Approximation of Hessian Matrix for Second-order SPSA Algorithm ...

Approximation of Hessian Matrix for 

Second-order SPSA Algorithm Addressed 

Toward Parameter Optimization 

in Non-linear Systems 

- 

JORGE IVAN MEDINA MARTINEZ 

Doctoral Program in Electronic Engineering 

Graduate School of Electro-Communications 

The University of Electro-Communications 

A thesis submitted for the degree of 

DOCTOR OF ENGINEERING 

The University of Electro-Communications 

December 2009





Approved by Supervisory Committee: 

Chairperson : 

Prof. Kazushi Nakano 

Member : Prof. Kohji Higuchi 

Member : Prof. Masahide Kaneko 

Member : Prof. Tetsuro Kirimoto 

Member : Prof. Takayuki Inaba 

Member : Prof. Seiichi Shin

Copyright 2009 by Jorge Ivan Medina Martinez 

All Rights Reserved





(2 次型同時摂動確率近似アルゴリズムのヘッセ行列推定とその非線形システム 

におけるパラメータ最適化への応用 ) 

Jorge Ivan Medina Martinez 

—Abstract in Japanese — 

システム同定問題とは、システムの構造が既知の場合、雑音を有する観測データから、未知パ 

ラメータを推定する問題になる。特に最近、非線形モデルが、状態推定、制御、シミュレーショ 

ンに多用されており、非線形モデル予測制御の成功に動機づけられ、第一モデル原理やニューラ 

ルネットに基づくモデルの精緻化が盛んに議論されている。このような非線形で複雑なシステム 

の同定問題は、多くの未知パラメータに関してある種の誤差関数を最適化する問題に帰着され、 

そのための効率的な最適化手法が求められている。 

これに対して多くのアルゴリズムが提案されているが、これらを、非線形状態空間モデルのよ 

うな数多くのパラメータを有する複雑なシステムに適用する場合、膨大な計算コストがかかると 

いう問題があった。本論文では、複雑なシステムのパラメータ推定において、アルゴリズムが十 

分な安定性を有していない点、および計算過程が複雑で計算コストがかかる点に注目して、新し 

い推定アルゴリズムを提案する。まず、計算の複雑度やコストにおいて有利で、実装が容易で、 

しかも安定した収束性を有する、同時摂動確率近似 (Simultaneous Perturbation Stochastic 

Approximation Algorithm: SPSA) アルゴリズムに注目する。しかしながら、これを複雑なシ 

ステムに適用する場合には、いくつかの問題に遭遇する。そのために、安定な収束性を有し、計 

算コストの面でも有利なSPSAアルゴリズムを改良した新しい手法を開発する。すなわち、誤差関 

数ヘシアンから 1st-Order SPSA(1-SPSA) 法と 2nd-Order SPSA (2-SPSA) 法との比較に基づいて、 

SPSAアルゴリズムの改良を行う。 

i

ここで提案するアルゴリズム( 修正 SPSA)は、悪条件ヘシアンの非正定性を解消し、正定性を 

保証するためにフィッシャ情報行列をもちいて悪条件ヘシアンの逆行列によって生ずる誤差拡 

大を抑えるような手続きを採るものである。これはまた良条件をもつヘシアンを有する手法に対 

しても収束性の大幅な改善をもたらすものである。漸近収束性に対しては、2-SPSA 法に対する修 

正 SPSA 法の平均 2 乗誤差の比は、完全な良条件をもつヘシアンの場合を除いたあらゆる方法より、 

小さいことが示される。さらに、ヘシアンの対角要素の推定を行えば、ほかの手法に比べて大幅 

な計算コスト削減が実現される。 

修正 SPSA 法においても、すべてのパラメータは同時に摂動されることから、パラメータの次元 

にかかわらず、誤差関数の二つの計算値だけでパラメータが更新できる。このように、このSPSA 

アルゴリズムを用いれば、大幅な計算コストの削減が可能である。本論文では、提案するアルゴ 

リズムの収束定理を与えるとともに、このアルゴリズムを用いてパラメータ推定の実行可能性を 

証明すべくシミュレーションを実施する。 

最後に、本提案手法の三つの実際的応用を考える。ひとつは、振動抑制を目的とした1リンク 

フレキシブルアームの角度制御問題である。この制御目的のために、非線形 VSS (Variable 

Structure System) オブザーバを用いたモデル規範型スライディングモード制御 (Model 

Reference – Sliding Mode Control: MR-SMC) 手法を提案する。非線形オブザーバのパラメータ 

は、ここで提案している修正 2-SPSAアルゴリズムを用いて最適化される。MR-SMCのコントローラ 

の設計についても同様に議論する。この手法の有効性は振動制御シミュレーションにより確認さ 

れる。次は、適応 IIR 型フィルタアルゴリズムへの応用である。このアルゴリズムは SHARF 

(Simple Hyperstable-Adaptive- Recursive- Filter) と SM (Steiglitz-McBride) アルゴリズ 

ムに対応しており、出力誤差をもとにした同定用フィルタの係数パラメータは、提案している修 

正 2-SPSAアルゴリズムを用いて求められる。確率近似 (SA) アルゴリズムとの比較により、本ア 

ルゴリズムの有効性が示される。最後の例は、修正 SPSAアルゴリズムを、非線形状態空間システ 

ムの未知な静的パラメータを推定する問題に適用するものである。提案するアルゴリズムは最尤 

推定量を与えるものであり、その性能は、差分近似型確率近似 (Finite Difference Stochastic 

Approximation) アルゴリズムとの比較を通じて検証される。 

ii





Jorge Ivan Medina Martinez 

Abstract 

The research presented in this dissertation has been motivated due to the fact that many 

algorithms, which are very extended, do not offer sufficient stability in the estimation of a great 

volume of parameters in non-linear systems or other kinds of systems. They also have a high 

computational complexity and cost. So that, we have decided to use the simultaneous 

perturbation stochastic approximation (SPSA) algorithm because it has several important 

advantages such as low computational complexity and stable convergence. Nevertheless, the 

typical SPSA algorithm has some difficulties and problems when it is applied to non-linear and 

complex systems. Therefore, this research proposes a novel extension to the SPSA algorithm 

based on features and disadvantages shown in the first-order and second-order SPSA (the 

1st-SPSA and 2nd-SPSA) algorithms and comparisons made from the perspective of the 

Hessian loss function. These comparisons are made because at finite iterations, the convergence 

rate depends on matrix conditioning of the loss function Hessian. It is shown that 2nd-SPSA 

converges more slowly for a loss function with an ill-conditioned Hessian than the one with a 

well-conditioned Hessian. On the other hand, the convergence rate of 1st-SPSA is less sensitive 

to the matrix conditioning of loss function Hessian. 

The main disadvantages in the 1st-SPSA and 2nd-SPSA algorithms, one that the error for the 

loss function with an ill-conditioned Hessian is greater than the one with a well-conditioned 

Hessian. Our proposed modified version of 2nd-SPSA (M2-SPSA) eliminates the error 

amplification caused by the inversion of an ill-conditioned Hessian at finite iterations, which 

leads to significant improvements in its convergence rate in problems with an ill-conditioned 

Hessian matrix and complex systems. Asymptotically, the efficiency analysis shows that our 

proposed SPSA is also superior to 2nd-SPSA in terms of its convergence rate coefficients. It is 

iii

ABSTRACT 

shown that for the same asymptotic convergence rate, the ratio of the mean square errors for our 

proposed SPSA to 2nd-SPSA is always less than one, except for a perfectly conditioned Hessian. 

Also, we have proposed to reduce the computational expense by evaluating only a diagonal 

estimate of the eigenvalues in the Hessian matrix. In this research, a new mapping is suggested 

for the 2nd-SPSA algorithm in order to eliminate the non-positive definiteness part while 

preserving key spectral properties of the estimated Hessian using the Fisher information matrix. 

After defining the M2-SPSA algorithm, we apply this algorithm to parameter estimation. 

Therefore, using M2-SPSA all parameters are perturbed simultaneously, it is possible to modify 

parameters with only two measurements of an evaluation function regardless of the dimension 

of the parameter. A convergence theorem for the proposed algorithm is presented and a 

simulation result also reveals the feasibility of the identification scheme proposed here. In order 

to show the efficiency of M2-SPSA, we have proposed three important applications, in which 

we can see the efficiency of our proposed algorithm for estimating and designing the 

parameters. 

In the first application, our proposed algorithm is addressed to control, in this case, the vibration 

reduction in the model considered here. Therefore, the main objective concerns a vibration 

control of a one-link flexible arm system. A variable structure system (VSS) non-linear observer 

has been proposed in order to reduce the oscillation in controlling the angle of the flexible arm. 

The non-linear observer parameters are optimized using a modified version of SPSA algorithm. 

This SPSA algorithm is especially useful when the number of parameters to be adjusted is large 

and makes it possible to estimate them very efficiently. As for the vibration and position control, 

a model reference sliding-mode control (MR-SMC) has been presented. Our proposed 

M2-SPSA algorithm obtains the parameters of MR-SMC method. The simulations show that the 

vibration control of a one-link flexible arm system can be achieved more efficiently using our 

proposed methods. 

In the second application, our proposed algorithm is addressed to signal processing, in this case 

IIR lattice filters. Adaptive infinite impulse response (IIR), or recursive, filters are less attractive 

mainly because of the stability and the difficulties associated with their adaptive algorithms. 

Therefore, in this research the adaptive IIR lattice filters are studied in order to devise 

algorithms that preserve the stability of the corresponding direct form schemes. We analyze the 

local properties of stationary points, a transformation achieving this goal is suggested, which 

iv

ABSTRACT 

gives algorithms that can be efficiently implemented. Application to the Steiglitz-McBride (SM) 

and Simple Hyperstable Adaptive Recursive Filter (SHARF) algorithms is presented. Also, our 

proposed M2-SPSA algorithm is presented in order to get the coefficients in a lattice form more 

efficiently and with a lower computational cost and complexity. The results are compared with 

previous lattice versions of these algorithms. These previous lattice versions may fail to 

preserve the stability of stationary points. 

Finally, the M2-SPSA algorithm is addressed to the problem of estimation of unknown static 

parameters in non-linear state-space models. The M2-SPSA algorithm can generate maximum 

likelihood estimates efficiently. The performance of the proposed algorithm is assessed through 

simulation. Here, the M2-SPSA algorithm is compared with the finite difference stochastic 

approximation (FDSA) in order to show its efficiency. 

Therefore, in this dissertation, we have proposed a modification to SPSA algorithm where the 

main objective is to estimate the parameters in complex systems, improve the convergence and 

reduce the computational cost. Then, this modification to the simultaneous perturbation seems 

particularly useful when there are number of parameters to be identified is very large or when 

the observed values for what is to be identified can only be obtained via an unknown 

observation system. 

Finally, this dissertation is organized as follows. In Chapter 1, we describe an introduction to 

SPSA, so that we explain the mean concepts, advantages, disadvantages, recursions, 

formulation and implementation of SPSA. Our proposed SPSA algorithm is analyzed in detail 

in Chap. 2. The asymptotic normality, the Hessian estimation and the efficiency between 

M2-SPSA and the previous versions of SPSA are shown. In addition, we show how the 

M2-SPSA algorithm is applied to parameter estimation, and prove its efficiency in several 

simple numerical simulations. The first important application of M2-SPSA algorithm is 

described in Chap. 3, in this case, in the control area; M2-SPSA is applied to parameter 

estimation of some methods for controlling the vibration in the proposed system. Other 

application for M2-SPSA algorithm is described in Chap. 4. In this application, our proposed 

algorithm is applied to signal processing, here M2-SPSA calculates the coefficients in some 

adaptive algorithms. In the final application, M2-SPSA algorithm is addressed to the problem of 

estimation of unknown static parameters in non-linear state-space models, this is described in 

Chap. 5. Finally, the conclusions and future work are given in Chap. 6. 

v

Contents 

1. Introduction 1 

1.1 Motivation and Background 1 

-1.1.1 Motivation 1 

-1.1.2 Background 2 

1.2 Overview of Stochastic Algorithms 5 

1.3 Introduction to SPSA Algorithm 7 

1.4 Features of SPSA 10 

1.5 Applications Areas 11 

1.6 Formulation of SPSA Algorithm 12 

1.7 Basic Assumptions of SPSA Algorithm 14 

1.8 Versions of SPSA Algorithms 15 

2. Proposed SPSA Algorithm 19 

2.1 Overview of Modified 2nd-SPSA Algorithm 

19 

2.2 SPSA Algorithm Recursions 20 

2.3 Proposed Mapping 22 

2.4 Description of Proposed SPSA Algorithm 26 

2.5 Asymptotic Normality 27 

2.6 Fisher Information Matrix 31 

-2.6.1 Introduction to Fisher Information Matrix 31 

-2.6.2 Two Key Properties of the Information Matrix: Connections to 

--Covariance Matrix of Parameter Estimates 33 

-2.6.3 Estimation of F (θ n 

) 

34 

2.7 Efficiency Between 1st-SPSA, 2nd-SPSA and 2M-SPSA 40 

2.8 Implementation Aspects 41 

2.9 Strong Convergence 44 

2.10 Asymptotic Distribution and Efficiency Analysis 50 

2.11 Perturbation Distribution for M2-SPSA 54 

2.12 Parameter Estimation 57 

2.12.1 Introduction 57 

2.12.2 System to be Applied 64 

2.12.3 Convergence Theorem 69 

vii

CONTENTS 

2.13 Simulation 70 

2.13.1 Simulation 1 70 



3. Vibration Suppression Control of a Flexible Arm using 

Non-linear Observer with SPSA 79 

3.1 Introduction 79 

3.2 Dynamic Modeling of a Single Link Robot Arm 81 

-3.2.1 Dynamic Model 81 

-3.2.2 Equation of Motion and State Equations 84 

3.3 Design of Non-Linear Observer 85 

3.4 Model Reference Sliding Model Controller 87 


4. Lattice IIR Adaptive Filter Structure Adapted by SPSA Algorithm 

9 9 


4.2 Procedure of Improved Algorithm 101 

4.3 Lattice Structure 104 

4.4 Adaptive Algorithm 105 

-4.4.1 SHARF Algorithm 105 

-4.4.2 Steiglitz-McBride Algorithm 108 


-4.5.1 SHARF Algorithm 109 

-4.5.2 Steiglitz-McBride Algorithm 110 

5. Parameters Estimation using a Modified Version of 

SPSA Algorithm Applied to State-Space Models 113 


5.2 Implementation of SPSA Toward Proposed Model 115 

-5.2.1 State-Space Model 115 

-5.2.2 Gradient-free Maximum Likelihood Estimation 118 

viii

CONTENTS 

5.3 Parameter Estimation by SPSA and FDSA 120 


6. Conclusions and Future Work 125 

6.1 Conclusions 125 

6.2 Future Work 129 

References 131 

Appendix A 139 

Appendix B 155 

List of Publications Directly Related to the Dissertation 159 

Acknowledgements 163 

Author Biography 165 

ix

List of Figures 

Fig. 1.1 Example of stochastic optimization algorithm minimizing loss function L θ 1 

θ ) 3 

( 

, 2 

Fig. 1.2 Performance of SPSA algorithm (two measurements). 9 

Fig. 2.1 The two-recursions in 2nd-SPSA Algorithm 21 

Fig. 2.2 Diagram of method for forming estimate F ( ) 

39 

M , N 

θ 

Fig. 2.3 Split uniform distribution 56 

Fig. 2.4 Inverse split uniform distribution 57 

Fig. 2.5 Symmetric double triangular distribution 57 

Fig. 2.6 Identification with an unknown observation system 65 

Fig. 2.7 Identification results (with bias compensation) 75 

Fig. 2.8 Identification results (without bias compensation) 76 

Fig. 3.1 One-link flexible arm 82 

Fig. 3.2 Sliding mode surface 88 

Fig. 3.3 Block diagram of the sliding mode control system incorporating the non-linear 

observer 91 

Fig. 3.4 Motor angle. Without M2-SPSA and MR-SMC (dotted-line (-.-)).With RM-SA 

algorithm and MR-SMC (dashed-line (- -)). With LS algorithm and MR-SMC (dash-dot-line 

(-.)).With M2-SPSA and MR-SMC (solid-line (-)) 94 

Fig. 3.5 Tip position. Without M2-SPSA and MR-SMC (dotted-line (-.-)).With RM-SA 



Fig. 3.6 Tip Velocity. Without M2-SPSA and MR-SMC (dotted-line (-.-)).With RM-SA 



Fig. 3.7 Control torque. Without M2-SPSA and MR-SMC (dotted-line (-.-)).With RM-SA 



Fig. 3.8 Motor angle. Simulation using x 1 

with M2-SPSA and MR-SMC (solid-line). 

Simulation using x m 

with M2-SPSA and MR-SMC (dashed-line) 96 

Fig. 3.9 Tip position. Simulation using x 3 


Simulation using ˆx 3 


Fig. 3.10 Tip velocity. Simulation using x 4 


Tip velocity. Simulation using ˆx 4 


xi

LIST OF FIGURES 

Fig. 4.1 Block diagram of the SHARF lattice algorithm 107 

Fig. 4.2 Block diagram of the SM lattice algorithm 109 

Fig. 4.3 Convergence of the proposed SHARF algorithm and M2-SPSA 111 

Fig. 4.4 Instability of the existing SHARF algorithm 111 

Fig. 4.5 Instability of the existing SM algorithm 112 

Fig. 4.6 Convergence of the proposed SM algorithm and M2-SPSA 112 

T 

Fig. 5.1 ML Parameter estimateθ = θ , θ , for the bi-modal non-linear model using 

k 

[ 

1, 

k 2, k 

θ3 

, k] 

M2-SPSA. The true parameters in the model are defined by 

* =[0.5, 25, 

8] 

θ T . 122 

Fig. 5.2 Parameter estimation using 2nd-SPSA and FDSA 123 

xii

List of Tables 

Table 2.1 Characteristics of the perturbation distributions 55 

Table 2.2 Normalized loss values for 1st-SPSA and M2-SPSA with σ = 0.001; 

90% confidence interval shown in [⋅] 

72 

Table 2.3. Values of 

Table 2.4 Values of 

* 

θˆ 

k 

− θ 

with no measurement noise 74 

ˆ * 

θ − θ 

0 

* 

θˆ 

k 

− θ 

with measurement noise 74 

ˆ * 

θ − θ 

0 

Table 2.5 Comparison of estimators 76 

Table 3.1 Comparison of estimators (non-linear observer) 92 

Table 3.2 Comparison of estimators (MR-SMC) 92 

Table 3.3 Performance comparisons among M2-SPSA, RM-SA and LS 93 

Table 5.1 Computational statistics 123 

Table 6.1. Comparison of algorithms (performance) 127 

Table 6.2. Comparison of algorithms (computational cost) 128 

xiii

List of Abbreviations 

SPSA 

1st-SPSA 

2nd-SPSA 

SP 

SA 

M2-SPSA 

NN 

R-M 

FDSA 

LMS 

L-M 

ASP 

SG 

i.o. 

a.s. 

FIM 

MCNR 

MSE 

BP 

RMS 

MR-SMC 

VSS 

LS 

SM 

SHARF 

IIR 

FIR 

ODE 

HARF 

MSOE 

SMC 

ML 

Simultaneous perturbation stochastic approximation 

First-order of simultaneous perturbation stochastic approximation 

Second-order of simultaneous perturbation stochastic approximation 

Simultaneous perturbation 

Stochastic approximation 

Modified version of 2nd-SPSA 

Neural network 

Robbins-Monroe 

Finite difference stochastic approximation 

Least mean square 

Levenberg-Marquardt 

Adaptive simultaneous perturbation 

Stochastic gradient 

Infinitely often 

Almost sure 

Fisher information matrix 

Monte Carlo Newton-Raphson 

Mean Squire error 

Back-propagation 

Root mean square error 

Model reference-sliding mode control 

Variable structure system 

Least squares 

Steiglitz-McBride 

Simple hyperstable adaptive recursive filter 

Infinite impulse response 

Finite impulse response 

Ordinary differential equation 

Hyperstable adaptive recursive filter 

Mean-square output error 

Sequential Monte Carlo 

Maximum likelihood 

xv

Chapter 1 

Introduction 

Multivariate stochastic optimization plays a major role in the analysis and control of many 

engineering systems[1]. In almost all real-world optimization problems, it is necessary to use a 

mathematical algorithm that iteratively seeks out the solution because an analytical 

(closed-form) solution is rarely available. In this spirit, the “simultaneous perturbation 

stochastic approximation (SPSA)” method for difficult multivariate optimization problems has 

been developed. SPSA has recently attracted considerable international attention in areas such 

as statistical parameter estimation, feedback control, simulation-based optimization, signal and 

image processing, and experimental design. The essential feature of SPSA—which accounts for 

its power and relative ease of implementation—is the underlying gradient approximation that 

requires only two measurements of the objective function regardless of the dimension of the 

optimization problem. This feature allows for a significant decrease in the cost of optimization, 

especially in problems with a large number of variables to be optimized. 

1.1 -Motivation and Background 

1.1.1 -Motivation 

The simultaneous perturbation stochastic approximation (SPSA) method is a very useful tool for 

solving optimization problems in which the cost function is in analytically unavailable or 

difficult to compute. The method is essentially a randomized version of the Kiefer-Wolfowitz 

method in which the gradient is estimated using only two measurements of the cost function at 

each iteration. SPSA is particularly efficient in problems of high-dimension and where the 

cost function must be estimated through expensive simulations. Our motivation is based on the 

features of SPSA algorithm that can be oriented toward parameter estimations in complex 

systems, where many algorithms have many disadvantages. Often it is necessary to estimate the 

parameters of a model of unknown system. Various techniques exist to accomplish this task, 

including Kalman and Wiener filtering, least mean square (LMS) algorithms, and the 

Levenberg-Marquardt (L-M) algorithm. These techniques require an analytic form of the 

gradient of the function of the parameters to be estimated and usually have high computational 

complexity and cost [2]. Also, there are other kinds of algorithms to estimate parameter, which 

1

CHAPTER 1. INTRODUCTION 

the convergence is not stable because they cannot manage a great volume of parameter to be 

estimated. Therefore, SPSA algorithm is convenient in these kinds of complex systems with a 

large number of parameters. 

1.1.2 -Background 

This dissertation is an introduction to the simultaneous perturbation stochastic approximation 

(SPSA) algorithm for stochastic optimization of multivariate systems. Optimization algorithms 

play a critical role in the design, analysis, and control of most engineering systems and are in 

widespread use in the work of many organizations. Before presenting the SPSA algorithm, we 

provide some general background on the stochastic optimization context of interest here. 

The mathematical representation of most optimization problems is the minimization (or 

maximization) of some scalar-valued objective function with respect to a vector of adjustable 

parameters. The optimization algorithm is a step-by-step procedure for changing the adjustable 

parameters from some initial guess (or set of guesses) to a value that offers an improvement in 

the objective function [3][4]. Figure 1.1 depicts this process for a very simple case of only two 

variables, θ 

1 

and θ 

2 

, where our objective function is a loss function to be minimized (without 

loss of generality, we will discuss optimization in the context of minimization because a 

maximization problem can be trivially converted to a minimization problem by changing the 

sign of the objective function). Most real-world problems would have many more variables. 

The illustration in Fig. 1.1 is a typical example of a stochastic optimization setting with noisy 

input information because the loss function value does not uniformly decrease as the iteration 

process proceeds (note the temporary increase in the loss value in the third step of the 

algorithm). Many optimization algorithms have been developed that assume a deterministic 

setting and that assume information is available on the gradient vector associated with the loss 

function (i.e., the gradient of the loss function with respect to the parameters being optimized). 

However, there has been a growing interest in recursive optimization algorithms that do not 

depend on direct gradient information or measurements. Rather, these algorithms are based on 

an approximation to the gradient formed from measurements (generally noisy) of the loss 

function. This interest has been motivated, for example, by problems in the adaptive control and 

statistical identification of complex systems, the optimization of processes by large Monte Carlo 

simulations, the training of recurrent neural networks, the recovery of images from noisy sensor 

data, and the design of complex queuing and discrete-event systems. 

2

1.1 MOTIVATION AND BACKGROUND 

Fig. 1.1. Example of stochastic optimization algorithm minimizing loss function L θ 1 

, θ ). 

( 

2 

This dissertation focuses on the case where such an approximation is going to be used as a result 

of direct gradient information not being readily available. Overall, gradient-free stochastic 

algorithms exhibit convergence properties similar to the gradient-based stochastic algorithms 

[e.g., Robbins-Monroe stochastic approximation (R-M SA)] while requiring only loss function 

measurements [5][6]. A main advantage of such algorithms is that they do not require the 

detailed knowledge of the functional relationship between the parameters being adjusted 

(optimized) and the loss function being minimized that is required in gradient-based algorithms. 

Such a relationship can be notoriously difficult to develop in some areas (e.g., non-linear 

feedback controller design), whereas in other areas (such as Monte Carlo optimization or 

recursive statistical parameter estimation), there may be large computational savings in 

calculating a loss function relative to that required in calculating a gradient. To elaborate on the 

distinction between algorithms based on direct gradient measurements and those based on 

gradient approximations from measurements of the loss function, the prototype gradient-based 

algorithm is R-M SA, which may be considered a generalization of such techniques as 

deterministic steepest descent and Newton–Raphson, neural network back-propagation (BP), 

and infinitesimal perturbation analysis–based optimization for discrete-event systems [9]. The 

gradient-based algorithms rely on direct measurements of the gradient of the loss function with 

respect to the parameters being optimized. These measurements typically yield an estimate of 

the gradient because the underlying data generally include added noise. Because it is not usually 

the case that one would obtain direct measurements of the gradient (with or without added 

noise) naturally in the course of operating or simulating a system, one must have detailed 

knowledge of the underlying system input–output relationships to calculate the R-M gradient 

estimate from basic system output measurements. In contrast, the approaches based on gradient- 

3

CHAPTER 1.INTRODUCTION 

approximations require only conversion of the basic output measurements to sample values of 

the loss function, which does not require full knowledge of the system input–output 

relationships. 

The classical method for gradient-free stochastic optimization is the Kiefer–Wolfowitz 

finite-difference SA (FDSA) algorithm [8]. Because of the fundamentally different information 

needed in implementing these gradient-based (R-M) and gradient-free algorithms, it is difficult 

to construct meaningful methods of comparison. As a general rule, however, the gradient-based 

algorithms will be faster to converge than those using loss function based gradient 

approximations when speed is measured in the number of iterations. Intuitively, this result is not 

surprising given the additional information required for the gradient-based algorithms. In 

particular, on the basis of asymptotic theory, the optimal rate of convergence measured in terms 

of the deviation of the parameter estimate from the true optimal parameter vector is of order 

−1/2 

k for the gradient-based algorithms and of order 

−1/3 

k for the algorithms based on gradient 

approximations, where k represents the number of iterations. (Special cases exist where the 

maximum rate of convergence for a non-gradient algorithm is arbitrarily close to, or equal to 

−1/2 

k ). 

In practice, of course, many other factors must be considered in determining which algorithm is 

best for a given circumstance for the following three reasons: (1) It may not be possible to 

obtain reliable knowledge of the system input–output relationships, implying that the 

gradient-based algorithms may be either infeasible (if no system model is available) or 

undependable (if a poor system model is used). (2) The total cost to achieve effective 

convergence depends not only on the number of iterations required, but also on the cost needed 

per iteration, which is typically greater in gradient-based algorithms. (This cost may include 

greater computational burden, additional human effort required for determining and coding 

gradients, and experimental costs for model building such as labor, materials, and fuel.) (3) The 

rates of convergence are based on asymptotic theory and may not be representative of practical 

convergence rates in finite samples. For these reasons, one cannot say in general that a 

gradient-based search algorithm is superior to a gradient approximation-based algorithm, even 

though the gradient-based algorithm has a faster asymptotic rate of convergence (and with 

simulation-based optimization such as infinitesimal perturbation analysis requires only one 

system run per iteration, whereas the approximation based algorithm may require multiple 

system runs per iteration). As a general rule, however, if direct gradient information is 

4

1.1 FDSA AND SPSA ALGORITHM 

conveniently and reliably available, it is generally to one’s advantage to use this information in 

the optimization process. The focus in this article is the case where such information is not 

readily available. The next section describes SPSA and the related FDSA algorithm. Then some 

of the theory associated with the convergence and efficiency of SPSA is summarized. 

1.2 -Overview of Stochastic Algorithms 

This dissertation considers the problem of minimizing a (scalar) differentiable loss function 

L (θ) , where θ is a p-dimensional vector and where the optimization problem can be translated 

* 

into finding the minimizing θ such that ∂L 

/ ∂θ 

= 0. This is the classical formulation of 

(local) optimization for differentiable loss functions. It is assumed that measurements of L (θ ) 

are available at various values of θ . These measurements may or may not include added noise. 

No direct measurements of ∂L 

/ ∂θ 

= 0are assumed available, in contrast to the R-M framework. 

This section will describe the FDSA and SPSA algorithms. Although the emphasis of this 

dissertation is SPSA, the FDSA discussion is included for comparison because FDSA is a 

classical method for stochastic optimization. The SPSA and FDSA procedures are in the general 

recursive SA form: 

ˆ θ ˆ ˆ ( ˆ 

k + 1 

= θ k 

−a 

k 

g k 

θ k 

) 

(1.1) 

where gˆ 

( ˆ 

k 

θk) 

is the estimate of the gradient g ( θ) 

≡ ∂L 

/ ∂θ 

at the iterate θˆ k based on the 

previously mentioned measurements of the loss function. Under appropriate conditions, the 

iteration in (1.1) will converge to 

e.g., in [7]). 

* 

θ in some stochastic sense (usually “almost surely”) see, 

The essential part of (1.1) is the gradient approximation gˆ 

( ˆ θ ) . We discuss the two forms of 

interest here. Let y(⋅) 

denote a measurement of L(⋅) 

at a design level represented by the dot (i.e., 

y (⋅) = L (⋅) 

+ (noise)) and c 

k 

be some (usually small) positive number. One-sided gradient 

approximations involve measurements y( ˆ θ k 

) and y ( θˆ k +perturbation), whereas two-sided 

k 

k 

gradient approximations involve measurements of the form y ( θˆ 

k 

± 

perturbation). The two 

general forms of gradient approximations for use in FDSA and SPSA are finite difference 

5


and simultaneous perturbation (SP), respectively, which are discussed in the following 

paragraphs. For the finite-difference approximation, each component of 

θˆ k is perturbed one at a 

time, and corresponding measurements y(·) are obtained. Each component of the gradient 

estimate is formed by differencing the corresponding y(·) values and then dividing by a 

difference interval. This is the standard approach to approximating gradient vectors and is 

motivated directly from the definition of a gradient as a vector of p partial derivatives, each 

constructed as the limit of the ratio of a change in the function value over a corresponding 

change in one component of the argument vector. 

Typically, the i-th component of gˆ 

( ˆ θ ) ( i 1,2,..., p) 

for a two-sided finite-difference 

approximation is given by 

k k 

= 

gˆ 

ˆ 

y( 

ˆ θ + c e ) − y( 

ˆ θ −c 

e ) 

k k i k k i 

ki( 

θ 

k 

) = 

(1.2) 

2 ck 

where 

e 

i 

denotes a vector with a one in the i-th place and zeros elsewhere (an obvious 

analogue holds for the one-sided version; likewise for the simultaneous perturbation form 

below), and 

c 

k 

denotes a small positive number that usually gets smaller as k gets larger. The 

simultaneous perturbation has all elements of 

θˆ k randomly perturbed together to obtain two 

measurements of y(·), but each component gˆ 

( ˆ θ ) is formed from a ratio involving the 

individual components in the perturbation vector and the difference in the two corresponding 

measurements. For two-sided Simultaneous Perturbation (SP), we have 

ki 

k 

gˆ 

ˆ 

y( 

ˆ θ + c e ) − y( 

ˆ θ −c 

e ) 

k k i k k i 

ki( 

θ 

k 

) = 

(1.3) 

2 ck 

where the distribution of the user-specified p-dimensional random perturbation vector, 

∆ 

T 

k 

= ( ∆ ,..., 

k 1+∆ 

k 2 

∆ 

k p) 

, satisfies conditions discussed later in this dissertation (superscript T 

denotes vector transpose). Note that the number of loss function measurements y(·) needed in 

6

1.3 INTRODUCTION TO SPSA ALGORITHM 

each iteration of FDSA grows with p, whereas with SPSA, only two measurements are needed 

independent of p because the numerator is the same in all p components. This circumstance, of 

course, provides the potential for SPSA to achieve a large savings (over FDSA) in the total 

number of measurements required to estimate θ when p is large. This potential is realized only 

if the number of iterations required for effective convergence to 

* 

θ 

does not increase in a way 

to cancel the measurement savings per gradient approximation at each iteration. In the following 

sections, the advantages in this potential of SPSA over FDSA will be described. 

1.3 -Introduction to SPSA Algorithm 

From here, the SPSA algorithm will be described with more detail. Firstly, since many years 

ago, the Stochastic Algorithm (SA) has long been applied for problems of minimizing loss 

functions or root-finding with noisy input information [10]. As with all stochastic search 

algorithms, there are adjustable algorithm coefficients that must be specified and that can have a 

profound effect on algorithm performance. It is known that picking these coefficients according 

to an SA analogue of the deterministic Newton-Raphson (N-R) algorithm provides an 

optimal or near-optimal form of the algorithm. However, directly determining the required 

Hessian matrix (or Jacobian matrix for root-finding) to achieve this algorithm form has been 

often difficult or impossible in practice [11]. This research presents a general adaptive SA 

algorithm that is based on an easy method for estimating the Hessian matrix at each iteration 

while concurrently estimating the primary parameters of interest. The approach applies in both 

the gradient-free optimization (Kiefer-Wolfowitz) and root finding stochastic gradient-based 

(Robbins-Monroe) settings and is based on the simultaneous perturbation (SP) idea introduced 

in [12]. There has recently been much interest in recursive optimization algorithms that rely on 

measurements of only the objective function to be optimized, not on direct measurements of the 

gradient (derivative) of the objective function [12]. Such algorithms have the advantage of not 

requiring detailed modeling information describing the relationship between the parameters to 

be optimized and the objective function. For example, many systems involving human beings or 

computer simulations are difficult to treat analytically, and could potentially benefit from such 

an optimization approach [11][12]. The stochastic optimization algorithms are used in virtually 

all areas of engineering, physical and social sciences. Such techniques apply in the usual case 

where a closed form solution to the optimization problem of interest is not available and where 

the input information into the optimization method may be contaminated with noise. 

7


Typical applications include model fitting and statically parameter estimation, experimental 

design, adaptive control, pattern classifications, simulation-based optimization, and 

performance evaluation from test data. Frequently, the solution to the optimization problem 

corresponds to a vector of parameters at which the gradient of the objective (say, loss) function 

with respect to the parameters being optimized is zero. In many practical settings, however, the 

gradient of the loss function for use in the optimization process is not available or is difficult to 

compute (knowledge of the gradient usually requires complete knowledge of the relationship 

between the parameters being optimized and the loss function). So, there is a considerable 

interest in techniques for optimization that rely on measurements of the loss function only, not 

on measurements (or direct calculations) of the gradient (or higher order derivatives) of loss 

function. One of these techniques using only loss function measurements, and it have attracted 

considerable recent attention for difficult multivariate problems, this technique is the SPSA 

algorithm introduced in [12]. This contrasts with algorithms requiring direct measurements of 

the gradient of the objective function (which are often difficult or impossible to obtain). Further, 

SPSA is especially efficient in high-dimensional problems in terms of providing a good solution 

for a relatively small number of measurements of the objective function. The essential feature of 

SPSA, which provides its power and relative ease of use in difficult multivariate optimization 

problems, is the underlying gradient approximation that requires only two objective function 

measurements per iteration regardless of the dimension of the optimization problem. These two 

measurements are made by simultaneously varying in a "proper" random fashion all of the 

variables in the problem. This contrasts with the classical FDSA method where the variables are 

varied one at a time. If the number of terms being optimized is p, then the finite-difference 

method takes 2p measurements of the objective function at each iteration (to form one gradient 

approximation) while SPSA takes only two measurements (see Fig. 1.2). A fundamental result 

on relative efficiency is described below. 

Under reasonably general conditions, SPSA and the standard finite-difference SA method 

achieve the same level of statistical accuracy for a given number of iterations even though SPSA 

uses p times fewer measurements of the objective function at each iteration (since each gradient 

approximation uses only 1/p the number of function measurements). This indicates that SPSA 

will converge to the optimal solution within a given level of accuracy with p times fewer 

measurements of the objective function than the standard method. An equivalent way of 

interpreting the statement above is described in the following paragraph. 

8

1.3 INTRODUCTION TO SPSA ALGORITHM 

One properly generated simultaneous random change of all p variables in the problem contains 

as much information for optimization as a full set of p one at time changes of each variable [13]. 

Further, SPSA—like other stochastic approximation methods—formally accommodates noisy 

measurements of the objective function. This is an important practical concern in a wide variety 

of problems involving Monte Carlo simulations, physical experiments, feedback systems, or 

incomplete knowledge. The need for solving multivariate optimization problems is pervasive in 

engineering and the physical and social sciences. The SPSA algorithm has already attracted 

considerable attention for challenging optimization problems where it is difficult or impossible 

to directly obtain a gradient of the objective function, not on measurement of the gradient 

objective function. As we mentioned above, the gradient approximation is based on only two 

functions measurements (regardless of the dimension of the gradient vector). Therefore, 

contrasts with standard finite-difference approaches, which require a number of function 

measurements proportional to the dimension of the gradient vector. 

The SPSA is generally used in non-linear problems having many variables where the objective 

function gradient is difficult or impossible to obtain. As a SA algorithm, SPSA may be 

rigorously applied when noisy measurements of the objective function are all that are available. 

There have also been many successful applications of SPSA in settings where perfect 

measurements of the loss function are available. 

Fig. 1.2. Performance of SPSA algorithm (two measurements). 

9


1.4--Features of SPSA 

1. SPSA allows for the input the algorithm to be measurement of the objective function 

corrupted by noise. For example, this is ideal for the case where Monte Carlo simulations 

are being used because each simulation run provides one noisy estimate of the performance 

measure. This is especially relelvant in practice as a very large number of scenarios often 

need to be evaluated, and it will not be possible to run a large number of simulations at 

each scenario (to average out noise). So, an algorithm explicitly designed to handle noise is 

needed. 

2. The algorithm is appropiate for high-dimensional problems where many terms are being 

determined in the optimization process. Many practical applications have a significant. 

3. Performance guarantees for SPSA exist in the form of an extentive convergence theory. The 

algorithm has desirable properties for the both global and local optimization in the sense 

that the gradient approximation is sufficiently noisy to allow for escape from local minima 

while being informative about the slope for the function to faciliate local convergence. This 

may avoid the cumbersome need in many global optimization problems to manually switch 

from a global to a local algorithm. However, we concentrate in the optimal area, so that we 

omite the local mimina problem. 

4. Implementation of SPSA may be easier than other stochastic optimization methods since 

there are fewer algorithm coefficients that need to be specfied, and there are some 

published guidelines [12] proving insight into how to pick th coefficients in practical 

applications. 

5. While the original SPSA method is designed for conitnuos optimization problems, there 

have been recent extensions to discrete optimization problems. This may be revelant to 

certain design problems, for example, where one wants to find the best number of items to 

use in a particular application. 

10

1.5 APPLICATIONS AREAS 

6. While “basic” SPSA uses only objective function measurements to carry out the iteration 

process in a stochastic analogue of the steepest descent method of deterministic 

optimization. 

1.5 -Applications Areas 

Over the past several years, non-linear models have been increasingly used for simulation, state 

estimation and control purposes. Particularly, the rapid progresses in computational techniques 

and the success of non-linear model predictive control have been strong incentivites for the 

development of such models as neural networks or first-principle models. Process modeling 

requires the estimation of several unknown parameters from noisy measurements data. A least 

square or maximum likelihood cost function. is usually minimized using a gradient-based 

optimization method [7]. Several techniques for computing the gradient of the cost function are 

available, including finite difference approximations and analytic differentiation. In these 

techniques, the computational expense required to estimate the current gradient direction is 

directly proportional to the number of unknown model parameters, which becomes an issue for 

model involving a large number of parameters. This is typically the case in neural networks 

modeling, but can also occur in other circumstances, such as the estimation of parameters and 

initial conditions in first principle models. Moreover the derivation of sensitivity equations 

requires analytic manipulation of the model equation, which is time consuming and subject to 

errors [7]. 

In contrast to standard finite differences which approximate the gradient by varying the 

parameters one at time, the simultaneous perturbation approximation of the gradient proposed 

by Spall and Chin [12] make use of a very efficient technique based on a simultaneous (random) 

perturbation in all the parameters and on each iteration the SPSA only needs few loss 

measurements to estimate the gradient, regardless of the dimensionality of the problem (number 

of parameters)[12]. Hence, one gradient evaluation requires only two evaluations of the cost 

function. This approach has first been applied to gradient estimation in a first-order stochastic 

approximation algorithm, and more recently to Hessian estimation in an accelerated 

second-order SPSA algorithm. Therefore, using those features, the proposed SPSA algorithm in 

this dissertation also will be applied to non-linear systems regardless of the dimensionality of 

the problem. 

11


Some of the general areas for application of SPSA include statistical parameter estimation, 

simulation-based optimization, pattern recognition, non-linear regression, signal processing, 

neural network (NN) training, adaptive feedback control, and experimental design. Specific 

system applications represented in the list of references include [14]. 

1. Adaptive optics 

2. Aircraft modeling and control 

3. Atmosferic and planetary modeling 

4. Fault detection in plant operations 

5. Human-machine interface control 

6. Industrial quality improvement 

7. Medical imaging 

8. Noise cancellation 

9. Process control 

10. Quering network design 

11. Robot control 

12. Parameter estimation in highly non-linear model 

In this point, the research has one important goal because this application (parameter estimation) 

is very useful in realistic systems. Often it is necessary to estimate the parameters of a model of 

unknown system. Various techniques exist to accomplish this task, including LMS algorithms, 

and the L-M algorithm [15]. These techniques require an analytic form of the gradient of the 

function of the parameters to be estimated. A key feature of the SPSA method is that it is a 

gradient-free optimization technique. The function of parameters to be identified is highly 

non-linear and of sufficient difficulty that obtaining an analytic form of the gradient is 

empirical. 

1.6 -Formulation of SPSA Algorithm 

The problem of minimizing a (scalar) differentiable loss function L (θ ), where θ ∈R P , p ≥1 

is 

considered. A typical example of L (θ ) would be some measure of mean-square error (MSE) 

for the output of a process as a function of some design parameters θ . For many cases of 

practical interest, this is equivalent to finding the minimizing 

* 

θ such that 

12

1.6 FORMULATION OF SPSA ALGORITHM 

∂L 

g ( θ) 

= = 0 . (1.4) 

∂ θ 

For the gradient-free setting, it is assumed that measurements of L (θ ) say y (θ ) are available at 

various values of θ . These measurements may or may not include random noise. No direct 

measurements (either with or without noise) of g (θ ) are assumed available in this setting. In 

the Robbins-Monroe/stochastic gradient (SG) case [9], it is assumed that direct measurements of 

g (θ) are available, usually in the presence of added noise. The basic problem is to take the 

available information measurements of L (θ ) and/or g (θ ) and attempt to estimate 

* 

θ . This is 

essentially a local unconstrained optimization problem. The SPSA algorithm is a tool for 

solving optimization problems in which the cost function is analytically unavailable or difficult 

to compute. The algorithm is essentially a randomized version of the Kiefer-Wolfowitz method 

in which the gradient is estimated using only two measurements of the cost function at each 

iteration [15][16]. SPSA is particularly efficient in problems of high dimension and where the 

cost function must be estimated through expensive simulations. The convergence properties of 

the algorithm have been established in [16]. Consider the problem of finding the minimum of a 

real valued function L (θ ), for θ ∈D 

where D is an open domain in 

P 

R . The function is not 

assumed to be explicitly known, but noisy measurements M ( n, 

θ) 

of it are available: 

M ( n, 

θ) 

= L( 

θ) 

+ ε ( θ) 

(1.5) 

n 

where { ε n 

} is the measurement noise process. We assume that the function L (⋅) 

is at least 

three-times continuously differentiable and has a unique minimize in D. The process { ε n 

} is a 

zero-mean process, uniformly bounded and smooth in θ in an appropriate technical sense. The 

problem is to minimize L (⋅) 

using only the noisy measurements M (⋅) 

. The SPSA algorithm for 

minimizing functions relies on the SP gradient approximation [16]. At each iteration k of the 

algorithm, a random perturbation vector 

∆ is taken, where the ∆ 

ki 

forms a 

T 

k 

= ( ∆k 

1,..., 

∆kp) 

sequence of Bernoulli random variables taking the values ± 1. The perturbations are assumed to 

be independent of the measurement noise process. In fixed gain SPSA, the step size of the 

perturbation is fixed at, say, some c > 0. To compute the gradient estimate at iteration k, it is 

necessary to evaluate M (⋅) 

at two values of θ : 

M θ ) L( 

θ + c∆ 

) + ε ( θ + c∆ 

) 

(1.6) 

+ 

k 

( = 

k 2k−1 

k 

13


M 

− 

k 

( 

k 2k 

k 

θ ) = L( 

θ − c∆ 

) + ε ( θ − c∆ 

) . (1.7) 

The i-th component of the gradient estimate is 

( M 

H ( k, 

θ) 

= 

i 

+ 

k 

( θ) 

− M 

2c∆ 

ki 

− 

k 

( θ)) 

. 

1.7 -Basic Assumptions of SPSA Algorithm 

Once again, the goal is to minimize a loss function L (θ ) over 

P 

θ ∈C ⊆ R . The SPSA 

algorithm works by iterating from an initial guess of the optimal θ , where the iteration process 

depends on the above-mentioned simultaneous perturbation approximation to the gradient g (θ ). 

In [16] are presented sufficient conditions for convergence of the SPSA iterate ( ˆ θ →θ * a.s.) 

using a differential equation approach well known by SA theory [17]. In particular, we must 

impose conditions on both gain sequences ( a k 

and 

and the statistical relationship of 

c 

k 

), the user specified distribution of 

k 

∆ 

k 

, 

∆ 

k 

to the measurements y(·). We will not repeat the 

conditions here since they are available in [17]. The main conditions are that 

a 

k 

and 

c 

k 

both 

go to 0 at rates neither too fast nor too slow, that L (θ ) is sufficiently smooth (several times 

differentiable) near 

* 

θ and that the { ∆ki} 

−1 

0 with finite inverse moments ( ∆ ki 

) 

are independent and symmetrically distributed about 

E for all k, i. One particular distribution for ∆ 

ki 

that 

satisfies these latter conditions is the symmetric Bernoulli ±1 distribution; two common 

distributions that do not satisfy the conditions (in particular, the critical finite inverse moment 

condition) are the uniform and normal. Although the convergence results for SPSA is of some 

independent interest, the most interesting theoretical results in [16] and those that best justify 

the use of SPSA, are the asymptotic efficiency conclusions that follow from an asymptotic 

normality result. In particular, under some minor additional conditions in [16] (proposition 2), it 

can be shown that 

k 

β / 2 

dist 

ˆ * 

( θk − θ ) →N( 

µ , Σ) 

as k →∞ 

(1.8) 

14

1.8 VERSIONS OF SPSA ALGORITHM 

where β > 0 depends on the choice of the gain sequences ( a k 

andc k 

), µ depends on both the 

Hessian and the third derivatives of L (θ ) and 

* 

θ , and Σ depends on Hessian matrix 

(note that in general µ ≠ 0, in contrasts to many well-known asymptotic normality results in 

estimation). Given the restrictions on the gain sequences to ensure convergence and asymptotic 

* 

θ 

normality, the fastest allowable value for the rate of convergence of 

θˆ k 

to 

* 

θ is 

−1/3 

k . 

In addition to establishing the formal convergence of SPSA, Spall in [18] shows that the 

probability distribution of an appropriately scaled 

θˆ k 

is approximately normal (with a specified 

mean and covariance matrix) for large k . Spall in [18] uses the asymptotic normality result in 

(1.8), together with a parallel result for FDSA [9], to establish the relative efficiency of SPSA. 

This efficiency depends on the shape of L (θ ) , the values for a } and c } , and the 

distributions of the { ∆ k 

} and measurement noise terms. There is no single expression that can 

be used to characterize the relative efficiency; however, as discussed in [17] in most practical 

problems SPSA will be asymptotically more efficient than FDSA. 

{ k 

{ k 

For example, if 

a 

k 

and 

asymptotic mean squared error 

c 

k 

are chosen as in the guidelines of 

Spall [18] then by equating the 

2 

E 

⎛ ⎞ 

⎜ 

ˆ θ −θ * 

k ⎟ in SPSA and FDSA algorithm, we find 

⎝ ⎠ 

No. of measurements of L (θ ) in SPSA / No. of measurements of L (θ ) in FDSA → 1/p 

as the number of loss measurements in both procedures gets large. Hence, above expression 

implies that the p-fold savings per iteration (gradient approximation) translates directly into a 

p-fold savings in the overall optimization process despite the complex non-linear ways in which 

the sequence of gradient approximations manifests itself in the ultimate solutionθˆ k . One 

properly chosen simultaneous random change in all the variables in a problem provides as much 

information for optimization as a full set of one at time changes of each variable. 

1.8. -Versions of SPSA Algorithm 

The standard first-order SA algorithms for estimating θ involve a simple recursion with. 

15


usually, a scalar gain and an approximation to the gradient based on the measurements of L (⋅) 

The first-order of SPSA (1st-SPSA or SPSA) algorithm mentioned previously requires only two 

measurements of L(⋅) 

to form the gradient approximation, independent of p (versus 2p in the 

standard multivariate finite-difference approximation considered, e.g., in [8]), which extends the 

scalar algorithm of Kiefer and Wolfowitz [8]. Theory presented in [17] shows that for large p 

the 1st-SPSA approach can be much more efficient (in terms of total number of loss 

* 

measurements to achieve effective convergence to θ 

) than the finite-difference approach in 

many cases of practical interest. In extending 1st-SPSA to a second-order (accelerated) form 

[18] that will be explained below, we can see how the gradient and inverse Hessian of L(⋅) 

can 

both be estimated on a per iteration basis using only three measurements of L (⋅) 

(again, 

independent of p). With these estimates, it is possible create an SA analogue to the 

Newton-Raphson algorithm (which, recall, is based on an update step that is negatively 

proportional to the inverse Hessian times the gradient) [17]. The aim of second-order of SPSA 

(2nd-SPSA) algorithm is to emulate the acceleration properties associated with deterministic 

algorithms of Newton-Raphson form, particularly in the terminal phase where the first-order 

SPSA algorithm slows down in its convergence [18]. This approach requires only three loss 

function measurements at each iteration, independent of the problem dimension. The 2nd-SPSA 

approach is composed of two parallel recursions, one for θ and one for the upper triangular 

matrix square root, say S = S(θ ) , of the Hessian of L (θ ) (square root is estimated to ensure 

that the inverse Hessian estimate used in the second-order SPSA recursion for θ is positive 

semi-definite). The two recursions are, respectively [18], 

ˆ 1 

k+ 1 k 

− 

k k k k k 

θ 

ˆ ˆT 

ˆ − 

= θ a ( S S ) gˆ 

( ˆ θ ) 

(1.9) 

Sˆ ˆ ~ ˆ ( ˆ 

k + 1 

= S k 

− a k 

G k 

S k 

) 

(1.10) 

where 

a 

k 

and 

a ~ are non-negative scalar gain coefficients, gˆ 

( ˆ θ ) is the SP gradient 

k 

k 

k 

approximation to g ˆ 

k 

( θ k 

) [18] and Ĝ 

k 

is an observation related to the gradient of a certain loss 

function with respect to S. Note that 

ˆ T 

k 

Sk 

(which depends on 

k 

S ˆ 

θˆ ) represents an estimate of 

16

1.8 VERSIONS OF SPSA ALGORITHM 

the Hessian matrix of L ˆ θ ). Hence, equation (1.10) is a stochastic analogue of the 

( k 

well-known Newton-Raphson algorithm of deterministic optimization. Since gˆ 

( ˆ θ ) has a 

known form, the parallel recursions in equations (1.9) and (1.10) can be implemented once that 

k 

k 

Ĝk 

is specified. The SP gradient approximation requires two measurements of 

L( ⋅): 

y 

( + ) 

k 

and 

y . These represent measurements at design levels θˆ k 

+ ck∆k 

and θˆ 

k 

− ck∆k 

respectively, where 

(−) 

k 

c 

k 

is a positive scalar and 

∆ 

k 

represents a user-generated random vector satisfying certain 

regularity conditions, e.g 

∆ 

k 

being a vector of independent Bernoulli ± 1 random variables 

satisfies these conditions but a vector of uniformly distributed random variables does not. The 

SP comes from the fact that all elements of 

θˆ k 

are perturbed simultaneously in forming 

gˆ 

( ˆ θ ) , as opposed to the finite difference form, where they are perturbed one at time. To 

k 

k 

perform one iteration of (1.9) and (1.10), one additional measurement, say 

(0) 

y 

k 

is required; this 

measurement represents an observation of L (⋅) 

at the nominal design level θˆ k . 

Main Advantage: 

- 1st-SPSA gives region(s) where the function value is low, and this allows to conjecture in 

which region(s) is the global solution. 

- 2nd-SPSA is based on a highly efficient approximation of the gradient based on loss function 

measurements. In particular, on each iteration the SPSA only needs three loss measurements to 

estimate the gradient, regardless of the dimensionality of the problem. Moreover, the 2nd-SPSA 

is grounded on a solid mathematical framework that permits to assess its stochastic properties 

also for optimization problems affected by noise or uncertainties. Due to these striking 

advantages, 2nd-SPSA is recently used as optimization engine for adaptive control problems. 

17


Main Disadvantages: 

- 1st-SPSA gives slow convergence. 

- 2nd-SPSA does not take into account equality/inequality constraints. 

The 1st-SPSA and 2nd-SPSA are algorithms that do not depend on derivative information, and 

it is able to find a good approximation to the solution using few function values. Its 

disadvantage is that once obtained a good approximation, it may not satisfy some conditions and 

constraints associated with some complex problems [17][18]. Also, in both version of SPSA 

algorithm is not possible guarantee that non-positive definiteness part of the Hessian matrix can 

be eliminated when the number of parameters to be adjusted is large. This can cause instability 

in the system and also both versions can become very high in computational cost. Finally, in the 

1st-SPSA and 2nd-SPSA algorithms, the error for the loss function with an ill-conditioned 

Hessian is greater than the one with a well-conditioned Hessian, with this problem the system 

performance decrease. Also, in estimating optimum parameters of a model or time series, there 

are several factors which must be considered when deciding on the appropriate optimization 

technique. Among these factors are convergence speed, accuracy, algorithm suitability, 

complexity and computational cost in terms of time (coding, run –time, output) and power. In 

the parameter estimation application, the 2nd-SPSA had problems with convergence to local 

minima and computational cost. So that, in [18] are proposed some techniques, in order to solve 

this kind of problems efficiently. Nevertheless, when the number of parameters to be adjusted is 

very large the convergence is slow and instable. The techniques defined in [18] included a 

mapping in the Hessian matrix, but this is not consistent in some conditions or applications. 

Therefore, according to these disadvantages (theoretical and practical), in the following chapter, 

we have proposed some improvements to speed up and stability in the 2nd-SPSA algorithm, in 

particular, in the stability, convergence, and computational cost. Also, it is suggested a new 

mapping for implementing in 2nd-SPSA that eliminates the non-positive definiteness while 

preserving key spectral properties of the estimated Hessian. This Hessian is estimated using the 

Fisher information matrix in order to keep it non-positive definiteness and improve the stability. 

So that, those improvements constitute our proposed SPSA algorithm that it is described in the 

following chapter. 

18

Chapter 2 

Proposed SPSA Algorithm 

We propose a modification to the simultaneous perturbation stochastic approximation (SPSA) 

methods based on the comparisons made between the first- and second-order SPSAs (1st-SPSA 

and 2nd-SPSA) algorithms from the perspective of loss function Hessian. At finite iterations, 

the accuracy of the algorithm depends on the matrix conditioning of the loss function Hessian. 

The error of 2nd-SPSA algorithm for a loss function with an ill-conditioned Hessian is greater 

than the one with a well-conditioned Hessian. On the other hand, the 1st-SPSA algorithm is less 

sensitive to the matrix conditioning of loss function Hessians. The modified 2nd-SPSA 

(M2-SPSA) eliminates the error amplification caused by the inversion of an ill-conditioned 

Hessian. This leads to significant improvements in its algorithm efficiency in problems with an 

ill-conditioned Hessian matrix. Asymptotically, the efficiency analysis shows that M2-SPSA is 

also superior to 2nd-SPSA in a large parameter domain. It is shown that the ratio of the mean 

square errors for M2-SPSA to 2nd-SPSA is always less than one except for a perfectly 

conditioned Hessian or for an asymptotically optimal setting of the gain sequence. Also, an 

improved estimation of the Hessian matrix is proposed in order to guarantee that in this matrix 

the non-positive definiteness part can be eliminated and also using this proposed estimation, the 

computational cost is reduced when our method is applied to parameter estimation. 

2.1 -Overview of Modified 2nd-SPSA Algorithm 

The recently developed simultaneous perturbation stochastic approximation (SPSA) method has 

found many applications in areas such as physical parameter estimation and simulation based 

optimization. The novelty of the SPSA is the underlying derivative approximation that requires 

only two (for the gradient) or four (for the Hessian matrix) evaluations of the loss function 

regardless of the dimension of the optimization problem. There exist two basic SPSA 

algorithms that are based on the “simultaneous perturbation” (SP) concept and that use only 

(noisy) loss function measurements. The first-order SPSA (1st-SPSA) is related to the 

Kiefer–Wolfowitz (K–W) stochastic approximation (SA) method [17] whereas the second-order 

SPSA (2nd-SPSA) is a stochastic analogue of the deterministic Newton–Raphson algorithm 

[18]. There have been several studies that compare the efficiency of 1st-SPSA with other 

stochastic approximation (SA) methods. It is generally accepted that 1st-SPSA is superior to 

19

CHAPTER 2. PROPOSED SPSA ALGORITHM 

other first-order SA methods (such as the standard K–W method) due to its efficient estimator 

for the loss function gradient. Spall [28] shows that a ‘standard’ implementation of 2nd-SPSA 

achieves a nearly optimal asymptotic error, with the asymptotic root-mean-square error being no 

more than twice the optimal (but unachievable) error from an infeasible gain sequence 

depending on the third derivatives of the loss function. This appealing result for 2nd-SPSA is 

achieved with a trivial gain sequence a k 

= 1 /( k + 1) 

in the notation below), which effectively 

eliminates the nettlesome issue of selecting a “good” gain sequence. Because this result is 

asymptotic, however, performance in finite samples may sometimes be improved using other 

considerations. Part of the purpose of this paper is to provide a comparison between 1st-SPSA 

and 2nd-SPSA from the perspective of the conditioning of the loss function Hessian matrix. To 

achieve the objectivity of the comparison we also suggest a new mapping for implementing 

2nd-SPSA that eliminates the non-positive definiteness while preserving key spectral properties 

of the estimated Hessian. While the focus of this paper is finite-sample analysis, we are 

necessarily limited by the theory available for SA algorithms, almost all of which is asymptotic.. 

The numerical examples illustrating the empirical results at finite iterations will be carefully 

chosen to represent a wide range of matrix conditioning for the loss function Hessians. 

2.2 -SPSA Algorithm Recursions 

There has recently been a growing interest in recursive optimization algorithms of SA form that 

does not depend on direct gradient information or measurements [19]-[21]. Rather, these SA 

algorithms are based on an approximation to the p-dimensional gradient formed from 

measurements of the objective function. This interest has been motivated by problems such as 

the adaptive control of complex processes, the training of recurrent NN, and the optimization of 

complex queuing and estimation parameters. The principal advantage of algorithms that do not 

require direct gradient measurements (gradient-free algorithm) is that they do not require 

knowledge of the functional relationship between the parameters being adjusted and the 

objective function being minimized. The SPSA algorithm, which is based on a highly efficient 

gradient approximation, is one such gradient-free algorithm. In the SPSA algorithm there are 

two important orders: the 1st-SPSA or SPSA and 2nd-SPSA. These algorithms are described as 

follows: 

20

2.2 THE SPSA ALGORITHM RECURSIONS 

1st-SPSA [17]: 

θ ˆ = θˆ 

− a gˆ 

( θˆ 

), 0,1,2,... 

(2.1) 

k + 1 k k k k 

k = 

2nd-SPSA [18]: 

ˆ ˆ 

−1 

θ = θ − a H gˆ 

( ˆ ), H = f ( H ) 

(2.2 a) 

k + 1 k k k k 

θ 

k k k k 

= k 

1 

H H 

ˆ 

1 

+ H , = 0,1,2,... 

k + 1 

− k + 1 

k 

(2.2 b) 

k k 

k 

where 

a 

k and a 

k are the scalar gain series that satisfy certain SA conditions [18], ĝ 

k 

is 

the SP estimate of the loss function gradient that depends on the gain sequence 

c 

k 

(representing a difference interval of the perturbations), 

Hˆ 

k 

is the SP estimate of the Hessian 

matrix, and 

f 

k maps the usual non-positive-definite H 

k 

to a positive-definite pxp matrix. 

The two recursions are showed in Fig. 2.1. Let 

∆ 

k be a user-generated mean zero random 

vector of dimension p with its components being independent random variables. 

Fig. 2.1. The two-recursions in 2nd-SPSA algorithm 

(solid-line eq. 2.2 a, dashed-line eq. 2.2 b). 

The i-th element of the loss function gradient is given by [18]. 

( gˆ 

) = (2c 

∆ 

k 

i 

k 

ki 

−1 

) [ y( ˆ θ + c ∆ ) − y( ˆ θ −c 

∆ )], i=1, 2, … , p (2.3) 

k 

k 

k 

k 

k 

k 

21


where 

∆ 

ki is the i-th component of the k 

∆ vector and y(θ) 

is the measurements of the loss 

function: 

y(θ ) = L(θ ) + (noise) (2.4) 

* 

where θ is the parameter that has the true value of θ . 

It is noted that the 2nd-SPSA form is 

a special case of the general adaptive SP method. The general method can also be used in 

root-finding problems where 

H 

k 

represents an estimate of the associated Jacobian matrix. The 

true Hessian matrix of the loss function H (θ ) 

H 

ij 

i 

j 

has its i-th element defined as 

2 

= ∂ L / ∂θ ∂θ 

and its value at the solution ( * 

* 

H θ ) denote by H . Finally, its estimation 

and ijth element of estimate of H is defined in Sec. 2.6 using the Fisher information matrix 

(FIM). The FIM is used here in stead Hessian matrix in order to estimate this matrix efficiently 

[22]. The FIM is obtained by Monte Carlo Newton-Raphson (MCNR)[23]. However, this 

Hessian matrix estimate is convenient in an optimization application and is a crucial 

requirement for the new mapping 

f 

k 

proposed in the following section. 

2.3 -Proposed Mapping 

An important point of implementing 2nd-SPSA is to define the mapping 

f 

k , from H 

k 

to 

H 

k 

since the former is often non-positive definite in practice. It is noted that there are no 

simple and universal conditions that guarantee a matrix to be positively definite. The existence 

of a minimum(s) for a loss function based on the problem’s physical nature guarantees that its 

Hessian should be positively definite. The following approach eliminates the non-positive 

definiteness of 

H and using the Fisher information matrix, we can keep this condition in this 

k 

matrix even when the real application has a computational complexity is very high. Now, this 

approach is motivated by finite-sample concerns, as we discuss below. First, we compute the 

eigenvalues of 

H 

k 

and sort them into descending order: 

Λ k 

≡ diag , λ , , λ , λ , λ ,..., λ ] 

[ λ 

1 2 q −1 

q q+ 

1 p 

K (2.5) 

22

2.3 PROPOSED MAPPING 

λ and 0 

where > 0 

q 

λ 

q + 1 

≤ . As k 

H is a real-valued, its eigenvalues are real-valued, 

too. The eigenvalues of 

H are computed as follows: 

k 

The number of non-zero eigenvalues is equal to the rank of H 

k 

, i.e., at most three non-zero 

eigenvalues are available. In this part, the following arrangement of eigenvalues is assumed: 

λ 

≥ 

≥ 

1 

λ 

2 

λ 

3 

. The technique presented here, requires much less user interaction. Now, the 

theoretical background is explained leading to a two-fold threshold algorithm where the only 

task of the user is to specify two thresholds. Finding the eigenvalues and eigenvectors of the 

Hessian matrix is closely related to its decomposition 

H 

i 

= PD P 

−1 

(2.6) 

where P is a matrix and its columns are H’s eigenvectors and 

D 

i 

is a diagonal matrix having 

H’s eigenvalues on the Hessian. While computing the gradient magnitude by the Euclidean 

norm requires three multiplications, two additions and one square root, the computation of 

eigenvalues of the Hessian matrix is more suitable. The explicit formula would require solving 

cubic polynomials. In our implementation a numerical technique of fast converging called 

Jacobi’s method is used as is recommended in [20] for symmetric matrices. We have proposed 

an easy-to-use framework for exploiting eigenvalues of the Hessian matrix to represent volume 

data by small subsets. 

The relation of eigenvalues to the Laplacian operator is recalled, this shows the suitability of 

threshold eigenvalue volumes, and define a two-fold threshold operation to generate sparse data 

sets. For data where it can be assumed that objects exhibit higher intensities than background, 

we modify the framework taking into account only the smallest eigenvalue. This results in a 

further reduction of the representative subsets by selecting just data at the interior side of object 

boundaries. For the sake of simplicity, we have omitted the index k for the individual eigenvalue 

λ 

i that is a function of k. Next, we assume that the negative eigenvalues will not lead to a 

physically meaningful solution. They are either caused by errors in 

H 

k 

or are due to the fact 

that the iteration has not reached the neighborhood of 

θ 

* 

where the loss function is locally 

quadratic. Therefore, we replace them together with the smallest positive eigenvalue with a 

descending series of positive eigenvalues: 

23


ˆ λ ˆ ˆ ˆ ˆ 

(2.7) 

q 

= ελ 

q − 1 

, λ 

q + 1 

= ε λ 

q 

,..., λ 

p 

= ε λ 

p −1 

where the adjustable parameter 0 < ε < 1 can be specified based on the existing positive 

eigenvalues 

ε = ( λ q 

λ 

(2.8) 

q − 2 

−1 / 

1 

) 

. 

The purpose of having the smallest positive eigenvalue 

λ 

q 

redefined is to avoid its possible 

near-zero value that would make the mapped matrix near singular. We let 

Λˆ k be the 

diagonal matrix 

Λ with eigenvalues q 

λ 

p 

k 

λ ,..., replaced by λ 

q 

,..., ˆ λ 

p 

ˆ 

defined 

according to (2.7), and for guarantee the stability in this diagonal matrix when realistic system 

be very complex or in our case the parameters estimations be very high. The Jacobi algorithm is 

proposed because the matrices in this algorithm need to be positive definite, in general, and 

hence should be (2.2a) projected appropriately after each parameter update so as to ensure that 

the resulting matrices are positive definite. In (2.7) and (2.8) is indicated that the spectral 

character of the existing positive eigenvalues as measured by the ratio of its 

maximum-to-minimum eigenvalues, whether it is wide or narrowly-spread, is extrapolated to 

the rest of the matrix spectrum. Other forms of specifications such as 

ε = 

( λ q 

λ 

( q − 2 ) 

−1 / 

1 

) 

/ 2 

or ε = 1 

would also effectively eliminate the non-positive-definiteness. Because the 

separating point between the positive and negative eigenvalues q slowly increases from 1 to p, 

we find numerically that the specification based on (2.8) yields relatively a faster convergence 

in most cases. Since 

H 

k 

is symmetric, it is orthogonally similar to the real diagonal matrix 

of its real eigenvalues. 

H 

k 

= P Λ P 

(2.9) 

k 

k 

T 

k 

where the orthogonal matrix 

P 

k 

consists of all the eigenvectors 

H 

k 

which are usually 

derived together with the eigenvalues. Now, the mapping 

f 

k can be expressed as 

24

2.3 PROPOSED MAPPING 

f 

k 

( H 

k 

) 

ˆ T 

= P Λ P . 

(2.10) 

k 

k 

k 

Since it is 

− 1 

H that is used in the 2nd-SPSA recursion (2.2a) mapping (2.10) with the 

available eigenvectors of H 

k 

k 

also leads to an easy inversion of the estimated Hessian: 

H 

− 1 

= P Λ P . 

(2.11) 

− 1 T 

k 

k k k 

The 2nd-SPSA based on mapping (2.10) makes the procedure of eliminating the 

non-positive-definiteness of 

H 

k 

a precise one. It is noted that the key parameters needed for 

the mapping (ε and λ 

q−1 

) are internally determined by H 

k 

at each iteration. This is different 

from some other forms of 

f 

k where a user-specified coefficient is needed. 

* 

λ ∆ H ) ≤ λ − λ ≤ λ ( ∆ H ) for all i = 1, 2,…, p (2.12) 

p 

( 

k i i 1 

k 

* 

where λ denotes the eigenvalues of 

i 

* 

H . Furthermore, 

p 

( ∆ H 

k 

) 

λ and λ ( ∆ H ) are the 

1 k 

* 

minimum and maximum eigenvalues of the k-th perturbation matrix ∆ H = H − H ; 

respectively. Equation (2.12) suggests that the perturbation matrix will have greater impact on 

* 

the smaller eigenvalues in terms of their fractional changes as H converges to H . Hence, 

k 

the smallest positive eigenvalue ( λ ) has also been redefined at each iteration to avoid its 

q 

possible near-zero value. When all the eigenvalues in (2.5) are positive and the smallest 

becomes stabilized, say empirically λ > 0.1 

p 

( ελ 

p −1 

) with 

p − 2 

ε = ( λ 

p − 1 

/ λ 

1 

) or λ > 0 in 

p 

10 consecutive iterations, we set Λˆ k 

= Λ . Specifically, k 

H asymptotically converges to a 

k 

* 

positively definite H so that λ >0 as k → ∞ see [24]. Hence, 

p 

Λˆ 

k 

→ Λ 

k 

→ 0 since, 

asymptotically, elements of Λˆ are continuous functions of k 

H . Here 

k 

Λ is a continuous 

k 

function of H . Therefore, ˆ 

* 

* 

* 

Λ k 

→ Λ almost surely when H k 

→ H where Λ denotes all the 

* 

eigenvalues of H 

k 

. This follows from the basic property of continuous function for 

deterministic sequence. Both Λ and 

k 

H converge for almost all points in their underlying 

k 

sample spaces. We further note that our mapping from Λ to 

k 

Λˆ defined by (2.7) and (2.8) is 

k 

also a continuous function asymptotically. Here, we like to point out that the mapping 

f 

k defined by (2.10) preserves the key spectral characters such as the spread of those known 

k 

k 

λ 

p 

25


positive eigenvalues λ 

1 

/ λ q 

. Furthermore, as k → ∞ any mapping for 2nd-SPSA should 

− 1 

preserve the complete spectral property of H . Therefore, the proposed mapping to a matrix in 

k 

2nd-SPSA is different from the matrix regularization in an ill-posed inversion problem where 

the spectral property of an ill-conditioned matrix is changed to make the problem well posed. 

2.4 -Description of Proposed SPSA Algorithm 

The 1st-SPSA algorithm predetermines the gain series 

a 

k 

for the whole iteration process 

− 1 

whereas 2nd-SPSA derives a generalized gain series a that is adapted to near optimality 

at each iteration. However, based on previous analyses, the inverse of the estimated Hessian 

k 

H k 

generally introduces additional error sensitivity inherited in 

conditioned matrix k > 1 

H for a non-perfectly 

k 

. To avoid computing the inverse of an ill-conditioned matrix while 

still approximately optimizing the gain series at each iteration, we can modify the first recursion 

for 2nd-SPSA (2.2a) by replacing 

constant diagonal elements 

Λˆ in the mapping f 

k 

k 

of (2.10) with Λ that contains 

k 

ˆ 1 

k + 1 k 

− 

k k k k 

θ 

ˆ 

− 

= θ a λ gˆ 

( θˆ 

) 

(2.13) 

where 

λ is the geometric mean of all the eigenvalues of 

k 

H 

k 

λ 

k 

( 

ˆ ˆ ˆ ) 

1 / p 

= λ λ λ λ λ K λ . 

(2.14) 

1 

2 

K 

q −1 

q q + 1 

p 

Recursions (2.13) and (2.2b) together with (2.5),(2.7)-(2.8) and (2.14) form a modified version 

of 2nd-SPSA that takes advantage of both the well-conditioned 1st-SPSA and the internally 

determined gain sequence of 2nd-SPSA. The proportionality coefficient a of 

α 

a k 

( = a /( k + 1 + A ) , A ≥ 0 ) in 1st-SPSA depends on the individual loss function and is 

generally selected by a trial-and-error approach in practice. On other hand, the 2nd-SPSA 

algorithm removes such an uncertainty in selecting its proportionality coefficient a 

α 

a k 

( = a /( k + 1 + A ) , A ≥ 0 ) since the asymptotically near optimal selection of a is 1 [24]. 

The crucial property that a in 1st-SPSA is dependent on the individual loss function has been 

built into 2nd-SPSA by its generalized gain series 

− α − 1 

( k + 1 + A ) H , A ≥ 0 . From this 

perspective, our proposed SPSA algorithm (2.13) can be considered as an extension of 

k 

k 

of 

26

2.5 ASYMPTOTIC NORMALITY 

1st-SPSA in which a is replaced by a scalar series 

− 1 

λ 

k 

that depends on the individual loss 

− 1 

function and varies with iteration. Before to replacing a by λ , in order to enhance 

convergence and stability, the use of an adaptive gain sequence for parameter updating is 

proposed, this application considers the following conditions: 

k 

a ) a η a η ≥ 1, if J ( θ ) < (1 β ) J ( θ ) 

k 

= 

k − 1 

, 

k 

+ 

k − 1 

b) a µ a µ ≥ , if J ( θ ) < (1 β ) J ( θ ). 

k 

= 

k − 1 

1 

k 

+ 

k −1 

In addition to gain attenuation when the value of the criterion becomes worse, “blocking” 

mechanism are also applied, i.e., the recurrent step is rejected and, starting from previous 

parameter estimate, a new step is accomplished (with a new gradient evaluation and a reduced 

updating gain). The parameter β in the condition (a) represents the permissible increase in the 

criterion, before step rejection and gain attenuation occur. A constant gain sequence c k 

= c 

in assumption and implementation SPSA in the Sec. 2.8 can be used for gradient approximation, 

the value of c being selected so as to overcome the influence of noise. In the optimum 

neighborhood, a decaying sequence in the form defined by step sub in Sec. 2.8 is required to 

evaluate the gradient with enough accuracy and avoid an amplification of the “slowing down” 

effect. When these conditions have been implemented in a , this can be replaced by 

− 1 

λ 

k 

. 

2.5 -Asymptotic Normality 

The strong convergence of 

θˆ 

k 

generally implies an asymptotic normal distribution. In [24] is 

established the asymptotic normal distributions for both lst-SPSA and 2nd-SPSA. Although our 

interests are mainly in finite samples, let us present the following asymptotic arguments as a 

way of relating to previous known results. Since the proposed algorithm can also be considered 

as an extension of lst-SPSA with a special gain series 

− 1 

λ 

k 

the analysis of the asymptotic 

normality for lst-SPSA can also be extended to M2-SPSA. In this section, we first review the 

asymptotic normal distributions for lst-SPSA and 2nd-SPSA. Then, the asymptotic efficiency is 

compared for three different algorithms of lst-SPSA, 2nd-SPSA, and proposed SPSA algorithm. 

Using Fabian’s [19] result, is established the following asymptotic normality ofθˆ in 1st-SPSA 

k 

27


k 

β / 2 

dist 

* 

( θˆ 

k 

− θ ) → N ( ξ , Σ ) as k → ∞ 

(2.15) 

where ξ and Σ are the mean vector and covariance matrix and β / 2 

characterizes the rate 

of convergence and is related to the parameters of gain sequences a and 

k 

c : The mean 

k 

ξ 

in (2.15) depends on the third derivatives of the loss function at 

* 

θ 

and generally vanishes 

except for a special set of gain sequences. The covariance matrix Σ for α ≤ 1 is 

orthogonally similar to the diagonal matrix that is proportional to the inverse eigenvalues of the 

Hessian 

Σ = ψ aP T Λ 

* − 1 P 

(2.16) 

* ∗ T ∗ 

∗ ∗ 

∗ 

where P is orthogonal with H = PΛ 

P , Λ = diag λ , λ , K , λ ] , and the coefficient of 

[ 

1 2 

proportionality ψ depends on the statistical parameters in the algorithm [16]. Again, 

according to the eigenvalue perturbation theorem [16] the difference between 

∗ 

λ 

i 

( i = 1,2 , K , p ) at the k-th iteration and λ 

i 

in (2.16) is bounded by the difference in its 

Hessian 

p 

∗ 

( ) ( ˆ ∗ 

λ − λ ≤ κ λ P H θ − H ) , i = 1,2 , K , p 

(2.17) 

i 

i 

k 

k 

2 

where ⋅ 

2 

denotes the spectral norm of a matrix that leads to the definition of spectral condition 

number in 

κ λ 

H ) = λ max 

/ λ . 

(2.18) 

( 

min 

It is noted that H θˆ 

) converges almost surely to 

k 

( 

k 

∗ 

H and the mapping from H 

k 

to 

H defined by (2.10) preserves the matrix spectra. Furthermore, Λˆ 

− Λ → 0 

k 

k 

k 

as 

k → 0 and the calculation from H 

k 

to Λ is a continuous function, we also have the 

k 

following strong convergence for the eigenvalues of Hessian: 

∗ 

∗ ∗ 

∗ 

Λ 

k 

− Λ = diag [ λ 

1 

, λ 

2 

, K , λ 

p 

] , 

∗ 

ˆ λ k 

→ λ as → ∞ 

k (2.19) 

28

2.5 ASYMPTOTIC NORMALITY 

where 

∗ 

λ 

is the geometric mean of all the eigenvalues of 

H 

∗ 

. Based on (2.15), (2.16) and 

(2.19) we conclude that the choice of 

a λ 

k 

− 1 

k 

in M2-SPSA can also be considered as a natural 

extension of 1st-SPSA with a sensible selection of a based on its asymptotic normality. 

k 

β / 2 

dist 

* 

( θˆ 

k 

− θ ) → N ( µ , Ω ) as k → ∞ 

(2.20) 

where 

β = α − 2γ 

The covariance matrix Ω is proportional to 

H 

∗ − 2 ∗ − 2 T 

= P Λ P with 

the same coefficient of proportionality ψ as in (2.16), and the mean µ depends on both the 

gain sequence parameters and the third derivatives of the loss function at 

β / 2 

mean square error (MSE) of ( ˆ 

* 

k θ − θ ) in (2.20) is given by [16] 

k 

∗ 

θ 

. The asymptotic 

MSE ( α , ) = µ T µ + trace (Ω). 

(2.21) 

2ndSPSA 

γ 

We first consider a special case of a diagonal Hessian with constant eigenvalues 

∗ 

∗ 

( λ i 

= λ = λ ) . It can be shown that the asymptotic normality of θˆ 

k 

in 2nd-SPSA [18] is 

identical to that in 1st-SPSA [17] when the following gain sequences are picked. 

N ( µ , Ω ) = N ( ξ , Σ ) when a k 

= φ /( k + 1) 

and = φ /[( k + 1) λ ] 

a k 

(2.22) 

where the constant φ represents a common scale factor for the two gain sequences. The 

near-optimal selection of φ for 2nd-SPSA is φ = 1 

. Note that the true optimal selection of 

the gain is essentially infeasible as it depends on the third derivatives of the loss [16]. Equation 

(2.22) suggests that the near-optimal MSE in 2nd-SPSA can be achieved in 1st-SPSA by 

picking its proportionality coefficient a in such a way that 

a 

= 1 / λ 

. Since a in 1st-SPSA is 

externally prescribed, such an optimal picking of a is only theoretically possible. On the other 

− 1 − 1 − 1 

hand, the internally determined gain sequence of a λ ( k λ ) in the proposed SPSA 

k 

k 

= 

k 

algorithm makes the near-optimal picking for the special case of constant eigenvalues 

practically possible. Next, we consider the specification of the gain sequence α < 1 

3 γ − α / 2 > 0 from which µ = ξ = 0 [16]. The asymptotic distribution-based MSE for 

2nd-SPSA under this condition is inversely proportional to the sum of all the eigenvalues 

squared 

and 

29


* −2 

* −2 

MSE 

2SPSA( 

α, 

γ) 

= trace (Ω) 

α trace ( Λ ) = ∑λ i 

. 

(2.23) 

p 

i= 

1 

* 

On the other hand, the MSE for our proposed SPSA can be derived by setting in 


a = 1 / λ 

MSE ( , ) = trace Σ 

2SPSAα 

γ 

* 

a= 

1/ λ 

α λ 

* −1 

trace 

p 

* −1 

* −1 

* −1 

( Λ ) = ∑λ 

i 

. 

i= 

1 

λ (2.24) 

The constants of proportionality are related to c and to the variances of 

∆ 

k 

and measurement 

noise. Therefore, the ratio of MSEs for M2-SPSA to SPSA is given by 

MSE 

MSE 

2 SPSA 

p * −1 

1/ p 

p * −1 

[ ∏ λ ] (1/ p) 

i= 

1 i ∑ λ 

i= 

1 i 

⋅ 

≡ R 1 

( α , λ ) 

= 

(2.25) 

0 

( α , λ) 

p * −2 

p * −2 

(1/ p) 

λ (1/ p) 

λ 

M 2 SPSA 

≤ 

where, we have used a well-known relation in the last inequality of (2.25): 

∑ 

i= 

1 

i 

∑ 

i= 

1 

i 

(geometric mean) ≤ (arithmetic mean) ≤ (root-mean-square). (2.26) 

Equality in (2.26) holds only when all the eigenvalues are equal which corresponds to a 

perfectly conditioned Hessian of κ ( H 

∗ ) = 1 . Since the ratio R has been derived from the 

0 

asymptotic MSEs the comparison between M2SPSA and 2nd-SPSA has been made under the 

same rate of convergence. Our third case in the asymptotic efficiency analysis is to consider 

α = 1 when 3γ − α / 2 ≥ 0 in 2nd-SPSA. This setting again corresponds to µ = ξ = 0 in 

2nd-SPSA and proposed SPSA algorithm. It is possible for both 1st-SPSA and 2nd-SPSA to set 

α = 

1 for their gain sequence selection. The near optimal rate of convergence in 2nd-SPSA by 

setting a = 1 can be accomplished in 1st-SPSA by adjusting its a to yield the same rate of 

convergence as 2nd-SPSA. By setting 

a 

= 1 / λ 

in 1st-SPSA for the implementation of our 

proposed SPSA, we can again derive (2.25) that shows the superiority of our proposed SPSA to 

2nd-SPSA under the same rate of convergence. However, the above setting of a 

a 

= 1 / λ 

1st-SPSA is allowed only if the resulting condition in 1st-SPSA of min ( λ / λ ) ≥ β / 2 still 

holds [16]. When the above condition is violated while implementing M2-SPSA for relatively 

∗ 

large k ( H ) the setting of α = 1 in our proposed SPSA algorithm is excluded and we can no 

longer make a straight comparison of the asymptotic MSEs between 2nd-SPSA and M2-SPSA 

i 

i 

in 

30

2.6 FISHER INFORMATION MATRIX 

under the same rate of convergence. Under this circumstance, there is no superiority of either 

one of M2-SPSA and 2nd-SPSA to the other in terms of the efficiency or the rate of 

convergence. The superiority of our proposed SPSA algorithm to 2nd-SPSA indicated by (2.25) 

only shows an improvement in the multiplier for the convergence rate ( R 

0 

) when the common 

convergence rate is sub-optimal. In [25] is showed that by setting α = 1 and γ = 1 / 6 

asymptotically optimal MSE can be achieved with a maximum rate of convergence for the MSE 


− 

/ 3 

θˆ of k β = k 

− 2 in both 1st-SPSA and 2nd-SPSA. We have already shown that in order to 

k 

avoid the violation of the condition min 

i 

( λ 

i 

/ λ ) ≥ β / 2 the setting of α = 1 (with β ≈ 2 / 3 ) 

is often not allowed in our proposed SPSA algorithm. Neither is it possible to choose a different 

set of 

α and 

m 

γ to yield 

m 

β 

m 

= 2 / 3 when γ = 1 / 6 

an 

. Under this circumstance, the 

/ 3 

maximum rate of convergence of − 2 for MSE cannot be achieved by our proposed SPSA. It 

is noted that the mapping 

k 

f 

k 

such as the one proposed in Sec. 2.3 will leave the asymptotic 

H 

k 

unchanged (when we set 

Λˆ = Λ ) as k → ∞ . On the other hand, our proposed SPSA 

k 

k 

algorithm changes 

H 

k 

when its 

Λ is replaced by Λ . 

k 

k 

2.6 -Fisher Information Matrix 

2.6.1 -Introduction to Fisher Information Matrix 

In this section, we presented a relatively simple MCNR method for obtaining the FIM that is 

used in order to estimate the Hessian matrix efficiently. So that, the resampling-based method 

relies on an efficient technique for estimating the Hessian matrix. The FIM plays a central role 

in the practice and theory of identification and estimation. This matrix provides a summary of 

the amount of information in the data relative to the quantities of interest [22]. Suppose that the 

i-th measurement of a process is 

z 

i 

and that a stacked vector of n such measurement vectors is 

n 

T T T 

[ z z z ] T 

z ≡ ,..., 

1 

, 

2 

n 

. Let us assume that the general form for the joint probability density or 

probability mass function for 

zn 

is known, but that this function depends on an unknown vector 

θ . Let the probability density/mass function for 

z be z( ζ θ ) where ζ (“zeta”) is a 

n 

p f 

31


dummy vector representing the possible outcomes for z n 

(in p f 

z( ζ θ) 

), the index n on z n 

is 

being suppressed for notational convenience). The corresponding likelihood function, say 

l ( θ ζ ) = z( 

ζ θ ). 

(2.27) 

p f 

With the definition of the likelihood function in (2.27), we are now in a position to present the 

Fisher information matrix. The expectations below are with respect to the dataset 

z 

n 

. The 

p xp f f 

information matrix F n(θ ) for a differentiable log-likelihood function is given by [22] 

⎛ ∂ log l ∂ logl 

⎞ 

F n 

( θ ) ≡ E⎜ 

⋅ θ ⎟. 

T 

⎝ ∂θ 

∂θ 

⎠ 

(2.28) 

In the case where the underlying data { z z ,..., } 

, 

2 

1 

are independent (and even in many cases 

where the data may be dependent), the magnitude of F n(θ ) will grow at a rate proportional to 

n since logl( 

⋅) 

will represent a sum of n random terms. Then, the bounded quantity F n 

(θ ) / n 

is employed as an average information matrix over all measurements. Except for relatively 

simple problems, however, the form in (2.28) is generally not useful in the practical calculation 

of the information matrix. Computing the expectation of a product of multivariate non-linear 

functions is usually a hopeless task. A well-known equivalent form follows by assuming that 

logl ( ⋅) 

is twice differentiable in θ . The following Hessian matrix 

z n 

H 

( θ ζ ) 

≡ 

∂ 

2 

log l ( θ ζ ) 

∂ θ ∂ θ 

T 

is assumed to exist [22]. One of these conditions is that the set { ζ : l ( θ ζ ) > 0} 

does not 

depend on θ . A fundamental implication of the regularity for the likelihood is that the 

necessary interchanges of differentiation and integration are valid. Then, the information matrix 

is related to the Hessian matrix of l through: 

[ H ( θ Z θ ] 

F θ ) = − E ) . (2.29) 

n 

( 

n 

The form in (2.29) is usually more amenable to calculating the matrix than the product-based 

32


form in (2.28). Note that in some applications, the observed information matrix at a particular 

dataset zn 

may be easier to compute and/or preferred from an inference point of view relative 

to the actual information matrix Fn(θ 

) in (2.29). Although the method in this work is 

described for the determination of Fn(θ 

) the efficient Hessian estimation may also be used 

directly for the determination of H θ z ) when it is not easy to calculate the Hessian directly. 

( 

n 

2.6.2 -Two Key Properties of the Information Matrix: Connections to 

--Covariance Matrix of Parameter Estimates 

Let 

* 

θ denotes the unknown “true” value of θ . The primary rationale for (n) 

F as a 

measure of information about θ within the data 

(n) 

Z 

covariance matrix for the estimate of θ constructed from Z 

comes from its connection to the 

(n) 

. The first of the key properties 

makes this connection via an asymptotic normality result [23]. In particular, for some 

common forms of estimates 

θˆ n 

(e.g. maximum likelihood and Bayesian maximum a 

posteriori), it is known that, under modest conditions 

ˆ * −1 

n ( θ 

n 

−θ 

) → N (0, F ) 

(2.30) 

where 

dist 

→ denotes convergence in distribution and 

F 

* 

Fn 

( θ ) 

≡ lim 

(2.31) 

n→∞ 

n 

provided that the indicated limit exists and is invertible. Hence, in practice, for n reasonably 

−1 

large, F 

n( 

θ ) ’ can serve as an approximate covariance matrix of the estimate θˆ n 

when θ is 

chosen close to the unknown 

of some recursive algorithms where the data 

* 

θ . Relationship (2.30) also holds for optimal implementations 

Z are processed recursively instead of in a hatch 

i 

mode as is typical in maximum likelihood. This includes optimal versions of gradient-based SA 

algorithms, which includes popular algorithms such as LMS and NN BP as special cases. The 

second key property of the information matrix applies in finite-samples. 

33


If 

θˆ n 

is any unbiased estimator of θ [23], 

ˆ 

* −1 

cov( θ ) ≥ ( θ ) , ∀n. 

(2.32) 

n 

F n 

There is also an expression analogous to (2.31) for biased estimators, but it is not especially 

useful in practice because it requires knowledge of the gradient of the bias with respect to θ . 

Expressions (2.30) and (2.31), taken together, point to the close connection, between the inverse 

Fisher information matrix and the covariance matrix of the estimator. While (2.30) is an 

asymptotic result, (2.31) applies for all sample sizes subject to the unbiased ness requirement. It 

is also clear why the name “information matrix” is used for F (n) 

: A larger F (n) 

(in the 

matrix. sense) is associated with a smaller covariance matrix (i.e., more information) while a 

smaller F (n) 

is associated with a larger covariance matrix (i.e., less information). The 

calculation of F (n) 

is often difficult or impossible in many non-linear problems. Obtaining 

the required first or second derivatives of the log-likelihood function may he a formidable task 

in some applications, and computing the required expectation of the generally non-linear 

multivariate function is often impossible in problems of practical interest. To address this 

difficulty, the subsection outlines a computer resampling approach to estimating F (n) 

. This 

approach is useful when analytical methods for computing F (n) 

are infeasible. The approach 

makes use of an idea introduced for optimization the Hessian estimation for SA even though 

this problem is not directly one of optimization. The basis for the technique below is to use 

computational horsepower in lieu of traditional detailed theoretical analysis to determine F (n) 

. 

The method here is an example of a MCNR method for producing an estimate. Such methods 

have become very popular as a means of handling problems that were formerly infeasible. Other 

notable Monte Carlo techniques are the bootstrap method for determining statistical 

distributions of estimates and the Markov chain Monte Carlo method for producing 

pseudorandom numbers and related quantities. Part of the appeal of the Monte Carlo method 

here for estimating F (n) 

is that it can be implemented with only evaluations of the 

log-likelihood. 

2.6.3 -Estimation of F n(θ ) 

The calculation of F n(θ ) is often difficult or impossible in practical problems. Obtaining the 

34


required first or second derivatives of the log-likelihood function may be a formidable 

task in some applications, and computing the required expectation of the generally non-linear 

multivariate function is often impossible in problems of practical interest. This section outlines 

a computer resampling approach to estimating F n(θ ) that is useful when analytical methods 

for computing F n(θ ) are infeasible. The approach makes use of a computationally efficient 

and easy-to-implement method for Hessian estimation that was described by Spall[24] in the 

context of optimization. 

The computational efficiency follows by the low number of log-likelihood or gradient values 

needed to produce each Hessian estimate. Although there is no optimization here per se, we use 

the same basic simultaneous perturbation (SP) formula for Hessian estimation [this is the same 

SP principle given earlier in Spall[24] for gradient estimation]. However, the way in which the 

individual Hessian estimates are averaged differs from Spall[24] because of the distinction 

between the problem of recursive optimization and the problem of estimation of F n(θ ) . The 

essence of the method is to produce a large number of SP estimates of the Hessian matrix of 

logl ( ⋅) 

and then average the negative of these estimates to obtain an approximation to 

F n(θ ) . 

This approach is directly motivated by the definition of F n(θ ) as the main value of the 

negative Hessian matrix (2.29). To produce the SP Hessian estimates, we generate pseudodata 

vectors in a Monte Carlo manner. The pseudodata are generated according to a bootstrap 

resampling scheme treating the chosen θ as “truth.” The pseudodata are generated according 

to the probability model p f 

z( ζ θ ) given in (2.27). So, for example, if it is assumed that the 

real data Zn are jointly normally distributed, N ( µ ( θ ), Σ( 

θ )) , then the pseudodata are 

generated by Monte Carlo according to a normal distribution based on a mean µ and 

covariance matrix Σ evaluated at the chosen θ . Let the i-th pseudodata vector be Z pseudo 

(i) 

; 

the use of 

Z 

pseudo 

without the argument is a generic reference to a pseudodata vector. This data 

vector represents a sample of size n from the assumed distribution for the set of data based on 

35


the unknown parameters taking on the chosen value of θ . Hence, the basis for the technique is 

to use computational horsepower in lieu of traditional detailed theoretical analysis to determine 

F n(θ ) . Two other notable Monte Carlo techniques are the bootstrap method for determining 

statistical distributions of estimates and the Markov chain Monte Carlo method for producing 

pseudorandom numbers and related quantities. Part of the appeal of the Monte Carlo method 

here for estimating F n(θ ) is that it can be implemented with only evaluations of the 

log-likelihood. The approach below can work with either log l( 

θ Z ) values (alone) or 

pseudo 

with the gradient 

g θ Z pseudo 

) ≡ ∂log 

l ( θ Z )/ 

∂θ 

if that is available. The former usually 

( 

pseudo 

corresponds to cases where the likelihood function and associated non-linear process are so 

complex that no gradients are available. To highlight the fundamental commonality of the 

approach in this dissertation, we assume the following: 

Let 

G θ Z pseudo 

) ≡ ∂ log l( 

θ Z ) / ∂θ 

represent either a gradient approximation 

( 

pseudo 

(based log l ( θ Z )) values) or the exact gradient g θ Z ) . Because of its efficiency, the 

pseudo 

( 

pseudo 

SP gradient approximation is recommended in the case where only logl( 

θ Z ) values are 

pseudo 

available (Spall [24]). We now present the Hessian estimate. Let 

Ĥ 

k 

denote the 

th 

k estimate of 

the Hessian H ( ⋅) 

. The formula for estimating the Hessian is: 

log l 

Hˆ 

k 

⎪⎧ 

δGk 

= 1/ 2⎨ 

2ck 

⎪⎩ 

−1 

−1 

−1 

⎛ δGk 

−1 

−1 

−1 

[ ∆ , ∆ ,..., ∆ ] + ⎜ [ ∆ , ∆ ∆ ] 

1 k 2 kp ⎜ k1 

k 2,..., 

⎝ 2ck 

k 

kp 

T 

⎞ ⎪⎫ 

⎟ ⎬ 

⎠ ⎪⎭ 

(2.33) 

where δG 

= G θ + ∆ Z ) − G( 

θ − ∆ Z ) and the perturbation vector in this 

k 

approach [ ] T 

k 

( 

k pseudo 

k pseudo 

∆ = ∆ ,..., ∆ 

k 

, ∆ 

k2 

1 

is a mean zero random vector such that the { ∆ki} 

k p 

are “small” 

symmetrically distributed random variables where k,i are uniformly bounded, and satisfy 

E( 1/ ∆ 

ki 

) 

< ∞ uniformly in k, i. This latter condition excludes such commonly used Monte 

36


Carlo distributions as uniform and Gaussian. Assume that 

∆ , 

≤ c for some small c > 0 . 

k j 

In most implementations, the { } 

∆ are independent and identically distributed (iid) across k 

k , j 

and j. In implementations involving antithetic random numbers, 

dependent random vectors for some k, but at each k the { ∆ } 

kj 

∆k 

and ∆ k + 1 

may be 

are iid (across j). Note that the 

user has full control over the choice of the ∆ distribution. A valid (and simple) choice is the 

ki 

Bernoulli +c distribution (it is not known at this time if this is the “best” distribution to choose). 

The prime rationale for (2.33) is that 

Ĥ 

k 

is a nearly unbiased estimator of the unknown H. 

Spall[24] gave conditions such that the Hessian estimate has an O ( c 

2 ) bias. The next 

proposition considers this further in the context of the resulting (small) bias in the estimate of 

the FIM. 

Proposition 1. Suppose that g θ Z ) is three times continuously differentiable inθ for 

( 

pseudo 

almost all 

Z 

pseudo 

. Then, based on the structure and assumptions of (2.33) (see reference [22]), 

E 

2 

[ F ( )] = F( 

θ ) O( 

) 

θ . 

M , N 

+ c 

Proof: Spall [24] showed that ˆ 

2 

E( 

H Z ) = H( 

θ Z ) O ( c ) under the stated 

k pseudo 

pseudo 

+ 

Z 

conditions on g(⋅) 

and ∆ k 

.Because FM N( 

) is a sample mean of − Ĥ 

k 

, 

θ 

values, the result to 

be proved follows immediately. The summarizing operation in (2.33) is convenient to maintain 

a symmetric and positive Hessian estimate. To illustrate how the individual Hessian estimates 

may be quite poor, note that 

Ĥ 

k 

in (2.33) has (at most) rank two (and may not even be positive 

semi-definite). This low quality, however, does not prevent the information matrix estimate of 

interest from being accurate since it is not the Hessian per se that is of interest. The averaging 

process eliminates the nadequacies of the individual Hessian estimates. 

37


Given the form for the Hessian estimate in (2.33), it is now relatively straightforward to 

estimate F n(θ ) . Averaging Hessian estimates across many Z pseudo 

(i) 

yields an estimate of 

E 

[ H ( θ Z ( i)) 

] = −F 

( θ) 

pseudo 

n 

to within an O( c 

2 ) bias (the expectation in the left-hand side above is with respect to the 

pseudodata). The resulting estimate can be made as accurate as desired through reducing c and 

increasing the number of 

Ĥ 

k 

values being averaged. The averaging of the 

Ĥ 

k 

values may be 

done recursively to avoid having to store many matrices. Of course, the interest is not in the 

Hessian per se; rather the interest is in the (negative) mean of the Hessian, according to (2.3) (so 

the averaging must reflect many different values of Z pseudo 

(i)) 

. This leads to greater variability 

for a given number (N) of pseudodata. Also using this estimation, we can keep the Hessian matrix 

positive-definiteness. Let us now present a step-by-step summary of the above Monte Carlo 

resampling approach for estimating F n(θ ) . The MCNR method is an iterative procedure that can 

be used to approximate the maximum of a likelihood function in situations where direct 

likelihood computation is infeasible because of the existence of unmeasured variables, missing 

data, or measurement error. Let 

(i) 

∆ 

k 

represent the k-th perturbation vector for the i-th realization 

(i.e., for Z pseudo 

(i) 

). 

The Monte Carlo algorithm with a resampling method for estimate F n(θ ) is described as 

follows: 

Step 1. (Initialization). Determine θ , the sample size n, and the number of pseudodata vectors 

that will be generated (N). In other words, we need to calculate ˆ θ ) and the number 

of pseudodata vectors that will be generated. Determine whether log-likelihood log l ( ⋅) 

or 

gradient information g (⋅) 

will be used to form the Ĥ 

k . Pick small number 

Bernoulli ± 

c k distribution used to generate the perturbations 

∆ 

ki ; 

( k 

c 

k =0.001. 

c 

k in the 

38


Step 2. (Generating pseudodata). Based on ˆ θ ) given in step 1, generate by Monte Carlo 

( k 

method the 

th 

k pseudodata vector on n-pseudo measurements (i) 

Z pseudo 

. 

Step 3. (Hessian estimation). With the i-th pseudodata vector in step 2, compute M ≥ 1 Hessian 

estimates according in (2.33) [22]. Let the sample mean of these M estimates be 

(i) 

H = 

(i) 

H ( Z pseudo 

(i) 

). Unless using antithetic random numbers, the perturbation vectors 

{ ∆ 

(i k) } 

should be mutually independent across realizations i and along the realizations (along k). 

the values are available and SP gradient approximations are being used to form the G(⋅) 

values, 

the perturbations forming the gradient approximations, say { ∆ } 

( ) 

~ i 

k 

, should likewise be mutually 

independent). Z pseudo 

(i) 

is the pseudodata vector, this vectors represents a sample of size of n 

from the assumed distribution of the set of data based on the unknown parameters. 

Step 4. (Averaging Hessian estimates). Repeat step 2 and 3 until N pseudodata vectors have 

(i) 

been processed. Take the negative of the average of N Hessian estimates H produced in 

step 3; this is the estimate of F n(θ ) . The key parameters needed for the mapping are 

internally determined by F n(θ ) at each iteration. Figure 2.2 is a schematic of the steps. 

Fig. 2.2. Diagram of method for forming estimate ( ). 

F M , N 

θ 

39


2.7 -Efficiency Between 1st-SPSA, 2nd-SPSA and M2-SPSA 

The proposed SPSA algorithm presented above offers considerable potential for accelerating the 

convergence of SA algorithms while only requiring loss function measurements (no gradient or 

higher derivative measurements are needed). Since it requires only three measurements per 

iteration to estimate both the gradient and Hessian, independent of the dimension of the problem. 

So that, the relationships among 1st-SPSA, 2nd-SPSA and M2-SPSA can also be understood 

from a different perspective: 1st-SPSA (2.1) and M2-SPSA (2.13) weight the different 

components of the estimated gradient gˆ 

( θˆ 

) equally whereas 2nd-SPSA (2.2a) weights them 

k 

k 

differently to account for different sensitivities of θ . A steeper eigen direction (greater λ ) 

i 

requires a smaller step ≈ 1 / λ ) to effectively reach the exact solution [25][26]. Both 

( 

i 

2nd-SPSA and our proposed SPSA algorithm have captured the dependence of the step size on 

the overall sensitivities of θ at each iteration. From this perspective, 2nd-SPSA and proposed 

SPSA algorithm are superior to 1st-SPSA. However, our proposed SPSA weights the different 

components of gˆ 

( θˆ 

) equally with an averaged step ≈ 1 / λ ) , it has given up the further 

k 

k 

( 

k 

advantage of higher-order sensitivity of θ . Therefore, whether our proposed SPSA algorithm is 

better than 2nd-SPSA or not at finite iterations is determined by the relative importance of two 

competing factors that influence the efficiency of the algorithm. The elimination of the matrix 

inverse reduces the magnitude of errors whereas the lack of gradient sensitivity may deteriorate 

the accuracy. It is noted that the asymptotic relation (2.25) only shows an improvement of our 

proposed SPSA over 2nd-SPSA in terms of its rate coefficient. Both our proposed SPSA 

/ 2 

algorithm and 2nd-SPSA have the same rate of convergence characterized by − β as shown 

by (2.20). The asymptotic relation (2.25) provides a theoretical rationale of considering 

/ 3 

M2-SPSA over 2nd-SPSA in practice although the maximum rate of convergence of − 2 for 

MSE cannot be achieved for our proposed SPSA algorithm. Another rationale of proposing 

* 

M2-SPSA is that the amplification of errors in an ill-conditioned H 

k 

through the matrix 

inversion is a well-established result whereas the efficiency of the gradient sensitivity through 

* 

Newton–Raphson search only shows near the extreme point ( θ ) with a near-exact [26]. Recall, 

however, that such justification for the proposed SPSA algorithm is restricted to the case where 

the gains are not asymptotically optimal in order to achieve fast convergence with finite 

iterations. For the asymptotic optimal gains ( a 

≈ 1 / k , c ≈ 1 / k 

* 

to M2SPSA except in the case where all eigenvalues of H 

k 

k 

1 / 6 

), 2nd-SPSA is superior 

are identical (where 2nd-SPSA and 

M2SPSA are identical). It is shown that the magnitude of errors in 2nd-SPSA is dependent on 

the matrix. 

k 

40

2.8 IMPLEMENTATION ASPECTS 

We have shown that the magnitude of errors in SPSA is dependent on the matrix conditioning 

* 

of H 

due to two competing factors. Since both factors are strongly related to the same quantity 

of the matrix conditioning, the relative efficiency between M2-SPSA and 2nd-SPSA might be 

less dependent on specific loss functions. However, such a replacement does not necessarily 

suggest that the magnitude of errors in our proposed SPSA be independent on the matrix 

conditioning of 

* 

H since the computation of λ 

k is dependent on the matrix properties 


* 

H . 

2.8 -Implementation Aspects 

The five points below have been found important in making the adaptive simultaneous 

perturbation (ASP) perform well in practice. Before describe these points, we can explain that 

while the ASP structure in (2.2a), (2.2b), and (2.2) is general, we will largely restrict ourselves 

(1) 

in our choice of G (⋅) 

(and G ( ⋅) 

) in the remainder of the discussion in order to present 

k 

k 

concrete theoretical and numerical results. For M2-SPSA, we will consider the simultaneous 

(1) 

perturbation approach for generating G (⋅) 

and G ( ⋅) 

, while for second-order stochastic 

k 

(1) 

gradient (2SG), we will suppose that G ( ⋅) 

= G ( ⋅) 

is an unbiased direct measurement of 

k 

g (⋅) ; in other words G ˆ θ ) is the input information related to g ˆ θ ) . The rationale for basic 

k 

( k 

SPSA in the gradient-free case has been discussed extensively elsewhere (e.g., Spall [28],), and 

hence will not be discussed in detail here. (In summary, it tends to lead to more efficient 

optimization than the classical finite-difference Kiefer–Wolfowitz method while being no more 

difficult to implement; the relative efficiency grows with the problem dimension.) In the 

gradient-based case, stochastic gradient (SG) methods include as special cases the well-known 

approaches mentioned at the beginning of dissertation (backpropagation, etc.). SG methods are 

themselves special cases of the general Robbins–Monroe root-finding framework and, in fact, 

most of the results here can apply in this root-finding setting as well. The associated 

Appendixes A and B provide part of the theoretical justification for SP, establishing conditions 

for the almost sure (a.s.) convergence of both the iterate and the Hessian estimate. Now, we can 

explain the five points in the implementation of M2-SPSA as follows: 

k 

k 

( k 

41


1) θ and H Initialization: Typically, (2.2a) is initialized at some ˆ θ 

0 

believed to be near 

* 

θ . One may wish to run the standard first-order SA (i.e., (2.2a) without 

−1 

H 

k 

) or some other 

“rough” optimization approach for some period to move the initial θ for ASP closer to 

* 

θ . 

Although, with the indexing shown in (2.2b), no initialization of the 

H 

k 

recursion is required 

since H 

0 

is computed directly from Ĥ 

0 

, the recursion may be trivially modified to allow for 

an initialization if one has useful prior information. If this is done, then the recursion may be 

initialized at (say) scale ⋅ I pxp 

, scale ≥ 0, 

or some other positive semi definite matrix reflecting 

available prior information (e.g., if one knows that theθ elements will have very different 

magnitudes, then the initialization may be chosen to approximately scale for the differences). It 

is also possible to run (2.2b) in parallel with the rough search methods that might be used for 

initializing θ . Since 

Ĥ 

k 

has (at most) rank 2 (and may not be positive semi-definite). 

2) Numerical Issues in Choice of ∆ 

k 

and 

H 

k 

: Generating the elements of 

∆k 

according to a 

Bernoulli having a positive-definite initialization helps provide for the invariability of H 

k 

, 

especially for small k (if H is positive definite, f (⋅ k 

) in (2.2a) may be taken as the identity 

k 

transformation). ± 1 distribution is easy and theoretically valid (and was shown to be 

asymptotically optimal in Brennan and Rogers [27] and Spall [28] for basic SPSA; its potential 

optimality for the adaptive approach here is an open question). In some applications, however, it 

may be worth exploring other valid choices of distributions since the generation of 

∆k 

represents a trivial part of the cost of optimization, and a different choice may yield 

improved finite-sample performance. Because H 

k 

may not be positive definite, especially for 

small k (even if is initialized based on prior information to H 

0 

be positive definite), it is 

recommended that H 

k 

in (2.2b) not generally be used directly in (2.2a). Hence, as shown in 

(2.2a), it is recommended that 

H 

k 

be replaced by another matrix 

H 

k 

that is closely related to 

H 

k 

. One useful form when is not too large has been to take 

H 

k 

1/ 2 

= ( H H ) + δ I , where the 

k 

k 

k 

indicated square root is the (unique) positive semi-definite square root and δ ≥ 0 is some 

small number. 

k 

42

2.8 IMPLEMENTATION ASPECTS 

For large p , a more efficient method is to simply set 

H 

k 

= H + δ I but this is likely to require 

k 

k 

a larger 

δ 

k 

to ensure positive definiteness of 

H 

k 

. For very large p, it may be advantageous to 

have 

H 

k 

be only a diagonal matrix based on the diagonal elements of 

H 

k 

+ δ I . This is a way 

k 

of capturing large scaling differences in the elements (unavailable to first-order algorithms) 

while eliminating the potentially onerous computations associated with the inverse operation in 

(2.2a). Note that 

Hk 

should only be used in (2.2a), as (2.2b) should remain in terms of 

Hk 

to 

ensure a.s. consistency. By Theorems 2a, b, one can set 

H = H for sufficiently large k. Also, 

k 

k 

for general diagonal 

H 

k 

, it is numerically advantageous to avoid a direct inversion of 

H 

k 

in 

(2.2a), preferring a method such as Gaussian elimination. 

3) Gradient/Hessian Averaging: At each iteration, it may be desirable to compute and average 

several and values despite the additional cost. This may be especially true in a high noise 

environment. 

4) Gain Selection: The principles outlined in Brennan and Rogers [27] and Spall [28] are 

useful here as well for practical selection of the gain sequences , { a k 

}, { c k 

} and in the 

M2-SPSA case, { c k 

}. For M2-SPSA the critical gain can be simply chosen as1/ 

k, 

k ≥1to 

achieve asymptotic near optimality or optimality, respectively, although this may not be ideal in 

practical finite-sample problems. For the remainder, let us focus on the M2-SPSA case, here we 

can choose 

a 

c k 

= α γ , A ≥ 0 for k ≥1. In 

α 

γ 

γ 

k 

= a /( k + A) 

, ck 

= c / k and c / k , a, 

c, 

c, 

, > 0 

finite-sample practice, it may be better to choose and lower than their asymptotically optimal 

values of α = 1 and γ = 1/6 (see Sec. 2.10), and, in particular, α = 0.602 and γ = 0.101 are 

practically effective and approximately the lowest theoretically valid values allowed (see 

Theorems 1a, 2a, and 3a). Choosing so that the typical change in to is of “reasonable” 

magnitude, especially in the critical early iterations, has proven effective. Setting A 

approximately equal to 5–10% of the total expected number of iterations enhances practical 

convergence by allowing for a larger than possible with the more typical A= 0. However, in 

slight contrast to Spall [28] for the first-order algorithm, we recommend that have a magnitude 

43


greater (by roughly a factor of 2–10) than the typical (“one-sigma”) noise level in the y (⋅) 

. 

Further, setting c ~ > c has been effective. These recommendations for larger c (and c ~ ) 

values than given in Spall [28] are made due to the greater inherent sensitivity of a second-order 

algorithm to noise effects. 

2.9 -Strong Convergence 

This section presents results related to the strong (a.s.) convergence of 

ˆ θ * θ → k 

and 

H H( θ 

* k 

→ ) (all limits are as unless otherwise noted). This section establishes separate results 

for M2-SPSA. One of the challenges, of course, in establishing convergence is the coupling 

between the recursions for 

θˆ k 

and 

H 

k 

. Formal convergence of 

H 

k 

(see Theorems 2a, b) may 

still hold under such weighting provided that the analog to expressions (A10) and (A13) in the 

proof of Theorem 2a (see Appendix) holds. We present a martingale approach that seems to 

provide a relatively simple solution with reasonable regularity conditions. Alternative 

conditions for convergence might be available using the ordinary differential equation approach 

of Metivier and Priouret [29] and Benveniste [30], which includes a certain Markov dependence 

that would, in principle, accommodate the recursion coupling. However, this approach was not 

pursued here due to the difficulty of checking certain regularity conditions associated with the 

Markov dependence (e.g., those related to the solution of the “Poisson equation”). The results 

below are in two parts, with the first part (Theorems 1a, b) establishing conditions for the 

convergence of 

θˆ k , and the second part (Theorems 2a, b) doing the same for 

H 

k 

. The proofs of 

the theorems are in Appendix A. We let denote ⋅ the standard Euclidean vector norm or 

compatible matrix spectral norm (as appropriate), 

( * * 

θ ) 

i 

and ( θ −θ 

) 

i 

represent the i-th 

components of the indicated vectors (notation chosen to avoid confusion with the iteration 

subscript k), i.o. represents infinitely often, and ˆ −1 

g ( θ ) ≡ H g( 

ˆ θ ). Below are some regularity 

conditions that will be used in Theorem 1a for M2-SPSA and, in part, in the succeeding 

theorems. Some comments on the practical implications of the conditions are given 

k 

k 

k 

k 

immediately following their statement. Note that some conditions show a dependence on 

θˆ 

k 

44

2.9 STRONG CONVERGENCE 

and 

H 

k 

, the very quantities for which we are showing convergence. Although such 

“circularity” is generally undesirable, it is fairly common is the SA field (e.g., Kushner and Yin 

[31], Benveniste [30]). The inherent difficulty in establishing theoretical properties of adaptive 

approaches comes from the need to couple the estimates for the parameters of interest and for 

the Hessian (Jacobian) matrix. Note that the bulk of the conditions here showing dependence on 

θˆ k and 

Hk 

are conditions on the measurement noise and smoothness of the loss function (C.0, 

C.2, and C.3 below; C.0’, C.2’ , C.3’ , C.8, and C.8’ in later theorems); the explicit dependence 

on 

θˆ k can be removed by assuming that the relevant condition holds uniformly for all 

“reasonable” θ . The dependence in C.5 is handled in the lemma below. The following 

assumptions are guidelines [16] very useful for establish our theorems. 

( ) ( ) 

C.0 E( 

ε −ε 

− ∆ ; H ) = 0 a.s. ∀ k whereε is the effective SA measurement noise, i.e., 

ε 

( + ) 

k 

+ k k k k 

≡ y( 

ˆ θ ± c ∆ 

k 

k 

k 

) − L( 

ˆ θ ± c ∆ 

k 

k 

k 

). 

(+) 

k 

C.1 ak 

, ck 

> 0 ∀k 

, a 

k 

→0, c 

k 

→0 

a.s. k →∞ 

∑ ∞ a = ∞ 

k= 

0 k 

( a / ) < ∞. 

k 0 k 

ck 

∑ ∞ = 

2 

C.2 For some δ , ρ →0 

and ∀k 

l E y ˆ 

2+ 

δ 

, , ( ( θk ± c k 

∆k 

)/ ∆k 

l 

) ≤ ρ, 

∆k 

l 

≤ ρ, 

∆k 

l 

symmetrically distributed about 0, and { ∆ } 

kl 

are mutually independent. 

is 

C.3 For some ρ > 0 and almost all θˆ k 

the function 

g ⋅ 

is continuously twice differentiable 

with a uniformly (k ) in bounded second derivative for all θ such that ˆ θ −θ 

≤ ρ. 

C.4 For each k ≥1 

and all θ there exists a ρ > 0 not dependent on k and θ , such that 

* 

( θ −θ 

) 

T g 

k 

* 

( θ) 

> ρ θ −θ 

. 

C.5 For each i = 1,2 ..., p and any > 0 

C.6 

( ˆ * 

ki k 

ki k 

θki 

−( 

θ ) 

i 

≥ ρ∀k 

) = 0. 

ρ , P { g ( ˆ θ ) ≥ 0} ∩{ g ( ˆ θ ) < 0} 

2 δ 

−1 

2 −1 

1 

Hk 

exists a.s. ∀k, 

c k 

H k 

→0a.s., and for someδ , ρ > 0, ⎞ 

⎜ 

⎛ − 

E H + 

k ⎟ ≤ ρ. 

⎝ ⎠ 

C.7 For any τ > 0 

and non-empty S { 1,2 ,..., p } 

⊆ there exists a ρ '( 

τ, 

S ) > τ . 

k 

45


* 

∑ ( θ − θ ) 

i 

g 

ki 

( θ ) 

i∉S 

lim sup 

< 

* 

∑ ( θ − θ ) ( ) 

k →∞ 

i 

g 

ki 

θ 

i∈S 

1 

(2.34) 

for all 

* 

( θ −θ 

) < τ when 

i 

* 

i∉S 

and ( θ −θ 

) ≥ ρ'( 

τ, 

S) 

when i ∈ S. 

i 

C.0 and C.7 are common martingale-difference noise and gain sequence conditions. C.2 

(1) 

provides a structure to ensure that the gradient approximations G (⋅) 

and G ( ⋅) 

are well 

behaved. The conditions on 

violation of the implied finite inverse moments condition in 

∆ from being uniformly or normally distributed due to their 

k 

2+ δ 

E 

⎛ 

θ 

⎞ 

⎜ y( ˆ 

k 

± c k 

∆k 

)/ ∆k 

l ⎟ ≤ ρ . An 

⎝ 

⎠ 

independent Bernoulli ± 1 distribution is frequently used for the elements of ∆ 

k 

. C.3 and C.4 

provide basic assumptions about the smoothness and steepness of L (θ ). C.3 holds, of course, if 

g (θ) is twice continuously differentiable with a bounded second derivative on 

k 

p 

R C.5 is a 

modest condition that says that 

θˆ k cannot be bouncing around in a manner that causes the signs 

of the normalized gradient elements to be changing an infinite number of times if 

θˆ k is 

uniformly bounded away from 

* 

θ . C.6 provides some conditions on the surrogate for the 

Hessian estimate that appears in (2.2a) and (2.2b). Since the user has full control over the 

definition of 

H 

k 

these conditions should be relatively easy to satisfy. Note that the middle part 

of C.6 ( H 

1 o( 

c 

−2 ) a.s.) allows for 

− k 

= 

k 

−1 

Hk 

to “occasionally” be large provided that the 

boundedness of moments in the last part of the condition is satisfied. The example for 

H 

k 

given in Sec. 2.8 [guideline 2] would satisfy this potential growth condition, for instance, if 

ρ 

δk 

= ck 

, 0 < ρ < 2. 

Finally, C.7 ensures that, for k sufficiently large, each element of g (θ k 

) 

* 

tends to make a non negligible contribution to products of the form ( θ − θ ) 

T g ( θ) 

(see C.4). 

A sufficient condition for C.7 is that, for each i , ( θ) 

be uniformly (in k ) bounded > 0 and < 

* 

∞ when ( θ −θ 

) is bounded as stated in the lemma below. Note that, although no explicit 

i 

conditions are shown on { c ~ k 

} there are implicit conditions in C.4–C.7 given c ~ k 

’s effect on 

g ki 

k 

46


H 

k 

(via 

H 

k 

). In Theorem 2a on the convergence of 

H , there are explicit conditions on { } 

k 

c ~ . 

k 

Conditions C.5 and C.7 are relatively unfamiliar. So, before showing the main theorems on 

convergence for M2-SPSA, we give sufficient conditions for these two conditions in the lemma 

below. The main sufficient condition is the well-known boundedness condition on the SA 

iterate (e.g., Benveniste [30, Theorem II.15]). Although some authors have relaxed this 

boundedness condition (e.g., Kushner and Yin [31]), the condition imposes no practical 

limitation. This boundedness condition also formally eliminates the need for the explicit 

dependence of other conditions (C.2 and C.3 above; C.0’, C.2’, C.3’, C.8, and C.8’ below) on 

θˆ k 

since the conditions can be restated to hold for all θ in the bounded set containing 

Note also that the condition a / 

2 → 0 holds automatically for gains in the standard form 

k 

c k 

discussed in 2.9.1. One example of when the remaining condition of the lemma (2.35), is 

θˆ k . 

trivially satisfied is 

Hk 

is chosen as a diagonal matrix (see guideline 2). 

Lemma—Sufficient Conditions for C.5 and C.7: Assume that C.1–C.4 and C.6 hold, and 

lim sup k 

θˆ < ∞ a.s. Then condition C.7 is not needed. Further, let a / 

2 → 0, and suppose 

−∞ 

that, for any ρ > 0 

k 

P (sign g ˆ θ ) 

ki 

( k 

≠ sign g ˆ θ ) i.o. ˆ θ − ( θ 

* ) ≥ ρ) 

= 0 

i 

( k 

. 

ki i 

k 

c k 

∀ i 

(2.35) 

Then C.5 is automatically satisfied. 

(1) 

Theorem 1a—M2-SPSA: Consider the SPSA estimate for G (⋅) 

with G ( ⋅) 

given by (2.34). 

Let conditions C.0–C.7 hold. Then ˆ θ * k 

−θ →0 

a.s. 

k 

k 

Theorem 1b below on the second-order stochastic gradient (2SG) approach is a straightforward 

modification of Theorem 1a on M2-SPSA. In order to explain more clearly the theorems of 

M2-SPSA, we take some references from the theorems of the SG form [21]. Therefore, we 

replace C.0, C.1, and C.2 with the following SG analogs. Equalities hold a.s. where needed. 

47


( + ) 

C.0’: E( 

e ˆ 

k 

θ ; ∆ ; H ) = 0 where e = G ˆ θ ) − g( ˆ θ ). 

k 

k 

k 

∞ 

→ 

k ∑ ∑ 

k 

k 

( 

k k 

2 

C.1’: a 0∀k 

; a →0; 

a = ∞, 

a < ∞. 

k 

∞ 

k= 

0 k 

k= 

0 k 

2+ 

δ 

C.2’: For some δ , ρ > 0, E ( G ( θˆ 

) ) ≤ ρ ∀ k . 

k 

k 

Note (analogous to ~ c } in Theorem 1a) that there are no explicit conditions on c } here. 

{ k 

{ k 

These conditions are implicit via the conditions on 

H 

k 

, and will be made explicit when we 

consider the convergence of 

H 

k 

in Theorem 2b. 

Theorem 1b—2SG: Consider the setting where ( ⋅) 

Suppose that C.0’ –C.2’ and C.3–C.7 hold. Then ˆ θ * k 

−θ →0 

a.s. 

Theorem 2a below treats the convergence of 

conditions as follows, which are largely self-explanatory: 

C.1’’: The conditions of C.1 hold plus 

∑ 

G is a direct measurement of the gradient. 

k 

H 

k 

in the SPSA case. We introduce several new 

−2 

−2 

k + 1) ( c ~ c ) < ∞ with c ~ = O( 

). 

∞ 

( 

k= 

0 

k k 

k 

c k 

C.3’: Change “thrice differentiable” in C.3 to “four-times differentiable” with all else 

unchanged. 

C.8: For some ρ > 0 and all k ,l, 

m , 

ˆ θ ± ∆ + ~ ~ 2 ~ 2 

[ y( 

c c ∆ ) /( ∆ ∆ ) ] ≤ ρ 

E 

k k k k k kl 

km 

and 

ˆ 

2 ~ 2 

[ y( 

θ ± ∆ ) /( ∆ ∆ ) ] ≤ ρ 

E 

k 

c k k kl 

km 

( 

~ ( 

E ε 

ˆ ~ 

θ ; ∆ ; H 

± ) ( ) 

− ± 

k 

ε 

k k k k 

) = 0 

and 

~ ( 

ε 

± ) ( 

− ε 

± ) 2 ~ 2 

[( ) /( ∆ ∆ ) ] 

E 

k k 

kl 

km 

≤ ρ 

where ~ ( ± ) ˆ ~ ~ ˆ ~ ~ 

ε = y( 

θ ± c ∆ + c ∆ ) − L( 

θ ± c ∆ + c ∆ ). 

k 

k 

k 

k 

k 

k 

k 

k 

k 

k 

k 

48


C.9: 

∆ ~ ~ 

k 

satisfies the assumptions for ∆k 

in C.2 (i.e., ∀ k , l , ∆ 

kl 

≤ ρ and ∆ ~ 

l 

k 

is 

symmetrically distributed about 0; { ∆ ~ kl 

} are mutually independent); ∆ 

k 

and 

∆ ~ 

k 

are 

independent; 

E 

−2 

−2 

( ∆ ) ≤ , E( ∆ ) ≤ ρ∀k 

l 

ρ and some ρ > 0 . 

kl kl 

, 

Theorem 2a—M2-SPSA: Let conditions C.0, C.1’’, C.2, C.3’, and C.4–C.9 hold. Then, 

H H( θ 

* k 

→ ) a.s. Our final strong convergence result is for the Hessian estimate in 2SG. As 

above, we introduce some additional modified conditions. 

−2 

−2 

C.1’’’: The conditions of C1’ hold plus c 0, 

c →0 

and ( k + 1) c < ∞. 

C.8’: For some ρ →0 

and all k , l , 

2 

E 

⎛ θ ⎞ 

⎜ g( ˆ 

k 

± c k 

∆k 

) / ∆k 

l ⎟ ≤ ρ 

⎝ 

⎠ 

k 

> 

k 

∑ ∞ k= 

0 

k 

and 

E 

⎜⎛ 

⎝ 

( ) 

( e − − 

k 

e k 

) / ∆k 

l 

+ 2 

⎟⎞ 

≤ ρ 

⎠ 

E 

( ) 

( e )/ ˆ ) + − − 

k 

e k 

∆k 

l 

θ = 0 

k 

ˆ θ 

ˆ θ 

( ± ) 

where e = G ( ± c ∆ ) − g( ± c ∆ ). 

k 

k 

k 

k 

k 

C.9’ : For some ρ > 0 and all k , l , ∆ 

kl 

≤ , ∆kl 

2 

are mutually independent, and E ( ) . 

k 

k 

∆ − kl 

k 

≤ ρ 

ρ ,is symmetrically distributed about 0, { ∆ } 

kl 

Unlike this theorem’s companion result for 2SG (Theorem 1b), explicit conditions are necessary 

on { c k 

} to control the convergence of the Hessian iteration. Note that due to the simpler 

structure of 2SG (versus M2-SPSA), the conditions in C.9’ are a subset of the conditions in C.9 

for Theorem 2a. 

Theorem 2b—2SG: Suppose that C.0, C.1, C.2, C.3, C.4–C.7, C.8 and C.9 hold. Then 

H H( θ 

* k 

→ ) a.s. 

49


2.10 -Asymptotic Distributions and Efficiency Analysis 

A. Asymptotic Distributions of ASP 

This subsection builds on the convergence results in the previous section, establishing the 

asymptotic normality of the M2-SPSA and 2SG formulations of ASP. The asymptotic normality 

is then used in Sec. 2.9 to analyze the asymptotic efficiency of the algorithms. Proofs are in 

Appendix A. 

M2-SPSA Setting: As before, we consider 2nd-SPSA before 2SG. Asymptotic normality or the 

related issue of convergence of moments in basic first-order SPSA has been established under 

slightly differing conditions by Spall [3], Spall and Criston et al. [32], Dippon and Renz [33], 

Kushner and Yin [31, ch. 10]. We consider gains of the typical form 

c k 

γ 

= c / k , a, 

c, 

α, 

γ > 0, A ≥ 0, k ≥1 

and take 

= 

ki 

a k 

+ 

α 

= a /( k A) and 

β = α − 2γ 

, 2 ( 

−2 2 −2 

ρ E ∆ ) , ξ = E( 

∆ ki 

) ∀k, 

i .The 

asymptotic mean below relies on the third derivative of L(θ 

) we let L ( * ) 

derivative of Lwith respect to elements i,j,k of θ evaluated at 

conditions will be used in the asymptotic normality result. 

3 ijk 

θ 

represent the third 

* 

θ . The following regularity 

E 

~ ˆ 

as form some σ 

2 > 0 . In this point, for some all 

( ) ( ) 2 

2 

C.10: ( ( ε − ε ) θ , ) ± H → σ ; 

± k k k k 

( 

{ ( ε − ε ) θ ∆ η )} 

+ ) ( − ) 

E ˆ 

k k k 

, ck 

k 

2 

ˆ 

k 

θ , 

= 

is an equicontinuous sequence at η = 0 and is continuous 

in η on some compact, connected set containing the actual (observed) value of 

c ∆ a.s. 

k 

k 

C.11: In addition to implicit conditions an α and γ via C.1’’, 3γ −α / 2 ≥ 0 and β > 0 . 

Further, whenα = 1, 

a > β / 2 . Let f (⋅) 

in (2.2a) be chosen such that H − H → 0 a.s. 

k 

Although, in some applications, the “ → ” for the noise second moments in C.10 may be 

replaced by “=,” the limiting operation allows for a more general setting. Since the user has 

k 

k 

full control over f (⋅) 

, it is not difficult to guarantee in C.11 that H −H 

→ 0 

k 

k 

k 

a.s. 

Theorem 3a—M2-SPSA: Suppose that C.0, C.1’’, C.2, C.3’, and C.4–C.9 hold (implying 

50

2.10 ASYMPTOTIC DISTRIBUTIONS AND ANALYSIS 


θˆ and H ). Then, if C.10 and C.11 hold and 

k 

k 

H( 

θ 

* −1 

) 

exists, 

β /2 ˆ * 

k ( θ −θ 

) dist 

k 

⎯ ⎯→ 

N( 

µ , Ω) 

(2.36) 

where µ = 0{ 

0 3γ 

− α / 2 > 0 

T is 

−1 

* 

if ( ) T /( / 2) 

H θ a − β if 3 γ −α / 2 = 0} the j-th element of 

+ 

Ω = 

⎡ 

⎤ 

P 

1 2 2⎢ 

(3) * 

(3) * 

− ac ξ L + ⎥ 

⎢ jjj 

( θ ) 3∑ 

L jjj 

( θ ) 

(2.37) 

6 

⎥ 

i= 

1 

⎢⎣ 

i≠1 

⎥⎦ 

( 8a 

+ β ) 

2 −2 

2 2 * −2 

a c σ ρ H( 

θ ) / 4 

+ 

and β = β 

+ 

if α = 1 and β 

+ 

= 0 if α 1/ 2 if α = 1, and 

k 

k 

f 

k 

(⋅) is chosen such that H 

k 

− H 

k 

→ 0 a.s. As with C.10, frequently, → can be replaced 

with “=” in the limiting covariance expression. Likewise, see the comments following C.11 

regarding the condition H − H →0 

a.s. 

k 

k 

51


Theorem 3b—2SG: Suppose that C.0’, C.1’’’, C.2’, C.3’, C.4–C.7, C.8’, and C.9’ hold 

(implying convergence of 

θˆ k 

and H 

k 

) that C.12 holds with 

H( 

θ 

* −1 

) 

existing. Then, 

k 

α / 2 

dist 

ˆ * 

( θ k 

−θ 

) →N(0, 

Ω') 

(2.38) 

2 * −1 

* −1 

where Ω' 

= a H( 

θ ) ΣH( 

θ ) /(2a 

− β ) with β = 1 if α = 1 and β = 0 if α

2.10 ASYMPTOTIC DISTRIBUTIONS AND ANALYSIS 

1st-SPSA and M2-SPSA, we have 

2a 

1 

rms 

2SPSA(1,1, 

c, 

) 

6 

< 2, 

1 

min rms1 

SPSA( 

a,1, 

c, 

) 

> 1/ λmin 

6 

∀c 

> 0 

(2.40a) 

2a 

1 

rms 

2SPSA(1,1, 

c, 

) 

6 

< 2 

1 

min min rms1 

SPSA(1,1, 

c, 

) 

> 1/ λ min c> 

0 

6 

(2.40b) 

where 

λ is the minimum eigenvalue of H ( θ 

* ) . The interpretation of (2.40a), (2.40b) is as 

min 

follows. From (2.40a), we know that, for any common value of , the asymptotic rms error of 

M2-SPSA is less than twice that of 1st-SPSA with an optimal (even when c is chosen optimally 

for 1st-SPSA). Expression (2.40b) states that, if we optimize only for M2-SPSA, while 

optimizing both a and c for 1st-SPSA, we are still guaranteed that the asymptotic rms error for 

M2-SPSA is no more than twice the optimized rms error for 1st-SPSA. Another interesting 

aspect of M2-SPSA is the relative robustness apparent in (2.40a), (2.40b) given that the optimal 

for 1st-SPSA will not typically be known in practice. For certain suboptimal values of a in 

1st-SPSA, the rms error can get very large whereas simply choosing a= 1 for M2-SPSA 

provides the factor of guarantee mentioned above. Although (2.40a), (2.40b) suggest that the 

M2-SPSA approach yields a solution that is quite good, one might wonder if a true optimal 

solution is possible. Dippon and Renz [33, pp.1817–1818] pursue this issue, and provide an 

alternative to 

θ 

* −1 

H ( ) as the limiting weighting matrix for use in an SA form such as (2.2a). 

Unfortunately, this limiting matrix has no closed-form solution, and depends on the third 

derivatives of L (θ ) at 

adaptive matrix (analogous to 

* 

θ , and furthermore, it is not apparent how one would construct an 

H 

k 

that would converge to this optimal limiting matrix. 

Likewise, the optimal for M2-SPSA is typically unavailable in practice since it also depends on 

the third derivatives of L (θ ). Expressions (2.40a), (2.40b) are based on an assumption that 

1st-SPSA and M2-SPSA have used the same number of iterations. This is a reasonable basis for 

a core comparison since the “cost” of solving for the optimal 1st-SPSA gains is unknown. 

However, a more conservative representation of relative efficiency is possible by considering 

only the direct number of loss measurements, ignoring the extra cost for optimal gains in 

1st-SPSA. In particular, 1st-SPSA uses two loss measurements per iteration and M2-SPSA uses 

four measurements per iteration. Hence, with both algorithms using the same number of loss 

53


measurements, the corresponding upper bounds to the ratios in (2.40a), (2.40b) (reflecting the 

ratio of rms errors as the common number of loss measurements gets large) would be 

2 / 3 

4 ≈ 2.52 , an increase from the bound of 2 under a common number of iterations. This bound’s 

likely excessive conservativeness follows from the fact that the cost of solving for the optimal 

gains in 1SPSA is being ignored. Note that, for other adaptive approaches that are also 

asymptotically normally distributed, the same relative cost analysis can be used. Hence, for 

example, with the Fabian [19] approach using O ( p 

2 ) measurements per iteration to generate 

2/3 

the Hessian estimate, the corresponding upper bounds would be of magnitude O ( p ), bounds 

that, unlike the bounds for M2-SPSA, increase with problem dimension. 

In the following chapters, once finished these numerical simulations in order show the 

M2-SPSA performance, we will prove the proposed SPSA algorithm applied to parameters 

estimation performance in some realistic systems. The main advantages of our proposed 

algorithm will be shown, such as low computational cost and efficient accuracy and 

convergence. 

2.11 -Perturbation Distribution for M2-SPSA 

As discussed above, the perturbations 

∆ 

k 

in the gradient estimate are based on Bernoulli 

random variables on {–1, 1}. In fact, the requirements are merely that the 

∆ 

ki 

must be 

independent and symmetrically distributed about zero with finite absolute inverse moments 

−1 

E[ 

∆ ki 

] for all k, i. The Bernoulli is just one distribution for ∆ 

ki 

that satisfies these 

conditions. It has been shown that one cannot do better than this distribution in the asymptotic 

case [34], but less is known about the best distribution for small-sample approximations. Some 

numerical results seem to show better performance on some problems with non-Bernoulli 

distributions. The performance of three such alternative distributions is reported here: a split 

uniform distribution, an inverse split uniform distribution, and a symmetric double triangular 

distribution (referred to as candidate distributions in the following). The {–1, 1} Bernoulli 

distribution has variance and absolute first moment (mean magnitude) both equal to one. It is 

the only qualified distribution with these qualities. We conjecture that these characteristics are 

necessary conditions for optimal performance of the M2-SPSA algorithm, given optimal step 

size parameters. Variations in mean magnitude can be addressed by scaling the gradient step 

54

2.11 PERTURBATION DISTRIBUTION FOR M2-SPSA 

size (c), so for comparisons, candidate distributions should have the same variance as the {–1, 

1} Bernoulli. Then differences in performance could be attributed to differences in the nature of 

variability in that distribution. 

Table 2.1. Characteristics of the perturbation distributions. 

To ensure consistency in the comparison, we normalized the candidate distributions so that their 

variances were one and their main magnitudes were close to one, but not so close that the 

essential character of the distributions were lost. The probability density functions of these 

distributions are given at right. The characteristics of each distribution are given in Table 2.1. 

The M2-SPSA algorithm with each distribution for the perturbations was applied to 34 

functions from Moré’s suite of optimization problems [35]. The initial points recommended in 

Moré were used for each function. The functions values were obscured with normally 

distributed errors with mean zero and a variance of one. We then used these noisy function 

values to calculate a simultaneous perturbation gradient approximation. For nearly all of the 

functions, errors of this magnitude are insignificant away from the minimum. However, most 

functions in the optimization suite have minimums at or near zero, where N(0, 1) errors are 

quite significant. This situation is further complicated by the fact that many functions are 

extremely flat near the minimum as well. The result was a demanding examination of the 

M2-SPSA algorithm offering ample opportunity to test alternative perturbation distributions. 

The step size parameters of the M2-SPSA algorithm (that is, a and c) were optimized for each 

distribution and each function by random search. The procedure to optimize the step parameters 

used 20,000 iterations of a directed random search algorithm. 

55


In the directed random search (sometimes called a localized random search, see [36], p. 45), 

new trial values are generated near the location of the current best value. The algorithm accepts 

the input parameters as the current optimal values if they produce results that are better than the 

best yet obtained, otherwise they are rejected. This method is somewhat more sophisticated than 

simple random search, and generally more computationally efficient in that it uses information 

from previous iterations. For more information on random search methods, see Solis and Wets 

[37]. For each iteration of the random search we executed fifty Monte Carlo trials of the SPSA 

algorithm, and then accepted or rejected the parameter values based on the average of these fifty 

trials. The theoretically optimal values for a and g were used. The M2-SPSA algorithm in the 

procedure outlined above was run for stopping times of n = 10, 100, and 1000 iterations to 

determine whether any one distribution outperformed the others over small, moderate, and large 

iteration domains. Common random numbers (CRN) were used to minimize variance. With 

CRN, the sequences of function values generated by the iteration differ only as a result of how 

the SPSA algorithm processes the random numbers in a different way. In this evaluation, the 

sequence of CRN were used to generate random perturbations from the appropriate distribution. 

This method allows the use of matched pairs testing to determine the significance of differences 

in the minimum values observed. Matched pairs testing generally leads to sharper analysis. 

⎧ 1 

⎪ −b≤x≤−a 

2( b−a) 

f SU 

( x; 

a, 

b) 

= ⎨ 

⎪ 

⎪⎩ 

0 otherwise 

or 

a≤x≤b 

Fig. 2.3. Split uniform distribution. 

56

2.12. PARAMETER ESTIMATION 

⎧ ab 

⎪ 

−b 

≤ x ≤ −a 

2 

2( b−a) 

x 

f ISU 

( x; 

a, 

b) 

= ⎨ 

⎪ 

⎪⎩ 

0 otherwise 

or 

a ≤ x ≤b 

Fig. 2.4. Inverse split uniform distribution. 

⎧ x + c 

⎪ 

− c ≤ x ≤ −b 

( c − a)( 

c − b) 

⎪ 

⎪ 

x + a 

− b ≤ x ≤ −a 

⎪( 

c − a)( 

c − b) 

⎪ 

f SDT 

( x; 

a, 

b) 

= ⎨ x − a 

a ≤ x ≤ b 

⎪( 

c − a)( 

c − a) 

⎪ 

⎪ x − c 

b ≤ x ≤ c 

⎪( 

c − a)( 

b − c) 

⎪ 

⎩0 

otherwise 

Fig. 2.5. Symmetric double triangular 

distribution. 

2.12 -Parameter Estimation 

2.12.1 -Introduction 

In the proposed SPSA algorithm, all parameters are perturbed simultaneously; it is possible to 

57


modify parameters with only two measurements of an evaluation function regardless of the 

dimension of the parameter. A parameter estimation algorithm using M2-SPSA is proposed. 

The contribution of this chapter is a SPSA algorithm for parameter estimation that can be used 

with non-linear systems or systems with parameters estimation very high. The proposed SPSA 

algorithm is an iterative method for optimization, with randomized search direction, that 

requires at most three function (model) evaluations at each iteration. The M2-SPSA 

incorporates the 2nd-SPSA usually reduced number of iterations, to do an initial estimate of the 

* 

optimum values for the parameter, θ . The proposed SPSA algorithm makes use of the Hessian 

matrix to increase the rate of convergence. First, second and modified second-order SPSA 

algorithm was implemented to estimate the unknown parameters of the highly non-linear 

physical model. Hence, execution time per iteration does not increase with the number of 

parameters. The method can handle non-linear dynamic models, non-equilibrium transient test 

conditions and data obtained in close loop. For this reason, this method is suitable for the 

estimation of parameters in realistic applications. Firstly, it is necessary to show the general 

implementation of SPSA algorithm. The general steps in implementation of SPSA algorithm are 

[28]: 1) initialization and coefficient selection, 2) numerical issued, 3) gradient/Hessian 

averaging, 4) gain selection, (see Sec. 2.8). Finally, we have proposed a modification in this 

implementation. This modification is explained on base to the recursive update form for the 

parameter vector is given by 

θˆ 

= θˆ 

− a 

gˆ 

( θˆ 

) 

k + 1 k k k k 

(2.41) 

where 

ak 

is a weight or gain constant for the recurrent iteration and 

ĝ 

k 

is a gradient estimate 

for the recurrent iteration. To update 

θˆ k to a new value ˆ 

k+ 

1 

ˆ 

+ 

θ . If θ k 1 

falls outside the range of 

allowable values for θ . Then project the updated θ k 1 

to the nearest boundary and reassign this 

ˆ + 

ˆ 

+ 

projected value θ k 1 

. Mathematically we have, for every -i = 1, … , n; 

ˆ θ 

k+ 

1, i 

⎧ ˆ θk 

⎪ 

= ⎨θ 

i 

⎪ 

⎪⎩ 

θi 

+ 1, i 

min 

max 

if θ 

if ˆ θ 

if ˆ θ 

min 

i 

k+ 

1, i 

k+ 

1, i 

≤ ˆ θ 

< θ 

> θ 

k+ 

1, i 

min 

i 

max 

i 

< θ 

max 

i 

. 

58

2.11 PARAMETER ESTIMATION 

Modifications to this step may be needed to enhance the best convergence of the algorithm. In 

particular the update could be block if the cost function actually worsens after the “the basic” 

update in this step. The choice of various parameters of the algorithm plays an important role in 

the convergence of the algorithm. It is suggested that α = 0. 602 and γ = 0. 101 

practically effective and theoretically valid choice. The value of A is chosen to be 10% of the 

maximum iterations allowed. The maximum number of iterations was chosen to be 100 and 

hence A was chosen to be 10. It is recommended that if the measurements are (almost) error free 

c, can be chosen as a small positive number. In this case it was chosen to be 0.01. 

are 

The value of a should be chosen such that the 

α 

a /( A +1) times the magnitude of elements of 

( ˆ ) is approximately equal to the smallest of the desired change magnitudes among the 

gˆ 0 

θ0 

elements of θ in early iterations. For the problem at hand a=1 gave a good results. This value 

if a was chosen to ensure that the component of θ during the iterations would remain within the 

allowed bounds. 

We have proposed modify the typical implementation of SPSA algorithm for the estimation 

parameters application according to M2-SPSA algorithm, so that, the optimization in the vector 

parameter θˆ was modified and showed as follows: The vector parameter θˆ is obtained by 

solving the following problem: 

ˆ θ = arg min 

θ H 

( θ ) 

subject to 

θ 

θ 

M 

θ 

min 

1 

min 

2 

min 

n 

≤ θ ≤ θ 

1 

≤ θ ≤ θ 

2 

≤ θ ≤ θ 

n 

max 

1 

max 

2 

max 

n 

(2.42) 

where the cost function H (θ ) is given by a cost function and n gives the total number of 

parameters in the case n=19. Most conventional tools used for optimization of the cost function 

to arrive at local minimum. However this optimization method is very time consuming if there 

are many variables to be optimized or if the cost function evaluations are computationally 

expensive. If the number of parameters increases, the number of function evaluations required 

computing the gradients also increase. Moreover, the chance of solution convergence to local 

59


minimum also increases with the number of parameters to be optimized. For the problem at 

hand, which several parameters to be optimized, it was found that the gradient-based approach 

was not practical. For this reason, the SPSA algorithm was used to minimize the cost function. 

Once the approximate gradient is computed the parameters are update and a new value of θ is 

computed. It is recommended once more that the cost function evaluation at this point to check 

if the cost function at this new value of θ is less that the cost function using 

θ 

k 

. The number 

of cost function evaluations per iteration does not depend on the number of variable, which 

makes this method very attractive for optimization problems with several variables. Therefore, 

this method can be represented as follows: The i-th element of the gradient estimate, g ˆ ( ˆ θ ) is 

given by 

ˆ 

ˆ 

ˆ y( 

θk 

+ ck∆k 

) − y( 

θk 

− ck∆k 

) 

gˆ 

k 

( θ 

k 

) = 

. 

(2.43) 

2c 

∆ 

k 

ki 

k 

The term 

θˆ ± c ∆ represents a perturbation to the optimization parameters about the recurrent 

k 

k 

k 

estimate. Similar to a standard SA form, 

ck 

is small, positive weighting value. The vector of 

zero-mean random variables, which must have bounded inverse moments. One valid choice for 

∆k 

is a vector of Bernoulli-distributed, i.e. ± 1, random perturbation terms. In resume, the fifth 

guideline that we have proposed and is complement of Sec. 2.8 is given as follows: 

At each iteration, block “bad” steps if the new estimate for θ fails a certain criterion. 

H 

k 

should typically continue to be updated even if θ k 1 

is blocked. The most obvious blocking 

applies when θ must satisfy constraints; an updated value may be blocked or modified if a 

constraint is violated. There are two ways 5a) and 5b) that one might implement blocking when 

constraints are not the limiting factor. 

ˆ 

+ 

5a) Based on θˆ k 

and θ k 1 

directly. 

5b) Based on loss measurements. 

ˆ 

+ 

Both of 5a) and 5b) may be implemented in a given applications. In 5a), one simply blocks the 

step from 

θˆ k to 

ˆk 1 

θ if ˆ θ − ˆ 

1 

θ > (tolerance1 ) where the norm is any convenient distances 

k+ k 

60


measure and (tolerance1 >0) is some “reasonable” maximum distance to cover in one step. The 

rationale behind 5a) is that a well-behaving algorithm should be moving toward the solution in a 

smooth manner, and very large steps are indicative of potential divergence. The second potential 

method, 5b), is based on blocking the step if y( ˆ θ ) ( ˆ 

k+ 1 

> y θk 

)(tolerance 2 ) where (tolerance 2 )≥0 

might be set at about one or two times the approximate standard deviation of the noise in the 

y (⋅) measurements. In a setting where the noise in the loss measurements tends to be large (say, 

much larger than the allowable difference between L( θ 

* ) and L ˆ θ )), it may be undesirable 

( final 

to use 5b) due to the difficulty in obtaining meaningful information about the relative old and 

new loss values. For any nonzero noise levels, it may be beneficial to average several y (⋅) 

measurements in making the decision about whether to block the step; this may be done. Having 

tolerance 2 >0 as specified above when there is noise in the 

y (⋅)' 

builds some conservativeness 

into the algorithm by allowing a new step only if there is relatively strong statistical evidence of 

an improved loss value. Let us close this subsection with a few summary comments about the 

implementation aspects above. Without the second blocking procedure 5b) in use, 2nd-SPSA 

requires four measurements y(⋅) 

per iteration, regardless of the dimension p (two for the standard 

G (⋅) k 

estimate and two new values for the one sided SP gradients G 

1 ( ⋅ k 

)) . For 2SG, three 

gradient measurements G (⋅ k 

) are needed, again independent of p. If the second blocking 

procedure 5b) is used, one or more additional y (⋅) 

measurements are needed for both 2nd- 

SPSA and 2SG. The use of gradient/ Hessian averaging 3) would increase the number of loss or 

gradient evaluations, of course. 

The standard deviation for the measurement noise (used in items 4) and 5b in this chapter) can 

be estimated by collecting several y (⋅) 

values at θ = θˆ 

0 

; neither 4) nor 5a) requires this 

estimate to be precise (so relatively few y(⋅) 

values are needed). In general, 5a) can be used 

anytime, while 5b) is more appropriate in a low- or no-noise setting. Note that 5a) helps to 

prevent divergence, but lacks direct insight into whether the loss function is improving, while 

5b) does provide that insight, but requires additional y (⋅) 

measurements, the number of which 

might grow prohibitively in a high-noise setting. Once finished the modifications in the 

implementation SPSA algorithm according to our proposed algorithm, we can start to explain 

how is applied toward the estimation parameters. 

61


Firstly, we defined a simple model in order to explain how is developed the estimation 

parameters algorithm using our proposed algorithm. This model was used before by other 

authors [24][25] for explain estimation parameters using the 1st-SPSA algorithm. So that, this 

system is used because is very suitable and illustrates very well M2-SPSA algorithm 

performance. Of such a way, the following single-input single-output (SISO) discrete system 

with input x and output y [24][25] is considered: 

x 

k 

= a k + K + a x + b u + K b u . 

(2.44) 

1 k − 1 

n kn 1 k −1 

+ 

m k − m 

Here, k is the discrete time, a 

1 

, . . . , 

a 

n 

and b 

1, . . . , 

b 

m 

represent the constant coefficients. 

Also, in general, n ≥ m. It is assumed that the system input 

value 

y k 

accompanied by some form of noise 

υk 

xk 

is observed as the observed 

y 

= + υk. 

(2.45) 

k 

x k 

Here, the noise 

satisfy the following: 

vk 

the input 

uk 

and the output 

xk 

are independent of one another, and they 

E ( u ) = u , E ( u u ) = r 

2 δ 

(2.46 a) 

k 

a 

k 

i 

ki 

2 

E ( υ ) = 0 , E ( υ υ ) = σ δ 

(2.46 b) 

k 

k 

i 

ki 

2 2 

where,δ represents the Kronecker delta, r and σ represent the variances of the noise, and 

u is the average value for the input. At this point, the parameter estimation problem for 

a 

consecutively finding unknown parameters { a ,..., a , b b } 

values{ y k 

, u k 

}. The parameters are defined as follows: 

n m 1 

based on the observed 

1 

,..., 

u ) 

T 

k 1 

( uk−m,..., 

uk− 

1 

− = (2.47a) 

T 

x 

k− 1 

= ( xk−n,..., 

xk− 

1) 

(2.47b) 

T 

υ 

k − 1 

= ( υk−n,..., 

υk− 

1) 

(2.47c) 

62


T 

y 

k − 1 

= ( yk 

−n 

,..., yk 

−1, 

uk 

−m 

,..., uk 

−1) 

(2.47d) 

T 

φ = a ,..., a , b ,..., b ) . 

(2.47e) 

( 

n 1 m 1 

Furthermore, based on the conditions in (2.46 b) for the observed noise, 

E ( ) = 0 

(2.48) 

e k 

E( e e ) = 0, k − i n. 

(2.49) 

k i 

> 

Therefore, the error function J can be defined as follows. The problem of minimizing this error 

function and finding the system parameter vector φ is addressed in this chapter. 

1 2 

⎧ 

T 

( 

ˆ ⎫ 

J = E ⎨ y 

k 

− y 

k − 1φ 

) ⎬ 

(2.50) 

⎩ 2 

⎭ . 

Here, E represented the expected value, and φˆ represents the estimated value. This kind of error 

function, with the expected value, cannot be found in practice. Thus, using SA with this as an 

iterated function is considered. The problem of finding a parameter that yields a minimum in 

this kind of iterated function can be solved by using the SA method. The partial derivative of 

the error function (2.50) with respect to the estimation φˆ is 

− y y − y 

T ˆ). 

(2.51) 

k−1( k k−1φ 

Here, let us look at the expected value for 

independent with υ 

k−1 

, then 

yk 

− 1 

ek 

. If we consider that 

k−1 

x and u 

k−1 

are 

E 

2 

⎧⎛ x + υ ⎞ 

⎫ ⎡σ 

I 0⎤ 

k−1 

k 

= ⎨⎜ 

⎬ ⎢ ⎥ 

⎩ υ ⎟ k 1 k−1 

L 

n k−n 

(2.52) 

⎝ k−1 

⎠ 

⎭ ⎣ 0 0⎦ 

k−1 

k−1 

{ y e } E ⎜ ⎟( υ − aυ 

− − a υ ) = − φ 

holds, with no result being zero. Consequently, in the estimate using (2.51), a bias occurs; thus, 

63


(2.51) does not give a consistent estimate [15]. Therefore, this bias must be compensated. The 

reference [15] offers a detailed explanation of this. Moreover, if (2.49) is considered, 

calculations must be performed every (n + 1) instances of sampling, to guarantee the 

independence of { e k 

}. The modifying time k can be represented by the actual sampling time n; 

k = 1, n + 2, 2n + 3, . . . . Then, the following recursion for the estimated parameters will be 

considered: 

ˆ ˆ k − 1 

φ 

k + n 

= φ 

k −1 

− ρ 

e 

∆φ 

k −1, 

k = 1,…, n + 2, 2n+3,.... (2.53) 

n + 1 

ˆ 

− 

Here, ∆φ 

k 1 

is the basic quantity which provides the quantity for the estimation parameters. 

Furthermore, 

a fraction. 

ρe 

represents the gain coefficient. The subscript on the coefficient ρe 

represents 

Because this takes a value for every (n + 1) instances, for example 1, n + 2, 2n + 3, . . . , with 

respect to the actual sampling time n, as a result, the subscript 

ρ 

e 

refers to taking the value: 

1 – 1/n + 1 = 0, n + 2 –1/n + 1 = 1, . . . , 0, 1, 2,…. 

In SPSA, the perturbations are superimposed simultaneously on all the parameters. As a result, 

even as the number of parameters rises, the estimated parameters can be revised based on the 

two values of the error functions either when perturbation is added or when there is not 

perturbation. A parameter estimation method that uses this kind of SP is extremely useful in the 

many circumstances. 

2.12.2 -System to be Applied 

Let us consider the differential with respect to the parameter φ for the model of the error of 

squares 

2 

e in this instance [24][25]. For the sake of simplicity, when considering a case in 

which all variables are scalar, results 

2 

∂ e 

∂ φ 

= 

2 ( y 

− 

y 

q 

) 

∂ y 

q 

∂ φ 

= 

2 ( y 

− 

y 

q 

) 

∂ y 

q 

∂ x 

∂ x 

. 

∂ φ 

(2.54) 

64


∂ y q 

/ ∂x in this equation represents a Jacobian observation system. If the observation system is 

assumed to be unknown, then it cannot be found. 

Therefore, when identifying a system that includes an unknown observation system, the amount 

of correction for the parameters cannot be found in methods that directly find the slope of the 

error. In other words, identification algorithms based on the conventional slope approach cannot 

be used. 

In contrast, in the SP method proposed in this chapter, the amount of correction for the 

estimation parameters is found directly from the value 

characteristics of the observation system are not needed. 

2 

e for the error. As result, the 

Moreover, in distinction with differential approximation methods, in ours method, regardless of 

how many paramters are to be estimated, the parameters can be corrected using only two 

observations. 

In this research, we refer to many authors that have proposed a parameter estimation algorithm 

using the SPSA algorithm. The following system was considered by other authors [24][25] and 

is very suitable for show the proposed SPSA algorithm performance. The system considered is a 

case in which the observed values for an unknown system to be identified can only be obtained 

from its characteristics (see Fig. 2.6). 

Fig. 2.6. Identification with an unknown observation system. 

65


Once proposed the model structure, the next step is to estimate the parameters of the system. 

This is done by assuming an initial value of the parameters and then optimizing them so as to 

minimize the errror between the measurements and the model predictions. In then next 

simulation, a code using standard MATLAB commands implementing the SPSA for constrained 

optimization was developed. Consider the following successive equations: 

φ 

= ˆ φ − ρ ∆φ 

k+ 

1 k ek k 

(2.55) 

T 

∆φ = ∆φ 

,..., ∆φ 

) . 

(2.56) 

k 

( 

k ,1 k , n+ 

m 

∆ φ represents the modifying vector for the estimated parameters. Also, ρe 

represents the 

correct gain. The estimation parameter vector 

to the perturbation c is defined as follows: 

+ i 

φˆ 

with only the i-th estimation parameter added 

ˆ + i 

i 

k 

= ˆ φk 

+ cke 

(i=1,…, n+m). (2.57) 

φ 

Here, the vector 

i 

e represents the fundamental vector for which the i-th element alone is 1, and 

everything else is 0. Consequently, the error function for when perturbation is superimposed on 

each parameter is structured as follows: 

1 2 

T ˆ+ 

i 

( y k + 1 

y k k 

) . 

2 

− φ (2.58) 

Based on the error function in the equation above, the estimation parameters can be updated as 

shown below. In other words, an algorithm in which 

1 ( y − y φˆ 

) − ( y − y φˆ 

) 

T + i 2 

T 2 

k + 1 k k 

k + 1 k k 

∆ φ 

k , i 

= 

(i=1,…, n+m) (2.59) 

2 

c 

k 

represents each element for the correction parameters can be conceived. The equation above 

provides the amount of estimation for the differential with respect to the i-th parameter in the 

66


error. Finding values in the above equation for i = 1, . . . , n + m means finding the square of 

errors in (2.58) by superimposing the perturbation on each parameter successively. As a result, 

the error function must be calculated (number of parameters + 1) times. As the number of 

dimensions for the parameters rises, the number of calculations for the error increases in this 

method. 

We consider a signed vector 

whether the element takes +1 or -1 is determined randomly by 

sk 

consisting of the elements +1 or -1. As is described [38], 

s 

k 

( 

k ,1 k , n+ 

m 

T 

= s K , s ) . 

(2.60) 

By making use of this, perturbation can be superimposed on the parameter vector as shown 

below: 

ˆ 

+ 

k 

= ˆ χ + c s 

k 

k 

k 

. 

χ (2.61) 

By making use of this, the perturbation 

+ ck 

and ck 

− is added at the same time to all 

parameters. The parameter estimation using our modified SPSA algorithm is give as follows: 

ˆ χ 

ˆ 

k + n 

= χ 

k −1 

− 

ψ 

k − 1 

n + 1 

⎧ 

⎪ 1 ( W 

⋅ ⎨ 

⎪ 2 

⎩ 

Xs 

k + n 

− W 

T 

k 

ˆ χ 

2 

⎡υ 

I 

− ⎢ 

⎣ 0 

− ( W 

c 

k −1 

n + 1 

− W 

ˆ χ 

+ 2 

T + 2 

k −1 ) 

k + n k + n k −1 

) 

0 ⎤ 

⎥ χ 

0 ⎦ 

n 

ˆ 

k − 1 k − 1 

⎪⎫ 

⎬ 

⎪⎭ 

(2.62) 

where 

W 

k is measured output, c is the perturbation, υ represents the variance, n, k are 

sampling time, χ is the parameter to be estimated, and ψ is a gain coefficient and the 

subscript in this coefficient represents a fraction because this takes value for every (n+1) 

instances. Note that 

χ 

+ 

k −1 

is calculated as follows: 

67


ˆ χ 

+ 

ˆ 

k − 1 

= χ 

k − 1 

+ c 

k − 1 

s 

k − 1 

n + 1 

. (2.63) 

In estimating the optimum parameters of a model or times, there are several factors, which must 

be considered when deciding on the appropriate optimization technique. Among these factors 

are convergence speed, accuracy, algorithm suitability, complexity, and computational cost in 

terms of time and power. In the current problem it is necessary to estimate the parameters of a 

geometrical object in real time. This algorithm updates the estimates using the following 

procedure: 

y k+ 

n 

(S1) The output to be identified { } 

is observed with respect to a particular input. 

(S2) Perturbation is added to all the parameters in the estimation vector for the parameters. 

(Calculation of (2.63)). 

(S3) The value for the error function 

( y ˆ φ is calculated. 

− T + 

) 2 

k+ n 

yk+ 

n k−1 (S4)-The amount of correction is calculated and the estimation parameters is updated. 

(Calculation of (2.62)). 

(S5) Return to S1. 

At each correction time, the value of { y k 

, u k 

} is observed, and the amount of correction is 

calculated based on these values. The above represents the proposal for an algorithm using a 

one-sided difference with the error for when perturbation is or is not present. However, as is the 

case for (2.61) the following two-sided form of algorithm using 

− 

χˆ 

k 

in which the perturbation 

is subtracted from the estimation parameter can also be considered: 

T ˆ+ 

2 

T 

1 ( ) ( ˆ− 

2 

yk+ 1 

− yk 

φk 

− yk+ 

1 

− yk 

φk 

) 

∆φ k 

= 

. 

(2.64) 

2 

2c 

k 

68


This algorithm to estimate the parameters is based on the M2-SPSA, which is capable of 

optimizing any number of parameters in reasonable time. This is because the number of cost 

function evaluations needed to estimate the gradient is independent of the number of parameters 

to be optimized. 

2.12.3 -Convergence Theorem 

In this section a convergence theorem for the parameter estimation algorithm using the 

M2-SPSA is described. First, let us consider the following conditions. 

(A11) The coefficient 

ρe 

satisfies the following conditions: 

∞ 

∑ 

i= 

1 

∞ 

∑ 

ρ = ∞, 

ρ < ∞ . 

ei 

i= 

1 

2 

ei 

(A12) The perturbation c (> 0) 

is bounded. 

i 

(B11) 

E 

( sk, i) 

= 0, E( 

sk, 

i, 

slj 

) = δljδ 

kl 

. 

Note that δ represents the Kronecker delta. 

(C11) The input 

uk 

and the observed noise 

vk 

satisfy (2.46a) and (2.46b), and they are 

mutually independent. Further, they have a bounded fourth-order moment. Here, condition 

(A11) is related to the correction gain, and is the same as the condition required for an ordinary 

Robbin-Monroe type stochastic approximation. 

Condition (A12) is related to the magnitude of the perturbation. Condition (B11) is related to 

the signed vector. Conditions (A12) and (B11) are related to the perturbation required because 

this is a SPSA. The condition in (C11) is related to the nature of the noise and the input signal. 

It is also required for identification using a conventional R-M type stochastic approximation. 

69


Theorem 4a—Convergence in parameter estimation M2-SPSA. For { φˆ k 

} given in (2.62), 

when the conditions (A11), (A12), (B11) and (C11) are satisfied, we have 

lim 

k→∞ 

E 

⎧ 

⎨ 

⎩ 

ˆ 

φ k 

2 

−φ 

⎫ 

⎬ = 0. 

⎭ 

Refer to the Appendix for details of the proof of this theorem. 

2.13- Simulation 

2.13.1 -Simulation 1 

This section compares M2-SPSA with its corresponding first-order “standard” forms 1st-SPSA 

and 2nd-SPSA. Numerical studies on other functions are given in Spall [18]. The loss function 

considered here is a fourth-order polynomial with p = 10, significant variable interaction, and 

highly skewed level surfaces (the ratio of maximum to minimum eigenvalue of H ( θ 

* ) is 

approximately 65). Gaussian noise is added to the L (⋅) 

or g(⋅) 

evaluations as appropriate. 

MATLAB software was used to carry out this study. The loss function is 

p 

p 

3 

( Aθ 

) 

i 

+ 0.001∑ 

i= 1 i= 

1 

∑ 

T T 

4 

L( θ ) = θ A Aθ 

+ 0.1 

( Aθ 

) 

(2.65) 

i 

where 

) 

i 

(⋅ represents the i-th component of the argument vector and A is such that pA is an 

upper triangular matrix of ones. The minimum occurs at θ * = 0 with L ( θ 

* ) = 0 .The noise in 

the loss function measurements at any value of θ is given by [ θ T , 1]z 

where 

2 

z ≈ N 0, σ I ) is independently generated at each θ . This is a relatively simple noise 

( 

11X 

11 

structure representing the usual scenario where the noise values in y (⋅) 

depend on θ (and 

are therefore dependent over iterations); the z 

11 

term provides some degree of independence 

2 

at each noise contribution, and ensures that y(⋅) 

always contains noise of variance at least σ 

(even if θ = 0). Guidelines 1), 2), 4), from Sec. 2.8 of our proposed modification in the 

implementation to 2nd-SPSA were applied here. A fundamental philosophy in the comparisons 

below is that the loss function and gradient measurements are the dominant cost in the 

70

2.12 NUMERICAL SIMULATIONS 

optimization process; the other calculations in the algorithms are considered relatively 

unimportant. This philosophy is consistent with most complex stochastic optimization problems 

where the loss function or gradient measurement may represent a large-scale simulation or a 

physical experiment. The relatively simple loss function here, of course, is merely a proxy for 

the more complex functions encountered in practice. 

M2-SPSA Versus 1st-SPSA and 2nd-SPSA Results: We compared M2-SPSA with 1st-SPSA 

because our proposed method is an extension of 1st-SPSA, so that is convenient make a 

comparison in order to show the improvements of our SPSA proposed respect to 1st-SPSA and 

also it is compared with 2nd-SPSA because this is the last version of SPSA, so that is very 

important verify our improvements according to this algorithm. Spall [18] provides a thorough 

numerical study based on the loss function (2.65). Three noise levels were considered: σ = 

0.10, 0.001, and 0. The results here are a condensed study based on the same loss function. 

Table 2.2 shows results for the low-noise σ = 0.001) case. The Table 2.2 shows the mean 

terminal loss value after 50 independent experiments, where the values are normalized (divided) 

by L ˆ θ ) . Approximate 90% confidence intervals are shown below each mean loss value. The 

( 0 

gains, 

a 

k 

, ck 

and 

k 

c ~ and decayed at the rates, 

0.602 0.101 

1/ 

k , 1/ k 

0.101 

and-1/ 

k , respectively. 

These decay rates are approximately the slowest allowed by the theory and are slower than the 

asymptotically optimal values discussed in Sec. 2.10 (which do not tend to work as well in 

finite-sample practice). Four separate algorithms are shown: basic 1st-SPSA with the 

coefficients of the slowly decaying gains mentioned above chosen empirically according to 

Spall[18], the same 1st-SPSA algorithm but with final estimate taken as the iterate average of 

the last 200 iterations, 2nd-SPSA and M2-SPSA. Additional study details are as in Spall[18]. 

We see that M2-SPSA provides a considerable reduction in the loss function value for the same 

number of measurements used in 1st-SPSA and 2nd-SPSA. Based on the numbers in the table 

together with supplementary studies, we find that 1st-SPSA and 2nd-SPSA need approximately 

five–ten times the number of function evaluations used by M2-SPSA to reach the levels of 

accuracy shown. 

71


The behavior of iterate averaging was consistent with the discussion in previous section in 

which the 1st-SPSA iterates had not yet settled into bouncing roughly uniformly around the 

solution. Using numerical studies in Spall [18], we can show that M2-SPSA outperforms 

1st-SPSA and 2nd-SPSA even more strongly in the noise-free (σ = 0) case for this loss function, 

but that it is inferior to 1st-SPSA in the high-noise (σ = 0.10) case. 

However, Spall [18] presents a study based on a larger number of loss measurements (i.e., more 

asymptotic) where we can show that M2-SPSA outperforms 1st-SPSA and 2nd-SPSA in the 

high-noise case. 

Table 2.2. Normalized loss values for 1st-SPSA, 2nd-SPSA and M2-SPSA with σ = 0.001; 

No. of loss 

measurements 


2000 0.0046 

[0.0040, 

0.0052] 

10 000 0.0023 

[0.0021, 

0.0025] 

90% confidence interval shown in [⋅]. 

1st-SPSA with 

iterate averaging 

0.0047 

[0.0040, 0.0054] 

0.0023 

[0.0021,0.0025] 

2nd-SPSA 

0.0041 

[0.0037, 0.0050] 

0.0019 

[0.0019, 0.0022] 


0.0023 

[0.0021, 0.0025] 

8.6 

X 10 −4 

[7.6X 10 −4 

, 9.6X10 − 4 ] 

* 

It was also found that, if the iterates were constrained to lie in some hypercube around θ (as 

required, e.g., in genetic algorithms), then all values in Table 2.2 will be reduced, sometimes by 

several orders of magnitude. Such prior information can be valuable at speeding convergence. 

2.13.2- Simulation 2 

We will compare the performance of M2-SPSA with that of the standard first-order SPSA 

algorithm in Spall [18]. The loss function L(⋅) 

we consider is a fourth-order polynomial with 

significant interaction among the p=10 elements in θ this makes the loss function flat near 

* 

θ and, consequently, the optimization problem challenging. Tables 2.3 and 2.4 provide the 

* 

results for this preliminary study, showing the ratio of the estimation error θˆ − θˆ 

k 

to the 

* 

initial error θˆ − θˆ based on an average of five independent runs (the same θ ˆ was used 

0 

0 

72


in all runs, and represents the standard Euclidean norm). 1st-SPSA and M2-SPSA represent the 

first-order and modified second-order SPSA algorithms, respectively. Table 2.3 considers the 

case where there is no noise in the measurements of L (⋅) 

, while Table 2.4 includes Gaussian 

measurement noise (with a one-sigma value that ranges from 3 to over 100 percent of the 

L(θ ) 

value as θ varies). 

The left-hand column represents the total number of measurements used (so with 3000 

measurements, 1st-SPSA has gone through k = 1500 iterations while M2-SPSA has gone 

through k = 1000 iterations). The first two results columns in the tables represent runs with the 

same SA gains a , 

k 

c , tuned numerically to approximately optimize the performance of the 

k 

1st-SPSA algorithm. The third results column is based on a (numerical) recalibration of a , 

k 

c to be approximately optimized for the M2-SPSA algorithm (an identical 

k 

used for both M2-SPSA columns). 

a sequence was 

k 

The results in both tables illustrate the performance of the M2-SPSA approach for a difficult to 

optimize (i.e., flat surface) function. As expected, we see that the ratios (for both 1st-SPSA and 

M2-SPSA) tend to be lower in the no-noise case of Table 2.3 Further, we see that the M2-SPSA 

* 

algorithm provides solutions closer to θ both with and without optimal M2-SPSA gains. An 

enlightening way to look at the numbers in the tables is to compare the number of 

measurements needed to achieve the same level of accuracy. We see that in the no-noise case 

(Table 2.3), the ratio of number of measurements for M2-SPSA: 1st-SPSA ranged from 1:2 to 

1:50. In the noisy measurement case (Table 2.4), the ratios for M2-SPSA: 1st-SPSA ranged 

from 1:2 to 1:20. These ratios offer considerable promise for practical problems, where p is 

even larger (say, as in the neural network–based direct adaptive control method of Spall and 

2 

3 

Cristion [25], where p can easily be of order 10 or 10 ). In such cases, other second order 

techniques that require a growing (with p) number of function measurements are likely to 

become infeasible. 

73



θˆ 

k 

θˆ 

0 

* 

− θ 

* 

− θ 

with no measurement noise. 

Number of 

measurements 



w/1st-SPSA 

gains 


w/optimal 

gains 

3000 0.265 0.287 0.122 

15000 0.184 0.160 0.033 

30000 0.146 0.128 0.018 


θˆ 

k 

θˆ 

0 

* 

− θ 

* 

− θ 

with measurement noise. 

Number of 

measurements 

1st-SPSA M2-SPSA 

w/1st-SPSA 

gains 


w/optimal 

gains 

3000 0.273 0.292 0.243 

15000 0.184 0.163 0.103 

30000 0.146 0.141 0.097 

There are several important practical concerns in implementing the M2-SPSA algorithm. One, 

of course, involves the choice of SA gains. As in all SA algorithms, this must be done with 

some care to ensure good performance of the algorithm. Some theoretical guidance is provided 

in Fabian [19], but we have found that empirical experimentation is more effective and easier. 

Another practical aspect involves the use of the Hessian estimate: in the studies here we found it 

more effective to not use the Hessian estimate for the first few (100) iterations. This allows the 

inverse Hessian estimate to improve while it really is not needed since L(⋅) 

is dropping quickly 

because of the characteristic steep initial decline of the standard SPSA algorithm. 

74


2.13.3 -Simulation 3 

First, let us consider the following: 

where 

x (2.66) 

k 

+ a 

k 

x 

k − 1 

+ a 

2 

x 

k − 2 

= b1u 

k −1 

+ b 

2u 

k − 2 

a 

1 

=-1.2, a 

2 

=0.4, b 

1=1.0 and b 

2 

=0.7. 

Figure 2.7 shows the parameter estimation results using the algorithm in (2.62). Fig. 2.8 shows 

the results for when bias compensation was not performed. Here, the input is white noise 

generated using a normal distribution with a variance of 0.6 and an average of 0. 

The observed noise is a separate white noise generated using a normal distribution with a 

variance of 0.1 and an average of 0. The observed noise is a separate white noise generated 

using a normal distribution with a variance of 0.1 and an average of 0. Also, the initial values 

for the estimation parameters are all 0, the magnitude c of the perturbation used in the algorithm 

0.9 

is 0.0015, and the gain coefficient = 1/( i + 1) . 

ρ i 

Fig. 2.7. Identification results (with bias compensation). 

â 

1 

(solid line), ˆb (dashed line), 

2 

â (dashed dot line), ˆb (dot line). 

2 

1 

75


Fig. 2.8. Identification results (without bias compensation). 

â 

1 

(solid line), ˆb (dashed line), 

2 

â (dashed dot line), ˆb (dot line). 

2 

1 

These settings satisfy conditions (A11) and (A12) for the convergence theorem. In the figures 

above, the horizontal axis represents the number of iterations for the parameters. In Fig. 2.7 , we 

can confirm that the estimated values converge to the true values. On the other hand, when bias 

compensation was not performed, it is clear from Fig. 2.8 that an estimation error occurs as can 

be seen in (2.52) this means that the estimates could not be consistent in the system. Now, our 

proposed method is compared to other methods such as the R-M type SA [9] and the 2nd-SPSA 

algorithm [18]. For all these methods, the variance of 0.1 for the observed noise was known, 

and the compensation algorithm was used. The results of estimations with almost 100,000 

iterations of parameter correction are shown in Table 2.5. The average values for 50 trials are 

given for the estimation results. 

Table 2.5. Comparison of estimators. 

Algorithms 

â 

1 

ˆb 

2 

â ˆb 

2 

1 

RM -1.1770170 0.635410 0.361731 0.964721 

M2-SPSA -1.20511120 0.67401 0.401234 1.006991 

2nd-SPSA -1.1916300 0.664451 0.393394 0.990554 

True value -1.2 0.7 0.4 1.0 

M2-SPSA: Estimators using the proposed method. 

2nd-SPSA: Second-order of SPSA [18]. 

SA: Estimators using R-M SA [9]. 

76


In terms of estimation precision, the 2nd-SPSA and M2-SPSA are better than R-M SA method 

(see Table 2.5). In Fig. 2.7, we can see the corrections required in order to achieve suitable 

results. The values in the proposed SPSA algorithm are closest to true values. Also, in the other 

methods (RM algorithm), an accurate amount for the slope is used for the evaluation function. 

In contrast, in the proposed method the slope is estimated, and the estimation error for the slope 

has an effect on the convergence speed. However, as was explained before, when the system 

output can only be obtained via unknown characteristics, conventional estimation methods 

cannot be used. This is only a small study in order to show how the proposed SPSA algorithm is 

applied to parameter estimation. 

In conclusion in this chapter, we have proposed a parameter estimation algorithm using 

M2-SPSA. The identification method using the SP seems particularly useful when the number 

of parameters to be identified is very large or when the observed values for what is to be 

identified can only be obtained via an unknown observation system [38]-[41]. Furthermore, an 

improved time differential of SP method that only require one observation of error for each time 

increment have been proposed as improvements. The system can also be used for identification 

problems. Then in this chapter, we have made some empirical and theoretical comparisons 

between 1st-SPSA and 2nd-SPSA and other SA algorithms. It is found that the magnitude of 

errors introduced by matrix inversion in 2nd-SPSA is greater for an ill-conditioned 

Hessian than a well-conditioned Hessian. On the other hand, the errors in 1st-SPSA are less 

sensitive to the matrix conditioning of loss function Hessians. To eliminate the errors introduced 

by the inversion of estimated Hessian 

−1 

H 

k 

, it is suggested a modification (2.13) to 2nd-SPSA 

that replaces 

−1 

Hk 

with a scalar inverse of the geometric mean of all the eigenvalues of H 

k 

. At 

finite iterations, it is found that the introduced M2-SPSA based on (2.13) and (2.14) 

outperforms 1st-SPSA and 2nd-SPSA in numerical experiments that represent a wide range of 

matrix conditioning. The asymptotic efficiency analysis shows that the ratio of the mean square 

errors for the proposed SPSA algorithm to 2nd-SPSA is always less than unity except for a 

perfectly conditioned Hessian or for an asymptotically optimal setting of the gain sequence. 

Therefore, the general differences between previous version of SPSA algorithm and our version 

presented above is that our proposed SPSA algorithm offers considerable potential for 

accelerating the convergence of SA algorithms while only requiring loss function measurements 

(no gradient or higher derivative measurements are needed). In this section since it requires only 

77


three measurements per iteration to estimate both the gradient and Hessian independent of 

problem dimension p it does not impose a large requirement for data collection. Also, the 

computational complexity and cost are reduced as the previous simulations showed. The main 

features in our proposed SPSA are the follows: 

1) M2-SPSA is useful for complex problems where a great volume of parameters need to be 

estimated its description is explained in Sec. 2.4 and 2.5. 

2) Reduction in the computation time by evaluating only a diagonal estimate of the Hessian 

matrix (see Sec. 2.3). 

3) The eigenvalues of the Hessian matrix are computed very efficiently (see Sec. 2.3) 

4) M2-SPSA guarantees that non-positive-definiteness part be eliminated using FIM. The 

Hessian matrix inverse is improved (see Sec. 2.6). 

5) The modification in the SPSA implementation improves the convergence in the algorithm 

when is applied to parameter estimation (see Sec. 2.8 - 2.11). 

78

Chapter 3 

Vibration Suppression Control of a Flexible 

Arm using Non-linear Observer with SPSA 

In this first application, the proposed SPSA algorithm is applied to parameter estimation in 

methods for the vibration control in the model proposed here, these methods are the non-linear 

observer and model reference-sliding mode control. In both cases the parameter estimation by 

M2-SPSA is compared with other algorithms in order to show the efficiency in comparison to 

other good parameter estimators. The computational cost and accuracy in parameters is 

compared here. Finally, a novel model reference-sliding mode control applied to non-linear 

observer is proposed here. The main objective in this study concerns to vibration control of a 

one-link flexible arm system. A variable structure system (VSS) non-linear observer has been 

proposed in order to reduce the oscillation in controlling the angle of the flexible arm. The 

non-linear observer parameters are optimized using a modified version of simultaneous 

perturbation stochastic approximation (SPSA) algorithm. The SPSA algorithm is especially 

useful when the number of parameters to be adjusted is large, and makes it possible to estimate 

them simultaneously. As for the vibration and position control, a model reference sliding-mode 

control (MR-SMC) has been proposed. Also the MR-SMC parameters are optimized using a 

modified version of SPSA algorithm. The simulations show that the vibration control of a 

one-link flexible arm system can be achieved more efficiently using our method. Therefore, by 

applying of MR-SMC method to non-linear observer, we can improve the performance in this 

kind of models and by our proposed SPSA algorithm, we can determine very easy and 

efficiently the parameters of control. 

3.1 -Introduction 

Traditionally, robotic manipulators have been designed and built in a manner that maximizes 

stiffness in order to minimize vibration and allow for good positional accuracy with relatively 

simple controllers [41]. High stiffness is achieved by using heavy links that limits the rapid 

motion of the manipulator, increases the size of the actuators and boosts the energy 

79

CHAPTER 3. APPLICATION USING M2-SPSA ALGORTIHM I 

consumption. Conversely, a lightweight manipulator is less expensive to manufacture and 

operate. Weight reduction, however, incurs a penalty in that the manipulator becomes more 

flexible and more difficult to control accurately [41]. Since the manipulator is a 

distributed-parameter system, the control difficulty is caused by the fact that a large number of 

flexible modes are required to accurately model its behavior. According to this, we overcome 

these problems in this chapter. Since a simple model can be used in a flexible manipulator that 

carries a great tip load [41]-[43], this research has been centered in such kind of simple model, 

particularly in the single flexible link that is moved in a horizontal plane. Also, this kind of 

model is very convenient because show more clearly the advantages of our method and control 

strategies described in this chapter. We have proposed a method which the vibrations can be 

suppressed satisfactory in the single flexible link system, this method helps to have a very 

suitable control of the angular position of this system. The mathematical model of this system is 

described in Sec. 3.2. In the single flexible link, one end of this arm is attached to a motor and 

the other end carries a payload. In this chapter, the control angular position of the arm 

suppressing the oscillation is taken as the control purpose. Since the feedback of only the motor 

angle will not be sufficient to suppress the oscillation, we have considered a VSS non-linear 

observer incorporated by a MR-SMC in order to reduce the oscillation more efficiently. The 

variable structure systems theory has been successfully used in the development of robust 

observers for dynamical systems with bounded non-linearities and/or uncertainties. These 

observers do not require exact knowledge of the plant parameters and/or non-linearities. Their 

design is solely based on knowing the upper bounds of the system uncertainties and/or 

non-linearities. Furthermore, in some studies, the estimated state variables were preferred over 

the measured ones in order to enhance the performance of the controller [47] or to reduce the 

effect of observation spillover in the active control of flexible structures [47]. In other words, 

VSS is fundamentally based in a stability equations and minimization of the cost function. 

Therefore, the performance of the non-linear observer is assessed herein by examining its 

capability of predicting the rigid and flexible motions of a compliant beam that is connected to a 

revolute joint. In respect of MR-SMC, its advantage is robustness against parameter 

uncertainties and external disturbance and so on, so that MR-SMC is robustness under the 

matching condition. In general, suspension system is easily subjected to several parameter 

variations such as the variation of the sprung mass. The robustness of the SMC can be improved 

by shortening the time required to attain the sliding mode, or may be guaranteed during whole 

intervals of control action by eliminating the reaching phase. One easy way to minimize the 

reaching phase is to employ a large control input. 

80

3.2 DYNAMIC MODELING OF A SINGLE LINK ROBOT ARM 

This MR-SMC is formulated for the position control of a single flexible link subjected to 

parameter variations. Also, a sliding surface which guarantees stable sliding mode motion during 

the sliding phase is synthesized in an optimal manner; this will be analyzed in Sec. 3.3 and 3.4. 

The MR-SMC and the observer have been designed based on a simplified model of the arm, 

which only accounts for the first elastic mode of the beam. Moreover, there are many 

parameters to be determined, so that it is difficult to get them. Hence, in order to overcome this 

problem, a modified version of 2nd-SPSA has been proposed to obtain the observer/ controller 

gains more efficiently. In the traditional SPSA since all parameters are perturbed simultaneously, 

it is possible to modify parameters with only two measurements of an evaluation function 

regardless of the dimension of the parameter. This is very useful but this SPSA can cause in 

some cases high computational cost [3]. Therefore, M2-SPSA is applied to a parameters 

estimation algorithm in order to get the observer and controller parameters more efficiently and 

also reduce its cost. We apply a parameter estimation algorithm using our proposed SPSA 

described in Chap. 2. The performance of this algorithm will be examined in terms of parameter 

selection, computational cost, and convergence performance in the current problem. Finally, in 

order to understand the proposed method using non-linear observer, MR-SMC and SPSA, the 

control system only uses measurable data such as motor angle, tip velocity, tip position, and 

control torque shown in Sec. 3.5. 

3.2 -Dynamic Modeling of a Single Link Robot Arm 

3.2.1 -Dynamic Model 

The single flexible link is considered as a continuous cantilever beam of length L carrying a 

mass M and a torque T applied by a motor that rotates the beam in a horizontal plane. The mass 

and elastic properties are assumed to be distributed uniformly along single flexible link [44]. 

The physical configuration of this system is shown in Fig. 3.1. This system is constituted of a 

length L that has a mass m, a torque T (that rotates the elastic arm) and an additional mass M 

(that is the payload at the end of the arm) [44]. The deflection y(x,t) is described by an infinite 

series of separable modes. 

n 

∑ 

y( 

x, 

t) 

= φ ( x) 

q ( t) 

(3.1) 

i= 

1 

i 

i 

81


which is assumed for the elastic displacement of the single flexible link, where φ (x) 

is a 

characteristic function and q i 

(t) 

is a mode function. The kinetic and potential energies of this 

system can be determined as follows: 

i 

T 

e 

m 

+ θ& 

L 

+ & θ 

1 

= & 2 m 

θ J + 

2 2L 

n 

∑ 

i= 

1 

B q& 

i 

i 

∑ 

i= 

1 

M 

+ 

2 

n 

n 

2 2 2 

∑ C 

i 

q 

i 

+ 2Lθ& 

∑ 

i= 1 i= 

1 

n 

A q& 

i 

2 

i 

2 2 

( L & θ + 

m 

+ & θ 

2L 

i 

i 

n 

∑ 

i= 

1 

C q& 

) 

C 

∑ 

i= 

1 

2 

i 

n 

q& 

2 

i 

A q& 

i 

2 

i 

(3.2) 

V 

= 

EI 

2 

n 

∑ 

i = 1 

D i q i 

2 

(3.3) 

where θ is the angle of the joint, E is Young's modulus, and I is the area moment of inertia 

with the next variables: 

0 

L 

0 

L 

2 

A 

i= 

∫ φ 

i 

( x) 

dx, 

Bi 

= ∫ xφi 

( x) 

dx, 

Ci 

= φi 

( L), 

Di 

= 

∫ 

0 

L 

2 

2 2 

[ d φ ( x) / dx ] dx. 

i 

The equation of motion of the cantilever beam for free vibration is based on the Euler-Bernoulli 

equation [45] and is written as follows: 

4 

2 

∂ y ∂ y 

EIL + m = 0 . 

(3.4) 

4 

2 

∂ x ∂ t 

Fig. 3.1. One-link flexible arm. 

82

3.2 DYNAMIC MODELING OF A SINGLE LINK ROBOT ARM 

The beam has a uniform cross-sectional and its boundary conditions are defined as follows [45]: 

The deflection is zero at x=0. 

y ( 0, t) 

= 0. 

(3.5) 

The slope deflection is zero at x=0. 

dy 

dx 

( 0, t ) = 

0. 

(3.6) 

Bending moment is zero at x=L. 

2 

d y 

2 

dx 

( L , t ) = 

0 . 

(3.7) 

Shear force balance at the tip. 

3 

2 

d y 

d y 

EI ( L , t ) = m ( L , t ). 

3 

2 

(3.8) 

dx 

dt 

From (3.4) and (3.5) - (3.8), we have 

y ( x, 

t) 

= φ ( x) 

cos ω t. 

(3.9) 

i 

i 

i 

Then φ (x) 

can be found as: 

i 

φ x) 

= c cosβ 

x+ 

c coshβ 

x+ 

c sinβ 

x c sinhβ 

x 

(3.10) 

i( 1 i i 2i 

i 3i 

i 

+ 

4i 

i 

2 EI 4 

ω 

i 

= β 

i 

. 

(3.11) 

ρ a 

Substituting φi 

(x) 

from (3.10) into (3.9) and using (3.5)-(3.8), β and 

i 

c1 i 

c4i 

~ are determined. 

83


3.2.2 -Equation of Motion and State Equations 

The state equations of the system are derived to describe the dynamic of single flexible link 

under certain assumptions [45]. Therefore, assuming that only the first mode exists, from (3.2) 

and (3.3), and using Lagrange's equations as in [45][46], we obtain 

d 

dt 

⎛ 

⎜ 

⎝ 

∂ T 

e ⎞ ∂ T 

e 

∂ V 

⎟ − + = T 

∂ θ & (3.12) 

⎠ ∂ θ ∂ θ 

d 

dt 

⎛ 

⎜ 

⎝ 

∂ T 

∂ q& 

e 

1 

⎞ 

⎟ 

⎠ 

− 

∂ T 

∂ q 

e 

1 

+ 

∂ V 

∂ q 

1 

= 

0 

(3.13) 

then 

⎡α 

⎢ 

⎣α 

01 

α ⎤⎡ 

&& 

01 θ ⎤ ⎡ T − & θα ⎤ 

11q1q 

& 

1 

⎥⎢ 

⎥ = ⎢ 

2⎥ 

α11⎦⎣q&& 

1⎦ 

⎣− 

H1q1 

+ α & 

11q1θ 

⎦ 

00 2 

(3.14) 

y = θ 

(3.15) 

where 

α = + , and T is the motor’s shaft torque, J is the moment of inertia 

2 2 

00 

J + ML α11q1 

about the joint axis 

2 

α 

01 

= ω1 

+ ML φ1e 

, α 

11 

= v1 

+ ML φ1e 

, v = ρ a∫0 

L 

2 

1 

φ 1 

dx 

1 

, ρ is the 

L 

L 

2 

density, H 

1 

= EI ∫ & φ dx1, 

φ1 

= 

1( 

), 

1 

= ∫ 1 1 1, 

1 

e 

φ L ω ρ a x φ dx a is the area of the cross-section, 

0 

and y is the observation of θ . In order to get the variables that we will use for 

0 

evaluate our method, the state variables are defined as 

x 

1 

= θ , x2 

= & θ, 

x3 

= q1, 

x4 

= q& 

1. 

84

3.3 DESIGN OF NON-LINEAR OBSERVER 

Then 

where 

⎡ x& 

⎢ 

x& 

⎢ 

⎢ x& 

⎢ 

⎣ x& 

⋅ 

⋅ 

1 

2 

3 

4 

⎤ ⎡ 

⎢ 

⎢ 

f 

= 

⎢ 

⎢ 

⎣ f 

⎥ 

⎥ 

⎥ 

⎥ 

⎦ 

f ( x 

1 

1 

2 

( x 

( x 

2 

2 

x 

, x 

x 

2 

4 

, x 

3 

3 

, x 

, x 

4 

4 

1 

α − α 

⎤ ⎡ 0 

) 

⎥ ⎢ 

⎥ ⎢ 

b1 

+ 

⎥ ⎢ 0 

⎥ ⎢ 

) ⎦ ⎣b 

2 

⎤ 

⎥ 

⎥T 

⎥ 

⎥ 

⎦ 

2 

2 

[ − 2 α x x x − α ( − H x + α x x ) ] 

f 

2 

2 

( x 

2 

, x 

1 

α − α 

2 

2 

[ 2 α α x x x + α ( − H x + α x x ) ] 

01 

11 

3 

, x 

3 

11 

, x 

2 

2 

4 

, x 

) = 

α 

3 

4 

3 

4 

) = 

α 

4 

00 

00 

01 

00 

11 

11 

1 

2 

01 

2 

01 

1 

3 

3 

11 

11 

3 

3 

2 

2 

(3.16) 

b 

1 

= 

α 

00 

α 

11 

α − α 

11 

2 

01 

b 

2 

= 

− 

α 

00 

α 

01 

α − α 

11 

2 

01 

• 

3.3 -Design of Non-linear Observer 

In this section, since only the motor angle x1 

is the measurable state variable, the remaining 

states x2, x3 

and x 4 

are predicted using intelligent state observer design [47]. For this, 

(3.14)-(3.15) are written as follows: 

State equations: 

x & = f ( x) 

+ g( 

x) 

T 

(3.17) 

Output equations: 

y = c 

T 

x 

T 

c = [1 0 0 0]. 

(3.18) 

85


For this non-linear system, we consider a robust VSS observer, which predicts system states. 

This observer is defined as follows: 

xˆ 

= f ( xˆ) 

+ g( xˆ) 

T + M ( yˆ) 

+ K( 

yˆ 

− y) 

(3.19) 

yˆ = c T 

xˆ 

(3.20) 

M ( yˆ 

) = − g ( x ) 

yˆ 

ς 

y + γ 

(3.21) 

T 

y = yˆ 

− y = c ( xˆ 

− x) 

(3.22) 

where xˆ represents the predicted value of system state as in [47], K is the observer gain matrix, 

M (y) is the observer non-linearity term, ς represents the gain and γ > 0 is an averaging 

constant for removing chattering. Now defining the estimation error as 

e = xˆ − x 

(3.23) 

we have 

e& = f (ˆ) x − f ( x) 

+ [ g(ˆ) 

x − g( 

x)] 

T + Kc 

+ M( 

y). 

T 

(ˆ x− 

x) 

(3.24) 

For evaluating of the observer gain K with 

xd 

as the desired point, using the Taylor series 

expansion and its first order approximation, the error system is given as follows: 

e& 

= [ f '( x 

d 

= A0e 

+ M( 

y). 

) + g'( 

x 

d 

) T + Kc 

T 

] e + M( 

y) 

(3.25) 

where 

A + 

A 

T 

0 

= A + GT Kc 

(3.26) 

∂f 

i 

= (3.27) 

∂x 

∂g 

G ∂ x 

j 

i 

= (i,-j = 1,2,3,4). (3.28) 

j 

86

3.4 MODEL REFERENCE – SLIDING MODEL CONTROLLER 

Choosing a Lyapunov function of e as 

1 e 

2 

V = 

2 

(3.29) 

and integrating V with respect to e yield 

2 

1 

V& T 

= ee& 

= e ( A 0 

− g( 

x) 

c ς). 

(3.30) 

T 

c e +γ 

If K is designed such that the eigenvalues of error system (3.26) are all negatives, then selection 

of A 

0 

− g( 

x) 

ς < 0 yields V & < 0 and the Lyapunov’s stability theory gives e(t) → 0 as 

t → ∞ . 

In the simulation, we chose x = [0.1 0 0 0] 

and computed A and G with the observer 

d 

parameters determined with M2-SPSA algorithm (see Chap. 2). Therefore, to ensure the 

stability of (3.31) minimizing the following evaluation parameters: 

J 

2 

( y − ˆ ) . 

0 

Σ y 

= (3.31) 

In the determination of unknown parameters of the non-linear observer 

k k , k , , ζ and 

1, 2 3 

k4 

γ each parameter is calculated by (2.62). Therefore, the parameters are determined as 

k =[-227 -25015 13.69 -11101] T , ς =0.010- and γ =0.002. 

3.4 -Model Reference - Sliding Model Controller 

The MR-SMC is often used in robust control of non-linear systems and also for stabilizes single 

inputs systems. The main purpose of the MR-SMC is to make the states converge to the sliding 

mode surface. This normally depends on the sliding mode controller design. For MR-SMC, the 

Lyapunov function is applied to keep the non-linear system under control. In this case, 

MR-SMC is formulated for the tip position control of a single flexible link subjected to parameter 

variations. The desired response is based on a second order reference model given as [47] 

87


⎡x& 

⎢ 

⎣&& 

x 

m 

m 

⎤ ⎡ 0 

⎥ = ⎢ 2 

⎦ ⎣−ω 

n 

1 ⎤⎡xm⎤ 

⎡ ⎤ 

U 

n 

x 

⎥ + 0 

⎥⎢ 

⎢ 2 

− 2ω 

⎥ 

⎦⎣ 

& 

m⎦ 

⎣ωn 

⎦ 

m 

(3.32) 

where 

ω 

n 

is the eigenvalue of angular frequency and 

U 

m 

is the model input. For sliding 

mode controller, Lyaponov stability method is applied to keep the non-linear system under 

control. The sliding mode approach is method, which transformed a higher-order system into 

first-order system. In that way, simple control algorithm can be applied, which is very 

straighforward and robust. 

The surface is called a switching surface. When the plant state trajectory is “above” the surface, 

a feedback path has one gain and a different gain if the trajectory drops “below” the surface. 

This surface defines the rule for proper switching. This surface is also called a sliding surface 

(sliding manifold). 

Ideally, once intercepted, the switched control maintains the plant’s state trajectory on the 

surface for all subsequent time and the plant’s state trajectory slides along this surface (see Fig. 

3.2). Then, using the slide surface mentioned above, the sliding mode control became an 

important robust control approach. For the class of systems to which it applies, sliding mode 

controller design provides a systematic approach to the problem of maintaining stability and 

consistent performance in the face of modeling imprecision. On the other hand, by allowing the 

tradeoffs between modeling and performance to be quantified in a simple fashion, it can 

illuminate the whole design process. 

Fig. 3.2. Sliding mode surface. 

88

3.4 MODEL REFERENCE – SLIDING MODEL CONTROLLER 

The most important task is to design a switched control that will drive the plant state to the 

switching surface and maintain it on the surface upon interception. A Lyapunov approach is 

used to characterize this task, this will be explained after. Now, we assume the slide mode 

hyper-plane for the system of (3.14) with the states variables predicted by the observer as 

. 

1( 1 m 2 2 

3 3 

+ 

4x4 

σ = s x − x ) + s ( x − xm) 

+ s x s . 

(3.33) 

When the sliding mode is in operation, then 

σ = 0 

(3.34) 

σ& = 0. 

(3.35) 

The equivalent control input can be obtained by substituting (3.14) into (3.35). This gives 

T 

eq 

= 2α 

x x x 

∆ 

⋅ 

⎡ 

− 

⎢ 

s1( 

x2 

− x 

s ⎣ 

2 

11 

2 

3 

4 

m 

α01 

+ ( −H1x 

α 

) − s 

11 

2 

⋅⋅ 

3 

3 

x + s x 

m 

4 

2 

+ α x x ) 

11 

2 

⋅ ⎤ 

+ s4x4 

⎥ 

⎦ 

3 

(3.36) 

where it can be assumed that 

∆ 

= 

2 

α 

00 

− α 

01 

/ α ) > 0 . 

( 

11 

Now, the design of MR-SMC is considered, which in the non-linear input makes the state 

converging in the hyper-plane. In general, the eventual sliding mode input can be considered as 

two independent inputs, namely, the equivalent control input 

T 

eq and non-linear control input 

T 

l , in other words, 

T 

= T 

eq 

+ T 

l 

= T 

eq 

− k ( x , t )sat ( σ ) 

(3.37) 

where 

89


sat( σ ) 

= 

⎧ 1 

⎪ σ 

⎨ 

⎪ δ 

⎩ − 1 

if 

if 

if 

σ > δ 

σ ≤ δ 

σ < − δ 

(3.38) 

and k ( x, 

t) 

is the control input function. δ is a constant to eliminate the chattering. The 

condition for realization of the sliding mode is obtained from the Lyapunov function as we 

mentioned before. Lyapunov method is usually used to determine the stability properties of an 

equilibrium point without solving the state equation. A generalized Lyapunov function, that 

characterizes the motion of the state trajectory to the sliding surface, is defined in terms of the 

surface. For each chosen switched control structure, one chooses the “gains” so that the 

derivative of this Lyapunov function is negative definite, thus guaranteeing motion of the state 

trajectory to the surface. After proper design of the surface, a switched controller is constructed 

so that the tangent vectors of the state trajectory point towards the surface such that the state is 

driven to and maintained on the sliding surface. Such controllers result in discontinuous 

closed-loop systems. The following Lyapunov function is chosen of σ to confirm σ = 0 : 

1 2 

V = σ . 

(3.39) 

2 

With this, V & is given by 

⎧ ⎡ 

⎪ s2 

⎢ 

α 

− T − 2α 

11x2 

x3x4 

− 

V& 

= σ σ& 

= σ ⎨ ∆ ⎢ 

α 

⎪ 

⎣ 

⋅ 

⋅⋅ 

⎪⎩ 

+ s1 

( x2 

− xm 

) − s2 

xm 

+ s3x 

01 

11 

4 

⎛ − H ⎫ 

1x3 

+ ⎞⎤ 

⎜ ⎟⎥⎪ 

⎜ ⎟⎥ 

2 

⎝α 

⎠ 

⎬ 

11x2 

x3 

⎦ 

⎪ 

⋅ 

+ s ⎪ 

4 

x4 

⎭ . 

(3.40) 

Substituting (3.37) into (3.40), the existence condition for sliding mode is given as 

⎧ s2 

⎫ s2 

V & = σ ⎨− 

k( 

x, 

t)sgn( 

σ ) ⎬ = −k( 

x, 

t) 

σ < 0. 

⎩ ∆ 

⎭ ∆ 

(3.41) 

Since s / ∆ 0 if we choose k(x,t) > 0, then the state variable x will converge in the slide 

2 

> 

90

3.5 SIMULATION RESULTS 

mode hyper-plane and a stable SMC can be realized. The controller gains are determined using 

our proposed algorithm (see Chap. 2) so as to minimize the cost function by 

J 

h 

= ∑ [ L ⋅ ( x1 

− xm 

) + x3 

]. 

(3.42) 

The estimation of unknown parameters of the MR-SMC s s , s , , k(x,t) and δ each one is 

1, 2 3 

s4 

calculated by (2.62). The parameters values are s 

1 

= 4.2, s 

2 =1, s 

3 =10.19, s 

4 =-0.41,δ =0.2 

and k(x,t)=2.14. Figure 3.3 shows the diagram of the system designed. 

Fig.3.3. Block diagram of the MR-SMC system incorporating the non-linear observer. 

3.5 -Simulation 

The MR-SMC method and M2-SPSA are used in order to achieve a very suitable controlling the 

angular position of the single flexible link, suppressing its oscillation. The results are compared 

with simulations done previously [47] without the proposed SMC. The numerical values are 

follows: 

2 

2 

J=0.00135520[kg ⋅ m ], m=0.026[kg], a ρ =0.0630[kg/m], EI=0.09007[ N⋅ m ], L=0.4[m], 

x 

0 =[-0.1 0 0 0] T and 

x 

d =[ 0.1 0 0 0] T , ∆t = 0. 1 

[ms], M=0.025[kg]. First, the parameter 

estimation in the non-linear observer and MR-SMC using the proposed SPSA algorithm is 

91


compared with effectives estimation algorithms under the same conditions mentioned previously, 

Robinns-Monroe stochastic approximation (RM-SA) [9] and Least-Squares (LS) method [10] 

are used here. 

Table 3.1. Comparison of estimators (non-linear observer). 

Algorithm k 

1 

k k 

2 

3 k ζ γ 

4 

M2-SPSA -227 -25015 13.69 -11101 0.010 0.002 

RM-SA -366 -30055 19.10 -12971 0.019 0.006 

LS -397 -30471 20.16 -13100 0.042 0.009 

Table 3.2. Comparison of estimators (MR-SMC). 

Algorithm s 

1 

s s 

2 3 s 

4 δ k ( x, 

t) 

M2-SPSA 4.2 1 10.19 -0.41 0.2 2.14 

RM-SA 5.0 2 17.72 -0.67 0.2 3.63 

LS 5.8 2 20.14 -0.84 0.2 4.01 

In the above tables, the values obtained by M2-SPSA are very suitable in terms of estimation 

precision in the current system. The results obtained by our algorithm are explained since 

M2-SPSA is an algorithm that does not depend on derivative information, and it is able to find a 

good approximation to the solution using few function values; this causes a low computational 

cost. Also, its implementation is easier than other methods since our algorithm needs fewer 

coefficients to be specified. For this reason, it is possible to obtain good parameters estimation. 

Finally, in the other methods, an accurate amount for the slope [48] is used for the evaluation 

function. 

The variability of the values of the parameters is explained according to stopping condition 

which if the value is very small the iterations are stopped, therefore using this criterion defined 

in this simulation these tables are explained. 

In contrast, in M2-SPSA the slope is estimated, and the estimation error for the slope has an 

effect on the convergence speed. Table 3.3 compares, the number of iterations and 

computational load or normalized CPU (central processing unit) time [49] (computational cost 

in time processing) with CPU time required by M2-SPSA as reference. These comparisons are 

92


done according to average performance of M2-SPSA and the SA algorithms in the estimated 

parameters obtained in Tables 3.1 and 3.2. The CPU time is the time processing in estimate each 

parameter, in this case CPU time is represented as 1 for M2-SPSA, from here, we can evaluate 

if the other algorithms that we use as comparison need two or more times the CPU time required 

by our proposed SPSA. 

Table 3.3. Performance comparison among M2-SPSA, RM-SA and LS. 

Algorithm Iterations CPU 

M2-SPSA 30000 1 

RM-SA 29000 2.1 

LS 28000 5.2 

In Table 3.3, LS is efficient in terms of the number of iterations required to achieve a certain 

level of accuracy in the parameter estimation for the current system, but it is computationally 

expensive and also has a high computational complexity. The LS and RM-SA algorithms 

depend on derivative information and its solution in each iteration this can increase the 

computational cost and complexity. 

The CPU time required by LS and RM-SA is 5 to 2 times respectively the CPU required by 

M2-SPSA, so that, in terms of efficiency, the use of these algorithms might be questionable. On 

the other hand, the proposed SPSA algorithm has a low computational cost and usually provides 

less dispersed parameters. In the number of iterations, these algorithms are almost similar but 

according to features of our proposed SPSA, this can reduce the computational cost (see Chap. 

2) and this is a great advantage. Even, the typical SPSA algorithm has a modest computational 

complexity as is shown in [6], this reason causes a low computational expensive in M2-SPSA. 

The reason of these data obtained by M2-SPSA in Table 3.3, is that this algorithm is a very 

powerful technique that allows an approximation of the gradient or Hessian by effecting 

simultaneous random perturbations in all the parameters. Therefore, the data of the proposed 

SPSA algorithm contrast with the other approximations in which the evaluation of the gradient 

is achieved by varying the parameters once at time. Figures 3.4-3.7 show the simulation results 

using the state variables and torque. Figure 3.4 shows the response of the motor shaft angle in 

the simulation by proposed method. The tracking performance associated with the motor angle 

is very suitable using a non-linear observer applied to MR-SMC method. 

93


Fig. 3.4. Motor angle. Without M2-SPSA and MR-SMC (dotted-line (.-)). With RM-SA and 

MR-SMC (dashed-line (- -)). With LS and MR-SMC-(dash-dot-line(-.-)). With M2-SPSA and 

MR- SMC(solid-line (-) ). 

Figure 3.5 shows the tip position response of a single flexible link. The VSS non-linear observer 

is very important in eliminating the effects due to load of the arm, (see in solid-line). 

Figure 3.6 shows the tip velocity. The algorithm proposed with MR-SMC reduces the 

magnitude of velocity to a small value (solid-line). We can see that after 0.5 seconds the system 

start to become stable and the state variables predicted by the non-linear observer converge 

more efficiently in the sliding mode plane. 

Figure 3.7 shows the control torque. This simulation shows the control of the force to rotate the 

beam generated by our method (solid-line) and is stabilized after 0.5 seconds. In these 

simulations, we can see that using the non-linear observer and MR-SMC is possible to obtain a 

good performance since the non-linear observer is very reliable in predicting the state variables. 

Also, MR-SMC is an important control method used here that needs an indispensable estimate 

of all state variables predicted by non-linear observer. So that, the sliding mode control method 

is an important robust control approach. For the class of systems to which it applies, sliding 

mode controller design provides a systematic approach to the problem of maintaining stability 

and consistent performance in the face of modeling imprecision. 

94


On the other hand, by allowing the tradeoffs between modeling and performance to be 

quantified in a simple fashion, it can illuminate the whole design process. 

Fig. 3.5. Tip position. Without M2-SPSA and MR-SMC (dotted-line (.)).With RM-SA and 

MR-SMC (dashed-line (- -)). With LS and MR-SMC(dash-dot-line(-.-)). With M2-SPSA and 

MR-SMC(solid-line (-) ). 

Fig. 3.6. Tip velocity. Without M2-SPSA and MR-SMC (dotted-line (.)).With RM-SA and 

MR-SMC (dashed-line (- -)). With LS and MR-SMC(dash-dot-line (-.-)).With M2-SPSA and 

MR-SMC (solid-line (-) ). 

95


Fig. 3.7. Control torque. Without M2-SPSA and MR-SMC (dotted-line (.)).With RM-SA and 

MR-SMC (dashed-line (- -)). With LS and MR-SMC(dash-dot-line (-.-)).With M2-SPSA and 

MR-SMC (solid-line (-) ). 

Fig. 3.8.Motor angle. Simulation using x 1 

with M2-SPSA and MR-SMC (solid-line). Simulation 

using x m 

with M2-SPSA and MR-SMC (dashed-line). 

Fig. 3.9. Tip position. Simulation using x 3 

with M2-SPSA and MR-SMC (solid-line). Simulation 

using ˆx 3 


96


Fig. 3.10. Tip velocity. Simulation using x 4 


Simulation using ˆx 4 


In these simulations, we can see that using the non-linear observer and MR-SMC is possible to 

obtain a good performance since the non-linear observer is very reliable in predicting the state 

variables. Also, MR-SMC is an important control method used here that needs an indispensable 

estimate of all state variables predicted by non-linear observer. For this kind of systems, 

MR-SMC design (see Fig.3.3) provides a systematic approach to the problem of maintaining 

stability and consistent performance in the face of modeling imprecision. Moreover, M2-SPSA 

had a better performance to estimate the observer and MR-SMC parameters in comparison with 

the other algorithms. 

In this chapter, we have proposed a MR-SMC method using a non-linear observer for 

controlling the angular position of the single flexible link, suppressing its oscillation. 

We can see that the non-linear observer and the MR-SMC provide a successful and stable 

operation to the system. We also have proposed the use of M2-SPSA in order to determine the 

observer/controller gains. This could determine them very efficiently and with a low 

computational cost. The non-linear observer was successful in predicting the state variables 

from the motor angular position and the MR-SMC was a very efficient control method. 

97


In a future work, we will plan make real experiments using this model, but before it is necessary 

to evaluate several factors in order to make these experiments such as physical conditions 

(dimensions and material of the flexible arm) or the estimation of the gradient that need have a 

certain level of accuracy. The handling of the deflection within the proposed method also is 

considered as a factor in the real experiments. Even, we need to consider the robust controller, 

an exact modeling to some extent is thought to be necessary in order to be able to predict the 

experimental results through simulations, also this feature is necessary consider it. Finally, the 

friction also will be considered as important factor to consider in the real experiments. 

98

Chapter 4 

Lattice IIR Adaptive Filter Structure 

Adapted by SPSA Algorithm 

In this second application, the M2-SPSA algorithm is applied to parameter estimation, in this 

case to get the coefficient of adaptive algorithms in the model proposed here, these adaptive 

algorithms are Steiglitz-McBride (SM) and Simple Hyperstable Adaptive Recursive Filter 

(SHARF). The results are compared with previous lattice versions of these algorithms. The 

performance in the coefficients is compared here. Finally, also we make some modifications in 

the adaptive algorithms proposed here in order to obtain a suitable stability and convergence. 

Adaptive infinite impulse response (IIR), or recursive, filters are less attractive mainly because 

of the stability and the difficulties associated with their adaptive algorithms. Therefore, in this 

chapter the adaptive IIR lattice filters are studied in order to devise algorithms that preserve the 

stability of the corresponding direct-form schemes. We analyze the local properties of stationary 

points, a transformation achieving this goal is suggested, which gives algorithms that can be 

efficiently implemented. Application to the SM and SHARF algorithms is presented. The 

M2-SPSA is presented in order to get the coefficients in a lattice form more efficiently and with 

a lower computational cost and complexity. The results are compared with previous lattice 

versions of these algorithms. These previous lattice versions may fail to preserve the stability of 

stationary points. 


In the last decade, substantial research efforts have been spent to turn adaptive IIR filtering 

techniques into a reliable alternative to traditional adaptive finite impulse response (FIR) filters. 

The main advantages of IIR filters are that they are more suitable to models of physical systems, 

due to the pole-zero structure, and also require much less parameters to achieve the same 

performance level as FIR filters. Unfortunately, these good characteristics come along with 

some possible drawbacks inherent to adaptive filters with recursive structure such as algorithm 

99

CHAPTER 4. APPLICATION USING M2-SPSA ALGORTIHM II 

instability, convergence to biased and/or local minimum solutions, as well as slow convergence. 

Consequently, several new algorithms for adaptive IIR filtering have been proposed in the 

literatures attempting to overcome these problems. Extensive research on the subject, however, 

seems to suggest that no general purposed optimal algorithm exists. In fact, all available 

information must be considered when applying adaptive IIR filtering, in order to determine the 

most appropriate algorithm for a given problem. Then, the need for ensuring stable operation of 

adaptive IIR filters has spawned much interest in other structures over the direct-form. In 

particular, the lattice structure has received considerable attention due to several advantages 

such as one-to-one correspondence between transfer functions and parameter spaces, good 

numerical properties, as well as built-in stability [50]. Therefore, several adaptive algorithms 

described in [50], originally devised for direct-form structures, have been modified in order to 

allow for a lattice realization of the filter. These algorithms use a conventional method based in 

exploiting the properties of the lattice structure [52] and suitable approximations [53]. These 

algorithms based in this conventional method offer a relative low computational load and in 

most cases these approximate lattice algorithms preserve the set of stationary points. 

Nevertheless, it has not been clear whether the convergence properties of the stationary points 

are well preserved. Also, the reduction in the computational load is not enough, in special in the 

estimation of reflection coefficients into lattice form. Hence, in this paper a new approach to 

improve lattice structure is proposed. The Ordinary Differential Equation (ODE) method 

[50]-[54] is proposed in order to get a transformation, which allows deriving sufficient 

conditions for convergence. The method is very general, applying to any pair of structures as 

long as a one-to-one correspondence exists between them. For the direct-form to lattice case, it 

is shown how to efficiently implement this transformation. This approach is applied to the same 

adaptive algorithms used in [50], in this case the lattice versions of the Steiglitz-McBride (SM) 

and the Simple Hyperstable Adaptive Recursive Filter (SHARF) algorithms, for which it is also 

shown how a pre-existing approximate algorithms may fail to converge in some cases. Finally, 

in order to get the reflection coefficients in lattice form, we have proposed a gradient-free 

method. This method is only based on objective function measurements and do not require 

knowledge of the gradients of the underlying model. As a result, they are very easy to 

implement and have reduced the computational cost in theirs applications. The gradient-free 

method proposed here is the Simultaneous Perturbation Stochastic Approximation (SPSA) 

algorithm [3]. This is based on a randomized method where all parameters are perturbed 

simultaneously [3], and makes it possible to modify the parameters with only two measurements 

of an evaluation function regardless of the dimension of the parameter. This algorithm is very 

100

4.2 PROCEDURE OF IMPROVED ALGORITHM 

useful, but this traditional SPSA algorithm can cause, in some cases (systems with a great 

volume of parameters), a high computational cost [3]. Therefore, we have proposed a modified 

version of SPSA applied to reflection coefficient estimation in the current system in order to get 

estimated coefficients more efficiently reducing the computational cost. The organization of the 

present chapter is as follows: In Sec. 4.2, the derivation of the proposed algorithm is described. 

In Sec. 4.3, the application to lattice structure is explained. The adaptive algorithms are 

described in Sec. 4.4. The simulations results with the proposed methods are shown in Sec. 4.5. 

4.2 -Procedure of Improved Algorithm 

Consider a direct-form adaptive filter 

N 

∑ 

− i 

bi 

z 

B z 

i = 

H ˆ ( ) 

0 

( z ) = = 

(4.1) 

M 

A ( z ) 

− j 

1 + a z 

∑ 

j = 1 

j 

parameterized by 

T 

θ = b a ] . Usually, constant-gain algorithms can be written as 

d 

[ 

0, 

L, 

bN 

, a1, 

L, 

M 

θ ( n + 1) = θ ( n) 

+ X ( n) 

e( 

n) 

(4.2) 

d d 

µ 

d 

where µ > 0 is a step size, e (⋅) 

is some signal and X (⋅) 

is a driving vector that depends on 

the specific algorithm. Let θ be the corresponding parameter vector for a different 

l 

implementation of the filter, in such a way that there exists a one-to-one map θ = f θ ) defined 

on a suitable stability domain that allows one to move back and forth between both descriptions. 

The objective is to reformulate the algorithm (4.2) in terms of 

matrix as 

df ( θ ) 

F( 

θ ) = 

l 

. 

f 

dθ 

l 

d 

d 

( l 

θ . Let us define the Jacobian 

l 

(4.3) 

We neglect to use a subscript in the argument, since F can be expressed as a function of either 

θ or 

d 

θ by means of the map f. We can think of 

l 

θ as representing the actual transfer 

f 

function H ˆ ( z ) while θ and 

d 

θ are the parameter vectors that describe 

l 

H ˆ ( z ) in a particular set 

of coordinates. The following algorithm can update of 

θ 

l : 

101


θ ( n + 1) = θ ( n) 

+ e( 

n) 

X ( n) 

(4.4) 

l l 

µ 

l 

T 

X ( n) 

= F ( ( n)) 

X ( n). 

(4.5) 

l 

θ f 

d 

That is, the driving vector for the new coordinates, X l 

(n) 

is related to (n) 

X d 

through the 

Jacobian F. Since the map f is one-to-one, F( θ f 

) has a full rank for all θ describing stable 

f 

transfer functions. Therefore if θ * ( * 

d 

= f θl 

) then * * 

θ is a stationary point of (4.2) iff 

d 

θ is a 

l 

stationary point of (4.4), since 

E 

[ X ( n) 

e( 

n) 

] * = 0 ⇔ E[ X ( n) 

e( 

n) 

] * = 0 

l 

f 

d 

. (4.6) 

θ θ f 

So that, the stationary points are preserved. Now, the convergence issue is described. By 

* 

applying the ODE method [55], for sufficiently small µ the stationary point θ is locally stable 

l 

for algorithm (4.4) iff all the eigenvalues of the matrix 

S 

t 

dE 

= 

[ X ( n) 

e( 

n) 

] 

l 

dθ 

t 

T 

⎡dX 

l 

( n) 

⎤ ⎡ de( 

n) 

⎤ 

= E⎢ 

⋅ e( 

n) 

⎥ + E⎢X 

l 

( n) 

⎥ 

⎣ dθl 

⎦ * 

d 

t 

f 

14442 

4443 

14 

⎣ θ 

44 2444 

⎦ * 

θ 

θ 

3 

= P 

θ 

* 

f 

= Q 

(4.7) 

(k ) 

have negative real parts. For a vector V, let V denote its k-th component. Then, the i, j 

element of P, is given by 

P 

i , j 

. 

( k) 

[ X ( n) 

e( 

n) 

] 

i 

N + M + 1 

⎡∂X 

( ) ⎤ ∂ ( ) 

l 

n 

Fji 

θ 

f 

E ( ) = 

* 

⎢ e n 

( ) ⎥ ∑ E 

l 

( j) 

d 

θ f 

⎣ ∂θl 

⎦ * k= 

1 ∂θl 

1442 

443 

+ 

N + M + 1 

∑ 

k= 

1 

θ 

( k ) 

* ⎡∂X 

( ) ⎤ 

d 

n 

Fji 

( θ 

f 

) E⎢ 

e( 

n) 

( j) 

⎥ 

⎣ ∂θi 

⎦ 

* 

θ 

* 

f 

= 0 

(4.8) 

102

4.2 PROCEDURE OF IMPROVED ALGORITHM 

Using (4.8) and the chain rule, 

* T 

⎡dX 

d 

( n) 

⎤ 

* 

P = F( 

θ 

f 

) ⋅ E e( 

n) 

⎥ ⋅ F( 

θ 

f 

). 

(4.9) 

⎢ 

⎣ dθ 

d ⎦ * 

θ f 

On the other hand, using again the chain rule and (4.5), 

T 

* T 

⎡ de( 

n) 

⎤ 

* 

Q = F( 

θ 

f 

) ⋅ E X 

d 

( n) 

⎥ ⋅ F( 

θ 

f 

). 

(4.10) 

⎢ 

⎣ dθ 

d ⎦ * 

θ f 

Therefore, the derivative matrix 

S l 

= P + Q reduces to 

dE 

[ X n) 

e( 

n) 

] dE[ X ( n) 

e( 

n) 

] 

l 

( * t 

d 

* 

= F( 

θ ) ⋅ 

⋅ F( 

θ 

f 

dθt 

* 

dθ 

d 

* 

θ f 

θ f 

144 

2444 

3 

= Sd 

). 

(4.11) 

Here, (4.11) relates the stability matrices of algorithms (4.2) and (4.4) through the Jacobian 

* 

f 

F( 

θ ) . In this algorithm, if the matrix sd 

is symmetric, then 

* 

θ 

l 

is a locally stable stationary 

point for algorithm (4.4) iff 

proved in the following way: 

* 

θd 

is a locally stable stationary point for algorithm (4.2). This is 

In view of (4.11) and Sylvester’s law of inertia the signs of the eigenvalues of the matrices 

sd 

and 

* 

s 

l 

are the same. Also if s 

d 

< 0 , then θl 

is a locally stable stationary point for algorithm 

(4.4). This is proved in the following way: 

It is shown that in view of (4.11), s < 0 iff s < 0 . Since all the eigenvalues of a negative 

definite matrix have negative real parts, it follows that 

l 

d 

* 

* 

θl 

is locally stable for (4.4). (and θd 

is 

locally stable for (4.2) ). According to these last explanations described above, these gives 

103


sufficient conditions under which the stability of algorithm (4.2) implies the stability of 

algorithm (4.4). 

4.3 -Lattice Structure 

The lattice filters are typically used as linear predictors because it is easy to ensure that they are 

minimum phase and hence that its inverse is stable [52]. The lattice form adaptive IIR 

algorithms derived are expected to have at least the following advantages over the direct-form 

algorithms: i) faster convergence; ii) easier stability monitoring even simpler than parallel form; 

iii) more robust for finite precision implementation [52]. One important characteristic of this 

structure is the possibility of represents multiples poles [52]. It is expected that these structural 

advantages can bring about substantial performance improvement for adaptive filters. Then, the 

derivation described in Sec. 4.3 is applied in this section to obtain efficient adaptive algorithms 

for lattice filters according to the characteristics of this structure mentioned above. Firstly, this 

approach is implemented to the adaptive filter as a cascade of a direct-form FIR filter 

N −i 

( z ) = ∑ b z and an all-pole lattice filter 

= 

1/ 

A ( z) 

. So that, θ 

i l 

is defined by 

B i 0 

θ 

[ b L α sinα 

] T 

l 0 

b N 

sin 1 

L 

= (4.12) 

M 

where 

sinα 

are the reflection coefficients of the lattice part (these coefficients can be calculated 

k 

using a modified version of SPSA, this algorithm is explained in Chap. 2). In general, the 

reflection coefficients are estimated as cross-correlation coefficients between forward and 

backward prediction errors in each stage of the adaptive lattice filter. Accordingly, two divisions 

in each stage, and effectively doubling the number of stages, are required. A problem is that the 

processing cost of division is higher than that of multiplication, especially in cheap digital signal 

processors (DSPs). These coefficients are calculated by our modified version of SPSA, which 

make a reduction of the number of divisions. The proposed technique can decrease the number 

of divisions to one. This algorithm will be explained in the following section. 

⎡I 

N + 1 ⎤ 

F (θ 

f 

) = 

with 

⎢ ⎥ 

⎣ D⎦ 

D 

ij 

∂ai 

= 

∂ sin α 

j . 

104

4.4 ADAPTIVE ALGORITHM 

T 

Also, we have [ ] T T 

X ( n) 

= V ( n) 

W ( n) 

with 

d 

d 

d 

V 

d 

⎡ 1 

⎢ − 

z 

( n) 

= ⎢ 

⎢ M 

⎢ − 

⎣ z 

1 

N 

⎤ 

⎥ 

⎥v( 

n), 

⎥ 

⎥ 

⎦ 

W 

d 

⎡ z 

⎢ 

z 

( n) 

= ⎢ 

⎢ M 

⎢ 

⎣ z 

−1 

− 2 

− M 

⎤ 

⎥ 

⎥ 

⎥ 

⎥ 

⎦ 

1 

ω ( n) 

A( 

z ) 

for some signals v (n) 

, ω(n) 

T 

[ V ( n) 

W ( n ] T T 

X 

l 

( n) 

= 

l l 

) , we find that Vl ( n) 

= Vd 

( n) 

and 

which depend on the particular algorithm. If similarly partitioning 

T 

W ( n) 

= D W 

l 

d 

⎡ ∂a1 

⎢ ∂ sinα1 

⎢ 

( n) 

= ⎢ M 

⎢ ∂a1 

⎢ 

⎣∂ 

sinα 

M 

L 

L 

∂aM 

⎤ 

− 

∂ sinα 

⎥⎡ 

z 

1 

⎥⎢ 

⎥⎢ 

M 

∂aM 

⎥⎢ 

− 

⎣z 

∂ sinα 

⎥ 

M ⎦ 

1 

M 

⎤ 

⎥ 1 

⎥ ω( 

n) 

A( 

z) 

⎥ 

⎦ 

⎡ ∂ A ( z ) 

= ⎢ L 

⎣ ∂ sin α 

1 

∂ A ( z ) 

∂ sin α 

M 

⎤ 

⎥ 

⎦ 

T 

1 

ω ( n ). 

A ( z ) 

Thus the problem boils down to efficiently implementing the transfer function 

[ T ( z), 

T ( z ] T 

M 

T ( z) 

= ) 

1 

L with 

1 ∂A( 

z) 

1 

Tk 

( z) 

= 

= 

A( 

z) 

∂ sin α cos α 

k 

k 

1 ∂A( 

z) 

A( 

z) 

∂α 

k 

. 

A structure that performs exactly this task, with complexity proportional to the filter order, was 

developed in [50]. Hence (4.4)-(4.5) can be efficiently implemented. 

105


4.4--Adaptive Algorithm 

4.4.1 -SHARF Algorithm 

The hyperstable adaptive recursive filter (HARF) algorithm is an early version of the 

application of hyperstability [56] to signal processing but suffers many setbacks, which makes it 

very hard to implement [57]. Landau in [58] developed an algorithm for off-line system 

identification, based on the hyperstability theory [58], that can be considered as the origin of the 

SHARF algorithm. Basically, the SHARF algorithm has the following convergence properties 

[56][57]: 

Property 1: In the cases of sufficient order in identification ( n 

* ≥ 0 ) , the SHARF algorithm 

may not converge to the global minimum of the mean-square output error of (MSOE) [57][58] 

performance surface if the plant transfer function denominator polynomial does not satisfy the 

following strictly positive realness condition. 

− 1 

Re ⎡ D ( z ) ⎤ 

⎢ > 0 ; = 1 . 

− 

( 

1 ⎥ z 

(4.13) 

⎣ A z ) ⎦ 

Property 2: In case of insufficient order to identification ( n 

* ≥ 0 ) , the adaptive filter output 

signal yˆ ( n ) and the adaptive filter coefficients vector θˆ are stable sequences, provided the 

input signal is sufficiently persistent exciting. The main problem of the SHARF algorithm 

seems to be the nonexistence of a robust practical procedure to define the moving average filter 

− 1 

D ( q ) in order to guarantee the global convergence of the algorithm, here 

D 

− 

n − 

= ∑ 

d k 

( q 

1 ) 

k = 

d k 

q . This is a consequence of the fact that the condition in (4.13) depends on 

1 

the plant denominator characteristics, that in practice are unknown. We particularize now 

(4.4)-(4.5) to the SHARF algorithm. For the direct form SHARF [49], we have 

υ ( n ) = u( 

n), 

ω ( n) 

= −B( 

z) 

u( 

n) 

= −A( 

z) 

yˆ( 

n) 

and e( n) 

= C( 

z)( 

y( 

n) 

− yˆ( 

n)). 

In this expression, C(z) 

is a compensating filter designed in order to make the transfer function 

C( z)/ 

A* ( z) 

strictly positive real (SPR) [57] , where A ( ) is the denominator of H (z) 

. The 

* 

z 

106

4.4 ADAPTIVE ALGORITHM 

jω 

transfer function G(z) 

is SPR if it is stable and causal and satisfies ReG 

( e ) > 0 ∀ω 

. This SPR 

condition is a common convergence requirement for all hyperstability based adaptive algorithms 

[57]. The block diagram of the adaptive filter is shown in Fig. 4.1. 

Fig. 4.1. Block diagram SHARF lattice algorithm. 

Assuming a sufficient-order setting and that the SPR condition is satisfied, it can be proved 

that the matrix S for the SHARF algorithm is negative definite [57]. In order to guarantee 

d 

global convergence for the SHARF algorithm independently of the plant characteristics, Landau 

[58] proposed the application of a time-varying moving average filtering to the output error 

signal. Using Landaus’ approach, the modified SHARF algorithm can be given by 

e 

SHARF 

with 

D 

( n) 

= 

−1 

[ D( 

q , n) 

] e ( n) 

∑ 

k = 0 

OE 

n 

−1 

− 

= d 

k 

( q , n) 

d k 

q 

(4.14) 

d ( n + 1) = d ( n) 

+ µ e ( n) 

e ( n − k), 

k = 0,1,…, υ 

d 

(4.15) 

k 

k 

SHARF 

OE 

ˆ θ ( n + 1) = ˆ θ ( n) 

+ µ e ( n) 

ˆ φ ( n) 

. (4.16) 

f 

f 

SHARF 

MOE 

Another interesting interpretation of the modified SHARF algorithm can be found in [59]. In the 

convergence of the modified SHARF algorithm, the error signal e SHARF 

(n) 

converges to zero in the mean sense if n * ≥ 0 and µ satisfies 

is a sequence that 

107


1 

0 < µ < 

(4.17) 

2 

( n) 

φ SHARF 

where φ (n) 

is the extended information vector defined as 

SHARF 

T 

[ yˆ( 

n −i) 

x( 

n − j) 

e ( n − k) 

] . 

φ ( n) 

= 

(4.18) 

SHARF 

SHARF 

It should be mentioned that if signal φ (n) 

tends to zero, the output error (n) 

signal 

SHARF 

does not necessarily tend to zero. In fact, it was shown in [60] that the minimum phase 

condition of D( 

q 

−1 , n) 

must also be satisfied in order to guarantee that (n) 

converges to 

zero in the mean sense. This additional condition implies that a continuous minimum phase 

monitoring should be performed in the polynomial D( 

q 

−1 , n) 

to assure global convergence of 

the SHARF algorithm. This fact prevents the general use of the SHARF algorithm in practice. It 

is also important to mention that although the members of the SHARF family of adaptive 

algorithms, that includes the modified output error (MOE), and SHARF algorithms, attempt to 

minimize the output error signal, these algorithms do not present a gradient descent convergence 

concept from the hyperstability theory. 

e OE 

e OE 

4.4.2 -Steilglitz-McBride Algorithm 

In [61], Steiglitz and McBride developed an adaptive algorithm attemping to combine the good 

characteristics of the output error and equation error algorithms, namely unbiased and unique 

global solution, respectively. In order to achieve these properties, the so called SM algorithm is 

based on an error signal e(n) that is a linear function of the adaptive filter coefficients, yielding a 

unimodal performance surface, and has physical interpretation similar to the output error signal, 

leading to an unbiased global solution. For this adaptive algorithm, let u (⋅) 

, yˆ ( ⋅) 

be the 

adaptive filter input and output respectively, and let y(⋅) 

be the reference signal. For the SM 

adaptive algorithm described in [61] we have 

1 

e( n) 

= y( 

n) 

− yˆ( 

n) 

and υ( 

n) 

= u( 

n), 

ω( 

n) 

= −y( 

n). 

A( 

z) 

108


Figure 4.2 shows the block diagram of the lattice implementation of (4.4)-(4.5) for the SM 

algorithm. Suppose that y ( n) 

= H( 

z) 

u ( n) 

where H(z) 

is a filter of the same order as H ˆ ( z ) . 

Fig. 4.2. Block diagram of the SM lattice algorithm. 

In case there is an additive output disturbance, the SM estimate remains unbiased as long as the 

disturbance is white [61] [62]. For simplicity it is assumed here that the reference signal y(⋅) 

is 

not contaminated by noise. It can be shown that for this sufficient order case, the matrix 

Sd 

at 

the stationary pointθ corresponding to H ˆ ( z) 

= H ( z) 

coincides with the Hessian matrix of 

* 

f 

the cost function E[ e ( )] 

2 n 

evaluated at 

* 

* 

θ 

d 

and therefore it is symmetric. Since θd 

is locally 

stable for the direct form SM algorithm [61], then 

* 

θl 

is locally stable for the lattice algorithm. 

In [62] an alternative way of implementing the SM algorithm using a normalized tapped lattice 

structure was presented. However, the stability of the stationary point is not guaranteed. 


4.5.1 -SHARF Algorithm 

Here, we considered such a setting in which u(⋅) 

was taken as unit variance white noise, N=0, 

M=6, 

0.1 

H ( z) 

= A *( 

z 

with A 

*( 

z) 

parameterized in lattice form by the reflection coefficients 

) 

109


* * 

estimated by our proposed algorithm SPSA [ sinα L sinα 

] [.6 

.95 .86 .84.9 .51 ] 

1 6 

= 

, and also 

C ( z) 

= A* ( z) 

. Figure 4.3 shows the parameter trajectories of algorithm (4.4)-(4.5). The initial 

point was θ ( 0) = 0 . The convergence is achieved, as expected. On the other hand, the lattice 

l 

version of SHARF presented in [63] using fails to converge in this setting, as shown in Fig. 4.4. 

The initial value θ (0) 

was taken very close to the stationary point. For this algorithm the 

l 

corresponding matrix can be shown to have unstable eigenvalues, which implies that the 

stationary point is not convergent [63]. Note that the SPR condition is satisfied; the problem 

does not reside there, but in the simplifications introduced when passing from the direct form to 

the lattice algorithm. In the figures 4.3-4.6, the dashed-lines show the parameter values at the 

stationary point. 

4.5.2 -Steiglitz-Mcbride Algorithm 

0.01 

Let N=0, M=6 and H ( z) 

= with A* ( z) 

parameterized in lattice form by the reflection 

A * 

( z) 

coefficients estimate by the proposed SPSA algorithm. 

* 

* 

[ sin sinα 

] [ .6 .95 .86 .84 .81 .72 ] 

α L . 

1 6 

= 

Assume that u (⋅) 

is unit variance white noise. Then, it can be shown that even with no 

measurement noise, the corresponding stability matrix for the SM lattice algorithm of [62], 

evaluated at the stationary point H ˆ ( z ) = H ( z ) has a pair of unstable eigenvalues. This 

means that this stationary point cannot be locally convergent. This is illustrated in Fig. 4.5, 

where the results of a computer simulation of this algorithm in the above setting are presented. 

The initial parameters were set to those of the stationary point, except for sinα 

2 

(0) 

which was 

set at 0.9499. Despite the proximity of this starting point to the stationary point, the algorithm 

clearly diverges, as expected. The reflection coefficients are estimated by our proposed 

algorithm SPSA. In Fig. 4.6 we show the results obtained by applying algorithm (4.4)-(4.5) in 

l 

( 0) = 1 .5 .9 .7 .7 .7 . 8 

the same setting, though now the initial point was [ ] T 

Convergence is achieved in this case, as predicted by the theory. 

θ . 

110

4.5 SIMULATIONS RESULTS 

Fig. 4.3. Convergence of the proposed SHARF algorithm and M2-SPSA. 

Fig. 4.4. Instability of the existing SHARF algorithm. 

111


Fig. 4.5. Instability of the existing SM algorithm. 

Fig. 4.6. Convergence of the proposed SM algorithm and M2-SPSA. 

We can see in the previous graphics a better convergence achieved by our proposed method and 

the M2-SPSA algorithm in comparison to previous simulations shown in [62]-[63]. Also, we can 

see that the number of iteration used to achieve this convergence in our proposed algorithm is 

reduced, this characteristic is explained due to that M2-SPSA algorithm can calculate more 

efficiently and with less computational burden the coefficients in the lattice form; this is 

explained in Chap. 2. 

112

Chapter 5 

Parameter Estimation using a Modified 

Version of SPSA Algorithm Applied to 

State-Space Models 

Finally, in this third application, M2-SPSA is applied the estimation of unknown static 

parameters in non-linear non-Gaussian state-space model. The results are compared with the 

FDSA algorithms. The performance of the coefficients in bi-modal non-linear model is 

compared here. The objective of this paper is the estimation of unknown static parameters in 

non-linear non-Gaussian state-space model. The Simultaneous Perturbation Stochastic 

Approximation (SPSA) algorithm is considered due to its highly efficient gradient 

approximation. We consider a particle filtering method and employ the SPSA algorithm to 

maximize recursively the likelihood function. Nevertheless, the SPSA algorithm can become 

inadequate in models as non-Gaussian state-space model. So that, we have proposed to modify 

the SPSA algorithm in order to estimate parameters very efficiently in complex models as 

proposed here reducing its computational cost. An efficient parameter estimator as the Finite 

Difference Stochastic Approximation (FDSA) algorithm is considered here, in order to compare 

it with the efficiency of the proposed SPSA algorithm. The proposed algorithm can generate 

maximum likelihood estimates very efficiently. The performance of proposed SPSA algorithm 

is shown through simulation using a model with highly multimodal likelihood. 


Dynamic state-space models are useful for describing data in many different areas, such as 

engineering, finance mathematics, environmental data, and physical science. Most real-world 

problems are non-linear and non-Gaussian (1) , therefore optimal state estimation in such 

problems does not admit a closed form solution. Sequential Monte Carlo (SMC) methods, also 

known as particle filters, are a set of practical and flexible simulation-based techniques that 

have become increasingly popular to perform optimal filtering in non-linear non-Gaussian 

models [64][65]. Then, SMC methods are a set of simulation-based techniques that recursively 

113

CHAPTER 5. APPLICATION USING M2-SPSA ALGORITHM III 

generate and update a set of weighted samples, which provide approximations to the posterior 

probability distributions of interest. Standard SMC methods, however assume knowledge of the 

model parameters. In many real-world applications, these parameters are unknown and need to 

be estimated. Then, we address here the challenging problem of obtaining their maximum 

likelihood (ML) estimates. The ML parameter estimation using SMC methods still remains an 

open problem, despite various earlier attempts in the literature [66]. Previous approaches that 

extend the state with the unknown parameters and transform the problem into an optimal 

filtering problem suffered from several drawbacks [66][68]. Recently, a robust particle method 

to approximate the optimal filter derivative and perform ML parameter estimation has been 

proposed [64]. This method is efficient but computationally intensive. The gradient-based SA 

algorithms rely on a direct measurement of the gradient of an objective function with respect to 

the parameters of interest. Such an approach assumes that detailed knowledge of the system 

dynamics is available so that the gradient equations can be calculated. In the SMC framework, 

the gradient estimates of the particle approximations require infinitesimal perturbation 

analysis-based approach [65]. This often results in a very high estimation variance that 

increases with the number of particles and with time. Although this problem can be 

successfully mitigated with a number of variance reduction techniques, this adds to the 

computational burden. In this chapter, we investigate the using of gradient-free SA techniques 

as a simple alternative to generate ML parameter estimates. A related approach was described 

in [67] to optimize the performance of SMC algorithms. We use this approach to our ML 

parameter estimation. In principle, gradient-free techniques have a slower rate of convergence 

compared to gradient-based methods. However, gradient-free methods are only based on 

objective function measurements and do not require knowledge of the gradients of the 

underlying model. As a result, they are very easy to implement and have a reduced 

computational complexity. The classical gradient-free method is the FDSA [21]. However, we 

have proposed a more efficient approach that has recently attracted attention, the SPSA [3]. 

This is based on a randomized method where all parameters are perturbed simultaneously, it is 

possible to modify parameters with only two measurements of an evaluation function regardless 

of the dimension of the parameter. This is very useful but this traditional SPSA can cause in 

some cases a high computational cost [3]. Therefore, M2-SPSA is applied to ML parameter 

estimation in order to get estimated parameter more efficiently reducing its cost. In this chapter, 

FDSA is considered as a comparison toward our proposed SPSA algorithm. 

114

5.2 IMPLEMENTATION OF SPSA ALGORITHM TO THE PROPOSED MODEL 

5.2 -Implementation of SPSA Toward Proposed Model 

5.2.1 –State-Space Model 

In order describe the state-space models [61], let { X k 

} k ≥0 

and { k 

} k≥0 

kx 

Y be R and 

k 

R 

y 

valued 

stochastic processes defined on a measurable space ( Ω , F) 

. Let θ ∈ Θ be the parameter vector 

m 

where Θ is an open subset R [69] . A general discrete-time state-space model represents the 

X as a Markov process of initial density X ~ µ and Markov 

unobserved state { k 

} k≥0 

0 

transition density ( x' 

x) 

f θ 

[61]. The observations { k 

} k≥0 

Y 

are assumed conditionally 

independent given { k 

} k≥0 

X and are characterized by their conditional marginal 

density ( y x) 

. The model is summarized as follows: 

g θ 

X X x fθ ( ⋅ x ) 

(5.1) 

k 

k−1 = 

k−1 

~ 

k−1 

Y 

k 

X 

k 

= x gθ ( ⋅ x ) 

(5.2) 

k 

~ 

k 

where the two densities can be non-Gaussian and may involve non-linearity. For any sequence 

{ } p 

z and random process { } 

Z 

i: j 

( 

i i+ 

1 j 

Z we will use the notation = z , z ,..., z ) and 

p 

z 

i: j 

( 

i i+ 

1 j 

= Z , Z ,..., Z ) . Assume for the time being that θ is known. In such a situation, one is 

interested in estimating the hidden state 

X 

k 

given the observation sequence { Y k 

} k≥0 

. This leads 

to the so-called optimal filtering problem that seeks to compute the posterior density 

p θ ( x k 

Y 0 : k 

) sequentially in time. Introducing a proposal distribution ( xk 

Yk 

, xk− 

1) 

, whose 

q θ 

support includes the support of gθ ( Yk 

xk 

) fθ 

( xk 

xk− 

1) 

. In this moment the SMC method [70] 

approximates the optimal filtering density by a weighted empirical distribution, i.e. a weighted 

sum of N >1 samples, termed as particles. Here we will assume that at time k−1, the filtering 

density x k 

Y ) 

∆ 

p θ ( 

−1 0: k−1 

is approximated by the particle set 

(1: N ) (1 ) 

[ ] 

( N 

X 

1) 

k 1 

X 

k −1 

,..., X 

k − 

− 

= having equal 

weights. The filtering distribution at the next time step can be recursively approximated by a 

115


new set of particles 

X 

( 1: N ) 

k 

generated via an importance sampling and a resampling step. In the 

importance sampling step, a set of prediction particles are generated independently from 

( ) 

( ⋅Y 

, X ) 

~ ( i) 

X ~ 

i 

k 

q 

k k−1 

θ 

and are weighted by an importance weight 

~ ( i) 

a θ , k 

that accounts for the 

( i) 

~ ( i) 

i 

discrepancy with the “target” distribution. Here, this is given by a θ 

= α θ 

X , X , Y ) and 

, k 

( 

k k−1 

k 

i i 

a~ ( ) ( ) 

, 

= a / 

θ k 

θ, 

k 

N 

∑ j = 1 

a 

( j) 

θ , k 

. In the resampling step, the particles 

~ (1: N ) 

X 

k 

are multiplied or eliminated 

according to their importance 

weights 

~ ( i: 

N ) 

a θ , k 

to give the new set of particles 

X 

( 1: N ) 

k 

. Now, let 

us now consider the case where the model includes some unknown parameters. We will assume 

* 

that the system to be identified evolves according to a true but unknown static parameter θ , 

i.e. 

X 

k 

X 

k−1 = xk− 

1 * 

θ k− 

~ f ( ⋅ x 

1) 

(5.3) 

Y 

k 

X 

k 

= xk 

θ 

~ g * ( ⋅ xk 

). 

(5.4) 

The aim is to identify this parameter. Addressing this problem for a non-Gaussian and 

* 

non-linear system is very challenging. We aim to identify θ based on an infinite (or very 

Y . A standard method to do so is to maximize the limit of the 

large) observation sequence { k 

} k≥0 

time averaged log-likelihood function: 

1 

l θ ( Y Y 

(5.5) 

k 

( ) = lim ∑ log pθ 

k → ∞ k + 1 k = 0 

k 

0 : k −1) 

with respect to θ . Suitable regularity conditions ensure that this limits exists and 

l (θ) admits θ * as a global maximum [70]. The expression Y Y n 

) 

defined as 

p θ 

( 

0 : k −1 

is the predictive likelihood 

116


p θ ( Y k 

Y 1) = α x , Y ) q ( x Y , x ) 

0: 

k− 

∫∫ 

θ 

( 

k−1: 

k k θ k k k−1 

⋅ θ 

p ( xk− 

1 

Y0: 

k−1) 

dxk− 

1: 

k. 

(5.6) 

Note that this is a normalization constant [70]. This approach is known as recursive ML 

parameter estimation. Now, we propose to use M2-SPSA in the ML parameter estimation based 

on the GSMC algorithm (Generic Sequential Monte Carlo algorithm) described in [70]. It is 

very difficult to compute log ( Y Y k 0 : k −1 

) 

p θ 

in closed form. Instead, we use a particle 

approximation and propose to optimize an alternative criterion: the SMC provides us with 

( i) 

~ ( i) 

samples ( Xk 

− 1, 

Xk 

) from p θ ( x k −1 Y0: 

k−1) 

q θ ( xk 

Yk 

, xk− 

1) 

. A particle approximation to log p ( Y Y θ k 0: 

k−1) 

is given by 

^ 

N 

⎛ −1 

( i) 

⎞ 

log p ( Yk 

Y0 : k−1) 

= log⎜ 

N ∑ a 

, k ⎟. 

(5.7) 

θ θ 

⎝ i= 

1 ⎠ 

Now, we use the key fact that the current hidden state 

X 

k 

, the observation 

Y 

k 

, the predicted 

particles 

~ (1: N ) 

X 

k 

and their corresponding not normalized weights 

(1: N ) 

a θ , k 

form a homogenous 

Markov chain [70]. 

* 

In the following section, we propose SA algorithms to solve: ϑ = arg max J ( θ ) Note that 

~ ( 1: N ) (1, N ) 

because we only use a finite number N of particles ( X , aθ ) is only an approximation to 

k 

, k 

θ∈Θ 

* 

the exact prediction density x k 

Y ) . Hence ϑ will not be equal to the true parameter 

p θ ( 

0: 

k−1 

* 

θ . However, as N increases, ) 

J (θ will get closer to l(θ 

) and 

* 

ϑ will converge to 

* 

θ . Our 

* 

* 

simulation results indicate that ϑ provides a good approximation to θ for a moderate number 

of particles. 

117


5.2.2 -Gradient-free Maximum Likelihood Estimation 

The function J (θ ) must be maximized with respect to the m-dimensional parameter vector θ . 

The function J (θ ) does not admit an analytical expression. Additionally, we do not have 

access to it. Using the geometric ergodicity of the Markov chain { Z k 

} ≥ 

, J( 

) can be 

approximated in the limit as follows: 

k 

θ 

0 

J 

∆ 

⎧ 

θ ) = lim ⎨ J ( ) = [ ( )] ⎬ ⎫ 

k 

θ E r Z 

(5.8) 

→ ∞ 

, k 

k 

θ θ 

⎩ 

⎭ 

( * 

where the expectation is taken with respect to the distribution of 

J (θ ) is unknown, we access to a sequence of functions 

k 

Z . This implies that although 

k 

J (θ . One way to 

J that converge to ) 

exploit this sequence in order to optimize J (θ ) is to use a recursion as follows: 

where 

k−1 

k 

^ 

= θ 

k −1 + γ 

k k − 

∇J ( θ 1) 

θ (5.9) 

θ is the parameter estimate at time k−1 and ∇ k denotes an estimate of ∇ J 

k 

.The 

idea is that we take incremental steps to improve θ where each step uses a particular function 

from the sequence. Under suitable conditions on the step size, the above iteration will converge 

* 

to ϑ [71]. We will consider the case where the expression for the gradient of 

^ 

J 

J 

k 

is either not 

available or too complex to calculate. One may approximate ∇J k ( θ ) by recourse to finite 

difference methods. These are “gradient-free” methods that only use measurements of J (θ ) . 

The idea behind this approach is to measure the change in the function induced by a small 

^ 

k 

perturbation 

∆θ 

in the value of the parameter. If we denote an estimate of J (θ ) by 

k 

^ 

2 

J 

k 

(θ ) , 

one-sided gradient approximations consider the change between J ( θ ) and J ( θ + ∆ θ ) 

while two-sided approximations consider the difference between J ( θ −∆θ) 

and J ( θ + ∆θ ) . A 

gradient-free approach can provide a maximum likelihood parameter estimate that is 

^ 

k 

^ 

k 

^ 

k 

^ 

k 

118


computationally cheap, as well as very simple to implement. The key feature of the SPSA 

technique is that it requires only two measurements of the cost function regardless of the 

dimension of the parameter vector. This efficiency is achieved by the fact that all the elements 

in θ are perturbed together. The i-th component of the two-sided gradient approximation 

^ 

^ 

^ 

⎡ 

⎤ 

∇ J 

k 

= 

⎢ 

∇J 

k ,1( 

θ ),..., ∇J 

k , m 

( θ ) is 

⎣ 

⎥ 

⎦ 

∇J 

^ 

^ 

^ 

J 

k 

( θk− 

1 

+ ck∆k 

) − J 

k 

( θk− 

1 

+ ck∆k 

) 

k, 

i 

( θ 

n−1) 

= 

(5.10) 

2ck 

∆ki 

where ∆ 

k 

= 

∆ [ ∆ 

k , 1 

,..., ∆ 

k , m 

] is a random perturbation vector and { k 

} k ≥1 

c is defined in the 

Sec. 1.7. Note that the computational saving stems from the fact that the objective function 

difference is now common in all m components of the gradient approximation vector. Almost 

sure convergence of the SA recursion in (5.9) is guaranteed if J (θ ) is sufficiently smooth near 

k 

* 

θ . Additionally, the elements of 

∆k 

must be mutually independent random variables, 

−1 

symmetrically distributed around zero and with finite inverse moments E ( ∆ k , i 

) 

. A simple and 

popular choice for ∆ that satisfies these requirements is the Bernoulli ± 1distribution and the 

positive step sizes should satisfy 

k 

∑ ∞ →0 , k 

→0, 

k= 

1 

γ 

k 

c γ 

k 

= ∞ and ∑ ∞ 

k = 

1 

⎛ γ 

k 

⎜ 

⎝ c 

k 

⎞ 

⎟ 

⎠ 

2 

< 

∞ 

. 

The choice of the step sequences is crucial to the performance of the algorithm. Note that if a 

constant step size is used for γ 

k 

the SA estimate will still converge but will oscillate about the 

limiting value with a variance proportional to the step size. In most of our simulations γ 

k 

was 

set to a small constant step size that was repeatedly halved after several thousands of iterations. 

For the two-sided SPSA case for example, these would be ^ 

^ 

+ 

J ( θ ∆ ; ω ) and 

− 

J ( θ − ∆ ; ω ) 

k 

+ c k k k 

k 

c k 

k 

k 

where 

ω andω denote the randomness of each realization. This implies that besides the 

+ 

k 

− k 

desired objective function change induced by the perturbation in θ , there is also some 

119


undesirable variability in 

± 

± 

ω 

k 

. Although in a real system ωk 

cannot be controlled, in 

simulation settings it might be possible to eliminate the undesirable variability component by 

using the same random seeds at every time instant k, so that 

ω 

ω . The SA of (5.9) can be 

+ − 

k 

= k 

thought of as a stochastic generalization of the steepest descent method. Faster convergence can 

be achieved if one uses a Newton type SA algorithm that is based on an estimate of the second 

derivative of the objective function. This will be of the form 

− 1 

^ 

⎡ 

^ 

2 ⎤ 

θ 

k = θ 

k − 1 − γ 

k ⎢ ∇ J k ( θ 

k − 1 

) ⎥ ⋅ ∇ J ( θ 

k − 1 

) 

(5.11) 

⎣ 

⎦ 

where 

^ 

2 

2 

∇ J is an estimate of the negative definite Hessian matrix ∇ J 

k 

k 

. Such an approach 

can be particularly attractive in terms of convergence acceleration, in the terminal phase of the 

algorithm, where the steepest descent-type method of (5.9) slows down but main difficulty with 

this is the fact that the estimate of the Hessian should also be instable. In order to keep the 

stability in Hessian matrix, we applied the procedure used in Chap. 2. Also, as it was suggested 

in [70], it might be useful to average several SP gradient approximations at each iteration, each 

with an independent value of 

∆ 

k 

. Despite the expense of additional objective function 

evaluations, this can reduce the noise effects and accelerate convergence. 

5.3 -Parameter Estimation by SPSA and FDSA 

Now, we present two maximum likelihood parameter estimation algorithms that are based on a 

FDSA and SPSA algorithm. In line with our objectives, the algorithm below only requires a 

single realization of observations { k 

} k≥1 

Y of the true system. At time k -1, we denote the 

current parameter estimate by θ 

k−1 

. Also, let the filtering density pθ0 : k−1( xk− 

1 

Y0: 

k−1) 

be 

approximated by the particle set 

(1: N ) 

X 

k − 1 

having equal importance weights. Note that the 

subscript θ0: 

k −1 

indicates that the filtering density estimate is a function of all the past parameter 

values. The parameter estimation using SPSA is performed as follows: 

120

5.4 NUMERICAL SIMULATION 

First, let generate random perturbation vectors. 

For i = 1, ...,N, sample 

~ ( i ) 

( i ) 

( ⋅ Y 

k ~ q c 

k , 

X 

+ θ + ∆ 

k − 1 

X 

, k −1 

k 

~ ( i ) 

( i ) 

( ⋅ Y 

k , ~ q c 

k , 

X 

− θ − ∆ 

k − 1 

X 

k −1 

k 

k 

k 

) 

) 

and using the following evaluation: 

α θ 

( x 

k 

− 1: k 

, Yk 

) = 

g 

θ 

( Y 

q 

k 

θ 

x ) f 

k 

( x 

k 

θ 

k 

( x 

Y , x 

k 

x 

k −1 

k −1 

) 

) 

. 

We can evaluate 

~ ( i ) ( i ) 

~ ( i ) ( i ) 

aθ ( Y , X , X ) , a ( Y 

k 

, X 

k , − 

, X 

k , −1 

) 

^ 

∇J 

k , i 

where 

J 

^ 

k 

log 

( θ 

k 

( θ 

⎧ 

⎨ 

⎩ 

k −1 

k − 1 

1 

N 

k , + 

Jˆ 

k 

( θ 

) = 

± 

c 

k 

∆ 

N 

∑ i = 1 

k 

k , −1 

k −1 

a θ 

θ 

. 

+ ck∆ 

k 

) − Jˆ 

k 

( θ 

2c 

∆ 

) = 

k − 1 

θ = + ∇ ( ), 

k 

θ 

k 

γ 

k 

J k θ 

k 1 

where 

∇J 

^ 

± 

c 

−1 − 

k 

k 

∆ 

k , i 

k 

( Y 

k −1 

k 

− c ∆ ) 

~ 

, X 

⎡ ^ 

^ ⎤ 

θ 

k − 1) 

⎢∇J 

k ,1( 

θ 

k −1 

),... ∇J 

k , m ( θ 

k − 

) ⎥ 

⎣ 

⎦ 

^ 

k ( 

1 

k 

k 

(1: N ) 

k , ± 

, X 

( j ) 

k − 1 

⎫ 

) ⎬ 

⎭ 

X ~ ) 

k 

k k , k −1 

( i ) 

( i 

( ) 

each particle i = 1, ...,N, sample ~ qθ ( ⋅Y 

X ) and evaluate the weights a . θ 

( 1, N ) 

Sample ~ ( ~ (1, N ) 

( 1, N ) ~ (1, N ) 1: N 

Ik 

L ⋅aθ 

, k 

) using a standard resampling scheme. Set X 

k 

= H ( X 

k 

, I 

k 

) 

. 

~ j 

k , k , 

121



The following bi-modal non-linear model [72] is proposed here. 

X 

Xk 

= θ 

− 

+ σ V 

(5.12) 

k−1 

1 

X 

k 1+ 

θ2 

+ θ 

2 3 

cos (1.2 k) 

1+ 

Xk− 

1 

υ k, 

Y = cX +σ W 

k 

2 

k 

ω 

k 

(5.13) 

i. 

i. 

d 

2 

where σ 10, c = 0.05, σ = 1, X ~ N(0,2), 

V ~ N(0,1) 

and W ( 0,1) 

. These are zero mean 

υ = ω 

0 

k 

i. 

~ i. 

d 

k 

N 

Gaussian random variables. Here, we seek ML estimatesθ = 

[ θ1, 

θ2, 

θ3] 

T 

. Also is important to 

initialize the algorithm properly; else some of the parameter estimates might get trapped in local 

maxima. In this model, we can initialize at 

T 

θ 

0 

= [0.2, 20, 5] . The choice of the step size is very 

important. Here, this is particularly true due to the difference in the relative sensitivity of the 

0. 101 

three unknown parameters. The values for step size are c k 

= c 0 

/k where c = [0.01,2.0,1] 

x10 −4 

4 

and the constant step size γ = [0.005,7,17] x 10 − . 

0 

T 

Fig. 5.1. ML Parameter estimateθ = θ , θ , for the bi-modal non-linear model using 

k 

[ 

1, 

k 2, k 

θ 

3, 

k 

] 

T 

M2-SPSA. The true Parameters in the model are defined by θ * =[0.5, 25, 8] . 

122

5.4 NUMERICAL SIMULATION 

Fig. 5.2. Parameter estimation using 2nd-SPSA and FDSA. 

Figure 5.1 shows the efficiency obtained using M2-SPSA. These results are compared with 

2nd-SPSA and FDSA in Fig. 5.2. These results show the best performance found by each 

algorithm in the current model. Table 5.1 compares, the number of particles used by each 

algorithm in its performance, and the computational load or normalized CPU time [49] 

(computational cost in time processing) with CPU time required by M2-SPSA as reference. 

These comparisons are done according to average CPU time used by each algorithm to estimate. 

Table 5.1. Computational statistics. 

Algorithm No. of Particles CPU 

M2-SPSA 800 1 

2nd-SPSA 920 2.8 

FDSA 1000 3.2 

The results obtained here by M2-SPSA show its efficiency, which 

* 

ϑ provides a good 

* 

approximation to θ using a moderate number of particles in comparison with 2nd-SPSA and 

FDSA. Therefore, M2-SPSA only uses 800 particles to estimate and get a good approximation 

and accuracy parameters. The 2nd-SPSA uses 920 particles to find a suitable estimation. Finally, 

FDSA uses 1000 particles to estimate the parameters in a correct way. Also, the computational 

123


cost is shown that CPU time required estimating the parameters by 2nd-SPSA and FDSA ranges 

from 2.8 and 3.2 times respectively the CPU time required by M2-SPSA, so that, in terms 

of efficiency, the use of these algorithms might be questionable. Note that the number of loss 

function measurements needed in each iteration of FDSA grow with p while M2-SPSA only 

two measurements are needed, independent of p. This, according to the characteristics of 

M2-SPSA described in Chap. 2, provides the potential of our proposed algorithm to achieve a 

large saving (over FDSA) in the total number of measurements required to estimate θ when p 

is large. Also, we can see that the performance of FDSA was highly dependent on the shape of 

the loss function surface [21]. Consequently, this places a higher burden on the selection of 

initial parameter values. So that, M2-SPSA has a low computational cost and usually provides 

less dispersed and more accurate parameters. The reason of these data obtained by M2-SPSA is 

that this algorithm is a very powerful technique that allows an approximation of the gradient or 

Hessian by effecting simultaneous random perturbations in all the parameters. Therefore, the 

data of M2-SPSA contrast with FDSA in which the evaluation of the gradient is achieved by 

varying the parameters once at a time. In general, these results obtained by M2-SPSA are 

explained since this algorithm does not depend on derivative information, and it is able to find a 

good approximation to the solution using few function values (see Chap. 2), this causes a low 

computational cost and complexity. In comparison with 2nd-SPSA, M2-SPSA has a low 

computational cost that is explained in Chap. 2. Also, the M2-SPSA algorithm can satisfy some 

conditions and constraints associated with the problem in contrast with 2nd-SPSA that cannot 

satisfy them [18]. In contrast with FDSA, in M2-SPSA the slope is estimated, and the 

estimation error for the slope has an effect on the convergence speed. So that, M2-SPSA is a 

very suitable algorithm. Nevertheless, if one decides to allow for more resource and use a 

gradient-based approach, the SPSA proposed here can still prove extremely useful in exploring 

the parameter space and choosing suitable initial values for the parameter vector. 

124

Chapter 6 

Conclusions and Future Work 

6.1 -Conclusions 

In this research, we have proposed a new modification to SPSA algorithm which main objective 

is estimate the parameters in complex system, improve the convergence and reduce the 

computational expense. This modification is called “modified version of 2nd-SPSA algorithm”. 

The identification method using the SP seems particularly useful when the number of 

parameters to be identified is very large or when the observed values for to be identified can 

only be obtained via an unknown observation system. Furthermore, a time differential SP 

method that only require one observation of error for each time increment have been proposed 

as improvements for the SPSA algorithm. The procedure of the proposed SPSA algorithm can 

be explained as follows: 

−1 

To eliminate the errors introduced by the inversion of estimated Hessian ( H ) , is suggested a 

−1 

modification (2.13) to 2nd-SPSA that replaces H with a scalar inverse of the geometric mean 

k 

k 

of all the eigenvalues of 

H 

k 

. This leads to significant improvements in the proposed SPSA 

algorithm efficiency. At finite iterations, it is found that the newly introduced M2-SPSA based 

on (2.13) and (2.14) frequently outperforms 2nd-SPSA in the numerical simulations that 

represent a wide range of matrix conditioning. Moreover is considered that the ratio of the mean 

square errors from M2-SPSA to 2nd-SPSA is always less than unity except for a perfectly 

conditioned Hessian. The magnitude of errors in 2nd-SPSA is dependent on the matrix 

* 

conditioning of H due to competing factors [16]. Since these factors are strongly related to 

the same quantity of the matrix conditioning, the efficiency between the proposed SPSA 

algorithm and 2nd-SPSA might less dependent on specific loss functions. We have proposed to 

reduce the computational expense by evaluating only a diagonal estimate of the Hessian matrix. 

The reduction in the computation time (in comparison with SA algorithms and previous 

versions of SPSA) is due to savings in the evaluation of the Hessian estimate, as well as in the 

recursion on θ that only requires a trivial matrix inverse. The performance, in terms of rate of 

convergence and accuracy, remains almost unchanged, which demonstrates that the diagonal 

125

CHAPTER 6. CONCLUSIONS AND FUTURE WORK 

Hessian estimate still captures potential large scaling differences in the elements ofθ . 

In this latter algorithm, regularization can be achieved in a straightforward way, by imposing the 

positive of the diagonal elements of the Hessian. 

We have explained our proposed SPSA algorithm in detail in this dissertation. Therefore, three 

important applications have been proposed in order to evaluate our proposed M2-SPSA 

algorithm. These applications were addressed toward control area and signal processing where 

the M2-SPSA algorithm was implemented very successfully. In the following paragraphs, the 

conclusions corresponding to these applications are mentioned. 

1) First application 

We have proposed a MR-SMC method using a non-linear observer for controlling the angular 

position of the single flexible link, suppressing its oscillation. The non-linear observer and the 

MR-SMC provide a successful and stable operation to the system. The M2-SPSA algorithm is 

used in order to determine the observer/controller gains. This could determine them very 

efficiently and with a low computational cost. The non-linear observer was successful in 

predicting the state variables from the motor angular position and the MR-SMC was a very 

efficient control method. Although the performance of our proposed system was very 

satisfactory and closed to real results obtained in [47]. 

2) Second application 

In this research, also we have shown a method for deriving adaptive algorithms for IIR lattice 

filters from the corresponding direct form algorithms. The advantage of this approach is that it 

provides conditions under which the convergence characters of stationary points are preserved 

when passing from the direct form to the lattice algorithm. We use M2-SPSA in order to get the 

coefficients in the lattice form more efficiently, so that we can reduce the computational burden 

to obtain a suitable performance. This allowed the design of lattice versions of the SM and 

SHARF algorithms, which are locally convergent, at least in the sufficient order case. It was 

also shown that this was not the case for previous lattice versions. 

126

6.1 CONCLUSIONS 

3) Third application 

Finally, a fast and efficient modified SPSA algorithm to perform ML parameter estimation in 

state-space models, using SMC filters has been proposed. The algorithm proposed here is based 

on measurements of the objective function and do not involve any gradient calculations. The 

estimation using M2-SPSA seems particularly useful when the number of parameters to identify 

is large or when the observed values for what is to be identified can only be obtained via an 

unknown observation system. Also M2-SPSA outperforms the FDSA and 2nd-SPSA due to its 

reduced computational cost and complexity that remains fixed with the dimensions of the 

parameter vector. However, its performance is very sensitive to the step-size parameters and 

special care should be taken when these are selected. 

Tables 6.1 and 6.2 show the final performance of M2-SPSA according to the applications 

described in this dissertation, this performance is compared with previous versions of SPSA 

algorithm and SA algorithms. 

Table 6.1. Comparison of algorithms (performance). 

Algorithm 

No. of Loss Measurements 


Low 

2nd SPSA 

Relatively Low 


High 

Table 6.1 represents a comparative between the M2-SPSA and previous version of SPSA 

according to the simulations results obtained in this dissertation in the Chap. 2. The number of 

loss measurements is reduced significantly by our proposed method M2-SPSA in this chapter 

and this is confirmed by the Tables 2.2 – 2.4 where Spall [18] presents a study based on a larger 

number of loss measurements (i.e., more asymptotic) where we can show that M2-SPSA 

outperforms (about iterations needed for normalized loss values) 1st-SPSA and 2nd-SPSA in 

the high-noise case. 

The ratios of M2-SPSA shown in Table 2.3 - 2.4 offer considerable promise for practical 

problems (using a number of measurements low in comparison with 1st-SPSA), where p is even 

larger (say, as in the neural network–based direct adaptive control method of Spall and Cristion 

127


2 

3 

[25], where p can easily be of order 10 or 10 ). In such cases, other second order 

techniques that require a growing (with p) number of function measurements are likely to 

become infeasible. 

In Table 2.2, we see that M2-SPSA provides a considerable reduction in the loss function value 

for the same number of measurements used in 1st-SPSA and 2nd-SPSA. Based on the numbers 

in the Table 2.2 – 2.4 together with supplementary studies described in Chap. 2, we find that 

1st-SPSA and 2nd-SPSA needs approximately five–ten times the number of function 

evaluations used by M2-SPSA to reach the levels of accuracy shown. 

Table 6.2. Comparison of algorithms (computational cost). 

Algorithm 

Computational Cost 


Low 

2nd SPSA 

Relatively Low 

SA Algorithms 

High 

Table 6.2 represents a comparison between the M2-SPSA and previous version of SPSA and SA 

algorithms according to CPU time, these results are confirmed by the values obtained in Tables 

3.3 and 5.1 in Chap. 3 and 5 respectively, where the computational load or normalized CPU 

time [49] (computational cost in time processing) with CPU time required by M2-SPSA as 

reference is used. These comparisons are done according to average CPU time used by each 

algorithm to estimate each parameter. 

The CPU time or CPU usage is the amount of time a computer program uses in processing the 

instructions, as opposed to, for example, waiting for the input/output operations. Then in this 

case, the CPU time required to estimate the parameters by 2nd-SPSA ranges 2 times the CPU 

time required by M2-SPSA, for that reason is relatively low in comparison with our proposed 

SPSA. 

The CPU time required to estimate the parameters by the SA algorithms ranges 2 to 5 times 

approximately the CPU time required by M2-SPSA, for that reason this is high in comparison 

with our proposed SPSA. Therefore, in these simulations is shown that the SA algorithms in 

comparison with M2-SPSA has a high computational cost (see Tables 3.3 and 5.1) even 

128

6.2 FUTURE WORK 

2nd-SPSA algorithm has the same or less computational cost in comparison to SA algorithms 

(see Table 5.5). This is explained because the number of loss function measurements needed in 

each iteration of FDSA (Table 5.1), RM-SA or LS (Table3.3) grow with p while M2-SPSA or 

2nd-SPSA only two measurements are needed, independent of p, this explanation is described in 

detail in the Chap. 2 and demonstrated by simulations the difference between 2nd-SPSA and 

M2-SPSA (Table 2.2 – 2.4). 

Also, M2-SPSA allows an approximation of the gradient or Hessian by effecting simultaneous 

random perturbations in all the parameters. This contrast with the evaluation of the gradient in 

FDSA which is achieved by varying the parameters once at a time. 

6.2 -Future Work 

Referring to conclusions give above, we still have many topics for investigate in a near future. 

Future work to assess the performance of SPSA for constrained and unconstrained aerodynamic 

shape design studies. 

This study will be carried out in the near future to establish the cost benefits and to investigate the 

extent to which SPSA offers comparative advantages over other kinds of similar methods for 

dynamic design optimization problems. 

The M2-SPSA algorithm can be applied to image processing; in this case we focus in two main 

applications. First, the M2-SPSA algorithm will be used in image process multidimensional 

(medical images) in order to reduce CPU time in same way that the applications presented in 

this dissertation. Second, extracting a multivariate non-linear physical model from a set satellite 

images is considered as a multivariate non-linear regression problem. Multiple local solutions 

often prevent gradient type algorithms from obtaining global optimal solutions. A method of 

solving this problem is M2-SPSA algorithm. The method will be applied to a problem of 

estimating the distribution of energetic ion populations from global images of the 

magnetosphere. 

129


Finally, we have applied our proposed M2-SPSA algorithm to the applications proposed here, 

but also our proposed SPSA can be applied to other kind of applications in other areas for 

example the image process mentioned in this section. The M2-SPSA algorithm can be applied to 

different applications if these satisfy in advances the conditions described by the main theorems 

(theorems 1,2 and 3 of M2-SPSA and its guidelines C.1’ and C.3’) explained in Sec. 2.9, if the 

application satisfies these conditions M2-SPSA can be used. 

130

References 

[1] G. Cassandras, L. Dai, and C. G. Panayiotou, “Ordinal Optimization for a Class of 

Deterministic and Stochastic Discrete Resource Allocation Problems,” 1EEE Trans. Auto. 

Contr., vol.43(7): pp.881-900, 1998. 

[2] G. N. Saridis, “Stochastic Approximation Methods for Identification and Control,” IEEE 

Trans Autom Control , vol.19, pp.798-809, 1974. 

[3] J. C. Spall,“Multivariate Stochastic Approximation using a Simultaneous Perturbation 

Gradient Approximation,” IEEE Transactions on Automatic Control, vol.37, pp.332-341, 1992. 

[4] S. N. Evans and N. C.Weber., “On the Almost Sure Convergence of a General Stochastic 

Approximation Procedure,” Bull. Australian Math. Soc., vol.34, pp.335–342, 1986. 

[5] H. F. Chen, T. E. Duncan, and B. Pasik Duncan, "A Stochastic Approximation Algorithm 

with Random Differences,” Proceedings of the 13 th Triennial IFAC World Congress, pp. 

493-496, 1996. 

[6] J. C. Spall, “An Overview of the Simultaneous Perturbation Algorithm for Stochastic 

Optimization,” IEEE, Transactions on Aerospace and Electronic Systems, vol.34, pp.817-823, 

1998. 

[7] A. Vande Wouver, C. Rennote, Ph. Bogaerts, "Application of SPSA Techniques in 

Non-linear System Identification," European Control Conference, 2001. 

[8] J. Kiefer and J. Wolfowitz, “Stochastic Estimation of the Maximum of a Regression 

Function,” Ann.Math. Statist., vol.23 pp.498-506, 1952. 

[9] -H. Monroe, Robbins, “A Stochastic Approximation Method,” Ann.Math. Statist, vol. 22, pp. 

400-407, 1951. 

131

REFERENCE 

[10] S. A. Billings, G. N. Jones, “Orthogonal Least-Squares Parameter Estimation Algorithms 

for Non-Linear Stochastic Systems,” Int. Journal of Systems Science, vol.23, issue 7, pp. 

1019-1032,1990. 

[11] L. Gerencser, "SPSA with State-Dependent Noise a Tool for Direct Adaptive Control," 

Proceedings of the Conference on decision and Control, CDC 37, 1998. 

[12] J. C. Spall, and D.C Chin, “Traffic Responsive Signal Timing for System-Wide Traffic 

Control,” Transp. Res., Part C, vol.5, pp.153-163, 1997. 

[13] J. H. Venter, "An Extension of the Robbins-Monroe Algorithm," Annals of Mathematical 

Statistics, vol.38, pp.181-190. 

[14] D. Ruppert, "Stochastic approximation,” Handbook of Sequential Analysis, pp.503-529, 

1991. 

[15] G. N. Saridis, G. Stein, "Stochastic Approximation Algorithms for Linear Discrete-time 

System Identification," IEEE Trans Autom. Control, vol.13, pp.515–523, 1968. 

[16] L. Gerencser, “Rate of Convergence of Moments for a Simultaneous Perturbation 

Stochastic Approximation Method for Function Minimization,” IEEE Trans. on Automat. 

Contr. ,vol.44, pp.894-906, 1999. 

[17] J. C. Spall, “Adaptive Stochastic Approximation by the Simultaneous Perturbation 

Method,” Proceedings of the 1998 IEEE CDC, pp.3872 -3879, 1998. 

[18] J. C. Spall , “A Second-Order Stochastic Approximation Algorithm using only Function 

Measurements,” Proceedings of the IEEE Conference on Decision and Control, pp. 2472–2477, 

1994. 

[19] V. Fabian, “On Asymptotic Normality in Stochastic Approximation,” Ann,. Math Static., 

vol.39, pp.1327-1332, 1968. 

132

REFERENCE 

[20] H. F. Chen and Y. Zhu, “Stochastic Estimation Procedure with Randomly Varying 

Truncations,” Scientia Sinica (Serie A), vol.29, pp.914-926, 1986. 

[21] D. C. Chin, “Comparative Study of Stochastic Algorithms for System Optimization Based 

on Gradient Approximations,” IEEE Trans. Syst., Man, and Cybernetics, vol.27, pp.244–249, 

1997. 

[22] B. Efron, and D.V. Hinckley, “Assesing the Accuracy of the Maximum Likelihood 

Estimator: Observed versus Expected Fisher Information,” Biometrika, vol.65, pp.457-487, 

1995. 

[23] S. Das, R. Ghanem, and J. C. Spall.,”Asymptotic Sampling Distribution for Polynomial 

Chaos Representation of Data: A Maximum Entropy and Fisher Information Approach,” SIAM 

Journal on Scientic Computing, 2006. 

[24] J. C. Spall, “A Stochastic Approximation Algorithm for Large-Dimensional Systems in the 

Kiefer-Wolfowitz Setting,” Proc. IEEE Conf. on Decision and Control, pp.1544–1548, 1988. 

[25] J. C. Spall and Cristion, J. A., “Non-linear Adaptive Control Using Neural Networks: 

Estimation Based on a Smoothed Form of Simultaneous Perturbation Gradient Approximation,” 

Statistica Sinica, vol.4, pp.1-27, 1999. 

[26]-D. W Hutchison, “On an Efficient Distribution of Perturbations of Simulation Optimization 

using Simultaneous Perturbation Stochastic Approximation,” Proceedings of the IASTED 

International Conference on Applied Modeling and Simulation, pp.440-445, 2002. 

[27] R. W. Brennan and P. Rogers, “Stochastic Optimization Applied to a Manufacturing 

System Operation Problem,” Proc.Winter Simulation Conf., C. Alexopoulos, K. Kang,W. R. 

Lilegdon, and D. Goldsman, Eds., pp.857–864, 1995. 

[28] J. C. Spall, “Implementation of the Simultaneous Perturbation Algorithm for Stochastic 

Optimization,” IEEE Trans. Aerosp. Electron. Syst., vol.34, pp.817–823, 1998. 

133

REFERENCE 

[29] M. Metivier and P. Priouret, “Applications of a Kushner and Clark Lemma to General 

Classes of Stochastic Algorithms,” IEEE Trans. Inform. Theory, vol. IT-30, pp.140–151, 1984. 

[30] A. Benveniste, M. Metivier, and P. Priouret, “Adaptive Algorithms and Stochastic 

Approximations,” New York: Springer Verlag, 1990. 

[31] H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithms and Applications. New 

York: Springer Verlag, 1997. 

[32]-J. C. Spall and Cristion, “Model-free Control of Non-linear Stochastic Systems with 

Discrete-time Measurements,” IEEE Trans.Automat. Contr., vol.43, pp.1198–1210, 1998. 

[33] J. Dippon and J. Renz, “Weighted Means in Stochastic Approximation of Minima,” SIAM J. 

Contr. Optimiz., vol.35, pp.1811–1827, 1997. 

[34] J. R. Blum, “Approximation Methods which Converge with Probability One,” Ann. Mat. 

Statist., vol.25, pp.382–386, 1954. 

[35] J. J. More, B. S. Garbow, K. E. Hillstrom, “Testing Unconstrained Optimization Software,” 

ACM. Transactions on Mathematical Sciences, vol.7, no.1, pp.17–41, 1981. 

[36] R. G. Laha and V. K. Rohatgi, Probability Theory, New York: Wiley, 1979. 

[37] F. J. Solis, R. J. Wets, “Minimization by Random Search Techniques,”Mathematics of 

Operations Research, vol.6, pp.19-30, 1981. 

[38] Y. Maeda, Y. Kanata, “Learning Rules for Recurrent Neural Networks using Perturbation 

and Their Application to Neuro-control”, Trans. IEE Japan, vol.113-C, pp.402-408, 1995 (in 

Japanese). 

[39] J. C. Spall, “A One-Measurement Form of Simultaneous Perturbation Stochastic 

Approximation,” Automatica, vol.33, pp.109–112, 1997. 

134

REFERENCE 

[40] J. C. Spall and J. A. Criston, “A Neural Network Controller for Systems with Unmodeled 

Dynamics with Applications to Waste-water Treatment,” IEEE Trans. Syst. Man. Cybern B, 

vol.27, pp.369-375, 1978. 

[41] J. Link, F. L. Lewis, “Two-Time Fuzzy Logic Controller of Flexible Link Robot Arm,” 

Fuzzy sets and system, vol.139, no.7, pp.125-149, 2003. 

[42] R. H. Cannon, E. Schmitz, “Initial Experiments on the End-Point Control of a Flexible 

One-Link Robot,” Int. journal of robotics research, vol.8, no.3, pp. 62-75, 1984. 

[43] Y. Sakawa, F. Matsuno, and S. Fukushima, “Modeling and Feedback Control of a Flexible 

Arm,” Journal of robotic systems, vol.2, no.4, pp.453-472, 1985. 

[44] S. Nicosia, P. Tomei,and A. Tornambe, “Non-Linear Control and Observation Algorithms 

for a Single-Link Flexible Arm,” Int. Journal Control, vol. 49, no.5, pp.827-840, 1989. 

[45] J. Yuh, “Application of Discrete-Time Model Reference Adaptive Control to a Flexible 

Single-Link Robot,” Journal of robotic system, vol.4, pp.621-630, 1987. 

[46] E. Bayo et al, “Inverse Dynamic and Kinematics of Multi-Link Elastic Robots: An iterative 

frequency domain approach,” Int. Journal of Robotics Research, vol.8, no.6, pp.49-62, 1989. 

[47] U. Sawut, N. Umeda, T. Hanamoto, T. Tsuji, “Applications of Non-Linear Observer in 

Flexible Arm Control,” Trans. of SICE, vol. (35), no.3, pp. 401-406, 1999 (in Japanese). 

[48] C. Z. Wei, “Multivariate Adaptive Approximation,” Ann Statist., vol. 15, pp. 1115-1130. 

[49] A. Vande Wouwer, C. Renotte and M.Remy, “Application of Stochastic Approximation 

Techniques in Neural Modeling and Control,” Int. Journal. of Syst. Science, vol.34, no.14, 

pp.851-863, 2003. 

[50] P. A. Regalia, “Adpative IIR Filtering in Signal Processing and Control. Marcel Dekker,” 

1995. 

135

REFERENCE 

[51] D. Parikh, N. Ahmed and S.D Stearns, “An Adaptive Lattice Algorithm for Recursive 

Filters,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 28, pp.110-112, 1988. 

[52] J. A. Rodriguez-Fonollosa and E. Magrau, “Simplified Gradient Calculation in Adaptive 

IIR Lattice Filters,” IEEE Trans.on Signal Processing, vol.39, pp.1702-1705, 1991. 

[53] P. A. Regalia, “Stable and Efficient Lattice Algorithms for Adaptive IIR Filtering,” IEEE 

Trans. on Signal Processing, vol.40, pp.375-388, 1992. 

[54] H. Fan: “Application of Benveniste’s Convergence Results in the Study of Adaptive IIR 

Filtering Algorithms,” IEEE Trans. Inform. Theory, vol.34, pp.692-709, 1988. 

[55] P. Lancaster, M. Tismenetsky, “The Theory of Matices,” Academic Press ,1985. 

[56] C. R. Johnson, Jr, M. G. Larimore, J. R. Treichler, and B. D. O. Anderson, “SHARF 

Convergence Properties SHARF Convergence Properties,” IEEE Trans. on Acoustics, Speech, 

and Signal Processing, vol.28, no.4, pp.428-440, 1980. 

[57] M. G. Larimore, J. R. Treichler, “SHARF: An algorithm for Adapting IIR Digital Filters,” 

IEEE Trans. on Acoustics, Speech, and Signal Processing, vol.28, no.4, pp.428-440, 1980. 

[58] I. D. Landau: “Elimination of the Real Positivity Condition in the Design of Parallel 

MRAS,” IEEE Trans. Automat. Cont., vol.AC-23, no.6, pp.1015-1020, 1978. 

[59] K. Kurosawa and S. Tsuji, “An IIR Parallel-Type Adaptive Algorithm using the Fast Least 

Square Method,” IEEE Trans. Acoust.,Speech, Signal Processing, vol.37, no.8, pp.1226-1230, 

1989. 

[60] C. R. Johnson Jr., and Taylor, “Failure of a Parallel Adaptive Identifier with Adaptive Error 

Filtering,” IEEE Trans. Automat. Cont., vol. AC-25, no.6, pp.1248-1250, 1980. 

[61] K. Steilglitz and L. E. McBride: “A Techinique for the Identification of Linear Systems,” 

IEEE Trans. Automat. Cont., vol. AC-10, no.4, pp.461-464, 1965. 

136

REFERENCE 

[62] P. Regalia and M. M. Boup: “An a Priori Error Bound for the Steiglitz-McBride Method,” 

IEEE Trans. on Circuit and Systems, Analog and Digital Signal Processing, vol.41, no.2, 

pp.105-116, 1996. 

[63] K. X. Miao, H. Fan and M. Doroslovacki: “Cascade Normalized Lattice Adaptive IIR 

Filters,” IEEE Trans. on Signal Processing, vol.42, pp.721-742, 1994. 

[64] G. Poyiadjis, A. Doucet and S. S. Singh, “Particle Methods for Optimal Filter Derivative: 

Application to Parameter Estimation,”Proccedings IEEE ICASSP, 2005. 

[65] G..Poyiadjis, S.S. Singh and A. Doucet, “Novel Particle Filter Methods for Recursive and 

Batch Parameter Estimation in General State Space Models,” Technical Report, 

CUED/F-INFENG/TR-536, Engineering Department, Cambridge University, 2005. 

[66] P. Fearnhead, “MCMC, Sufficient Statistic and Particle Filter,” Journal Comp. Graph. Stat., 

vol.11, pp.848-862, 2002. 

[67] B. L Chan, A. Doucet and V.B Tadie: “Optimization of Particle Filters using Simultaneous 

Perturbation Stochastic Approximation,” Proc. IEEE ICASSP, pp.681-684, 2003. 

[68] J. Liu and M. West, “Combined parameter and state estimation in simulation-based 

filtering,” In Sequential Monte Carlo Methods in Practice (eds Doucet A., de Freitas J.F.G. and 

Gordon N.J. NY): Springer Verlag, 2001. 

[69] G. Storvik, “Particle Filters in State Space Models with The Presence of Unknown Static 

Parameters,” IEEE Trans. Signal Processing, vol.50, pp.281-289, 1998. 

[70] A. Doucet, Vladislav B.Tadie: “On-line Optimization of Sequential Monte Carlo 

Methods using Stochastic Approximation,” Proceedings of the -American Control Conference, 

pp.2565-2570, 2002. 

[71] H. J. Kushner and D.S. Clark, “Stochastic Approximation Methods for Constrained and 

Unconstrained System,” Springer Varlag, N.Y. , 1978. 

137

REFERENCE 

[72] N. J. Gordon, D. J Salmond and A.F.M. Smith, “Novel Approach to Non-linear/Non 

Gaussian Bayesian State Estimation,” IEEE Proc. F., vol.140, pp.107-113, 1993. 

138

Appendix A 

Proofs of Convergence Results and-Asymptotic Distribution 

Results 

Proof of Lemma (Sufficient Conditions for C.5 and C.7) 

C.7 is used in the proofs of Theorems 1a and 1b only to ensure that P(lim sup ˆ θ = ∞) 

= 0 . 

k→∞ 

k 

Given the boundedness of 

θˆ k , this condition becomes superfluous. Regarding C.5, the 

boundedness condition together with the facts that / 

2 2 −1 

a →0 

and c →0 

(C.6) imply 

k 

c k 

k 

H k 

that, for some 

0 < ρ ' < ρ , a ( ˆ θ ) ≤ ρ' 

a.s. for all k sufficiently large. From the basic 

k 

g ki 

k 

recursion, 

~ ~ 

θ −a ( ˆ −a 

e , where e = G ˆ θ ) − g ( ˆ θ ) . But a →0 

k+ 1, 

i 

= θki 

kgki 

θk 

) 

k ki 

k 

k 

( 

k ki k 

k e k 

a.s. by the 

martingale convergence theorem (see (8) and (9) in Spall and Cristion [25]). Since 

~ 

θ ≥ ρ > ρ' 

ki 

, 

we know that sign ~ ~ 

θ 

ki 

= sign θ 

k + 1, 

i 

for all k sufficiently large, implying that sign 

g ˆ θ ) = sign g ˆ θ ) a.s. 

i 

( k 

i( k+ 1 

Proof of Theorem 1a (M2-SPSA) 

The proof will proceed in three parts. Some of the proof closely follows that of the proposition 

in Spall and Cristion [25], in which case the details will be omitted here, and the reader will be 

directed to that reference. However, some of the proof differs in nontrivial ways due to, among 

other factors, the need to explicitly treat the bias in the gradient estimate G (⋅ k 

) First, we will 

show that 

θ 

k 

= k 

θˆ −θ * does not diverge in magnitude to ∞ on any set of nonzero measure. 

Second, we will show that 

~ θ 

k 

converges a.s. to some random vector, and third, we will show 

139

APPENDIX A 

that this random vector is the constant 0, as desired. Equalities hold a.s. where relevant. 

Part 1: First, from C.0, C.2, and C.3, it can be shown in the manner of Spall [3, Lemma 1] that, 

for all k sufficiently large 

( Gk 

( ˆ 

k) 

ˆ θk 

) = g( 

ˆ θk 

) bk 

E θ + 

(A1) 

where 

c −2 

k 

b 

k 

is uniformly bounded a.s. Using C.6, we know that H 

−1 

k 

exists a.s., and hence we 

− 

write M ≡ a H 

1 ( g( 

ˆ θ ) + b ) . Then, as in the proposition of Spall and Cristion [25], C.1, C.2, 

k 

k 

k 

k 

k 

and C.6, and Holder’s inequality imply, via the martingale convergence theorem, 

k 

~ 

a. 

s. 

θ 

+ 

= M ⎯⎯→ 

X 

(A2) 

k 1 

∑ 

j= 

0 

j 

where X is some integrable random vector. 

~ 

Let us now show that P(lim sup θ = ∞) 

= 0 . Since the arguments below apply along any 

k→∞ 

k 

subsequence, we will, for ease of notation and without loss of generality, consider the event 

{ ~ θ 

k 

→∞} 

. We will show that this event has probability 0 in a modification to the arguments in 

[25, proposition] (which is a multivariate extension to scalar arguments in Blum [34], and Evans 

and Weber [4]). Furthermore, suppose that the limiting quantity of the unbounded elements is 

+ ∞ (trivial modifications cover a limiting quantity including − ∞ limits). Then, as shown in 

[25], the event of interest { ~ θ 

k 

→∞} 

has probability 0 if 

and 

~ 

~ 

{ ≤ ρ'( 

τ , S) 

∀i 

∈ S, 

θ ≤ τ∀i 

∉ S, 

k ≥ K( 

τ, 

S) 

} ∩limsup{ M < ∀i 

∈ S} ⎬ ⎫ 

⎭ 

⎧ 

⎨ θ 

ki ki 

ki 

0 

(A3a) 

⎩ 

k→∞ 

140

APPENDIX A 

⎪⎧ 

~ 

⎪⎫ 

c 

⎨θ ki 

→ ∞∀i ∈ S ∩ liminf{ M 

ki 

< 0∀i 

∈ S} 

⎬ 

(A3b) 

k→∞ 

⎪⎩ 

⎪⎭ 

both have probabilities 0 for all 

τ , S and ρ '( 

τ, 

S) 

as defined in C.7, where K( τ , S) 

< ∞ and 

the superscript c denotes set complement. For event (A3a), we know that there exists a 

subsequence { k k , k ,..., } k K( 

τ, 

) 

Then, from C.6 and (A1), 

0, 1 2 0 

S 

~ 

≥ such that { θ ≥ ρ'( 

, S) 

∀i 

∈S} ∩{ M < 0∀i 

∈S} 

k, i 

τ 

k, 

i 

is true. 

∑ 

i∈S 

~ 

( θ ) + o(1)) 

< 0 

~ 

θ ( g 

a.s. (A4) 

kji 

kji 

kji 

for all 

~ T 

~ 

k, j . By C.4, θkj , g ˆ 

k, j( 

θkj) 

≥ ρ θ a.s. which, by C.7, implies, for all j sufficiently 

kj 

large, 

∑ 

i∈S 

~ 

θ g 

kji 

ρ ~ 

≥ θkj 

2 

kji 

~ 

( θ ) 

kj 

⎛ ρ ⎞ 

ρτ 

≥ ⎜ ⎟dim( 

S) 

ρ'( 

τ, 

S) 

≥ 

⎝ 2 ⎠ 

2 

(A5) 

since ρ' ( τ, 

S ) ≥τ 

and dim( S ) ≥1. Taken together, (A4) and (A5) imply that, for each 

sample point (except possibly on a set of measure 0), the event in (A3a) has probability 0. Now, 

consider the second event (A3b). From (A2), we know that, for almost all sample points, 

∑ ∞ =0 

M i S 

k ki 

→ −∞ ∀ ∈ 

must be true. But this implies from C.5 and the above-mentioned 

uniformly bounded decaying bias ( b k 

) that for no i ∈ S can M 

ki 

≥ 0 

occur. However, at 

each, the event{ M 

ki 

} c 

dim( S ) 

< 0∀i 

∈ S is composed of the union of 2 −1events, each of which 

has for at least one M ≥ 0 for at least one i ∈ S . This, of course, requires that M ≥ 0 for 

ki 

at least onei ∈ S , which creates a contradiction. Hence, the probability of the event in (A3b) is 

0. This completes Part 1 of the proof. 

ki 

141

APPENDIX A 

Part 2: To show that 

~ θ 

k 

converges a.s. to a unique (finite) limit, we show that 

⎛ 

⎞ 

⎜ ~ 

~ ⎟ 

P⎜liminf 

θ 

ki 

< a' 

< b' 

< limsupθki 

= ∀i 

k→∞ 

⎟ 0 

(A6) 

k→∞ 

⎝ 

⎠ 

for any a ' < b' 

.This result follows as in the proof of Part 2 of the proposition in Spall and 

Cristion [25]. 

Part 3: Let us now show that the unique finite limit from Part 2 is 0. From (A2) and the 

conclusion of Part 1, we have 

follows if 

limsup 

n→∞ 

∑ ∞ k= 

0 

M < ∞ a.s ∀ i . Then the result to be shown 

ki 

⎛ ~ 

⎞ 

⎜lim 

≠ 0, ∑ ∞ P θ < ∞ 

⎟ 

k 

M 

k 

= 0. 

(A7) 

k→∞ 

⎝ 

k= 

0 ⎠ 

Suppose that the event in the probability of (A7) is true, and let I { 1,2,..., 

p} 

⊆ represent those 

~ 

indexes i such thatθ → 0 as k → ∞ . Then, by the convergence in Part 2, there exists (for 

ki 

almost any sample point in the underlying sample space) some 0 < a ' < b' 

< ∞ and 

~ 

K ( a', 

b') 

< ∞ (dependent on sample point) such that ∀ k > K, 0 < a' 

≤ θ ≤ b' 

< ∞ when 

ki 

~ 

i ∈ I( I ≠ θ ) θ 

ki 

≤ a' 

and 

c 

i ∈ I 

. From C.4, it follows that 

n 

n 

~ 

θ g ( ˆ θ ) ≥ a' 

ρ a . 

(A8) 

∑ a ∑ ∑ 

k ki ki k 

k 

k= K+ 

1 i∈ I 

k= 

K+ 

1 

142

APPENDIX A 

But since C.5 implies that g ˆ θ ) can change sign only a finite number of times (except 

ki 

( k 

~ 

possibly on a set of sample points of measure 0), and since θ 

ki 

≤ b' 

, we know from (A8) that, 

for at least onei ∈ I , 

limsup 

n→∞ 

ρ a' 

n 

∑ 

a 

k 

k= 

K+ 

1 

n 

∑ 

k 

k= 

k+ 

1 

< ∞ 

g 

ki 

a 

( ˆ θ ) 

k 

. 

(A9) 

Recall that 

a g ( ˆ θ ) = M − a H b and b = O c ) a.s. Hence, from C.6, we have 

k 

k 

k 

k 

k 

−1 

k 

k 

k 

( 2 k 

H 

−1 

k 

b k 

= 

∑ ∞ k= K + 1 

o(1) 

. Then by (A9), M = ∞ Since, for the a ' < b' 

above, there exists such 

ki 

a K for each sample point in a set of measure one, we know from the above discussion that there 

also exists an 

∑ ∞ k= K + 1 

i ∈ I (i possibly dependent on the simple point) such that M = ∞ . 

ki 

Since I has a finite number of elements, 

∑ ∞ k=0 

M = ∞ with probability 0 for at least one i. 

However, this is inconsistent with the event in (A7), showing that the event does, in fact, have 

probability 0. This completes Part 3, which completes the proof. 

ki 

Proof of Theorem 1b (2SG) The initial martingale convergence arguments establishing the 

2SG analog to (A2) are based on C.0’ –C.2’ and C.6. Although there is no bias in the gradient 

measurement, C.4 and C.7 still work together to guarantee that the elements potentially 

diverging [in the arguments analogous to those surrounding (A3a), (A3b)] asymptotically 

dominate the product ˆT 

θ ; ( ˆ 

k 

gkj 

θkj 

) . As in the Proof of Theorem 1a, this sets up a contradiction. 

The remainder of the proof follows exactly as in Parts 2 and 3 of the Proof of Theorem 1a, with 

some of the arguments made easier since b = 0 . 

k 

143

APPENDIX A 


First, note that the conditions subsume those of Theorem 1a; hence, we have a.s. convergence of 

θˆ . By C.8, we have 

⎛( ) 

2 

2 

c c~ 

Hˆ 

⎞ 

⎟ ⎠ 

k 

E⎜ 

k k k 

uniformly bounded ∀ k 

⎝ 

. Hence, by the additional 

assumption introduced in C.1’’ (beyond that in C.1), the martingale convergence result in, say, 

Gerencser [16], yields 

1 

∑ n 

n + 1 k= 

0 

( Hˆ 

ˆ ˆ 

k 

− E( 

H k k 

)) → 0 

θ a.s. as n → ∞ . (A10) 

Let H (θ ) represent the true Hessian matrix, and suppose that g (θ ) is three-times 

continuously differentiable in a neighborhood of 

θˆ k . Then, simple Taylor series arguments 

show that 

E( 

δG 

ˆ θ , ∆ 

≡ δg 

k 

k 

k 

+ O( 

c 

k 

3 

k 

) = g( 

ˆ θ + c ∆ 

) 

( O ( 

3 ) = 0 in the SG case) 

c k 

k 

k 

k 

) − g( 

ˆ θ − c ∆ 

k 

k 

k 

) + O( 

c 

3 

k 

) 

where this result is immediate in the SG case, and follows easily by a Taylor series argument in 

the SPSA case (where the O c ) term is the difference of the two O c ) bias terms in the 

( 3 k 

one-sided SP gradient approximations and c ~ = O( 

)). Hence, by an expansion of each of 

g ˆ θ ± ∆ ) , we have for any i, j. 

( 

k 

c k k 

k 

c k 

( 2 k 

144

APPENDIX A 

⎛ Gki 

E⎜ 

δ 

⎝ 2ck∆ 

kj 

ˆ θ , ∆ 

k 

l≠ 

j 

k 

⎞ 

⎟ 

⎠ 

⎛ g ⎞ 

ki ˆ 

2 

E⎜ 

δ 

= θk 

, ∆ ⎟ 

k 

+ O( 

ck 

) 

2c 

⎝ k∆kj 

⎠ 

( ˆ ) ( ˆ ∆kl 

2 

= Hij 

θk 

+ ∑ Hlj 

θk 

) + O( 

ck 

) 

∆ 

kj 

where the O ( c 

2 k 

) term in the second line absorbs higher order terms in the expansion of δ gk 

. 

Then, since 

E( ∆kl 

/ ∆kj 

) = 0∀j 

≠ l by the assumptions for ∆ 

k 

, we have 

⎛ G ⎞ 

ki 

E ⎜ 

δ ˆ θ ⎟ ( ˆ 

k 

= Hij 

θk 

) + O( 

c 

2c 

⎝ k∆kj 

⎠ 

2 

k 

) 

implying that the Hessian estimate is “nearly unbiased,” with the bias disappearing at rate 

O ( c 

2 k 

) . The additional operation in 

Hˆ 

k 

= 

⎡ T 

1 δGk 

⎢ 

2 ⎢2ck∆k 

⎣ 

T 

⎛ δGk 

+ 

⎜ 

⎝ 2ck∆ 

k 

T 

⎞ ⎤ 

⎟ ⎥ 

⎠ ⎥⎦ 

simply forces the per-iteration estimate to be symmetric. Then, by the above equations, 

conditions C.3’, C.8, and C.9 imply ∀ l (A14) where L 

(3) 

hij 

represents the third derivative of L w.r.t. 

the hh, ith, and jth elements ofθ ; θ are points on the line segments between ˆ ~ 

θ 

k 

± ck∆k 

+ ck∆k 

k 

k 

k 

± 

k 

~ ~ ~ 

andθˆ ± c ∆ ; and we used the fact that E( ∆ ∆ / ∆ 

l 

) = 0∀i, 

j k and l (implied by C.9 and 

the Cauchy–Schwarz inequality). Let 

ki kj k 

, 

145

APPENDIX A 

1 ⎡~ 

~ ~ ~ 

( ( ) ( )) ˆ 

⎤ 

−1 

(3) + (3) − 

Bkl = E⎢∆k 

l∑ 

Lhij 

θk 

− Lhij 

θk 

⋅∆k, 

h∆k, 

i∆kj 

θk 

, ∆k 

⎥. 

(A11) 

6 ⎣ h, 

i, 

j 

⎦ 

By C.3’ (bounding the difference in 

(3) 

Lhij 

terms) and C.9 in conjunction with the 

Cauchy–Schwarz inequality and C.1’’ ( c ~ 

k 

= O( 

ck 

)) we have B 

kl 

/ ck 

uniformly bounded 

(in ˆ θ 

k 

, ∆k 

) for all k sufficiently large. Hence, from (A11) the ( l m)-th element of Ĥ 

k 

satisfies 

E( 

Hˆ 

k , l, 

m 

ˆ θ ) 

(1) ˆ 

(1) 

⎛ G 

ˆ 

k 

( 

k 

ck 

k 

) Gk 

( 

k 

c ) ˆ 

⎞ 

k k 

E⎜ 

l 

θ + ∆ − 

l 

θ − ∆ 

= 

⎟ 

θk 

2c 

⎝ 

k∆km 

⎠ 

( ˆ ) ( ˆ 

−2 

⎛ g 

k 

ck 

k 

g 

k 

ck 

k 

) ck 

Bk 

E⎜ 

l 

θ + ∆ − 

l 

θ − ∆ + 

= 

⎝ 

2ck∆km 

T 

[ ∂gl 

/ ∂θ 

] 

⎛ 2ck 

θ = 

= E⎜ 

⎝ 

2ck∆ 

( ˆ 

2 

= H θ ) + O( 

c ) 

lm 

k 

k 

k 

ˆ θk 

km 

∆ 

k 

+ O( 

c 

3 

k 

) ⎞ 

ˆ θ ⎟ 

k 

⎠ 

l 

ˆ 

⎞ 

θ ⎟ 

k 

⎠ 

(A12) 

where the O( c 

3 k 

) term in the third line of (A12) encompasses both c −2 

B k kl 

and the uniformly 

bounded contributions due to 

2 T 

∂ gl 

/∂θ ∂ 

T 

θ 

in the remainder terms of the expansion 

of g ˆ 

l 

( θk 

+ ck∆ 

k) 

−gl(ˆ 

θk 

−ck∆ 

k) 

is 

3 3 

( ( c k 

) / ck 

O uniformly bounded, allowing the use of C.9 and the 

2 

Cauchy–Schwarz inequality in producing the ( ) 

O term in the last line of (A12)). Then, by 

c k 

(A12), the continuity of H nearθˆ k and the fact that ˆ θ → θ * 

k 

a.s. (Theorem 1a), the principle of 

Cesaro summability implies 

146

APPENDIX A 

1 

+ 

n 

∑ 

n 1 k= 

0 

= 

1 

n + 1 k= 

0 

E( 

ˆ 

H k 

ˆ θ ) 

k 

n 

2 

∑( H ( ˆ 

k 

) + O( 

ck 

)) 

* 

θ → H ( θ ) a.s. (A13) 

1 n 1 

Hˆ 

0 k 

− 

Given that H = ( + 1) ∑ + k 

n 

k= 

(A.10) and (A13) then yield the result to be proved. 

Proof of Theorem 2b (2SG) Since the conditions subsume those of Theorem 1b, we have 

ˆ θ * θ → k 

a.s. Analogous to (A10), C.1’’’ and C.8’ yield a martingale convergence result for the 

sample mean of Hˆ 

− E( 

ˆ ˆ θ ) . Then, given the boundedness of the third derivatives of 

k 

H k 

k 

L (θ ) near θˆ k for all k, the Cauchy–Schwarz inequality and C.8’, C.9’ imply that 

E Hˆ 

ˆ θ ) = H ( ˆ θ ) + O 

2 

( c ) 

( 

k k 

k k 

yield the result to be proved. 

. By 

ˆ θ → θ * 

k 

a.s., the Cesaro summability arguments in (A13) 


Beginning with the expansion G ( ˆ θ ) 

( ˆ 

* 

k k 

θ 

k 

) = H ( θ 

k 

)( ˆ θ 

k 

− θ ) bk 

E + 

whereθ k 

is on the line 

segment between 

θˆ k and θ * and the bias 

bk 

is defined in (A1), the estimation error can be 

represented in the notation of [19] as 

147

APPENDIX A 

ˆ θ 

* 

−α 

* 

k+ 

1 

−θ 

= ( I − k Γk 

)( θk 

−θ 

) 

+ k 

−( 

α+ 

β ) / 2 

Φ V 

k 

ˆ 

k 

+ k 

α −β 

/ 2 

H 

−1 

k 

T 

k 

where 

Γ 

V 

k 

Φ 

k 

k 

= aH 

= k 

−1 

k 

= −aH 

−γ 

H ( θ ) 

−1 

k 

k 

[ Gk 

( ˆ θk 

) − E( Gk 

( ˆ θk 

) ˆ θk 

)] 

and 

T 

k 

/ 2 

= −ak 

β b . The proof follows that of Spall [3, Proposition 2] closely, which shows 

k 

that the three sufficient conditions for asymptotic normality, in Fabian [19], hold. By the 


θˆ k it is straightforward to show a.s. convergence of 

T to 0 if 3 γ −α / 2 > 0 or 

k 

to T in (2.37) if 3 γ −α / 2 = 0 . The mean expression µ then follows directly from Fabian 

[19] and the convergence of H 

k 

(and hence 

T 

Further, as in Spall [3], ( 

k 

) E V V k k 

−1 

H 

k 

by C.11 and the existence of 

* −1 

H ( θ ) . 

θˆ is a.s. convergent by C.2 and C.10, leading to the 

covariance matrix Ω . This shows Fabian [19, (2.21) and (2.22)]. The final condition [19, 

(2.2.3)] follows as in Spall [3, Proposition 2] since the definition of V 

k 

is identical in both 

standard SPSA and M2-SPSA. 

(1) 

[ ˆ ) ˆ 

kl 

( θk 

± ck∆k 

θk, 

∆k 

] 

E G 

⎡~ 

( ˆ 

⎢ck 

g θk 

± ck∆ 

= E⎢ 

⎢ 

⎢ 

⎣ 

= g 

l 

( ˆ θ ± c ∆ 

k 

k 

k 

k 

T ~ 

) ∆ 

1 

) + c 

6 

−2 

k 

k 

−2 

ck 

~ T ˆ ~ 

+ ∆kH( 

θk 

± ck∆k 

) ∆k 

2 

c~~ 

∆ 

⎡~ 

E⎢∆ 

⎣ 

∑ 

−1 

kl 

h, 

i, 

j 

L 

(3) 

h, 

i, 

j 

± ~ 

( θ ) ∆ 

k 

kl 

kh 

− 

ck 

+ 

6 

3 

∑ 

h, 

i, 

j 

~ ~ 

∆ ∆ ˆ θ , ∆ 

ki 

kj 

k 

k 

L 

⎤ 

⎥. 

⎦ 

(3) 

h, 

i, 

j 

± 

( θ ) 

k 

~ 

∆ 

kh 

~ ~ 

∆ ∆ 

ki 

kj 

ˆ θ , ∆ 

k 

k 

⎤ 

⎥ 

⎥ 

⎥ 

⎥ 

⎦ 

(A14) 

148

APPENDIX A 

Proof of Theorem 3b (2SG) Analogous to the Proof of Theorem 3a, the estimation error can be 

represented as 

ˆ * 

−α 

k + 

−θ 

= ( I − k Γk 

)( ˆ θk 

−θ 

* 

1 

θ ) + k 

−α 

Φ 

k 

e 

k 

−1 

where Γ = aH 

H ( θ ) and 

k 

k 

k 

Φ 

k 

= −a H Conditions (2.2.1) and (2.2.2) of Fabian [19] 

−1 

k 

follow immediately by the smoothness of L (θ ) (from C.3’), the convergence of θˆ k 

and 

and C.12. Condition (2.2.3) of Fabian [19] follows by Holder’s inequality and C.2’, C.3’. 

H 

k 

, 

Proof of theorem 4a—Convergence in parameter estimation M2-SPSA 

The convergence theorem for the proposed method is proven here based on RM-type stochastic 

approximation. In contrast to the RM-type stochastic approximation, in the simultaneous 

perturbation stochastic approximation, the slope of the error function is estimated based on the 

value of the error function. Therefore, the estimated slope must include the error. In this proof, 

the nature of the estimated error for the slope when using simultaneous perturbation stochastic 

approximation for parameter estimation is clarified, thus arriving at convergence of the 

parameter estimation algorithm using the conventional RM-type stochastic approximation. In 

the proof below, the subscripts that can be readily understood are omitted. 

For 

~ 

φ = ˆ φ −φ 

, if the true value φ for the parameters is subtracted from both sides of (2.62) 

and the right-hand side is then expanded and cleaned up, 

~ 

φ 

k+ 

n 

⎛ 

= ⎜ 

I 

⎝ 

+ ρ 

n+ 

m 

k−1 

n+ 

1 

⎪⎧ 

⎨z 

⎪⎩ 

− ρ 

k−1 

n+ 

1 

k+ 

n−1 

e 

z 

k+ 

n−1 

k+ 

n 

y 

T 

k+ 

n−1 

2 

⎡σ 

I 

+ ⎢ 

⎣ 0 

n 

⎞~ 

⎟ 

φk 

⎠ 

−1 

0⎤ 

ˆ 

⎥φk 

0⎦ 

−1 

1 

− c 

2 

k−1 

n+ 

1 

T 

( y s ) 

k+ 

n−1 

k−1 

2 

s 

k−1 

⎪⎫ 

⎬ 

⎪⎭ 

(B.1) 

149

APPENDIX A 

results. Here, zk+n+ 

1 

is given by 

z 

T 

k + n−1 = yk 

+ n−1sk 

−1sk 

−1 

= yk 

+ n−1 

+ dk 

+ n− 

T 

Note that dk+n− 

1 

represents the difference between yk+ n−1sk 

−1sk 

−1 

and y 

k+n−1 

and is given by the 

1. 

following equation. 

s, 

i 

represents the i-th element s 

k 1, 

i 

− 

, for the signed vector at the time k – 1. 

d 

k+ 

n−1 

⎛ yk 

−1s, 

2 

s, 

1+ 

L+ 

uk 

⎜ 

⎜ yks, 

1 

s, 

2+ 

L+ 

uk 

+ 

= ⎜ 

M 

⎜ 

⎝ yks, 

1 

sn+ 

m 

+ L+ 

u 

+ n−1 

n−1 

s, 

s, 

k+ 

n−2 

n+ 

m 

n+ 

m 

s, 

s, 

s, 

2 

1 

n+ 

m−1 

s, 

n+ 

m 

⎞ 

⎟ 

⎟ 

⎟ 

⎟ 

⎠ . 

(B.2) 

Caution is required because the product of s, 

i 

and s, 

j 

in each item in each element of d is the 

product of mutually distinct elements in the signed vector s 

k −1 

. 

In other words, in the case in which 

At this point, based on Eq. (B.1), 

i ≠ j when taking the expected value, this is 0. 

~ 

φk+n 

2 

is calculated as follows: 

~ 

φ 

k+ 

n 

2 

~ T ~ 

= φ φ 

k+ 

n 

k+ 

n 

⎧ 

2 

⎛ 

⎪ ⎜ 

T ~ ⎡σ 

I 

− zy φk 

−1 

+ ze+ 

⎢ 

~ ⎜ 

⎨ 

⎣ 0 

= φk 

−1 

+ ρ⎜ 

⎪ 

⎪ 

⎜ 1 T 2 

− c( y s) 

s 

⎩ ⎝ 2 

~ T 

T T ~ 

= φ ( I −ρ 

zy − pyz ) φ 

k−1 

~ T 

+ 2ρφ 

n+ 

m 

k−1 

2 

⎪⎧ 

⎡σ 

I 

⎨ze+ 

⎢ 

⎪⎩ ⎣ 0 

n 

0⎤ 

ˆ 

⎥φ 

k− 

0⎦ 

k−1 

1 

n 

1 

− c 

2 

0⎤ 

ˆ 

⎥φk 

− 

0⎦ 

T 

( y s) 

2 

1 

⎪⎫ 

s⎬ 

⎪⎭ 

T 

⎞⎫ 

−⎟⎪ 

⎟ 

⎪⎧ 

~ 

⎟⎬ 

⎨φ 

k−1 

⎪ ⎪⎩ 

⎟ 

⎠⎪ 

⎭ 

⎛ 

⎜ 

T ~ 

+ ρ − zy φ 

k− 

⎝ 

1 

2 

⎡σ 

I 

+ ze+ 

⎢ 

⎣ 0 

n 

0⎤ 

ˆ 

⎥φk 

− 

0⎦ 

1 

1 

− c 

2 

T 

( y s) 

2 

⎞⎪⎫ 

s⎟ 

⎬ 

⎠⎪⎭ 

+ ρ 2 h. 

(B.3) 

150

APPENDIX A 

However, 

h 

~ ⎡σ 

I 0⎤ 

~ 1 

φ ⎢ ⎥ 

(B.4) 

⎣ 0 0⎦ 

2 

2 

T 

n 

T 2 

= − zy 

k−1 + ze + φk− 

1 

− c( 

y s) 

2 

. 

Finally, the expected value for Eq. (B.3) is found using 

~ 

φ = β . Before this, though, each 

k +1 

~ 

item in this equation is evaluated. First, the conditional expected value for φ 

k + 1 

in 

T 

zy must 

be considered: 

T ~ 

T 

T 

{ zy φk 

1 

= β} = E{ yy β} E{ dy β}. 

E 

− 

+ 

(B.5) 

Here, the second element on the right is 0 based on the signed vector condition (B11). Therefore, 

only the first element on the right needs to be considered, and so the following equation results: 

E 

T 

T 

{ yz β} = E{ yy β} 

⎧⎛ 

x⎞ 

= E⎨⎜ 

⎟ 

⎩⎝u 

⎠ 

⎪⎧ 

⎡xx 

= E⎨⎢ 

⎪⎩ ⎣ux 

T T ⎛ ⎞ T T 

( x u ) + ⎜ ⎟( υ 0 ) 

T 

T 

xu 

uu 

T 

T 

υ 

⎝ 0⎠ 

2 

⎤ ⎪⎫ 

⎡σ 

I 

⎥ β ⎬ + ⎢ 

⎦ ⎪⎭ ⎣ 0 

n 

⎫ 

β ⎬ 

⎭ 

0⎤ 

⎥ 

0⎦ 

. 

(B.6) 

T ~ 

In the same fashion, for { yz φ 

k−1 

= β} 

E the same results as seen in Eq. (B.6) are obtained. In 

~ 

addition, the expected value forφ = β for ze must be considered: 

k−1 

{ ze β} E{ ye β} E{ de β}. 

E = + 

(B.7) 

151

APPENDIX A 

Because the second element in the equation above is 0 based on the condition (B11), only the 

first element needs to be considered. The first element is given by (2.52), and so the following 

equation results: 

E 

⎡σ 

⎢ 

⎣ 0 

2 

I n 

0⎤ 

⎥ 

0 ⎦ 

{ ze β} = − φ. 

(B.8) 

Now let us consider h as represented in Eq. (B.4). Although a similar discussion can be found in 

[15], only the fourth element on the right varies as a result of the perturbation. Expanding Eq. 

(B.4) reveals an item affected by ( 

multiplied by ( 

s) s 

y T 2 

s) s 

y T 2 

and an item affected by its square. The item 

takes 0 for its expected value based on the condition (B11) for signing. 

The later items have a fourth-order moment for y. When the assumption (C11) of the 

boundedness of the fourth-order moment for the stochastic variable input u and the observed 

noise v, and the assumption (A12) of the boundedness for the perturbation are taken into 

consideration, from Eq. (B.4) we have the following inequality for the appropriate constants 

0 ≤ α 1, α2 

< ∞ : 

~ 

2 

{ hφ k−1 = β} ≤ α1 

β + α2. 

E (B.9) 

~ 

Given the above relationships, the conditional expected value for φ 

k−1 

in (B.3) satisfies the 

following equation: 

E 

~ 2 ~ 

{ φk+ 

n 

φk− 

1 

= β } 

T 

≤ β 

T 

= β ( I 

( I − 2ρD) 

2 

⎪⎧ 

⎡σ 

I 

+ 2ρβ 

⎨− 

⎢ 

⎪⎩ ⎣ 0 

2 

+ ρ 

n+ 

m 

2 

( α1 

β + α2 

) 

n+ 

m 

n 

2 

2 

⎡σ 

I 

β − 2ρβ 

⎢ 

⎣ 0 

2 

0⎤ 

⎡σ 

I 

⎥φ 

+ ⎢ 

0⎦ 

⎣ 0 

2 

− 2ρD) 

β + ρ α β 

1 

n 

2 

n 

0⎤ 

ˆ⎪ 

⎫ 

⎥φ 

⎬ 

0⎦ 

⎪⎭ 

0⎤ 

⎥β 

0⎦ 

2 

+ ρ α 

2 

(B.10) 

152

APPENDIX A 

where, 

D 

⎡xx 

E⎢ 

⎣ux 

T 

xu 

T 

= 

T T 

uu 

⎤ 

⎥ 

⎦ 

. 

Based on the condition (C11), D is a symmetrical positive definite matrix and has a minimum 

eigenvalue λ > 0 . Therefore, we can obtain (2.52) by using 

~ 2 

~ 2 

2 

2 

{ φ 

k+ n 

} ≤ ( 1− 

2ρλ 

+ ρ α1) E{ φk− 

1 

} + ρ α 

2. 

E (B.11) 

The above equation returns us to the proof [15] of the convergence theorem for the parameter 

estimation algorithm using the Robbins–Monroe stochastic approximation. Therefore, under the 

condition (A11) for the gain coefficient 

holds. 

lim E 

⎧ 

⎨ 

k →∞ ⎩ 

ˆ 

φ k 

2 

−φ 

⎫ 

⎬ = 0 

⎭ 

153

154 

APPENDIX A

Appendix B 

Interpretation of Regularity Conditions 

This Appendix provides comments on some of the conditions of ASP relative to other adaptive 

SA approaches. In the confines of a short discussion, it is obviously not possible to provide a 

detailed discussion of all conditions of all known adaptive approaches. Nevertheless, we hope to 

convey a flavor of the relative nature of the conditions. 

As discussed in Sec. 2.9, some of the conditions of ASP depend on 

θˆ k 

itself, creating a type of 

circularity (i.e., direct conditions on the quantity being analyzed). This circularity has been 

discussed elsewhere since other SA algorithms also have dependent conditions. Some of the 

ASP conditions can be eliminated or simplified if the conditions of the lemma in Sec. 2.9 hold. 

The foremost lemma condition is that 

θˆ k 

be uniformly bounded. Of course, this uniformly 

bounded condition is itself a circular condition, but it helps to simplify the other conditions of 

the theorems that are dependent on 

θˆ k since the 

θˆ k 

dependence can be replaced by an 

assumption that these other conditions hold uniformly over all θ in the bounded set guaranteed 

to contain 

θˆ k 

(e.g., the current assumption C.3 that 

θˆ k 

be twice continuously differentiable in 

neighborhoods of estimates 

θˆ can be replaced by an assumption that g (θ ) is twice 

k 

continuously differentiable on some bounded set known to contain 

θˆ k . If the lemma applies, 

condition C.5 (on the i.o. behavior of 

θˆ k ) is unnecessary. 

In showing convergence and 

asymptotic normality, one might wonder whether other adaptive algorithms could avoid 

conditions that depend on 

θˆ k , and avoid alternative conditions that are similarly undesirable. 

Based on currently available adaptive approaches, the answer appears to be “no.” .As an 

illustration, let us analyze one of the more powerful results on adaptive algorithms, the result in 

Wei [48]. 

155

APPENDIX B 

The Wei [48] approach is restricted to the SG/root-finding setting as opposed to the more 

general setting for ASP that encompasses both gradient-free and SG/root finding. The approach 

is based on 2p measurements of g (θ ) at each iteration to estimate the Jacobian (Hessian) 

matrix. Some of the conditions in Wei [48] are similar to conditions for ASP (e.g., decaying 

gain sequences and smoothness of the functions involved), while other conditions are more 

stringent (the restriction to only the root-finding setting and the requirement for i.i.d. 

measurement noise). There are also conditions in ASP that are not required in Wei [48], 

principally those associated with “nice” behavior of the user-specified (bounded moments, etc.), 

the steepness conditions C.4 and C.7 (similar to standard conditions in some other adaptive 

approaches, e.g., Ruppert [14]), and limits on the amount of bouncing in “big steps” around (the 

i.o. condition C.5). An additional key assumption in Wei [48] is the symmetric function 

condition on the Jacobian (or Hessian) matrix: 

T 

T 

H ( θ) 

H( 

θ') 

+ H( 

θ') 

H( 

θ) 

> 0, ∀θ 

, θ '. 

(D.1) 

This, unfortunately, is a stringent condition that may be easily violated. In the optimization case 

(where H is a Hessian), this condition may fail even for benign (e.g., convex) loss functions. 

Consider, for example, a case with 

4 2 2 

L (θ ) = x + x + y + xy. Letting 

T 

( 0,0) and 

θ = ( x , y) 

θ 

T 

T 

'(0,0) 

= 

and a simple convex loss function 

(2,0) 

T 

, we have 

H( 

θ ) H( 

θ') 

T 

+ H( 

θ') 

H( 

θ) 

T 

⎡202 

= ⎢ 

⎣ 56 

56⎤ 

10 

⎥ 

⎦ 

which is not positive definite, violating condition (D.1). Aside from the fact that this condition 

may be easily violated, it is also generally impossible to check in practice because it requires 

knowledge of the true H (θ ) over the whole domain; this, of course, is the very quantity that is 

being estimated. The requirement for such prior knowledge is also apparent in other adaptive 

approaches discussed in Ruppert [14] and Fabian [19]. Given the above, it is clear that neither 

ASP nor Wei [48] (nor others) have uniformly “easier” conditions for their respective 

approaches. The inherent difficulty in establishing theoretical properties of adaptive approaches 

comes from the need to couple the estimates for the parameters of interest and for the 

Hessian/Jacobian matrix. 

156

APPENDIX B 

This tends to lead to nontrivial regularity conditions, as seen in the 

θˆ k 

dependent conditions of 

ASP and in the stringent conditions that have appeared in the literature for other approaches. 

There appear to be no easy conditions for establishing rigorous properties of adaptive 

algorithms. However, given that all of these approaches have a strong intuitive appeal based on 

analogies to deterministic optimization, the needs of practical users will focus less on the 

nuances of the regularity conditions and more on the cost of implementation (e.g., the number 

of function measurements needed), the ease of implementation, and the practical performance. 

157

158 

APPENDIX B

List of Publications Directly 

Related to the Dissertation 

1) Jorge Medina Martínez, Mariko Nakano Miyatake, Kazushi Nakano, Héctor Pérez Meana: 

Low Complexity Cascade Lattice IIR Adaptive Filter Algorithms using Simultaneous 

Perturbations Approach, WSEAS Transactions on Communications, Vol. 10, No. 10, pp. 

1058-1068 (2005). 

(Related to the contents of Chap. 4). 

2) Jorge Ivan Medina Martinez, Kazushi Nakano, Kohji Higuchi: Parameter Estimation using 

a Modified Version of SPSA Algorithm Applied to State Space Models, IEEJ Transactions 

on Industry Applications, Vol.129, No.12/ Sec. D. (2009). 


3) Jorge Ivan Medina Martinez, Kazushi Nakano, Sawut Umerujan: Vibration Suppression 

Control of a Flexible Arm using Non-linear Observer with Simultaneous Perturbation 

Stochastic Approximation, Journal of Artificial Life and Robotics, Vol. 14, (2009). 


4) Jorge Ivan Medina Martinez, Kazushi Nakano, Kohji Higuchi: New Approach for IIR 

Adaptive Lattice Filter Structure using Simultaneous Perturbation Algorithm, IEEJ 

Transactions on Industry Applications, Vol.130, No.4/ Sec. D. (2010). 


List of Other Publications and Presentations 

-Presentations in Internationals Symposiums 

1) Jorge Ivan Medina Martinez, Kazushi Nakano: Neural Control of a Flexible Arm System 

using Simultaneous Perturbation Method, SICE 7th Annual Conference on Control 

Systems, March 6-8 2007 Chofu,Tokyo, Japan. 

159

LIST OF THE PUBLICATIONS AND INTERNATIONAL SYMPOSIUMS 

2) Jorge Ivan Medina Martinez, Kazushi Nakano, Sawut Umerujan: Simultaneous 

Perturbation Approach to Neural Control of a Flexible System, ECTI-CON 2007, Mae Fah 

Luang University, Chiang Rai, Thailand May 9-12, 2007. 

3) Jorge Ivan Medina Martinez, Kazushi Nakano, Sawut Umerujan: Cascade Lattice IIR 

Adaptive Filter Structure using Simultaneous Perturbation Method for Self-Adjusting 

SHARF Algorithm, International Conference on Instrumentation, Control and Information 

Technology (SICE Annual Conference 2008) Aug.20-22, The University of Electro 

Communications Chofu, Tokyo, Japan. 


4) Jorge Ivan Medina Martinez, Sawut Umerujan, Kazushi Nakano: Application of Non-linear 

Observer with Simultaneous Perturbation Stochastic Approximation Method to Single 

Flexible Link SMC, International Conference on Instrumentation, Control and Information 

Technology (SICE Annual Conference 2008) Aug.20–22, The University of 

Electro-Communications, Chofu, Tokyo, Japan. 


5) Jorge Ivan Medina Martinez, Sawut Umerujan, Kazushi Nakano: Vibration Suppression 

Control of a Flexible arm using Non-linear Observer with Simultaneous Perturbation 

Stochastic Approximation, The Fourteenth International Symposium on Artificial Life and 

Robotics (AROB 14 th '09), Feb 5-7, 2009, B-Con Plaza, Beppu, Oita, Japan. 


6) Jorge Ivan Medina Martinez, Kazushi Nakano, Kohji Higuchi: Parameters Estimation in 

Neural Networks by Improved Version of Simultaneous Perturbation Stochastic 

Approximation Algorithm, ICCAS-SICE 2009, August 18-21, 2009, Fukuoka, Japan. 

160


-Other Publications, Presentations and Submissions 

1) Jorge Ivan Medina Martinez, Kazushi Nakano, Development of an IIR Adaptive Filter with 

Low Computational Complexity using Simultaneous Perturbation Method.-2nd. 

KMUTT-UEC Workshop May 14, 2007 King Mongkut's University of Technology 

Thonburi, Bangkok, Thailand. 

2) Jorge Ivan Medina Martinez, Kazushi Nakano, A Fast Converging and Self-Adjusting 

SHARF Algorithm using Simultaneous Perturbation Method and Vibration Control of a 

Flexible using Non-linear Observer with Simultaneous Perturbation Stochastic 

Approximation Method 3rd KMUTT-UEC Workshop August 19,2008, The University of 

Electro-communications Chofu, Tokyo, Japan. 

161


162

Acknowledgements 

This dissertation is a summary of my doctoral study at the Department of Electronic 

Engineering of the University of Electro-Communications. This work would have not been 

accomplished without the help of so many people. The following paragraph is a brief account of 

some but not all who deserve my thanks. 

I would like to extend my deepest thanks to my Prof. Kazushi Nakano for taking the burden of 

supervising my research work for so long in his laboratory. Right from the beginning in October 

2006 and up to the conclusion of this work in December 2009. It is my pleasure to have a 

chance to do the research work under his supervision and I also enjoy the life of the research 

work. 

My special thanks are due to all the reviewers 

Prof. Kohji Higuchi 

Prof. Masahide Kaneko 

Prof. Tetsuro Kirimoto 

Prof. Takayuki Inaba 

Prof. Seiichi Shin 

Also, my special thanks to our research group, both past and present, for their helpful 

cooperation over the years. They all have been very kind to me and provided a nice and friendly 

environment during these years. 

My gratitude goes to the Ministry of Education, Science and Culture of Japan who granted me 

this opportunity and financially supported this work. I am thankful to the administrative staff of 

the Department of Electronic Engineering and the Foreign Students Affairs Office at the 

University of Electro-Communications, for their amiability and effective supports. 

Finally, I would like to give special thanks to my family and friends to their love, warm supports 

and encouragements. 

163

Author Biography 

Jorge Ivan Medina Martinez was born in Mexico city, Mexico, on April 23, 1978. He recieved 

the Master of Science degree from the National Institute Polytechnic, Mexico City, Mexico, in 

2005. Since 2006, he has been with the Department of Electronic Engineering in the University 

of Electro-Communications, Tokyo, Japan working toward his Ph.D. degree. His research 

interests include signal processing and control using SPSA. 

165

Approximation of Hessian Matrix for Second-order SPSA Algorithm ...

Create successful ePaper yourself

Delete template?

Save as template?