25.02.2015 Views

Approximation of Hessian Matrix for Second-order SPSA Algorithm ...

Approximation of Hessian Matrix for Second-order SPSA Algorithm ...

Approximation of Hessian Matrix for Second-order SPSA Algorithm ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Approximation</strong> <strong>of</strong> <strong>Hessian</strong> <strong>Matrix</strong> <strong>for</strong><br />

<strong>Second</strong>-<strong>order</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Addressed<br />

Toward Parameter Optimization<br />

in Non-linear Systems<br />

-<br />

JORGE IVAN MEDINA MARTINEZ<br />

Doctoral Program in Electronic Engineering<br />

Graduate School <strong>of</strong> Electro-Communications<br />

The University <strong>of</strong> Electro-Communications<br />

A thesis submitted <strong>for</strong> the degree <strong>of</strong><br />

DOCTOR OF ENGINEERING<br />

The University <strong>of</strong> Electro-Communications<br />

December 2009


<strong>Approximation</strong> <strong>of</strong> <strong>Hessian</strong> <strong>Matrix</strong> <strong>for</strong><br />

<strong>Second</strong>-<strong>order</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Addressed<br />

Toward Parameter Optimization<br />

in Non-linear Systems<br />

Approved by Supervisory Committee:<br />

Chairperson :<br />

Pr<strong>of</strong>. Kazushi Nakano<br />

Member : Pr<strong>of</strong>. Kohji Higuchi<br />

Member : Pr<strong>of</strong>. Masahide Kaneko<br />

Member : Pr<strong>of</strong>. Tetsuro Kirimoto<br />

Member : Pr<strong>of</strong>. Takayuki Inaba<br />

Member : Pr<strong>of</strong>. Seiichi Shin


Copyright 2009 by Jorge Ivan Medina Martinez<br />

All Rights Reserved


<strong>Approximation</strong> <strong>of</strong> <strong>Hessian</strong> <strong>Matrix</strong> <strong>for</strong><br />

<strong>Second</strong>-<strong>order</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Addressed<br />

Toward Parameter Optimization<br />

in Non-linear Systems<br />

(2 次 型 同 時 摂 動 確 率 近 似 アルゴリズムのヘッセ 行 列 推 定 とその 非 線 形 システム<br />

におけるパラメータ 最 適 化 への 応 用 )<br />

Jorge Ivan Medina Martinez<br />

—Abstract in Japanese —<br />

システム 同 定 問 題 とは、システムの 構 造 が 既 知 の 場 合 、 雑 音 を 有 する 観 測 データから、 未 知 パ<br />

ラメータを 推 定 する 問 題 になる。 特 に 最 近 、 非 線 形 モデルが、 状 態 推 定 、 制 御 、シミュレーショ<br />

ンに 多 用 されており、 非 線 形 モデル 予 測 制 御 の 成 功 に 動 機 づけられ、 第 一 モデル 原 理 やニューラ<br />

ルネットに 基 づくモデルの 精 緻 化 が 盛 んに 議 論 されている。このような 非 線 形 で 複 雑 なシステム<br />

の 同 定 問 題 は、 多 くの 未 知 パラメータに 関 してある 種 の 誤 差 関 数 を 最 適 化 する 問 題 に 帰 着 され、<br />

そのための 効 率 的 な 最 適 化 手 法 が 求 められている。<br />

これに 対 して 多 くのアルゴリズムが 提 案 されているが、これらを、 非 線 形 状 態 空 間 モデルのよ<br />

うな 数 多 くのパラメータを 有 する 複 雑 なシステムに 適 用 する 場 合 、 膨 大 な 計 算 コストがかかると<br />

いう 問 題 があった。 本 論 文 では、 複 雑 なシステムのパラメータ 推 定 において、アルゴリズムが 十<br />

分 な 安 定 性 を 有 していない 点 、および 計 算 過 程 が 複 雑 で 計 算 コストがかかる 点 に 注 目 して、 新 し<br />

い 推 定 アルゴリズムを 提 案 する。まず、 計 算 の 複 雑 度 やコストにおいて 有 利 で、 実 装 が 容 易 で、<br />

しかも 安 定 した 収 束 性 を 有 する、 同 時 摂 動 確 率 近 似 (Simultaneous Perturbation Stochastic<br />

<strong>Approximation</strong> <strong>Algorithm</strong>: <strong>SPSA</strong>) アルゴリズムに 注 目 する。しかしながら、これを 複 雑 なシ<br />

ステムに 適 用 する 場 合 には、いくつかの 問 題 に 遭 遇 する。そのために、 安 定 な 収 束 性 を 有 し、 計<br />

算 コストの 面 でも 有 利 な<strong>SPSA</strong>アルゴリズムを 改 良 した 新 しい 手 法 を 開 発 する。すなわち、 誤 差 関<br />

数 ヘシアンから 1st-Order <strong>SPSA</strong>(1-<strong>SPSA</strong>) 法 と 2nd-Order <strong>SPSA</strong> (2-<strong>SPSA</strong>) 法 との 比 較 に 基 づいて、<br />

<strong>SPSA</strong>アルゴリズムの 改 良 を 行 う。<br />

i


ここで 提 案 するアルゴリズム( 修 正 <strong>SPSA</strong>)は、 悪 条 件 ヘシアンの 非 正 定 性 を 解 消 し、 正 定 性 を<br />

保 証 するためにフィッシャ 情 報 行 列 をもちいて 悪 条 件 ヘシアンの 逆 行 列 によって 生 ずる 誤 差 拡<br />

大 を 抑 えるような 手 続 きを 採 るものである。これはまた 良 条 件 をもつヘシアンを 有 する 手 法 に 対<br />

しても 収 束 性 の 大 幅 な 改 善 をもたらすものである。 漸 近 収 束 性 に 対 しては、2-<strong>SPSA</strong> 法 に 対 する 修<br />

正 <strong>SPSA</strong> 法 の 平 均 2 乗 誤 差 の 比 は、 完 全 な 良 条 件 をもつヘシアンの 場 合 を 除 いたあらゆる 方 法 より、<br />

小 さいことが 示 される。さらに、ヘシアンの 対 角 要 素 の 推 定 を 行 えば、ほかの 手 法 に 比 べて 大 幅<br />

な 計 算 コスト 削 減 が 実 現 される。<br />

修 正 <strong>SPSA</strong> 法 においても、すべてのパラメータは 同 時 に 摂 動 されることから、パラメータの 次 元<br />

にかかわらず、 誤 差 関 数 の 二 つの 計 算 値 だけでパラメータが 更 新 できる。このように、この<strong>SPSA</strong><br />

アルゴリズムを 用 いれば、 大 幅 な 計 算 コストの 削 減 が 可 能 である。 本 論 文 では、 提 案 するアルゴ<br />

リズムの 収 束 定 理 を 与 えるとともに、このアルゴリズムを 用 いてパラメータ 推 定 の 実 行 可 能 性 を<br />

証 明 すべくシミュレーションを 実 施 する。<br />

最 後 に、 本 提 案 手 法 の 三 つの 実 際 的 応 用 を 考 える。ひとつは、 振 動 抑 制 を 目 的 とした1リンク<br />

フレキシブルアームの 角 度 制 御 問 題 である。この 制 御 目 的 のために、 非 線 形 VSS (Variable<br />

Structure System) オブザーバを 用 いたモデル 規 範 型 スライディングモード 制 御 (Model<br />

Reference – Sliding Mode Control: MR-SMC) 手 法 を 提 案 する。 非 線 形 オブザーバのパラメータ<br />

は、ここで 提 案 している 修 正 2-<strong>SPSA</strong>アルゴリズムを 用 いて 最 適 化 される。MR-SMCのコントローラ<br />

の 設 計 についても 同 様 に 議 論 する。この 手 法 の 有 効 性 は 振 動 制 御 シミュレーションにより 確 認 さ<br />

れる。 次 は、 適 応 IIR 型 フィルタアルゴリズムへの 応 用 である。このアルゴリズムは SHARF<br />

(Simple Hyperstable-Adaptive- Recursive- Filter) と SM (Steiglitz-McBride) アルゴリズ<br />

ムに 対 応 しており、 出 力 誤 差 をもとにした 同 定 用 フィルタの 係 数 パラメータは、 提 案 している 修<br />

正 2-<strong>SPSA</strong>アルゴリズムを 用 いて 求 められる。 確 率 近 似 (SA) アルゴリズムとの 比 較 により、 本 ア<br />

ルゴリズムの 有 効 性 が 示 される。 最 後 の 例 は、 修 正 <strong>SPSA</strong>アルゴリズムを、 非 線 形 状 態 空 間 システ<br />

ムの 未 知 な 静 的 パラメータを 推 定 する 問 題 に 適 用 するものである。 提 案 するアルゴリズムは 最 尤<br />

推 定 量 を 与 えるものであり、その 性 能 は、 差 分 近 似 型 確 率 近 似 (Finite Difference Stochastic<br />

<strong>Approximation</strong>) アルゴリズムとの 比 較 を 通 じて 検 証 される。<br />

ii


<strong>Approximation</strong> <strong>of</strong> <strong>Hessian</strong> <strong>Matrix</strong> <strong>for</strong><br />

<strong>Second</strong>-<strong>order</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Addressed<br />

Toward Parameter Optimization<br />

in Non-linear Systems<br />

Jorge Ivan Medina Martinez<br />

Abstract<br />

The research presented in this dissertation has been motivated due to the fact that many<br />

algorithms, which are very extended, do not <strong>of</strong>fer sufficient stability in the estimation <strong>of</strong> a great<br />

volume <strong>of</strong> parameters in non-linear systems or other kinds <strong>of</strong> systems. They also have a high<br />

computational complexity and cost. So that, we have decided to use the simultaneous<br />

perturbation stochastic approximation (<strong>SPSA</strong>) algorithm because it has several important<br />

advantages such as low computational complexity and stable convergence. Nevertheless, the<br />

typical <strong>SPSA</strong> algorithm has some difficulties and problems when it is applied to non-linear and<br />

complex systems. There<strong>for</strong>e, this research proposes a novel extension to the <strong>SPSA</strong> algorithm<br />

based on features and disadvantages shown in the first-<strong>order</strong> and second-<strong>order</strong> <strong>SPSA</strong> (the<br />

1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>) algorithms and comparisons made from the perspective <strong>of</strong> the<br />

<strong>Hessian</strong> loss function. These comparisons are made because at finite iterations, the convergence<br />

rate depends on matrix conditioning <strong>of</strong> the loss function <strong>Hessian</strong>. It is shown that 2nd-<strong>SPSA</strong><br />

converges more slowly <strong>for</strong> a loss function with an ill-conditioned <strong>Hessian</strong> than the one with a<br />

well-conditioned <strong>Hessian</strong>. On the other hand, the convergence rate <strong>of</strong> 1st-<strong>SPSA</strong> is less sensitive<br />

to the matrix conditioning <strong>of</strong> loss function <strong>Hessian</strong>.<br />

The main disadvantages in the 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> algorithms, one that the error <strong>for</strong> the<br />

loss function with an ill-conditioned <strong>Hessian</strong> is greater than the one with a well-conditioned<br />

<strong>Hessian</strong>. Our proposed modified version <strong>of</strong> 2nd-<strong>SPSA</strong> (M2-<strong>SPSA</strong>) eliminates the error<br />

amplification caused by the inversion <strong>of</strong> an ill-conditioned <strong>Hessian</strong> at finite iterations, which<br />

leads to significant improvements in its convergence rate in problems with an ill-conditioned<br />

<strong>Hessian</strong> matrix and complex systems. Asymptotically, the efficiency analysis shows that our<br />

proposed <strong>SPSA</strong> is also superior to 2nd-<strong>SPSA</strong> in terms <strong>of</strong> its convergence rate coefficients. It is<br />

iii


ABSTRACT<br />

shown that <strong>for</strong> the same asymptotic convergence rate, the ratio <strong>of</strong> the mean square errors <strong>for</strong> our<br />

proposed <strong>SPSA</strong> to 2nd-<strong>SPSA</strong> is always less than one, except <strong>for</strong> a perfectly conditioned <strong>Hessian</strong>.<br />

Also, we have proposed to reduce the computational expense by evaluating only a diagonal<br />

estimate <strong>of</strong> the eigenvalues in the <strong>Hessian</strong> matrix. In this research, a new mapping is suggested<br />

<strong>for</strong> the 2nd-<strong>SPSA</strong> algorithm in <strong>order</strong> to eliminate the non-positive definiteness part while<br />

preserving key spectral properties <strong>of</strong> the estimated <strong>Hessian</strong> using the Fisher in<strong>for</strong>mation matrix.<br />

After defining the M2-<strong>SPSA</strong> algorithm, we apply this algorithm to parameter estimation.<br />

There<strong>for</strong>e, using M2-<strong>SPSA</strong> all parameters are perturbed simultaneously, it is possible to modify<br />

parameters with only two measurements <strong>of</strong> an evaluation function regardless <strong>of</strong> the dimension<br />

<strong>of</strong> the parameter. A convergence theorem <strong>for</strong> the proposed algorithm is presented and a<br />

simulation result also reveals the feasibility <strong>of</strong> the identification scheme proposed here. In <strong>order</strong><br />

to show the efficiency <strong>of</strong> M2-<strong>SPSA</strong>, we have proposed three important applications, in which<br />

we can see the efficiency <strong>of</strong> our proposed algorithm <strong>for</strong> estimating and designing the<br />

parameters.<br />

In the first application, our proposed algorithm is addressed to control, in this case, the vibration<br />

reduction in the model considered here. There<strong>for</strong>e, the main objective concerns a vibration<br />

control <strong>of</strong> a one-link flexible arm system. A variable structure system (VSS) non-linear observer<br />

has been proposed in <strong>order</strong> to reduce the oscillation in controlling the angle <strong>of</strong> the flexible arm.<br />

The non-linear observer parameters are optimized using a modified version <strong>of</strong> <strong>SPSA</strong> algorithm.<br />

This <strong>SPSA</strong> algorithm is especially useful when the number <strong>of</strong> parameters to be adjusted is large<br />

and makes it possible to estimate them very efficiently. As <strong>for</strong> the vibration and position control,<br />

a model reference sliding-mode control (MR-SMC) has been presented. Our proposed<br />

M2-<strong>SPSA</strong> algorithm obtains the parameters <strong>of</strong> MR-SMC method. The simulations show that the<br />

vibration control <strong>of</strong> a one-link flexible arm system can be achieved more efficiently using our<br />

proposed methods.<br />

In the second application, our proposed algorithm is addressed to signal processing, in this case<br />

IIR lattice filters. Adaptive infinite impulse response (IIR), or recursive, filters are less attractive<br />

mainly because <strong>of</strong> the stability and the difficulties associated with their adaptive algorithms.<br />

There<strong>for</strong>e, in this research the adaptive IIR lattice filters are studied in <strong>order</strong> to devise<br />

algorithms that preserve the stability <strong>of</strong> the corresponding direct <strong>for</strong>m schemes. We analyze the<br />

local properties <strong>of</strong> stationary points, a trans<strong>for</strong>mation achieving this goal is suggested, which<br />

iv


ABSTRACT<br />

gives algorithms that can be efficiently implemented. Application to the Steiglitz-McBride (SM)<br />

and Simple Hyperstable Adaptive Recursive Filter (SHARF) algorithms is presented. Also, our<br />

proposed M2-<strong>SPSA</strong> algorithm is presented in <strong>order</strong> to get the coefficients in a lattice <strong>for</strong>m more<br />

efficiently and with a lower computational cost and complexity. The results are compared with<br />

previous lattice versions <strong>of</strong> these algorithms. These previous lattice versions may fail to<br />

preserve the stability <strong>of</strong> stationary points.<br />

Finally, the M2-<strong>SPSA</strong> algorithm is addressed to the problem <strong>of</strong> estimation <strong>of</strong> unknown static<br />

parameters in non-linear state-space models. The M2-<strong>SPSA</strong> algorithm can generate maximum<br />

likelihood estimates efficiently. The per<strong>for</strong>mance <strong>of</strong> the proposed algorithm is assessed through<br />

simulation. Here, the M2-<strong>SPSA</strong> algorithm is compared with the finite difference stochastic<br />

approximation (FDSA) in <strong>order</strong> to show its efficiency.<br />

There<strong>for</strong>e, in this dissertation, we have proposed a modification to <strong>SPSA</strong> algorithm where the<br />

main objective is to estimate the parameters in complex systems, improve the convergence and<br />

reduce the computational cost. Then, this modification to the simultaneous perturbation seems<br />

particularly useful when there are number <strong>of</strong> parameters to be identified is very large or when<br />

the observed values <strong>for</strong> what is to be identified can only be obtained via an unknown<br />

observation system.<br />

Finally, this dissertation is organized as follows. In Chapter 1, we describe an introduction to<br />

<strong>SPSA</strong>, so that we explain the mean concepts, advantages, disadvantages, recursions,<br />

<strong>for</strong>mulation and implementation <strong>of</strong> <strong>SPSA</strong>. Our proposed <strong>SPSA</strong> algorithm is analyzed in detail<br />

in Chap. 2. The asymptotic normality, the <strong>Hessian</strong> estimation and the efficiency between<br />

M2-<strong>SPSA</strong> and the previous versions <strong>of</strong> <strong>SPSA</strong> are shown. In addition, we show how the<br />

M2-<strong>SPSA</strong> algorithm is applied to parameter estimation, and prove its efficiency in several<br />

simple numerical simulations. The first important application <strong>of</strong> M2-<strong>SPSA</strong> algorithm is<br />

described in Chap. 3, in this case, in the control area; M2-<strong>SPSA</strong> is applied to parameter<br />

estimation <strong>of</strong> some methods <strong>for</strong> controlling the vibration in the proposed system. Other<br />

application <strong>for</strong> M2-<strong>SPSA</strong> algorithm is described in Chap. 4. In this application, our proposed<br />

algorithm is applied to signal processing, here M2-<strong>SPSA</strong> calculates the coefficients in some<br />

adaptive algorithms. In the final application, M2-<strong>SPSA</strong> algorithm is addressed to the problem <strong>of</strong><br />

estimation <strong>of</strong> unknown static parameters in non-linear state-space models, this is described in<br />

Chap. 5. Finally, the conclusions and future work are given in Chap. 6.<br />

v


Contents<br />

1. Introduction 1<br />

1.1 Motivation and Background 1<br />

-1.1.1 Motivation 1<br />

-1.1.2 Background 2<br />

1.2 Overview <strong>of</strong> Stochastic <strong>Algorithm</strong>s 5<br />

1.3 Introduction to <strong>SPSA</strong> <strong>Algorithm</strong> 7<br />

1.4 Features <strong>of</strong> <strong>SPSA</strong> 10<br />

1.5 Applications Areas 11<br />

1.6 Formulation <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong> 12<br />

1.7 Basic Assumptions <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong> 14<br />

1.8 Versions <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong>s 15<br />

2. Proposed <strong>SPSA</strong> <strong>Algorithm</strong> 19<br />

2.1 Overview <strong>of</strong> Modified 2nd-<strong>SPSA</strong> <strong>Algorithm</strong><br />

19<br />

2.2 <strong>SPSA</strong> <strong>Algorithm</strong> Recursions 20<br />

2.3 Proposed Mapping 22<br />

2.4 Description <strong>of</strong> Proposed <strong>SPSA</strong> <strong>Algorithm</strong> 26<br />

2.5 Asymptotic Normality 27<br />

2.6 Fisher In<strong>for</strong>mation <strong>Matrix</strong> 31<br />

-2.6.1 Introduction to Fisher In<strong>for</strong>mation <strong>Matrix</strong> 31<br />

-2.6.2 Two Key Properties <strong>of</strong> the In<strong>for</strong>mation <strong>Matrix</strong>: Connections to<br />

--Covariance <strong>Matrix</strong> <strong>of</strong> Parameter Estimates 33<br />

-2.6.3 Estimation <strong>of</strong> F (θ n<br />

)<br />

34<br />

2.7 Efficiency Between 1st-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong> and 2M-<strong>SPSA</strong> 40<br />

2.8 Implementation Aspects 41<br />

2.9 Strong Convergence 44<br />

2.10 Asymptotic Distribution and Efficiency Analysis 50<br />

2.11 Perturbation Distribution <strong>for</strong> M2-<strong>SPSA</strong> 54<br />

2.12 Parameter Estimation 57<br />

2.12.1 Introduction 57<br />

2.12.2 System to be Applied 64<br />

2.12.3 Convergence Theorem 69<br />

vii


CONTENTS<br />

2.13 Simulation 70<br />

2.13.1 Simulation 1 70<br />

2.13.2 Simulation 2 72<br />

2.13.3 Simulation 3 75<br />

3. Vibration Suppression Control <strong>of</strong> a Flexible Arm using<br />

Non-linear Observer with <strong>SPSA</strong> 79<br />

3.1 Introduction 79<br />

3.2 Dynamic Modeling <strong>of</strong> a Single Link Robot Arm 81<br />

-3.2.1 Dynamic Model 81<br />

-3.2.2 Equation <strong>of</strong> Motion and State Equations 84<br />

3.3 Design <strong>of</strong> Non-Linear Observer 85<br />

3.4 Model Reference Sliding Model Controller 87<br />

3.5 Simulation 91<br />

4. Lattice IIR Adaptive Filter Structure Adapted by <strong>SPSA</strong> <strong>Algorithm</strong><br />

9 9<br />

4.1 Introduction 99<br />

4.2 Procedure <strong>of</strong> Improved <strong>Algorithm</strong> 101<br />

4.3 Lattice Structure 104<br />

4.4 Adaptive <strong>Algorithm</strong> 105<br />

-4.4.1 SHARF <strong>Algorithm</strong> 105<br />

-4.4.2 Steiglitz-McBride <strong>Algorithm</strong> 108<br />

4.5 Simulation 109<br />

-4.5.1 SHARF <strong>Algorithm</strong> 109<br />

-4.5.2 Steiglitz-McBride <strong>Algorithm</strong> 110<br />

5. Parameters Estimation using a Modified Version <strong>of</strong><br />

<strong>SPSA</strong> <strong>Algorithm</strong> Applied to State-Space Models 113<br />

5.1 Introduction 113<br />

5.2 Implementation <strong>of</strong> <strong>SPSA</strong> Toward Proposed Model 115<br />

-5.2.1 State-Space Model 115<br />

-5.2.2 Gradient-free Maximum Likelihood Estimation 118<br />

viii


CONTENTS<br />

5.3 Parameter Estimation by <strong>SPSA</strong> and FDSA 120<br />

5.4 Simulation 122<br />

6. Conclusions and Future Work 125<br />

6.1 Conclusions 125<br />

6.2 Future Work 129<br />

References 131<br />

Appendix A 139<br />

Appendix B 155<br />

List <strong>of</strong> Publications Directly Related to the Dissertation 159<br />

Acknowledgements 163<br />

Author Biography 165<br />

ix


List <strong>of</strong> Figures<br />

Fig. 1.1 Example <strong>of</strong> stochastic optimization algorithm minimizing loss function L θ 1<br />

θ ) 3<br />

(<br />

, 2<br />

Fig. 1.2 Per<strong>for</strong>mance <strong>of</strong> <strong>SPSA</strong> algorithm (two measurements). 9<br />

Fig. 2.1 The two-recursions in 2nd-<strong>SPSA</strong> <strong>Algorithm</strong> 21<br />

Fig. 2.2 Diagram <strong>of</strong> method <strong>for</strong> <strong>for</strong>ming estimate F ( )<br />

39<br />

M , N<br />

θ<br />

Fig. 2.3 Split uni<strong>for</strong>m distribution 56<br />

Fig. 2.4 Inverse split uni<strong>for</strong>m distribution 57<br />

Fig. 2.5 Symmetric double triangular distribution 57<br />

Fig. 2.6 Identification with an unknown observation system 65<br />

Fig. 2.7 Identification results (with bias compensation) 75<br />

Fig. 2.8 Identification results (without bias compensation) 76<br />

Fig. 3.1 One-link flexible arm 82<br />

Fig. 3.2 Sliding mode surface 88<br />

Fig. 3.3 Block diagram <strong>of</strong> the sliding mode control system incorporating the non-linear<br />

observer 91<br />

Fig. 3.4 Motor angle. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (-.-)).With RM-SA<br />

algorithm and MR-SMC (dashed-line (- -)). With LS algorithm and MR-SMC (dash-dot-line<br />

(-.)).With M2-<strong>SPSA</strong> and MR-SMC (solid-line (-)) 94<br />

Fig. 3.5 Tip position. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (-.-)).With RM-SA<br />

algorithm and MR-SMC (dashed-line (- -)). With LS algorithm and MR-SMC (dash-dot-line<br />

(-.)).With M2-<strong>SPSA</strong> and MR-SMC (solid-line (-)) 95<br />

Fig. 3.6 Tip Velocity. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (-.-)).With RM-SA<br />

algorithm and MR-SMC (dashed-line (- -)). With LS algorithm and MR-SMC (dash-dot-line<br />

(-.)).With M2-<strong>SPSA</strong> and MR-SMC (solid-line (-)) 95<br />

Fig. 3.7 Control torque. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (-.-)).With RM-SA<br />

algorithm and MR-SMC (dashed-line (- -)). With LS algorithm and MR-SMC (dash-dot-line<br />

(-.)).With M2-<strong>SPSA</strong> and MR-SMC (solid-line (-)) 96<br />

Fig. 3.8 Motor angle. Simulation using x 1<br />

with M2-<strong>SPSA</strong> and MR-SMC (solid-line).<br />

Simulation using x m<br />

with M2-<strong>SPSA</strong> and MR-SMC (dashed-line) 96<br />

Fig. 3.9 Tip position. Simulation using x 3<br />

with M2-<strong>SPSA</strong> and MR-SMC (solid-line).<br />

Simulation using ˆx 3<br />

with M2-<strong>SPSA</strong> and MR-SMC (dashed-line) 96<br />

Fig. 3.10 Tip velocity. Simulation using x 4<br />

with M2-<strong>SPSA</strong> and MR-SMC (solid-line).<br />

Tip velocity. Simulation using ˆx 4<br />

with M2-<strong>SPSA</strong> and MR-SMC (dashed-line) 97<br />

xi


LIST OF FIGURES<br />

Fig. 4.1 Block diagram <strong>of</strong> the SHARF lattice algorithm 107<br />

Fig. 4.2 Block diagram <strong>of</strong> the SM lattice algorithm 109<br />

Fig. 4.3 Convergence <strong>of</strong> the proposed SHARF algorithm and M2-<strong>SPSA</strong> 111<br />

Fig. 4.4 Instability <strong>of</strong> the existing SHARF algorithm 111<br />

Fig. 4.5 Instability <strong>of</strong> the existing SM algorithm 112<br />

Fig. 4.6 Convergence <strong>of</strong> the proposed SM algorithm and M2-<strong>SPSA</strong> 112<br />

T<br />

Fig. 5.1 ML Parameter estimateθ = θ , θ , <strong>for</strong> the bi-modal non-linear model using<br />

k<br />

[<br />

1,<br />

k 2, k<br />

θ3<br />

, k]<br />

M2-<strong>SPSA</strong>. The true parameters in the model are defined by<br />

* =[0.5, 25,<br />

8]<br />

θ T . 122<br />

Fig. 5.2 Parameter estimation using 2nd-<strong>SPSA</strong> and FDSA 123<br />

xii


List <strong>of</strong> Tables<br />

Table 2.1 Characteristics <strong>of</strong> the perturbation distributions 55<br />

Table 2.2 Normalized loss values <strong>for</strong> 1st-<strong>SPSA</strong> and M2-<strong>SPSA</strong> with σ = 0.001;<br />

90% confidence interval shown in [⋅]<br />

72<br />

Table 2.3. Values <strong>of</strong><br />

Table 2.4 Values <strong>of</strong><br />

*<br />

θˆ<br />

k<br />

− θ<br />

with no measurement noise 74<br />

ˆ *<br />

θ − θ<br />

0<br />

*<br />

θˆ<br />

k<br />

− θ<br />

with measurement noise 74<br />

ˆ *<br />

θ − θ<br />

0<br />

Table 2.5 Comparison <strong>of</strong> estimators 76<br />

Table 3.1 Comparison <strong>of</strong> estimators (non-linear observer) 92<br />

Table 3.2 Comparison <strong>of</strong> estimators (MR-SMC) 92<br />

Table 3.3 Per<strong>for</strong>mance comparisons among M2-<strong>SPSA</strong>, RM-SA and LS 93<br />

Table 5.1 Computational statistics 123<br />

Table 6.1. Comparison <strong>of</strong> algorithms (per<strong>for</strong>mance) 127<br />

Table 6.2. Comparison <strong>of</strong> algorithms (computational cost) 128<br />

xiii


List <strong>of</strong> Abbreviations<br />

<strong>SPSA</strong><br />

1st-<strong>SPSA</strong><br />

2nd-<strong>SPSA</strong><br />

SP<br />

SA<br />

M2-<strong>SPSA</strong><br />

NN<br />

R-M<br />

FDSA<br />

LMS<br />

L-M<br />

ASP<br />

SG<br />

i.o.<br />

a.s.<br />

FIM<br />

MCNR<br />

MSE<br />

BP<br />

RMS<br />

MR-SMC<br />

VSS<br />

LS<br />

SM<br />

SHARF<br />

IIR<br />

FIR<br />

ODE<br />

HARF<br />

MSOE<br />

SMC<br />

ML<br />

Simultaneous perturbation stochastic approximation<br />

First-<strong>order</strong> <strong>of</strong> simultaneous perturbation stochastic approximation<br />

<strong>Second</strong>-<strong>order</strong> <strong>of</strong> simultaneous perturbation stochastic approximation<br />

Simultaneous perturbation<br />

Stochastic approximation<br />

Modified version <strong>of</strong> 2nd-<strong>SPSA</strong><br />

Neural network<br />

Robbins-Monroe<br />

Finite difference stochastic approximation<br />

Least mean square<br />

Levenberg-Marquardt<br />

Adaptive simultaneous perturbation<br />

Stochastic gradient<br />

Infinitely <strong>of</strong>ten<br />

Almost sure<br />

Fisher in<strong>for</strong>mation matrix<br />

Monte Carlo Newton-Raphson<br />

Mean Squire error<br />

Back-propagation<br />

Root mean square error<br />

Model reference-sliding mode control<br />

Variable structure system<br />

Least squares<br />

Steiglitz-McBride<br />

Simple hyperstable adaptive recursive filter<br />

Infinite impulse response<br />

Finite impulse response<br />

Ordinary differential equation<br />

Hyperstable adaptive recursive filter<br />

Mean-square output error<br />

Sequential Monte Carlo<br />

Maximum likelihood<br />

xv


Chapter 1<br />

Introduction<br />

Multivariate stochastic optimization plays a major role in the analysis and control <strong>of</strong> many<br />

engineering systems[1]. In almost all real-world optimization problems, it is necessary to use a<br />

mathematical algorithm that iteratively seeks out the solution because an analytical<br />

(closed-<strong>for</strong>m) solution is rarely available. In this spirit, the “simultaneous perturbation<br />

stochastic approximation (<strong>SPSA</strong>)” method <strong>for</strong> difficult multivariate optimization problems has<br />

been developed. <strong>SPSA</strong> has recently attracted considerable international attention in areas such<br />

as statistical parameter estimation, feedback control, simulation-based optimization, signal and<br />

image processing, and experimental design. The essential feature <strong>of</strong> <strong>SPSA</strong>—which accounts <strong>for</strong><br />

its power and relative ease <strong>of</strong> implementation—is the underlying gradient approximation that<br />

requires only two measurements <strong>of</strong> the objective function regardless <strong>of</strong> the dimension <strong>of</strong> the<br />

optimization problem. This feature allows <strong>for</strong> a significant decrease in the cost <strong>of</strong> optimization,<br />

especially in problems with a large number <strong>of</strong> variables to be optimized.<br />

1.1 -Motivation and Background<br />

1.1.1 -Motivation<br />

The simultaneous perturbation stochastic approximation (<strong>SPSA</strong>) method is a very useful tool <strong>for</strong><br />

solving optimization problems in which the cost function is in analytically unavailable or<br />

difficult to compute. The method is essentially a randomized version <strong>of</strong> the Kiefer-Wolfowitz<br />

method in which the gradient is estimated using only two measurements <strong>of</strong> the cost function at<br />

each iteration. <strong>SPSA</strong> is particularly efficient in problems <strong>of</strong> high-dimension and where the<br />

cost function must be estimated through expensive simulations. Our motivation is based on the<br />

features <strong>of</strong> <strong>SPSA</strong> algorithm that can be oriented toward parameter estimations in complex<br />

systems, where many algorithms have many disadvantages. Often it is necessary to estimate the<br />

parameters <strong>of</strong> a model <strong>of</strong> unknown system. Various techniques exist to accomplish this task,<br />

including Kalman and Wiener filtering, least mean square (LMS) algorithms, and the<br />

Levenberg-Marquardt (L-M) algorithm. These techniques require an analytic <strong>for</strong>m <strong>of</strong> the<br />

gradient <strong>of</strong> the function <strong>of</strong> the parameters to be estimated and usually have high computational<br />

complexity and cost [2]. Also, there are other kinds <strong>of</strong> algorithms to estimate parameter, which<br />

1


CHAPTER 1. INTRODUCTION<br />

the convergence is not stable because they cannot manage a great volume <strong>of</strong> parameter to be<br />

estimated. There<strong>for</strong>e, <strong>SPSA</strong> algorithm is convenient in these kinds <strong>of</strong> complex systems with a<br />

large number <strong>of</strong> parameters.<br />

1.1.2 -Background<br />

This dissertation is an introduction to the simultaneous perturbation stochastic approximation<br />

(<strong>SPSA</strong>) algorithm <strong>for</strong> stochastic optimization <strong>of</strong> multivariate systems. Optimization algorithms<br />

play a critical role in the design, analysis, and control <strong>of</strong> most engineering systems and are in<br />

widespread use in the work <strong>of</strong> many organizations. Be<strong>for</strong>e presenting the <strong>SPSA</strong> algorithm, we<br />

provide some general background on the stochastic optimization context <strong>of</strong> interest here.<br />

The mathematical representation <strong>of</strong> most optimization problems is the minimization (or<br />

maximization) <strong>of</strong> some scalar-valued objective function with respect to a vector <strong>of</strong> adjustable<br />

parameters. The optimization algorithm is a step-by-step procedure <strong>for</strong> changing the adjustable<br />

parameters from some initial guess (or set <strong>of</strong> guesses) to a value that <strong>of</strong>fers an improvement in<br />

the objective function [3][4]. Figure 1.1 depicts this process <strong>for</strong> a very simple case <strong>of</strong> only two<br />

variables, θ<br />

1<br />

and θ<br />

2<br />

, where our objective function is a loss function to be minimized (without<br />

loss <strong>of</strong> generality, we will discuss optimization in the context <strong>of</strong> minimization because a<br />

maximization problem can be trivially converted to a minimization problem by changing the<br />

sign <strong>of</strong> the objective function). Most real-world problems would have many more variables.<br />

The illustration in Fig. 1.1 is a typical example <strong>of</strong> a stochastic optimization setting with noisy<br />

input in<strong>for</strong>mation because the loss function value does not uni<strong>for</strong>mly decrease as the iteration<br />

process proceeds (note the temporary increase in the loss value in the third step <strong>of</strong> the<br />

algorithm). Many optimization algorithms have been developed that assume a deterministic<br />

setting and that assume in<strong>for</strong>mation is available on the gradient vector associated with the loss<br />

function (i.e., the gradient <strong>of</strong> the loss function with respect to the parameters being optimized).<br />

However, there has been a growing interest in recursive optimization algorithms that do not<br />

depend on direct gradient in<strong>for</strong>mation or measurements. Rather, these algorithms are based on<br />

an approximation to the gradient <strong>for</strong>med from measurements (generally noisy) <strong>of</strong> the loss<br />

function. This interest has been motivated, <strong>for</strong> example, by problems in the adaptive control and<br />

statistical identification <strong>of</strong> complex systems, the optimization <strong>of</strong> processes by large Monte Carlo<br />

simulations, the training <strong>of</strong> recurrent neural networks, the recovery <strong>of</strong> images from noisy sensor<br />

data, and the design <strong>of</strong> complex queuing and discrete-event systems.<br />

2


1.1 MOTIVATION AND BACKGROUND<br />

Fig. 1.1. Example <strong>of</strong> stochastic optimization algorithm minimizing loss function L θ 1<br />

, θ ).<br />

(<br />

2<br />

This dissertation focuses on the case where such an approximation is going to be used as a result<br />

<strong>of</strong> direct gradient in<strong>for</strong>mation not being readily available. Overall, gradient-free stochastic<br />

algorithms exhibit convergence properties similar to the gradient-based stochastic algorithms<br />

[e.g., Robbins-Monroe stochastic approximation (R-M SA)] while requiring only loss function<br />

measurements [5][6]. A main advantage <strong>of</strong> such algorithms is that they do not require the<br />

detailed knowledge <strong>of</strong> the functional relationship between the parameters being adjusted<br />

(optimized) and the loss function being minimized that is required in gradient-based algorithms.<br />

Such a relationship can be notoriously difficult to develop in some areas (e.g., non-linear<br />

feedback controller design), whereas in other areas (such as Monte Carlo optimization or<br />

recursive statistical parameter estimation), there may be large computational savings in<br />

calculating a loss function relative to that required in calculating a gradient. To elaborate on the<br />

distinction between algorithms based on direct gradient measurements and those based on<br />

gradient approximations from measurements <strong>of</strong> the loss function, the prototype gradient-based<br />

algorithm is R-M SA, which may be considered a generalization <strong>of</strong> such techniques as<br />

deterministic steepest descent and Newton–Raphson, neural network back-propagation (BP),<br />

and infinitesimal perturbation analysis–based optimization <strong>for</strong> discrete-event systems [9]. The<br />

gradient-based algorithms rely on direct measurements <strong>of</strong> the gradient <strong>of</strong> the loss function with<br />

respect to the parameters being optimized. These measurements typically yield an estimate <strong>of</strong><br />

the gradient because the underlying data generally include added noise. Because it is not usually<br />

the case that one would obtain direct measurements <strong>of</strong> the gradient (with or without added<br />

noise) naturally in the course <strong>of</strong> operating or simulating a system, one must have detailed<br />

knowledge <strong>of</strong> the underlying system input–output relationships to calculate the R-M gradient<br />

estimate from basic system output measurements. In contrast, the approaches based on gradient-<br />

3


CHAPTER 1.INTRODUCTION<br />

approximations require only conversion <strong>of</strong> the basic output measurements to sample values <strong>of</strong><br />

the loss function, which does not require full knowledge <strong>of</strong> the system input–output<br />

relationships.<br />

The classical method <strong>for</strong> gradient-free stochastic optimization is the Kiefer–Wolfowitz<br />

finite-difference SA (FDSA) algorithm [8]. Because <strong>of</strong> the fundamentally different in<strong>for</strong>mation<br />

needed in implementing these gradient-based (R-M) and gradient-free algorithms, it is difficult<br />

to construct meaningful methods <strong>of</strong> comparison. As a general rule, however, the gradient-based<br />

algorithms will be faster to converge than those using loss function based gradient<br />

approximations when speed is measured in the number <strong>of</strong> iterations. Intuitively, this result is not<br />

surprising given the additional in<strong>for</strong>mation required <strong>for</strong> the gradient-based algorithms. In<br />

particular, on the basis <strong>of</strong> asymptotic theory, the optimal rate <strong>of</strong> convergence measured in terms<br />

<strong>of</strong> the deviation <strong>of</strong> the parameter estimate from the true optimal parameter vector is <strong>of</strong> <strong>order</strong><br />

−1/2<br />

k <strong>for</strong> the gradient-based algorithms and <strong>of</strong> <strong>order</strong><br />

−1/3<br />

k <strong>for</strong> the algorithms based on gradient<br />

approximations, where k represents the number <strong>of</strong> iterations. (Special cases exist where the<br />

maximum rate <strong>of</strong> convergence <strong>for</strong> a non-gradient algorithm is arbitrarily close to, or equal to<br />

−1/2<br />

k ).<br />

In practice, <strong>of</strong> course, many other factors must be considered in determining which algorithm is<br />

best <strong>for</strong> a given circumstance <strong>for</strong> the following three reasons: (1) It may not be possible to<br />

obtain reliable knowledge <strong>of</strong> the system input–output relationships, implying that the<br />

gradient-based algorithms may be either infeasible (if no system model is available) or<br />

undependable (if a poor system model is used). (2) The total cost to achieve effective<br />

convergence depends not only on the number <strong>of</strong> iterations required, but also on the cost needed<br />

per iteration, which is typically greater in gradient-based algorithms. (This cost may include<br />

greater computational burden, additional human ef<strong>for</strong>t required <strong>for</strong> determining and coding<br />

gradients, and experimental costs <strong>for</strong> model building such as labor, materials, and fuel.) (3) The<br />

rates <strong>of</strong> convergence are based on asymptotic theory and may not be representative <strong>of</strong> practical<br />

convergence rates in finite samples. For these reasons, one cannot say in general that a<br />

gradient-based search algorithm is superior to a gradient approximation-based algorithm, even<br />

though the gradient-based algorithm has a faster asymptotic rate <strong>of</strong> convergence (and with<br />

simulation-based optimization such as infinitesimal perturbation analysis requires only one<br />

system run per iteration, whereas the approximation based algorithm may require multiple<br />

system runs per iteration). As a general rule, however, if direct gradient in<strong>for</strong>mation is<br />

4


1.1 FDSA AND <strong>SPSA</strong> ALGORITHM<br />

conveniently and reliably available, it is generally to one’s advantage to use this in<strong>for</strong>mation in<br />

the optimization process. The focus in this article is the case where such in<strong>for</strong>mation is not<br />

readily available. The next section describes <strong>SPSA</strong> and the related FDSA algorithm. Then some<br />

<strong>of</strong> the theory associated with the convergence and efficiency <strong>of</strong> <strong>SPSA</strong> is summarized.<br />

1.2 -Overview <strong>of</strong> Stochastic <strong>Algorithm</strong>s<br />

This dissertation considers the problem <strong>of</strong> minimizing a (scalar) differentiable loss function<br />

L (θ) , where θ is a p-dimensional vector and where the optimization problem can be translated<br />

*<br />

into finding the minimizing θ such that ∂L<br />

/ ∂θ<br />

= 0. This is the classical <strong>for</strong>mulation <strong>of</strong><br />

(local) optimization <strong>for</strong> differentiable loss functions. It is assumed that measurements <strong>of</strong> L (θ )<br />

are available at various values <strong>of</strong> θ . These measurements may or may not include added noise.<br />

No direct measurements <strong>of</strong> ∂L<br />

/ ∂θ<br />

= 0are assumed available, in contrast to the R-M framework.<br />

This section will describe the FDSA and <strong>SPSA</strong> algorithms. Although the emphasis <strong>of</strong> this<br />

dissertation is <strong>SPSA</strong>, the FDSA discussion is included <strong>for</strong> comparison because FDSA is a<br />

classical method <strong>for</strong> stochastic optimization. The <strong>SPSA</strong> and FDSA procedures are in the general<br />

recursive SA <strong>for</strong>m:<br />

ˆ θ ˆ ˆ ( ˆ<br />

k + 1<br />

= θ k<br />

−a<br />

k<br />

g k<br />

θ k<br />

)<br />

(1.1)<br />

where gˆ<br />

( ˆ<br />

k<br />

θk)<br />

is the estimate <strong>of</strong> the gradient g ( θ)<br />

≡ ∂L<br />

/ ∂θ<br />

at the iterate θˆ k based on the<br />

previously mentioned measurements <strong>of</strong> the loss function. Under appropriate conditions, the<br />

iteration in (1.1) will converge to<br />

e.g., in [7]).<br />

*<br />

θ in some stochastic sense (usually “almost surely”) see,<br />

The essential part <strong>of</strong> (1.1) is the gradient approximation gˆ<br />

( ˆ θ ) . We discuss the two <strong>for</strong>ms <strong>of</strong><br />

interest here. Let y(⋅)<br />

denote a measurement <strong>of</strong> L(⋅)<br />

at a design level represented by the dot (i.e.,<br />

y (⋅) = L (⋅)<br />

+ (noise)) and c<br />

k<br />

be some (usually small) positive number. One-sided gradient<br />

approximations involve measurements y( ˆ θ k<br />

) and y ( θˆ k +perturbation), whereas two-sided<br />

k<br />

k<br />

gradient approximations involve measurements <strong>of</strong> the <strong>for</strong>m y ( θˆ<br />

k<br />

±<br />

perturbation). The two<br />

general <strong>for</strong>ms <strong>of</strong> gradient approximations <strong>for</strong> use in FDSA and <strong>SPSA</strong> are finite difference<br />

5


CHAPTER 1. INTRODUCTION<br />

and simultaneous perturbation (SP), respectively, which are discussed in the following<br />

paragraphs. For the finite-difference approximation, each component <strong>of</strong><br />

θˆ k is perturbed one at a<br />

time, and corresponding measurements y(·) are obtained. Each component <strong>of</strong> the gradient<br />

estimate is <strong>for</strong>med by differencing the corresponding y(·) values and then dividing by a<br />

difference interval. This is the standard approach to approximating gradient vectors and is<br />

motivated directly from the definition <strong>of</strong> a gradient as a vector <strong>of</strong> p partial derivatives, each<br />

constructed as the limit <strong>of</strong> the ratio <strong>of</strong> a change in the function value over a corresponding<br />

change in one component <strong>of</strong> the argument vector.<br />

Typically, the i-th component <strong>of</strong> gˆ<br />

( ˆ θ ) ( i 1,2,..., p)<br />

<strong>for</strong> a two-sided finite-difference<br />

approximation is given by<br />

k k<br />

=<br />

gˆ<br />

ˆ<br />

y(<br />

ˆ θ + c e ) − y(<br />

ˆ θ −c<br />

e )<br />

k k i k k i<br />

ki(<br />

θ<br />

k<br />

) =<br />

(1.2)<br />

2 ck<br />

where<br />

e<br />

i<br />

denotes a vector with a one in the i-th place and zeros elsewhere (an obvious<br />

analogue holds <strong>for</strong> the one-sided version; likewise <strong>for</strong> the simultaneous perturbation <strong>for</strong>m<br />

below), and<br />

c<br />

k<br />

denotes a small positive number that usually gets smaller as k gets larger. The<br />

simultaneous perturbation has all elements <strong>of</strong><br />

θˆ k randomly perturbed together to obtain two<br />

measurements <strong>of</strong> y(·), but each component gˆ<br />

( ˆ θ ) is <strong>for</strong>med from a ratio involving the<br />

individual components in the perturbation vector and the difference in the two corresponding<br />

measurements. For two-sided Simultaneous Perturbation (SP), we have<br />

ki<br />

k<br />

gˆ<br />

ˆ<br />

y(<br />

ˆ θ + c e ) − y(<br />

ˆ θ −c<br />

e )<br />

k k i k k i<br />

ki(<br />

θ<br />

k<br />

) =<br />

(1.3)<br />

2 ck<br />

where the distribution <strong>of</strong> the user-specified p-dimensional random perturbation vector,<br />

∆<br />

T<br />

k<br />

= ( ∆ ,...,<br />

k 1+∆<br />

k 2<br />

∆<br />

k p)<br />

, satisfies conditions discussed later in this dissertation (superscript T<br />

denotes vector transpose). Note that the number <strong>of</strong> loss function measurements y(·) needed in<br />

6


1.3 INTRODUCTION TO <strong>SPSA</strong> ALGORITHM<br />

each iteration <strong>of</strong> FDSA grows with p, whereas with <strong>SPSA</strong>, only two measurements are needed<br />

independent <strong>of</strong> p because the numerator is the same in all p components. This circumstance, <strong>of</strong><br />

course, provides the potential <strong>for</strong> <strong>SPSA</strong> to achieve a large savings (over FDSA) in the total<br />

number <strong>of</strong> measurements required to estimate θ when p is large. This potential is realized only<br />

if the number <strong>of</strong> iterations required <strong>for</strong> effective convergence to<br />

*<br />

θ<br />

does not increase in a way<br />

to cancel the measurement savings per gradient approximation at each iteration. In the following<br />

sections, the advantages in this potential <strong>of</strong> <strong>SPSA</strong> over FDSA will be described.<br />

1.3 -Introduction to <strong>SPSA</strong> <strong>Algorithm</strong><br />

From here, the <strong>SPSA</strong> algorithm will be described with more detail. Firstly, since many years<br />

ago, the Stochastic <strong>Algorithm</strong> (SA) has long been applied <strong>for</strong> problems <strong>of</strong> minimizing loss<br />

functions or root-finding with noisy input in<strong>for</strong>mation [10]. As with all stochastic search<br />

algorithms, there are adjustable algorithm coefficients that must be specified and that can have a<br />

pr<strong>of</strong>ound effect on algorithm per<strong>for</strong>mance. It is known that picking these coefficients according<br />

to an SA analogue <strong>of</strong> the deterministic Newton-Raphson (N-R) algorithm provides an<br />

optimal or near-optimal <strong>for</strong>m <strong>of</strong> the algorithm. However, directly determining the required<br />

<strong>Hessian</strong> matrix (or Jacobian matrix <strong>for</strong> root-finding) to achieve this algorithm <strong>for</strong>m has been<br />

<strong>of</strong>ten difficult or impossible in practice [11]. This research presents a general adaptive SA<br />

algorithm that is based on an easy method <strong>for</strong> estimating the <strong>Hessian</strong> matrix at each iteration<br />

while concurrently estimating the primary parameters <strong>of</strong> interest. The approach applies in both<br />

the gradient-free optimization (Kiefer-Wolfowitz) and root finding stochastic gradient-based<br />

(Robbins-Monroe) settings and is based on the simultaneous perturbation (SP) idea introduced<br />

in [12]. There has recently been much interest in recursive optimization algorithms that rely on<br />

measurements <strong>of</strong> only the objective function to be optimized, not on direct measurements <strong>of</strong> the<br />

gradient (derivative) <strong>of</strong> the objective function [12]. Such algorithms have the advantage <strong>of</strong> not<br />

requiring detailed modeling in<strong>for</strong>mation describing the relationship between the parameters to<br />

be optimized and the objective function. For example, many systems involving human beings or<br />

computer simulations are difficult to treat analytically, and could potentially benefit from such<br />

an optimization approach [11][12]. The stochastic optimization algorithms are used in virtually<br />

all areas <strong>of</strong> engineering, physical and social sciences. Such techniques apply in the usual case<br />

where a closed <strong>for</strong>m solution to the optimization problem <strong>of</strong> interest is not available and where<br />

the input in<strong>for</strong>mation into the optimization method may be contaminated with noise.<br />

7


CHAPTER 1. INTRODUCTION<br />

Typical applications include model fitting and statically parameter estimation, experimental<br />

design, adaptive control, pattern classifications, simulation-based optimization, and<br />

per<strong>for</strong>mance evaluation from test data. Frequently, the solution to the optimization problem<br />

corresponds to a vector <strong>of</strong> parameters at which the gradient <strong>of</strong> the objective (say, loss) function<br />

with respect to the parameters being optimized is zero. In many practical settings, however, the<br />

gradient <strong>of</strong> the loss function <strong>for</strong> use in the optimization process is not available or is difficult to<br />

compute (knowledge <strong>of</strong> the gradient usually requires complete knowledge <strong>of</strong> the relationship<br />

between the parameters being optimized and the loss function). So, there is a considerable<br />

interest in techniques <strong>for</strong> optimization that rely on measurements <strong>of</strong> the loss function only, not<br />

on measurements (or direct calculations) <strong>of</strong> the gradient (or higher <strong>order</strong> derivatives) <strong>of</strong> loss<br />

function. One <strong>of</strong> these techniques using only loss function measurements, and it have attracted<br />

considerable recent attention <strong>for</strong> difficult multivariate problems, this technique is the <strong>SPSA</strong><br />

algorithm introduced in [12]. This contrasts with algorithms requiring direct measurements <strong>of</strong><br />

the gradient <strong>of</strong> the objective function (which are <strong>of</strong>ten difficult or impossible to obtain). Further,<br />

<strong>SPSA</strong> is especially efficient in high-dimensional problems in terms <strong>of</strong> providing a good solution<br />

<strong>for</strong> a relatively small number <strong>of</strong> measurements <strong>of</strong> the objective function. The essential feature <strong>of</strong><br />

<strong>SPSA</strong>, which provides its power and relative ease <strong>of</strong> use in difficult multivariate optimization<br />

problems, is the underlying gradient approximation that requires only two objective function<br />

measurements per iteration regardless <strong>of</strong> the dimension <strong>of</strong> the optimization problem. These two<br />

measurements are made by simultaneously varying in a "proper" random fashion all <strong>of</strong> the<br />

variables in the problem. This contrasts with the classical FDSA method where the variables are<br />

varied one at a time. If the number <strong>of</strong> terms being optimized is p, then the finite-difference<br />

method takes 2p measurements <strong>of</strong> the objective function at each iteration (to <strong>for</strong>m one gradient<br />

approximation) while <strong>SPSA</strong> takes only two measurements (see Fig. 1.2). A fundamental result<br />

on relative efficiency is described below.<br />

Under reasonably general conditions, <strong>SPSA</strong> and the standard finite-difference SA method<br />

achieve the same level <strong>of</strong> statistical accuracy <strong>for</strong> a given number <strong>of</strong> iterations even though <strong>SPSA</strong><br />

uses p times fewer measurements <strong>of</strong> the objective function at each iteration (since each gradient<br />

approximation uses only 1/p the number <strong>of</strong> function measurements). This indicates that <strong>SPSA</strong><br />

will converge to the optimal solution within a given level <strong>of</strong> accuracy with p times fewer<br />

measurements <strong>of</strong> the objective function than the standard method. An equivalent way <strong>of</strong><br />

interpreting the statement above is described in the following paragraph.<br />

8


1.3 INTRODUCTION TO <strong>SPSA</strong> ALGORITHM<br />

One properly generated simultaneous random change <strong>of</strong> all p variables in the problem contains<br />

as much in<strong>for</strong>mation <strong>for</strong> optimization as a full set <strong>of</strong> p one at time changes <strong>of</strong> each variable [13].<br />

Further, <strong>SPSA</strong>—like other stochastic approximation methods—<strong>for</strong>mally accommodates noisy<br />

measurements <strong>of</strong> the objective function. This is an important practical concern in a wide variety<br />

<strong>of</strong> problems involving Monte Carlo simulations, physical experiments, feedback systems, or<br />

incomplete knowledge. The need <strong>for</strong> solving multivariate optimization problems is pervasive in<br />

engineering and the physical and social sciences. The <strong>SPSA</strong> algorithm has already attracted<br />

considerable attention <strong>for</strong> challenging optimization problems where it is difficult or impossible<br />

to directly obtain a gradient <strong>of</strong> the objective function, not on measurement <strong>of</strong> the gradient<br />

objective function. As we mentioned above, the gradient approximation is based on only two<br />

functions measurements (regardless <strong>of</strong> the dimension <strong>of</strong> the gradient vector). There<strong>for</strong>e,<br />

contrasts with standard finite-difference approaches, which require a number <strong>of</strong> function<br />

measurements proportional to the dimension <strong>of</strong> the gradient vector.<br />

The <strong>SPSA</strong> is generally used in non-linear problems having many variables where the objective<br />

function gradient is difficult or impossible to obtain. As a SA algorithm, <strong>SPSA</strong> may be<br />

rigorously applied when noisy measurements <strong>of</strong> the objective function are all that are available.<br />

There have also been many successful applications <strong>of</strong> <strong>SPSA</strong> in settings where perfect<br />

measurements <strong>of</strong> the loss function are available.<br />

Fig. 1.2. Per<strong>for</strong>mance <strong>of</strong> <strong>SPSA</strong> algorithm (two measurements).<br />

9


CHAPTER 1. INTRODUCTION<br />

1.4--Features <strong>of</strong> <strong>SPSA</strong><br />

1. <strong>SPSA</strong> allows <strong>for</strong> the input the algorithm to be measurement <strong>of</strong> the objective function<br />

corrupted by noise. For example, this is ideal <strong>for</strong> the case where Monte Carlo simulations<br />

are being used because each simulation run provides one noisy estimate <strong>of</strong> the per<strong>for</strong>mance<br />

measure. This is especially relelvant in practice as a very large number <strong>of</strong> scenarios <strong>of</strong>ten<br />

need to be evaluated, and it will not be possible to run a large number <strong>of</strong> simulations at<br />

each scenario (to average out noise). So, an algorithm explicitly designed to handle noise is<br />

needed.<br />

2. The algorithm is appropiate <strong>for</strong> high-dimensional problems where many terms are being<br />

determined in the optimization process. Many practical applications have a significant.<br />

3. Per<strong>for</strong>mance guarantees <strong>for</strong> <strong>SPSA</strong> exist in the <strong>for</strong>m <strong>of</strong> an extentive convergence theory. The<br />

algorithm has desirable properties <strong>for</strong> the both global and local optimization in the sense<br />

that the gradient approximation is sufficiently noisy to allow <strong>for</strong> escape from local minima<br />

while being in<strong>for</strong>mative about the slope <strong>for</strong> the function to faciliate local convergence. This<br />

may avoid the cumbersome need in many global optimization problems to manually switch<br />

from a global to a local algorithm. However, we concentrate in the optimal area, so that we<br />

omite the local mimina problem.<br />

4. Implementation <strong>of</strong> <strong>SPSA</strong> may be easier than other stochastic optimization methods since<br />

there are fewer algorithm coefficients that need to be specfied, and there are some<br />

published guidelines [12] proving insight into how to pick th coefficients in practical<br />

applications.<br />

5. While the original <strong>SPSA</strong> method is designed <strong>for</strong> conitnuos optimization problems, there<br />

have been recent extensions to discrete optimization problems. This may be revelant to<br />

certain design problems, <strong>for</strong> example, where one wants to find the best number <strong>of</strong> items to<br />

use in a particular application.<br />

10


1.5 APPLICATIONS AREAS<br />

6. While “basic” <strong>SPSA</strong> uses only objective function measurements to carry out the iteration<br />

process in a stochastic analogue <strong>of</strong> the steepest descent method <strong>of</strong> deterministic<br />

optimization.<br />

1.5 -Applications Areas<br />

Over the past several years, non-linear models have been increasingly used <strong>for</strong> simulation, state<br />

estimation and control purposes. Particularly, the rapid progresses in computational techniques<br />

and the success <strong>of</strong> non-linear model predictive control have been strong incentivites <strong>for</strong> the<br />

development <strong>of</strong> such models as neural networks or first-principle models. Process modeling<br />

requires the estimation <strong>of</strong> several unknown parameters from noisy measurements data. A least<br />

square or maximum likelihood cost function. is usually minimized using a gradient-based<br />

optimization method [7]. Several techniques <strong>for</strong> computing the gradient <strong>of</strong> the cost function are<br />

available, including finite difference approximations and analytic differentiation. In these<br />

techniques, the computational expense required to estimate the current gradient direction is<br />

directly proportional to the number <strong>of</strong> unknown model parameters, which becomes an issue <strong>for</strong><br />

model involving a large number <strong>of</strong> parameters. This is typically the case in neural networks<br />

modeling, but can also occur in other circumstances, such as the estimation <strong>of</strong> parameters and<br />

initial conditions in first principle models. Moreover the derivation <strong>of</strong> sensitivity equations<br />

requires analytic manipulation <strong>of</strong> the model equation, which is time consuming and subject to<br />

errors [7].<br />

In contrast to standard finite differences which approximate the gradient by varying the<br />

parameters one at time, the simultaneous perturbation approximation <strong>of</strong> the gradient proposed<br />

by Spall and Chin [12] make use <strong>of</strong> a very efficient technique based on a simultaneous (random)<br />

perturbation in all the parameters and on each iteration the <strong>SPSA</strong> only needs few loss<br />

measurements to estimate the gradient, regardless <strong>of</strong> the dimensionality <strong>of</strong> the problem (number<br />

<strong>of</strong> parameters)[12]. Hence, one gradient evaluation requires only two evaluations <strong>of</strong> the cost<br />

function. This approach has first been applied to gradient estimation in a first-<strong>order</strong> stochastic<br />

approximation algorithm, and more recently to <strong>Hessian</strong> estimation in an accelerated<br />

second-<strong>order</strong> <strong>SPSA</strong> algorithm. There<strong>for</strong>e, using those features, the proposed <strong>SPSA</strong> algorithm in<br />

this dissertation also will be applied to non-linear systems regardless <strong>of</strong> the dimensionality <strong>of</strong><br />

the problem.<br />

11


CHAPTER 1. INTRODUCTION<br />

Some <strong>of</strong> the general areas <strong>for</strong> application <strong>of</strong> <strong>SPSA</strong> include statistical parameter estimation,<br />

simulation-based optimization, pattern recognition, non-linear regression, signal processing,<br />

neural network (NN) training, adaptive feedback control, and experimental design. Specific<br />

system applications represented in the list <strong>of</strong> references include [14].<br />

1. Adaptive optics<br />

2. Aircraft modeling and control<br />

3. Atmosferic and planetary modeling<br />

4. Fault detection in plant operations<br />

5. Human-machine interface control<br />

6. Industrial quality improvement<br />

7. Medical imaging<br />

8. Noise cancellation<br />

9. Process control<br />

10. Quering network design<br />

11. Robot control<br />

12. Parameter estimation in highly non-linear model<br />

In this point, the research has one important goal because this application (parameter estimation)<br />

is very useful in realistic systems. Often it is necessary to estimate the parameters <strong>of</strong> a model <strong>of</strong><br />

unknown system. Various techniques exist to accomplish this task, including LMS algorithms,<br />

and the L-M algorithm [15]. These techniques require an analytic <strong>for</strong>m <strong>of</strong> the gradient <strong>of</strong> the<br />

function <strong>of</strong> the parameters to be estimated. A key feature <strong>of</strong> the <strong>SPSA</strong> method is that it is a<br />

gradient-free optimization technique. The function <strong>of</strong> parameters to be identified is highly<br />

non-linear and <strong>of</strong> sufficient difficulty that obtaining an analytic <strong>for</strong>m <strong>of</strong> the gradient is<br />

empirical.<br />

1.6 -Formulation <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong><br />

The problem <strong>of</strong> minimizing a (scalar) differentiable loss function L (θ ), where θ ∈R P , p ≥1<br />

is<br />

considered. A typical example <strong>of</strong> L (θ ) would be some measure <strong>of</strong> mean-square error (MSE)<br />

<strong>for</strong> the output <strong>of</strong> a process as a function <strong>of</strong> some design parameters θ . For many cases <strong>of</strong><br />

practical interest, this is equivalent to finding the minimizing<br />

*<br />

θ such that<br />

12


1.6 FORMULATION OF <strong>SPSA</strong> ALGORITHM<br />

∂L<br />

g ( θ)<br />

= = 0 . (1.4)<br />

∂ θ<br />

For the gradient-free setting, it is assumed that measurements <strong>of</strong> L (θ ) say y (θ ) are available at<br />

various values <strong>of</strong> θ . These measurements may or may not include random noise. No direct<br />

measurements (either with or without noise) <strong>of</strong> g (θ ) are assumed available in this setting. In<br />

the Robbins-Monroe/stochastic gradient (SG) case [9], it is assumed that direct measurements <strong>of</strong><br />

g (θ) are available, usually in the presence <strong>of</strong> added noise. The basic problem is to take the<br />

available in<strong>for</strong>mation measurements <strong>of</strong> L (θ ) and/or g (θ ) and attempt to estimate<br />

*<br />

θ . This is<br />

essentially a local unconstrained optimization problem. The <strong>SPSA</strong> algorithm is a tool <strong>for</strong><br />

solving optimization problems in which the cost function is analytically unavailable or difficult<br />

to compute. The algorithm is essentially a randomized version <strong>of</strong> the Kiefer-Wolfowitz method<br />

in which the gradient is estimated using only two measurements <strong>of</strong> the cost function at each<br />

iteration [15][16]. <strong>SPSA</strong> is particularly efficient in problems <strong>of</strong> high dimension and where the<br />

cost function must be estimated through expensive simulations. The convergence properties <strong>of</strong><br />

the algorithm have been established in [16]. Consider the problem <strong>of</strong> finding the minimum <strong>of</strong> a<br />

real valued function L (θ ), <strong>for</strong> θ ∈D<br />

where D is an open domain in<br />

P<br />

R . The function is not<br />

assumed to be explicitly known, but noisy measurements M ( n,<br />

θ)<br />

<strong>of</strong> it are available:<br />

M ( n,<br />

θ)<br />

= L(<br />

θ)<br />

+ ε ( θ)<br />

(1.5)<br />

n<br />

where { ε n<br />

} is the measurement noise process. We assume that the function L (⋅)<br />

is at least<br />

three-times continuously differentiable and has a unique minimize in D. The process { ε n<br />

} is a<br />

zero-mean process, uni<strong>for</strong>mly bounded and smooth in θ in an appropriate technical sense. The<br />

problem is to minimize L (⋅)<br />

using only the noisy measurements M (⋅)<br />

. The <strong>SPSA</strong> algorithm <strong>for</strong><br />

minimizing functions relies on the SP gradient approximation [16]. At each iteration k <strong>of</strong> the<br />

algorithm, a random perturbation vector<br />

∆ is taken, where the ∆<br />

ki<br />

<strong>for</strong>ms a<br />

T<br />

k<br />

= ( ∆k<br />

1,...,<br />

∆kp)<br />

sequence <strong>of</strong> Bernoulli random variables taking the values ± 1. The perturbations are assumed to<br />

be independent <strong>of</strong> the measurement noise process. In fixed gain <strong>SPSA</strong>, the step size <strong>of</strong> the<br />

perturbation is fixed at, say, some c > 0. To compute the gradient estimate at iteration k, it is<br />

necessary to evaluate M (⋅)<br />

at two values <strong>of</strong> θ :<br />

M θ ) L(<br />

θ + c∆<br />

) + ε ( θ + c∆<br />

)<br />

(1.6)<br />

+<br />

k<br />

( =<br />

k 2k−1<br />

k<br />

13


CHAPTER 1. INTRODUCTION<br />

M<br />

−<br />

k<br />

(<br />

k 2k<br />

k<br />

θ ) = L(<br />

θ − c∆<br />

) + ε ( θ − c∆<br />

) . (1.7)<br />

The i-th component <strong>of</strong> the gradient estimate is<br />

( M<br />

H ( k,<br />

θ)<br />

=<br />

i<br />

+<br />

k<br />

( θ)<br />

− M<br />

2c∆<br />

ki<br />

−<br />

k<br />

( θ))<br />

.<br />

1.7 -Basic Assumptions <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong><br />

Once again, the goal is to minimize a loss function L (θ ) over<br />

P<br />

θ ∈C ⊆ R . The <strong>SPSA</strong><br />

algorithm works by iterating from an initial guess <strong>of</strong> the optimal θ , where the iteration process<br />

depends on the above-mentioned simultaneous perturbation approximation to the gradient g (θ ).<br />

In [16] are presented sufficient conditions <strong>for</strong> convergence <strong>of</strong> the <strong>SPSA</strong> iterate ( ˆ θ →θ * a.s.)<br />

using a differential equation approach well known by SA theory [17]. In particular, we must<br />

impose conditions on both gain sequences ( a k<br />

and<br />

and the statistical relationship <strong>of</strong><br />

c<br />

k<br />

), the user specified distribution <strong>of</strong><br />

k<br />

∆<br />

k<br />

,<br />

∆<br />

k<br />

to the measurements y(·). We will not repeat the<br />

conditions here since they are available in [17]. The main conditions are that<br />

a<br />

k<br />

and<br />

c<br />

k<br />

both<br />

go to 0 at rates neither too fast nor too slow, that L (θ ) is sufficiently smooth (several times<br />

differentiable) near<br />

*<br />

θ and that the { ∆ki}<br />

−1<br />

0 with finite inverse moments ( ∆ ki<br />

)<br />

are independent and symmetrically distributed about<br />

E <strong>for</strong> all k, i. One particular distribution <strong>for</strong> ∆<br />

ki<br />

that<br />

satisfies these latter conditions is the symmetric Bernoulli ±1 distribution; two common<br />

distributions that do not satisfy the conditions (in particular, the critical finite inverse moment<br />

condition) are the uni<strong>for</strong>m and normal. Although the convergence results <strong>for</strong> <strong>SPSA</strong> is <strong>of</strong> some<br />

independent interest, the most interesting theoretical results in [16] and those that best justify<br />

the use <strong>of</strong> <strong>SPSA</strong>, are the asymptotic efficiency conclusions that follow from an asymptotic<br />

normality result. In particular, under some minor additional conditions in [16] (proposition 2), it<br />

can be shown that<br />

k<br />

β / 2<br />

dist<br />

ˆ *<br />

( θk − θ ) →N(<br />

µ , Σ)<br />

as k →∞<br />

(1.8)<br />

14


1.8 VERSIONS OF <strong>SPSA</strong> ALGORITHM<br />

where β > 0 depends on the choice <strong>of</strong> the gain sequences ( a k<br />

andc k<br />

), µ depends on both the<br />

<strong>Hessian</strong> and the third derivatives <strong>of</strong> L (θ ) and<br />

*<br />

θ , and Σ depends on <strong>Hessian</strong> matrix<br />

(note that in general µ ≠ 0, in contrasts to many well-known asymptotic normality results in<br />

estimation). Given the restrictions on the gain sequences to ensure convergence and asymptotic<br />

*<br />

θ<br />

normality, the fastest allowable value <strong>for</strong> the rate <strong>of</strong> convergence <strong>of</strong><br />

θˆ k<br />

to<br />

*<br />

θ is<br />

−1/3<br />

k .<br />

In addition to establishing the <strong>for</strong>mal convergence <strong>of</strong> <strong>SPSA</strong>, Spall in [18] shows that the<br />

probability distribution <strong>of</strong> an appropriately scaled<br />

θˆ k<br />

is approximately normal (with a specified<br />

mean and covariance matrix) <strong>for</strong> large k . Spall in [18] uses the asymptotic normality result in<br />

(1.8), together with a parallel result <strong>for</strong> FDSA [9], to establish the relative efficiency <strong>of</strong> <strong>SPSA</strong>.<br />

This efficiency depends on the shape <strong>of</strong> L (θ ) , the values <strong>for</strong> a } and c } , and the<br />

distributions <strong>of</strong> the { ∆ k<br />

} and measurement noise terms. There is no single expression that can<br />

be used to characterize the relative efficiency; however, as discussed in [17] in most practical<br />

problems <strong>SPSA</strong> will be asymptotically more efficient than FDSA.<br />

{ k<br />

{ k<br />

For example, if<br />

a<br />

k<br />

and<br />

asymptotic mean squared error<br />

c<br />

k<br />

are chosen as in the guidelines <strong>of</strong><br />

Spall [18] then by equating the<br />

2<br />

E<br />

⎛ ⎞<br />

⎜<br />

ˆ θ −θ *<br />

k ⎟ in <strong>SPSA</strong> and FDSA algorithm, we find<br />

⎝ ⎠<br />

No. <strong>of</strong> measurements <strong>of</strong> L (θ ) in <strong>SPSA</strong> / No. <strong>of</strong> measurements <strong>of</strong> L (θ ) in FDSA → 1/p<br />

as the number <strong>of</strong> loss measurements in both procedures gets large. Hence, above expression<br />

implies that the p-fold savings per iteration (gradient approximation) translates directly into a<br />

p-fold savings in the overall optimization process despite the complex non-linear ways in which<br />

the sequence <strong>of</strong> gradient approximations manifests itself in the ultimate solutionθˆ k . One<br />

properly chosen simultaneous random change in all the variables in a problem provides as much<br />

in<strong>for</strong>mation <strong>for</strong> optimization as a full set <strong>of</strong> one at time changes <strong>of</strong> each variable.<br />

1.8. -Versions <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong><br />

The standard first-<strong>order</strong> SA algorithms <strong>for</strong> estimating θ involve a simple recursion with.<br />

15


CHAPTER 1. INTRODUCTION<br />

usually, a scalar gain and an approximation to the gradient based on the measurements <strong>of</strong> L (⋅)<br />

The first-<strong>order</strong> <strong>of</strong> <strong>SPSA</strong> (1st-<strong>SPSA</strong> or <strong>SPSA</strong>) algorithm mentioned previously requires only two<br />

measurements <strong>of</strong> L(⋅)<br />

to <strong>for</strong>m the gradient approximation, independent <strong>of</strong> p (versus 2p in the<br />

standard multivariate finite-difference approximation considered, e.g., in [8]), which extends the<br />

scalar algorithm <strong>of</strong> Kiefer and Wolfowitz [8]. Theory presented in [17] shows that <strong>for</strong> large p<br />

the 1st-<strong>SPSA</strong> approach can be much more efficient (in terms <strong>of</strong> total number <strong>of</strong> loss<br />

*<br />

measurements to achieve effective convergence to θ<br />

) than the finite-difference approach in<br />

many cases <strong>of</strong> practical interest. In extending 1st-<strong>SPSA</strong> to a second-<strong>order</strong> (accelerated) <strong>for</strong>m<br />

[18] that will be explained below, we can see how the gradient and inverse <strong>Hessian</strong> <strong>of</strong> L(⋅)<br />

can<br />

both be estimated on a per iteration basis using only three measurements <strong>of</strong> L (⋅)<br />

(again,<br />

independent <strong>of</strong> p). With these estimates, it is possible create an SA analogue to the<br />

Newton-Raphson algorithm (which, recall, is based on an update step that is negatively<br />

proportional to the inverse <strong>Hessian</strong> times the gradient) [17]. The aim <strong>of</strong> second-<strong>order</strong> <strong>of</strong> <strong>SPSA</strong><br />

(2nd-<strong>SPSA</strong>) algorithm is to emulate the acceleration properties associated with deterministic<br />

algorithms <strong>of</strong> Newton-Raphson <strong>for</strong>m, particularly in the terminal phase where the first-<strong>order</strong><br />

<strong>SPSA</strong> algorithm slows down in its convergence [18]. This approach requires only three loss<br />

function measurements at each iteration, independent <strong>of</strong> the problem dimension. The 2nd-<strong>SPSA</strong><br />

approach is composed <strong>of</strong> two parallel recursions, one <strong>for</strong> θ and one <strong>for</strong> the upper triangular<br />

matrix square root, say S = S(θ ) , <strong>of</strong> the <strong>Hessian</strong> <strong>of</strong> L (θ ) (square root is estimated to ensure<br />

that the inverse <strong>Hessian</strong> estimate used in the second-<strong>order</strong> <strong>SPSA</strong> recursion <strong>for</strong> θ is positive<br />

semi-definite). The two recursions are, respectively [18],<br />

ˆ 1<br />

k+ 1 k<br />

−<br />

k k k k k<br />

θ<br />

ˆ ˆT<br />

ˆ −<br />

= θ a ( S S ) gˆ<br />

( ˆ θ )<br />

(1.9)<br />

Sˆ ˆ ~ ˆ ( ˆ<br />

k + 1<br />

= S k<br />

− a k<br />

G k<br />

S k<br />

)<br />

(1.10)<br />

where<br />

a<br />

k<br />

and<br />

a ~ are non-negative scalar gain coefficients, gˆ<br />

( ˆ θ ) is the SP gradient<br />

k<br />

k<br />

k<br />

approximation to g ˆ<br />

k<br />

( θ k<br />

) [18] and Ĝ<br />

k<br />

is an observation related to the gradient <strong>of</strong> a certain loss<br />

function with respect to S. Note that<br />

ˆ T<br />

k<br />

Sk<br />

(which depends on<br />

k<br />

S ˆ<br />

θˆ ) represents an estimate <strong>of</strong><br />

16


1.8 VERSIONS OF <strong>SPSA</strong> ALGORITHM<br />

the <strong>Hessian</strong> matrix <strong>of</strong> L ˆ θ ). Hence, equation (1.10) is a stochastic analogue <strong>of</strong> the<br />

( k<br />

well-known Newton-Raphson algorithm <strong>of</strong> deterministic optimization. Since gˆ<br />

( ˆ θ ) has a<br />

known <strong>for</strong>m, the parallel recursions in equations (1.9) and (1.10) can be implemented once that<br />

k<br />

k<br />

Ĝk<br />

is specified. The SP gradient approximation requires two measurements <strong>of</strong><br />

L( ⋅):<br />

y<br />

( + )<br />

k<br />

and<br />

y . These represent measurements at design levels θˆ k<br />

+ ck∆k<br />

and θˆ<br />

k<br />

− ck∆k<br />

respectively, where<br />

(−)<br />

k<br />

c<br />

k<br />

is a positive scalar and<br />

∆<br />

k<br />

represents a user-generated random vector satisfying certain<br />

regularity conditions, e.g<br />

∆<br />

k<br />

being a vector <strong>of</strong> independent Bernoulli ± 1 random variables<br />

satisfies these conditions but a vector <strong>of</strong> uni<strong>for</strong>mly distributed random variables does not. The<br />

SP comes from the fact that all elements <strong>of</strong><br />

θˆ k<br />

are perturbed simultaneously in <strong>for</strong>ming<br />

gˆ<br />

( ˆ θ ) , as opposed to the finite difference <strong>for</strong>m, where they are perturbed one at time. To<br />

k<br />

k<br />

per<strong>for</strong>m one iteration <strong>of</strong> (1.9) and (1.10), one additional measurement, say<br />

(0)<br />

y<br />

k<br />

is required; this<br />

measurement represents an observation <strong>of</strong> L (⋅)<br />

at the nominal design level θˆ k .<br />

Main Advantage:<br />

- 1st-<strong>SPSA</strong> gives region(s) where the function value is low, and this allows to conjecture in<br />

which region(s) is the global solution.<br />

- 2nd-<strong>SPSA</strong> is based on a highly efficient approximation <strong>of</strong> the gradient based on loss function<br />

measurements. In particular, on each iteration the <strong>SPSA</strong> only needs three loss measurements to<br />

estimate the gradient, regardless <strong>of</strong> the dimensionality <strong>of</strong> the problem. Moreover, the 2nd-<strong>SPSA</strong><br />

is grounded on a solid mathematical framework that permits to assess its stochastic properties<br />

also <strong>for</strong> optimization problems affected by noise or uncertainties. Due to these striking<br />

advantages, 2nd-<strong>SPSA</strong> is recently used as optimization engine <strong>for</strong> adaptive control problems.<br />

17


CHAPTER 1. INTRODUCTION<br />

Main Disadvantages:<br />

- 1st-<strong>SPSA</strong> gives slow convergence.<br />

- 2nd-<strong>SPSA</strong> does not take into account equality/inequality constraints.<br />

The 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> are algorithms that do not depend on derivative in<strong>for</strong>mation, and<br />

it is able to find a good approximation to the solution using few function values. Its<br />

disadvantage is that once obtained a good approximation, it may not satisfy some conditions and<br />

constraints associated with some complex problems [17][18]. Also, in both version <strong>of</strong> <strong>SPSA</strong><br />

algorithm is not possible guarantee that non-positive definiteness part <strong>of</strong> the <strong>Hessian</strong> matrix can<br />

be eliminated when the number <strong>of</strong> parameters to be adjusted is large. This can cause instability<br />

in the system and also both versions can become very high in computational cost. Finally, in the<br />

1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> algorithms, the error <strong>for</strong> the loss function with an ill-conditioned<br />

<strong>Hessian</strong> is greater than the one with a well-conditioned <strong>Hessian</strong>, with this problem the system<br />

per<strong>for</strong>mance decrease. Also, in estimating optimum parameters <strong>of</strong> a model or time series, there<br />

are several factors which must be considered when deciding on the appropriate optimization<br />

technique. Among these factors are convergence speed, accuracy, algorithm suitability,<br />

complexity and computational cost in terms <strong>of</strong> time (coding, run –time, output) and power. In<br />

the parameter estimation application, the 2nd-<strong>SPSA</strong> had problems with convergence to local<br />

minima and computational cost. So that, in [18] are proposed some techniques, in <strong>order</strong> to solve<br />

this kind <strong>of</strong> problems efficiently. Nevertheless, when the number <strong>of</strong> parameters to be adjusted is<br />

very large the convergence is slow and instable. The techniques defined in [18] included a<br />

mapping in the <strong>Hessian</strong> matrix, but this is not consistent in some conditions or applications.<br />

There<strong>for</strong>e, according to these disadvantages (theoretical and practical), in the following chapter,<br />

we have proposed some improvements to speed up and stability in the 2nd-<strong>SPSA</strong> algorithm, in<br />

particular, in the stability, convergence, and computational cost. Also, it is suggested a new<br />

mapping <strong>for</strong> implementing in 2nd-<strong>SPSA</strong> that eliminates the non-positive definiteness while<br />

preserving key spectral properties <strong>of</strong> the estimated <strong>Hessian</strong>. This <strong>Hessian</strong> is estimated using the<br />

Fisher in<strong>for</strong>mation matrix in <strong>order</strong> to keep it non-positive definiteness and improve the stability.<br />

So that, those improvements constitute our proposed <strong>SPSA</strong> algorithm that it is described in the<br />

following chapter.<br />

18


Chapter 2<br />

Proposed <strong>SPSA</strong> <strong>Algorithm</strong><br />

We propose a modification to the simultaneous perturbation stochastic approximation (<strong>SPSA</strong>)<br />

methods based on the comparisons made between the first- and second-<strong>order</strong> <strong>SPSA</strong>s (1st-<strong>SPSA</strong><br />

and 2nd-<strong>SPSA</strong>) algorithms from the perspective <strong>of</strong> loss function <strong>Hessian</strong>. At finite iterations,<br />

the accuracy <strong>of</strong> the algorithm depends on the matrix conditioning <strong>of</strong> the loss function <strong>Hessian</strong>.<br />

The error <strong>of</strong> 2nd-<strong>SPSA</strong> algorithm <strong>for</strong> a loss function with an ill-conditioned <strong>Hessian</strong> is greater<br />

than the one with a well-conditioned <strong>Hessian</strong>. On the other hand, the 1st-<strong>SPSA</strong> algorithm is less<br />

sensitive to the matrix conditioning <strong>of</strong> loss function <strong>Hessian</strong>s. The modified 2nd-<strong>SPSA</strong><br />

(M2-<strong>SPSA</strong>) eliminates the error amplification caused by the inversion <strong>of</strong> an ill-conditioned<br />

<strong>Hessian</strong>. This leads to significant improvements in its algorithm efficiency in problems with an<br />

ill-conditioned <strong>Hessian</strong> matrix. Asymptotically, the efficiency analysis shows that M2-<strong>SPSA</strong> is<br />

also superior to 2nd-<strong>SPSA</strong> in a large parameter domain. It is shown that the ratio <strong>of</strong> the mean<br />

square errors <strong>for</strong> M2-<strong>SPSA</strong> to 2nd-<strong>SPSA</strong> is always less than one except <strong>for</strong> a perfectly<br />

conditioned <strong>Hessian</strong> or <strong>for</strong> an asymptotically optimal setting <strong>of</strong> the gain sequence. Also, an<br />

improved estimation <strong>of</strong> the <strong>Hessian</strong> matrix is proposed in <strong>order</strong> to guarantee that in this matrix<br />

the non-positive definiteness part can be eliminated and also using this proposed estimation, the<br />

computational cost is reduced when our method is applied to parameter estimation.<br />

2.1 -Overview <strong>of</strong> Modified 2nd-<strong>SPSA</strong> <strong>Algorithm</strong><br />

The recently developed simultaneous perturbation stochastic approximation (<strong>SPSA</strong>) method has<br />

found many applications in areas such as physical parameter estimation and simulation based<br />

optimization. The novelty <strong>of</strong> the <strong>SPSA</strong> is the underlying derivative approximation that requires<br />

only two (<strong>for</strong> the gradient) or four (<strong>for</strong> the <strong>Hessian</strong> matrix) evaluations <strong>of</strong> the loss function<br />

regardless <strong>of</strong> the dimension <strong>of</strong> the optimization problem. There exist two basic <strong>SPSA</strong><br />

algorithms that are based on the “simultaneous perturbation” (SP) concept and that use only<br />

(noisy) loss function measurements. The first-<strong>order</strong> <strong>SPSA</strong> (1st-<strong>SPSA</strong>) is related to the<br />

Kiefer–Wolfowitz (K–W) stochastic approximation (SA) method [17] whereas the second-<strong>order</strong><br />

<strong>SPSA</strong> (2nd-<strong>SPSA</strong>) is a stochastic analogue <strong>of</strong> the deterministic Newton–Raphson algorithm<br />

[18]. There have been several studies that compare the efficiency <strong>of</strong> 1st-<strong>SPSA</strong> with other<br />

stochastic approximation (SA) methods. It is generally accepted that 1st-<strong>SPSA</strong> is superior to<br />

19


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

other first-<strong>order</strong> SA methods (such as the standard K–W method) due to its efficient estimator<br />

<strong>for</strong> the loss function gradient. Spall [28] shows that a ‘standard’ implementation <strong>of</strong> 2nd-<strong>SPSA</strong><br />

achieves a nearly optimal asymptotic error, with the asymptotic root-mean-square error being no<br />

more than twice the optimal (but unachievable) error from an infeasible gain sequence<br />

depending on the third derivatives <strong>of</strong> the loss function. This appealing result <strong>for</strong> 2nd-<strong>SPSA</strong> is<br />

achieved with a trivial gain sequence a k<br />

= 1 /( k + 1)<br />

in the notation below), which effectively<br />

eliminates the nettlesome issue <strong>of</strong> selecting a “good” gain sequence. Because this result is<br />

asymptotic, however, per<strong>for</strong>mance in finite samples may sometimes be improved using other<br />

considerations. Part <strong>of</strong> the purpose <strong>of</strong> this paper is to provide a comparison between 1st-<strong>SPSA</strong><br />

and 2nd-<strong>SPSA</strong> from the perspective <strong>of</strong> the conditioning <strong>of</strong> the loss function <strong>Hessian</strong> matrix. To<br />

achieve the objectivity <strong>of</strong> the comparison we also suggest a new mapping <strong>for</strong> implementing<br />

2nd-<strong>SPSA</strong> that eliminates the non-positive definiteness while preserving key spectral properties<br />

<strong>of</strong> the estimated <strong>Hessian</strong>. While the focus <strong>of</strong> this paper is finite-sample analysis, we are<br />

necessarily limited by the theory available <strong>for</strong> SA algorithms, almost all <strong>of</strong> which is asymptotic..<br />

The numerical examples illustrating the empirical results at finite iterations will be carefully<br />

chosen to represent a wide range <strong>of</strong> matrix conditioning <strong>for</strong> the loss function <strong>Hessian</strong>s.<br />

2.2 -<strong>SPSA</strong> <strong>Algorithm</strong> Recursions<br />

There has recently been a growing interest in recursive optimization algorithms <strong>of</strong> SA <strong>for</strong>m that<br />

does not depend on direct gradient in<strong>for</strong>mation or measurements [19]-[21]. Rather, these SA<br />

algorithms are based on an approximation to the p-dimensional gradient <strong>for</strong>med from<br />

measurements <strong>of</strong> the objective function. This interest has been motivated by problems such as<br />

the adaptive control <strong>of</strong> complex processes, the training <strong>of</strong> recurrent NN, and the optimization <strong>of</strong><br />

complex queuing and estimation parameters. The principal advantage <strong>of</strong> algorithms that do not<br />

require direct gradient measurements (gradient-free algorithm) is that they do not require<br />

knowledge <strong>of</strong> the functional relationship between the parameters being adjusted and the<br />

objective function being minimized. The <strong>SPSA</strong> algorithm, which is based on a highly efficient<br />

gradient approximation, is one such gradient-free algorithm. In the <strong>SPSA</strong> algorithm there are<br />

two important <strong>order</strong>s: the 1st-<strong>SPSA</strong> or <strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. These algorithms are described as<br />

follows:<br />

20


2.2 THE <strong>SPSA</strong> ALGORITHM RECURSIONS<br />

1st-<strong>SPSA</strong> [17]:<br />

θ ˆ = θˆ<br />

− a gˆ<br />

( θˆ<br />

), 0,1,2,...<br />

(2.1)<br />

k + 1 k k k k<br />

k =<br />

2nd-<strong>SPSA</strong> [18]:<br />

ˆ ˆ<br />

−1<br />

θ = θ − a H gˆ<br />

( ˆ ), H = f ( H )<br />

(2.2 a)<br />

k + 1 k k k k<br />

θ<br />

k k k k<br />

= k<br />

1<br />

H H<br />

ˆ<br />

1<br />

+ H , = 0,1,2,...<br />

k + 1<br />

− k + 1<br />

k<br />

(2.2 b)<br />

k k<br />

k<br />

where<br />

a<br />

k and a<br />

k are the scalar gain series that satisfy certain SA conditions [18], ĝ<br />

k<br />

is<br />

the SP estimate <strong>of</strong> the loss function gradient that depends on the gain sequence<br />

c<br />

k<br />

(representing a difference interval <strong>of</strong> the perturbations),<br />

Hˆ<br />

k<br />

is the SP estimate <strong>of</strong> the <strong>Hessian</strong><br />

matrix, and<br />

f<br />

k maps the usual non-positive-definite H<br />

k<br />

to a positive-definite pxp matrix.<br />

The two recursions are showed in Fig. 2.1. Let<br />

∆<br />

k be a user-generated mean zero random<br />

vector <strong>of</strong> dimension p with its components being independent random variables.<br />

Fig. 2.1. The two-recursions in 2nd-<strong>SPSA</strong> algorithm<br />

(solid-line eq. 2.2 a, dashed-line eq. 2.2 b).<br />

The i-th element <strong>of</strong> the loss function gradient is given by [18].<br />

( gˆ<br />

) = (2c<br />

∆<br />

k<br />

i<br />

k<br />

ki<br />

−1<br />

) [ y( ˆ θ + c ∆ ) − y( ˆ θ −c<br />

∆ )], i=1, 2, … , p (2.3)<br />

k<br />

k<br />

k<br />

k<br />

k<br />

k<br />

21


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

where<br />

∆<br />

ki is the i-th component <strong>of</strong> the k<br />

∆ vector and y(θ)<br />

is the measurements <strong>of</strong> the loss<br />

function:<br />

y(θ ) = L(θ ) + (noise) (2.4)<br />

*<br />

where θ is the parameter that has the true value <strong>of</strong> θ .<br />

It is noted that the 2nd-<strong>SPSA</strong> <strong>for</strong>m is<br />

a special case <strong>of</strong> the general adaptive SP method. The general method can also be used in<br />

root-finding problems where<br />

H<br />

k<br />

represents an estimate <strong>of</strong> the associated Jacobian matrix. The<br />

true <strong>Hessian</strong> matrix <strong>of</strong> the loss function H (θ )<br />

H<br />

ij<br />

i<br />

j<br />

has its i-th element defined as<br />

2<br />

= ∂ L / ∂θ ∂θ<br />

and its value at the solution ( *<br />

*<br />

H θ ) denote by H . Finally, its estimation<br />

and ijth element <strong>of</strong> estimate <strong>of</strong> H is defined in Sec. 2.6 using the Fisher in<strong>for</strong>mation matrix<br />

(FIM). The FIM is used here in stead <strong>Hessian</strong> matrix in <strong>order</strong> to estimate this matrix efficiently<br />

[22]. The FIM is obtained by Monte Carlo Newton-Raphson (MCNR)[23]. However, this<br />

<strong>Hessian</strong> matrix estimate is convenient in an optimization application and is a crucial<br />

requirement <strong>for</strong> the new mapping<br />

f<br />

k<br />

proposed in the following section.<br />

2.3 -Proposed Mapping<br />

An important point <strong>of</strong> implementing 2nd-<strong>SPSA</strong> is to define the mapping<br />

f<br />

k , from H<br />

k<br />

to<br />

H<br />

k<br />

since the <strong>for</strong>mer is <strong>of</strong>ten non-positive definite in practice. It is noted that there are no<br />

simple and universal conditions that guarantee a matrix to be positively definite. The existence<br />

<strong>of</strong> a minimum(s) <strong>for</strong> a loss function based on the problem’s physical nature guarantees that its<br />

<strong>Hessian</strong> should be positively definite. The following approach eliminates the non-positive<br />

definiteness <strong>of</strong><br />

H and using the Fisher in<strong>for</strong>mation matrix, we can keep this condition in this<br />

k<br />

matrix even when the real application has a computational complexity is very high. Now, this<br />

approach is motivated by finite-sample concerns, as we discuss below. First, we compute the<br />

eigenvalues <strong>of</strong><br />

H<br />

k<br />

and sort them into descending <strong>order</strong>:<br />

Λ k<br />

≡ diag , λ , , λ , λ , λ ,..., λ ]<br />

[ λ<br />

1 2 q −1<br />

q q+<br />

1 p<br />

K (2.5)<br />

22


2.3 PROPOSED MAPPING<br />

λ and 0<br />

where > 0<br />

q<br />

λ<br />

q + 1<br />

≤ . As k<br />

H is a real-valued, its eigenvalues are real-valued,<br />

too. The eigenvalues <strong>of</strong><br />

H are computed as follows:<br />

k<br />

The number <strong>of</strong> non-zero eigenvalues is equal to the rank <strong>of</strong> H<br />

k<br />

, i.e., at most three non-zero<br />

eigenvalues are available. In this part, the following arrangement <strong>of</strong> eigenvalues is assumed:<br />

λ<br />

≥<br />

≥<br />

1<br />

λ<br />

2<br />

λ<br />

3<br />

. The technique presented here, requires much less user interaction. Now, the<br />

theoretical background is explained leading to a two-fold threshold algorithm where the only<br />

task <strong>of</strong> the user is to specify two thresholds. Finding the eigenvalues and eigenvectors <strong>of</strong> the<br />

<strong>Hessian</strong> matrix is closely related to its decomposition<br />

H<br />

i<br />

= PD P<br />

−1<br />

(2.6)<br />

where P is a matrix and its columns are H’s eigenvectors and<br />

D<br />

i<br />

is a diagonal matrix having<br />

H’s eigenvalues on the <strong>Hessian</strong>. While computing the gradient magnitude by the Euclidean<br />

norm requires three multiplications, two additions and one square root, the computation <strong>of</strong><br />

eigenvalues <strong>of</strong> the <strong>Hessian</strong> matrix is more suitable. The explicit <strong>for</strong>mula would require solving<br />

cubic polynomials. In our implementation a numerical technique <strong>of</strong> fast converging called<br />

Jacobi’s method is used as is recommended in [20] <strong>for</strong> symmetric matrices. We have proposed<br />

an easy-to-use framework <strong>for</strong> exploiting eigenvalues <strong>of</strong> the <strong>Hessian</strong> matrix to represent volume<br />

data by small subsets.<br />

The relation <strong>of</strong> eigenvalues to the Laplacian operator is recalled, this shows the suitability <strong>of</strong><br />

threshold eigenvalue volumes, and define a two-fold threshold operation to generate sparse data<br />

sets. For data where it can be assumed that objects exhibit higher intensities than background,<br />

we modify the framework taking into account only the smallest eigenvalue. This results in a<br />

further reduction <strong>of</strong> the representative subsets by selecting just data at the interior side <strong>of</strong> object<br />

boundaries. For the sake <strong>of</strong> simplicity, we have omitted the index k <strong>for</strong> the individual eigenvalue<br />

λ<br />

i that is a function <strong>of</strong> k. Next, we assume that the negative eigenvalues will not lead to a<br />

physically meaningful solution. They are either caused by errors in<br />

H<br />

k<br />

or are due to the fact<br />

that the iteration has not reached the neighborhood <strong>of</strong><br />

θ<br />

*<br />

where the loss function is locally<br />

quadratic. There<strong>for</strong>e, we replace them together with the smallest positive eigenvalue with a<br />

descending series <strong>of</strong> positive eigenvalues:<br />

23


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

ˆ λ ˆ ˆ ˆ ˆ<br />

(2.7)<br />

q<br />

= ελ<br />

q − 1<br />

, λ<br />

q + 1<br />

= ε λ<br />

q<br />

,..., λ<br />

p<br />

= ε λ<br />

p −1<br />

where the adjustable parameter 0 < ε < 1 can be specified based on the existing positive<br />

eigenvalues<br />

ε = ( λ q<br />

λ<br />

(2.8)<br />

q − 2<br />

−1 /<br />

1<br />

)<br />

.<br />

The purpose <strong>of</strong> having the smallest positive eigenvalue<br />

λ<br />

q<br />

redefined is to avoid its possible<br />

near-zero value that would make the mapped matrix near singular. We let<br />

Λˆ k be the<br />

diagonal matrix<br />

Λ with eigenvalues q<br />

λ<br />

p<br />

k<br />

λ ,..., replaced by λ<br />

q<br />

,..., ˆ λ<br />

p<br />

ˆ<br />

defined<br />

according to (2.7), and <strong>for</strong> guarantee the stability in this diagonal matrix when realistic system<br />

be very complex or in our case the parameters estimations be very high. The Jacobi algorithm is<br />

proposed because the matrices in this algorithm need to be positive definite, in general, and<br />

hence should be (2.2a) projected appropriately after each parameter update so as to ensure that<br />

the resulting matrices are positive definite. In (2.7) and (2.8) is indicated that the spectral<br />

character <strong>of</strong> the existing positive eigenvalues as measured by the ratio <strong>of</strong> its<br />

maximum-to-minimum eigenvalues, whether it is wide or narrowly-spread, is extrapolated to<br />

the rest <strong>of</strong> the matrix spectrum. Other <strong>for</strong>ms <strong>of</strong> specifications such as<br />

ε =<br />

( λ q<br />

λ<br />

( q − 2 )<br />

−1 /<br />

1<br />

)<br />

/ 2<br />

or ε = 1<br />

would also effectively eliminate the non-positive-definiteness. Because the<br />

separating point between the positive and negative eigenvalues q slowly increases from 1 to p,<br />

we find numerically that the specification based on (2.8) yields relatively a faster convergence<br />

in most cases. Since<br />

H<br />

k<br />

is symmetric, it is orthogonally similar to the real diagonal matrix<br />

<strong>of</strong> its real eigenvalues.<br />

H<br />

k<br />

= P Λ P<br />

(2.9)<br />

k<br />

k<br />

T<br />

k<br />

where the orthogonal matrix<br />

P<br />

k<br />

consists <strong>of</strong> all the eigenvectors<br />

H<br />

k<br />

which are usually<br />

derived together with the eigenvalues. Now, the mapping<br />

f<br />

k can be expressed as<br />

24


2.3 PROPOSED MAPPING<br />

f<br />

k<br />

( H<br />

k<br />

)<br />

ˆ T<br />

= P Λ P .<br />

(2.10)<br />

k<br />

k<br />

k<br />

Since it is<br />

− 1<br />

H that is used in the 2nd-<strong>SPSA</strong> recursion (2.2a) mapping (2.10) with the<br />

available eigenvectors <strong>of</strong> H<br />

k<br />

k<br />

also leads to an easy inversion <strong>of</strong> the estimated <strong>Hessian</strong>:<br />

H<br />

− 1<br />

= P Λ P .<br />

(2.11)<br />

− 1 T<br />

k<br />

k k k<br />

The 2nd-<strong>SPSA</strong> based on mapping (2.10) makes the procedure <strong>of</strong> eliminating the<br />

non-positive-definiteness <strong>of</strong><br />

H<br />

k<br />

a precise one. It is noted that the key parameters needed <strong>for</strong><br />

the mapping (ε and λ<br />

q−1<br />

) are internally determined by H<br />

k<br />

at each iteration. This is different<br />

from some other <strong>for</strong>ms <strong>of</strong><br />

f<br />

k where a user-specified coefficient is needed.<br />

*<br />

λ ∆ H ) ≤ λ − λ ≤ λ ( ∆ H ) <strong>for</strong> all i = 1, 2,…, p (2.12)<br />

p<br />

(<br />

k i i 1<br />

k<br />

*<br />

where λ denotes the eigenvalues <strong>of</strong><br />

i<br />

*<br />

H . Furthermore,<br />

p<br />

( ∆ H<br />

k<br />

)<br />

λ and λ ( ∆ H ) are the<br />

1 k<br />

*<br />

minimum and maximum eigenvalues <strong>of</strong> the k-th perturbation matrix ∆ H = H − H ;<br />

respectively. Equation (2.12) suggests that the perturbation matrix will have greater impact on<br />

*<br />

the smaller eigenvalues in terms <strong>of</strong> their fractional changes as H converges to H . Hence,<br />

k<br />

the smallest positive eigenvalue ( λ ) has also been redefined at each iteration to avoid its<br />

q<br />

possible near-zero value. When all the eigenvalues in (2.5) are positive and the smallest<br />

becomes stabilized, say empirically λ > 0.1<br />

p<br />

( ελ<br />

p −1<br />

) with<br />

p − 2<br />

ε = ( λ<br />

p − 1<br />

/ λ<br />

1<br />

) or λ > 0 in<br />

p<br />

10 consecutive iterations, we set Λˆ k<br />

= Λ . Specifically, k<br />

H asymptotically converges to a<br />

k<br />

*<br />

positively definite H so that λ >0 as k → ∞ see [24]. Hence,<br />

p<br />

Λˆ<br />

k<br />

→ Λ<br />

k<br />

→ 0 since,<br />

asymptotically, elements <strong>of</strong> Λˆ are continuous functions <strong>of</strong> k<br />

H . Here<br />

k<br />

Λ is a continuous<br />

k<br />

function <strong>of</strong> H . There<strong>for</strong>e, ˆ<br />

*<br />

*<br />

*<br />

Λ k<br />

→ Λ almost surely when H k<br />

→ H where Λ denotes all the<br />

*<br />

eigenvalues <strong>of</strong> H<br />

k<br />

. This follows from the basic property <strong>of</strong> continuous function <strong>for</strong><br />

deterministic sequence. Both Λ and<br />

k<br />

H converge <strong>for</strong> almost all points in their underlying<br />

k<br />

sample spaces. We further note that our mapping from Λ to<br />

k<br />

Λˆ defined by (2.7) and (2.8) is<br />

k<br />

also a continuous function asymptotically. Here, we like to point out that the mapping<br />

f<br />

k defined by (2.10) preserves the key spectral characters such as the spread <strong>of</strong> those known<br />

k<br />

k<br />

λ<br />

p<br />

25


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

positive eigenvalues λ<br />

1<br />

/ λ q<br />

. Furthermore, as k → ∞ any mapping <strong>for</strong> 2nd-<strong>SPSA</strong> should<br />

− 1<br />

preserve the complete spectral property <strong>of</strong> H . There<strong>for</strong>e, the proposed mapping to a matrix in<br />

k<br />

2nd-<strong>SPSA</strong> is different from the matrix regularization in an ill-posed inversion problem where<br />

the spectral property <strong>of</strong> an ill-conditioned matrix is changed to make the problem well posed.<br />

2.4 -Description <strong>of</strong> Proposed <strong>SPSA</strong> <strong>Algorithm</strong><br />

The 1st-<strong>SPSA</strong> algorithm predetermines the gain series<br />

a<br />

k<br />

<strong>for</strong> the whole iteration process<br />

− 1<br />

whereas 2nd-<strong>SPSA</strong> derives a generalized gain series a that is adapted to near optimality<br />

at each iteration. However, based on previous analyses, the inverse <strong>of</strong> the estimated <strong>Hessian</strong><br />

k<br />

H k<br />

generally introduces additional error sensitivity inherited in<br />

conditioned matrix k > 1<br />

H <strong>for</strong> a non-perfectly<br />

k<br />

. To avoid computing the inverse <strong>of</strong> an ill-conditioned matrix while<br />

still approximately optimizing the gain series at each iteration, we can modify the first recursion<br />

<strong>for</strong> 2nd-<strong>SPSA</strong> (2.2a) by replacing<br />

constant diagonal elements<br />

Λˆ in the mapping f<br />

k<br />

k<br />

<strong>of</strong> (2.10) with Λ that contains<br />

k<br />

ˆ 1<br />

k + 1 k<br />

−<br />

k k k k<br />

θ<br />

ˆ<br />

−<br />

= θ a λ gˆ<br />

( θˆ<br />

)<br />

(2.13)<br />

where<br />

λ is the geometric mean <strong>of</strong> all the eigenvalues <strong>of</strong><br />

k<br />

H<br />

k<br />

λ<br />

k<br />

(<br />

ˆ ˆ ˆ )<br />

1 / p<br />

= λ λ λ λ λ K λ .<br />

(2.14)<br />

1<br />

2<br />

K<br />

q −1<br />

q q + 1<br />

p<br />

Recursions (2.13) and (2.2b) together with (2.5),(2.7)-(2.8) and (2.14) <strong>for</strong>m a modified version<br />

<strong>of</strong> 2nd-<strong>SPSA</strong> that takes advantage <strong>of</strong> both the well-conditioned 1st-<strong>SPSA</strong> and the internally<br />

determined gain sequence <strong>of</strong> 2nd-<strong>SPSA</strong>. The proportionality coefficient a <strong>of</strong><br />

α<br />

a k<br />

( = a /( k + 1 + A ) , A ≥ 0 ) in 1st-<strong>SPSA</strong> depends on the individual loss function and is<br />

generally selected by a trial-and-error approach in practice. On other hand, the 2nd-<strong>SPSA</strong><br />

algorithm removes such an uncertainty in selecting its proportionality coefficient a<br />

α<br />

a k<br />

( = a /( k + 1 + A ) , A ≥ 0 ) since the asymptotically near optimal selection <strong>of</strong> a is 1 [24].<br />

The crucial property that a in 1st-<strong>SPSA</strong> is dependent on the individual loss function has been<br />

built into 2nd-<strong>SPSA</strong> by its generalized gain series<br />

− α − 1<br />

( k + 1 + A ) H , A ≥ 0 . From this<br />

perspective, our proposed <strong>SPSA</strong> algorithm (2.13) can be considered as an extension <strong>of</strong><br />

k<br />

k<br />

<strong>of</strong><br />

26


2.5 ASYMPTOTIC NORMALITY<br />

1st-<strong>SPSA</strong> in which a is replaced by a scalar series<br />

− 1<br />

λ<br />

k<br />

that depends on the individual loss<br />

− 1<br />

function and varies with iteration. Be<strong>for</strong>e to replacing a by λ , in <strong>order</strong> to enhance<br />

convergence and stability, the use <strong>of</strong> an adaptive gain sequence <strong>for</strong> parameter updating is<br />

proposed, this application considers the following conditions:<br />

k<br />

a ) a η a η ≥ 1, if J ( θ ) < (1 β ) J ( θ )<br />

k<br />

=<br />

k − 1<br />

,<br />

k<br />

+<br />

k − 1<br />

b) a µ a µ ≥ , if J ( θ ) < (1 β ) J ( θ ).<br />

k<br />

=<br />

k − 1<br />

1<br />

k<br />

+<br />

k −1<br />

In addition to gain attenuation when the value <strong>of</strong> the criterion becomes worse, “blocking”<br />

mechanism are also applied, i.e., the recurrent step is rejected and, starting from previous<br />

parameter estimate, a new step is accomplished (with a new gradient evaluation and a reduced<br />

updating gain). The parameter β in the condition (a) represents the permissible increase in the<br />

criterion, be<strong>for</strong>e step rejection and gain attenuation occur. A constant gain sequence c k<br />

= c<br />

in assumption and implementation <strong>SPSA</strong> in the Sec. 2.8 can be used <strong>for</strong> gradient approximation,<br />

the value <strong>of</strong> c being selected so as to overcome the influence <strong>of</strong> noise. In the optimum<br />

neighborhood, a decaying sequence in the <strong>for</strong>m defined by step sub in Sec. 2.8 is required to<br />

evaluate the gradient with enough accuracy and avoid an amplification <strong>of</strong> the “slowing down”<br />

effect. When these conditions have been implemented in a , this can be replaced by<br />

− 1<br />

λ<br />

k<br />

.<br />

2.5 -Asymptotic Normality<br />

The strong convergence <strong>of</strong><br />

θˆ<br />

k<br />

generally implies an asymptotic normal distribution. In [24] is<br />

established the asymptotic normal distributions <strong>for</strong> both lst-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. Although our<br />

interests are mainly in finite samples, let us present the following asymptotic arguments as a<br />

way <strong>of</strong> relating to previous known results. Since the proposed algorithm can also be considered<br />

as an extension <strong>of</strong> lst-<strong>SPSA</strong> with a special gain series<br />

− 1<br />

λ<br />

k<br />

the analysis <strong>of</strong> the asymptotic<br />

normality <strong>for</strong> lst-<strong>SPSA</strong> can also be extended to M2-<strong>SPSA</strong>. In this section, we first review the<br />

asymptotic normal distributions <strong>for</strong> lst-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. Then, the asymptotic efficiency is<br />

compared <strong>for</strong> three different algorithms <strong>of</strong> lst-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong>, and proposed <strong>SPSA</strong> algorithm.<br />

Using Fabian’s [19] result, is established the following asymptotic normality <strong>of</strong>θˆ in 1st-<strong>SPSA</strong><br />

k<br />

27


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

k<br />

β / 2<br />

dist<br />

*<br />

( θˆ<br />

k<br />

− θ ) → N ( ξ , Σ ) as k → ∞<br />

(2.15)<br />

where ξ and Σ are the mean vector and covariance matrix and β / 2<br />

characterizes the rate<br />

<strong>of</strong> convergence and is related to the parameters <strong>of</strong> gain sequences a and<br />

k<br />

c : The mean<br />

k<br />

ξ<br />

in (2.15) depends on the third derivatives <strong>of</strong> the loss function at<br />

*<br />

θ<br />

and generally vanishes<br />

except <strong>for</strong> a special set <strong>of</strong> gain sequences. The covariance matrix Σ <strong>for</strong> α ≤ 1 is<br />

orthogonally similar to the diagonal matrix that is proportional to the inverse eigenvalues <strong>of</strong> the<br />

<strong>Hessian</strong><br />

Σ = ψ aP T Λ<br />

* − 1 P<br />

(2.16)<br />

* ∗ T ∗<br />

∗ ∗<br />

∗<br />

where P is orthogonal with H = PΛ<br />

P , Λ = diag λ , λ , K , λ ] , and the coefficient <strong>of</strong><br />

[<br />

1 2<br />

proportionality ψ depends on the statistical parameters in the algorithm [16]. Again,<br />

according to the eigenvalue perturbation theorem [16] the difference between<br />

∗<br />

λ<br />

i<br />

( i = 1,2 , K , p ) at the k-th iteration and λ<br />

i<br />

in (2.16) is bounded by the difference in its<br />

<strong>Hessian</strong><br />

p<br />

∗<br />

( ) ( ˆ ∗<br />

λ − λ ≤ κ λ P H θ − H ) , i = 1,2 , K , p<br />

(2.17)<br />

i<br />

i<br />

k<br />

k<br />

2<br />

where ⋅<br />

2<br />

denotes the spectral norm <strong>of</strong> a matrix that leads to the definition <strong>of</strong> spectral condition<br />

number in<br />

κ λ<br />

H ) = λ max<br />

/ λ .<br />

(2.18)<br />

(<br />

min<br />

It is noted that H θˆ<br />

) converges almost surely to<br />

k<br />

(<br />

k<br />

∗<br />

H and the mapping from H<br />

k<br />

to<br />

H defined by (2.10) preserves the matrix spectra. Furthermore, Λˆ<br />

− Λ → 0<br />

k<br />

k<br />

k<br />

as<br />

k → 0 and the calculation from H<br />

k<br />

to Λ is a continuous function, we also have the<br />

k<br />

following strong convergence <strong>for</strong> the eigenvalues <strong>of</strong> <strong>Hessian</strong>:<br />

∗<br />

∗ ∗<br />

∗<br />

Λ<br />

k<br />

− Λ = diag [ λ<br />

1<br />

, λ<br />

2<br />

, K , λ<br />

p<br />

] ,<br />

∗<br />

ˆ λ k<br />

→ λ as → ∞<br />

k (2.19)<br />

28


2.5 ASYMPTOTIC NORMALITY<br />

where<br />

∗<br />

λ<br />

is the geometric mean <strong>of</strong> all the eigenvalues <strong>of</strong><br />

H<br />

∗<br />

. Based on (2.15), (2.16) and<br />

(2.19) we conclude that the choice <strong>of</strong><br />

a λ<br />

k<br />

− 1<br />

k<br />

in M2-<strong>SPSA</strong> can also be considered as a natural<br />

extension <strong>of</strong> 1st-<strong>SPSA</strong> with a sensible selection <strong>of</strong> a based on its asymptotic normality.<br />

k<br />

β / 2<br />

dist<br />

*<br />

( θˆ<br />

k<br />

− θ ) → N ( µ , Ω ) as k → ∞<br />

(2.20)<br />

where<br />

β = α − 2γ<br />

The covariance matrix Ω is proportional to<br />

H<br />

∗ − 2 ∗ − 2 T<br />

= P Λ P with<br />

the same coefficient <strong>of</strong> proportionality ψ as in (2.16), and the mean µ depends on both the<br />

gain sequence parameters and the third derivatives <strong>of</strong> the loss function at<br />

β / 2<br />

mean square error (MSE) <strong>of</strong> ( ˆ<br />

*<br />

k θ − θ ) in (2.20) is given by [16]<br />

k<br />

∗<br />

θ<br />

. The asymptotic<br />

MSE ( α , ) = µ T µ + trace (Ω).<br />

(2.21)<br />

2nd<strong>SPSA</strong><br />

γ<br />

We first consider a special case <strong>of</strong> a diagonal <strong>Hessian</strong> with constant eigenvalues<br />

∗<br />

∗<br />

( λ i<br />

= λ = λ ) . It can be shown that the asymptotic normality <strong>of</strong> θˆ<br />

k<br />

in 2nd-<strong>SPSA</strong> [18] is<br />

identical to that in 1st-<strong>SPSA</strong> [17] when the following gain sequences are picked.<br />

N ( µ , Ω ) = N ( ξ , Σ ) when a k<br />

= φ /( k + 1)<br />

and = φ /[( k + 1) λ ]<br />

a k<br />

(2.22)<br />

where the constant φ represents a common scale factor <strong>for</strong> the two gain sequences. The<br />

near-optimal selection <strong>of</strong> φ <strong>for</strong> 2nd-<strong>SPSA</strong> is φ = 1<br />

. Note that the true optimal selection <strong>of</strong><br />

the gain is essentially infeasible as it depends on the third derivatives <strong>of</strong> the loss [16]. Equation<br />

(2.22) suggests that the near-optimal MSE in 2nd-<strong>SPSA</strong> can be achieved in 1st-<strong>SPSA</strong> by<br />

picking its proportionality coefficient a in such a way that<br />

a<br />

= 1 / λ<br />

. Since a in 1st-<strong>SPSA</strong> is<br />

externally prescribed, such an optimal picking <strong>of</strong> a is only theoretically possible. On the other<br />

− 1 − 1 − 1<br />

hand, the internally determined gain sequence <strong>of</strong> a λ ( k λ ) in the proposed <strong>SPSA</strong><br />

k<br />

k<br />

=<br />

k<br />

algorithm makes the near-optimal picking <strong>for</strong> the special case <strong>of</strong> constant eigenvalues<br />

practically possible. Next, we consider the specification <strong>of</strong> the gain sequence α < 1<br />

3 γ − α / 2 > 0 from which µ = ξ = 0 [16]. The asymptotic distribution-based MSE <strong>for</strong><br />

2nd-<strong>SPSA</strong> under this condition is inversely proportional to the sum <strong>of</strong> all the eigenvalues<br />

squared<br />

and<br />

29


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

* −2<br />

* −2<br />

MSE<br />

2<strong>SPSA</strong>(<br />

α,<br />

γ)<br />

= trace (Ω)<br />

α trace ( Λ ) = ∑λ i<br />

.<br />

(2.23)<br />

p<br />

i=<br />

1<br />

*<br />

On the other hand, the MSE <strong>for</strong> our proposed <strong>SPSA</strong> can be derived by setting in<br />

1st-<strong>SPSA</strong><br />

a = 1 / λ<br />

MSE ( , ) = trace Σ<br />

2<strong>SPSA</strong>α<br />

γ<br />

*<br />

a=<br />

1/ λ<br />

α λ<br />

* −1<br />

trace<br />

p<br />

* −1<br />

* −1<br />

* −1<br />

( Λ ) = ∑λ<br />

i<br />

.<br />

i=<br />

1<br />

λ (2.24)<br />

The constants <strong>of</strong> proportionality are related to c and to the variances <strong>of</strong><br />

∆<br />

k<br />

and measurement<br />

noise. There<strong>for</strong>e, the ratio <strong>of</strong> MSEs <strong>for</strong> M2-<strong>SPSA</strong> to <strong>SPSA</strong> is given by<br />

MSE<br />

MSE<br />

2 <strong>SPSA</strong><br />

p * −1<br />

1/ p<br />

p * −1<br />

[ ∏ λ ] (1/ p)<br />

i=<br />

1 i ∑ λ<br />

i=<br />

1 i<br />

⋅<br />

≡ R 1<br />

( α , λ )<br />

=<br />

(2.25)<br />

0<br />

( α , λ)<br />

p * −2<br />

p * −2<br />

(1/ p)<br />

λ (1/ p)<br />

λ<br />

M 2 <strong>SPSA</strong><br />

≤<br />

where, we have used a well-known relation in the last inequality <strong>of</strong> (2.25):<br />

∑<br />

i=<br />

1<br />

i<br />

∑<br />

i=<br />

1<br />

i<br />

(geometric mean) ≤ (arithmetic mean) ≤ (root-mean-square). (2.26)<br />

Equality in (2.26) holds only when all the eigenvalues are equal which corresponds to a<br />

perfectly conditioned <strong>Hessian</strong> <strong>of</strong> κ ( H<br />

∗ ) = 1 . Since the ratio R has been derived from the<br />

0<br />

asymptotic MSEs the comparison between M2<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> has been made under the<br />

same rate <strong>of</strong> convergence. Our third case in the asymptotic efficiency analysis is to consider<br />

α = 1 when 3γ − α / 2 ≥ 0 in 2nd-<strong>SPSA</strong>. This setting again corresponds to µ = ξ = 0 in<br />

2nd-<strong>SPSA</strong> and proposed <strong>SPSA</strong> algorithm. It is possible <strong>for</strong> both 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> to set<br />

α =<br />

1 <strong>for</strong> their gain sequence selection. The near optimal rate <strong>of</strong> convergence in 2nd-<strong>SPSA</strong> by<br />

setting a = 1 can be accomplished in 1st-<strong>SPSA</strong> by adjusting its a to yield the same rate <strong>of</strong><br />

convergence as 2nd-<strong>SPSA</strong>. By setting<br />

a<br />

= 1 / λ<br />

in 1st-<strong>SPSA</strong> <strong>for</strong> the implementation <strong>of</strong> our<br />

proposed <strong>SPSA</strong>, we can again derive (2.25) that shows the superiority <strong>of</strong> our proposed <strong>SPSA</strong> to<br />

2nd-<strong>SPSA</strong> under the same rate <strong>of</strong> convergence. However, the above setting <strong>of</strong> a<br />

a<br />

= 1 / λ<br />

1st-<strong>SPSA</strong> is allowed only if the resulting condition in 1st-<strong>SPSA</strong> <strong>of</strong> min ( λ / λ ) ≥ β / 2 still<br />

holds [16]. When the above condition is violated while implementing M2-<strong>SPSA</strong> <strong>for</strong> relatively<br />

∗<br />

large k ( H ) the setting <strong>of</strong> α = 1 in our proposed <strong>SPSA</strong> algorithm is excluded and we can no<br />

longer make a straight comparison <strong>of</strong> the asymptotic MSEs between 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong><br />

i<br />

i<br />

in<br />

30


2.6 FISHER INFORMATION MATRIX<br />

under the same rate <strong>of</strong> convergence. Under this circumstance, there is no superiority <strong>of</strong> either<br />

one <strong>of</strong> M2-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> to the other in terms <strong>of</strong> the efficiency or the rate <strong>of</strong><br />

convergence. The superiority <strong>of</strong> our proposed <strong>SPSA</strong> algorithm to 2nd-<strong>SPSA</strong> indicated by (2.25)<br />

only shows an improvement in the multiplier <strong>for</strong> the convergence rate ( R<br />

0<br />

) when the common<br />

convergence rate is sub-optimal. In [25] is showed that by setting α = 1 and γ = 1 / 6<br />

asymptotically optimal MSE can be achieved with a maximum rate <strong>of</strong> convergence <strong>for</strong> the MSE<br />

<strong>of</strong><br />

−<br />

/ 3<br />

θˆ <strong>of</strong> k β = k<br />

− 2 in both 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. We have already shown that in <strong>order</strong> to<br />

k<br />

avoid the violation <strong>of</strong> the condition min<br />

i<br />

( λ<br />

i<br />

/ λ ) ≥ β / 2 the setting <strong>of</strong> α = 1 (with β ≈ 2 / 3 )<br />

is <strong>of</strong>ten not allowed in our proposed <strong>SPSA</strong> algorithm. Neither is it possible to choose a different<br />

set <strong>of</strong><br />

α and<br />

m<br />

γ to yield<br />

m<br />

β<br />

m<br />

= 2 / 3 when γ = 1 / 6<br />

an<br />

. Under this circumstance, the<br />

/ 3<br />

maximum rate <strong>of</strong> convergence <strong>of</strong> − 2 <strong>for</strong> MSE cannot be achieved by our proposed <strong>SPSA</strong>. It<br />

is noted that the mapping<br />

k<br />

f<br />

k<br />

such as the one proposed in Sec. 2.3 will leave the asymptotic<br />

H<br />

k<br />

unchanged (when we set<br />

Λˆ = Λ ) as k → ∞ . On the other hand, our proposed <strong>SPSA</strong><br />

k<br />

k<br />

algorithm changes<br />

H<br />

k<br />

when its<br />

Λ is replaced by Λ .<br />

k<br />

k<br />

2.6 -Fisher In<strong>for</strong>mation <strong>Matrix</strong><br />

2.6.1 -Introduction to Fisher In<strong>for</strong>mation <strong>Matrix</strong><br />

In this section, we presented a relatively simple MCNR method <strong>for</strong> obtaining the FIM that is<br />

used in <strong>order</strong> to estimate the <strong>Hessian</strong> matrix efficiently. So that, the resampling-based method<br />

relies on an efficient technique <strong>for</strong> estimating the <strong>Hessian</strong> matrix. The FIM plays a central role<br />

in the practice and theory <strong>of</strong> identification and estimation. This matrix provides a summary <strong>of</strong><br />

the amount <strong>of</strong> in<strong>for</strong>mation in the data relative to the quantities <strong>of</strong> interest [22]. Suppose that the<br />

i-th measurement <strong>of</strong> a process is<br />

z<br />

i<br />

and that a stacked vector <strong>of</strong> n such measurement vectors is<br />

n<br />

T T T<br />

[ z z z ] T<br />

z ≡ ,...,<br />

1<br />

,<br />

2<br />

n<br />

. Let us assume that the general <strong>for</strong>m <strong>for</strong> the joint probability density or<br />

probability mass function <strong>for</strong><br />

zn<br />

is known, but that this function depends on an unknown vector<br />

θ . Let the probability density/mass function <strong>for</strong><br />

z be z( ζ θ ) where ζ (“zeta”) is a<br />

n<br />

p f<br />

31


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

dummy vector representing the possible outcomes <strong>for</strong> z n<br />

(in p f<br />

z( ζ θ)<br />

), the index n on z n<br />

is<br />

being suppressed <strong>for</strong> notational convenience). The corresponding likelihood function, say<br />

l ( θ ζ ) = z(<br />

ζ θ ).<br />

(2.27)<br />

p f<br />

With the definition <strong>of</strong> the likelihood function in (2.27), we are now in a position to present the<br />

Fisher in<strong>for</strong>mation matrix. The expectations below are with respect to the dataset<br />

z<br />

n<br />

. The<br />

p xp f f<br />

in<strong>for</strong>mation matrix F n(θ ) <strong>for</strong> a differentiable log-likelihood function is given by [22]<br />

⎛ ∂ log l ∂ logl<br />

⎞<br />

F n<br />

( θ ) ≡ E⎜<br />

⋅ θ ⎟.<br />

T<br />

⎝ ∂θ<br />

∂θ<br />

⎠<br />

(2.28)<br />

In the case where the underlying data { z z ,..., }<br />

,<br />

2<br />

1<br />

are independent (and even in many cases<br />

where the data may be dependent), the magnitude <strong>of</strong> F n(θ ) will grow at a rate proportional to<br />

n since logl(<br />

⋅)<br />

will represent a sum <strong>of</strong> n random terms. Then, the bounded quantity F n<br />

(θ ) / n<br />

is employed as an average in<strong>for</strong>mation matrix over all measurements. Except <strong>for</strong> relatively<br />

simple problems, however, the <strong>for</strong>m in (2.28) is generally not useful in the practical calculation<br />

<strong>of</strong> the in<strong>for</strong>mation matrix. Computing the expectation <strong>of</strong> a product <strong>of</strong> multivariate non-linear<br />

functions is usually a hopeless task. A well-known equivalent <strong>for</strong>m follows by assuming that<br />

logl ( ⋅)<br />

is twice differentiable in θ . The following <strong>Hessian</strong> matrix<br />

z n<br />

H<br />

( θ ζ )<br />

≡<br />

∂<br />

2<br />

log l ( θ ζ )<br />

∂ θ ∂ θ<br />

T<br />

is assumed to exist [22]. One <strong>of</strong> these conditions is that the set { ζ : l ( θ ζ ) > 0}<br />

does not<br />

depend on θ . A fundamental implication <strong>of</strong> the regularity <strong>for</strong> the likelihood is that the<br />

necessary interchanges <strong>of</strong> differentiation and integration are valid. Then, the in<strong>for</strong>mation matrix<br />

is related to the <strong>Hessian</strong> matrix <strong>of</strong> l through:<br />

[ H ( θ Z θ ]<br />

F θ ) = − E ) . (2.29)<br />

n<br />

(<br />

n<br />

The <strong>for</strong>m in (2.29) is usually more amenable to calculating the matrix than the product-based<br />

32


2.6 FISHER INFORMATION MATRIX<br />

<strong>for</strong>m in (2.28). Note that in some applications, the observed in<strong>for</strong>mation matrix at a particular<br />

dataset zn<br />

may be easier to compute and/or preferred from an inference point <strong>of</strong> view relative<br />

to the actual in<strong>for</strong>mation matrix Fn(θ<br />

) in (2.29). Although the method in this work is<br />

described <strong>for</strong> the determination <strong>of</strong> Fn(θ<br />

) the efficient <strong>Hessian</strong> estimation may also be used<br />

directly <strong>for</strong> the determination <strong>of</strong> H θ z ) when it is not easy to calculate the <strong>Hessian</strong> directly.<br />

(<br />

n<br />

2.6.2 -Two Key Properties <strong>of</strong> the In<strong>for</strong>mation <strong>Matrix</strong>: Connections to<br />

--Covariance <strong>Matrix</strong> <strong>of</strong> Parameter Estimates<br />

Let<br />

*<br />

θ denotes the unknown “true” value <strong>of</strong> θ . The primary rationale <strong>for</strong> (n)<br />

F as a<br />

measure <strong>of</strong> in<strong>for</strong>mation about θ within the data<br />

(n)<br />

Z<br />

covariance matrix <strong>for</strong> the estimate <strong>of</strong> θ constructed from Z<br />

comes from its connection to the<br />

(n)<br />

. The first <strong>of</strong> the key properties<br />

makes this connection via an asymptotic normality result [23]. In particular, <strong>for</strong> some<br />

common <strong>for</strong>ms <strong>of</strong> estimates<br />

θˆ n<br />

(e.g. maximum likelihood and Bayesian maximum a<br />

posteriori), it is known that, under modest conditions<br />

ˆ * −1<br />

n ( θ<br />

n<br />

−θ<br />

) → N (0, F )<br />

(2.30)<br />

where<br />

dist<br />

→ denotes convergence in distribution and<br />

F<br />

*<br />

Fn<br />

( θ )<br />

≡ lim<br />

(2.31)<br />

n→∞<br />

n<br />

provided that the indicated limit exists and is invertible. Hence, in practice, <strong>for</strong> n reasonably<br />

−1<br />

large, F<br />

n(<br />

θ ) ’ can serve as an approximate covariance matrix <strong>of</strong> the estimate θˆ n<br />

when θ is<br />

chosen close to the unknown<br />

<strong>of</strong> some recursive algorithms where the data<br />

*<br />

θ . Relationship (2.30) also holds <strong>for</strong> optimal implementations<br />

Z are processed recursively instead <strong>of</strong> in a hatch<br />

i<br />

mode as is typical in maximum likelihood. This includes optimal versions <strong>of</strong> gradient-based SA<br />

algorithms, which includes popular algorithms such as LMS and NN BP as special cases. The<br />

second key property <strong>of</strong> the in<strong>for</strong>mation matrix applies in finite-samples.<br />

33


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

If<br />

θˆ n<br />

is any unbiased estimator <strong>of</strong> θ [23],<br />

ˆ<br />

* −1<br />

cov( θ ) ≥ ( θ ) , ∀n.<br />

(2.32)<br />

n<br />

F n<br />

There is also an expression analogous to (2.31) <strong>for</strong> biased estimators, but it is not especially<br />

useful in practice because it requires knowledge <strong>of</strong> the gradient <strong>of</strong> the bias with respect to θ .<br />

Expressions (2.30) and (2.31), taken together, point to the close connection, between the inverse<br />

Fisher in<strong>for</strong>mation matrix and the covariance matrix <strong>of</strong> the estimator. While (2.30) is an<br />

asymptotic result, (2.31) applies <strong>for</strong> all sample sizes subject to the unbiased ness requirement. It<br />

is also clear why the name “in<strong>for</strong>mation matrix” is used <strong>for</strong> F (n)<br />

: A larger F (n)<br />

(in the<br />

matrix. sense) is associated with a smaller covariance matrix (i.e., more in<strong>for</strong>mation) while a<br />

smaller F (n)<br />

is associated with a larger covariance matrix (i.e., less in<strong>for</strong>mation). The<br />

calculation <strong>of</strong> F (n)<br />

is <strong>of</strong>ten difficult or impossible in many non-linear problems. Obtaining<br />

the required first or second derivatives <strong>of</strong> the log-likelihood function may he a <strong>for</strong>midable task<br />

in some applications, and computing the required expectation <strong>of</strong> the generally non-linear<br />

multivariate function is <strong>of</strong>ten impossible in problems <strong>of</strong> practical interest. To address this<br />

difficulty, the subsection outlines a computer resampling approach to estimating F (n)<br />

. This<br />

approach is useful when analytical methods <strong>for</strong> computing F (n)<br />

are infeasible. The approach<br />

makes use <strong>of</strong> an idea introduced <strong>for</strong> optimization the <strong>Hessian</strong> estimation <strong>for</strong> SA even though<br />

this problem is not directly one <strong>of</strong> optimization. The basis <strong>for</strong> the technique below is to use<br />

computational horsepower in lieu <strong>of</strong> traditional detailed theoretical analysis to determine F (n)<br />

.<br />

The method here is an example <strong>of</strong> a MCNR method <strong>for</strong> producing an estimate. Such methods<br />

have become very popular as a means <strong>of</strong> handling problems that were <strong>for</strong>merly infeasible. Other<br />

notable Monte Carlo techniques are the bootstrap method <strong>for</strong> determining statistical<br />

distributions <strong>of</strong> estimates and the Markov chain Monte Carlo method <strong>for</strong> producing<br />

pseudorandom numbers and related quantities. Part <strong>of</strong> the appeal <strong>of</strong> the Monte Carlo method<br />

here <strong>for</strong> estimating F (n)<br />

is that it can be implemented with only evaluations <strong>of</strong> the<br />

log-likelihood.<br />

2.6.3 -Estimation <strong>of</strong> F n(θ )<br />

The calculation <strong>of</strong> F n(θ ) is <strong>of</strong>ten difficult or impossible in practical problems. Obtaining the<br />

34


2.6 FISHER INFORMATION MATRIX<br />

required first or second derivatives <strong>of</strong> the log-likelihood function may be a <strong>for</strong>midable<br />

task in some applications, and computing the required expectation <strong>of</strong> the generally non-linear<br />

multivariate function is <strong>of</strong>ten impossible in problems <strong>of</strong> practical interest. This section outlines<br />

a computer resampling approach to estimating F n(θ ) that is useful when analytical methods<br />

<strong>for</strong> computing F n(θ ) are infeasible. The approach makes use <strong>of</strong> a computationally efficient<br />

and easy-to-implement method <strong>for</strong> <strong>Hessian</strong> estimation that was described by Spall[24] in the<br />

context <strong>of</strong> optimization.<br />

The computational efficiency follows by the low number <strong>of</strong> log-likelihood or gradient values<br />

needed to produce each <strong>Hessian</strong> estimate. Although there is no optimization here per se, we use<br />

the same basic simultaneous perturbation (SP) <strong>for</strong>mula <strong>for</strong> <strong>Hessian</strong> estimation [this is the same<br />

SP principle given earlier in Spall[24] <strong>for</strong> gradient estimation]. However, the way in which the<br />

individual <strong>Hessian</strong> estimates are averaged differs from Spall[24] because <strong>of</strong> the distinction<br />

between the problem <strong>of</strong> recursive optimization and the problem <strong>of</strong> estimation <strong>of</strong> F n(θ ) . The<br />

essence <strong>of</strong> the method is to produce a large number <strong>of</strong> SP estimates <strong>of</strong> the <strong>Hessian</strong> matrix <strong>of</strong><br />

logl ( ⋅)<br />

and then average the negative <strong>of</strong> these estimates to obtain an approximation to<br />

F n(θ ) .<br />

This approach is directly motivated by the definition <strong>of</strong> F n(θ ) as the main value <strong>of</strong> the<br />

negative <strong>Hessian</strong> matrix (2.29). To produce the SP <strong>Hessian</strong> estimates, we generate pseudodata<br />

vectors in a Monte Carlo manner. The pseudodata are generated according to a bootstrap<br />

resampling scheme treating the chosen θ as “truth.” The pseudodata are generated according<br />

to the probability model p f<br />

z( ζ θ ) given in (2.27). So, <strong>for</strong> example, if it is assumed that the<br />

real data Zn are jointly normally distributed, N ( µ ( θ ), Σ(<br />

θ )) , then the pseudodata are<br />

generated by Monte Carlo according to a normal distribution based on a mean µ and<br />

covariance matrix Σ evaluated at the chosen θ . Let the i-th pseudodata vector be Z pseudo<br />

(i)<br />

;<br />

the use <strong>of</strong><br />

Z<br />

pseudo<br />

without the argument is a generic reference to a pseudodata vector. This data<br />

vector represents a sample <strong>of</strong> size n from the assumed distribution <strong>for</strong> the set <strong>of</strong> data based on<br />

35


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

the unknown parameters taking on the chosen value <strong>of</strong> θ . Hence, the basis <strong>for</strong> the technique is<br />

to use computational horsepower in lieu <strong>of</strong> traditional detailed theoretical analysis to determine<br />

F n(θ ) . Two other notable Monte Carlo techniques are the bootstrap method <strong>for</strong> determining<br />

statistical distributions <strong>of</strong> estimates and the Markov chain Monte Carlo method <strong>for</strong> producing<br />

pseudorandom numbers and related quantities. Part <strong>of</strong> the appeal <strong>of</strong> the Monte Carlo method<br />

here <strong>for</strong> estimating F n(θ ) is that it can be implemented with only evaluations <strong>of</strong> the<br />

log-likelihood. The approach below can work with either log l(<br />

θ Z ) values (alone) or<br />

pseudo<br />

with the gradient<br />

g θ Z pseudo<br />

) ≡ ∂log<br />

l ( θ Z )/<br />

∂θ<br />

if that is available. The <strong>for</strong>mer usually<br />

(<br />

pseudo<br />

corresponds to cases where the likelihood function and associated non-linear process are so<br />

complex that no gradients are available. To highlight the fundamental commonality <strong>of</strong> the<br />

approach in this dissertation, we assume the following:<br />

Let<br />

G θ Z pseudo<br />

) ≡ ∂ log l(<br />

θ Z ) / ∂θ<br />

represent either a gradient approximation<br />

(<br />

pseudo<br />

(based log l ( θ Z )) values) or the exact gradient g θ Z ) . Because <strong>of</strong> its efficiency, the<br />

pseudo<br />

(<br />

pseudo<br />

SP gradient approximation is recommended in the case where only logl(<br />

θ Z ) values are<br />

pseudo<br />

available (Spall [24]). We now present the <strong>Hessian</strong> estimate. Let<br />

Ĥ<br />

k<br />

denote the<br />

th<br />

k estimate <strong>of</strong><br />

the <strong>Hessian</strong> H ( ⋅)<br />

. The <strong>for</strong>mula <strong>for</strong> estimating the <strong>Hessian</strong> is:<br />

log l<br />

Hˆ<br />

k<br />

⎪⎧<br />

δGk<br />

= 1/ 2⎨<br />

2ck<br />

⎪⎩<br />

−1<br />

−1<br />

−1<br />

⎛ δGk<br />

−1<br />

−1<br />

−1<br />

[ ∆ , ∆ ,..., ∆ ] + ⎜ [ ∆ , ∆ ∆ ]<br />

1 k 2 kp ⎜ k1<br />

k 2,...,<br />

⎝ 2ck<br />

k<br />

kp<br />

T<br />

⎞ ⎪⎫<br />

⎟ ⎬<br />

⎠ ⎪⎭<br />

(2.33)<br />

where δG<br />

= G θ + ∆ Z ) − G(<br />

θ − ∆ Z ) and the perturbation vector in this<br />

k<br />

approach [ ] T<br />

k<br />

(<br />

k pseudo<br />

k pseudo<br />

∆ = ∆ ,..., ∆<br />

k<br />

, ∆<br />

k2<br />

1<br />

is a mean zero random vector such that the { ∆ki}<br />

k p<br />

are “small”<br />

symmetrically distributed random variables where k,i are uni<strong>for</strong>mly bounded, and satisfy<br />

E( 1/ ∆<br />

ki<br />

)<br />

< ∞ uni<strong>for</strong>mly in k, i. This latter condition excludes such commonly used Monte<br />

36


2.6 FISHER INFORMATION MATRIX<br />

Carlo distributions as uni<strong>for</strong>m and Gaussian. Assume that<br />

∆ ,<br />

≤ c <strong>for</strong> some small c > 0 .<br />

k j<br />

In most implementations, the { }<br />

∆ are independent and identically distributed (iid) across k<br />

k , j<br />

and j. In implementations involving antithetic random numbers,<br />

dependent random vectors <strong>for</strong> some k, but at each k the { ∆ }<br />

kj<br />

∆k<br />

and ∆ k + 1<br />

may be<br />

are iid (across j). Note that the<br />

user has full control over the choice <strong>of</strong> the ∆ distribution. A valid (and simple) choice is the<br />

ki<br />

Bernoulli +c distribution (it is not known at this time if this is the “best” distribution to choose).<br />

The prime rationale <strong>for</strong> (2.33) is that<br />

Ĥ<br />

k<br />

is a nearly unbiased estimator <strong>of</strong> the unknown H.<br />

Spall[24] gave conditions such that the <strong>Hessian</strong> estimate has an O ( c<br />

2 ) bias. The next<br />

proposition considers this further in the context <strong>of</strong> the resulting (small) bias in the estimate <strong>of</strong><br />

the FIM.<br />

Proposition 1. Suppose that g θ Z ) is three times continuously differentiable inθ <strong>for</strong><br />

(<br />

pseudo<br />

almost all<br />

Z<br />

pseudo<br />

. Then, based on the structure and assumptions <strong>of</strong> (2.33) (see reference [22]),<br />

E<br />

2<br />

[ F ( )] = F(<br />

θ ) O(<br />

)<br />

θ .<br />

M , N<br />

+ c<br />

Pro<strong>of</strong>: Spall [24] showed that ˆ<br />

2<br />

E(<br />

H Z ) = H(<br />

θ Z ) O ( c ) under the stated<br />

k pseudo<br />

pseudo<br />

+<br />

Z<br />

conditions on g(⋅)<br />

and ∆ k<br />

.Because FM N(<br />

) is a sample mean <strong>of</strong> − Ĥ<br />

k<br />

,<br />

θ<br />

values, the result to<br />

be proved follows immediately. The summarizing operation in (2.33) is convenient to maintain<br />

a symmetric and positive <strong>Hessian</strong> estimate. To illustrate how the individual <strong>Hessian</strong> estimates<br />

may be quite poor, note that<br />

Ĥ<br />

k<br />

in (2.33) has (at most) rank two (and may not even be positive<br />

semi-definite). This low quality, however, does not prevent the in<strong>for</strong>mation matrix estimate <strong>of</strong><br />

interest from being accurate since it is not the <strong>Hessian</strong> per se that is <strong>of</strong> interest. The averaging<br />

process eliminates the nadequacies <strong>of</strong> the individual <strong>Hessian</strong> estimates.<br />

37


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

Given the <strong>for</strong>m <strong>for</strong> the <strong>Hessian</strong> estimate in (2.33), it is now relatively straight<strong>for</strong>ward to<br />

estimate F n(θ ) . Averaging <strong>Hessian</strong> estimates across many Z pseudo<br />

(i)<br />

yields an estimate <strong>of</strong><br />

E<br />

[ H ( θ Z ( i))<br />

] = −F<br />

( θ)<br />

pseudo<br />

n<br />

to within an O( c<br />

2 ) bias (the expectation in the left-hand side above is with respect to the<br />

pseudodata). The resulting estimate can be made as accurate as desired through reducing c and<br />

increasing the number <strong>of</strong><br />

Ĥ<br />

k<br />

values being averaged. The averaging <strong>of</strong> the<br />

Ĥ<br />

k<br />

values may be<br />

done recursively to avoid having to store many matrices. Of course, the interest is not in the<br />

<strong>Hessian</strong> per se; rather the interest is in the (negative) mean <strong>of</strong> the <strong>Hessian</strong>, according to (2.3) (so<br />

the averaging must reflect many different values <strong>of</strong> Z pseudo<br />

(i))<br />

. This leads to greater variability<br />

<strong>for</strong> a given number (N) <strong>of</strong> pseudodata. Also using this estimation, we can keep the <strong>Hessian</strong> matrix<br />

positive-definiteness. Let us now present a step-by-step summary <strong>of</strong> the above Monte Carlo<br />

resampling approach <strong>for</strong> estimating F n(θ ) . The MCNR method is an iterative procedure that can<br />

be used to approximate the maximum <strong>of</strong> a likelihood function in situations where direct<br />

likelihood computation is infeasible because <strong>of</strong> the existence <strong>of</strong> unmeasured variables, missing<br />

data, or measurement error. Let<br />

(i)<br />

∆<br />

k<br />

represent the k-th perturbation vector <strong>for</strong> the i-th realization<br />

(i.e., <strong>for</strong> Z pseudo<br />

(i)<br />

).<br />

The Monte Carlo algorithm with a resampling method <strong>for</strong> estimate F n(θ ) is described as<br />

follows:<br />

Step 1. (Initialization). Determine θ , the sample size n, and the number <strong>of</strong> pseudodata vectors<br />

that will be generated (N). In other words, we need to calculate ˆ θ ) and the number<br />

<strong>of</strong> pseudodata vectors that will be generated. Determine whether log-likelihood log l ( ⋅)<br />

or<br />

gradient in<strong>for</strong>mation g (⋅)<br />

will be used to <strong>for</strong>m the Ĥ<br />

k . Pick small number<br />

Bernoulli ±<br />

c k distribution used to generate the perturbations<br />

∆<br />

ki ;<br />

( k<br />

c<br />

k =0.001.<br />

c<br />

k in the<br />

38


2.6 FISHER INFORMATION MATRIX<br />

Step 2. (Generating pseudodata). Based on ˆ θ ) given in step 1, generate by Monte Carlo<br />

( k<br />

method the<br />

th<br />

k pseudodata vector on n-pseudo measurements (i)<br />

Z pseudo<br />

.<br />

Step 3. (<strong>Hessian</strong> estimation). With the i-th pseudodata vector in step 2, compute M ≥ 1 <strong>Hessian</strong><br />

estimates according in (2.33) [22]. Let the sample mean <strong>of</strong> these M estimates be<br />

(i)<br />

H =<br />

(i)<br />

H ( Z pseudo<br />

(i)<br />

). Unless using antithetic random numbers, the perturbation vectors<br />

{ ∆<br />

(i k) }<br />

should be mutually independent across realizations i and along the realizations (along k).<br />

the values are available and SP gradient approximations are being used to <strong>for</strong>m the G(⋅)<br />

values,<br />

the perturbations <strong>for</strong>ming the gradient approximations, say { ∆ }<br />

( )<br />

~ i<br />

k<br />

, should likewise be mutually<br />

independent). Z pseudo<br />

(i)<br />

is the pseudodata vector, this vectors represents a sample <strong>of</strong> size <strong>of</strong> n<br />

from the assumed distribution <strong>of</strong> the set <strong>of</strong> data based on the unknown parameters.<br />

Step 4. (Averaging <strong>Hessian</strong> estimates). Repeat step 2 and 3 until N pseudodata vectors have<br />

(i)<br />

been processed. Take the negative <strong>of</strong> the average <strong>of</strong> N <strong>Hessian</strong> estimates H produced in<br />

step 3; this is the estimate <strong>of</strong> F n(θ ) . The key parameters needed <strong>for</strong> the mapping are<br />

internally determined by F n(θ ) at each iteration. Figure 2.2 is a schematic <strong>of</strong> the steps.<br />

Fig. 2.2. Diagram <strong>of</strong> method <strong>for</strong> <strong>for</strong>ming estimate ( ).<br />

F M , N<br />

θ<br />

39


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

2.7 -Efficiency Between 1st-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong><br />

The proposed <strong>SPSA</strong> algorithm presented above <strong>of</strong>fers considerable potential <strong>for</strong> accelerating the<br />

convergence <strong>of</strong> SA algorithms while only requiring loss function measurements (no gradient or<br />

higher derivative measurements are needed). Since it requires only three measurements per<br />

iteration to estimate both the gradient and <strong>Hessian</strong>, independent <strong>of</strong> the dimension <strong>of</strong> the problem.<br />

So that, the relationships among 1st-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong> can also be understood<br />

from a different perspective: 1st-<strong>SPSA</strong> (2.1) and M2-<strong>SPSA</strong> (2.13) weight the different<br />

components <strong>of</strong> the estimated gradient gˆ<br />

( θˆ<br />

) equally whereas 2nd-<strong>SPSA</strong> (2.2a) weights them<br />

k<br />

k<br />

differently to account <strong>for</strong> different sensitivities <strong>of</strong> θ . A steeper eigen direction (greater λ )<br />

i<br />

requires a smaller step ≈ 1 / λ ) to effectively reach the exact solution [25][26]. Both<br />

(<br />

i<br />

2nd-<strong>SPSA</strong> and our proposed <strong>SPSA</strong> algorithm have captured the dependence <strong>of</strong> the step size on<br />

the overall sensitivities <strong>of</strong> θ at each iteration. From this perspective, 2nd-<strong>SPSA</strong> and proposed<br />

<strong>SPSA</strong> algorithm are superior to 1st-<strong>SPSA</strong>. However, our proposed <strong>SPSA</strong> weights the different<br />

components <strong>of</strong> gˆ<br />

( θˆ<br />

) equally with an averaged step ≈ 1 / λ ) , it has given up the further<br />

k<br />

k<br />

(<br />

k<br />

advantage <strong>of</strong> higher-<strong>order</strong> sensitivity <strong>of</strong> θ . There<strong>for</strong>e, whether our proposed <strong>SPSA</strong> algorithm is<br />

better than 2nd-<strong>SPSA</strong> or not at finite iterations is determined by the relative importance <strong>of</strong> two<br />

competing factors that influence the efficiency <strong>of</strong> the algorithm. The elimination <strong>of</strong> the matrix<br />

inverse reduces the magnitude <strong>of</strong> errors whereas the lack <strong>of</strong> gradient sensitivity may deteriorate<br />

the accuracy. It is noted that the asymptotic relation (2.25) only shows an improvement <strong>of</strong> our<br />

proposed <strong>SPSA</strong> over 2nd-<strong>SPSA</strong> in terms <strong>of</strong> its rate coefficient. Both our proposed <strong>SPSA</strong><br />

/ 2<br />

algorithm and 2nd-<strong>SPSA</strong> have the same rate <strong>of</strong> convergence characterized by − β as shown<br />

by (2.20). The asymptotic relation (2.25) provides a theoretical rationale <strong>of</strong> considering<br />

/ 3<br />

M2-<strong>SPSA</strong> over 2nd-<strong>SPSA</strong> in practice although the maximum rate <strong>of</strong> convergence <strong>of</strong> − 2 <strong>for</strong><br />

MSE cannot be achieved <strong>for</strong> our proposed <strong>SPSA</strong> algorithm. Another rationale <strong>of</strong> proposing<br />

*<br />

M2-<strong>SPSA</strong> is that the amplification <strong>of</strong> errors in an ill-conditioned H<br />

k<br />

through the matrix<br />

inversion is a well-established result whereas the efficiency <strong>of</strong> the gradient sensitivity through<br />

*<br />

Newton–Raphson search only shows near the extreme point ( θ ) with a near-exact [26]. Recall,<br />

however, that such justification <strong>for</strong> the proposed <strong>SPSA</strong> algorithm is restricted to the case where<br />

the gains are not asymptotically optimal in <strong>order</strong> to achieve fast convergence with finite<br />

iterations. For the asymptotic optimal gains ( a<br />

≈ 1 / k , c ≈ 1 / k<br />

*<br />

to M2<strong>SPSA</strong> except in the case where all eigenvalues <strong>of</strong> H<br />

k<br />

k<br />

1 / 6<br />

), 2nd-<strong>SPSA</strong> is superior<br />

are identical (where 2nd-<strong>SPSA</strong> and<br />

M2<strong>SPSA</strong> are identical). It is shown that the magnitude <strong>of</strong> errors in 2nd-<strong>SPSA</strong> is dependent on<br />

the matrix.<br />

k<br />

40


2.8 IMPLEMENTATION ASPECTS<br />

We have shown that the magnitude <strong>of</strong> errors in <strong>SPSA</strong> is dependent on the matrix conditioning<br />

*<br />

<strong>of</strong> H<br />

due to two competing factors. Since both factors are strongly related to the same quantity<br />

<strong>of</strong> the matrix conditioning, the relative efficiency between M2-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> might be<br />

less dependent on specific loss functions. However, such a replacement does not necessarily<br />

suggest that the magnitude <strong>of</strong> errors in our proposed <strong>SPSA</strong> be independent on the matrix<br />

conditioning <strong>of</strong><br />

*<br />

H since the computation <strong>of</strong> λ<br />

k is dependent on the matrix properties<br />

<strong>of</strong><br />

*<br />

H .<br />

2.8 -Implementation Aspects<br />

The five points below have been found important in making the adaptive simultaneous<br />

perturbation (ASP) per<strong>for</strong>m well in practice. Be<strong>for</strong>e describe these points, we can explain that<br />

while the ASP structure in (2.2a), (2.2b), and (2.2) is general, we will largely restrict ourselves<br />

(1)<br />

in our choice <strong>of</strong> G (⋅)<br />

(and G ( ⋅)<br />

) in the remainder <strong>of</strong> the discussion in <strong>order</strong> to present<br />

k<br />

k<br />

concrete theoretical and numerical results. For M2-<strong>SPSA</strong>, we will consider the simultaneous<br />

(1)<br />

perturbation approach <strong>for</strong> generating G (⋅)<br />

and G ( ⋅)<br />

, while <strong>for</strong> second-<strong>order</strong> stochastic<br />

k<br />

(1)<br />

gradient (2SG), we will suppose that G ( ⋅)<br />

= G ( ⋅)<br />

is an unbiased direct measurement <strong>of</strong><br />

k<br />

g (⋅) ; in other words G ˆ θ ) is the input in<strong>for</strong>mation related to g ˆ θ ) . The rationale <strong>for</strong> basic<br />

k<br />

( k<br />

<strong>SPSA</strong> in the gradient-free case has been discussed extensively elsewhere (e.g., Spall [28],), and<br />

hence will not be discussed in detail here. (In summary, it tends to lead to more efficient<br />

optimization than the classical finite-difference Kiefer–Wolfowitz method while being no more<br />

difficult to implement; the relative efficiency grows with the problem dimension.) In the<br />

gradient-based case, stochastic gradient (SG) methods include as special cases the well-known<br />

approaches mentioned at the beginning <strong>of</strong> dissertation (backpropagation, etc.). SG methods are<br />

themselves special cases <strong>of</strong> the general Robbins–Monroe root-finding framework and, in fact,<br />

most <strong>of</strong> the results here can apply in this root-finding setting as well. The associated<br />

Appendixes A and B provide part <strong>of</strong> the theoretical justification <strong>for</strong> SP, establishing conditions<br />

<strong>for</strong> the almost sure (a.s.) convergence <strong>of</strong> both the iterate and the <strong>Hessian</strong> estimate. Now, we can<br />

explain the five points in the implementation <strong>of</strong> M2-<strong>SPSA</strong> as follows:<br />

k<br />

k<br />

( k<br />

41


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

1) θ and H Initialization: Typically, (2.2a) is initialized at some ˆ θ<br />

0<br />

believed to be near<br />

*<br />

θ . One may wish to run the standard first-<strong>order</strong> SA (i.e., (2.2a) without<br />

−1<br />

H<br />

k<br />

) or some other<br />

“rough” optimization approach <strong>for</strong> some period to move the initial θ <strong>for</strong> ASP closer to<br />

*<br />

θ .<br />

Although, with the indexing shown in (2.2b), no initialization <strong>of</strong> the<br />

H<br />

k<br />

recursion is required<br />

since H<br />

0<br />

is computed directly from Ĥ<br />

0<br />

, the recursion may be trivially modified to allow <strong>for</strong><br />

an initialization if one has useful prior in<strong>for</strong>mation. If this is done, then the recursion may be<br />

initialized at (say) scale ⋅ I pxp<br />

, scale ≥ 0,<br />

or some other positive semi definite matrix reflecting<br />

available prior in<strong>for</strong>mation (e.g., if one knows that theθ elements will have very different<br />

magnitudes, then the initialization may be chosen to approximately scale <strong>for</strong> the differences). It<br />

is also possible to run (2.2b) in parallel with the rough search methods that might be used <strong>for</strong><br />

initializing θ . Since<br />

Ĥ<br />

k<br />

has (at most) rank 2 (and may not be positive semi-definite).<br />

2) Numerical Issues in Choice <strong>of</strong> ∆<br />

k<br />

and<br />

H<br />

k<br />

: Generating the elements <strong>of</strong><br />

∆k<br />

according to a<br />

Bernoulli having a positive-definite initialization helps provide <strong>for</strong> the invariability <strong>of</strong> H<br />

k<br />

,<br />

especially <strong>for</strong> small k (if H is positive definite, f (⋅ k<br />

) in (2.2a) may be taken as the identity<br />

k<br />

trans<strong>for</strong>mation). ± 1 distribution is easy and theoretically valid (and was shown to be<br />

asymptotically optimal in Brennan and Rogers [27] and Spall [28] <strong>for</strong> basic <strong>SPSA</strong>; its potential<br />

optimality <strong>for</strong> the adaptive approach here is an open question). In some applications, however, it<br />

may be worth exploring other valid choices <strong>of</strong> distributions since the generation <strong>of</strong><br />

∆k<br />

represents a trivial part <strong>of</strong> the cost <strong>of</strong> optimization, and a different choice may yield<br />

improved finite-sample per<strong>for</strong>mance. Because H<br />

k<br />

may not be positive definite, especially <strong>for</strong><br />

small k (even if is initialized based on prior in<strong>for</strong>mation to H<br />

0<br />

be positive definite), it is<br />

recommended that H<br />

k<br />

in (2.2b) not generally be used directly in (2.2a). Hence, as shown in<br />

(2.2a), it is recommended that<br />

H<br />

k<br />

be replaced by another matrix<br />

H<br />

k<br />

that is closely related to<br />

H<br />

k<br />

. One useful <strong>for</strong>m when is not too large has been to take<br />

H<br />

k<br />

1/ 2<br />

= ( H H ) + δ I , where the<br />

k<br />

k<br />

k<br />

indicated square root is the (unique) positive semi-definite square root and δ ≥ 0 is some<br />

small number.<br />

k<br />

42


2.8 IMPLEMENTATION ASPECTS<br />

For large p , a more efficient method is to simply set<br />

H<br />

k<br />

= H + δ I but this is likely to require<br />

k<br />

k<br />

a larger<br />

δ<br />

k<br />

to ensure positive definiteness <strong>of</strong><br />

H<br />

k<br />

. For very large p, it may be advantageous to<br />

have<br />

H<br />

k<br />

be only a diagonal matrix based on the diagonal elements <strong>of</strong><br />

H<br />

k<br />

+ δ I . This is a way<br />

k<br />

<strong>of</strong> capturing large scaling differences in the elements (unavailable to first-<strong>order</strong> algorithms)<br />

while eliminating the potentially onerous computations associated with the inverse operation in<br />

(2.2a). Note that<br />

Hk<br />

should only be used in (2.2a), as (2.2b) should remain in terms <strong>of</strong><br />

Hk<br />

to<br />

ensure a.s. consistency. By Theorems 2a, b, one can set<br />

H = H <strong>for</strong> sufficiently large k. Also,<br />

k<br />

k<br />

<strong>for</strong> general diagonal<br />

H<br />

k<br />

, it is numerically advantageous to avoid a direct inversion <strong>of</strong><br />

H<br />

k<br />

in<br />

(2.2a), preferring a method such as Gaussian elimination.<br />

3) Gradient/<strong>Hessian</strong> Averaging: At each iteration, it may be desirable to compute and average<br />

several and values despite the additional cost. This may be especially true in a high noise<br />

environment.<br />

4) Gain Selection: The principles outlined in Brennan and Rogers [27] and Spall [28] are<br />

useful here as well <strong>for</strong> practical selection <strong>of</strong> the gain sequences , { a k<br />

}, { c k<br />

} and in the<br />

M2-<strong>SPSA</strong> case, { c k<br />

}. For M2-<strong>SPSA</strong> the critical gain can be simply chosen as1/<br />

k,<br />

k ≥1to<br />

achieve asymptotic near optimality or optimality, respectively, although this may not be ideal in<br />

practical finite-sample problems. For the remainder, let us focus on the M2-<strong>SPSA</strong> case, here we<br />

can choose<br />

a<br />

c k<br />

= α γ , A ≥ 0 <strong>for</strong> k ≥1. In<br />

α<br />

γ<br />

γ<br />

k<br />

= a /( k + A)<br />

, ck<br />

= c / k and c / k , a,<br />

c,<br />

c,<br />

, > 0<br />

finite-sample practice, it may be better to choose and lower than their asymptotically optimal<br />

values <strong>of</strong> α = 1 and γ = 1/6 (see Sec. 2.10), and, in particular, α = 0.602 and γ = 0.101 are<br />

practically effective and approximately the lowest theoretically valid values allowed (see<br />

Theorems 1a, 2a, and 3a). Choosing so that the typical change in to is <strong>of</strong> “reasonable”<br />

magnitude, especially in the critical early iterations, has proven effective. Setting A<br />

approximately equal to 5–10% <strong>of</strong> the total expected number <strong>of</strong> iterations enhances practical<br />

convergence by allowing <strong>for</strong> a larger than possible with the more typical A= 0. However, in<br />

slight contrast to Spall [28] <strong>for</strong> the first-<strong>order</strong> algorithm, we recommend that have a magnitude<br />

43


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

greater (by roughly a factor <strong>of</strong> 2–10) than the typical (“one-sigma”) noise level in the y (⋅)<br />

.<br />

Further, setting c ~ > c has been effective. These recommendations <strong>for</strong> larger c (and c ~ )<br />

values than given in Spall [28] are made due to the greater inherent sensitivity <strong>of</strong> a second-<strong>order</strong><br />

algorithm to noise effects.<br />

2.9 -Strong Convergence<br />

This section presents results related to the strong (a.s.) convergence <strong>of</strong><br />

ˆ θ * θ → k<br />

and<br />

H H( θ<br />

* k<br />

→ ) (all limits are as unless otherwise noted). This section establishes separate results<br />

<strong>for</strong> M2-<strong>SPSA</strong>. One <strong>of</strong> the challenges, <strong>of</strong> course, in establishing convergence is the coupling<br />

between the recursions <strong>for</strong><br />

θˆ k<br />

and<br />

H<br />

k<br />

. Formal convergence <strong>of</strong><br />

H<br />

k<br />

(see Theorems 2a, b) may<br />

still hold under such weighting provided that the analog to expressions (A10) and (A13) in the<br />

pro<strong>of</strong> <strong>of</strong> Theorem 2a (see Appendix) holds. We present a martingale approach that seems to<br />

provide a relatively simple solution with reasonable regularity conditions. Alternative<br />

conditions <strong>for</strong> convergence might be available using the ordinary differential equation approach<br />

<strong>of</strong> Metivier and Priouret [29] and Benveniste [30], which includes a certain Markov dependence<br />

that would, in principle, accommodate the recursion coupling. However, this approach was not<br />

pursued here due to the difficulty <strong>of</strong> checking certain regularity conditions associated with the<br />

Markov dependence (e.g., those related to the solution <strong>of</strong> the “Poisson equation”). The results<br />

below are in two parts, with the first part (Theorems 1a, b) establishing conditions <strong>for</strong> the<br />

convergence <strong>of</strong><br />

θˆ k , and the second part (Theorems 2a, b) doing the same <strong>for</strong><br />

H<br />

k<br />

. The pro<strong>of</strong>s <strong>of</strong><br />

the theorems are in Appendix A. We let denote ⋅ the standard Euclidean vector norm or<br />

compatible matrix spectral norm (as appropriate),<br />

( * *<br />

θ )<br />

i<br />

and ( θ −θ<br />

)<br />

i<br />

represent the i-th<br />

components <strong>of</strong> the indicated vectors (notation chosen to avoid confusion with the iteration<br />

subscript k), i.o. represents infinitely <strong>of</strong>ten, and ˆ −1<br />

g ( θ ) ≡ H g(<br />

ˆ θ ). Below are some regularity<br />

conditions that will be used in Theorem 1a <strong>for</strong> M2-<strong>SPSA</strong> and, in part, in the succeeding<br />

theorems. Some comments on the practical implications <strong>of</strong> the conditions are given<br />

k<br />

k<br />

k<br />

k<br />

immediately following their statement. Note that some conditions show a dependence on<br />

θˆ<br />

k<br />

44


2.9 STRONG CONVERGENCE<br />

and<br />

H<br />

k<br />

, the very quantities <strong>for</strong> which we are showing convergence. Although such<br />

“circularity” is generally undesirable, it is fairly common is the SA field (e.g., Kushner and Yin<br />

[31], Benveniste [30]). The inherent difficulty in establishing theoretical properties <strong>of</strong> adaptive<br />

approaches comes from the need to couple the estimates <strong>for</strong> the parameters <strong>of</strong> interest and <strong>for</strong><br />

the <strong>Hessian</strong> (Jacobian) matrix. Note that the bulk <strong>of</strong> the conditions here showing dependence on<br />

θˆ k and<br />

Hk<br />

are conditions on the measurement noise and smoothness <strong>of</strong> the loss function (C.0,<br />

C.2, and C.3 below; C.0’, C.2’ , C.3’ , C.8, and C.8’ in later theorems); the explicit dependence<br />

on<br />

θˆ k can be removed by assuming that the relevant condition holds uni<strong>for</strong>mly <strong>for</strong> all<br />

“reasonable” θ . The dependence in C.5 is handled in the lemma below. The following<br />

assumptions are guidelines [16] very useful <strong>for</strong> establish our theorems.<br />

( ) ( )<br />

C.0 E(<br />

ε −ε<br />

− ∆ ; H ) = 0 a.s. ∀ k whereε is the effective SA measurement noise, i.e.,<br />

ε<br />

( + )<br />

k<br />

+ k k k k<br />

≡ y(<br />

ˆ θ ± c ∆<br />

k<br />

k<br />

k<br />

) − L(<br />

ˆ θ ± c ∆<br />

k<br />

k<br />

k<br />

).<br />

(+)<br />

k<br />

C.1 ak<br />

, ck<br />

> 0 ∀k<br />

, a<br />

k<br />

→0, c<br />

k<br />

→0<br />

a.s. k →∞<br />

∑ ∞ a = ∞<br />

k=<br />

0 k<br />

( a / ) < ∞.<br />

k 0 k<br />

ck<br />

∑ ∞ =<br />

2<br />

C.2 For some δ , ρ →0<br />

and ∀k<br />

l E y ˆ<br />

2+<br />

δ<br />

, , ( ( θk ± c k<br />

∆k<br />

)/ ∆k<br />

l<br />

) ≤ ρ,<br />

∆k<br />

l<br />

≤ ρ,<br />

∆k<br />

l<br />

symmetrically distributed about 0, and { ∆ }<br />

kl<br />

are mutually independent.<br />

is<br />

C.3 For some ρ > 0 and almost all θˆ k<br />

the function<br />

g ⋅<br />

is continuously twice differentiable<br />

with a uni<strong>for</strong>mly (k ) in bounded second derivative <strong>for</strong> all θ such that ˆ θ −θ<br />

≤ ρ.<br />

C.4 For each k ≥1<br />

and all θ there exists a ρ > 0 not dependent on k and θ , such that<br />

*<br />

( θ −θ<br />

)<br />

T g<br />

k<br />

*<br />

( θ)<br />

> ρ θ −θ<br />

.<br />

C.5 For each i = 1,2 ..., p and any > 0<br />

C.6<br />

( ˆ *<br />

ki k<br />

ki k<br />

θki<br />

−(<br />

θ )<br />

i<br />

≥ ρ∀k<br />

) = 0.<br />

ρ , P { g ( ˆ θ ) ≥ 0} ∩{ g ( ˆ θ ) < 0}<br />

2 δ<br />

−1<br />

2 −1<br />

1<br />

Hk<br />

exists a.s. ∀k,<br />

c k<br />

H k<br />

→0a.s., and <strong>for</strong> someδ , ρ > 0, ⎞<br />

⎜<br />

⎛ −<br />

E H +<br />

k ⎟ ≤ ρ.<br />

⎝ ⎠<br />

C.7 For any τ > 0<br />

and non-empty S { 1,2 ,..., p }<br />

⊆ there exists a ρ '(<br />

τ,<br />

S ) > τ .<br />

k<br />

45


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

*<br />

∑ ( θ − θ )<br />

i<br />

g<br />

ki<br />

( θ )<br />

i∉S<br />

lim sup<br />

<<br />

*<br />

∑ ( θ − θ ) ( )<br />

k →∞<br />

i<br />

g<br />

ki<br />

θ<br />

i∈S<br />

1<br />

(2.34)<br />

<strong>for</strong> all<br />

*<br />

( θ −θ<br />

) < τ when<br />

i<br />

*<br />

i∉S<br />

and ( θ −θ<br />

) ≥ ρ'(<br />

τ,<br />

S)<br />

when i ∈ S.<br />

i<br />

C.0 and C.7 are common martingale-difference noise and gain sequence conditions. C.2<br />

(1)<br />

provides a structure to ensure that the gradient approximations G (⋅)<br />

and G ( ⋅)<br />

are well<br />

behaved. The conditions on<br />

violation <strong>of</strong> the implied finite inverse moments condition in<br />

∆ from being uni<strong>for</strong>mly or normally distributed due to their<br />

k<br />

2+ δ<br />

E<br />

⎛<br />

θ<br />

⎞<br />

⎜ y( ˆ<br />

k<br />

± c k<br />

∆k<br />

)/ ∆k<br />

l ⎟ ≤ ρ . An<br />

⎝<br />

⎠<br />

independent Bernoulli ± 1 distribution is frequently used <strong>for</strong> the elements <strong>of</strong> ∆<br />

k<br />

. C.3 and C.4<br />

provide basic assumptions about the smoothness and steepness <strong>of</strong> L (θ ). C.3 holds, <strong>of</strong> course, if<br />

g (θ) is twice continuously differentiable with a bounded second derivative on<br />

k<br />

p<br />

R C.5 is a<br />

modest condition that says that<br />

θˆ k cannot be bouncing around in a manner that causes the signs<br />

<strong>of</strong> the normalized gradient elements to be changing an infinite number <strong>of</strong> times if<br />

θˆ k is<br />

uni<strong>for</strong>mly bounded away from<br />

*<br />

θ . C.6 provides some conditions on the surrogate <strong>for</strong> the<br />

<strong>Hessian</strong> estimate that appears in (2.2a) and (2.2b). Since the user has full control over the<br />

definition <strong>of</strong><br />

H<br />

k<br />

these conditions should be relatively easy to satisfy. Note that the middle part<br />

<strong>of</strong> C.6 ( H<br />

1 o(<br />

c<br />

−2 ) a.s.) allows <strong>for</strong><br />

− k<br />

=<br />

k<br />

−1<br />

Hk<br />

to “occasionally” be large provided that the<br />

boundedness <strong>of</strong> moments in the last part <strong>of</strong> the condition is satisfied. The example <strong>for</strong><br />

H<br />

k<br />

given in Sec. 2.8 [guideline 2] would satisfy this potential growth condition, <strong>for</strong> instance, if<br />

ρ<br />

δk<br />

= ck<br />

, 0 < ρ < 2.<br />

Finally, C.7 ensures that, <strong>for</strong> k sufficiently large, each element <strong>of</strong> g (θ k<br />

)<br />

*<br />

tends to make a non negligible contribution to products <strong>of</strong> the <strong>for</strong>m ( θ − θ )<br />

T g ( θ)<br />

(see C.4).<br />

A sufficient condition <strong>for</strong> C.7 is that, <strong>for</strong> each i , ( θ)<br />

be uni<strong>for</strong>mly (in k ) bounded > 0 and <<br />

*<br />

∞ when ( θ −θ<br />

) is bounded as stated in the lemma below. Note that, although no explicit<br />

i<br />

conditions are shown on { c ~ k<br />

} there are implicit conditions in C.4–C.7 given c ~ k<br />

’s effect on<br />

g ki<br />

k<br />

46


2.9 STRONG CONVERGENCE<br />

H<br />

k<br />

(via<br />

H<br />

k<br />

). In Theorem 2a on the convergence <strong>of</strong><br />

H , there are explicit conditions on { }<br />

k<br />

c ~ .<br />

k<br />

Conditions C.5 and C.7 are relatively unfamiliar. So, be<strong>for</strong>e showing the main theorems on<br />

convergence <strong>for</strong> M2-<strong>SPSA</strong>, we give sufficient conditions <strong>for</strong> these two conditions in the lemma<br />

below. The main sufficient condition is the well-known boundedness condition on the SA<br />

iterate (e.g., Benveniste [30, Theorem II.15]). Although some authors have relaxed this<br />

boundedness condition (e.g., Kushner and Yin [31]), the condition imposes no practical<br />

limitation. This boundedness condition also <strong>for</strong>mally eliminates the need <strong>for</strong> the explicit<br />

dependence <strong>of</strong> other conditions (C.2 and C.3 above; C.0’, C.2’, C.3’, C.8, and C.8’ below) on<br />

θˆ k<br />

since the conditions can be restated to hold <strong>for</strong> all θ in the bounded set containing<br />

Note also that the condition a /<br />

2 → 0 holds automatically <strong>for</strong> gains in the standard <strong>for</strong>m<br />

k<br />

c k<br />

discussed in 2.9.1. One example <strong>of</strong> when the remaining condition <strong>of</strong> the lemma (2.35), is<br />

θˆ k .<br />

trivially satisfied is<br />

Hk<br />

is chosen as a diagonal matrix (see guideline 2).<br />

Lemma—Sufficient Conditions <strong>for</strong> C.5 and C.7: Assume that C.1–C.4 and C.6 hold, and<br />

lim sup k<br />

θˆ < ∞ a.s. Then condition C.7 is not needed. Further, let a /<br />

2 → 0, and suppose<br />

−∞<br />

that, <strong>for</strong> any ρ > 0<br />

k<br />

P (sign g ˆ θ )<br />

ki<br />

( k<br />

≠ sign g ˆ θ ) i.o. ˆ θ − ( θ<br />

* ) ≥ ρ)<br />

= 0<br />

i<br />

( k<br />

.<br />

ki i<br />

k<br />

c k<br />

∀ i<br />

(2.35)<br />

Then C.5 is automatically satisfied.<br />

(1)<br />

Theorem 1a—M2-<strong>SPSA</strong>: Consider the <strong>SPSA</strong> estimate <strong>for</strong> G (⋅)<br />

with G ( ⋅)<br />

given by (2.34).<br />

Let conditions C.0–C.7 hold. Then ˆ θ * k<br />

−θ →0<br />

a.s.<br />

k<br />

k<br />

Theorem 1b below on the second-<strong>order</strong> stochastic gradient (2SG) approach is a straight<strong>for</strong>ward<br />

modification <strong>of</strong> Theorem 1a on M2-<strong>SPSA</strong>. In <strong>order</strong> to explain more clearly the theorems <strong>of</strong><br />

M2-<strong>SPSA</strong>, we take some references from the theorems <strong>of</strong> the SG <strong>for</strong>m [21]. There<strong>for</strong>e, we<br />

replace C.0, C.1, and C.2 with the following SG analogs. Equalities hold a.s. where needed.<br />

47


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

( + )<br />

C.0’: E(<br />

e ˆ<br />

k<br />

θ ; ∆ ; H ) = 0 where e = G ˆ θ ) − g( ˆ θ ).<br />

k<br />

k<br />

k<br />

∞<br />

→<br />

k ∑ ∑<br />

k<br />

k<br />

(<br />

k k<br />

2<br />

C.1’: a 0∀k<br />

; a →0;<br />

a = ∞,<br />

a < ∞.<br />

k<br />

∞<br />

k=<br />

0 k<br />

k=<br />

0 k<br />

2+<br />

δ<br />

C.2’: For some δ , ρ > 0, E ( G ( θˆ<br />

) ) ≤ ρ ∀ k .<br />

k<br />

k<br />

Note (analogous to ~ c } in Theorem 1a) that there are no explicit conditions on c } here.<br />

{ k<br />

{ k<br />

These conditions are implicit via the conditions on<br />

H<br />

k<br />

, and will be made explicit when we<br />

consider the convergence <strong>of</strong><br />

H<br />

k<br />

in Theorem 2b.<br />

Theorem 1b—2SG: Consider the setting where ( ⋅)<br />

Suppose that C.0’ –C.2’ and C.3–C.7 hold. Then ˆ θ * k<br />

−θ →0<br />

a.s.<br />

Theorem 2a below treats the convergence <strong>of</strong><br />

conditions as follows, which are largely self-explanatory:<br />

C.1’’: The conditions <strong>of</strong> C.1 hold plus<br />

∑<br />

G is a direct measurement <strong>of</strong> the gradient.<br />

k<br />

H<br />

k<br />

in the <strong>SPSA</strong> case. We introduce several new<br />

−2<br />

−2<br />

k + 1) ( c ~ c ) < ∞ with c ~ = O(<br />

).<br />

∞<br />

(<br />

k=<br />

0<br />

k k<br />

k<br />

c k<br />

C.3’: Change “thrice differentiable” in C.3 to “four-times differentiable” with all else<br />

unchanged.<br />

C.8: For some ρ > 0 and all k ,l,<br />

m ,<br />

ˆ θ ± ∆ + ~ ~ 2 ~ 2<br />

[ y(<br />

c c ∆ ) /( ∆ ∆ ) ] ≤ ρ<br />

E<br />

k k k k k kl<br />

km<br />

and<br />

ˆ<br />

2 ~ 2<br />

[ y(<br />

θ ± ∆ ) /( ∆ ∆ ) ] ≤ ρ<br />

E<br />

k<br />

c k k kl<br />

km<br />

(<br />

~ (<br />

E ε<br />

ˆ ~<br />

θ ; ∆ ; H<br />

± ) ( )<br />

− ±<br />

k<br />

ε<br />

k k k k<br />

) = 0<br />

and<br />

~ (<br />

ε<br />

± ) (<br />

− ε<br />

± ) 2 ~ 2<br />

[( ) /( ∆ ∆ ) ]<br />

E<br />

k k<br />

kl<br />

km<br />

≤ ρ<br />

where ~ ( ± ) ˆ ~ ~ ˆ ~ ~<br />

ε = y(<br />

θ ± c ∆ + c ∆ ) − L(<br />

θ ± c ∆ + c ∆ ).<br />

k<br />

k<br />

k<br />

k<br />

k<br />

k<br />

k<br />

k<br />

k<br />

k<br />

k<br />

48


2.9 STRONG CONVERGENCE<br />

C.9:<br />

∆ ~ ~<br />

k<br />

satisfies the assumptions <strong>for</strong> ∆k<br />

in C.2 (i.e., ∀ k , l , ∆<br />

kl<br />

≤ ρ and ∆ ~<br />

l<br />

k<br />

is<br />

symmetrically distributed about 0; { ∆ ~ kl<br />

} are mutually independent); ∆<br />

k<br />

and<br />

∆ ~<br />

k<br />

are<br />

independent;<br />

E<br />

−2<br />

−2<br />

( ∆ ) ≤ , E( ∆ ) ≤ ρ∀k<br />

l<br />

ρ and some ρ > 0 .<br />

kl kl<br />

,<br />

Theorem 2a—M2-<strong>SPSA</strong>: Let conditions C.0, C.1’’, C.2, C.3’, and C.4–C.9 hold. Then,<br />

H H( θ<br />

* k<br />

→ ) a.s. Our final strong convergence result is <strong>for</strong> the <strong>Hessian</strong> estimate in 2SG. As<br />

above, we introduce some additional modified conditions.<br />

−2<br />

−2<br />

C.1’’’: The conditions <strong>of</strong> C1’ hold plus c 0,<br />

c →0<br />

and ( k + 1) c < ∞.<br />

C.8’: For some ρ →0<br />

and all k , l ,<br />

2<br />

E<br />

⎛ θ ⎞<br />

⎜ g( ˆ<br />

k<br />

± c k<br />

∆k<br />

) / ∆k<br />

l ⎟ ≤ ρ<br />

⎝<br />

⎠<br />

k<br />

><br />

k<br />

∑ ∞ k=<br />

0<br />

k<br />

and<br />

E<br />

⎜⎛<br />

⎝<br />

( )<br />

( e − −<br />

k<br />

e k<br />

) / ∆k<br />

l<br />

+ 2<br />

⎟⎞<br />

≤ ρ<br />

⎠<br />

E<br />

( )<br />

( e )/ ˆ ) + − −<br />

k<br />

e k<br />

∆k<br />

l<br />

θ = 0<br />

k<br />

ˆ θ<br />

ˆ θ<br />

( ± )<br />

where e = G ( ± c ∆ ) − g( ± c ∆ ).<br />

k<br />

k<br />

k<br />

k<br />

k<br />

C.9’ : For some ρ > 0 and all k , l , ∆<br />

kl<br />

≤ , ∆kl<br />

2<br />

are mutually independent, and E ( ) .<br />

k<br />

k<br />

∆ − kl<br />

k<br />

≤ ρ<br />

ρ ,is symmetrically distributed about 0, { ∆ }<br />

kl<br />

Unlike this theorem’s companion result <strong>for</strong> 2SG (Theorem 1b), explicit conditions are necessary<br />

on { c k<br />

} to control the convergence <strong>of</strong> the <strong>Hessian</strong> iteration. Note that due to the simpler<br />

structure <strong>of</strong> 2SG (versus M2-<strong>SPSA</strong>), the conditions in C.9’ are a subset <strong>of</strong> the conditions in C.9<br />

<strong>for</strong> Theorem 2a.<br />

Theorem 2b—2SG: Suppose that C.0, C.1, C.2, C.3, C.4–C.7, C.8 and C.9 hold. Then<br />

H H( θ<br />

* k<br />

→ ) a.s.<br />

49


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

2.10 -Asymptotic Distributions and Efficiency Analysis<br />

A. Asymptotic Distributions <strong>of</strong> ASP<br />

This subsection builds on the convergence results in the previous section, establishing the<br />

asymptotic normality <strong>of</strong> the M2-<strong>SPSA</strong> and 2SG <strong>for</strong>mulations <strong>of</strong> ASP. The asymptotic normality<br />

is then used in Sec. 2.9 to analyze the asymptotic efficiency <strong>of</strong> the algorithms. Pro<strong>of</strong>s are in<br />

Appendix A.<br />

M2-<strong>SPSA</strong> Setting: As be<strong>for</strong>e, we consider 2nd-<strong>SPSA</strong> be<strong>for</strong>e 2SG. Asymptotic normality or the<br />

related issue <strong>of</strong> convergence <strong>of</strong> moments in basic first-<strong>order</strong> <strong>SPSA</strong> has been established under<br />

slightly differing conditions by Spall [3], Spall and Criston et al. [32], Dippon and Renz [33],<br />

Kushner and Yin [31, ch. 10]. We consider gains <strong>of</strong> the typical <strong>for</strong>m<br />

c k<br />

γ<br />

= c / k , a,<br />

c,<br />

α,<br />

γ > 0, A ≥ 0, k ≥1<br />

and take<br />

=<br />

ki<br />

a k<br />

+<br />

α<br />

= a /( k A) and<br />

β = α − 2γ<br />

, 2 (<br />

−2 2 −2<br />

ρ E ∆ ) , ξ = E(<br />

∆ ki<br />

) ∀k,<br />

i .The<br />

asymptotic mean below relies on the third derivative <strong>of</strong> L(θ<br />

) we let L ( * )<br />

derivative <strong>of</strong> Lwith respect to elements i,j,k <strong>of</strong> θ evaluated at<br />

conditions will be used in the asymptotic normality result.<br />

3 ijk<br />

θ<br />

represent the third<br />

*<br />

θ . The following regularity<br />

E<br />

~ ˆ<br />

as <strong>for</strong>m some σ<br />

2 > 0 . In this point, <strong>for</strong> some all<br />

( ) ( ) 2<br />

2<br />

C.10: ( ( ε − ε ) θ , ) ± H → σ ;<br />

± k k k k<br />

(<br />

{ ( ε − ε ) θ ∆ η )}<br />

+ ) ( − )<br />

E ˆ<br />

k k k<br />

, ck<br />

k<br />

2<br />

ˆ<br />

k<br />

θ ,<br />

=<br />

is an equicontinuous sequence at η = 0 and is continuous<br />

in η on some compact, connected set containing the actual (observed) value <strong>of</strong><br />

c ∆ a.s.<br />

k<br />

k<br />

C.11: In addition to implicit conditions an α and γ via C.1’’, 3γ −α / 2 ≥ 0 and β > 0 .<br />

Further, whenα = 1,<br />

a > β / 2 . Let f (⋅)<br />

in (2.2a) be chosen such that H − H → 0 a.s.<br />

k<br />

Although, in some applications, the “ → ” <strong>for</strong> the noise second moments in C.10 may be<br />

replaced by “=,” the limiting operation allows <strong>for</strong> a more general setting. Since the user has<br />

k<br />

k<br />

full control over f (⋅)<br />

, it is not difficult to guarantee in C.11 that H −H<br />

→ 0<br />

k<br />

k<br />

k<br />

a.s.<br />

Theorem 3a—M2-<strong>SPSA</strong>: Suppose that C.0, C.1’’, C.2, C.3’, and C.4–C.9 hold (implying<br />

50


2.10 ASYMPTOTIC DISTRIBUTIONS AND ANALYSIS<br />

convergence <strong>of</strong><br />

θˆ and H ). Then, if C.10 and C.11 hold and<br />

k<br />

k<br />

H(<br />

θ<br />

* −1<br />

)<br />

exists,<br />

β /2 ˆ *<br />

k ( θ −θ<br />

) dist<br />

k<br />

⎯ ⎯→<br />

N(<br />

µ , Ω)<br />

(2.36)<br />

where µ = 0{<br />

0 3γ<br />

− α / 2 > 0<br />

T is<br />

−1<br />

*<br />

if ( ) T /( / 2)<br />

H θ a − β if 3 γ −α / 2 = 0} the j-th element <strong>of</strong><br />

+<br />

Ω =<br />

⎡<br />

⎤<br />

P<br />

1 2 2⎢<br />

(3) *<br />

(3) *<br />

− ac ξ L + ⎥<br />

⎢ jjj<br />

( θ ) 3∑<br />

L jjj<br />

( θ )<br />

(2.37)<br />

6<br />

⎥<br />

i=<br />

1<br />

⎢⎣<br />

i≠1<br />

⎥⎦<br />

( 8a<br />

+ β )<br />

2 −2<br />

2 2 * −2<br />

a c σ ρ H(<br />

θ ) / 4<br />

+<br />

and β = β<br />

+<br />

if α = 1 and β<br />

+<br />

= 0 if α 1/ 2 if α = 1, and<br />

k<br />

k<br />

f<br />

k<br />

(⋅) is chosen such that H<br />

k<br />

− H<br />

k<br />

→ 0 a.s. As with C.10, frequently, → can be replaced<br />

with “=” in the limiting covariance expression. Likewise, see the comments following C.11<br />

regarding the condition H − H →0<br />

a.s.<br />

k<br />

k<br />

51


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

Theorem 3b—2SG: Suppose that C.0’, C.1’’’, C.2’, C.3’, C.4–C.7, C.8’, and C.9’ hold<br />

(implying convergence <strong>of</strong><br />

θˆ k<br />

and H<br />

k<br />

) that C.12 holds with<br />

H(<br />

θ<br />

* −1<br />

)<br />

existing. Then,<br />

k<br />

α / 2<br />

dist<br />

ˆ *<br />

( θ k<br />

−θ<br />

) →N(0,<br />

Ω')<br />

(2.38)<br />

2 * −1<br />

* −1<br />

where Ω'<br />

= a H(<br />

θ ) ΣH(<br />

θ ) /(2a<br />

− β ) with β = 1 if α = 1 and β = 0 if α


2.10 ASYMPTOTIC DISTRIBUTIONS AND ANALYSIS<br />

1st-<strong>SPSA</strong> and M2-<strong>SPSA</strong>, we have<br />

2a<br />

1<br />

rms<br />

2<strong>SPSA</strong>(1,1,<br />

c,<br />

)<br />

6<br />

< 2,<br />

1<br />

min rms1<br />

<strong>SPSA</strong>(<br />

a,1,<br />

c,<br />

)<br />

> 1/ λmin<br />

6<br />

∀c<br />

> 0<br />

(2.40a)<br />

2a<br />

1<br />

rms<br />

2<strong>SPSA</strong>(1,1,<br />

c,<br />

)<br />

6<br />

< 2<br />

1<br />

min min rms1<br />

<strong>SPSA</strong>(1,1,<br />

c,<br />

)<br />

> 1/ λ min c><br />

0<br />

6<br />

(2.40b)<br />

where<br />

λ is the minimum eigenvalue <strong>of</strong> H ( θ<br />

* ) . The interpretation <strong>of</strong> (2.40a), (2.40b) is as<br />

min<br />

follows. From (2.40a), we know that, <strong>for</strong> any common value <strong>of</strong> , the asymptotic rms error <strong>of</strong><br />

M2-<strong>SPSA</strong> is less than twice that <strong>of</strong> 1st-<strong>SPSA</strong> with an optimal (even when c is chosen optimally<br />

<strong>for</strong> 1st-<strong>SPSA</strong>). Expression (2.40b) states that, if we optimize only <strong>for</strong> M2-<strong>SPSA</strong>, while<br />

optimizing both a and c <strong>for</strong> 1st-<strong>SPSA</strong>, we are still guaranteed that the asymptotic rms error <strong>for</strong><br />

M2-<strong>SPSA</strong> is no more than twice the optimized rms error <strong>for</strong> 1st-<strong>SPSA</strong>. Another interesting<br />

aspect <strong>of</strong> M2-<strong>SPSA</strong> is the relative robustness apparent in (2.40a), (2.40b) given that the optimal<br />

<strong>for</strong> 1st-<strong>SPSA</strong> will not typically be known in practice. For certain suboptimal values <strong>of</strong> a in<br />

1st-<strong>SPSA</strong>, the rms error can get very large whereas simply choosing a= 1 <strong>for</strong> M2-<strong>SPSA</strong><br />

provides the factor <strong>of</strong> guarantee mentioned above. Although (2.40a), (2.40b) suggest that the<br />

M2-<strong>SPSA</strong> approach yields a solution that is quite good, one might wonder if a true optimal<br />

solution is possible. Dippon and Renz [33, pp.1817–1818] pursue this issue, and provide an<br />

alternative to<br />

θ<br />

* −1<br />

H ( ) as the limiting weighting matrix <strong>for</strong> use in an SA <strong>for</strong>m such as (2.2a).<br />

Un<strong>for</strong>tunately, this limiting matrix has no closed-<strong>for</strong>m solution, and depends on the third<br />

derivatives <strong>of</strong> L (θ ) at<br />

adaptive matrix (analogous to<br />

*<br />

θ , and furthermore, it is not apparent how one would construct an<br />

H<br />

k<br />

that would converge to this optimal limiting matrix.<br />

Likewise, the optimal <strong>for</strong> M2-<strong>SPSA</strong> is typically unavailable in practice since it also depends on<br />

the third derivatives <strong>of</strong> L (θ ). Expressions (2.40a), (2.40b) are based on an assumption that<br />

1st-<strong>SPSA</strong> and M2-<strong>SPSA</strong> have used the same number <strong>of</strong> iterations. This is a reasonable basis <strong>for</strong><br />

a core comparison since the “cost” <strong>of</strong> solving <strong>for</strong> the optimal 1st-<strong>SPSA</strong> gains is unknown.<br />

However, a more conservative representation <strong>of</strong> relative efficiency is possible by considering<br />

only the direct number <strong>of</strong> loss measurements, ignoring the extra cost <strong>for</strong> optimal gains in<br />

1st-<strong>SPSA</strong>. In particular, 1st-<strong>SPSA</strong> uses two loss measurements per iteration and M2-<strong>SPSA</strong> uses<br />

four measurements per iteration. Hence, with both algorithms using the same number <strong>of</strong> loss<br />

53


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

measurements, the corresponding upper bounds to the ratios in (2.40a), (2.40b) (reflecting the<br />

ratio <strong>of</strong> rms errors as the common number <strong>of</strong> loss measurements gets large) would be<br />

2 / 3<br />

4 ≈ 2.52 , an increase from the bound <strong>of</strong> 2 under a common number <strong>of</strong> iterations. This bound’s<br />

likely excessive conservativeness follows from the fact that the cost <strong>of</strong> solving <strong>for</strong> the optimal<br />

gains in 1<strong>SPSA</strong> is being ignored. Note that, <strong>for</strong> other adaptive approaches that are also<br />

asymptotically normally distributed, the same relative cost analysis can be used. Hence, <strong>for</strong><br />

example, with the Fabian [19] approach using O ( p<br />

2 ) measurements per iteration to generate<br />

2/3<br />

the <strong>Hessian</strong> estimate, the corresponding upper bounds would be <strong>of</strong> magnitude O ( p ), bounds<br />

that, unlike the bounds <strong>for</strong> M2-<strong>SPSA</strong>, increase with problem dimension.<br />

In the following chapters, once finished these numerical simulations in <strong>order</strong> show the<br />

M2-<strong>SPSA</strong> per<strong>for</strong>mance, we will prove the proposed <strong>SPSA</strong> algorithm applied to parameters<br />

estimation per<strong>for</strong>mance in some realistic systems. The main advantages <strong>of</strong> our proposed<br />

algorithm will be shown, such as low computational cost and efficient accuracy and<br />

convergence.<br />

2.11 -Perturbation Distribution <strong>for</strong> M2-<strong>SPSA</strong><br />

As discussed above, the perturbations<br />

∆<br />

k<br />

in the gradient estimate are based on Bernoulli<br />

random variables on {–1, 1}. In fact, the requirements are merely that the<br />

∆<br />

ki<br />

must be<br />

independent and symmetrically distributed about zero with finite absolute inverse moments<br />

−1<br />

E[<br />

∆ ki<br />

] <strong>for</strong> all k, i. The Bernoulli is just one distribution <strong>for</strong> ∆<br />

ki<br />

that satisfies these<br />

conditions. It has been shown that one cannot do better than this distribution in the asymptotic<br />

case [34], but less is known about the best distribution <strong>for</strong> small-sample approximations. Some<br />

numerical results seem to show better per<strong>for</strong>mance on some problems with non-Bernoulli<br />

distributions. The per<strong>for</strong>mance <strong>of</strong> three such alternative distributions is reported here: a split<br />

uni<strong>for</strong>m distribution, an inverse split uni<strong>for</strong>m distribution, and a symmetric double triangular<br />

distribution (referred to as candidate distributions in the following). The {–1, 1} Bernoulli<br />

distribution has variance and absolute first moment (mean magnitude) both equal to one. It is<br />

the only qualified distribution with these qualities. We conjecture that these characteristics are<br />

necessary conditions <strong>for</strong> optimal per<strong>for</strong>mance <strong>of</strong> the M2-<strong>SPSA</strong> algorithm, given optimal step<br />

size parameters. Variations in mean magnitude can be addressed by scaling the gradient step<br />

54


2.11 PERTURBATION DISTRIBUTION FOR M2-<strong>SPSA</strong><br />

size (c), so <strong>for</strong> comparisons, candidate distributions should have the same variance as the {–1,<br />

1} Bernoulli. Then differences in per<strong>for</strong>mance could be attributed to differences in the nature <strong>of</strong><br />

variability in that distribution.<br />

Table 2.1. Characteristics <strong>of</strong> the perturbation distributions.<br />

To ensure consistency in the comparison, we normalized the candidate distributions so that their<br />

variances were one and their main magnitudes were close to one, but not so close that the<br />

essential character <strong>of</strong> the distributions were lost. The probability density functions <strong>of</strong> these<br />

distributions are given at right. The characteristics <strong>of</strong> each distribution are given in Table 2.1.<br />

The M2-<strong>SPSA</strong> algorithm with each distribution <strong>for</strong> the perturbations was applied to 34<br />

functions from Moré’s suite <strong>of</strong> optimization problems [35]. The initial points recommended in<br />

Moré were used <strong>for</strong> each function. The functions values were obscured with normally<br />

distributed errors with mean zero and a variance <strong>of</strong> one. We then used these noisy function<br />

values to calculate a simultaneous perturbation gradient approximation. For nearly all <strong>of</strong> the<br />

functions, errors <strong>of</strong> this magnitude are insignificant away from the minimum. However, most<br />

functions in the optimization suite have minimums at or near zero, where N(0, 1) errors are<br />

quite significant. This situation is further complicated by the fact that many functions are<br />

extremely flat near the minimum as well. The result was a demanding examination <strong>of</strong> the<br />

M2-<strong>SPSA</strong> algorithm <strong>of</strong>fering ample opportunity to test alternative perturbation distributions.<br />

The step size parameters <strong>of</strong> the M2-<strong>SPSA</strong> algorithm (that is, a and c) were optimized <strong>for</strong> each<br />

distribution and each function by random search. The procedure to optimize the step parameters<br />

used 20,000 iterations <strong>of</strong> a directed random search algorithm.<br />

55


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

In the directed random search (sometimes called a localized random search, see [36], p. 45),<br />

new trial values are generated near the location <strong>of</strong> the current best value. The algorithm accepts<br />

the input parameters as the current optimal values if they produce results that are better than the<br />

best yet obtained, otherwise they are rejected. This method is somewhat more sophisticated than<br />

simple random search, and generally more computationally efficient in that it uses in<strong>for</strong>mation<br />

from previous iterations. For more in<strong>for</strong>mation on random search methods, see Solis and Wets<br />

[37]. For each iteration <strong>of</strong> the random search we executed fifty Monte Carlo trials <strong>of</strong> the <strong>SPSA</strong><br />

algorithm, and then accepted or rejected the parameter values based on the average <strong>of</strong> these fifty<br />

trials. The theoretically optimal values <strong>for</strong> a and g were used. The M2-<strong>SPSA</strong> algorithm in the<br />

procedure outlined above was run <strong>for</strong> stopping times <strong>of</strong> n = 10, 100, and 1000 iterations to<br />

determine whether any one distribution outper<strong>for</strong>med the others over small, moderate, and large<br />

iteration domains. Common random numbers (CRN) were used to minimize variance. With<br />

CRN, the sequences <strong>of</strong> function values generated by the iteration differ only as a result <strong>of</strong> how<br />

the <strong>SPSA</strong> algorithm processes the random numbers in a different way. In this evaluation, the<br />

sequence <strong>of</strong> CRN were used to generate random perturbations from the appropriate distribution.<br />

This method allows the use <strong>of</strong> matched pairs testing to determine the significance <strong>of</strong> differences<br />

in the minimum values observed. Matched pairs testing generally leads to sharper analysis.<br />

⎧ 1<br />

⎪ −b≤x≤−a<br />

2( b−a)<br />

f SU<br />

( x;<br />

a,<br />

b)<br />

= ⎨<br />

⎪<br />

⎪⎩<br />

0 otherwise<br />

or<br />

a≤x≤b<br />

Fig. 2.3. Split uni<strong>for</strong>m distribution.<br />

56


2.12. PARAMETER ESTIMATION<br />

⎧ ab<br />

⎪<br />

−b<br />

≤ x ≤ −a<br />

2<br />

2( b−a)<br />

x<br />

f ISU<br />

( x;<br />

a,<br />

b)<br />

= ⎨<br />

⎪<br />

⎪⎩<br />

0 otherwise<br />

or<br />

a ≤ x ≤b<br />

Fig. 2.4. Inverse split uni<strong>for</strong>m distribution.<br />

⎧ x + c<br />

⎪<br />

− c ≤ x ≤ −b<br />

( c − a)(<br />

c − b)<br />

⎪<br />

⎪<br />

x + a<br />

− b ≤ x ≤ −a<br />

⎪(<br />

c − a)(<br />

c − b)<br />

⎪<br />

f SDT<br />

( x;<br />

a,<br />

b)<br />

= ⎨ x − a<br />

a ≤ x ≤ b<br />

⎪(<br />

c − a)(<br />

c − a)<br />

⎪<br />

⎪ x − c<br />

b ≤ x ≤ c<br />

⎪(<br />

c − a)(<br />

b − c)<br />

⎪<br />

⎩0<br />

otherwise<br />

Fig. 2.5. Symmetric double triangular<br />

distribution.<br />

2.12 -Parameter Estimation<br />

2.12.1 -Introduction<br />

In the proposed <strong>SPSA</strong> algorithm, all parameters are perturbed simultaneously; it is possible to<br />

57


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

modify parameters with only two measurements <strong>of</strong> an evaluation function regardless <strong>of</strong> the<br />

dimension <strong>of</strong> the parameter. A parameter estimation algorithm using M2-<strong>SPSA</strong> is proposed.<br />

The contribution <strong>of</strong> this chapter is a <strong>SPSA</strong> algorithm <strong>for</strong> parameter estimation that can be used<br />

with non-linear systems or systems with parameters estimation very high. The proposed <strong>SPSA</strong><br />

algorithm is an iterative method <strong>for</strong> optimization, with randomized search direction, that<br />

requires at most three function (model) evaluations at each iteration. The M2-<strong>SPSA</strong><br />

incorporates the 2nd-<strong>SPSA</strong> usually reduced number <strong>of</strong> iterations, to do an initial estimate <strong>of</strong> the<br />

*<br />

optimum values <strong>for</strong> the parameter, θ . The proposed <strong>SPSA</strong> algorithm makes use <strong>of</strong> the <strong>Hessian</strong><br />

matrix to increase the rate <strong>of</strong> convergence. First, second and modified second-<strong>order</strong> <strong>SPSA</strong><br />

algorithm was implemented to estimate the unknown parameters <strong>of</strong> the highly non-linear<br />

physical model. Hence, execution time per iteration does not increase with the number <strong>of</strong><br />

parameters. The method can handle non-linear dynamic models, non-equilibrium transient test<br />

conditions and data obtained in close loop. For this reason, this method is suitable <strong>for</strong> the<br />

estimation <strong>of</strong> parameters in realistic applications. Firstly, it is necessary to show the general<br />

implementation <strong>of</strong> <strong>SPSA</strong> algorithm. The general steps in implementation <strong>of</strong> <strong>SPSA</strong> algorithm are<br />

[28]: 1) initialization and coefficient selection, 2) numerical issued, 3) gradient/<strong>Hessian</strong><br />

averaging, 4) gain selection, (see Sec. 2.8). Finally, we have proposed a modification in this<br />

implementation. This modification is explained on base to the recursive update <strong>for</strong>m <strong>for</strong> the<br />

parameter vector is given by<br />

θˆ<br />

= θˆ<br />

− a<br />

gˆ<br />

( θˆ<br />

)<br />

k + 1 k k k k<br />

(2.41)<br />

where<br />

ak<br />

is a weight or gain constant <strong>for</strong> the recurrent iteration and<br />

ĝ<br />

k<br />

is a gradient estimate<br />

<strong>for</strong> the recurrent iteration. To update<br />

θˆ k to a new value ˆ<br />

k+<br />

1<br />

ˆ<br />

+<br />

θ . If θ k 1<br />

falls outside the range <strong>of</strong><br />

allowable values <strong>for</strong> θ . Then project the updated θ k 1<br />

to the nearest boundary and reassign this<br />

ˆ +<br />

ˆ<br />

+<br />

projected value θ k 1<br />

. Mathematically we have, <strong>for</strong> every -i = 1, … , n;<br />

ˆ θ<br />

k+<br />

1, i<br />

⎧ ˆ θk<br />

⎪<br />

= ⎨θ<br />

i<br />

⎪<br />

⎪⎩<br />

θi<br />

+ 1, i<br />

min<br />

max<br />

if θ<br />

if ˆ θ<br />

if ˆ θ<br />

min<br />

i<br />

k+<br />

1, i<br />

k+<br />

1, i<br />

≤ ˆ θ<br />

< θ<br />

> θ<br />

k+<br />

1, i<br />

min<br />

i<br />

max<br />

i<br />

< θ<br />

max<br />

i<br />

.<br />

58


2.11 PARAMETER ESTIMATION<br />

Modifications to this step may be needed to enhance the best convergence <strong>of</strong> the algorithm. In<br />

particular the update could be block if the cost function actually worsens after the “the basic”<br />

update in this step. The choice <strong>of</strong> various parameters <strong>of</strong> the algorithm plays an important role in<br />

the convergence <strong>of</strong> the algorithm. It is suggested that α = 0. 602 and γ = 0. 101<br />

practically effective and theoretically valid choice. The value <strong>of</strong> A is chosen to be 10% <strong>of</strong> the<br />

maximum iterations allowed. The maximum number <strong>of</strong> iterations was chosen to be 100 and<br />

hence A was chosen to be 10. It is recommended that if the measurements are (almost) error free<br />

c, can be chosen as a small positive number. In this case it was chosen to be 0.01.<br />

are<br />

The value <strong>of</strong> a should be chosen such that the<br />

α<br />

a /( A +1) times the magnitude <strong>of</strong> elements <strong>of</strong><br />

( ˆ ) is approximately equal to the smallest <strong>of</strong> the desired change magnitudes among the<br />

gˆ 0<br />

θ0<br />

elements <strong>of</strong> θ in early iterations. For the problem at hand a=1 gave a good results. This value<br />

if a was chosen to ensure that the component <strong>of</strong> θ during the iterations would remain within the<br />

allowed bounds.<br />

We have proposed modify the typical implementation <strong>of</strong> <strong>SPSA</strong> algorithm <strong>for</strong> the estimation<br />

parameters application according to M2-<strong>SPSA</strong> algorithm, so that, the optimization in the vector<br />

parameter θˆ was modified and showed as follows: The vector parameter θˆ is obtained by<br />

solving the following problem:<br />

ˆ θ = arg min<br />

θ H<br />

( θ )<br />

subject to<br />

θ<br />

θ<br />

M<br />

θ<br />

min<br />

1<br />

min<br />

2<br />

min<br />

n<br />

≤ θ ≤ θ<br />

1<br />

≤ θ ≤ θ<br />

2<br />

≤ θ ≤ θ<br />

n<br />

max<br />

1<br />

max<br />

2<br />

max<br />

n<br />

(2.42)<br />

where the cost function H (θ ) is given by a cost function and n gives the total number <strong>of</strong><br />

parameters in the case n=19. Most conventional tools used <strong>for</strong> optimization <strong>of</strong> the cost function<br />

to arrive at local minimum. However this optimization method is very time consuming if there<br />

are many variables to be optimized or if the cost function evaluations are computationally<br />

expensive. If the number <strong>of</strong> parameters increases, the number <strong>of</strong> function evaluations required<br />

computing the gradients also increase. Moreover, the chance <strong>of</strong> solution convergence to local<br />

59


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

minimum also increases with the number <strong>of</strong> parameters to be optimized. For the problem at<br />

hand, which several parameters to be optimized, it was found that the gradient-based approach<br />

was not practical. For this reason, the <strong>SPSA</strong> algorithm was used to minimize the cost function.<br />

Once the approximate gradient is computed the parameters are update and a new value <strong>of</strong> θ is<br />

computed. It is recommended once more that the cost function evaluation at this point to check<br />

if the cost function at this new value <strong>of</strong> θ is less that the cost function using<br />

θ<br />

k<br />

. The number<br />

<strong>of</strong> cost function evaluations per iteration does not depend on the number <strong>of</strong> variable, which<br />

makes this method very attractive <strong>for</strong> optimization problems with several variables. There<strong>for</strong>e,<br />

this method can be represented as follows: The i-th element <strong>of</strong> the gradient estimate, g ˆ ( ˆ θ ) is<br />

given by<br />

ˆ<br />

ˆ<br />

ˆ y(<br />

θk<br />

+ ck∆k<br />

) − y(<br />

θk<br />

− ck∆k<br />

)<br />

gˆ<br />

k<br />

( θ<br />

k<br />

) =<br />

.<br />

(2.43)<br />

2c<br />

∆<br />

k<br />

ki<br />

k<br />

The term<br />

θˆ ± c ∆ represents a perturbation to the optimization parameters about the recurrent<br />

k<br />

k<br />

k<br />

estimate. Similar to a standard SA <strong>for</strong>m,<br />

ck<br />

is small, positive weighting value. The vector <strong>of</strong><br />

zero-mean random variables, which must have bounded inverse moments. One valid choice <strong>for</strong><br />

∆k<br />

is a vector <strong>of</strong> Bernoulli-distributed, i.e. ± 1, random perturbation terms. In resume, the fifth<br />

guideline that we have proposed and is complement <strong>of</strong> Sec. 2.8 is given as follows:<br />

At each iteration, block “bad” steps if the new estimate <strong>for</strong> θ fails a certain criterion.<br />

H<br />

k<br />

should typically continue to be updated even if θ k 1<br />

is blocked. The most obvious blocking<br />

applies when θ must satisfy constraints; an updated value may be blocked or modified if a<br />

constraint is violated. There are two ways 5a) and 5b) that one might implement blocking when<br />

constraints are not the limiting factor.<br />

ˆ<br />

+<br />

5a) Based on θˆ k<br />

and θ k 1<br />

directly.<br />

5b) Based on loss measurements.<br />

ˆ<br />

+<br />

Both <strong>of</strong> 5a) and 5b) may be implemented in a given applications. In 5a), one simply blocks the<br />

step from<br />

θˆ k to<br />

ˆk 1<br />

θ if ˆ θ − ˆ<br />

1<br />

θ > (tolerance1 ) where the norm is any convenient distances<br />

k+ k<br />

60


2.11 PARAMETER ESTIMATION<br />

measure and (tolerance1 >0) is some “reasonable” maximum distance to cover in one step. The<br />

rationale behind 5a) is that a well-behaving algorithm should be moving toward the solution in a<br />

smooth manner, and very large steps are indicative <strong>of</strong> potential divergence. The second potential<br />

method, 5b), is based on blocking the step if y( ˆ θ ) ( ˆ<br />

k+ 1<br />

> y θk<br />

)(tolerance 2 ) where (tolerance 2 )≥0<br />

might be set at about one or two times the approximate standard deviation <strong>of</strong> the noise in the<br />

y (⋅) measurements. In a setting where the noise in the loss measurements tends to be large (say,<br />

much larger than the allowable difference between L( θ<br />

* ) and L ˆ θ )), it may be undesirable<br />

( final<br />

to use 5b) due to the difficulty in obtaining meaningful in<strong>for</strong>mation about the relative old and<br />

new loss values. For any nonzero noise levels, it may be beneficial to average several y (⋅)<br />

measurements in making the decision about whether to block the step; this may be done. Having<br />

tolerance 2 >0 as specified above when there is noise in the<br />

y (⋅)'<br />

builds some conservativeness<br />

into the algorithm by allowing a new step only if there is relatively strong statistical evidence <strong>of</strong><br />

an improved loss value. Let us close this subsection with a few summary comments about the<br />

implementation aspects above. Without the second blocking procedure 5b) in use, 2nd-<strong>SPSA</strong><br />

requires four measurements y(⋅)<br />

per iteration, regardless <strong>of</strong> the dimension p (two <strong>for</strong> the standard<br />

G (⋅) k<br />

estimate and two new values <strong>for</strong> the one sided SP gradients G<br />

1 ( ⋅ k<br />

)) . For 2SG, three<br />

gradient measurements G (⋅ k<br />

) are needed, again independent <strong>of</strong> p. If the second blocking<br />

procedure 5b) is used, one or more additional y (⋅)<br />

measurements are needed <strong>for</strong> both 2nd-<br />

<strong>SPSA</strong> and 2SG. The use <strong>of</strong> gradient/ <strong>Hessian</strong> averaging 3) would increase the number <strong>of</strong> loss or<br />

gradient evaluations, <strong>of</strong> course.<br />

The standard deviation <strong>for</strong> the measurement noise (used in items 4) and 5b in this chapter) can<br />

be estimated by collecting several y (⋅)<br />

values at θ = θˆ<br />

0<br />

; neither 4) nor 5a) requires this<br />

estimate to be precise (so relatively few y(⋅)<br />

values are needed). In general, 5a) can be used<br />

anytime, while 5b) is more appropriate in a low- or no-noise setting. Note that 5a) helps to<br />

prevent divergence, but lacks direct insight into whether the loss function is improving, while<br />

5b) does provide that insight, but requires additional y (⋅)<br />

measurements, the number <strong>of</strong> which<br />

might grow prohibitively in a high-noise setting. Once finished the modifications in the<br />

implementation <strong>SPSA</strong> algorithm according to our proposed algorithm, we can start to explain<br />

how is applied toward the estimation parameters.<br />

61


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

Firstly, we defined a simple model in <strong>order</strong> to explain how is developed the estimation<br />

parameters algorithm using our proposed algorithm. This model was used be<strong>for</strong>e by other<br />

authors [24][25] <strong>for</strong> explain estimation parameters using the 1st-<strong>SPSA</strong> algorithm. So that, this<br />

system is used because is very suitable and illustrates very well M2-<strong>SPSA</strong> algorithm<br />

per<strong>for</strong>mance. Of such a way, the following single-input single-output (SISO) discrete system<br />

with input x and output y [24][25] is considered:<br />

x<br />

k<br />

= a k + K + a x + b u + K b u .<br />

(2.44)<br />

1 k − 1<br />

n kn 1 k −1<br />

+<br />

m k − m<br />

Here, k is the discrete time, a<br />

1<br />

, . . . ,<br />

a<br />

n<br />

and b<br />

1, . . . ,<br />

b<br />

m<br />

represent the constant coefficients.<br />

Also, in general, n ≥ m. It is assumed that the system input<br />

value<br />

y k<br />

accompanied by some <strong>for</strong>m <strong>of</strong> noise<br />

υk<br />

xk<br />

is observed as the observed<br />

y<br />

= + υk.<br />

(2.45)<br />

k<br />

x k<br />

Here, the noise<br />

satisfy the following:<br />

vk<br />

the input<br />

uk<br />

and the output<br />

xk<br />

are independent <strong>of</strong> one another, and they<br />

E ( u ) = u , E ( u u ) = r<br />

2 δ<br />

(2.46 a)<br />

k<br />

a<br />

k<br />

i<br />

ki<br />

2<br />

E ( υ ) = 0 , E ( υ υ ) = σ δ<br />

(2.46 b)<br />

k<br />

k<br />

i<br />

ki<br />

2 2<br />

where,δ represents the Kronecker delta, r and σ represent the variances <strong>of</strong> the noise, and<br />

u is the average value <strong>for</strong> the input. At this point, the parameter estimation problem <strong>for</strong><br />

a<br />

consecutively finding unknown parameters { a ,..., a , b b }<br />

values{ y k<br />

, u k<br />

}. The parameters are defined as follows:<br />

n m 1<br />

based on the observed<br />

1<br />

,...,<br />

u )<br />

T<br />

k 1<br />

( uk−m,...,<br />

uk−<br />

1<br />

− = (2.47a)<br />

T<br />

x<br />

k− 1<br />

= ( xk−n,...,<br />

xk−<br />

1)<br />

(2.47b)<br />

T<br />

υ<br />

k − 1<br />

= ( υk−n,...,<br />

υk−<br />

1)<br />

(2.47c)<br />

62


2.11 PARAMETER ESTIMATION<br />

T<br />

y<br />

k − 1<br />

= ( yk<br />

−n<br />

,..., yk<br />

−1,<br />

uk<br />

−m<br />

,..., uk<br />

−1)<br />

(2.47d)<br />

T<br />

φ = a ,..., a , b ,..., b ) .<br />

(2.47e)<br />

(<br />

n 1 m 1<br />

Furthermore, based on the conditions in (2.46 b) <strong>for</strong> the observed noise,<br />

E ( ) = 0<br />

(2.48)<br />

e k<br />

E( e e ) = 0, k − i n.<br />

(2.49)<br />

k i<br />

><br />

There<strong>for</strong>e, the error function J can be defined as follows. The problem <strong>of</strong> minimizing this error<br />

function and finding the system parameter vector φ is addressed in this chapter.<br />

1 2<br />

⎧<br />

T<br />

(<br />

ˆ ⎫<br />

J = E ⎨ y<br />

k<br />

− y<br />

k − 1φ<br />

) ⎬<br />

(2.50)<br />

⎩ 2<br />

⎭ .<br />

Here, E represented the expected value, and φˆ represents the estimated value. This kind <strong>of</strong> error<br />

function, with the expected value, cannot be found in practice. Thus, using SA with this as an<br />

iterated function is considered. The problem <strong>of</strong> finding a parameter that yields a minimum in<br />

this kind <strong>of</strong> iterated function can be solved by using the SA method. The partial derivative <strong>of</strong><br />

the error function (2.50) with respect to the estimation φˆ is<br />

− y y − y<br />

T ˆ).<br />

(2.51)<br />

k−1( k k−1φ<br />

Here, let us look at the expected value <strong>for</strong><br />

independent with υ<br />

k−1<br />

, then<br />

yk<br />

− 1<br />

ek<br />

. If we consider that<br />

k−1<br />

x and u<br />

k−1<br />

are<br />

E<br />

2<br />

⎧⎛ x + υ ⎞<br />

⎫ ⎡σ<br />

I 0⎤<br />

k−1<br />

k<br />

= ⎨⎜<br />

⎬ ⎢ ⎥<br />

⎩ υ ⎟ k 1 k−1<br />

L<br />

n k−n<br />

(2.52)<br />

⎝ k−1<br />

⎠<br />

⎭ ⎣ 0 0⎦<br />

k−1<br />

k−1<br />

{ y e } E ⎜ ⎟( υ − aυ<br />

− − a υ ) = − φ<br />

holds, with no result being zero. Consequently, in the estimate using (2.51), a bias occurs; thus,<br />

63


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

(2.51) does not give a consistent estimate [15]. There<strong>for</strong>e, this bias must be compensated. The<br />

reference [15] <strong>of</strong>fers a detailed explanation <strong>of</strong> this. Moreover, if (2.49) is considered,<br />

calculations must be per<strong>for</strong>med every (n + 1) instances <strong>of</strong> sampling, to guarantee the<br />

independence <strong>of</strong> { e k<br />

}. The modifying time k can be represented by the actual sampling time n;<br />

k = 1, n + 2, 2n + 3, . . . . Then, the following recursion <strong>for</strong> the estimated parameters will be<br />

considered:<br />

ˆ ˆ k − 1<br />

φ<br />

k + n<br />

= φ<br />

k −1<br />

− ρ<br />

e<br />

∆φ<br />

k −1,<br />

k = 1,…, n + 2, 2n+3,.... (2.53)<br />

n + 1<br />

ˆ<br />

−<br />

Here, ∆φ<br />

k 1<br />

is the basic quantity which provides the quantity <strong>for</strong> the estimation parameters.<br />

Furthermore,<br />

a fraction.<br />

ρe<br />

represents the gain coefficient. The subscript on the coefficient ρe<br />

represents<br />

Because this takes a value <strong>for</strong> every (n + 1) instances, <strong>for</strong> example 1, n + 2, 2n + 3, . . . , with<br />

respect to the actual sampling time n, as a result, the subscript<br />

ρ<br />

e<br />

refers to taking the value:<br />

1 – 1/n + 1 = 0, n + 2 –1/n + 1 = 1, . . . , 0, 1, 2,….<br />

In <strong>SPSA</strong>, the perturbations are superimposed simultaneously on all the parameters. As a result,<br />

even as the number <strong>of</strong> parameters rises, the estimated parameters can be revised based on the<br />

two values <strong>of</strong> the error functions either when perturbation is added or when there is not<br />

perturbation. A parameter estimation method that uses this kind <strong>of</strong> SP is extremely useful in the<br />

many circumstances.<br />

2.12.2 -System to be Applied<br />

Let us consider the differential with respect to the parameter φ <strong>for</strong> the model <strong>of</strong> the error <strong>of</strong><br />

squares<br />

2<br />

e in this instance [24][25]. For the sake <strong>of</strong> simplicity, when considering a case in<br />

which all variables are scalar, results<br />

2<br />

∂ e<br />

∂ φ<br />

=<br />

2 ( y<br />

−<br />

y<br />

q<br />

)<br />

∂ y<br />

q<br />

∂ φ<br />

=<br />

2 ( y<br />

−<br />

y<br />

q<br />

)<br />

∂ y<br />

q<br />

∂ x<br />

∂ x<br />

.<br />

∂ φ<br />

(2.54)<br />

64


2.11 PARAMETER ESTIMATION<br />

∂ y q<br />

/ ∂x in this equation represents a Jacobian observation system. If the observation system is<br />

assumed to be unknown, then it cannot be found.<br />

There<strong>for</strong>e, when identifying a system that includes an unknown observation system, the amount<br />

<strong>of</strong> correction <strong>for</strong> the parameters cannot be found in methods that directly find the slope <strong>of</strong> the<br />

error. In other words, identification algorithms based on the conventional slope approach cannot<br />

be used.<br />

In contrast, in the SP method proposed in this chapter, the amount <strong>of</strong> correction <strong>for</strong> the<br />

estimation parameters is found directly from the value<br />

characteristics <strong>of</strong> the observation system are not needed.<br />

2<br />

e <strong>for</strong> the error. As result, the<br />

Moreover, in distinction with differential approximation methods, in ours method, regardless <strong>of</strong><br />

how many paramters are to be estimated, the parameters can be corrected using only two<br />

observations.<br />

In this research, we refer to many authors that have proposed a parameter estimation algorithm<br />

using the <strong>SPSA</strong> algorithm. The following system was considered by other authors [24][25] and<br />

is very suitable <strong>for</strong> show the proposed <strong>SPSA</strong> algorithm per<strong>for</strong>mance. The system considered is a<br />

case in which the observed values <strong>for</strong> an unknown system to be identified can only be obtained<br />

from its characteristics (see Fig. 2.6).<br />

Fig. 2.6. Identification with an unknown observation system.<br />

65


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

Once proposed the model structure, the next step is to estimate the parameters <strong>of</strong> the system.<br />

This is done by assuming an initial value <strong>of</strong> the parameters and then optimizing them so as to<br />

minimize the errror between the measurements and the model predictions. In then next<br />

simulation, a code using standard MATLAB commands implementing the <strong>SPSA</strong> <strong>for</strong> constrained<br />

optimization was developed. Consider the following successive equations:<br />

φ<br />

= ˆ φ − ρ ∆φ<br />

k+<br />

1 k ek k<br />

(2.55)<br />

T<br />

∆φ = ∆φ<br />

,..., ∆φ<br />

) .<br />

(2.56)<br />

k<br />

(<br />

k ,1 k , n+<br />

m<br />

∆ φ represents the modifying vector <strong>for</strong> the estimated parameters. Also, ρe<br />

represents the<br />

correct gain. The estimation parameter vector<br />

to the perturbation c is defined as follows:<br />

+ i<br />

φˆ<br />

with only the i-th estimation parameter added<br />

ˆ + i<br />

i<br />

k<br />

= ˆ φk<br />

+ cke<br />

(i=1,…, n+m). (2.57)<br />

φ<br />

Here, the vector<br />

i<br />

e represents the fundamental vector <strong>for</strong> which the i-th element alone is 1, and<br />

everything else is 0. Consequently, the error function <strong>for</strong> when perturbation is superimposed on<br />

each parameter is structured as follows:<br />

1 2<br />

T ˆ+<br />

i<br />

( y k + 1<br />

y k k<br />

) .<br />

2<br />

− φ (2.58)<br />

Based on the error function in the equation above, the estimation parameters can be updated as<br />

shown below. In other words, an algorithm in which<br />

1 ( y − y φˆ<br />

) − ( y − y φˆ<br />

)<br />

T + i 2<br />

T 2<br />

k + 1 k k<br />

k + 1 k k<br />

∆ φ<br />

k , i<br />

=<br />

(i=1,…, n+m) (2.59)<br />

2<br />

c<br />

k<br />

represents each element <strong>for</strong> the correction parameters can be conceived. The equation above<br />

provides the amount <strong>of</strong> estimation <strong>for</strong> the differential with respect to the i-th parameter in the<br />

66


2.11 PARAMETER ESTIMATION<br />

error. Finding values in the above equation <strong>for</strong> i = 1, . . . , n + m means finding the square <strong>of</strong><br />

errors in (2.58) by superimposing the perturbation on each parameter successively. As a result,<br />

the error function must be calculated (number <strong>of</strong> parameters + 1) times. As the number <strong>of</strong><br />

dimensions <strong>for</strong> the parameters rises, the number <strong>of</strong> calculations <strong>for</strong> the error increases in this<br />

method.<br />

We consider a signed vector<br />

whether the element takes +1 or -1 is determined randomly by<br />

sk<br />

consisting <strong>of</strong> the elements +1 or -1. As is described [38],<br />

s<br />

k<br />

(<br />

k ,1 k , n+<br />

m<br />

T<br />

= s K , s ) .<br />

(2.60)<br />

By making use <strong>of</strong> this, perturbation can be superimposed on the parameter vector as shown<br />

below:<br />

ˆ<br />

+<br />

k<br />

= ˆ χ + c s<br />

k<br />

k<br />

k<br />

.<br />

χ (2.61)<br />

By making use <strong>of</strong> this, the perturbation<br />

+ ck<br />

and ck<br />

− is added at the same time to all<br />

parameters. The parameter estimation using our modified <strong>SPSA</strong> algorithm is give as follows:<br />

ˆ χ<br />

ˆ<br />

k + n<br />

= χ<br />

k −1<br />

−<br />

ψ<br />

k − 1<br />

n + 1<br />

⎧<br />

⎪ 1 ( W<br />

⋅ ⎨<br />

⎪ 2<br />

⎩<br />

Xs<br />

k + n<br />

− W<br />

T<br />

k<br />

ˆ χ<br />

2<br />

⎡υ<br />

I<br />

− ⎢<br />

⎣ 0<br />

− ( W<br />

c<br />

k −1<br />

n + 1<br />

− W<br />

ˆ χ<br />

+ 2<br />

T + 2<br />

k −1 )<br />

k + n k + n k −1<br />

)<br />

0 ⎤<br />

⎥ χ<br />

0 ⎦<br />

n<br />

ˆ<br />

k − 1 k − 1<br />

⎪⎫<br />

⎬<br />

⎪⎭<br />

(2.62)<br />

where<br />

W<br />

k is measured output, c is the perturbation, υ represents the variance, n, k are<br />

sampling time, χ is the parameter to be estimated, and ψ is a gain coefficient and the<br />

subscript in this coefficient represents a fraction because this takes value <strong>for</strong> every (n+1)<br />

instances. Note that<br />

χ<br />

+<br />

k −1<br />

is calculated as follows:<br />

67


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

ˆ χ<br />

+<br />

ˆ<br />

k − 1<br />

= χ<br />

k − 1<br />

+ c<br />

k − 1<br />

s<br />

k − 1<br />

n + 1<br />

. (2.63)<br />

In estimating the optimum parameters <strong>of</strong> a model or times, there are several factors, which must<br />

be considered when deciding on the appropriate optimization technique. Among these factors<br />

are convergence speed, accuracy, algorithm suitability, complexity, and computational cost in<br />

terms <strong>of</strong> time and power. In the current problem it is necessary to estimate the parameters <strong>of</strong> a<br />

geometrical object in real time. This algorithm updates the estimates using the following<br />

procedure:<br />

y k+<br />

n<br />

(S1) The output to be identified { }<br />

is observed with respect to a particular input.<br />

(S2) Perturbation is added to all the parameters in the estimation vector <strong>for</strong> the parameters.<br />

(Calculation <strong>of</strong> (2.63)).<br />

(S3) The value <strong>for</strong> the error function<br />

( y ˆ φ is calculated.<br />

− T +<br />

) 2<br />

k+ n<br />

yk+<br />

n k−1 (S4)-The amount <strong>of</strong> correction is calculated and the estimation parameters is updated.<br />

(Calculation <strong>of</strong> (2.62)).<br />

(S5) Return to S1.<br />

At each correction time, the value <strong>of</strong> { y k<br />

, u k<br />

} is observed, and the amount <strong>of</strong> correction is<br />

calculated based on these values. The above represents the proposal <strong>for</strong> an algorithm using a<br />

one-sided difference with the error <strong>for</strong> when perturbation is or is not present. However, as is the<br />

case <strong>for</strong> (2.61) the following two-sided <strong>for</strong>m <strong>of</strong> algorithm using<br />

−<br />

χˆ<br />

k<br />

in which the perturbation<br />

is subtracted from the estimation parameter can also be considered:<br />

T ˆ+<br />

2<br />

T<br />

1 ( ) ( ˆ−<br />

2<br />

yk+ 1<br />

− yk<br />

φk<br />

− yk+<br />

1<br />

− yk<br />

φk<br />

)<br />

∆φ k<br />

=<br />

.<br />

(2.64)<br />

2<br />

2c<br />

k<br />

68


2.11 PARAMETER ESTIMATION<br />

This algorithm to estimate the parameters is based on the M2-<strong>SPSA</strong>, which is capable <strong>of</strong><br />

optimizing any number <strong>of</strong> parameters in reasonable time. This is because the number <strong>of</strong> cost<br />

function evaluations needed to estimate the gradient is independent <strong>of</strong> the number <strong>of</strong> parameters<br />

to be optimized.<br />

2.12.3 -Convergence Theorem<br />

In this section a convergence theorem <strong>for</strong> the parameter estimation algorithm using the<br />

M2-<strong>SPSA</strong> is described. First, let us consider the following conditions.<br />

(A11) The coefficient<br />

ρe<br />

satisfies the following conditions:<br />

∞<br />

∑<br />

i=<br />

1<br />

∞<br />

∑<br />

ρ = ∞,<br />

ρ < ∞ .<br />

ei<br />

i=<br />

1<br />

2<br />

ei<br />

(A12) The perturbation c (> 0)<br />

is bounded.<br />

i<br />

(B11)<br />

E<br />

( sk, i)<br />

= 0, E(<br />

sk,<br />

i,<br />

slj<br />

) = δljδ<br />

kl<br />

.<br />

Note that δ represents the Kronecker delta.<br />

(C11) The input<br />

uk<br />

and the observed noise<br />

vk<br />

satisfy (2.46a) and (2.46b), and they are<br />

mutually independent. Further, they have a bounded fourth-<strong>order</strong> moment. Here, condition<br />

(A11) is related to the correction gain, and is the same as the condition required <strong>for</strong> an ordinary<br />

Robbin-Monroe type stochastic approximation.<br />

Condition (A12) is related to the magnitude <strong>of</strong> the perturbation. Condition (B11) is related to<br />

the signed vector. Conditions (A12) and (B11) are related to the perturbation required because<br />

this is a <strong>SPSA</strong>. The condition in (C11) is related to the nature <strong>of</strong> the noise and the input signal.<br />

It is also required <strong>for</strong> identification using a conventional R-M type stochastic approximation.<br />

69


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

Theorem 4a—Convergence in parameter estimation M2-<strong>SPSA</strong>. For { φˆ k<br />

} given in (2.62),<br />

when the conditions (A11), (A12), (B11) and (C11) are satisfied, we have<br />

lim<br />

k→∞<br />

E<br />

⎧<br />

⎨<br />

⎩<br />

ˆ<br />

φ k<br />

2<br />

−φ<br />

⎫<br />

⎬ = 0.<br />

⎭<br />

Refer to the Appendix <strong>for</strong> details <strong>of</strong> the pro<strong>of</strong> <strong>of</strong> this theorem.<br />

2.13- Simulation<br />

2.13.1 -Simulation 1<br />

This section compares M2-<strong>SPSA</strong> with its corresponding first-<strong>order</strong> “standard” <strong>for</strong>ms 1st-<strong>SPSA</strong><br />

and 2nd-<strong>SPSA</strong>. Numerical studies on other functions are given in Spall [18]. The loss function<br />

considered here is a fourth-<strong>order</strong> polynomial with p = 10, significant variable interaction, and<br />

highly skewed level surfaces (the ratio <strong>of</strong> maximum to minimum eigenvalue <strong>of</strong> H ( θ<br />

* ) is<br />

approximately 65). Gaussian noise is added to the L (⋅)<br />

or g(⋅)<br />

evaluations as appropriate.<br />

MATLAB s<strong>of</strong>tware was used to carry out this study. The loss function is<br />

p<br />

p<br />

3<br />

( Aθ<br />

)<br />

i<br />

+ 0.001∑<br />

i= 1 i=<br />

1<br />

∑<br />

T T<br />

4<br />

L( θ ) = θ A Aθ<br />

+ 0.1<br />

( Aθ<br />

)<br />

(2.65)<br />

i<br />

where<br />

)<br />

i<br />

(⋅ represents the i-th component <strong>of</strong> the argument vector and A is such that pA is an<br />

upper triangular matrix <strong>of</strong> ones. The minimum occurs at θ * = 0 with L ( θ<br />

* ) = 0 .The noise in<br />

the loss function measurements at any value <strong>of</strong> θ is given by [ θ T , 1]z<br />

where<br />

2<br />

z ≈ N 0, σ I ) is independently generated at each θ . This is a relatively simple noise<br />

(<br />

11X<br />

11<br />

structure representing the usual scenario where the noise values in y (⋅)<br />

depend on θ (and<br />

are there<strong>for</strong>e dependent over iterations); the z<br />

11<br />

term provides some degree <strong>of</strong> independence<br />

2<br />

at each noise contribution, and ensures that y(⋅)<br />

always contains noise <strong>of</strong> variance at least σ<br />

(even if θ = 0). Guidelines 1), 2), 4), from Sec. 2.8 <strong>of</strong> our proposed modification in the<br />

implementation to 2nd-<strong>SPSA</strong> were applied here. A fundamental philosophy in the comparisons<br />

below is that the loss function and gradient measurements are the dominant cost in the<br />

70


2.12 NUMERICAL SIMULATIONS<br />

optimization process; the other calculations in the algorithms are considered relatively<br />

unimportant. This philosophy is consistent with most complex stochastic optimization problems<br />

where the loss function or gradient measurement may represent a large-scale simulation or a<br />

physical experiment. The relatively simple loss function here, <strong>of</strong> course, is merely a proxy <strong>for</strong><br />

the more complex functions encountered in practice.<br />

M2-<strong>SPSA</strong> Versus 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> Results: We compared M2-<strong>SPSA</strong> with 1st-<strong>SPSA</strong><br />

because our proposed method is an extension <strong>of</strong> 1st-<strong>SPSA</strong>, so that is convenient make a<br />

comparison in <strong>order</strong> to show the improvements <strong>of</strong> our <strong>SPSA</strong> proposed respect to 1st-<strong>SPSA</strong> and<br />

also it is compared with 2nd-<strong>SPSA</strong> because this is the last version <strong>of</strong> <strong>SPSA</strong>, so that is very<br />

important verify our improvements according to this algorithm. Spall [18] provides a thorough<br />

numerical study based on the loss function (2.65). Three noise levels were considered: σ =<br />

0.10, 0.001, and 0. The results here are a condensed study based on the same loss function.<br />

Table 2.2 shows results <strong>for</strong> the low-noise σ = 0.001) case. The Table 2.2 shows the mean<br />

terminal loss value after 50 independent experiments, where the values are normalized (divided)<br />

by L ˆ θ ) . Approximate 90% confidence intervals are shown below each mean loss value. The<br />

( 0<br />

gains,<br />

a<br />

k<br />

, ck<br />

and<br />

k<br />

c ~ and decayed at the rates,<br />

0.602 0.101<br />

1/<br />

k , 1/ k<br />

0.101<br />

and-1/<br />

k , respectively.<br />

These decay rates are approximately the slowest allowed by the theory and are slower than the<br />

asymptotically optimal values discussed in Sec. 2.10 (which do not tend to work as well in<br />

finite-sample practice). Four separate algorithms are shown: basic 1st-<strong>SPSA</strong> with the<br />

coefficients <strong>of</strong> the slowly decaying gains mentioned above chosen empirically according to<br />

Spall[18], the same 1st-<strong>SPSA</strong> algorithm but with final estimate taken as the iterate average <strong>of</strong><br />

the last 200 iterations, 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong>. Additional study details are as in Spall[18].<br />

We see that M2-<strong>SPSA</strong> provides a considerable reduction in the loss function value <strong>for</strong> the same<br />

number <strong>of</strong> measurements used in 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. Based on the numbers in the table<br />

together with supplementary studies, we find that 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> need approximately<br />

five–ten times the number <strong>of</strong> function evaluations used by M2-<strong>SPSA</strong> to reach the levels <strong>of</strong><br />

accuracy shown.<br />

71


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

The behavior <strong>of</strong> iterate averaging was consistent with the discussion in previous section in<br />

which the 1st-<strong>SPSA</strong> iterates had not yet settled into bouncing roughly uni<strong>for</strong>mly around the<br />

solution. Using numerical studies in Spall [18], we can show that M2-<strong>SPSA</strong> outper<strong>for</strong>ms<br />

1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> even more strongly in the noise-free (σ = 0) case <strong>for</strong> this loss function,<br />

but that it is inferior to 1st-<strong>SPSA</strong> in the high-noise (σ = 0.10) case.<br />

However, Spall [18] presents a study based on a larger number <strong>of</strong> loss measurements (i.e., more<br />

asymptotic) where we can show that M2-<strong>SPSA</strong> outper<strong>for</strong>ms 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> in the<br />

high-noise case.<br />

Table 2.2. Normalized loss values <strong>for</strong> 1st-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong> with σ = 0.001;<br />

No. <strong>of</strong> loss<br />

measurements<br />

1st-<strong>SPSA</strong><br />

2000 0.0046<br />

[0.0040,<br />

0.0052]<br />

10 000 0.0023<br />

[0.0021,<br />

0.0025]<br />

90% confidence interval shown in [⋅].<br />

1st-<strong>SPSA</strong> with<br />

iterate averaging<br />

0.0047<br />

[0.0040, 0.0054]<br />

0.0023<br />

[0.0021,0.0025]<br />

2nd-<strong>SPSA</strong><br />

0.0041<br />

[0.0037, 0.0050]<br />

0.0019<br />

[0.0019, 0.0022]<br />

M2-<strong>SPSA</strong><br />

0.0023<br />

[0.0021, 0.0025]<br />

8.6<br />

X 10 −4<br />

[7.6X 10 −4<br />

, 9.6X10 − 4 ]<br />

*<br />

It was also found that, if the iterates were constrained to lie in some hypercube around θ (as<br />

required, e.g., in genetic algorithms), then all values in Table 2.2 will be reduced, sometimes by<br />

several <strong>order</strong>s <strong>of</strong> magnitude. Such prior in<strong>for</strong>mation can be valuable at speeding convergence.<br />

2.13.2- Simulation 2<br />

We will compare the per<strong>for</strong>mance <strong>of</strong> M2-<strong>SPSA</strong> with that <strong>of</strong> the standard first-<strong>order</strong> <strong>SPSA</strong><br />

algorithm in Spall [18]. The loss function L(⋅)<br />

we consider is a fourth-<strong>order</strong> polynomial with<br />

significant interaction among the p=10 elements in θ this makes the loss function flat near<br />

*<br />

θ and, consequently, the optimization problem challenging. Tables 2.3 and 2.4 provide the<br />

*<br />

results <strong>for</strong> this preliminary study, showing the ratio <strong>of</strong> the estimation error θˆ − θˆ<br />

k<br />

to the<br />

*<br />

initial error θˆ − θˆ based on an average <strong>of</strong> five independent runs (the same θ ˆ was used<br />

0<br />

0<br />

72


2.12 NUMERICAL SIMULATIONS<br />

in all runs, and represents the standard Euclidean norm). 1st-<strong>SPSA</strong> and M2-<strong>SPSA</strong> represent the<br />

first-<strong>order</strong> and modified second-<strong>order</strong> <strong>SPSA</strong> algorithms, respectively. Table 2.3 considers the<br />

case where there is no noise in the measurements <strong>of</strong> L (⋅)<br />

, while Table 2.4 includes Gaussian<br />

measurement noise (with a one-sigma value that ranges from 3 to over 100 percent <strong>of</strong> the<br />

L(θ )<br />

value as θ varies).<br />

The left-hand column represents the total number <strong>of</strong> measurements used (so with 3000<br />

measurements, 1st-<strong>SPSA</strong> has gone through k = 1500 iterations while M2-<strong>SPSA</strong> has gone<br />

through k = 1000 iterations). The first two results columns in the tables represent runs with the<br />

same SA gains a ,<br />

k<br />

c , tuned numerically to approximately optimize the per<strong>for</strong>mance <strong>of</strong> the<br />

k<br />

1st-<strong>SPSA</strong> algorithm. The third results column is based on a (numerical) recalibration <strong>of</strong> a ,<br />

k<br />

c to be approximately optimized <strong>for</strong> the M2-<strong>SPSA</strong> algorithm (an identical<br />

k<br />

used <strong>for</strong> both M2-<strong>SPSA</strong> columns).<br />

a sequence was<br />

k<br />

The results in both tables illustrate the per<strong>for</strong>mance <strong>of</strong> the M2-<strong>SPSA</strong> approach <strong>for</strong> a difficult to<br />

optimize (i.e., flat surface) function. As expected, we see that the ratios (<strong>for</strong> both 1st-<strong>SPSA</strong> and<br />

M2-<strong>SPSA</strong>) tend to be lower in the no-noise case <strong>of</strong> Table 2.3 Further, we see that the M2-<strong>SPSA</strong><br />

*<br />

algorithm provides solutions closer to θ both with and without optimal M2-<strong>SPSA</strong> gains. An<br />

enlightening way to look at the numbers in the tables is to compare the number <strong>of</strong><br />

measurements needed to achieve the same level <strong>of</strong> accuracy. We see that in the no-noise case<br />

(Table 2.3), the ratio <strong>of</strong> number <strong>of</strong> measurements <strong>for</strong> M2-<strong>SPSA</strong>: 1st-<strong>SPSA</strong> ranged from 1:2 to<br />

1:50. In the noisy measurement case (Table 2.4), the ratios <strong>for</strong> M2-<strong>SPSA</strong>: 1st-<strong>SPSA</strong> ranged<br />

from 1:2 to 1:20. These ratios <strong>of</strong>fer considerable promise <strong>for</strong> practical problems, where p is<br />

even larger (say, as in the neural network–based direct adaptive control method <strong>of</strong> Spall and<br />

2<br />

3<br />

Cristion [25], where p can easily be <strong>of</strong> <strong>order</strong> 10 or 10 ). In such cases, other second <strong>order</strong><br />

techniques that require a growing (with p) number <strong>of</strong> function measurements are likely to<br />

become infeasible.<br />

73


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

Table 2.3. Values <strong>of</strong><br />

θˆ<br />

k<br />

θˆ<br />

0<br />

*<br />

− θ<br />

*<br />

− θ<br />

with no measurement noise.<br />

Number <strong>of</strong><br />

measurements<br />

1st-<strong>SPSA</strong><br />

M2-<strong>SPSA</strong><br />

w/1st-<strong>SPSA</strong><br />

gains<br />

M2-<strong>SPSA</strong><br />

w/optimal<br />

gains<br />

3000 0.265 0.287 0.122<br />

15000 0.184 0.160 0.033<br />

30000 0.146 0.128 0.018<br />

Table 2.4. Values <strong>of</strong><br />

θˆ<br />

k<br />

θˆ<br />

0<br />

*<br />

− θ<br />

*<br />

− θ<br />

with measurement noise.<br />

Number <strong>of</strong><br />

measurements<br />

1st-<strong>SPSA</strong> M2-<strong>SPSA</strong><br />

w/1st-<strong>SPSA</strong><br />

gains<br />

M2-<strong>SPSA</strong><br />

w/optimal<br />

gains<br />

3000 0.273 0.292 0.243<br />

15000 0.184 0.163 0.103<br />

30000 0.146 0.141 0.097<br />

There are several important practical concerns in implementing the M2-<strong>SPSA</strong> algorithm. One,<br />

<strong>of</strong> course, involves the choice <strong>of</strong> SA gains. As in all SA algorithms, this must be done with<br />

some care to ensure good per<strong>for</strong>mance <strong>of</strong> the algorithm. Some theoretical guidance is provided<br />

in Fabian [19], but we have found that empirical experimentation is more effective and easier.<br />

Another practical aspect involves the use <strong>of</strong> the <strong>Hessian</strong> estimate: in the studies here we found it<br />

more effective to not use the <strong>Hessian</strong> estimate <strong>for</strong> the first few (100) iterations. This allows the<br />

inverse <strong>Hessian</strong> estimate to improve while it really is not needed since L(⋅)<br />

is dropping quickly<br />

because <strong>of</strong> the characteristic steep initial decline <strong>of</strong> the standard <strong>SPSA</strong> algorithm.<br />

74


2.12 NUMERICAL SIMULATIONS<br />

2.13.3 -Simulation 3<br />

First, let us consider the following:<br />

where<br />

x (2.66)<br />

k<br />

+ a<br />

k<br />

x<br />

k − 1<br />

+ a<br />

2<br />

x<br />

k − 2<br />

= b1u<br />

k −1<br />

+ b<br />

2u<br />

k − 2<br />

a<br />

1<br />

=-1.2, a<br />

2<br />

=0.4, b<br />

1=1.0 and b<br />

2<br />

=0.7.<br />

Figure 2.7 shows the parameter estimation results using the algorithm in (2.62). Fig. 2.8 shows<br />

the results <strong>for</strong> when bias compensation was not per<strong>for</strong>med. Here, the input is white noise<br />

generated using a normal distribution with a variance <strong>of</strong> 0.6 and an average <strong>of</strong> 0.<br />

The observed noise is a separate white noise generated using a normal distribution with a<br />

variance <strong>of</strong> 0.1 and an average <strong>of</strong> 0. The observed noise is a separate white noise generated<br />

using a normal distribution with a variance <strong>of</strong> 0.1 and an average <strong>of</strong> 0. Also, the initial values<br />

<strong>for</strong> the estimation parameters are all 0, the magnitude c <strong>of</strong> the perturbation used in the algorithm<br />

0.9<br />

is 0.0015, and the gain coefficient = 1/( i + 1) .<br />

ρ i<br />

Fig. 2.7. Identification results (with bias compensation).<br />

â<br />

1<br />

(solid line), ˆb (dashed line),<br />

2<br />

â (dashed dot line), ˆb (dot line).<br />

2<br />

1<br />

75


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

Fig. 2.8. Identification results (without bias compensation).<br />

â<br />

1<br />

(solid line), ˆb (dashed line),<br />

2<br />

â (dashed dot line), ˆb (dot line).<br />

2<br />

1<br />

These settings satisfy conditions (A11) and (A12) <strong>for</strong> the convergence theorem. In the figures<br />

above, the horizontal axis represents the number <strong>of</strong> iterations <strong>for</strong> the parameters. In Fig. 2.7 , we<br />

can confirm that the estimated values converge to the true values. On the other hand, when bias<br />

compensation was not per<strong>for</strong>med, it is clear from Fig. 2.8 that an estimation error occurs as can<br />

be seen in (2.52) this means that the estimates could not be consistent in the system. Now, our<br />

proposed method is compared to other methods such as the R-M type SA [9] and the 2nd-<strong>SPSA</strong><br />

algorithm [18]. For all these methods, the variance <strong>of</strong> 0.1 <strong>for</strong> the observed noise was known,<br />

and the compensation algorithm was used. The results <strong>of</strong> estimations with almost 100,000<br />

iterations <strong>of</strong> parameter correction are shown in Table 2.5. The average values <strong>for</strong> 50 trials are<br />

given <strong>for</strong> the estimation results.<br />

Table 2.5. Comparison <strong>of</strong> estimators.<br />

<strong>Algorithm</strong>s<br />

â<br />

1<br />

ˆb<br />

2<br />

â ˆb<br />

2<br />

1<br />

RM -1.1770170 0.635410 0.361731 0.964721<br />

M2-<strong>SPSA</strong> -1.20511120 0.67401 0.401234 1.006991<br />

2nd-<strong>SPSA</strong> -1.1916300 0.664451 0.393394 0.990554<br />

True value -1.2 0.7 0.4 1.0<br />

M2-<strong>SPSA</strong>: Estimators using the proposed method.<br />

2nd-<strong>SPSA</strong>: <strong>Second</strong>-<strong>order</strong> <strong>of</strong> <strong>SPSA</strong> [18].<br />

SA: Estimators using R-M SA [9].<br />

76


2.12 NUMERICAL SIMULATIONS<br />

In terms <strong>of</strong> estimation precision, the 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong> are better than R-M SA method<br />

(see Table 2.5). In Fig. 2.7, we can see the corrections required in <strong>order</strong> to achieve suitable<br />

results. The values in the proposed <strong>SPSA</strong> algorithm are closest to true values. Also, in the other<br />

methods (RM algorithm), an accurate amount <strong>for</strong> the slope is used <strong>for</strong> the evaluation function.<br />

In contrast, in the proposed method the slope is estimated, and the estimation error <strong>for</strong> the slope<br />

has an effect on the convergence speed. However, as was explained be<strong>for</strong>e, when the system<br />

output can only be obtained via unknown characteristics, conventional estimation methods<br />

cannot be used. This is only a small study in <strong>order</strong> to show how the proposed <strong>SPSA</strong> algorithm is<br />

applied to parameter estimation.<br />

In conclusion in this chapter, we have proposed a parameter estimation algorithm using<br />

M2-<strong>SPSA</strong>. The identification method using the SP seems particularly useful when the number<br />

<strong>of</strong> parameters to be identified is very large or when the observed values <strong>for</strong> what is to be<br />

identified can only be obtained via an unknown observation system [38]-[41]. Furthermore, an<br />

improved time differential <strong>of</strong> SP method that only require one observation <strong>of</strong> error <strong>for</strong> each time<br />

increment have been proposed as improvements. The system can also be used <strong>for</strong> identification<br />

problems. Then in this chapter, we have made some empirical and theoretical comparisons<br />

between 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> and other SA algorithms. It is found that the magnitude <strong>of</strong><br />

errors introduced by matrix inversion in 2nd-<strong>SPSA</strong> is greater <strong>for</strong> an ill-conditioned<br />

<strong>Hessian</strong> than a well-conditioned <strong>Hessian</strong>. On the other hand, the errors in 1st-<strong>SPSA</strong> are less<br />

sensitive to the matrix conditioning <strong>of</strong> loss function <strong>Hessian</strong>s. To eliminate the errors introduced<br />

by the inversion <strong>of</strong> estimated <strong>Hessian</strong><br />

−1<br />

H<br />

k<br />

, it is suggested a modification (2.13) to 2nd-<strong>SPSA</strong><br />

that replaces<br />

−1<br />

Hk<br />

with a scalar inverse <strong>of</strong> the geometric mean <strong>of</strong> all the eigenvalues <strong>of</strong> H<br />

k<br />

. At<br />

finite iterations, it is found that the introduced M2-<strong>SPSA</strong> based on (2.13) and (2.14)<br />

outper<strong>for</strong>ms 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> in numerical experiments that represent a wide range <strong>of</strong><br />

matrix conditioning. The asymptotic efficiency analysis shows that the ratio <strong>of</strong> the mean square<br />

errors <strong>for</strong> the proposed <strong>SPSA</strong> algorithm to 2nd-<strong>SPSA</strong> is always less than unity except <strong>for</strong> a<br />

perfectly conditioned <strong>Hessian</strong> or <strong>for</strong> an asymptotically optimal setting <strong>of</strong> the gain sequence.<br />

There<strong>for</strong>e, the general differences between previous version <strong>of</strong> <strong>SPSA</strong> algorithm and our version<br />

presented above is that our proposed <strong>SPSA</strong> algorithm <strong>of</strong>fers considerable potential <strong>for</strong><br />

accelerating the convergence <strong>of</strong> SA algorithms while only requiring loss function measurements<br />

(no gradient or higher derivative measurements are needed). In this section since it requires only<br />

77


CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />

three measurements per iteration to estimate both the gradient and <strong>Hessian</strong> independent <strong>of</strong><br />

problem dimension p it does not impose a large requirement <strong>for</strong> data collection. Also, the<br />

computational complexity and cost are reduced as the previous simulations showed. The main<br />

features in our proposed <strong>SPSA</strong> are the follows:<br />

1) M2-<strong>SPSA</strong> is useful <strong>for</strong> complex problems where a great volume <strong>of</strong> parameters need to be<br />

estimated its description is explained in Sec. 2.4 and 2.5.<br />

2) Reduction in the computation time by evaluating only a diagonal estimate <strong>of</strong> the <strong>Hessian</strong><br />

matrix (see Sec. 2.3).<br />

3) The eigenvalues <strong>of</strong> the <strong>Hessian</strong> matrix are computed very efficiently (see Sec. 2.3)<br />

4) M2-<strong>SPSA</strong> guarantees that non-positive-definiteness part be eliminated using FIM. The<br />

<strong>Hessian</strong> matrix inverse is improved (see Sec. 2.6).<br />

5) The modification in the <strong>SPSA</strong> implementation improves the convergence in the algorithm<br />

when is applied to parameter estimation (see Sec. 2.8 - 2.11).<br />

78


Chapter 3<br />

Vibration Suppression Control <strong>of</strong> a Flexible<br />

Arm using Non-linear Observer with <strong>SPSA</strong><br />

In this first application, the proposed <strong>SPSA</strong> algorithm is applied to parameter estimation in<br />

methods <strong>for</strong> the vibration control in the model proposed here, these methods are the non-linear<br />

observer and model reference-sliding mode control. In both cases the parameter estimation by<br />

M2-<strong>SPSA</strong> is compared with other algorithms in <strong>order</strong> to show the efficiency in comparison to<br />

other good parameter estimators. The computational cost and accuracy in parameters is<br />

compared here. Finally, a novel model reference-sliding mode control applied to non-linear<br />

observer is proposed here. The main objective in this study concerns to vibration control <strong>of</strong> a<br />

one-link flexible arm system. A variable structure system (VSS) non-linear observer has been<br />

proposed in <strong>order</strong> to reduce the oscillation in controlling the angle <strong>of</strong> the flexible arm. The<br />

non-linear observer parameters are optimized using a modified version <strong>of</strong> simultaneous<br />

perturbation stochastic approximation (<strong>SPSA</strong>) algorithm. The <strong>SPSA</strong> algorithm is especially<br />

useful when the number <strong>of</strong> parameters to be adjusted is large, and makes it possible to estimate<br />

them simultaneously. As <strong>for</strong> the vibration and position control, a model reference sliding-mode<br />

control (MR-SMC) has been proposed. Also the MR-SMC parameters are optimized using a<br />

modified version <strong>of</strong> <strong>SPSA</strong> algorithm. The simulations show that the vibration control <strong>of</strong> a<br />

one-link flexible arm system can be achieved more efficiently using our method. There<strong>for</strong>e, by<br />

applying <strong>of</strong> MR-SMC method to non-linear observer, we can improve the per<strong>for</strong>mance in this<br />

kind <strong>of</strong> models and by our proposed <strong>SPSA</strong> algorithm, we can determine very easy and<br />

efficiently the parameters <strong>of</strong> control.<br />

3.1 -Introduction<br />

Traditionally, robotic manipulators have been designed and built in a manner that maximizes<br />

stiffness in <strong>order</strong> to minimize vibration and allow <strong>for</strong> good positional accuracy with relatively<br />

simple controllers [41]. High stiffness is achieved by using heavy links that limits the rapid<br />

motion <strong>of</strong> the manipulator, increases the size <strong>of</strong> the actuators and boosts the energy<br />

79


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

consumption. Conversely, a lightweight manipulator is less expensive to manufacture and<br />

operate. Weight reduction, however, incurs a penalty in that the manipulator becomes more<br />

flexible and more difficult to control accurately [41]. Since the manipulator is a<br />

distributed-parameter system, the control difficulty is caused by the fact that a large number <strong>of</strong><br />

flexible modes are required to accurately model its behavior. According to this, we overcome<br />

these problems in this chapter. Since a simple model can be used in a flexible manipulator that<br />

carries a great tip load [41]-[43], this research has been centered in such kind <strong>of</strong> simple model,<br />

particularly in the single flexible link that is moved in a horizontal plane. Also, this kind <strong>of</strong><br />

model is very convenient because show more clearly the advantages <strong>of</strong> our method and control<br />

strategies described in this chapter. We have proposed a method which the vibrations can be<br />

suppressed satisfactory in the single flexible link system, this method helps to have a very<br />

suitable control <strong>of</strong> the angular position <strong>of</strong> this system. The mathematical model <strong>of</strong> this system is<br />

described in Sec. 3.2. In the single flexible link, one end <strong>of</strong> this arm is attached to a motor and<br />

the other end carries a payload. In this chapter, the control angular position <strong>of</strong> the arm<br />

suppressing the oscillation is taken as the control purpose. Since the feedback <strong>of</strong> only the motor<br />

angle will not be sufficient to suppress the oscillation, we have considered a VSS non-linear<br />

observer incorporated by a MR-SMC in <strong>order</strong> to reduce the oscillation more efficiently. The<br />

variable structure systems theory has been successfully used in the development <strong>of</strong> robust<br />

observers <strong>for</strong> dynamical systems with bounded non-linearities and/or uncertainties. These<br />

observers do not require exact knowledge <strong>of</strong> the plant parameters and/or non-linearities. Their<br />

design is solely based on knowing the upper bounds <strong>of</strong> the system uncertainties and/or<br />

non-linearities. Furthermore, in some studies, the estimated state variables were preferred over<br />

the measured ones in <strong>order</strong> to enhance the per<strong>for</strong>mance <strong>of</strong> the controller [47] or to reduce the<br />

effect <strong>of</strong> observation spillover in the active control <strong>of</strong> flexible structures [47]. In other words,<br />

VSS is fundamentally based in a stability equations and minimization <strong>of</strong> the cost function.<br />

There<strong>for</strong>e, the per<strong>for</strong>mance <strong>of</strong> the non-linear observer is assessed herein by examining its<br />

capability <strong>of</strong> predicting the rigid and flexible motions <strong>of</strong> a compliant beam that is connected to a<br />

revolute joint. In respect <strong>of</strong> MR-SMC, its advantage is robustness against parameter<br />

uncertainties and external disturbance and so on, so that MR-SMC is robustness under the<br />

matching condition. In general, suspension system is easily subjected to several parameter<br />

variations such as the variation <strong>of</strong> the sprung mass. The robustness <strong>of</strong> the SMC can be improved<br />

by shortening the time required to attain the sliding mode, or may be guaranteed during whole<br />

intervals <strong>of</strong> control action by eliminating the reaching phase. One easy way to minimize the<br />

reaching phase is to employ a large control input.<br />

80


3.2 DYNAMIC MODELING OF A SINGLE LINK ROBOT ARM<br />

This MR-SMC is <strong>for</strong>mulated <strong>for</strong> the position control <strong>of</strong> a single flexible link subjected to<br />

parameter variations. Also, a sliding surface which guarantees stable sliding mode motion during<br />

the sliding phase is synthesized in an optimal manner; this will be analyzed in Sec. 3.3 and 3.4.<br />

The MR-SMC and the observer have been designed based on a simplified model <strong>of</strong> the arm,<br />

which only accounts <strong>for</strong> the first elastic mode <strong>of</strong> the beam. Moreover, there are many<br />

parameters to be determined, so that it is difficult to get them. Hence, in <strong>order</strong> to overcome this<br />

problem, a modified version <strong>of</strong> 2nd-<strong>SPSA</strong> has been proposed to obtain the observer/ controller<br />

gains more efficiently. In the traditional <strong>SPSA</strong> since all parameters are perturbed simultaneously,<br />

it is possible to modify parameters with only two measurements <strong>of</strong> an evaluation function<br />

regardless <strong>of</strong> the dimension <strong>of</strong> the parameter. This is very useful but this <strong>SPSA</strong> can cause in<br />

some cases high computational cost [3]. There<strong>for</strong>e, M2-<strong>SPSA</strong> is applied to a parameters<br />

estimation algorithm in <strong>order</strong> to get the observer and controller parameters more efficiently and<br />

also reduce its cost. We apply a parameter estimation algorithm using our proposed <strong>SPSA</strong><br />

described in Chap. 2. The per<strong>for</strong>mance <strong>of</strong> this algorithm will be examined in terms <strong>of</strong> parameter<br />

selection, computational cost, and convergence per<strong>for</strong>mance in the current problem. Finally, in<br />

<strong>order</strong> to understand the proposed method using non-linear observer, MR-SMC and <strong>SPSA</strong>, the<br />

control system only uses measurable data such as motor angle, tip velocity, tip position, and<br />

control torque shown in Sec. 3.5.<br />

3.2 -Dynamic Modeling <strong>of</strong> a Single Link Robot Arm<br />

3.2.1 -Dynamic Model<br />

The single flexible link is considered as a continuous cantilever beam <strong>of</strong> length L carrying a<br />

mass M and a torque T applied by a motor that rotates the beam in a horizontal plane. The mass<br />

and elastic properties are assumed to be distributed uni<strong>for</strong>mly along single flexible link [44].<br />

The physical configuration <strong>of</strong> this system is shown in Fig. 3.1. This system is constituted <strong>of</strong> a<br />

length L that has a mass m, a torque T (that rotates the elastic arm) and an additional mass M<br />

(that is the payload at the end <strong>of</strong> the arm) [44]. The deflection y(x,t) is described by an infinite<br />

series <strong>of</strong> separable modes.<br />

n<br />

∑<br />

y(<br />

x,<br />

t)<br />

= φ ( x)<br />

q ( t)<br />

(3.1)<br />

i=<br />

1<br />

i<br />

i<br />

81


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

which is assumed <strong>for</strong> the elastic displacement <strong>of</strong> the single flexible link, where φ (x)<br />

is a<br />

characteristic function and q i<br />

(t)<br />

is a mode function. The kinetic and potential energies <strong>of</strong> this<br />

system can be determined as follows:<br />

i<br />

T<br />

e<br />

m<br />

+ θ&<br />

L<br />

+ & θ<br />

1<br />

= & 2 m<br />

θ J +<br />

2 2L<br />

n<br />

∑<br />

i=<br />

1<br />

B q&<br />

i<br />

i<br />

∑<br />

i=<br />

1<br />

M<br />

+<br />

2<br />

n<br />

n<br />

2 2 2<br />

∑ C<br />

i<br />

q<br />

i<br />

+ 2Lθ&<br />

∑<br />

i= 1 i=<br />

1<br />

n<br />

A q&<br />

i<br />

2<br />

i<br />

2 2<br />

( L & θ +<br />

m<br />

+ & θ<br />

2L<br />

i<br />

i<br />

n<br />

∑<br />

i=<br />

1<br />

C q&<br />

)<br />

C<br />

∑<br />

i=<br />

1<br />

2<br />

i<br />

n<br />

q&<br />

2<br />

i<br />

A q&<br />

i<br />

2<br />

i<br />

(3.2)<br />

V<br />

=<br />

EI<br />

2<br />

n<br />

∑<br />

i = 1<br />

D i q i<br />

2<br />

(3.3)<br />

where θ is the angle <strong>of</strong> the joint, E is Young's modulus, and I is the area moment <strong>of</strong> inertia<br />

with the next variables:<br />

0<br />

L<br />

0<br />

L<br />

2<br />

A<br />

i=<br />

∫ φ<br />

i<br />

( x)<br />

dx,<br />

Bi<br />

= ∫ xφi<br />

( x)<br />

dx,<br />

Ci<br />

= φi<br />

( L),<br />

Di<br />

=<br />

∫<br />

0<br />

L<br />

2<br />

2 2<br />

[ d φ ( x) / dx ] dx.<br />

i<br />

The equation <strong>of</strong> motion <strong>of</strong> the cantilever beam <strong>for</strong> free vibration is based on the Euler-Bernoulli<br />

equation [45] and is written as follows:<br />

4<br />

2<br />

∂ y ∂ y<br />

EIL + m = 0 .<br />

(3.4)<br />

4<br />

2<br />

∂ x ∂ t<br />

Fig. 3.1. One-link flexible arm.<br />

82


3.2 DYNAMIC MODELING OF A SINGLE LINK ROBOT ARM<br />

The beam has a uni<strong>for</strong>m cross-sectional and its boundary conditions are defined as follows [45]:<br />

The deflection is zero at x=0.<br />

y ( 0, t)<br />

= 0.<br />

(3.5)<br />

The slope deflection is zero at x=0.<br />

dy<br />

dx<br />

( 0, t ) =<br />

0.<br />

(3.6)<br />

Bending moment is zero at x=L.<br />

2<br />

d y<br />

2<br />

dx<br />

( L , t ) =<br />

0 .<br />

(3.7)<br />

Shear <strong>for</strong>ce balance at the tip.<br />

3<br />

2<br />

d y<br />

d y<br />

EI ( L , t ) = m ( L , t ).<br />

3<br />

2<br />

(3.8)<br />

dx<br />

dt<br />

From (3.4) and (3.5) - (3.8), we have<br />

y ( x,<br />

t)<br />

= φ ( x)<br />

cos ω t.<br />

(3.9)<br />

i<br />

i<br />

i<br />

Then φ (x)<br />

can be found as:<br />

i<br />

φ x)<br />

= c cosβ<br />

x+<br />

c coshβ<br />

x+<br />

c sinβ<br />

x c sinhβ<br />

x<br />

(3.10)<br />

i( 1 i i 2i<br />

i 3i<br />

i<br />

+<br />

4i<br />

i<br />

2 EI 4<br />

ω<br />

i<br />

= β<br />

i<br />

.<br />

(3.11)<br />

ρ a<br />

Substituting φi<br />

(x)<br />

from (3.10) into (3.9) and using (3.5)-(3.8), β and<br />

i<br />

c1 i<br />

c4i<br />

~ are determined.<br />

83


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

3.2.2 -Equation <strong>of</strong> Motion and State Equations<br />

The state equations <strong>of</strong> the system are derived to describe the dynamic <strong>of</strong> single flexible link<br />

under certain assumptions [45]. There<strong>for</strong>e, assuming that only the first mode exists, from (3.2)<br />

and (3.3), and using Lagrange's equations as in [45][46], we obtain<br />

d<br />

dt<br />

⎛<br />

⎜<br />

⎝<br />

∂ T<br />

e ⎞ ∂ T<br />

e<br />

∂ V<br />

⎟ − + = T<br />

∂ θ & (3.12)<br />

⎠ ∂ θ ∂ θ<br />

d<br />

dt<br />

⎛<br />

⎜<br />

⎝<br />

∂ T<br />

∂ q&<br />

e<br />

1<br />

⎞<br />

⎟<br />

⎠<br />

−<br />

∂ T<br />

∂ q<br />

e<br />

1<br />

+<br />

∂ V<br />

∂ q<br />

1<br />

=<br />

0<br />

(3.13)<br />

then<br />

⎡α<br />

⎢<br />

⎣α<br />

01<br />

α ⎤⎡<br />

&&<br />

01 θ ⎤ ⎡ T − & θα ⎤<br />

11q1q<br />

&<br />

1<br />

⎥⎢<br />

⎥ = ⎢<br />

2⎥<br />

α11⎦⎣q&&<br />

1⎦<br />

⎣−<br />

H1q1<br />

+ α &<br />

11q1θ<br />

⎦<br />

00 2<br />

(3.14)<br />

y = θ<br />

(3.15)<br />

where<br />

α = + , and T is the motor’s shaft torque, J is the moment <strong>of</strong> inertia<br />

2 2<br />

00<br />

J + ML α11q1<br />

about the joint axis<br />

2<br />

α<br />

01<br />

= ω1<br />

+ ML φ1e<br />

, α<br />

11<br />

= v1<br />

+ ML φ1e<br />

, v = ρ a∫0<br />

L<br />

2<br />

1<br />

φ 1<br />

dx<br />

1<br />

, ρ is the<br />

L<br />

L<br />

2<br />

density, H<br />

1<br />

= EI ∫ & φ dx1,<br />

φ1<br />

=<br />

1(<br />

),<br />

1<br />

= ∫ 1 1 1,<br />

1<br />

e<br />

φ L ω ρ a x φ dx a is the area <strong>of</strong> the cross-section,<br />

0<br />

and y is the observation <strong>of</strong> θ . In <strong>order</strong> to get the variables that we will use <strong>for</strong><br />

0<br />

evaluate our method, the state variables are defined as<br />

x<br />

1<br />

= θ , x2<br />

= & θ,<br />

x3<br />

= q1,<br />

x4<br />

= q&<br />

1.<br />

84


3.3 DESIGN OF NON-LINEAR OBSERVER<br />

Then<br />

where<br />

⎡ x&<br />

⎢<br />

x&<br />

⎢<br />

⎢ x&<br />

⎢<br />

⎣ x&<br />

⋅<br />

⋅<br />

1<br />

2<br />

3<br />

4<br />

⎤ ⎡<br />

⎢<br />

⎢<br />

f<br />

=<br />

⎢<br />

⎢<br />

⎣ f<br />

⎥<br />

⎥<br />

⎥<br />

⎥<br />

⎦<br />

f ( x<br />

1<br />

1<br />

2<br />

( x<br />

( x<br />

2<br />

2<br />

x<br />

, x<br />

x<br />

2<br />

4<br />

, x<br />

3<br />

3<br />

, x<br />

, x<br />

4<br />

4<br />

1<br />

α − α<br />

⎤ ⎡ 0<br />

)<br />

⎥ ⎢<br />

⎥ ⎢<br />

b1<br />

+<br />

⎥ ⎢ 0<br />

⎥ ⎢<br />

) ⎦ ⎣b<br />

2<br />

⎤<br />

⎥<br />

⎥T<br />

⎥<br />

⎥<br />

⎦<br />

2<br />

2<br />

[ − 2 α x x x − α ( − H x + α x x ) ]<br />

f<br />

2<br />

2<br />

( x<br />

2<br />

, x<br />

1<br />

α − α<br />

2<br />

2<br />

[ 2 α α x x x + α ( − H x + α x x ) ]<br />

01<br />

11<br />

3<br />

, x<br />

3<br />

11<br />

, x<br />

2<br />

2<br />

4<br />

, x<br />

) =<br />

α<br />

3<br />

4<br />

3<br />

4<br />

) =<br />

α<br />

4<br />

00<br />

00<br />

01<br />

00<br />

11<br />

11<br />

1<br />

2<br />

01<br />

2<br />

01<br />

1<br />

3<br />

3<br />

11<br />

11<br />

3<br />

3<br />

2<br />

2<br />

(3.16)<br />

b<br />

1<br />

=<br />

α<br />

00<br />

α<br />

11<br />

α − α<br />

11<br />

2<br />

01<br />

b<br />

2<br />

=<br />

−<br />

α<br />

00<br />

α<br />

01<br />

α − α<br />

11<br />

2<br />

01<br />

•<br />

3.3 -Design <strong>of</strong> Non-linear Observer<br />

In this section, since only the motor angle x1<br />

is the measurable state variable, the remaining<br />

states x2, x3<br />

and x 4<br />

are predicted using intelligent state observer design [47]. For this,<br />

(3.14)-(3.15) are written as follows:<br />

State equations:<br />

x & = f ( x)<br />

+ g(<br />

x)<br />

T<br />

(3.17)<br />

Output equations:<br />

y = c<br />

T<br />

x<br />

T<br />

c = [1 0 0 0].<br />

(3.18)<br />

85


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

For this non-linear system, we consider a robust VSS observer, which predicts system states.<br />

This observer is defined as follows:<br />

xˆ<br />

= f ( xˆ)<br />

+ g( xˆ)<br />

T + M ( yˆ)<br />

+ K(<br />

yˆ<br />

− y)<br />

(3.19)<br />

yˆ = c T<br />

xˆ<br />

(3.20)<br />

M ( yˆ<br />

) = − g ( x )<br />

yˆ<br />

ς<br />

y + γ<br />

(3.21)<br />

T<br />

y = yˆ<br />

− y = c ( xˆ<br />

− x)<br />

(3.22)<br />

where xˆ represents the predicted value <strong>of</strong> system state as in [47], K is the observer gain matrix,<br />

M (y) is the observer non-linearity term, ς represents the gain and γ > 0 is an averaging<br />

constant <strong>for</strong> removing chattering. Now defining the estimation error as<br />

e = xˆ − x<br />

(3.23)<br />

we have<br />

e& = f (ˆ) x − f ( x)<br />

+ [ g(ˆ)<br />

x − g(<br />

x)]<br />

T + Kc<br />

+ M(<br />

y).<br />

T<br />

(ˆ x−<br />

x)<br />

(3.24)<br />

For evaluating <strong>of</strong> the observer gain K with<br />

xd<br />

as the desired point, using the Taylor series<br />

expansion and its first <strong>order</strong> approximation, the error system is given as follows:<br />

e&<br />

= [ f '( x<br />

d<br />

= A0e<br />

+ M(<br />

y).<br />

) + g'(<br />

x<br />

d<br />

) T + Kc<br />

T<br />

] e + M(<br />

y)<br />

(3.25)<br />

where<br />

A +<br />

A<br />

T<br />

0<br />

= A + GT Kc<br />

(3.26)<br />

∂f<br />

i<br />

= (3.27)<br />

∂x<br />

∂g<br />

G ∂ x<br />

j<br />

i<br />

= (i,-j = 1,2,3,4). (3.28)<br />

j<br />

86


3.4 MODEL REFERENCE – SLIDING MODEL CONTROLLER<br />

Choosing a Lyapunov function <strong>of</strong> e as<br />

1 e<br />

2<br />

V =<br />

2<br />

(3.29)<br />

and integrating V with respect to e yield<br />

2<br />

1<br />

V& T<br />

= ee&<br />

= e ( A 0<br />

− g(<br />

x)<br />

c ς).<br />

(3.30)<br />

T<br />

c e +γ<br />

If K is designed such that the eigenvalues <strong>of</strong> error system (3.26) are all negatives, then selection<br />

<strong>of</strong> A<br />

0<br />

− g(<br />

x)<br />

ς < 0 yields V & < 0 and the Lyapunov’s stability theory gives e(t) → 0 as<br />

t → ∞ .<br />

In the simulation, we chose x = [0.1 0 0 0]<br />

and computed A and G with the observer<br />

d<br />

parameters determined with M2-<strong>SPSA</strong> algorithm (see Chap. 2). There<strong>for</strong>e, to ensure the<br />

stability <strong>of</strong> (3.31) minimizing the following evaluation parameters:<br />

J<br />

2<br />

( y − ˆ ) .<br />

0<br />

Σ y<br />

= (3.31)<br />

In the determination <strong>of</strong> unknown parameters <strong>of</strong> the non-linear observer<br />

k k , k , , ζ and<br />

1, 2 3<br />

k4<br />

γ each parameter is calculated by (2.62). There<strong>for</strong>e, the parameters are determined as<br />

k =[-227 -25015 13.69 -11101] T , ς =0.010- and γ =0.002.<br />

3.4 -Model Reference - Sliding Model Controller<br />

The MR-SMC is <strong>of</strong>ten used in robust control <strong>of</strong> non-linear systems and also <strong>for</strong> stabilizes single<br />

inputs systems. The main purpose <strong>of</strong> the MR-SMC is to make the states converge to the sliding<br />

mode surface. This normally depends on the sliding mode controller design. For MR-SMC, the<br />

Lyapunov function is applied to keep the non-linear system under control. In this case,<br />

MR-SMC is <strong>for</strong>mulated <strong>for</strong> the tip position control <strong>of</strong> a single flexible link subjected to parameter<br />

variations. The desired response is based on a second <strong>order</strong> reference model given as [47]<br />

87


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

⎡x&<br />

⎢<br />

⎣&&<br />

x<br />

m<br />

m<br />

⎤ ⎡ 0<br />

⎥ = ⎢ 2<br />

⎦ ⎣−ω<br />

n<br />

1 ⎤⎡xm⎤<br />

⎡ ⎤<br />

U<br />

n<br />

x<br />

⎥ + 0<br />

⎥⎢<br />

⎢ 2<br />

− 2ω<br />

⎥<br />

⎦⎣<br />

&<br />

m⎦<br />

⎣ωn<br />

⎦<br />

m<br />

(3.32)<br />

where<br />

ω<br />

n<br />

is the eigenvalue <strong>of</strong> angular frequency and<br />

U<br />

m<br />

is the model input. For sliding<br />

mode controller, Lyaponov stability method is applied to keep the non-linear system under<br />

control. The sliding mode approach is method, which trans<strong>for</strong>med a higher-<strong>order</strong> system into<br />

first-<strong>order</strong> system. In that way, simple control algorithm can be applied, which is very<br />

straigh<strong>for</strong>ward and robust.<br />

The surface is called a switching surface. When the plant state trajectory is “above” the surface,<br />

a feedback path has one gain and a different gain if the trajectory drops “below” the surface.<br />

This surface defines the rule <strong>for</strong> proper switching. This surface is also called a sliding surface<br />

(sliding manifold).<br />

Ideally, once intercepted, the switched control maintains the plant’s state trajectory on the<br />

surface <strong>for</strong> all subsequent time and the plant’s state trajectory slides along this surface (see Fig.<br />

3.2). Then, using the slide surface mentioned above, the sliding mode control became an<br />

important robust control approach. For the class <strong>of</strong> systems to which it applies, sliding mode<br />

controller design provides a systematic approach to the problem <strong>of</strong> maintaining stability and<br />

consistent per<strong>for</strong>mance in the face <strong>of</strong> modeling imprecision. On the other hand, by allowing the<br />

trade<strong>of</strong>fs between modeling and per<strong>for</strong>mance to be quantified in a simple fashion, it can<br />

illuminate the whole design process.<br />

Fig. 3.2. Sliding mode surface.<br />

88


3.4 MODEL REFERENCE – SLIDING MODEL CONTROLLER<br />

The most important task is to design a switched control that will drive the plant state to the<br />

switching surface and maintain it on the surface upon interception. A Lyapunov approach is<br />

used to characterize this task, this will be explained after. Now, we assume the slide mode<br />

hyper-plane <strong>for</strong> the system <strong>of</strong> (3.14) with the states variables predicted by the observer as<br />

.<br />

1( 1 m 2 2<br />

3 3<br />

+<br />

4x4<br />

σ = s x − x ) + s ( x − xm)<br />

+ s x s .<br />

(3.33)<br />

When the sliding mode is in operation, then<br />

σ = 0<br />

(3.34)<br />

σ& = 0.<br />

(3.35)<br />

The equivalent control input can be obtained by substituting (3.14) into (3.35). This gives<br />

T<br />

eq<br />

= 2α<br />

x x x<br />

∆<br />

⋅<br />

⎡<br />

−<br />

⎢<br />

s1(<br />

x2<br />

− x<br />

s ⎣<br />

2<br />

11<br />

2<br />

3<br />

4<br />

m<br />

α01<br />

+ ( −H1x<br />

α<br />

) − s<br />

11<br />

2<br />

⋅⋅<br />

3<br />

3<br />

x + s x<br />

m<br />

4<br />

2<br />

+ α x x )<br />

11<br />

2<br />

⋅ ⎤<br />

+ s4x4<br />

⎥<br />

⎦<br />

3<br />

(3.36)<br />

where it can be assumed that<br />

∆<br />

=<br />

2<br />

α<br />

00<br />

− α<br />

01<br />

/ α ) > 0 .<br />

(<br />

11<br />

Now, the design <strong>of</strong> MR-SMC is considered, which in the non-linear input makes the state<br />

converging in the hyper-plane. In general, the eventual sliding mode input can be considered as<br />

two independent inputs, namely, the equivalent control input<br />

T<br />

eq and non-linear control input<br />

T<br />

l , in other words,<br />

T<br />

= T<br />

eq<br />

+ T<br />

l<br />

= T<br />

eq<br />

− k ( x , t )sat ( σ )<br />

(3.37)<br />

where<br />

89


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

sat( σ )<br />

=<br />

⎧ 1<br />

⎪ σ<br />

⎨<br />

⎪ δ<br />

⎩ − 1<br />

if<br />

if<br />

if<br />

σ > δ<br />

σ ≤ δ<br />

σ < − δ<br />

(3.38)<br />

and k ( x,<br />

t)<br />

is the control input function. δ is a constant to eliminate the chattering. The<br />

condition <strong>for</strong> realization <strong>of</strong> the sliding mode is obtained from the Lyapunov function as we<br />

mentioned be<strong>for</strong>e. Lyapunov method is usually used to determine the stability properties <strong>of</strong> an<br />

equilibrium point without solving the state equation. A generalized Lyapunov function, that<br />

characterizes the motion <strong>of</strong> the state trajectory to the sliding surface, is defined in terms <strong>of</strong> the<br />

surface. For each chosen switched control structure, one chooses the “gains” so that the<br />

derivative <strong>of</strong> this Lyapunov function is negative definite, thus guaranteeing motion <strong>of</strong> the state<br />

trajectory to the surface. After proper design <strong>of</strong> the surface, a switched controller is constructed<br />

so that the tangent vectors <strong>of</strong> the state trajectory point towards the surface such that the state is<br />

driven to and maintained on the sliding surface. Such controllers result in discontinuous<br />

closed-loop systems. The following Lyapunov function is chosen <strong>of</strong> σ to confirm σ = 0 :<br />

1 2<br />

V = σ .<br />

(3.39)<br />

2<br />

With this, V & is given by<br />

⎧ ⎡<br />

⎪ s2<br />

⎢<br />

α<br />

− T − 2α<br />

11x2<br />

x3x4<br />

−<br />

V&<br />

= σ σ&<br />

= σ ⎨ ∆ ⎢<br />

α<br />

⎪<br />

⎣<br />

⋅<br />

⋅⋅<br />

⎪⎩<br />

+ s1<br />

( x2<br />

− xm<br />

) − s2<br />

xm<br />

+ s3x<br />

01<br />

11<br />

4<br />

⎛ − H ⎫<br />

1x3<br />

+ ⎞⎤<br />

⎜ ⎟⎥⎪<br />

⎜ ⎟⎥<br />

2<br />

⎝α<br />

⎠<br />

⎬<br />

11x2<br />

x3<br />

⎦<br />

⎪<br />

⋅<br />

+ s ⎪<br />

4<br />

x4<br />

⎭ .<br />

(3.40)<br />

Substituting (3.37) into (3.40), the existence condition <strong>for</strong> sliding mode is given as<br />

⎧ s2<br />

⎫ s2<br />

V & = σ ⎨−<br />

k(<br />

x,<br />

t)sgn(<br />

σ ) ⎬ = −k(<br />

x,<br />

t)<br />

σ < 0.<br />

⎩ ∆<br />

⎭ ∆<br />

(3.41)<br />

Since s / ∆ 0 if we choose k(x,t) > 0, then the state variable x will converge in the slide<br />

2<br />

><br />

90


3.5 SIMULATION RESULTS<br />

mode hyper-plane and a stable SMC can be realized. The controller gains are determined using<br />

our proposed algorithm (see Chap. 2) so as to minimize the cost function by<br />

J<br />

h<br />

= ∑ [ L ⋅ ( x1<br />

− xm<br />

) + x3<br />

].<br />

(3.42)<br />

The estimation <strong>of</strong> unknown parameters <strong>of</strong> the MR-SMC s s , s , , k(x,t) and δ each one is<br />

1, 2 3<br />

s4<br />

calculated by (2.62). The parameters values are s<br />

1<br />

= 4.2, s<br />

2 =1, s<br />

3 =10.19, s<br />

4 =-0.41,δ =0.2<br />

and k(x,t)=2.14. Figure 3.3 shows the diagram <strong>of</strong> the system designed.<br />

Fig.3.3. Block diagram <strong>of</strong> the MR-SMC system incorporating the non-linear observer.<br />

3.5 -Simulation<br />

The MR-SMC method and M2-<strong>SPSA</strong> are used in <strong>order</strong> to achieve a very suitable controlling the<br />

angular position <strong>of</strong> the single flexible link, suppressing its oscillation. The results are compared<br />

with simulations done previously [47] without the proposed SMC. The numerical values are<br />

follows:<br />

2<br />

2<br />

J=0.00135520[kg ⋅ m ], m=0.026[kg], a ρ =0.0630[kg/m], EI=0.09007[ N⋅ m ], L=0.4[m],<br />

x<br />

0 =[-0.1 0 0 0] T and<br />

x<br />

d =[ 0.1 0 0 0] T , ∆t = 0. 1<br />

[ms], M=0.025[kg]. First, the parameter<br />

estimation in the non-linear observer and MR-SMC using the proposed <strong>SPSA</strong> algorithm is<br />

91


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

compared with effectives estimation algorithms under the same conditions mentioned previously,<br />

Robinns-Monroe stochastic approximation (RM-SA) [9] and Least-Squares (LS) method [10]<br />

are used here.<br />

Table 3.1. Comparison <strong>of</strong> estimators (non-linear observer).<br />

<strong>Algorithm</strong> k<br />

1<br />

k k<br />

2<br />

3 k ζ γ<br />

4<br />

M2-<strong>SPSA</strong> -227 -25015 13.69 -11101 0.010 0.002<br />

RM-SA -366 -30055 19.10 -12971 0.019 0.006<br />

LS -397 -30471 20.16 -13100 0.042 0.009<br />

Table 3.2. Comparison <strong>of</strong> estimators (MR-SMC).<br />

<strong>Algorithm</strong> s<br />

1<br />

s s<br />

2 3 s<br />

4 δ k ( x,<br />

t)<br />

M2-<strong>SPSA</strong> 4.2 1 10.19 -0.41 0.2 2.14<br />

RM-SA 5.0 2 17.72 -0.67 0.2 3.63<br />

LS 5.8 2 20.14 -0.84 0.2 4.01<br />

In the above tables, the values obtained by M2-<strong>SPSA</strong> are very suitable in terms <strong>of</strong> estimation<br />

precision in the current system. The results obtained by our algorithm are explained since<br />

M2-<strong>SPSA</strong> is an algorithm that does not depend on derivative in<strong>for</strong>mation, and it is able to find a<br />

good approximation to the solution using few function values; this causes a low computational<br />

cost. Also, its implementation is easier than other methods since our algorithm needs fewer<br />

coefficients to be specified. For this reason, it is possible to obtain good parameters estimation.<br />

Finally, in the other methods, an accurate amount <strong>for</strong> the slope [48] is used <strong>for</strong> the evaluation<br />

function.<br />

The variability <strong>of</strong> the values <strong>of</strong> the parameters is explained according to stopping condition<br />

which if the value is very small the iterations are stopped, there<strong>for</strong>e using this criterion defined<br />

in this simulation these tables are explained.<br />

In contrast, in M2-<strong>SPSA</strong> the slope is estimated, and the estimation error <strong>for</strong> the slope has an<br />

effect on the convergence speed. Table 3.3 compares, the number <strong>of</strong> iterations and<br />

computational load or normalized CPU (central processing unit) time [49] (computational cost<br />

in time processing) with CPU time required by M2-<strong>SPSA</strong> as reference. These comparisons are<br />

92


3.5 SIMULATION RESULTS<br />

done according to average per<strong>for</strong>mance <strong>of</strong> M2-<strong>SPSA</strong> and the SA algorithms in the estimated<br />

parameters obtained in Tables 3.1 and 3.2. The CPU time is the time processing in estimate each<br />

parameter, in this case CPU time is represented as 1 <strong>for</strong> M2-<strong>SPSA</strong>, from here, we can evaluate<br />

if the other algorithms that we use as comparison need two or more times the CPU time required<br />

by our proposed <strong>SPSA</strong>.<br />

Table 3.3. Per<strong>for</strong>mance comparison among M2-<strong>SPSA</strong>, RM-SA and LS.<br />

<strong>Algorithm</strong> Iterations CPU<br />

M2-<strong>SPSA</strong> 30000 1<br />

RM-SA 29000 2.1<br />

LS 28000 5.2<br />

In Table 3.3, LS is efficient in terms <strong>of</strong> the number <strong>of</strong> iterations required to achieve a certain<br />

level <strong>of</strong> accuracy in the parameter estimation <strong>for</strong> the current system, but it is computationally<br />

expensive and also has a high computational complexity. The LS and RM-SA algorithms<br />

depend on derivative in<strong>for</strong>mation and its solution in each iteration this can increase the<br />

computational cost and complexity.<br />

The CPU time required by LS and RM-SA is 5 to 2 times respectively the CPU required by<br />

M2-<strong>SPSA</strong>, so that, in terms <strong>of</strong> efficiency, the use <strong>of</strong> these algorithms might be questionable. On<br />

the other hand, the proposed <strong>SPSA</strong> algorithm has a low computational cost and usually provides<br />

less dispersed parameters. In the number <strong>of</strong> iterations, these algorithms are almost similar but<br />

according to features <strong>of</strong> our proposed <strong>SPSA</strong>, this can reduce the computational cost (see Chap.<br />

2) and this is a great advantage. Even, the typical <strong>SPSA</strong> algorithm has a modest computational<br />

complexity as is shown in [6], this reason causes a low computational expensive in M2-<strong>SPSA</strong>.<br />

The reason <strong>of</strong> these data obtained by M2-<strong>SPSA</strong> in Table 3.3, is that this algorithm is a very<br />

powerful technique that allows an approximation <strong>of</strong> the gradient or <strong>Hessian</strong> by effecting<br />

simultaneous random perturbations in all the parameters. There<strong>for</strong>e, the data <strong>of</strong> the proposed<br />

<strong>SPSA</strong> algorithm contrast with the other approximations in which the evaluation <strong>of</strong> the gradient<br />

is achieved by varying the parameters once at time. Figures 3.4-3.7 show the simulation results<br />

using the state variables and torque. Figure 3.4 shows the response <strong>of</strong> the motor shaft angle in<br />

the simulation by proposed method. The tracking per<strong>for</strong>mance associated with the motor angle<br />

is very suitable using a non-linear observer applied to MR-SMC method.<br />

93


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

Fig. 3.4. Motor angle. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (.-)). With RM-SA and<br />

MR-SMC (dashed-line (- -)). With LS and MR-SMC-(dash-dot-line(-.-)). With M2-<strong>SPSA</strong> and<br />

MR- SMC(solid-line (-) ).<br />

Figure 3.5 shows the tip position response <strong>of</strong> a single flexible link. The VSS non-linear observer<br />

is very important in eliminating the effects due to load <strong>of</strong> the arm, (see in solid-line).<br />

Figure 3.6 shows the tip velocity. The algorithm proposed with MR-SMC reduces the<br />

magnitude <strong>of</strong> velocity to a small value (solid-line). We can see that after 0.5 seconds the system<br />

start to become stable and the state variables predicted by the non-linear observer converge<br />

more efficiently in the sliding mode plane.<br />

Figure 3.7 shows the control torque. This simulation shows the control <strong>of</strong> the <strong>for</strong>ce to rotate the<br />

beam generated by our method (solid-line) and is stabilized after 0.5 seconds. In these<br />

simulations, we can see that using the non-linear observer and MR-SMC is possible to obtain a<br />

good per<strong>for</strong>mance since the non-linear observer is very reliable in predicting the state variables.<br />

Also, MR-SMC is an important control method used here that needs an indispensable estimate<br />

<strong>of</strong> all state variables predicted by non-linear observer. So that, the sliding mode control method<br />

is an important robust control approach. For the class <strong>of</strong> systems to which it applies, sliding<br />

mode controller design provides a systematic approach to the problem <strong>of</strong> maintaining stability<br />

and consistent per<strong>for</strong>mance in the face <strong>of</strong> modeling imprecision.<br />

94


3.5 SIMULATION RESULTS<br />

On the other hand, by allowing the trade<strong>of</strong>fs between modeling and per<strong>for</strong>mance to be<br />

quantified in a simple fashion, it can illuminate the whole design process.<br />

Fig. 3.5. Tip position. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (.)).With RM-SA and<br />

MR-SMC (dashed-line (- -)). With LS and MR-SMC(dash-dot-line(-.-)). With M2-<strong>SPSA</strong> and<br />

MR-SMC(solid-line (-) ).<br />

Fig. 3.6. Tip velocity. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (.)).With RM-SA and<br />

MR-SMC (dashed-line (- -)). With LS and MR-SMC(dash-dot-line (-.-)).With M2-<strong>SPSA</strong> and<br />

MR-SMC (solid-line (-) ).<br />

95


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

Fig. 3.7. Control torque. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (.)).With RM-SA and<br />

MR-SMC (dashed-line (- -)). With LS and MR-SMC(dash-dot-line (-.-)).With M2-<strong>SPSA</strong> and<br />

MR-SMC (solid-line (-) ).<br />

Fig. 3.8.Motor angle. Simulation using x 1<br />

with M2-<strong>SPSA</strong> and MR-SMC (solid-line). Simulation<br />

using x m<br />

with M2-<strong>SPSA</strong> and MR-SMC (dashed-line).<br />

Fig. 3.9. Tip position. Simulation using x 3<br />

with M2-<strong>SPSA</strong> and MR-SMC (solid-line). Simulation<br />

using ˆx 3<br />

with M2-<strong>SPSA</strong> and MR-SMC (dashed-line).<br />

96


3.5 SIMULATION RESULTS<br />

Fig. 3.10. Tip velocity. Simulation using x 4<br />

with M2-<strong>SPSA</strong> and MR-SMC (solid-line).<br />

Simulation using ˆx 4<br />

with M2-<strong>SPSA</strong> and MR-SMC (dashed-line).<br />

In these simulations, we can see that using the non-linear observer and MR-SMC is possible to<br />

obtain a good per<strong>for</strong>mance since the non-linear observer is very reliable in predicting the state<br />

variables. Also, MR-SMC is an important control method used here that needs an indispensable<br />

estimate <strong>of</strong> all state variables predicted by non-linear observer. For this kind <strong>of</strong> systems,<br />

MR-SMC design (see Fig.3.3) provides a systematic approach to the problem <strong>of</strong> maintaining<br />

stability and consistent per<strong>for</strong>mance in the face <strong>of</strong> modeling imprecision. Moreover, M2-<strong>SPSA</strong><br />

had a better per<strong>for</strong>mance to estimate the observer and MR-SMC parameters in comparison with<br />

the other algorithms.<br />

In this chapter, we have proposed a MR-SMC method using a non-linear observer <strong>for</strong><br />

controlling the angular position <strong>of</strong> the single flexible link, suppressing its oscillation.<br />

We can see that the non-linear observer and the MR-SMC provide a successful and stable<br />

operation to the system. We also have proposed the use <strong>of</strong> M2-<strong>SPSA</strong> in <strong>order</strong> to determine the<br />

observer/controller gains. This could determine them very efficiently and with a low<br />

computational cost. The non-linear observer was successful in predicting the state variables<br />

from the motor angular position and the MR-SMC was a very efficient control method.<br />

97


CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />

In a future work, we will plan make real experiments using this model, but be<strong>for</strong>e it is necessary<br />

to evaluate several factors in <strong>order</strong> to make these experiments such as physical conditions<br />

(dimensions and material <strong>of</strong> the flexible arm) or the estimation <strong>of</strong> the gradient that need have a<br />

certain level <strong>of</strong> accuracy. The handling <strong>of</strong> the deflection within the proposed method also is<br />

considered as a factor in the real experiments. Even, we need to consider the robust controller,<br />

an exact modeling to some extent is thought to be necessary in <strong>order</strong> to be able to predict the<br />

experimental results through simulations, also this feature is necessary consider it. Finally, the<br />

friction also will be considered as important factor to consider in the real experiments.<br />

98


Chapter 4<br />

Lattice IIR Adaptive Filter Structure<br />

Adapted by <strong>SPSA</strong> <strong>Algorithm</strong><br />

In this second application, the M2-<strong>SPSA</strong> algorithm is applied to parameter estimation, in this<br />

case to get the coefficient <strong>of</strong> adaptive algorithms in the model proposed here, these adaptive<br />

algorithms are Steiglitz-McBride (SM) and Simple Hyperstable Adaptive Recursive Filter<br />

(SHARF). The results are compared with previous lattice versions <strong>of</strong> these algorithms. The<br />

per<strong>for</strong>mance in the coefficients is compared here. Finally, also we make some modifications in<br />

the adaptive algorithms proposed here in <strong>order</strong> to obtain a suitable stability and convergence.<br />

Adaptive infinite impulse response (IIR), or recursive, filters are less attractive mainly because<br />

<strong>of</strong> the stability and the difficulties associated with their adaptive algorithms. There<strong>for</strong>e, in this<br />

chapter the adaptive IIR lattice filters are studied in <strong>order</strong> to devise algorithms that preserve the<br />

stability <strong>of</strong> the corresponding direct-<strong>for</strong>m schemes. We analyze the local properties <strong>of</strong> stationary<br />

points, a trans<strong>for</strong>mation achieving this goal is suggested, which gives algorithms that can be<br />

efficiently implemented. Application to the SM and SHARF algorithms is presented. The<br />

M2-<strong>SPSA</strong> is presented in <strong>order</strong> to get the coefficients in a lattice <strong>for</strong>m more efficiently and with<br />

a lower computational cost and complexity. The results are compared with previous lattice<br />

versions <strong>of</strong> these algorithms. These previous lattice versions may fail to preserve the stability <strong>of</strong><br />

stationary points.<br />

4.1 -Introduction<br />

In the last decade, substantial research ef<strong>for</strong>ts have been spent to turn adaptive IIR filtering<br />

techniques into a reliable alternative to traditional adaptive finite impulse response (FIR) filters.<br />

The main advantages <strong>of</strong> IIR filters are that they are more suitable to models <strong>of</strong> physical systems,<br />

due to the pole-zero structure, and also require much less parameters to achieve the same<br />

per<strong>for</strong>mance level as FIR filters. Un<strong>for</strong>tunately, these good characteristics come along with<br />

some possible drawbacks inherent to adaptive filters with recursive structure such as algorithm<br />

99


CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />

instability, convergence to biased and/or local minimum solutions, as well as slow convergence.<br />

Consequently, several new algorithms <strong>for</strong> adaptive IIR filtering have been proposed in the<br />

literatures attempting to overcome these problems. Extensive research on the subject, however,<br />

seems to suggest that no general purposed optimal algorithm exists. In fact, all available<br />

in<strong>for</strong>mation must be considered when applying adaptive IIR filtering, in <strong>order</strong> to determine the<br />

most appropriate algorithm <strong>for</strong> a given problem. Then, the need <strong>for</strong> ensuring stable operation <strong>of</strong><br />

adaptive IIR filters has spawned much interest in other structures over the direct-<strong>for</strong>m. In<br />

particular, the lattice structure has received considerable attention due to several advantages<br />

such as one-to-one correspondence between transfer functions and parameter spaces, good<br />

numerical properties, as well as built-in stability [50]. There<strong>for</strong>e, several adaptive algorithms<br />

described in [50], originally devised <strong>for</strong> direct-<strong>for</strong>m structures, have been modified in <strong>order</strong> to<br />

allow <strong>for</strong> a lattice realization <strong>of</strong> the filter. These algorithms use a conventional method based in<br />

exploiting the properties <strong>of</strong> the lattice structure [52] and suitable approximations [53]. These<br />

algorithms based in this conventional method <strong>of</strong>fer a relative low computational load and in<br />

most cases these approximate lattice algorithms preserve the set <strong>of</strong> stationary points.<br />

Nevertheless, it has not been clear whether the convergence properties <strong>of</strong> the stationary points<br />

are well preserved. Also, the reduction in the computational load is not enough, in special in the<br />

estimation <strong>of</strong> reflection coefficients into lattice <strong>for</strong>m. Hence, in this paper a new approach to<br />

improve lattice structure is proposed. The Ordinary Differential Equation (ODE) method<br />

[50]-[54] is proposed in <strong>order</strong> to get a trans<strong>for</strong>mation, which allows deriving sufficient<br />

conditions <strong>for</strong> convergence. The method is very general, applying to any pair <strong>of</strong> structures as<br />

long as a one-to-one correspondence exists between them. For the direct-<strong>for</strong>m to lattice case, it<br />

is shown how to efficiently implement this trans<strong>for</strong>mation. This approach is applied to the same<br />

adaptive algorithms used in [50], in this case the lattice versions <strong>of</strong> the Steiglitz-McBride (SM)<br />

and the Simple Hyperstable Adaptive Recursive Filter (SHARF) algorithms, <strong>for</strong> which it is also<br />

shown how a pre-existing approximate algorithms may fail to converge in some cases. Finally,<br />

in <strong>order</strong> to get the reflection coefficients in lattice <strong>for</strong>m, we have proposed a gradient-free<br />

method. This method is only based on objective function measurements and do not require<br />

knowledge <strong>of</strong> the gradients <strong>of</strong> the underlying model. As a result, they are very easy to<br />

implement and have reduced the computational cost in theirs applications. The gradient-free<br />

method proposed here is the Simultaneous Perturbation Stochastic <strong>Approximation</strong> (<strong>SPSA</strong>)<br />

algorithm [3]. This is based on a randomized method where all parameters are perturbed<br />

simultaneously [3], and makes it possible to modify the parameters with only two measurements<br />

<strong>of</strong> an evaluation function regardless <strong>of</strong> the dimension <strong>of</strong> the parameter. This algorithm is very<br />

100


4.2 PROCEDURE OF IMPROVED ALGORITHM<br />

useful, but this traditional <strong>SPSA</strong> algorithm can cause, in some cases (systems with a great<br />

volume <strong>of</strong> parameters), a high computational cost [3]. There<strong>for</strong>e, we have proposed a modified<br />

version <strong>of</strong> <strong>SPSA</strong> applied to reflection coefficient estimation in the current system in <strong>order</strong> to get<br />

estimated coefficients more efficiently reducing the computational cost. The organization <strong>of</strong> the<br />

present chapter is as follows: In Sec. 4.2, the derivation <strong>of</strong> the proposed algorithm is described.<br />

In Sec. 4.3, the application to lattice structure is explained. The adaptive algorithms are<br />

described in Sec. 4.4. The simulations results with the proposed methods are shown in Sec. 4.5.<br />

4.2 -Procedure <strong>of</strong> Improved <strong>Algorithm</strong><br />

Consider a direct-<strong>for</strong>m adaptive filter<br />

N<br />

∑<br />

− i<br />

bi<br />

z<br />

B z<br />

i =<br />

H ˆ ( )<br />

0<br />

( z ) = =<br />

(4.1)<br />

M<br />

A ( z )<br />

− j<br />

1 + a z<br />

∑<br />

j = 1<br />

j<br />

parameterized by<br />

T<br />

θ = b a ] . Usually, constant-gain algorithms can be written as<br />

d<br />

[<br />

0,<br />

L,<br />

bN<br />

, a1,<br />

L,<br />

M<br />

θ ( n + 1) = θ ( n)<br />

+ X ( n)<br />

e(<br />

n)<br />

(4.2)<br />

d d<br />

µ<br />

d<br />

where µ > 0 is a step size, e (⋅)<br />

is some signal and X (⋅)<br />

is a driving vector that depends on<br />

the specific algorithm. Let θ be the corresponding parameter vector <strong>for</strong> a different<br />

l<br />

implementation <strong>of</strong> the filter, in such a way that there exists a one-to-one map θ = f θ ) defined<br />

on a suitable stability domain that allows one to move back and <strong>for</strong>th between both descriptions.<br />

The objective is to re<strong>for</strong>mulate the algorithm (4.2) in terms <strong>of</strong><br />

matrix as<br />

df ( θ )<br />

F(<br />

θ ) =<br />

l<br />

.<br />

f<br />

dθ<br />

l<br />

d<br />

d<br />

( l<br />

θ . Let us define the Jacobian<br />

l<br />

(4.3)<br />

We neglect to use a subscript in the argument, since F can be expressed as a function <strong>of</strong> either<br />

θ or<br />

d<br />

θ by means <strong>of</strong> the map f. We can think <strong>of</strong><br />

l<br />

θ as representing the actual transfer<br />

f<br />

function H ˆ ( z ) while θ and<br />

d<br />

θ are the parameter vectors that describe<br />

l<br />

H ˆ ( z ) in a particular set<br />

<strong>of</strong> coordinates. The following algorithm can update <strong>of</strong><br />

θ<br />

l :<br />

101


CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />

θ ( n + 1) = θ ( n)<br />

+ e(<br />

n)<br />

X ( n)<br />

(4.4)<br />

l l<br />

µ<br />

l<br />

T<br />

X ( n)<br />

= F ( ( n))<br />

X ( n).<br />

(4.5)<br />

l<br />

θ f<br />

d<br />

That is, the driving vector <strong>for</strong> the new coordinates, X l<br />

(n)<br />

is related to (n)<br />

X d<br />

through the<br />

Jacobian F. Since the map f is one-to-one, F( θ f<br />

) has a full rank <strong>for</strong> all θ describing stable<br />

f<br />

transfer functions. There<strong>for</strong>e if θ * ( *<br />

d<br />

= f θl<br />

) then * *<br />

θ is a stationary point <strong>of</strong> (4.2) iff<br />

d<br />

θ is a<br />

l<br />

stationary point <strong>of</strong> (4.4), since<br />

E<br />

[ X ( n)<br />

e(<br />

n)<br />

] * = 0 ⇔ E[ X ( n)<br />

e(<br />

n)<br />

] * = 0<br />

l<br />

f<br />

d<br />

. (4.6)<br />

θ θ f<br />

So that, the stationary points are preserved. Now, the convergence issue is described. By<br />

*<br />

applying the ODE method [55], <strong>for</strong> sufficiently small µ the stationary point θ is locally stable<br />

l<br />

<strong>for</strong> algorithm (4.4) iff all the eigenvalues <strong>of</strong> the matrix<br />

S<br />

t<br />

dE<br />

=<br />

[ X ( n)<br />

e(<br />

n)<br />

]<br />

l<br />

dθ<br />

t<br />

T<br />

⎡dX<br />

l<br />

( n)<br />

⎤ ⎡ de(<br />

n)<br />

⎤<br />

= E⎢<br />

⋅ e(<br />

n)<br />

⎥ + E⎢X<br />

l<br />

( n)<br />

⎥<br />

⎣ dθl<br />

⎦ *<br />

d<br />

t<br />

f<br />

14442<br />

4443<br />

14<br />

⎣ θ<br />

44 2444<br />

⎦ *<br />

θ<br />

θ<br />

3<br />

= P<br />

θ<br />

*<br />

f<br />

= Q<br />

(4.7)<br />

(k )<br />

have negative real parts. For a vector V, let V denote its k-th component. Then, the i, j<br />

element <strong>of</strong> P, is given by<br />

P<br />

i , j<br />

.<br />

( k)<br />

[ X ( n)<br />

e(<br />

n)<br />

]<br />

i<br />

N + M + 1<br />

⎡∂X<br />

( ) ⎤ ∂ ( )<br />

l<br />

n<br />

Fji<br />

θ<br />

f<br />

E ( ) =<br />

*<br />

⎢ e n<br />

( ) ⎥ ∑ E<br />

l<br />

( j)<br />

d<br />

θ f<br />

⎣ ∂θl<br />

⎦ * k=<br />

1 ∂θl<br />

1442<br />

443<br />

+<br />

N + M + 1<br />

∑<br />

k=<br />

1<br />

θ<br />

( k )<br />

* ⎡∂X<br />

( ) ⎤<br />

d<br />

n<br />

Fji<br />

( θ<br />

f<br />

) E⎢<br />

e(<br />

n)<br />

( j)<br />

⎥<br />

⎣ ∂θi<br />

⎦<br />

*<br />

θ<br />

*<br />

f<br />

= 0<br />

(4.8)<br />

102


4.2 PROCEDURE OF IMPROVED ALGORITHM<br />

Using (4.8) and the chain rule,<br />

* T<br />

⎡dX<br />

d<br />

( n)<br />

⎤<br />

*<br />

P = F(<br />

θ<br />

f<br />

) ⋅ E e(<br />

n)<br />

⎥ ⋅ F(<br />

θ<br />

f<br />

).<br />

(4.9)<br />

⎢<br />

⎣ dθ<br />

d ⎦ *<br />

θ f<br />

On the other hand, using again the chain rule and (4.5),<br />

T<br />

* T<br />

⎡ de(<br />

n)<br />

⎤<br />

*<br />

Q = F(<br />

θ<br />

f<br />

) ⋅ E X<br />

d<br />

( n)<br />

⎥ ⋅ F(<br />

θ<br />

f<br />

).<br />

(4.10)<br />

⎢<br />

⎣ dθ<br />

d ⎦ *<br />

θ f<br />

There<strong>for</strong>e, the derivative matrix<br />

S l<br />

= P + Q reduces to<br />

dE<br />

[ X n)<br />

e(<br />

n)<br />

] dE[ X ( n)<br />

e(<br />

n)<br />

]<br />

l<br />

( * t<br />

d<br />

*<br />

= F(<br />

θ ) ⋅<br />

⋅ F(<br />

θ<br />

f<br />

dθt<br />

*<br />

dθ<br />

d<br />

*<br />

θ f<br />

θ f<br />

144<br />

2444<br />

3<br />

= Sd<br />

).<br />

(4.11)<br />

Here, (4.11) relates the stability matrices <strong>of</strong> algorithms (4.2) and (4.4) through the Jacobian<br />

*<br />

f<br />

F(<br />

θ ) . In this algorithm, if the matrix sd<br />

is symmetric, then<br />

*<br />

θ<br />

l<br />

is a locally stable stationary<br />

point <strong>for</strong> algorithm (4.4) iff<br />

proved in the following way:<br />

*<br />

θd<br />

is a locally stable stationary point <strong>for</strong> algorithm (4.2). This is<br />

In view <strong>of</strong> (4.11) and Sylvester’s law <strong>of</strong> inertia the signs <strong>of</strong> the eigenvalues <strong>of</strong> the matrices<br />

sd<br />

and<br />

*<br />

s<br />

l<br />

are the same. Also if s<br />

d<br />

< 0 , then θl<br />

is a locally stable stationary point <strong>for</strong> algorithm<br />

(4.4). This is proved in the following way:<br />

It is shown that in view <strong>of</strong> (4.11), s < 0 iff s < 0 . Since all the eigenvalues <strong>of</strong> a negative<br />

definite matrix have negative real parts, it follows that<br />

l<br />

d<br />

*<br />

*<br />

θl<br />

is locally stable <strong>for</strong> (4.4). (and θd<br />

is<br />

locally stable <strong>for</strong> (4.2) ). According to these last explanations described above, these gives<br />

103


CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />

sufficient conditions under which the stability <strong>of</strong> algorithm (4.2) implies the stability <strong>of</strong><br />

algorithm (4.4).<br />

4.3 -Lattice Structure<br />

The lattice filters are typically used as linear predictors because it is easy to ensure that they are<br />

minimum phase and hence that its inverse is stable [52]. The lattice <strong>for</strong>m adaptive IIR<br />

algorithms derived are expected to have at least the following advantages over the direct-<strong>for</strong>m<br />

algorithms: i) faster convergence; ii) easier stability monitoring even simpler than parallel <strong>for</strong>m;<br />

iii) more robust <strong>for</strong> finite precision implementation [52]. One important characteristic <strong>of</strong> this<br />

structure is the possibility <strong>of</strong> represents multiples poles [52]. It is expected that these structural<br />

advantages can bring about substantial per<strong>for</strong>mance improvement <strong>for</strong> adaptive filters. Then, the<br />

derivation described in Sec. 4.3 is applied in this section to obtain efficient adaptive algorithms<br />

<strong>for</strong> lattice filters according to the characteristics <strong>of</strong> this structure mentioned above. Firstly, this<br />

approach is implemented to the adaptive filter as a cascade <strong>of</strong> a direct-<strong>for</strong>m FIR filter<br />

N −i<br />

( z ) = ∑ b z and an all-pole lattice filter<br />

=<br />

1/<br />

A ( z)<br />

. So that, θ<br />

i l<br />

is defined by<br />

B i 0<br />

θ<br />

[ b L α sinα<br />

] T<br />

l 0<br />

b N<br />

sin 1<br />

L<br />

= (4.12)<br />

M<br />

where<br />

sinα<br />

are the reflection coefficients <strong>of</strong> the lattice part (these coefficients can be calculated<br />

k<br />

using a modified version <strong>of</strong> <strong>SPSA</strong>, this algorithm is explained in Chap. 2). In general, the<br />

reflection coefficients are estimated as cross-correlation coefficients between <strong>for</strong>ward and<br />

backward prediction errors in each stage <strong>of</strong> the adaptive lattice filter. Accordingly, two divisions<br />

in each stage, and effectively doubling the number <strong>of</strong> stages, are required. A problem is that the<br />

processing cost <strong>of</strong> division is higher than that <strong>of</strong> multiplication, especially in cheap digital signal<br />

processors (DSPs). These coefficients are calculated by our modified version <strong>of</strong> <strong>SPSA</strong>, which<br />

make a reduction <strong>of</strong> the number <strong>of</strong> divisions. The proposed technique can decrease the number<br />

<strong>of</strong> divisions to one. This algorithm will be explained in the following section.<br />

⎡I<br />

N + 1 ⎤<br />

F (θ<br />

f<br />

) =<br />

with<br />

⎢ ⎥<br />

⎣ D⎦<br />

D<br />

ij<br />

∂ai<br />

=<br />

∂ sin α<br />

j .<br />

104


4.4 ADAPTIVE ALGORITHM<br />

T<br />

Also, we have [ ] T T<br />

X ( n)<br />

= V ( n)<br />

W ( n)<br />

with<br />

d<br />

d<br />

d<br />

V<br />

d<br />

⎡ 1<br />

⎢ −<br />

z<br />

( n)<br />

= ⎢<br />

⎢ M<br />

⎢ −<br />

⎣ z<br />

1<br />

N<br />

⎤<br />

⎥<br />

⎥v(<br />

n),<br />

⎥<br />

⎥<br />

⎦<br />

W<br />

d<br />

⎡ z<br />

⎢<br />

z<br />

( n)<br />

= ⎢<br />

⎢ M<br />

⎢<br />

⎣ z<br />

−1<br />

− 2<br />

− M<br />

⎤<br />

⎥<br />

⎥<br />

⎥<br />

⎥<br />

⎦<br />

1<br />

ω ( n)<br />

A(<br />

z )<br />

<strong>for</strong> some signals v (n)<br />

, ω(n)<br />

T<br />

[ V ( n)<br />

W ( n ] T T<br />

X<br />

l<br />

( n)<br />

=<br />

l l<br />

) , we find that Vl ( n)<br />

= Vd<br />

( n)<br />

and<br />

which depend on the particular algorithm. If similarly partitioning<br />

T<br />

W ( n)<br />

= D W<br />

l<br />

d<br />

⎡ ∂a1<br />

⎢ ∂ sinα1<br />

⎢<br />

( n)<br />

= ⎢ M<br />

⎢ ∂a1<br />

⎢<br />

⎣∂<br />

sinα<br />

M<br />

L<br />

L<br />

∂aM<br />

⎤<br />

−<br />

∂ sinα<br />

⎥⎡<br />

z<br />

1<br />

⎥⎢<br />

⎥⎢<br />

M<br />

∂aM<br />

⎥⎢<br />

−<br />

⎣z<br />

∂ sinα<br />

⎥<br />

M ⎦<br />

1<br />

M<br />

⎤<br />

⎥ 1<br />

⎥ ω(<br />

n)<br />

A(<br />

z)<br />

⎥<br />

⎦<br />

⎡ ∂ A ( z )<br />

= ⎢ L<br />

⎣ ∂ sin α<br />

1<br />

∂ A ( z )<br />

∂ sin α<br />

M<br />

⎤<br />

⎥<br />

⎦<br />

T<br />

1<br />

ω ( n ).<br />

A ( z )<br />

Thus the problem boils down to efficiently implementing the transfer function<br />

[ T ( z),<br />

T ( z ] T<br />

M<br />

T ( z)<br />

= )<br />

1<br />

L with<br />

1 ∂A(<br />

z)<br />

1<br />

Tk<br />

( z)<br />

=<br />

=<br />

A(<br />

z)<br />

∂ sin α cos α<br />

k<br />

k<br />

1 ∂A(<br />

z)<br />

A(<br />

z)<br />

∂α<br />

k<br />

.<br />

A structure that per<strong>for</strong>ms exactly this task, with complexity proportional to the filter <strong>order</strong>, was<br />

developed in [50]. Hence (4.4)-(4.5) can be efficiently implemented.<br />

105


CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />

4.4--Adaptive <strong>Algorithm</strong><br />

4.4.1 -SHARF <strong>Algorithm</strong><br />

The hyperstable adaptive recursive filter (HARF) algorithm is an early version <strong>of</strong> the<br />

application <strong>of</strong> hyperstability [56] to signal processing but suffers many setbacks, which makes it<br />

very hard to implement [57]. Landau in [58] developed an algorithm <strong>for</strong> <strong>of</strong>f-line system<br />

identification, based on the hyperstability theory [58], that can be considered as the origin <strong>of</strong> the<br />

SHARF algorithm. Basically, the SHARF algorithm has the following convergence properties<br />

[56][57]:<br />

Property 1: In the cases <strong>of</strong> sufficient <strong>order</strong> in identification ( n<br />

* ≥ 0 ) , the SHARF algorithm<br />

may not converge to the global minimum <strong>of</strong> the mean-square output error <strong>of</strong> (MSOE) [57][58]<br />

per<strong>for</strong>mance surface if the plant transfer function denominator polynomial does not satisfy the<br />

following strictly positive realness condition.<br />

− 1<br />

Re ⎡ D ( z ) ⎤<br />

⎢ > 0 ; = 1 .<br />

−<br />

(<br />

1 ⎥ z<br />

(4.13)<br />

⎣ A z ) ⎦<br />

Property 2: In case <strong>of</strong> insufficient <strong>order</strong> to identification ( n<br />

* ≥ 0 ) , the adaptive filter output<br />

signal yˆ ( n ) and the adaptive filter coefficients vector θˆ are stable sequences, provided the<br />

input signal is sufficiently persistent exciting. The main problem <strong>of</strong> the SHARF algorithm<br />

seems to be the nonexistence <strong>of</strong> a robust practical procedure to define the moving average filter<br />

− 1<br />

D ( q ) in <strong>order</strong> to guarantee the global convergence <strong>of</strong> the algorithm, here<br />

D<br />

−<br />

n −<br />

= ∑<br />

d k<br />

( q<br />

1 )<br />

k =<br />

d k<br />

q . This is a consequence <strong>of</strong> the fact that the condition in (4.13) depends on<br />

1<br />

the plant denominator characteristics, that in practice are unknown. We particularize now<br />

(4.4)-(4.5) to the SHARF algorithm. For the direct <strong>for</strong>m SHARF [49], we have<br />

υ ( n ) = u(<br />

n),<br />

ω ( n)<br />

= −B(<br />

z)<br />

u(<br />

n)<br />

= −A(<br />

z)<br />

yˆ(<br />

n)<br />

and e( n)<br />

= C(<br />

z)(<br />

y(<br />

n)<br />

− yˆ(<br />

n)).<br />

In this expression, C(z)<br />

is a compensating filter designed in <strong>order</strong> to make the transfer function<br />

C( z)/<br />

A* ( z)<br />

strictly positive real (SPR) [57] , where A ( ) is the denominator <strong>of</strong> H (z)<br />

. The<br />

*<br />

z<br />

106


4.4 ADAPTIVE ALGORITHM<br />

jω<br />

transfer function G(z)<br />

is SPR if it is stable and causal and satisfies ReG<br />

( e ) > 0 ∀ω<br />

. This SPR<br />

condition is a common convergence requirement <strong>for</strong> all hyperstability based adaptive algorithms<br />

[57]. The block diagram <strong>of</strong> the adaptive filter is shown in Fig. 4.1.<br />

Fig. 4.1. Block diagram SHARF lattice algorithm.<br />

Assuming a sufficient-<strong>order</strong> setting and that the SPR condition is satisfied, it can be proved<br />

that the matrix S <strong>for</strong> the SHARF algorithm is negative definite [57]. In <strong>order</strong> to guarantee<br />

d<br />

global convergence <strong>for</strong> the SHARF algorithm independently <strong>of</strong> the plant characteristics, Landau<br />

[58] proposed the application <strong>of</strong> a time-varying moving average filtering to the output error<br />

signal. Using Landaus’ approach, the modified SHARF algorithm can be given by<br />

e<br />

SHARF<br />

with<br />

D<br />

( n)<br />

=<br />

−1<br />

[ D(<br />

q , n)<br />

] e ( n)<br />

∑<br />

k = 0<br />

OE<br />

n<br />

−1<br />

−<br />

= d<br />

k<br />

( q , n)<br />

d k<br />

q<br />

(4.14)<br />

d ( n + 1) = d ( n)<br />

+ µ e ( n)<br />

e ( n − k),<br />

k = 0,1,…, υ<br />

d<br />

(4.15)<br />

k<br />

k<br />

SHARF<br />

OE<br />

ˆ θ ( n + 1) = ˆ θ ( n)<br />

+ µ e ( n)<br />

ˆ φ ( n)<br />

. (4.16)<br />

f<br />

f<br />

SHARF<br />

MOE<br />

Another interesting interpretation <strong>of</strong> the modified SHARF algorithm can be found in [59]. In the<br />

convergence <strong>of</strong> the modified SHARF algorithm, the error signal e SHARF<br />

(n)<br />

converges to zero in the mean sense if n * ≥ 0 and µ satisfies<br />

is a sequence that<br />

107


CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />

1<br />

0 < µ <<br />

(4.17)<br />

2<br />

( n)<br />

φ SHARF<br />

where φ (n)<br />

is the extended in<strong>for</strong>mation vector defined as<br />

SHARF<br />

T<br />

[ yˆ(<br />

n −i)<br />

x(<br />

n − j)<br />

e ( n − k)<br />

] .<br />

φ ( n)<br />

=<br />

(4.18)<br />

SHARF<br />

SHARF<br />

It should be mentioned that if signal φ (n)<br />

tends to zero, the output error (n)<br />

signal<br />

SHARF<br />

does not necessarily tend to zero. In fact, it was shown in [60] that the minimum phase<br />

condition <strong>of</strong> D(<br />

q<br />

−1 , n)<br />

must also be satisfied in <strong>order</strong> to guarantee that (n)<br />

converges to<br />

zero in the mean sense. This additional condition implies that a continuous minimum phase<br />

monitoring should be per<strong>for</strong>med in the polynomial D(<br />

q<br />

−1 , n)<br />

to assure global convergence <strong>of</strong><br />

the SHARF algorithm. This fact prevents the general use <strong>of</strong> the SHARF algorithm in practice. It<br />

is also important to mention that although the members <strong>of</strong> the SHARF family <strong>of</strong> adaptive<br />

algorithms, that includes the modified output error (MOE), and SHARF algorithms, attempt to<br />

minimize the output error signal, these algorithms do not present a gradient descent convergence<br />

concept from the hyperstability theory.<br />

e OE<br />

e OE<br />

4.4.2 -Steilglitz-McBride <strong>Algorithm</strong><br />

In [61], Steiglitz and McBride developed an adaptive algorithm attemping to combine the good<br />

characteristics <strong>of</strong> the output error and equation error algorithms, namely unbiased and unique<br />

global solution, respectively. In <strong>order</strong> to achieve these properties, the so called SM algorithm is<br />

based on an error signal e(n) that is a linear function <strong>of</strong> the adaptive filter coefficients, yielding a<br />

unimodal per<strong>for</strong>mance surface, and has physical interpretation similar to the output error signal,<br />

leading to an unbiased global solution. For this adaptive algorithm, let u (⋅)<br />

, yˆ ( ⋅)<br />

be the<br />

adaptive filter input and output respectively, and let y(⋅)<br />

be the reference signal. For the SM<br />

adaptive algorithm described in [61] we have<br />

1<br />

e( n)<br />

= y(<br />

n)<br />

− yˆ(<br />

n)<br />

and υ(<br />

n)<br />

= u(<br />

n),<br />

ω(<br />

n)<br />

= −y(<br />

n).<br />

A(<br />

z)<br />

108


4.5 SIMULATION RESULTS<br />

Figure 4.2 shows the block diagram <strong>of</strong> the lattice implementation <strong>of</strong> (4.4)-(4.5) <strong>for</strong> the SM<br />

algorithm. Suppose that y ( n)<br />

= H(<br />

z)<br />

u ( n)<br />

where H(z)<br />

is a filter <strong>of</strong> the same <strong>order</strong> as H ˆ ( z ) .<br />

Fig. 4.2. Block diagram <strong>of</strong> the SM lattice algorithm.<br />

In case there is an additive output disturbance, the SM estimate remains unbiased as long as the<br />

disturbance is white [61] [62]. For simplicity it is assumed here that the reference signal y(⋅)<br />

is<br />

not contaminated by noise. It can be shown that <strong>for</strong> this sufficient <strong>order</strong> case, the matrix<br />

Sd<br />

at<br />

the stationary pointθ corresponding to H ˆ ( z)<br />

= H ( z)<br />

coincides with the <strong>Hessian</strong> matrix <strong>of</strong><br />

*<br />

f<br />

the cost function E[ e ( )]<br />

2 n<br />

evaluated at<br />

*<br />

*<br />

θ<br />

d<br />

and there<strong>for</strong>e it is symmetric. Since θd<br />

is locally<br />

stable <strong>for</strong> the direct <strong>for</strong>m SM algorithm [61], then<br />

*<br />

θl<br />

is locally stable <strong>for</strong> the lattice algorithm.<br />

In [62] an alternative way <strong>of</strong> implementing the SM algorithm using a normalized tapped lattice<br />

structure was presented. However, the stability <strong>of</strong> the stationary point is not guaranteed.<br />

4.5 -Simulation<br />

4.5.1 -SHARF <strong>Algorithm</strong><br />

Here, we considered such a setting in which u(⋅)<br />

was taken as unit variance white noise, N=0,<br />

M=6,<br />

0.1<br />

H ( z)<br />

= A *(<br />

z<br />

with A<br />

*(<br />

z)<br />

parameterized in lattice <strong>for</strong>m by the reflection coefficients<br />

)<br />

109


CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />

* *<br />

estimated by our proposed algorithm <strong>SPSA</strong> [ sinα L sinα<br />

] [.6<br />

.95 .86 .84.9 .51 ]<br />

1 6<br />

=<br />

, and also<br />

C ( z)<br />

= A* ( z)<br />

. Figure 4.3 shows the parameter trajectories <strong>of</strong> algorithm (4.4)-(4.5). The initial<br />

point was θ ( 0) = 0 . The convergence is achieved, as expected. On the other hand, the lattice<br />

l<br />

version <strong>of</strong> SHARF presented in [63] using fails to converge in this setting, as shown in Fig. 4.4.<br />

The initial value θ (0)<br />

was taken very close to the stationary point. For this algorithm the<br />

l<br />

corresponding matrix can be shown to have unstable eigenvalues, which implies that the<br />

stationary point is not convergent [63]. Note that the SPR condition is satisfied; the problem<br />

does not reside there, but in the simplifications introduced when passing from the direct <strong>for</strong>m to<br />

the lattice algorithm. In the figures 4.3-4.6, the dashed-lines show the parameter values at the<br />

stationary point.<br />

4.5.2 -Steiglitz-Mcbride <strong>Algorithm</strong><br />

0.01<br />

Let N=0, M=6 and H ( z)<br />

= with A* ( z)<br />

parameterized in lattice <strong>for</strong>m by the reflection<br />

A *<br />

( z)<br />

coefficients estimate by the proposed <strong>SPSA</strong> algorithm.<br />

*<br />

*<br />

[ sin sinα<br />

] [ .6 .95 .86 .84 .81 .72 ]<br />

α L .<br />

1 6<br />

=<br />

Assume that u (⋅)<br />

is unit variance white noise. Then, it can be shown that even with no<br />

measurement noise, the corresponding stability matrix <strong>for</strong> the SM lattice algorithm <strong>of</strong> [62],<br />

evaluated at the stationary point H ˆ ( z ) = H ( z ) has a pair <strong>of</strong> unstable eigenvalues. This<br />

means that this stationary point cannot be locally convergent. This is illustrated in Fig. 4.5,<br />

where the results <strong>of</strong> a computer simulation <strong>of</strong> this algorithm in the above setting are presented.<br />

The initial parameters were set to those <strong>of</strong> the stationary point, except <strong>for</strong> sinα<br />

2<br />

(0)<br />

which was<br />

set at 0.9499. Despite the proximity <strong>of</strong> this starting point to the stationary point, the algorithm<br />

clearly diverges, as expected. The reflection coefficients are estimated by our proposed<br />

algorithm <strong>SPSA</strong>. In Fig. 4.6 we show the results obtained by applying algorithm (4.4)-(4.5) in<br />

l<br />

( 0) = 1 .5 .9 .7 .7 .7 . 8<br />

the same setting, though now the initial point was [ ] T<br />

Convergence is achieved in this case, as predicted by the theory.<br />

θ .<br />

110


4.5 SIMULATIONS RESULTS<br />

Fig. 4.3. Convergence <strong>of</strong> the proposed SHARF algorithm and M2-<strong>SPSA</strong>.<br />

Fig. 4.4. Instability <strong>of</strong> the existing SHARF algorithm.<br />

111


CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />

Fig. 4.5. Instability <strong>of</strong> the existing SM algorithm.<br />

Fig. 4.6. Convergence <strong>of</strong> the proposed SM algorithm and M2-<strong>SPSA</strong>.<br />

We can see in the previous graphics a better convergence achieved by our proposed method and<br />

the M2-<strong>SPSA</strong> algorithm in comparison to previous simulations shown in [62]-[63]. Also, we can<br />

see that the number <strong>of</strong> iteration used to achieve this convergence in our proposed algorithm is<br />

reduced, this characteristic is explained due to that M2-<strong>SPSA</strong> algorithm can calculate more<br />

efficiently and with less computational burden the coefficients in the lattice <strong>for</strong>m; this is<br />

explained in Chap. 2.<br />

112


Chapter 5<br />

Parameter Estimation using a Modified<br />

Version <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Applied to<br />

State-Space Models<br />

Finally, in this third application, M2-<strong>SPSA</strong> is applied the estimation <strong>of</strong> unknown static<br />

parameters in non-linear non-Gaussian state-space model. The results are compared with the<br />

FDSA algorithms. The per<strong>for</strong>mance <strong>of</strong> the coefficients in bi-modal non-linear model is<br />

compared here. The objective <strong>of</strong> this paper is the estimation <strong>of</strong> unknown static parameters in<br />

non-linear non-Gaussian state-space model. The Simultaneous Perturbation Stochastic<br />

<strong>Approximation</strong> (<strong>SPSA</strong>) algorithm is considered due to its highly efficient gradient<br />

approximation. We consider a particle filtering method and employ the <strong>SPSA</strong> algorithm to<br />

maximize recursively the likelihood function. Nevertheless, the <strong>SPSA</strong> algorithm can become<br />

inadequate in models as non-Gaussian state-space model. So that, we have proposed to modify<br />

the <strong>SPSA</strong> algorithm in <strong>order</strong> to estimate parameters very efficiently in complex models as<br />

proposed here reducing its computational cost. An efficient parameter estimator as the Finite<br />

Difference Stochastic <strong>Approximation</strong> (FDSA) algorithm is considered here, in <strong>order</strong> to compare<br />

it with the efficiency <strong>of</strong> the proposed <strong>SPSA</strong> algorithm. The proposed algorithm can generate<br />

maximum likelihood estimates very efficiently. The per<strong>for</strong>mance <strong>of</strong> proposed <strong>SPSA</strong> algorithm<br />

is shown through simulation using a model with highly multimodal likelihood.<br />

5.1 -Introduction<br />

Dynamic state-space models are useful <strong>for</strong> describing data in many different areas, such as<br />

engineering, finance mathematics, environmental data, and physical science. Most real-world<br />

problems are non-linear and non-Gaussian (1) , there<strong>for</strong>e optimal state estimation in such<br />

problems does not admit a closed <strong>for</strong>m solution. Sequential Monte Carlo (SMC) methods, also<br />

known as particle filters, are a set <strong>of</strong> practical and flexible simulation-based techniques that<br />

have become increasingly popular to per<strong>for</strong>m optimal filtering in non-linear non-Gaussian<br />

models [64][65]. Then, SMC methods are a set <strong>of</strong> simulation-based techniques that recursively<br />

113


CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />

generate and update a set <strong>of</strong> weighted samples, which provide approximations to the posterior<br />

probability distributions <strong>of</strong> interest. Standard SMC methods, however assume knowledge <strong>of</strong> the<br />

model parameters. In many real-world applications, these parameters are unknown and need to<br />

be estimated. Then, we address here the challenging problem <strong>of</strong> obtaining their maximum<br />

likelihood (ML) estimates. The ML parameter estimation using SMC methods still remains an<br />

open problem, despite various earlier attempts in the literature [66]. Previous approaches that<br />

extend the state with the unknown parameters and trans<strong>for</strong>m the problem into an optimal<br />

filtering problem suffered from several drawbacks [66][68]. Recently, a robust particle method<br />

to approximate the optimal filter derivative and per<strong>for</strong>m ML parameter estimation has been<br />

proposed [64]. This method is efficient but computationally intensive. The gradient-based SA<br />

algorithms rely on a direct measurement <strong>of</strong> the gradient <strong>of</strong> an objective function with respect to<br />

the parameters <strong>of</strong> interest. Such an approach assumes that detailed knowledge <strong>of</strong> the system<br />

dynamics is available so that the gradient equations can be calculated. In the SMC framework,<br />

the gradient estimates <strong>of</strong> the particle approximations require infinitesimal perturbation<br />

analysis-based approach [65]. This <strong>of</strong>ten results in a very high estimation variance that<br />

increases with the number <strong>of</strong> particles and with time. Although this problem can be<br />

successfully mitigated with a number <strong>of</strong> variance reduction techniques, this adds to the<br />

computational burden. In this chapter, we investigate the using <strong>of</strong> gradient-free SA techniques<br />

as a simple alternative to generate ML parameter estimates. A related approach was described<br />

in [67] to optimize the per<strong>for</strong>mance <strong>of</strong> SMC algorithms. We use this approach to our ML<br />

parameter estimation. In principle, gradient-free techniques have a slower rate <strong>of</strong> convergence<br />

compared to gradient-based methods. However, gradient-free methods are only based on<br />

objective function measurements and do not require knowledge <strong>of</strong> the gradients <strong>of</strong> the<br />

underlying model. As a result, they are very easy to implement and have a reduced<br />

computational complexity. The classical gradient-free method is the FDSA [21]. However, we<br />

have proposed a more efficient approach that has recently attracted attention, the <strong>SPSA</strong> [3].<br />

This is based on a randomized method where all parameters are perturbed simultaneously, it is<br />

possible to modify parameters with only two measurements <strong>of</strong> an evaluation function regardless<br />

<strong>of</strong> the dimension <strong>of</strong> the parameter. This is very useful but this traditional <strong>SPSA</strong> can cause in<br />

some cases a high computational cost [3]. There<strong>for</strong>e, M2-<strong>SPSA</strong> is applied to ML parameter<br />

estimation in <strong>order</strong> to get estimated parameter more efficiently reducing its cost. In this chapter,<br />

FDSA is considered as a comparison toward our proposed <strong>SPSA</strong> algorithm.<br />

114


5.2 IMPLEMENTATION OF <strong>SPSA</strong> ALGORITHM TO THE PROPOSED MODEL<br />

5.2 -Implementation <strong>of</strong> <strong>SPSA</strong> Toward Proposed Model<br />

5.2.1 –State-Space Model<br />

In <strong>order</strong> describe the state-space models [61], let { X k<br />

} k ≥0<br />

and { k<br />

} k≥0<br />

kx<br />

Y be R and<br />

k<br />

R<br />

y<br />

valued<br />

stochastic processes defined on a measurable space ( Ω , F)<br />

. Let θ ∈ Θ be the parameter vector<br />

m<br />

where Θ is an open subset R [69] . A general discrete-time state-space model represents the<br />

X as a Markov process <strong>of</strong> initial density X ~ µ and Markov<br />

unobserved state { k<br />

} k≥0<br />

0<br />

transition density ( x'<br />

x)<br />

f θ<br />

[61]. The observations { k<br />

} k≥0<br />

Y<br />

are assumed conditionally<br />

independent given { k<br />

} k≥0<br />

X and are characterized by their conditional marginal<br />

density ( y x)<br />

. The model is summarized as follows:<br />

g θ<br />

X X x fθ ( ⋅ x )<br />

(5.1)<br />

k<br />

k−1 =<br />

k−1<br />

~<br />

k−1<br />

Y<br />

k<br />

X<br />

k<br />

= x gθ ( ⋅ x )<br />

(5.2)<br />

k<br />

~<br />

k<br />

where the two densities can be non-Gaussian and may involve non-linearity. For any sequence<br />

{ } p<br />

z and random process { }<br />

Z<br />

i: j<br />

(<br />

i i+<br />

1 j<br />

Z we will use the notation = z , z ,..., z ) and<br />

p<br />

z<br />

i: j<br />

(<br />

i i+<br />

1 j<br />

= Z , Z ,..., Z ) . Assume <strong>for</strong> the time being that θ is known. In such a situation, one is<br />

interested in estimating the hidden state<br />

X<br />

k<br />

given the observation sequence { Y k<br />

} k≥0<br />

. This leads<br />

to the so-called optimal filtering problem that seeks to compute the posterior density<br />

p θ ( x k<br />

Y 0 : k<br />

) sequentially in time. Introducing a proposal distribution ( xk<br />

Yk<br />

, xk−<br />

1)<br />

, whose<br />

q θ<br />

support includes the support <strong>of</strong> gθ ( Yk<br />

xk<br />

) fθ<br />

( xk<br />

xk−<br />

1)<br />

. In this moment the SMC method [70]<br />

approximates the optimal filtering density by a weighted empirical distribution, i.e. a weighted<br />

sum <strong>of</strong> N >1 samples, termed as particles. Here we will assume that at time k−1, the filtering<br />

density x k<br />

Y )<br />

∆<br />

p θ (<br />

−1 0: k−1<br />

is approximated by the particle set<br />

(1: N ) (1 )<br />

[ ]<br />

( N<br />

X<br />

1)<br />

k 1<br />

X<br />

k −1<br />

,..., X<br />

k −<br />

−<br />

= having equal<br />

weights. The filtering distribution at the next time step can be recursively approximated by a<br />

115


CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />

new set <strong>of</strong> particles<br />

X<br />

( 1: N )<br />

k<br />

generated via an importance sampling and a resampling step. In the<br />

importance sampling step, a set <strong>of</strong> prediction particles are generated independently from<br />

( )<br />

( ⋅Y<br />

, X )<br />

~ ( i)<br />

X ~<br />

i<br />

k<br />

q<br />

k k−1<br />

θ<br />

and are weighted by an importance weight<br />

~ ( i)<br />

a θ , k<br />

that accounts <strong>for</strong> the<br />

( i)<br />

~ ( i)<br />

i<br />

discrepancy with the “target” distribution. Here, this is given by a θ<br />

= α θ<br />

X , X , Y ) and<br />

, k<br />

(<br />

k k−1<br />

k<br />

i i<br />

a~ ( ) ( )<br />

,<br />

= a /<br />

θ k<br />

θ,<br />

k<br />

N<br />

∑ j = 1<br />

a<br />

( j)<br />

θ , k<br />

. In the resampling step, the particles<br />

~ (1: N )<br />

X<br />

k<br />

are multiplied or eliminated<br />

according to their importance<br />

weights<br />

~ ( i:<br />

N )<br />

a θ , k<br />

to give the new set <strong>of</strong> particles<br />

X<br />

( 1: N )<br />

k<br />

. Now, let<br />

us now consider the case where the model includes some unknown parameters. We will assume<br />

*<br />

that the system to be identified evolves according to a true but unknown static parameter θ ,<br />

i.e.<br />

X<br />

k<br />

X<br />

k−1 = xk−<br />

1 *<br />

θ k−<br />

~ f ( ⋅ x<br />

1)<br />

(5.3)<br />

Y<br />

k<br />

X<br />

k<br />

= xk<br />

θ<br />

~ g * ( ⋅ xk<br />

).<br />

(5.4)<br />

The aim is to identify this parameter. Addressing this problem <strong>for</strong> a non-Gaussian and<br />

*<br />

non-linear system is very challenging. We aim to identify θ based on an infinite (or very<br />

Y . A standard method to do so is to maximize the limit <strong>of</strong> the<br />

large) observation sequence { k<br />

} k≥0<br />

time averaged log-likelihood function:<br />

1<br />

l θ ( Y Y<br />

(5.5)<br />

k<br />

( ) = lim ∑ log pθ<br />

k → ∞ k + 1 k = 0<br />

k<br />

0 : k −1)<br />

with respect to θ . Suitable regularity conditions ensure that this limits exists and<br />

l (θ) admits θ * as a global maximum [70]. The expression Y Y n<br />

)<br />

defined as<br />

p θ<br />

(<br />

0 : k −1<br />

is the predictive likelihood<br />

116


5.2 IMPLEMENTATION OF <strong>SPSA</strong> ALGORITHM TO THE PROPOSED MODEL<br />

p θ ( Y k<br />

Y 1) = α x , Y ) q ( x Y , x )<br />

0:<br />

k−<br />

∫∫<br />

θ<br />

(<br />

k−1:<br />

k k θ k k k−1<br />

⋅ θ<br />

p ( xk−<br />

1<br />

Y0:<br />

k−1)<br />

dxk−<br />

1:<br />

k.<br />

(5.6)<br />

Note that this is a normalization constant [70]. This approach is known as recursive ML<br />

parameter estimation. Now, we propose to use M2-<strong>SPSA</strong> in the ML parameter estimation based<br />

on the GSMC algorithm (Generic Sequential Monte Carlo algorithm) described in [70]. It is<br />

very difficult to compute log ( Y Y k 0 : k −1<br />

)<br />

p θ<br />

in closed <strong>for</strong>m. Instead, we use a particle<br />

approximation and propose to optimize an alternative criterion: the SMC provides us with<br />

( i)<br />

~ ( i)<br />

samples ( Xk<br />

− 1,<br />

Xk<br />

) from p θ ( x k −1 Y0:<br />

k−1)<br />

q θ ( xk<br />

Yk<br />

, xk−<br />

1)<br />

. A particle approximation to log p ( Y Y θ k 0:<br />

k−1)<br />

is given by<br />

^<br />

N<br />

⎛ −1<br />

( i)<br />

⎞<br />

log p ( Yk<br />

Y0 : k−1)<br />

= log⎜<br />

N ∑ a<br />

, k ⎟.<br />

(5.7)<br />

θ θ<br />

⎝ i=<br />

1 ⎠<br />

Now, we use the key fact that the current hidden state<br />

X<br />

k<br />

, the observation<br />

Y<br />

k<br />

, the predicted<br />

particles<br />

~ (1: N )<br />

X<br />

k<br />

and their corresponding not normalized weights<br />

(1: N )<br />

a θ , k<br />

<strong>for</strong>m a homogenous<br />

Markov chain [70].<br />

*<br />

In the following section, we propose SA algorithms to solve: ϑ = arg max J ( θ ) Note that<br />

~ ( 1: N ) (1, N )<br />

because we only use a finite number N <strong>of</strong> particles ( X , aθ ) is only an approximation to<br />

k<br />

, k<br />

θ∈Θ<br />

*<br />

the exact prediction density x k<br />

Y ) . Hence ϑ will not be equal to the true parameter<br />

p θ (<br />

0:<br />

k−1<br />

*<br />

θ . However, as N increases, )<br />

J (θ will get closer to l(θ<br />

) and<br />

*<br />

ϑ will converge to<br />

*<br />

θ . Our<br />

*<br />

*<br />

simulation results indicate that ϑ provides a good approximation to θ <strong>for</strong> a moderate number<br />

<strong>of</strong> particles.<br />

117


CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />

5.2.2 -Gradient-free Maximum Likelihood Estimation<br />

The function J (θ ) must be maximized with respect to the m-dimensional parameter vector θ .<br />

The function J (θ ) does not admit an analytical expression. Additionally, we do not have<br />

access to it. Using the geometric ergodicity <strong>of</strong> the Markov chain { Z k<br />

} ≥<br />

, J(<br />

) can be<br />

approximated in the limit as follows:<br />

k<br />

θ<br />

0<br />

J<br />

∆<br />

⎧<br />

θ ) = lim ⎨ J ( ) = [ ( )] ⎬ ⎫<br />

k<br />

θ E r Z<br />

(5.8)<br />

→ ∞<br />

, k<br />

k<br />

θ θ<br />

⎩<br />

⎭<br />

( *<br />

where the expectation is taken with respect to the distribution <strong>of</strong><br />

J (θ ) is unknown, we access to a sequence <strong>of</strong> functions<br />

k<br />

Z . This implies that although<br />

k<br />

J (θ . One way to<br />

J that converge to )<br />

exploit this sequence in <strong>order</strong> to optimize J (θ ) is to use a recursion as follows:<br />

where<br />

k−1<br />

k<br />

^<br />

= θ<br />

k −1 + γ<br />

k k −<br />

∇J ( θ 1)<br />

θ (5.9)<br />

θ is the parameter estimate at time k−1 and ∇ k denotes an estimate <strong>of</strong> ∇ J<br />

k<br />

.The<br />

idea is that we take incremental steps to improve θ where each step uses a particular function<br />

from the sequence. Under suitable conditions on the step size, the above iteration will converge<br />

*<br />

to ϑ [71]. We will consider the case where the expression <strong>for</strong> the gradient <strong>of</strong><br />

^<br />

J<br />

J<br />

k<br />

is either not<br />

available or too complex to calculate. One may approximate ∇J k ( θ ) by recourse to finite<br />

difference methods. These are “gradient-free” methods that only use measurements <strong>of</strong> J (θ ) .<br />

The idea behind this approach is to measure the change in the function induced by a small<br />

^<br />

k<br />

perturbation<br />

∆θ<br />

in the value <strong>of</strong> the parameter. If we denote an estimate <strong>of</strong> J (θ ) by<br />

k<br />

^<br />

2<br />

J<br />

k<br />

(θ ) ,<br />

one-sided gradient approximations consider the change between J ( θ ) and J ( θ + ∆ θ )<br />

while two-sided approximations consider the difference between J ( θ −∆θ)<br />

and J ( θ + ∆θ ) . A<br />

gradient-free approach can provide a maximum likelihood parameter estimate that is<br />

^<br />

k<br />

^<br />

k<br />

^<br />

k<br />

^<br />

k<br />

118


5.2 IMPLEMENTATION OF <strong>SPSA</strong> ALGORITHM TO THE PROPOSED MODEL<br />

computationally cheap, as well as very simple to implement. The key feature <strong>of</strong> the <strong>SPSA</strong><br />

technique is that it requires only two measurements <strong>of</strong> the cost function regardless <strong>of</strong> the<br />

dimension <strong>of</strong> the parameter vector. This efficiency is achieved by the fact that all the elements<br />

in θ are perturbed together. The i-th component <strong>of</strong> the two-sided gradient approximation<br />

^<br />

^<br />

^<br />

⎡<br />

⎤<br />

∇ J<br />

k<br />

=<br />

⎢<br />

∇J<br />

k ,1(<br />

θ ),..., ∇J<br />

k , m<br />

( θ ) is<br />

⎣<br />

⎥<br />

⎦<br />

∇J<br />

^<br />

^<br />

^<br />

J<br />

k<br />

( θk−<br />

1<br />

+ ck∆k<br />

) − J<br />

k<br />

( θk−<br />

1<br />

+ ck∆k<br />

)<br />

k,<br />

i<br />

( θ<br />

n−1)<br />

=<br />

(5.10)<br />

2ck<br />

∆ki<br />

where ∆<br />

k<br />

=<br />

∆ [ ∆<br />

k , 1<br />

,..., ∆<br />

k , m<br />

] is a random perturbation vector and { k<br />

} k ≥1<br />

c is defined in the<br />

Sec. 1.7. Note that the computational saving stems from the fact that the objective function<br />

difference is now common in all m components <strong>of</strong> the gradient approximation vector. Almost<br />

sure convergence <strong>of</strong> the SA recursion in (5.9) is guaranteed if J (θ ) is sufficiently smooth near<br />

k<br />

*<br />

θ . Additionally, the elements <strong>of</strong><br />

∆k<br />

must be mutually independent random variables,<br />

−1<br />

symmetrically distributed around zero and with finite inverse moments E ( ∆ k , i<br />

)<br />

. A simple and<br />

popular choice <strong>for</strong> ∆ that satisfies these requirements is the Bernoulli ± 1distribution and the<br />

positive step sizes should satisfy<br />

k<br />

∑ ∞ →0 , k<br />

→0,<br />

k=<br />

1<br />

γ<br />

k<br />

c γ<br />

k<br />

= ∞ and ∑ ∞<br />

k =<br />

1<br />

⎛ γ<br />

k<br />

⎜<br />

⎝ c<br />

k<br />

⎞<br />

⎟<br />

⎠<br />

2<br />

<<br />

∞<br />

.<br />

The choice <strong>of</strong> the step sequences is crucial to the per<strong>for</strong>mance <strong>of</strong> the algorithm. Note that if a<br />

constant step size is used <strong>for</strong> γ<br />

k<br />

the SA estimate will still converge but will oscillate about the<br />

limiting value with a variance proportional to the step size. In most <strong>of</strong> our simulations γ<br />

k<br />

was<br />

set to a small constant step size that was repeatedly halved after several thousands <strong>of</strong> iterations.<br />

For the two-sided <strong>SPSA</strong> case <strong>for</strong> example, these would be ^<br />

^<br />

+<br />

J ( θ ∆ ; ω ) and<br />

−<br />

J ( θ − ∆ ; ω )<br />

k<br />

+ c k k k<br />

k<br />

c k<br />

k<br />

k<br />

where<br />

ω andω denote the randomness <strong>of</strong> each realization. This implies that besides the<br />

+<br />

k<br />

− k<br />

desired objective function change induced by the perturbation in θ , there is also some<br />

119


CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />

undesirable variability in<br />

±<br />

±<br />

ω<br />

k<br />

. Although in a real system ωk<br />

cannot be controlled, in<br />

simulation settings it might be possible to eliminate the undesirable variability component by<br />

using the same random seeds at every time instant k, so that<br />

ω<br />

ω . The SA <strong>of</strong> (5.9) can be<br />

+ −<br />

k<br />

= k<br />

thought <strong>of</strong> as a stochastic generalization <strong>of</strong> the steepest descent method. Faster convergence can<br />

be achieved if one uses a Newton type SA algorithm that is based on an estimate <strong>of</strong> the second<br />

derivative <strong>of</strong> the objective function. This will be <strong>of</strong> the <strong>for</strong>m<br />

− 1<br />

^<br />

⎡<br />

^<br />

2 ⎤<br />

θ<br />

k = θ<br />

k − 1 − γ<br />

k ⎢ ∇ J k ( θ<br />

k − 1<br />

) ⎥ ⋅ ∇ J ( θ<br />

k − 1<br />

)<br />

(5.11)<br />

⎣<br />

⎦<br />

where<br />

^<br />

2<br />

2<br />

∇ J is an estimate <strong>of</strong> the negative definite <strong>Hessian</strong> matrix ∇ J<br />

k<br />

k<br />

. Such an approach<br />

can be particularly attractive in terms <strong>of</strong> convergence acceleration, in the terminal phase <strong>of</strong> the<br />

algorithm, where the steepest descent-type method <strong>of</strong> (5.9) slows down but main difficulty with<br />

this is the fact that the estimate <strong>of</strong> the <strong>Hessian</strong> should also be instable. In <strong>order</strong> to keep the<br />

stability in <strong>Hessian</strong> matrix, we applied the procedure used in Chap. 2. Also, as it was suggested<br />

in [70], it might be useful to average several SP gradient approximations at each iteration, each<br />

with an independent value <strong>of</strong><br />

∆<br />

k<br />

. Despite the expense <strong>of</strong> additional objective function<br />

evaluations, this can reduce the noise effects and accelerate convergence.<br />

5.3 -Parameter Estimation by <strong>SPSA</strong> and FDSA<br />

Now, we present two maximum likelihood parameter estimation algorithms that are based on a<br />

FDSA and <strong>SPSA</strong> algorithm. In line with our objectives, the algorithm below only requires a<br />

single realization <strong>of</strong> observations { k<br />

} k≥1<br />

Y <strong>of</strong> the true system. At time k -1, we denote the<br />

current parameter estimate by θ<br />

k−1<br />

. Also, let the filtering density pθ0 : k−1( xk−<br />

1<br />

Y0:<br />

k−1)<br />

be<br />

approximated by the particle set<br />

(1: N )<br />

X<br />

k − 1<br />

having equal importance weights. Note that the<br />

subscript θ0:<br />

k −1<br />

indicates that the filtering density estimate is a function <strong>of</strong> all the past parameter<br />

values. The parameter estimation using <strong>SPSA</strong> is per<strong>for</strong>med as follows:<br />

120


5.4 NUMERICAL SIMULATION<br />

First, let generate random perturbation vectors.<br />

For i = 1, ...,N, sample<br />

~ ( i )<br />

( i )<br />

( ⋅ Y<br />

k ~ q c<br />

k ,<br />

X<br />

+ θ + ∆<br />

k − 1<br />

X<br />

, k −1<br />

k<br />

~ ( i )<br />

( i )<br />

( ⋅ Y<br />

k , ~ q c<br />

k ,<br />

X<br />

− θ − ∆<br />

k − 1<br />

X<br />

k −1<br />

k<br />

k<br />

k<br />

)<br />

)<br />

and using the following evaluation:<br />

α θ<br />

( x<br />

k<br />

− 1: k<br />

, Yk<br />

) =<br />

g<br />

θ<br />

( Y<br />

q<br />

k<br />

θ<br />

x ) f<br />

k<br />

( x<br />

k<br />

θ<br />

k<br />

( x<br />

Y , x<br />

k<br />

x<br />

k −1<br />

k −1<br />

)<br />

)<br />

.<br />

We can evaluate<br />

~ ( i ) ( i )<br />

~ ( i ) ( i )<br />

aθ ( Y , X , X ) , a ( Y<br />

k<br />

, X<br />

k , −<br />

, X<br />

k , −1<br />

)<br />

^<br />

∇J<br />

k , i<br />

where<br />

J<br />

^<br />

k<br />

log<br />

( θ<br />

k<br />

( θ<br />

⎧<br />

⎨<br />

⎩<br />

k −1<br />

k − 1<br />

1<br />

N<br />

k , +<br />

Jˆ<br />

k<br />

( θ<br />

) =<br />

±<br />

c<br />

k<br />

∆<br />

N<br />

∑ i = 1<br />

k<br />

k , −1<br />

k −1<br />

a θ<br />

θ<br />

.<br />

+ ck∆<br />

k<br />

) − Jˆ<br />

k<br />

( θ<br />

2c<br />

∆<br />

) =<br />

k − 1<br />

θ = + ∇ ( ),<br />

k<br />

θ<br />

k<br />

γ<br />

k<br />

J k θ<br />

k 1<br />

where<br />

∇J<br />

^<br />

±<br />

c<br />

−1 −<br />

k<br />

k<br />

∆<br />

k , i<br />

k<br />

( Y<br />

k −1<br />

k<br />

− c ∆ )<br />

~<br />

, X<br />

⎡ ^<br />

^ ⎤<br />

θ<br />

k − 1)<br />

⎢∇J<br />

k ,1(<br />

θ<br />

k −1<br />

),... ∇J<br />

k , m ( θ<br />

k −<br />

) ⎥<br />

⎣<br />

⎦<br />

^<br />

k (<br />

1<br />

k<br />

k<br />

(1: N )<br />

k , ±<br />

, X<br />

( j )<br />

k − 1<br />

⎫<br />

) ⎬<br />

⎭<br />

X ~ )<br />

k<br />

k k , k −1<br />

( i )<br />

( i<br />

( )<br />

each particle i = 1, ...,N, sample ~ qθ ( ⋅Y<br />

X ) and evaluate the weights a . θ<br />

( 1, N )<br />

Sample ~ ( ~ (1, N )<br />

( 1, N ) ~ (1, N ) 1: N<br />

Ik<br />

L ⋅aθ<br />

, k<br />

) using a standard resampling scheme. Set X<br />

k<br />

= H ( X<br />

k<br />

, I<br />

k<br />

)<br />

.<br />

~ j<br />

k , k ,<br />

121


CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />

5.4 -Simulation<br />

The following bi-modal non-linear model [72] is proposed here.<br />

X<br />

Xk<br />

= θ<br />

−<br />

+ σ V<br />

(5.12)<br />

k−1<br />

1<br />

X<br />

k 1+<br />

θ2<br />

+ θ<br />

2 3<br />

cos (1.2 k)<br />

1+<br />

Xk−<br />

1<br />

υ k,<br />

Y = cX +σ W<br />

k<br />

2<br />

k<br />

ω<br />

k<br />

(5.13)<br />

i.<br />

i.<br />

d<br />

2<br />

where σ 10, c = 0.05, σ = 1, X ~ N(0,2),<br />

V ~ N(0,1)<br />

and W ( 0,1)<br />

. These are zero mean<br />

υ = ω<br />

0<br />

k<br />

i.<br />

~ i.<br />

d<br />

k<br />

N<br />

Gaussian random variables. Here, we seek ML estimatesθ =<br />

[ θ1,<br />

θ2,<br />

θ3]<br />

T<br />

. Also is important to<br />

initialize the algorithm properly; else some <strong>of</strong> the parameter estimates might get trapped in local<br />

maxima. In this model, we can initialize at<br />

T<br />

θ<br />

0<br />

= [0.2, 20, 5] . The choice <strong>of</strong> the step size is very<br />

important. Here, this is particularly true due to the difference in the relative sensitivity <strong>of</strong> the<br />

0. 101<br />

three unknown parameters. The values <strong>for</strong> step size are c k<br />

= c 0<br />

/k where c = [0.01,2.0,1]<br />

x10 −4<br />

4<br />

and the constant step size γ = [0.005,7,17] x 10 − .<br />

0<br />

T<br />

Fig. 5.1. ML Parameter estimateθ = θ , θ , <strong>for</strong> the bi-modal non-linear model using<br />

k<br />

[<br />

1,<br />

k 2, k<br />

θ<br />

3,<br />

k<br />

]<br />

T<br />

M2-<strong>SPSA</strong>. The true Parameters in the model are defined by θ * =[0.5, 25, 8] .<br />

122


5.4 NUMERICAL SIMULATION<br />

Fig. 5.2. Parameter estimation using 2nd-<strong>SPSA</strong> and FDSA.<br />

Figure 5.1 shows the efficiency obtained using M2-<strong>SPSA</strong>. These results are compared with<br />

2nd-<strong>SPSA</strong> and FDSA in Fig. 5.2. These results show the best per<strong>for</strong>mance found by each<br />

algorithm in the current model. Table 5.1 compares, the number <strong>of</strong> particles used by each<br />

algorithm in its per<strong>for</strong>mance, and the computational load or normalized CPU time [49]<br />

(computational cost in time processing) with CPU time required by M2-<strong>SPSA</strong> as reference.<br />

These comparisons are done according to average CPU time used by each algorithm to estimate.<br />

Table 5.1. Computational statistics.<br />

<strong>Algorithm</strong> No. <strong>of</strong> Particles CPU<br />

M2-<strong>SPSA</strong> 800 1<br />

2nd-<strong>SPSA</strong> 920 2.8<br />

FDSA 1000 3.2<br />

The results obtained here by M2-<strong>SPSA</strong> show its efficiency, which<br />

*<br />

ϑ provides a good<br />

*<br />

approximation to θ using a moderate number <strong>of</strong> particles in comparison with 2nd-<strong>SPSA</strong> and<br />

FDSA. There<strong>for</strong>e, M2-<strong>SPSA</strong> only uses 800 particles to estimate and get a good approximation<br />

and accuracy parameters. The 2nd-<strong>SPSA</strong> uses 920 particles to find a suitable estimation. Finally,<br />

FDSA uses 1000 particles to estimate the parameters in a correct way. Also, the computational<br />

123


CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />

cost is shown that CPU time required estimating the parameters by 2nd-<strong>SPSA</strong> and FDSA ranges<br />

from 2.8 and 3.2 times respectively the CPU time required by M2-<strong>SPSA</strong>, so that, in terms<br />

<strong>of</strong> efficiency, the use <strong>of</strong> these algorithms might be questionable. Note that the number <strong>of</strong> loss<br />

function measurements needed in each iteration <strong>of</strong> FDSA grow with p while M2-<strong>SPSA</strong> only<br />

two measurements are needed, independent <strong>of</strong> p. This, according to the characteristics <strong>of</strong><br />

M2-<strong>SPSA</strong> described in Chap. 2, provides the potential <strong>of</strong> our proposed algorithm to achieve a<br />

large saving (over FDSA) in the total number <strong>of</strong> measurements required to estimate θ when p<br />

is large. Also, we can see that the per<strong>for</strong>mance <strong>of</strong> FDSA was highly dependent on the shape <strong>of</strong><br />

the loss function surface [21]. Consequently, this places a higher burden on the selection <strong>of</strong><br />

initial parameter values. So that, M2-<strong>SPSA</strong> has a low computational cost and usually provides<br />

less dispersed and more accurate parameters. The reason <strong>of</strong> these data obtained by M2-<strong>SPSA</strong> is<br />

that this algorithm is a very powerful technique that allows an approximation <strong>of</strong> the gradient or<br />

<strong>Hessian</strong> by effecting simultaneous random perturbations in all the parameters. There<strong>for</strong>e, the<br />

data <strong>of</strong> M2-<strong>SPSA</strong> contrast with FDSA in which the evaluation <strong>of</strong> the gradient is achieved by<br />

varying the parameters once at a time. In general, these results obtained by M2-<strong>SPSA</strong> are<br />

explained since this algorithm does not depend on derivative in<strong>for</strong>mation, and it is able to find a<br />

good approximation to the solution using few function values (see Chap. 2), this causes a low<br />

computational cost and complexity. In comparison with 2nd-<strong>SPSA</strong>, M2-<strong>SPSA</strong> has a low<br />

computational cost that is explained in Chap. 2. Also, the M2-<strong>SPSA</strong> algorithm can satisfy some<br />

conditions and constraints associated with the problem in contrast with 2nd-<strong>SPSA</strong> that cannot<br />

satisfy them [18]. In contrast with FDSA, in M2-<strong>SPSA</strong> the slope is estimated, and the<br />

estimation error <strong>for</strong> the slope has an effect on the convergence speed. So that, M2-<strong>SPSA</strong> is a<br />

very suitable algorithm. Nevertheless, if one decides to allow <strong>for</strong> more resource and use a<br />

gradient-based approach, the <strong>SPSA</strong> proposed here can still prove extremely useful in exploring<br />

the parameter space and choosing suitable initial values <strong>for</strong> the parameter vector.<br />

124


Chapter 6<br />

Conclusions and Future Work<br />

6.1 -Conclusions<br />

In this research, we have proposed a new modification to <strong>SPSA</strong> algorithm which main objective<br />

is estimate the parameters in complex system, improve the convergence and reduce the<br />

computational expense. This modification is called “modified version <strong>of</strong> 2nd-<strong>SPSA</strong> algorithm”.<br />

The identification method using the SP seems particularly useful when the number <strong>of</strong><br />

parameters to be identified is very large or when the observed values <strong>for</strong> to be identified can<br />

only be obtained via an unknown observation system. Furthermore, a time differential SP<br />

method that only require one observation <strong>of</strong> error <strong>for</strong> each time increment have been proposed<br />

as improvements <strong>for</strong> the <strong>SPSA</strong> algorithm. The procedure <strong>of</strong> the proposed <strong>SPSA</strong> algorithm can<br />

be explained as follows:<br />

−1<br />

To eliminate the errors introduced by the inversion <strong>of</strong> estimated <strong>Hessian</strong> ( H ) , is suggested a<br />

−1<br />

modification (2.13) to 2nd-<strong>SPSA</strong> that replaces H with a scalar inverse <strong>of</strong> the geometric mean<br />

k<br />

k<br />

<strong>of</strong> all the eigenvalues <strong>of</strong><br />

H<br />

k<br />

. This leads to significant improvements in the proposed <strong>SPSA</strong><br />

algorithm efficiency. At finite iterations, it is found that the newly introduced M2-<strong>SPSA</strong> based<br />

on (2.13) and (2.14) frequently outper<strong>for</strong>ms 2nd-<strong>SPSA</strong> in the numerical simulations that<br />

represent a wide range <strong>of</strong> matrix conditioning. Moreover is considered that the ratio <strong>of</strong> the mean<br />

square errors from M2-<strong>SPSA</strong> to 2nd-<strong>SPSA</strong> is always less than unity except <strong>for</strong> a perfectly<br />

conditioned <strong>Hessian</strong>. The magnitude <strong>of</strong> errors in 2nd-<strong>SPSA</strong> is dependent on the matrix<br />

*<br />

conditioning <strong>of</strong> H due to competing factors [16]. Since these factors are strongly related to<br />

the same quantity <strong>of</strong> the matrix conditioning, the efficiency between the proposed <strong>SPSA</strong><br />

algorithm and 2nd-<strong>SPSA</strong> might less dependent on specific loss functions. We have proposed to<br />

reduce the computational expense by evaluating only a diagonal estimate <strong>of</strong> the <strong>Hessian</strong> matrix.<br />

The reduction in the computation time (in comparison with SA algorithms and previous<br />

versions <strong>of</strong> <strong>SPSA</strong>) is due to savings in the evaluation <strong>of</strong> the <strong>Hessian</strong> estimate, as well as in the<br />

recursion on θ that only requires a trivial matrix inverse. The per<strong>for</strong>mance, in terms <strong>of</strong> rate <strong>of</strong><br />

convergence and accuracy, remains almost unchanged, which demonstrates that the diagonal<br />

125


CHAPTER 6. CONCLUSIONS AND FUTURE WORK<br />

<strong>Hessian</strong> estimate still captures potential large scaling differences in the elements <strong>of</strong>θ .<br />

In this latter algorithm, regularization can be achieved in a straight<strong>for</strong>ward way, by imposing the<br />

positive <strong>of</strong> the diagonal elements <strong>of</strong> the <strong>Hessian</strong>.<br />

We have explained our proposed <strong>SPSA</strong> algorithm in detail in this dissertation. There<strong>for</strong>e, three<br />

important applications have been proposed in <strong>order</strong> to evaluate our proposed M2-<strong>SPSA</strong><br />

algorithm. These applications were addressed toward control area and signal processing where<br />

the M2-<strong>SPSA</strong> algorithm was implemented very successfully. In the following paragraphs, the<br />

conclusions corresponding to these applications are mentioned.<br />

1) First application<br />

We have proposed a MR-SMC method using a non-linear observer <strong>for</strong> controlling the angular<br />

position <strong>of</strong> the single flexible link, suppressing its oscillation. The non-linear observer and the<br />

MR-SMC provide a successful and stable operation to the system. The M2-<strong>SPSA</strong> algorithm is<br />

used in <strong>order</strong> to determine the observer/controller gains. This could determine them very<br />

efficiently and with a low computational cost. The non-linear observer was successful in<br />

predicting the state variables from the motor angular position and the MR-SMC was a very<br />

efficient control method. Although the per<strong>for</strong>mance <strong>of</strong> our proposed system was very<br />

satisfactory and closed to real results obtained in [47].<br />

2) <strong>Second</strong> application<br />

In this research, also we have shown a method <strong>for</strong> deriving adaptive algorithms <strong>for</strong> IIR lattice<br />

filters from the corresponding direct <strong>for</strong>m algorithms. The advantage <strong>of</strong> this approach is that it<br />

provides conditions under which the convergence characters <strong>of</strong> stationary points are preserved<br />

when passing from the direct <strong>for</strong>m to the lattice algorithm. We use M2-<strong>SPSA</strong> in <strong>order</strong> to get the<br />

coefficients in the lattice <strong>for</strong>m more efficiently, so that we can reduce the computational burden<br />

to obtain a suitable per<strong>for</strong>mance. This allowed the design <strong>of</strong> lattice versions <strong>of</strong> the SM and<br />

SHARF algorithms, which are locally convergent, at least in the sufficient <strong>order</strong> case. It was<br />

also shown that this was not the case <strong>for</strong> previous lattice versions.<br />

126


6.1 CONCLUSIONS<br />

3) Third application<br />

Finally, a fast and efficient modified <strong>SPSA</strong> algorithm to per<strong>for</strong>m ML parameter estimation in<br />

state-space models, using SMC filters has been proposed. The algorithm proposed here is based<br />

on measurements <strong>of</strong> the objective function and do not involve any gradient calculations. The<br />

estimation using M2-<strong>SPSA</strong> seems particularly useful when the number <strong>of</strong> parameters to identify<br />

is large or when the observed values <strong>for</strong> what is to be identified can only be obtained via an<br />

unknown observation system. Also M2-<strong>SPSA</strong> outper<strong>for</strong>ms the FDSA and 2nd-<strong>SPSA</strong> due to its<br />

reduced computational cost and complexity that remains fixed with the dimensions <strong>of</strong> the<br />

parameter vector. However, its per<strong>for</strong>mance is very sensitive to the step-size parameters and<br />

special care should be taken when these are selected.<br />

Tables 6.1 and 6.2 show the final per<strong>for</strong>mance <strong>of</strong> M2-<strong>SPSA</strong> according to the applications<br />

described in this dissertation, this per<strong>for</strong>mance is compared with previous versions <strong>of</strong> <strong>SPSA</strong><br />

algorithm and SA algorithms.<br />

Table 6.1. Comparison <strong>of</strong> algorithms (per<strong>for</strong>mance).<br />

<strong>Algorithm</strong><br />

No. <strong>of</strong> Loss Measurements<br />

M2-<strong>SPSA</strong><br />

Low<br />

2nd <strong>SPSA</strong><br />

Relatively Low<br />

1st-<strong>SPSA</strong><br />

High<br />

Table 6.1 represents a comparative between the M2-<strong>SPSA</strong> and previous version <strong>of</strong> <strong>SPSA</strong><br />

according to the simulations results obtained in this dissertation in the Chap. 2. The number <strong>of</strong><br />

loss measurements is reduced significantly by our proposed method M2-<strong>SPSA</strong> in this chapter<br />

and this is confirmed by the Tables 2.2 – 2.4 where Spall [18] presents a study based on a larger<br />

number <strong>of</strong> loss measurements (i.e., more asymptotic) where we can show that M2-<strong>SPSA</strong><br />

outper<strong>for</strong>ms (about iterations needed <strong>for</strong> normalized loss values) 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> in<br />

the high-noise case.<br />

The ratios <strong>of</strong> M2-<strong>SPSA</strong> shown in Table 2.3 - 2.4 <strong>of</strong>fer considerable promise <strong>for</strong> practical<br />

problems (using a number <strong>of</strong> measurements low in comparison with 1st-<strong>SPSA</strong>), where p is even<br />

larger (say, as in the neural network–based direct adaptive control method <strong>of</strong> Spall and Cristion<br />

127


CHAPTER 6. CONCLUSIONS AND FUTURE WORK<br />

2<br />

3<br />

[25], where p can easily be <strong>of</strong> <strong>order</strong> 10 or 10 ). In such cases, other second <strong>order</strong><br />

techniques that require a growing (with p) number <strong>of</strong> function measurements are likely to<br />

become infeasible.<br />

In Table 2.2, we see that M2-<strong>SPSA</strong> provides a considerable reduction in the loss function value<br />

<strong>for</strong> the same number <strong>of</strong> measurements used in 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. Based on the numbers<br />

in the Table 2.2 – 2.4 together with supplementary studies described in Chap. 2, we find that<br />

1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> needs approximately five–ten times the number <strong>of</strong> function<br />

evaluations used by M2-<strong>SPSA</strong> to reach the levels <strong>of</strong> accuracy shown.<br />

Table 6.2. Comparison <strong>of</strong> algorithms (computational cost).<br />

<strong>Algorithm</strong><br />

Computational Cost<br />

M2-<strong>SPSA</strong><br />

Low<br />

2nd <strong>SPSA</strong><br />

Relatively Low<br />

SA <strong>Algorithm</strong>s<br />

High<br />

Table 6.2 represents a comparison between the M2-<strong>SPSA</strong> and previous version <strong>of</strong> <strong>SPSA</strong> and SA<br />

algorithms according to CPU time, these results are confirmed by the values obtained in Tables<br />

3.3 and 5.1 in Chap. 3 and 5 respectively, where the computational load or normalized CPU<br />

time [49] (computational cost in time processing) with CPU time required by M2-<strong>SPSA</strong> as<br />

reference is used. These comparisons are done according to average CPU time used by each<br />

algorithm to estimate each parameter.<br />

The CPU time or CPU usage is the amount <strong>of</strong> time a computer program uses in processing the<br />

instructions, as opposed to, <strong>for</strong> example, waiting <strong>for</strong> the input/output operations. Then in this<br />

case, the CPU time required to estimate the parameters by 2nd-<strong>SPSA</strong> ranges 2 times the CPU<br />

time required by M2-<strong>SPSA</strong>, <strong>for</strong> that reason is relatively low in comparison with our proposed<br />

<strong>SPSA</strong>.<br />

The CPU time required to estimate the parameters by the SA algorithms ranges 2 to 5 times<br />

approximately the CPU time required by M2-<strong>SPSA</strong>, <strong>for</strong> that reason this is high in comparison<br />

with our proposed <strong>SPSA</strong>. There<strong>for</strong>e, in these simulations is shown that the SA algorithms in<br />

comparison with M2-<strong>SPSA</strong> has a high computational cost (see Tables 3.3 and 5.1) even<br />

128


6.2 FUTURE WORK<br />

2nd-<strong>SPSA</strong> algorithm has the same or less computational cost in comparison to SA algorithms<br />

(see Table 5.5). This is explained because the number <strong>of</strong> loss function measurements needed in<br />

each iteration <strong>of</strong> FDSA (Table 5.1), RM-SA or LS (Table3.3) grow with p while M2-<strong>SPSA</strong> or<br />

2nd-<strong>SPSA</strong> only two measurements are needed, independent <strong>of</strong> p, this explanation is described in<br />

detail in the Chap. 2 and demonstrated by simulations the difference between 2nd-<strong>SPSA</strong> and<br />

M2-<strong>SPSA</strong> (Table 2.2 – 2.4).<br />

Also, M2-<strong>SPSA</strong> allows an approximation <strong>of</strong> the gradient or <strong>Hessian</strong> by effecting simultaneous<br />

random perturbations in all the parameters. This contrast with the evaluation <strong>of</strong> the gradient in<br />

FDSA which is achieved by varying the parameters once at a time.<br />

6.2 -Future Work<br />

Referring to conclusions give above, we still have many topics <strong>for</strong> investigate in a near future.<br />

Future work to assess the per<strong>for</strong>mance <strong>of</strong> <strong>SPSA</strong> <strong>for</strong> constrained and unconstrained aerodynamic<br />

shape design studies.<br />

This study will be carried out in the near future to establish the cost benefits and to investigate the<br />

extent to which <strong>SPSA</strong> <strong>of</strong>fers comparative advantages over other kinds <strong>of</strong> similar methods <strong>for</strong><br />

dynamic design optimization problems.<br />

The M2-<strong>SPSA</strong> algorithm can be applied to image processing; in this case we focus in two main<br />

applications. First, the M2-<strong>SPSA</strong> algorithm will be used in image process multidimensional<br />

(medical images) in <strong>order</strong> to reduce CPU time in same way that the applications presented in<br />

this dissertation. <strong>Second</strong>, extracting a multivariate non-linear physical model from a set satellite<br />

images is considered as a multivariate non-linear regression problem. Multiple local solutions<br />

<strong>of</strong>ten prevent gradient type algorithms from obtaining global optimal solutions. A method <strong>of</strong><br />

solving this problem is M2-<strong>SPSA</strong> algorithm. The method will be applied to a problem <strong>of</strong><br />

estimating the distribution <strong>of</strong> energetic ion populations from global images <strong>of</strong> the<br />

magnetosphere.<br />

129


CHAPTER 6. CONCLUSIONS AND FUTURE WORK<br />

Finally, we have applied our proposed M2-<strong>SPSA</strong> algorithm to the applications proposed here,<br />

but also our proposed <strong>SPSA</strong> can be applied to other kind <strong>of</strong> applications in other areas <strong>for</strong><br />

example the image process mentioned in this section. The M2-<strong>SPSA</strong> algorithm can be applied to<br />

different applications if these satisfy in advances the conditions described by the main theorems<br />

(theorems 1,2 and 3 <strong>of</strong> M2-<strong>SPSA</strong> and its guidelines C.1’ and C.3’) explained in Sec. 2.9, if the<br />

application satisfies these conditions M2-<strong>SPSA</strong> can be used.<br />

130


References<br />

[1] G. Cassandras, L. Dai, and C. G. Panayiotou, “Ordinal Optimization <strong>for</strong> a Class <strong>of</strong><br />

Deterministic and Stochastic Discrete Resource Allocation Problems,” 1EEE Trans. Auto.<br />

Contr., vol.43(7): pp.881-900, 1998.<br />

[2] G. N. Saridis, “Stochastic <strong>Approximation</strong> Methods <strong>for</strong> Identification and Control,” IEEE<br />

Trans Autom Control , vol.19, pp.798-809, 1974.<br />

[3] J. C. Spall,“Multivariate Stochastic <strong>Approximation</strong> using a Simultaneous Perturbation<br />

Gradient <strong>Approximation</strong>,” IEEE Transactions on Automatic Control, vol.37, pp.332-341, 1992.<br />

[4] S. N. Evans and N. C.Weber., “On the Almost Sure Convergence <strong>of</strong> a General Stochastic<br />

<strong>Approximation</strong> Procedure,” Bull. Australian Math. Soc., vol.34, pp.335–342, 1986.<br />

[5] H. F. Chen, T. E. Duncan, and B. Pasik Duncan, "A Stochastic <strong>Approximation</strong> <strong>Algorithm</strong><br />

with Random Differences,” Proceedings <strong>of</strong> the 13 th Triennial IFAC World Congress, pp.<br />

493-496, 1996.<br />

[6] J. C. Spall, “An Overview <strong>of</strong> the Simultaneous Perturbation <strong>Algorithm</strong> <strong>for</strong> Stochastic<br />

Optimization,” IEEE, Transactions on Aerospace and Electronic Systems, vol.34, pp.817-823,<br />

1998.<br />

[7] A. Vande Wouver, C. Rennote, Ph. Bogaerts, "Application <strong>of</strong> <strong>SPSA</strong> Techniques in<br />

Non-linear System Identification," European Control Conference, 2001.<br />

[8] J. Kiefer and J. Wolfowitz, “Stochastic Estimation <strong>of</strong> the Maximum <strong>of</strong> a Regression<br />

Function,” Ann.Math. Statist., vol.23 pp.498-506, 1952.<br />

[9] -H. Monroe, Robbins, “A Stochastic <strong>Approximation</strong> Method,” Ann.Math. Statist, vol. 22, pp.<br />

400-407, 1951.<br />

131


REFERENCE<br />

[10] S. A. Billings, G. N. Jones, “Orthogonal Least-Squares Parameter Estimation <strong>Algorithm</strong>s<br />

<strong>for</strong> Non-Linear Stochastic Systems,” Int. Journal <strong>of</strong> Systems Science, vol.23, issue 7, pp.<br />

1019-1032,1990.<br />

[11] L. Gerencser, "<strong>SPSA</strong> with State-Dependent Noise a Tool <strong>for</strong> Direct Adaptive Control,"<br />

Proceedings <strong>of</strong> the Conference on decision and Control, CDC 37, 1998.<br />

[12] J. C. Spall, and D.C Chin, “Traffic Responsive Signal Timing <strong>for</strong> System-Wide Traffic<br />

Control,” Transp. Res., Part C, vol.5, pp.153-163, 1997.<br />

[13] J. H. Venter, "An Extension <strong>of</strong> the Robbins-Monroe <strong>Algorithm</strong>," Annals <strong>of</strong> Mathematical<br />

Statistics, vol.38, pp.181-190.<br />

[14] D. Ruppert, "Stochastic approximation,” Handbook <strong>of</strong> Sequential Analysis, pp.503-529,<br />

1991.<br />

[15] G. N. Saridis, G. Stein, "Stochastic <strong>Approximation</strong> <strong>Algorithm</strong>s <strong>for</strong> Linear Discrete-time<br />

System Identification," IEEE Trans Autom. Control, vol.13, pp.515–523, 1968.<br />

[16] L. Gerencser, “Rate <strong>of</strong> Convergence <strong>of</strong> Moments <strong>for</strong> a Simultaneous Perturbation<br />

Stochastic <strong>Approximation</strong> Method <strong>for</strong> Function Minimization,” IEEE Trans. on Automat.<br />

Contr. ,vol.44, pp.894-906, 1999.<br />

[17] J. C. Spall, “Adaptive Stochastic <strong>Approximation</strong> by the Simultaneous Perturbation<br />

Method,” Proceedings <strong>of</strong> the 1998 IEEE CDC, pp.3872 -3879, 1998.<br />

[18] J. C. Spall , “A <strong>Second</strong>-Order Stochastic <strong>Approximation</strong> <strong>Algorithm</strong> using only Function<br />

Measurements,” Proceedings <strong>of</strong> the IEEE Conference on Decision and Control, pp. 2472–2477,<br />

1994.<br />

[19] V. Fabian, “On Asymptotic Normality in Stochastic <strong>Approximation</strong>,” Ann,. Math Static.,<br />

vol.39, pp.1327-1332, 1968.<br />

132


REFERENCE<br />

[20] H. F. Chen and Y. Zhu, “Stochastic Estimation Procedure with Randomly Varying<br />

Truncations,” Scientia Sinica (Serie A), vol.29, pp.914-926, 1986.<br />

[21] D. C. Chin, “Comparative Study <strong>of</strong> Stochastic <strong>Algorithm</strong>s <strong>for</strong> System Optimization Based<br />

on Gradient <strong>Approximation</strong>s,” IEEE Trans. Syst., Man, and Cybernetics, vol.27, pp.244–249,<br />

1997.<br />

[22] B. Efron, and D.V. Hinckley, “Assesing the Accuracy <strong>of</strong> the Maximum Likelihood<br />

Estimator: Observed versus Expected Fisher In<strong>for</strong>mation,” Biometrika, vol.65, pp.457-487,<br />

1995.<br />

[23] S. Das, R. Ghanem, and J. C. Spall.,”Asymptotic Sampling Distribution <strong>for</strong> Polynomial<br />

Chaos Representation <strong>of</strong> Data: A Maximum Entropy and Fisher In<strong>for</strong>mation Approach,” SIAM<br />

Journal on Scientic Computing, 2006.<br />

[24] J. C. Spall, “A Stochastic <strong>Approximation</strong> <strong>Algorithm</strong> <strong>for</strong> Large-Dimensional Systems in the<br />

Kiefer-Wolfowitz Setting,” Proc. IEEE Conf. on Decision and Control, pp.1544–1548, 1988.<br />

[25] J. C. Spall and Cristion, J. A., “Non-linear Adaptive Control Using Neural Networks:<br />

Estimation Based on a Smoothed Form <strong>of</strong> Simultaneous Perturbation Gradient <strong>Approximation</strong>,”<br />

Statistica Sinica, vol.4, pp.1-27, 1999.<br />

[26]-D. W Hutchison, “On an Efficient Distribution <strong>of</strong> Perturbations <strong>of</strong> Simulation Optimization<br />

using Simultaneous Perturbation Stochastic <strong>Approximation</strong>,” Proceedings <strong>of</strong> the IASTED<br />

International Conference on Applied Modeling and Simulation, pp.440-445, 2002.<br />

[27] R. W. Brennan and P. Rogers, “Stochastic Optimization Applied to a Manufacturing<br />

System Operation Problem,” Proc.Winter Simulation Conf., C. Alexopoulos, K. Kang,W. R.<br />

Lilegdon, and D. Goldsman, Eds., pp.857–864, 1995.<br />

[28] J. C. Spall, “Implementation <strong>of</strong> the Simultaneous Perturbation <strong>Algorithm</strong> <strong>for</strong> Stochastic<br />

Optimization,” IEEE Trans. Aerosp. Electron. Syst., vol.34, pp.817–823, 1998.<br />

133


REFERENCE<br />

[29] M. Metivier and P. Priouret, “Applications <strong>of</strong> a Kushner and Clark Lemma to General<br />

Classes <strong>of</strong> Stochastic <strong>Algorithm</strong>s,” IEEE Trans. In<strong>for</strong>m. Theory, vol. IT-30, pp.140–151, 1984.<br />

[30] A. Benveniste, M. Metivier, and P. Priouret, “Adaptive <strong>Algorithm</strong>s and Stochastic<br />

<strong>Approximation</strong>s,” New York: Springer Verlag, 1990.<br />

[31] H. J. Kushner and G. G. Yin, Stochastic <strong>Approximation</strong> <strong>Algorithm</strong>s and Applications. New<br />

York: Springer Verlag, 1997.<br />

[32]-J. C. Spall and Cristion, “Model-free Control <strong>of</strong> Non-linear Stochastic Systems with<br />

Discrete-time Measurements,” IEEE Trans.Automat. Contr., vol.43, pp.1198–1210, 1998.<br />

[33] J. Dippon and J. Renz, “Weighted Means in Stochastic <strong>Approximation</strong> <strong>of</strong> Minima,” SIAM J.<br />

Contr. Optimiz., vol.35, pp.1811–1827, 1997.<br />

[34] J. R. Blum, “<strong>Approximation</strong> Methods which Converge with Probability One,” Ann. Mat.<br />

Statist., vol.25, pp.382–386, 1954.<br />

[35] J. J. More, B. S. Garbow, K. E. Hillstrom, “Testing Unconstrained Optimization S<strong>of</strong>tware,”<br />

ACM. Transactions on Mathematical Sciences, vol.7, no.1, pp.17–41, 1981.<br />

[36] R. G. Laha and V. K. Rohatgi, Probability Theory, New York: Wiley, 1979.<br />

[37] F. J. Solis, R. J. Wets, “Minimization by Random Search Techniques,”Mathematics <strong>of</strong><br />

Operations Research, vol.6, pp.19-30, 1981.<br />

[38] Y. Maeda, Y. Kanata, “Learning Rules <strong>for</strong> Recurrent Neural Networks using Perturbation<br />

and Their Application to Neuro-control”, Trans. IEE Japan, vol.113-C, pp.402-408, 1995 (in<br />

Japanese).<br />

[39] J. C. Spall, “A One-Measurement Form <strong>of</strong> Simultaneous Perturbation Stochastic<br />

<strong>Approximation</strong>,” Automatica, vol.33, pp.109–112, 1997.<br />

134


REFERENCE<br />

[40] J. C. Spall and J. A. Criston, “A Neural Network Controller <strong>for</strong> Systems with Unmodeled<br />

Dynamics with Applications to Waste-water Treatment,” IEEE Trans. Syst. Man. Cybern B,<br />

vol.27, pp.369-375, 1978.<br />

[41] J. Link, F. L. Lewis, “Two-Time Fuzzy Logic Controller <strong>of</strong> Flexible Link Robot Arm,”<br />

Fuzzy sets and system, vol.139, no.7, pp.125-149, 2003.<br />

[42] R. H. Cannon, E. Schmitz, “Initial Experiments on the End-Point Control <strong>of</strong> a Flexible<br />

One-Link Robot,” Int. journal <strong>of</strong> robotics research, vol.8, no.3, pp. 62-75, 1984.<br />

[43] Y. Sakawa, F. Matsuno, and S. Fukushima, “Modeling and Feedback Control <strong>of</strong> a Flexible<br />

Arm,” Journal <strong>of</strong> robotic systems, vol.2, no.4, pp.453-472, 1985.<br />

[44] S. Nicosia, P. Tomei,and A. Tornambe, “Non-Linear Control and Observation <strong>Algorithm</strong>s<br />

<strong>for</strong> a Single-Link Flexible Arm,” Int. Journal Control, vol. 49, no.5, pp.827-840, 1989.<br />

[45] J. Yuh, “Application <strong>of</strong> Discrete-Time Model Reference Adaptive Control to a Flexible<br />

Single-Link Robot,” Journal <strong>of</strong> robotic system, vol.4, pp.621-630, 1987.<br />

[46] E. Bayo et al, “Inverse Dynamic and Kinematics <strong>of</strong> Multi-Link Elastic Robots: An iterative<br />

frequency domain approach,” Int. Journal <strong>of</strong> Robotics Research, vol.8, no.6, pp.49-62, 1989.<br />

[47] U. Sawut, N. Umeda, T. Hanamoto, T. Tsuji, “Applications <strong>of</strong> Non-Linear Observer in<br />

Flexible Arm Control,” Trans. <strong>of</strong> SICE, vol. (35), no.3, pp. 401-406, 1999 (in Japanese).<br />

[48] C. Z. Wei, “Multivariate Adaptive <strong>Approximation</strong>,” Ann Statist., vol. 15, pp. 1115-1130.<br />

[49] A. Vande Wouwer, C. Renotte and M.Remy, “Application <strong>of</strong> Stochastic <strong>Approximation</strong><br />

Techniques in Neural Modeling and Control,” Int. Journal. <strong>of</strong> Syst. Science, vol.34, no.14,<br />

pp.851-863, 2003.<br />

[50] P. A. Regalia, “Adpative IIR Filtering in Signal Processing and Control. Marcel Dekker,”<br />

1995.<br />

135


REFERENCE<br />

[51] D. Parikh, N. Ahmed and S.D Stearns, “An Adaptive Lattice <strong>Algorithm</strong> <strong>for</strong> Recursive<br />

Filters,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 28, pp.110-112, 1988.<br />

[52] J. A. Rodriguez-Fonollosa and E. Magrau, “Simplified Gradient Calculation in Adaptive<br />

IIR Lattice Filters,” IEEE Trans.on Signal Processing, vol.39, pp.1702-1705, 1991.<br />

[53] P. A. Regalia, “Stable and Efficient Lattice <strong>Algorithm</strong>s <strong>for</strong> Adaptive IIR Filtering,” IEEE<br />

Trans. on Signal Processing, vol.40, pp.375-388, 1992.<br />

[54] H. Fan: “Application <strong>of</strong> Benveniste’s Convergence Results in the Study <strong>of</strong> Adaptive IIR<br />

Filtering <strong>Algorithm</strong>s,” IEEE Trans. In<strong>for</strong>m. Theory, vol.34, pp.692-709, 1988.<br />

[55] P. Lancaster, M. Tismenetsky, “The Theory <strong>of</strong> Matices,” Academic Press ,1985.<br />

[56] C. R. Johnson, Jr, M. G. Larimore, J. R. Treichler, and B. D. O. Anderson, “SHARF<br />

Convergence Properties SHARF Convergence Properties,” IEEE Trans. on Acoustics, Speech,<br />

and Signal Processing, vol.28, no.4, pp.428-440, 1980.<br />

[57] M. G. Larimore, J. R. Treichler, “SHARF: An algorithm <strong>for</strong> Adapting IIR Digital Filters,”<br />

IEEE Trans. on Acoustics, Speech, and Signal Processing, vol.28, no.4, pp.428-440, 1980.<br />

[58] I. D. Landau: “Elimination <strong>of</strong> the Real Positivity Condition in the Design <strong>of</strong> Parallel<br />

MRAS,” IEEE Trans. Automat. Cont., vol.AC-23, no.6, pp.1015-1020, 1978.<br />

[59] K. Kurosawa and S. Tsuji, “An IIR Parallel-Type Adaptive <strong>Algorithm</strong> using the Fast Least<br />

Square Method,” IEEE Trans. Acoust.,Speech, Signal Processing, vol.37, no.8, pp.1226-1230,<br />

1989.<br />

[60] C. R. Johnson Jr., and Taylor, “Failure <strong>of</strong> a Parallel Adaptive Identifier with Adaptive Error<br />

Filtering,” IEEE Trans. Automat. Cont., vol. AC-25, no.6, pp.1248-1250, 1980.<br />

[61] K. Steilglitz and L. E. McBride: “A Techinique <strong>for</strong> the Identification <strong>of</strong> Linear Systems,”<br />

IEEE Trans. Automat. Cont., vol. AC-10, no.4, pp.461-464, 1965.<br />

136


REFERENCE<br />

[62] P. Regalia and M. M. Boup: “An a Priori Error Bound <strong>for</strong> the Steiglitz-McBride Method,”<br />

IEEE Trans. on Circuit and Systems, Analog and Digital Signal Processing, vol.41, no.2,<br />

pp.105-116, 1996.<br />

[63] K. X. Miao, H. Fan and M. Doroslovacki: “Cascade Normalized Lattice Adaptive IIR<br />

Filters,” IEEE Trans. on Signal Processing, vol.42, pp.721-742, 1994.<br />

[64] G. Poyiadjis, A. Doucet and S. S. Singh, “Particle Methods <strong>for</strong> Optimal Filter Derivative:<br />

Application to Parameter Estimation,”Proccedings IEEE ICASSP, 2005.<br />

[65] G..Poyiadjis, S.S. Singh and A. Doucet, “Novel Particle Filter Methods <strong>for</strong> Recursive and<br />

Batch Parameter Estimation in General State Space Models,” Technical Report,<br />

CUED/F-INFENG/TR-536, Engineering Department, Cambridge University, 2005.<br />

[66] P. Fearnhead, “MCMC, Sufficient Statistic and Particle Filter,” Journal Comp. Graph. Stat.,<br />

vol.11, pp.848-862, 2002.<br />

[67] B. L Chan, A. Doucet and V.B Tadie: “Optimization <strong>of</strong> Particle Filters using Simultaneous<br />

Perturbation Stochastic <strong>Approximation</strong>,” Proc. IEEE ICASSP, pp.681-684, 2003.<br />

[68] J. Liu and M. West, “Combined parameter and state estimation in simulation-based<br />

filtering,” In Sequential Monte Carlo Methods in Practice (eds Doucet A., de Freitas J.F.G. and<br />

Gordon N.J. NY): Springer Verlag, 2001.<br />

[69] G. Storvik, “Particle Filters in State Space Models with The Presence <strong>of</strong> Unknown Static<br />

Parameters,” IEEE Trans. Signal Processing, vol.50, pp.281-289, 1998.<br />

[70] A. Doucet, Vladislav B.Tadie: “On-line Optimization <strong>of</strong> Sequential Monte Carlo<br />

Methods using Stochastic <strong>Approximation</strong>,” Proceedings <strong>of</strong> the -American Control Conference,<br />

pp.2565-2570, 2002.<br />

[71] H. J. Kushner and D.S. Clark, “Stochastic <strong>Approximation</strong> Methods <strong>for</strong> Constrained and<br />

Unconstrained System,” Springer Varlag, N.Y. , 1978.<br />

137


REFERENCE<br />

[72] N. J. Gordon, D. J Salmond and A.F.M. Smith, “Novel Approach to Non-linear/Non<br />

Gaussian Bayesian State Estimation,” IEEE Proc. F., vol.140, pp.107-113, 1993.<br />

138


Appendix A<br />

Pro<strong>of</strong>s <strong>of</strong> Convergence Results and-Asymptotic Distribution<br />

Results<br />

Pro<strong>of</strong> <strong>of</strong> Lemma (Sufficient Conditions <strong>for</strong> C.5 and C.7)<br />

C.7 is used in the pro<strong>of</strong>s <strong>of</strong> Theorems 1a and 1b only to ensure that P(lim sup ˆ θ = ∞)<br />

= 0 .<br />

k→∞<br />

k<br />

Given the boundedness <strong>of</strong><br />

θˆ k , this condition becomes superfluous. Regarding C.5, the<br />

boundedness condition together with the facts that /<br />

2 2 −1<br />

a →0<br />

and c →0<br />

(C.6) imply<br />

k<br />

c k<br />

k<br />

H k<br />

that, <strong>for</strong> some<br />

0 < ρ ' < ρ , a ( ˆ θ ) ≤ ρ'<br />

a.s. <strong>for</strong> all k sufficiently large. From the basic<br />

k<br />

g ki<br />

k<br />

recursion,<br />

~ ~<br />

θ −a ( ˆ −a<br />

e , where e = G ˆ θ ) − g ( ˆ θ ) . But a →0<br />

k+ 1,<br />

i<br />

= θki<br />

kgki<br />

θk<br />

)<br />

k ki<br />

k<br />

k<br />

(<br />

k ki k<br />

k e k<br />

a.s. by the<br />

martingale convergence theorem (see (8) and (9) in Spall and Cristion [25]). Since<br />

~<br />

θ ≥ ρ > ρ'<br />

ki<br />

,<br />

we know that sign ~ ~<br />

θ<br />

ki<br />

= sign θ<br />

k + 1,<br />

i<br />

<strong>for</strong> all k sufficiently large, implying that sign<br />

g ˆ θ ) = sign g ˆ θ ) a.s.<br />

i<br />

( k<br />

i( k+ 1<br />

Pro<strong>of</strong> <strong>of</strong> Theorem 1a (M2-<strong>SPSA</strong>)<br />

The pro<strong>of</strong> will proceed in three parts. Some <strong>of</strong> the pro<strong>of</strong> closely follows that <strong>of</strong> the proposition<br />

in Spall and Cristion [25], in which case the details will be omitted here, and the reader will be<br />

directed to that reference. However, some <strong>of</strong> the pro<strong>of</strong> differs in nontrivial ways due to, among<br />

other factors, the need to explicitly treat the bias in the gradient estimate G (⋅ k<br />

) First, we will<br />

show that<br />

θ<br />

k<br />

= k<br />

θˆ −θ * does not diverge in magnitude to ∞ on any set <strong>of</strong> nonzero measure.<br />

<strong>Second</strong>, we will show that<br />

~ θ<br />

k<br />

converges a.s. to some random vector, and third, we will show<br />

139


APPENDIX A<br />

that this random vector is the constant 0, as desired. Equalities hold a.s. where relevant.<br />

Part 1: First, from C.0, C.2, and C.3, it can be shown in the manner <strong>of</strong> Spall [3, Lemma 1] that,<br />

<strong>for</strong> all k sufficiently large<br />

( Gk<br />

( ˆ<br />

k)<br />

ˆ θk<br />

) = g(<br />

ˆ θk<br />

) bk<br />

E θ +<br />

(A1)<br />

where<br />

c −2<br />

k<br />

b<br />

k<br />

is uni<strong>for</strong>mly bounded a.s. Using C.6, we know that H<br />

−1<br />

k<br />

exists a.s., and hence we<br />

−<br />

write M ≡ a H<br />

1 ( g(<br />

ˆ θ ) + b ) . Then, as in the proposition <strong>of</strong> Spall and Cristion [25], C.1, C.2,<br />

k<br />

k<br />

k<br />

k<br />

k<br />

and C.6, and Holder’s inequality imply, via the martingale convergence theorem,<br />

k<br />

~<br />

a.<br />

s.<br />

θ<br />

+<br />

= M ⎯⎯→<br />

X<br />

(A2)<br />

k 1<br />

∑<br />

j=<br />

0<br />

j<br />

where X is some integrable random vector.<br />

~<br />

Let us now show that P(lim sup θ = ∞)<br />

= 0 . Since the arguments below apply along any<br />

k→∞<br />

k<br />

subsequence, we will, <strong>for</strong> ease <strong>of</strong> notation and without loss <strong>of</strong> generality, consider the event<br />

{ ~ θ<br />

k<br />

→∞}<br />

. We will show that this event has probability 0 in a modification to the arguments in<br />

[25, proposition] (which is a multivariate extension to scalar arguments in Blum [34], and Evans<br />

and Weber [4]). Furthermore, suppose that the limiting quantity <strong>of</strong> the unbounded elements is<br />

+ ∞ (trivial modifications cover a limiting quantity including − ∞ limits). Then, as shown in<br />

[25], the event <strong>of</strong> interest { ~ θ<br />

k<br />

→∞}<br />

has probability 0 if<br />

and<br />

~<br />

~<br />

{ ≤ ρ'(<br />

τ , S)<br />

∀i<br />

∈ S,<br />

θ ≤ τ∀i<br />

∉ S,<br />

k ≥ K(<br />

τ,<br />

S)<br />

} ∩limsup{ M < ∀i<br />

∈ S} ⎬ ⎫<br />

⎭<br />

⎧<br />

⎨ θ<br />

ki ki<br />

ki<br />

0<br />

(A3a)<br />

⎩<br />

k→∞<br />

140


APPENDIX A<br />

⎪⎧<br />

~<br />

⎪⎫<br />

c<br />

⎨θ ki<br />

→ ∞∀i ∈ S ∩ liminf{ M<br />

ki<br />

< 0∀i<br />

∈ S}<br />

⎬<br />

(A3b)<br />

k→∞<br />

⎪⎩<br />

⎪⎭<br />

both have probabilities 0 <strong>for</strong> all<br />

τ , S and ρ '(<br />

τ,<br />

S)<br />

as defined in C.7, where K( τ , S)<br />

< ∞ and<br />

the superscript c denotes set complement. For event (A3a), we know that there exists a<br />

subsequence { k k , k ,..., } k K(<br />

τ,<br />

)<br />

Then, from C.6 and (A1),<br />

0, 1 2 0<br />

S<br />

~<br />

≥ such that { θ ≥ ρ'(<br />

, S)<br />

∀i<br />

∈S} ∩{ M < 0∀i<br />

∈S}<br />

k, i<br />

τ<br />

k,<br />

i<br />

is true.<br />

∑<br />

i∈S<br />

~<br />

( θ ) + o(1))<br />

< 0<br />

~<br />

θ ( g<br />

a.s. (A4)<br />

kji<br />

kji<br />

kji<br />

<strong>for</strong> all<br />

~ T<br />

~<br />

k, j . By C.4, θkj , g ˆ<br />

k, j(<br />

θkj)<br />

≥ ρ θ a.s. which, by C.7, implies, <strong>for</strong> all j sufficiently<br />

kj<br />

large,<br />

∑<br />

i∈S<br />

~<br />

θ g<br />

kji<br />

ρ ~<br />

≥ θkj<br />

2<br />

kji<br />

~<br />

( θ )<br />

kj<br />

⎛ ρ ⎞<br />

ρτ<br />

≥ ⎜ ⎟dim(<br />

S)<br />

ρ'(<br />

τ,<br />

S)<br />

≥<br />

⎝ 2 ⎠<br />

2<br />

(A5)<br />

since ρ' ( τ,<br />

S ) ≥τ<br />

and dim( S ) ≥1. Taken together, (A4) and (A5) imply that, <strong>for</strong> each<br />

sample point (except possibly on a set <strong>of</strong> measure 0), the event in (A3a) has probability 0. Now,<br />

consider the second event (A3b). From (A2), we know that, <strong>for</strong> almost all sample points,<br />

∑ ∞ =0<br />

M i S<br />

k ki<br />

→ −∞ ∀ ∈<br />

must be true. But this implies from C.5 and the above-mentioned<br />

uni<strong>for</strong>mly bounded decaying bias ( b k<br />

) that <strong>for</strong> no i ∈ S can M<br />

ki<br />

≥ 0<br />

occur. However, at<br />

each, the event{ M<br />

ki<br />

} c<br />

dim( S )<br />

< 0∀i<br />

∈ S is composed <strong>of</strong> the union <strong>of</strong> 2 −1events, each <strong>of</strong> which<br />

has <strong>for</strong> at least one M ≥ 0 <strong>for</strong> at least one i ∈ S . This, <strong>of</strong> course, requires that M ≥ 0 <strong>for</strong><br />

ki<br />

at least onei ∈ S , which creates a contradiction. Hence, the probability <strong>of</strong> the event in (A3b) is<br />

0. This completes Part 1 <strong>of</strong> the pro<strong>of</strong>.<br />

ki<br />

141


APPENDIX A<br />

Part 2: To show that<br />

~ θ<br />

k<br />

converges a.s. to a unique (finite) limit, we show that<br />

⎛<br />

⎞<br />

⎜ ~<br />

~ ⎟<br />

P⎜liminf<br />

θ<br />

ki<br />

< a'<br />

< b'<br />

< limsupθki<br />

= ∀i<br />

k→∞<br />

⎟ 0<br />

(A6)<br />

k→∞<br />

⎝<br />

⎠<br />

<strong>for</strong> any a ' < b'<br />

.This result follows as in the pro<strong>of</strong> <strong>of</strong> Part 2 <strong>of</strong> the proposition in Spall and<br />

Cristion [25].<br />

Part 3: Let us now show that the unique finite limit from Part 2 is 0. From (A2) and the<br />

conclusion <strong>of</strong> Part 1, we have<br />

follows if<br />

limsup<br />

n→∞<br />

∑ ∞ k=<br />

0<br />

M < ∞ a.s ∀ i . Then the result to be shown<br />

ki<br />

⎛ ~<br />

⎞<br />

⎜lim<br />

≠ 0, ∑ ∞ P θ < ∞<br />

⎟<br />

k<br />

M<br />

k<br />

= 0.<br />

(A7)<br />

k→∞<br />

⎝<br />

k=<br />

0 ⎠<br />

Suppose that the event in the probability <strong>of</strong> (A7) is true, and let I { 1,2,...,<br />

p}<br />

⊆ represent those<br />

~<br />

indexes i such thatθ → 0 as k → ∞ . Then, by the convergence in Part 2, there exists (<strong>for</strong><br />

ki<br />

almost any sample point in the underlying sample space) some 0 < a ' < b'<br />

< ∞ and<br />

~<br />

K ( a',<br />

b')<br />

< ∞ (dependent on sample point) such that ∀ k > K, 0 < a'<br />

≤ θ ≤ b'<br />

< ∞ when<br />

ki<br />

~<br />

i ∈ I( I ≠ θ ) θ<br />

ki<br />

≤ a'<br />

and<br />

c<br />

i ∈ I<br />

. From C.4, it follows that<br />

n<br />

n<br />

~<br />

θ g ( ˆ θ ) ≥ a'<br />

ρ a .<br />

(A8)<br />

∑ a ∑ ∑<br />

k ki ki k<br />

k<br />

k= K+<br />

1 i∈ I<br />

k=<br />

K+<br />

1<br />

142


APPENDIX A<br />

But since C.5 implies that g ˆ θ ) can change sign only a finite number <strong>of</strong> times (except<br />

ki<br />

( k<br />

~<br />

possibly on a set <strong>of</strong> sample points <strong>of</strong> measure 0), and since θ<br />

ki<br />

≤ b'<br />

, we know from (A8) that,<br />

<strong>for</strong> at least onei ∈ I ,<br />

limsup<br />

n→∞<br />

ρ a'<br />

n<br />

∑<br />

a<br />

k<br />

k=<br />

K+<br />

1<br />

n<br />

∑<br />

k<br />

k=<br />

k+<br />

1<br />

< ∞<br />

g<br />

ki<br />

a<br />

( ˆ θ )<br />

k<br />

.<br />

(A9)<br />

Recall that<br />

a g ( ˆ θ ) = M − a H b and b = O c ) a.s. Hence, from C.6, we have<br />

k<br />

k<br />

k<br />

k<br />

k<br />

−1<br />

k<br />

k<br />

k<br />

( 2 k<br />

H<br />

−1<br />

k<br />

b k<br />

=<br />

∑ ∞ k= K + 1<br />

o(1)<br />

. Then by (A9), M = ∞ Since, <strong>for</strong> the a ' < b'<br />

above, there exists such<br />

ki<br />

a K <strong>for</strong> each sample point in a set <strong>of</strong> measure one, we know from the above discussion that there<br />

also exists an<br />

∑ ∞ k= K + 1<br />

i ∈ I (i possibly dependent on the simple point) such that M = ∞ .<br />

ki<br />

Since I has a finite number <strong>of</strong> elements,<br />

∑ ∞ k=0<br />

M = ∞ with probability 0 <strong>for</strong> at least one i.<br />

However, this is inconsistent with the event in (A7), showing that the event does, in fact, have<br />

probability 0. This completes Part 3, which completes the pro<strong>of</strong>.<br />

ki<br />

Pro<strong>of</strong> <strong>of</strong> Theorem 1b (2SG) The initial martingale convergence arguments establishing the<br />

2SG analog to (A2) are based on C.0’ –C.2’ and C.6. Although there is no bias in the gradient<br />

measurement, C.4 and C.7 still work together to guarantee that the elements potentially<br />

diverging [in the arguments analogous to those surrounding (A3a), (A3b)] asymptotically<br />

dominate the product ˆT<br />

θ ; ( ˆ<br />

k<br />

gkj<br />

θkj<br />

) . As in the Pro<strong>of</strong> <strong>of</strong> Theorem 1a, this sets up a contradiction.<br />

The remainder <strong>of</strong> the pro<strong>of</strong> follows exactly as in Parts 2 and 3 <strong>of</strong> the Pro<strong>of</strong> <strong>of</strong> Theorem 1a, with<br />

some <strong>of</strong> the arguments made easier since b = 0 .<br />

k<br />

143


APPENDIX A<br />

Pro<strong>of</strong> <strong>of</strong> Theorem 2a (M2-<strong>SPSA</strong>)<br />

First, note that the conditions subsume those <strong>of</strong> Theorem 1a; hence, we have a.s. convergence <strong>of</strong><br />

θˆ . By C.8, we have<br />

⎛( )<br />

2<br />

2<br />

c c~<br />

Hˆ<br />

⎞<br />

⎟ ⎠<br />

k<br />

E⎜<br />

k k k<br />

uni<strong>for</strong>mly bounded ∀ k<br />

⎝<br />

. Hence, by the additional<br />

assumption introduced in C.1’’ (beyond that in C.1), the martingale convergence result in, say,<br />

Gerencser [16], yields<br />

1<br />

∑ n<br />

n + 1 k=<br />

0<br />

( Hˆ<br />

ˆ ˆ<br />

k<br />

− E(<br />

H k k<br />

)) → 0<br />

θ a.s. as n → ∞ . (A10)<br />

Let H (θ ) represent the true <strong>Hessian</strong> matrix, and suppose that g (θ ) is three-times<br />

continuously differentiable in a neighborhood <strong>of</strong><br />

θˆ k . Then, simple Taylor series arguments<br />

show that<br />

E(<br />

δG<br />

ˆ θ , ∆<br />

≡ δg<br />

k<br />

k<br />

k<br />

+ O(<br />

c<br />

k<br />

3<br />

k<br />

) = g(<br />

ˆ θ + c ∆<br />

)<br />

( O (<br />

3 ) = 0 in the SG case)<br />

c k<br />

k<br />

k<br />

k<br />

) − g(<br />

ˆ θ − c ∆<br />

k<br />

k<br />

k<br />

) + O(<br />

c<br />

3<br />

k<br />

)<br />

where this result is immediate in the SG case, and follows easily by a Taylor series argument in<br />

the <strong>SPSA</strong> case (where the O c ) term is the difference <strong>of</strong> the two O c ) bias terms in the<br />

( 3 k<br />

one-sided SP gradient approximations and c ~ = O(<br />

)). Hence, by an expansion <strong>of</strong> each <strong>of</strong><br />

g ˆ θ ± ∆ ) , we have <strong>for</strong> any i, j.<br />

(<br />

k<br />

c k k<br />

k<br />

c k<br />

( 2 k<br />

144


APPENDIX A<br />

⎛ Gki<br />

E⎜<br />

δ<br />

⎝ 2ck∆<br />

kj<br />

ˆ θ , ∆<br />

k<br />

l≠<br />

j<br />

k<br />

⎞<br />

⎟<br />

⎠<br />

⎛ g ⎞<br />

ki ˆ<br />

2<br />

E⎜<br />

δ<br />

= θk<br />

, ∆ ⎟<br />

k<br />

+ O(<br />

ck<br />

)<br />

2c<br />

⎝ k∆kj<br />

⎠<br />

( ˆ ) ( ˆ ∆kl<br />

2<br />

= Hij<br />

θk<br />

+ ∑ Hlj<br />

θk<br />

) + O(<br />

ck<br />

)<br />

∆<br />

kj<br />

where the O ( c<br />

2 k<br />

) term in the second line absorbs higher <strong>order</strong> terms in the expansion <strong>of</strong> δ gk<br />

.<br />

Then, since<br />

E( ∆kl<br />

/ ∆kj<br />

) = 0∀j<br />

≠ l by the assumptions <strong>for</strong> ∆<br />

k<br />

, we have<br />

⎛ G ⎞<br />

ki<br />

E ⎜<br />

δ ˆ θ ⎟ ( ˆ<br />

k<br />

= Hij<br />

θk<br />

) + O(<br />

c<br />

2c<br />

⎝ k∆kj<br />

⎠<br />

2<br />

k<br />

)<br />

implying that the <strong>Hessian</strong> estimate is “nearly unbiased,” with the bias disappearing at rate<br />

O ( c<br />

2 k<br />

) . The additional operation in<br />

Hˆ<br />

k<br />

=<br />

⎡ T<br />

1 δGk<br />

⎢<br />

2 ⎢2ck∆k<br />

⎣<br />

T<br />

⎛ δGk<br />

+<br />

⎜<br />

⎝ 2ck∆<br />

k<br />

T<br />

⎞ ⎤<br />

⎟ ⎥<br />

⎠ ⎥⎦<br />

simply <strong>for</strong>ces the per-iteration estimate to be symmetric. Then, by the above equations,<br />

conditions C.3’, C.8, and C.9 imply ∀ l (A14) where L<br />

(3)<br />

hij<br />

represents the third derivative <strong>of</strong> L w.r.t.<br />

the hh, ith, and jth elements <strong>of</strong>θ ; θ are points on the line segments between ˆ ~<br />

θ<br />

k<br />

± ck∆k<br />

+ ck∆k<br />

k<br />

k<br />

k<br />

±<br />

k<br />

~ ~ ~<br />

andθˆ ± c ∆ ; and we used the fact that E( ∆ ∆ / ∆<br />

l<br />

) = 0∀i,<br />

j k and l (implied by C.9 and<br />

the Cauchy–Schwarz inequality). Let<br />

ki kj k<br />

,<br />

145


APPENDIX A<br />

1 ⎡~<br />

~ ~ ~<br />

( ( ) ( )) ˆ<br />

⎤<br />

−1<br />

(3) + (3) −<br />

Bkl = E⎢∆k<br />

l∑<br />

Lhij<br />

θk<br />

− Lhij<br />

θk<br />

⋅∆k,<br />

h∆k,<br />

i∆kj<br />

θk<br />

, ∆k<br />

⎥.<br />

(A11)<br />

6 ⎣ h,<br />

i,<br />

j<br />

⎦<br />

By C.3’ (bounding the difference in<br />

(3)<br />

Lhij<br />

terms) and C.9 in conjunction with the<br />

Cauchy–Schwarz inequality and C.1’’ ( c ~<br />

k<br />

= O(<br />

ck<br />

)) we have B<br />

kl<br />

/ ck<br />

uni<strong>for</strong>mly bounded<br />

(in ˆ θ<br />

k<br />

, ∆k<br />

) <strong>for</strong> all k sufficiently large. Hence, from (A11) the ( l m)-th element <strong>of</strong> Ĥ<br />

k<br />

satisfies<br />

E(<br />

Hˆ<br />

k , l,<br />

m<br />

ˆ θ )<br />

(1) ˆ<br />

(1)<br />

⎛ G<br />

ˆ<br />

k<br />

(<br />

k<br />

ck<br />

k<br />

) Gk<br />

(<br />

k<br />

c ) ˆ<br />

⎞<br />

k k<br />

E⎜<br />

l<br />

θ + ∆ −<br />

l<br />

θ − ∆<br />

=<br />

⎟<br />

θk<br />

2c<br />

⎝<br />

k∆km<br />

⎠<br />

( ˆ ) ( ˆ<br />

−2<br />

⎛ g<br />

k<br />

ck<br />

k<br />

g<br />

k<br />

ck<br />

k<br />

) ck<br />

Bk<br />

E⎜<br />

l<br />

θ + ∆ −<br />

l<br />

θ − ∆ +<br />

=<br />

⎝<br />

2ck∆km<br />

T<br />

[ ∂gl<br />

/ ∂θ<br />

]<br />

⎛ 2ck<br />

θ =<br />

= E⎜<br />

⎝<br />

2ck∆<br />

( ˆ<br />

2<br />

= H θ ) + O(<br />

c )<br />

lm<br />

k<br />

k<br />

k<br />

ˆ θk<br />

km<br />

∆<br />

k<br />

+ O(<br />

c<br />

3<br />

k<br />

) ⎞<br />

ˆ θ ⎟<br />

k<br />

⎠<br />

l<br />

ˆ<br />

⎞<br />

θ ⎟<br />

k<br />

⎠<br />

(A12)<br />

where the O( c<br />

3 k<br />

) term in the third line <strong>of</strong> (A12) encompasses both c −2<br />

B k kl<br />

and the uni<strong>for</strong>mly<br />

bounded contributions due to<br />

2 T<br />

∂ gl<br />

/∂θ ∂<br />

T<br />

θ<br />

in the remainder terms <strong>of</strong> the expansion<br />

<strong>of</strong> g ˆ<br />

l<br />

( θk<br />

+ ck∆<br />

k)<br />

−gl(ˆ<br />

θk<br />

−ck∆<br />

k)<br />

is<br />

3 3<br />

( ( c k<br />

) / ck<br />

O uni<strong>for</strong>mly bounded, allowing the use <strong>of</strong> C.9 and the<br />

2<br />

Cauchy–Schwarz inequality in producing the ( )<br />

O term in the last line <strong>of</strong> (A12)). Then, by<br />

c k<br />

(A12), the continuity <strong>of</strong> H nearθˆ k and the fact that ˆ θ → θ *<br />

k<br />

a.s. (Theorem 1a), the principle <strong>of</strong><br />

Cesaro summability implies<br />

146


APPENDIX A<br />

1<br />

+<br />

n<br />

∑<br />

n 1 k=<br />

0<br />

=<br />

1<br />

n + 1 k=<br />

0<br />

E(<br />

ˆ<br />

H k<br />

ˆ θ )<br />

k<br />

n<br />

2<br />

∑( H ( ˆ<br />

k<br />

) + O(<br />

ck<br />

))<br />

*<br />

θ → H ( θ ) a.s. (A13)<br />

1 n 1<br />

Hˆ<br />

0 k<br />

−<br />

Given that H = ( + 1) ∑ + k<br />

n<br />

k=<br />

(A.10) and (A13) then yield the result to be proved.<br />

Pro<strong>of</strong> <strong>of</strong> Theorem 2b (2SG) Since the conditions subsume those <strong>of</strong> Theorem 1b, we have<br />

ˆ θ * θ → k<br />

a.s. Analogous to (A10), C.1’’’ and C.8’ yield a martingale convergence result <strong>for</strong> the<br />

sample mean <strong>of</strong> Hˆ<br />

− E(<br />

ˆ ˆ θ ) . Then, given the boundedness <strong>of</strong> the third derivatives <strong>of</strong><br />

k<br />

H k<br />

k<br />

L (θ ) near θˆ k <strong>for</strong> all k, the Cauchy–Schwarz inequality and C.8’, C.9’ imply that<br />

E Hˆ<br />

ˆ θ ) = H ( ˆ θ ) + O<br />

2<br />

( c )<br />

(<br />

k k<br />

k k<br />

yield the result to be proved.<br />

. By<br />

ˆ θ → θ *<br />

k<br />

a.s., the Cesaro summability arguments in (A13)<br />

Pro<strong>of</strong> <strong>of</strong> Theorem 3a (M2-<strong>SPSA</strong>)<br />

Beginning with the expansion G ( ˆ θ )<br />

( ˆ<br />

*<br />

k k<br />

θ<br />

k<br />

) = H ( θ<br />

k<br />

)( ˆ θ<br />

k<br />

− θ ) bk<br />

E +<br />

whereθ k<br />

is on the line<br />

segment between<br />

θˆ k and θ * and the bias<br />

bk<br />

is defined in (A1), the estimation error can be<br />

represented in the notation <strong>of</strong> [19] as<br />

147


APPENDIX A<br />

ˆ θ<br />

*<br />

−α<br />

*<br />

k+<br />

1<br />

−θ<br />

= ( I − k Γk<br />

)( θk<br />

−θ<br />

)<br />

+ k<br />

−(<br />

α+<br />

β ) / 2<br />

Φ V<br />

k<br />

ˆ<br />

k<br />

+ k<br />

α −β<br />

/ 2<br />

H<br />

−1<br />

k<br />

T<br />

k<br />

where<br />

Γ<br />

V<br />

k<br />

Φ<br />

k<br />

k<br />

= aH<br />

= k<br />

−1<br />

k<br />

= −aH<br />

−γ<br />

H ( θ )<br />

−1<br />

k<br />

k<br />

[ Gk<br />

( ˆ θk<br />

) − E( Gk<br />

( ˆ θk<br />

) ˆ θk<br />

)]<br />

and<br />

T<br />

k<br />

/ 2<br />

= −ak<br />

β b . The pro<strong>of</strong> follows that <strong>of</strong> Spall [3, Proposition 2] closely, which shows<br />

k<br />

that the three sufficient conditions <strong>for</strong> asymptotic normality, in Fabian [19], hold. By the<br />

convergence <strong>of</strong><br />

θˆ k it is straight<strong>for</strong>ward to show a.s. convergence <strong>of</strong><br />

T to 0 if 3 γ −α / 2 > 0 or<br />

k<br />

to T in (2.37) if 3 γ −α / 2 = 0 . The mean expression µ then follows directly from Fabian<br />

[19] and the convergence <strong>of</strong> H<br />

k<br />

(and hence<br />

T<br />

Further, as in Spall [3], (<br />

k<br />

) E V V k k<br />

−1<br />

H<br />

k<br />

by C.11 and the existence <strong>of</strong><br />

* −1<br />

H ( θ ) .<br />

θˆ is a.s. convergent by C.2 and C.10, leading to the<br />

covariance matrix Ω . This shows Fabian [19, (2.21) and (2.22)]. The final condition [19,<br />

(2.2.3)] follows as in Spall [3, Proposition 2] since the definition <strong>of</strong> V<br />

k<br />

is identical in both<br />

standard <strong>SPSA</strong> and M2-<strong>SPSA</strong>.<br />

(1)<br />

[ ˆ ) ˆ<br />

kl<br />

( θk<br />

± ck∆k<br />

θk,<br />

∆k<br />

]<br />

E G<br />

⎡~<br />

( ˆ<br />

⎢ck<br />

g θk<br />

± ck∆<br />

= E⎢<br />

⎢<br />

⎢<br />

⎣<br />

= g<br />

l<br />

( ˆ θ ± c ∆<br />

k<br />

k<br />

k<br />

k<br />

T ~<br />

) ∆<br />

1<br />

) + c<br />

6<br />

−2<br />

k<br />

k<br />

−2<br />

ck<br />

~ T ˆ ~<br />

+ ∆kH(<br />

θk<br />

± ck∆k<br />

) ∆k<br />

2<br />

c~~<br />

∆<br />

⎡~<br />

E⎢∆<br />

⎣<br />

∑<br />

−1<br />

kl<br />

h,<br />

i,<br />

j<br />

L<br />

(3)<br />

h,<br />

i,<br />

j<br />

± ~<br />

( θ ) ∆<br />

k<br />

kl<br />

kh<br />

−<br />

ck<br />

+<br />

6<br />

3<br />

∑<br />

h,<br />

i,<br />

j<br />

~ ~<br />

∆ ∆ ˆ θ , ∆<br />

ki<br />

kj<br />

k<br />

k<br />

L<br />

⎤<br />

⎥.<br />

⎦<br />

(3)<br />

h,<br />

i,<br />

j<br />

±<br />

( θ )<br />

k<br />

~<br />

∆<br />

kh<br />

~ ~<br />

∆ ∆<br />

ki<br />

kj<br />

ˆ θ , ∆<br />

k<br />

k<br />

⎤<br />

⎥<br />

⎥<br />

⎥<br />

⎥<br />

⎦<br />

(A14)<br />

148


APPENDIX A<br />

Pro<strong>of</strong> <strong>of</strong> Theorem 3b (2SG) Analogous to the Pro<strong>of</strong> <strong>of</strong> Theorem 3a, the estimation error can be<br />

represented as<br />

ˆ *<br />

−α<br />

k +<br />

−θ<br />

= ( I − k Γk<br />

)( ˆ θk<br />

−θ<br />

*<br />

1<br />

θ ) + k<br />

−α<br />

Φ<br />

k<br />

e<br />

k<br />

−1<br />

where Γ = aH<br />

H ( θ ) and<br />

k<br />

k<br />

k<br />

Φ<br />

k<br />

= −a H Conditions (2.2.1) and (2.2.2) <strong>of</strong> Fabian [19]<br />

−1<br />

k<br />

follow immediately by the smoothness <strong>of</strong> L (θ ) (from C.3’), the convergence <strong>of</strong> θˆ k<br />

and<br />

and C.12. Condition (2.2.3) <strong>of</strong> Fabian [19] follows by Holder’s inequality and C.2’, C.3’.<br />

H<br />

k<br />

,<br />

Pro<strong>of</strong> <strong>of</strong> theorem 4a—Convergence in parameter estimation M2-<strong>SPSA</strong><br />

The convergence theorem <strong>for</strong> the proposed method is proven here based on RM-type stochastic<br />

approximation. In contrast to the RM-type stochastic approximation, in the simultaneous<br />

perturbation stochastic approximation, the slope <strong>of</strong> the error function is estimated based on the<br />

value <strong>of</strong> the error function. There<strong>for</strong>e, the estimated slope must include the error. In this pro<strong>of</strong>,<br />

the nature <strong>of</strong> the estimated error <strong>for</strong> the slope when using simultaneous perturbation stochastic<br />

approximation <strong>for</strong> parameter estimation is clarified, thus arriving at convergence <strong>of</strong> the<br />

parameter estimation algorithm using the conventional RM-type stochastic approximation. In<br />

the pro<strong>of</strong> below, the subscripts that can be readily understood are omitted.<br />

For<br />

~<br />

φ = ˆ φ −φ<br />

, if the true value φ <strong>for</strong> the parameters is subtracted from both sides <strong>of</strong> (2.62)<br />

and the right-hand side is then expanded and cleaned up,<br />

~<br />

φ<br />

k+<br />

n<br />

⎛<br />

= ⎜<br />

I<br />

⎝<br />

+ ρ<br />

n+<br />

m<br />

k−1<br />

n+<br />

1<br />

⎪⎧<br />

⎨z<br />

⎪⎩<br />

− ρ<br />

k−1<br />

n+<br />

1<br />

k+<br />

n−1<br />

e<br />

z<br />

k+<br />

n−1<br />

k+<br />

n<br />

y<br />

T<br />

k+<br />

n−1<br />

2<br />

⎡σ<br />

I<br />

+ ⎢<br />

⎣ 0<br />

n<br />

⎞~<br />

⎟<br />

φk<br />

⎠<br />

−1<br />

0⎤<br />

ˆ<br />

⎥φk<br />

0⎦<br />

−1<br />

1<br />

− c<br />

2<br />

k−1<br />

n+<br />

1<br />

T<br />

( y s )<br />

k+<br />

n−1<br />

k−1<br />

2<br />

s<br />

k−1<br />

⎪⎫<br />

⎬<br />

⎪⎭<br />

(B.1)<br />

149


APPENDIX A<br />

results. Here, zk+n+<br />

1<br />

is given by<br />

z<br />

T<br />

k + n−1 = yk<br />

+ n−1sk<br />

−1sk<br />

−1<br />

= yk<br />

+ n−1<br />

+ dk<br />

+ n−<br />

T<br />

Note that dk+n−<br />

1<br />

represents the difference between yk+ n−1sk<br />

−1sk<br />

−1<br />

and y<br />

k+n−1<br />

and is given by the<br />

1.<br />

following equation.<br />

s,<br />

i<br />

represents the i-th element s<br />

k 1,<br />

i<br />

−<br />

, <strong>for</strong> the signed vector at the time k – 1.<br />

d<br />

k+<br />

n−1<br />

⎛ yk<br />

−1s,<br />

2<br />

s,<br />

1+<br />

L+<br />

uk<br />

⎜<br />

⎜ yks,<br />

1<br />

s,<br />

2+<br />

L+<br />

uk<br />

+<br />

= ⎜<br />

M<br />

⎜<br />

⎝ yks,<br />

1<br />

sn+<br />

m<br />

+ L+<br />

u<br />

+ n−1<br />

n−1<br />

s,<br />

s,<br />

k+<br />

n−2<br />

n+<br />

m<br />

n+<br />

m<br />

s,<br />

s,<br />

s,<br />

2<br />

1<br />

n+<br />

m−1<br />

s,<br />

n+<br />

m<br />

⎞<br />

⎟<br />

⎟<br />

⎟<br />

⎟<br />

⎠ .<br />

(B.2)<br />

Caution is required because the product <strong>of</strong> s,<br />

i<br />

and s,<br />

j<br />

in each item in each element <strong>of</strong> d is the<br />

product <strong>of</strong> mutually distinct elements in the signed vector s<br />

k −1<br />

.<br />

In other words, in the case in which<br />

At this point, based on Eq. (B.1),<br />

i ≠ j when taking the expected value, this is 0.<br />

~<br />

φk+n<br />

2<br />

is calculated as follows:<br />

~<br />

φ<br />

k+<br />

n<br />

2<br />

~ T ~<br />

= φ φ<br />

k+<br />

n<br />

k+<br />

n<br />

⎧<br />

2<br />

⎛<br />

⎪ ⎜<br />

T ~ ⎡σ<br />

I<br />

− zy φk<br />

−1<br />

+ ze+<br />

⎢<br />

~ ⎜<br />

⎨<br />

⎣ 0<br />

= φk<br />

−1<br />

+ ρ⎜<br />

⎪<br />

⎪<br />

⎜ 1 T 2<br />

− c( y s)<br />

s<br />

⎩ ⎝ 2<br />

~ T<br />

T T ~<br />

= φ ( I −ρ<br />

zy − pyz ) φ<br />

k−1<br />

~ T<br />

+ 2ρφ<br />

n+<br />

m<br />

k−1<br />

2<br />

⎪⎧<br />

⎡σ<br />

I<br />

⎨ze+<br />

⎢<br />

⎪⎩ ⎣ 0<br />

n<br />

0⎤<br />

ˆ<br />

⎥φ<br />

k−<br />

0⎦<br />

k−1<br />

1<br />

n<br />

1<br />

− c<br />

2<br />

0⎤<br />

ˆ<br />

⎥φk<br />

−<br />

0⎦<br />

T<br />

( y s)<br />

2<br />

1<br />

⎪⎫<br />

s⎬<br />

⎪⎭<br />

T<br />

⎞⎫<br />

−⎟⎪<br />

⎟<br />

⎪⎧<br />

~<br />

⎟⎬<br />

⎨φ<br />

k−1<br />

⎪ ⎪⎩<br />

⎟<br />

⎠⎪<br />

⎭<br />

⎛<br />

⎜<br />

T ~<br />

+ ρ − zy φ<br />

k−<br />

⎝<br />

1<br />

2<br />

⎡σ<br />

I<br />

+ ze+<br />

⎢<br />

⎣ 0<br />

n<br />

0⎤<br />

ˆ<br />

⎥φk<br />

−<br />

0⎦<br />

1<br />

1<br />

− c<br />

2<br />

T<br />

( y s)<br />

2<br />

⎞⎪⎫<br />

s⎟<br />

⎬<br />

⎠⎪⎭<br />

+ ρ 2 h.<br />

(B.3)<br />

150


APPENDIX A<br />

However,<br />

h<br />

~ ⎡σ<br />

I 0⎤<br />

~ 1<br />

φ ⎢ ⎥<br />

(B.4)<br />

⎣ 0 0⎦<br />

2<br />

2<br />

T<br />

n<br />

T 2<br />

= − zy<br />

k−1 + ze + φk−<br />

1<br />

− c(<br />

y s)<br />

2<br />

.<br />

Finally, the expected value <strong>for</strong> Eq. (B.3) is found using<br />

~<br />

φ = β . Be<strong>for</strong>e this, though, each<br />

k +1<br />

~<br />

item in this equation is evaluated. First, the conditional expected value <strong>for</strong> φ<br />

k + 1<br />

in<br />

T<br />

zy must<br />

be considered:<br />

T ~<br />

T<br />

T<br />

{ zy φk<br />

1<br />

= β} = E{ yy β} E{ dy β}.<br />

E<br />

−<br />

+<br />

(B.5)<br />

Here, the second element on the right is 0 based on the signed vector condition (B11). There<strong>for</strong>e,<br />

only the first element on the right needs to be considered, and so the following equation results:<br />

E<br />

T<br />

T<br />

{ yz β} = E{ yy β}<br />

⎧⎛<br />

x⎞<br />

= E⎨⎜<br />

⎟<br />

⎩⎝u<br />

⎠<br />

⎪⎧<br />

⎡xx<br />

= E⎨⎢<br />

⎪⎩ ⎣ux<br />

T T ⎛ ⎞ T T<br />

( x u ) + ⎜ ⎟( υ 0 )<br />

T<br />

T<br />

xu<br />

uu<br />

T<br />

T<br />

υ<br />

⎝ 0⎠<br />

2<br />

⎤ ⎪⎫<br />

⎡σ<br />

I<br />

⎥ β ⎬ + ⎢<br />

⎦ ⎪⎭ ⎣ 0<br />

n<br />

⎫<br />

β ⎬<br />

⎭<br />

0⎤<br />

⎥<br />

0⎦<br />

.<br />

(B.6)<br />

T ~<br />

In the same fashion, <strong>for</strong> { yz φ<br />

k−1<br />

= β}<br />

E the same results as seen in Eq. (B.6) are obtained. In<br />

~<br />

addition, the expected value <strong>for</strong>φ = β <strong>for</strong> ze must be considered:<br />

k−1<br />

{ ze β} E{ ye β} E{ de β}.<br />

E = +<br />

(B.7)<br />

151


APPENDIX A<br />

Because the second element in the equation above is 0 based on the condition (B11), only the<br />

first element needs to be considered. The first element is given by (2.52), and so the following<br />

equation results:<br />

E<br />

⎡σ<br />

⎢<br />

⎣ 0<br />

2<br />

I n<br />

0⎤<br />

⎥<br />

0 ⎦<br />

{ ze β} = − φ.<br />

(B.8)<br />

Now let us consider h as represented in Eq. (B.4). Although a similar discussion can be found in<br />

[15], only the fourth element on the right varies as a result <strong>of</strong> the perturbation. Expanding Eq.<br />

(B.4) reveals an item affected by (<br />

multiplied by (<br />

s) s<br />

y T 2<br />

s) s<br />

y T 2<br />

and an item affected by its square. The item<br />

takes 0 <strong>for</strong> its expected value based on the condition (B11) <strong>for</strong> signing.<br />

The later items have a fourth-<strong>order</strong> moment <strong>for</strong> y. When the assumption (C11) <strong>of</strong> the<br />

boundedness <strong>of</strong> the fourth-<strong>order</strong> moment <strong>for</strong> the stochastic variable input u and the observed<br />

noise v, and the assumption (A12) <strong>of</strong> the boundedness <strong>for</strong> the perturbation are taken into<br />

consideration, from Eq. (B.4) we have the following inequality <strong>for</strong> the appropriate constants<br />

0 ≤ α 1, α2<br />

< ∞ :<br />

~<br />

2<br />

{ hφ k−1 = β} ≤ α1<br />

β + α2.<br />

E (B.9)<br />

~<br />

Given the above relationships, the conditional expected value <strong>for</strong> φ<br />

k−1<br />

in (B.3) satisfies the<br />

following equation:<br />

E<br />

~ 2 ~<br />

{ φk+<br />

n<br />

φk−<br />

1<br />

= β }<br />

T<br />

≤ β<br />

T<br />

= β ( I<br />

( I − 2ρD)<br />

2<br />

⎪⎧<br />

⎡σ<br />

I<br />

+ 2ρβ<br />

⎨−<br />

⎢<br />

⎪⎩ ⎣ 0<br />

2<br />

+ ρ<br />

n+<br />

m<br />

2<br />

( α1<br />

β + α2<br />

)<br />

n+<br />

m<br />

n<br />

2<br />

2<br />

⎡σ<br />

I<br />

β − 2ρβ<br />

⎢<br />

⎣ 0<br />

2<br />

0⎤<br />

⎡σ<br />

I<br />

⎥φ<br />

+ ⎢<br />

0⎦<br />

⎣ 0<br />

2<br />

− 2ρD)<br />

β + ρ α β<br />

1<br />

n<br />

2<br />

n<br />

0⎤<br />

ˆ⎪<br />

⎫<br />

⎥φ<br />

⎬<br />

0⎦<br />

⎪⎭<br />

0⎤<br />

⎥β<br />

0⎦<br />

2<br />

+ ρ α<br />

2<br />

(B.10)<br />

152


APPENDIX A<br />

where,<br />

D<br />

⎡xx<br />

E⎢<br />

⎣ux<br />

T<br />

xu<br />

T<br />

=<br />

T T<br />

uu<br />

⎤<br />

⎥<br />

⎦<br />

.<br />

Based on the condition (C11), D is a symmetrical positive definite matrix and has a minimum<br />

eigenvalue λ > 0 . There<strong>for</strong>e, we can obtain (2.52) by using<br />

~ 2<br />

~ 2<br />

2<br />

2<br />

{ φ<br />

k+ n<br />

} ≤ ( 1−<br />

2ρλ<br />

+ ρ α1) E{ φk−<br />

1<br />

} + ρ α<br />

2.<br />

E (B.11)<br />

The above equation returns us to the pro<strong>of</strong> [15] <strong>of</strong> the convergence theorem <strong>for</strong> the parameter<br />

estimation algorithm using the Robbins–Monroe stochastic approximation. There<strong>for</strong>e, under the<br />

condition (A11) <strong>for</strong> the gain coefficient<br />

holds.<br />

lim E<br />

⎧<br />

⎨<br />

k →∞ ⎩<br />

ˆ<br />

φ k<br />

2<br />

−φ<br />

⎫<br />

⎬ = 0<br />

⎭<br />

153


154<br />

APPENDIX A


Appendix B<br />

Interpretation <strong>of</strong> Regularity Conditions<br />

This Appendix provides comments on some <strong>of</strong> the conditions <strong>of</strong> ASP relative to other adaptive<br />

SA approaches. In the confines <strong>of</strong> a short discussion, it is obviously not possible to provide a<br />

detailed discussion <strong>of</strong> all conditions <strong>of</strong> all known adaptive approaches. Nevertheless, we hope to<br />

convey a flavor <strong>of</strong> the relative nature <strong>of</strong> the conditions.<br />

As discussed in Sec. 2.9, some <strong>of</strong> the conditions <strong>of</strong> ASP depend on<br />

θˆ k<br />

itself, creating a type <strong>of</strong><br />

circularity (i.e., direct conditions on the quantity being analyzed). This circularity has been<br />

discussed elsewhere since other SA algorithms also have dependent conditions. Some <strong>of</strong> the<br />

ASP conditions can be eliminated or simplified if the conditions <strong>of</strong> the lemma in Sec. 2.9 hold.<br />

The <strong>for</strong>emost lemma condition is that<br />

θˆ k<br />

be uni<strong>for</strong>mly bounded. Of course, this uni<strong>for</strong>mly<br />

bounded condition is itself a circular condition, but it helps to simplify the other conditions <strong>of</strong><br />

the theorems that are dependent on<br />

θˆ k since the<br />

θˆ k<br />

dependence can be replaced by an<br />

assumption that these other conditions hold uni<strong>for</strong>mly over all θ in the bounded set guaranteed<br />

to contain<br />

θˆ k<br />

(e.g., the current assumption C.3 that<br />

θˆ k<br />

be twice continuously differentiable in<br />

neighborhoods <strong>of</strong> estimates<br />

θˆ can be replaced by an assumption that g (θ ) is twice<br />

k<br />

continuously differentiable on some bounded set known to contain<br />

θˆ k . If the lemma applies,<br />

condition C.5 (on the i.o. behavior <strong>of</strong><br />

θˆ k ) is unnecessary.<br />

In showing convergence and<br />

asymptotic normality, one might wonder whether other adaptive algorithms could avoid<br />

conditions that depend on<br />

θˆ k , and avoid alternative conditions that are similarly undesirable.<br />

Based on currently available adaptive approaches, the answer appears to be “no.” .As an<br />

illustration, let us analyze one <strong>of</strong> the more powerful results on adaptive algorithms, the result in<br />

Wei [48].<br />

155


APPENDIX B<br />

The Wei [48] approach is restricted to the SG/root-finding setting as opposed to the more<br />

general setting <strong>for</strong> ASP that encompasses both gradient-free and SG/root finding. The approach<br />

is based on 2p measurements <strong>of</strong> g (θ ) at each iteration to estimate the Jacobian (<strong>Hessian</strong>)<br />

matrix. Some <strong>of</strong> the conditions in Wei [48] are similar to conditions <strong>for</strong> ASP (e.g., decaying<br />

gain sequences and smoothness <strong>of</strong> the functions involved), while other conditions are more<br />

stringent (the restriction to only the root-finding setting and the requirement <strong>for</strong> i.i.d.<br />

measurement noise). There are also conditions in ASP that are not required in Wei [48],<br />

principally those associated with “nice” behavior <strong>of</strong> the user-specified (bounded moments, etc.),<br />

the steepness conditions C.4 and C.7 (similar to standard conditions in some other adaptive<br />

approaches, e.g., Ruppert [14]), and limits on the amount <strong>of</strong> bouncing in “big steps” around (the<br />

i.o. condition C.5). An additional key assumption in Wei [48] is the symmetric function<br />

condition on the Jacobian (or <strong>Hessian</strong>) matrix:<br />

T<br />

T<br />

H ( θ)<br />

H(<br />

θ')<br />

+ H(<br />

θ')<br />

H(<br />

θ)<br />

> 0, ∀θ<br />

, θ '.<br />

(D.1)<br />

This, un<strong>for</strong>tunately, is a stringent condition that may be easily violated. In the optimization case<br />

(where H is a <strong>Hessian</strong>), this condition may fail even <strong>for</strong> benign (e.g., convex) loss functions.<br />

Consider, <strong>for</strong> example, a case with<br />

4 2 2<br />

L (θ ) = x + x + y + xy. Letting<br />

T<br />

( 0,0) and<br />

θ = ( x , y)<br />

θ<br />

T<br />

T<br />

'(0,0)<br />

=<br />

and a simple convex loss function<br />

(2,0)<br />

T<br />

, we have<br />

H(<br />

θ ) H(<br />

θ')<br />

T<br />

+ H(<br />

θ')<br />

H(<br />

θ)<br />

T<br />

⎡202<br />

= ⎢<br />

⎣ 56<br />

56⎤<br />

10<br />

⎥<br />

⎦<br />

which is not positive definite, violating condition (D.1). Aside from the fact that this condition<br />

may be easily violated, it is also generally impossible to check in practice because it requires<br />

knowledge <strong>of</strong> the true H (θ ) over the whole domain; this, <strong>of</strong> course, is the very quantity that is<br />

being estimated. The requirement <strong>for</strong> such prior knowledge is also apparent in other adaptive<br />

approaches discussed in Ruppert [14] and Fabian [19]. Given the above, it is clear that neither<br />

ASP nor Wei [48] (nor others) have uni<strong>for</strong>mly “easier” conditions <strong>for</strong> their respective<br />

approaches. The inherent difficulty in establishing theoretical properties <strong>of</strong> adaptive approaches<br />

comes from the need to couple the estimates <strong>for</strong> the parameters <strong>of</strong> interest and <strong>for</strong> the<br />

<strong>Hessian</strong>/Jacobian matrix.<br />

156


APPENDIX B<br />

This tends to lead to nontrivial regularity conditions, as seen in the<br />

θˆ k<br />

dependent conditions <strong>of</strong><br />

ASP and in the stringent conditions that have appeared in the literature <strong>for</strong> other approaches.<br />

There appear to be no easy conditions <strong>for</strong> establishing rigorous properties <strong>of</strong> adaptive<br />

algorithms. However, given that all <strong>of</strong> these approaches have a strong intuitive appeal based on<br />

analogies to deterministic optimization, the needs <strong>of</strong> practical users will focus less on the<br />

nuances <strong>of</strong> the regularity conditions and more on the cost <strong>of</strong> implementation (e.g., the number<br />

<strong>of</strong> function measurements needed), the ease <strong>of</strong> implementation, and the practical per<strong>for</strong>mance.<br />

157


158<br />

APPENDIX B


List <strong>of</strong> Publications Directly<br />

Related to the Dissertation<br />

1) Jorge Medina Martínez, Mariko Nakano Miyatake, Kazushi Nakano, Héctor Pérez Meana:<br />

Low Complexity Cascade Lattice IIR Adaptive Filter <strong>Algorithm</strong>s using Simultaneous<br />

Perturbations Approach, WSEAS Transactions on Communications, Vol. 10, No. 10, pp.<br />

1058-1068 (2005).<br />

(Related to the contents <strong>of</strong> Chap. 4).<br />

2) Jorge Ivan Medina Martinez, Kazushi Nakano, Kohji Higuchi: Parameter Estimation using<br />

a Modified Version <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Applied to State Space Models, IEEJ Transactions<br />

on Industry Applications, Vol.129, No.12/ Sec. D. (2009).<br />

(Related to the contents <strong>of</strong> Chap. 5).<br />

3) Jorge Ivan Medina Martinez, Kazushi Nakano, Sawut Umerujan: Vibration Suppression<br />

Control <strong>of</strong> a Flexible Arm using Non-linear Observer with Simultaneous Perturbation<br />

Stochastic <strong>Approximation</strong>, Journal <strong>of</strong> Artificial Life and Robotics, Vol. 14, (2009).<br />

(Related to the contents <strong>of</strong> Chap. 3).<br />

4) Jorge Ivan Medina Martinez, Kazushi Nakano, Kohji Higuchi: New Approach <strong>for</strong> IIR<br />

Adaptive Lattice Filter Structure using Simultaneous Perturbation <strong>Algorithm</strong>, IEEJ<br />

Transactions on Industry Applications, Vol.130, No.4/ Sec. D. (2010).<br />

(Related to the contents <strong>of</strong> Chap. 4).<br />

List <strong>of</strong> Other Publications and Presentations<br />

-Presentations in Internationals Symposiums<br />

1) Jorge Ivan Medina Martinez, Kazushi Nakano: Neural Control <strong>of</strong> a Flexible Arm System<br />

using Simultaneous Perturbation Method, SICE 7th Annual Conference on Control<br />

Systems, March 6-8 2007 Ch<strong>of</strong>u,Tokyo, Japan.<br />

159


LIST OF THE PUBLICATIONS AND INTERNATIONAL SYMPOSIUMS<br />

2) Jorge Ivan Medina Martinez, Kazushi Nakano, Sawut Umerujan: Simultaneous<br />

Perturbation Approach to Neural Control <strong>of</strong> a Flexible System, ECTI-CON 2007, Mae Fah<br />

Luang University, Chiang Rai, Thailand May 9-12, 2007.<br />

3) Jorge Ivan Medina Martinez, Kazushi Nakano, Sawut Umerujan: Cascade Lattice IIR<br />

Adaptive Filter Structure using Simultaneous Perturbation Method <strong>for</strong> Self-Adjusting<br />

SHARF <strong>Algorithm</strong>, International Conference on Instrumentation, Control and In<strong>for</strong>mation<br />

Technology (SICE Annual Conference 2008) Aug.20-22, The University <strong>of</strong> Electro<br />

Communications Ch<strong>of</strong>u, Tokyo, Japan.<br />

(Related to the contents <strong>of</strong> Chap. 5).<br />

4) Jorge Ivan Medina Martinez, Sawut Umerujan, Kazushi Nakano: Application <strong>of</strong> Non-linear<br />

Observer with Simultaneous Perturbation Stochastic <strong>Approximation</strong> Method to Single<br />

Flexible Link SMC, International Conference on Instrumentation, Control and In<strong>for</strong>mation<br />

Technology (SICE Annual Conference 2008) Aug.20–22, The University <strong>of</strong><br />

Electro-Communications, Ch<strong>of</strong>u, Tokyo, Japan.<br />

(Related to the contents <strong>of</strong> Chap. 4).<br />

5) Jorge Ivan Medina Martinez, Sawut Umerujan, Kazushi Nakano: Vibration Suppression<br />

Control <strong>of</strong> a Flexible arm using Non-linear Observer with Simultaneous Perturbation<br />

Stochastic <strong>Approximation</strong>, The Fourteenth International Symposium on Artificial Life and<br />

Robotics (AROB 14 th '09), Feb 5-7, 2009, B-Con Plaza, Beppu, Oita, Japan.<br />

(Related to the contents <strong>of</strong> Chap. 4).<br />

6) Jorge Ivan Medina Martinez, Kazushi Nakano, Kohji Higuchi: Parameters Estimation in<br />

Neural Networks by Improved Version <strong>of</strong> Simultaneous Perturbation Stochastic<br />

<strong>Approximation</strong> <strong>Algorithm</strong>, ICCAS-SICE 2009, August 18-21, 2009, Fukuoka, Japan.<br />

160


LIST OF THE PUBLICATIONS AND INTERNATIONAL SYMPOSIUMS<br />

-Other Publications, Presentations and Submissions<br />

1) Jorge Ivan Medina Martinez, Kazushi Nakano, Development <strong>of</strong> an IIR Adaptive Filter with<br />

Low Computational Complexity using Simultaneous Perturbation Method.-2nd.<br />

KMUTT-UEC Workshop May 14, 2007 King Mongkut's University <strong>of</strong> Technology<br />

Thonburi, Bangkok, Thailand.<br />

2) Jorge Ivan Medina Martinez, Kazushi Nakano, A Fast Converging and Self-Adjusting<br />

SHARF <strong>Algorithm</strong> using Simultaneous Perturbation Method and Vibration Control <strong>of</strong> a<br />

Flexible using Non-linear Observer with Simultaneous Perturbation Stochastic<br />

<strong>Approximation</strong> Method 3rd KMUTT-UEC Workshop August 19,2008, The University <strong>of</strong><br />

Electro-communications Ch<strong>of</strong>u, Tokyo, Japan.<br />

161


LIST OF THE PUBLICATIONS AND INTERNATIONAL SYMPOSIUMS<br />

162


Acknowledgements<br />

This dissertation is a summary <strong>of</strong> my doctoral study at the Department <strong>of</strong> Electronic<br />

Engineering <strong>of</strong> the University <strong>of</strong> Electro-Communications. This work would have not been<br />

accomplished without the help <strong>of</strong> so many people. The following paragraph is a brief account <strong>of</strong><br />

some but not all who deserve my thanks.<br />

I would like to extend my deepest thanks to my Pr<strong>of</strong>. Kazushi Nakano <strong>for</strong> taking the burden <strong>of</strong><br />

supervising my research work <strong>for</strong> so long in his laboratory. Right from the beginning in October<br />

2006 and up to the conclusion <strong>of</strong> this work in December 2009. It is my pleasure to have a<br />

chance to do the research work under his supervision and I also enjoy the life <strong>of</strong> the research<br />

work.<br />

My special thanks are due to all the reviewers<br />

Pr<strong>of</strong>. Kohji Higuchi<br />

Pr<strong>of</strong>. Masahide Kaneko<br />

Pr<strong>of</strong>. Tetsuro Kirimoto<br />

Pr<strong>of</strong>. Takayuki Inaba<br />

Pr<strong>of</strong>. Seiichi Shin<br />

Also, my special thanks to our research group, both past and present, <strong>for</strong> their helpful<br />

cooperation over the years. They all have been very kind to me and provided a nice and friendly<br />

environment during these years.<br />

My gratitude goes to the Ministry <strong>of</strong> Education, Science and Culture <strong>of</strong> Japan who granted me<br />

this opportunity and financially supported this work. I am thankful to the administrative staff <strong>of</strong><br />

the Department <strong>of</strong> Electronic Engineering and the Foreign Students Affairs Office at the<br />

University <strong>of</strong> Electro-Communications, <strong>for</strong> their amiability and effective supports.<br />

Finally, I would like to give special thanks to my family and friends to their love, warm supports<br />

and encouragements.<br />

163


Author Biography<br />

Jorge Ivan Medina Martinez was born in Mexico city, Mexico, on April 23, 1978. He recieved<br />

the Master <strong>of</strong> Science degree from the National Institute Polytechnic, Mexico City, Mexico, in<br />

2005. Since 2006, he has been with the Department <strong>of</strong> Electronic Engineering in the University<br />

<strong>of</strong> Electro-Communications, Tokyo, Japan working toward his Ph.D. degree. His research<br />

interests include signal processing and control using <strong>SPSA</strong>.<br />

165

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!