Approximation of Hessian Matrix for Second-order SPSA Algorithm ...
Approximation of Hessian Matrix for Second-order SPSA Algorithm ...
Approximation of Hessian Matrix for Second-order SPSA Algorithm ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Approximation</strong> <strong>of</strong> <strong>Hessian</strong> <strong>Matrix</strong> <strong>for</strong><br />
<strong>Second</strong>-<strong>order</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Addressed<br />
Toward Parameter Optimization<br />
in Non-linear Systems<br />
-<br />
JORGE IVAN MEDINA MARTINEZ<br />
Doctoral Program in Electronic Engineering<br />
Graduate School <strong>of</strong> Electro-Communications<br />
The University <strong>of</strong> Electro-Communications<br />
A thesis submitted <strong>for</strong> the degree <strong>of</strong><br />
DOCTOR OF ENGINEERING<br />
The University <strong>of</strong> Electro-Communications<br />
December 2009
<strong>Approximation</strong> <strong>of</strong> <strong>Hessian</strong> <strong>Matrix</strong> <strong>for</strong><br />
<strong>Second</strong>-<strong>order</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Addressed<br />
Toward Parameter Optimization<br />
in Non-linear Systems<br />
Approved by Supervisory Committee:<br />
Chairperson :<br />
Pr<strong>of</strong>. Kazushi Nakano<br />
Member : Pr<strong>of</strong>. Kohji Higuchi<br />
Member : Pr<strong>of</strong>. Masahide Kaneko<br />
Member : Pr<strong>of</strong>. Tetsuro Kirimoto<br />
Member : Pr<strong>of</strong>. Takayuki Inaba<br />
Member : Pr<strong>of</strong>. Seiichi Shin
Copyright 2009 by Jorge Ivan Medina Martinez<br />
All Rights Reserved
<strong>Approximation</strong> <strong>of</strong> <strong>Hessian</strong> <strong>Matrix</strong> <strong>for</strong><br />
<strong>Second</strong>-<strong>order</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Addressed<br />
Toward Parameter Optimization<br />
in Non-linear Systems<br />
(2 次 型 同 時 摂 動 確 率 近 似 アルゴリズムのヘッセ 行 列 推 定 とその 非 線 形 システム<br />
におけるパラメータ 最 適 化 への 応 用 )<br />
Jorge Ivan Medina Martinez<br />
—Abstract in Japanese —<br />
システム 同 定 問 題 とは、システムの 構 造 が 既 知 の 場 合 、 雑 音 を 有 する 観 測 データから、 未 知 パ<br />
ラメータを 推 定 する 問 題 になる。 特 に 最 近 、 非 線 形 モデルが、 状 態 推 定 、 制 御 、シミュレーショ<br />
ンに 多 用 されており、 非 線 形 モデル 予 測 制 御 の 成 功 に 動 機 づけられ、 第 一 モデル 原 理 やニューラ<br />
ルネットに 基 づくモデルの 精 緻 化 が 盛 んに 議 論 されている。このような 非 線 形 で 複 雑 なシステム<br />
の 同 定 問 題 は、 多 くの 未 知 パラメータに 関 してある 種 の 誤 差 関 数 を 最 適 化 する 問 題 に 帰 着 され、<br />
そのための 効 率 的 な 最 適 化 手 法 が 求 められている。<br />
これに 対 して 多 くのアルゴリズムが 提 案 されているが、これらを、 非 線 形 状 態 空 間 モデルのよ<br />
うな 数 多 くのパラメータを 有 する 複 雑 なシステムに 適 用 する 場 合 、 膨 大 な 計 算 コストがかかると<br />
いう 問 題 があった。 本 論 文 では、 複 雑 なシステムのパラメータ 推 定 において、アルゴリズムが 十<br />
分 な 安 定 性 を 有 していない 点 、および 計 算 過 程 が 複 雑 で 計 算 コストがかかる 点 に 注 目 して、 新 し<br />
い 推 定 アルゴリズムを 提 案 する。まず、 計 算 の 複 雑 度 やコストにおいて 有 利 で、 実 装 が 容 易 で、<br />
しかも 安 定 した 収 束 性 を 有 する、 同 時 摂 動 確 率 近 似 (Simultaneous Perturbation Stochastic<br />
<strong>Approximation</strong> <strong>Algorithm</strong>: <strong>SPSA</strong>) アルゴリズムに 注 目 する。しかしながら、これを 複 雑 なシ<br />
ステムに 適 用 する 場 合 には、いくつかの 問 題 に 遭 遇 する。そのために、 安 定 な 収 束 性 を 有 し、 計<br />
算 コストの 面 でも 有 利 な<strong>SPSA</strong>アルゴリズムを 改 良 した 新 しい 手 法 を 開 発 する。すなわち、 誤 差 関<br />
数 ヘシアンから 1st-Order <strong>SPSA</strong>(1-<strong>SPSA</strong>) 法 と 2nd-Order <strong>SPSA</strong> (2-<strong>SPSA</strong>) 法 との 比 較 に 基 づいて、<br />
<strong>SPSA</strong>アルゴリズムの 改 良 を 行 う。<br />
i
ここで 提 案 するアルゴリズム( 修 正 <strong>SPSA</strong>)は、 悪 条 件 ヘシアンの 非 正 定 性 を 解 消 し、 正 定 性 を<br />
保 証 するためにフィッシャ 情 報 行 列 をもちいて 悪 条 件 ヘシアンの 逆 行 列 によって 生 ずる 誤 差 拡<br />
大 を 抑 えるような 手 続 きを 採 るものである。これはまた 良 条 件 をもつヘシアンを 有 する 手 法 に 対<br />
しても 収 束 性 の 大 幅 な 改 善 をもたらすものである。 漸 近 収 束 性 に 対 しては、2-<strong>SPSA</strong> 法 に 対 する 修<br />
正 <strong>SPSA</strong> 法 の 平 均 2 乗 誤 差 の 比 は、 完 全 な 良 条 件 をもつヘシアンの 場 合 を 除 いたあらゆる 方 法 より、<br />
小 さいことが 示 される。さらに、ヘシアンの 対 角 要 素 の 推 定 を 行 えば、ほかの 手 法 に 比 べて 大 幅<br />
な 計 算 コスト 削 減 が 実 現 される。<br />
修 正 <strong>SPSA</strong> 法 においても、すべてのパラメータは 同 時 に 摂 動 されることから、パラメータの 次 元<br />
にかかわらず、 誤 差 関 数 の 二 つの 計 算 値 だけでパラメータが 更 新 できる。このように、この<strong>SPSA</strong><br />
アルゴリズムを 用 いれば、 大 幅 な 計 算 コストの 削 減 が 可 能 である。 本 論 文 では、 提 案 するアルゴ<br />
リズムの 収 束 定 理 を 与 えるとともに、このアルゴリズムを 用 いてパラメータ 推 定 の 実 行 可 能 性 を<br />
証 明 すべくシミュレーションを 実 施 する。<br />
最 後 に、 本 提 案 手 法 の 三 つの 実 際 的 応 用 を 考 える。ひとつは、 振 動 抑 制 を 目 的 とした1リンク<br />
フレキシブルアームの 角 度 制 御 問 題 である。この 制 御 目 的 のために、 非 線 形 VSS (Variable<br />
Structure System) オブザーバを 用 いたモデル 規 範 型 スライディングモード 制 御 (Model<br />
Reference – Sliding Mode Control: MR-SMC) 手 法 を 提 案 する。 非 線 形 オブザーバのパラメータ<br />
は、ここで 提 案 している 修 正 2-<strong>SPSA</strong>アルゴリズムを 用 いて 最 適 化 される。MR-SMCのコントローラ<br />
の 設 計 についても 同 様 に 議 論 する。この 手 法 の 有 効 性 は 振 動 制 御 シミュレーションにより 確 認 さ<br />
れる。 次 は、 適 応 IIR 型 フィルタアルゴリズムへの 応 用 である。このアルゴリズムは SHARF<br />
(Simple Hyperstable-Adaptive- Recursive- Filter) と SM (Steiglitz-McBride) アルゴリズ<br />
ムに 対 応 しており、 出 力 誤 差 をもとにした 同 定 用 フィルタの 係 数 パラメータは、 提 案 している 修<br />
正 2-<strong>SPSA</strong>アルゴリズムを 用 いて 求 められる。 確 率 近 似 (SA) アルゴリズムとの 比 較 により、 本 ア<br />
ルゴリズムの 有 効 性 が 示 される。 最 後 の 例 は、 修 正 <strong>SPSA</strong>アルゴリズムを、 非 線 形 状 態 空 間 システ<br />
ムの 未 知 な 静 的 パラメータを 推 定 する 問 題 に 適 用 するものである。 提 案 するアルゴリズムは 最 尤<br />
推 定 量 を 与 えるものであり、その 性 能 は、 差 分 近 似 型 確 率 近 似 (Finite Difference Stochastic<br />
<strong>Approximation</strong>) アルゴリズムとの 比 較 を 通 じて 検 証 される。<br />
ii
<strong>Approximation</strong> <strong>of</strong> <strong>Hessian</strong> <strong>Matrix</strong> <strong>for</strong><br />
<strong>Second</strong>-<strong>order</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Addressed<br />
Toward Parameter Optimization<br />
in Non-linear Systems<br />
Jorge Ivan Medina Martinez<br />
Abstract<br />
The research presented in this dissertation has been motivated due to the fact that many<br />
algorithms, which are very extended, do not <strong>of</strong>fer sufficient stability in the estimation <strong>of</strong> a great<br />
volume <strong>of</strong> parameters in non-linear systems or other kinds <strong>of</strong> systems. They also have a high<br />
computational complexity and cost. So that, we have decided to use the simultaneous<br />
perturbation stochastic approximation (<strong>SPSA</strong>) algorithm because it has several important<br />
advantages such as low computational complexity and stable convergence. Nevertheless, the<br />
typical <strong>SPSA</strong> algorithm has some difficulties and problems when it is applied to non-linear and<br />
complex systems. There<strong>for</strong>e, this research proposes a novel extension to the <strong>SPSA</strong> algorithm<br />
based on features and disadvantages shown in the first-<strong>order</strong> and second-<strong>order</strong> <strong>SPSA</strong> (the<br />
1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>) algorithms and comparisons made from the perspective <strong>of</strong> the<br />
<strong>Hessian</strong> loss function. These comparisons are made because at finite iterations, the convergence<br />
rate depends on matrix conditioning <strong>of</strong> the loss function <strong>Hessian</strong>. It is shown that 2nd-<strong>SPSA</strong><br />
converges more slowly <strong>for</strong> a loss function with an ill-conditioned <strong>Hessian</strong> than the one with a<br />
well-conditioned <strong>Hessian</strong>. On the other hand, the convergence rate <strong>of</strong> 1st-<strong>SPSA</strong> is less sensitive<br />
to the matrix conditioning <strong>of</strong> loss function <strong>Hessian</strong>.<br />
The main disadvantages in the 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> algorithms, one that the error <strong>for</strong> the<br />
loss function with an ill-conditioned <strong>Hessian</strong> is greater than the one with a well-conditioned<br />
<strong>Hessian</strong>. Our proposed modified version <strong>of</strong> 2nd-<strong>SPSA</strong> (M2-<strong>SPSA</strong>) eliminates the error<br />
amplification caused by the inversion <strong>of</strong> an ill-conditioned <strong>Hessian</strong> at finite iterations, which<br />
leads to significant improvements in its convergence rate in problems with an ill-conditioned<br />
<strong>Hessian</strong> matrix and complex systems. Asymptotically, the efficiency analysis shows that our<br />
proposed <strong>SPSA</strong> is also superior to 2nd-<strong>SPSA</strong> in terms <strong>of</strong> its convergence rate coefficients. It is<br />
iii
ABSTRACT<br />
shown that <strong>for</strong> the same asymptotic convergence rate, the ratio <strong>of</strong> the mean square errors <strong>for</strong> our<br />
proposed <strong>SPSA</strong> to 2nd-<strong>SPSA</strong> is always less than one, except <strong>for</strong> a perfectly conditioned <strong>Hessian</strong>.<br />
Also, we have proposed to reduce the computational expense by evaluating only a diagonal<br />
estimate <strong>of</strong> the eigenvalues in the <strong>Hessian</strong> matrix. In this research, a new mapping is suggested<br />
<strong>for</strong> the 2nd-<strong>SPSA</strong> algorithm in <strong>order</strong> to eliminate the non-positive definiteness part while<br />
preserving key spectral properties <strong>of</strong> the estimated <strong>Hessian</strong> using the Fisher in<strong>for</strong>mation matrix.<br />
After defining the M2-<strong>SPSA</strong> algorithm, we apply this algorithm to parameter estimation.<br />
There<strong>for</strong>e, using M2-<strong>SPSA</strong> all parameters are perturbed simultaneously, it is possible to modify<br />
parameters with only two measurements <strong>of</strong> an evaluation function regardless <strong>of</strong> the dimension<br />
<strong>of</strong> the parameter. A convergence theorem <strong>for</strong> the proposed algorithm is presented and a<br />
simulation result also reveals the feasibility <strong>of</strong> the identification scheme proposed here. In <strong>order</strong><br />
to show the efficiency <strong>of</strong> M2-<strong>SPSA</strong>, we have proposed three important applications, in which<br />
we can see the efficiency <strong>of</strong> our proposed algorithm <strong>for</strong> estimating and designing the<br />
parameters.<br />
In the first application, our proposed algorithm is addressed to control, in this case, the vibration<br />
reduction in the model considered here. There<strong>for</strong>e, the main objective concerns a vibration<br />
control <strong>of</strong> a one-link flexible arm system. A variable structure system (VSS) non-linear observer<br />
has been proposed in <strong>order</strong> to reduce the oscillation in controlling the angle <strong>of</strong> the flexible arm.<br />
The non-linear observer parameters are optimized using a modified version <strong>of</strong> <strong>SPSA</strong> algorithm.<br />
This <strong>SPSA</strong> algorithm is especially useful when the number <strong>of</strong> parameters to be adjusted is large<br />
and makes it possible to estimate them very efficiently. As <strong>for</strong> the vibration and position control,<br />
a model reference sliding-mode control (MR-SMC) has been presented. Our proposed<br />
M2-<strong>SPSA</strong> algorithm obtains the parameters <strong>of</strong> MR-SMC method. The simulations show that the<br />
vibration control <strong>of</strong> a one-link flexible arm system can be achieved more efficiently using our<br />
proposed methods.<br />
In the second application, our proposed algorithm is addressed to signal processing, in this case<br />
IIR lattice filters. Adaptive infinite impulse response (IIR), or recursive, filters are less attractive<br />
mainly because <strong>of</strong> the stability and the difficulties associated with their adaptive algorithms.<br />
There<strong>for</strong>e, in this research the adaptive IIR lattice filters are studied in <strong>order</strong> to devise<br />
algorithms that preserve the stability <strong>of</strong> the corresponding direct <strong>for</strong>m schemes. We analyze the<br />
local properties <strong>of</strong> stationary points, a trans<strong>for</strong>mation achieving this goal is suggested, which<br />
iv
ABSTRACT<br />
gives algorithms that can be efficiently implemented. Application to the Steiglitz-McBride (SM)<br />
and Simple Hyperstable Adaptive Recursive Filter (SHARF) algorithms is presented. Also, our<br />
proposed M2-<strong>SPSA</strong> algorithm is presented in <strong>order</strong> to get the coefficients in a lattice <strong>for</strong>m more<br />
efficiently and with a lower computational cost and complexity. The results are compared with<br />
previous lattice versions <strong>of</strong> these algorithms. These previous lattice versions may fail to<br />
preserve the stability <strong>of</strong> stationary points.<br />
Finally, the M2-<strong>SPSA</strong> algorithm is addressed to the problem <strong>of</strong> estimation <strong>of</strong> unknown static<br />
parameters in non-linear state-space models. The M2-<strong>SPSA</strong> algorithm can generate maximum<br />
likelihood estimates efficiently. The per<strong>for</strong>mance <strong>of</strong> the proposed algorithm is assessed through<br />
simulation. Here, the M2-<strong>SPSA</strong> algorithm is compared with the finite difference stochastic<br />
approximation (FDSA) in <strong>order</strong> to show its efficiency.<br />
There<strong>for</strong>e, in this dissertation, we have proposed a modification to <strong>SPSA</strong> algorithm where the<br />
main objective is to estimate the parameters in complex systems, improve the convergence and<br />
reduce the computational cost. Then, this modification to the simultaneous perturbation seems<br />
particularly useful when there are number <strong>of</strong> parameters to be identified is very large or when<br />
the observed values <strong>for</strong> what is to be identified can only be obtained via an unknown<br />
observation system.<br />
Finally, this dissertation is organized as follows. In Chapter 1, we describe an introduction to<br />
<strong>SPSA</strong>, so that we explain the mean concepts, advantages, disadvantages, recursions,<br />
<strong>for</strong>mulation and implementation <strong>of</strong> <strong>SPSA</strong>. Our proposed <strong>SPSA</strong> algorithm is analyzed in detail<br />
in Chap. 2. The asymptotic normality, the <strong>Hessian</strong> estimation and the efficiency between<br />
M2-<strong>SPSA</strong> and the previous versions <strong>of</strong> <strong>SPSA</strong> are shown. In addition, we show how the<br />
M2-<strong>SPSA</strong> algorithm is applied to parameter estimation, and prove its efficiency in several<br />
simple numerical simulations. The first important application <strong>of</strong> M2-<strong>SPSA</strong> algorithm is<br />
described in Chap. 3, in this case, in the control area; M2-<strong>SPSA</strong> is applied to parameter<br />
estimation <strong>of</strong> some methods <strong>for</strong> controlling the vibration in the proposed system. Other<br />
application <strong>for</strong> M2-<strong>SPSA</strong> algorithm is described in Chap. 4. In this application, our proposed<br />
algorithm is applied to signal processing, here M2-<strong>SPSA</strong> calculates the coefficients in some<br />
adaptive algorithms. In the final application, M2-<strong>SPSA</strong> algorithm is addressed to the problem <strong>of</strong><br />
estimation <strong>of</strong> unknown static parameters in non-linear state-space models, this is described in<br />
Chap. 5. Finally, the conclusions and future work are given in Chap. 6.<br />
v
Contents<br />
1. Introduction 1<br />
1.1 Motivation and Background 1<br />
-1.1.1 Motivation 1<br />
-1.1.2 Background 2<br />
1.2 Overview <strong>of</strong> Stochastic <strong>Algorithm</strong>s 5<br />
1.3 Introduction to <strong>SPSA</strong> <strong>Algorithm</strong> 7<br />
1.4 Features <strong>of</strong> <strong>SPSA</strong> 10<br />
1.5 Applications Areas 11<br />
1.6 Formulation <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong> 12<br />
1.7 Basic Assumptions <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong> 14<br />
1.8 Versions <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong>s 15<br />
2. Proposed <strong>SPSA</strong> <strong>Algorithm</strong> 19<br />
2.1 Overview <strong>of</strong> Modified 2nd-<strong>SPSA</strong> <strong>Algorithm</strong><br />
19<br />
2.2 <strong>SPSA</strong> <strong>Algorithm</strong> Recursions 20<br />
2.3 Proposed Mapping 22<br />
2.4 Description <strong>of</strong> Proposed <strong>SPSA</strong> <strong>Algorithm</strong> 26<br />
2.5 Asymptotic Normality 27<br />
2.6 Fisher In<strong>for</strong>mation <strong>Matrix</strong> 31<br />
-2.6.1 Introduction to Fisher In<strong>for</strong>mation <strong>Matrix</strong> 31<br />
-2.6.2 Two Key Properties <strong>of</strong> the In<strong>for</strong>mation <strong>Matrix</strong>: Connections to<br />
--Covariance <strong>Matrix</strong> <strong>of</strong> Parameter Estimates 33<br />
-2.6.3 Estimation <strong>of</strong> F (θ n<br />
)<br />
34<br />
2.7 Efficiency Between 1st-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong> and 2M-<strong>SPSA</strong> 40<br />
2.8 Implementation Aspects 41<br />
2.9 Strong Convergence 44<br />
2.10 Asymptotic Distribution and Efficiency Analysis 50<br />
2.11 Perturbation Distribution <strong>for</strong> M2-<strong>SPSA</strong> 54<br />
2.12 Parameter Estimation 57<br />
2.12.1 Introduction 57<br />
2.12.2 System to be Applied 64<br />
2.12.3 Convergence Theorem 69<br />
vii
CONTENTS<br />
2.13 Simulation 70<br />
2.13.1 Simulation 1 70<br />
2.13.2 Simulation 2 72<br />
2.13.3 Simulation 3 75<br />
3. Vibration Suppression Control <strong>of</strong> a Flexible Arm using<br />
Non-linear Observer with <strong>SPSA</strong> 79<br />
3.1 Introduction 79<br />
3.2 Dynamic Modeling <strong>of</strong> a Single Link Robot Arm 81<br />
-3.2.1 Dynamic Model 81<br />
-3.2.2 Equation <strong>of</strong> Motion and State Equations 84<br />
3.3 Design <strong>of</strong> Non-Linear Observer 85<br />
3.4 Model Reference Sliding Model Controller 87<br />
3.5 Simulation 91<br />
4. Lattice IIR Adaptive Filter Structure Adapted by <strong>SPSA</strong> <strong>Algorithm</strong><br />
9 9<br />
4.1 Introduction 99<br />
4.2 Procedure <strong>of</strong> Improved <strong>Algorithm</strong> 101<br />
4.3 Lattice Structure 104<br />
4.4 Adaptive <strong>Algorithm</strong> 105<br />
-4.4.1 SHARF <strong>Algorithm</strong> 105<br />
-4.4.2 Steiglitz-McBride <strong>Algorithm</strong> 108<br />
4.5 Simulation 109<br />
-4.5.1 SHARF <strong>Algorithm</strong> 109<br />
-4.5.2 Steiglitz-McBride <strong>Algorithm</strong> 110<br />
5. Parameters Estimation using a Modified Version <strong>of</strong><br />
<strong>SPSA</strong> <strong>Algorithm</strong> Applied to State-Space Models 113<br />
5.1 Introduction 113<br />
5.2 Implementation <strong>of</strong> <strong>SPSA</strong> Toward Proposed Model 115<br />
-5.2.1 State-Space Model 115<br />
-5.2.2 Gradient-free Maximum Likelihood Estimation 118<br />
viii
CONTENTS<br />
5.3 Parameter Estimation by <strong>SPSA</strong> and FDSA 120<br />
5.4 Simulation 122<br />
6. Conclusions and Future Work 125<br />
6.1 Conclusions 125<br />
6.2 Future Work 129<br />
References 131<br />
Appendix A 139<br />
Appendix B 155<br />
List <strong>of</strong> Publications Directly Related to the Dissertation 159<br />
Acknowledgements 163<br />
Author Biography 165<br />
ix
List <strong>of</strong> Figures<br />
Fig. 1.1 Example <strong>of</strong> stochastic optimization algorithm minimizing loss function L θ 1<br />
θ ) 3<br />
(<br />
, 2<br />
Fig. 1.2 Per<strong>for</strong>mance <strong>of</strong> <strong>SPSA</strong> algorithm (two measurements). 9<br />
Fig. 2.1 The two-recursions in 2nd-<strong>SPSA</strong> <strong>Algorithm</strong> 21<br />
Fig. 2.2 Diagram <strong>of</strong> method <strong>for</strong> <strong>for</strong>ming estimate F ( )<br />
39<br />
M , N<br />
θ<br />
Fig. 2.3 Split uni<strong>for</strong>m distribution 56<br />
Fig. 2.4 Inverse split uni<strong>for</strong>m distribution 57<br />
Fig. 2.5 Symmetric double triangular distribution 57<br />
Fig. 2.6 Identification with an unknown observation system 65<br />
Fig. 2.7 Identification results (with bias compensation) 75<br />
Fig. 2.8 Identification results (without bias compensation) 76<br />
Fig. 3.1 One-link flexible arm 82<br />
Fig. 3.2 Sliding mode surface 88<br />
Fig. 3.3 Block diagram <strong>of</strong> the sliding mode control system incorporating the non-linear<br />
observer 91<br />
Fig. 3.4 Motor angle. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (-.-)).With RM-SA<br />
algorithm and MR-SMC (dashed-line (- -)). With LS algorithm and MR-SMC (dash-dot-line<br />
(-.)).With M2-<strong>SPSA</strong> and MR-SMC (solid-line (-)) 94<br />
Fig. 3.5 Tip position. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (-.-)).With RM-SA<br />
algorithm and MR-SMC (dashed-line (- -)). With LS algorithm and MR-SMC (dash-dot-line<br />
(-.)).With M2-<strong>SPSA</strong> and MR-SMC (solid-line (-)) 95<br />
Fig. 3.6 Tip Velocity. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (-.-)).With RM-SA<br />
algorithm and MR-SMC (dashed-line (- -)). With LS algorithm and MR-SMC (dash-dot-line<br />
(-.)).With M2-<strong>SPSA</strong> and MR-SMC (solid-line (-)) 95<br />
Fig. 3.7 Control torque. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (-.-)).With RM-SA<br />
algorithm and MR-SMC (dashed-line (- -)). With LS algorithm and MR-SMC (dash-dot-line<br />
(-.)).With M2-<strong>SPSA</strong> and MR-SMC (solid-line (-)) 96<br />
Fig. 3.8 Motor angle. Simulation using x 1<br />
with M2-<strong>SPSA</strong> and MR-SMC (solid-line).<br />
Simulation using x m<br />
with M2-<strong>SPSA</strong> and MR-SMC (dashed-line) 96<br />
Fig. 3.9 Tip position. Simulation using x 3<br />
with M2-<strong>SPSA</strong> and MR-SMC (solid-line).<br />
Simulation using ˆx 3<br />
with M2-<strong>SPSA</strong> and MR-SMC (dashed-line) 96<br />
Fig. 3.10 Tip velocity. Simulation using x 4<br />
with M2-<strong>SPSA</strong> and MR-SMC (solid-line).<br />
Tip velocity. Simulation using ˆx 4<br />
with M2-<strong>SPSA</strong> and MR-SMC (dashed-line) 97<br />
xi
LIST OF FIGURES<br />
Fig. 4.1 Block diagram <strong>of</strong> the SHARF lattice algorithm 107<br />
Fig. 4.2 Block diagram <strong>of</strong> the SM lattice algorithm 109<br />
Fig. 4.3 Convergence <strong>of</strong> the proposed SHARF algorithm and M2-<strong>SPSA</strong> 111<br />
Fig. 4.4 Instability <strong>of</strong> the existing SHARF algorithm 111<br />
Fig. 4.5 Instability <strong>of</strong> the existing SM algorithm 112<br />
Fig. 4.6 Convergence <strong>of</strong> the proposed SM algorithm and M2-<strong>SPSA</strong> 112<br />
T<br />
Fig. 5.1 ML Parameter estimateθ = θ , θ , <strong>for</strong> the bi-modal non-linear model using<br />
k<br />
[<br />
1,<br />
k 2, k<br />
θ3<br />
, k]<br />
M2-<strong>SPSA</strong>. The true parameters in the model are defined by<br />
* =[0.5, 25,<br />
8]<br />
θ T . 122<br />
Fig. 5.2 Parameter estimation using 2nd-<strong>SPSA</strong> and FDSA 123<br />
xii
List <strong>of</strong> Tables<br />
Table 2.1 Characteristics <strong>of</strong> the perturbation distributions 55<br />
Table 2.2 Normalized loss values <strong>for</strong> 1st-<strong>SPSA</strong> and M2-<strong>SPSA</strong> with σ = 0.001;<br />
90% confidence interval shown in [⋅]<br />
72<br />
Table 2.3. Values <strong>of</strong><br />
Table 2.4 Values <strong>of</strong><br />
*<br />
θˆ<br />
k<br />
− θ<br />
with no measurement noise 74<br />
ˆ *<br />
θ − θ<br />
0<br />
*<br />
θˆ<br />
k<br />
− θ<br />
with measurement noise 74<br />
ˆ *<br />
θ − θ<br />
0<br />
Table 2.5 Comparison <strong>of</strong> estimators 76<br />
Table 3.1 Comparison <strong>of</strong> estimators (non-linear observer) 92<br />
Table 3.2 Comparison <strong>of</strong> estimators (MR-SMC) 92<br />
Table 3.3 Per<strong>for</strong>mance comparisons among M2-<strong>SPSA</strong>, RM-SA and LS 93<br />
Table 5.1 Computational statistics 123<br />
Table 6.1. Comparison <strong>of</strong> algorithms (per<strong>for</strong>mance) 127<br />
Table 6.2. Comparison <strong>of</strong> algorithms (computational cost) 128<br />
xiii
List <strong>of</strong> Abbreviations<br />
<strong>SPSA</strong><br />
1st-<strong>SPSA</strong><br />
2nd-<strong>SPSA</strong><br />
SP<br />
SA<br />
M2-<strong>SPSA</strong><br />
NN<br />
R-M<br />
FDSA<br />
LMS<br />
L-M<br />
ASP<br />
SG<br />
i.o.<br />
a.s.<br />
FIM<br />
MCNR<br />
MSE<br />
BP<br />
RMS<br />
MR-SMC<br />
VSS<br />
LS<br />
SM<br />
SHARF<br />
IIR<br />
FIR<br />
ODE<br />
HARF<br />
MSOE<br />
SMC<br />
ML<br />
Simultaneous perturbation stochastic approximation<br />
First-<strong>order</strong> <strong>of</strong> simultaneous perturbation stochastic approximation<br />
<strong>Second</strong>-<strong>order</strong> <strong>of</strong> simultaneous perturbation stochastic approximation<br />
Simultaneous perturbation<br />
Stochastic approximation<br />
Modified version <strong>of</strong> 2nd-<strong>SPSA</strong><br />
Neural network<br />
Robbins-Monroe<br />
Finite difference stochastic approximation<br />
Least mean square<br />
Levenberg-Marquardt<br />
Adaptive simultaneous perturbation<br />
Stochastic gradient<br />
Infinitely <strong>of</strong>ten<br />
Almost sure<br />
Fisher in<strong>for</strong>mation matrix<br />
Monte Carlo Newton-Raphson<br />
Mean Squire error<br />
Back-propagation<br />
Root mean square error<br />
Model reference-sliding mode control<br />
Variable structure system<br />
Least squares<br />
Steiglitz-McBride<br />
Simple hyperstable adaptive recursive filter<br />
Infinite impulse response<br />
Finite impulse response<br />
Ordinary differential equation<br />
Hyperstable adaptive recursive filter<br />
Mean-square output error<br />
Sequential Monte Carlo<br />
Maximum likelihood<br />
xv
Chapter 1<br />
Introduction<br />
Multivariate stochastic optimization plays a major role in the analysis and control <strong>of</strong> many<br />
engineering systems[1]. In almost all real-world optimization problems, it is necessary to use a<br />
mathematical algorithm that iteratively seeks out the solution because an analytical<br />
(closed-<strong>for</strong>m) solution is rarely available. In this spirit, the “simultaneous perturbation<br />
stochastic approximation (<strong>SPSA</strong>)” method <strong>for</strong> difficult multivariate optimization problems has<br />
been developed. <strong>SPSA</strong> has recently attracted considerable international attention in areas such<br />
as statistical parameter estimation, feedback control, simulation-based optimization, signal and<br />
image processing, and experimental design. The essential feature <strong>of</strong> <strong>SPSA</strong>—which accounts <strong>for</strong><br />
its power and relative ease <strong>of</strong> implementation—is the underlying gradient approximation that<br />
requires only two measurements <strong>of</strong> the objective function regardless <strong>of</strong> the dimension <strong>of</strong> the<br />
optimization problem. This feature allows <strong>for</strong> a significant decrease in the cost <strong>of</strong> optimization,<br />
especially in problems with a large number <strong>of</strong> variables to be optimized.<br />
1.1 -Motivation and Background<br />
1.1.1 -Motivation<br />
The simultaneous perturbation stochastic approximation (<strong>SPSA</strong>) method is a very useful tool <strong>for</strong><br />
solving optimization problems in which the cost function is in analytically unavailable or<br />
difficult to compute. The method is essentially a randomized version <strong>of</strong> the Kiefer-Wolfowitz<br />
method in which the gradient is estimated using only two measurements <strong>of</strong> the cost function at<br />
each iteration. <strong>SPSA</strong> is particularly efficient in problems <strong>of</strong> high-dimension and where the<br />
cost function must be estimated through expensive simulations. Our motivation is based on the<br />
features <strong>of</strong> <strong>SPSA</strong> algorithm that can be oriented toward parameter estimations in complex<br />
systems, where many algorithms have many disadvantages. Often it is necessary to estimate the<br />
parameters <strong>of</strong> a model <strong>of</strong> unknown system. Various techniques exist to accomplish this task,<br />
including Kalman and Wiener filtering, least mean square (LMS) algorithms, and the<br />
Levenberg-Marquardt (L-M) algorithm. These techniques require an analytic <strong>for</strong>m <strong>of</strong> the<br />
gradient <strong>of</strong> the function <strong>of</strong> the parameters to be estimated and usually have high computational<br />
complexity and cost [2]. Also, there are other kinds <strong>of</strong> algorithms to estimate parameter, which<br />
1
CHAPTER 1. INTRODUCTION<br />
the convergence is not stable because they cannot manage a great volume <strong>of</strong> parameter to be<br />
estimated. There<strong>for</strong>e, <strong>SPSA</strong> algorithm is convenient in these kinds <strong>of</strong> complex systems with a<br />
large number <strong>of</strong> parameters.<br />
1.1.2 -Background<br />
This dissertation is an introduction to the simultaneous perturbation stochastic approximation<br />
(<strong>SPSA</strong>) algorithm <strong>for</strong> stochastic optimization <strong>of</strong> multivariate systems. Optimization algorithms<br />
play a critical role in the design, analysis, and control <strong>of</strong> most engineering systems and are in<br />
widespread use in the work <strong>of</strong> many organizations. Be<strong>for</strong>e presenting the <strong>SPSA</strong> algorithm, we<br />
provide some general background on the stochastic optimization context <strong>of</strong> interest here.<br />
The mathematical representation <strong>of</strong> most optimization problems is the minimization (or<br />
maximization) <strong>of</strong> some scalar-valued objective function with respect to a vector <strong>of</strong> adjustable<br />
parameters. The optimization algorithm is a step-by-step procedure <strong>for</strong> changing the adjustable<br />
parameters from some initial guess (or set <strong>of</strong> guesses) to a value that <strong>of</strong>fers an improvement in<br />
the objective function [3][4]. Figure 1.1 depicts this process <strong>for</strong> a very simple case <strong>of</strong> only two<br />
variables, θ<br />
1<br />
and θ<br />
2<br />
, where our objective function is a loss function to be minimized (without<br />
loss <strong>of</strong> generality, we will discuss optimization in the context <strong>of</strong> minimization because a<br />
maximization problem can be trivially converted to a minimization problem by changing the<br />
sign <strong>of</strong> the objective function). Most real-world problems would have many more variables.<br />
The illustration in Fig. 1.1 is a typical example <strong>of</strong> a stochastic optimization setting with noisy<br />
input in<strong>for</strong>mation because the loss function value does not uni<strong>for</strong>mly decrease as the iteration<br />
process proceeds (note the temporary increase in the loss value in the third step <strong>of</strong> the<br />
algorithm). Many optimization algorithms have been developed that assume a deterministic<br />
setting and that assume in<strong>for</strong>mation is available on the gradient vector associated with the loss<br />
function (i.e., the gradient <strong>of</strong> the loss function with respect to the parameters being optimized).<br />
However, there has been a growing interest in recursive optimization algorithms that do not<br />
depend on direct gradient in<strong>for</strong>mation or measurements. Rather, these algorithms are based on<br />
an approximation to the gradient <strong>for</strong>med from measurements (generally noisy) <strong>of</strong> the loss<br />
function. This interest has been motivated, <strong>for</strong> example, by problems in the adaptive control and<br />
statistical identification <strong>of</strong> complex systems, the optimization <strong>of</strong> processes by large Monte Carlo<br />
simulations, the training <strong>of</strong> recurrent neural networks, the recovery <strong>of</strong> images from noisy sensor<br />
data, and the design <strong>of</strong> complex queuing and discrete-event systems.<br />
2
1.1 MOTIVATION AND BACKGROUND<br />
Fig. 1.1. Example <strong>of</strong> stochastic optimization algorithm minimizing loss function L θ 1<br />
, θ ).<br />
(<br />
2<br />
This dissertation focuses on the case where such an approximation is going to be used as a result<br />
<strong>of</strong> direct gradient in<strong>for</strong>mation not being readily available. Overall, gradient-free stochastic<br />
algorithms exhibit convergence properties similar to the gradient-based stochastic algorithms<br />
[e.g., Robbins-Monroe stochastic approximation (R-M SA)] while requiring only loss function<br />
measurements [5][6]. A main advantage <strong>of</strong> such algorithms is that they do not require the<br />
detailed knowledge <strong>of</strong> the functional relationship between the parameters being adjusted<br />
(optimized) and the loss function being minimized that is required in gradient-based algorithms.<br />
Such a relationship can be notoriously difficult to develop in some areas (e.g., non-linear<br />
feedback controller design), whereas in other areas (such as Monte Carlo optimization or<br />
recursive statistical parameter estimation), there may be large computational savings in<br />
calculating a loss function relative to that required in calculating a gradient. To elaborate on the<br />
distinction between algorithms based on direct gradient measurements and those based on<br />
gradient approximations from measurements <strong>of</strong> the loss function, the prototype gradient-based<br />
algorithm is R-M SA, which may be considered a generalization <strong>of</strong> such techniques as<br />
deterministic steepest descent and Newton–Raphson, neural network back-propagation (BP),<br />
and infinitesimal perturbation analysis–based optimization <strong>for</strong> discrete-event systems [9]. The<br />
gradient-based algorithms rely on direct measurements <strong>of</strong> the gradient <strong>of</strong> the loss function with<br />
respect to the parameters being optimized. These measurements typically yield an estimate <strong>of</strong><br />
the gradient because the underlying data generally include added noise. Because it is not usually<br />
the case that one would obtain direct measurements <strong>of</strong> the gradient (with or without added<br />
noise) naturally in the course <strong>of</strong> operating or simulating a system, one must have detailed<br />
knowledge <strong>of</strong> the underlying system input–output relationships to calculate the R-M gradient<br />
estimate from basic system output measurements. In contrast, the approaches based on gradient-<br />
3
CHAPTER 1.INTRODUCTION<br />
approximations require only conversion <strong>of</strong> the basic output measurements to sample values <strong>of</strong><br />
the loss function, which does not require full knowledge <strong>of</strong> the system input–output<br />
relationships.<br />
The classical method <strong>for</strong> gradient-free stochastic optimization is the Kiefer–Wolfowitz<br />
finite-difference SA (FDSA) algorithm [8]. Because <strong>of</strong> the fundamentally different in<strong>for</strong>mation<br />
needed in implementing these gradient-based (R-M) and gradient-free algorithms, it is difficult<br />
to construct meaningful methods <strong>of</strong> comparison. As a general rule, however, the gradient-based<br />
algorithms will be faster to converge than those using loss function based gradient<br />
approximations when speed is measured in the number <strong>of</strong> iterations. Intuitively, this result is not<br />
surprising given the additional in<strong>for</strong>mation required <strong>for</strong> the gradient-based algorithms. In<br />
particular, on the basis <strong>of</strong> asymptotic theory, the optimal rate <strong>of</strong> convergence measured in terms<br />
<strong>of</strong> the deviation <strong>of</strong> the parameter estimate from the true optimal parameter vector is <strong>of</strong> <strong>order</strong><br />
−1/2<br />
k <strong>for</strong> the gradient-based algorithms and <strong>of</strong> <strong>order</strong><br />
−1/3<br />
k <strong>for</strong> the algorithms based on gradient<br />
approximations, where k represents the number <strong>of</strong> iterations. (Special cases exist where the<br />
maximum rate <strong>of</strong> convergence <strong>for</strong> a non-gradient algorithm is arbitrarily close to, or equal to<br />
−1/2<br />
k ).<br />
In practice, <strong>of</strong> course, many other factors must be considered in determining which algorithm is<br />
best <strong>for</strong> a given circumstance <strong>for</strong> the following three reasons: (1) It may not be possible to<br />
obtain reliable knowledge <strong>of</strong> the system input–output relationships, implying that the<br />
gradient-based algorithms may be either infeasible (if no system model is available) or<br />
undependable (if a poor system model is used). (2) The total cost to achieve effective<br />
convergence depends not only on the number <strong>of</strong> iterations required, but also on the cost needed<br />
per iteration, which is typically greater in gradient-based algorithms. (This cost may include<br />
greater computational burden, additional human ef<strong>for</strong>t required <strong>for</strong> determining and coding<br />
gradients, and experimental costs <strong>for</strong> model building such as labor, materials, and fuel.) (3) The<br />
rates <strong>of</strong> convergence are based on asymptotic theory and may not be representative <strong>of</strong> practical<br />
convergence rates in finite samples. For these reasons, one cannot say in general that a<br />
gradient-based search algorithm is superior to a gradient approximation-based algorithm, even<br />
though the gradient-based algorithm has a faster asymptotic rate <strong>of</strong> convergence (and with<br />
simulation-based optimization such as infinitesimal perturbation analysis requires only one<br />
system run per iteration, whereas the approximation based algorithm may require multiple<br />
system runs per iteration). As a general rule, however, if direct gradient in<strong>for</strong>mation is<br />
4
1.1 FDSA AND <strong>SPSA</strong> ALGORITHM<br />
conveniently and reliably available, it is generally to one’s advantage to use this in<strong>for</strong>mation in<br />
the optimization process. The focus in this article is the case where such in<strong>for</strong>mation is not<br />
readily available. The next section describes <strong>SPSA</strong> and the related FDSA algorithm. Then some<br />
<strong>of</strong> the theory associated with the convergence and efficiency <strong>of</strong> <strong>SPSA</strong> is summarized.<br />
1.2 -Overview <strong>of</strong> Stochastic <strong>Algorithm</strong>s<br />
This dissertation considers the problem <strong>of</strong> minimizing a (scalar) differentiable loss function<br />
L (θ) , where θ is a p-dimensional vector and where the optimization problem can be translated<br />
*<br />
into finding the minimizing θ such that ∂L<br />
/ ∂θ<br />
= 0. This is the classical <strong>for</strong>mulation <strong>of</strong><br />
(local) optimization <strong>for</strong> differentiable loss functions. It is assumed that measurements <strong>of</strong> L (θ )<br />
are available at various values <strong>of</strong> θ . These measurements may or may not include added noise.<br />
No direct measurements <strong>of</strong> ∂L<br />
/ ∂θ<br />
= 0are assumed available, in contrast to the R-M framework.<br />
This section will describe the FDSA and <strong>SPSA</strong> algorithms. Although the emphasis <strong>of</strong> this<br />
dissertation is <strong>SPSA</strong>, the FDSA discussion is included <strong>for</strong> comparison because FDSA is a<br />
classical method <strong>for</strong> stochastic optimization. The <strong>SPSA</strong> and FDSA procedures are in the general<br />
recursive SA <strong>for</strong>m:<br />
ˆ θ ˆ ˆ ( ˆ<br />
k + 1<br />
= θ k<br />
−a<br />
k<br />
g k<br />
θ k<br />
)<br />
(1.1)<br />
where gˆ<br />
( ˆ<br />
k<br />
θk)<br />
is the estimate <strong>of</strong> the gradient g ( θ)<br />
≡ ∂L<br />
/ ∂θ<br />
at the iterate θˆ k based on the<br />
previously mentioned measurements <strong>of</strong> the loss function. Under appropriate conditions, the<br />
iteration in (1.1) will converge to<br />
e.g., in [7]).<br />
*<br />
θ in some stochastic sense (usually “almost surely”) see,<br />
The essential part <strong>of</strong> (1.1) is the gradient approximation gˆ<br />
( ˆ θ ) . We discuss the two <strong>for</strong>ms <strong>of</strong><br />
interest here. Let y(⋅)<br />
denote a measurement <strong>of</strong> L(⋅)<br />
at a design level represented by the dot (i.e.,<br />
y (⋅) = L (⋅)<br />
+ (noise)) and c<br />
k<br />
be some (usually small) positive number. One-sided gradient<br />
approximations involve measurements y( ˆ θ k<br />
) and y ( θˆ k +perturbation), whereas two-sided<br />
k<br />
k<br />
gradient approximations involve measurements <strong>of</strong> the <strong>for</strong>m y ( θˆ<br />
k<br />
±<br />
perturbation). The two<br />
general <strong>for</strong>ms <strong>of</strong> gradient approximations <strong>for</strong> use in FDSA and <strong>SPSA</strong> are finite difference<br />
5
CHAPTER 1. INTRODUCTION<br />
and simultaneous perturbation (SP), respectively, which are discussed in the following<br />
paragraphs. For the finite-difference approximation, each component <strong>of</strong><br />
θˆ k is perturbed one at a<br />
time, and corresponding measurements y(·) are obtained. Each component <strong>of</strong> the gradient<br />
estimate is <strong>for</strong>med by differencing the corresponding y(·) values and then dividing by a<br />
difference interval. This is the standard approach to approximating gradient vectors and is<br />
motivated directly from the definition <strong>of</strong> a gradient as a vector <strong>of</strong> p partial derivatives, each<br />
constructed as the limit <strong>of</strong> the ratio <strong>of</strong> a change in the function value over a corresponding<br />
change in one component <strong>of</strong> the argument vector.<br />
Typically, the i-th component <strong>of</strong> gˆ<br />
( ˆ θ ) ( i 1,2,..., p)<br />
<strong>for</strong> a two-sided finite-difference<br />
approximation is given by<br />
k k<br />
=<br />
gˆ<br />
ˆ<br />
y(<br />
ˆ θ + c e ) − y(<br />
ˆ θ −c<br />
e )<br />
k k i k k i<br />
ki(<br />
θ<br />
k<br />
) =<br />
(1.2)<br />
2 ck<br />
where<br />
e<br />
i<br />
denotes a vector with a one in the i-th place and zeros elsewhere (an obvious<br />
analogue holds <strong>for</strong> the one-sided version; likewise <strong>for</strong> the simultaneous perturbation <strong>for</strong>m<br />
below), and<br />
c<br />
k<br />
denotes a small positive number that usually gets smaller as k gets larger. The<br />
simultaneous perturbation has all elements <strong>of</strong><br />
θˆ k randomly perturbed together to obtain two<br />
measurements <strong>of</strong> y(·), but each component gˆ<br />
( ˆ θ ) is <strong>for</strong>med from a ratio involving the<br />
individual components in the perturbation vector and the difference in the two corresponding<br />
measurements. For two-sided Simultaneous Perturbation (SP), we have<br />
ki<br />
k<br />
gˆ<br />
ˆ<br />
y(<br />
ˆ θ + c e ) − y(<br />
ˆ θ −c<br />
e )<br />
k k i k k i<br />
ki(<br />
θ<br />
k<br />
) =<br />
(1.3)<br />
2 ck<br />
where the distribution <strong>of</strong> the user-specified p-dimensional random perturbation vector,<br />
∆<br />
T<br />
k<br />
= ( ∆ ,...,<br />
k 1+∆<br />
k 2<br />
∆<br />
k p)<br />
, satisfies conditions discussed later in this dissertation (superscript T<br />
denotes vector transpose). Note that the number <strong>of</strong> loss function measurements y(·) needed in<br />
6
1.3 INTRODUCTION TO <strong>SPSA</strong> ALGORITHM<br />
each iteration <strong>of</strong> FDSA grows with p, whereas with <strong>SPSA</strong>, only two measurements are needed<br />
independent <strong>of</strong> p because the numerator is the same in all p components. This circumstance, <strong>of</strong><br />
course, provides the potential <strong>for</strong> <strong>SPSA</strong> to achieve a large savings (over FDSA) in the total<br />
number <strong>of</strong> measurements required to estimate θ when p is large. This potential is realized only<br />
if the number <strong>of</strong> iterations required <strong>for</strong> effective convergence to<br />
*<br />
θ<br />
does not increase in a way<br />
to cancel the measurement savings per gradient approximation at each iteration. In the following<br />
sections, the advantages in this potential <strong>of</strong> <strong>SPSA</strong> over FDSA will be described.<br />
1.3 -Introduction to <strong>SPSA</strong> <strong>Algorithm</strong><br />
From here, the <strong>SPSA</strong> algorithm will be described with more detail. Firstly, since many years<br />
ago, the Stochastic <strong>Algorithm</strong> (SA) has long been applied <strong>for</strong> problems <strong>of</strong> minimizing loss<br />
functions or root-finding with noisy input in<strong>for</strong>mation [10]. As with all stochastic search<br />
algorithms, there are adjustable algorithm coefficients that must be specified and that can have a<br />
pr<strong>of</strong>ound effect on algorithm per<strong>for</strong>mance. It is known that picking these coefficients according<br />
to an SA analogue <strong>of</strong> the deterministic Newton-Raphson (N-R) algorithm provides an<br />
optimal or near-optimal <strong>for</strong>m <strong>of</strong> the algorithm. However, directly determining the required<br />
<strong>Hessian</strong> matrix (or Jacobian matrix <strong>for</strong> root-finding) to achieve this algorithm <strong>for</strong>m has been<br />
<strong>of</strong>ten difficult or impossible in practice [11]. This research presents a general adaptive SA<br />
algorithm that is based on an easy method <strong>for</strong> estimating the <strong>Hessian</strong> matrix at each iteration<br />
while concurrently estimating the primary parameters <strong>of</strong> interest. The approach applies in both<br />
the gradient-free optimization (Kiefer-Wolfowitz) and root finding stochastic gradient-based<br />
(Robbins-Monroe) settings and is based on the simultaneous perturbation (SP) idea introduced<br />
in [12]. There has recently been much interest in recursive optimization algorithms that rely on<br />
measurements <strong>of</strong> only the objective function to be optimized, not on direct measurements <strong>of</strong> the<br />
gradient (derivative) <strong>of</strong> the objective function [12]. Such algorithms have the advantage <strong>of</strong> not<br />
requiring detailed modeling in<strong>for</strong>mation describing the relationship between the parameters to<br />
be optimized and the objective function. For example, many systems involving human beings or<br />
computer simulations are difficult to treat analytically, and could potentially benefit from such<br />
an optimization approach [11][12]. The stochastic optimization algorithms are used in virtually<br />
all areas <strong>of</strong> engineering, physical and social sciences. Such techniques apply in the usual case<br />
where a closed <strong>for</strong>m solution to the optimization problem <strong>of</strong> interest is not available and where<br />
the input in<strong>for</strong>mation into the optimization method may be contaminated with noise.<br />
7
CHAPTER 1. INTRODUCTION<br />
Typical applications include model fitting and statically parameter estimation, experimental<br />
design, adaptive control, pattern classifications, simulation-based optimization, and<br />
per<strong>for</strong>mance evaluation from test data. Frequently, the solution to the optimization problem<br />
corresponds to a vector <strong>of</strong> parameters at which the gradient <strong>of</strong> the objective (say, loss) function<br />
with respect to the parameters being optimized is zero. In many practical settings, however, the<br />
gradient <strong>of</strong> the loss function <strong>for</strong> use in the optimization process is not available or is difficult to<br />
compute (knowledge <strong>of</strong> the gradient usually requires complete knowledge <strong>of</strong> the relationship<br />
between the parameters being optimized and the loss function). So, there is a considerable<br />
interest in techniques <strong>for</strong> optimization that rely on measurements <strong>of</strong> the loss function only, not<br />
on measurements (or direct calculations) <strong>of</strong> the gradient (or higher <strong>order</strong> derivatives) <strong>of</strong> loss<br />
function. One <strong>of</strong> these techniques using only loss function measurements, and it have attracted<br />
considerable recent attention <strong>for</strong> difficult multivariate problems, this technique is the <strong>SPSA</strong><br />
algorithm introduced in [12]. This contrasts with algorithms requiring direct measurements <strong>of</strong><br />
the gradient <strong>of</strong> the objective function (which are <strong>of</strong>ten difficult or impossible to obtain). Further,<br />
<strong>SPSA</strong> is especially efficient in high-dimensional problems in terms <strong>of</strong> providing a good solution<br />
<strong>for</strong> a relatively small number <strong>of</strong> measurements <strong>of</strong> the objective function. The essential feature <strong>of</strong><br />
<strong>SPSA</strong>, which provides its power and relative ease <strong>of</strong> use in difficult multivariate optimization<br />
problems, is the underlying gradient approximation that requires only two objective function<br />
measurements per iteration regardless <strong>of</strong> the dimension <strong>of</strong> the optimization problem. These two<br />
measurements are made by simultaneously varying in a "proper" random fashion all <strong>of</strong> the<br />
variables in the problem. This contrasts with the classical FDSA method where the variables are<br />
varied one at a time. If the number <strong>of</strong> terms being optimized is p, then the finite-difference<br />
method takes 2p measurements <strong>of</strong> the objective function at each iteration (to <strong>for</strong>m one gradient<br />
approximation) while <strong>SPSA</strong> takes only two measurements (see Fig. 1.2). A fundamental result<br />
on relative efficiency is described below.<br />
Under reasonably general conditions, <strong>SPSA</strong> and the standard finite-difference SA method<br />
achieve the same level <strong>of</strong> statistical accuracy <strong>for</strong> a given number <strong>of</strong> iterations even though <strong>SPSA</strong><br />
uses p times fewer measurements <strong>of</strong> the objective function at each iteration (since each gradient<br />
approximation uses only 1/p the number <strong>of</strong> function measurements). This indicates that <strong>SPSA</strong><br />
will converge to the optimal solution within a given level <strong>of</strong> accuracy with p times fewer<br />
measurements <strong>of</strong> the objective function than the standard method. An equivalent way <strong>of</strong><br />
interpreting the statement above is described in the following paragraph.<br />
8
1.3 INTRODUCTION TO <strong>SPSA</strong> ALGORITHM<br />
One properly generated simultaneous random change <strong>of</strong> all p variables in the problem contains<br />
as much in<strong>for</strong>mation <strong>for</strong> optimization as a full set <strong>of</strong> p one at time changes <strong>of</strong> each variable [13].<br />
Further, <strong>SPSA</strong>—like other stochastic approximation methods—<strong>for</strong>mally accommodates noisy<br />
measurements <strong>of</strong> the objective function. This is an important practical concern in a wide variety<br />
<strong>of</strong> problems involving Monte Carlo simulations, physical experiments, feedback systems, or<br />
incomplete knowledge. The need <strong>for</strong> solving multivariate optimization problems is pervasive in<br />
engineering and the physical and social sciences. The <strong>SPSA</strong> algorithm has already attracted<br />
considerable attention <strong>for</strong> challenging optimization problems where it is difficult or impossible<br />
to directly obtain a gradient <strong>of</strong> the objective function, not on measurement <strong>of</strong> the gradient<br />
objective function. As we mentioned above, the gradient approximation is based on only two<br />
functions measurements (regardless <strong>of</strong> the dimension <strong>of</strong> the gradient vector). There<strong>for</strong>e,<br />
contrasts with standard finite-difference approaches, which require a number <strong>of</strong> function<br />
measurements proportional to the dimension <strong>of</strong> the gradient vector.<br />
The <strong>SPSA</strong> is generally used in non-linear problems having many variables where the objective<br />
function gradient is difficult or impossible to obtain. As a SA algorithm, <strong>SPSA</strong> may be<br />
rigorously applied when noisy measurements <strong>of</strong> the objective function are all that are available.<br />
There have also been many successful applications <strong>of</strong> <strong>SPSA</strong> in settings where perfect<br />
measurements <strong>of</strong> the loss function are available.<br />
Fig. 1.2. Per<strong>for</strong>mance <strong>of</strong> <strong>SPSA</strong> algorithm (two measurements).<br />
9
CHAPTER 1. INTRODUCTION<br />
1.4--Features <strong>of</strong> <strong>SPSA</strong><br />
1. <strong>SPSA</strong> allows <strong>for</strong> the input the algorithm to be measurement <strong>of</strong> the objective function<br />
corrupted by noise. For example, this is ideal <strong>for</strong> the case where Monte Carlo simulations<br />
are being used because each simulation run provides one noisy estimate <strong>of</strong> the per<strong>for</strong>mance<br />
measure. This is especially relelvant in practice as a very large number <strong>of</strong> scenarios <strong>of</strong>ten<br />
need to be evaluated, and it will not be possible to run a large number <strong>of</strong> simulations at<br />
each scenario (to average out noise). So, an algorithm explicitly designed to handle noise is<br />
needed.<br />
2. The algorithm is appropiate <strong>for</strong> high-dimensional problems where many terms are being<br />
determined in the optimization process. Many practical applications have a significant.<br />
3. Per<strong>for</strong>mance guarantees <strong>for</strong> <strong>SPSA</strong> exist in the <strong>for</strong>m <strong>of</strong> an extentive convergence theory. The<br />
algorithm has desirable properties <strong>for</strong> the both global and local optimization in the sense<br />
that the gradient approximation is sufficiently noisy to allow <strong>for</strong> escape from local minima<br />
while being in<strong>for</strong>mative about the slope <strong>for</strong> the function to faciliate local convergence. This<br />
may avoid the cumbersome need in many global optimization problems to manually switch<br />
from a global to a local algorithm. However, we concentrate in the optimal area, so that we<br />
omite the local mimina problem.<br />
4. Implementation <strong>of</strong> <strong>SPSA</strong> may be easier than other stochastic optimization methods since<br />
there are fewer algorithm coefficients that need to be specfied, and there are some<br />
published guidelines [12] proving insight into how to pick th coefficients in practical<br />
applications.<br />
5. While the original <strong>SPSA</strong> method is designed <strong>for</strong> conitnuos optimization problems, there<br />
have been recent extensions to discrete optimization problems. This may be revelant to<br />
certain design problems, <strong>for</strong> example, where one wants to find the best number <strong>of</strong> items to<br />
use in a particular application.<br />
10
1.5 APPLICATIONS AREAS<br />
6. While “basic” <strong>SPSA</strong> uses only objective function measurements to carry out the iteration<br />
process in a stochastic analogue <strong>of</strong> the steepest descent method <strong>of</strong> deterministic<br />
optimization.<br />
1.5 -Applications Areas<br />
Over the past several years, non-linear models have been increasingly used <strong>for</strong> simulation, state<br />
estimation and control purposes. Particularly, the rapid progresses in computational techniques<br />
and the success <strong>of</strong> non-linear model predictive control have been strong incentivites <strong>for</strong> the<br />
development <strong>of</strong> such models as neural networks or first-principle models. Process modeling<br />
requires the estimation <strong>of</strong> several unknown parameters from noisy measurements data. A least<br />
square or maximum likelihood cost function. is usually minimized using a gradient-based<br />
optimization method [7]. Several techniques <strong>for</strong> computing the gradient <strong>of</strong> the cost function are<br />
available, including finite difference approximations and analytic differentiation. In these<br />
techniques, the computational expense required to estimate the current gradient direction is<br />
directly proportional to the number <strong>of</strong> unknown model parameters, which becomes an issue <strong>for</strong><br />
model involving a large number <strong>of</strong> parameters. This is typically the case in neural networks<br />
modeling, but can also occur in other circumstances, such as the estimation <strong>of</strong> parameters and<br />
initial conditions in first principle models. Moreover the derivation <strong>of</strong> sensitivity equations<br />
requires analytic manipulation <strong>of</strong> the model equation, which is time consuming and subject to<br />
errors [7].<br />
In contrast to standard finite differences which approximate the gradient by varying the<br />
parameters one at time, the simultaneous perturbation approximation <strong>of</strong> the gradient proposed<br />
by Spall and Chin [12] make use <strong>of</strong> a very efficient technique based on a simultaneous (random)<br />
perturbation in all the parameters and on each iteration the <strong>SPSA</strong> only needs few loss<br />
measurements to estimate the gradient, regardless <strong>of</strong> the dimensionality <strong>of</strong> the problem (number<br />
<strong>of</strong> parameters)[12]. Hence, one gradient evaluation requires only two evaluations <strong>of</strong> the cost<br />
function. This approach has first been applied to gradient estimation in a first-<strong>order</strong> stochastic<br />
approximation algorithm, and more recently to <strong>Hessian</strong> estimation in an accelerated<br />
second-<strong>order</strong> <strong>SPSA</strong> algorithm. There<strong>for</strong>e, using those features, the proposed <strong>SPSA</strong> algorithm in<br />
this dissertation also will be applied to non-linear systems regardless <strong>of</strong> the dimensionality <strong>of</strong><br />
the problem.<br />
11
CHAPTER 1. INTRODUCTION<br />
Some <strong>of</strong> the general areas <strong>for</strong> application <strong>of</strong> <strong>SPSA</strong> include statistical parameter estimation,<br />
simulation-based optimization, pattern recognition, non-linear regression, signal processing,<br />
neural network (NN) training, adaptive feedback control, and experimental design. Specific<br />
system applications represented in the list <strong>of</strong> references include [14].<br />
1. Adaptive optics<br />
2. Aircraft modeling and control<br />
3. Atmosferic and planetary modeling<br />
4. Fault detection in plant operations<br />
5. Human-machine interface control<br />
6. Industrial quality improvement<br />
7. Medical imaging<br />
8. Noise cancellation<br />
9. Process control<br />
10. Quering network design<br />
11. Robot control<br />
12. Parameter estimation in highly non-linear model<br />
In this point, the research has one important goal because this application (parameter estimation)<br />
is very useful in realistic systems. Often it is necessary to estimate the parameters <strong>of</strong> a model <strong>of</strong><br />
unknown system. Various techniques exist to accomplish this task, including LMS algorithms,<br />
and the L-M algorithm [15]. These techniques require an analytic <strong>for</strong>m <strong>of</strong> the gradient <strong>of</strong> the<br />
function <strong>of</strong> the parameters to be estimated. A key feature <strong>of</strong> the <strong>SPSA</strong> method is that it is a<br />
gradient-free optimization technique. The function <strong>of</strong> parameters to be identified is highly<br />
non-linear and <strong>of</strong> sufficient difficulty that obtaining an analytic <strong>for</strong>m <strong>of</strong> the gradient is<br />
empirical.<br />
1.6 -Formulation <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong><br />
The problem <strong>of</strong> minimizing a (scalar) differentiable loss function L (θ ), where θ ∈R P , p ≥1<br />
is<br />
considered. A typical example <strong>of</strong> L (θ ) would be some measure <strong>of</strong> mean-square error (MSE)<br />
<strong>for</strong> the output <strong>of</strong> a process as a function <strong>of</strong> some design parameters θ . For many cases <strong>of</strong><br />
practical interest, this is equivalent to finding the minimizing<br />
*<br />
θ such that<br />
12
1.6 FORMULATION OF <strong>SPSA</strong> ALGORITHM<br />
∂L<br />
g ( θ)<br />
= = 0 . (1.4)<br />
∂ θ<br />
For the gradient-free setting, it is assumed that measurements <strong>of</strong> L (θ ) say y (θ ) are available at<br />
various values <strong>of</strong> θ . These measurements may or may not include random noise. No direct<br />
measurements (either with or without noise) <strong>of</strong> g (θ ) are assumed available in this setting. In<br />
the Robbins-Monroe/stochastic gradient (SG) case [9], it is assumed that direct measurements <strong>of</strong><br />
g (θ) are available, usually in the presence <strong>of</strong> added noise. The basic problem is to take the<br />
available in<strong>for</strong>mation measurements <strong>of</strong> L (θ ) and/or g (θ ) and attempt to estimate<br />
*<br />
θ . This is<br />
essentially a local unconstrained optimization problem. The <strong>SPSA</strong> algorithm is a tool <strong>for</strong><br />
solving optimization problems in which the cost function is analytically unavailable or difficult<br />
to compute. The algorithm is essentially a randomized version <strong>of</strong> the Kiefer-Wolfowitz method<br />
in which the gradient is estimated using only two measurements <strong>of</strong> the cost function at each<br />
iteration [15][16]. <strong>SPSA</strong> is particularly efficient in problems <strong>of</strong> high dimension and where the<br />
cost function must be estimated through expensive simulations. The convergence properties <strong>of</strong><br />
the algorithm have been established in [16]. Consider the problem <strong>of</strong> finding the minimum <strong>of</strong> a<br />
real valued function L (θ ), <strong>for</strong> θ ∈D<br />
where D is an open domain in<br />
P<br />
R . The function is not<br />
assumed to be explicitly known, but noisy measurements M ( n,<br />
θ)<br />
<strong>of</strong> it are available:<br />
M ( n,<br />
θ)<br />
= L(<br />
θ)<br />
+ ε ( θ)<br />
(1.5)<br />
n<br />
where { ε n<br />
} is the measurement noise process. We assume that the function L (⋅)<br />
is at least<br />
three-times continuously differentiable and has a unique minimize in D. The process { ε n<br />
} is a<br />
zero-mean process, uni<strong>for</strong>mly bounded and smooth in θ in an appropriate technical sense. The<br />
problem is to minimize L (⋅)<br />
using only the noisy measurements M (⋅)<br />
. The <strong>SPSA</strong> algorithm <strong>for</strong><br />
minimizing functions relies on the SP gradient approximation [16]. At each iteration k <strong>of</strong> the<br />
algorithm, a random perturbation vector<br />
∆ is taken, where the ∆<br />
ki<br />
<strong>for</strong>ms a<br />
T<br />
k<br />
= ( ∆k<br />
1,...,<br />
∆kp)<br />
sequence <strong>of</strong> Bernoulli random variables taking the values ± 1. The perturbations are assumed to<br />
be independent <strong>of</strong> the measurement noise process. In fixed gain <strong>SPSA</strong>, the step size <strong>of</strong> the<br />
perturbation is fixed at, say, some c > 0. To compute the gradient estimate at iteration k, it is<br />
necessary to evaluate M (⋅)<br />
at two values <strong>of</strong> θ :<br />
M θ ) L(<br />
θ + c∆<br />
) + ε ( θ + c∆<br />
)<br />
(1.6)<br />
+<br />
k<br />
( =<br />
k 2k−1<br />
k<br />
13
CHAPTER 1. INTRODUCTION<br />
M<br />
−<br />
k<br />
(<br />
k 2k<br />
k<br />
θ ) = L(<br />
θ − c∆<br />
) + ε ( θ − c∆<br />
) . (1.7)<br />
The i-th component <strong>of</strong> the gradient estimate is<br />
( M<br />
H ( k,<br />
θ)<br />
=<br />
i<br />
+<br />
k<br />
( θ)<br />
− M<br />
2c∆<br />
ki<br />
−<br />
k<br />
( θ))<br />
.<br />
1.7 -Basic Assumptions <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong><br />
Once again, the goal is to minimize a loss function L (θ ) over<br />
P<br />
θ ∈C ⊆ R . The <strong>SPSA</strong><br />
algorithm works by iterating from an initial guess <strong>of</strong> the optimal θ , where the iteration process<br />
depends on the above-mentioned simultaneous perturbation approximation to the gradient g (θ ).<br />
In [16] are presented sufficient conditions <strong>for</strong> convergence <strong>of</strong> the <strong>SPSA</strong> iterate ( ˆ θ →θ * a.s.)<br />
using a differential equation approach well known by SA theory [17]. In particular, we must<br />
impose conditions on both gain sequences ( a k<br />
and<br />
and the statistical relationship <strong>of</strong><br />
c<br />
k<br />
), the user specified distribution <strong>of</strong><br />
k<br />
∆<br />
k<br />
,<br />
∆<br />
k<br />
to the measurements y(·). We will not repeat the<br />
conditions here since they are available in [17]. The main conditions are that<br />
a<br />
k<br />
and<br />
c<br />
k<br />
both<br />
go to 0 at rates neither too fast nor too slow, that L (θ ) is sufficiently smooth (several times<br />
differentiable) near<br />
*<br />
θ and that the { ∆ki}<br />
−1<br />
0 with finite inverse moments ( ∆ ki<br />
)<br />
are independent and symmetrically distributed about<br />
E <strong>for</strong> all k, i. One particular distribution <strong>for</strong> ∆<br />
ki<br />
that<br />
satisfies these latter conditions is the symmetric Bernoulli ±1 distribution; two common<br />
distributions that do not satisfy the conditions (in particular, the critical finite inverse moment<br />
condition) are the uni<strong>for</strong>m and normal. Although the convergence results <strong>for</strong> <strong>SPSA</strong> is <strong>of</strong> some<br />
independent interest, the most interesting theoretical results in [16] and those that best justify<br />
the use <strong>of</strong> <strong>SPSA</strong>, are the asymptotic efficiency conclusions that follow from an asymptotic<br />
normality result. In particular, under some minor additional conditions in [16] (proposition 2), it<br />
can be shown that<br />
k<br />
β / 2<br />
dist<br />
ˆ *<br />
( θk − θ ) →N(<br />
µ , Σ)<br />
as k →∞<br />
(1.8)<br />
14
1.8 VERSIONS OF <strong>SPSA</strong> ALGORITHM<br />
where β > 0 depends on the choice <strong>of</strong> the gain sequences ( a k<br />
andc k<br />
), µ depends on both the<br />
<strong>Hessian</strong> and the third derivatives <strong>of</strong> L (θ ) and<br />
*<br />
θ , and Σ depends on <strong>Hessian</strong> matrix<br />
(note that in general µ ≠ 0, in contrasts to many well-known asymptotic normality results in<br />
estimation). Given the restrictions on the gain sequences to ensure convergence and asymptotic<br />
*<br />
θ<br />
normality, the fastest allowable value <strong>for</strong> the rate <strong>of</strong> convergence <strong>of</strong><br />
θˆ k<br />
to<br />
*<br />
θ is<br />
−1/3<br />
k .<br />
In addition to establishing the <strong>for</strong>mal convergence <strong>of</strong> <strong>SPSA</strong>, Spall in [18] shows that the<br />
probability distribution <strong>of</strong> an appropriately scaled<br />
θˆ k<br />
is approximately normal (with a specified<br />
mean and covariance matrix) <strong>for</strong> large k . Spall in [18] uses the asymptotic normality result in<br />
(1.8), together with a parallel result <strong>for</strong> FDSA [9], to establish the relative efficiency <strong>of</strong> <strong>SPSA</strong>.<br />
This efficiency depends on the shape <strong>of</strong> L (θ ) , the values <strong>for</strong> a } and c } , and the<br />
distributions <strong>of</strong> the { ∆ k<br />
} and measurement noise terms. There is no single expression that can<br />
be used to characterize the relative efficiency; however, as discussed in [17] in most practical<br />
problems <strong>SPSA</strong> will be asymptotically more efficient than FDSA.<br />
{ k<br />
{ k<br />
For example, if<br />
a<br />
k<br />
and<br />
asymptotic mean squared error<br />
c<br />
k<br />
are chosen as in the guidelines <strong>of</strong><br />
Spall [18] then by equating the<br />
2<br />
E<br />
⎛ ⎞<br />
⎜<br />
ˆ θ −θ *<br />
k ⎟ in <strong>SPSA</strong> and FDSA algorithm, we find<br />
⎝ ⎠<br />
No. <strong>of</strong> measurements <strong>of</strong> L (θ ) in <strong>SPSA</strong> / No. <strong>of</strong> measurements <strong>of</strong> L (θ ) in FDSA → 1/p<br />
as the number <strong>of</strong> loss measurements in both procedures gets large. Hence, above expression<br />
implies that the p-fold savings per iteration (gradient approximation) translates directly into a<br />
p-fold savings in the overall optimization process despite the complex non-linear ways in which<br />
the sequence <strong>of</strong> gradient approximations manifests itself in the ultimate solutionθˆ k . One<br />
properly chosen simultaneous random change in all the variables in a problem provides as much<br />
in<strong>for</strong>mation <strong>for</strong> optimization as a full set <strong>of</strong> one at time changes <strong>of</strong> each variable.<br />
1.8. -Versions <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong><br />
The standard first-<strong>order</strong> SA algorithms <strong>for</strong> estimating θ involve a simple recursion with.<br />
15
CHAPTER 1. INTRODUCTION<br />
usually, a scalar gain and an approximation to the gradient based on the measurements <strong>of</strong> L (⋅)<br />
The first-<strong>order</strong> <strong>of</strong> <strong>SPSA</strong> (1st-<strong>SPSA</strong> or <strong>SPSA</strong>) algorithm mentioned previously requires only two<br />
measurements <strong>of</strong> L(⋅)<br />
to <strong>for</strong>m the gradient approximation, independent <strong>of</strong> p (versus 2p in the<br />
standard multivariate finite-difference approximation considered, e.g., in [8]), which extends the<br />
scalar algorithm <strong>of</strong> Kiefer and Wolfowitz [8]. Theory presented in [17] shows that <strong>for</strong> large p<br />
the 1st-<strong>SPSA</strong> approach can be much more efficient (in terms <strong>of</strong> total number <strong>of</strong> loss<br />
*<br />
measurements to achieve effective convergence to θ<br />
) than the finite-difference approach in<br />
many cases <strong>of</strong> practical interest. In extending 1st-<strong>SPSA</strong> to a second-<strong>order</strong> (accelerated) <strong>for</strong>m<br />
[18] that will be explained below, we can see how the gradient and inverse <strong>Hessian</strong> <strong>of</strong> L(⋅)<br />
can<br />
both be estimated on a per iteration basis using only three measurements <strong>of</strong> L (⋅)<br />
(again,<br />
independent <strong>of</strong> p). With these estimates, it is possible create an SA analogue to the<br />
Newton-Raphson algorithm (which, recall, is based on an update step that is negatively<br />
proportional to the inverse <strong>Hessian</strong> times the gradient) [17]. The aim <strong>of</strong> second-<strong>order</strong> <strong>of</strong> <strong>SPSA</strong><br />
(2nd-<strong>SPSA</strong>) algorithm is to emulate the acceleration properties associated with deterministic<br />
algorithms <strong>of</strong> Newton-Raphson <strong>for</strong>m, particularly in the terminal phase where the first-<strong>order</strong><br />
<strong>SPSA</strong> algorithm slows down in its convergence [18]. This approach requires only three loss<br />
function measurements at each iteration, independent <strong>of</strong> the problem dimension. The 2nd-<strong>SPSA</strong><br />
approach is composed <strong>of</strong> two parallel recursions, one <strong>for</strong> θ and one <strong>for</strong> the upper triangular<br />
matrix square root, say S = S(θ ) , <strong>of</strong> the <strong>Hessian</strong> <strong>of</strong> L (θ ) (square root is estimated to ensure<br />
that the inverse <strong>Hessian</strong> estimate used in the second-<strong>order</strong> <strong>SPSA</strong> recursion <strong>for</strong> θ is positive<br />
semi-definite). The two recursions are, respectively [18],<br />
ˆ 1<br />
k+ 1 k<br />
−<br />
k k k k k<br />
θ<br />
ˆ ˆT<br />
ˆ −<br />
= θ a ( S S ) gˆ<br />
( ˆ θ )<br />
(1.9)<br />
Sˆ ˆ ~ ˆ ( ˆ<br />
k + 1<br />
= S k<br />
− a k<br />
G k<br />
S k<br />
)<br />
(1.10)<br />
where<br />
a<br />
k<br />
and<br />
a ~ are non-negative scalar gain coefficients, gˆ<br />
( ˆ θ ) is the SP gradient<br />
k<br />
k<br />
k<br />
approximation to g ˆ<br />
k<br />
( θ k<br />
) [18] and Ĝ<br />
k<br />
is an observation related to the gradient <strong>of</strong> a certain loss<br />
function with respect to S. Note that<br />
ˆ T<br />
k<br />
Sk<br />
(which depends on<br />
k<br />
S ˆ<br />
θˆ ) represents an estimate <strong>of</strong><br />
16
1.8 VERSIONS OF <strong>SPSA</strong> ALGORITHM<br />
the <strong>Hessian</strong> matrix <strong>of</strong> L ˆ θ ). Hence, equation (1.10) is a stochastic analogue <strong>of</strong> the<br />
( k<br />
well-known Newton-Raphson algorithm <strong>of</strong> deterministic optimization. Since gˆ<br />
( ˆ θ ) has a<br />
known <strong>for</strong>m, the parallel recursions in equations (1.9) and (1.10) can be implemented once that<br />
k<br />
k<br />
Ĝk<br />
is specified. The SP gradient approximation requires two measurements <strong>of</strong><br />
L( ⋅):<br />
y<br />
( + )<br />
k<br />
and<br />
y . These represent measurements at design levels θˆ k<br />
+ ck∆k<br />
and θˆ<br />
k<br />
− ck∆k<br />
respectively, where<br />
(−)<br />
k<br />
c<br />
k<br />
is a positive scalar and<br />
∆<br />
k<br />
represents a user-generated random vector satisfying certain<br />
regularity conditions, e.g<br />
∆<br />
k<br />
being a vector <strong>of</strong> independent Bernoulli ± 1 random variables<br />
satisfies these conditions but a vector <strong>of</strong> uni<strong>for</strong>mly distributed random variables does not. The<br />
SP comes from the fact that all elements <strong>of</strong><br />
θˆ k<br />
are perturbed simultaneously in <strong>for</strong>ming<br />
gˆ<br />
( ˆ θ ) , as opposed to the finite difference <strong>for</strong>m, where they are perturbed one at time. To<br />
k<br />
k<br />
per<strong>for</strong>m one iteration <strong>of</strong> (1.9) and (1.10), one additional measurement, say<br />
(0)<br />
y<br />
k<br />
is required; this<br />
measurement represents an observation <strong>of</strong> L (⋅)<br />
at the nominal design level θˆ k .<br />
Main Advantage:<br />
- 1st-<strong>SPSA</strong> gives region(s) where the function value is low, and this allows to conjecture in<br />
which region(s) is the global solution.<br />
- 2nd-<strong>SPSA</strong> is based on a highly efficient approximation <strong>of</strong> the gradient based on loss function<br />
measurements. In particular, on each iteration the <strong>SPSA</strong> only needs three loss measurements to<br />
estimate the gradient, regardless <strong>of</strong> the dimensionality <strong>of</strong> the problem. Moreover, the 2nd-<strong>SPSA</strong><br />
is grounded on a solid mathematical framework that permits to assess its stochastic properties<br />
also <strong>for</strong> optimization problems affected by noise or uncertainties. Due to these striking<br />
advantages, 2nd-<strong>SPSA</strong> is recently used as optimization engine <strong>for</strong> adaptive control problems.<br />
17
CHAPTER 1. INTRODUCTION<br />
Main Disadvantages:<br />
- 1st-<strong>SPSA</strong> gives slow convergence.<br />
- 2nd-<strong>SPSA</strong> does not take into account equality/inequality constraints.<br />
The 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> are algorithms that do not depend on derivative in<strong>for</strong>mation, and<br />
it is able to find a good approximation to the solution using few function values. Its<br />
disadvantage is that once obtained a good approximation, it may not satisfy some conditions and<br />
constraints associated with some complex problems [17][18]. Also, in both version <strong>of</strong> <strong>SPSA</strong><br />
algorithm is not possible guarantee that non-positive definiteness part <strong>of</strong> the <strong>Hessian</strong> matrix can<br />
be eliminated when the number <strong>of</strong> parameters to be adjusted is large. This can cause instability<br />
in the system and also both versions can become very high in computational cost. Finally, in the<br />
1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> algorithms, the error <strong>for</strong> the loss function with an ill-conditioned<br />
<strong>Hessian</strong> is greater than the one with a well-conditioned <strong>Hessian</strong>, with this problem the system<br />
per<strong>for</strong>mance decrease. Also, in estimating optimum parameters <strong>of</strong> a model or time series, there<br />
are several factors which must be considered when deciding on the appropriate optimization<br />
technique. Among these factors are convergence speed, accuracy, algorithm suitability,<br />
complexity and computational cost in terms <strong>of</strong> time (coding, run –time, output) and power. In<br />
the parameter estimation application, the 2nd-<strong>SPSA</strong> had problems with convergence to local<br />
minima and computational cost. So that, in [18] are proposed some techniques, in <strong>order</strong> to solve<br />
this kind <strong>of</strong> problems efficiently. Nevertheless, when the number <strong>of</strong> parameters to be adjusted is<br />
very large the convergence is slow and instable. The techniques defined in [18] included a<br />
mapping in the <strong>Hessian</strong> matrix, but this is not consistent in some conditions or applications.<br />
There<strong>for</strong>e, according to these disadvantages (theoretical and practical), in the following chapter,<br />
we have proposed some improvements to speed up and stability in the 2nd-<strong>SPSA</strong> algorithm, in<br />
particular, in the stability, convergence, and computational cost. Also, it is suggested a new<br />
mapping <strong>for</strong> implementing in 2nd-<strong>SPSA</strong> that eliminates the non-positive definiteness while<br />
preserving key spectral properties <strong>of</strong> the estimated <strong>Hessian</strong>. This <strong>Hessian</strong> is estimated using the<br />
Fisher in<strong>for</strong>mation matrix in <strong>order</strong> to keep it non-positive definiteness and improve the stability.<br />
So that, those improvements constitute our proposed <strong>SPSA</strong> algorithm that it is described in the<br />
following chapter.<br />
18
Chapter 2<br />
Proposed <strong>SPSA</strong> <strong>Algorithm</strong><br />
We propose a modification to the simultaneous perturbation stochastic approximation (<strong>SPSA</strong>)<br />
methods based on the comparisons made between the first- and second-<strong>order</strong> <strong>SPSA</strong>s (1st-<strong>SPSA</strong><br />
and 2nd-<strong>SPSA</strong>) algorithms from the perspective <strong>of</strong> loss function <strong>Hessian</strong>. At finite iterations,<br />
the accuracy <strong>of</strong> the algorithm depends on the matrix conditioning <strong>of</strong> the loss function <strong>Hessian</strong>.<br />
The error <strong>of</strong> 2nd-<strong>SPSA</strong> algorithm <strong>for</strong> a loss function with an ill-conditioned <strong>Hessian</strong> is greater<br />
than the one with a well-conditioned <strong>Hessian</strong>. On the other hand, the 1st-<strong>SPSA</strong> algorithm is less<br />
sensitive to the matrix conditioning <strong>of</strong> loss function <strong>Hessian</strong>s. The modified 2nd-<strong>SPSA</strong><br />
(M2-<strong>SPSA</strong>) eliminates the error amplification caused by the inversion <strong>of</strong> an ill-conditioned<br />
<strong>Hessian</strong>. This leads to significant improvements in its algorithm efficiency in problems with an<br />
ill-conditioned <strong>Hessian</strong> matrix. Asymptotically, the efficiency analysis shows that M2-<strong>SPSA</strong> is<br />
also superior to 2nd-<strong>SPSA</strong> in a large parameter domain. It is shown that the ratio <strong>of</strong> the mean<br />
square errors <strong>for</strong> M2-<strong>SPSA</strong> to 2nd-<strong>SPSA</strong> is always less than one except <strong>for</strong> a perfectly<br />
conditioned <strong>Hessian</strong> or <strong>for</strong> an asymptotically optimal setting <strong>of</strong> the gain sequence. Also, an<br />
improved estimation <strong>of</strong> the <strong>Hessian</strong> matrix is proposed in <strong>order</strong> to guarantee that in this matrix<br />
the non-positive definiteness part can be eliminated and also using this proposed estimation, the<br />
computational cost is reduced when our method is applied to parameter estimation.<br />
2.1 -Overview <strong>of</strong> Modified 2nd-<strong>SPSA</strong> <strong>Algorithm</strong><br />
The recently developed simultaneous perturbation stochastic approximation (<strong>SPSA</strong>) method has<br />
found many applications in areas such as physical parameter estimation and simulation based<br />
optimization. The novelty <strong>of</strong> the <strong>SPSA</strong> is the underlying derivative approximation that requires<br />
only two (<strong>for</strong> the gradient) or four (<strong>for</strong> the <strong>Hessian</strong> matrix) evaluations <strong>of</strong> the loss function<br />
regardless <strong>of</strong> the dimension <strong>of</strong> the optimization problem. There exist two basic <strong>SPSA</strong><br />
algorithms that are based on the “simultaneous perturbation” (SP) concept and that use only<br />
(noisy) loss function measurements. The first-<strong>order</strong> <strong>SPSA</strong> (1st-<strong>SPSA</strong>) is related to the<br />
Kiefer–Wolfowitz (K–W) stochastic approximation (SA) method [17] whereas the second-<strong>order</strong><br />
<strong>SPSA</strong> (2nd-<strong>SPSA</strong>) is a stochastic analogue <strong>of</strong> the deterministic Newton–Raphson algorithm<br />
[18]. There have been several studies that compare the efficiency <strong>of</strong> 1st-<strong>SPSA</strong> with other<br />
stochastic approximation (SA) methods. It is generally accepted that 1st-<strong>SPSA</strong> is superior to<br />
19
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
other first-<strong>order</strong> SA methods (such as the standard K–W method) due to its efficient estimator<br />
<strong>for</strong> the loss function gradient. Spall [28] shows that a ‘standard’ implementation <strong>of</strong> 2nd-<strong>SPSA</strong><br />
achieves a nearly optimal asymptotic error, with the asymptotic root-mean-square error being no<br />
more than twice the optimal (but unachievable) error from an infeasible gain sequence<br />
depending on the third derivatives <strong>of</strong> the loss function. This appealing result <strong>for</strong> 2nd-<strong>SPSA</strong> is<br />
achieved with a trivial gain sequence a k<br />
= 1 /( k + 1)<br />
in the notation below), which effectively<br />
eliminates the nettlesome issue <strong>of</strong> selecting a “good” gain sequence. Because this result is<br />
asymptotic, however, per<strong>for</strong>mance in finite samples may sometimes be improved using other<br />
considerations. Part <strong>of</strong> the purpose <strong>of</strong> this paper is to provide a comparison between 1st-<strong>SPSA</strong><br />
and 2nd-<strong>SPSA</strong> from the perspective <strong>of</strong> the conditioning <strong>of</strong> the loss function <strong>Hessian</strong> matrix. To<br />
achieve the objectivity <strong>of</strong> the comparison we also suggest a new mapping <strong>for</strong> implementing<br />
2nd-<strong>SPSA</strong> that eliminates the non-positive definiteness while preserving key spectral properties<br />
<strong>of</strong> the estimated <strong>Hessian</strong>. While the focus <strong>of</strong> this paper is finite-sample analysis, we are<br />
necessarily limited by the theory available <strong>for</strong> SA algorithms, almost all <strong>of</strong> which is asymptotic..<br />
The numerical examples illustrating the empirical results at finite iterations will be carefully<br />
chosen to represent a wide range <strong>of</strong> matrix conditioning <strong>for</strong> the loss function <strong>Hessian</strong>s.<br />
2.2 -<strong>SPSA</strong> <strong>Algorithm</strong> Recursions<br />
There has recently been a growing interest in recursive optimization algorithms <strong>of</strong> SA <strong>for</strong>m that<br />
does not depend on direct gradient in<strong>for</strong>mation or measurements [19]-[21]. Rather, these SA<br />
algorithms are based on an approximation to the p-dimensional gradient <strong>for</strong>med from<br />
measurements <strong>of</strong> the objective function. This interest has been motivated by problems such as<br />
the adaptive control <strong>of</strong> complex processes, the training <strong>of</strong> recurrent NN, and the optimization <strong>of</strong><br />
complex queuing and estimation parameters. The principal advantage <strong>of</strong> algorithms that do not<br />
require direct gradient measurements (gradient-free algorithm) is that they do not require<br />
knowledge <strong>of</strong> the functional relationship between the parameters being adjusted and the<br />
objective function being minimized. The <strong>SPSA</strong> algorithm, which is based on a highly efficient<br />
gradient approximation, is one such gradient-free algorithm. In the <strong>SPSA</strong> algorithm there are<br />
two important <strong>order</strong>s: the 1st-<strong>SPSA</strong> or <strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. These algorithms are described as<br />
follows:<br />
20
2.2 THE <strong>SPSA</strong> ALGORITHM RECURSIONS<br />
1st-<strong>SPSA</strong> [17]:<br />
θ ˆ = θˆ<br />
− a gˆ<br />
( θˆ<br />
), 0,1,2,...<br />
(2.1)<br />
k + 1 k k k k<br />
k =<br />
2nd-<strong>SPSA</strong> [18]:<br />
ˆ ˆ<br />
−1<br />
θ = θ − a H gˆ<br />
( ˆ ), H = f ( H )<br />
(2.2 a)<br />
k + 1 k k k k<br />
θ<br />
k k k k<br />
= k<br />
1<br />
H H<br />
ˆ<br />
1<br />
+ H , = 0,1,2,...<br />
k + 1<br />
− k + 1<br />
k<br />
(2.2 b)<br />
k k<br />
k<br />
where<br />
a<br />
k and a<br />
k are the scalar gain series that satisfy certain SA conditions [18], ĝ<br />
k<br />
is<br />
the SP estimate <strong>of</strong> the loss function gradient that depends on the gain sequence<br />
c<br />
k<br />
(representing a difference interval <strong>of</strong> the perturbations),<br />
Hˆ<br />
k<br />
is the SP estimate <strong>of</strong> the <strong>Hessian</strong><br />
matrix, and<br />
f<br />
k maps the usual non-positive-definite H<br />
k<br />
to a positive-definite pxp matrix.<br />
The two recursions are showed in Fig. 2.1. Let<br />
∆<br />
k be a user-generated mean zero random<br />
vector <strong>of</strong> dimension p with its components being independent random variables.<br />
Fig. 2.1. The two-recursions in 2nd-<strong>SPSA</strong> algorithm<br />
(solid-line eq. 2.2 a, dashed-line eq. 2.2 b).<br />
The i-th element <strong>of</strong> the loss function gradient is given by [18].<br />
( gˆ<br />
) = (2c<br />
∆<br />
k<br />
i<br />
k<br />
ki<br />
−1<br />
) [ y( ˆ θ + c ∆ ) − y( ˆ θ −c<br />
∆ )], i=1, 2, … , p (2.3)<br />
k<br />
k<br />
k<br />
k<br />
k<br />
k<br />
21
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
where<br />
∆<br />
ki is the i-th component <strong>of</strong> the k<br />
∆ vector and y(θ)<br />
is the measurements <strong>of</strong> the loss<br />
function:<br />
y(θ ) = L(θ ) + (noise) (2.4)<br />
*<br />
where θ is the parameter that has the true value <strong>of</strong> θ .<br />
It is noted that the 2nd-<strong>SPSA</strong> <strong>for</strong>m is<br />
a special case <strong>of</strong> the general adaptive SP method. The general method can also be used in<br />
root-finding problems where<br />
H<br />
k<br />
represents an estimate <strong>of</strong> the associated Jacobian matrix. The<br />
true <strong>Hessian</strong> matrix <strong>of</strong> the loss function H (θ )<br />
H<br />
ij<br />
i<br />
j<br />
has its i-th element defined as<br />
2<br />
= ∂ L / ∂θ ∂θ<br />
and its value at the solution ( *<br />
*<br />
H θ ) denote by H . Finally, its estimation<br />
and ijth element <strong>of</strong> estimate <strong>of</strong> H is defined in Sec. 2.6 using the Fisher in<strong>for</strong>mation matrix<br />
(FIM). The FIM is used here in stead <strong>Hessian</strong> matrix in <strong>order</strong> to estimate this matrix efficiently<br />
[22]. The FIM is obtained by Monte Carlo Newton-Raphson (MCNR)[23]. However, this<br />
<strong>Hessian</strong> matrix estimate is convenient in an optimization application and is a crucial<br />
requirement <strong>for</strong> the new mapping<br />
f<br />
k<br />
proposed in the following section.<br />
2.3 -Proposed Mapping<br />
An important point <strong>of</strong> implementing 2nd-<strong>SPSA</strong> is to define the mapping<br />
f<br />
k , from H<br />
k<br />
to<br />
H<br />
k<br />
since the <strong>for</strong>mer is <strong>of</strong>ten non-positive definite in practice. It is noted that there are no<br />
simple and universal conditions that guarantee a matrix to be positively definite. The existence<br />
<strong>of</strong> a minimum(s) <strong>for</strong> a loss function based on the problem’s physical nature guarantees that its<br />
<strong>Hessian</strong> should be positively definite. The following approach eliminates the non-positive<br />
definiteness <strong>of</strong><br />
H and using the Fisher in<strong>for</strong>mation matrix, we can keep this condition in this<br />
k<br />
matrix even when the real application has a computational complexity is very high. Now, this<br />
approach is motivated by finite-sample concerns, as we discuss below. First, we compute the<br />
eigenvalues <strong>of</strong><br />
H<br />
k<br />
and sort them into descending <strong>order</strong>:<br />
Λ k<br />
≡ diag , λ , , λ , λ , λ ,..., λ ]<br />
[ λ<br />
1 2 q −1<br />
q q+<br />
1 p<br />
K (2.5)<br />
22
2.3 PROPOSED MAPPING<br />
λ and 0<br />
where > 0<br />
q<br />
λ<br />
q + 1<br />
≤ . As k<br />
H is a real-valued, its eigenvalues are real-valued,<br />
too. The eigenvalues <strong>of</strong><br />
H are computed as follows:<br />
k<br />
The number <strong>of</strong> non-zero eigenvalues is equal to the rank <strong>of</strong> H<br />
k<br />
, i.e., at most three non-zero<br />
eigenvalues are available. In this part, the following arrangement <strong>of</strong> eigenvalues is assumed:<br />
λ<br />
≥<br />
≥<br />
1<br />
λ<br />
2<br />
λ<br />
3<br />
. The technique presented here, requires much less user interaction. Now, the<br />
theoretical background is explained leading to a two-fold threshold algorithm where the only<br />
task <strong>of</strong> the user is to specify two thresholds. Finding the eigenvalues and eigenvectors <strong>of</strong> the<br />
<strong>Hessian</strong> matrix is closely related to its decomposition<br />
H<br />
i<br />
= PD P<br />
−1<br />
(2.6)<br />
where P is a matrix and its columns are H’s eigenvectors and<br />
D<br />
i<br />
is a diagonal matrix having<br />
H’s eigenvalues on the <strong>Hessian</strong>. While computing the gradient magnitude by the Euclidean<br />
norm requires three multiplications, two additions and one square root, the computation <strong>of</strong><br />
eigenvalues <strong>of</strong> the <strong>Hessian</strong> matrix is more suitable. The explicit <strong>for</strong>mula would require solving<br />
cubic polynomials. In our implementation a numerical technique <strong>of</strong> fast converging called<br />
Jacobi’s method is used as is recommended in [20] <strong>for</strong> symmetric matrices. We have proposed<br />
an easy-to-use framework <strong>for</strong> exploiting eigenvalues <strong>of</strong> the <strong>Hessian</strong> matrix to represent volume<br />
data by small subsets.<br />
The relation <strong>of</strong> eigenvalues to the Laplacian operator is recalled, this shows the suitability <strong>of</strong><br />
threshold eigenvalue volumes, and define a two-fold threshold operation to generate sparse data<br />
sets. For data where it can be assumed that objects exhibit higher intensities than background,<br />
we modify the framework taking into account only the smallest eigenvalue. This results in a<br />
further reduction <strong>of</strong> the representative subsets by selecting just data at the interior side <strong>of</strong> object<br />
boundaries. For the sake <strong>of</strong> simplicity, we have omitted the index k <strong>for</strong> the individual eigenvalue<br />
λ<br />
i that is a function <strong>of</strong> k. Next, we assume that the negative eigenvalues will not lead to a<br />
physically meaningful solution. They are either caused by errors in<br />
H<br />
k<br />
or are due to the fact<br />
that the iteration has not reached the neighborhood <strong>of</strong><br />
θ<br />
*<br />
where the loss function is locally<br />
quadratic. There<strong>for</strong>e, we replace them together with the smallest positive eigenvalue with a<br />
descending series <strong>of</strong> positive eigenvalues:<br />
23
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
ˆ λ ˆ ˆ ˆ ˆ<br />
(2.7)<br />
q<br />
= ελ<br />
q − 1<br />
, λ<br />
q + 1<br />
= ε λ<br />
q<br />
,..., λ<br />
p<br />
= ε λ<br />
p −1<br />
where the adjustable parameter 0 < ε < 1 can be specified based on the existing positive<br />
eigenvalues<br />
ε = ( λ q<br />
λ<br />
(2.8)<br />
q − 2<br />
−1 /<br />
1<br />
)<br />
.<br />
The purpose <strong>of</strong> having the smallest positive eigenvalue<br />
λ<br />
q<br />
redefined is to avoid its possible<br />
near-zero value that would make the mapped matrix near singular. We let<br />
Λˆ k be the<br />
diagonal matrix<br />
Λ with eigenvalues q<br />
λ<br />
p<br />
k<br />
λ ,..., replaced by λ<br />
q<br />
,..., ˆ λ<br />
p<br />
ˆ<br />
defined<br />
according to (2.7), and <strong>for</strong> guarantee the stability in this diagonal matrix when realistic system<br />
be very complex or in our case the parameters estimations be very high. The Jacobi algorithm is<br />
proposed because the matrices in this algorithm need to be positive definite, in general, and<br />
hence should be (2.2a) projected appropriately after each parameter update so as to ensure that<br />
the resulting matrices are positive definite. In (2.7) and (2.8) is indicated that the spectral<br />
character <strong>of</strong> the existing positive eigenvalues as measured by the ratio <strong>of</strong> its<br />
maximum-to-minimum eigenvalues, whether it is wide or narrowly-spread, is extrapolated to<br />
the rest <strong>of</strong> the matrix spectrum. Other <strong>for</strong>ms <strong>of</strong> specifications such as<br />
ε =<br />
( λ q<br />
λ<br />
( q − 2 )<br />
−1 /<br />
1<br />
)<br />
/ 2<br />
or ε = 1<br />
would also effectively eliminate the non-positive-definiteness. Because the<br />
separating point between the positive and negative eigenvalues q slowly increases from 1 to p,<br />
we find numerically that the specification based on (2.8) yields relatively a faster convergence<br />
in most cases. Since<br />
H<br />
k<br />
is symmetric, it is orthogonally similar to the real diagonal matrix<br />
<strong>of</strong> its real eigenvalues.<br />
H<br />
k<br />
= P Λ P<br />
(2.9)<br />
k<br />
k<br />
T<br />
k<br />
where the orthogonal matrix<br />
P<br />
k<br />
consists <strong>of</strong> all the eigenvectors<br />
H<br />
k<br />
which are usually<br />
derived together with the eigenvalues. Now, the mapping<br />
f<br />
k can be expressed as<br />
24
2.3 PROPOSED MAPPING<br />
f<br />
k<br />
( H<br />
k<br />
)<br />
ˆ T<br />
= P Λ P .<br />
(2.10)<br />
k<br />
k<br />
k<br />
Since it is<br />
− 1<br />
H that is used in the 2nd-<strong>SPSA</strong> recursion (2.2a) mapping (2.10) with the<br />
available eigenvectors <strong>of</strong> H<br />
k<br />
k<br />
also leads to an easy inversion <strong>of</strong> the estimated <strong>Hessian</strong>:<br />
H<br />
− 1<br />
= P Λ P .<br />
(2.11)<br />
− 1 T<br />
k<br />
k k k<br />
The 2nd-<strong>SPSA</strong> based on mapping (2.10) makes the procedure <strong>of</strong> eliminating the<br />
non-positive-definiteness <strong>of</strong><br />
H<br />
k<br />
a precise one. It is noted that the key parameters needed <strong>for</strong><br />
the mapping (ε and λ<br />
q−1<br />
) are internally determined by H<br />
k<br />
at each iteration. This is different<br />
from some other <strong>for</strong>ms <strong>of</strong><br />
f<br />
k where a user-specified coefficient is needed.<br />
*<br />
λ ∆ H ) ≤ λ − λ ≤ λ ( ∆ H ) <strong>for</strong> all i = 1, 2,…, p (2.12)<br />
p<br />
(<br />
k i i 1<br />
k<br />
*<br />
where λ denotes the eigenvalues <strong>of</strong><br />
i<br />
*<br />
H . Furthermore,<br />
p<br />
( ∆ H<br />
k<br />
)<br />
λ and λ ( ∆ H ) are the<br />
1 k<br />
*<br />
minimum and maximum eigenvalues <strong>of</strong> the k-th perturbation matrix ∆ H = H − H ;<br />
respectively. Equation (2.12) suggests that the perturbation matrix will have greater impact on<br />
*<br />
the smaller eigenvalues in terms <strong>of</strong> their fractional changes as H converges to H . Hence,<br />
k<br />
the smallest positive eigenvalue ( λ ) has also been redefined at each iteration to avoid its<br />
q<br />
possible near-zero value. When all the eigenvalues in (2.5) are positive and the smallest<br />
becomes stabilized, say empirically λ > 0.1<br />
p<br />
( ελ<br />
p −1<br />
) with<br />
p − 2<br />
ε = ( λ<br />
p − 1<br />
/ λ<br />
1<br />
) or λ > 0 in<br />
p<br />
10 consecutive iterations, we set Λˆ k<br />
= Λ . Specifically, k<br />
H asymptotically converges to a<br />
k<br />
*<br />
positively definite H so that λ >0 as k → ∞ see [24]. Hence,<br />
p<br />
Λˆ<br />
k<br />
→ Λ<br />
k<br />
→ 0 since,<br />
asymptotically, elements <strong>of</strong> Λˆ are continuous functions <strong>of</strong> k<br />
H . Here<br />
k<br />
Λ is a continuous<br />
k<br />
function <strong>of</strong> H . There<strong>for</strong>e, ˆ<br />
*<br />
*<br />
*<br />
Λ k<br />
→ Λ almost surely when H k<br />
→ H where Λ denotes all the<br />
*<br />
eigenvalues <strong>of</strong> H<br />
k<br />
. This follows from the basic property <strong>of</strong> continuous function <strong>for</strong><br />
deterministic sequence. Both Λ and<br />
k<br />
H converge <strong>for</strong> almost all points in their underlying<br />
k<br />
sample spaces. We further note that our mapping from Λ to<br />
k<br />
Λˆ defined by (2.7) and (2.8) is<br />
k<br />
also a continuous function asymptotically. Here, we like to point out that the mapping<br />
f<br />
k defined by (2.10) preserves the key spectral characters such as the spread <strong>of</strong> those known<br />
k<br />
k<br />
λ<br />
p<br />
25
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
positive eigenvalues λ<br />
1<br />
/ λ q<br />
. Furthermore, as k → ∞ any mapping <strong>for</strong> 2nd-<strong>SPSA</strong> should<br />
− 1<br />
preserve the complete spectral property <strong>of</strong> H . There<strong>for</strong>e, the proposed mapping to a matrix in<br />
k<br />
2nd-<strong>SPSA</strong> is different from the matrix regularization in an ill-posed inversion problem where<br />
the spectral property <strong>of</strong> an ill-conditioned matrix is changed to make the problem well posed.<br />
2.4 -Description <strong>of</strong> Proposed <strong>SPSA</strong> <strong>Algorithm</strong><br />
The 1st-<strong>SPSA</strong> algorithm predetermines the gain series<br />
a<br />
k<br />
<strong>for</strong> the whole iteration process<br />
− 1<br />
whereas 2nd-<strong>SPSA</strong> derives a generalized gain series a that is adapted to near optimality<br />
at each iteration. However, based on previous analyses, the inverse <strong>of</strong> the estimated <strong>Hessian</strong><br />
k<br />
H k<br />
generally introduces additional error sensitivity inherited in<br />
conditioned matrix k > 1<br />
H <strong>for</strong> a non-perfectly<br />
k<br />
. To avoid computing the inverse <strong>of</strong> an ill-conditioned matrix while<br />
still approximately optimizing the gain series at each iteration, we can modify the first recursion<br />
<strong>for</strong> 2nd-<strong>SPSA</strong> (2.2a) by replacing<br />
constant diagonal elements<br />
Λˆ in the mapping f<br />
k<br />
k<br />
<strong>of</strong> (2.10) with Λ that contains<br />
k<br />
ˆ 1<br />
k + 1 k<br />
−<br />
k k k k<br />
θ<br />
ˆ<br />
−<br />
= θ a λ gˆ<br />
( θˆ<br />
)<br />
(2.13)<br />
where<br />
λ is the geometric mean <strong>of</strong> all the eigenvalues <strong>of</strong><br />
k<br />
H<br />
k<br />
λ<br />
k<br />
(<br />
ˆ ˆ ˆ )<br />
1 / p<br />
= λ λ λ λ λ K λ .<br />
(2.14)<br />
1<br />
2<br />
K<br />
q −1<br />
q q + 1<br />
p<br />
Recursions (2.13) and (2.2b) together with (2.5),(2.7)-(2.8) and (2.14) <strong>for</strong>m a modified version<br />
<strong>of</strong> 2nd-<strong>SPSA</strong> that takes advantage <strong>of</strong> both the well-conditioned 1st-<strong>SPSA</strong> and the internally<br />
determined gain sequence <strong>of</strong> 2nd-<strong>SPSA</strong>. The proportionality coefficient a <strong>of</strong><br />
α<br />
a k<br />
( = a /( k + 1 + A ) , A ≥ 0 ) in 1st-<strong>SPSA</strong> depends on the individual loss function and is<br />
generally selected by a trial-and-error approach in practice. On other hand, the 2nd-<strong>SPSA</strong><br />
algorithm removes such an uncertainty in selecting its proportionality coefficient a<br />
α<br />
a k<br />
( = a /( k + 1 + A ) , A ≥ 0 ) since the asymptotically near optimal selection <strong>of</strong> a is 1 [24].<br />
The crucial property that a in 1st-<strong>SPSA</strong> is dependent on the individual loss function has been<br />
built into 2nd-<strong>SPSA</strong> by its generalized gain series<br />
− α − 1<br />
( k + 1 + A ) H , A ≥ 0 . From this<br />
perspective, our proposed <strong>SPSA</strong> algorithm (2.13) can be considered as an extension <strong>of</strong><br />
k<br />
k<br />
<strong>of</strong><br />
26
2.5 ASYMPTOTIC NORMALITY<br />
1st-<strong>SPSA</strong> in which a is replaced by a scalar series<br />
− 1<br />
λ<br />
k<br />
that depends on the individual loss<br />
− 1<br />
function and varies with iteration. Be<strong>for</strong>e to replacing a by λ , in <strong>order</strong> to enhance<br />
convergence and stability, the use <strong>of</strong> an adaptive gain sequence <strong>for</strong> parameter updating is<br />
proposed, this application considers the following conditions:<br />
k<br />
a ) a η a η ≥ 1, if J ( θ ) < (1 β ) J ( θ )<br />
k<br />
=<br />
k − 1<br />
,<br />
k<br />
+<br />
k − 1<br />
b) a µ a µ ≥ , if J ( θ ) < (1 β ) J ( θ ).<br />
k<br />
=<br />
k − 1<br />
1<br />
k<br />
+<br />
k −1<br />
In addition to gain attenuation when the value <strong>of</strong> the criterion becomes worse, “blocking”<br />
mechanism are also applied, i.e., the recurrent step is rejected and, starting from previous<br />
parameter estimate, a new step is accomplished (with a new gradient evaluation and a reduced<br />
updating gain). The parameter β in the condition (a) represents the permissible increase in the<br />
criterion, be<strong>for</strong>e step rejection and gain attenuation occur. A constant gain sequence c k<br />
= c<br />
in assumption and implementation <strong>SPSA</strong> in the Sec. 2.8 can be used <strong>for</strong> gradient approximation,<br />
the value <strong>of</strong> c being selected so as to overcome the influence <strong>of</strong> noise. In the optimum<br />
neighborhood, a decaying sequence in the <strong>for</strong>m defined by step sub in Sec. 2.8 is required to<br />
evaluate the gradient with enough accuracy and avoid an amplification <strong>of</strong> the “slowing down”<br />
effect. When these conditions have been implemented in a , this can be replaced by<br />
− 1<br />
λ<br />
k<br />
.<br />
2.5 -Asymptotic Normality<br />
The strong convergence <strong>of</strong><br />
θˆ<br />
k<br />
generally implies an asymptotic normal distribution. In [24] is<br />
established the asymptotic normal distributions <strong>for</strong> both lst-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. Although our<br />
interests are mainly in finite samples, let us present the following asymptotic arguments as a<br />
way <strong>of</strong> relating to previous known results. Since the proposed algorithm can also be considered<br />
as an extension <strong>of</strong> lst-<strong>SPSA</strong> with a special gain series<br />
− 1<br />
λ<br />
k<br />
the analysis <strong>of</strong> the asymptotic<br />
normality <strong>for</strong> lst-<strong>SPSA</strong> can also be extended to M2-<strong>SPSA</strong>. In this section, we first review the<br />
asymptotic normal distributions <strong>for</strong> lst-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. Then, the asymptotic efficiency is<br />
compared <strong>for</strong> three different algorithms <strong>of</strong> lst-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong>, and proposed <strong>SPSA</strong> algorithm.<br />
Using Fabian’s [19] result, is established the following asymptotic normality <strong>of</strong>θˆ in 1st-<strong>SPSA</strong><br />
k<br />
27
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
k<br />
β / 2<br />
dist<br />
*<br />
( θˆ<br />
k<br />
− θ ) → N ( ξ , Σ ) as k → ∞<br />
(2.15)<br />
where ξ and Σ are the mean vector and covariance matrix and β / 2<br />
characterizes the rate<br />
<strong>of</strong> convergence and is related to the parameters <strong>of</strong> gain sequences a and<br />
k<br />
c : The mean<br />
k<br />
ξ<br />
in (2.15) depends on the third derivatives <strong>of</strong> the loss function at<br />
*<br />
θ<br />
and generally vanishes<br />
except <strong>for</strong> a special set <strong>of</strong> gain sequences. The covariance matrix Σ <strong>for</strong> α ≤ 1 is<br />
orthogonally similar to the diagonal matrix that is proportional to the inverse eigenvalues <strong>of</strong> the<br />
<strong>Hessian</strong><br />
Σ = ψ aP T Λ<br />
* − 1 P<br />
(2.16)<br />
* ∗ T ∗<br />
∗ ∗<br />
∗<br />
where P is orthogonal with H = PΛ<br />
P , Λ = diag λ , λ , K , λ ] , and the coefficient <strong>of</strong><br />
[<br />
1 2<br />
proportionality ψ depends on the statistical parameters in the algorithm [16]. Again,<br />
according to the eigenvalue perturbation theorem [16] the difference between<br />
∗<br />
λ<br />
i<br />
( i = 1,2 , K , p ) at the k-th iteration and λ<br />
i<br />
in (2.16) is bounded by the difference in its<br />
<strong>Hessian</strong><br />
p<br />
∗<br />
( ) ( ˆ ∗<br />
λ − λ ≤ κ λ P H θ − H ) , i = 1,2 , K , p<br />
(2.17)<br />
i<br />
i<br />
k<br />
k<br />
2<br />
where ⋅<br />
2<br />
denotes the spectral norm <strong>of</strong> a matrix that leads to the definition <strong>of</strong> spectral condition<br />
number in<br />
κ λ<br />
H ) = λ max<br />
/ λ .<br />
(2.18)<br />
(<br />
min<br />
It is noted that H θˆ<br />
) converges almost surely to<br />
k<br />
(<br />
k<br />
∗<br />
H and the mapping from H<br />
k<br />
to<br />
H defined by (2.10) preserves the matrix spectra. Furthermore, Λˆ<br />
− Λ → 0<br />
k<br />
k<br />
k<br />
as<br />
k → 0 and the calculation from H<br />
k<br />
to Λ is a continuous function, we also have the<br />
k<br />
following strong convergence <strong>for</strong> the eigenvalues <strong>of</strong> <strong>Hessian</strong>:<br />
∗<br />
∗ ∗<br />
∗<br />
Λ<br />
k<br />
− Λ = diag [ λ<br />
1<br />
, λ<br />
2<br />
, K , λ<br />
p<br />
] ,<br />
∗<br />
ˆ λ k<br />
→ λ as → ∞<br />
k (2.19)<br />
28
2.5 ASYMPTOTIC NORMALITY<br />
where<br />
∗<br />
λ<br />
is the geometric mean <strong>of</strong> all the eigenvalues <strong>of</strong><br />
H<br />
∗<br />
. Based on (2.15), (2.16) and<br />
(2.19) we conclude that the choice <strong>of</strong><br />
a λ<br />
k<br />
− 1<br />
k<br />
in M2-<strong>SPSA</strong> can also be considered as a natural<br />
extension <strong>of</strong> 1st-<strong>SPSA</strong> with a sensible selection <strong>of</strong> a based on its asymptotic normality.<br />
k<br />
β / 2<br />
dist<br />
*<br />
( θˆ<br />
k<br />
− θ ) → N ( µ , Ω ) as k → ∞<br />
(2.20)<br />
where<br />
β = α − 2γ<br />
The covariance matrix Ω is proportional to<br />
H<br />
∗ − 2 ∗ − 2 T<br />
= P Λ P with<br />
the same coefficient <strong>of</strong> proportionality ψ as in (2.16), and the mean µ depends on both the<br />
gain sequence parameters and the third derivatives <strong>of</strong> the loss function at<br />
β / 2<br />
mean square error (MSE) <strong>of</strong> ( ˆ<br />
*<br />
k θ − θ ) in (2.20) is given by [16]<br />
k<br />
∗<br />
θ<br />
. The asymptotic<br />
MSE ( α , ) = µ T µ + trace (Ω).<br />
(2.21)<br />
2nd<strong>SPSA</strong><br />
γ<br />
We first consider a special case <strong>of</strong> a diagonal <strong>Hessian</strong> with constant eigenvalues<br />
∗<br />
∗<br />
( λ i<br />
= λ = λ ) . It can be shown that the asymptotic normality <strong>of</strong> θˆ<br />
k<br />
in 2nd-<strong>SPSA</strong> [18] is<br />
identical to that in 1st-<strong>SPSA</strong> [17] when the following gain sequences are picked.<br />
N ( µ , Ω ) = N ( ξ , Σ ) when a k<br />
= φ /( k + 1)<br />
and = φ /[( k + 1) λ ]<br />
a k<br />
(2.22)<br />
where the constant φ represents a common scale factor <strong>for</strong> the two gain sequences. The<br />
near-optimal selection <strong>of</strong> φ <strong>for</strong> 2nd-<strong>SPSA</strong> is φ = 1<br />
. Note that the true optimal selection <strong>of</strong><br />
the gain is essentially infeasible as it depends on the third derivatives <strong>of</strong> the loss [16]. Equation<br />
(2.22) suggests that the near-optimal MSE in 2nd-<strong>SPSA</strong> can be achieved in 1st-<strong>SPSA</strong> by<br />
picking its proportionality coefficient a in such a way that<br />
a<br />
= 1 / λ<br />
. Since a in 1st-<strong>SPSA</strong> is<br />
externally prescribed, such an optimal picking <strong>of</strong> a is only theoretically possible. On the other<br />
− 1 − 1 − 1<br />
hand, the internally determined gain sequence <strong>of</strong> a λ ( k λ ) in the proposed <strong>SPSA</strong><br />
k<br />
k<br />
=<br />
k<br />
algorithm makes the near-optimal picking <strong>for</strong> the special case <strong>of</strong> constant eigenvalues<br />
practically possible. Next, we consider the specification <strong>of</strong> the gain sequence α < 1<br />
3 γ − α / 2 > 0 from which µ = ξ = 0 [16]. The asymptotic distribution-based MSE <strong>for</strong><br />
2nd-<strong>SPSA</strong> under this condition is inversely proportional to the sum <strong>of</strong> all the eigenvalues<br />
squared<br />
and<br />
29
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
* −2<br />
* −2<br />
MSE<br />
2<strong>SPSA</strong>(<br />
α,<br />
γ)<br />
= trace (Ω)<br />
α trace ( Λ ) = ∑λ i<br />
.<br />
(2.23)<br />
p<br />
i=<br />
1<br />
*<br />
On the other hand, the MSE <strong>for</strong> our proposed <strong>SPSA</strong> can be derived by setting in<br />
1st-<strong>SPSA</strong><br />
a = 1 / λ<br />
MSE ( , ) = trace Σ<br />
2<strong>SPSA</strong>α<br />
γ<br />
*<br />
a=<br />
1/ λ<br />
α λ<br />
* −1<br />
trace<br />
p<br />
* −1<br />
* −1<br />
* −1<br />
( Λ ) = ∑λ<br />
i<br />
.<br />
i=<br />
1<br />
λ (2.24)<br />
The constants <strong>of</strong> proportionality are related to c and to the variances <strong>of</strong><br />
∆<br />
k<br />
and measurement<br />
noise. There<strong>for</strong>e, the ratio <strong>of</strong> MSEs <strong>for</strong> M2-<strong>SPSA</strong> to <strong>SPSA</strong> is given by<br />
MSE<br />
MSE<br />
2 <strong>SPSA</strong><br />
p * −1<br />
1/ p<br />
p * −1<br />
[ ∏ λ ] (1/ p)<br />
i=<br />
1 i ∑ λ<br />
i=<br />
1 i<br />
⋅<br />
≡ R 1<br />
( α , λ )<br />
=<br />
(2.25)<br />
0<br />
( α , λ)<br />
p * −2<br />
p * −2<br />
(1/ p)<br />
λ (1/ p)<br />
λ<br />
M 2 <strong>SPSA</strong><br />
≤<br />
where, we have used a well-known relation in the last inequality <strong>of</strong> (2.25):<br />
∑<br />
i=<br />
1<br />
i<br />
∑<br />
i=<br />
1<br />
i<br />
(geometric mean) ≤ (arithmetic mean) ≤ (root-mean-square). (2.26)<br />
Equality in (2.26) holds only when all the eigenvalues are equal which corresponds to a<br />
perfectly conditioned <strong>Hessian</strong> <strong>of</strong> κ ( H<br />
∗ ) = 1 . Since the ratio R has been derived from the<br />
0<br />
asymptotic MSEs the comparison between M2<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> has been made under the<br />
same rate <strong>of</strong> convergence. Our third case in the asymptotic efficiency analysis is to consider<br />
α = 1 when 3γ − α / 2 ≥ 0 in 2nd-<strong>SPSA</strong>. This setting again corresponds to µ = ξ = 0 in<br />
2nd-<strong>SPSA</strong> and proposed <strong>SPSA</strong> algorithm. It is possible <strong>for</strong> both 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> to set<br />
α =<br />
1 <strong>for</strong> their gain sequence selection. The near optimal rate <strong>of</strong> convergence in 2nd-<strong>SPSA</strong> by<br />
setting a = 1 can be accomplished in 1st-<strong>SPSA</strong> by adjusting its a to yield the same rate <strong>of</strong><br />
convergence as 2nd-<strong>SPSA</strong>. By setting<br />
a<br />
= 1 / λ<br />
in 1st-<strong>SPSA</strong> <strong>for</strong> the implementation <strong>of</strong> our<br />
proposed <strong>SPSA</strong>, we can again derive (2.25) that shows the superiority <strong>of</strong> our proposed <strong>SPSA</strong> to<br />
2nd-<strong>SPSA</strong> under the same rate <strong>of</strong> convergence. However, the above setting <strong>of</strong> a<br />
a<br />
= 1 / λ<br />
1st-<strong>SPSA</strong> is allowed only if the resulting condition in 1st-<strong>SPSA</strong> <strong>of</strong> min ( λ / λ ) ≥ β / 2 still<br />
holds [16]. When the above condition is violated while implementing M2-<strong>SPSA</strong> <strong>for</strong> relatively<br />
∗<br />
large k ( H ) the setting <strong>of</strong> α = 1 in our proposed <strong>SPSA</strong> algorithm is excluded and we can no<br />
longer make a straight comparison <strong>of</strong> the asymptotic MSEs between 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong><br />
i<br />
i<br />
in<br />
30
2.6 FISHER INFORMATION MATRIX<br />
under the same rate <strong>of</strong> convergence. Under this circumstance, there is no superiority <strong>of</strong> either<br />
one <strong>of</strong> M2-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> to the other in terms <strong>of</strong> the efficiency or the rate <strong>of</strong><br />
convergence. The superiority <strong>of</strong> our proposed <strong>SPSA</strong> algorithm to 2nd-<strong>SPSA</strong> indicated by (2.25)<br />
only shows an improvement in the multiplier <strong>for</strong> the convergence rate ( R<br />
0<br />
) when the common<br />
convergence rate is sub-optimal. In [25] is showed that by setting α = 1 and γ = 1 / 6<br />
asymptotically optimal MSE can be achieved with a maximum rate <strong>of</strong> convergence <strong>for</strong> the MSE<br />
<strong>of</strong><br />
−<br />
/ 3<br />
θˆ <strong>of</strong> k β = k<br />
− 2 in both 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. We have already shown that in <strong>order</strong> to<br />
k<br />
avoid the violation <strong>of</strong> the condition min<br />
i<br />
( λ<br />
i<br />
/ λ ) ≥ β / 2 the setting <strong>of</strong> α = 1 (with β ≈ 2 / 3 )<br />
is <strong>of</strong>ten not allowed in our proposed <strong>SPSA</strong> algorithm. Neither is it possible to choose a different<br />
set <strong>of</strong><br />
α and<br />
m<br />
γ to yield<br />
m<br />
β<br />
m<br />
= 2 / 3 when γ = 1 / 6<br />
an<br />
. Under this circumstance, the<br />
/ 3<br />
maximum rate <strong>of</strong> convergence <strong>of</strong> − 2 <strong>for</strong> MSE cannot be achieved by our proposed <strong>SPSA</strong>. It<br />
is noted that the mapping<br />
k<br />
f<br />
k<br />
such as the one proposed in Sec. 2.3 will leave the asymptotic<br />
H<br />
k<br />
unchanged (when we set<br />
Λˆ = Λ ) as k → ∞ . On the other hand, our proposed <strong>SPSA</strong><br />
k<br />
k<br />
algorithm changes<br />
H<br />
k<br />
when its<br />
Λ is replaced by Λ .<br />
k<br />
k<br />
2.6 -Fisher In<strong>for</strong>mation <strong>Matrix</strong><br />
2.6.1 -Introduction to Fisher In<strong>for</strong>mation <strong>Matrix</strong><br />
In this section, we presented a relatively simple MCNR method <strong>for</strong> obtaining the FIM that is<br />
used in <strong>order</strong> to estimate the <strong>Hessian</strong> matrix efficiently. So that, the resampling-based method<br />
relies on an efficient technique <strong>for</strong> estimating the <strong>Hessian</strong> matrix. The FIM plays a central role<br />
in the practice and theory <strong>of</strong> identification and estimation. This matrix provides a summary <strong>of</strong><br />
the amount <strong>of</strong> in<strong>for</strong>mation in the data relative to the quantities <strong>of</strong> interest [22]. Suppose that the<br />
i-th measurement <strong>of</strong> a process is<br />
z<br />
i<br />
and that a stacked vector <strong>of</strong> n such measurement vectors is<br />
n<br />
T T T<br />
[ z z z ] T<br />
z ≡ ,...,<br />
1<br />
,<br />
2<br />
n<br />
. Let us assume that the general <strong>for</strong>m <strong>for</strong> the joint probability density or<br />
probability mass function <strong>for</strong><br />
zn<br />
is known, but that this function depends on an unknown vector<br />
θ . Let the probability density/mass function <strong>for</strong><br />
z be z( ζ θ ) where ζ (“zeta”) is a<br />
n<br />
p f<br />
31
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
dummy vector representing the possible outcomes <strong>for</strong> z n<br />
(in p f<br />
z( ζ θ)<br />
), the index n on z n<br />
is<br />
being suppressed <strong>for</strong> notational convenience). The corresponding likelihood function, say<br />
l ( θ ζ ) = z(<br />
ζ θ ).<br />
(2.27)<br />
p f<br />
With the definition <strong>of</strong> the likelihood function in (2.27), we are now in a position to present the<br />
Fisher in<strong>for</strong>mation matrix. The expectations below are with respect to the dataset<br />
z<br />
n<br />
. The<br />
p xp f f<br />
in<strong>for</strong>mation matrix F n(θ ) <strong>for</strong> a differentiable log-likelihood function is given by [22]<br />
⎛ ∂ log l ∂ logl<br />
⎞<br />
F n<br />
( θ ) ≡ E⎜<br />
⋅ θ ⎟.<br />
T<br />
⎝ ∂θ<br />
∂θ<br />
⎠<br />
(2.28)<br />
In the case where the underlying data { z z ,..., }<br />
,<br />
2<br />
1<br />
are independent (and even in many cases<br />
where the data may be dependent), the magnitude <strong>of</strong> F n(θ ) will grow at a rate proportional to<br />
n since logl(<br />
⋅)<br />
will represent a sum <strong>of</strong> n random terms. Then, the bounded quantity F n<br />
(θ ) / n<br />
is employed as an average in<strong>for</strong>mation matrix over all measurements. Except <strong>for</strong> relatively<br />
simple problems, however, the <strong>for</strong>m in (2.28) is generally not useful in the practical calculation<br />
<strong>of</strong> the in<strong>for</strong>mation matrix. Computing the expectation <strong>of</strong> a product <strong>of</strong> multivariate non-linear<br />
functions is usually a hopeless task. A well-known equivalent <strong>for</strong>m follows by assuming that<br />
logl ( ⋅)<br />
is twice differentiable in θ . The following <strong>Hessian</strong> matrix<br />
z n<br />
H<br />
( θ ζ )<br />
≡<br />
∂<br />
2<br />
log l ( θ ζ )<br />
∂ θ ∂ θ<br />
T<br />
is assumed to exist [22]. One <strong>of</strong> these conditions is that the set { ζ : l ( θ ζ ) > 0}<br />
does not<br />
depend on θ . A fundamental implication <strong>of</strong> the regularity <strong>for</strong> the likelihood is that the<br />
necessary interchanges <strong>of</strong> differentiation and integration are valid. Then, the in<strong>for</strong>mation matrix<br />
is related to the <strong>Hessian</strong> matrix <strong>of</strong> l through:<br />
[ H ( θ Z θ ]<br />
F θ ) = − E ) . (2.29)<br />
n<br />
(<br />
n<br />
The <strong>for</strong>m in (2.29) is usually more amenable to calculating the matrix than the product-based<br />
32
2.6 FISHER INFORMATION MATRIX<br />
<strong>for</strong>m in (2.28). Note that in some applications, the observed in<strong>for</strong>mation matrix at a particular<br />
dataset zn<br />
may be easier to compute and/or preferred from an inference point <strong>of</strong> view relative<br />
to the actual in<strong>for</strong>mation matrix Fn(θ<br />
) in (2.29). Although the method in this work is<br />
described <strong>for</strong> the determination <strong>of</strong> Fn(θ<br />
) the efficient <strong>Hessian</strong> estimation may also be used<br />
directly <strong>for</strong> the determination <strong>of</strong> H θ z ) when it is not easy to calculate the <strong>Hessian</strong> directly.<br />
(<br />
n<br />
2.6.2 -Two Key Properties <strong>of</strong> the In<strong>for</strong>mation <strong>Matrix</strong>: Connections to<br />
--Covariance <strong>Matrix</strong> <strong>of</strong> Parameter Estimates<br />
Let<br />
*<br />
θ denotes the unknown “true” value <strong>of</strong> θ . The primary rationale <strong>for</strong> (n)<br />
F as a<br />
measure <strong>of</strong> in<strong>for</strong>mation about θ within the data<br />
(n)<br />
Z<br />
covariance matrix <strong>for</strong> the estimate <strong>of</strong> θ constructed from Z<br />
comes from its connection to the<br />
(n)<br />
. The first <strong>of</strong> the key properties<br />
makes this connection via an asymptotic normality result [23]. In particular, <strong>for</strong> some<br />
common <strong>for</strong>ms <strong>of</strong> estimates<br />
θˆ n<br />
(e.g. maximum likelihood and Bayesian maximum a<br />
posteriori), it is known that, under modest conditions<br />
ˆ * −1<br />
n ( θ<br />
n<br />
−θ<br />
) → N (0, F )<br />
(2.30)<br />
where<br />
dist<br />
→ denotes convergence in distribution and<br />
F<br />
*<br />
Fn<br />
( θ )<br />
≡ lim<br />
(2.31)<br />
n→∞<br />
n<br />
provided that the indicated limit exists and is invertible. Hence, in practice, <strong>for</strong> n reasonably<br />
−1<br />
large, F<br />
n(<br />
θ ) ’ can serve as an approximate covariance matrix <strong>of</strong> the estimate θˆ n<br />
when θ is<br />
chosen close to the unknown<br />
<strong>of</strong> some recursive algorithms where the data<br />
*<br />
θ . Relationship (2.30) also holds <strong>for</strong> optimal implementations<br />
Z are processed recursively instead <strong>of</strong> in a hatch<br />
i<br />
mode as is typical in maximum likelihood. This includes optimal versions <strong>of</strong> gradient-based SA<br />
algorithms, which includes popular algorithms such as LMS and NN BP as special cases. The<br />
second key property <strong>of</strong> the in<strong>for</strong>mation matrix applies in finite-samples.<br />
33
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
If<br />
θˆ n<br />
is any unbiased estimator <strong>of</strong> θ [23],<br />
ˆ<br />
* −1<br />
cov( θ ) ≥ ( θ ) , ∀n.<br />
(2.32)<br />
n<br />
F n<br />
There is also an expression analogous to (2.31) <strong>for</strong> biased estimators, but it is not especially<br />
useful in practice because it requires knowledge <strong>of</strong> the gradient <strong>of</strong> the bias with respect to θ .<br />
Expressions (2.30) and (2.31), taken together, point to the close connection, between the inverse<br />
Fisher in<strong>for</strong>mation matrix and the covariance matrix <strong>of</strong> the estimator. While (2.30) is an<br />
asymptotic result, (2.31) applies <strong>for</strong> all sample sizes subject to the unbiased ness requirement. It<br />
is also clear why the name “in<strong>for</strong>mation matrix” is used <strong>for</strong> F (n)<br />
: A larger F (n)<br />
(in the<br />
matrix. sense) is associated with a smaller covariance matrix (i.e., more in<strong>for</strong>mation) while a<br />
smaller F (n)<br />
is associated with a larger covariance matrix (i.e., less in<strong>for</strong>mation). The<br />
calculation <strong>of</strong> F (n)<br />
is <strong>of</strong>ten difficult or impossible in many non-linear problems. Obtaining<br />
the required first or second derivatives <strong>of</strong> the log-likelihood function may he a <strong>for</strong>midable task<br />
in some applications, and computing the required expectation <strong>of</strong> the generally non-linear<br />
multivariate function is <strong>of</strong>ten impossible in problems <strong>of</strong> practical interest. To address this<br />
difficulty, the subsection outlines a computer resampling approach to estimating F (n)<br />
. This<br />
approach is useful when analytical methods <strong>for</strong> computing F (n)<br />
are infeasible. The approach<br />
makes use <strong>of</strong> an idea introduced <strong>for</strong> optimization the <strong>Hessian</strong> estimation <strong>for</strong> SA even though<br />
this problem is not directly one <strong>of</strong> optimization. The basis <strong>for</strong> the technique below is to use<br />
computational horsepower in lieu <strong>of</strong> traditional detailed theoretical analysis to determine F (n)<br />
.<br />
The method here is an example <strong>of</strong> a MCNR method <strong>for</strong> producing an estimate. Such methods<br />
have become very popular as a means <strong>of</strong> handling problems that were <strong>for</strong>merly infeasible. Other<br />
notable Monte Carlo techniques are the bootstrap method <strong>for</strong> determining statistical<br />
distributions <strong>of</strong> estimates and the Markov chain Monte Carlo method <strong>for</strong> producing<br />
pseudorandom numbers and related quantities. Part <strong>of</strong> the appeal <strong>of</strong> the Monte Carlo method<br />
here <strong>for</strong> estimating F (n)<br />
is that it can be implemented with only evaluations <strong>of</strong> the<br />
log-likelihood.<br />
2.6.3 -Estimation <strong>of</strong> F n(θ )<br />
The calculation <strong>of</strong> F n(θ ) is <strong>of</strong>ten difficult or impossible in practical problems. Obtaining the<br />
34
2.6 FISHER INFORMATION MATRIX<br />
required first or second derivatives <strong>of</strong> the log-likelihood function may be a <strong>for</strong>midable<br />
task in some applications, and computing the required expectation <strong>of</strong> the generally non-linear<br />
multivariate function is <strong>of</strong>ten impossible in problems <strong>of</strong> practical interest. This section outlines<br />
a computer resampling approach to estimating F n(θ ) that is useful when analytical methods<br />
<strong>for</strong> computing F n(θ ) are infeasible. The approach makes use <strong>of</strong> a computationally efficient<br />
and easy-to-implement method <strong>for</strong> <strong>Hessian</strong> estimation that was described by Spall[24] in the<br />
context <strong>of</strong> optimization.<br />
The computational efficiency follows by the low number <strong>of</strong> log-likelihood or gradient values<br />
needed to produce each <strong>Hessian</strong> estimate. Although there is no optimization here per se, we use<br />
the same basic simultaneous perturbation (SP) <strong>for</strong>mula <strong>for</strong> <strong>Hessian</strong> estimation [this is the same<br />
SP principle given earlier in Spall[24] <strong>for</strong> gradient estimation]. However, the way in which the<br />
individual <strong>Hessian</strong> estimates are averaged differs from Spall[24] because <strong>of</strong> the distinction<br />
between the problem <strong>of</strong> recursive optimization and the problem <strong>of</strong> estimation <strong>of</strong> F n(θ ) . The<br />
essence <strong>of</strong> the method is to produce a large number <strong>of</strong> SP estimates <strong>of</strong> the <strong>Hessian</strong> matrix <strong>of</strong><br />
logl ( ⋅)<br />
and then average the negative <strong>of</strong> these estimates to obtain an approximation to<br />
F n(θ ) .<br />
This approach is directly motivated by the definition <strong>of</strong> F n(θ ) as the main value <strong>of</strong> the<br />
negative <strong>Hessian</strong> matrix (2.29). To produce the SP <strong>Hessian</strong> estimates, we generate pseudodata<br />
vectors in a Monte Carlo manner. The pseudodata are generated according to a bootstrap<br />
resampling scheme treating the chosen θ as “truth.” The pseudodata are generated according<br />
to the probability model p f<br />
z( ζ θ ) given in (2.27). So, <strong>for</strong> example, if it is assumed that the<br />
real data Zn are jointly normally distributed, N ( µ ( θ ), Σ(<br />
θ )) , then the pseudodata are<br />
generated by Monte Carlo according to a normal distribution based on a mean µ and<br />
covariance matrix Σ evaluated at the chosen θ . Let the i-th pseudodata vector be Z pseudo<br />
(i)<br />
;<br />
the use <strong>of</strong><br />
Z<br />
pseudo<br />
without the argument is a generic reference to a pseudodata vector. This data<br />
vector represents a sample <strong>of</strong> size n from the assumed distribution <strong>for</strong> the set <strong>of</strong> data based on<br />
35
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
the unknown parameters taking on the chosen value <strong>of</strong> θ . Hence, the basis <strong>for</strong> the technique is<br />
to use computational horsepower in lieu <strong>of</strong> traditional detailed theoretical analysis to determine<br />
F n(θ ) . Two other notable Monte Carlo techniques are the bootstrap method <strong>for</strong> determining<br />
statistical distributions <strong>of</strong> estimates and the Markov chain Monte Carlo method <strong>for</strong> producing<br />
pseudorandom numbers and related quantities. Part <strong>of</strong> the appeal <strong>of</strong> the Monte Carlo method<br />
here <strong>for</strong> estimating F n(θ ) is that it can be implemented with only evaluations <strong>of</strong> the<br />
log-likelihood. The approach below can work with either log l(<br />
θ Z ) values (alone) or<br />
pseudo<br />
with the gradient<br />
g θ Z pseudo<br />
) ≡ ∂log<br />
l ( θ Z )/<br />
∂θ<br />
if that is available. The <strong>for</strong>mer usually<br />
(<br />
pseudo<br />
corresponds to cases where the likelihood function and associated non-linear process are so<br />
complex that no gradients are available. To highlight the fundamental commonality <strong>of</strong> the<br />
approach in this dissertation, we assume the following:<br />
Let<br />
G θ Z pseudo<br />
) ≡ ∂ log l(<br />
θ Z ) / ∂θ<br />
represent either a gradient approximation<br />
(<br />
pseudo<br />
(based log l ( θ Z )) values) or the exact gradient g θ Z ) . Because <strong>of</strong> its efficiency, the<br />
pseudo<br />
(<br />
pseudo<br />
SP gradient approximation is recommended in the case where only logl(<br />
θ Z ) values are<br />
pseudo<br />
available (Spall [24]). We now present the <strong>Hessian</strong> estimate. Let<br />
Ĥ<br />
k<br />
denote the<br />
th<br />
k estimate <strong>of</strong><br />
the <strong>Hessian</strong> H ( ⋅)<br />
. The <strong>for</strong>mula <strong>for</strong> estimating the <strong>Hessian</strong> is:<br />
log l<br />
Hˆ<br />
k<br />
⎪⎧<br />
δGk<br />
= 1/ 2⎨<br />
2ck<br />
⎪⎩<br />
−1<br />
−1<br />
−1<br />
⎛ δGk<br />
−1<br />
−1<br />
−1<br />
[ ∆ , ∆ ,..., ∆ ] + ⎜ [ ∆ , ∆ ∆ ]<br />
1 k 2 kp ⎜ k1<br />
k 2,...,<br />
⎝ 2ck<br />
k<br />
kp<br />
T<br />
⎞ ⎪⎫<br />
⎟ ⎬<br />
⎠ ⎪⎭<br />
(2.33)<br />
where δG<br />
= G θ + ∆ Z ) − G(<br />
θ − ∆ Z ) and the perturbation vector in this<br />
k<br />
approach [ ] T<br />
k<br />
(<br />
k pseudo<br />
k pseudo<br />
∆ = ∆ ,..., ∆<br />
k<br />
, ∆<br />
k2<br />
1<br />
is a mean zero random vector such that the { ∆ki}<br />
k p<br />
are “small”<br />
symmetrically distributed random variables where k,i are uni<strong>for</strong>mly bounded, and satisfy<br />
E( 1/ ∆<br />
ki<br />
)<br />
< ∞ uni<strong>for</strong>mly in k, i. This latter condition excludes such commonly used Monte<br />
36
2.6 FISHER INFORMATION MATRIX<br />
Carlo distributions as uni<strong>for</strong>m and Gaussian. Assume that<br />
∆ ,<br />
≤ c <strong>for</strong> some small c > 0 .<br />
k j<br />
In most implementations, the { }<br />
∆ are independent and identically distributed (iid) across k<br />
k , j<br />
and j. In implementations involving antithetic random numbers,<br />
dependent random vectors <strong>for</strong> some k, but at each k the { ∆ }<br />
kj<br />
∆k<br />
and ∆ k + 1<br />
may be<br />
are iid (across j). Note that the<br />
user has full control over the choice <strong>of</strong> the ∆ distribution. A valid (and simple) choice is the<br />
ki<br />
Bernoulli +c distribution (it is not known at this time if this is the “best” distribution to choose).<br />
The prime rationale <strong>for</strong> (2.33) is that<br />
Ĥ<br />
k<br />
is a nearly unbiased estimator <strong>of</strong> the unknown H.<br />
Spall[24] gave conditions such that the <strong>Hessian</strong> estimate has an O ( c<br />
2 ) bias. The next<br />
proposition considers this further in the context <strong>of</strong> the resulting (small) bias in the estimate <strong>of</strong><br />
the FIM.<br />
Proposition 1. Suppose that g θ Z ) is three times continuously differentiable inθ <strong>for</strong><br />
(<br />
pseudo<br />
almost all<br />
Z<br />
pseudo<br />
. Then, based on the structure and assumptions <strong>of</strong> (2.33) (see reference [22]),<br />
E<br />
2<br />
[ F ( )] = F(<br />
θ ) O(<br />
)<br />
θ .<br />
M , N<br />
+ c<br />
Pro<strong>of</strong>: Spall [24] showed that ˆ<br />
2<br />
E(<br />
H Z ) = H(<br />
θ Z ) O ( c ) under the stated<br />
k pseudo<br />
pseudo<br />
+<br />
Z<br />
conditions on g(⋅)<br />
and ∆ k<br />
.Because FM N(<br />
) is a sample mean <strong>of</strong> − Ĥ<br />
k<br />
,<br />
θ<br />
values, the result to<br />
be proved follows immediately. The summarizing operation in (2.33) is convenient to maintain<br />
a symmetric and positive <strong>Hessian</strong> estimate. To illustrate how the individual <strong>Hessian</strong> estimates<br />
may be quite poor, note that<br />
Ĥ<br />
k<br />
in (2.33) has (at most) rank two (and may not even be positive<br />
semi-definite). This low quality, however, does not prevent the in<strong>for</strong>mation matrix estimate <strong>of</strong><br />
interest from being accurate since it is not the <strong>Hessian</strong> per se that is <strong>of</strong> interest. The averaging<br />
process eliminates the nadequacies <strong>of</strong> the individual <strong>Hessian</strong> estimates.<br />
37
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
Given the <strong>for</strong>m <strong>for</strong> the <strong>Hessian</strong> estimate in (2.33), it is now relatively straight<strong>for</strong>ward to<br />
estimate F n(θ ) . Averaging <strong>Hessian</strong> estimates across many Z pseudo<br />
(i)<br />
yields an estimate <strong>of</strong><br />
E<br />
[ H ( θ Z ( i))<br />
] = −F<br />
( θ)<br />
pseudo<br />
n<br />
to within an O( c<br />
2 ) bias (the expectation in the left-hand side above is with respect to the<br />
pseudodata). The resulting estimate can be made as accurate as desired through reducing c and<br />
increasing the number <strong>of</strong><br />
Ĥ<br />
k<br />
values being averaged. The averaging <strong>of</strong> the<br />
Ĥ<br />
k<br />
values may be<br />
done recursively to avoid having to store many matrices. Of course, the interest is not in the<br />
<strong>Hessian</strong> per se; rather the interest is in the (negative) mean <strong>of</strong> the <strong>Hessian</strong>, according to (2.3) (so<br />
the averaging must reflect many different values <strong>of</strong> Z pseudo<br />
(i))<br />
. This leads to greater variability<br />
<strong>for</strong> a given number (N) <strong>of</strong> pseudodata. Also using this estimation, we can keep the <strong>Hessian</strong> matrix<br />
positive-definiteness. Let us now present a step-by-step summary <strong>of</strong> the above Monte Carlo<br />
resampling approach <strong>for</strong> estimating F n(θ ) . The MCNR method is an iterative procedure that can<br />
be used to approximate the maximum <strong>of</strong> a likelihood function in situations where direct<br />
likelihood computation is infeasible because <strong>of</strong> the existence <strong>of</strong> unmeasured variables, missing<br />
data, or measurement error. Let<br />
(i)<br />
∆<br />
k<br />
represent the k-th perturbation vector <strong>for</strong> the i-th realization<br />
(i.e., <strong>for</strong> Z pseudo<br />
(i)<br />
).<br />
The Monte Carlo algorithm with a resampling method <strong>for</strong> estimate F n(θ ) is described as<br />
follows:<br />
Step 1. (Initialization). Determine θ , the sample size n, and the number <strong>of</strong> pseudodata vectors<br />
that will be generated (N). In other words, we need to calculate ˆ θ ) and the number<br />
<strong>of</strong> pseudodata vectors that will be generated. Determine whether log-likelihood log l ( ⋅)<br />
or<br />
gradient in<strong>for</strong>mation g (⋅)<br />
will be used to <strong>for</strong>m the Ĥ<br />
k . Pick small number<br />
Bernoulli ±<br />
c k distribution used to generate the perturbations<br />
∆<br />
ki ;<br />
( k<br />
c<br />
k =0.001.<br />
c<br />
k in the<br />
38
2.6 FISHER INFORMATION MATRIX<br />
Step 2. (Generating pseudodata). Based on ˆ θ ) given in step 1, generate by Monte Carlo<br />
( k<br />
method the<br />
th<br />
k pseudodata vector on n-pseudo measurements (i)<br />
Z pseudo<br />
.<br />
Step 3. (<strong>Hessian</strong> estimation). With the i-th pseudodata vector in step 2, compute M ≥ 1 <strong>Hessian</strong><br />
estimates according in (2.33) [22]. Let the sample mean <strong>of</strong> these M estimates be<br />
(i)<br />
H =<br />
(i)<br />
H ( Z pseudo<br />
(i)<br />
). Unless using antithetic random numbers, the perturbation vectors<br />
{ ∆<br />
(i k) }<br />
should be mutually independent across realizations i and along the realizations (along k).<br />
the values are available and SP gradient approximations are being used to <strong>for</strong>m the G(⋅)<br />
values,<br />
the perturbations <strong>for</strong>ming the gradient approximations, say { ∆ }<br />
( )<br />
~ i<br />
k<br />
, should likewise be mutually<br />
independent). Z pseudo<br />
(i)<br />
is the pseudodata vector, this vectors represents a sample <strong>of</strong> size <strong>of</strong> n<br />
from the assumed distribution <strong>of</strong> the set <strong>of</strong> data based on the unknown parameters.<br />
Step 4. (Averaging <strong>Hessian</strong> estimates). Repeat step 2 and 3 until N pseudodata vectors have<br />
(i)<br />
been processed. Take the negative <strong>of</strong> the average <strong>of</strong> N <strong>Hessian</strong> estimates H produced in<br />
step 3; this is the estimate <strong>of</strong> F n(θ ) . The key parameters needed <strong>for</strong> the mapping are<br />
internally determined by F n(θ ) at each iteration. Figure 2.2 is a schematic <strong>of</strong> the steps.<br />
Fig. 2.2. Diagram <strong>of</strong> method <strong>for</strong> <strong>for</strong>ming estimate ( ).<br />
F M , N<br />
θ<br />
39
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
2.7 -Efficiency Between 1st-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong><br />
The proposed <strong>SPSA</strong> algorithm presented above <strong>of</strong>fers considerable potential <strong>for</strong> accelerating the<br />
convergence <strong>of</strong> SA algorithms while only requiring loss function measurements (no gradient or<br />
higher derivative measurements are needed). Since it requires only three measurements per<br />
iteration to estimate both the gradient and <strong>Hessian</strong>, independent <strong>of</strong> the dimension <strong>of</strong> the problem.<br />
So that, the relationships among 1st-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong> can also be understood<br />
from a different perspective: 1st-<strong>SPSA</strong> (2.1) and M2-<strong>SPSA</strong> (2.13) weight the different<br />
components <strong>of</strong> the estimated gradient gˆ<br />
( θˆ<br />
) equally whereas 2nd-<strong>SPSA</strong> (2.2a) weights them<br />
k<br />
k<br />
differently to account <strong>for</strong> different sensitivities <strong>of</strong> θ . A steeper eigen direction (greater λ )<br />
i<br />
requires a smaller step ≈ 1 / λ ) to effectively reach the exact solution [25][26]. Both<br />
(<br />
i<br />
2nd-<strong>SPSA</strong> and our proposed <strong>SPSA</strong> algorithm have captured the dependence <strong>of</strong> the step size on<br />
the overall sensitivities <strong>of</strong> θ at each iteration. From this perspective, 2nd-<strong>SPSA</strong> and proposed<br />
<strong>SPSA</strong> algorithm are superior to 1st-<strong>SPSA</strong>. However, our proposed <strong>SPSA</strong> weights the different<br />
components <strong>of</strong> gˆ<br />
( θˆ<br />
) equally with an averaged step ≈ 1 / λ ) , it has given up the further<br />
k<br />
k<br />
(<br />
k<br />
advantage <strong>of</strong> higher-<strong>order</strong> sensitivity <strong>of</strong> θ . There<strong>for</strong>e, whether our proposed <strong>SPSA</strong> algorithm is<br />
better than 2nd-<strong>SPSA</strong> or not at finite iterations is determined by the relative importance <strong>of</strong> two<br />
competing factors that influence the efficiency <strong>of</strong> the algorithm. The elimination <strong>of</strong> the matrix<br />
inverse reduces the magnitude <strong>of</strong> errors whereas the lack <strong>of</strong> gradient sensitivity may deteriorate<br />
the accuracy. It is noted that the asymptotic relation (2.25) only shows an improvement <strong>of</strong> our<br />
proposed <strong>SPSA</strong> over 2nd-<strong>SPSA</strong> in terms <strong>of</strong> its rate coefficient. Both our proposed <strong>SPSA</strong><br />
/ 2<br />
algorithm and 2nd-<strong>SPSA</strong> have the same rate <strong>of</strong> convergence characterized by − β as shown<br />
by (2.20). The asymptotic relation (2.25) provides a theoretical rationale <strong>of</strong> considering<br />
/ 3<br />
M2-<strong>SPSA</strong> over 2nd-<strong>SPSA</strong> in practice although the maximum rate <strong>of</strong> convergence <strong>of</strong> − 2 <strong>for</strong><br />
MSE cannot be achieved <strong>for</strong> our proposed <strong>SPSA</strong> algorithm. Another rationale <strong>of</strong> proposing<br />
*<br />
M2-<strong>SPSA</strong> is that the amplification <strong>of</strong> errors in an ill-conditioned H<br />
k<br />
through the matrix<br />
inversion is a well-established result whereas the efficiency <strong>of</strong> the gradient sensitivity through<br />
*<br />
Newton–Raphson search only shows near the extreme point ( θ ) with a near-exact [26]. Recall,<br />
however, that such justification <strong>for</strong> the proposed <strong>SPSA</strong> algorithm is restricted to the case where<br />
the gains are not asymptotically optimal in <strong>order</strong> to achieve fast convergence with finite<br />
iterations. For the asymptotic optimal gains ( a<br />
≈ 1 / k , c ≈ 1 / k<br />
*<br />
to M2<strong>SPSA</strong> except in the case where all eigenvalues <strong>of</strong> H<br />
k<br />
k<br />
1 / 6<br />
), 2nd-<strong>SPSA</strong> is superior<br />
are identical (where 2nd-<strong>SPSA</strong> and<br />
M2<strong>SPSA</strong> are identical). It is shown that the magnitude <strong>of</strong> errors in 2nd-<strong>SPSA</strong> is dependent on<br />
the matrix.<br />
k<br />
40
2.8 IMPLEMENTATION ASPECTS<br />
We have shown that the magnitude <strong>of</strong> errors in <strong>SPSA</strong> is dependent on the matrix conditioning<br />
*<br />
<strong>of</strong> H<br />
due to two competing factors. Since both factors are strongly related to the same quantity<br />
<strong>of</strong> the matrix conditioning, the relative efficiency between M2-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> might be<br />
less dependent on specific loss functions. However, such a replacement does not necessarily<br />
suggest that the magnitude <strong>of</strong> errors in our proposed <strong>SPSA</strong> be independent on the matrix<br />
conditioning <strong>of</strong><br />
*<br />
H since the computation <strong>of</strong> λ<br />
k is dependent on the matrix properties<br />
<strong>of</strong><br />
*<br />
H .<br />
2.8 -Implementation Aspects<br />
The five points below have been found important in making the adaptive simultaneous<br />
perturbation (ASP) per<strong>for</strong>m well in practice. Be<strong>for</strong>e describe these points, we can explain that<br />
while the ASP structure in (2.2a), (2.2b), and (2.2) is general, we will largely restrict ourselves<br />
(1)<br />
in our choice <strong>of</strong> G (⋅)<br />
(and G ( ⋅)<br />
) in the remainder <strong>of</strong> the discussion in <strong>order</strong> to present<br />
k<br />
k<br />
concrete theoretical and numerical results. For M2-<strong>SPSA</strong>, we will consider the simultaneous<br />
(1)<br />
perturbation approach <strong>for</strong> generating G (⋅)<br />
and G ( ⋅)<br />
, while <strong>for</strong> second-<strong>order</strong> stochastic<br />
k<br />
(1)<br />
gradient (2SG), we will suppose that G ( ⋅)<br />
= G ( ⋅)<br />
is an unbiased direct measurement <strong>of</strong><br />
k<br />
g (⋅) ; in other words G ˆ θ ) is the input in<strong>for</strong>mation related to g ˆ θ ) . The rationale <strong>for</strong> basic<br />
k<br />
( k<br />
<strong>SPSA</strong> in the gradient-free case has been discussed extensively elsewhere (e.g., Spall [28],), and<br />
hence will not be discussed in detail here. (In summary, it tends to lead to more efficient<br />
optimization than the classical finite-difference Kiefer–Wolfowitz method while being no more<br />
difficult to implement; the relative efficiency grows with the problem dimension.) In the<br />
gradient-based case, stochastic gradient (SG) methods include as special cases the well-known<br />
approaches mentioned at the beginning <strong>of</strong> dissertation (backpropagation, etc.). SG methods are<br />
themselves special cases <strong>of</strong> the general Robbins–Monroe root-finding framework and, in fact,<br />
most <strong>of</strong> the results here can apply in this root-finding setting as well. The associated<br />
Appendixes A and B provide part <strong>of</strong> the theoretical justification <strong>for</strong> SP, establishing conditions<br />
<strong>for</strong> the almost sure (a.s.) convergence <strong>of</strong> both the iterate and the <strong>Hessian</strong> estimate. Now, we can<br />
explain the five points in the implementation <strong>of</strong> M2-<strong>SPSA</strong> as follows:<br />
k<br />
k<br />
( k<br />
41
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
1) θ and H Initialization: Typically, (2.2a) is initialized at some ˆ θ<br />
0<br />
believed to be near<br />
*<br />
θ . One may wish to run the standard first-<strong>order</strong> SA (i.e., (2.2a) without<br />
−1<br />
H<br />
k<br />
) or some other<br />
“rough” optimization approach <strong>for</strong> some period to move the initial θ <strong>for</strong> ASP closer to<br />
*<br />
θ .<br />
Although, with the indexing shown in (2.2b), no initialization <strong>of</strong> the<br />
H<br />
k<br />
recursion is required<br />
since H<br />
0<br />
is computed directly from Ĥ<br />
0<br />
, the recursion may be trivially modified to allow <strong>for</strong><br />
an initialization if one has useful prior in<strong>for</strong>mation. If this is done, then the recursion may be<br />
initialized at (say) scale ⋅ I pxp<br />
, scale ≥ 0,<br />
or some other positive semi definite matrix reflecting<br />
available prior in<strong>for</strong>mation (e.g., if one knows that theθ elements will have very different<br />
magnitudes, then the initialization may be chosen to approximately scale <strong>for</strong> the differences). It<br />
is also possible to run (2.2b) in parallel with the rough search methods that might be used <strong>for</strong><br />
initializing θ . Since<br />
Ĥ<br />
k<br />
has (at most) rank 2 (and may not be positive semi-definite).<br />
2) Numerical Issues in Choice <strong>of</strong> ∆<br />
k<br />
and<br />
H<br />
k<br />
: Generating the elements <strong>of</strong><br />
∆k<br />
according to a<br />
Bernoulli having a positive-definite initialization helps provide <strong>for</strong> the invariability <strong>of</strong> H<br />
k<br />
,<br />
especially <strong>for</strong> small k (if H is positive definite, f (⋅ k<br />
) in (2.2a) may be taken as the identity<br />
k<br />
trans<strong>for</strong>mation). ± 1 distribution is easy and theoretically valid (and was shown to be<br />
asymptotically optimal in Brennan and Rogers [27] and Spall [28] <strong>for</strong> basic <strong>SPSA</strong>; its potential<br />
optimality <strong>for</strong> the adaptive approach here is an open question). In some applications, however, it<br />
may be worth exploring other valid choices <strong>of</strong> distributions since the generation <strong>of</strong><br />
∆k<br />
represents a trivial part <strong>of</strong> the cost <strong>of</strong> optimization, and a different choice may yield<br />
improved finite-sample per<strong>for</strong>mance. Because H<br />
k<br />
may not be positive definite, especially <strong>for</strong><br />
small k (even if is initialized based on prior in<strong>for</strong>mation to H<br />
0<br />
be positive definite), it is<br />
recommended that H<br />
k<br />
in (2.2b) not generally be used directly in (2.2a). Hence, as shown in<br />
(2.2a), it is recommended that<br />
H<br />
k<br />
be replaced by another matrix<br />
H<br />
k<br />
that is closely related to<br />
H<br />
k<br />
. One useful <strong>for</strong>m when is not too large has been to take<br />
H<br />
k<br />
1/ 2<br />
= ( H H ) + δ I , where the<br />
k<br />
k<br />
k<br />
indicated square root is the (unique) positive semi-definite square root and δ ≥ 0 is some<br />
small number.<br />
k<br />
42
2.8 IMPLEMENTATION ASPECTS<br />
For large p , a more efficient method is to simply set<br />
H<br />
k<br />
= H + δ I but this is likely to require<br />
k<br />
k<br />
a larger<br />
δ<br />
k<br />
to ensure positive definiteness <strong>of</strong><br />
H<br />
k<br />
. For very large p, it may be advantageous to<br />
have<br />
H<br />
k<br />
be only a diagonal matrix based on the diagonal elements <strong>of</strong><br />
H<br />
k<br />
+ δ I . This is a way<br />
k<br />
<strong>of</strong> capturing large scaling differences in the elements (unavailable to first-<strong>order</strong> algorithms)<br />
while eliminating the potentially onerous computations associated with the inverse operation in<br />
(2.2a). Note that<br />
Hk<br />
should only be used in (2.2a), as (2.2b) should remain in terms <strong>of</strong><br />
Hk<br />
to<br />
ensure a.s. consistency. By Theorems 2a, b, one can set<br />
H = H <strong>for</strong> sufficiently large k. Also,<br />
k<br />
k<br />
<strong>for</strong> general diagonal<br />
H<br />
k<br />
, it is numerically advantageous to avoid a direct inversion <strong>of</strong><br />
H<br />
k<br />
in<br />
(2.2a), preferring a method such as Gaussian elimination.<br />
3) Gradient/<strong>Hessian</strong> Averaging: At each iteration, it may be desirable to compute and average<br />
several and values despite the additional cost. This may be especially true in a high noise<br />
environment.<br />
4) Gain Selection: The principles outlined in Brennan and Rogers [27] and Spall [28] are<br />
useful here as well <strong>for</strong> practical selection <strong>of</strong> the gain sequences , { a k<br />
}, { c k<br />
} and in the<br />
M2-<strong>SPSA</strong> case, { c k<br />
}. For M2-<strong>SPSA</strong> the critical gain can be simply chosen as1/<br />
k,<br />
k ≥1to<br />
achieve asymptotic near optimality or optimality, respectively, although this may not be ideal in<br />
practical finite-sample problems. For the remainder, let us focus on the M2-<strong>SPSA</strong> case, here we<br />
can choose<br />
a<br />
c k<br />
= α γ , A ≥ 0 <strong>for</strong> k ≥1. In<br />
α<br />
γ<br />
γ<br />
k<br />
= a /( k + A)<br />
, ck<br />
= c / k and c / k , a,<br />
c,<br />
c,<br />
, > 0<br />
finite-sample practice, it may be better to choose and lower than their asymptotically optimal<br />
values <strong>of</strong> α = 1 and γ = 1/6 (see Sec. 2.10), and, in particular, α = 0.602 and γ = 0.101 are<br />
practically effective and approximately the lowest theoretically valid values allowed (see<br />
Theorems 1a, 2a, and 3a). Choosing so that the typical change in to is <strong>of</strong> “reasonable”<br />
magnitude, especially in the critical early iterations, has proven effective. Setting A<br />
approximately equal to 5–10% <strong>of</strong> the total expected number <strong>of</strong> iterations enhances practical<br />
convergence by allowing <strong>for</strong> a larger than possible with the more typical A= 0. However, in<br />
slight contrast to Spall [28] <strong>for</strong> the first-<strong>order</strong> algorithm, we recommend that have a magnitude<br />
43
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
greater (by roughly a factor <strong>of</strong> 2–10) than the typical (“one-sigma”) noise level in the y (⋅)<br />
.<br />
Further, setting c ~ > c has been effective. These recommendations <strong>for</strong> larger c (and c ~ )<br />
values than given in Spall [28] are made due to the greater inherent sensitivity <strong>of</strong> a second-<strong>order</strong><br />
algorithm to noise effects.<br />
2.9 -Strong Convergence<br />
This section presents results related to the strong (a.s.) convergence <strong>of</strong><br />
ˆ θ * θ → k<br />
and<br />
H H( θ<br />
* k<br />
→ ) (all limits are as unless otherwise noted). This section establishes separate results<br />
<strong>for</strong> M2-<strong>SPSA</strong>. One <strong>of</strong> the challenges, <strong>of</strong> course, in establishing convergence is the coupling<br />
between the recursions <strong>for</strong><br />
θˆ k<br />
and<br />
H<br />
k<br />
. Formal convergence <strong>of</strong><br />
H<br />
k<br />
(see Theorems 2a, b) may<br />
still hold under such weighting provided that the analog to expressions (A10) and (A13) in the<br />
pro<strong>of</strong> <strong>of</strong> Theorem 2a (see Appendix) holds. We present a martingale approach that seems to<br />
provide a relatively simple solution with reasonable regularity conditions. Alternative<br />
conditions <strong>for</strong> convergence might be available using the ordinary differential equation approach<br />
<strong>of</strong> Metivier and Priouret [29] and Benveniste [30], which includes a certain Markov dependence<br />
that would, in principle, accommodate the recursion coupling. However, this approach was not<br />
pursued here due to the difficulty <strong>of</strong> checking certain regularity conditions associated with the<br />
Markov dependence (e.g., those related to the solution <strong>of</strong> the “Poisson equation”). The results<br />
below are in two parts, with the first part (Theorems 1a, b) establishing conditions <strong>for</strong> the<br />
convergence <strong>of</strong><br />
θˆ k , and the second part (Theorems 2a, b) doing the same <strong>for</strong><br />
H<br />
k<br />
. The pro<strong>of</strong>s <strong>of</strong><br />
the theorems are in Appendix A. We let denote ⋅ the standard Euclidean vector norm or<br />
compatible matrix spectral norm (as appropriate),<br />
( * *<br />
θ )<br />
i<br />
and ( θ −θ<br />
)<br />
i<br />
represent the i-th<br />
components <strong>of</strong> the indicated vectors (notation chosen to avoid confusion with the iteration<br />
subscript k), i.o. represents infinitely <strong>of</strong>ten, and ˆ −1<br />
g ( θ ) ≡ H g(<br />
ˆ θ ). Below are some regularity<br />
conditions that will be used in Theorem 1a <strong>for</strong> M2-<strong>SPSA</strong> and, in part, in the succeeding<br />
theorems. Some comments on the practical implications <strong>of</strong> the conditions are given<br />
k<br />
k<br />
k<br />
k<br />
immediately following their statement. Note that some conditions show a dependence on<br />
θˆ<br />
k<br />
44
2.9 STRONG CONVERGENCE<br />
and<br />
H<br />
k<br />
, the very quantities <strong>for</strong> which we are showing convergence. Although such<br />
“circularity” is generally undesirable, it is fairly common is the SA field (e.g., Kushner and Yin<br />
[31], Benveniste [30]). The inherent difficulty in establishing theoretical properties <strong>of</strong> adaptive<br />
approaches comes from the need to couple the estimates <strong>for</strong> the parameters <strong>of</strong> interest and <strong>for</strong><br />
the <strong>Hessian</strong> (Jacobian) matrix. Note that the bulk <strong>of</strong> the conditions here showing dependence on<br />
θˆ k and<br />
Hk<br />
are conditions on the measurement noise and smoothness <strong>of</strong> the loss function (C.0,<br />
C.2, and C.3 below; C.0’, C.2’ , C.3’ , C.8, and C.8’ in later theorems); the explicit dependence<br />
on<br />
θˆ k can be removed by assuming that the relevant condition holds uni<strong>for</strong>mly <strong>for</strong> all<br />
“reasonable” θ . The dependence in C.5 is handled in the lemma below. The following<br />
assumptions are guidelines [16] very useful <strong>for</strong> establish our theorems.<br />
( ) ( )<br />
C.0 E(<br />
ε −ε<br />
− ∆ ; H ) = 0 a.s. ∀ k whereε is the effective SA measurement noise, i.e.,<br />
ε<br />
( + )<br />
k<br />
+ k k k k<br />
≡ y(<br />
ˆ θ ± c ∆<br />
k<br />
k<br />
k<br />
) − L(<br />
ˆ θ ± c ∆<br />
k<br />
k<br />
k<br />
).<br />
(+)<br />
k<br />
C.1 ak<br />
, ck<br />
> 0 ∀k<br />
, a<br />
k<br />
→0, c<br />
k<br />
→0<br />
a.s. k →∞<br />
∑ ∞ a = ∞<br />
k=<br />
0 k<br />
( a / ) < ∞.<br />
k 0 k<br />
ck<br />
∑ ∞ =<br />
2<br />
C.2 For some δ , ρ →0<br />
and ∀k<br />
l E y ˆ<br />
2+<br />
δ<br />
, , ( ( θk ± c k<br />
∆k<br />
)/ ∆k<br />
l<br />
) ≤ ρ,<br />
∆k<br />
l<br />
≤ ρ,<br />
∆k<br />
l<br />
symmetrically distributed about 0, and { ∆ }<br />
kl<br />
are mutually independent.<br />
is<br />
C.3 For some ρ > 0 and almost all θˆ k<br />
the function<br />
g ⋅<br />
is continuously twice differentiable<br />
with a uni<strong>for</strong>mly (k ) in bounded second derivative <strong>for</strong> all θ such that ˆ θ −θ<br />
≤ ρ.<br />
C.4 For each k ≥1<br />
and all θ there exists a ρ > 0 not dependent on k and θ , such that<br />
*<br />
( θ −θ<br />
)<br />
T g<br />
k<br />
*<br />
( θ)<br />
> ρ θ −θ<br />
.<br />
C.5 For each i = 1,2 ..., p and any > 0<br />
C.6<br />
( ˆ *<br />
ki k<br />
ki k<br />
θki<br />
−(<br />
θ )<br />
i<br />
≥ ρ∀k<br />
) = 0.<br />
ρ , P { g ( ˆ θ ) ≥ 0} ∩{ g ( ˆ θ ) < 0}<br />
2 δ<br />
−1<br />
2 −1<br />
1<br />
Hk<br />
exists a.s. ∀k,<br />
c k<br />
H k<br />
→0a.s., and <strong>for</strong> someδ , ρ > 0, ⎞<br />
⎜<br />
⎛ −<br />
E H +<br />
k ⎟ ≤ ρ.<br />
⎝ ⎠<br />
C.7 For any τ > 0<br />
and non-empty S { 1,2 ,..., p }<br />
⊆ there exists a ρ '(<br />
τ,<br />
S ) > τ .<br />
k<br />
45
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
*<br />
∑ ( θ − θ )<br />
i<br />
g<br />
ki<br />
( θ )<br />
i∉S<br />
lim sup<br />
<<br />
*<br />
∑ ( θ − θ ) ( )<br />
k →∞<br />
i<br />
g<br />
ki<br />
θ<br />
i∈S<br />
1<br />
(2.34)<br />
<strong>for</strong> all<br />
*<br />
( θ −θ<br />
) < τ when<br />
i<br />
*<br />
i∉S<br />
and ( θ −θ<br />
) ≥ ρ'(<br />
τ,<br />
S)<br />
when i ∈ S.<br />
i<br />
C.0 and C.7 are common martingale-difference noise and gain sequence conditions. C.2<br />
(1)<br />
provides a structure to ensure that the gradient approximations G (⋅)<br />
and G ( ⋅)<br />
are well<br />
behaved. The conditions on<br />
violation <strong>of</strong> the implied finite inverse moments condition in<br />
∆ from being uni<strong>for</strong>mly or normally distributed due to their<br />
k<br />
2+ δ<br />
E<br />
⎛<br />
θ<br />
⎞<br />
⎜ y( ˆ<br />
k<br />
± c k<br />
∆k<br />
)/ ∆k<br />
l ⎟ ≤ ρ . An<br />
⎝<br />
⎠<br />
independent Bernoulli ± 1 distribution is frequently used <strong>for</strong> the elements <strong>of</strong> ∆<br />
k<br />
. C.3 and C.4<br />
provide basic assumptions about the smoothness and steepness <strong>of</strong> L (θ ). C.3 holds, <strong>of</strong> course, if<br />
g (θ) is twice continuously differentiable with a bounded second derivative on<br />
k<br />
p<br />
R C.5 is a<br />
modest condition that says that<br />
θˆ k cannot be bouncing around in a manner that causes the signs<br />
<strong>of</strong> the normalized gradient elements to be changing an infinite number <strong>of</strong> times if<br />
θˆ k is<br />
uni<strong>for</strong>mly bounded away from<br />
*<br />
θ . C.6 provides some conditions on the surrogate <strong>for</strong> the<br />
<strong>Hessian</strong> estimate that appears in (2.2a) and (2.2b). Since the user has full control over the<br />
definition <strong>of</strong><br />
H<br />
k<br />
these conditions should be relatively easy to satisfy. Note that the middle part<br />
<strong>of</strong> C.6 ( H<br />
1 o(<br />
c<br />
−2 ) a.s.) allows <strong>for</strong><br />
− k<br />
=<br />
k<br />
−1<br />
Hk<br />
to “occasionally” be large provided that the<br />
boundedness <strong>of</strong> moments in the last part <strong>of</strong> the condition is satisfied. The example <strong>for</strong><br />
H<br />
k<br />
given in Sec. 2.8 [guideline 2] would satisfy this potential growth condition, <strong>for</strong> instance, if<br />
ρ<br />
δk<br />
= ck<br />
, 0 < ρ < 2.<br />
Finally, C.7 ensures that, <strong>for</strong> k sufficiently large, each element <strong>of</strong> g (θ k<br />
)<br />
*<br />
tends to make a non negligible contribution to products <strong>of</strong> the <strong>for</strong>m ( θ − θ )<br />
T g ( θ)<br />
(see C.4).<br />
A sufficient condition <strong>for</strong> C.7 is that, <strong>for</strong> each i , ( θ)<br />
be uni<strong>for</strong>mly (in k ) bounded > 0 and <<br />
*<br />
∞ when ( θ −θ<br />
) is bounded as stated in the lemma below. Note that, although no explicit<br />
i<br />
conditions are shown on { c ~ k<br />
} there are implicit conditions in C.4–C.7 given c ~ k<br />
’s effect on<br />
g ki<br />
k<br />
46
2.9 STRONG CONVERGENCE<br />
H<br />
k<br />
(via<br />
H<br />
k<br />
). In Theorem 2a on the convergence <strong>of</strong><br />
H , there are explicit conditions on { }<br />
k<br />
c ~ .<br />
k<br />
Conditions C.5 and C.7 are relatively unfamiliar. So, be<strong>for</strong>e showing the main theorems on<br />
convergence <strong>for</strong> M2-<strong>SPSA</strong>, we give sufficient conditions <strong>for</strong> these two conditions in the lemma<br />
below. The main sufficient condition is the well-known boundedness condition on the SA<br />
iterate (e.g., Benveniste [30, Theorem II.15]). Although some authors have relaxed this<br />
boundedness condition (e.g., Kushner and Yin [31]), the condition imposes no practical<br />
limitation. This boundedness condition also <strong>for</strong>mally eliminates the need <strong>for</strong> the explicit<br />
dependence <strong>of</strong> other conditions (C.2 and C.3 above; C.0’, C.2’, C.3’, C.8, and C.8’ below) on<br />
θˆ k<br />
since the conditions can be restated to hold <strong>for</strong> all θ in the bounded set containing<br />
Note also that the condition a /<br />
2 → 0 holds automatically <strong>for</strong> gains in the standard <strong>for</strong>m<br />
k<br />
c k<br />
discussed in 2.9.1. One example <strong>of</strong> when the remaining condition <strong>of</strong> the lemma (2.35), is<br />
θˆ k .<br />
trivially satisfied is<br />
Hk<br />
is chosen as a diagonal matrix (see guideline 2).<br />
Lemma—Sufficient Conditions <strong>for</strong> C.5 and C.7: Assume that C.1–C.4 and C.6 hold, and<br />
lim sup k<br />
θˆ < ∞ a.s. Then condition C.7 is not needed. Further, let a /<br />
2 → 0, and suppose<br />
−∞<br />
that, <strong>for</strong> any ρ > 0<br />
k<br />
P (sign g ˆ θ )<br />
ki<br />
( k<br />
≠ sign g ˆ θ ) i.o. ˆ θ − ( θ<br />
* ) ≥ ρ)<br />
= 0<br />
i<br />
( k<br />
.<br />
ki i<br />
k<br />
c k<br />
∀ i<br />
(2.35)<br />
Then C.5 is automatically satisfied.<br />
(1)<br />
Theorem 1a—M2-<strong>SPSA</strong>: Consider the <strong>SPSA</strong> estimate <strong>for</strong> G (⋅)<br />
with G ( ⋅)<br />
given by (2.34).<br />
Let conditions C.0–C.7 hold. Then ˆ θ * k<br />
−θ →0<br />
a.s.<br />
k<br />
k<br />
Theorem 1b below on the second-<strong>order</strong> stochastic gradient (2SG) approach is a straight<strong>for</strong>ward<br />
modification <strong>of</strong> Theorem 1a on M2-<strong>SPSA</strong>. In <strong>order</strong> to explain more clearly the theorems <strong>of</strong><br />
M2-<strong>SPSA</strong>, we take some references from the theorems <strong>of</strong> the SG <strong>for</strong>m [21]. There<strong>for</strong>e, we<br />
replace C.0, C.1, and C.2 with the following SG analogs. Equalities hold a.s. where needed.<br />
47
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
( + )<br />
C.0’: E(<br />
e ˆ<br />
k<br />
θ ; ∆ ; H ) = 0 where e = G ˆ θ ) − g( ˆ θ ).<br />
k<br />
k<br />
k<br />
∞<br />
→<br />
k ∑ ∑<br />
k<br />
k<br />
(<br />
k k<br />
2<br />
C.1’: a 0∀k<br />
; a →0;<br />
a = ∞,<br />
a < ∞.<br />
k<br />
∞<br />
k=<br />
0 k<br />
k=<br />
0 k<br />
2+<br />
δ<br />
C.2’: For some δ , ρ > 0, E ( G ( θˆ<br />
) ) ≤ ρ ∀ k .<br />
k<br />
k<br />
Note (analogous to ~ c } in Theorem 1a) that there are no explicit conditions on c } here.<br />
{ k<br />
{ k<br />
These conditions are implicit via the conditions on<br />
H<br />
k<br />
, and will be made explicit when we<br />
consider the convergence <strong>of</strong><br />
H<br />
k<br />
in Theorem 2b.<br />
Theorem 1b—2SG: Consider the setting where ( ⋅)<br />
Suppose that C.0’ –C.2’ and C.3–C.7 hold. Then ˆ θ * k<br />
−θ →0<br />
a.s.<br />
Theorem 2a below treats the convergence <strong>of</strong><br />
conditions as follows, which are largely self-explanatory:<br />
C.1’’: The conditions <strong>of</strong> C.1 hold plus<br />
∑<br />
G is a direct measurement <strong>of</strong> the gradient.<br />
k<br />
H<br />
k<br />
in the <strong>SPSA</strong> case. We introduce several new<br />
−2<br />
−2<br />
k + 1) ( c ~ c ) < ∞ with c ~ = O(<br />
).<br />
∞<br />
(<br />
k=<br />
0<br />
k k<br />
k<br />
c k<br />
C.3’: Change “thrice differentiable” in C.3 to “four-times differentiable” with all else<br />
unchanged.<br />
C.8: For some ρ > 0 and all k ,l,<br />
m ,<br />
ˆ θ ± ∆ + ~ ~ 2 ~ 2<br />
[ y(<br />
c c ∆ ) /( ∆ ∆ ) ] ≤ ρ<br />
E<br />
k k k k k kl<br />
km<br />
and<br />
ˆ<br />
2 ~ 2<br />
[ y(<br />
θ ± ∆ ) /( ∆ ∆ ) ] ≤ ρ<br />
E<br />
k<br />
c k k kl<br />
km<br />
(<br />
~ (<br />
E ε<br />
ˆ ~<br />
θ ; ∆ ; H<br />
± ) ( )<br />
− ±<br />
k<br />
ε<br />
k k k k<br />
) = 0<br />
and<br />
~ (<br />
ε<br />
± ) (<br />
− ε<br />
± ) 2 ~ 2<br />
[( ) /( ∆ ∆ ) ]<br />
E<br />
k k<br />
kl<br />
km<br />
≤ ρ<br />
where ~ ( ± ) ˆ ~ ~ ˆ ~ ~<br />
ε = y(<br />
θ ± c ∆ + c ∆ ) − L(<br />
θ ± c ∆ + c ∆ ).<br />
k<br />
k<br />
k<br />
k<br />
k<br />
k<br />
k<br />
k<br />
k<br />
k<br />
k<br />
48
2.9 STRONG CONVERGENCE<br />
C.9:<br />
∆ ~ ~<br />
k<br />
satisfies the assumptions <strong>for</strong> ∆k<br />
in C.2 (i.e., ∀ k , l , ∆<br />
kl<br />
≤ ρ and ∆ ~<br />
l<br />
k<br />
is<br />
symmetrically distributed about 0; { ∆ ~ kl<br />
} are mutually independent); ∆<br />
k<br />
and<br />
∆ ~<br />
k<br />
are<br />
independent;<br />
E<br />
−2<br />
−2<br />
( ∆ ) ≤ , E( ∆ ) ≤ ρ∀k<br />
l<br />
ρ and some ρ > 0 .<br />
kl kl<br />
,<br />
Theorem 2a—M2-<strong>SPSA</strong>: Let conditions C.0, C.1’’, C.2, C.3’, and C.4–C.9 hold. Then,<br />
H H( θ<br />
* k<br />
→ ) a.s. Our final strong convergence result is <strong>for</strong> the <strong>Hessian</strong> estimate in 2SG. As<br />
above, we introduce some additional modified conditions.<br />
−2<br />
−2<br />
C.1’’’: The conditions <strong>of</strong> C1’ hold plus c 0,<br />
c →0<br />
and ( k + 1) c < ∞.<br />
C.8’: For some ρ →0<br />
and all k , l ,<br />
2<br />
E<br />
⎛ θ ⎞<br />
⎜ g( ˆ<br />
k<br />
± c k<br />
∆k<br />
) / ∆k<br />
l ⎟ ≤ ρ<br />
⎝<br />
⎠<br />
k<br />
><br />
k<br />
∑ ∞ k=<br />
0<br />
k<br />
and<br />
E<br />
⎜⎛<br />
⎝<br />
( )<br />
( e − −<br />
k<br />
e k<br />
) / ∆k<br />
l<br />
+ 2<br />
⎟⎞<br />
≤ ρ<br />
⎠<br />
E<br />
( )<br />
( e )/ ˆ ) + − −<br />
k<br />
e k<br />
∆k<br />
l<br />
θ = 0<br />
k<br />
ˆ θ<br />
ˆ θ<br />
( ± )<br />
where e = G ( ± c ∆ ) − g( ± c ∆ ).<br />
k<br />
k<br />
k<br />
k<br />
k<br />
C.9’ : For some ρ > 0 and all k , l , ∆<br />
kl<br />
≤ , ∆kl<br />
2<br />
are mutually independent, and E ( ) .<br />
k<br />
k<br />
∆ − kl<br />
k<br />
≤ ρ<br />
ρ ,is symmetrically distributed about 0, { ∆ }<br />
kl<br />
Unlike this theorem’s companion result <strong>for</strong> 2SG (Theorem 1b), explicit conditions are necessary<br />
on { c k<br />
} to control the convergence <strong>of</strong> the <strong>Hessian</strong> iteration. Note that due to the simpler<br />
structure <strong>of</strong> 2SG (versus M2-<strong>SPSA</strong>), the conditions in C.9’ are a subset <strong>of</strong> the conditions in C.9<br />
<strong>for</strong> Theorem 2a.<br />
Theorem 2b—2SG: Suppose that C.0, C.1, C.2, C.3, C.4–C.7, C.8 and C.9 hold. Then<br />
H H( θ<br />
* k<br />
→ ) a.s.<br />
49
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
2.10 -Asymptotic Distributions and Efficiency Analysis<br />
A. Asymptotic Distributions <strong>of</strong> ASP<br />
This subsection builds on the convergence results in the previous section, establishing the<br />
asymptotic normality <strong>of</strong> the M2-<strong>SPSA</strong> and 2SG <strong>for</strong>mulations <strong>of</strong> ASP. The asymptotic normality<br />
is then used in Sec. 2.9 to analyze the asymptotic efficiency <strong>of</strong> the algorithms. Pro<strong>of</strong>s are in<br />
Appendix A.<br />
M2-<strong>SPSA</strong> Setting: As be<strong>for</strong>e, we consider 2nd-<strong>SPSA</strong> be<strong>for</strong>e 2SG. Asymptotic normality or the<br />
related issue <strong>of</strong> convergence <strong>of</strong> moments in basic first-<strong>order</strong> <strong>SPSA</strong> has been established under<br />
slightly differing conditions by Spall [3], Spall and Criston et al. [32], Dippon and Renz [33],<br />
Kushner and Yin [31, ch. 10]. We consider gains <strong>of</strong> the typical <strong>for</strong>m<br />
c k<br />
γ<br />
= c / k , a,<br />
c,<br />
α,<br />
γ > 0, A ≥ 0, k ≥1<br />
and take<br />
=<br />
ki<br />
a k<br />
+<br />
α<br />
= a /( k A) and<br />
β = α − 2γ<br />
, 2 (<br />
−2 2 −2<br />
ρ E ∆ ) , ξ = E(<br />
∆ ki<br />
) ∀k,<br />
i .The<br />
asymptotic mean below relies on the third derivative <strong>of</strong> L(θ<br />
) we let L ( * )<br />
derivative <strong>of</strong> Lwith respect to elements i,j,k <strong>of</strong> θ evaluated at<br />
conditions will be used in the asymptotic normality result.<br />
3 ijk<br />
θ<br />
represent the third<br />
*<br />
θ . The following regularity<br />
E<br />
~ ˆ<br />
as <strong>for</strong>m some σ<br />
2 > 0 . In this point, <strong>for</strong> some all<br />
( ) ( ) 2<br />
2<br />
C.10: ( ( ε − ε ) θ , ) ± H → σ ;<br />
± k k k k<br />
(<br />
{ ( ε − ε ) θ ∆ η )}<br />
+ ) ( − )<br />
E ˆ<br />
k k k<br />
, ck<br />
k<br />
2<br />
ˆ<br />
k<br />
θ ,<br />
=<br />
is an equicontinuous sequence at η = 0 and is continuous<br />
in η on some compact, connected set containing the actual (observed) value <strong>of</strong><br />
c ∆ a.s.<br />
k<br />
k<br />
C.11: In addition to implicit conditions an α and γ via C.1’’, 3γ −α / 2 ≥ 0 and β > 0 .<br />
Further, whenα = 1,<br />
a > β / 2 . Let f (⋅)<br />
in (2.2a) be chosen such that H − H → 0 a.s.<br />
k<br />
Although, in some applications, the “ → ” <strong>for</strong> the noise second moments in C.10 may be<br />
replaced by “=,” the limiting operation allows <strong>for</strong> a more general setting. Since the user has<br />
k<br />
k<br />
full control over f (⋅)<br />
, it is not difficult to guarantee in C.11 that H −H<br />
→ 0<br />
k<br />
k<br />
k<br />
a.s.<br />
Theorem 3a—M2-<strong>SPSA</strong>: Suppose that C.0, C.1’’, C.2, C.3’, and C.4–C.9 hold (implying<br />
50
2.10 ASYMPTOTIC DISTRIBUTIONS AND ANALYSIS<br />
convergence <strong>of</strong><br />
θˆ and H ). Then, if C.10 and C.11 hold and<br />
k<br />
k<br />
H(<br />
θ<br />
* −1<br />
)<br />
exists,<br />
β /2 ˆ *<br />
k ( θ −θ<br />
) dist<br />
k<br />
⎯ ⎯→<br />
N(<br />
µ , Ω)<br />
(2.36)<br />
where µ = 0{<br />
0 3γ<br />
− α / 2 > 0<br />
T is<br />
−1<br />
*<br />
if ( ) T /( / 2)<br />
H θ a − β if 3 γ −α / 2 = 0} the j-th element <strong>of</strong><br />
+<br />
Ω =<br />
⎡<br />
⎤<br />
P<br />
1 2 2⎢<br />
(3) *<br />
(3) *<br />
− ac ξ L + ⎥<br />
⎢ jjj<br />
( θ ) 3∑<br />
L jjj<br />
( θ )<br />
(2.37)<br />
6<br />
⎥<br />
i=<br />
1<br />
⎢⎣<br />
i≠1<br />
⎥⎦<br />
( 8a<br />
+ β )<br />
2 −2<br />
2 2 * −2<br />
a c σ ρ H(<br />
θ ) / 4<br />
+<br />
and β = β<br />
+<br />
if α = 1 and β<br />
+<br />
= 0 if α 1/ 2 if α = 1, and<br />
k<br />
k<br />
f<br />
k<br />
(⋅) is chosen such that H<br />
k<br />
− H<br />
k<br />
→ 0 a.s. As with C.10, frequently, → can be replaced<br />
with “=” in the limiting covariance expression. Likewise, see the comments following C.11<br />
regarding the condition H − H →0<br />
a.s.<br />
k<br />
k<br />
51
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
Theorem 3b—2SG: Suppose that C.0’, C.1’’’, C.2’, C.3’, C.4–C.7, C.8’, and C.9’ hold<br />
(implying convergence <strong>of</strong><br />
θˆ k<br />
and H<br />
k<br />
) that C.12 holds with<br />
H(<br />
θ<br />
* −1<br />
)<br />
existing. Then,<br />
k<br />
α / 2<br />
dist<br />
ˆ *<br />
( θ k<br />
−θ<br />
) →N(0,<br />
Ω')<br />
(2.38)<br />
2 * −1<br />
* −1<br />
where Ω'<br />
= a H(<br />
θ ) ΣH(<br />
θ ) /(2a<br />
− β ) with β = 1 if α = 1 and β = 0 if α
2.10 ASYMPTOTIC DISTRIBUTIONS AND ANALYSIS<br />
1st-<strong>SPSA</strong> and M2-<strong>SPSA</strong>, we have<br />
2a<br />
1<br />
rms<br />
2<strong>SPSA</strong>(1,1,<br />
c,<br />
)<br />
6<br />
< 2,<br />
1<br />
min rms1<br />
<strong>SPSA</strong>(<br />
a,1,<br />
c,<br />
)<br />
> 1/ λmin<br />
6<br />
∀c<br />
> 0<br />
(2.40a)<br />
2a<br />
1<br />
rms<br />
2<strong>SPSA</strong>(1,1,<br />
c,<br />
)<br />
6<br />
< 2<br />
1<br />
min min rms1<br />
<strong>SPSA</strong>(1,1,<br />
c,<br />
)<br />
> 1/ λ min c><br />
0<br />
6<br />
(2.40b)<br />
where<br />
λ is the minimum eigenvalue <strong>of</strong> H ( θ<br />
* ) . The interpretation <strong>of</strong> (2.40a), (2.40b) is as<br />
min<br />
follows. From (2.40a), we know that, <strong>for</strong> any common value <strong>of</strong> , the asymptotic rms error <strong>of</strong><br />
M2-<strong>SPSA</strong> is less than twice that <strong>of</strong> 1st-<strong>SPSA</strong> with an optimal (even when c is chosen optimally<br />
<strong>for</strong> 1st-<strong>SPSA</strong>). Expression (2.40b) states that, if we optimize only <strong>for</strong> M2-<strong>SPSA</strong>, while<br />
optimizing both a and c <strong>for</strong> 1st-<strong>SPSA</strong>, we are still guaranteed that the asymptotic rms error <strong>for</strong><br />
M2-<strong>SPSA</strong> is no more than twice the optimized rms error <strong>for</strong> 1st-<strong>SPSA</strong>. Another interesting<br />
aspect <strong>of</strong> M2-<strong>SPSA</strong> is the relative robustness apparent in (2.40a), (2.40b) given that the optimal<br />
<strong>for</strong> 1st-<strong>SPSA</strong> will not typically be known in practice. For certain suboptimal values <strong>of</strong> a in<br />
1st-<strong>SPSA</strong>, the rms error can get very large whereas simply choosing a= 1 <strong>for</strong> M2-<strong>SPSA</strong><br />
provides the factor <strong>of</strong> guarantee mentioned above. Although (2.40a), (2.40b) suggest that the<br />
M2-<strong>SPSA</strong> approach yields a solution that is quite good, one might wonder if a true optimal<br />
solution is possible. Dippon and Renz [33, pp.1817–1818] pursue this issue, and provide an<br />
alternative to<br />
θ<br />
* −1<br />
H ( ) as the limiting weighting matrix <strong>for</strong> use in an SA <strong>for</strong>m such as (2.2a).<br />
Un<strong>for</strong>tunately, this limiting matrix has no closed-<strong>for</strong>m solution, and depends on the third<br />
derivatives <strong>of</strong> L (θ ) at<br />
adaptive matrix (analogous to<br />
*<br />
θ , and furthermore, it is not apparent how one would construct an<br />
H<br />
k<br />
that would converge to this optimal limiting matrix.<br />
Likewise, the optimal <strong>for</strong> M2-<strong>SPSA</strong> is typically unavailable in practice since it also depends on<br />
the third derivatives <strong>of</strong> L (θ ). Expressions (2.40a), (2.40b) are based on an assumption that<br />
1st-<strong>SPSA</strong> and M2-<strong>SPSA</strong> have used the same number <strong>of</strong> iterations. This is a reasonable basis <strong>for</strong><br />
a core comparison since the “cost” <strong>of</strong> solving <strong>for</strong> the optimal 1st-<strong>SPSA</strong> gains is unknown.<br />
However, a more conservative representation <strong>of</strong> relative efficiency is possible by considering<br />
only the direct number <strong>of</strong> loss measurements, ignoring the extra cost <strong>for</strong> optimal gains in<br />
1st-<strong>SPSA</strong>. In particular, 1st-<strong>SPSA</strong> uses two loss measurements per iteration and M2-<strong>SPSA</strong> uses<br />
four measurements per iteration. Hence, with both algorithms using the same number <strong>of</strong> loss<br />
53
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
measurements, the corresponding upper bounds to the ratios in (2.40a), (2.40b) (reflecting the<br />
ratio <strong>of</strong> rms errors as the common number <strong>of</strong> loss measurements gets large) would be<br />
2 / 3<br />
4 ≈ 2.52 , an increase from the bound <strong>of</strong> 2 under a common number <strong>of</strong> iterations. This bound’s<br />
likely excessive conservativeness follows from the fact that the cost <strong>of</strong> solving <strong>for</strong> the optimal<br />
gains in 1<strong>SPSA</strong> is being ignored. Note that, <strong>for</strong> other adaptive approaches that are also<br />
asymptotically normally distributed, the same relative cost analysis can be used. Hence, <strong>for</strong><br />
example, with the Fabian [19] approach using O ( p<br />
2 ) measurements per iteration to generate<br />
2/3<br />
the <strong>Hessian</strong> estimate, the corresponding upper bounds would be <strong>of</strong> magnitude O ( p ), bounds<br />
that, unlike the bounds <strong>for</strong> M2-<strong>SPSA</strong>, increase with problem dimension.<br />
In the following chapters, once finished these numerical simulations in <strong>order</strong> show the<br />
M2-<strong>SPSA</strong> per<strong>for</strong>mance, we will prove the proposed <strong>SPSA</strong> algorithm applied to parameters<br />
estimation per<strong>for</strong>mance in some realistic systems. The main advantages <strong>of</strong> our proposed<br />
algorithm will be shown, such as low computational cost and efficient accuracy and<br />
convergence.<br />
2.11 -Perturbation Distribution <strong>for</strong> M2-<strong>SPSA</strong><br />
As discussed above, the perturbations<br />
∆<br />
k<br />
in the gradient estimate are based on Bernoulli<br />
random variables on {–1, 1}. In fact, the requirements are merely that the<br />
∆<br />
ki<br />
must be<br />
independent and symmetrically distributed about zero with finite absolute inverse moments<br />
−1<br />
E[<br />
∆ ki<br />
] <strong>for</strong> all k, i. The Bernoulli is just one distribution <strong>for</strong> ∆<br />
ki<br />
that satisfies these<br />
conditions. It has been shown that one cannot do better than this distribution in the asymptotic<br />
case [34], but less is known about the best distribution <strong>for</strong> small-sample approximations. Some<br />
numerical results seem to show better per<strong>for</strong>mance on some problems with non-Bernoulli<br />
distributions. The per<strong>for</strong>mance <strong>of</strong> three such alternative distributions is reported here: a split<br />
uni<strong>for</strong>m distribution, an inverse split uni<strong>for</strong>m distribution, and a symmetric double triangular<br />
distribution (referred to as candidate distributions in the following). The {–1, 1} Bernoulli<br />
distribution has variance and absolute first moment (mean magnitude) both equal to one. It is<br />
the only qualified distribution with these qualities. We conjecture that these characteristics are<br />
necessary conditions <strong>for</strong> optimal per<strong>for</strong>mance <strong>of</strong> the M2-<strong>SPSA</strong> algorithm, given optimal step<br />
size parameters. Variations in mean magnitude can be addressed by scaling the gradient step<br />
54
2.11 PERTURBATION DISTRIBUTION FOR M2-<strong>SPSA</strong><br />
size (c), so <strong>for</strong> comparisons, candidate distributions should have the same variance as the {–1,<br />
1} Bernoulli. Then differences in per<strong>for</strong>mance could be attributed to differences in the nature <strong>of</strong><br />
variability in that distribution.<br />
Table 2.1. Characteristics <strong>of</strong> the perturbation distributions.<br />
To ensure consistency in the comparison, we normalized the candidate distributions so that their<br />
variances were one and their main magnitudes were close to one, but not so close that the<br />
essential character <strong>of</strong> the distributions were lost. The probability density functions <strong>of</strong> these<br />
distributions are given at right. The characteristics <strong>of</strong> each distribution are given in Table 2.1.<br />
The M2-<strong>SPSA</strong> algorithm with each distribution <strong>for</strong> the perturbations was applied to 34<br />
functions from Moré’s suite <strong>of</strong> optimization problems [35]. The initial points recommended in<br />
Moré were used <strong>for</strong> each function. The functions values were obscured with normally<br />
distributed errors with mean zero and a variance <strong>of</strong> one. We then used these noisy function<br />
values to calculate a simultaneous perturbation gradient approximation. For nearly all <strong>of</strong> the<br />
functions, errors <strong>of</strong> this magnitude are insignificant away from the minimum. However, most<br />
functions in the optimization suite have minimums at or near zero, where N(0, 1) errors are<br />
quite significant. This situation is further complicated by the fact that many functions are<br />
extremely flat near the minimum as well. The result was a demanding examination <strong>of</strong> the<br />
M2-<strong>SPSA</strong> algorithm <strong>of</strong>fering ample opportunity to test alternative perturbation distributions.<br />
The step size parameters <strong>of</strong> the M2-<strong>SPSA</strong> algorithm (that is, a and c) were optimized <strong>for</strong> each<br />
distribution and each function by random search. The procedure to optimize the step parameters<br />
used 20,000 iterations <strong>of</strong> a directed random search algorithm.<br />
55
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
In the directed random search (sometimes called a localized random search, see [36], p. 45),<br />
new trial values are generated near the location <strong>of</strong> the current best value. The algorithm accepts<br />
the input parameters as the current optimal values if they produce results that are better than the<br />
best yet obtained, otherwise they are rejected. This method is somewhat more sophisticated than<br />
simple random search, and generally more computationally efficient in that it uses in<strong>for</strong>mation<br />
from previous iterations. For more in<strong>for</strong>mation on random search methods, see Solis and Wets<br />
[37]. For each iteration <strong>of</strong> the random search we executed fifty Monte Carlo trials <strong>of</strong> the <strong>SPSA</strong><br />
algorithm, and then accepted or rejected the parameter values based on the average <strong>of</strong> these fifty<br />
trials. The theoretically optimal values <strong>for</strong> a and g were used. The M2-<strong>SPSA</strong> algorithm in the<br />
procedure outlined above was run <strong>for</strong> stopping times <strong>of</strong> n = 10, 100, and 1000 iterations to<br />
determine whether any one distribution outper<strong>for</strong>med the others over small, moderate, and large<br />
iteration domains. Common random numbers (CRN) were used to minimize variance. With<br />
CRN, the sequences <strong>of</strong> function values generated by the iteration differ only as a result <strong>of</strong> how<br />
the <strong>SPSA</strong> algorithm processes the random numbers in a different way. In this evaluation, the<br />
sequence <strong>of</strong> CRN were used to generate random perturbations from the appropriate distribution.<br />
This method allows the use <strong>of</strong> matched pairs testing to determine the significance <strong>of</strong> differences<br />
in the minimum values observed. Matched pairs testing generally leads to sharper analysis.<br />
⎧ 1<br />
⎪ −b≤x≤−a<br />
2( b−a)<br />
f SU<br />
( x;<br />
a,<br />
b)<br />
= ⎨<br />
⎪<br />
⎪⎩<br />
0 otherwise<br />
or<br />
a≤x≤b<br />
Fig. 2.3. Split uni<strong>for</strong>m distribution.<br />
56
2.12. PARAMETER ESTIMATION<br />
⎧ ab<br />
⎪<br />
−b<br />
≤ x ≤ −a<br />
2<br />
2( b−a)<br />
x<br />
f ISU<br />
( x;<br />
a,<br />
b)<br />
= ⎨<br />
⎪<br />
⎪⎩<br />
0 otherwise<br />
or<br />
a ≤ x ≤b<br />
Fig. 2.4. Inverse split uni<strong>for</strong>m distribution.<br />
⎧ x + c<br />
⎪<br />
− c ≤ x ≤ −b<br />
( c − a)(<br />
c − b)<br />
⎪<br />
⎪<br />
x + a<br />
− b ≤ x ≤ −a<br />
⎪(<br />
c − a)(<br />
c − b)<br />
⎪<br />
f SDT<br />
( x;<br />
a,<br />
b)<br />
= ⎨ x − a<br />
a ≤ x ≤ b<br />
⎪(<br />
c − a)(<br />
c − a)<br />
⎪<br />
⎪ x − c<br />
b ≤ x ≤ c<br />
⎪(<br />
c − a)(<br />
b − c)<br />
⎪<br />
⎩0<br />
otherwise<br />
Fig. 2.5. Symmetric double triangular<br />
distribution.<br />
2.12 -Parameter Estimation<br />
2.12.1 -Introduction<br />
In the proposed <strong>SPSA</strong> algorithm, all parameters are perturbed simultaneously; it is possible to<br />
57
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
modify parameters with only two measurements <strong>of</strong> an evaluation function regardless <strong>of</strong> the<br />
dimension <strong>of</strong> the parameter. A parameter estimation algorithm using M2-<strong>SPSA</strong> is proposed.<br />
The contribution <strong>of</strong> this chapter is a <strong>SPSA</strong> algorithm <strong>for</strong> parameter estimation that can be used<br />
with non-linear systems or systems with parameters estimation very high. The proposed <strong>SPSA</strong><br />
algorithm is an iterative method <strong>for</strong> optimization, with randomized search direction, that<br />
requires at most three function (model) evaluations at each iteration. The M2-<strong>SPSA</strong><br />
incorporates the 2nd-<strong>SPSA</strong> usually reduced number <strong>of</strong> iterations, to do an initial estimate <strong>of</strong> the<br />
*<br />
optimum values <strong>for</strong> the parameter, θ . The proposed <strong>SPSA</strong> algorithm makes use <strong>of</strong> the <strong>Hessian</strong><br />
matrix to increase the rate <strong>of</strong> convergence. First, second and modified second-<strong>order</strong> <strong>SPSA</strong><br />
algorithm was implemented to estimate the unknown parameters <strong>of</strong> the highly non-linear<br />
physical model. Hence, execution time per iteration does not increase with the number <strong>of</strong><br />
parameters. The method can handle non-linear dynamic models, non-equilibrium transient test<br />
conditions and data obtained in close loop. For this reason, this method is suitable <strong>for</strong> the<br />
estimation <strong>of</strong> parameters in realistic applications. Firstly, it is necessary to show the general<br />
implementation <strong>of</strong> <strong>SPSA</strong> algorithm. The general steps in implementation <strong>of</strong> <strong>SPSA</strong> algorithm are<br />
[28]: 1) initialization and coefficient selection, 2) numerical issued, 3) gradient/<strong>Hessian</strong><br />
averaging, 4) gain selection, (see Sec. 2.8). Finally, we have proposed a modification in this<br />
implementation. This modification is explained on base to the recursive update <strong>for</strong>m <strong>for</strong> the<br />
parameter vector is given by<br />
θˆ<br />
= θˆ<br />
− a<br />
gˆ<br />
( θˆ<br />
)<br />
k + 1 k k k k<br />
(2.41)<br />
where<br />
ak<br />
is a weight or gain constant <strong>for</strong> the recurrent iteration and<br />
ĝ<br />
k<br />
is a gradient estimate<br />
<strong>for</strong> the recurrent iteration. To update<br />
θˆ k to a new value ˆ<br />
k+<br />
1<br />
ˆ<br />
+<br />
θ . If θ k 1<br />
falls outside the range <strong>of</strong><br />
allowable values <strong>for</strong> θ . Then project the updated θ k 1<br />
to the nearest boundary and reassign this<br />
ˆ +<br />
ˆ<br />
+<br />
projected value θ k 1<br />
. Mathematically we have, <strong>for</strong> every -i = 1, … , n;<br />
ˆ θ<br />
k+<br />
1, i<br />
⎧ ˆ θk<br />
⎪<br />
= ⎨θ<br />
i<br />
⎪<br />
⎪⎩<br />
θi<br />
+ 1, i<br />
min<br />
max<br />
if θ<br />
if ˆ θ<br />
if ˆ θ<br />
min<br />
i<br />
k+<br />
1, i<br />
k+<br />
1, i<br />
≤ ˆ θ<br />
< θ<br />
> θ<br />
k+<br />
1, i<br />
min<br />
i<br />
max<br />
i<br />
< θ<br />
max<br />
i<br />
.<br />
58
2.11 PARAMETER ESTIMATION<br />
Modifications to this step may be needed to enhance the best convergence <strong>of</strong> the algorithm. In<br />
particular the update could be block if the cost function actually worsens after the “the basic”<br />
update in this step. The choice <strong>of</strong> various parameters <strong>of</strong> the algorithm plays an important role in<br />
the convergence <strong>of</strong> the algorithm. It is suggested that α = 0. 602 and γ = 0. 101<br />
practically effective and theoretically valid choice. The value <strong>of</strong> A is chosen to be 10% <strong>of</strong> the<br />
maximum iterations allowed. The maximum number <strong>of</strong> iterations was chosen to be 100 and<br />
hence A was chosen to be 10. It is recommended that if the measurements are (almost) error free<br />
c, can be chosen as a small positive number. In this case it was chosen to be 0.01.<br />
are<br />
The value <strong>of</strong> a should be chosen such that the<br />
α<br />
a /( A +1) times the magnitude <strong>of</strong> elements <strong>of</strong><br />
( ˆ ) is approximately equal to the smallest <strong>of</strong> the desired change magnitudes among the<br />
gˆ 0<br />
θ0<br />
elements <strong>of</strong> θ in early iterations. For the problem at hand a=1 gave a good results. This value<br />
if a was chosen to ensure that the component <strong>of</strong> θ during the iterations would remain within the<br />
allowed bounds.<br />
We have proposed modify the typical implementation <strong>of</strong> <strong>SPSA</strong> algorithm <strong>for</strong> the estimation<br />
parameters application according to M2-<strong>SPSA</strong> algorithm, so that, the optimization in the vector<br />
parameter θˆ was modified and showed as follows: The vector parameter θˆ is obtained by<br />
solving the following problem:<br />
ˆ θ = arg min<br />
θ H<br />
( θ )<br />
subject to<br />
θ<br />
θ<br />
M<br />
θ<br />
min<br />
1<br />
min<br />
2<br />
min<br />
n<br />
≤ θ ≤ θ<br />
1<br />
≤ θ ≤ θ<br />
2<br />
≤ θ ≤ θ<br />
n<br />
max<br />
1<br />
max<br />
2<br />
max<br />
n<br />
(2.42)<br />
where the cost function H (θ ) is given by a cost function and n gives the total number <strong>of</strong><br />
parameters in the case n=19. Most conventional tools used <strong>for</strong> optimization <strong>of</strong> the cost function<br />
to arrive at local minimum. However this optimization method is very time consuming if there<br />
are many variables to be optimized or if the cost function evaluations are computationally<br />
expensive. If the number <strong>of</strong> parameters increases, the number <strong>of</strong> function evaluations required<br />
computing the gradients also increase. Moreover, the chance <strong>of</strong> solution convergence to local<br />
59
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
minimum also increases with the number <strong>of</strong> parameters to be optimized. For the problem at<br />
hand, which several parameters to be optimized, it was found that the gradient-based approach<br />
was not practical. For this reason, the <strong>SPSA</strong> algorithm was used to minimize the cost function.<br />
Once the approximate gradient is computed the parameters are update and a new value <strong>of</strong> θ is<br />
computed. It is recommended once more that the cost function evaluation at this point to check<br />
if the cost function at this new value <strong>of</strong> θ is less that the cost function using<br />
θ<br />
k<br />
. The number<br />
<strong>of</strong> cost function evaluations per iteration does not depend on the number <strong>of</strong> variable, which<br />
makes this method very attractive <strong>for</strong> optimization problems with several variables. There<strong>for</strong>e,<br />
this method can be represented as follows: The i-th element <strong>of</strong> the gradient estimate, g ˆ ( ˆ θ ) is<br />
given by<br />
ˆ<br />
ˆ<br />
ˆ y(<br />
θk<br />
+ ck∆k<br />
) − y(<br />
θk<br />
− ck∆k<br />
)<br />
gˆ<br />
k<br />
( θ<br />
k<br />
) =<br />
.<br />
(2.43)<br />
2c<br />
∆<br />
k<br />
ki<br />
k<br />
The term<br />
θˆ ± c ∆ represents a perturbation to the optimization parameters about the recurrent<br />
k<br />
k<br />
k<br />
estimate. Similar to a standard SA <strong>for</strong>m,<br />
ck<br />
is small, positive weighting value. The vector <strong>of</strong><br />
zero-mean random variables, which must have bounded inverse moments. One valid choice <strong>for</strong><br />
∆k<br />
is a vector <strong>of</strong> Bernoulli-distributed, i.e. ± 1, random perturbation terms. In resume, the fifth<br />
guideline that we have proposed and is complement <strong>of</strong> Sec. 2.8 is given as follows:<br />
At each iteration, block “bad” steps if the new estimate <strong>for</strong> θ fails a certain criterion.<br />
H<br />
k<br />
should typically continue to be updated even if θ k 1<br />
is blocked. The most obvious blocking<br />
applies when θ must satisfy constraints; an updated value may be blocked or modified if a<br />
constraint is violated. There are two ways 5a) and 5b) that one might implement blocking when<br />
constraints are not the limiting factor.<br />
ˆ<br />
+<br />
5a) Based on θˆ k<br />
and θ k 1<br />
directly.<br />
5b) Based on loss measurements.<br />
ˆ<br />
+<br />
Both <strong>of</strong> 5a) and 5b) may be implemented in a given applications. In 5a), one simply blocks the<br />
step from<br />
θˆ k to<br />
ˆk 1<br />
θ if ˆ θ − ˆ<br />
1<br />
θ > (tolerance1 ) where the norm is any convenient distances<br />
k+ k<br />
60
2.11 PARAMETER ESTIMATION<br />
measure and (tolerance1 >0) is some “reasonable” maximum distance to cover in one step. The<br />
rationale behind 5a) is that a well-behaving algorithm should be moving toward the solution in a<br />
smooth manner, and very large steps are indicative <strong>of</strong> potential divergence. The second potential<br />
method, 5b), is based on blocking the step if y( ˆ θ ) ( ˆ<br />
k+ 1<br />
> y θk<br />
)(tolerance 2 ) where (tolerance 2 )≥0<br />
might be set at about one or two times the approximate standard deviation <strong>of</strong> the noise in the<br />
y (⋅) measurements. In a setting where the noise in the loss measurements tends to be large (say,<br />
much larger than the allowable difference between L( θ<br />
* ) and L ˆ θ )), it may be undesirable<br />
( final<br />
to use 5b) due to the difficulty in obtaining meaningful in<strong>for</strong>mation about the relative old and<br />
new loss values. For any nonzero noise levels, it may be beneficial to average several y (⋅)<br />
measurements in making the decision about whether to block the step; this may be done. Having<br />
tolerance 2 >0 as specified above when there is noise in the<br />
y (⋅)'<br />
builds some conservativeness<br />
into the algorithm by allowing a new step only if there is relatively strong statistical evidence <strong>of</strong><br />
an improved loss value. Let us close this subsection with a few summary comments about the<br />
implementation aspects above. Without the second blocking procedure 5b) in use, 2nd-<strong>SPSA</strong><br />
requires four measurements y(⋅)<br />
per iteration, regardless <strong>of</strong> the dimension p (two <strong>for</strong> the standard<br />
G (⋅) k<br />
estimate and two new values <strong>for</strong> the one sided SP gradients G<br />
1 ( ⋅ k<br />
)) . For 2SG, three<br />
gradient measurements G (⋅ k<br />
) are needed, again independent <strong>of</strong> p. If the second blocking<br />
procedure 5b) is used, one or more additional y (⋅)<br />
measurements are needed <strong>for</strong> both 2nd-<br />
<strong>SPSA</strong> and 2SG. The use <strong>of</strong> gradient/ <strong>Hessian</strong> averaging 3) would increase the number <strong>of</strong> loss or<br />
gradient evaluations, <strong>of</strong> course.<br />
The standard deviation <strong>for</strong> the measurement noise (used in items 4) and 5b in this chapter) can<br />
be estimated by collecting several y (⋅)<br />
values at θ = θˆ<br />
0<br />
; neither 4) nor 5a) requires this<br />
estimate to be precise (so relatively few y(⋅)<br />
values are needed). In general, 5a) can be used<br />
anytime, while 5b) is more appropriate in a low- or no-noise setting. Note that 5a) helps to<br />
prevent divergence, but lacks direct insight into whether the loss function is improving, while<br />
5b) does provide that insight, but requires additional y (⋅)<br />
measurements, the number <strong>of</strong> which<br />
might grow prohibitively in a high-noise setting. Once finished the modifications in the<br />
implementation <strong>SPSA</strong> algorithm according to our proposed algorithm, we can start to explain<br />
how is applied toward the estimation parameters.<br />
61
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
Firstly, we defined a simple model in <strong>order</strong> to explain how is developed the estimation<br />
parameters algorithm using our proposed algorithm. This model was used be<strong>for</strong>e by other<br />
authors [24][25] <strong>for</strong> explain estimation parameters using the 1st-<strong>SPSA</strong> algorithm. So that, this<br />
system is used because is very suitable and illustrates very well M2-<strong>SPSA</strong> algorithm<br />
per<strong>for</strong>mance. Of such a way, the following single-input single-output (SISO) discrete system<br />
with input x and output y [24][25] is considered:<br />
x<br />
k<br />
= a k + K + a x + b u + K b u .<br />
(2.44)<br />
1 k − 1<br />
n kn 1 k −1<br />
+<br />
m k − m<br />
Here, k is the discrete time, a<br />
1<br />
, . . . ,<br />
a<br />
n<br />
and b<br />
1, . . . ,<br />
b<br />
m<br />
represent the constant coefficients.<br />
Also, in general, n ≥ m. It is assumed that the system input<br />
value<br />
y k<br />
accompanied by some <strong>for</strong>m <strong>of</strong> noise<br />
υk<br />
xk<br />
is observed as the observed<br />
y<br />
= + υk.<br />
(2.45)<br />
k<br />
x k<br />
Here, the noise<br />
satisfy the following:<br />
vk<br />
the input<br />
uk<br />
and the output<br />
xk<br />
are independent <strong>of</strong> one another, and they<br />
E ( u ) = u , E ( u u ) = r<br />
2 δ<br />
(2.46 a)<br />
k<br />
a<br />
k<br />
i<br />
ki<br />
2<br />
E ( υ ) = 0 , E ( υ υ ) = σ δ<br />
(2.46 b)<br />
k<br />
k<br />
i<br />
ki<br />
2 2<br />
where,δ represents the Kronecker delta, r and σ represent the variances <strong>of</strong> the noise, and<br />
u is the average value <strong>for</strong> the input. At this point, the parameter estimation problem <strong>for</strong><br />
a<br />
consecutively finding unknown parameters { a ,..., a , b b }<br />
values{ y k<br />
, u k<br />
}. The parameters are defined as follows:<br />
n m 1<br />
based on the observed<br />
1<br />
,...,<br />
u )<br />
T<br />
k 1<br />
( uk−m,...,<br />
uk−<br />
1<br />
− = (2.47a)<br />
T<br />
x<br />
k− 1<br />
= ( xk−n,...,<br />
xk−<br />
1)<br />
(2.47b)<br />
T<br />
υ<br />
k − 1<br />
= ( υk−n,...,<br />
υk−<br />
1)<br />
(2.47c)<br />
62
2.11 PARAMETER ESTIMATION<br />
T<br />
y<br />
k − 1<br />
= ( yk<br />
−n<br />
,..., yk<br />
−1,<br />
uk<br />
−m<br />
,..., uk<br />
−1)<br />
(2.47d)<br />
T<br />
φ = a ,..., a , b ,..., b ) .<br />
(2.47e)<br />
(<br />
n 1 m 1<br />
Furthermore, based on the conditions in (2.46 b) <strong>for</strong> the observed noise,<br />
E ( ) = 0<br />
(2.48)<br />
e k<br />
E( e e ) = 0, k − i n.<br />
(2.49)<br />
k i<br />
><br />
There<strong>for</strong>e, the error function J can be defined as follows. The problem <strong>of</strong> minimizing this error<br />
function and finding the system parameter vector φ is addressed in this chapter.<br />
1 2<br />
⎧<br />
T<br />
(<br />
ˆ ⎫<br />
J = E ⎨ y<br />
k<br />
− y<br />
k − 1φ<br />
) ⎬<br />
(2.50)<br />
⎩ 2<br />
⎭ .<br />
Here, E represented the expected value, and φˆ represents the estimated value. This kind <strong>of</strong> error<br />
function, with the expected value, cannot be found in practice. Thus, using SA with this as an<br />
iterated function is considered. The problem <strong>of</strong> finding a parameter that yields a minimum in<br />
this kind <strong>of</strong> iterated function can be solved by using the SA method. The partial derivative <strong>of</strong><br />
the error function (2.50) with respect to the estimation φˆ is<br />
− y y − y<br />
T ˆ).<br />
(2.51)<br />
k−1( k k−1φ<br />
Here, let us look at the expected value <strong>for</strong><br />
independent with υ<br />
k−1<br />
, then<br />
yk<br />
− 1<br />
ek<br />
. If we consider that<br />
k−1<br />
x and u<br />
k−1<br />
are<br />
E<br />
2<br />
⎧⎛ x + υ ⎞<br />
⎫ ⎡σ<br />
I 0⎤<br />
k−1<br />
k<br />
= ⎨⎜<br />
⎬ ⎢ ⎥<br />
⎩ υ ⎟ k 1 k−1<br />
L<br />
n k−n<br />
(2.52)<br />
⎝ k−1<br />
⎠<br />
⎭ ⎣ 0 0⎦<br />
k−1<br />
k−1<br />
{ y e } E ⎜ ⎟( υ − aυ<br />
− − a υ ) = − φ<br />
holds, with no result being zero. Consequently, in the estimate using (2.51), a bias occurs; thus,<br />
63
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
(2.51) does not give a consistent estimate [15]. There<strong>for</strong>e, this bias must be compensated. The<br />
reference [15] <strong>of</strong>fers a detailed explanation <strong>of</strong> this. Moreover, if (2.49) is considered,<br />
calculations must be per<strong>for</strong>med every (n + 1) instances <strong>of</strong> sampling, to guarantee the<br />
independence <strong>of</strong> { e k<br />
}. The modifying time k can be represented by the actual sampling time n;<br />
k = 1, n + 2, 2n + 3, . . . . Then, the following recursion <strong>for</strong> the estimated parameters will be<br />
considered:<br />
ˆ ˆ k − 1<br />
φ<br />
k + n<br />
= φ<br />
k −1<br />
− ρ<br />
e<br />
∆φ<br />
k −1,<br />
k = 1,…, n + 2, 2n+3,.... (2.53)<br />
n + 1<br />
ˆ<br />
−<br />
Here, ∆φ<br />
k 1<br />
is the basic quantity which provides the quantity <strong>for</strong> the estimation parameters.<br />
Furthermore,<br />
a fraction.<br />
ρe<br />
represents the gain coefficient. The subscript on the coefficient ρe<br />
represents<br />
Because this takes a value <strong>for</strong> every (n + 1) instances, <strong>for</strong> example 1, n + 2, 2n + 3, . . . , with<br />
respect to the actual sampling time n, as a result, the subscript<br />
ρ<br />
e<br />
refers to taking the value:<br />
1 – 1/n + 1 = 0, n + 2 –1/n + 1 = 1, . . . , 0, 1, 2,….<br />
In <strong>SPSA</strong>, the perturbations are superimposed simultaneously on all the parameters. As a result,<br />
even as the number <strong>of</strong> parameters rises, the estimated parameters can be revised based on the<br />
two values <strong>of</strong> the error functions either when perturbation is added or when there is not<br />
perturbation. A parameter estimation method that uses this kind <strong>of</strong> SP is extremely useful in the<br />
many circumstances.<br />
2.12.2 -System to be Applied<br />
Let us consider the differential with respect to the parameter φ <strong>for</strong> the model <strong>of</strong> the error <strong>of</strong><br />
squares<br />
2<br />
e in this instance [24][25]. For the sake <strong>of</strong> simplicity, when considering a case in<br />
which all variables are scalar, results<br />
2<br />
∂ e<br />
∂ φ<br />
=<br />
2 ( y<br />
−<br />
y<br />
q<br />
)<br />
∂ y<br />
q<br />
∂ φ<br />
=<br />
2 ( y<br />
−<br />
y<br />
q<br />
)<br />
∂ y<br />
q<br />
∂ x<br />
∂ x<br />
.<br />
∂ φ<br />
(2.54)<br />
64
2.11 PARAMETER ESTIMATION<br />
∂ y q<br />
/ ∂x in this equation represents a Jacobian observation system. If the observation system is<br />
assumed to be unknown, then it cannot be found.<br />
There<strong>for</strong>e, when identifying a system that includes an unknown observation system, the amount<br />
<strong>of</strong> correction <strong>for</strong> the parameters cannot be found in methods that directly find the slope <strong>of</strong> the<br />
error. In other words, identification algorithms based on the conventional slope approach cannot<br />
be used.<br />
In contrast, in the SP method proposed in this chapter, the amount <strong>of</strong> correction <strong>for</strong> the<br />
estimation parameters is found directly from the value<br />
characteristics <strong>of</strong> the observation system are not needed.<br />
2<br />
e <strong>for</strong> the error. As result, the<br />
Moreover, in distinction with differential approximation methods, in ours method, regardless <strong>of</strong><br />
how many paramters are to be estimated, the parameters can be corrected using only two<br />
observations.<br />
In this research, we refer to many authors that have proposed a parameter estimation algorithm<br />
using the <strong>SPSA</strong> algorithm. The following system was considered by other authors [24][25] and<br />
is very suitable <strong>for</strong> show the proposed <strong>SPSA</strong> algorithm per<strong>for</strong>mance. The system considered is a<br />
case in which the observed values <strong>for</strong> an unknown system to be identified can only be obtained<br />
from its characteristics (see Fig. 2.6).<br />
Fig. 2.6. Identification with an unknown observation system.<br />
65
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
Once proposed the model structure, the next step is to estimate the parameters <strong>of</strong> the system.<br />
This is done by assuming an initial value <strong>of</strong> the parameters and then optimizing them so as to<br />
minimize the errror between the measurements and the model predictions. In then next<br />
simulation, a code using standard MATLAB commands implementing the <strong>SPSA</strong> <strong>for</strong> constrained<br />
optimization was developed. Consider the following successive equations:<br />
φ<br />
= ˆ φ − ρ ∆φ<br />
k+<br />
1 k ek k<br />
(2.55)<br />
T<br />
∆φ = ∆φ<br />
,..., ∆φ<br />
) .<br />
(2.56)<br />
k<br />
(<br />
k ,1 k , n+<br />
m<br />
∆ φ represents the modifying vector <strong>for</strong> the estimated parameters. Also, ρe<br />
represents the<br />
correct gain. The estimation parameter vector<br />
to the perturbation c is defined as follows:<br />
+ i<br />
φˆ<br />
with only the i-th estimation parameter added<br />
ˆ + i<br />
i<br />
k<br />
= ˆ φk<br />
+ cke<br />
(i=1,…, n+m). (2.57)<br />
φ<br />
Here, the vector<br />
i<br />
e represents the fundamental vector <strong>for</strong> which the i-th element alone is 1, and<br />
everything else is 0. Consequently, the error function <strong>for</strong> when perturbation is superimposed on<br />
each parameter is structured as follows:<br />
1 2<br />
T ˆ+<br />
i<br />
( y k + 1<br />
y k k<br />
) .<br />
2<br />
− φ (2.58)<br />
Based on the error function in the equation above, the estimation parameters can be updated as<br />
shown below. In other words, an algorithm in which<br />
1 ( y − y φˆ<br />
) − ( y − y φˆ<br />
)<br />
T + i 2<br />
T 2<br />
k + 1 k k<br />
k + 1 k k<br />
∆ φ<br />
k , i<br />
=<br />
(i=1,…, n+m) (2.59)<br />
2<br />
c<br />
k<br />
represents each element <strong>for</strong> the correction parameters can be conceived. The equation above<br />
provides the amount <strong>of</strong> estimation <strong>for</strong> the differential with respect to the i-th parameter in the<br />
66
2.11 PARAMETER ESTIMATION<br />
error. Finding values in the above equation <strong>for</strong> i = 1, . . . , n + m means finding the square <strong>of</strong><br />
errors in (2.58) by superimposing the perturbation on each parameter successively. As a result,<br />
the error function must be calculated (number <strong>of</strong> parameters + 1) times. As the number <strong>of</strong><br />
dimensions <strong>for</strong> the parameters rises, the number <strong>of</strong> calculations <strong>for</strong> the error increases in this<br />
method.<br />
We consider a signed vector<br />
whether the element takes +1 or -1 is determined randomly by<br />
sk<br />
consisting <strong>of</strong> the elements +1 or -1. As is described [38],<br />
s<br />
k<br />
(<br />
k ,1 k , n+<br />
m<br />
T<br />
= s K , s ) .<br />
(2.60)<br />
By making use <strong>of</strong> this, perturbation can be superimposed on the parameter vector as shown<br />
below:<br />
ˆ<br />
+<br />
k<br />
= ˆ χ + c s<br />
k<br />
k<br />
k<br />
.<br />
χ (2.61)<br />
By making use <strong>of</strong> this, the perturbation<br />
+ ck<br />
and ck<br />
− is added at the same time to all<br />
parameters. The parameter estimation using our modified <strong>SPSA</strong> algorithm is give as follows:<br />
ˆ χ<br />
ˆ<br />
k + n<br />
= χ<br />
k −1<br />
−<br />
ψ<br />
k − 1<br />
n + 1<br />
⎧<br />
⎪ 1 ( W<br />
⋅ ⎨<br />
⎪ 2<br />
⎩<br />
Xs<br />
k + n<br />
− W<br />
T<br />
k<br />
ˆ χ<br />
2<br />
⎡υ<br />
I<br />
− ⎢<br />
⎣ 0<br />
− ( W<br />
c<br />
k −1<br />
n + 1<br />
− W<br />
ˆ χ<br />
+ 2<br />
T + 2<br />
k −1 )<br />
k + n k + n k −1<br />
)<br />
0 ⎤<br />
⎥ χ<br />
0 ⎦<br />
n<br />
ˆ<br />
k − 1 k − 1<br />
⎪⎫<br />
⎬<br />
⎪⎭<br />
(2.62)<br />
where<br />
W<br />
k is measured output, c is the perturbation, υ represents the variance, n, k are<br />
sampling time, χ is the parameter to be estimated, and ψ is a gain coefficient and the<br />
subscript in this coefficient represents a fraction because this takes value <strong>for</strong> every (n+1)<br />
instances. Note that<br />
χ<br />
+<br />
k −1<br />
is calculated as follows:<br />
67
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
ˆ χ<br />
+<br />
ˆ<br />
k − 1<br />
= χ<br />
k − 1<br />
+ c<br />
k − 1<br />
s<br />
k − 1<br />
n + 1<br />
. (2.63)<br />
In estimating the optimum parameters <strong>of</strong> a model or times, there are several factors, which must<br />
be considered when deciding on the appropriate optimization technique. Among these factors<br />
are convergence speed, accuracy, algorithm suitability, complexity, and computational cost in<br />
terms <strong>of</strong> time and power. In the current problem it is necessary to estimate the parameters <strong>of</strong> a<br />
geometrical object in real time. This algorithm updates the estimates using the following<br />
procedure:<br />
y k+<br />
n<br />
(S1) The output to be identified { }<br />
is observed with respect to a particular input.<br />
(S2) Perturbation is added to all the parameters in the estimation vector <strong>for</strong> the parameters.<br />
(Calculation <strong>of</strong> (2.63)).<br />
(S3) The value <strong>for</strong> the error function<br />
( y ˆ φ is calculated.<br />
− T +<br />
) 2<br />
k+ n<br />
yk+<br />
n k−1 (S4)-The amount <strong>of</strong> correction is calculated and the estimation parameters is updated.<br />
(Calculation <strong>of</strong> (2.62)).<br />
(S5) Return to S1.<br />
At each correction time, the value <strong>of</strong> { y k<br />
, u k<br />
} is observed, and the amount <strong>of</strong> correction is<br />
calculated based on these values. The above represents the proposal <strong>for</strong> an algorithm using a<br />
one-sided difference with the error <strong>for</strong> when perturbation is or is not present. However, as is the<br />
case <strong>for</strong> (2.61) the following two-sided <strong>for</strong>m <strong>of</strong> algorithm using<br />
−<br />
χˆ<br />
k<br />
in which the perturbation<br />
is subtracted from the estimation parameter can also be considered:<br />
T ˆ+<br />
2<br />
T<br />
1 ( ) ( ˆ−<br />
2<br />
yk+ 1<br />
− yk<br />
φk<br />
− yk+<br />
1<br />
− yk<br />
φk<br />
)<br />
∆φ k<br />
=<br />
.<br />
(2.64)<br />
2<br />
2c<br />
k<br />
68
2.11 PARAMETER ESTIMATION<br />
This algorithm to estimate the parameters is based on the M2-<strong>SPSA</strong>, which is capable <strong>of</strong><br />
optimizing any number <strong>of</strong> parameters in reasonable time. This is because the number <strong>of</strong> cost<br />
function evaluations needed to estimate the gradient is independent <strong>of</strong> the number <strong>of</strong> parameters<br />
to be optimized.<br />
2.12.3 -Convergence Theorem<br />
In this section a convergence theorem <strong>for</strong> the parameter estimation algorithm using the<br />
M2-<strong>SPSA</strong> is described. First, let us consider the following conditions.<br />
(A11) The coefficient<br />
ρe<br />
satisfies the following conditions:<br />
∞<br />
∑<br />
i=<br />
1<br />
∞<br />
∑<br />
ρ = ∞,<br />
ρ < ∞ .<br />
ei<br />
i=<br />
1<br />
2<br />
ei<br />
(A12) The perturbation c (> 0)<br />
is bounded.<br />
i<br />
(B11)<br />
E<br />
( sk, i)<br />
= 0, E(<br />
sk,<br />
i,<br />
slj<br />
) = δljδ<br />
kl<br />
.<br />
Note that δ represents the Kronecker delta.<br />
(C11) The input<br />
uk<br />
and the observed noise<br />
vk<br />
satisfy (2.46a) and (2.46b), and they are<br />
mutually independent. Further, they have a bounded fourth-<strong>order</strong> moment. Here, condition<br />
(A11) is related to the correction gain, and is the same as the condition required <strong>for</strong> an ordinary<br />
Robbin-Monroe type stochastic approximation.<br />
Condition (A12) is related to the magnitude <strong>of</strong> the perturbation. Condition (B11) is related to<br />
the signed vector. Conditions (A12) and (B11) are related to the perturbation required because<br />
this is a <strong>SPSA</strong>. The condition in (C11) is related to the nature <strong>of</strong> the noise and the input signal.<br />
It is also required <strong>for</strong> identification using a conventional R-M type stochastic approximation.<br />
69
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
Theorem 4a—Convergence in parameter estimation M2-<strong>SPSA</strong>. For { φˆ k<br />
} given in (2.62),<br />
when the conditions (A11), (A12), (B11) and (C11) are satisfied, we have<br />
lim<br />
k→∞<br />
E<br />
⎧<br />
⎨<br />
⎩<br />
ˆ<br />
φ k<br />
2<br />
−φ<br />
⎫<br />
⎬ = 0.<br />
⎭<br />
Refer to the Appendix <strong>for</strong> details <strong>of</strong> the pro<strong>of</strong> <strong>of</strong> this theorem.<br />
2.13- Simulation<br />
2.13.1 -Simulation 1<br />
This section compares M2-<strong>SPSA</strong> with its corresponding first-<strong>order</strong> “standard” <strong>for</strong>ms 1st-<strong>SPSA</strong><br />
and 2nd-<strong>SPSA</strong>. Numerical studies on other functions are given in Spall [18]. The loss function<br />
considered here is a fourth-<strong>order</strong> polynomial with p = 10, significant variable interaction, and<br />
highly skewed level surfaces (the ratio <strong>of</strong> maximum to minimum eigenvalue <strong>of</strong> H ( θ<br />
* ) is<br />
approximately 65). Gaussian noise is added to the L (⋅)<br />
or g(⋅)<br />
evaluations as appropriate.<br />
MATLAB s<strong>of</strong>tware was used to carry out this study. The loss function is<br />
p<br />
p<br />
3<br />
( Aθ<br />
)<br />
i<br />
+ 0.001∑<br />
i= 1 i=<br />
1<br />
∑<br />
T T<br />
4<br />
L( θ ) = θ A Aθ<br />
+ 0.1<br />
( Aθ<br />
)<br />
(2.65)<br />
i<br />
where<br />
)<br />
i<br />
(⋅ represents the i-th component <strong>of</strong> the argument vector and A is such that pA is an<br />
upper triangular matrix <strong>of</strong> ones. The minimum occurs at θ * = 0 with L ( θ<br />
* ) = 0 .The noise in<br />
the loss function measurements at any value <strong>of</strong> θ is given by [ θ T , 1]z<br />
where<br />
2<br />
z ≈ N 0, σ I ) is independently generated at each θ . This is a relatively simple noise<br />
(<br />
11X<br />
11<br />
structure representing the usual scenario where the noise values in y (⋅)<br />
depend on θ (and<br />
are there<strong>for</strong>e dependent over iterations); the z<br />
11<br />
term provides some degree <strong>of</strong> independence<br />
2<br />
at each noise contribution, and ensures that y(⋅)<br />
always contains noise <strong>of</strong> variance at least σ<br />
(even if θ = 0). Guidelines 1), 2), 4), from Sec. 2.8 <strong>of</strong> our proposed modification in the<br />
implementation to 2nd-<strong>SPSA</strong> were applied here. A fundamental philosophy in the comparisons<br />
below is that the loss function and gradient measurements are the dominant cost in the<br />
70
2.12 NUMERICAL SIMULATIONS<br />
optimization process; the other calculations in the algorithms are considered relatively<br />
unimportant. This philosophy is consistent with most complex stochastic optimization problems<br />
where the loss function or gradient measurement may represent a large-scale simulation or a<br />
physical experiment. The relatively simple loss function here, <strong>of</strong> course, is merely a proxy <strong>for</strong><br />
the more complex functions encountered in practice.<br />
M2-<strong>SPSA</strong> Versus 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> Results: We compared M2-<strong>SPSA</strong> with 1st-<strong>SPSA</strong><br />
because our proposed method is an extension <strong>of</strong> 1st-<strong>SPSA</strong>, so that is convenient make a<br />
comparison in <strong>order</strong> to show the improvements <strong>of</strong> our <strong>SPSA</strong> proposed respect to 1st-<strong>SPSA</strong> and<br />
also it is compared with 2nd-<strong>SPSA</strong> because this is the last version <strong>of</strong> <strong>SPSA</strong>, so that is very<br />
important verify our improvements according to this algorithm. Spall [18] provides a thorough<br />
numerical study based on the loss function (2.65). Three noise levels were considered: σ =<br />
0.10, 0.001, and 0. The results here are a condensed study based on the same loss function.<br />
Table 2.2 shows results <strong>for</strong> the low-noise σ = 0.001) case. The Table 2.2 shows the mean<br />
terminal loss value after 50 independent experiments, where the values are normalized (divided)<br />
by L ˆ θ ) . Approximate 90% confidence intervals are shown below each mean loss value. The<br />
( 0<br />
gains,<br />
a<br />
k<br />
, ck<br />
and<br />
k<br />
c ~ and decayed at the rates,<br />
0.602 0.101<br />
1/<br />
k , 1/ k<br />
0.101<br />
and-1/<br />
k , respectively.<br />
These decay rates are approximately the slowest allowed by the theory and are slower than the<br />
asymptotically optimal values discussed in Sec. 2.10 (which do not tend to work as well in<br />
finite-sample practice). Four separate algorithms are shown: basic 1st-<strong>SPSA</strong> with the<br />
coefficients <strong>of</strong> the slowly decaying gains mentioned above chosen empirically according to<br />
Spall[18], the same 1st-<strong>SPSA</strong> algorithm but with final estimate taken as the iterate average <strong>of</strong><br />
the last 200 iterations, 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong>. Additional study details are as in Spall[18].<br />
We see that M2-<strong>SPSA</strong> provides a considerable reduction in the loss function value <strong>for</strong> the same<br />
number <strong>of</strong> measurements used in 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. Based on the numbers in the table<br />
together with supplementary studies, we find that 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> need approximately<br />
five–ten times the number <strong>of</strong> function evaluations used by M2-<strong>SPSA</strong> to reach the levels <strong>of</strong><br />
accuracy shown.<br />
71
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
The behavior <strong>of</strong> iterate averaging was consistent with the discussion in previous section in<br />
which the 1st-<strong>SPSA</strong> iterates had not yet settled into bouncing roughly uni<strong>for</strong>mly around the<br />
solution. Using numerical studies in Spall [18], we can show that M2-<strong>SPSA</strong> outper<strong>for</strong>ms<br />
1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> even more strongly in the noise-free (σ = 0) case <strong>for</strong> this loss function,<br />
but that it is inferior to 1st-<strong>SPSA</strong> in the high-noise (σ = 0.10) case.<br />
However, Spall [18] presents a study based on a larger number <strong>of</strong> loss measurements (i.e., more<br />
asymptotic) where we can show that M2-<strong>SPSA</strong> outper<strong>for</strong>ms 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> in the<br />
high-noise case.<br />
Table 2.2. Normalized loss values <strong>for</strong> 1st-<strong>SPSA</strong>, 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong> with σ = 0.001;<br />
No. <strong>of</strong> loss<br />
measurements<br />
1st-<strong>SPSA</strong><br />
2000 0.0046<br />
[0.0040,<br />
0.0052]<br />
10 000 0.0023<br />
[0.0021,<br />
0.0025]<br />
90% confidence interval shown in [⋅].<br />
1st-<strong>SPSA</strong> with<br />
iterate averaging<br />
0.0047<br />
[0.0040, 0.0054]<br />
0.0023<br />
[0.0021,0.0025]<br />
2nd-<strong>SPSA</strong><br />
0.0041<br />
[0.0037, 0.0050]<br />
0.0019<br />
[0.0019, 0.0022]<br />
M2-<strong>SPSA</strong><br />
0.0023<br />
[0.0021, 0.0025]<br />
8.6<br />
X 10 −4<br />
[7.6X 10 −4<br />
, 9.6X10 − 4 ]<br />
*<br />
It was also found that, if the iterates were constrained to lie in some hypercube around θ (as<br />
required, e.g., in genetic algorithms), then all values in Table 2.2 will be reduced, sometimes by<br />
several <strong>order</strong>s <strong>of</strong> magnitude. Such prior in<strong>for</strong>mation can be valuable at speeding convergence.<br />
2.13.2- Simulation 2<br />
We will compare the per<strong>for</strong>mance <strong>of</strong> M2-<strong>SPSA</strong> with that <strong>of</strong> the standard first-<strong>order</strong> <strong>SPSA</strong><br />
algorithm in Spall [18]. The loss function L(⋅)<br />
we consider is a fourth-<strong>order</strong> polynomial with<br />
significant interaction among the p=10 elements in θ this makes the loss function flat near<br />
*<br />
θ and, consequently, the optimization problem challenging. Tables 2.3 and 2.4 provide the<br />
*<br />
results <strong>for</strong> this preliminary study, showing the ratio <strong>of</strong> the estimation error θˆ − θˆ<br />
k<br />
to the<br />
*<br />
initial error θˆ − θˆ based on an average <strong>of</strong> five independent runs (the same θ ˆ was used<br />
0<br />
0<br />
72
2.12 NUMERICAL SIMULATIONS<br />
in all runs, and represents the standard Euclidean norm). 1st-<strong>SPSA</strong> and M2-<strong>SPSA</strong> represent the<br />
first-<strong>order</strong> and modified second-<strong>order</strong> <strong>SPSA</strong> algorithms, respectively. Table 2.3 considers the<br />
case where there is no noise in the measurements <strong>of</strong> L (⋅)<br />
, while Table 2.4 includes Gaussian<br />
measurement noise (with a one-sigma value that ranges from 3 to over 100 percent <strong>of</strong> the<br />
L(θ )<br />
value as θ varies).<br />
The left-hand column represents the total number <strong>of</strong> measurements used (so with 3000<br />
measurements, 1st-<strong>SPSA</strong> has gone through k = 1500 iterations while M2-<strong>SPSA</strong> has gone<br />
through k = 1000 iterations). The first two results columns in the tables represent runs with the<br />
same SA gains a ,<br />
k<br />
c , tuned numerically to approximately optimize the per<strong>for</strong>mance <strong>of</strong> the<br />
k<br />
1st-<strong>SPSA</strong> algorithm. The third results column is based on a (numerical) recalibration <strong>of</strong> a ,<br />
k<br />
c to be approximately optimized <strong>for</strong> the M2-<strong>SPSA</strong> algorithm (an identical<br />
k<br />
used <strong>for</strong> both M2-<strong>SPSA</strong> columns).<br />
a sequence was<br />
k<br />
The results in both tables illustrate the per<strong>for</strong>mance <strong>of</strong> the M2-<strong>SPSA</strong> approach <strong>for</strong> a difficult to<br />
optimize (i.e., flat surface) function. As expected, we see that the ratios (<strong>for</strong> both 1st-<strong>SPSA</strong> and<br />
M2-<strong>SPSA</strong>) tend to be lower in the no-noise case <strong>of</strong> Table 2.3 Further, we see that the M2-<strong>SPSA</strong><br />
*<br />
algorithm provides solutions closer to θ both with and without optimal M2-<strong>SPSA</strong> gains. An<br />
enlightening way to look at the numbers in the tables is to compare the number <strong>of</strong><br />
measurements needed to achieve the same level <strong>of</strong> accuracy. We see that in the no-noise case<br />
(Table 2.3), the ratio <strong>of</strong> number <strong>of</strong> measurements <strong>for</strong> M2-<strong>SPSA</strong>: 1st-<strong>SPSA</strong> ranged from 1:2 to<br />
1:50. In the noisy measurement case (Table 2.4), the ratios <strong>for</strong> M2-<strong>SPSA</strong>: 1st-<strong>SPSA</strong> ranged<br />
from 1:2 to 1:20. These ratios <strong>of</strong>fer considerable promise <strong>for</strong> practical problems, where p is<br />
even larger (say, as in the neural network–based direct adaptive control method <strong>of</strong> Spall and<br />
2<br />
3<br />
Cristion [25], where p can easily be <strong>of</strong> <strong>order</strong> 10 or 10 ). In such cases, other second <strong>order</strong><br />
techniques that require a growing (with p) number <strong>of</strong> function measurements are likely to<br />
become infeasible.<br />
73
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
Table 2.3. Values <strong>of</strong><br />
θˆ<br />
k<br />
θˆ<br />
0<br />
*<br />
− θ<br />
*<br />
− θ<br />
with no measurement noise.<br />
Number <strong>of</strong><br />
measurements<br />
1st-<strong>SPSA</strong><br />
M2-<strong>SPSA</strong><br />
w/1st-<strong>SPSA</strong><br />
gains<br />
M2-<strong>SPSA</strong><br />
w/optimal<br />
gains<br />
3000 0.265 0.287 0.122<br />
15000 0.184 0.160 0.033<br />
30000 0.146 0.128 0.018<br />
Table 2.4. Values <strong>of</strong><br />
θˆ<br />
k<br />
θˆ<br />
0<br />
*<br />
− θ<br />
*<br />
− θ<br />
with measurement noise.<br />
Number <strong>of</strong><br />
measurements<br />
1st-<strong>SPSA</strong> M2-<strong>SPSA</strong><br />
w/1st-<strong>SPSA</strong><br />
gains<br />
M2-<strong>SPSA</strong><br />
w/optimal<br />
gains<br />
3000 0.273 0.292 0.243<br />
15000 0.184 0.163 0.103<br />
30000 0.146 0.141 0.097<br />
There are several important practical concerns in implementing the M2-<strong>SPSA</strong> algorithm. One,<br />
<strong>of</strong> course, involves the choice <strong>of</strong> SA gains. As in all SA algorithms, this must be done with<br />
some care to ensure good per<strong>for</strong>mance <strong>of</strong> the algorithm. Some theoretical guidance is provided<br />
in Fabian [19], but we have found that empirical experimentation is more effective and easier.<br />
Another practical aspect involves the use <strong>of</strong> the <strong>Hessian</strong> estimate: in the studies here we found it<br />
more effective to not use the <strong>Hessian</strong> estimate <strong>for</strong> the first few (100) iterations. This allows the<br />
inverse <strong>Hessian</strong> estimate to improve while it really is not needed since L(⋅)<br />
is dropping quickly<br />
because <strong>of</strong> the characteristic steep initial decline <strong>of</strong> the standard <strong>SPSA</strong> algorithm.<br />
74
2.12 NUMERICAL SIMULATIONS<br />
2.13.3 -Simulation 3<br />
First, let us consider the following:<br />
where<br />
x (2.66)<br />
k<br />
+ a<br />
k<br />
x<br />
k − 1<br />
+ a<br />
2<br />
x<br />
k − 2<br />
= b1u<br />
k −1<br />
+ b<br />
2u<br />
k − 2<br />
a<br />
1<br />
=-1.2, a<br />
2<br />
=0.4, b<br />
1=1.0 and b<br />
2<br />
=0.7.<br />
Figure 2.7 shows the parameter estimation results using the algorithm in (2.62). Fig. 2.8 shows<br />
the results <strong>for</strong> when bias compensation was not per<strong>for</strong>med. Here, the input is white noise<br />
generated using a normal distribution with a variance <strong>of</strong> 0.6 and an average <strong>of</strong> 0.<br />
The observed noise is a separate white noise generated using a normal distribution with a<br />
variance <strong>of</strong> 0.1 and an average <strong>of</strong> 0. The observed noise is a separate white noise generated<br />
using a normal distribution with a variance <strong>of</strong> 0.1 and an average <strong>of</strong> 0. Also, the initial values<br />
<strong>for</strong> the estimation parameters are all 0, the magnitude c <strong>of</strong> the perturbation used in the algorithm<br />
0.9<br />
is 0.0015, and the gain coefficient = 1/( i + 1) .<br />
ρ i<br />
Fig. 2.7. Identification results (with bias compensation).<br />
â<br />
1<br />
(solid line), ˆb (dashed line),<br />
2<br />
â (dashed dot line), ˆb (dot line).<br />
2<br />
1<br />
75
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
Fig. 2.8. Identification results (without bias compensation).<br />
â<br />
1<br />
(solid line), ˆb (dashed line),<br />
2<br />
â (dashed dot line), ˆb (dot line).<br />
2<br />
1<br />
These settings satisfy conditions (A11) and (A12) <strong>for</strong> the convergence theorem. In the figures<br />
above, the horizontal axis represents the number <strong>of</strong> iterations <strong>for</strong> the parameters. In Fig. 2.7 , we<br />
can confirm that the estimated values converge to the true values. On the other hand, when bias<br />
compensation was not per<strong>for</strong>med, it is clear from Fig. 2.8 that an estimation error occurs as can<br />
be seen in (2.52) this means that the estimates could not be consistent in the system. Now, our<br />
proposed method is compared to other methods such as the R-M type SA [9] and the 2nd-<strong>SPSA</strong><br />
algorithm [18]. For all these methods, the variance <strong>of</strong> 0.1 <strong>for</strong> the observed noise was known,<br />
and the compensation algorithm was used. The results <strong>of</strong> estimations with almost 100,000<br />
iterations <strong>of</strong> parameter correction are shown in Table 2.5. The average values <strong>for</strong> 50 trials are<br />
given <strong>for</strong> the estimation results.<br />
Table 2.5. Comparison <strong>of</strong> estimators.<br />
<strong>Algorithm</strong>s<br />
â<br />
1<br />
ˆb<br />
2<br />
â ˆb<br />
2<br />
1<br />
RM -1.1770170 0.635410 0.361731 0.964721<br />
M2-<strong>SPSA</strong> -1.20511120 0.67401 0.401234 1.006991<br />
2nd-<strong>SPSA</strong> -1.1916300 0.664451 0.393394 0.990554<br />
True value -1.2 0.7 0.4 1.0<br />
M2-<strong>SPSA</strong>: Estimators using the proposed method.<br />
2nd-<strong>SPSA</strong>: <strong>Second</strong>-<strong>order</strong> <strong>of</strong> <strong>SPSA</strong> [18].<br />
SA: Estimators using R-M SA [9].<br />
76
2.12 NUMERICAL SIMULATIONS<br />
In terms <strong>of</strong> estimation precision, the 2nd-<strong>SPSA</strong> and M2-<strong>SPSA</strong> are better than R-M SA method<br />
(see Table 2.5). In Fig. 2.7, we can see the corrections required in <strong>order</strong> to achieve suitable<br />
results. The values in the proposed <strong>SPSA</strong> algorithm are closest to true values. Also, in the other<br />
methods (RM algorithm), an accurate amount <strong>for</strong> the slope is used <strong>for</strong> the evaluation function.<br />
In contrast, in the proposed method the slope is estimated, and the estimation error <strong>for</strong> the slope<br />
has an effect on the convergence speed. However, as was explained be<strong>for</strong>e, when the system<br />
output can only be obtained via unknown characteristics, conventional estimation methods<br />
cannot be used. This is only a small study in <strong>order</strong> to show how the proposed <strong>SPSA</strong> algorithm is<br />
applied to parameter estimation.<br />
In conclusion in this chapter, we have proposed a parameter estimation algorithm using<br />
M2-<strong>SPSA</strong>. The identification method using the SP seems particularly useful when the number<br />
<strong>of</strong> parameters to be identified is very large or when the observed values <strong>for</strong> what is to be<br />
identified can only be obtained via an unknown observation system [38]-[41]. Furthermore, an<br />
improved time differential <strong>of</strong> SP method that only require one observation <strong>of</strong> error <strong>for</strong> each time<br />
increment have been proposed as improvements. The system can also be used <strong>for</strong> identification<br />
problems. Then in this chapter, we have made some empirical and theoretical comparisons<br />
between 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> and other SA algorithms. It is found that the magnitude <strong>of</strong><br />
errors introduced by matrix inversion in 2nd-<strong>SPSA</strong> is greater <strong>for</strong> an ill-conditioned<br />
<strong>Hessian</strong> than a well-conditioned <strong>Hessian</strong>. On the other hand, the errors in 1st-<strong>SPSA</strong> are less<br />
sensitive to the matrix conditioning <strong>of</strong> loss function <strong>Hessian</strong>s. To eliminate the errors introduced<br />
by the inversion <strong>of</strong> estimated <strong>Hessian</strong><br />
−1<br />
H<br />
k<br />
, it is suggested a modification (2.13) to 2nd-<strong>SPSA</strong><br />
that replaces<br />
−1<br />
Hk<br />
with a scalar inverse <strong>of</strong> the geometric mean <strong>of</strong> all the eigenvalues <strong>of</strong> H<br />
k<br />
. At<br />
finite iterations, it is found that the introduced M2-<strong>SPSA</strong> based on (2.13) and (2.14)<br />
outper<strong>for</strong>ms 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> in numerical experiments that represent a wide range <strong>of</strong><br />
matrix conditioning. The asymptotic efficiency analysis shows that the ratio <strong>of</strong> the mean square<br />
errors <strong>for</strong> the proposed <strong>SPSA</strong> algorithm to 2nd-<strong>SPSA</strong> is always less than unity except <strong>for</strong> a<br />
perfectly conditioned <strong>Hessian</strong> or <strong>for</strong> an asymptotically optimal setting <strong>of</strong> the gain sequence.<br />
There<strong>for</strong>e, the general differences between previous version <strong>of</strong> <strong>SPSA</strong> algorithm and our version<br />
presented above is that our proposed <strong>SPSA</strong> algorithm <strong>of</strong>fers considerable potential <strong>for</strong><br />
accelerating the convergence <strong>of</strong> SA algorithms while only requiring loss function measurements<br />
(no gradient or higher derivative measurements are needed). In this section since it requires only<br />
77
CHAPTER 2. PROPOSED <strong>SPSA</strong> ALGORITHM<br />
three measurements per iteration to estimate both the gradient and <strong>Hessian</strong> independent <strong>of</strong><br />
problem dimension p it does not impose a large requirement <strong>for</strong> data collection. Also, the<br />
computational complexity and cost are reduced as the previous simulations showed. The main<br />
features in our proposed <strong>SPSA</strong> are the follows:<br />
1) M2-<strong>SPSA</strong> is useful <strong>for</strong> complex problems where a great volume <strong>of</strong> parameters need to be<br />
estimated its description is explained in Sec. 2.4 and 2.5.<br />
2) Reduction in the computation time by evaluating only a diagonal estimate <strong>of</strong> the <strong>Hessian</strong><br />
matrix (see Sec. 2.3).<br />
3) The eigenvalues <strong>of</strong> the <strong>Hessian</strong> matrix are computed very efficiently (see Sec. 2.3)<br />
4) M2-<strong>SPSA</strong> guarantees that non-positive-definiteness part be eliminated using FIM. The<br />
<strong>Hessian</strong> matrix inverse is improved (see Sec. 2.6).<br />
5) The modification in the <strong>SPSA</strong> implementation improves the convergence in the algorithm<br />
when is applied to parameter estimation (see Sec. 2.8 - 2.11).<br />
78
Chapter 3<br />
Vibration Suppression Control <strong>of</strong> a Flexible<br />
Arm using Non-linear Observer with <strong>SPSA</strong><br />
In this first application, the proposed <strong>SPSA</strong> algorithm is applied to parameter estimation in<br />
methods <strong>for</strong> the vibration control in the model proposed here, these methods are the non-linear<br />
observer and model reference-sliding mode control. In both cases the parameter estimation by<br />
M2-<strong>SPSA</strong> is compared with other algorithms in <strong>order</strong> to show the efficiency in comparison to<br />
other good parameter estimators. The computational cost and accuracy in parameters is<br />
compared here. Finally, a novel model reference-sliding mode control applied to non-linear<br />
observer is proposed here. The main objective in this study concerns to vibration control <strong>of</strong> a<br />
one-link flexible arm system. A variable structure system (VSS) non-linear observer has been<br />
proposed in <strong>order</strong> to reduce the oscillation in controlling the angle <strong>of</strong> the flexible arm. The<br />
non-linear observer parameters are optimized using a modified version <strong>of</strong> simultaneous<br />
perturbation stochastic approximation (<strong>SPSA</strong>) algorithm. The <strong>SPSA</strong> algorithm is especially<br />
useful when the number <strong>of</strong> parameters to be adjusted is large, and makes it possible to estimate<br />
them simultaneously. As <strong>for</strong> the vibration and position control, a model reference sliding-mode<br />
control (MR-SMC) has been proposed. Also the MR-SMC parameters are optimized using a<br />
modified version <strong>of</strong> <strong>SPSA</strong> algorithm. The simulations show that the vibration control <strong>of</strong> a<br />
one-link flexible arm system can be achieved more efficiently using our method. There<strong>for</strong>e, by<br />
applying <strong>of</strong> MR-SMC method to non-linear observer, we can improve the per<strong>for</strong>mance in this<br />
kind <strong>of</strong> models and by our proposed <strong>SPSA</strong> algorithm, we can determine very easy and<br />
efficiently the parameters <strong>of</strong> control.<br />
3.1 -Introduction<br />
Traditionally, robotic manipulators have been designed and built in a manner that maximizes<br />
stiffness in <strong>order</strong> to minimize vibration and allow <strong>for</strong> good positional accuracy with relatively<br />
simple controllers [41]. High stiffness is achieved by using heavy links that limits the rapid<br />
motion <strong>of</strong> the manipulator, increases the size <strong>of</strong> the actuators and boosts the energy<br />
79
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
consumption. Conversely, a lightweight manipulator is less expensive to manufacture and<br />
operate. Weight reduction, however, incurs a penalty in that the manipulator becomes more<br />
flexible and more difficult to control accurately [41]. Since the manipulator is a<br />
distributed-parameter system, the control difficulty is caused by the fact that a large number <strong>of</strong><br />
flexible modes are required to accurately model its behavior. According to this, we overcome<br />
these problems in this chapter. Since a simple model can be used in a flexible manipulator that<br />
carries a great tip load [41]-[43], this research has been centered in such kind <strong>of</strong> simple model,<br />
particularly in the single flexible link that is moved in a horizontal plane. Also, this kind <strong>of</strong><br />
model is very convenient because show more clearly the advantages <strong>of</strong> our method and control<br />
strategies described in this chapter. We have proposed a method which the vibrations can be<br />
suppressed satisfactory in the single flexible link system, this method helps to have a very<br />
suitable control <strong>of</strong> the angular position <strong>of</strong> this system. The mathematical model <strong>of</strong> this system is<br />
described in Sec. 3.2. In the single flexible link, one end <strong>of</strong> this arm is attached to a motor and<br />
the other end carries a payload. In this chapter, the control angular position <strong>of</strong> the arm<br />
suppressing the oscillation is taken as the control purpose. Since the feedback <strong>of</strong> only the motor<br />
angle will not be sufficient to suppress the oscillation, we have considered a VSS non-linear<br />
observer incorporated by a MR-SMC in <strong>order</strong> to reduce the oscillation more efficiently. The<br />
variable structure systems theory has been successfully used in the development <strong>of</strong> robust<br />
observers <strong>for</strong> dynamical systems with bounded non-linearities and/or uncertainties. These<br />
observers do not require exact knowledge <strong>of</strong> the plant parameters and/or non-linearities. Their<br />
design is solely based on knowing the upper bounds <strong>of</strong> the system uncertainties and/or<br />
non-linearities. Furthermore, in some studies, the estimated state variables were preferred over<br />
the measured ones in <strong>order</strong> to enhance the per<strong>for</strong>mance <strong>of</strong> the controller [47] or to reduce the<br />
effect <strong>of</strong> observation spillover in the active control <strong>of</strong> flexible structures [47]. In other words,<br />
VSS is fundamentally based in a stability equations and minimization <strong>of</strong> the cost function.<br />
There<strong>for</strong>e, the per<strong>for</strong>mance <strong>of</strong> the non-linear observer is assessed herein by examining its<br />
capability <strong>of</strong> predicting the rigid and flexible motions <strong>of</strong> a compliant beam that is connected to a<br />
revolute joint. In respect <strong>of</strong> MR-SMC, its advantage is robustness against parameter<br />
uncertainties and external disturbance and so on, so that MR-SMC is robustness under the<br />
matching condition. In general, suspension system is easily subjected to several parameter<br />
variations such as the variation <strong>of</strong> the sprung mass. The robustness <strong>of</strong> the SMC can be improved<br />
by shortening the time required to attain the sliding mode, or may be guaranteed during whole<br />
intervals <strong>of</strong> control action by eliminating the reaching phase. One easy way to minimize the<br />
reaching phase is to employ a large control input.<br />
80
3.2 DYNAMIC MODELING OF A SINGLE LINK ROBOT ARM<br />
This MR-SMC is <strong>for</strong>mulated <strong>for</strong> the position control <strong>of</strong> a single flexible link subjected to<br />
parameter variations. Also, a sliding surface which guarantees stable sliding mode motion during<br />
the sliding phase is synthesized in an optimal manner; this will be analyzed in Sec. 3.3 and 3.4.<br />
The MR-SMC and the observer have been designed based on a simplified model <strong>of</strong> the arm,<br />
which only accounts <strong>for</strong> the first elastic mode <strong>of</strong> the beam. Moreover, there are many<br />
parameters to be determined, so that it is difficult to get them. Hence, in <strong>order</strong> to overcome this<br />
problem, a modified version <strong>of</strong> 2nd-<strong>SPSA</strong> has been proposed to obtain the observer/ controller<br />
gains more efficiently. In the traditional <strong>SPSA</strong> since all parameters are perturbed simultaneously,<br />
it is possible to modify parameters with only two measurements <strong>of</strong> an evaluation function<br />
regardless <strong>of</strong> the dimension <strong>of</strong> the parameter. This is very useful but this <strong>SPSA</strong> can cause in<br />
some cases high computational cost [3]. There<strong>for</strong>e, M2-<strong>SPSA</strong> is applied to a parameters<br />
estimation algorithm in <strong>order</strong> to get the observer and controller parameters more efficiently and<br />
also reduce its cost. We apply a parameter estimation algorithm using our proposed <strong>SPSA</strong><br />
described in Chap. 2. The per<strong>for</strong>mance <strong>of</strong> this algorithm will be examined in terms <strong>of</strong> parameter<br />
selection, computational cost, and convergence per<strong>for</strong>mance in the current problem. Finally, in<br />
<strong>order</strong> to understand the proposed method using non-linear observer, MR-SMC and <strong>SPSA</strong>, the<br />
control system only uses measurable data such as motor angle, tip velocity, tip position, and<br />
control torque shown in Sec. 3.5.<br />
3.2 -Dynamic Modeling <strong>of</strong> a Single Link Robot Arm<br />
3.2.1 -Dynamic Model<br />
The single flexible link is considered as a continuous cantilever beam <strong>of</strong> length L carrying a<br />
mass M and a torque T applied by a motor that rotates the beam in a horizontal plane. The mass<br />
and elastic properties are assumed to be distributed uni<strong>for</strong>mly along single flexible link [44].<br />
The physical configuration <strong>of</strong> this system is shown in Fig. 3.1. This system is constituted <strong>of</strong> a<br />
length L that has a mass m, a torque T (that rotates the elastic arm) and an additional mass M<br />
(that is the payload at the end <strong>of</strong> the arm) [44]. The deflection y(x,t) is described by an infinite<br />
series <strong>of</strong> separable modes.<br />
n<br />
∑<br />
y(<br />
x,<br />
t)<br />
= φ ( x)<br />
q ( t)<br />
(3.1)<br />
i=<br />
1<br />
i<br />
i<br />
81
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
which is assumed <strong>for</strong> the elastic displacement <strong>of</strong> the single flexible link, where φ (x)<br />
is a<br />
characteristic function and q i<br />
(t)<br />
is a mode function. The kinetic and potential energies <strong>of</strong> this<br />
system can be determined as follows:<br />
i<br />
T<br />
e<br />
m<br />
+ θ&<br />
L<br />
+ & θ<br />
1<br />
= & 2 m<br />
θ J +<br />
2 2L<br />
n<br />
∑<br />
i=<br />
1<br />
B q&<br />
i<br />
i<br />
∑<br />
i=<br />
1<br />
M<br />
+<br />
2<br />
n<br />
n<br />
2 2 2<br />
∑ C<br />
i<br />
q<br />
i<br />
+ 2Lθ&<br />
∑<br />
i= 1 i=<br />
1<br />
n<br />
A q&<br />
i<br />
2<br />
i<br />
2 2<br />
( L & θ +<br />
m<br />
+ & θ<br />
2L<br />
i<br />
i<br />
n<br />
∑<br />
i=<br />
1<br />
C q&<br />
)<br />
C<br />
∑<br />
i=<br />
1<br />
2<br />
i<br />
n<br />
q&<br />
2<br />
i<br />
A q&<br />
i<br />
2<br />
i<br />
(3.2)<br />
V<br />
=<br />
EI<br />
2<br />
n<br />
∑<br />
i = 1<br />
D i q i<br />
2<br />
(3.3)<br />
where θ is the angle <strong>of</strong> the joint, E is Young's modulus, and I is the area moment <strong>of</strong> inertia<br />
with the next variables:<br />
0<br />
L<br />
0<br />
L<br />
2<br />
A<br />
i=<br />
∫ φ<br />
i<br />
( x)<br />
dx,<br />
Bi<br />
= ∫ xφi<br />
( x)<br />
dx,<br />
Ci<br />
= φi<br />
( L),<br />
Di<br />
=<br />
∫<br />
0<br />
L<br />
2<br />
2 2<br />
[ d φ ( x) / dx ] dx.<br />
i<br />
The equation <strong>of</strong> motion <strong>of</strong> the cantilever beam <strong>for</strong> free vibration is based on the Euler-Bernoulli<br />
equation [45] and is written as follows:<br />
4<br />
2<br />
∂ y ∂ y<br />
EIL + m = 0 .<br />
(3.4)<br />
4<br />
2<br />
∂ x ∂ t<br />
Fig. 3.1. One-link flexible arm.<br />
82
3.2 DYNAMIC MODELING OF A SINGLE LINK ROBOT ARM<br />
The beam has a uni<strong>for</strong>m cross-sectional and its boundary conditions are defined as follows [45]:<br />
The deflection is zero at x=0.<br />
y ( 0, t)<br />
= 0.<br />
(3.5)<br />
The slope deflection is zero at x=0.<br />
dy<br />
dx<br />
( 0, t ) =<br />
0.<br />
(3.6)<br />
Bending moment is zero at x=L.<br />
2<br />
d y<br />
2<br />
dx<br />
( L , t ) =<br />
0 .<br />
(3.7)<br />
Shear <strong>for</strong>ce balance at the tip.<br />
3<br />
2<br />
d y<br />
d y<br />
EI ( L , t ) = m ( L , t ).<br />
3<br />
2<br />
(3.8)<br />
dx<br />
dt<br />
From (3.4) and (3.5) - (3.8), we have<br />
y ( x,<br />
t)<br />
= φ ( x)<br />
cos ω t.<br />
(3.9)<br />
i<br />
i<br />
i<br />
Then φ (x)<br />
can be found as:<br />
i<br />
φ x)<br />
= c cosβ<br />
x+<br />
c coshβ<br />
x+<br />
c sinβ<br />
x c sinhβ<br />
x<br />
(3.10)<br />
i( 1 i i 2i<br />
i 3i<br />
i<br />
+<br />
4i<br />
i<br />
2 EI 4<br />
ω<br />
i<br />
= β<br />
i<br />
.<br />
(3.11)<br />
ρ a<br />
Substituting φi<br />
(x)<br />
from (3.10) into (3.9) and using (3.5)-(3.8), β and<br />
i<br />
c1 i<br />
c4i<br />
~ are determined.<br />
83
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
3.2.2 -Equation <strong>of</strong> Motion and State Equations<br />
The state equations <strong>of</strong> the system are derived to describe the dynamic <strong>of</strong> single flexible link<br />
under certain assumptions [45]. There<strong>for</strong>e, assuming that only the first mode exists, from (3.2)<br />
and (3.3), and using Lagrange's equations as in [45][46], we obtain<br />
d<br />
dt<br />
⎛<br />
⎜<br />
⎝<br />
∂ T<br />
e ⎞ ∂ T<br />
e<br />
∂ V<br />
⎟ − + = T<br />
∂ θ & (3.12)<br />
⎠ ∂ θ ∂ θ<br />
d<br />
dt<br />
⎛<br />
⎜<br />
⎝<br />
∂ T<br />
∂ q&<br />
e<br />
1<br />
⎞<br />
⎟<br />
⎠<br />
−<br />
∂ T<br />
∂ q<br />
e<br />
1<br />
+<br />
∂ V<br />
∂ q<br />
1<br />
=<br />
0<br />
(3.13)<br />
then<br />
⎡α<br />
⎢<br />
⎣α<br />
01<br />
α ⎤⎡<br />
&&<br />
01 θ ⎤ ⎡ T − & θα ⎤<br />
11q1q<br />
&<br />
1<br />
⎥⎢<br />
⎥ = ⎢<br />
2⎥<br />
α11⎦⎣q&&<br />
1⎦<br />
⎣−<br />
H1q1<br />
+ α &<br />
11q1θ<br />
⎦<br />
00 2<br />
(3.14)<br />
y = θ<br />
(3.15)<br />
where<br />
α = + , and T is the motor’s shaft torque, J is the moment <strong>of</strong> inertia<br />
2 2<br />
00<br />
J + ML α11q1<br />
about the joint axis<br />
2<br />
α<br />
01<br />
= ω1<br />
+ ML φ1e<br />
, α<br />
11<br />
= v1<br />
+ ML φ1e<br />
, v = ρ a∫0<br />
L<br />
2<br />
1<br />
φ 1<br />
dx<br />
1<br />
, ρ is the<br />
L<br />
L<br />
2<br />
density, H<br />
1<br />
= EI ∫ & φ dx1,<br />
φ1<br />
=<br />
1(<br />
),<br />
1<br />
= ∫ 1 1 1,<br />
1<br />
e<br />
φ L ω ρ a x φ dx a is the area <strong>of</strong> the cross-section,<br />
0<br />
and y is the observation <strong>of</strong> θ . In <strong>order</strong> to get the variables that we will use <strong>for</strong><br />
0<br />
evaluate our method, the state variables are defined as<br />
x<br />
1<br />
= θ , x2<br />
= & θ,<br />
x3<br />
= q1,<br />
x4<br />
= q&<br />
1.<br />
84
3.3 DESIGN OF NON-LINEAR OBSERVER<br />
Then<br />
where<br />
⎡ x&<br />
⎢<br />
x&<br />
⎢<br />
⎢ x&<br />
⎢<br />
⎣ x&<br />
⋅<br />
⋅<br />
1<br />
2<br />
3<br />
4<br />
⎤ ⎡<br />
⎢<br />
⎢<br />
f<br />
=<br />
⎢<br />
⎢<br />
⎣ f<br />
⎥<br />
⎥<br />
⎥<br />
⎥<br />
⎦<br />
f ( x<br />
1<br />
1<br />
2<br />
( x<br />
( x<br />
2<br />
2<br />
x<br />
, x<br />
x<br />
2<br />
4<br />
, x<br />
3<br />
3<br />
, x<br />
, x<br />
4<br />
4<br />
1<br />
α − α<br />
⎤ ⎡ 0<br />
)<br />
⎥ ⎢<br />
⎥ ⎢<br />
b1<br />
+<br />
⎥ ⎢ 0<br />
⎥ ⎢<br />
) ⎦ ⎣b<br />
2<br />
⎤<br />
⎥<br />
⎥T<br />
⎥<br />
⎥<br />
⎦<br />
2<br />
2<br />
[ − 2 α x x x − α ( − H x + α x x ) ]<br />
f<br />
2<br />
2<br />
( x<br />
2<br />
, x<br />
1<br />
α − α<br />
2<br />
2<br />
[ 2 α α x x x + α ( − H x + α x x ) ]<br />
01<br />
11<br />
3<br />
, x<br />
3<br />
11<br />
, x<br />
2<br />
2<br />
4<br />
, x<br />
) =<br />
α<br />
3<br />
4<br />
3<br />
4<br />
) =<br />
α<br />
4<br />
00<br />
00<br />
01<br />
00<br />
11<br />
11<br />
1<br />
2<br />
01<br />
2<br />
01<br />
1<br />
3<br />
3<br />
11<br />
11<br />
3<br />
3<br />
2<br />
2<br />
(3.16)<br />
b<br />
1<br />
=<br />
α<br />
00<br />
α<br />
11<br />
α − α<br />
11<br />
2<br />
01<br />
b<br />
2<br />
=<br />
−<br />
α<br />
00<br />
α<br />
01<br />
α − α<br />
11<br />
2<br />
01<br />
•<br />
3.3 -Design <strong>of</strong> Non-linear Observer<br />
In this section, since only the motor angle x1<br />
is the measurable state variable, the remaining<br />
states x2, x3<br />
and x 4<br />
are predicted using intelligent state observer design [47]. For this,<br />
(3.14)-(3.15) are written as follows:<br />
State equations:<br />
x & = f ( x)<br />
+ g(<br />
x)<br />
T<br />
(3.17)<br />
Output equations:<br />
y = c<br />
T<br />
x<br />
T<br />
c = [1 0 0 0].<br />
(3.18)<br />
85
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
For this non-linear system, we consider a robust VSS observer, which predicts system states.<br />
This observer is defined as follows:<br />
xˆ<br />
= f ( xˆ)<br />
+ g( xˆ)<br />
T + M ( yˆ)<br />
+ K(<br />
yˆ<br />
− y)<br />
(3.19)<br />
yˆ = c T<br />
xˆ<br />
(3.20)<br />
M ( yˆ<br />
) = − g ( x )<br />
yˆ<br />
ς<br />
y + γ<br />
(3.21)<br />
T<br />
y = yˆ<br />
− y = c ( xˆ<br />
− x)<br />
(3.22)<br />
where xˆ represents the predicted value <strong>of</strong> system state as in [47], K is the observer gain matrix,<br />
M (y) is the observer non-linearity term, ς represents the gain and γ > 0 is an averaging<br />
constant <strong>for</strong> removing chattering. Now defining the estimation error as<br />
e = xˆ − x<br />
(3.23)<br />
we have<br />
e& = f (ˆ) x − f ( x)<br />
+ [ g(ˆ)<br />
x − g(<br />
x)]<br />
T + Kc<br />
+ M(<br />
y).<br />
T<br />
(ˆ x−<br />
x)<br />
(3.24)<br />
For evaluating <strong>of</strong> the observer gain K with<br />
xd<br />
as the desired point, using the Taylor series<br />
expansion and its first <strong>order</strong> approximation, the error system is given as follows:<br />
e&<br />
= [ f '( x<br />
d<br />
= A0e<br />
+ M(<br />
y).<br />
) + g'(<br />
x<br />
d<br />
) T + Kc<br />
T<br />
] e + M(<br />
y)<br />
(3.25)<br />
where<br />
A +<br />
A<br />
T<br />
0<br />
= A + GT Kc<br />
(3.26)<br />
∂f<br />
i<br />
= (3.27)<br />
∂x<br />
∂g<br />
G ∂ x<br />
j<br />
i<br />
= (i,-j = 1,2,3,4). (3.28)<br />
j<br />
86
3.4 MODEL REFERENCE – SLIDING MODEL CONTROLLER<br />
Choosing a Lyapunov function <strong>of</strong> e as<br />
1 e<br />
2<br />
V =<br />
2<br />
(3.29)<br />
and integrating V with respect to e yield<br />
2<br />
1<br />
V& T<br />
= ee&<br />
= e ( A 0<br />
− g(<br />
x)<br />
c ς).<br />
(3.30)<br />
T<br />
c e +γ<br />
If K is designed such that the eigenvalues <strong>of</strong> error system (3.26) are all negatives, then selection<br />
<strong>of</strong> A<br />
0<br />
− g(<br />
x)<br />
ς < 0 yields V & < 0 and the Lyapunov’s stability theory gives e(t) → 0 as<br />
t → ∞ .<br />
In the simulation, we chose x = [0.1 0 0 0]<br />
and computed A and G with the observer<br />
d<br />
parameters determined with M2-<strong>SPSA</strong> algorithm (see Chap. 2). There<strong>for</strong>e, to ensure the<br />
stability <strong>of</strong> (3.31) minimizing the following evaluation parameters:<br />
J<br />
2<br />
( y − ˆ ) .<br />
0<br />
Σ y<br />
= (3.31)<br />
In the determination <strong>of</strong> unknown parameters <strong>of</strong> the non-linear observer<br />
k k , k , , ζ and<br />
1, 2 3<br />
k4<br />
γ each parameter is calculated by (2.62). There<strong>for</strong>e, the parameters are determined as<br />
k =[-227 -25015 13.69 -11101] T , ς =0.010- and γ =0.002.<br />
3.4 -Model Reference - Sliding Model Controller<br />
The MR-SMC is <strong>of</strong>ten used in robust control <strong>of</strong> non-linear systems and also <strong>for</strong> stabilizes single<br />
inputs systems. The main purpose <strong>of</strong> the MR-SMC is to make the states converge to the sliding<br />
mode surface. This normally depends on the sliding mode controller design. For MR-SMC, the<br />
Lyapunov function is applied to keep the non-linear system under control. In this case,<br />
MR-SMC is <strong>for</strong>mulated <strong>for</strong> the tip position control <strong>of</strong> a single flexible link subjected to parameter<br />
variations. The desired response is based on a second <strong>order</strong> reference model given as [47]<br />
87
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
⎡x&<br />
⎢<br />
⎣&&<br />
x<br />
m<br />
m<br />
⎤ ⎡ 0<br />
⎥ = ⎢ 2<br />
⎦ ⎣−ω<br />
n<br />
1 ⎤⎡xm⎤<br />
⎡ ⎤<br />
U<br />
n<br />
x<br />
⎥ + 0<br />
⎥⎢<br />
⎢ 2<br />
− 2ω<br />
⎥<br />
⎦⎣<br />
&<br />
m⎦<br />
⎣ωn<br />
⎦<br />
m<br />
(3.32)<br />
where<br />
ω<br />
n<br />
is the eigenvalue <strong>of</strong> angular frequency and<br />
U<br />
m<br />
is the model input. For sliding<br />
mode controller, Lyaponov stability method is applied to keep the non-linear system under<br />
control. The sliding mode approach is method, which trans<strong>for</strong>med a higher-<strong>order</strong> system into<br />
first-<strong>order</strong> system. In that way, simple control algorithm can be applied, which is very<br />
straigh<strong>for</strong>ward and robust.<br />
The surface is called a switching surface. When the plant state trajectory is “above” the surface,<br />
a feedback path has one gain and a different gain if the trajectory drops “below” the surface.<br />
This surface defines the rule <strong>for</strong> proper switching. This surface is also called a sliding surface<br />
(sliding manifold).<br />
Ideally, once intercepted, the switched control maintains the plant’s state trajectory on the<br />
surface <strong>for</strong> all subsequent time and the plant’s state trajectory slides along this surface (see Fig.<br />
3.2). Then, using the slide surface mentioned above, the sliding mode control became an<br />
important robust control approach. For the class <strong>of</strong> systems to which it applies, sliding mode<br />
controller design provides a systematic approach to the problem <strong>of</strong> maintaining stability and<br />
consistent per<strong>for</strong>mance in the face <strong>of</strong> modeling imprecision. On the other hand, by allowing the<br />
trade<strong>of</strong>fs between modeling and per<strong>for</strong>mance to be quantified in a simple fashion, it can<br />
illuminate the whole design process.<br />
Fig. 3.2. Sliding mode surface.<br />
88
3.4 MODEL REFERENCE – SLIDING MODEL CONTROLLER<br />
The most important task is to design a switched control that will drive the plant state to the<br />
switching surface and maintain it on the surface upon interception. A Lyapunov approach is<br />
used to characterize this task, this will be explained after. Now, we assume the slide mode<br />
hyper-plane <strong>for</strong> the system <strong>of</strong> (3.14) with the states variables predicted by the observer as<br />
.<br />
1( 1 m 2 2<br />
3 3<br />
+<br />
4x4<br />
σ = s x − x ) + s ( x − xm)<br />
+ s x s .<br />
(3.33)<br />
When the sliding mode is in operation, then<br />
σ = 0<br />
(3.34)<br />
σ& = 0.<br />
(3.35)<br />
The equivalent control input can be obtained by substituting (3.14) into (3.35). This gives<br />
T<br />
eq<br />
= 2α<br />
x x x<br />
∆<br />
⋅<br />
⎡<br />
−<br />
⎢<br />
s1(<br />
x2<br />
− x<br />
s ⎣<br />
2<br />
11<br />
2<br />
3<br />
4<br />
m<br />
α01<br />
+ ( −H1x<br />
α<br />
) − s<br />
11<br />
2<br />
⋅⋅<br />
3<br />
3<br />
x + s x<br />
m<br />
4<br />
2<br />
+ α x x )<br />
11<br />
2<br />
⋅ ⎤<br />
+ s4x4<br />
⎥<br />
⎦<br />
3<br />
(3.36)<br />
where it can be assumed that<br />
∆<br />
=<br />
2<br />
α<br />
00<br />
− α<br />
01<br />
/ α ) > 0 .<br />
(<br />
11<br />
Now, the design <strong>of</strong> MR-SMC is considered, which in the non-linear input makes the state<br />
converging in the hyper-plane. In general, the eventual sliding mode input can be considered as<br />
two independent inputs, namely, the equivalent control input<br />
T<br />
eq and non-linear control input<br />
T<br />
l , in other words,<br />
T<br />
= T<br />
eq<br />
+ T<br />
l<br />
= T<br />
eq<br />
− k ( x , t )sat ( σ )<br />
(3.37)<br />
where<br />
89
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
sat( σ )<br />
=<br />
⎧ 1<br />
⎪ σ<br />
⎨<br />
⎪ δ<br />
⎩ − 1<br />
if<br />
if<br />
if<br />
σ > δ<br />
σ ≤ δ<br />
σ < − δ<br />
(3.38)<br />
and k ( x,<br />
t)<br />
is the control input function. δ is a constant to eliminate the chattering. The<br />
condition <strong>for</strong> realization <strong>of</strong> the sliding mode is obtained from the Lyapunov function as we<br />
mentioned be<strong>for</strong>e. Lyapunov method is usually used to determine the stability properties <strong>of</strong> an<br />
equilibrium point without solving the state equation. A generalized Lyapunov function, that<br />
characterizes the motion <strong>of</strong> the state trajectory to the sliding surface, is defined in terms <strong>of</strong> the<br />
surface. For each chosen switched control structure, one chooses the “gains” so that the<br />
derivative <strong>of</strong> this Lyapunov function is negative definite, thus guaranteeing motion <strong>of</strong> the state<br />
trajectory to the surface. After proper design <strong>of</strong> the surface, a switched controller is constructed<br />
so that the tangent vectors <strong>of</strong> the state trajectory point towards the surface such that the state is<br />
driven to and maintained on the sliding surface. Such controllers result in discontinuous<br />
closed-loop systems. The following Lyapunov function is chosen <strong>of</strong> σ to confirm σ = 0 :<br />
1 2<br />
V = σ .<br />
(3.39)<br />
2<br />
With this, V & is given by<br />
⎧ ⎡<br />
⎪ s2<br />
⎢<br />
α<br />
− T − 2α<br />
11x2<br />
x3x4<br />
−<br />
V&<br />
= σ σ&<br />
= σ ⎨ ∆ ⎢<br />
α<br />
⎪<br />
⎣<br />
⋅<br />
⋅⋅<br />
⎪⎩<br />
+ s1<br />
( x2<br />
− xm<br />
) − s2<br />
xm<br />
+ s3x<br />
01<br />
11<br />
4<br />
⎛ − H ⎫<br />
1x3<br />
+ ⎞⎤<br />
⎜ ⎟⎥⎪<br />
⎜ ⎟⎥<br />
2<br />
⎝α<br />
⎠<br />
⎬<br />
11x2<br />
x3<br />
⎦<br />
⎪<br />
⋅<br />
+ s ⎪<br />
4<br />
x4<br />
⎭ .<br />
(3.40)<br />
Substituting (3.37) into (3.40), the existence condition <strong>for</strong> sliding mode is given as<br />
⎧ s2<br />
⎫ s2<br />
V & = σ ⎨−<br />
k(<br />
x,<br />
t)sgn(<br />
σ ) ⎬ = −k(<br />
x,<br />
t)<br />
σ < 0.<br />
⎩ ∆<br />
⎭ ∆<br />
(3.41)<br />
Since s / ∆ 0 if we choose k(x,t) > 0, then the state variable x will converge in the slide<br />
2<br />
><br />
90
3.5 SIMULATION RESULTS<br />
mode hyper-plane and a stable SMC can be realized. The controller gains are determined using<br />
our proposed algorithm (see Chap. 2) so as to minimize the cost function by<br />
J<br />
h<br />
= ∑ [ L ⋅ ( x1<br />
− xm<br />
) + x3<br />
].<br />
(3.42)<br />
The estimation <strong>of</strong> unknown parameters <strong>of</strong> the MR-SMC s s , s , , k(x,t) and δ each one is<br />
1, 2 3<br />
s4<br />
calculated by (2.62). The parameters values are s<br />
1<br />
= 4.2, s<br />
2 =1, s<br />
3 =10.19, s<br />
4 =-0.41,δ =0.2<br />
and k(x,t)=2.14. Figure 3.3 shows the diagram <strong>of</strong> the system designed.<br />
Fig.3.3. Block diagram <strong>of</strong> the MR-SMC system incorporating the non-linear observer.<br />
3.5 -Simulation<br />
The MR-SMC method and M2-<strong>SPSA</strong> are used in <strong>order</strong> to achieve a very suitable controlling the<br />
angular position <strong>of</strong> the single flexible link, suppressing its oscillation. The results are compared<br />
with simulations done previously [47] without the proposed SMC. The numerical values are<br />
follows:<br />
2<br />
2<br />
J=0.00135520[kg ⋅ m ], m=0.026[kg], a ρ =0.0630[kg/m], EI=0.09007[ N⋅ m ], L=0.4[m],<br />
x<br />
0 =[-0.1 0 0 0] T and<br />
x<br />
d =[ 0.1 0 0 0] T , ∆t = 0. 1<br />
[ms], M=0.025[kg]. First, the parameter<br />
estimation in the non-linear observer and MR-SMC using the proposed <strong>SPSA</strong> algorithm is<br />
91
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
compared with effectives estimation algorithms under the same conditions mentioned previously,<br />
Robinns-Monroe stochastic approximation (RM-SA) [9] and Least-Squares (LS) method [10]<br />
are used here.<br />
Table 3.1. Comparison <strong>of</strong> estimators (non-linear observer).<br />
<strong>Algorithm</strong> k<br />
1<br />
k k<br />
2<br />
3 k ζ γ<br />
4<br />
M2-<strong>SPSA</strong> -227 -25015 13.69 -11101 0.010 0.002<br />
RM-SA -366 -30055 19.10 -12971 0.019 0.006<br />
LS -397 -30471 20.16 -13100 0.042 0.009<br />
Table 3.2. Comparison <strong>of</strong> estimators (MR-SMC).<br />
<strong>Algorithm</strong> s<br />
1<br />
s s<br />
2 3 s<br />
4 δ k ( x,<br />
t)<br />
M2-<strong>SPSA</strong> 4.2 1 10.19 -0.41 0.2 2.14<br />
RM-SA 5.0 2 17.72 -0.67 0.2 3.63<br />
LS 5.8 2 20.14 -0.84 0.2 4.01<br />
In the above tables, the values obtained by M2-<strong>SPSA</strong> are very suitable in terms <strong>of</strong> estimation<br />
precision in the current system. The results obtained by our algorithm are explained since<br />
M2-<strong>SPSA</strong> is an algorithm that does not depend on derivative in<strong>for</strong>mation, and it is able to find a<br />
good approximation to the solution using few function values; this causes a low computational<br />
cost. Also, its implementation is easier than other methods since our algorithm needs fewer<br />
coefficients to be specified. For this reason, it is possible to obtain good parameters estimation.<br />
Finally, in the other methods, an accurate amount <strong>for</strong> the slope [48] is used <strong>for</strong> the evaluation<br />
function.<br />
The variability <strong>of</strong> the values <strong>of</strong> the parameters is explained according to stopping condition<br />
which if the value is very small the iterations are stopped, there<strong>for</strong>e using this criterion defined<br />
in this simulation these tables are explained.<br />
In contrast, in M2-<strong>SPSA</strong> the slope is estimated, and the estimation error <strong>for</strong> the slope has an<br />
effect on the convergence speed. Table 3.3 compares, the number <strong>of</strong> iterations and<br />
computational load or normalized CPU (central processing unit) time [49] (computational cost<br />
in time processing) with CPU time required by M2-<strong>SPSA</strong> as reference. These comparisons are<br />
92
3.5 SIMULATION RESULTS<br />
done according to average per<strong>for</strong>mance <strong>of</strong> M2-<strong>SPSA</strong> and the SA algorithms in the estimated<br />
parameters obtained in Tables 3.1 and 3.2. The CPU time is the time processing in estimate each<br />
parameter, in this case CPU time is represented as 1 <strong>for</strong> M2-<strong>SPSA</strong>, from here, we can evaluate<br />
if the other algorithms that we use as comparison need two or more times the CPU time required<br />
by our proposed <strong>SPSA</strong>.<br />
Table 3.3. Per<strong>for</strong>mance comparison among M2-<strong>SPSA</strong>, RM-SA and LS.<br />
<strong>Algorithm</strong> Iterations CPU<br />
M2-<strong>SPSA</strong> 30000 1<br />
RM-SA 29000 2.1<br />
LS 28000 5.2<br />
In Table 3.3, LS is efficient in terms <strong>of</strong> the number <strong>of</strong> iterations required to achieve a certain<br />
level <strong>of</strong> accuracy in the parameter estimation <strong>for</strong> the current system, but it is computationally<br />
expensive and also has a high computational complexity. The LS and RM-SA algorithms<br />
depend on derivative in<strong>for</strong>mation and its solution in each iteration this can increase the<br />
computational cost and complexity.<br />
The CPU time required by LS and RM-SA is 5 to 2 times respectively the CPU required by<br />
M2-<strong>SPSA</strong>, so that, in terms <strong>of</strong> efficiency, the use <strong>of</strong> these algorithms might be questionable. On<br />
the other hand, the proposed <strong>SPSA</strong> algorithm has a low computational cost and usually provides<br />
less dispersed parameters. In the number <strong>of</strong> iterations, these algorithms are almost similar but<br />
according to features <strong>of</strong> our proposed <strong>SPSA</strong>, this can reduce the computational cost (see Chap.<br />
2) and this is a great advantage. Even, the typical <strong>SPSA</strong> algorithm has a modest computational<br />
complexity as is shown in [6], this reason causes a low computational expensive in M2-<strong>SPSA</strong>.<br />
The reason <strong>of</strong> these data obtained by M2-<strong>SPSA</strong> in Table 3.3, is that this algorithm is a very<br />
powerful technique that allows an approximation <strong>of</strong> the gradient or <strong>Hessian</strong> by effecting<br />
simultaneous random perturbations in all the parameters. There<strong>for</strong>e, the data <strong>of</strong> the proposed<br />
<strong>SPSA</strong> algorithm contrast with the other approximations in which the evaluation <strong>of</strong> the gradient<br />
is achieved by varying the parameters once at time. Figures 3.4-3.7 show the simulation results<br />
using the state variables and torque. Figure 3.4 shows the response <strong>of</strong> the motor shaft angle in<br />
the simulation by proposed method. The tracking per<strong>for</strong>mance associated with the motor angle<br />
is very suitable using a non-linear observer applied to MR-SMC method.<br />
93
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
Fig. 3.4. Motor angle. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (.-)). With RM-SA and<br />
MR-SMC (dashed-line (- -)). With LS and MR-SMC-(dash-dot-line(-.-)). With M2-<strong>SPSA</strong> and<br />
MR- SMC(solid-line (-) ).<br />
Figure 3.5 shows the tip position response <strong>of</strong> a single flexible link. The VSS non-linear observer<br />
is very important in eliminating the effects due to load <strong>of</strong> the arm, (see in solid-line).<br />
Figure 3.6 shows the tip velocity. The algorithm proposed with MR-SMC reduces the<br />
magnitude <strong>of</strong> velocity to a small value (solid-line). We can see that after 0.5 seconds the system<br />
start to become stable and the state variables predicted by the non-linear observer converge<br />
more efficiently in the sliding mode plane.<br />
Figure 3.7 shows the control torque. This simulation shows the control <strong>of</strong> the <strong>for</strong>ce to rotate the<br />
beam generated by our method (solid-line) and is stabilized after 0.5 seconds. In these<br />
simulations, we can see that using the non-linear observer and MR-SMC is possible to obtain a<br />
good per<strong>for</strong>mance since the non-linear observer is very reliable in predicting the state variables.<br />
Also, MR-SMC is an important control method used here that needs an indispensable estimate<br />
<strong>of</strong> all state variables predicted by non-linear observer. So that, the sliding mode control method<br />
is an important robust control approach. For the class <strong>of</strong> systems to which it applies, sliding<br />
mode controller design provides a systematic approach to the problem <strong>of</strong> maintaining stability<br />
and consistent per<strong>for</strong>mance in the face <strong>of</strong> modeling imprecision.<br />
94
3.5 SIMULATION RESULTS<br />
On the other hand, by allowing the trade<strong>of</strong>fs between modeling and per<strong>for</strong>mance to be<br />
quantified in a simple fashion, it can illuminate the whole design process.<br />
Fig. 3.5. Tip position. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (.)).With RM-SA and<br />
MR-SMC (dashed-line (- -)). With LS and MR-SMC(dash-dot-line(-.-)). With M2-<strong>SPSA</strong> and<br />
MR-SMC(solid-line (-) ).<br />
Fig. 3.6. Tip velocity. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (.)).With RM-SA and<br />
MR-SMC (dashed-line (- -)). With LS and MR-SMC(dash-dot-line (-.-)).With M2-<strong>SPSA</strong> and<br />
MR-SMC (solid-line (-) ).<br />
95
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
Fig. 3.7. Control torque. Without M2-<strong>SPSA</strong> and MR-SMC (dotted-line (.)).With RM-SA and<br />
MR-SMC (dashed-line (- -)). With LS and MR-SMC(dash-dot-line (-.-)).With M2-<strong>SPSA</strong> and<br />
MR-SMC (solid-line (-) ).<br />
Fig. 3.8.Motor angle. Simulation using x 1<br />
with M2-<strong>SPSA</strong> and MR-SMC (solid-line). Simulation<br />
using x m<br />
with M2-<strong>SPSA</strong> and MR-SMC (dashed-line).<br />
Fig. 3.9. Tip position. Simulation using x 3<br />
with M2-<strong>SPSA</strong> and MR-SMC (solid-line). Simulation<br />
using ˆx 3<br />
with M2-<strong>SPSA</strong> and MR-SMC (dashed-line).<br />
96
3.5 SIMULATION RESULTS<br />
Fig. 3.10. Tip velocity. Simulation using x 4<br />
with M2-<strong>SPSA</strong> and MR-SMC (solid-line).<br />
Simulation using ˆx 4<br />
with M2-<strong>SPSA</strong> and MR-SMC (dashed-line).<br />
In these simulations, we can see that using the non-linear observer and MR-SMC is possible to<br />
obtain a good per<strong>for</strong>mance since the non-linear observer is very reliable in predicting the state<br />
variables. Also, MR-SMC is an important control method used here that needs an indispensable<br />
estimate <strong>of</strong> all state variables predicted by non-linear observer. For this kind <strong>of</strong> systems,<br />
MR-SMC design (see Fig.3.3) provides a systematic approach to the problem <strong>of</strong> maintaining<br />
stability and consistent per<strong>for</strong>mance in the face <strong>of</strong> modeling imprecision. Moreover, M2-<strong>SPSA</strong><br />
had a better per<strong>for</strong>mance to estimate the observer and MR-SMC parameters in comparison with<br />
the other algorithms.<br />
In this chapter, we have proposed a MR-SMC method using a non-linear observer <strong>for</strong><br />
controlling the angular position <strong>of</strong> the single flexible link, suppressing its oscillation.<br />
We can see that the non-linear observer and the MR-SMC provide a successful and stable<br />
operation to the system. We also have proposed the use <strong>of</strong> M2-<strong>SPSA</strong> in <strong>order</strong> to determine the<br />
observer/controller gains. This could determine them very efficiently and with a low<br />
computational cost. The non-linear observer was successful in predicting the state variables<br />
from the motor angular position and the MR-SMC was a very efficient control method.<br />
97
CHAPTER 3. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM I<br />
In a future work, we will plan make real experiments using this model, but be<strong>for</strong>e it is necessary<br />
to evaluate several factors in <strong>order</strong> to make these experiments such as physical conditions<br />
(dimensions and material <strong>of</strong> the flexible arm) or the estimation <strong>of</strong> the gradient that need have a<br />
certain level <strong>of</strong> accuracy. The handling <strong>of</strong> the deflection within the proposed method also is<br />
considered as a factor in the real experiments. Even, we need to consider the robust controller,<br />
an exact modeling to some extent is thought to be necessary in <strong>order</strong> to be able to predict the<br />
experimental results through simulations, also this feature is necessary consider it. Finally, the<br />
friction also will be considered as important factor to consider in the real experiments.<br />
98
Chapter 4<br />
Lattice IIR Adaptive Filter Structure<br />
Adapted by <strong>SPSA</strong> <strong>Algorithm</strong><br />
In this second application, the M2-<strong>SPSA</strong> algorithm is applied to parameter estimation, in this<br />
case to get the coefficient <strong>of</strong> adaptive algorithms in the model proposed here, these adaptive<br />
algorithms are Steiglitz-McBride (SM) and Simple Hyperstable Adaptive Recursive Filter<br />
(SHARF). The results are compared with previous lattice versions <strong>of</strong> these algorithms. The<br />
per<strong>for</strong>mance in the coefficients is compared here. Finally, also we make some modifications in<br />
the adaptive algorithms proposed here in <strong>order</strong> to obtain a suitable stability and convergence.<br />
Adaptive infinite impulse response (IIR), or recursive, filters are less attractive mainly because<br />
<strong>of</strong> the stability and the difficulties associated with their adaptive algorithms. There<strong>for</strong>e, in this<br />
chapter the adaptive IIR lattice filters are studied in <strong>order</strong> to devise algorithms that preserve the<br />
stability <strong>of</strong> the corresponding direct-<strong>for</strong>m schemes. We analyze the local properties <strong>of</strong> stationary<br />
points, a trans<strong>for</strong>mation achieving this goal is suggested, which gives algorithms that can be<br />
efficiently implemented. Application to the SM and SHARF algorithms is presented. The<br />
M2-<strong>SPSA</strong> is presented in <strong>order</strong> to get the coefficients in a lattice <strong>for</strong>m more efficiently and with<br />
a lower computational cost and complexity. The results are compared with previous lattice<br />
versions <strong>of</strong> these algorithms. These previous lattice versions may fail to preserve the stability <strong>of</strong><br />
stationary points.<br />
4.1 -Introduction<br />
In the last decade, substantial research ef<strong>for</strong>ts have been spent to turn adaptive IIR filtering<br />
techniques into a reliable alternative to traditional adaptive finite impulse response (FIR) filters.<br />
The main advantages <strong>of</strong> IIR filters are that they are more suitable to models <strong>of</strong> physical systems,<br />
due to the pole-zero structure, and also require much less parameters to achieve the same<br />
per<strong>for</strong>mance level as FIR filters. Un<strong>for</strong>tunately, these good characteristics come along with<br />
some possible drawbacks inherent to adaptive filters with recursive structure such as algorithm<br />
99
CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />
instability, convergence to biased and/or local minimum solutions, as well as slow convergence.<br />
Consequently, several new algorithms <strong>for</strong> adaptive IIR filtering have been proposed in the<br />
literatures attempting to overcome these problems. Extensive research on the subject, however,<br />
seems to suggest that no general purposed optimal algorithm exists. In fact, all available<br />
in<strong>for</strong>mation must be considered when applying adaptive IIR filtering, in <strong>order</strong> to determine the<br />
most appropriate algorithm <strong>for</strong> a given problem. Then, the need <strong>for</strong> ensuring stable operation <strong>of</strong><br />
adaptive IIR filters has spawned much interest in other structures over the direct-<strong>for</strong>m. In<br />
particular, the lattice structure has received considerable attention due to several advantages<br />
such as one-to-one correspondence between transfer functions and parameter spaces, good<br />
numerical properties, as well as built-in stability [50]. There<strong>for</strong>e, several adaptive algorithms<br />
described in [50], originally devised <strong>for</strong> direct-<strong>for</strong>m structures, have been modified in <strong>order</strong> to<br />
allow <strong>for</strong> a lattice realization <strong>of</strong> the filter. These algorithms use a conventional method based in<br />
exploiting the properties <strong>of</strong> the lattice structure [52] and suitable approximations [53]. These<br />
algorithms based in this conventional method <strong>of</strong>fer a relative low computational load and in<br />
most cases these approximate lattice algorithms preserve the set <strong>of</strong> stationary points.<br />
Nevertheless, it has not been clear whether the convergence properties <strong>of</strong> the stationary points<br />
are well preserved. Also, the reduction in the computational load is not enough, in special in the<br />
estimation <strong>of</strong> reflection coefficients into lattice <strong>for</strong>m. Hence, in this paper a new approach to<br />
improve lattice structure is proposed. The Ordinary Differential Equation (ODE) method<br />
[50]-[54] is proposed in <strong>order</strong> to get a trans<strong>for</strong>mation, which allows deriving sufficient<br />
conditions <strong>for</strong> convergence. The method is very general, applying to any pair <strong>of</strong> structures as<br />
long as a one-to-one correspondence exists between them. For the direct-<strong>for</strong>m to lattice case, it<br />
is shown how to efficiently implement this trans<strong>for</strong>mation. This approach is applied to the same<br />
adaptive algorithms used in [50], in this case the lattice versions <strong>of</strong> the Steiglitz-McBride (SM)<br />
and the Simple Hyperstable Adaptive Recursive Filter (SHARF) algorithms, <strong>for</strong> which it is also<br />
shown how a pre-existing approximate algorithms may fail to converge in some cases. Finally,<br />
in <strong>order</strong> to get the reflection coefficients in lattice <strong>for</strong>m, we have proposed a gradient-free<br />
method. This method is only based on objective function measurements and do not require<br />
knowledge <strong>of</strong> the gradients <strong>of</strong> the underlying model. As a result, they are very easy to<br />
implement and have reduced the computational cost in theirs applications. The gradient-free<br />
method proposed here is the Simultaneous Perturbation Stochastic <strong>Approximation</strong> (<strong>SPSA</strong>)<br />
algorithm [3]. This is based on a randomized method where all parameters are perturbed<br />
simultaneously [3], and makes it possible to modify the parameters with only two measurements<br />
<strong>of</strong> an evaluation function regardless <strong>of</strong> the dimension <strong>of</strong> the parameter. This algorithm is very<br />
100
4.2 PROCEDURE OF IMPROVED ALGORITHM<br />
useful, but this traditional <strong>SPSA</strong> algorithm can cause, in some cases (systems with a great<br />
volume <strong>of</strong> parameters), a high computational cost [3]. There<strong>for</strong>e, we have proposed a modified<br />
version <strong>of</strong> <strong>SPSA</strong> applied to reflection coefficient estimation in the current system in <strong>order</strong> to get<br />
estimated coefficients more efficiently reducing the computational cost. The organization <strong>of</strong> the<br />
present chapter is as follows: In Sec. 4.2, the derivation <strong>of</strong> the proposed algorithm is described.<br />
In Sec. 4.3, the application to lattice structure is explained. The adaptive algorithms are<br />
described in Sec. 4.4. The simulations results with the proposed methods are shown in Sec. 4.5.<br />
4.2 -Procedure <strong>of</strong> Improved <strong>Algorithm</strong><br />
Consider a direct-<strong>for</strong>m adaptive filter<br />
N<br />
∑<br />
− i<br />
bi<br />
z<br />
B z<br />
i =<br />
H ˆ ( )<br />
0<br />
( z ) = =<br />
(4.1)<br />
M<br />
A ( z )<br />
− j<br />
1 + a z<br />
∑<br />
j = 1<br />
j<br />
parameterized by<br />
T<br />
θ = b a ] . Usually, constant-gain algorithms can be written as<br />
d<br />
[<br />
0,<br />
L,<br />
bN<br />
, a1,<br />
L,<br />
M<br />
θ ( n + 1) = θ ( n)<br />
+ X ( n)<br />
e(<br />
n)<br />
(4.2)<br />
d d<br />
µ<br />
d<br />
where µ > 0 is a step size, e (⋅)<br />
is some signal and X (⋅)<br />
is a driving vector that depends on<br />
the specific algorithm. Let θ be the corresponding parameter vector <strong>for</strong> a different<br />
l<br />
implementation <strong>of</strong> the filter, in such a way that there exists a one-to-one map θ = f θ ) defined<br />
on a suitable stability domain that allows one to move back and <strong>for</strong>th between both descriptions.<br />
The objective is to re<strong>for</strong>mulate the algorithm (4.2) in terms <strong>of</strong><br />
matrix as<br />
df ( θ )<br />
F(<br />
θ ) =<br />
l<br />
.<br />
f<br />
dθ<br />
l<br />
d<br />
d<br />
( l<br />
θ . Let us define the Jacobian<br />
l<br />
(4.3)<br />
We neglect to use a subscript in the argument, since F can be expressed as a function <strong>of</strong> either<br />
θ or<br />
d<br />
θ by means <strong>of</strong> the map f. We can think <strong>of</strong><br />
l<br />
θ as representing the actual transfer<br />
f<br />
function H ˆ ( z ) while θ and<br />
d<br />
θ are the parameter vectors that describe<br />
l<br />
H ˆ ( z ) in a particular set<br />
<strong>of</strong> coordinates. The following algorithm can update <strong>of</strong><br />
θ<br />
l :<br />
101
CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />
θ ( n + 1) = θ ( n)<br />
+ e(<br />
n)<br />
X ( n)<br />
(4.4)<br />
l l<br />
µ<br />
l<br />
T<br />
X ( n)<br />
= F ( ( n))<br />
X ( n).<br />
(4.5)<br />
l<br />
θ f<br />
d<br />
That is, the driving vector <strong>for</strong> the new coordinates, X l<br />
(n)<br />
is related to (n)<br />
X d<br />
through the<br />
Jacobian F. Since the map f is one-to-one, F( θ f<br />
) has a full rank <strong>for</strong> all θ describing stable<br />
f<br />
transfer functions. There<strong>for</strong>e if θ * ( *<br />
d<br />
= f θl<br />
) then * *<br />
θ is a stationary point <strong>of</strong> (4.2) iff<br />
d<br />
θ is a<br />
l<br />
stationary point <strong>of</strong> (4.4), since<br />
E<br />
[ X ( n)<br />
e(<br />
n)<br />
] * = 0 ⇔ E[ X ( n)<br />
e(<br />
n)<br />
] * = 0<br />
l<br />
f<br />
d<br />
. (4.6)<br />
θ θ f<br />
So that, the stationary points are preserved. Now, the convergence issue is described. By<br />
*<br />
applying the ODE method [55], <strong>for</strong> sufficiently small µ the stationary point θ is locally stable<br />
l<br />
<strong>for</strong> algorithm (4.4) iff all the eigenvalues <strong>of</strong> the matrix<br />
S<br />
t<br />
dE<br />
=<br />
[ X ( n)<br />
e(<br />
n)<br />
]<br />
l<br />
dθ<br />
t<br />
T<br />
⎡dX<br />
l<br />
( n)<br />
⎤ ⎡ de(<br />
n)<br />
⎤<br />
= E⎢<br />
⋅ e(<br />
n)<br />
⎥ + E⎢X<br />
l<br />
( n)<br />
⎥<br />
⎣ dθl<br />
⎦ *<br />
d<br />
t<br />
f<br />
14442<br />
4443<br />
14<br />
⎣ θ<br />
44 2444<br />
⎦ *<br />
θ<br />
θ<br />
3<br />
= P<br />
θ<br />
*<br />
f<br />
= Q<br />
(4.7)<br />
(k )<br />
have negative real parts. For a vector V, let V denote its k-th component. Then, the i, j<br />
element <strong>of</strong> P, is given by<br />
P<br />
i , j<br />
.<br />
( k)<br />
[ X ( n)<br />
e(<br />
n)<br />
]<br />
i<br />
N + M + 1<br />
⎡∂X<br />
( ) ⎤ ∂ ( )<br />
l<br />
n<br />
Fji<br />
θ<br />
f<br />
E ( ) =<br />
*<br />
⎢ e n<br />
( ) ⎥ ∑ E<br />
l<br />
( j)<br />
d<br />
θ f<br />
⎣ ∂θl<br />
⎦ * k=<br />
1 ∂θl<br />
1442<br />
443<br />
+<br />
N + M + 1<br />
∑<br />
k=<br />
1<br />
θ<br />
( k )<br />
* ⎡∂X<br />
( ) ⎤<br />
d<br />
n<br />
Fji<br />
( θ<br />
f<br />
) E⎢<br />
e(<br />
n)<br />
( j)<br />
⎥<br />
⎣ ∂θi<br />
⎦<br />
*<br />
θ<br />
*<br />
f<br />
= 0<br />
(4.8)<br />
102
4.2 PROCEDURE OF IMPROVED ALGORITHM<br />
Using (4.8) and the chain rule,<br />
* T<br />
⎡dX<br />
d<br />
( n)<br />
⎤<br />
*<br />
P = F(<br />
θ<br />
f<br />
) ⋅ E e(<br />
n)<br />
⎥ ⋅ F(<br />
θ<br />
f<br />
).<br />
(4.9)<br />
⎢<br />
⎣ dθ<br />
d ⎦ *<br />
θ f<br />
On the other hand, using again the chain rule and (4.5),<br />
T<br />
* T<br />
⎡ de(<br />
n)<br />
⎤<br />
*<br />
Q = F(<br />
θ<br />
f<br />
) ⋅ E X<br />
d<br />
( n)<br />
⎥ ⋅ F(<br />
θ<br />
f<br />
).<br />
(4.10)<br />
⎢<br />
⎣ dθ<br />
d ⎦ *<br />
θ f<br />
There<strong>for</strong>e, the derivative matrix<br />
S l<br />
= P + Q reduces to<br />
dE<br />
[ X n)<br />
e(<br />
n)<br />
] dE[ X ( n)<br />
e(<br />
n)<br />
]<br />
l<br />
( * t<br />
d<br />
*<br />
= F(<br />
θ ) ⋅<br />
⋅ F(<br />
θ<br />
f<br />
dθt<br />
*<br />
dθ<br />
d<br />
*<br />
θ f<br />
θ f<br />
144<br />
2444<br />
3<br />
= Sd<br />
).<br />
(4.11)<br />
Here, (4.11) relates the stability matrices <strong>of</strong> algorithms (4.2) and (4.4) through the Jacobian<br />
*<br />
f<br />
F(<br />
θ ) . In this algorithm, if the matrix sd<br />
is symmetric, then<br />
*<br />
θ<br />
l<br />
is a locally stable stationary<br />
point <strong>for</strong> algorithm (4.4) iff<br />
proved in the following way:<br />
*<br />
θd<br />
is a locally stable stationary point <strong>for</strong> algorithm (4.2). This is<br />
In view <strong>of</strong> (4.11) and Sylvester’s law <strong>of</strong> inertia the signs <strong>of</strong> the eigenvalues <strong>of</strong> the matrices<br />
sd<br />
and<br />
*<br />
s<br />
l<br />
are the same. Also if s<br />
d<br />
< 0 , then θl<br />
is a locally stable stationary point <strong>for</strong> algorithm<br />
(4.4). This is proved in the following way:<br />
It is shown that in view <strong>of</strong> (4.11), s < 0 iff s < 0 . Since all the eigenvalues <strong>of</strong> a negative<br />
definite matrix have negative real parts, it follows that<br />
l<br />
d<br />
*<br />
*<br />
θl<br />
is locally stable <strong>for</strong> (4.4). (and θd<br />
is<br />
locally stable <strong>for</strong> (4.2) ). According to these last explanations described above, these gives<br />
103
CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />
sufficient conditions under which the stability <strong>of</strong> algorithm (4.2) implies the stability <strong>of</strong><br />
algorithm (4.4).<br />
4.3 -Lattice Structure<br />
The lattice filters are typically used as linear predictors because it is easy to ensure that they are<br />
minimum phase and hence that its inverse is stable [52]. The lattice <strong>for</strong>m adaptive IIR<br />
algorithms derived are expected to have at least the following advantages over the direct-<strong>for</strong>m<br />
algorithms: i) faster convergence; ii) easier stability monitoring even simpler than parallel <strong>for</strong>m;<br />
iii) more robust <strong>for</strong> finite precision implementation [52]. One important characteristic <strong>of</strong> this<br />
structure is the possibility <strong>of</strong> represents multiples poles [52]. It is expected that these structural<br />
advantages can bring about substantial per<strong>for</strong>mance improvement <strong>for</strong> adaptive filters. Then, the<br />
derivation described in Sec. 4.3 is applied in this section to obtain efficient adaptive algorithms<br />
<strong>for</strong> lattice filters according to the characteristics <strong>of</strong> this structure mentioned above. Firstly, this<br />
approach is implemented to the adaptive filter as a cascade <strong>of</strong> a direct-<strong>for</strong>m FIR filter<br />
N −i<br />
( z ) = ∑ b z and an all-pole lattice filter<br />
=<br />
1/<br />
A ( z)<br />
. So that, θ<br />
i l<br />
is defined by<br />
B i 0<br />
θ<br />
[ b L α sinα<br />
] T<br />
l 0<br />
b N<br />
sin 1<br />
L<br />
= (4.12)<br />
M<br />
where<br />
sinα<br />
are the reflection coefficients <strong>of</strong> the lattice part (these coefficients can be calculated<br />
k<br />
using a modified version <strong>of</strong> <strong>SPSA</strong>, this algorithm is explained in Chap. 2). In general, the<br />
reflection coefficients are estimated as cross-correlation coefficients between <strong>for</strong>ward and<br />
backward prediction errors in each stage <strong>of</strong> the adaptive lattice filter. Accordingly, two divisions<br />
in each stage, and effectively doubling the number <strong>of</strong> stages, are required. A problem is that the<br />
processing cost <strong>of</strong> division is higher than that <strong>of</strong> multiplication, especially in cheap digital signal<br />
processors (DSPs). These coefficients are calculated by our modified version <strong>of</strong> <strong>SPSA</strong>, which<br />
make a reduction <strong>of</strong> the number <strong>of</strong> divisions. The proposed technique can decrease the number<br />
<strong>of</strong> divisions to one. This algorithm will be explained in the following section.<br />
⎡I<br />
N + 1 ⎤<br />
F (θ<br />
f<br />
) =<br />
with<br />
⎢ ⎥<br />
⎣ D⎦<br />
D<br />
ij<br />
∂ai<br />
=<br />
∂ sin α<br />
j .<br />
104
4.4 ADAPTIVE ALGORITHM<br />
T<br />
Also, we have [ ] T T<br />
X ( n)<br />
= V ( n)<br />
W ( n)<br />
with<br />
d<br />
d<br />
d<br />
V<br />
d<br />
⎡ 1<br />
⎢ −<br />
z<br />
( n)<br />
= ⎢<br />
⎢ M<br />
⎢ −<br />
⎣ z<br />
1<br />
N<br />
⎤<br />
⎥<br />
⎥v(<br />
n),<br />
⎥<br />
⎥<br />
⎦<br />
W<br />
d<br />
⎡ z<br />
⎢<br />
z<br />
( n)<br />
= ⎢<br />
⎢ M<br />
⎢<br />
⎣ z<br />
−1<br />
− 2<br />
− M<br />
⎤<br />
⎥<br />
⎥<br />
⎥<br />
⎥<br />
⎦<br />
1<br />
ω ( n)<br />
A(<br />
z )<br />
<strong>for</strong> some signals v (n)<br />
, ω(n)<br />
T<br />
[ V ( n)<br />
W ( n ] T T<br />
X<br />
l<br />
( n)<br />
=<br />
l l<br />
) , we find that Vl ( n)<br />
= Vd<br />
( n)<br />
and<br />
which depend on the particular algorithm. If similarly partitioning<br />
T<br />
W ( n)<br />
= D W<br />
l<br />
d<br />
⎡ ∂a1<br />
⎢ ∂ sinα1<br />
⎢<br />
( n)<br />
= ⎢ M<br />
⎢ ∂a1<br />
⎢<br />
⎣∂<br />
sinα<br />
M<br />
L<br />
L<br />
∂aM<br />
⎤<br />
−<br />
∂ sinα<br />
⎥⎡<br />
z<br />
1<br />
⎥⎢<br />
⎥⎢<br />
M<br />
∂aM<br />
⎥⎢<br />
−<br />
⎣z<br />
∂ sinα<br />
⎥<br />
M ⎦<br />
1<br />
M<br />
⎤<br />
⎥ 1<br />
⎥ ω(<br />
n)<br />
A(<br />
z)<br />
⎥<br />
⎦<br />
⎡ ∂ A ( z )<br />
= ⎢ L<br />
⎣ ∂ sin α<br />
1<br />
∂ A ( z )<br />
∂ sin α<br />
M<br />
⎤<br />
⎥<br />
⎦<br />
T<br />
1<br />
ω ( n ).<br />
A ( z )<br />
Thus the problem boils down to efficiently implementing the transfer function<br />
[ T ( z),<br />
T ( z ] T<br />
M<br />
T ( z)<br />
= )<br />
1<br />
L with<br />
1 ∂A(<br />
z)<br />
1<br />
Tk<br />
( z)<br />
=<br />
=<br />
A(<br />
z)<br />
∂ sin α cos α<br />
k<br />
k<br />
1 ∂A(<br />
z)<br />
A(<br />
z)<br />
∂α<br />
k<br />
.<br />
A structure that per<strong>for</strong>ms exactly this task, with complexity proportional to the filter <strong>order</strong>, was<br />
developed in [50]. Hence (4.4)-(4.5) can be efficiently implemented.<br />
105
CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />
4.4--Adaptive <strong>Algorithm</strong><br />
4.4.1 -SHARF <strong>Algorithm</strong><br />
The hyperstable adaptive recursive filter (HARF) algorithm is an early version <strong>of</strong> the<br />
application <strong>of</strong> hyperstability [56] to signal processing but suffers many setbacks, which makes it<br />
very hard to implement [57]. Landau in [58] developed an algorithm <strong>for</strong> <strong>of</strong>f-line system<br />
identification, based on the hyperstability theory [58], that can be considered as the origin <strong>of</strong> the<br />
SHARF algorithm. Basically, the SHARF algorithm has the following convergence properties<br />
[56][57]:<br />
Property 1: In the cases <strong>of</strong> sufficient <strong>order</strong> in identification ( n<br />
* ≥ 0 ) , the SHARF algorithm<br />
may not converge to the global minimum <strong>of</strong> the mean-square output error <strong>of</strong> (MSOE) [57][58]<br />
per<strong>for</strong>mance surface if the plant transfer function denominator polynomial does not satisfy the<br />
following strictly positive realness condition.<br />
− 1<br />
Re ⎡ D ( z ) ⎤<br />
⎢ > 0 ; = 1 .<br />
−<br />
(<br />
1 ⎥ z<br />
(4.13)<br />
⎣ A z ) ⎦<br />
Property 2: In case <strong>of</strong> insufficient <strong>order</strong> to identification ( n<br />
* ≥ 0 ) , the adaptive filter output<br />
signal yˆ ( n ) and the adaptive filter coefficients vector θˆ are stable sequences, provided the<br />
input signal is sufficiently persistent exciting. The main problem <strong>of</strong> the SHARF algorithm<br />
seems to be the nonexistence <strong>of</strong> a robust practical procedure to define the moving average filter<br />
− 1<br />
D ( q ) in <strong>order</strong> to guarantee the global convergence <strong>of</strong> the algorithm, here<br />
D<br />
−<br />
n −<br />
= ∑<br />
d k<br />
( q<br />
1 )<br />
k =<br />
d k<br />
q . This is a consequence <strong>of</strong> the fact that the condition in (4.13) depends on<br />
1<br />
the plant denominator characteristics, that in practice are unknown. We particularize now<br />
(4.4)-(4.5) to the SHARF algorithm. For the direct <strong>for</strong>m SHARF [49], we have<br />
υ ( n ) = u(<br />
n),<br />
ω ( n)<br />
= −B(<br />
z)<br />
u(<br />
n)<br />
= −A(<br />
z)<br />
yˆ(<br />
n)<br />
and e( n)<br />
= C(<br />
z)(<br />
y(<br />
n)<br />
− yˆ(<br />
n)).<br />
In this expression, C(z)<br />
is a compensating filter designed in <strong>order</strong> to make the transfer function<br />
C( z)/<br />
A* ( z)<br />
strictly positive real (SPR) [57] , where A ( ) is the denominator <strong>of</strong> H (z)<br />
. The<br />
*<br />
z<br />
106
4.4 ADAPTIVE ALGORITHM<br />
jω<br />
transfer function G(z)<br />
is SPR if it is stable and causal and satisfies ReG<br />
( e ) > 0 ∀ω<br />
. This SPR<br />
condition is a common convergence requirement <strong>for</strong> all hyperstability based adaptive algorithms<br />
[57]. The block diagram <strong>of</strong> the adaptive filter is shown in Fig. 4.1.<br />
Fig. 4.1. Block diagram SHARF lattice algorithm.<br />
Assuming a sufficient-<strong>order</strong> setting and that the SPR condition is satisfied, it can be proved<br />
that the matrix S <strong>for</strong> the SHARF algorithm is negative definite [57]. In <strong>order</strong> to guarantee<br />
d<br />
global convergence <strong>for</strong> the SHARF algorithm independently <strong>of</strong> the plant characteristics, Landau<br />
[58] proposed the application <strong>of</strong> a time-varying moving average filtering to the output error<br />
signal. Using Landaus’ approach, the modified SHARF algorithm can be given by<br />
e<br />
SHARF<br />
with<br />
D<br />
( n)<br />
=<br />
−1<br />
[ D(<br />
q , n)<br />
] e ( n)<br />
∑<br />
k = 0<br />
OE<br />
n<br />
−1<br />
−<br />
= d<br />
k<br />
( q , n)<br />
d k<br />
q<br />
(4.14)<br />
d ( n + 1) = d ( n)<br />
+ µ e ( n)<br />
e ( n − k),<br />
k = 0,1,…, υ<br />
d<br />
(4.15)<br />
k<br />
k<br />
SHARF<br />
OE<br />
ˆ θ ( n + 1) = ˆ θ ( n)<br />
+ µ e ( n)<br />
ˆ φ ( n)<br />
. (4.16)<br />
f<br />
f<br />
SHARF<br />
MOE<br />
Another interesting interpretation <strong>of</strong> the modified SHARF algorithm can be found in [59]. In the<br />
convergence <strong>of</strong> the modified SHARF algorithm, the error signal e SHARF<br />
(n)<br />
converges to zero in the mean sense if n * ≥ 0 and µ satisfies<br />
is a sequence that<br />
107
CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />
1<br />
0 < µ <<br />
(4.17)<br />
2<br />
( n)<br />
φ SHARF<br />
where φ (n)<br />
is the extended in<strong>for</strong>mation vector defined as<br />
SHARF<br />
T<br />
[ yˆ(<br />
n −i)<br />
x(<br />
n − j)<br />
e ( n − k)<br />
] .<br />
φ ( n)<br />
=<br />
(4.18)<br />
SHARF<br />
SHARF<br />
It should be mentioned that if signal φ (n)<br />
tends to zero, the output error (n)<br />
signal<br />
SHARF<br />
does not necessarily tend to zero. In fact, it was shown in [60] that the minimum phase<br />
condition <strong>of</strong> D(<br />
q<br />
−1 , n)<br />
must also be satisfied in <strong>order</strong> to guarantee that (n)<br />
converges to<br />
zero in the mean sense. This additional condition implies that a continuous minimum phase<br />
monitoring should be per<strong>for</strong>med in the polynomial D(<br />
q<br />
−1 , n)<br />
to assure global convergence <strong>of</strong><br />
the SHARF algorithm. This fact prevents the general use <strong>of</strong> the SHARF algorithm in practice. It<br />
is also important to mention that although the members <strong>of</strong> the SHARF family <strong>of</strong> adaptive<br />
algorithms, that includes the modified output error (MOE), and SHARF algorithms, attempt to<br />
minimize the output error signal, these algorithms do not present a gradient descent convergence<br />
concept from the hyperstability theory.<br />
e OE<br />
e OE<br />
4.4.2 -Steilglitz-McBride <strong>Algorithm</strong><br />
In [61], Steiglitz and McBride developed an adaptive algorithm attemping to combine the good<br />
characteristics <strong>of</strong> the output error and equation error algorithms, namely unbiased and unique<br />
global solution, respectively. In <strong>order</strong> to achieve these properties, the so called SM algorithm is<br />
based on an error signal e(n) that is a linear function <strong>of</strong> the adaptive filter coefficients, yielding a<br />
unimodal per<strong>for</strong>mance surface, and has physical interpretation similar to the output error signal,<br />
leading to an unbiased global solution. For this adaptive algorithm, let u (⋅)<br />
, yˆ ( ⋅)<br />
be the<br />
adaptive filter input and output respectively, and let y(⋅)<br />
be the reference signal. For the SM<br />
adaptive algorithm described in [61] we have<br />
1<br />
e( n)<br />
= y(<br />
n)<br />
− yˆ(<br />
n)<br />
and υ(<br />
n)<br />
= u(<br />
n),<br />
ω(<br />
n)<br />
= −y(<br />
n).<br />
A(<br />
z)<br />
108
4.5 SIMULATION RESULTS<br />
Figure 4.2 shows the block diagram <strong>of</strong> the lattice implementation <strong>of</strong> (4.4)-(4.5) <strong>for</strong> the SM<br />
algorithm. Suppose that y ( n)<br />
= H(<br />
z)<br />
u ( n)<br />
where H(z)<br />
is a filter <strong>of</strong> the same <strong>order</strong> as H ˆ ( z ) .<br />
Fig. 4.2. Block diagram <strong>of</strong> the SM lattice algorithm.<br />
In case there is an additive output disturbance, the SM estimate remains unbiased as long as the<br />
disturbance is white [61] [62]. For simplicity it is assumed here that the reference signal y(⋅)<br />
is<br />
not contaminated by noise. It can be shown that <strong>for</strong> this sufficient <strong>order</strong> case, the matrix<br />
Sd<br />
at<br />
the stationary pointθ corresponding to H ˆ ( z)<br />
= H ( z)<br />
coincides with the <strong>Hessian</strong> matrix <strong>of</strong><br />
*<br />
f<br />
the cost function E[ e ( )]<br />
2 n<br />
evaluated at<br />
*<br />
*<br />
θ<br />
d<br />
and there<strong>for</strong>e it is symmetric. Since θd<br />
is locally<br />
stable <strong>for</strong> the direct <strong>for</strong>m SM algorithm [61], then<br />
*<br />
θl<br />
is locally stable <strong>for</strong> the lattice algorithm.<br />
In [62] an alternative way <strong>of</strong> implementing the SM algorithm using a normalized tapped lattice<br />
structure was presented. However, the stability <strong>of</strong> the stationary point is not guaranteed.<br />
4.5 -Simulation<br />
4.5.1 -SHARF <strong>Algorithm</strong><br />
Here, we considered such a setting in which u(⋅)<br />
was taken as unit variance white noise, N=0,<br />
M=6,<br />
0.1<br />
H ( z)<br />
= A *(<br />
z<br />
with A<br />
*(<br />
z)<br />
parameterized in lattice <strong>for</strong>m by the reflection coefficients<br />
)<br />
109
CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />
* *<br />
estimated by our proposed algorithm <strong>SPSA</strong> [ sinα L sinα<br />
] [.6<br />
.95 .86 .84.9 .51 ]<br />
1 6<br />
=<br />
, and also<br />
C ( z)<br />
= A* ( z)<br />
. Figure 4.3 shows the parameter trajectories <strong>of</strong> algorithm (4.4)-(4.5). The initial<br />
point was θ ( 0) = 0 . The convergence is achieved, as expected. On the other hand, the lattice<br />
l<br />
version <strong>of</strong> SHARF presented in [63] using fails to converge in this setting, as shown in Fig. 4.4.<br />
The initial value θ (0)<br />
was taken very close to the stationary point. For this algorithm the<br />
l<br />
corresponding matrix can be shown to have unstable eigenvalues, which implies that the<br />
stationary point is not convergent [63]. Note that the SPR condition is satisfied; the problem<br />
does not reside there, but in the simplifications introduced when passing from the direct <strong>for</strong>m to<br />
the lattice algorithm. In the figures 4.3-4.6, the dashed-lines show the parameter values at the<br />
stationary point.<br />
4.5.2 -Steiglitz-Mcbride <strong>Algorithm</strong><br />
0.01<br />
Let N=0, M=6 and H ( z)<br />
= with A* ( z)<br />
parameterized in lattice <strong>for</strong>m by the reflection<br />
A *<br />
( z)<br />
coefficients estimate by the proposed <strong>SPSA</strong> algorithm.<br />
*<br />
*<br />
[ sin sinα<br />
] [ .6 .95 .86 .84 .81 .72 ]<br />
α L .<br />
1 6<br />
=<br />
Assume that u (⋅)<br />
is unit variance white noise. Then, it can be shown that even with no<br />
measurement noise, the corresponding stability matrix <strong>for</strong> the SM lattice algorithm <strong>of</strong> [62],<br />
evaluated at the stationary point H ˆ ( z ) = H ( z ) has a pair <strong>of</strong> unstable eigenvalues. This<br />
means that this stationary point cannot be locally convergent. This is illustrated in Fig. 4.5,<br />
where the results <strong>of</strong> a computer simulation <strong>of</strong> this algorithm in the above setting are presented.<br />
The initial parameters were set to those <strong>of</strong> the stationary point, except <strong>for</strong> sinα<br />
2<br />
(0)<br />
which was<br />
set at 0.9499. Despite the proximity <strong>of</strong> this starting point to the stationary point, the algorithm<br />
clearly diverges, as expected. The reflection coefficients are estimated by our proposed<br />
algorithm <strong>SPSA</strong>. In Fig. 4.6 we show the results obtained by applying algorithm (4.4)-(4.5) in<br />
l<br />
( 0) = 1 .5 .9 .7 .7 .7 . 8<br />
the same setting, though now the initial point was [ ] T<br />
Convergence is achieved in this case, as predicted by the theory.<br />
θ .<br />
110
4.5 SIMULATIONS RESULTS<br />
Fig. 4.3. Convergence <strong>of</strong> the proposed SHARF algorithm and M2-<strong>SPSA</strong>.<br />
Fig. 4.4. Instability <strong>of</strong> the existing SHARF algorithm.<br />
111
CHAPTER 4. APPLICATION USING M2-<strong>SPSA</strong> ALGORTIHM II<br />
Fig. 4.5. Instability <strong>of</strong> the existing SM algorithm.<br />
Fig. 4.6. Convergence <strong>of</strong> the proposed SM algorithm and M2-<strong>SPSA</strong>.<br />
We can see in the previous graphics a better convergence achieved by our proposed method and<br />
the M2-<strong>SPSA</strong> algorithm in comparison to previous simulations shown in [62]-[63]. Also, we can<br />
see that the number <strong>of</strong> iteration used to achieve this convergence in our proposed algorithm is<br />
reduced, this characteristic is explained due to that M2-<strong>SPSA</strong> algorithm can calculate more<br />
efficiently and with less computational burden the coefficients in the lattice <strong>for</strong>m; this is<br />
explained in Chap. 2.<br />
112
Chapter 5<br />
Parameter Estimation using a Modified<br />
Version <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Applied to<br />
State-Space Models<br />
Finally, in this third application, M2-<strong>SPSA</strong> is applied the estimation <strong>of</strong> unknown static<br />
parameters in non-linear non-Gaussian state-space model. The results are compared with the<br />
FDSA algorithms. The per<strong>for</strong>mance <strong>of</strong> the coefficients in bi-modal non-linear model is<br />
compared here. The objective <strong>of</strong> this paper is the estimation <strong>of</strong> unknown static parameters in<br />
non-linear non-Gaussian state-space model. The Simultaneous Perturbation Stochastic<br />
<strong>Approximation</strong> (<strong>SPSA</strong>) algorithm is considered due to its highly efficient gradient<br />
approximation. We consider a particle filtering method and employ the <strong>SPSA</strong> algorithm to<br />
maximize recursively the likelihood function. Nevertheless, the <strong>SPSA</strong> algorithm can become<br />
inadequate in models as non-Gaussian state-space model. So that, we have proposed to modify<br />
the <strong>SPSA</strong> algorithm in <strong>order</strong> to estimate parameters very efficiently in complex models as<br />
proposed here reducing its computational cost. An efficient parameter estimator as the Finite<br />
Difference Stochastic <strong>Approximation</strong> (FDSA) algorithm is considered here, in <strong>order</strong> to compare<br />
it with the efficiency <strong>of</strong> the proposed <strong>SPSA</strong> algorithm. The proposed algorithm can generate<br />
maximum likelihood estimates very efficiently. The per<strong>for</strong>mance <strong>of</strong> proposed <strong>SPSA</strong> algorithm<br />
is shown through simulation using a model with highly multimodal likelihood.<br />
5.1 -Introduction<br />
Dynamic state-space models are useful <strong>for</strong> describing data in many different areas, such as<br />
engineering, finance mathematics, environmental data, and physical science. Most real-world<br />
problems are non-linear and non-Gaussian (1) , there<strong>for</strong>e optimal state estimation in such<br />
problems does not admit a closed <strong>for</strong>m solution. Sequential Monte Carlo (SMC) methods, also<br />
known as particle filters, are a set <strong>of</strong> practical and flexible simulation-based techniques that<br />
have become increasingly popular to per<strong>for</strong>m optimal filtering in non-linear non-Gaussian<br />
models [64][65]. Then, SMC methods are a set <strong>of</strong> simulation-based techniques that recursively<br />
113
CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />
generate and update a set <strong>of</strong> weighted samples, which provide approximations to the posterior<br />
probability distributions <strong>of</strong> interest. Standard SMC methods, however assume knowledge <strong>of</strong> the<br />
model parameters. In many real-world applications, these parameters are unknown and need to<br />
be estimated. Then, we address here the challenging problem <strong>of</strong> obtaining their maximum<br />
likelihood (ML) estimates. The ML parameter estimation using SMC methods still remains an<br />
open problem, despite various earlier attempts in the literature [66]. Previous approaches that<br />
extend the state with the unknown parameters and trans<strong>for</strong>m the problem into an optimal<br />
filtering problem suffered from several drawbacks [66][68]. Recently, a robust particle method<br />
to approximate the optimal filter derivative and per<strong>for</strong>m ML parameter estimation has been<br />
proposed [64]. This method is efficient but computationally intensive. The gradient-based SA<br />
algorithms rely on a direct measurement <strong>of</strong> the gradient <strong>of</strong> an objective function with respect to<br />
the parameters <strong>of</strong> interest. Such an approach assumes that detailed knowledge <strong>of</strong> the system<br />
dynamics is available so that the gradient equations can be calculated. In the SMC framework,<br />
the gradient estimates <strong>of</strong> the particle approximations require infinitesimal perturbation<br />
analysis-based approach [65]. This <strong>of</strong>ten results in a very high estimation variance that<br />
increases with the number <strong>of</strong> particles and with time. Although this problem can be<br />
successfully mitigated with a number <strong>of</strong> variance reduction techniques, this adds to the<br />
computational burden. In this chapter, we investigate the using <strong>of</strong> gradient-free SA techniques<br />
as a simple alternative to generate ML parameter estimates. A related approach was described<br />
in [67] to optimize the per<strong>for</strong>mance <strong>of</strong> SMC algorithms. We use this approach to our ML<br />
parameter estimation. In principle, gradient-free techniques have a slower rate <strong>of</strong> convergence<br />
compared to gradient-based methods. However, gradient-free methods are only based on<br />
objective function measurements and do not require knowledge <strong>of</strong> the gradients <strong>of</strong> the<br />
underlying model. As a result, they are very easy to implement and have a reduced<br />
computational complexity. The classical gradient-free method is the FDSA [21]. However, we<br />
have proposed a more efficient approach that has recently attracted attention, the <strong>SPSA</strong> [3].<br />
This is based on a randomized method where all parameters are perturbed simultaneously, it is<br />
possible to modify parameters with only two measurements <strong>of</strong> an evaluation function regardless<br />
<strong>of</strong> the dimension <strong>of</strong> the parameter. This is very useful but this traditional <strong>SPSA</strong> can cause in<br />
some cases a high computational cost [3]. There<strong>for</strong>e, M2-<strong>SPSA</strong> is applied to ML parameter<br />
estimation in <strong>order</strong> to get estimated parameter more efficiently reducing its cost. In this chapter,<br />
FDSA is considered as a comparison toward our proposed <strong>SPSA</strong> algorithm.<br />
114
5.2 IMPLEMENTATION OF <strong>SPSA</strong> ALGORITHM TO THE PROPOSED MODEL<br />
5.2 -Implementation <strong>of</strong> <strong>SPSA</strong> Toward Proposed Model<br />
5.2.1 –State-Space Model<br />
In <strong>order</strong> describe the state-space models [61], let { X k<br />
} k ≥0<br />
and { k<br />
} k≥0<br />
kx<br />
Y be R and<br />
k<br />
R<br />
y<br />
valued<br />
stochastic processes defined on a measurable space ( Ω , F)<br />
. Let θ ∈ Θ be the parameter vector<br />
m<br />
where Θ is an open subset R [69] . A general discrete-time state-space model represents the<br />
X as a Markov process <strong>of</strong> initial density X ~ µ and Markov<br />
unobserved state { k<br />
} k≥0<br />
0<br />
transition density ( x'<br />
x)<br />
f θ<br />
[61]. The observations { k<br />
} k≥0<br />
Y<br />
are assumed conditionally<br />
independent given { k<br />
} k≥0<br />
X and are characterized by their conditional marginal<br />
density ( y x)<br />
. The model is summarized as follows:<br />
g θ<br />
X X x fθ ( ⋅ x )<br />
(5.1)<br />
k<br />
k−1 =<br />
k−1<br />
~<br />
k−1<br />
Y<br />
k<br />
X<br />
k<br />
= x gθ ( ⋅ x )<br />
(5.2)<br />
k<br />
~<br />
k<br />
where the two densities can be non-Gaussian and may involve non-linearity. For any sequence<br />
{ } p<br />
z and random process { }<br />
Z<br />
i: j<br />
(<br />
i i+<br />
1 j<br />
Z we will use the notation = z , z ,..., z ) and<br />
p<br />
z<br />
i: j<br />
(<br />
i i+<br />
1 j<br />
= Z , Z ,..., Z ) . Assume <strong>for</strong> the time being that θ is known. In such a situation, one is<br />
interested in estimating the hidden state<br />
X<br />
k<br />
given the observation sequence { Y k<br />
} k≥0<br />
. This leads<br />
to the so-called optimal filtering problem that seeks to compute the posterior density<br />
p θ ( x k<br />
Y 0 : k<br />
) sequentially in time. Introducing a proposal distribution ( xk<br />
Yk<br />
, xk−<br />
1)<br />
, whose<br />
q θ<br />
support includes the support <strong>of</strong> gθ ( Yk<br />
xk<br />
) fθ<br />
( xk<br />
xk−<br />
1)<br />
. In this moment the SMC method [70]<br />
approximates the optimal filtering density by a weighted empirical distribution, i.e. a weighted<br />
sum <strong>of</strong> N >1 samples, termed as particles. Here we will assume that at time k−1, the filtering<br />
density x k<br />
Y )<br />
∆<br />
p θ (<br />
−1 0: k−1<br />
is approximated by the particle set<br />
(1: N ) (1 )<br />
[ ]<br />
( N<br />
X<br />
1)<br />
k 1<br />
X<br />
k −1<br />
,..., X<br />
k −<br />
−<br />
= having equal<br />
weights. The filtering distribution at the next time step can be recursively approximated by a<br />
115
CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />
new set <strong>of</strong> particles<br />
X<br />
( 1: N )<br />
k<br />
generated via an importance sampling and a resampling step. In the<br />
importance sampling step, a set <strong>of</strong> prediction particles are generated independently from<br />
( )<br />
( ⋅Y<br />
, X )<br />
~ ( i)<br />
X ~<br />
i<br />
k<br />
q<br />
k k−1<br />
θ<br />
and are weighted by an importance weight<br />
~ ( i)<br />
a θ , k<br />
that accounts <strong>for</strong> the<br />
( i)<br />
~ ( i)<br />
i<br />
discrepancy with the “target” distribution. Here, this is given by a θ<br />
= α θ<br />
X , X , Y ) and<br />
, k<br />
(<br />
k k−1<br />
k<br />
i i<br />
a~ ( ) ( )<br />
,<br />
= a /<br />
θ k<br />
θ,<br />
k<br />
N<br />
∑ j = 1<br />
a<br />
( j)<br />
θ , k<br />
. In the resampling step, the particles<br />
~ (1: N )<br />
X<br />
k<br />
are multiplied or eliminated<br />
according to their importance<br />
weights<br />
~ ( i:<br />
N )<br />
a θ , k<br />
to give the new set <strong>of</strong> particles<br />
X<br />
( 1: N )<br />
k<br />
. Now, let<br />
us now consider the case where the model includes some unknown parameters. We will assume<br />
*<br />
that the system to be identified evolves according to a true but unknown static parameter θ ,<br />
i.e.<br />
X<br />
k<br />
X<br />
k−1 = xk−<br />
1 *<br />
θ k−<br />
~ f ( ⋅ x<br />
1)<br />
(5.3)<br />
Y<br />
k<br />
X<br />
k<br />
= xk<br />
θ<br />
~ g * ( ⋅ xk<br />
).<br />
(5.4)<br />
The aim is to identify this parameter. Addressing this problem <strong>for</strong> a non-Gaussian and<br />
*<br />
non-linear system is very challenging. We aim to identify θ based on an infinite (or very<br />
Y . A standard method to do so is to maximize the limit <strong>of</strong> the<br />
large) observation sequence { k<br />
} k≥0<br />
time averaged log-likelihood function:<br />
1<br />
l θ ( Y Y<br />
(5.5)<br />
k<br />
( ) = lim ∑ log pθ<br />
k → ∞ k + 1 k = 0<br />
k<br />
0 : k −1)<br />
with respect to θ . Suitable regularity conditions ensure that this limits exists and<br />
l (θ) admits θ * as a global maximum [70]. The expression Y Y n<br />
)<br />
defined as<br />
p θ<br />
(<br />
0 : k −1<br />
is the predictive likelihood<br />
116
5.2 IMPLEMENTATION OF <strong>SPSA</strong> ALGORITHM TO THE PROPOSED MODEL<br />
p θ ( Y k<br />
Y 1) = α x , Y ) q ( x Y , x )<br />
0:<br />
k−<br />
∫∫<br />
θ<br />
(<br />
k−1:<br />
k k θ k k k−1<br />
⋅ θ<br />
p ( xk−<br />
1<br />
Y0:<br />
k−1)<br />
dxk−<br />
1:<br />
k.<br />
(5.6)<br />
Note that this is a normalization constant [70]. This approach is known as recursive ML<br />
parameter estimation. Now, we propose to use M2-<strong>SPSA</strong> in the ML parameter estimation based<br />
on the GSMC algorithm (Generic Sequential Monte Carlo algorithm) described in [70]. It is<br />
very difficult to compute log ( Y Y k 0 : k −1<br />
)<br />
p θ<br />
in closed <strong>for</strong>m. Instead, we use a particle<br />
approximation and propose to optimize an alternative criterion: the SMC provides us with<br />
( i)<br />
~ ( i)<br />
samples ( Xk<br />
− 1,<br />
Xk<br />
) from p θ ( x k −1 Y0:<br />
k−1)<br />
q θ ( xk<br />
Yk<br />
, xk−<br />
1)<br />
. A particle approximation to log p ( Y Y θ k 0:<br />
k−1)<br />
is given by<br />
^<br />
N<br />
⎛ −1<br />
( i)<br />
⎞<br />
log p ( Yk<br />
Y0 : k−1)<br />
= log⎜<br />
N ∑ a<br />
, k ⎟.<br />
(5.7)<br />
θ θ<br />
⎝ i=<br />
1 ⎠<br />
Now, we use the key fact that the current hidden state<br />
X<br />
k<br />
, the observation<br />
Y<br />
k<br />
, the predicted<br />
particles<br />
~ (1: N )<br />
X<br />
k<br />
and their corresponding not normalized weights<br />
(1: N )<br />
a θ , k<br />
<strong>for</strong>m a homogenous<br />
Markov chain [70].<br />
*<br />
In the following section, we propose SA algorithms to solve: ϑ = arg max J ( θ ) Note that<br />
~ ( 1: N ) (1, N )<br />
because we only use a finite number N <strong>of</strong> particles ( X , aθ ) is only an approximation to<br />
k<br />
, k<br />
θ∈Θ<br />
*<br />
the exact prediction density x k<br />
Y ) . Hence ϑ will not be equal to the true parameter<br />
p θ (<br />
0:<br />
k−1<br />
*<br />
θ . However, as N increases, )<br />
J (θ will get closer to l(θ<br />
) and<br />
*<br />
ϑ will converge to<br />
*<br />
θ . Our<br />
*<br />
*<br />
simulation results indicate that ϑ provides a good approximation to θ <strong>for</strong> a moderate number<br />
<strong>of</strong> particles.<br />
117
CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />
5.2.2 -Gradient-free Maximum Likelihood Estimation<br />
The function J (θ ) must be maximized with respect to the m-dimensional parameter vector θ .<br />
The function J (θ ) does not admit an analytical expression. Additionally, we do not have<br />
access to it. Using the geometric ergodicity <strong>of</strong> the Markov chain { Z k<br />
} ≥<br />
, J(<br />
) can be<br />
approximated in the limit as follows:<br />
k<br />
θ<br />
0<br />
J<br />
∆<br />
⎧<br />
θ ) = lim ⎨ J ( ) = [ ( )] ⎬ ⎫<br />
k<br />
θ E r Z<br />
(5.8)<br />
→ ∞<br />
, k<br />
k<br />
θ θ<br />
⎩<br />
⎭<br />
( *<br />
where the expectation is taken with respect to the distribution <strong>of</strong><br />
J (θ ) is unknown, we access to a sequence <strong>of</strong> functions<br />
k<br />
Z . This implies that although<br />
k<br />
J (θ . One way to<br />
J that converge to )<br />
exploit this sequence in <strong>order</strong> to optimize J (θ ) is to use a recursion as follows:<br />
where<br />
k−1<br />
k<br />
^<br />
= θ<br />
k −1 + γ<br />
k k −<br />
∇J ( θ 1)<br />
θ (5.9)<br />
θ is the parameter estimate at time k−1 and ∇ k denotes an estimate <strong>of</strong> ∇ J<br />
k<br />
.The<br />
idea is that we take incremental steps to improve θ where each step uses a particular function<br />
from the sequence. Under suitable conditions on the step size, the above iteration will converge<br />
*<br />
to ϑ [71]. We will consider the case where the expression <strong>for</strong> the gradient <strong>of</strong><br />
^<br />
J<br />
J<br />
k<br />
is either not<br />
available or too complex to calculate. One may approximate ∇J k ( θ ) by recourse to finite<br />
difference methods. These are “gradient-free” methods that only use measurements <strong>of</strong> J (θ ) .<br />
The idea behind this approach is to measure the change in the function induced by a small<br />
^<br />
k<br />
perturbation<br />
∆θ<br />
in the value <strong>of</strong> the parameter. If we denote an estimate <strong>of</strong> J (θ ) by<br />
k<br />
^<br />
2<br />
J<br />
k<br />
(θ ) ,<br />
one-sided gradient approximations consider the change between J ( θ ) and J ( θ + ∆ θ )<br />
while two-sided approximations consider the difference between J ( θ −∆θ)<br />
and J ( θ + ∆θ ) . A<br />
gradient-free approach can provide a maximum likelihood parameter estimate that is<br />
^<br />
k<br />
^<br />
k<br />
^<br />
k<br />
^<br />
k<br />
118
5.2 IMPLEMENTATION OF <strong>SPSA</strong> ALGORITHM TO THE PROPOSED MODEL<br />
computationally cheap, as well as very simple to implement. The key feature <strong>of</strong> the <strong>SPSA</strong><br />
technique is that it requires only two measurements <strong>of</strong> the cost function regardless <strong>of</strong> the<br />
dimension <strong>of</strong> the parameter vector. This efficiency is achieved by the fact that all the elements<br />
in θ are perturbed together. The i-th component <strong>of</strong> the two-sided gradient approximation<br />
^<br />
^<br />
^<br />
⎡<br />
⎤<br />
∇ J<br />
k<br />
=<br />
⎢<br />
∇J<br />
k ,1(<br />
θ ),..., ∇J<br />
k , m<br />
( θ ) is<br />
⎣<br />
⎥<br />
⎦<br />
∇J<br />
^<br />
^<br />
^<br />
J<br />
k<br />
( θk−<br />
1<br />
+ ck∆k<br />
) − J<br />
k<br />
( θk−<br />
1<br />
+ ck∆k<br />
)<br />
k,<br />
i<br />
( θ<br />
n−1)<br />
=<br />
(5.10)<br />
2ck<br />
∆ki<br />
where ∆<br />
k<br />
=<br />
∆ [ ∆<br />
k , 1<br />
,..., ∆<br />
k , m<br />
] is a random perturbation vector and { k<br />
} k ≥1<br />
c is defined in the<br />
Sec. 1.7. Note that the computational saving stems from the fact that the objective function<br />
difference is now common in all m components <strong>of</strong> the gradient approximation vector. Almost<br />
sure convergence <strong>of</strong> the SA recursion in (5.9) is guaranteed if J (θ ) is sufficiently smooth near<br />
k<br />
*<br />
θ . Additionally, the elements <strong>of</strong><br />
∆k<br />
must be mutually independent random variables,<br />
−1<br />
symmetrically distributed around zero and with finite inverse moments E ( ∆ k , i<br />
)<br />
. A simple and<br />
popular choice <strong>for</strong> ∆ that satisfies these requirements is the Bernoulli ± 1distribution and the<br />
positive step sizes should satisfy<br />
k<br />
∑ ∞ →0 , k<br />
→0,<br />
k=<br />
1<br />
γ<br />
k<br />
c γ<br />
k<br />
= ∞ and ∑ ∞<br />
k =<br />
1<br />
⎛ γ<br />
k<br />
⎜<br />
⎝ c<br />
k<br />
⎞<br />
⎟<br />
⎠<br />
2<br />
<<br />
∞<br />
.<br />
The choice <strong>of</strong> the step sequences is crucial to the per<strong>for</strong>mance <strong>of</strong> the algorithm. Note that if a<br />
constant step size is used <strong>for</strong> γ<br />
k<br />
the SA estimate will still converge but will oscillate about the<br />
limiting value with a variance proportional to the step size. In most <strong>of</strong> our simulations γ<br />
k<br />
was<br />
set to a small constant step size that was repeatedly halved after several thousands <strong>of</strong> iterations.<br />
For the two-sided <strong>SPSA</strong> case <strong>for</strong> example, these would be ^<br />
^<br />
+<br />
J ( θ ∆ ; ω ) and<br />
−<br />
J ( θ − ∆ ; ω )<br />
k<br />
+ c k k k<br />
k<br />
c k<br />
k<br />
k<br />
where<br />
ω andω denote the randomness <strong>of</strong> each realization. This implies that besides the<br />
+<br />
k<br />
− k<br />
desired objective function change induced by the perturbation in θ , there is also some<br />
119
CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />
undesirable variability in<br />
±<br />
±<br />
ω<br />
k<br />
. Although in a real system ωk<br />
cannot be controlled, in<br />
simulation settings it might be possible to eliminate the undesirable variability component by<br />
using the same random seeds at every time instant k, so that<br />
ω<br />
ω . The SA <strong>of</strong> (5.9) can be<br />
+ −<br />
k<br />
= k<br />
thought <strong>of</strong> as a stochastic generalization <strong>of</strong> the steepest descent method. Faster convergence can<br />
be achieved if one uses a Newton type SA algorithm that is based on an estimate <strong>of</strong> the second<br />
derivative <strong>of</strong> the objective function. This will be <strong>of</strong> the <strong>for</strong>m<br />
− 1<br />
^<br />
⎡<br />
^<br />
2 ⎤<br />
θ<br />
k = θ<br />
k − 1 − γ<br />
k ⎢ ∇ J k ( θ<br />
k − 1<br />
) ⎥ ⋅ ∇ J ( θ<br />
k − 1<br />
)<br />
(5.11)<br />
⎣<br />
⎦<br />
where<br />
^<br />
2<br />
2<br />
∇ J is an estimate <strong>of</strong> the negative definite <strong>Hessian</strong> matrix ∇ J<br />
k<br />
k<br />
. Such an approach<br />
can be particularly attractive in terms <strong>of</strong> convergence acceleration, in the terminal phase <strong>of</strong> the<br />
algorithm, where the steepest descent-type method <strong>of</strong> (5.9) slows down but main difficulty with<br />
this is the fact that the estimate <strong>of</strong> the <strong>Hessian</strong> should also be instable. In <strong>order</strong> to keep the<br />
stability in <strong>Hessian</strong> matrix, we applied the procedure used in Chap. 2. Also, as it was suggested<br />
in [70], it might be useful to average several SP gradient approximations at each iteration, each<br />
with an independent value <strong>of</strong><br />
∆<br />
k<br />
. Despite the expense <strong>of</strong> additional objective function<br />
evaluations, this can reduce the noise effects and accelerate convergence.<br />
5.3 -Parameter Estimation by <strong>SPSA</strong> and FDSA<br />
Now, we present two maximum likelihood parameter estimation algorithms that are based on a<br />
FDSA and <strong>SPSA</strong> algorithm. In line with our objectives, the algorithm below only requires a<br />
single realization <strong>of</strong> observations { k<br />
} k≥1<br />
Y <strong>of</strong> the true system. At time k -1, we denote the<br />
current parameter estimate by θ<br />
k−1<br />
. Also, let the filtering density pθ0 : k−1( xk−<br />
1<br />
Y0:<br />
k−1)<br />
be<br />
approximated by the particle set<br />
(1: N )<br />
X<br />
k − 1<br />
having equal importance weights. Note that the<br />
subscript θ0:<br />
k −1<br />
indicates that the filtering density estimate is a function <strong>of</strong> all the past parameter<br />
values. The parameter estimation using <strong>SPSA</strong> is per<strong>for</strong>med as follows:<br />
120
5.4 NUMERICAL SIMULATION<br />
First, let generate random perturbation vectors.<br />
For i = 1, ...,N, sample<br />
~ ( i )<br />
( i )<br />
( ⋅ Y<br />
k ~ q c<br />
k ,<br />
X<br />
+ θ + ∆<br />
k − 1<br />
X<br />
, k −1<br />
k<br />
~ ( i )<br />
( i )<br />
( ⋅ Y<br />
k , ~ q c<br />
k ,<br />
X<br />
− θ − ∆<br />
k − 1<br />
X<br />
k −1<br />
k<br />
k<br />
k<br />
)<br />
)<br />
and using the following evaluation:<br />
α θ<br />
( x<br />
k<br />
− 1: k<br />
, Yk<br />
) =<br />
g<br />
θ<br />
( Y<br />
q<br />
k<br />
θ<br />
x ) f<br />
k<br />
( x<br />
k<br />
θ<br />
k<br />
( x<br />
Y , x<br />
k<br />
x<br />
k −1<br />
k −1<br />
)<br />
)<br />
.<br />
We can evaluate<br />
~ ( i ) ( i )<br />
~ ( i ) ( i )<br />
aθ ( Y , X , X ) , a ( Y<br />
k<br />
, X<br />
k , −<br />
, X<br />
k , −1<br />
)<br />
^<br />
∇J<br />
k , i<br />
where<br />
J<br />
^<br />
k<br />
log<br />
( θ<br />
k<br />
( θ<br />
⎧<br />
⎨<br />
⎩<br />
k −1<br />
k − 1<br />
1<br />
N<br />
k , +<br />
Jˆ<br />
k<br />
( θ<br />
) =<br />
±<br />
c<br />
k<br />
∆<br />
N<br />
∑ i = 1<br />
k<br />
k , −1<br />
k −1<br />
a θ<br />
θ<br />
.<br />
+ ck∆<br />
k<br />
) − Jˆ<br />
k<br />
( θ<br />
2c<br />
∆<br />
) =<br />
k − 1<br />
θ = + ∇ ( ),<br />
k<br />
θ<br />
k<br />
γ<br />
k<br />
J k θ<br />
k 1<br />
where<br />
∇J<br />
^<br />
±<br />
c<br />
−1 −<br />
k<br />
k<br />
∆<br />
k , i<br />
k<br />
( Y<br />
k −1<br />
k<br />
− c ∆ )<br />
~<br />
, X<br />
⎡ ^<br />
^ ⎤<br />
θ<br />
k − 1)<br />
⎢∇J<br />
k ,1(<br />
θ<br />
k −1<br />
),... ∇J<br />
k , m ( θ<br />
k −<br />
) ⎥<br />
⎣<br />
⎦<br />
^<br />
k (<br />
1<br />
k<br />
k<br />
(1: N )<br />
k , ±<br />
, X<br />
( j )<br />
k − 1<br />
⎫<br />
) ⎬<br />
⎭<br />
X ~ )<br />
k<br />
k k , k −1<br />
( i )<br />
( i<br />
( )<br />
each particle i = 1, ...,N, sample ~ qθ ( ⋅Y<br />
X ) and evaluate the weights a . θ<br />
( 1, N )<br />
Sample ~ ( ~ (1, N )<br />
( 1, N ) ~ (1, N ) 1: N<br />
Ik<br />
L ⋅aθ<br />
, k<br />
) using a standard resampling scheme. Set X<br />
k<br />
= H ( X<br />
k<br />
, I<br />
k<br />
)<br />
.<br />
~ j<br />
k , k ,<br />
121
CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />
5.4 -Simulation<br />
The following bi-modal non-linear model [72] is proposed here.<br />
X<br />
Xk<br />
= θ<br />
−<br />
+ σ V<br />
(5.12)<br />
k−1<br />
1<br />
X<br />
k 1+<br />
θ2<br />
+ θ<br />
2 3<br />
cos (1.2 k)<br />
1+<br />
Xk−<br />
1<br />
υ k,<br />
Y = cX +σ W<br />
k<br />
2<br />
k<br />
ω<br />
k<br />
(5.13)<br />
i.<br />
i.<br />
d<br />
2<br />
where σ 10, c = 0.05, σ = 1, X ~ N(0,2),<br />
V ~ N(0,1)<br />
and W ( 0,1)<br />
. These are zero mean<br />
υ = ω<br />
0<br />
k<br />
i.<br />
~ i.<br />
d<br />
k<br />
N<br />
Gaussian random variables. Here, we seek ML estimatesθ =<br />
[ θ1,<br />
θ2,<br />
θ3]<br />
T<br />
. Also is important to<br />
initialize the algorithm properly; else some <strong>of</strong> the parameter estimates might get trapped in local<br />
maxima. In this model, we can initialize at<br />
T<br />
θ<br />
0<br />
= [0.2, 20, 5] . The choice <strong>of</strong> the step size is very<br />
important. Here, this is particularly true due to the difference in the relative sensitivity <strong>of</strong> the<br />
0. 101<br />
three unknown parameters. The values <strong>for</strong> step size are c k<br />
= c 0<br />
/k where c = [0.01,2.0,1]<br />
x10 −4<br />
4<br />
and the constant step size γ = [0.005,7,17] x 10 − .<br />
0<br />
T<br />
Fig. 5.1. ML Parameter estimateθ = θ , θ , <strong>for</strong> the bi-modal non-linear model using<br />
k<br />
[<br />
1,<br />
k 2, k<br />
θ<br />
3,<br />
k<br />
]<br />
T<br />
M2-<strong>SPSA</strong>. The true Parameters in the model are defined by θ * =[0.5, 25, 8] .<br />
122
5.4 NUMERICAL SIMULATION<br />
Fig. 5.2. Parameter estimation using 2nd-<strong>SPSA</strong> and FDSA.<br />
Figure 5.1 shows the efficiency obtained using M2-<strong>SPSA</strong>. These results are compared with<br />
2nd-<strong>SPSA</strong> and FDSA in Fig. 5.2. These results show the best per<strong>for</strong>mance found by each<br />
algorithm in the current model. Table 5.1 compares, the number <strong>of</strong> particles used by each<br />
algorithm in its per<strong>for</strong>mance, and the computational load or normalized CPU time [49]<br />
(computational cost in time processing) with CPU time required by M2-<strong>SPSA</strong> as reference.<br />
These comparisons are done according to average CPU time used by each algorithm to estimate.<br />
Table 5.1. Computational statistics.<br />
<strong>Algorithm</strong> No. <strong>of</strong> Particles CPU<br />
M2-<strong>SPSA</strong> 800 1<br />
2nd-<strong>SPSA</strong> 920 2.8<br />
FDSA 1000 3.2<br />
The results obtained here by M2-<strong>SPSA</strong> show its efficiency, which<br />
*<br />
ϑ provides a good<br />
*<br />
approximation to θ using a moderate number <strong>of</strong> particles in comparison with 2nd-<strong>SPSA</strong> and<br />
FDSA. There<strong>for</strong>e, M2-<strong>SPSA</strong> only uses 800 particles to estimate and get a good approximation<br />
and accuracy parameters. The 2nd-<strong>SPSA</strong> uses 920 particles to find a suitable estimation. Finally,<br />
FDSA uses 1000 particles to estimate the parameters in a correct way. Also, the computational<br />
123
CHAPTER 5. APPLICATION USING M2-<strong>SPSA</strong> ALGORITHM III<br />
cost is shown that CPU time required estimating the parameters by 2nd-<strong>SPSA</strong> and FDSA ranges<br />
from 2.8 and 3.2 times respectively the CPU time required by M2-<strong>SPSA</strong>, so that, in terms<br />
<strong>of</strong> efficiency, the use <strong>of</strong> these algorithms might be questionable. Note that the number <strong>of</strong> loss<br />
function measurements needed in each iteration <strong>of</strong> FDSA grow with p while M2-<strong>SPSA</strong> only<br />
two measurements are needed, independent <strong>of</strong> p. This, according to the characteristics <strong>of</strong><br />
M2-<strong>SPSA</strong> described in Chap. 2, provides the potential <strong>of</strong> our proposed algorithm to achieve a<br />
large saving (over FDSA) in the total number <strong>of</strong> measurements required to estimate θ when p<br />
is large. Also, we can see that the per<strong>for</strong>mance <strong>of</strong> FDSA was highly dependent on the shape <strong>of</strong><br />
the loss function surface [21]. Consequently, this places a higher burden on the selection <strong>of</strong><br />
initial parameter values. So that, M2-<strong>SPSA</strong> has a low computational cost and usually provides<br />
less dispersed and more accurate parameters. The reason <strong>of</strong> these data obtained by M2-<strong>SPSA</strong> is<br />
that this algorithm is a very powerful technique that allows an approximation <strong>of</strong> the gradient or<br />
<strong>Hessian</strong> by effecting simultaneous random perturbations in all the parameters. There<strong>for</strong>e, the<br />
data <strong>of</strong> M2-<strong>SPSA</strong> contrast with FDSA in which the evaluation <strong>of</strong> the gradient is achieved by<br />
varying the parameters once at a time. In general, these results obtained by M2-<strong>SPSA</strong> are<br />
explained since this algorithm does not depend on derivative in<strong>for</strong>mation, and it is able to find a<br />
good approximation to the solution using few function values (see Chap. 2), this causes a low<br />
computational cost and complexity. In comparison with 2nd-<strong>SPSA</strong>, M2-<strong>SPSA</strong> has a low<br />
computational cost that is explained in Chap. 2. Also, the M2-<strong>SPSA</strong> algorithm can satisfy some<br />
conditions and constraints associated with the problem in contrast with 2nd-<strong>SPSA</strong> that cannot<br />
satisfy them [18]. In contrast with FDSA, in M2-<strong>SPSA</strong> the slope is estimated, and the<br />
estimation error <strong>for</strong> the slope has an effect on the convergence speed. So that, M2-<strong>SPSA</strong> is a<br />
very suitable algorithm. Nevertheless, if one decides to allow <strong>for</strong> more resource and use a<br />
gradient-based approach, the <strong>SPSA</strong> proposed here can still prove extremely useful in exploring<br />
the parameter space and choosing suitable initial values <strong>for</strong> the parameter vector.<br />
124
Chapter 6<br />
Conclusions and Future Work<br />
6.1 -Conclusions<br />
In this research, we have proposed a new modification to <strong>SPSA</strong> algorithm which main objective<br />
is estimate the parameters in complex system, improve the convergence and reduce the<br />
computational expense. This modification is called “modified version <strong>of</strong> 2nd-<strong>SPSA</strong> algorithm”.<br />
The identification method using the SP seems particularly useful when the number <strong>of</strong><br />
parameters to be identified is very large or when the observed values <strong>for</strong> to be identified can<br />
only be obtained via an unknown observation system. Furthermore, a time differential SP<br />
method that only require one observation <strong>of</strong> error <strong>for</strong> each time increment have been proposed<br />
as improvements <strong>for</strong> the <strong>SPSA</strong> algorithm. The procedure <strong>of</strong> the proposed <strong>SPSA</strong> algorithm can<br />
be explained as follows:<br />
−1<br />
To eliminate the errors introduced by the inversion <strong>of</strong> estimated <strong>Hessian</strong> ( H ) , is suggested a<br />
−1<br />
modification (2.13) to 2nd-<strong>SPSA</strong> that replaces H with a scalar inverse <strong>of</strong> the geometric mean<br />
k<br />
k<br />
<strong>of</strong> all the eigenvalues <strong>of</strong><br />
H<br />
k<br />
. This leads to significant improvements in the proposed <strong>SPSA</strong><br />
algorithm efficiency. At finite iterations, it is found that the newly introduced M2-<strong>SPSA</strong> based<br />
on (2.13) and (2.14) frequently outper<strong>for</strong>ms 2nd-<strong>SPSA</strong> in the numerical simulations that<br />
represent a wide range <strong>of</strong> matrix conditioning. Moreover is considered that the ratio <strong>of</strong> the mean<br />
square errors from M2-<strong>SPSA</strong> to 2nd-<strong>SPSA</strong> is always less than unity except <strong>for</strong> a perfectly<br />
conditioned <strong>Hessian</strong>. The magnitude <strong>of</strong> errors in 2nd-<strong>SPSA</strong> is dependent on the matrix<br />
*<br />
conditioning <strong>of</strong> H due to competing factors [16]. Since these factors are strongly related to<br />
the same quantity <strong>of</strong> the matrix conditioning, the efficiency between the proposed <strong>SPSA</strong><br />
algorithm and 2nd-<strong>SPSA</strong> might less dependent on specific loss functions. We have proposed to<br />
reduce the computational expense by evaluating only a diagonal estimate <strong>of</strong> the <strong>Hessian</strong> matrix.<br />
The reduction in the computation time (in comparison with SA algorithms and previous<br />
versions <strong>of</strong> <strong>SPSA</strong>) is due to savings in the evaluation <strong>of</strong> the <strong>Hessian</strong> estimate, as well as in the<br />
recursion on θ that only requires a trivial matrix inverse. The per<strong>for</strong>mance, in terms <strong>of</strong> rate <strong>of</strong><br />
convergence and accuracy, remains almost unchanged, which demonstrates that the diagonal<br />
125
CHAPTER 6. CONCLUSIONS AND FUTURE WORK<br />
<strong>Hessian</strong> estimate still captures potential large scaling differences in the elements <strong>of</strong>θ .<br />
In this latter algorithm, regularization can be achieved in a straight<strong>for</strong>ward way, by imposing the<br />
positive <strong>of</strong> the diagonal elements <strong>of</strong> the <strong>Hessian</strong>.<br />
We have explained our proposed <strong>SPSA</strong> algorithm in detail in this dissertation. There<strong>for</strong>e, three<br />
important applications have been proposed in <strong>order</strong> to evaluate our proposed M2-<strong>SPSA</strong><br />
algorithm. These applications were addressed toward control area and signal processing where<br />
the M2-<strong>SPSA</strong> algorithm was implemented very successfully. In the following paragraphs, the<br />
conclusions corresponding to these applications are mentioned.<br />
1) First application<br />
We have proposed a MR-SMC method using a non-linear observer <strong>for</strong> controlling the angular<br />
position <strong>of</strong> the single flexible link, suppressing its oscillation. The non-linear observer and the<br />
MR-SMC provide a successful and stable operation to the system. The M2-<strong>SPSA</strong> algorithm is<br />
used in <strong>order</strong> to determine the observer/controller gains. This could determine them very<br />
efficiently and with a low computational cost. The non-linear observer was successful in<br />
predicting the state variables from the motor angular position and the MR-SMC was a very<br />
efficient control method. Although the per<strong>for</strong>mance <strong>of</strong> our proposed system was very<br />
satisfactory and closed to real results obtained in [47].<br />
2) <strong>Second</strong> application<br />
In this research, also we have shown a method <strong>for</strong> deriving adaptive algorithms <strong>for</strong> IIR lattice<br />
filters from the corresponding direct <strong>for</strong>m algorithms. The advantage <strong>of</strong> this approach is that it<br />
provides conditions under which the convergence characters <strong>of</strong> stationary points are preserved<br />
when passing from the direct <strong>for</strong>m to the lattice algorithm. We use M2-<strong>SPSA</strong> in <strong>order</strong> to get the<br />
coefficients in the lattice <strong>for</strong>m more efficiently, so that we can reduce the computational burden<br />
to obtain a suitable per<strong>for</strong>mance. This allowed the design <strong>of</strong> lattice versions <strong>of</strong> the SM and<br />
SHARF algorithms, which are locally convergent, at least in the sufficient <strong>order</strong> case. It was<br />
also shown that this was not the case <strong>for</strong> previous lattice versions.<br />
126
6.1 CONCLUSIONS<br />
3) Third application<br />
Finally, a fast and efficient modified <strong>SPSA</strong> algorithm to per<strong>for</strong>m ML parameter estimation in<br />
state-space models, using SMC filters has been proposed. The algorithm proposed here is based<br />
on measurements <strong>of</strong> the objective function and do not involve any gradient calculations. The<br />
estimation using M2-<strong>SPSA</strong> seems particularly useful when the number <strong>of</strong> parameters to identify<br />
is large or when the observed values <strong>for</strong> what is to be identified can only be obtained via an<br />
unknown observation system. Also M2-<strong>SPSA</strong> outper<strong>for</strong>ms the FDSA and 2nd-<strong>SPSA</strong> due to its<br />
reduced computational cost and complexity that remains fixed with the dimensions <strong>of</strong> the<br />
parameter vector. However, its per<strong>for</strong>mance is very sensitive to the step-size parameters and<br />
special care should be taken when these are selected.<br />
Tables 6.1 and 6.2 show the final per<strong>for</strong>mance <strong>of</strong> M2-<strong>SPSA</strong> according to the applications<br />
described in this dissertation, this per<strong>for</strong>mance is compared with previous versions <strong>of</strong> <strong>SPSA</strong><br />
algorithm and SA algorithms.<br />
Table 6.1. Comparison <strong>of</strong> algorithms (per<strong>for</strong>mance).<br />
<strong>Algorithm</strong><br />
No. <strong>of</strong> Loss Measurements<br />
M2-<strong>SPSA</strong><br />
Low<br />
2nd <strong>SPSA</strong><br />
Relatively Low<br />
1st-<strong>SPSA</strong><br />
High<br />
Table 6.1 represents a comparative between the M2-<strong>SPSA</strong> and previous version <strong>of</strong> <strong>SPSA</strong><br />
according to the simulations results obtained in this dissertation in the Chap. 2. The number <strong>of</strong><br />
loss measurements is reduced significantly by our proposed method M2-<strong>SPSA</strong> in this chapter<br />
and this is confirmed by the Tables 2.2 – 2.4 where Spall [18] presents a study based on a larger<br />
number <strong>of</strong> loss measurements (i.e., more asymptotic) where we can show that M2-<strong>SPSA</strong><br />
outper<strong>for</strong>ms (about iterations needed <strong>for</strong> normalized loss values) 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> in<br />
the high-noise case.<br />
The ratios <strong>of</strong> M2-<strong>SPSA</strong> shown in Table 2.3 - 2.4 <strong>of</strong>fer considerable promise <strong>for</strong> practical<br />
problems (using a number <strong>of</strong> measurements low in comparison with 1st-<strong>SPSA</strong>), where p is even<br />
larger (say, as in the neural network–based direct adaptive control method <strong>of</strong> Spall and Cristion<br />
127
CHAPTER 6. CONCLUSIONS AND FUTURE WORK<br />
2<br />
3<br />
[25], where p can easily be <strong>of</strong> <strong>order</strong> 10 or 10 ). In such cases, other second <strong>order</strong><br />
techniques that require a growing (with p) number <strong>of</strong> function measurements are likely to<br />
become infeasible.<br />
In Table 2.2, we see that M2-<strong>SPSA</strong> provides a considerable reduction in the loss function value<br />
<strong>for</strong> the same number <strong>of</strong> measurements used in 1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong>. Based on the numbers<br />
in the Table 2.2 – 2.4 together with supplementary studies described in Chap. 2, we find that<br />
1st-<strong>SPSA</strong> and 2nd-<strong>SPSA</strong> needs approximately five–ten times the number <strong>of</strong> function<br />
evaluations used by M2-<strong>SPSA</strong> to reach the levels <strong>of</strong> accuracy shown.<br />
Table 6.2. Comparison <strong>of</strong> algorithms (computational cost).<br />
<strong>Algorithm</strong><br />
Computational Cost<br />
M2-<strong>SPSA</strong><br />
Low<br />
2nd <strong>SPSA</strong><br />
Relatively Low<br />
SA <strong>Algorithm</strong>s<br />
High<br />
Table 6.2 represents a comparison between the M2-<strong>SPSA</strong> and previous version <strong>of</strong> <strong>SPSA</strong> and SA<br />
algorithms according to CPU time, these results are confirmed by the values obtained in Tables<br />
3.3 and 5.1 in Chap. 3 and 5 respectively, where the computational load or normalized CPU<br />
time [49] (computational cost in time processing) with CPU time required by M2-<strong>SPSA</strong> as<br />
reference is used. These comparisons are done according to average CPU time used by each<br />
algorithm to estimate each parameter.<br />
The CPU time or CPU usage is the amount <strong>of</strong> time a computer program uses in processing the<br />
instructions, as opposed to, <strong>for</strong> example, waiting <strong>for</strong> the input/output operations. Then in this<br />
case, the CPU time required to estimate the parameters by 2nd-<strong>SPSA</strong> ranges 2 times the CPU<br />
time required by M2-<strong>SPSA</strong>, <strong>for</strong> that reason is relatively low in comparison with our proposed<br />
<strong>SPSA</strong>.<br />
The CPU time required to estimate the parameters by the SA algorithms ranges 2 to 5 times<br />
approximately the CPU time required by M2-<strong>SPSA</strong>, <strong>for</strong> that reason this is high in comparison<br />
with our proposed <strong>SPSA</strong>. There<strong>for</strong>e, in these simulations is shown that the SA algorithms in<br />
comparison with M2-<strong>SPSA</strong> has a high computational cost (see Tables 3.3 and 5.1) even<br />
128
6.2 FUTURE WORK<br />
2nd-<strong>SPSA</strong> algorithm has the same or less computational cost in comparison to SA algorithms<br />
(see Table 5.5). This is explained because the number <strong>of</strong> loss function measurements needed in<br />
each iteration <strong>of</strong> FDSA (Table 5.1), RM-SA or LS (Table3.3) grow with p while M2-<strong>SPSA</strong> or<br />
2nd-<strong>SPSA</strong> only two measurements are needed, independent <strong>of</strong> p, this explanation is described in<br />
detail in the Chap. 2 and demonstrated by simulations the difference between 2nd-<strong>SPSA</strong> and<br />
M2-<strong>SPSA</strong> (Table 2.2 – 2.4).<br />
Also, M2-<strong>SPSA</strong> allows an approximation <strong>of</strong> the gradient or <strong>Hessian</strong> by effecting simultaneous<br />
random perturbations in all the parameters. This contrast with the evaluation <strong>of</strong> the gradient in<br />
FDSA which is achieved by varying the parameters once at a time.<br />
6.2 -Future Work<br />
Referring to conclusions give above, we still have many topics <strong>for</strong> investigate in a near future.<br />
Future work to assess the per<strong>for</strong>mance <strong>of</strong> <strong>SPSA</strong> <strong>for</strong> constrained and unconstrained aerodynamic<br />
shape design studies.<br />
This study will be carried out in the near future to establish the cost benefits and to investigate the<br />
extent to which <strong>SPSA</strong> <strong>of</strong>fers comparative advantages over other kinds <strong>of</strong> similar methods <strong>for</strong><br />
dynamic design optimization problems.<br />
The M2-<strong>SPSA</strong> algorithm can be applied to image processing; in this case we focus in two main<br />
applications. First, the M2-<strong>SPSA</strong> algorithm will be used in image process multidimensional<br />
(medical images) in <strong>order</strong> to reduce CPU time in same way that the applications presented in<br />
this dissertation. <strong>Second</strong>, extracting a multivariate non-linear physical model from a set satellite<br />
images is considered as a multivariate non-linear regression problem. Multiple local solutions<br />
<strong>of</strong>ten prevent gradient type algorithms from obtaining global optimal solutions. A method <strong>of</strong><br />
solving this problem is M2-<strong>SPSA</strong> algorithm. The method will be applied to a problem <strong>of</strong><br />
estimating the distribution <strong>of</strong> energetic ion populations from global images <strong>of</strong> the<br />
magnetosphere.<br />
129
CHAPTER 6. CONCLUSIONS AND FUTURE WORK<br />
Finally, we have applied our proposed M2-<strong>SPSA</strong> algorithm to the applications proposed here,<br />
but also our proposed <strong>SPSA</strong> can be applied to other kind <strong>of</strong> applications in other areas <strong>for</strong><br />
example the image process mentioned in this section. The M2-<strong>SPSA</strong> algorithm can be applied to<br />
different applications if these satisfy in advances the conditions described by the main theorems<br />
(theorems 1,2 and 3 <strong>of</strong> M2-<strong>SPSA</strong> and its guidelines C.1’ and C.3’) explained in Sec. 2.9, if the<br />
application satisfies these conditions M2-<strong>SPSA</strong> can be used.<br />
130
References<br />
[1] G. Cassandras, L. Dai, and C. G. Panayiotou, “Ordinal Optimization <strong>for</strong> a Class <strong>of</strong><br />
Deterministic and Stochastic Discrete Resource Allocation Problems,” 1EEE Trans. Auto.<br />
Contr., vol.43(7): pp.881-900, 1998.<br />
[2] G. N. Saridis, “Stochastic <strong>Approximation</strong> Methods <strong>for</strong> Identification and Control,” IEEE<br />
Trans Autom Control , vol.19, pp.798-809, 1974.<br />
[3] J. C. Spall,“Multivariate Stochastic <strong>Approximation</strong> using a Simultaneous Perturbation<br />
Gradient <strong>Approximation</strong>,” IEEE Transactions on Automatic Control, vol.37, pp.332-341, 1992.<br />
[4] S. N. Evans and N. C.Weber., “On the Almost Sure Convergence <strong>of</strong> a General Stochastic<br />
<strong>Approximation</strong> Procedure,” Bull. Australian Math. Soc., vol.34, pp.335–342, 1986.<br />
[5] H. F. Chen, T. E. Duncan, and B. Pasik Duncan, "A Stochastic <strong>Approximation</strong> <strong>Algorithm</strong><br />
with Random Differences,” Proceedings <strong>of</strong> the 13 th Triennial IFAC World Congress, pp.<br />
493-496, 1996.<br />
[6] J. C. Spall, “An Overview <strong>of</strong> the Simultaneous Perturbation <strong>Algorithm</strong> <strong>for</strong> Stochastic<br />
Optimization,” IEEE, Transactions on Aerospace and Electronic Systems, vol.34, pp.817-823,<br />
1998.<br />
[7] A. Vande Wouver, C. Rennote, Ph. Bogaerts, "Application <strong>of</strong> <strong>SPSA</strong> Techniques in<br />
Non-linear System Identification," European Control Conference, 2001.<br />
[8] J. Kiefer and J. Wolfowitz, “Stochastic Estimation <strong>of</strong> the Maximum <strong>of</strong> a Regression<br />
Function,” Ann.Math. Statist., vol.23 pp.498-506, 1952.<br />
[9] -H. Monroe, Robbins, “A Stochastic <strong>Approximation</strong> Method,” Ann.Math. Statist, vol. 22, pp.<br />
400-407, 1951.<br />
131
REFERENCE<br />
[10] S. A. Billings, G. N. Jones, “Orthogonal Least-Squares Parameter Estimation <strong>Algorithm</strong>s<br />
<strong>for</strong> Non-Linear Stochastic Systems,” Int. Journal <strong>of</strong> Systems Science, vol.23, issue 7, pp.<br />
1019-1032,1990.<br />
[11] L. Gerencser, "<strong>SPSA</strong> with State-Dependent Noise a Tool <strong>for</strong> Direct Adaptive Control,"<br />
Proceedings <strong>of</strong> the Conference on decision and Control, CDC 37, 1998.<br />
[12] J. C. Spall, and D.C Chin, “Traffic Responsive Signal Timing <strong>for</strong> System-Wide Traffic<br />
Control,” Transp. Res., Part C, vol.5, pp.153-163, 1997.<br />
[13] J. H. Venter, "An Extension <strong>of</strong> the Robbins-Monroe <strong>Algorithm</strong>," Annals <strong>of</strong> Mathematical<br />
Statistics, vol.38, pp.181-190.<br />
[14] D. Ruppert, "Stochastic approximation,” Handbook <strong>of</strong> Sequential Analysis, pp.503-529,<br />
1991.<br />
[15] G. N. Saridis, G. Stein, "Stochastic <strong>Approximation</strong> <strong>Algorithm</strong>s <strong>for</strong> Linear Discrete-time<br />
System Identification," IEEE Trans Autom. Control, vol.13, pp.515–523, 1968.<br />
[16] L. Gerencser, “Rate <strong>of</strong> Convergence <strong>of</strong> Moments <strong>for</strong> a Simultaneous Perturbation<br />
Stochastic <strong>Approximation</strong> Method <strong>for</strong> Function Minimization,” IEEE Trans. on Automat.<br />
Contr. ,vol.44, pp.894-906, 1999.<br />
[17] J. C. Spall, “Adaptive Stochastic <strong>Approximation</strong> by the Simultaneous Perturbation<br />
Method,” Proceedings <strong>of</strong> the 1998 IEEE CDC, pp.3872 -3879, 1998.<br />
[18] J. C. Spall , “A <strong>Second</strong>-Order Stochastic <strong>Approximation</strong> <strong>Algorithm</strong> using only Function<br />
Measurements,” Proceedings <strong>of</strong> the IEEE Conference on Decision and Control, pp. 2472–2477,<br />
1994.<br />
[19] V. Fabian, “On Asymptotic Normality in Stochastic <strong>Approximation</strong>,” Ann,. Math Static.,<br />
vol.39, pp.1327-1332, 1968.<br />
132
REFERENCE<br />
[20] H. F. Chen and Y. Zhu, “Stochastic Estimation Procedure with Randomly Varying<br />
Truncations,” Scientia Sinica (Serie A), vol.29, pp.914-926, 1986.<br />
[21] D. C. Chin, “Comparative Study <strong>of</strong> Stochastic <strong>Algorithm</strong>s <strong>for</strong> System Optimization Based<br />
on Gradient <strong>Approximation</strong>s,” IEEE Trans. Syst., Man, and Cybernetics, vol.27, pp.244–249,<br />
1997.<br />
[22] B. Efron, and D.V. Hinckley, “Assesing the Accuracy <strong>of</strong> the Maximum Likelihood<br />
Estimator: Observed versus Expected Fisher In<strong>for</strong>mation,” Biometrika, vol.65, pp.457-487,<br />
1995.<br />
[23] S. Das, R. Ghanem, and J. C. Spall.,”Asymptotic Sampling Distribution <strong>for</strong> Polynomial<br />
Chaos Representation <strong>of</strong> Data: A Maximum Entropy and Fisher In<strong>for</strong>mation Approach,” SIAM<br />
Journal on Scientic Computing, 2006.<br />
[24] J. C. Spall, “A Stochastic <strong>Approximation</strong> <strong>Algorithm</strong> <strong>for</strong> Large-Dimensional Systems in the<br />
Kiefer-Wolfowitz Setting,” Proc. IEEE Conf. on Decision and Control, pp.1544–1548, 1988.<br />
[25] J. C. Spall and Cristion, J. A., “Non-linear Adaptive Control Using Neural Networks:<br />
Estimation Based on a Smoothed Form <strong>of</strong> Simultaneous Perturbation Gradient <strong>Approximation</strong>,”<br />
Statistica Sinica, vol.4, pp.1-27, 1999.<br />
[26]-D. W Hutchison, “On an Efficient Distribution <strong>of</strong> Perturbations <strong>of</strong> Simulation Optimization<br />
using Simultaneous Perturbation Stochastic <strong>Approximation</strong>,” Proceedings <strong>of</strong> the IASTED<br />
International Conference on Applied Modeling and Simulation, pp.440-445, 2002.<br />
[27] R. W. Brennan and P. Rogers, “Stochastic Optimization Applied to a Manufacturing<br />
System Operation Problem,” Proc.Winter Simulation Conf., C. Alexopoulos, K. Kang,W. R.<br />
Lilegdon, and D. Goldsman, Eds., pp.857–864, 1995.<br />
[28] J. C. Spall, “Implementation <strong>of</strong> the Simultaneous Perturbation <strong>Algorithm</strong> <strong>for</strong> Stochastic<br />
Optimization,” IEEE Trans. Aerosp. Electron. Syst., vol.34, pp.817–823, 1998.<br />
133
REFERENCE<br />
[29] M. Metivier and P. Priouret, “Applications <strong>of</strong> a Kushner and Clark Lemma to General<br />
Classes <strong>of</strong> Stochastic <strong>Algorithm</strong>s,” IEEE Trans. In<strong>for</strong>m. Theory, vol. IT-30, pp.140–151, 1984.<br />
[30] A. Benveniste, M. Metivier, and P. Priouret, “Adaptive <strong>Algorithm</strong>s and Stochastic<br />
<strong>Approximation</strong>s,” New York: Springer Verlag, 1990.<br />
[31] H. J. Kushner and G. G. Yin, Stochastic <strong>Approximation</strong> <strong>Algorithm</strong>s and Applications. New<br />
York: Springer Verlag, 1997.<br />
[32]-J. C. Spall and Cristion, “Model-free Control <strong>of</strong> Non-linear Stochastic Systems with<br />
Discrete-time Measurements,” IEEE Trans.Automat. Contr., vol.43, pp.1198–1210, 1998.<br />
[33] J. Dippon and J. Renz, “Weighted Means in Stochastic <strong>Approximation</strong> <strong>of</strong> Minima,” SIAM J.<br />
Contr. Optimiz., vol.35, pp.1811–1827, 1997.<br />
[34] J. R. Blum, “<strong>Approximation</strong> Methods which Converge with Probability One,” Ann. Mat.<br />
Statist., vol.25, pp.382–386, 1954.<br />
[35] J. J. More, B. S. Garbow, K. E. Hillstrom, “Testing Unconstrained Optimization S<strong>of</strong>tware,”<br />
ACM. Transactions on Mathematical Sciences, vol.7, no.1, pp.17–41, 1981.<br />
[36] R. G. Laha and V. K. Rohatgi, Probability Theory, New York: Wiley, 1979.<br />
[37] F. J. Solis, R. J. Wets, “Minimization by Random Search Techniques,”Mathematics <strong>of</strong><br />
Operations Research, vol.6, pp.19-30, 1981.<br />
[38] Y. Maeda, Y. Kanata, “Learning Rules <strong>for</strong> Recurrent Neural Networks using Perturbation<br />
and Their Application to Neuro-control”, Trans. IEE Japan, vol.113-C, pp.402-408, 1995 (in<br />
Japanese).<br />
[39] J. C. Spall, “A One-Measurement Form <strong>of</strong> Simultaneous Perturbation Stochastic<br />
<strong>Approximation</strong>,” Automatica, vol.33, pp.109–112, 1997.<br />
134
REFERENCE<br />
[40] J. C. Spall and J. A. Criston, “A Neural Network Controller <strong>for</strong> Systems with Unmodeled<br />
Dynamics with Applications to Waste-water Treatment,” IEEE Trans. Syst. Man. Cybern B,<br />
vol.27, pp.369-375, 1978.<br />
[41] J. Link, F. L. Lewis, “Two-Time Fuzzy Logic Controller <strong>of</strong> Flexible Link Robot Arm,”<br />
Fuzzy sets and system, vol.139, no.7, pp.125-149, 2003.<br />
[42] R. H. Cannon, E. Schmitz, “Initial Experiments on the End-Point Control <strong>of</strong> a Flexible<br />
One-Link Robot,” Int. journal <strong>of</strong> robotics research, vol.8, no.3, pp. 62-75, 1984.<br />
[43] Y. Sakawa, F. Matsuno, and S. Fukushima, “Modeling and Feedback Control <strong>of</strong> a Flexible<br />
Arm,” Journal <strong>of</strong> robotic systems, vol.2, no.4, pp.453-472, 1985.<br />
[44] S. Nicosia, P. Tomei,and A. Tornambe, “Non-Linear Control and Observation <strong>Algorithm</strong>s<br />
<strong>for</strong> a Single-Link Flexible Arm,” Int. Journal Control, vol. 49, no.5, pp.827-840, 1989.<br />
[45] J. Yuh, “Application <strong>of</strong> Discrete-Time Model Reference Adaptive Control to a Flexible<br />
Single-Link Robot,” Journal <strong>of</strong> robotic system, vol.4, pp.621-630, 1987.<br />
[46] E. Bayo et al, “Inverse Dynamic and Kinematics <strong>of</strong> Multi-Link Elastic Robots: An iterative<br />
frequency domain approach,” Int. Journal <strong>of</strong> Robotics Research, vol.8, no.6, pp.49-62, 1989.<br />
[47] U. Sawut, N. Umeda, T. Hanamoto, T. Tsuji, “Applications <strong>of</strong> Non-Linear Observer in<br />
Flexible Arm Control,” Trans. <strong>of</strong> SICE, vol. (35), no.3, pp. 401-406, 1999 (in Japanese).<br />
[48] C. Z. Wei, “Multivariate Adaptive <strong>Approximation</strong>,” Ann Statist., vol. 15, pp. 1115-1130.<br />
[49] A. Vande Wouwer, C. Renotte and M.Remy, “Application <strong>of</strong> Stochastic <strong>Approximation</strong><br />
Techniques in Neural Modeling and Control,” Int. Journal. <strong>of</strong> Syst. Science, vol.34, no.14,<br />
pp.851-863, 2003.<br />
[50] P. A. Regalia, “Adpative IIR Filtering in Signal Processing and Control. Marcel Dekker,”<br />
1995.<br />
135
REFERENCE<br />
[51] D. Parikh, N. Ahmed and S.D Stearns, “An Adaptive Lattice <strong>Algorithm</strong> <strong>for</strong> Recursive<br />
Filters,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 28, pp.110-112, 1988.<br />
[52] J. A. Rodriguez-Fonollosa and E. Magrau, “Simplified Gradient Calculation in Adaptive<br />
IIR Lattice Filters,” IEEE Trans.on Signal Processing, vol.39, pp.1702-1705, 1991.<br />
[53] P. A. Regalia, “Stable and Efficient Lattice <strong>Algorithm</strong>s <strong>for</strong> Adaptive IIR Filtering,” IEEE<br />
Trans. on Signal Processing, vol.40, pp.375-388, 1992.<br />
[54] H. Fan: “Application <strong>of</strong> Benveniste’s Convergence Results in the Study <strong>of</strong> Adaptive IIR<br />
Filtering <strong>Algorithm</strong>s,” IEEE Trans. In<strong>for</strong>m. Theory, vol.34, pp.692-709, 1988.<br />
[55] P. Lancaster, M. Tismenetsky, “The Theory <strong>of</strong> Matices,” Academic Press ,1985.<br />
[56] C. R. Johnson, Jr, M. G. Larimore, J. R. Treichler, and B. D. O. Anderson, “SHARF<br />
Convergence Properties SHARF Convergence Properties,” IEEE Trans. on Acoustics, Speech,<br />
and Signal Processing, vol.28, no.4, pp.428-440, 1980.<br />
[57] M. G. Larimore, J. R. Treichler, “SHARF: An algorithm <strong>for</strong> Adapting IIR Digital Filters,”<br />
IEEE Trans. on Acoustics, Speech, and Signal Processing, vol.28, no.4, pp.428-440, 1980.<br />
[58] I. D. Landau: “Elimination <strong>of</strong> the Real Positivity Condition in the Design <strong>of</strong> Parallel<br />
MRAS,” IEEE Trans. Automat. Cont., vol.AC-23, no.6, pp.1015-1020, 1978.<br />
[59] K. Kurosawa and S. Tsuji, “An IIR Parallel-Type Adaptive <strong>Algorithm</strong> using the Fast Least<br />
Square Method,” IEEE Trans. Acoust.,Speech, Signal Processing, vol.37, no.8, pp.1226-1230,<br />
1989.<br />
[60] C. R. Johnson Jr., and Taylor, “Failure <strong>of</strong> a Parallel Adaptive Identifier with Adaptive Error<br />
Filtering,” IEEE Trans. Automat. Cont., vol. AC-25, no.6, pp.1248-1250, 1980.<br />
[61] K. Steilglitz and L. E. McBride: “A Techinique <strong>for</strong> the Identification <strong>of</strong> Linear Systems,”<br />
IEEE Trans. Automat. Cont., vol. AC-10, no.4, pp.461-464, 1965.<br />
136
REFERENCE<br />
[62] P. Regalia and M. M. Boup: “An a Priori Error Bound <strong>for</strong> the Steiglitz-McBride Method,”<br />
IEEE Trans. on Circuit and Systems, Analog and Digital Signal Processing, vol.41, no.2,<br />
pp.105-116, 1996.<br />
[63] K. X. Miao, H. Fan and M. Doroslovacki: “Cascade Normalized Lattice Adaptive IIR<br />
Filters,” IEEE Trans. on Signal Processing, vol.42, pp.721-742, 1994.<br />
[64] G. Poyiadjis, A. Doucet and S. S. Singh, “Particle Methods <strong>for</strong> Optimal Filter Derivative:<br />
Application to Parameter Estimation,”Proccedings IEEE ICASSP, 2005.<br />
[65] G..Poyiadjis, S.S. Singh and A. Doucet, “Novel Particle Filter Methods <strong>for</strong> Recursive and<br />
Batch Parameter Estimation in General State Space Models,” Technical Report,<br />
CUED/F-INFENG/TR-536, Engineering Department, Cambridge University, 2005.<br />
[66] P. Fearnhead, “MCMC, Sufficient Statistic and Particle Filter,” Journal Comp. Graph. Stat.,<br />
vol.11, pp.848-862, 2002.<br />
[67] B. L Chan, A. Doucet and V.B Tadie: “Optimization <strong>of</strong> Particle Filters using Simultaneous<br />
Perturbation Stochastic <strong>Approximation</strong>,” Proc. IEEE ICASSP, pp.681-684, 2003.<br />
[68] J. Liu and M. West, “Combined parameter and state estimation in simulation-based<br />
filtering,” In Sequential Monte Carlo Methods in Practice (eds Doucet A., de Freitas J.F.G. and<br />
Gordon N.J. NY): Springer Verlag, 2001.<br />
[69] G. Storvik, “Particle Filters in State Space Models with The Presence <strong>of</strong> Unknown Static<br />
Parameters,” IEEE Trans. Signal Processing, vol.50, pp.281-289, 1998.<br />
[70] A. Doucet, Vladislav B.Tadie: “On-line Optimization <strong>of</strong> Sequential Monte Carlo<br />
Methods using Stochastic <strong>Approximation</strong>,” Proceedings <strong>of</strong> the -American Control Conference,<br />
pp.2565-2570, 2002.<br />
[71] H. J. Kushner and D.S. Clark, “Stochastic <strong>Approximation</strong> Methods <strong>for</strong> Constrained and<br />
Unconstrained System,” Springer Varlag, N.Y. , 1978.<br />
137
REFERENCE<br />
[72] N. J. Gordon, D. J Salmond and A.F.M. Smith, “Novel Approach to Non-linear/Non<br />
Gaussian Bayesian State Estimation,” IEEE Proc. F., vol.140, pp.107-113, 1993.<br />
138
Appendix A<br />
Pro<strong>of</strong>s <strong>of</strong> Convergence Results and-Asymptotic Distribution<br />
Results<br />
Pro<strong>of</strong> <strong>of</strong> Lemma (Sufficient Conditions <strong>for</strong> C.5 and C.7)<br />
C.7 is used in the pro<strong>of</strong>s <strong>of</strong> Theorems 1a and 1b only to ensure that P(lim sup ˆ θ = ∞)<br />
= 0 .<br />
k→∞<br />
k<br />
Given the boundedness <strong>of</strong><br />
θˆ k , this condition becomes superfluous. Regarding C.5, the<br />
boundedness condition together with the facts that /<br />
2 2 −1<br />
a →0<br />
and c →0<br />
(C.6) imply<br />
k<br />
c k<br />
k<br />
H k<br />
that, <strong>for</strong> some<br />
0 < ρ ' < ρ , a ( ˆ θ ) ≤ ρ'<br />
a.s. <strong>for</strong> all k sufficiently large. From the basic<br />
k<br />
g ki<br />
k<br />
recursion,<br />
~ ~<br />
θ −a ( ˆ −a<br />
e , where e = G ˆ θ ) − g ( ˆ θ ) . But a →0<br />
k+ 1,<br />
i<br />
= θki<br />
kgki<br />
θk<br />
)<br />
k ki<br />
k<br />
k<br />
(<br />
k ki k<br />
k e k<br />
a.s. by the<br />
martingale convergence theorem (see (8) and (9) in Spall and Cristion [25]). Since<br />
~<br />
θ ≥ ρ > ρ'<br />
ki<br />
,<br />
we know that sign ~ ~<br />
θ<br />
ki<br />
= sign θ<br />
k + 1,<br />
i<br />
<strong>for</strong> all k sufficiently large, implying that sign<br />
g ˆ θ ) = sign g ˆ θ ) a.s.<br />
i<br />
( k<br />
i( k+ 1<br />
Pro<strong>of</strong> <strong>of</strong> Theorem 1a (M2-<strong>SPSA</strong>)<br />
The pro<strong>of</strong> will proceed in three parts. Some <strong>of</strong> the pro<strong>of</strong> closely follows that <strong>of</strong> the proposition<br />
in Spall and Cristion [25], in which case the details will be omitted here, and the reader will be<br />
directed to that reference. However, some <strong>of</strong> the pro<strong>of</strong> differs in nontrivial ways due to, among<br />
other factors, the need to explicitly treat the bias in the gradient estimate G (⋅ k<br />
) First, we will<br />
show that<br />
θ<br />
k<br />
= k<br />
θˆ −θ * does not diverge in magnitude to ∞ on any set <strong>of</strong> nonzero measure.<br />
<strong>Second</strong>, we will show that<br />
~ θ<br />
k<br />
converges a.s. to some random vector, and third, we will show<br />
139
APPENDIX A<br />
that this random vector is the constant 0, as desired. Equalities hold a.s. where relevant.<br />
Part 1: First, from C.0, C.2, and C.3, it can be shown in the manner <strong>of</strong> Spall [3, Lemma 1] that,<br />
<strong>for</strong> all k sufficiently large<br />
( Gk<br />
( ˆ<br />
k)<br />
ˆ θk<br />
) = g(<br />
ˆ θk<br />
) bk<br />
E θ +<br />
(A1)<br />
where<br />
c −2<br />
k<br />
b<br />
k<br />
is uni<strong>for</strong>mly bounded a.s. Using C.6, we know that H<br />
−1<br />
k<br />
exists a.s., and hence we<br />
−<br />
write M ≡ a H<br />
1 ( g(<br />
ˆ θ ) + b ) . Then, as in the proposition <strong>of</strong> Spall and Cristion [25], C.1, C.2,<br />
k<br />
k<br />
k<br />
k<br />
k<br />
and C.6, and Holder’s inequality imply, via the martingale convergence theorem,<br />
k<br />
~<br />
a.<br />
s.<br />
θ<br />
+<br />
= M ⎯⎯→<br />
X<br />
(A2)<br />
k 1<br />
∑<br />
j=<br />
0<br />
j<br />
where X is some integrable random vector.<br />
~<br />
Let us now show that P(lim sup θ = ∞)<br />
= 0 . Since the arguments below apply along any<br />
k→∞<br />
k<br />
subsequence, we will, <strong>for</strong> ease <strong>of</strong> notation and without loss <strong>of</strong> generality, consider the event<br />
{ ~ θ<br />
k<br />
→∞}<br />
. We will show that this event has probability 0 in a modification to the arguments in<br />
[25, proposition] (which is a multivariate extension to scalar arguments in Blum [34], and Evans<br />
and Weber [4]). Furthermore, suppose that the limiting quantity <strong>of</strong> the unbounded elements is<br />
+ ∞ (trivial modifications cover a limiting quantity including − ∞ limits). Then, as shown in<br />
[25], the event <strong>of</strong> interest { ~ θ<br />
k<br />
→∞}<br />
has probability 0 if<br />
and<br />
~<br />
~<br />
{ ≤ ρ'(<br />
τ , S)<br />
∀i<br />
∈ S,<br />
θ ≤ τ∀i<br />
∉ S,<br />
k ≥ K(<br />
τ,<br />
S)<br />
} ∩limsup{ M < ∀i<br />
∈ S} ⎬ ⎫<br />
⎭<br />
⎧<br />
⎨ θ<br />
ki ki<br />
ki<br />
0<br />
(A3a)<br />
⎩<br />
k→∞<br />
140
APPENDIX A<br />
⎪⎧<br />
~<br />
⎪⎫<br />
c<br />
⎨θ ki<br />
→ ∞∀i ∈ S ∩ liminf{ M<br />
ki<br />
< 0∀i<br />
∈ S}<br />
⎬<br />
(A3b)<br />
k→∞<br />
⎪⎩<br />
⎪⎭<br />
both have probabilities 0 <strong>for</strong> all<br />
τ , S and ρ '(<br />
τ,<br />
S)<br />
as defined in C.7, where K( τ , S)<br />
< ∞ and<br />
the superscript c denotes set complement. For event (A3a), we know that there exists a<br />
subsequence { k k , k ,..., } k K(<br />
τ,<br />
)<br />
Then, from C.6 and (A1),<br />
0, 1 2 0<br />
S<br />
~<br />
≥ such that { θ ≥ ρ'(<br />
, S)<br />
∀i<br />
∈S} ∩{ M < 0∀i<br />
∈S}<br />
k, i<br />
τ<br />
k,<br />
i<br />
is true.<br />
∑<br />
i∈S<br />
~<br />
( θ ) + o(1))<br />
< 0<br />
~<br />
θ ( g<br />
a.s. (A4)<br />
kji<br />
kji<br />
kji<br />
<strong>for</strong> all<br />
~ T<br />
~<br />
k, j . By C.4, θkj , g ˆ<br />
k, j(<br />
θkj)<br />
≥ ρ θ a.s. which, by C.7, implies, <strong>for</strong> all j sufficiently<br />
kj<br />
large,<br />
∑<br />
i∈S<br />
~<br />
θ g<br />
kji<br />
ρ ~<br />
≥ θkj<br />
2<br />
kji<br />
~<br />
( θ )<br />
kj<br />
⎛ ρ ⎞<br />
ρτ<br />
≥ ⎜ ⎟dim(<br />
S)<br />
ρ'(<br />
τ,<br />
S)<br />
≥<br />
⎝ 2 ⎠<br />
2<br />
(A5)<br />
since ρ' ( τ,<br />
S ) ≥τ<br />
and dim( S ) ≥1. Taken together, (A4) and (A5) imply that, <strong>for</strong> each<br />
sample point (except possibly on a set <strong>of</strong> measure 0), the event in (A3a) has probability 0. Now,<br />
consider the second event (A3b). From (A2), we know that, <strong>for</strong> almost all sample points,<br />
∑ ∞ =0<br />
M i S<br />
k ki<br />
→ −∞ ∀ ∈<br />
must be true. But this implies from C.5 and the above-mentioned<br />
uni<strong>for</strong>mly bounded decaying bias ( b k<br />
) that <strong>for</strong> no i ∈ S can M<br />
ki<br />
≥ 0<br />
occur. However, at<br />
each, the event{ M<br />
ki<br />
} c<br />
dim( S )<br />
< 0∀i<br />
∈ S is composed <strong>of</strong> the union <strong>of</strong> 2 −1events, each <strong>of</strong> which<br />
has <strong>for</strong> at least one M ≥ 0 <strong>for</strong> at least one i ∈ S . This, <strong>of</strong> course, requires that M ≥ 0 <strong>for</strong><br />
ki<br />
at least onei ∈ S , which creates a contradiction. Hence, the probability <strong>of</strong> the event in (A3b) is<br />
0. This completes Part 1 <strong>of</strong> the pro<strong>of</strong>.<br />
ki<br />
141
APPENDIX A<br />
Part 2: To show that<br />
~ θ<br />
k<br />
converges a.s. to a unique (finite) limit, we show that<br />
⎛<br />
⎞<br />
⎜ ~<br />
~ ⎟<br />
P⎜liminf<br />
θ<br />
ki<br />
< a'<br />
< b'<br />
< limsupθki<br />
= ∀i<br />
k→∞<br />
⎟ 0<br />
(A6)<br />
k→∞<br />
⎝<br />
⎠<br />
<strong>for</strong> any a ' < b'<br />
.This result follows as in the pro<strong>of</strong> <strong>of</strong> Part 2 <strong>of</strong> the proposition in Spall and<br />
Cristion [25].<br />
Part 3: Let us now show that the unique finite limit from Part 2 is 0. From (A2) and the<br />
conclusion <strong>of</strong> Part 1, we have<br />
follows if<br />
limsup<br />
n→∞<br />
∑ ∞ k=<br />
0<br />
M < ∞ a.s ∀ i . Then the result to be shown<br />
ki<br />
⎛ ~<br />
⎞<br />
⎜lim<br />
≠ 0, ∑ ∞ P θ < ∞<br />
⎟<br />
k<br />
M<br />
k<br />
= 0.<br />
(A7)<br />
k→∞<br />
⎝<br />
k=<br />
0 ⎠<br />
Suppose that the event in the probability <strong>of</strong> (A7) is true, and let I { 1,2,...,<br />
p}<br />
⊆ represent those<br />
~<br />
indexes i such thatθ → 0 as k → ∞ . Then, by the convergence in Part 2, there exists (<strong>for</strong><br />
ki<br />
almost any sample point in the underlying sample space) some 0 < a ' < b'<br />
< ∞ and<br />
~<br />
K ( a',<br />
b')<br />
< ∞ (dependent on sample point) such that ∀ k > K, 0 < a'<br />
≤ θ ≤ b'<br />
< ∞ when<br />
ki<br />
~<br />
i ∈ I( I ≠ θ ) θ<br />
ki<br />
≤ a'<br />
and<br />
c<br />
i ∈ I<br />
. From C.4, it follows that<br />
n<br />
n<br />
~<br />
θ g ( ˆ θ ) ≥ a'<br />
ρ a .<br />
(A8)<br />
∑ a ∑ ∑<br />
k ki ki k<br />
k<br />
k= K+<br />
1 i∈ I<br />
k=<br />
K+<br />
1<br />
142
APPENDIX A<br />
But since C.5 implies that g ˆ θ ) can change sign only a finite number <strong>of</strong> times (except<br />
ki<br />
( k<br />
~<br />
possibly on a set <strong>of</strong> sample points <strong>of</strong> measure 0), and since θ<br />
ki<br />
≤ b'<br />
, we know from (A8) that,<br />
<strong>for</strong> at least onei ∈ I ,<br />
limsup<br />
n→∞<br />
ρ a'<br />
n<br />
∑<br />
a<br />
k<br />
k=<br />
K+<br />
1<br />
n<br />
∑<br />
k<br />
k=<br />
k+<br />
1<br />
< ∞<br />
g<br />
ki<br />
a<br />
( ˆ θ )<br />
k<br />
.<br />
(A9)<br />
Recall that<br />
a g ( ˆ θ ) = M − a H b and b = O c ) a.s. Hence, from C.6, we have<br />
k<br />
k<br />
k<br />
k<br />
k<br />
−1<br />
k<br />
k<br />
k<br />
( 2 k<br />
H<br />
−1<br />
k<br />
b k<br />
=<br />
∑ ∞ k= K + 1<br />
o(1)<br />
. Then by (A9), M = ∞ Since, <strong>for</strong> the a ' < b'<br />
above, there exists such<br />
ki<br />
a K <strong>for</strong> each sample point in a set <strong>of</strong> measure one, we know from the above discussion that there<br />
also exists an<br />
∑ ∞ k= K + 1<br />
i ∈ I (i possibly dependent on the simple point) such that M = ∞ .<br />
ki<br />
Since I has a finite number <strong>of</strong> elements,<br />
∑ ∞ k=0<br />
M = ∞ with probability 0 <strong>for</strong> at least one i.<br />
However, this is inconsistent with the event in (A7), showing that the event does, in fact, have<br />
probability 0. This completes Part 3, which completes the pro<strong>of</strong>.<br />
ki<br />
Pro<strong>of</strong> <strong>of</strong> Theorem 1b (2SG) The initial martingale convergence arguments establishing the<br />
2SG analog to (A2) are based on C.0’ –C.2’ and C.6. Although there is no bias in the gradient<br />
measurement, C.4 and C.7 still work together to guarantee that the elements potentially<br />
diverging [in the arguments analogous to those surrounding (A3a), (A3b)] asymptotically<br />
dominate the product ˆT<br />
θ ; ( ˆ<br />
k<br />
gkj<br />
θkj<br />
) . As in the Pro<strong>of</strong> <strong>of</strong> Theorem 1a, this sets up a contradiction.<br />
The remainder <strong>of</strong> the pro<strong>of</strong> follows exactly as in Parts 2 and 3 <strong>of</strong> the Pro<strong>of</strong> <strong>of</strong> Theorem 1a, with<br />
some <strong>of</strong> the arguments made easier since b = 0 .<br />
k<br />
143
APPENDIX A<br />
Pro<strong>of</strong> <strong>of</strong> Theorem 2a (M2-<strong>SPSA</strong>)<br />
First, note that the conditions subsume those <strong>of</strong> Theorem 1a; hence, we have a.s. convergence <strong>of</strong><br />
θˆ . By C.8, we have<br />
⎛( )<br />
2<br />
2<br />
c c~<br />
Hˆ<br />
⎞<br />
⎟ ⎠<br />
k<br />
E⎜<br />
k k k<br />
uni<strong>for</strong>mly bounded ∀ k<br />
⎝<br />
. Hence, by the additional<br />
assumption introduced in C.1’’ (beyond that in C.1), the martingale convergence result in, say,<br />
Gerencser [16], yields<br />
1<br />
∑ n<br />
n + 1 k=<br />
0<br />
( Hˆ<br />
ˆ ˆ<br />
k<br />
− E(<br />
H k k<br />
)) → 0<br />
θ a.s. as n → ∞ . (A10)<br />
Let H (θ ) represent the true <strong>Hessian</strong> matrix, and suppose that g (θ ) is three-times<br />
continuously differentiable in a neighborhood <strong>of</strong><br />
θˆ k . Then, simple Taylor series arguments<br />
show that<br />
E(<br />
δG<br />
ˆ θ , ∆<br />
≡ δg<br />
k<br />
k<br />
k<br />
+ O(<br />
c<br />
k<br />
3<br />
k<br />
) = g(<br />
ˆ θ + c ∆<br />
)<br />
( O (<br />
3 ) = 0 in the SG case)<br />
c k<br />
k<br />
k<br />
k<br />
) − g(<br />
ˆ θ − c ∆<br />
k<br />
k<br />
k<br />
) + O(<br />
c<br />
3<br />
k<br />
)<br />
where this result is immediate in the SG case, and follows easily by a Taylor series argument in<br />
the <strong>SPSA</strong> case (where the O c ) term is the difference <strong>of</strong> the two O c ) bias terms in the<br />
( 3 k<br />
one-sided SP gradient approximations and c ~ = O(<br />
)). Hence, by an expansion <strong>of</strong> each <strong>of</strong><br />
g ˆ θ ± ∆ ) , we have <strong>for</strong> any i, j.<br />
(<br />
k<br />
c k k<br />
k<br />
c k<br />
( 2 k<br />
144
APPENDIX A<br />
⎛ Gki<br />
E⎜<br />
δ<br />
⎝ 2ck∆<br />
kj<br />
ˆ θ , ∆<br />
k<br />
l≠<br />
j<br />
k<br />
⎞<br />
⎟<br />
⎠<br />
⎛ g ⎞<br />
ki ˆ<br />
2<br />
E⎜<br />
δ<br />
= θk<br />
, ∆ ⎟<br />
k<br />
+ O(<br />
ck<br />
)<br />
2c<br />
⎝ k∆kj<br />
⎠<br />
( ˆ ) ( ˆ ∆kl<br />
2<br />
= Hij<br />
θk<br />
+ ∑ Hlj<br />
θk<br />
) + O(<br />
ck<br />
)<br />
∆<br />
kj<br />
where the O ( c<br />
2 k<br />
) term in the second line absorbs higher <strong>order</strong> terms in the expansion <strong>of</strong> δ gk<br />
.<br />
Then, since<br />
E( ∆kl<br />
/ ∆kj<br />
) = 0∀j<br />
≠ l by the assumptions <strong>for</strong> ∆<br />
k<br />
, we have<br />
⎛ G ⎞<br />
ki<br />
E ⎜<br />
δ ˆ θ ⎟ ( ˆ<br />
k<br />
= Hij<br />
θk<br />
) + O(<br />
c<br />
2c<br />
⎝ k∆kj<br />
⎠<br />
2<br />
k<br />
)<br />
implying that the <strong>Hessian</strong> estimate is “nearly unbiased,” with the bias disappearing at rate<br />
O ( c<br />
2 k<br />
) . The additional operation in<br />
Hˆ<br />
k<br />
=<br />
⎡ T<br />
1 δGk<br />
⎢<br />
2 ⎢2ck∆k<br />
⎣<br />
T<br />
⎛ δGk<br />
+<br />
⎜<br />
⎝ 2ck∆<br />
k<br />
T<br />
⎞ ⎤<br />
⎟ ⎥<br />
⎠ ⎥⎦<br />
simply <strong>for</strong>ces the per-iteration estimate to be symmetric. Then, by the above equations,<br />
conditions C.3’, C.8, and C.9 imply ∀ l (A14) where L<br />
(3)<br />
hij<br />
represents the third derivative <strong>of</strong> L w.r.t.<br />
the hh, ith, and jth elements <strong>of</strong>θ ; θ are points on the line segments between ˆ ~<br />
θ<br />
k<br />
± ck∆k<br />
+ ck∆k<br />
k<br />
k<br />
k<br />
±<br />
k<br />
~ ~ ~<br />
andθˆ ± c ∆ ; and we used the fact that E( ∆ ∆ / ∆<br />
l<br />
) = 0∀i,<br />
j k and l (implied by C.9 and<br />
the Cauchy–Schwarz inequality). Let<br />
ki kj k<br />
,<br />
145
APPENDIX A<br />
1 ⎡~<br />
~ ~ ~<br />
( ( ) ( )) ˆ<br />
⎤<br />
−1<br />
(3) + (3) −<br />
Bkl = E⎢∆k<br />
l∑<br />
Lhij<br />
θk<br />
− Lhij<br />
θk<br />
⋅∆k,<br />
h∆k,<br />
i∆kj<br />
θk<br />
, ∆k<br />
⎥.<br />
(A11)<br />
6 ⎣ h,<br />
i,<br />
j<br />
⎦<br />
By C.3’ (bounding the difference in<br />
(3)<br />
Lhij<br />
terms) and C.9 in conjunction with the<br />
Cauchy–Schwarz inequality and C.1’’ ( c ~<br />
k<br />
= O(<br />
ck<br />
)) we have B<br />
kl<br />
/ ck<br />
uni<strong>for</strong>mly bounded<br />
(in ˆ θ<br />
k<br />
, ∆k<br />
) <strong>for</strong> all k sufficiently large. Hence, from (A11) the ( l m)-th element <strong>of</strong> Ĥ<br />
k<br />
satisfies<br />
E(<br />
Hˆ<br />
k , l,<br />
m<br />
ˆ θ )<br />
(1) ˆ<br />
(1)<br />
⎛ G<br />
ˆ<br />
k<br />
(<br />
k<br />
ck<br />
k<br />
) Gk<br />
(<br />
k<br />
c ) ˆ<br />
⎞<br />
k k<br />
E⎜<br />
l<br />
θ + ∆ −<br />
l<br />
θ − ∆<br />
=<br />
⎟<br />
θk<br />
2c<br />
⎝<br />
k∆km<br />
⎠<br />
( ˆ ) ( ˆ<br />
−2<br />
⎛ g<br />
k<br />
ck<br />
k<br />
g<br />
k<br />
ck<br />
k<br />
) ck<br />
Bk<br />
E⎜<br />
l<br />
θ + ∆ −<br />
l<br />
θ − ∆ +<br />
=<br />
⎝<br />
2ck∆km<br />
T<br />
[ ∂gl<br />
/ ∂θ<br />
]<br />
⎛ 2ck<br />
θ =<br />
= E⎜<br />
⎝<br />
2ck∆<br />
( ˆ<br />
2<br />
= H θ ) + O(<br />
c )<br />
lm<br />
k<br />
k<br />
k<br />
ˆ θk<br />
km<br />
∆<br />
k<br />
+ O(<br />
c<br />
3<br />
k<br />
) ⎞<br />
ˆ θ ⎟<br />
k<br />
⎠<br />
l<br />
ˆ<br />
⎞<br />
θ ⎟<br />
k<br />
⎠<br />
(A12)<br />
where the O( c<br />
3 k<br />
) term in the third line <strong>of</strong> (A12) encompasses both c −2<br />
B k kl<br />
and the uni<strong>for</strong>mly<br />
bounded contributions due to<br />
2 T<br />
∂ gl<br />
/∂θ ∂<br />
T<br />
θ<br />
in the remainder terms <strong>of</strong> the expansion<br />
<strong>of</strong> g ˆ<br />
l<br />
( θk<br />
+ ck∆<br />
k)<br />
−gl(ˆ<br />
θk<br />
−ck∆<br />
k)<br />
is<br />
3 3<br />
( ( c k<br />
) / ck<br />
O uni<strong>for</strong>mly bounded, allowing the use <strong>of</strong> C.9 and the<br />
2<br />
Cauchy–Schwarz inequality in producing the ( )<br />
O term in the last line <strong>of</strong> (A12)). Then, by<br />
c k<br />
(A12), the continuity <strong>of</strong> H nearθˆ k and the fact that ˆ θ → θ *<br />
k<br />
a.s. (Theorem 1a), the principle <strong>of</strong><br />
Cesaro summability implies<br />
146
APPENDIX A<br />
1<br />
+<br />
n<br />
∑<br />
n 1 k=<br />
0<br />
=<br />
1<br />
n + 1 k=<br />
0<br />
E(<br />
ˆ<br />
H k<br />
ˆ θ )<br />
k<br />
n<br />
2<br />
∑( H ( ˆ<br />
k<br />
) + O(<br />
ck<br />
))<br />
*<br />
θ → H ( θ ) a.s. (A13)<br />
1 n 1<br />
Hˆ<br />
0 k<br />
−<br />
Given that H = ( + 1) ∑ + k<br />
n<br />
k=<br />
(A.10) and (A13) then yield the result to be proved.<br />
Pro<strong>of</strong> <strong>of</strong> Theorem 2b (2SG) Since the conditions subsume those <strong>of</strong> Theorem 1b, we have<br />
ˆ θ * θ → k<br />
a.s. Analogous to (A10), C.1’’’ and C.8’ yield a martingale convergence result <strong>for</strong> the<br />
sample mean <strong>of</strong> Hˆ<br />
− E(<br />
ˆ ˆ θ ) . Then, given the boundedness <strong>of</strong> the third derivatives <strong>of</strong><br />
k<br />
H k<br />
k<br />
L (θ ) near θˆ k <strong>for</strong> all k, the Cauchy–Schwarz inequality and C.8’, C.9’ imply that<br />
E Hˆ<br />
ˆ θ ) = H ( ˆ θ ) + O<br />
2<br />
( c )<br />
(<br />
k k<br />
k k<br />
yield the result to be proved.<br />
. By<br />
ˆ θ → θ *<br />
k<br />
a.s., the Cesaro summability arguments in (A13)<br />
Pro<strong>of</strong> <strong>of</strong> Theorem 3a (M2-<strong>SPSA</strong>)<br />
Beginning with the expansion G ( ˆ θ )<br />
( ˆ<br />
*<br />
k k<br />
θ<br />
k<br />
) = H ( θ<br />
k<br />
)( ˆ θ<br />
k<br />
− θ ) bk<br />
E +<br />
whereθ k<br />
is on the line<br />
segment between<br />
θˆ k and θ * and the bias<br />
bk<br />
is defined in (A1), the estimation error can be<br />
represented in the notation <strong>of</strong> [19] as<br />
147
APPENDIX A<br />
ˆ θ<br />
*<br />
−α<br />
*<br />
k+<br />
1<br />
−θ<br />
= ( I − k Γk<br />
)( θk<br />
−θ<br />
)<br />
+ k<br />
−(<br />
α+<br />
β ) / 2<br />
Φ V<br />
k<br />
ˆ<br />
k<br />
+ k<br />
α −β<br />
/ 2<br />
H<br />
−1<br />
k<br />
T<br />
k<br />
where<br />
Γ<br />
V<br />
k<br />
Φ<br />
k<br />
k<br />
= aH<br />
= k<br />
−1<br />
k<br />
= −aH<br />
−γ<br />
H ( θ )<br />
−1<br />
k<br />
k<br />
[ Gk<br />
( ˆ θk<br />
) − E( Gk<br />
( ˆ θk<br />
) ˆ θk<br />
)]<br />
and<br />
T<br />
k<br />
/ 2<br />
= −ak<br />
β b . The pro<strong>of</strong> follows that <strong>of</strong> Spall [3, Proposition 2] closely, which shows<br />
k<br />
that the three sufficient conditions <strong>for</strong> asymptotic normality, in Fabian [19], hold. By the<br />
convergence <strong>of</strong><br />
θˆ k it is straight<strong>for</strong>ward to show a.s. convergence <strong>of</strong><br />
T to 0 if 3 γ −α / 2 > 0 or<br />
k<br />
to T in (2.37) if 3 γ −α / 2 = 0 . The mean expression µ then follows directly from Fabian<br />
[19] and the convergence <strong>of</strong> H<br />
k<br />
(and hence<br />
T<br />
Further, as in Spall [3], (<br />
k<br />
) E V V k k<br />
−1<br />
H<br />
k<br />
by C.11 and the existence <strong>of</strong><br />
* −1<br />
H ( θ ) .<br />
θˆ is a.s. convergent by C.2 and C.10, leading to the<br />
covariance matrix Ω . This shows Fabian [19, (2.21) and (2.22)]. The final condition [19,<br />
(2.2.3)] follows as in Spall [3, Proposition 2] since the definition <strong>of</strong> V<br />
k<br />
is identical in both<br />
standard <strong>SPSA</strong> and M2-<strong>SPSA</strong>.<br />
(1)<br />
[ ˆ ) ˆ<br />
kl<br />
( θk<br />
± ck∆k<br />
θk,<br />
∆k<br />
]<br />
E G<br />
⎡~<br />
( ˆ<br />
⎢ck<br />
g θk<br />
± ck∆<br />
= E⎢<br />
⎢<br />
⎢<br />
⎣<br />
= g<br />
l<br />
( ˆ θ ± c ∆<br />
k<br />
k<br />
k<br />
k<br />
T ~<br />
) ∆<br />
1<br />
) + c<br />
6<br />
−2<br />
k<br />
k<br />
−2<br />
ck<br />
~ T ˆ ~<br />
+ ∆kH(<br />
θk<br />
± ck∆k<br />
) ∆k<br />
2<br />
c~~<br />
∆<br />
⎡~<br />
E⎢∆<br />
⎣<br />
∑<br />
−1<br />
kl<br />
h,<br />
i,<br />
j<br />
L<br />
(3)<br />
h,<br />
i,<br />
j<br />
± ~<br />
( θ ) ∆<br />
k<br />
kl<br />
kh<br />
−<br />
ck<br />
+<br />
6<br />
3<br />
∑<br />
h,<br />
i,<br />
j<br />
~ ~<br />
∆ ∆ ˆ θ , ∆<br />
ki<br />
kj<br />
k<br />
k<br />
L<br />
⎤<br />
⎥.<br />
⎦<br />
(3)<br />
h,<br />
i,<br />
j<br />
±<br />
( θ )<br />
k<br />
~<br />
∆<br />
kh<br />
~ ~<br />
∆ ∆<br />
ki<br />
kj<br />
ˆ θ , ∆<br />
k<br />
k<br />
⎤<br />
⎥<br />
⎥<br />
⎥<br />
⎥<br />
⎦<br />
(A14)<br />
148
APPENDIX A<br />
Pro<strong>of</strong> <strong>of</strong> Theorem 3b (2SG) Analogous to the Pro<strong>of</strong> <strong>of</strong> Theorem 3a, the estimation error can be<br />
represented as<br />
ˆ *<br />
−α<br />
k +<br />
−θ<br />
= ( I − k Γk<br />
)( ˆ θk<br />
−θ<br />
*<br />
1<br />
θ ) + k<br />
−α<br />
Φ<br />
k<br />
e<br />
k<br />
−1<br />
where Γ = aH<br />
H ( θ ) and<br />
k<br />
k<br />
k<br />
Φ<br />
k<br />
= −a H Conditions (2.2.1) and (2.2.2) <strong>of</strong> Fabian [19]<br />
−1<br />
k<br />
follow immediately by the smoothness <strong>of</strong> L (θ ) (from C.3’), the convergence <strong>of</strong> θˆ k<br />
and<br />
and C.12. Condition (2.2.3) <strong>of</strong> Fabian [19] follows by Holder’s inequality and C.2’, C.3’.<br />
H<br />
k<br />
,<br />
Pro<strong>of</strong> <strong>of</strong> theorem 4a—Convergence in parameter estimation M2-<strong>SPSA</strong><br />
The convergence theorem <strong>for</strong> the proposed method is proven here based on RM-type stochastic<br />
approximation. In contrast to the RM-type stochastic approximation, in the simultaneous<br />
perturbation stochastic approximation, the slope <strong>of</strong> the error function is estimated based on the<br />
value <strong>of</strong> the error function. There<strong>for</strong>e, the estimated slope must include the error. In this pro<strong>of</strong>,<br />
the nature <strong>of</strong> the estimated error <strong>for</strong> the slope when using simultaneous perturbation stochastic<br />
approximation <strong>for</strong> parameter estimation is clarified, thus arriving at convergence <strong>of</strong> the<br />
parameter estimation algorithm using the conventional RM-type stochastic approximation. In<br />
the pro<strong>of</strong> below, the subscripts that can be readily understood are omitted.<br />
For<br />
~<br />
φ = ˆ φ −φ<br />
, if the true value φ <strong>for</strong> the parameters is subtracted from both sides <strong>of</strong> (2.62)<br />
and the right-hand side is then expanded and cleaned up,<br />
~<br />
φ<br />
k+<br />
n<br />
⎛<br />
= ⎜<br />
I<br />
⎝<br />
+ ρ<br />
n+<br />
m<br />
k−1<br />
n+<br />
1<br />
⎪⎧<br />
⎨z<br />
⎪⎩<br />
− ρ<br />
k−1<br />
n+<br />
1<br />
k+<br />
n−1<br />
e<br />
z<br />
k+<br />
n−1<br />
k+<br />
n<br />
y<br />
T<br />
k+<br />
n−1<br />
2<br />
⎡σ<br />
I<br />
+ ⎢<br />
⎣ 0<br />
n<br />
⎞~<br />
⎟<br />
φk<br />
⎠<br />
−1<br />
0⎤<br />
ˆ<br />
⎥φk<br />
0⎦<br />
−1<br />
1<br />
− c<br />
2<br />
k−1<br />
n+<br />
1<br />
T<br />
( y s )<br />
k+<br />
n−1<br />
k−1<br />
2<br />
s<br />
k−1<br />
⎪⎫<br />
⎬<br />
⎪⎭<br />
(B.1)<br />
149
APPENDIX A<br />
results. Here, zk+n+<br />
1<br />
is given by<br />
z<br />
T<br />
k + n−1 = yk<br />
+ n−1sk<br />
−1sk<br />
−1<br />
= yk<br />
+ n−1<br />
+ dk<br />
+ n−<br />
T<br />
Note that dk+n−<br />
1<br />
represents the difference between yk+ n−1sk<br />
−1sk<br />
−1<br />
and y<br />
k+n−1<br />
and is given by the<br />
1.<br />
following equation.<br />
s,<br />
i<br />
represents the i-th element s<br />
k 1,<br />
i<br />
−<br />
, <strong>for</strong> the signed vector at the time k – 1.<br />
d<br />
k+<br />
n−1<br />
⎛ yk<br />
−1s,<br />
2<br />
s,<br />
1+<br />
L+<br />
uk<br />
⎜<br />
⎜ yks,<br />
1<br />
s,<br />
2+<br />
L+<br />
uk<br />
+<br />
= ⎜<br />
M<br />
⎜<br />
⎝ yks,<br />
1<br />
sn+<br />
m<br />
+ L+<br />
u<br />
+ n−1<br />
n−1<br />
s,<br />
s,<br />
k+<br />
n−2<br />
n+<br />
m<br />
n+<br />
m<br />
s,<br />
s,<br />
s,<br />
2<br />
1<br />
n+<br />
m−1<br />
s,<br />
n+<br />
m<br />
⎞<br />
⎟<br />
⎟<br />
⎟<br />
⎟<br />
⎠ .<br />
(B.2)<br />
Caution is required because the product <strong>of</strong> s,<br />
i<br />
and s,<br />
j<br />
in each item in each element <strong>of</strong> d is the<br />
product <strong>of</strong> mutually distinct elements in the signed vector s<br />
k −1<br />
.<br />
In other words, in the case in which<br />
At this point, based on Eq. (B.1),<br />
i ≠ j when taking the expected value, this is 0.<br />
~<br />
φk+n<br />
2<br />
is calculated as follows:<br />
~<br />
φ<br />
k+<br />
n<br />
2<br />
~ T ~<br />
= φ φ<br />
k+<br />
n<br />
k+<br />
n<br />
⎧<br />
2<br />
⎛<br />
⎪ ⎜<br />
T ~ ⎡σ<br />
I<br />
− zy φk<br />
−1<br />
+ ze+<br />
⎢<br />
~ ⎜<br />
⎨<br />
⎣ 0<br />
= φk<br />
−1<br />
+ ρ⎜<br />
⎪<br />
⎪<br />
⎜ 1 T 2<br />
− c( y s)<br />
s<br />
⎩ ⎝ 2<br />
~ T<br />
T T ~<br />
= φ ( I −ρ<br />
zy − pyz ) φ<br />
k−1<br />
~ T<br />
+ 2ρφ<br />
n+<br />
m<br />
k−1<br />
2<br />
⎪⎧<br />
⎡σ<br />
I<br />
⎨ze+<br />
⎢<br />
⎪⎩ ⎣ 0<br />
n<br />
0⎤<br />
ˆ<br />
⎥φ<br />
k−<br />
0⎦<br />
k−1<br />
1<br />
n<br />
1<br />
− c<br />
2<br />
0⎤<br />
ˆ<br />
⎥φk<br />
−<br />
0⎦<br />
T<br />
( y s)<br />
2<br />
1<br />
⎪⎫<br />
s⎬<br />
⎪⎭<br />
T<br />
⎞⎫<br />
−⎟⎪<br />
⎟<br />
⎪⎧<br />
~<br />
⎟⎬<br />
⎨φ<br />
k−1<br />
⎪ ⎪⎩<br />
⎟<br />
⎠⎪<br />
⎭<br />
⎛<br />
⎜<br />
T ~<br />
+ ρ − zy φ<br />
k−<br />
⎝<br />
1<br />
2<br />
⎡σ<br />
I<br />
+ ze+<br />
⎢<br />
⎣ 0<br />
n<br />
0⎤<br />
ˆ<br />
⎥φk<br />
−<br />
0⎦<br />
1<br />
1<br />
− c<br />
2<br />
T<br />
( y s)<br />
2<br />
⎞⎪⎫<br />
s⎟<br />
⎬<br />
⎠⎪⎭<br />
+ ρ 2 h.<br />
(B.3)<br />
150
APPENDIX A<br />
However,<br />
h<br />
~ ⎡σ<br />
I 0⎤<br />
~ 1<br />
φ ⎢ ⎥<br />
(B.4)<br />
⎣ 0 0⎦<br />
2<br />
2<br />
T<br />
n<br />
T 2<br />
= − zy<br />
k−1 + ze + φk−<br />
1<br />
− c(<br />
y s)<br />
2<br />
.<br />
Finally, the expected value <strong>for</strong> Eq. (B.3) is found using<br />
~<br />
φ = β . Be<strong>for</strong>e this, though, each<br />
k +1<br />
~<br />
item in this equation is evaluated. First, the conditional expected value <strong>for</strong> φ<br />
k + 1<br />
in<br />
T<br />
zy must<br />
be considered:<br />
T ~<br />
T<br />
T<br />
{ zy φk<br />
1<br />
= β} = E{ yy β} E{ dy β}.<br />
E<br />
−<br />
+<br />
(B.5)<br />
Here, the second element on the right is 0 based on the signed vector condition (B11). There<strong>for</strong>e,<br />
only the first element on the right needs to be considered, and so the following equation results:<br />
E<br />
T<br />
T<br />
{ yz β} = E{ yy β}<br />
⎧⎛<br />
x⎞<br />
= E⎨⎜<br />
⎟<br />
⎩⎝u<br />
⎠<br />
⎪⎧<br />
⎡xx<br />
= E⎨⎢<br />
⎪⎩ ⎣ux<br />
T T ⎛ ⎞ T T<br />
( x u ) + ⎜ ⎟( υ 0 )<br />
T<br />
T<br />
xu<br />
uu<br />
T<br />
T<br />
υ<br />
⎝ 0⎠<br />
2<br />
⎤ ⎪⎫<br />
⎡σ<br />
I<br />
⎥ β ⎬ + ⎢<br />
⎦ ⎪⎭ ⎣ 0<br />
n<br />
⎫<br />
β ⎬<br />
⎭<br />
0⎤<br />
⎥<br />
0⎦<br />
.<br />
(B.6)<br />
T ~<br />
In the same fashion, <strong>for</strong> { yz φ<br />
k−1<br />
= β}<br />
E the same results as seen in Eq. (B.6) are obtained. In<br />
~<br />
addition, the expected value <strong>for</strong>φ = β <strong>for</strong> ze must be considered:<br />
k−1<br />
{ ze β} E{ ye β} E{ de β}.<br />
E = +<br />
(B.7)<br />
151
APPENDIX A<br />
Because the second element in the equation above is 0 based on the condition (B11), only the<br />
first element needs to be considered. The first element is given by (2.52), and so the following<br />
equation results:<br />
E<br />
⎡σ<br />
⎢<br />
⎣ 0<br />
2<br />
I n<br />
0⎤<br />
⎥<br />
0 ⎦<br />
{ ze β} = − φ.<br />
(B.8)<br />
Now let us consider h as represented in Eq. (B.4). Although a similar discussion can be found in<br />
[15], only the fourth element on the right varies as a result <strong>of</strong> the perturbation. Expanding Eq.<br />
(B.4) reveals an item affected by (<br />
multiplied by (<br />
s) s<br />
y T 2<br />
s) s<br />
y T 2<br />
and an item affected by its square. The item<br />
takes 0 <strong>for</strong> its expected value based on the condition (B11) <strong>for</strong> signing.<br />
The later items have a fourth-<strong>order</strong> moment <strong>for</strong> y. When the assumption (C11) <strong>of</strong> the<br />
boundedness <strong>of</strong> the fourth-<strong>order</strong> moment <strong>for</strong> the stochastic variable input u and the observed<br />
noise v, and the assumption (A12) <strong>of</strong> the boundedness <strong>for</strong> the perturbation are taken into<br />
consideration, from Eq. (B.4) we have the following inequality <strong>for</strong> the appropriate constants<br />
0 ≤ α 1, α2<br />
< ∞ :<br />
~<br />
2<br />
{ hφ k−1 = β} ≤ α1<br />
β + α2.<br />
E (B.9)<br />
~<br />
Given the above relationships, the conditional expected value <strong>for</strong> φ<br />
k−1<br />
in (B.3) satisfies the<br />
following equation:<br />
E<br />
~ 2 ~<br />
{ φk+<br />
n<br />
φk−<br />
1<br />
= β }<br />
T<br />
≤ β<br />
T<br />
= β ( I<br />
( I − 2ρD)<br />
2<br />
⎪⎧<br />
⎡σ<br />
I<br />
+ 2ρβ<br />
⎨−<br />
⎢<br />
⎪⎩ ⎣ 0<br />
2<br />
+ ρ<br />
n+<br />
m<br />
2<br />
( α1<br />
β + α2<br />
)<br />
n+<br />
m<br />
n<br />
2<br />
2<br />
⎡σ<br />
I<br />
β − 2ρβ<br />
⎢<br />
⎣ 0<br />
2<br />
0⎤<br />
⎡σ<br />
I<br />
⎥φ<br />
+ ⎢<br />
0⎦<br />
⎣ 0<br />
2<br />
− 2ρD)<br />
β + ρ α β<br />
1<br />
n<br />
2<br />
n<br />
0⎤<br />
ˆ⎪<br />
⎫<br />
⎥φ<br />
⎬<br />
0⎦<br />
⎪⎭<br />
0⎤<br />
⎥β<br />
0⎦<br />
2<br />
+ ρ α<br />
2<br />
(B.10)<br />
152
APPENDIX A<br />
where,<br />
D<br />
⎡xx<br />
E⎢<br />
⎣ux<br />
T<br />
xu<br />
T<br />
=<br />
T T<br />
uu<br />
⎤<br />
⎥<br />
⎦<br />
.<br />
Based on the condition (C11), D is a symmetrical positive definite matrix and has a minimum<br />
eigenvalue λ > 0 . There<strong>for</strong>e, we can obtain (2.52) by using<br />
~ 2<br />
~ 2<br />
2<br />
2<br />
{ φ<br />
k+ n<br />
} ≤ ( 1−<br />
2ρλ<br />
+ ρ α1) E{ φk−<br />
1<br />
} + ρ α<br />
2.<br />
E (B.11)<br />
The above equation returns us to the pro<strong>of</strong> [15] <strong>of</strong> the convergence theorem <strong>for</strong> the parameter<br />
estimation algorithm using the Robbins–Monroe stochastic approximation. There<strong>for</strong>e, under the<br />
condition (A11) <strong>for</strong> the gain coefficient<br />
holds.<br />
lim E<br />
⎧<br />
⎨<br />
k →∞ ⎩<br />
ˆ<br />
φ k<br />
2<br />
−φ<br />
⎫<br />
⎬ = 0<br />
⎭<br />
153
154<br />
APPENDIX A
Appendix B<br />
Interpretation <strong>of</strong> Regularity Conditions<br />
This Appendix provides comments on some <strong>of</strong> the conditions <strong>of</strong> ASP relative to other adaptive<br />
SA approaches. In the confines <strong>of</strong> a short discussion, it is obviously not possible to provide a<br />
detailed discussion <strong>of</strong> all conditions <strong>of</strong> all known adaptive approaches. Nevertheless, we hope to<br />
convey a flavor <strong>of</strong> the relative nature <strong>of</strong> the conditions.<br />
As discussed in Sec. 2.9, some <strong>of</strong> the conditions <strong>of</strong> ASP depend on<br />
θˆ k<br />
itself, creating a type <strong>of</strong><br />
circularity (i.e., direct conditions on the quantity being analyzed). This circularity has been<br />
discussed elsewhere since other SA algorithms also have dependent conditions. Some <strong>of</strong> the<br />
ASP conditions can be eliminated or simplified if the conditions <strong>of</strong> the lemma in Sec. 2.9 hold.<br />
The <strong>for</strong>emost lemma condition is that<br />
θˆ k<br />
be uni<strong>for</strong>mly bounded. Of course, this uni<strong>for</strong>mly<br />
bounded condition is itself a circular condition, but it helps to simplify the other conditions <strong>of</strong><br />
the theorems that are dependent on<br />
θˆ k since the<br />
θˆ k<br />
dependence can be replaced by an<br />
assumption that these other conditions hold uni<strong>for</strong>mly over all θ in the bounded set guaranteed<br />
to contain<br />
θˆ k<br />
(e.g., the current assumption C.3 that<br />
θˆ k<br />
be twice continuously differentiable in<br />
neighborhoods <strong>of</strong> estimates<br />
θˆ can be replaced by an assumption that g (θ ) is twice<br />
k<br />
continuously differentiable on some bounded set known to contain<br />
θˆ k . If the lemma applies,<br />
condition C.5 (on the i.o. behavior <strong>of</strong><br />
θˆ k ) is unnecessary.<br />
In showing convergence and<br />
asymptotic normality, one might wonder whether other adaptive algorithms could avoid<br />
conditions that depend on<br />
θˆ k , and avoid alternative conditions that are similarly undesirable.<br />
Based on currently available adaptive approaches, the answer appears to be “no.” .As an<br />
illustration, let us analyze one <strong>of</strong> the more powerful results on adaptive algorithms, the result in<br />
Wei [48].<br />
155
APPENDIX B<br />
The Wei [48] approach is restricted to the SG/root-finding setting as opposed to the more<br />
general setting <strong>for</strong> ASP that encompasses both gradient-free and SG/root finding. The approach<br />
is based on 2p measurements <strong>of</strong> g (θ ) at each iteration to estimate the Jacobian (<strong>Hessian</strong>)<br />
matrix. Some <strong>of</strong> the conditions in Wei [48] are similar to conditions <strong>for</strong> ASP (e.g., decaying<br />
gain sequences and smoothness <strong>of</strong> the functions involved), while other conditions are more<br />
stringent (the restriction to only the root-finding setting and the requirement <strong>for</strong> i.i.d.<br />
measurement noise). There are also conditions in ASP that are not required in Wei [48],<br />
principally those associated with “nice” behavior <strong>of</strong> the user-specified (bounded moments, etc.),<br />
the steepness conditions C.4 and C.7 (similar to standard conditions in some other adaptive<br />
approaches, e.g., Ruppert [14]), and limits on the amount <strong>of</strong> bouncing in “big steps” around (the<br />
i.o. condition C.5). An additional key assumption in Wei [48] is the symmetric function<br />
condition on the Jacobian (or <strong>Hessian</strong>) matrix:<br />
T<br />
T<br />
H ( θ)<br />
H(<br />
θ')<br />
+ H(<br />
θ')<br />
H(<br />
θ)<br />
> 0, ∀θ<br />
, θ '.<br />
(D.1)<br />
This, un<strong>for</strong>tunately, is a stringent condition that may be easily violated. In the optimization case<br />
(where H is a <strong>Hessian</strong>), this condition may fail even <strong>for</strong> benign (e.g., convex) loss functions.<br />
Consider, <strong>for</strong> example, a case with<br />
4 2 2<br />
L (θ ) = x + x + y + xy. Letting<br />
T<br />
( 0,0) and<br />
θ = ( x , y)<br />
θ<br />
T<br />
T<br />
'(0,0)<br />
=<br />
and a simple convex loss function<br />
(2,0)<br />
T<br />
, we have<br />
H(<br />
θ ) H(<br />
θ')<br />
T<br />
+ H(<br />
θ')<br />
H(<br />
θ)<br />
T<br />
⎡202<br />
= ⎢<br />
⎣ 56<br />
56⎤<br />
10<br />
⎥<br />
⎦<br />
which is not positive definite, violating condition (D.1). Aside from the fact that this condition<br />
may be easily violated, it is also generally impossible to check in practice because it requires<br />
knowledge <strong>of</strong> the true H (θ ) over the whole domain; this, <strong>of</strong> course, is the very quantity that is<br />
being estimated. The requirement <strong>for</strong> such prior knowledge is also apparent in other adaptive<br />
approaches discussed in Ruppert [14] and Fabian [19]. Given the above, it is clear that neither<br />
ASP nor Wei [48] (nor others) have uni<strong>for</strong>mly “easier” conditions <strong>for</strong> their respective<br />
approaches. The inherent difficulty in establishing theoretical properties <strong>of</strong> adaptive approaches<br />
comes from the need to couple the estimates <strong>for</strong> the parameters <strong>of</strong> interest and <strong>for</strong> the<br />
<strong>Hessian</strong>/Jacobian matrix.<br />
156
APPENDIX B<br />
This tends to lead to nontrivial regularity conditions, as seen in the<br />
θˆ k<br />
dependent conditions <strong>of</strong><br />
ASP and in the stringent conditions that have appeared in the literature <strong>for</strong> other approaches.<br />
There appear to be no easy conditions <strong>for</strong> establishing rigorous properties <strong>of</strong> adaptive<br />
algorithms. However, given that all <strong>of</strong> these approaches have a strong intuitive appeal based on<br />
analogies to deterministic optimization, the needs <strong>of</strong> practical users will focus less on the<br />
nuances <strong>of</strong> the regularity conditions and more on the cost <strong>of</strong> implementation (e.g., the number<br />
<strong>of</strong> function measurements needed), the ease <strong>of</strong> implementation, and the practical per<strong>for</strong>mance.<br />
157
158<br />
APPENDIX B
List <strong>of</strong> Publications Directly<br />
Related to the Dissertation<br />
1) Jorge Medina Martínez, Mariko Nakano Miyatake, Kazushi Nakano, Héctor Pérez Meana:<br />
Low Complexity Cascade Lattice IIR Adaptive Filter <strong>Algorithm</strong>s using Simultaneous<br />
Perturbations Approach, WSEAS Transactions on Communications, Vol. 10, No. 10, pp.<br />
1058-1068 (2005).<br />
(Related to the contents <strong>of</strong> Chap. 4).<br />
2) Jorge Ivan Medina Martinez, Kazushi Nakano, Kohji Higuchi: Parameter Estimation using<br />
a Modified Version <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong> Applied to State Space Models, IEEJ Transactions<br />
on Industry Applications, Vol.129, No.12/ Sec. D. (2009).<br />
(Related to the contents <strong>of</strong> Chap. 5).<br />
3) Jorge Ivan Medina Martinez, Kazushi Nakano, Sawut Umerujan: Vibration Suppression<br />
Control <strong>of</strong> a Flexible Arm using Non-linear Observer with Simultaneous Perturbation<br />
Stochastic <strong>Approximation</strong>, Journal <strong>of</strong> Artificial Life and Robotics, Vol. 14, (2009).<br />
(Related to the contents <strong>of</strong> Chap. 3).<br />
4) Jorge Ivan Medina Martinez, Kazushi Nakano, Kohji Higuchi: New Approach <strong>for</strong> IIR<br />
Adaptive Lattice Filter Structure using Simultaneous Perturbation <strong>Algorithm</strong>, IEEJ<br />
Transactions on Industry Applications, Vol.130, No.4/ Sec. D. (2010).<br />
(Related to the contents <strong>of</strong> Chap. 4).<br />
List <strong>of</strong> Other Publications and Presentations<br />
-Presentations in Internationals Symposiums<br />
1) Jorge Ivan Medina Martinez, Kazushi Nakano: Neural Control <strong>of</strong> a Flexible Arm System<br />
using Simultaneous Perturbation Method, SICE 7th Annual Conference on Control<br />
Systems, March 6-8 2007 Ch<strong>of</strong>u,Tokyo, Japan.<br />
159
LIST OF THE PUBLICATIONS AND INTERNATIONAL SYMPOSIUMS<br />
2) Jorge Ivan Medina Martinez, Kazushi Nakano, Sawut Umerujan: Simultaneous<br />
Perturbation Approach to Neural Control <strong>of</strong> a Flexible System, ECTI-CON 2007, Mae Fah<br />
Luang University, Chiang Rai, Thailand May 9-12, 2007.<br />
3) Jorge Ivan Medina Martinez, Kazushi Nakano, Sawut Umerujan: Cascade Lattice IIR<br />
Adaptive Filter Structure using Simultaneous Perturbation Method <strong>for</strong> Self-Adjusting<br />
SHARF <strong>Algorithm</strong>, International Conference on Instrumentation, Control and In<strong>for</strong>mation<br />
Technology (SICE Annual Conference 2008) Aug.20-22, The University <strong>of</strong> Electro<br />
Communications Ch<strong>of</strong>u, Tokyo, Japan.<br />
(Related to the contents <strong>of</strong> Chap. 5).<br />
4) Jorge Ivan Medina Martinez, Sawut Umerujan, Kazushi Nakano: Application <strong>of</strong> Non-linear<br />
Observer with Simultaneous Perturbation Stochastic <strong>Approximation</strong> Method to Single<br />
Flexible Link SMC, International Conference on Instrumentation, Control and In<strong>for</strong>mation<br />
Technology (SICE Annual Conference 2008) Aug.20–22, The University <strong>of</strong><br />
Electro-Communications, Ch<strong>of</strong>u, Tokyo, Japan.<br />
(Related to the contents <strong>of</strong> Chap. 4).<br />
5) Jorge Ivan Medina Martinez, Sawut Umerujan, Kazushi Nakano: Vibration Suppression<br />
Control <strong>of</strong> a Flexible arm using Non-linear Observer with Simultaneous Perturbation<br />
Stochastic <strong>Approximation</strong>, The Fourteenth International Symposium on Artificial Life and<br />
Robotics (AROB 14 th '09), Feb 5-7, 2009, B-Con Plaza, Beppu, Oita, Japan.<br />
(Related to the contents <strong>of</strong> Chap. 4).<br />
6) Jorge Ivan Medina Martinez, Kazushi Nakano, Kohji Higuchi: Parameters Estimation in<br />
Neural Networks by Improved Version <strong>of</strong> Simultaneous Perturbation Stochastic<br />
<strong>Approximation</strong> <strong>Algorithm</strong>, ICCAS-SICE 2009, August 18-21, 2009, Fukuoka, Japan.<br />
160
LIST OF THE PUBLICATIONS AND INTERNATIONAL SYMPOSIUMS<br />
-Other Publications, Presentations and Submissions<br />
1) Jorge Ivan Medina Martinez, Kazushi Nakano, Development <strong>of</strong> an IIR Adaptive Filter with<br />
Low Computational Complexity using Simultaneous Perturbation Method.-2nd.<br />
KMUTT-UEC Workshop May 14, 2007 King Mongkut's University <strong>of</strong> Technology<br />
Thonburi, Bangkok, Thailand.<br />
2) Jorge Ivan Medina Martinez, Kazushi Nakano, A Fast Converging and Self-Adjusting<br />
SHARF <strong>Algorithm</strong> using Simultaneous Perturbation Method and Vibration Control <strong>of</strong> a<br />
Flexible using Non-linear Observer with Simultaneous Perturbation Stochastic<br />
<strong>Approximation</strong> Method 3rd KMUTT-UEC Workshop August 19,2008, The University <strong>of</strong><br />
Electro-communications Ch<strong>of</strong>u, Tokyo, Japan.<br />
161
LIST OF THE PUBLICATIONS AND INTERNATIONAL SYMPOSIUMS<br />
162
Acknowledgements<br />
This dissertation is a summary <strong>of</strong> my doctoral study at the Department <strong>of</strong> Electronic<br />
Engineering <strong>of</strong> the University <strong>of</strong> Electro-Communications. This work would have not been<br />
accomplished without the help <strong>of</strong> so many people. The following paragraph is a brief account <strong>of</strong><br />
some but not all who deserve my thanks.<br />
I would like to extend my deepest thanks to my Pr<strong>of</strong>. Kazushi Nakano <strong>for</strong> taking the burden <strong>of</strong><br />
supervising my research work <strong>for</strong> so long in his laboratory. Right from the beginning in October<br />
2006 and up to the conclusion <strong>of</strong> this work in December 2009. It is my pleasure to have a<br />
chance to do the research work under his supervision and I also enjoy the life <strong>of</strong> the research<br />
work.<br />
My special thanks are due to all the reviewers<br />
Pr<strong>of</strong>. Kohji Higuchi<br />
Pr<strong>of</strong>. Masahide Kaneko<br />
Pr<strong>of</strong>. Tetsuro Kirimoto<br />
Pr<strong>of</strong>. Takayuki Inaba<br />
Pr<strong>of</strong>. Seiichi Shin<br />
Also, my special thanks to our research group, both past and present, <strong>for</strong> their helpful<br />
cooperation over the years. They all have been very kind to me and provided a nice and friendly<br />
environment during these years.<br />
My gratitude goes to the Ministry <strong>of</strong> Education, Science and Culture <strong>of</strong> Japan who granted me<br />
this opportunity and financially supported this work. I am thankful to the administrative staff <strong>of</strong><br />
the Department <strong>of</strong> Electronic Engineering and the Foreign Students Affairs Office at the<br />
University <strong>of</strong> Electro-Communications, <strong>for</strong> their amiability and effective supports.<br />
Finally, I would like to give special thanks to my family and friends to their love, warm supports<br />
and encouragements.<br />
163
Author Biography<br />
Jorge Ivan Medina Martinez was born in Mexico city, Mexico, on April 23, 1978. He recieved<br />
the Master <strong>of</strong> Science degree from the National Institute Polytechnic, Mexico City, Mexico, in<br />
2005. Since 2006, he has been with the Department <strong>of</strong> Electronic Engineering in the University<br />
<strong>of</strong> Electro-Communications, Tokyo, Japan working toward his Ph.D. degree. His research<br />
interests include signal processing and control using <strong>SPSA</strong>.<br />
165