Approximation of Hessian Matrix for Second-order SPSA Algorithm ...

More documents

Recommendations

Info

1.8 VERSIONS OF SPSA ALGORITHM where β > 0 depends on the choice of the gain sequences ( a k andc k ), µ depends on both the Hessian and the third derivatives of L (θ ) and * θ , and Σ depends on Hessian matrix (note that in general µ ≠ 0, in contrasts to many well-known asymptotic normality results in estimation). Given the restrictions on the gain sequences to ensure convergence and asymptotic * θ normality, the fastest allowable value for the rate of convergence of θˆ k to * θ is −1/3 k . In addition to establishing the formal convergence of SPSA, Spall in [18] shows that the probability distribution of an appropriately scaled θˆ k is approximately normal (with a specified mean and covariance matrix) for large k . Spall in [18] uses the asymptotic normality result in (1.8), together with a parallel result for FDSA [9], to establish the relative efficiency of SPSA. This efficiency depends on the shape of L (θ ) , the values for a } and c } , and the distributions of the { ∆ k } and measurement noise terms. There is no single expression that can be used to characterize the relative efficiency; however, as discussed in [17] in most practical problems SPSA will be asymptotically more efficient than FDSA. { k { k For example, if a k and asymptotic mean squared error c k are chosen as in the guidelines of Spall [18] then by equating the 2 E ⎛ ⎞ ⎜ ˆ θ −θ * k ⎟ in SPSA and FDSA algorithm, we find ⎝ ⎠ No. of measurements of L (θ ) in SPSA / No. of measurements of L (θ ) in FDSA → 1/p as the number of loss measurements in both procedures gets large. Hence, above expression implies that the p-fold savings per iteration (gradient approximation) translates directly into a p-fold savings in the overall optimization process despite the complex non-linear ways in which the sequence of gradient approximations manifests itself in the ultimate solutionθˆ k . One properly chosen simultaneous random change in all the variables in a problem provides as much information for optimization as a full set of one at time changes of each variable. 1.8. -Versions of SPSA Algorithm The standard first-order SA algorithms for estimating θ involve a simple recursion with. 15
CHAPTER 1. INTRODUCTION usually, a scalar gain and an approximation to the gradient based on the measurements of L (⋅) The first-order of SPSA (1st-SPSA or SPSA) algorithm mentioned previously requires only two measurements of L(⋅) to form the gradient approximation, independent of p (versus 2p in the standard multivariate finite-difference approximation considered, e.g., in [8]), which extends the scalar algorithm of Kiefer and Wolfowitz [8]. Theory presented in [17] shows that for large p the 1st-SPSA approach can be much more efficient (in terms of total number of loss * measurements to achieve effective convergence to θ ) than the finite-difference approach in many cases of practical interest. In extending 1st-SPSA to a second-order (accelerated) form [18] that will be explained below, we can see how the gradient and inverse Hessian of L(⋅) can both be estimated on a per iteration basis using only three measurements of L (⋅) (again, independent of p). With these estimates, it is possible create an SA analogue to the Newton-Raphson algorithm (which, recall, is based on an update step that is negatively proportional to the inverse Hessian times the gradient) [17]. The aim of second-order of SPSA (2nd-SPSA) algorithm is to emulate the acceleration properties associated with deterministic algorithms of Newton-Raphson form, particularly in the terminal phase where the first-order SPSA algorithm slows down in its convergence [18]. This approach requires only three loss function measurements at each iteration, independent of the problem dimension. The 2nd-SPSA approach is composed of two parallel recursions, one for θ and one for the upper triangular matrix square root, say S = S(θ ) , of the Hessian of L (θ ) (square root is estimated to ensure that the inverse Hessian estimate used in the second-order SPSA recursion for θ is positive semi-definite). The two recursions are, respectively [18], ˆ 1 k+ 1 k − k k k k k θ ˆ ˆT ˆ − = θ a ( S S ) gˆ ( ˆ θ ) (1.9) Sˆ ˆ ~ ˆ ( ˆ k + 1 = S k − a k G k S k ) (1.10) where a k and a ~ are non-negative scalar gain coefficients, gˆ ( ˆ θ ) is the SP gradient k k k approximation to g ˆ k ( θ k ) [18] and Ĝ k is an observation related to the gradient of a certain loss function with respect to S. Note that ˆ T k Sk (which depends on k S ˆ θˆ ) represents an estimate of 16
Page 1 and 2: Approximation of Hessian Matrix for
Page 3 and 4: Copyright 2009 by Jorge Ivan Medina
Page 5 and 6: ここで提案するアルゴ
Page 7 and 8: ABSTRACT shown that for the same as
Page 9 and 10: Contents 1. Introduction 1 1.1 Moti
Page 11 and 12: CONTENTS 5.3 Parameter Estimation b
Page 13 and 14: LIST OF FIGURES Fig. 4.1 Block diag
Page 15 and 16: List of Abbreviations SPSA 1st-SPSA
Page 17 and 18: CHAPTER 1. INTRODUCTION the converg
Page 19 and 20: CHAPTER 1.INTRODUCTION approximatio
Page 21 and 22: CHAPTER 1. INTRODUCTION and simulta
Page 23 and 24: CHAPTER 1. INTRODUCTION Typical app
Page 25 and 26: CHAPTER 1. INTRODUCTION 1.4--Featur
Page 27 and 28: CHAPTER 1. INTRODUCTION Some of the
Page 29: CHAPTER 1. INTRODUCTION M − k ( k
Page 33 and 34: CHAPTER 1. INTRODUCTION Main Disadv
Page 35 and 36: CHAPTER 2. PROPOSED SPSA ALGORITHM
Page 81 and 82:
CHAPTER 2. PROPOSED SPSA ALGORITHM
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
CHAPTER 3. APPLICATION USING M2-SPS
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
CHAPTER 6. CONCLUSIONS AND FUTURE W
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
REFERENCE [10] S. A. Billings, G. N
Page 149 and 150:
REFERENCE [29] M. Metivier and P. P
Page 151 and 152:
REFERENCE [51] D. Parikh, N. Ahmed
Page 153 and 154:
REFERENCE [72] N. J. Gordon, D. J S
Page 155 and 156:
APPENDIX A that this random vector
Page 157 and 158:
APPENDIX A Part 2: To show that ~
Page 159 and 160:
APPENDIX A Proof of Theorem 2a (M2-
Page 161 and 162:
APPENDIX A 1 ⎡~ ~ ~ ~ ( ( ) ( ))
Page 163 and 164:
APPENDIX A ˆ θ * −α * k+ 1 −
Page 165 and 166:
APPENDIX A results. Here, zk+n+ 1 i
Page 167 and 168:
APPENDIX A Because the second eleme
Page 169 and 170:
154 APPENDIX A
Page 171 and 172:
APPENDIX B The Wei [48] approach is
Page 173 and 174:
158 APPENDIX B
Page 175 and 176:
LIST OF THE PUBLICATIONS AND INTERN
Page 177 and 178:
LIST OF THE PUBLICATIONS AND INTERN
Page 179:
Author Biography Jorge Ivan Medina
show all

Approximation of Hessian Matrix for Second-order SPSA Algorithm ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?