Approximation of Hessian Matrix for Second-order SPSA Algorithm ...

More documents

Recommendations

Info

1.1 MOTIVATION AND BACKGROUND Fig. 1.1. Example of stochastic optimization algorithm minimizing loss function L θ 1 , θ ). ( 2 This dissertation focuses on the case where such an approximation is going to be used as a result of direct gradient information not being readily available. Overall, gradient-free stochastic algorithms exhibit convergence properties similar to the gradient-based stochastic algorithms [e.g., Robbins-Monroe stochastic approximation (R-M SA)] while requiring only loss function measurements [5][6]. A main advantage of such algorithms is that they do not require the detailed knowledge of the functional relationship between the parameters being adjusted (optimized) and the loss function being minimized that is required in gradient-based algorithms. Such a relationship can be notoriously difficult to develop in some areas (e.g., non-linear feedback controller design), whereas in other areas (such as Monte Carlo optimization or recursive statistical parameter estimation), there may be large computational savings in calculating a loss function relative to that required in calculating a gradient. To elaborate on the distinction between algorithms based on direct gradient measurements and those based on gradient approximations from measurements of the loss function, the prototype gradient-based algorithm is R-M SA, which may be considered a generalization of such techniques as deterministic steepest descent and Newton–Raphson, neural network back-propagation (BP), and infinitesimal perturbation analysis–based optimization for discrete-event systems [9]. The gradient-based algorithms rely on direct measurements of the gradient of the loss function with respect to the parameters being optimized. These measurements typically yield an estimate of the gradient because the underlying data generally include added noise. Because it is not usually the case that one would obtain direct measurements of the gradient (with or without added noise) naturally in the course of operating or simulating a system, one must have detailed knowledge of the underlying system input–output relationships to calculate the R-M gradient estimate from basic system output measurements. In contrast, the approaches based on gradient- 3
CHAPTER 1.INTRODUCTION approximations require only conversion of the basic output measurements to sample values of the loss function, which does not require full knowledge of the system input–output relationships. The classical method for gradient-free stochastic optimization is the Kiefer–Wolfowitz finite-difference SA (FDSA) algorithm [8]. Because of the fundamentally different information needed in implementing these gradient-based (R-M) and gradient-free algorithms, it is difficult to construct meaningful methods of comparison. As a general rule, however, the gradient-based algorithms will be faster to converge than those using loss function based gradient approximations when speed is measured in the number of iterations. Intuitively, this result is not surprising given the additional information required for the gradient-based algorithms. In particular, on the basis of asymptotic theory, the optimal rate of convergence measured in terms of the deviation of the parameter estimate from the true optimal parameter vector is of order −1/2 k for the gradient-based algorithms and of order −1/3 k for the algorithms based on gradient approximations, where k represents the number of iterations. (Special cases exist where the maximum rate of convergence for a non-gradient algorithm is arbitrarily close to, or equal to −1/2 k ). In practice, of course, many other factors must be considered in determining which algorithm is best for a given circumstance for the following three reasons: (1) It may not be possible to obtain reliable knowledge of the system input–output relationships, implying that the gradient-based algorithms may be either infeasible (if no system model is available) or undependable (if a poor system model is used). (2) The total cost to achieve effective convergence depends not only on the number of iterations required, but also on the cost needed per iteration, which is typically greater in gradient-based algorithms. (This cost may include greater computational burden, additional human effort required for determining and coding gradients, and experimental costs for model building such as labor, materials, and fuel.) (3) The rates of convergence are based on asymptotic theory and may not be representative of practical convergence rates in finite samples. For these reasons, one cannot say in general that a gradient-based search algorithm is superior to a gradient approximation-based algorithm, even though the gradient-based algorithm has a faster asymptotic rate of convergence (and with simulation-based optimization such as infinitesimal perturbation analysis requires only one system run per iteration, whereas the approximation based algorithm may require multiple system runs per iteration). As a general rule, however, if direct gradient information is 4
Page 1 and 2: Approximation of Hessian Matrix for
Page 3 and 4: Copyright 2009 by Jorge Ivan Medina
Page 5 and 6: ここで提案するアルゴ
Page 7 and 8: ABSTRACT shown that for the same as
Page 9 and 10: Contents 1. Introduction 1 1.1 Moti
Page 11 and 12: CONTENTS 5.3 Parameter Estimation b
Page 13 and 14: LIST OF FIGURES Fig. 4.1 Block diag
Page 15 and 16: List of Abbreviations SPSA 1st-SPSA
Page 17: CHAPTER 1. INTRODUCTION the converg
Page 21 and 22: CHAPTER 1. INTRODUCTION and simulta
Page 23 and 24: CHAPTER 1. INTRODUCTION Typical app
Page 25 and 26: CHAPTER 1. INTRODUCTION 1.4--Featur
Page 27 and 28: CHAPTER 1. INTRODUCTION Some of the
Page 29 and 30: CHAPTER 1. INTRODUCTION M − k ( k
Page 31 and 32: CHAPTER 1. INTRODUCTION usually, a
Page 33 and 34: CHAPTER 1. INTRODUCTION Main Disadv
Page 35 and 36: CHAPTER 2. PROPOSED SPSA ALGORITHM
Page 69 and 70:
CHAPTER 2. PROPOSED SPSA ALGORITHM
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
CHAPTER 3. APPLICATION USING M2-SPS
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
CHAPTER 6. CONCLUSIONS AND FUTURE W
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
REFERENCE [10] S. A. Billings, G. N
Page 149 and 150:
REFERENCE [29] M. Metivier and P. P
Page 151 and 152:
REFERENCE [51] D. Parikh, N. Ahmed
Page 153 and 154:
REFERENCE [72] N. J. Gordon, D. J S
Page 155 and 156:
APPENDIX A that this random vector
Page 157 and 158:
APPENDIX A Part 2: To show that ~
Page 159 and 160:
APPENDIX A Proof of Theorem 2a (M2-
Page 161 and 162:
APPENDIX A 1 ⎡~ ~ ~ ~ ( ( ) ( ))
Page 163 and 164:
APPENDIX A ˆ θ * −α * k+ 1 −
Page 165 and 166:
APPENDIX A results. Here, zk+n+ 1 i
Page 167 and 168:
APPENDIX A Because the second eleme
Page 169 and 170:
154 APPENDIX A
Page 171 and 172:
APPENDIX B The Wei [48] approach is
Page 173 and 174:
158 APPENDIX B
Page 175 and 176:
LIST OF THE PUBLICATIONS AND INTERN
Page 177 and 178:
LIST OF THE PUBLICATIONS AND INTERN
Page 179:
Author Biography Jorge Ivan Medina
show all

Approximation of Hessian Matrix for Second-order SPSA Algorithm ...

Create successful ePaper yourself

Delete template?

Save as template?