Approximation of Hessian Matrix for Second-order SPSA Algorithm ...
Approximation of Hessian Matrix for Second-order SPSA Algorithm ...
Approximation of Hessian Matrix for Second-order SPSA Algorithm ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
1.8 VERSIONS OF <strong>SPSA</strong> ALGORITHM<br />
where β > 0 depends on the choice <strong>of</strong> the gain sequences ( a k<br />
andc k<br />
), µ depends on both the<br />
<strong>Hessian</strong> and the third derivatives <strong>of</strong> L (θ ) and<br />
*<br />
θ , and Σ depends on <strong>Hessian</strong> matrix<br />
(note that in general µ ≠ 0, in contrasts to many well-known asymptotic normality results in<br />
estimation). Given the restrictions on the gain sequences to ensure convergence and asymptotic<br />
*<br />
θ<br />
normality, the fastest allowable value <strong>for</strong> the rate <strong>of</strong> convergence <strong>of</strong><br />
θˆ k<br />
to<br />
*<br />
θ is<br />
−1/3<br />
k .<br />
In addition to establishing the <strong>for</strong>mal convergence <strong>of</strong> <strong>SPSA</strong>, Spall in [18] shows that the<br />
probability distribution <strong>of</strong> an appropriately scaled<br />
θˆ k<br />
is approximately normal (with a specified<br />
mean and covariance matrix) <strong>for</strong> large k . Spall in [18] uses the asymptotic normality result in<br />
(1.8), together with a parallel result <strong>for</strong> FDSA [9], to establish the relative efficiency <strong>of</strong> <strong>SPSA</strong>.<br />
This efficiency depends on the shape <strong>of</strong> L (θ ) , the values <strong>for</strong> a } and c } , and the<br />
distributions <strong>of</strong> the { ∆ k<br />
} and measurement noise terms. There is no single expression that can<br />
be used to characterize the relative efficiency; however, as discussed in [17] in most practical<br />
problems <strong>SPSA</strong> will be asymptotically more efficient than FDSA.<br />
{ k<br />
{ k<br />
For example, if<br />
a<br />
k<br />
and<br />
asymptotic mean squared error<br />
c<br />
k<br />
are chosen as in the guidelines <strong>of</strong><br />
Spall [18] then by equating the<br />
2<br />
E<br />
⎛ ⎞<br />
⎜<br />
ˆ θ −θ *<br />
k ⎟ in <strong>SPSA</strong> and FDSA algorithm, we find<br />
⎝ ⎠<br />
No. <strong>of</strong> measurements <strong>of</strong> L (θ ) in <strong>SPSA</strong> / No. <strong>of</strong> measurements <strong>of</strong> L (θ ) in FDSA → 1/p<br />
as the number <strong>of</strong> loss measurements in both procedures gets large. Hence, above expression<br />
implies that the p-fold savings per iteration (gradient approximation) translates directly into a<br />
p-fold savings in the overall optimization process despite the complex non-linear ways in which<br />
the sequence <strong>of</strong> gradient approximations manifests itself in the ultimate solutionθˆ k . One<br />
properly chosen simultaneous random change in all the variables in a problem provides as much<br />
in<strong>for</strong>mation <strong>for</strong> optimization as a full set <strong>of</strong> one at time changes <strong>of</strong> each variable.<br />
1.8. -Versions <strong>of</strong> <strong>SPSA</strong> <strong>Algorithm</strong><br />
The standard first-<strong>order</strong> SA algorithms <strong>for</strong> estimating θ involve a simple recursion with.<br />
15