12.07.2015 Views

Adaptive Critic Designs - Neural Networks, IEEE ... - IEEE Xplore

Adaptive Critic Designs - Neural Networks, IEEE ... - IEEE Xplore

Adaptive Critic Designs - Neural Networks, IEEE ... - IEEE Xplore

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

PROKHOROV AND WUNSCH: ADAPTIVE CRITIC DESIGNS 1003above. Each training cycle is continued till convergence of thenetwork’s weights (i.e., , in the procedureabove). It is also suggested to use a new randomly chosenon every return to the beginning of the critic’s training cycle(line 1.6 is modified as follows: ; continue from 1.0).It is argued that whenever the action’s weights converge onehas a stable control, and such a training procedure eventuallyfinds the optimal control sequence.While theory behind classical dynamic programming demandschoosing the optimal vector of (22) for eachtraining cycle of the action network, we suggest incrementallearning of the action network in the training procedure above.A vector produced at the end of the action’s trainingcycle does not necessarily match the vector . However,our experience [28], [30], [44], [46], along with successfulresults in [33], [38], and [43], indicates that choosingprecisely is not critical.No training procedure currently exists that explicitly addressesissues of an inaccurate or uncertain model .It appears that model network errors of as much as 20%are tolerable, and ACD’s trained with such inaccurate modelnetworks are nevertheless sufficiently robust [30]. Althoughit seems consistent with assessments of robustness of conventionalneurocontrol (model-reference control with neuralnetworks) [31], [32], further research on robustness of controlwith ACD is needed, and we are currently pursuing this work.To allow using the training procedure above in presenceof the model network’s inaccuracies, we suggest to run themodel network concurrently with the actual plant or anothermodel, which imitates the plant more accurately than the modelnetwork but, unlike this network, it is not differentiable. Theplant’s outputs are then fed into the model network every sooften (usually, every time step) to provide necessary alignmentsand prevent errors of multiple-step-ahead predictionsfrom accumulating. Such a concurrently running arrangementis known under different names including teacher forcing[35] and series-parallel model [36]. After this arrangementis incorporated in an ACD, the critic will usually inputthe plant’s outputs, rather than the predicted ones from themodel network. Thus, the model network is mainly utilizedto calculate the auxiliary derivativesand.IV. SIMPLE ACD’S VERSUS ADVANCED ACD’SThe use of derivatives of an optimization criterion, ratherthan the optimization criterion itself, is known as being themost important information to have in order to find an acceptablesolution. In the simple ACD’s, HDP, and ADHDP,this information is obtained indirectly: by backpropagationthrough the critic network. It has a potential problem of beingtoo coarse since the critic network in HDP is not trained toapproximate derivatives of directly. An approach to improveaccuracy of this approximation has been proposed in [27]. It issuggested to explore a set of trajectories bordering a volumearound the nominal trajectory of the plant during the critic’straining, rather than the nominal trajectory alone. In spite ofthis enhancement, we still expect better performance from theadvanced ACD’s.Furthermore, Baird [39] showed that the shorter the discretizationinterval becomes, the slower the training of AD-HDP proceeds. In continuous time, it is completely incapableof learning.DHP and ADDHP have an important advantage over thesimple ACD’s since their critic networks build a representationfor derivatives of by being explicitly trained on themthrough and . For instance, in thearea of model-based control we usually have a sufficientlyaccurate model network and well-definedand. To adapt the action network we ultimately needthe derivatives or , rather than the functionitself. But an approximation of these derivatives is alreadya direct output of the DHP and ADDHP critics. Althoughmultilayer neural networks are well known to be universalapproximators of not only a function itself (direct output of thenetwork) but also its derivatives with respect to the network’sinputs (indirect output obtained through backpropagation) [41],we note that the quality of such a direct approximation isalways better than that of any indirect approximation for givensizes of the network and the training data. Work on a formalproof of this advantage of DHP and ADDHP is currently inprogress, but the reader is referred to the Section V for ourexperimental justification.<strong>Critic</strong> networks in GDHP and ADGDHP directly approximatenot only the function but also its derivatives. Knowingboth and its derivatives is useful in problems where availabilityof global information associated with the functionitself is as important as knowledge of the slope of , i.e., thederivatives of [40]. Besides, any shift of attention paid tovalues of or its derivatives during training can be readilyaccommodated by selecting unequal learning rates andin (11) (see Section II-C). In Section II-C we described threeGDHP designs. While the design of Fig. 5 seems to be themost straightforward and beneficial from the viewpoint ofsmall computational expenses, the designs of Figs. 3 and 4use the critic network more efficiently.Advanced ACD’s include DHP, ADDHP, GDHP, andADGDHP, the latter two being capable of emulating allthe previous ACD’s. All these designs assume availabilityof the model network. Along with direct approximation ofthe derivatives of , it contributes to a superior performanceof advanced ACD’s over simple ones (see the next Sectionfor examples of performance comparison). Although the finalselection among advanced ACD’s should certainly be basedon comparative results, we believe that in many applicationsthe use of DHP or ADDHP is quite enough. We also notethat the AD forms of the designs may have an advantage intraining recurrent action networks.V. EXPERIMENTAL STUDIESThis section provides an overview of our experimental workon applying various ACD’s to control of dynamic systems. Fordetailed information on interesting experiments carried out by

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!