Ding Wang 2012 Neurocomputing

D. Wang et al. / Neurocomputing 78 (2012) 14–22 17we can obtainJðe k , û k þ 1 Þ¼Uðek k ,v 0 ðe k ÞÞþUðe k þ 1 , û k þ 1 Þ¼Uðe k ,v 0 ðe k ÞÞ ¼ V 1 ðe k Þ:On the other hand, according to (20), we haveV 2 ðe k Þ¼minfJðe k ,u k þ 1 Þ : u k þ 1k ku k þ 1kwhich reveals thatAA ð2Þe kg,V 2 ðe k ÞrJðe k , û k þ 1 Þ¼Vk 1 ðe k Þ: ð22ÞTherefore, the theorem holds for i¼1.Next, assume that the theorem holds for any i¼q, where q41.The current cost function can be expressed asV q ðe k Þ¼ Xq 1Uðe k þ j ,v q 1 j ðe k þ j ÞÞ,j ¼ 0where û k þ q 1 ¼ðvk q 1 ðe k Þ,v q 2 ðe k þ 1 Þ, ...,v 0 ðe k þ q 1 ÞÞ is the correspondingfinite-horizon admissible control sequence.Then, for i ¼ qþ1, we can construct a control sequence û k þ q ¼kðv q 1 ðe k Þ,v q 2 ðe k þ 1 Þ, ...,v 0 ðe k þ q 1 Þ,0Þ with length qþ1, underwhich the error trajectory is given as e k , e k þ 1 ¼ Fðe k , v q 1 ðe k ÞÞ,e k þ 2 ¼ Fðe k þ 1 ,v q 2 ðe k þ 1 ÞÞ, ..., e k þ q ¼ Fðe k þ q 1 , v 0 ðe k þ q 1 ÞÞ ¼ 0,e k þ q þ 1 ¼ Fðe k þ q , û k þ q Þ¼Fð0; 0Þ¼0. This shows thatû k þ qkis afinite-horizon admissible control sequence. As Uðe k þ q , û k þ q Þ¼Uð0; 0Þ¼0, we can acquireJðe k , û k þ q Þ¼Uðek k ,v q 1 ðe k ÞÞþUðe k þ 1 ,v q 2 ðe k þ 1 ÞÞþþUðe k þ q 1 ,v 0 ðe k þ q 1 ÞÞþUðe k þ q , û k þ q Þ¼ Xq 1j ¼ 0Uðe k þ j ,v q 1 j ðe k þ j ÞÞ ¼ V q ðe k Þ:On the other hand, according to (20), we haveV q þ 1 ðe k Þ¼minfJðe k ,u k þ q Þ : u k þ qu k þ qkwhich implies thatkkðq þ 1ÞAAe kg,V q þ 1 ðe k ÞrJðe k , û k þ q Þ¼Vk q ðe k Þ: ð23ÞAccordingly, we complete the proof by mathematical induction. &We have concluded that the cost function sequence fV i ðe k Þg is amonotonically nonincreasing sequence which is bounded below,and therefore, its limit exists. Here, we denote it as V 1 ðe k Þ, i.e.,lim i-1 V i ðe k Þ¼V 1 ðe k Þ: Next, let us consider what will happenwhen we make i-1 in (17).Theorem 2. For any discrete time step k and tracking error e k , thefollowing equation holds:V 1 ðe k Þ¼minfUðe k ,u k ÞþV 1 ðe k þ 1 Þg: ð24Þu kProof. For any admissible control t k ¼ tðe k Þ and i, according toTheorem 1 and (17), we haveV 1 ðe k ÞrV i þ 1 ðe k Þ¼minfUðe k ,u k ÞþV i ðe k þ 1 ÞgrUðe k ,t k ÞþV i ðe k þ 1 Þ:u kLet i-1, we getV 1 ðe k ÞrUðe k ,t k ÞþV 1 ðe k þ 1 Þ:Note that in the above equation, t k is chosen arbitrarily. Thus, wecan obtainV 1 ðe k ÞrminfUðe k ,u k ÞþV 1 ðe k þ 1 Þg: ð25Þu kOn the other hand, let d40 be an arbitrary positive number.Then, there exists a positive integer l such thatV l ðe k Þ drV 1 ðe k ÞrV l ðe k Þ ð26Þbecause V i ðe k Þ is nonincreasing for iZ1 with V 1 ðe k Þ as its limit.Besides, from (17), we can acquireV l ðe k Þ¼minfUðe k ,u k ÞþV l 1 ðe k þ 1 Þgu k¼ Uðe k ,v l 1 ðe k ÞÞþV l 1 ðFðe k ,v l 1 ðe k ÞÞ:Combining with (26), we can obtainV 1 ðe k ÞZUðe k ,v l 1 ðe k ÞÞþV l 1 ðFðe k ,v l 1 ðe k ÞÞ dZUðe k ,v l 1 ðe k ÞÞþV 1 ðFðe k ,v l 1 ðe k ÞÞ dZminfUðe k ,u k ÞþV 1 ðe k þ 1 Þgu kd,which reveals thatV 1 ðe k ÞZminfUðe k ,u k ÞþV 1 ðe k þ 1 Þgu kð27Þbecause of the arbitrariness of d. Based on (25) and (27), we canconclude that (24) is true. &Next, we will prove that the cost function sequence fV i gconverges to the optimal cost function J n as i-1.Theorem 3. Define the cost function sequence fV i g as in (17) withV 0 ðÞ ¼ 0. If the system state e k is controllable, then J n is the limit ofthe cost function sequence fV i g, i.e.,V 1 ðe k Þ¼J n ðe k Þ:Proof. On one hand, in accordance with (9) and (20), we canacquireJ ðe k Þ¼inf u kfJðe k ,u kÞ: u kAA ek gr min fJðe k ,u k þ i 1 Þ : u k þ i 1k ku k þ i 1kAA ðiÞe kg¼V i ðe k Þ:Letting i-1, we getJ n ðe k ÞrV 1 ðe k Þ:ð28ÞOn the other hand, according to the definition of J n ðe k Þ,foranyZ40, there exists an admissible control sequence s kAA ek such thatJðe k ,s kÞrJ n ðe k ÞþZ:ð29ÞNow, we suppose that 9s k9 ¼ q, which shows that s kAA ðqÞe k.Then,we can obtainV 1 ðe k ÞrV q ðe k Þ¼ min fJðe k ,u k þ q 1 Þ : u k þ q 1k krJðe k ,s kÞ,u k þ q 1kAA ðqÞe kgusing Theorem 1 and (20). Combining with (29), we getV 1 ðe k ÞrJ n ðe k ÞþZ:Noticing that Z is chosen arbitrarily in the above expression,we haveV 1 ðe k ÞrJ n ðe k Þ:ð30ÞBased on (28) and (30), we can conclude that J n ðe k Þ is the limit of thecost function sequence fV i g as i-1, i.e., V 1 ðe k Þ¼J n ðe k Þ. &From Theorems 1–3, we can obtain that the cost functionsequence fV i ðe k Þg converges to the optimal cost function J n ðe k Þ ofthe DTHJB equation, i.e., V i -J n as i-1. Then, according to (12)and (16), we can conclude the convergence of the correspondingcontrol law sequence. Now, we present the following corollary.Corollary 1. Define the cost function sequence fV i g as in (17) withV 0 ðÞ ¼ 0, and the control law sequence fv i g as in (16). If the systemstate e k is controllable, then the sequence fv i g converges to the

18D. Wang et al. / Neurocomputing 78 (2012) 14–22optimal control law u n as i-1, i.e.,limv i ðe k Þ¼u n ðe k Þ:i-13.3. The e-optimal control algorithmAccording to Theorems 1–3 and Corollary 1, we should run theiterative ADP algorithm (14)–(17) until i-1 to obtain theoptimal cost function J n ðe k Þ, and then to get a control vectorv 1 ðe k Þ based on which we can construct a control sequenceu 1ðe k Þ¼ðv 1 ðe k Þ,v 1 ðe k þ 1 Þ, ...,v 1 ðe k þ i Þ, ...Þ to control the state toreach the target. Obviously, u 1ðe k Þ has infinite length. Though it isfeasible in terms of theory, it is always not practical to do sobecause most real world systems need to be effectively controlledwithin finite-horizon. Therefore, in this section, we will propose anovel e-optimal control strategy using the iterative ADP algorithmto deal with the problem. The idea is, for a given error bounde40, the iterative number i will be chosen so that the errorbetween V i ðe k Þ and J n ðe k Þ is within the bound.Let e40 be any small number, e k be any controllable state, andJ n ðe k Þ be the optimal value of the cost function sequence definedas in (17). From Theorem 3, it is clear that there exists a finite isuch that9V i ðe k Þ J n ðe k Þ9re: ð31ÞThe length of the optimal control sequence starting from e k withrespect to e is defined asK e ðe k Þ¼minfi: 9V i ðe k Þ J n ðe k Þ9reg: ð32ÞThe corresponding control lawv i 1 ðe k Þ¼arg minfUðe k ,u k ÞþV i 1 ðe k þ 1 Þgu k¼ 1 2 R 1 g T ðe k þr k Þ @V i 1ðe k þ 1 Þð33Þ@e k þ 1is called the e-optimal control and is denoted as m n e ðe kÞ.In this sense, we can see that an error e between V i ðe k Þ andJ n ðe k Þ is introduced into the iterative ADP algorithm, which makesthe cost function sequence fV i ðe k Þg converge in finite number ofiteration steps.However, the optimal criterion (31) is difficult to verifybecause the optimal cost function J n ðe k Þ is unknown in general.Consequently, we will use an equivalent criterion, i.e.,9V i ðe k Þ V i þ 1 ðe k Þ9re ð34Þto replace (31).In fact, if 9V i ðe k Þ J n ðe k Þ9re holds, we have V i ðe k ÞrJ n ðe k Þþe.Combining with J n ðe k ÞrV i þ 1 ðe k ÞrV i ðe k Þ, we can find that0rV i ðe k Þ V i þ 1 ðe k Þre,which means9V i ðe k Þ V i þ 1 ðe k Þ9re:On the other hand, according to Theorem 3, 9V i ðe k Þ V i þ 1ðe k Þ9-0 connotes that V i ðe k Þ-J n ðe k Þ. As a result, if 9V i ðe k ÞV i þ 1 ðe k Þ9re holds for any given small e, we can derive theconclusion that 9V i ðe k Þ J n ðe k Þ9re holds if i is sufficiently large.3.4. Design procedure of the finite-horizon optimal tracking controlscheme using iterative ADP algorithmIn this section, we will give the detailed design procedure forthe finite-horizon nonlinear optimal tracking control schemeusing the iterative ADP algorithm.Step 1 Specify an error bound e for the given initial state x 0 . Choosei max , the reference trajectory r k ,andthematricesQ and R.Step 2 Compute e k according to (2) and (3).Step 3 Set i¼0, V 0 ðe k Þ¼0. Obtain the initial finite-horizon admissiblevector v 0 ðe k Þ by (14) and update the cost functionV 1 ðe k Þ by (15).Step 4 Set i ¼ iþ1.Step 5 Compute v i ðe k Þ by (16) and the corresponding cost functionV i þ 1 ðe k Þ by (17).Step 6 If 9V i ðe k Þ V i þ 1 ðe k Þ9re, then go to Step 8; otherwise, go toStep 7.Step 7 If i4i max , then go to Step 8; otherwise, go to Step 4.Step 8 Stop.After the optimal control law u n ðe k Þ for system (6) is derivedunder the given error bound e, we can compute the optimaltracking control input for original system (1) byu pk ¼ u ðe k Þþu dk ¼ u ðe k Þþg 1 ðr k Þðfðr k Þ f ðr k ÞÞ: ð35ÞIn the following section, we will describe the implementationof the iterative ADP algorithm based on NNs in detail.4. NN implementation of the iterative ADP algorithmvia HDP techniqueNow, we implement the iterative HDP algorithm in (14)–(17)using NNs. In the iterative HDP algorithm, there are three networks,which are model network, critic network and action network.All the networks are chosen as three-layer feedforward NNs.The inputs of the critic network and the action network are e k ,while the inputs of the model network are e k and ^v i ðe k Þ. Thestructure diagram of the iterative HDP algorithm is shown in Fig. 1.4.1. The model networkThe purpose of designing the model network is to approximatethe error d ynamics. We should train the model network beforecarrying out the iterative HDP algorithm. For given e k and ^v i ðe k Þ,we can obtain the output of the model network asê k þ 1 ¼ o T msðn T m z kÞ,wherez k ¼½e T k ^vT i ðe kÞŠ T :We define the error function of the model network asð36Þe mk ¼ ê k þ 1 e k þ 1 : ð37ÞThe weights in the model network are updated to minimize thefollowing performance measure:E mk ¼ 1 2 eT mk e mk:ð38ÞUsing the gradient-based adaptation rule, the weights can beupdated as @Eo m ðjþ1Þ¼o m ðjÞ a mkm , ð39Þ@o m ðjÞ @En m ðjþ1Þ¼n m ðjÞ a mkm , ð40Þ@n m ðjÞwhere a m 40 is the learning rate of the model network, and j isthe iterative step for updating the weight parameters.After the model network is trained, its weights are keptunchanged.

D. Wang et al. / Neurocomputing 78 (2012) 14–22 19CriticNetworkVî+1 (e k )_e kActionNetworkvî (e k )ModelNetworkeˆk+1 Critic Vî (eˆk+1 ) U (e k , u k )Network+Signal LineBack-propagating PathWeight TransmissionFig. 1. The structure diagram of the iterative HDP algorithm.4.2. The critic networkThe critic network is used to approximate the cost functionV i ðe k Þ. The output of the critic network is denoted as^V i ðe k Þ¼o T cisðn T ci e kÞ:ð41ÞThe target function can be written asV i ðe k Þ¼e T k Qe k þv T i 1 ðe kÞRv i 1 ðe k Þþ ^V i 1 ðê k þ 1 Þ: ð42ÞThen, we define the error function for the critic network ase cik ¼ ^V i ðe k Þ V i ðe k Þ: ð43ÞThe objective function to be minimized for the critic network isE cik ¼ 1 2 eT cik e cik:ð44ÞThe weight updating rule for training the critic network is alsogradient-based adaptation given by @Eo ci ðjþ1Þ¼o ci ðjÞ a cikc , ð45Þ@o ci ðjÞ @En ci ðjþ1Þ¼n ci ðjÞ a cikc , ð46Þ@n ci ðjÞwhere a c 40 is the learning rate of the critic network, and j is theinner-loop iterative step for updating the weight parameters.4.3. The action networkIn the action network, the state e k is used as input to obtain theoptimal control as the output of the network. The output can beformulated as^v i ðe k Þ¼o T aisðn T ai e kÞ:ð47ÞThe target control input is given byv i ðe k Þ¼1 2 R 1 g T ðe k þr k Þ @ ^V i ðê k þ 1 Þ: ð48Þ@ê k þ 1The error function of the action network can be defined ase aik ¼ ^v i ðe k Þ v i ðe k Þ: ð49ÞThe weights of the action network are updated to minimize thefollowing performance error measure:E aik ¼ 1 2 eT aik e aik:ð50ÞSimilarly, the weight updating algorithm is @Eo ai ðjþ1Þ¼o ai ðjÞ a aika , ð51Þ@o ai ðjÞ @En ai ðjþ1Þ¼n ai ðjÞ a aika , ð52Þ@n ai ðjÞwhere a a 40 is the learning rate of the action network, and j is theinner-loop iterative step for updating the weight parameters.5. Simulation studyIn this section, two simulation examples are provided toconfirm the theoretical results.5.1. Example 1The first example is derived from [31] with some modifications.Consider the following nonlinear system:x k þ 1 ¼ f ðx k Þþgðx k Þu pk ,ð53Þwhere x k ¼½x 1k x 2k Š T AR 2 and u pk ¼½u p1k u p2k Š T AR 2 are the stateand control variables, respectively. The parameters of the costfunction are chosen as Q ¼ 0:5I and R ¼ 2I, where I denotes theidentity matrix with suitable dimensions. The state of the controlledsystem is initialized to be x 0 ¼½0:8 0:5Š T . The systemfunctions are given as" #sinð0:5x 2k Þx 2 1kf ðx k Þ¼,cosð1:4x 2k Þsinð0:9x 1k Þgðx k Þ¼ 1 0 :0 1The reference trajectory for the above system is selected as"r k ¼sinð0:25kÞ#:cosð0:25kÞWe set the error bound of the iterative HDP algorithm ase ¼ 10 5 and implement the algorithm at time instant k¼0. Theinitial control vector of system (6) can be computed as v 0 ðe 0 Þ¼½0:64 sinð0:25Þ sinð0:72Þcosð0:7ÞŠ T , where e 0 ¼½0:8 0:5Š T . Then,we choose three-layer feedforward NNs as model network, critic

20D. Wang et al. / Neurocomputing 78 (2012) 14–22network and action network with the structures 4–8–2, 2–8–1,2–8–2, respectively. The initial weights of the three networks areall set to be random in ½ 1; 1Š. It should be mentioned that themodel network should be trained first. We train the modelnetwork for 1000 steps using 500 data samples under the learningrate a m ¼ 0:1. After the training of the model network is completed,the weights are kept unchanged. Then, we train the criticnetwork and the action network for 20 iterations (i.e., fori ¼ 1; 2, ...,20) with each iteration of 2000 training steps to makesure the given error bound e ¼ 10 5 is reached. In the trainingprocess, the learning rate a c ¼ a a ¼ 0:05. The convergence processof the cost function of the iterative HDP algorithm is shown inFig. 2, for k¼0. We can see that the iterative cost functionsequence does converge to the optimal cost function quiterapidly, which indicates the effectiveness of the iterative HDPalgorithm. Therefore, we have 9V 19 ðe 0 Þ V 20 ðe 0 Þ9re, which meansthat the number of steps of the e-optimal control is K e ðe 0 Þ¼19.Besides, the e-optimal control law m n e ðe 0Þ for system (6) can alsobe obtained during the iteration process.Next, we compute the near-optimal tracking control law fororiginal system (1) using (35) and apply it to the controlled systemfor 40 time steps. The obtained state curves are shown inThe state trajectory and the reference trajectory21.510.50−0.5−10 5 10 15 20 25 30 35 40Time stepsFig. 4. The state trajectory x 2 and the reference trajectory r 2 .r 2x 22.11.5u p1The cost function21.91.81.71.61.51.40 5 10 15 20Iteration stepsFig. 2. The convergence process of the cost function.The tracking control trajectories10.50−0.5−1−1.50 5 10 15 20 25 30 35 40Time stepsFig. 5. The tracking control trajectories u p .u p2The state trajectory and the reference trajectory1.510.50−0.5−10 5 10 15 20 25 30 35 40Time stepsFig. 3. The state trajectory x 1 and the reference trajectory r 1 .r 1x 1The tracking error10.50−0.5−1−1.50 5 10 15 20 25 30 35 40Time stepsFig. 6. The tracking error e.e 1e 2

D. Wang et al. / Neurocomputing 78 (2012) 14–22 21The cost function1008060402000 1000 2000 3000 4000 5000Iteration stepsState and reference trajectories1.510.50−0.5r 1 x 1−10 50 100 150 200 250Time stepsState and reference trajectories10.50−0.50 50 100 150 200 250Time stepsr 2 x 28u p1 u p26The tracking control trajectories−2Fig. 7. Simulation results of Example 2.420−40 50 100 150 200 250Time stepsFigs. 3 and 4, where the corresponding reference trajectories arealso plotted to evaluate the tracking performance. The trackingcontrol curves and the tracking errors are shown in Figs. 5 and 6,respectively. Besides, we can derive that the tracking error becomese 19 ¼½0:2778 10 5 20:8793 10 5 Š T after 19 time steps. Thesesimulation results verify the excellent performance of the trackingcontroller developed by the iterative ADP algorithm.5.2. Example 2The second example is obtained from [32]. Consider thenonlinear discrete-time system described as (53) where"f ðx k Þ¼0:2x #1ke x2 2k0:3x 3 ,2k0:2 0gðx k Þ¼:0 0:2The desired trajectory is set to"r k ¼sinðkþ0:5pÞ#:0:5 cos kIn the implementation of the iterative HDP algorithm, theinitial weights and structures of three networks are set the sameas Example 1. Then, for the given initial state x 0 ¼½1:5 1Š T ,wetrain the model network for 10 000 steps using 1000 data samplesunder the learning rate a m ¼ 0:05. Besides, the critic network andthe action network are trained for 5000 iterations so that thegiven error bound e ¼ 10 6 is reached. The learning rate in thetraining process is also a c ¼ a a ¼ 0:05.The convergence process of the cost function of the iterativeHDP algorithm is shown in Fig. 7(a), for k¼0. Then, we apply thetracking control law to the system for 250 time steps and obtainthe state and reference trajectories shown in Fig. 7(b) and (c).Besides, the tracking control curves are given in Fig. 7(d). It isclear from the simulation results that the iterative HDP algorithmproposed in this paper is very effective in solving the finitehorizontracking control problems.6. ConclusionAn effective method is proposed in this paper to design thefinite-horizon near-optimal tracking controller for a class ofdiscrete-time nonlinear systems. The iterative ADP algorithm isintroduced to solve the cost function of the DTHJB equation withconvergence analysis, which obtains a finite-horizon near-optimaltracking controller that makes the cost function close to itsoptimal value within an e-error bound. Three NNs are used toapproximate the cost function, the control law, and the nonlinearsystem, respectively. The simulation examples confirmed thevalidity of the tracking control approach. The strategy presentedin this paper only be true of a class of affine nonlinear systemsand requires complete knowledge of the system dynamics.Though there are many practical systems for which the approachcan be applied, it is necessary to broaden its applicability for amore general class of nonlinear systems. Consequently, our futurework includes studying the optimal tracking control problems fornon-affine nonlinear systems and model-free systems.References[1] L. Cui, H. Zhang, B. Chen, Q. Zhang, Asymptotic tracking control scheme formechanical systems with external disturbances and friction, Neurocomputing73 (2010) 1293–1302.

22D. Wang et al. / Neurocomputing 78 (2012) 14–22[2] S. Devasia, D. Chen, B. Paden, Nonlinear inversion-based output tracking,IEEE Trans. Autom. Control 41 (1996) 930–942.[3] G. Tang, Y. Liu, Y. Zhang, Approximate optimal output tracking control fornonlinear discrete-time systems, Control Theory Appl. 27 (2010) 400–405.[4] F.L. Lewis, V.L. Syrmos, Optimal Control, Wiley, New York, 1995.[5] R.E. Bellman, Dynamic Programming, Princeton University Press, Princeton,NJ, 1957.[6] T. Poggio, F. Girosi, Networks for approximation and learning, Proc. IEEE78 (1990) 1481–1497.[7] S. Jagannathan, Neural Network Control of Nonlinear Discrete-time Systems,CRC Press, Boca Raton, FL, 2006.[8] W. Yu, Recent Advances in Intelligent Control Systems, Springer-Verlag,London, 2009.[9] P.J. Werbos, Approximate dynamic programming for real-time control andneural modeling, in: White, D.A., Sofge, D.A. (Eds.), Handbook of IntelligentControl, Van Nostrand Reinhold, New York, 1992 (Chapter 13).[10] P.J. Werbos, Intelligence in the brain: a theory of how it works and how tobuild it, Neural Networks 22 (2009) 200–212.[11] P.J. Werbos, Advanced forecasting methods for global crisis warning andmodels of intelligence, General Syst. Yearb. 22 (1977) 25–38.[12] J.J. Murray, C.J. Cox, G.G. Lendaris, R. Saeks, Adaptive dynamic programming,IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 32 (2002) 140–153.[13] F.Y. Wang, H. Zhang, D. Liu, Adaptive dynamic programming: an introduction,IEEE Comput. Intell. Mag. 4 (2009) 39–47.[14] F.L. Lewis, D. Vrabie, Reinforcement learning and adaptive dynamic programmingfor feedback control, IEEE Circuits Syst. Mag. 9 (2009) 32–50.[15] J. Si, A.G. Barto, W.B. Powell, D.C. Wunsch (Eds.), Handbook of Learning andApproximate Dynamic Programming, IEEE Press, Wiley, New York, 2004.[16] A. Al-Tamimi, F.L. Lewis, M. Abu-Khalaf, Discrete-time nonlinear HJB solutionusing approximate dynamic programming: convergence proof, IEEE Trans.Syst. Man Cybern. Part B Cybern. 38 (2008) 943–949.[17] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific,Belmont, MA, 1996.[18] J. Si, Y.T. Wang, On-line learning control by association and reinforcement,IEEE Trans. Neural Networks 12 (2001) 264–276.[19] D.V. Prokhorov, D.C. Wunsch, Adaptive critic designs, IEEE Trans. NeuralNetworks 8 (1997) 997–1007.[20] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, The MITPress, Cambridge, MA, 1998.[21] D. Liu, X. Xiong, Y. Zhang, Action-dependent adaptive critic designs,in: Proceedings of International Joint Conference on Neural Networks,Washington, DC, July 2001, pp. 990–995.[22] D. Liu, Y. Zhang, H. Zhang, A self-learning call admission control scheme forCDMA cellular networks, IEEE Trans. Neural Networks 16 (2005) 1219–1228.[23] F.Y. Wang, N. Jin, D. Liu, Q. Wei, Adaptive dynamic programming for finitehorizonoptimal control of discrete-time nonlinear systems with e-errorbound, IEEE Trans. Neural Networks 22 (2011) 24–36.[24] S.N. Balakrishnan, V. Biega, Adaptive-critic based neural networks for aircraftoptimal control, J. Guidance Control Dyn. 19 (1996) 893–898.[25] S.N. Balakrishnan, J. Ding, F.L. Lewis, Issues on stability of ADP feedbackcontrollers for dynamic systems, IEEE Trans. Syst. Man Cybern. Part B Cybern.38 (2008) 913–917.[26] G.K. Venayagamoorthy, R.G. Harley, D.C. Wunsch, Comparison of heuristicdynamic programming and dual heuristic programming adaptive critics forneurocontrol of a turbogenerator, IEEE Trans. Neural Networks 13 (2002)764–773.[27] G.K. Venayagamoorthy, R.G. Harley, D.C. Wunsch, Implementation of adaptivecritic-based neurocontrollers for turbogenerators in a multimachinepower system, IEEE Trans. Neural Networks 14 (2003) 1047–1064.[28] M. Abu-Khalaf, F.L. Lewis, Nearly optimal control laws for nonlinear systemswith saturating actuators using a neural network HJB approach, Automatica41 (2005) 779–791.[29] T. Cheng, F.L. Lewis, M. Abu-Khalaf, A neural network solution for fixed-finaltime optimal control of nonlinear systems, Automatica 43 (2007) 482–490.[30] H. Zhang, Y. Luo, D. Liu, Neural-network-based near-optimal control for aclass of discrete-time affine nonlinear systems with control constraints,IEEE Trans. Neural Networks 20 (2009) 1490–1503.[31] T. Dierks, S. Jagannathan, Optimal tracking control of affine nonlineardiscrete-time systems with unknown internal dynamics, in: Proceedings ofJoint 48th IEEE Conference on Decision and Control and 28th Chinese ControlConference, Shanghai, PR China, December 2009, pp. 6750–6755.[32] H. Zhang, Q. Wei, Y. Luo, A novel infinite-time optimal tracking controlscheme for a class of discrete-time nonlinear systems via the greedy HDPiteration algorithm, IEEE Trans. Syst. Man Cybern. Part B Cybern. 38 (2008)937–942.[33] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, F.L. Lewis, Adaptive optimal controlfor continuous-time linear systems based on policy iteration, Automatica45 (2009) 477–484.[34] K.G. Vamvoudakis, F.L. Lewis, Online actor-critic algorithm to solve thecontinuous-time infinite horizon optimal control problem, Automatica46 (2010) 878–888.[35] R. Song, H. Zhang, Y. Luo, Q. Wei, Optimal control laws for time-delaysystems with saturating actuators based on heuristic dynamic programming,Neurocomputing 73 (2010) 3020–3027.[36] Y.M. Park, M.S. Choi, K.Y. Lee, An optimal tracking neuro-controller fornonlinear dynamic systems, IEEE Trans. Neural Networks 7 (1996)1099–1110.Ding Wang received the B.S. degree in mathematicsfrom Zhengzhou University of Light Industry, Zhengzhou,China, and the M.S. degree in operational researchand cybernetics from Northeastern University, Shenyang,China, in 2007 and 2009, respectively. He iscurrently working toward the Ph.D. degree in the KeyLaboratory of Complex Systems and IntelligenceScience, Institute of Automation, Chinese Academy ofSciences, Beijing, China. His research interests includeadaptive dynamic programming, neural networks, andintelligent control.Derong Liu received the Ph.D. degree in electricalengineering from the University of Notre Dame, NotreDame, IN, in 1994. Dr. Liu was a Staff Fellow withGeneral Motors Research and Development Center,Warren, MI, from 1993 to 1995. He was an AssistantProfessor in the Department of Electrical and ComputerEngineering, Stevens Institute of Technology, Hoboken,NJ, from 1995 to 1999. He joined the University ofIllinois at Chicago in 1999, where he became a FullProfessor of Electrical and Computer Engineering and ofComputer Science in 2006. He was selected for the ‘‘100Talents Program’’ by the Chinese Academy of Sciencesin 2008. Dr. Liu was an Associate Editor of the IEEETransactions on Circuits and Systems—Part I: Fundamental Theory and Applications(1997–1999), the IEEE Transactions on Signal Processing (2001–2003), the IEEETransactions on Neural Networks (2004–2009), the IEEE Computational IntelligenceMagazine (2006–2009), and the IEEE Circuits and Systems Magazine (2008–2009),and the Letters Editor of the IEEE Transactions on Neural Networks (2006–2008).Currently, he is the Editor-in-Chief of the IEEE Transactions on Neural Networks andan Associate Editor of the IEEE Transactions on Control Systems Technology. Hereceived the Michael J. Birck Fellowship from the University of Notre Dame (1990),the Harvey N. Davis Distinguished Teaching Award from Stevens Institute ofTechnology (1997), the Faculty Early Career Development (CAREER) Award fromthe National Science Foundation (1999), the University Scholar Award fromUniversity of Illinois (2006), and the Overseas Outstanding Young Scholar Awardfrom the National Natural Science Foundation of China (2008).Qinglai Wei received the B.S. degree in automation, theM.S. degree in control theory and control engineering, andthe Ph.D. degree in control theory and control engineering,from the Northeastern University, Shenyang, China, in2002, 2005, and 2008, respectively. He is currently apostdoctoral fellow with the Key Laboratory of ComplexSystems and Intelligence Science, Institute of Automation,Chinese Academy of Sciences, Beijing, China. His researchinterests include neural-networks-based control, nonlinearcontrol, adaptive dynamic programming, and theirindustrial applications.

Ding Wang 2012 Neurocomputing

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?