D. <strong>Wang</strong> et al. / <strong>Neurocomputing</strong> 78 (<strong>2012</strong>) 14–22 17we can obtainJðe k , ^u k þ 1 Þ¼Uðek k ,v 0 ðe k ÞÞþUðe k þ 1 , ^u k þ 1 Þ¼Uðe k ,v 0 ðe k ÞÞ ¼ V 1 ðe k Þ:On the other hand, according to (20), we haveV 2 ðe k Þ¼minfJðe k ,u k þ 1 Þ : u k þ 1k ku k þ 1kwhich reveals thatAA ð2Þe kg,V 2 ðe k ÞrJðe k , ^u k þ 1 Þ¼Vk 1 ðe k Þ: ð22ÞTherefore, the theorem holds for i¼1.Next, assume that the theorem holds for any i¼q, where q41.The current cost function can be expressed asV q ðe k Þ¼ Xq 1Uðe k þ j ,v q 1 j ðe k þ j ÞÞ,j ¼ 0where ^u k þ q 1 ¼ðvk q 1 ðe k Þ,v q 2 ðe k þ 1 Þ, ...,v 0 ðe k þ q 1 ÞÞ is the correspondingfinite-horizon admissible control sequence.Then, for i ¼ qþ1, we can construct a control sequence ^u k þ q ¼kðv q 1 ðe k Þ,v q 2 ðe k þ 1 Þ, ...,v 0 ðe k þ q 1 Þ,0Þ with length qþ1, underwhich the error trajectory is given as e k , e k þ 1 ¼ Fðe k , v q 1 ðe k ÞÞ,e k þ 2 ¼ Fðe k þ 1 ,v q 2 ðe k þ 1 ÞÞ, ..., e k þ q ¼ Fðe k þ q 1 , v 0 ðe k þ q 1 ÞÞ ¼ 0,e k þ q þ 1 ¼ Fðe k þ q , ^u k þ q Þ¼Fð0; 0Þ¼0. This shows that^u k þ qkis afinite-horizon admissible control sequence. As Uðe k þ q , ^u k þ q Þ¼Uð0; 0Þ¼0, we can acquireJðe k , ^u k þ q Þ¼Uðek k ,v q 1 ðe k ÞÞþUðe k þ 1 ,v q 2 ðe k þ 1 ÞÞþþUðe k þ q 1 ,v 0 ðe k þ q 1 ÞÞþUðe k þ q , ^u k þ q Þ¼ Xq 1j ¼ 0Uðe k þ j ,v q 1 j ðe k þ j ÞÞ ¼ V q ðe k Þ:On the other hand, according to (20), we haveV q þ 1 ðe k Þ¼minfJðe k ,u k þ q Þ : u k þ qu k þ qkwhich implies thatkkðq þ 1ÞAAe kg,V q þ 1 ðe k ÞrJðe k , ^u k þ q Þ¼Vk q ðe k Þ: ð23ÞAccordingly, we complete the proof by mathematical induction. &We have concluded that the cost function sequence fV i ðe k Þg is amonotonically nonincreasing sequence which is bounded below,and therefore, its limit exists. Here, we denote it as V 1 ðe k Þ, i.e.,lim i-1 V i ðe k Þ¼V 1 ðe k Þ: Next, let us consider what will happenwhen we make i-1 in (17).Theorem 2. For any discrete time step k and tracking error e k , thefollowing equation holds:V 1 ðe k Þ¼minfUðe k ,u k ÞþV 1 ðe k þ 1 Þg: ð24Þu kProof. For any admissible control t k ¼ tðe k Þ and i, according toTheorem 1 and (17), we haveV 1 ðe k ÞrV i þ 1 ðe k Þ¼minfUðe k ,u k ÞþV i ðe k þ 1 ÞgrUðe k ,t k ÞþV i ðe k þ 1 Þ:u kLet i-1, we getV 1 ðe k ÞrUðe k ,t k ÞþV 1 ðe k þ 1 Þ:Note that in the above equation, t k is chosen arbitrarily. Thus, wecan obtainV 1 ðe k ÞrminfUðe k ,u k ÞþV 1 ðe k þ 1 Þg: ð25Þu kOn the other hand, let d40 be an arbitrary positive number.Then, there exists a positive integer l such thatV l ðe k Þ drV 1 ðe k ÞrV l ðe k Þ ð26Þbecause V i ðe k Þ is nonincreasing for iZ1 with V 1 ðe k Þ as its limit.Besides, from (17), we can acquireV l ðe k Þ¼minfUðe k ,u k ÞþV l 1 ðe k þ 1 Þgu k¼ Uðe k ,v l 1 ðe k ÞÞþV l 1 ðFðe k ,v l 1 ðe k ÞÞ:Combining with (26), we can obtainV 1 ðe k ÞZUðe k ,v l 1 ðe k ÞÞþV l 1 ðFðe k ,v l 1 ðe k ÞÞ dZUðe k ,v l 1 ðe k ÞÞþV 1 ðFðe k ,v l 1 ðe k ÞÞ dZminfUðe k ,u k ÞþV 1 ðe k þ 1 Þgu kd,which reveals thatV 1 ðe k ÞZminfUðe k ,u k ÞþV 1 ðe k þ 1 Þgu kð27Þbecause of the arbitrariness of d. Based on (25) and (27), we canconclude that (24) is true. &Next, we will prove that the cost function sequence fV i gconverges to the optimal cost function J n as i-1.Theorem 3. Define the cost function sequence fV i g as in (17) withV 0 ðÞ ¼ 0. If the system state e k is controllable, then J n is the limit ofthe cost function sequence fV i g, i.e.,V 1 ðe k Þ¼J n ðe k Þ:Proof. On one hand, in accordance with (9) and (20), we canacquireJ ðe k Þ¼inf u kfJðe k ,u kÞ: u kAA ek gr min fJðe k ,u k þ i 1 Þ : u k þ i 1k ku k þ i 1kAA ðiÞe kg¼V i ðe k Þ:Letting i-1, we getJ n ðe k ÞrV 1 ðe k Þ:ð28ÞOn the other hand, according to the definition of J n ðe k Þ,foranyZ40, there exists an admissible control sequence s kAA ek such thatJðe k ,s kÞrJ n ðe k ÞþZ:ð29ÞNow, we suppose that 9s k9 ¼ q, which shows that s kAA ðqÞe k.Then,we can obtainV 1 ðe k ÞrV q ðe k Þ¼ min fJðe k ,u k þ q 1 Þ : u k þ q 1k krJðe k ,s kÞ,u k þ q 1kAA ðqÞe kgusing Theorem 1 and (20). Combining with (29), we getV 1 ðe k ÞrJ n ðe k ÞþZ:Noticing that Z is chosen arbitrarily in the above expression,we haveV 1 ðe k ÞrJ n ðe k Þ:ð30ÞBased on (28) and (30), we can conclude that J n ðe k Þ is the limit of thecost function sequence fV i g as i-1, i.e., V 1 ðe k Þ¼J n ðe k Þ. &From Theorems 1–3, we can obtain that the cost functionsequence fV i ðe k Þg converges to the optimal cost function J n ðe k Þ ofthe DTHJB equation, i.e., V i -J n as i-1. Then, according to (12)and (16), we can conclude the convergence of the correspondingcontrol law sequence. Now, we present the following corollary.Corollary 1. Define the cost function sequence fV i g as in (17) withV 0 ðÞ ¼ 0, and the control law sequence fv i g as in (16). If the systemstate e k is controllable, then the sequence fv i g converges to the
18D. <strong>Wang</strong> et al. / <strong>Neurocomputing</strong> 78 (<strong>2012</strong>) 14–22optimal control law u n as i-1, i.e.,limv i ðe k Þ¼u n ðe k Þ:i-13.3. The e-optimal control algorithmAccording to Theorems 1–3 and Corollary 1, we should run theiterative ADP algorithm (14)–(17) until i-1 to obtain theoptimal cost function J n ðe k Þ, and then to get a control vectorv 1 ðe k Þ based on which we can construct a control sequenceu 1ðe k Þ¼ðv 1 ðe k Þ,v 1 ðe k þ 1 Þ, ...,v 1 ðe k þ i Þ, ...Þ to control the state toreach the target. Obviously, u 1ðe k Þ has infinite length. Though it isfeasible in terms of theory, it is always not practical to do sobecause most real world systems need to be effectively controlledwithin finite-horizon. Therefore, in this section, we will propose anovel e-optimal control strategy using the iterative ADP algorithmto deal with the problem. The idea is, for a given error bounde40, the iterative number i will be chosen so that the errorbetween V i ðe k Þ and J n ðe k Þ is within the bound.Let e40 be any small number, e k be any controllable state, andJ n ðe k Þ be the optimal value of the cost function sequence definedas in (17). From Theorem 3, it is clear that there exists a finite isuch that9V i ðe k Þ J n ðe k Þ9re: ð31ÞThe length of the optimal control sequence starting from e k withrespect to e is defined asK e ðe k Þ¼minfi: 9V i ðe k Þ J n ðe k Þ9reg: ð32ÞThe corresponding control lawv i 1 ðe k Þ¼arg minfUðe k ,u k ÞþV i 1 ðe k þ 1 Þgu k¼ 1 2 R 1 g T ðe k þr k Þ @V i 1ðe k þ 1 Þð33Þ@e k þ 1is called the e-optimal control and is denoted as m n e ðe kÞ.In this sense, we can see that an error e between V i ðe k Þ andJ n ðe k Þ is introduced into the iterative ADP algorithm, which makesthe cost function sequence fV i ðe k Þg converge in finite number ofiteration steps.However, the optimal criterion (31) is difficult to verifybecause the optimal cost function J n ðe k Þ is unknown in general.Consequently, we will use an equivalent criterion, i.e.,9V i ðe k Þ V i þ 1 ðe k Þ9re ð34Þto replace (31).In fact, if 9V i ðe k Þ J n ðe k Þ9re holds, we have V i ðe k ÞrJ n ðe k Þþe.Combining with J n ðe k ÞrV i þ 1 ðe k ÞrV i ðe k Þ, we can find that0rV i ðe k Þ V i þ 1 ðe k Þre,which means9V i ðe k Þ V i þ 1 ðe k Þ9re:On the other hand, according to Theorem 3, 9V i ðe k Þ V i þ 1ðe k Þ9-0 connotes that V i ðe k Þ-J n ðe k Þ. As a result, if 9V i ðe k ÞV i þ 1 ðe k Þ9re holds for any given small e, we can derive theconclusion that 9V i ðe k Þ J n ðe k Þ9re holds if i is sufficiently large.3.4. Design procedure of the finite-horizon optimal tracking controlscheme using iterative ADP algorithmIn this section, we will give the detailed design procedure forthe finite-horizon nonlinear optimal tracking control schemeusing the iterative ADP algorithm.Step 1 Specify an error bound e for the given initial state x 0 . Choosei max , the reference trajectory r k ,andthematricesQ and R.Step 2 Compute e k according to (2) and (3).Step 3 Set i¼0, V 0 ðe k Þ¼0. Obtain the initial finite-horizon admissiblevector v 0 ðe k Þ by (14) and update the cost functionV 1 ðe k Þ by (15).Step 4 Set i ¼ iþ1.Step 5 Compute v i ðe k Þ by (16) and the corresponding cost functionV i þ 1 ðe k Þ by (17).Step 6 If 9V i ðe k Þ V i þ 1 ðe k Þ9re, then go to Step 8; otherwise, go toStep 7.Step 7 If i4i max , then go to Step 8; otherwise, go to Step 4.Step 8 Stop.After the optimal control law u n ðe k Þ for system (6) is derivedunder the given error bound e, we can compute the optimaltracking control input for original system (1) byu pk ¼ u ðe k Þþu dk ¼ u ðe k Þþg 1 ðr k Þðfðr k Þ f ðr k ÞÞ: ð35ÞIn the following section, we will describe the implementationof the iterative ADP algorithm based on NNs in detail.4. NN implementation of the iterative ADP algorithmvia HDP techniqueNow, we implement the iterative HDP algorithm in (14)–(17)using NNs. In the iterative HDP algorithm, there are three networks,which are model network, critic network and action network.All the networks are chosen as three-layer feedforward NNs.The inputs of the critic network and the action network are e k ,while the inputs of the model network are e k and ^v i ðe k Þ. Thestructure diagram of the iterative HDP algorithm is shown in Fig. 1.4.1. The model networkThe purpose of designing the model network is to approximatethe error d ynamics. We should train the model network beforecarrying out the iterative HDP algorithm. For given e k and ^v i ðe k Þ,we can obtain the output of the model network as^e k þ 1 ¼ o T msðn T m z kÞ,wherez k ¼½e T k ^vT i ðe kÞŠ T :We define the error function of the model network asð36Þe mk ¼ ^e k þ 1 e k þ 1 : ð37ÞThe weights in the model network are updated to minimize thefollowing performance measure:E mk ¼ 1 2 eT mk e mk:ð38ÞUsing the gradient-based adaptation rule, the weights can beupdated as @Eo m ðjþ1Þ¼o m ðjÞ a mkm , ð39Þ@o m ðjÞ @En m ðjþ1Þ¼n m ðjÞ a mkm , ð40Þ@n m ðjÞwhere a m 40 is the learning rate of the model network, and j isthe iterative step for updating the weight parameters.After the model network is trained, its weights are keptunchanged.