12.07.2015 Views

A State-Space Representation Model and Learning Algorithm for ...

A State-Space Representation Model and Learning Algorithm for ...

A State-Space Representation Model and Learning Algorithm for ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Proceedings of IMECE072007 ASME International Mechanical Engineering Congress <strong>and</strong> ExpositionNovember 11-15, 2007, Seattle, Washington, USAIMECE2007-41258A STATE-SPACE REPRESENTATION MODEL AND LEARNING ALGORITHM FORREAL-TIME DECISION-MAKING UNDER UNCERTAINTYAndreas A. MalikopoulosPanos Y. PapalambrosDepartment of Mechanical EngineeringDepartment of Mechanical EngineeringUniversity of MichiganUniversity of MichiganAnn Arbor, Michigan 48109, U.S.AAnn Arbor, Michigan 48109, U.S.Aamaliko@umich.edupyp@umich.eduDennis N. AssanisDepartment of Mechanical EngineeringUniversity of MichiganAnn Arbor, Michigan 48109, U.S.Aassanis@umich.eduABSTRACT<strong>Model</strong>ing dynamic systems incurring stochasticdisturbances <strong>for</strong> deriving a control policy is a ubiquitous task inengineering. However, in some instances obtaining a model ofa system may be impractical or impossible. Alternativeapproaches have been developed using a simulation-basedstochastic framework, in which the system interacts with itsenvironment in real time <strong>and</strong> obtains in<strong>for</strong>mation that can beprocessed to produce an optimal control policy. In this context,the problem of developing a policy <strong>for</strong> controlling the system’sbehavior is <strong>for</strong>mulated as a sequential decision-makingproblem under uncertainty. This paper considers real-timesequential decision-making under uncertainty modeled as aMarkov Decision Process (MDP). A state-space representationmodel is constructed through a learning mechanism <strong>and</strong> is usedto improve system per<strong>for</strong>mance over time. The model allowsdecision making based on gradually enhanced knowledge ofsystem response as it transitions from one state to another, inconjunction with actions taken at each state. A learningalgorithm is implemented realizing in real time the optimalcontrol policy associated with the state transitions. Theproposed method is demonstrated on the single cart-polebalancing problem <strong>and</strong> a vehicle cruise control problem.Keywords: sequential decision-making under uncertainty,Markov Decision Process (MDP), rein<strong>for</strong>cement learning,learning algorithms, inverted pendulum, vehicle cruise control1. INTRODUCTIONDeriving a control policy <strong>for</strong> dynamic systems is an offlineprocess in which various methods from control theory areutilized iteratively. These methods aim to determine the policythat satisfies the system’s physical constraints while optimizingspecific per<strong>for</strong>mance criteria. A challenging task in this processis to derive a mathematical model of the system’s dynamics thatcan adequately predict the response of the physical system toall anticipated inputs. Exact modeling of complex engineeringsystems, however, may be infeasible or expensive. Viablealternative methods have been developed enabling the real-timeimplementation of control policies <strong>for</strong> systems when anaccurate model is not available. In this framework, the systeminteracts with its environment, <strong>and</strong> obtains in<strong>for</strong>mationenabling it to improve its future per<strong>for</strong>mance by means ofrewards associated with control actions taken. This interactionportrays the learning process conveyed by the progressiveenhancement of the system’s “knowledge” regarding the courseof action (control policy) that maximizes the accumulatedrewards with respect to the system’s operating point (state).The environment is assumed to be non-deterministic; namely,taking the same action in the same state on two different stages,the system may transit to a different state <strong>and</strong> receive adissimilar reward in the subsequent stage. Consequently, theproblem of developing a policy <strong>for</strong> controlling the system’sbehavior is <strong>for</strong>mulated as a sequential decision-makingproblem under uncertainty.Dynamic programming (DP) has been widely employed asthe principal method <strong>for</strong> analysis of sequential decision-makingproblems [1]. <strong>Algorithm</strong>s, such as value iteration, <strong>and</strong> policyiteration, have been extensively utilized in solvingdeterministic <strong>and</strong> stochastic optimal control problems, Markov<strong>and</strong> semi-Markov decision problems, min-max controlproblems, <strong>and</strong> sequential games. However, the computationalcomplexity of these algorithms in some occasions may beprohibitive <strong>and</strong> can grow intractably with the size of the1 Copyright © 2007 by ASME


problem <strong>and</strong> its related data, referred to as the DP “curse ofdimensionality” [2]. In addition, DP algorithms require therealization of the conditional probabilities of state transitions<strong>and</strong> the associated rewards, implying a priori knowledge of thesystem dynamics.Alternative approaches <strong>for</strong> solving sequential decisionmakingproblems under uncertainty have been primarilydeveloped in the field of Rein<strong>for</strong>cement <strong>Learning</strong> (RL) [3, 4].RL has aimed to provide simulation-based algorithms, foundedon DP, <strong>for</strong> learning control policies of complex systems, whereexact modeling is infeasible or expensive [5]. A majorinfluence on research leading to current RL algorithms hasbeen Samuel’s method [6, 7], used to modify a heuristicevaluation function <strong>for</strong> deriving optimal board positions in thegame of checkers. In this algorithm, Samuel represented theevaluation function as a weighted sum of numerical features<strong>and</strong> adjusted the weights based on an error derived fromcomparing evaluations of current <strong>and</strong> predicted board positions.This approach was refined <strong>and</strong> extended by Sutton [8, 9] tointroduce a class of incremental learning algorithms, TemporalDifference (TD). TD algorithms are specialized <strong>for</strong> derivingoptimal policies <strong>for</strong> incompletely known systems, using pastexperience to predict their future behavior. Watkins [10, 11]extended Sutton’s TD algorithms <strong>and</strong> developed an algorithm<strong>for</strong> systems to learn how to act optimally in controlled Markovdomains by explicitly utilizing the theory of DP. A strongcondition implicit in the convergence of Q-learning to anoptimal control policy is that the sequence of stages that <strong>for</strong>msthe basis of learning must include an infinite number of stages<strong>for</strong> each initial state <strong>and</strong> action. However, Q-learning isconsidered the most popular <strong>and</strong> efficient model-free learningalgorithm in deriving optimal control policies in Markovdomains [12]. Schwartz [13] explored the potential of adaptingQ-learning to an average-reward framework with his R-learning algorithm; Bertsekas <strong>and</strong> Tsitsiklis [3] presented asimilar to Q-learning average-reward algorithm. Mahadevan[14] surveyed rein<strong>for</strong>cement-learning average-rewardalgorithms <strong>and</strong> showed that these algorithms do not alwaysproduce bias-optimal control policies.Although many of these algorithms are eventuallyguaranteed to find optimal policies in sequential decisionmakingproblems under uncertainty, their use of theaccumulated data acquired over the learning process isinefficient, <strong>and</strong> they require a significant amount of experienceto achieve good per<strong>for</strong>mance [12]. This requirement arises dueto the <strong>for</strong>mation of these algorithms in deriving optimalpolicies without learning the system models en route.<strong>Algorithm</strong>s <strong>for</strong> computing optimal policies by learning themodels are especially important in applications in which realworldexperience is considered expensive. Sutton’s Dynaarchitecture [15, 16] exploits strategies which simultaneouslyutilize experience in building the model <strong>and</strong> adjust the derivedpolicy. Prioritized sweeping [17] <strong>and</strong> Queue-Dyna [18] aresimilar methods concentrating on the interesting subspaces ofthe state-action space. Barto et al. [19] developed anothermethod, called Real-Time Dynamic Programming (RTDP),referring to the cases in which concurrently executing DP <strong>and</strong>control processes influence one another. RTDP focuses thecomputational ef<strong>for</strong>t on the state-subspace that the system ismost likely to occupy. This method is specific to problems inwhich the system needs to achieve particular goal states <strong>and</strong> theinitial cost of every goal state is zero.This paper considers the problem of deriving optimalpolicies in real-time sequential decision-making underuncertainty that can be modeled as a Markov Decision Process(MDP). A state-space representation model is constructedthrough a learning mechanism <strong>and</strong> is used to improve systemper<strong>for</strong>mance over time. The model accumulates graduallyenhanced knowledge of system response as it transitions fromone state to another, in conjunction with actions taken at eachstate. A learning algorithm is implemented realizing in real timethe optimal course of action (control policy) associated with thestate transitions. The major difference between the proposedmethod <strong>and</strong> the existing RL algorithms is that the latter consistsof evaluation functions attempting to successively approximatethe Bellman equation [20]. The proposed real-time learningmethod, on the contrary, utilizes an evaluation function whichconsiders the expected reward that can be achieved by statetransitions <strong>for</strong>ward in time. This approach is especiallyappealing to learning engineering systems in which the initialstate is not fixed, <strong>and</strong> recursive updates of the evaluationfunctions to approximate the Bellman equation would dem<strong>and</strong>significant amount of experience to achieve the desired systemper<strong>for</strong>mance.In the following section the mathematical framework <strong>for</strong>modeling sequential decision-making problems underuncertainty is presented <strong>and</strong> the predictive optimal decisionmakinglearning method is introduced. The per<strong>for</strong>mance of theproposed method is demonstrated on the single cart-polebalancing problem, in Section 3 <strong>and</strong>, on a vehicle cruisecontrolproblem, in Section 4. Conclusions are presented inSection 5.2. PROBLEM FORMULATIONA large class of sequential decision-making problemsunder uncertainty can be modeled as a Markov DecisionProcess (MDP). MDP, extensively covered by Puterman [21],provides the mathematical framework <strong>for</strong> modeling decisionmakingin situations where outcomes are partly r<strong>and</strong>om <strong>and</strong>partly under the control of the decision maker. Decisions aremade at points of time referred to as decision epochs, <strong>and</strong> thetime domain can be either discrete or continuous. The focus ofthis paper is on discrete-time decision-making problems.The Markov decision process model consists of fiveelements: (a) decision epochs; (b) states; (c) actions; (d) thetransition probability matrix; <strong>and</strong> (e) the transition rewardmatrix. In this framework, the decision maker is faced with theproblem of influencing system behavior as it evolves over time,by making decisions (choosing actions). The objective of thedecision maker is to select the course of action (control policy)2 Copyright © 2007 by ASME


which causes the system to per<strong>for</strong>m optimally with respect tosome predetermined optimization criterion. Decisions mustanticipate rewards (or costs) associated with future systemstates-actions.At each decision epoch, the system occupies a state s ifromthe finite set of all possible system states SS = { si| i = 1,2,..., N}, N∈N .(1)In this state s i∈S , the decision maker has available a set ofallowable actions, α ∈As(i), As (i)⊆A , where A is the finiteaction spaceA = ∪ sA( s ).i∈Si(2)The decision-making process occurs at each of a sequenceof decision epochs T = 0,1, 2,..., M,M ∈N . At each epoch,the decision maker observes a system’s state s i∈S , <strong>and</strong>executes an action α ∈ A( s i), from the feasible set of actionsAs ( i) ⊆ A at this state. At the next epoch, the system transits tothe state sj∈S imposed by the conditionalprobabilities ps (j| si, α ), designated by the transitionprobability matrix P(⋅,⋅). The conditional probabilities ofP(⋅,⋅), p : S × A →[0,1], satisfy the constraintN∑ ps (j| si, α) = 1.(3)j=1Following this state transition, the decision maker receives areward associated with the action α, Rs (j| si, α ), R:S × A → Ras imposed by the transition reward matrix R(⋅,⋅). The states ofan MDP possess the Markov property, stating that theconditional probability distribution of future states of theprocess depends only upon the current state <strong>and</strong> not on any paststates, i.e., it is conditionally independent of the past states (thepath of the process) given the present state. Mathematically, theMarkov property states thatpX (n+ 1= sj | Xn = si, Xn− 1= si−1,..., X0 = s0)=(4)= pX ( = s | X = s).n+1 j n i2.1 Optimal Policies <strong>and</strong> Per<strong>for</strong>mance CriteriaThe solution to an MDP can be expressed as an admissiblecontrol policy such that a given per<strong>for</strong>mance criterion isoptimized over all admissible policies Π. An admissible policyconsists of a sequence of functionsπ = { µ0, µ1,..., µ M},(5)where µ Tmaps states s iinto actions α = µ T( s i) <strong>and</strong> is suchthat µT( si) ∈A( si), ∀si∈S.Consequently, given an initial state at decision epoch T = 0,si, <strong>and</strong> an admissible policy π = { µ0, µ1,..., µ M}, the expectedaccumulated undiscounted value of the rewards of the decisionmaker is given by the Bellman optimality equationM −1i N N−1 µN−1T j iµiT = 0V( s ) = E{ R( s | s , ( s )) +∑ V ( s | s , ( s ))}, (6)∀si, sj∈ S , i, j = 1,2,..., N, N∈N.In the finite-horizon context the decision maker shouldmaximize the accumulated value <strong>for</strong> the next M epochs. Anoptimal policy π * ∈Π is one that maximizes the overallexpected accumulated value of the rewardsV ( s ) = max V ( s ).(7)∗πiπ∈Ππ i∗ ∗ ∗ ∗Consequently, the optimal policy π = { µ0, µ1,..., µ M} <strong>for</strong> theM-decision epoch sequence is∗π = arg max V ∗ ( s i).(8)π∈ΠThe finite-horizon model is appropriate when the decisionmaker’s“lifetime” is known, namely, the terminal epoch of thedecision-making sequence. For problems with a very largenumber of decision epochs, however, the infinite-horizonmodel is more appropriate. In this context, the overall expectedundiscounted reward is:Vπ* ( si) = lim max Vπ( si).(9)M →∞ π∈ΠThis relation is valuable computationally <strong>and</strong> analytically, <strong>and</strong>it holds under certain conditions [22].2.2 Predictive Optimal Decision-Making <strong>State</strong>-<strong>Space</strong><strong>Representation</strong>The predictive optimal decision-making (POD) learningmethod consists of a new state-space system representation <strong>and</strong>a learning algorithm. The state-space representationaccumulates gradually enhanced knowledge of the system’stransition from each state to another in conjunction with actionstaken <strong>for</strong> each state. This knowledge is expressed in terms of anexpected evaluation function associated with each state. Whilethe model’s knowledge is advanced, the learning algorithmrealizes, at each decision epoch, the course of action thatguarantees at least a minimum value of the overall reward <strong>for</strong>both the current state <strong>and</strong> the subsequent states.The major difference between the proposed learningmethod <strong>and</strong> the existing RL methods is that the latter consistsof evaluation functions attempting to successively approximatethe Bellman equation, Eq. (6). These evaluation functionsassign to each state the total reward expected to accumulateover time starting from a given state when a policy π isemployed. The proposed real-time learning method, on thecontrary, utilizes an evaluation function which considers theexpected reward that can be achieved by state transitions<strong>for</strong>ward in time. This approach is especially appealing tolearning engineering systems in which the initial state is notfixed [23, 24], <strong>and</strong> recursive updates of the evaluationfunctions to approximate Eq.(6) would dem<strong>and</strong> a huge numberof iterations to achieve the desired system per<strong>for</strong>mance.The new state-space representation defines the PODdomain S . It is implemented by a mapping H from theπ3 Copyright © 2007 by ASME


Cartesian product of the finite state space <strong>and</strong> action space ofthe MDP:H: ( S× A) × ( S× A) →S , (10)where S = { si| i = 1,2,..., N},N∈Ndenotes the Markov statespace, <strong>and</strong> A = ∪ sAs ( ),i∈Si∀si∈Sst<strong>and</strong>s <strong>for</strong> a finite actionspace. This mapping generates an indexed family ofsubsets, S s i, <strong>for</strong> each state s i∈S , defined as Predictive<strong>Representation</strong> Nodes (PRNs). Each PRN is constituted by aiset of POD states, s ∈S ,msiiSs = { s| , 1,2,..., , | | }i mi m= N N = S ∈N , (11)Each POD state s i ∈S messentially represents a Markov statetransition from s i∈S to s m∈S . PRNs partition the PODdomain insofar as the POD underlying structure captures thestate transitions in the Markov domain. Consequently, a PRN isdefined asNi im m i m ∑ m iµiµ ( si) ∈A( si)m=1S s = { s | s ≡ s → s , p ( s | s , ( s )) = 1, N = | S |},i∀si, sm ∈S , ∀µ( si) ∈A( si),(12)the union of which defines the POD domainS= ∪ i s i, withs∈SS (13)s ims i∩ S =∅.(14)Ssim∈s is iEach PRN, S , corresponds to a Markov state, si∈S , <strong>and</strong>portrays all possible transitions occurring from this state s i tothe other states s m∈S . PRNs, constituting the fundamentalaspect of the POD state representation, provide an assessmentof the Markov state transitions along with the actions executedat each state. This assessment aims to establish a necessaryembedded property of the new state representation so as toconsider the potential transitions that can occur in subsequentdecision epochs. The assessment is expressed by means of theiPRN value, Rs ( s| ( ))i mµ si, which accounts <strong>for</strong> the maximumaverage expected reward that can be achieved by transitionsoccurring inside a PRN. Consequently, the PRN value isdefined asN⎛⎞⎜∑p( sm | si, µ ( si)) ⋅ R( sm | si, µ ( si))⎟im=1Rs ( s| ( )) max ,i mµ si= ⎜⎟µ ( si) ∈A⎜N⎟⎜⎟⎝⎠i∀s m∈S, ∀si, sm ∈S, ∀µ( si) ∈ A( si),<strong>and</strong> N = | S |. (15)The PRN value is exploited by POD state representation asan evaluation metric to estimate the subsequent Markov statetransitions. The estimation property is founded on theassessment of POD states by means of an expected evaluationi ifunction, R ( s, µ ( s )), defined asPRN m i{i iRPRN ( sm, µ ( si )) = p( sm | si , µ ( si )) ⋅ R( sm | si , µ ( si)) +(16)m+ Rs ( s| ( ))},m jµ smi m∀sm, sj∈S, ∀si, sm ∈S, ∀µ ( si) ∈A( si), ∀µ( sm) ∈A( sm).Consequently, employing the POD evaluation function throughiEq. (16), each POD state, sm∈S s , is comprised of an overallireward corresponding to: (a) the expected reward of transitingfrom state s i to s m (implying also the transition from the PRNSs itoSs m); <strong>and</strong> (b) the maximum average expected rewardwhen transiting from s m to any other Markov state (transitionoccurring into S s m).While the system interacts with its environment, the PODmodel learns the system dynamics in terms of the Markov statetransitions. The POD state representation attempts to provide aprocess in realizing the sequences of state transitions thatoccurred in the Markov domain, as infused in PRNs. Thedifferent sequences of the Markov state transitions are capturedby the POD states <strong>and</strong> evaluated through the expectedevaluation functions given in Eq. (16). Consequently, thehighest value of the expected evaluation function at each PODstate essentially estimates the subsequent Markov statetransitions with respect to the actions taken. As the process isstochastic, however, it is still necessary <strong>for</strong> the real-timelearning method to build a decision-making mechanism of howto select actions.The learning per<strong>for</strong>mance is closely related to theexploration-exploitation strategy of the action space. Moreprecisely, the decision maker has to exploit what is alreadyknown regarding the correlation involving the admissible stateactionpairs that maximize the rewards, <strong>and</strong> also to explorethose actions that have not yet been tried <strong>for</strong> these pairs toassess whether these actions may result in higher rewards. Abalance between an exhaustive exploration of the environment<strong>and</strong> the exploitation of the learned policy is fundamental toreach nearly optimal solutions in few decision epochs <strong>and</strong>,thus, enhancing the learning per<strong>for</strong>mance. This explorationexploitationdilemma has been extensively reported in theliterature. Iwata et al. [25] proposed a model-based learningmethod extending Q-learning <strong>and</strong> introducing two separatedfunctions based on statistics <strong>and</strong> on in<strong>for</strong>mation by applyingexploration <strong>and</strong> exploitation strategies. Ishii et al. [26]developed a model-based rein<strong>for</strong>cement learning methodutilizing a balance parameter, controlled through variation ofaction rewards <strong>and</strong> perception of environmental change. Chan-Geon et al. [27] proposed an exploration-exploitation policy inQ-learning consisting of an auxiliary Markov process <strong>and</strong> theoriginal Markov process. Miyazaki et al. [28] developed aunified learning system realizing the tradeoff betweenexploration <strong>and</strong> exploitation. Hern<strong>and</strong>ez-Aguirre et al. [29]analyzed the problem of exploration-exploitation in the contextof the probably approximately correct framework <strong>and</strong> studiedwhether it is possible to give bounds on the complexity of the4 Copyright © 2007 by ASME


“neural-fuzzy BOXES” control system by neural networks <strong>and</strong>utilized rein<strong>for</strong>cement learning <strong>for</strong> the training. Jeen-Shing etal. [33] proposed a defuzzification method incorporating agenetic algorithm to learn the defuzzification factors. Mustaphaet al. [34] developed an actor-critic rein<strong>for</strong>cement learningalgorithm represented by two adaptive neural-fuzzy systems. Siet al. [35] proposed a generic on-line learning control systemsimilar to Anderson’s utilizing neural networks <strong>and</strong> evaluated itthrough its application to both a single <strong>and</strong> double cart-polebalancing problem. The system utilizes two neural networks,<strong>and</strong> employs the action- dependent heuristic dynamicprogramming to adapt the weights of the networks.In the implementation of the POD on the single invertedpendulum presented here, two major variations are considered:(a) a single look-up table-based representation is employed <strong>for</strong>the controller to develop the mapping from the system’sMarkov states to optimal actions, <strong>and</strong> (b) two system’s statevariables are selected to represent the Markov state. The latterintroduces uncertainty <strong>and</strong> thus a conditional probabilitydistribution associating the state transitions with respect to theactions taken. Consequently, the POD method is evaluated inderiving the optimal policy (balance control policy) in asequential decision making problem under uncertainty.The governing equations, derived from the free bodydiagram of the system, shown in Figure 2, are:2( M + m) x + bx + mLϕcosϕ− mL ϕ sin ϕ = U , (19)2mLx cos ϕ+ ( I + mL ) ϕ+ mgL sinϕ= 0, (20)N secwhere M = 0.5 kg, m = 0.2 kg, b=0.1m2 mI = 0.006 kg m , g = 9.81 ,<strong>and</strong> L = 0.3 m.2secThe goal of the learning controller is to realize in real timethe <strong>for</strong>ce, U, of a fixed magnitude to be applied either to theright or the left direction so that the pendulum st<strong>and</strong>s balancedwhen released from any angle, φ, between 3° <strong>and</strong> -3°. Thesystem is simulated by numerically solving the nonlineardifferential equations (19) <strong>and</strong> (20) employing the explicitRunge-Kutta method with a time step of τ =0.02 sec. Thesimulation is conducted by observing the system’s states <strong>and</strong>executing actions (control <strong>for</strong>ce U) with a sample rate T =0.02sec (50 Hz). This sample rate defines a sequence of decisionmakingepochs, T = 0,1, 2,..., M,M ∈N .The system is fully specified by four state variables: (a) theposition of the cart on the track, x()t ; (b) the cart velocity,x() t ; (c) the pendulum’s angle with respect to the verticalposition, ϕ () t ; <strong>and</strong> (d) the angular velocity, ϕ () t . However, toincorporate uncertainty, the Markov states are selected to beonly the pair of the pendulum’s angle <strong>and</strong> angular velocity,namely, the finite state space S, in Eq. (1) is defined asS = { s | s = ( ϕ, ϕ)}.(21)iiUN HN VMxxbxIϕ 2LϕmgN HFigure 2. Free body diagram of the system.IϕmI,Consequently, at state si∈S <strong>and</strong> executing a control <strong>for</strong>cevalue, U(s i ), the system will end up at state s j∈S with aconditional probably ps (j| si, Us (i)). The control <strong>for</strong>ce, U(s i ),selects values from the finite set A, defined asA = As (i) = [ − 3 N,3 N], where i= 1, 2,..., N, N=| S | . (22)The decision-making process occurs at each of a sequence ofepochs T = 0,1, 2,..., M,M ∈ N . At each decision epoch, thelearning controller observes the system’s state s i∈S , <strong>and</strong>executes a control <strong>for</strong>ce value U( si) ∈ A( si). At the nextdecision epoch, the system transits to another statesj∈S imposed by the conditional probabilitiesps (j| si, Us (i)), <strong>and</strong> receives a numerical reward (thependulum’s angle φ).The inverted pendulum is simulated repeatedly <strong>for</strong>different initial angles, φ, between 3° <strong>and</strong> -3° utilizing the PODlearning method. The simulation lasts <strong>for</strong> 50 sec <strong>and</strong> eachcomplete simulation defines one iteration. If at any instantduring the simulation, the pendulum’s angle, φ, becomesgreater than 3° or less than -3°, this constitutes a failure,denoted by stating that there was one iteration associated with afailure. If, however, no failure occurs during the simulation,this is denoted by stating that there was one iteration associatedwith no failure.3.2 Simulation ResultsAfter completing the learning process, the controller employingthe POD learning method realizes the balance control policy ofthe pendulum, as illustrated in Figure 3. In some instances,however, the system’s response demonstrates some overshootsor delays during the transient period, shown in Figure 4. Thiscan be h<strong>and</strong>led by a denser parameterization of the state-spaceor adding a penalty in long transient responses. The efficiencyof the POD learning method in deriving the optimal balancecontrol policy that stabilizes the system is illustrated in Figure5. It is noted that after POD realizes the optimal policy in 749failures <strong>and</strong>, afterwards, as the number of iterations continuesno further failures occur.N Vx6 Copyright © 2007 by ASME


Phi [deg]43210-1-2-3Simulation of the Controlled Pendulum with POD-40 10 20 30 40 50time [sec]Figure 3. Simulation of the system after learning the balancecontrol policy with POD <strong>for</strong> different initial conditions.Phi [deg]3210-1-2Simulation of the Controlled Pendulum with POD-30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8time [sec]Figure 4. Simulation of the system after learning the balancecontrol policy with POD <strong>for</strong> different initial conditions (zoomin).Failures800700600500400300200100Number of Failures of POD in derivingthe balance control policy00 0.5 1 1.5 2Iterationsx 10 4Figure 5. Number of failures until POD derives the balancecontrol policy.4. CASE STUDY TWO4.1 Vehicle Cruise ControlThe POD real-time learning method introduced in theprevious sections is now applied to a vehicle cruise-controlproblem. Cruise control automatically regulates the vehicle’slongitudinal velocity by suitably adjusting the gas pedalposition. A vehicle cruise-control system is activated by thedriver who desires to maintain a constant speed in longhighway driving. The driver activates the cruise controllerwhile driving at a particular speed, which is then recorded asthe desired or set-point speed to be maintained by thecontroller. The main goal in designing a cruise controlalgorithm is to maintain vehicle speed smoothly but accurately,even under large variation of plant parameters (e.g., thevehicle’s varying mass in terms of the number of passengers)<strong>and</strong> road grade. In the case of passenger cars, however, vehiclemass may change noticeably but is within a small range.There<strong>for</strong>e, powertrain behavior might not vary significantly.The objective of the POD learning cruise controller is torealize in real time the control policy (gas pedal position) thatmaintains the vehicle speed as set by the driver under a greatrange of different road grades. Implementing learning vehiclecruise controllers has been addressed previously employinglearning <strong>and</strong> active control approaches. Zhang et al. [36]implemented learning control based on pattern recognition toregulate in real time the parameters of an PID cruise controller.Shahdi et al. [37] proposed an active learning method to extractthe driver's behavior <strong>and</strong> to derive control rules <strong>for</strong> a cruisecontrol system. However, no attempt has been reported inimplementing a learning automotive vehicle cruise controllerutilizing the principle of rein<strong>for</strong>cement learning, i.e., enablingthe controller to improve its per<strong>for</strong>mance over time by learningfrom its own failures through a rein<strong>for</strong>cement signal from theexternal environment, <strong>and</strong> thus, attempting to improve futureper<strong>for</strong>mance.The software package enDYNA by TESIS [38], suitable <strong>for</strong>real-time simulation of internal combustion engines, is used toevaluate the per<strong>for</strong>mance of the POD learning cruise controller.The software simulates the longitudinal vehicle dynamics witha highly variable drive train including the modules of starter,brake, clutch, converter, <strong>and</strong> transmission. In the driving modethe engine is operated by means of the usual vehicle controlelements just as a driver would do. In addition, a mechanicalparking lock <strong>and</strong> the uphill grade can be set. The driver modelis designed to operate the vehicle at given speed profiles(driving cycles). It actuates the starter, accelerator, clutch <strong>and</strong>brake pedals according to the profile specification, <strong>and</strong> alsoshifts gears. In this example, an existing vehicle model isselected representing a midsize passenger car carrying an 1.9Lturbocharged diesel engine.When activated, the learning cruise controller bypasses thedriver model <strong>and</strong> takes over the vehicle’s cruising. The Markovstates are defined to be the pair of the transmission gear <strong>and</strong> the7 Copyright © 2007 by ASME


difference between the desired <strong>and</strong> actual vehicle speed, ∆V,namely,S = { si| si= ( gear, ∆V)}.(23)The actions, α, correspond to the gas pedal position <strong>and</strong>can take values from the feasible set A, defined asA = As (i) = [0,0.7], where i= 1, 2,..., N, N=| S | . (24)To incorporate uncertainty the vehicle is simulated in agreat range of different road grades from 0° to 10°.Consequently, at state s i∈S <strong>and</strong> executing a control action(gas pedal position), the system transits to another statesj∈S with a conditional probably ps (j| si, α ( si)), since theacceleration capability of a vehicle varies at different roadgrades. As a consequence of this state transition the systemreceives a numerical reward (difference between the desired<strong>and</strong> actual vehicle speed).4.2 Simulation ResultsAfter completing four simulations of each road grade, the PODcruise controller realizes the control policy (gas pedal position)to maintain the vehicle’s speed at the desired set point. Thevehicle model was initiated from zero speed. The driver model,following the driving cycle, accelerated the vehicle up to40mph <strong>and</strong> at 10sec activated the POD cruise controller. Thedesired <strong>and</strong> actual vehicle speeds <strong>for</strong> three different road gradesas well as the gas pedal rates of the POD controller areillustrated in Figure 6. The small discrepancy between thedesired <strong>and</strong> actual vehicle speed be<strong>for</strong>e the cruise controlleractivation is due to the steady-state error of the driver’s model.However, since the desired driving cycle set the vehicle’s speedto be at 40mph, when the POD cruise controller is activatedhelps to correct this error <strong>and</strong>, afterwards, maintains thevehicle’s actual speed at the set point. The accelerator pedalposition is at different values because in the case of road grades2º <strong>and</strong> 6º the selected transmission gear is 2, shown in Figure 7,while in case of road grade 10º the selected transmission gear is1. So, at different selected gears, the accelerator pedal varies tomaintain constant vehicle’s speed. In Figure 8, the per<strong>for</strong>manceof POD cruise controller is evaluated in a severe drivingscenario where the road grade changes from 0° to 10°, whilethe POD cruise controller is active. In this scenario, the POD isactivated again at 10sec when the road grade is 0°, <strong>and</strong> at 14sec the road grade becomes 10°. The engine speed <strong>and</strong> theselected transmission gear <strong>for</strong> this scenario are shown in Figure9. While the vehicle is cruising at constant speed <strong>and</strong> the roadgrade changes from 0º to 10º, the vehicle’s speed startsdecreasing after some time. Once this occurs, the self-learningcruise controller senses the discrepancy between the desired<strong>and</strong> actual vehicle speed <strong>and</strong> comm<strong>and</strong>s the accelerator pedalso as to correct the error. Consequently, there is a small timedelay in the acceleration pedal comm<strong>and</strong>, illustrated in Figure8, which depends on vehicle inertia.Vehicle Velocity [mph]Accelerator PedalCruise Control with POD4030Driving Cycle20Grade 2 oGrade 6 o10Grade 10 o00 5 10 15 2010.500 5 10 15 20Time [sec]Figure 6. Vehicle speed <strong>and</strong> accelerator pedal rate <strong>for</strong> differentroad grades by self-learning cruise control with POD.Engine Speed [RPM]30002000Cruise Control with POD1000Grade 2 o00 5 10 Grade 15 6 o 20Grade 10 o3Transmission Gear2100 5 10 15 20Time [sec]Figure 7. Engine speed <strong>and</strong> transmission gear selection <strong>for</strong>different road grades by self-learning cruise control with POD.Vehicle Velocity [mph]Accelerator PedalCruise Control with PODAt 14 sec the road grade increases from 0 o to 10 o4030Driving Cycle20Vehicle Speed1000 5 10 15 2010.500 5 10 15 20Time [sec]Figure 8. Vehicle speed <strong>and</strong> accelerator pedal rate <strong>for</strong> a roadgrade increase from 0° to 10°.8 Copyright © 2007 by ASME


Engine Speed [RPM]300020001000Transmission GearCruise Control with PODAt 14 sec the road grade increases from 0 o to 10 o00 5 10 15 2032100 5 10 15 20Time [sec]Figure 9. Engine speed <strong>and</strong> transmission gear selection <strong>for</strong> aroad grade increase from 0° to 10°.5. CONCLUDING REMARKSThis paper has presented a real-time learning method,consisting of a new state-space representation model <strong>and</strong> alearning algorithm, suited to solve sequential decision-makingproblems under uncertainty in real time. The method aims tobuild autonomous engineering systems that can learn toimprove their per<strong>for</strong>mance over time in stochasticenvironments. Such systems must be able to sense theirenvironments, estimate the consequences of their actions, <strong>and</strong>learn in real time the course of action (control policy) thatoptimizes a specified objective. The major difference betweenthe POD learning method presented in this paper <strong>and</strong> theexisting rein<strong>for</strong>cement learning methods is that the firstemploys an evaluation function which does not requirerecursive iterations to approximate the Bellman equation. Thisapproach has been shown to be especially appealing to systemsin which the initial state is not fixed, <strong>and</strong> recursive updates ofthe evaluation functions to approximate the Bellman equationwould dem<strong>and</strong> significant amount of experience to achieve thedesired system per<strong>for</strong>mance.The overall per<strong>for</strong>mance of the POD method in deriving anoptimal control policy was evaluated by its application to twoexamples: (a) the cart-pole balancing problem, <strong>and</strong> (b) a realtimevehicle cruise-controller development. In the firstproblem, POD realized the balancing control policy <strong>for</strong> aninverted pendulum when the pendulum was released from anyangle between 3° <strong>and</strong> -3°. In implementing the real-time cruisecontroller, the POD was able to maintain the desired vehicle’sspeed at any road grade between 0° <strong>and</strong> 10°. Future researchshould investigate the potential of advancing the POD methodto accommodate more than one decision maker in sequentialdecision-making problems under uncertainty, known as multiagentsystems [39]. These problems are found in systems inwhich many intelligent decision makers (agents) interact witheach other. The agents are considered to be autonomousentities. Their interactions can be either cooperative or selfish,i.e., the agents can share a common goal, e.g., control ofvehicles operating in platoons to improve throughput oncongested highways by allowing groups of vehicles to traveltogether in a tightly spaced platoon at high speeds.Alternatively, the agents can pursue their own interests.ACKNOWLEDGMENTSThis research was partially supported by the AutomotiveResearch Center (ARC), a U.S. Army Center of Excellence in<strong>Model</strong>ing <strong>and</strong> Simulation of Ground Vehicles at the Universityof Michigan. The engine simulation package enDYNA wasprovided by TESIS DYNAware GmbH. This support isgratefully acknowledged.REFERENCES[1] Bertsekas, D. P. <strong>and</strong> Shreve, S. E., Stochastic OptimalControl: The Discrete-Time Case, 1st edition, AthenaScientific, February 2007.[2] Gosavi, A., "Rein<strong>for</strong>cement <strong>Learning</strong> <strong>for</strong> Long-RunAverage Cost," European Journal of OperationalResearch, vol. 155, pp. 654-74, 2004.[3] Bertsekas, D. P. <strong>and</strong> Tsitsiklis, J. N., Neuro-DynamicProgramming (Optimization <strong>and</strong> Neural ComputationSeries, 3), 1st edition, Athena Scientific, May 1996.[4] Sutton, R. S. <strong>and</strong> Barto, A. G., Rein<strong>for</strong>cement <strong>Learning</strong>:An Introduction (Adaptive Computation <strong>and</strong> Machine<strong>Learning</strong>), The MIT Press, March 1998.[5] Borkar, V. S., "A <strong>Learning</strong> <strong>Algorithm</strong> <strong>for</strong> Discrete-TimeStochastic Control," Probability in the Engineering <strong>and</strong>In<strong>for</strong>mation Sciences, vol. 14, pp. 243-258, 2000.[6] Samuel, A. L., "Some Studies in Machine <strong>Learning</strong> Usingthe Game of Checkers," IBM Journal of Research <strong>and</strong>Development, vol. 3, pp. 210-229, 1959.[7] Samuel, A. L., "Some Studies in Machine <strong>Learning</strong> Usingthe Game of Checkers. II -Recent progress," IBM Journalof Research <strong>and</strong> Development, vol. 11, pp. 601-617, 1967.[8] Sutton, R. S., Temporal Credit Assignment inRein<strong>for</strong>cement <strong>Learning</strong>, PhD Thesis, University ofMassachusetts, Amherst, MA, 1984.[9] Sutton, R. S., "<strong>Learning</strong> to Predict by the Methods ofTemporal Difference," Machine <strong>Learning</strong>, vol. 3, pp. 9-44, 1988.[10] Watkins, C. J., <strong>Learning</strong> from Delayed Rewards, PhDThesis, Kings College, Cambridge, Engl<strong>and</strong>, May 1989.[11] Watkins, C. J. C. H. <strong>and</strong> Dayan, P., "Q-learning," Machine<strong>Learning</strong>, vol. 8, pp. 279-92, 1992.[12] Kaelbling, L. P., Littman, M. L., <strong>and</strong> Moore, A. W.,"Rein<strong>for</strong>cement <strong>Learning</strong>: A Survey," Journal of ArtificialIntelligence Research, vol. 4.[13] Schwartz, A., "A Rein<strong>for</strong>cement <strong>Learning</strong> Method <strong>for</strong>Maximizing Undiscounted Rewards," Proceedings of theTenth International Conference on Machine <strong>Learning</strong>, pp.298-305, Amherst, Massachusetts, 1993,[14] Mahadevan, S., "Average Reward Rein<strong>for</strong>cement<strong>Learning</strong>: Foundations, <strong>Algorithm</strong>s, <strong>and</strong> EmpiricalResults," Machine <strong>Learning</strong>, vol. 22, pp. 159-195, 1996.9 Copyright © 2007 by ASME


[15] Sutton, R. S., "Integrated Architectures <strong>for</strong> <strong>Learning</strong>,Planning, <strong>and</strong> Reacting Based on Approximating DynamicProgramming," Proceedings of the Seventh InternationalConference on Machine <strong>Learning</strong>, pp. 216-224, Austin,TX, USA, 1990,[16] Sutton, R. S., "Planning by Incremental DynamicProgramming," Proceedings of the Eighth InternationalWorkshop on Machine <strong>Learning</strong> (ML91), pp. 353-357,Evanston, IL, USA, 1991,[17] Moore, A. W. <strong>and</strong> Atkinson, C. G., "Prioritized Sweeping:Rein<strong>for</strong>cement <strong>Learning</strong> with Less Data <strong>and</strong> Less Time,"Machine <strong>Learning</strong>, vol. 13, pp. 103-30, 1993.[18] Peng, J. <strong>and</strong> Williams, R. J., "Efficient <strong>Learning</strong> <strong>and</strong>Planning Within the Dyna Framework," Proceedings ofthe IEEE International Conference on Neural Networks(Cat. No.93CH3274-8), pp. 168-74, San Francisco, CA,USA, 1993,[19] Barto, A. G., Bradtke, S. J., <strong>and</strong> Singh, S. P., "<strong>Learning</strong> toAct Using Real-Time Dynamic Programming," ArtificialIntelligence, vol. 72, pp. 81-138, 1995.[20] Bellman, R., Dynamic Programming. Princeton, NJ,Princeton University Press, 1957.[21] Puterman, M. L., Markov Decision Processes: DiscreteStochastic Dynamic Programming, 2nd Rev. edition,Wiley-Interscience, 2005.[22] Bertsekas, D. P., Dynamic Programming <strong>and</strong> OptimalControl (Volumes 1 <strong>and</strong> 2), Athena Scientific, September2001.[23] Malikopoulos, A. A., Papalambros, P. Y., <strong>and</strong> Assanis, D.N., "A <strong>Learning</strong> <strong>Algorithm</strong> <strong>for</strong> Optimal InternalCombustion Engine Calibration in Real Time," to bepresented in the ASME 2007 International DesignEngineering Technical Conferences Computers <strong>and</strong>In<strong>for</strong>mation in Engineering Conference, Las Vegas,Nevada, September 4-7, 2007, DETC2007-34718.[24] Malikopoulos, A. A., Assanis, D. N., <strong>and</strong> Papalambros, P.Y., "Real-Time, Self-<strong>Learning</strong> Optimization of DieselEngine Calibration," to be presented in the 2007 FallTechnical Conference of the ASME Internal CombustionEngine Division, Charleston, South Carolina, October 14-17, 2007, ICEF2007-1603.[25] Iwata, K., Ito, N., Yamauchi, K., <strong>and</strong> Ishii, N.,"Combining Exploitation-Based <strong>and</strong> Exploration-BasedApproach in Rein<strong>for</strong>cement <strong>Learning</strong>," Proceedings of theIntelligent Data Engineering <strong>and</strong> Automated - IDEAL2000, pp. 326-31, Hong Kong, China, 2000,[26] Ishii, S., Yoshida, W., <strong>and</strong> Yoshimoto, J., "Control ofExploitation-Exploration Meta-Parameter inRein<strong>for</strong>cement <strong>Learning</strong>," Journal of Neural Networks,vol. 15, pp. 665-87, 2002.[27] Chan-Geon, P. <strong>and</strong> Sung-Bong, Y., "Implementation of theAgent Using Universal On-Line Q-learning by BalancingExploration <strong>and</strong> Exploitation in Rein<strong>for</strong>cement <strong>Learning</strong>,"Journal of KISS: Software <strong>and</strong> Applications, vol. 30, pp.672-80, 2003.[28] Miyazaki, K. <strong>and</strong> Yamamura, M., "Marco Polo: ARein<strong>for</strong>cement <strong>Learning</strong> System Considering TradeoffExploitation <strong>and</strong> Exploration under MarkovianEnvironments," Journal of Japanese Society <strong>for</strong> ArtificialIntelligence, vol. 12, pp. 78-89, 1997.[29] Hern<strong>and</strong>ez-Aguirre, A., Buckles, B. P., <strong>and</strong> Martinez-Alcantara, A., "The Probably Approximately Correct(PAC) Population Size of a Genetic <strong>Algorithm</strong>,"Proceedings of the 12th IEEE Internationals Conferenceon Tools with Artificial Intelligence, pp. 199-202,Vancouver, BC, Canada, 2000,[30] Anderson, C. W., "<strong>Learning</strong> to Control an InvertedPendulum Using Neural Networks," IEEE ControlSystems Magazine, vol. 9, pp. 31-7, 1989.[31] Williams, V. <strong>and</strong> Matsuoka, K., "<strong>Learning</strong> to Balance theInverted Pendulum Using Neural Networks," Proceedingsof the 1991 IEEE International Joint Conference onNeural Networks (Cat. No.91CH3065-0), pp. 214-19,Singapore, 1991,[32] Zhidong, D., Zaixing, Z., <strong>and</strong> Peifa, J., "A Neural-FuzzyBOXES Control System with Rein<strong>for</strong>cement <strong>Learning</strong><strong>and</strong> its Applications to Inverted Pendulum," Proceedingsof the 1995 IEEE International Conference on Systems,Man <strong>and</strong> Cybernetics. Intelligent Systems <strong>for</strong> the 21stCentury (Cat. No.95CH3576-7), pp. 1250-4, Vancouver,BC, Canada, 1995,[33] Jeen-Shing, W. <strong>and</strong> McLaren, R., "A Modified Defuzzifier<strong>for</strong> Control of the Inverted Pendulum Using <strong>Learning</strong>,"Proceedings of the 1997 Annual Meeting of the NorthAmerican Fuzzy In<strong>for</strong>mation Processing Society -NAFIPS (Cat. No.97TH8297), pp. 118-23, Syracuse, NY,USA, 1997,[34] Mustapha, S. M. <strong>and</strong> Lachiver, G., "A Modified Actor-Critic Rein<strong>for</strong>cement <strong>Learning</strong> <strong>Algorithm</strong>," Proceedingsof the 2000 Canadian Conference on Electrical <strong>and</strong>Computer Engineering, pp. 605-9, Halifax, NS, Canada,2000,[35] Si, J. <strong>and</strong> Wang, Y. T., "On-line <strong>Learning</strong> Control byAssociation <strong>and</strong> Rein<strong>for</strong>cement," IEEE Transactions onNeural Networks, vol. 12, pp. 264-276, 2001.[36] Zhang, B. S., Leigh, I., <strong>and</strong> Leigh, J. R., "<strong>Learning</strong>Control Based on Pattern Recognition Applied to VehicleCruise Control Systems," Proceedings of the the AmericanControl Conference, pp. 3101-3105, Seattle, WA, USA,1995,[37] Shahdi, S. A. <strong>and</strong> Shouraki, S. B., "Use of Active<strong>Learning</strong> Method to Develop an Intelligent Stop <strong>and</strong> GoCruise Control," Proceedings of the the IASTEDInternational Conference on Intelligent Systems <strong>and</strong>Control, pp. 87-90, Salzburg, Austria, 2003,[38] TESIS, .[39] Panait, L. <strong>and</strong> Luke, S., "Cooperative Multi-Agent<strong>Learning</strong>: The <strong>State</strong> of the Art," Autonomous Agents <strong>and</strong>Multi-Agent Systems, vol. 11, pp. 387-434, 2005.10 Copyright © 2007 by ASME

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!