A State-Space Representation Model and Learning Algorithm for ...

More documents

Recommendations

Info

Cartesian product of the finite state space and action space ofthe MDP:H: ( S× A) × ( S× A) →S , (10)where S = { si| i = 1,2,..., N},N∈Ndenotes the Markov statespace, and A = ∪ sAs ( ),i∈Si∀si∈Sstands for a finite actionspace. This mapping generates an indexed family ofsubsets, S s i, for each state s i∈S , defined as PredictiveRepresentation Nodes (PRNs). Each PRN is constituted by aiset of POD states, s ∈S ,msiiSs = { s| , 1,2,..., , | | }i mi m= N N = S ∈N , (11)Each POD state s i ∈S messentially represents a Markov statetransition from s i∈S to s m∈S . PRNs partition the PODdomain insofar as the POD underlying structure captures thestate transitions in the Markov domain. Consequently, a PRN isdefined asNi im m i m ∑ m iµiµ ( si) ∈A( si)m=1S s = { s | s ≡ s → s , p ( s | s , ( s )) = 1, N = | S |},i∀si, sm ∈S , ∀µ( si) ∈A( si),(12)the union of which defines the POD domainS= ∪ i s i, withs∈SS (13)s ims i∩ S =∅.(14)Ssim∈s is iEach PRN, S , corresponds to a Markov state, si∈S , andportrays all possible transitions occurring from this state s i tothe other states s m∈S . PRNs, constituting the fundamentalaspect of the POD state representation, provide an assessmentof the Markov state transitions along with the actions executedat each state. This assessment aims to establish a necessaryembedded property of the new state representation so as toconsider the potential transitions that can occur in subsequentdecision epochs. The assessment is expressed by means of theiPRN value, Rs ( s| ( ))i mµ si, which accounts for the maximumaverage expected reward that can be achieved by transitionsoccurring inside a PRN. Consequently, the PRN value isdefined asN⎛⎞⎜∑p( sm | si, µ ( si)) ⋅ R( sm | si, µ ( si))⎟im=1Rs ( s| ( )) max ,i mµ si= ⎜⎟µ ( si) ∈A⎜N⎟⎜⎟⎝⎠i∀s m∈S, ∀si, sm ∈S, ∀µ( si) ∈ A( si),and N = | S |. (15)The PRN value is exploited by POD state representation asan evaluation metric to estimate the subsequent Markov statetransitions. The estimation property is founded on theassessment of POD states by means of an expected evaluationi ifunction, R ( s, µ ( s )), defined asPRN m i{i iRPRN ( sm, µ ( si )) = p( sm | si , µ ( si )) ⋅ R( sm | si , µ ( si)) +(16)m+ Rs ( s| ( ))},m jµ smi m∀sm, sj∈S, ∀si, sm ∈S, ∀µ ( si) ∈A( si), ∀µ( sm) ∈A( sm).Consequently, employing the POD evaluation function throughiEq. (16), each POD state, sm∈S s , is comprised of an overallireward corresponding to: (a) the expected reward of transitingfrom state s i to s m (implying also the transition from the PRNSs itoSs m); and (b) the maximum average expected rewardwhen transiting from s m to any other Markov state (transitionoccurring into S s m).While the system interacts with its environment, the PODmodel learns the system dynamics in terms of the Markov statetransitions. The POD state representation attempts to provide aprocess in realizing the sequences of state transitions thatoccurred in the Markov domain, as infused in PRNs. Thedifferent sequences of the Markov state transitions are capturedby the POD states and evaluated through the expectedevaluation functions given in Eq. (16). Consequently, thehighest value of the expected evaluation function at each PODstate essentially estimates the subsequent Markov statetransitions with respect to the actions taken. As the process isstochastic, however, it is still necessary for the real-timelearning method to build a decision-making mechanism of howto select actions.The learning performance is closely related to theexploration-exploitation strategy of the action space. Moreprecisely, the decision maker has to exploit what is alreadyknown regarding the correlation involving the admissible stateactionpairs that maximize the rewards, and also to explorethose actions that have not yet been tried for these pairs toassess whether these actions may result in higher rewards. Abalance between an exhaustive exploration of the environmentand the exploitation of the learned policy is fundamental toreach nearly optimal solutions in few decision epochs and,thus, enhancing the learning performance. This explorationexploitationdilemma has been extensively reported in theliterature. Iwata et al. [25] proposed a model-based learningmethod extending Q-learning and introducing two separatedfunctions based on statistics and on information by applyingexploration and exploitation strategies. Ishii et al. [26]developed a model-based reinforcement learning methodutilizing a balance parameter, controlled through variation ofaction rewards and perception of environmental change. Chan-Geon et al. [27] proposed an exploration-exploitation policy inQ-learning consisting of an auxiliary Markov process and theoriginal Markov process. Miyazaki et al. [28] developed aunified learning system realizing the tradeoff betweenexploration and exploitation. Hernandez-Aguirre et al. [29]analyzed the problem of exploration-exploitation in the contextof the probably approximately correct framework and studiedwhether it is possible to give bounds on the complexity of the4 Copyright © 2007 by ASME
Page 1 and 2: Proceedings of IMECE072007 ASME Int
Page 3: which causes the system to perform
Page 7 and 8: Phi [deg]43210-1-2-3Simulation of t
Page 9 and 10: Engine Speed [RPM]300020001000Trans

A State-Space Representation Model and Learning Algorithm for ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?