Cartesian product of the finite state space <strong>and</strong> action space ofthe MDP:H: ( S× A) × ( S× A) →S , (10)where S = { si| i = 1,2,..., N},N∈Ndenotes the Markov statespace, <strong>and</strong> A = ∪ sAs ( ),i∈Si∀si∈Sst<strong>and</strong>s <strong>for</strong> a finite actionspace. This mapping generates an indexed family ofsubsets, S s i, <strong>for</strong> each state s i∈S , defined as Predictive<strong>Representation</strong> Nodes (PRNs). Each PRN is constituted by aiset of POD states, s ∈S ,msiiSs = { s| , 1,2,..., , | | }i mi m= N N = S ∈N , (11)Each POD state s i ∈S messentially represents a Markov statetransition from s i∈S to s m∈S . PRNs partition the PODdomain insofar as the POD underlying structure captures thestate transitions in the Markov domain. Consequently, a PRN isdefined asNi im m i m ∑ m iµiµ ( si) ∈A( si)m=1S s = { s | s ≡ s → s , p ( s | s , ( s )) = 1, N = | S |},i∀si, sm ∈S , ∀µ( si) ∈A( si),(12)the union of which defines the POD domainS= ∪ i s i, withs∈SS (13)s ims i∩ S =∅.(14)Ssim∈s is iEach PRN, S , corresponds to a Markov state, si∈S , <strong>and</strong>portrays all possible transitions occurring from this state s i tothe other states s m∈S . PRNs, constituting the fundamentalaspect of the POD state representation, provide an assessmentof the Markov state transitions along with the actions executedat each state. This assessment aims to establish a necessaryembedded property of the new state representation so as toconsider the potential transitions that can occur in subsequentdecision epochs. The assessment is expressed by means of theiPRN value, Rs ( s| ( ))i mµ si, which accounts <strong>for</strong> the maximumaverage expected reward that can be achieved by transitionsoccurring inside a PRN. Consequently, the PRN value isdefined asN⎛⎞⎜∑p( sm | si, µ ( si)) ⋅ R( sm | si, µ ( si))⎟im=1Rs ( s| ( )) max ,i mµ si= ⎜⎟µ ( si) ∈A⎜N⎟⎜⎟⎝⎠i∀s m∈S, ∀si, sm ∈S, ∀µ( si) ∈ A( si),<strong>and</strong> N = | S |. (15)The PRN value is exploited by POD state representation asan evaluation metric to estimate the subsequent Markov statetransitions. The estimation property is founded on theassessment of POD states by means of an expected evaluationi ifunction, R ( s, µ ( s )), defined asPRN m i{i iRPRN ( sm, µ ( si )) = p( sm | si , µ ( si )) ⋅ R( sm | si , µ ( si)) +(16)m+ Rs ( s| ( ))},m jµ smi m∀sm, sj∈S, ∀si, sm ∈S, ∀µ ( si) ∈A( si), ∀µ( sm) ∈A( sm).Consequently, employing the POD evaluation function throughiEq. (16), each POD state, sm∈S s , is comprised of an overallireward corresponding to: (a) the expected reward of transitingfrom state s i to s m (implying also the transition from the PRNSs itoSs m); <strong>and</strong> (b) the maximum average expected rewardwhen transiting from s m to any other Markov state (transitionoccurring into S s m).While the system interacts with its environment, the PODmodel learns the system dynamics in terms of the Markov statetransitions. The POD state representation attempts to provide aprocess in realizing the sequences of state transitions thatoccurred in the Markov domain, as infused in PRNs. Thedifferent sequences of the Markov state transitions are capturedby the POD states <strong>and</strong> evaluated through the expectedevaluation functions given in Eq. (16). Consequently, thehighest value of the expected evaluation function at each PODstate essentially estimates the subsequent Markov statetransitions with respect to the actions taken. As the process isstochastic, however, it is still necessary <strong>for</strong> the real-timelearning method to build a decision-making mechanism of howto select actions.The learning per<strong>for</strong>mance is closely related to theexploration-exploitation strategy of the action space. Moreprecisely, the decision maker has to exploit what is alreadyknown regarding the correlation involving the admissible stateactionpairs that maximize the rewards, <strong>and</strong> also to explorethose actions that have not yet been tried <strong>for</strong> these pairs toassess whether these actions may result in higher rewards. Abalance between an exhaustive exploration of the environment<strong>and</strong> the exploitation of the learned policy is fundamental toreach nearly optimal solutions in few decision epochs <strong>and</strong>,thus, enhancing the learning per<strong>for</strong>mance. This explorationexploitationdilemma has been extensively reported in theliterature. Iwata et al. [25] proposed a model-based learningmethod extending Q-learning <strong>and</strong> introducing two separatedfunctions based on statistics <strong>and</strong> on in<strong>for</strong>mation by applyingexploration <strong>and</strong> exploitation strategies. Ishii et al. [26]developed a model-based rein<strong>for</strong>cement learning methodutilizing a balance parameter, controlled through variation ofaction rewards <strong>and</strong> perception of environmental change. Chan-Geon et al. [27] proposed an exploration-exploitation policy inQ-learning consisting of an auxiliary Markov process <strong>and</strong> theoriginal Markov process. Miyazaki et al. [28] developed aunified learning system realizing the tradeoff betweenexploration <strong>and</strong> exploitation. Hern<strong>and</strong>ez-Aguirre et al. [29]analyzed the problem of exploration-exploitation in the contextof the probably approximately correct framework <strong>and</strong> studiedwhether it is possible to give bounds on the complexity of the4 Copyright © 2007 by ASME