11.07.2015 Views

Infinite-Horizon Average Reward Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OutlineThe average rewardClassification of MDPsOptimality equationsValue iteration in unichain modelsPolicy iteration in unichain modelsLinear Programming in unichain modelsDan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 2


AssumptionsStationary rewards and transition probabilities: r(s, a) andp(j|s, a) do not vary with timeBounded rewards: |r(s, a)| ≤ M < ∞Finite state spacesUnichain: the transition matrix corresponding to everydeterministic stationary policy is unichain (i.e., it consists of asingle recurrent class plus a possibly empty set of transientstates).Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 6


The <strong>Average</strong> <strong>Reward</strong> Optimality Equation – UnichainModelsFor unichain models, it can be shown that all stationarypolicies have constant gain g.Optimality equations:⎡0 = max ⎣r(s, a) − g + ∑a∈A sj∈SIn matrix notation:⎤p(j|s, a)h(j) − h(s) ⎦ .0 = maxd∈D {r d − ge + (P d − I )h} ≡ B(g, h).Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 7


The <strong>Average</strong> <strong>Reward</strong> Optimality Equation – UnichainModelsTheoremSuppose S is countable.(i) If there exists a scalar g and an h ∈ V which satisfyB(g, h) ≤ 0, then ge ≥ g ∗ +;(ii) If there exists a scalar g and an h ∈ V which satisfyB(g, h) ≥ 0, then ge ≤ sup d∈D MD g d∞− ≤ g ∗ −;(iii) If there exists a scalar g and an h ∈ V which satisfyB(g, h) = 0, then ge = g ∗ = g ∗ + = g ∗ −.Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 8


Existence of Solutions to the Optimality Equation –Unichain ModelsTheoremSuppose S and A s are finite, |r(s, a)| ≤ M < ∞ for all s, a, and themodel is unichain.(i) There exists a g ∈ R 1 and h ∈ V for which0 = maxd∈D {r d − ge + (P d − I )h};(ii) If (g ′ , h ′ ) is any other solution of the average rewardoptimality equation, then g = g ′ .Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 9


Existence of Optimal Policies – Unichain ModelsA decision d h is h-improving if d h ∈ argmax d∈D {r d + P d h}.TheoremSuppose there exists a scalar g ∗ and an h ∗ ∈ V for whichB(g ∗ , h ∗ ) = 0. Then if d ∗ is h ∗ -improving, (d ∗ ) ∞ is average optimal.Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 10


Existence of Optimal Policies – Unichain ModelsTheoremSuppose S and A s are finite, r(s, a) is bounded, and the model isunichain. Then(i) there exists a stationary average optimal policy;(ii) there exists a scalar g ∗ and an h ∗ ∈ V for whichB(g ∗ , h ∗ ) = 0;(iii) any stationary policy derived from an h ∗ -improving decisionrule is average optimal;(iv) g ∗ e = g ∗ + = g ∗ −.Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 11


Value Iteration1 Select v 0 ∈ V , specify ɛ > 0, and set n = 0.2 For each s ∈ S, compute v n+1 (s) by⎡v n+1 (s) = max ⎣r(s, a) + ∑a∈A sj∈S⎤p(j|s, a)v n (j) ⎦ .3 If sp(v n+1 − v n ) < ɛ, go to step 4. Otherwise, increment n by1 and return to step 2.4 For each s ∈ S, choose⎡and stop.d ɛ (s) ∈ argmaxa∈A s⎣r(s, a) + ∑ j∈S⎤p(j|s, a)v n+1 (j) ⎦Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 12


Relative Value Iteration1 Select u 0 ∈ V , choose s ∗ ∈ S, specify ɛ > 0, setw 0 = u 0 − u 0 (s ∗ )e, and set n = 0.2 For each s ∈ S, compute u n+1 (s) by⎡⎤u n+1 (s) = max ⎣r(s, a) + ∑ p(j|s, a)w n (j) ⎦ .a∈A sj∈SLet w n+1 = u n+1 − u n+1 (s ∗ )e.3 If sp(u n+1 − u n ) < ɛ, go to step 4. Otherwise, increment n by1 and return to step 2.4 For each s ∈ S, choose⎡⎤and stop.d ɛ (s) ∈ argmaxa∈A s⎣r(s, a) + ∑ j∈Sp(j|s, a)u n (j) ⎦Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 13


Policy Iteration1 Set n = 0 and select an arbitrary decision rule d 0 ∈ D.2 (Policy evaluation) Obtain a scalar g n and an h n ∈ V bysolving0 = r dn − ge + (P dn − I )h.3 (Policy improvement) Choose d n+1 satisfySetting d n+1 = d n if possible.d n+1 ∈ argmax[r d + P d h n ].d∈D4 If d n+1 = d n , stop and set d ∗ = d n . Otherwise increment n by1 and return to step 2.Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 14


Policy Iteration1 Set n = 0 and select an arbitrary decision rule d 0 ∈ D.2 (Policy evaluation) Obtain a scalar g n and an h n ∈ V bysolving0 = r dn − ge + (P dn − I )h.3 (Policy improvement) Choose d n+1 satisfySetting d n+1 = d n if possible.d n+1 ∈ argmax[r d + P d h n ].d∈D4 If d n+1 = d n , stop and set d ∗ = d n . Otherwise increment n by1 and return to step 2.Practical consideration: set h n (s 0 ) = 0 for some fixed s 0 ∈ S.Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 14


Linear ProgrammingPrimal linear program is given byming,h gg + h(s) − ∑ j∈Sp(j|s, a)h(j) ≥ r(s, a), ∀s ∈ S, a ∈ A s .Dual linear program is given by∑ ∑max r(s, a)x(s, a)xs∈S a∈A s∑∑x(j, a) − ∑ λp(j|s, a)x(s, a) = 0, ∀j ∈ S,a∈A j s∈S a∈A s∑ ∑x(s, a) = 1,s∈S a∈A sx(s, a) ≥ 0, ∀s ∈ S, a ∈ A s .Dan Zhang, Spring 2012 <strong>Infinite</strong> <strong>Horizon</strong> <strong>Average</strong> <strong>Reward</strong> MDP 15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!