Infinite-Horizon Average Reward Markov Decision Processes

OutlineThe average rewardClassification of MDPsOptimality equationsValue iteration in unichain modelsPolicy iteration in unichain modelsLinear Programming in unichain modelsDan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 2

AssumptionsStationary rewards and transition probabilities: r(s, a) andp(j|s, a) do not vary with timeBounded rewards: |r(s, a)| ≤ M < ∞Finite state spacesUnichain: the transition matrix corresponding to everydeterministic stationary policy is unichain (i.e., it consists of asingle recurrent class plus a possibly empty set of transientstates).Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 6

The Average Reward Optimality Equation – UnichainModelsFor unichain models, it can be shown that all stationarypolicies have constant gain g.Optimality equations:⎡0 = max ⎣r(s, a) − g + ∑a∈A sj∈SIn matrix notation:⎤p(j|s, a)h(j) − h(s) ⎦ .0 = maxd∈D {r d − ge + (P d − I )h} ≡ B(g, h).Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 7

The Average Reward Optimality Equation – UnichainModelsTheoremSuppose S is countable.(i) If there exists a scalar g and an h ∈ V which satisfyB(g, h) ≤ 0, then ge ≥ g ∗ +;(ii) If there exists a scalar g and an h ∈ V which satisfyB(g, h) ≥ 0, then ge ≤ sup d∈D MD g d∞− ≤ g ∗ −;(iii) If there exists a scalar g and an h ∈ V which satisfyB(g, h) = 0, then ge = g ∗ = g ∗ + = g ∗ −.Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 8

Existence of Solutions to the Optimality Equation –Unichain ModelsTheoremSuppose S and A s are finite, |r(s, a)| ≤ M < ∞ for all s, a, and themodel is unichain.(i) There exists a g ∈ R 1 and h ∈ V for which0 = maxd∈D {r d − ge + (P d − I )h};(ii) If (g ′ , h ′ ) is any other solution of the average rewardoptimality equation, then g = g ′ .Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 9

Existence of Optimal Policies – Unichain ModelsA decision d h is h-improving if d h ∈ argmax d∈D {r d + P d h}.TheoremSuppose there exists a scalar g ∗ and an h ∗ ∈ V for whichB(g ∗ , h ∗ ) = 0. Then if d ∗ is h ∗ -improving, (d ∗ ) ∞ is average optimal.Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 10

Existence of Optimal Policies – Unichain ModelsTheoremSuppose S and A s are finite, r(s, a) is bounded, and the model isunichain. Then(i) there exists a stationary average optimal policy;(ii) there exists a scalar g ∗ and an h ∗ ∈ V for whichB(g ∗ , h ∗ ) = 0;(iii) any stationary policy derived from an h ∗ -improving decisionrule is average optimal;(iv) g ∗ e = g ∗ + = g ∗ −.Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 11

Value Iteration1 Select v 0 ∈ V , specify ɛ > 0, and set n = 0.2 For each s ∈ S, compute v n+1 (s) by⎡v n+1 (s) = max ⎣r(s, a) + ∑a∈A sj∈S⎤p(j|s, a)v n (j) ⎦ .3 If sp(v n+1 − v n ) < ɛ, go to step 4. Otherwise, increment n by1 and return to step 2.4 For each s ∈ S, choose⎡and stop.d ɛ (s) ∈ argmaxa∈A s⎣r(s, a) + ∑ j∈S⎤p(j|s, a)v n+1 (j) ⎦Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 12

Relative Value Iteration1 Select u 0 ∈ V , choose s ∗ ∈ S, specify ɛ > 0, setw 0 = u 0 − u 0 (s ∗ )e, and set n = 0.2 For each s ∈ S, compute u n+1 (s) by⎡⎤u n+1 (s) = max ⎣r(s, a) + ∑ p(j|s, a)w n (j) ⎦ .a∈A sj∈SLet w n+1 = u n+1 − u n+1 (s ∗ )e.3 If sp(u n+1 − u n ) < ɛ, go to step 4. Otherwise, increment n by1 and return to step 2.4 For each s ∈ S, choose⎡⎤and stop.d ɛ (s) ∈ argmaxa∈A s⎣r(s, a) + ∑ j∈Sp(j|s, a)u n (j) ⎦Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 13

Policy Iteration1 Set n = 0 and select an arbitrary decision rule d 0 ∈ D.2 (Policy evaluation) Obtain a scalar g n and an h n ∈ V bysolving0 = r dn − ge + (P dn − I )h.3 (Policy improvement) Choose d n+1 satisfySetting d n+1 = d n if possible.d n+1 ∈ argmax[r d + P d h n ].d∈D4 If d n+1 = d n , stop and set d ∗ = d n . Otherwise increment n by1 and return to step 2.Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 14

Policy Iteration1 Set n = 0 and select an arbitrary decision rule d 0 ∈ D.2 (Policy evaluation) Obtain a scalar g n and an h n ∈ V bysolving0 = r dn − ge + (P dn − I )h.3 (Policy improvement) Choose d n+1 satisfySetting d n+1 = d n if possible.d n+1 ∈ argmax[r d + P d h n ].d∈D4 If d n+1 = d n , stop and set d ∗ = d n . Otherwise increment n by1 and return to step 2.Practical consideration: set h n (s 0 ) = 0 for some fixed s 0 ∈ S.Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 14

Linear ProgrammingPrimal linear program is given byming,h gg + h(s) − ∑ j∈Sp(j|s, a)h(j) ≥ r(s, a), ∀s ∈ S, a ∈ A s .Dual linear program is given by∑ ∑max r(s, a)x(s, a)xs∈S a∈A s∑∑x(j, a) − ∑ λp(j|s, a)x(s, a) = 0, ∀j ∈ S,a∈A j s∈S a∈A s∑ ∑x(s, a) = 1,s∈S a∈A sx(s, a) ≥ 0, ∀s ∈ S, a ∈ A s .Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 15

Infinite-Horizon Average Reward Markov Decision Processes

Create successful ePaper yourself

Delete template?

Save as template?