Sequential Decision Problems (2)Markov Decision Problem (MDP)Deterministic version: All actions always lead to the next square in theselected direction, except that moving into a wall results in no change inposition.Stochastic version: Each action achieves the intended effect withprobability 0.8, but the rest of the time, the agent moves at right anglesto the intended direction.0.10.80.1Given a set of states in an accessible, stochastic environment, an MDP isdefined byInitial state S 0Transition Model T (s, a, s ′ )Reward function R(s)Transition model: T (s, a, s ′ ) is the probability that state s ′ is reached, ifaction a is executed in state s.Policy: Complete mapping π that specifies for each state s which actionπ(s) to take.Wanted: The optimal policy π ∗ is the policy that maximizes the expectedutility.(University of Freiburg) Foundations of AI July 18, 2012 13 / 32(University of Freiburg) Foundations of AI July 18, 2012 14 / 32Optimal Policies (1)Optimal Policies (2)Given the optimal policy, the agent uses its current percept that tells itits current state.It then executes the action π ∗ (s).We obtain a simple reflex agent that is computed from the informationused for a utility-based agent.Optimal policy for stochasticMDP with R(s) = −0.04:3+12–111 2 3 4Optimal policy changes with choice of transition costs R(s).How to compute optimal policies?(University of Freiburg) Foundations of AI July 18, 2012 15 / 32(University of Freiburg) Foundations of AI July 18, 2012 16 / 32
Finite and Infinite Horizon ProblemsPerformance of the agent is measured by the sum of rewards for thestates visited.To determine an optimal policy we will first calculate the utility of eachstate and then use the state utilities to select the optimal action foreach state.The result depends on whether we have a finite or infinite horizonproblem.<strong>Utility</strong> function for state sequences: U h ([s 0 , s 1 , . . . , s n ])Finite horizon: U h ([s 0 , s 1 , . . . , s N+k ]) = U h ([s 0 , s 1 , . . . , s N ]) for allk > 0.For finite horizon problems the optimal policy depends on the currentstate and the remaining steps to go. It therefore depends on time andis called nonstationary.In infinite horizon problems the optimal policy only depends on thecurrent state and therefore is stationary.(University of Freiburg) Foundations of AI July 18, 2012 17 / 32Assigning Utilities to State SequencesFor stationary systems there are just two coherent ways to assignutilities to state sequences.Additive rewards:U h ([s 0 , s 1 , s 2 , . . .]) = R(s 0 ) + R(s 1 ) + R(s 2 ) + · · ·Discounted rewards:U h ([s 0 , s 1 , s 2 , . . .]) = R(s 0 ) + γR(s 1 ) + γ 2 R(s 2 ) + · · ·The term γ ∈ [0, 1[ is called the discount factor.With discounted rewards the utility of an infinite state sequence isalways finite. The discount factor expresses that future rewards haveless value than current rewards.(University of Freiburg) Foundations of AI July 18, 2012 18 / 32Utilities of StatesExampleThe utility of a state depends on the utility of the state sequences thatfollow it.Let U π (s) be the utility of a state <strong>under</strong> policy π.Let s t be the state of the agent after executing π for t steps. Thus, theutility of s <strong>under</strong> π is[ ∞]∑U π (s) = E γ t R(s t ) | π, s 0 = st=0The true utility U(s) of a state is U π∗ (s).R(s) is the short-term reward for being in s andU(s) is the long-term total reward from s onwards.The utilities of the states in our 4 × 3 world with γ = 1 and R(s) = −0.04for non-terminal states:3210.8120.7620.7050.8680.6550.9180.6600.6111 2 3+ 1–10.3884(University of Freiburg) Foundations of AI July 18, 2012 19 / 32(University of Freiburg) Foundations of AI July 18, 2012 20 / 32