10.07.2015 Views

13. Acting under Uncertainty Maximizing Expected Utility

13. Acting under Uncertainty Maximizing Expected Utility

13. Acting under Uncertainty Maximizing Expected Utility

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Application ExamplePolicy Iteration<strong>Utility</strong> estimates10.80.60.40.20-0.20 5 10 15 20 25 30Number of iterations(4,3)(3,3)(1,1)(3,1)(4,1)Max error/Policy loss10.80.60.40.2Max errorPolicy loss00 2 4 6 8 10 12 14Number of iterationsIn practice the policy often becomes optimal before the utility has converged.Value iteration computes the optimal policy even at a stage when theutility function estimate has not yet converged.If one action is better than all others, then the exact values of the statesinvolved need not to be known.Policy iteration alternates the following two steps beginning with aninitial policy π 0 :Policy evaluation: given a policy π t , calculate U t = U π t, the utility ofeach state if π t were executed.Policy improvement: calculate a new maximum expected utility policyπ t+1 according to∑π t+1 (s) = argmax T (s, a, s ′ )U(s ′ )as ′(University of Freiburg) Foundations of AI July 18, 2012 29 / 32(University of Freiburg) Foundations of AI July 18, 2012 30 / 32Chapter 17. Making Complex DecisionsThe Policy Iteration AlgorithmSummaryfunction POLICY-ITERATION(mdp) returns a policyinputs: mdp, an MDP with states S, actions A(s), transition model P (s ′ | s, a)local variables: U , a vector of utilities for states in S, initially zeroπ, a policy vector indexed by state, initially randomrepeatU ← POLICY-EVALUATION(π, U , mdp)unchanged? ← truefor each state sXin S doifmaxa ∈ A(s)P (s ′ | s, a) U [s ′ ] > X P (s ′ | s, π[s]) U [s ′ ] then dos ′ Xs ′P (s ′ | s, a) U [s ′ ]s ′π[s] ← argmaxa ∈ A(s)unchanged? ← falseuntil unchanged?return πRational agents can be developed on the basis of a probability theoryand a utility theory.Agents that make decisions according to the axioms of utility theorypossess a utility function.Sequential problems in uncertain environments (MDPs) can be solvedby calculating a policy.Value iteration is a process for calculating optimal policies.Figure 17.7The policy iteration algorithm for calculating an optimal policy.(University of Freiburg) Foundations of AI July 18, 2012 31 / 32(University of Freiburg) Foundations of AI July 18, 2012 32 / 32

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!