13. Acting under Uncertainty Maximizing Expected Utility

More documents

Recommendations

Info

Choosing Actions using theMaximum <strong>Expected</strong> <strong>Utility</strong> PrincipleBellman-EquationThe agent simply chooses the action that maximizes the expected utilityof the subsequent state:∑π(s) = argmax T (s, a, s ′ )U(s ′ )as ′The utility of a state is the immediate reward for that state plus theexpected discounted utility of the next state, assuming that the agentchooses the optimal action:U(s) = R(s) + γ maxa∑s ′ T (s, a, s ′ )U(s ′ )The equation∑U(s) = R(s) + γ max T (s, a, s ′ )U(s ′ )as ′is also called the Bellman-Equation.(University of Freiburg) Foundations of AI July 18, 2012 21 / 32(University of Freiburg) Foundations of AI July 18, 2012 22 / 32Bellman-Equation: ExampleIn our 4 × 3 world the equation for the state (1,1) isU(1, 1) = −0.04 + γ max{ 0.8U(1, 2) + 0.1U(2, 1) + 0.1U(1, 1),(Up)0.9U(1, 1) + 0.1U(1, 2),(Left)0.9U(1, 1) + 0.1U(2, 1),(Down)0.8U(2, 1) + 0.1U(1, 2) + 0.1U(1, 1)}(Right)= −0.04 + γ max{ 0.8 · 0.762 + 0.1 · 0.655 + 0.1 · 0.705, (Up)0.9 · 0.705 + 0.1 · 0.762, (Left)0.9 · 0.705 + 0.1 · 0.655, (Down)0.8 · 0.655 + 0.1 · 0.762 + 0.1 · 0.705} (Right)= −0.04 + 1.0 ( 0.6096 + 0.0655 + 0.0705), (Up) = −0.04 + 0.7456 = 0.7056→ Up is the optimal action in (1,1).320.8120.7620.8680.9180.660+ 1–1Value Iteration (1)An algorithm to calculate an optimal strategy.Basic Idea: Calculate the utility of each state. Then use the state utilitiesto select an optimal action for each state.A sequence of actions generates a branch in the tree of possible states(histories). A utility function on histories U h is separable iff there exists afunction f such thatU h ([s 0 , s 1 , . . . , s n ]) = f(s 0 , U h ([s 1 , . . . , s n ]))The simplest form is an additive reward function R:U h ([s 0 , s 1 , . . . , s n ]) = R(s 0 ) + U h ([s 1 , . . . , s n ]))10.7050.6550.6110.388In the example, R((4, 3)) = +1, R((4, 2)) = −1, R(other) = −1/25.1 2 34(University of Freiburg) Foundations of AI July 18, 2012 23 / 32(University of Freiburg) Foundations of AI July 18, 2012 24 / 32
Value Iteration (2)Value Iteration (3)If the utilities of the terminal states are known, then in certain cases wecan reduce an n-step decision problem to the calculation of the utilities ofthe terminal states of the (n − 1)-step decision problem.→ Iterative and efficient processProblem: Typical problems contain cycles, which means the length of thehistories is potentially infinite.Solution: Use∑U t+1 (s) = R(s) + γ max T (s, a, s ′ )U t (s ′ )as ′where U t (s) is the utility of state s after t iterations.17Remark: As t → ∞, MAKING the utilities of the individual COMPLEXstates converge to stablevalues.DECISIONS(University of Freiburg) Foundations of AI July 18, 2012 25 / 32The Bellman equation is the basis of value iteration.Because of the max-operator the n equations for the n states arenonlinear.We can apply an iterative approach in which we replace the equality byan assignment:∑U(s) ← R(s) + γ max T (s, a, s ′ )U(s ′ )as ′(University of Freiburg) Foundations of AI July 18, 2012 26 / 32The Value Iteration AlgorithmConvergence of Value Iterationfunction VALUE-ITERATION(mdp, ǫ) returns a utility functioninputs: mdp, an MDP with states S, actions A(s), transition model P (s ′ | s, a),rewards R(s), discount γǫ, the maximum error allowed in the utility of any statelocal variables: U , U ′ , vectors of utilities for states in S, initially zeroδ, the maximum change in the utility of any state in an iterationrepeatU ← U ′ ; δ ← 0for each state s in S doU ′ [s] ← R(s) + γXmaxa ∈ A(s)s ′ P (s ′ | s, a) U [s ′ ]if |U ′ [s] − U [s]| > δ then δ ← |U ′ [s] − U [s]|until δ < ǫ(1 − γ)/γreturn USince the algorithm is iterative we need a criterion to stop the process ifwe are close enough to the correct utility.In principle we want to limit the policy loss ‖U π t− U‖ that is the mostthe agent can lose by executing π t .It can be shown that value iteration converges and thatif ‖U t+1 − U t ‖ < ɛ(1 − γ)/γ then ‖U t+1 − U‖ < ɛif ‖U t − U‖ < ɛ then ‖U π − U‖ < 2ɛγ/(1 − γ)The value iteration algorithm yields the optimal policy π ∗ .Figure 17.4 The value iteration algorithm for calculating utilities of states. The termination conditionis from Equation (??).(University of Freiburg) Foundations of AI July 18, 2012 27 / 32(University of Freiburg) Foundations of AI July 18, 2012 28 / 32
Page 2 and 3: The Axioms of Utility Theory (1)The
Page 4 and 5: Sequential Decision Problems (2)Mar
Page 8: Application ExamplePolicy Iteration

13. Acting under Uncertainty Maximizing Expected Utility

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?