08.07.2015 Views

Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.4. Approaches based on Markov decision processes 15V that is optimal regardless of the start<strong>in</strong>g state [Howard 1960], which satises thefollow<strong>in</strong>g equation:V (s) = max fR(s; a)+X (a; s; u)V (u)gau2STwo popular methods for solv<strong>in</strong>g this equation and nd<strong>in</strong>g an optimal policy foran mdp are value iteration and policy iteration [Puterman 1994].In policy iteration, the current policy is repeatedly improved by nd<strong>in</strong>g someaction <strong>in</strong> each state that has a higher value than the action chosen by the currentpolicy for that state. The policy is <strong>in</strong>itially chosen at random, and the processterm<strong>in</strong>ates when no improvement can be found. Tha algorithm is shown <strong>in</strong> Table 2.1.This process converges to an optimal policy [Puterman 1994].Policy-Iteration(S; A; ;R;):1. For each s 2 S, (s) = RandomElement(A)2. Compute V (:)3. For each s 2 S f4. F<strong>in</strong>d some action a such thatR(s; a)+ P u2S(a; s; u)V (u) >V (s)5. Set 0 (s) =aif such anaexists,6 otherwise set 0 (s) =(s).g7. If 0 (s) 6= (s) for some s 2 S goto 2.8. Return Table 2.1: The policy iteration algorithmIn value iteration, optimal policies are produced for successively longer nite horizons,until they converge. It is relatively simple to nd an optimal policy over n steps(:), with value function V (:), us<strong>in</strong>g the recurrence relation: nn n (s) = argmax afR(s; a)+ X u2S(a; s; u)V n,1 (u)gwith start<strong>in</strong>g condition V 0 (s) =08s2S, where V m is derived from the policy m asdescribed above. Table 2.2 shows the value iteration algorithm, which takes an mdp,a discount value and a parameter and produces successive nite-horizon optimalpolicies, term<strong>in</strong>at<strong>in</strong>g when the maximum change <strong>in</strong> values between the current andprevious value functions is below . It can also be shown that the algorithm convergesto the optimal policy for the discounted <strong>in</strong>nite case <strong>in</strong> a number of steps which ispolynomial <strong>in</strong> jSj, jAj, log max s;a jR(s; a)j and 1=(1 , ).2.4.2 <strong>Plann<strong>in</strong>g</strong> <strong>under</strong> uncerta<strong>in</strong>ty with MDPsThe algorithms described above can nd optimal policies <strong>in</strong> polynomial time <strong>in</strong> thesize of the state space of the mdp. However, this state space is usually exponentially

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!