08.07.2015 Views

Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.4. Approaches based on Markov decision processes 17reduction can lead to considerable time sav<strong>in</strong>gs over the orig<strong>in</strong>al mdp. Boutilier andDearden prove bounds on the dierence <strong>in</strong> value of the abstract policy compared withan optimal policy <strong>in</strong> the orig<strong>in</strong>al mdp.L<strong>in</strong> and Dean further rene this idea by splitt<strong>in</strong>g the mdp <strong>in</strong>to subsets and allow<strong>in</strong>ga dierent abstraction of the states to be considered <strong>in</strong> each one [Dean & L<strong>in</strong> 1995].This approach can have extra power because typically dierent literals may be relevant<strong>in</strong> dierent parts of the state space. However there is an added cost to re-comb<strong>in</strong><strong>in</strong>gthe separate pieces unless they happen to decompose very cleanly. L<strong>in</strong> and Deanassume the partition of the state space is given by some external oracle.Boutilier et al. extend modied policy iteration to propose a technique calledstructured policy iteration that makes use of a structured action representation <strong>in</strong>the form of 2-stage Bayesian networks [Boutilier, Dearden, & Goldszmidt 1995]. Therepresentation of the policy and utility functions are also structured <strong>in</strong> their approach,us<strong>in</strong>g decision trees. In standard policy iteration, the value of the candidate policyis computed on each iteration by solv<strong>in</strong>g a system of jSj l<strong>in</strong>ear equations (step 2 <strong>in</strong>Table 2.1, which is computationally prohibitive for large real-world plann<strong>in</strong>g problems.Modied policy iteration replaces this step with an iterative approximation of thevalue function V by a series of value functions V 0 ;V 1 ;... given byV i (s) =R(s)+ X u2S((s);s;u)V i,1 (u)Stopp<strong>in</strong>g criteria are given <strong>in</strong> [Puterman 1994].In structured policy iteration, the value function is aga<strong>in</strong> built <strong>in</strong> a series of approximations,but <strong>in</strong> each one it is represented as a decision tree over the doma<strong>in</strong>literals. Similarly the policy is built up as a decision tree. On each iteration, newliterals might be added to these trees as a result of exam<strong>in</strong><strong>in</strong>g the literals mentioned<strong>in</strong> the action specication and utility function R. In this way the algorithm avoidsexplicitly enumerat<strong>in</strong>g the state space.Similar work has also been done with partially-observable Markov decision processesor pomdps, <strong>in</strong> which the assumption of complete observability is relaxed. Ina pomdp there is a set of observation labels O and a set of conditional probabilitiesP (oja; s);o 2O;a 2A;s 2 S, such that if the system makes a transition to state swith action a it receives the observation label o with probability P (oja; s). Cassandraet al. <strong>in</strong>troduce the witness algorithm for solv<strong>in</strong>g pomdps [Cassandra, Kaelbl<strong>in</strong>g, &Littman 1994]. A standard technique for nd<strong>in</strong>g an optimal policy for a pomdp is toconstruct the mdp whose states are the belief states of the orig<strong>in</strong>al pomdp, ieeachstate is a probability distribution over states <strong>in</strong> the pomdp, with beliefs ma<strong>in</strong>ta<strong>in</strong>edbased on the observation labels us<strong>in</strong>g Bayes' rule. A form of value iteration can beperformed <strong>in</strong> this space mak<strong>in</strong>g use of the fact that each nite-horizon policy will beconvex and piecewise-l<strong>in</strong>ear. The witness algorithm <strong>in</strong>cludes an improved techniquefor updat<strong>in</strong>g the basis of the convex value function on each iteration. Parr and Russelluse a smooth approximation of the value function that can be updated with gradient

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!