Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

More documents

Recommendations

Info

16 Chapter 2. Related workValue-Iteration(S; A; ;R;;):1. for each s 2 S, V 0 (s) =02. t =03. t = t +14. for each s 2 S f5. for each a 2 A6. Q t (s; a) =R(s; a)+ P u2S(a; s; u)V t,1 (u)7. t (s) = argmax aQ t (s; a)8. V t (s) =Q t (s; t (s))g9. if ( max s jV t (s) , V t,1 (s)j ) goto 310. return tTable 2.2: The value iteration algorithmlarge in the inputs to a planning problem, which includes a set of literals whose crossproduct describes the state space. Attempts to build on these and other techniquesfor solving mdps have concentrated on ways to gain leverage from the structure ofthe planning problem to reduce the computation time require.Dean et al. used policy iteration in a restricted state space called an envelope[Dean et al. 1993]. A subset of the states is selected, and each transition in themdp that leaves the subset is replaced with a new transition to a new state OUTwith zero reward. No transitions leave the OUT state. They developed an algorithmthat alternated between solving the restricted-space mdp with policy iteration andexpanding the envelope by including the n most likely elements of the state space to bereached by the optimal policy that were not in the envelope. The algorithm convergesto an optimal policy considerably more quickly than standard policy iteration on thewhole state space, but as the authors point out [Dean et al. 1995], it makes someassumptions which limit its applicability, including that of a sparse mdp in whicheach state has only a small number of outward transitions. Tash and Russell extendthe idea of an envelope with an initial estimate of distance-to-goal for each state anda model that takes the time of computation into account [Tash & Russell 1994].While the envelope extension method ignores portions of the state space, othertechniques have considered abstractions of the state space that try to group togethersets of states that behave similarly under the chosen actions of the optimal policy.Boutilier and Dearden [Boutilier & Dearden 1994] assume a representation for actionsthat is similar to that used in Buridan [Kushmerick, Hanks, & Weld 1994] describedin Section 2.3 and a state utility function that is described in terms of domain literals.They then pick a subset of the literals that account for the greatest variation in thestate utility and use the action representation to nd literals which can directly orindirectly aect the chosen set, using a technique similar to the one developed byKnoblock for building abstraction hierarchies for classical planners [Knoblock 1991].This subset of literals then forms the basis for an abstract mdp by projection of theoriginal states. Since the state space size is exponential in the set of literals, this
2.4. Approaches based on Markov decision processes 17reduction can lead to considerable time savings over the original mdp. Boutilier andDearden prove bounds on the dierence in value of the abstract policy compared withan optimal policy in the original mdp.Lin and Dean further rene this idea by splitting the mdp into subsets and allowinga dierent abstraction of the states to be considered in each one [Dean & Lin 1995].This approach can have extra power because typically dierent literals may be relevantin dierent parts of the state space. However there is an added cost to re-combiningthe separate pieces unless they happen to decompose very cleanly. Lin and Deanassume the partition of the state space is given by some external oracle.Boutilier et al. extend modied policy iteration to propose a technique calledstructured policy iteration that makes use of a structured action representation inthe form of 2-stage Bayesian networks [Boutilier, Dearden, & Goldszmidt 1995]. Therepresentation of the policy and utility functions are also structured in their approach,using decision trees. In standard policy iteration, the value of the candidate policyis computed on each iteration by solving a system of jSj linear equations (step 2 inTable 2.1, which is computationally prohibitive for large real-world planning problems.Modied policy iteration replaces this step with an iterative approximation of thevalue function V by a series of value functions V 0 ;V 1 ;... given byV i (s) =R(s)+ X u2S((s);s;u)V i,1 (u)Stopping criteria are given in [Puterman 1994].In structured policy iteration, the value function is again built in a series of approximations,but in each one it is represented as a decision tree over the domainliterals. Similarly the policy is built up as a decision tree. On each iteration, newliterals might be added to these trees as a result of examining the literals mentionedin the action specication and utility function R. In this way the algorithm avoidsexplicitly enumerating the state space.Similar work has also been done with partially-observable Markov decision processesor pomdps, in which the assumption of complete observability is relaxed. Ina pomdp there is a set of observation labels O and a set of conditional probabilitiesP (oja; s);o 2O;a 2A;s 2 S, such that if the system makes a transition to state swith action a it receives the observation label o with probability P (oja; s). Cassandraet al. introduce the witness algorithm for solving pomdps [Cassandra, Kaelbling, &Littman 1994]. A standard technique for nding an optimal policy for a pomdp is toconstruct the mdp whose states are the belief states of the original pomdp, ieeachstate is a probability distribution over states in the pomdp, with beliefs maintainedbased on the observation labels using Bayes' rule. A form of value iteration can beperformed in this space making use of the fact that each nite-horizon policy will beconvex and piecewise-linear. The witness algorithm includes an improved techniquefor updating the basis of the convex value function on each iteration. Parr and Russelluse a smooth approximation of the value function that can be updated with gradient
Page 1 and 2: Planning under Uncertainty in Dynam
Page 3: AbstractPlanning, the process of nd
Page 6 and 7: 4.3.1 Analysing the belief net and
Page 8 and 9: viii
Page 10 and 11: 7.4 Weaver's solution to the exampl
Page 12 and 13: 3.12 Reachability graph of literal
Page 14 and 15: 6.1 Operators in the parameterised
Page 16 and 17: xvi
Page 18 and 19: xviii
Page 20 and 21: 2 Chapter 1. Introductionif the pri
Page 22 and 23: 4 Chapter 1. Introductionweather co
Page 24 and 25: 6 Chapter 1. Introductionnet nodes
Page 26 and 27: 8 Chapter 1. Introduction
Page 28 and 29: 10 Chapter 2. Related workIn additi
Page 30 and 31: 12 Chapter 2. Related workmakes use
Page 32 and 33: 14 Chapter 2. Related workall the s
Page 36 and 37: 18 Chapter 2. Related workdescent [
Page 38 and 39: 20 Chapter 3. Planning under Uncert
Page 62 and 63: 44 Chapter 4. The Weaver Algorithmi
Page 64 and 65: 46 Chapter 4. The Weaver Algorithmn
Page 66 and 67: 48 Chapter 4. The Weaver AlgorithmB
Page 68 and 69: 50 Chapter 4. The Weaver Algorithm
Page 70 and 71: 52 Chapter 4. The Weaver Algorithm0
Page 72 and 73: 54 Chapter 4. The Weaver AlgorithmI
Page 74 and 75: 56 Chapter 4. The Weaver Algorithmd
Page 76 and 77: 58 Chapter 4. The Weaver Algorithm(
Page 78 and 79: 60 Chapter 4. The Weaver AlgorithmT
Page 80 and 81: 62 Chapter 4. The Weaver Algorithmn
Page 82 and 83: 64 Chapter 4. The Weaver Algorithm4
Page 84 and 85:
66 Chapter 4. The Weaver Algorithml
Page 86 and 87:
68 Chapter 4. The Weaver Algorithmc
Page 88 and 89:
70 Chapter 5. Eciency improvements
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
Page 96 and 97:
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
Page 104 and 105:
Page 106 and 107:
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
94 Chapter 7. Experimental results
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
118 Chapter 8. Conclusions The appl
Page 138 and 139:
120 Chapter 8. Conclusions
Page 140 and 141:
122 Appendix A. Proofs of theoremso
Page 142 and 143:
124 Appendix A. Proofs of theoremsN
Page 144 and 145:
126 Appendix B. The Oil-spill domai
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Page 154 and 155:
Page 156 and 157:
Page 158 and 159:
Page 160 and 161:
Page 162 and 163:
Page 164 and 165:
Page 166 and 167:
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175:
Page 176 and 177:
158 BIBLIOGRAPHY[Blythe & Veloso 19
Page 178 and 179:
160 BIBLIOGRAPHY[Drummond & Bresina
Page 180 and 181:
162 BIBLIOGRAPHY[Koenig & Simmons 1
Page 182 and 183:
164 BIBLIOGRAPHY[Schoppers 1989b] S
Page 184:
166 BIBLIOGRAPHY
show all

Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?