Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

More documents

Recommendations

Info

14 Chapter 2. Related workall the states of the system. Work to make mdp approaches tractable has attemptedto incorporate ideas from classical planning and other areas of AI to exploit anyunderlying structure in the state space [Dearden & Boutilier 1997] as well as to relaxthe requirement of optimalty [Parr & Russell 1995].In this section I give a brief overview of Markov decision processes and surveysome of the work in this area. This thesis pursues an approach based on classicalplanning, but uses an mdp representation to describe planning problems and theirsolutions.2.4.1 Overview of Markov Decision ProcessesThis description of Markov decision processes follows [Littman 1996] and [Boutilier,Dean, & Hanks 1995]. A Markov decision process M is a tuple M =< S;A;;R >where S is a nite set of states of the system.Ais a nite set of actions. :AS ! (S) is the state transition function, maping an action and a state toa probability distribution over S for the possible resulting state. The probabilityof reaching state s 0 by performing action a in state s is written (a; s; s 0 ). R:S A !Ris the reward function. R(s; a) is the reward the system receivesif it takes action a in state s.A policy for an mdp is a mapping :S ! A that selects an action for each state.Given a policy,we can dene its nite-horizon value function V n:S !R, where V (s) nis the expected value of applying the policy for n steps starting in state s. This isdened inductively with V 0(s) =R(s; (s)) andV m(s) =R(s; (s)) + X u2S((s);s;u)V m,1(u)Over an innite horizon, a discounted model is frequently used to ensure policieshave a bounded expected value. For some chosen so that
2.4. Approaches based on Markov decision processes 15V that is optimal regardless of the starting state [Howard 1960], which satises thefollowing equation:V (s) = max fR(s; a)+X (a; s; u)V (u)gau2STwo popular methods for solving this equation and nding an optimal policy foran mdp are value iteration and policy iteration [Puterman 1994].In policy iteration, the current policy is repeatedly improved by nding someaction in each state that has a higher value than the action chosen by the currentpolicy for that state. The policy is initially chosen at random, and the processterminates when no improvement can be found. Tha algorithm is shown in Table 2.1.This process converges to an optimal policy [Puterman 1994].Policy-Iteration(S; A; ;R;):1. For each s 2 S, (s) = RandomElement(A)2. Compute V (:)3. For each s 2 S f4. Find some action a such thatR(s; a)+ P u2S(a; s; u)V (u) >V (s)5. Set 0 (s) =aif such anaexists,6 otherwise set 0 (s) =(s).g7. If 0 (s) 6= (s) for some s 2 S goto 2.8. Return Table 2.1: The policy iteration algorithmIn value iteration, optimal policies are produced for successively longer nite horizons,until they converge. It is relatively simple to nd an optimal policy over n steps(:), with value function V (:), using the recurrence relation: nn n (s) = argmax afR(s; a)+ X u2S(a; s; u)V n,1 (u)gwith starting condition V 0 (s) =08s2S, where V m is derived from the policy m asdescribed above. Table 2.2 shows the value iteration algorithm, which takes an mdp,a discount value and a parameter and produces successive nite-horizon optimalpolicies, terminating when the maximum change in values between the current andprevious value functions is below . It can also be shown that the algorithm convergesto the optimal policy for the discounted innite case in a number of steps which ispolynomial in jSj, jAj, log max s;a jR(s; a)j and 1=(1 , ).2.4.2 Planning under uncertainty with MDPsThe algorithms described above can nd optimal policies in polynomial time in thesize of the state space of the mdp. However, this state space is usually exponentially
Page 1 and 2: Planning under Uncertainty in Dynam
Page 3: AbstractPlanning, the process of nd
Page 6 and 7: 4.3.1 Analysing the belief net and
Page 8 and 9: viii
Page 10 and 11: 7.4 Weaver's solution to the exampl
Page 12 and 13: 3.12 Reachability graph of literal
Page 14 and 15: 6.1 Operators in the parameterised
Page 16 and 17: xvi
Page 18 and 19: xviii
Page 20 and 21: 2 Chapter 1. Introductionif the pri
Page 22 and 23: 4 Chapter 1. Introductionweather co
Page 24 and 25: 6 Chapter 1. Introductionnet nodes
Page 26 and 27: 8 Chapter 1. Introduction
Page 28 and 29: 10 Chapter 2. Related workIn additi
Page 30 and 31: 12 Chapter 2. Related workmakes use
Page 34 and 35: 16 Chapter 2. Related workValue-Ite
Page 36 and 37: 18 Chapter 2. Related workdescent [
Page 38 and 39: 20 Chapter 3. Planning under Uncert
Page 62 and 63: 44 Chapter 4. The Weaver Algorithmi
Page 64 and 65: 46 Chapter 4. The Weaver Algorithmn
Page 66 and 67: 48 Chapter 4. The Weaver AlgorithmB
Page 68 and 69: 50 Chapter 4. The Weaver Algorithm
Page 70 and 71: 52 Chapter 4. The Weaver Algorithm0
Page 72 and 73: 54 Chapter 4. The Weaver AlgorithmI
Page 74 and 75: 56 Chapter 4. The Weaver Algorithmd
Page 76 and 77: 58 Chapter 4. The Weaver Algorithm(
Page 78 and 79: 60 Chapter 4. The Weaver AlgorithmT
Page 80 and 81: 62 Chapter 4. The Weaver Algorithmn
Page 82 and 83:
64 Chapter 4. The Weaver Algorithm4
Page 84 and 85:
66 Chapter 4. The Weaver Algorithml
Page 86 and 87:
68 Chapter 4. The Weaver Algorithmc
Page 88 and 89:
70 Chapter 5. Eciency improvements
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
Page 96 and 97:
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
Page 104 and 105:
Page 106 and 107:
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
94 Chapter 7. Experimental results
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
118 Chapter 8. Conclusions The appl
Page 138 and 139:
120 Chapter 8. Conclusions
Page 140 and 141:
122 Appendix A. Proofs of theoremso
Page 142 and 143:
124 Appendix A. Proofs of theoremsN
Page 144 and 145:
126 Appendix B. The Oil-spill domai
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Page 154 and 155:
Page 156 and 157:
Page 158 and 159:
Page 160 and 161:
Page 162 and 163:
Page 164 and 165:
Page 166 and 167:
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175:
Page 176 and 177:
158 BIBLIOGRAPHY[Blythe & Veloso 19
Page 178 and 179:
160 BIBLIOGRAPHY[Drummond & Bresina
Page 180 and 181:
162 BIBLIOGRAPHY[Koenig & Simmons 1
Page 182 and 183:
164 BIBLIOGRAPHY[Schoppers 1989b] S
Page 184:
166 BIBLIOGRAPHY
show all

Planning under Uncertainty in Dynamic Domains - Carnegie Mellon ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?