06.06.2013 Views

STOCHASTIC

STOCHASTIC

STOCHASTIC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

INTRODUCTION TO DYNAMIC PROGRAMMING<br />

Inherent in the proof of the theorem is a procedure for constructing a<br />

policy that is optimal for CI. Indeed, much of the merit of dynamic programming<br />

lies in the efficiency of this algorithm. For the algorithm, it will be convenient<br />

to describe the return function in slightly different notation. Define<br />

the function h(x,dx,va) by<br />

h (x, dx, v6) = vy (x) where yx = dx and yz = 5Z for all z =£ x.<br />

Two notions are present in this definition. First, the dependence on S has<br />

been suppressed; only vs remains. This is justified by the lemma, which assures<br />

that if 5 and y are two policies such that vs = vy, then vn(x) = vx(x) where<br />

n and A are defined by nx = Xx = dx and, for all z # x, nz — bz and kz = yz.<br />

Second, the dependence on dx in h(x,dx,vd) has been made explicit. Note<br />

that h {x, dx, vd) is not a function of 3X—that decision 5X is immaterial, and it<br />

is not required that 5X = dx. In economic terms, h(x,dx,vs) might be interpreted<br />

as the cumulative return obtained by starting at state x and choosing<br />

decision dx with the prospect of receiving the terminating reward vd(z) if<br />

transition occurs to state z. The algorithm is now displayed in the form of a<br />

corollary, whose proof replicates that of the theorem and is left as an exercise<br />

for the reader (Exercise CR.-25).<br />

Corollary 1 Suppose the monotonicity and termination assumptions are<br />

satisfied. Then for each «, the policy

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!