Lecture Notes - Department of Mathematics and Statistics - Queen's ...

More documents

Recommendations

Info

12 CHAPTER 2. CONTROLLED MARKOV CHAINS is independent of the initial distribution (or initial condition) on x 0 . The last two results are computationally very important, as there are powerful computational algorithms that allow one to develop such stationary policies. In the following set of notes, we will first consider further properties of Markov chains, since under a Markov control policy, the controlled state becomes a Markov chain. We will then get back to the Controlled Markov Chains and the development of optimal control policies. The classification of Markov Chains in the next topic will implicitly characterize the set of problems for which Stationary policies contain optimal admissible policies. 2.3.1 Partially Observed Model Consider the following model. x t+1 = f(x t , u t , w t ), y t = g(x t , v t ) Here, as before, x t is the state, u t ∈ U is the control, (w t , v t ) ∈ W × V are second order, zero-mean, i.i.d noise processes and w t is independent of v t . In addition to the previous fully observed model, y t denotes an observation variable taking values in Y, a subset of R n in the context of this review. The controller only has causal access to the second component {y t } of the process. An admissible policy {Π} is measurable with respect to σ({y s , s ≤ t}). We denote the observed history space as: H 0 := P, H t = H t−1 × Y × U. Hence, the set of (wide-sense) causal control policies are such that P(u(h t ) ∈ U|h t ) = 1 ∀h t ∈ H t . One could transform a partially observable Markov Decision Problem to a Fully Observed Markov Decision Problem via an enlargement of the state space. In particular, we obtain via the properties of total probability the following dynamical recursion π t (A) : = P(x t ∈ A|y [0,t] , u [0,t−1] ) ∫ X = π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 ) ∫ ∫X π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 ) , X where we assume that ∫ B r(y|x)dy = P(y t ∈ B|x t = x) for any B ∈ B(Y) and r denotes the density process. The conditional measure process becomes a controlled Markov chain in P(X), which we endow with the weak convergence topology. Theorem 2.3.1 The process {π t , u t } is a controlled Markov chain. That is, under any admissible control policy, given the action at time t ≥ 0 and π t , π t+1 is conditionally independent from {π s , u s , s ≤ t − 1}. Let the cost function to be minimized be T∑ −1 t=0 E Π x 0 [c(x t , u t )], where E Π x 0 [] denotes the expectation over all sample paths with initial state given by x 0 under policy Π. We ∫ transform the system into a fully observed Markov model as follows. Define the new cost as ˜c(π, u) = X c(x, u)π(dx), π ∈ P(X). The stochastic transition kernel q is given by: ∫ q(dx, dy|π, u) = P(dx, dy|x ′ , u)π(dx ′ ), π ∈ P(X) And, this kernel can be decomposed as q(dx, dy|π, u) = P(dy|π, u)P(dx|π, u, y). X The second term here is the filtering equation, mapping (π, u, y) ∈ (P(X) × U × Y) to P(X). It follows that (P(X), U, K, ˜c) defines a completely observable controlled Markov process. Here, we have ∫ K(B|π, u) = 1 (P(.|π,u,y)∈B) P(dy|π, u), ∀B ∈ B(P(X)), with 1 (.) denoting the indicator function. Y
2.4. EXERCISES 13 2.4 Exercises Exercise 2.4.1 Suppose that there are two decision makers DM 1 and DM 2 . Suppose that the information available to to DM 1 is a random variable Y 1 and the information available to DM 2 is Y 2 , where these random variables are measurable on a probability space (Ω, F, P). Suppose that the sigma-field generated by Y 1 is a subset of the sigma-field generated by Y 2 , that is σ(Y 1 ) ⊂ σ(Y 2 ). Further, suppose that the decision makers wish to minimize the following cost function: E[c(ω, u i )], where c : Ω × U → R + is a measurable cost function, with u i ∈ U a Borel space. Here, for i = 1, 2, u i = γ i (y i ) is generated by a measurable function γ i on the sigma-field generated by the random variable Y i . Let Γ i denote the space of all measurable policies. Prove that inf E[c(ω, γ 1 ∈Γ U1 )] ≥ inf E[c(ω, 1 γ 2 ∈Γ U2 )]. 2 Exercise 2.4.2 Consider a line of customers at a service station (such as at an airport, a grocery store, or a communication network where customers are packets). Let L t be the length of the line, that is the total number of customers waiting in the line. Let there be M t servers, serving the customers at time t. Let there be a manager (controller) who decides on the number of servers to be present. Let each of the servers be able to serve N customers for every time-stage. The dynamics of the line can be expressed as follows: L t+1 = L t + A t − M t N1 (Lt≥NM t), where 1 (E) is the indicator function for event E, i.e., it is equal to zero if E does not occur and is equal to 1 otherwise. In the equation above, A t is the number of customers that have just arrived at time t. We assume {A t } to be an independent process, and to have an exponential distribution, with mean λ, that is for all k ∈ N The manager has only access to the information vector P(A(t) = k) = λk e −λ , k ∈ {0, 1, 2, . . ., } k! I t = {L 0 , L 1 , . . .,L t ; A 0 , A 1 , . . . , A t−1 }, while implementing his policies. A consultant proposes a number of possible policies to be adopted by the manager: According to Policy A, the number of servers is given by: According to Policy B, the number of servers is M t = L t + L t−1 + L t−3 . 2 M t = L t+1 + L t + L t−1 . 2 According to Policy C, Finally, according to Policy D, the update is M t = ⌈λ + 0.1⌉1 t≥10 . M t = ⌈λ + 0.1⌉ a) Which of these policies are admissible, that is measurable with respect to the σ−field generated by I t ? Which are Markov, or stationary?
Page 1 and 2: i Queen’s University Mathematics
Page 3 and 4: Contents 1 Review of Probability 1
Page 5 and 6: CONTENTS v 5.1 Bellman’s Principl
Page 7 and 8: Chapter 1 Review of Probability 1.1
Page 9 and 10: 1.2. MEASURABLE SPACE 3 1.2.3 Measu
Page 11 and 12: 1.3. PROBABILITY SPACE AND RANDOM V
Page 13 and 14: 1.5. EXERCISES 7 where w t is an in
Page 15 and 16: Chapter 2 Controlled Markov Chains
Page 17: 2.3. PERFORMANCE OF POLICIES 11 The
Page 21 and 22: Chapter 3 Classification of Markov
Page 23 and 24: 3.1. COUNTABLE STATE SPACE MARKOV C
Page 25 and 26: 3.2. STABILITY AND INVARIANT DISTRI
Page 27 and 28: 3.2. STABILITY AND INVARIANT DISTRI
Page 29 and 30: 3.3. UNCOUNTABLE (COMPLETE, SEPARAB
Page 31 and 32: 3.3. UNCOUNTABLE (COMPLETE, SEPARAB
Page 33 and 34: 3.4. FURTHER RESULTS ON THE EXISTEN
Page 35 and 36: 3.5. EXERCISES 29 3.5 Exercises Exe
Page 37 and 38: Chapter 4 Martingales and Foster-Ly
Page 39 and 40: 4.1. MARTINGALES 33 Theorem 4.1.3 I
Page 41 and 42: 4.1. MARTINGALES 35 Theorem 4.1.6 L
Page 43 and 44: 4.2. STABILITY OF MARKOV CHAINS: FO
Page 49 and 50: 4.3. CONVERGENCE RATES TO EQUILIBRI
Page 51 and 52: 4.4. CONCLUSION 45 The second condi
Page 53 and 54: 4.5. EXERCISES 47 ∀ bounded funct
Page 55 and 56: Chapter 5 Dynamic Programming In th
Page 57 and 58: 5.2. DISCUSSION: WHY ARE MARKOV POL
Page 59 and 60: 5.3. EXISTENCE OF MINIMIZING SELECT
Page 61 and 62: 5.4. INFINITE HORIZON OPTIMAL CONTR
Page 63 and 64: 5.4. INFINITE HORIZON OPTIMAL CONTR
Page 65 and 66: 5.6. EXERCISES 59 Theorem 5.5.1 The
Page 67 and 68: 5.6. EXERCISES 61 with R, Q, Q T >
Page 69 and 70:
Chapter 6 Partially Observed Markov
Page 71 and 72:
6.3. ESTIMATION AND KALMAN FILTERIN
Page 73 and 74:
6.3. ESTIMATION AND KALMAN FILTERIN
Page 75 and 76:
6.4. PARTIALLY OBSERVED MARKOV DECI
Page 77 and 78:
Chapter 7 The Average Cost Problem
Page 79 and 80:
7.3. LINEAR PROGRAMMING APPROACH TO
Page 81 and 82:
7.3. LINEAR PROGRAMMING APPROACH TO
Page 83 and 84:
7.4. DISCUSSION FOR MORE GENERAL ST
Page 85 and 86:
7.5. EXERCISES 79 7.4.2 Sample-Path
Page 87 and 88:
Chapter 8 Team Decision Theory and
Page 89 and 90:
8.1. EXERCISES 83 u b t = f b (E[x
Page 91 and 92:
Appendix A On the Convergence of Ra
Page 93 and 94:
A.3. CONVERGENCE OF RANDOM VARIABLE
Page 95 and 96:
Bibliography [1] A. Arapostathis, V
show all

Lecture Notes - Department of Mathematics and Statistics - Queen's ...

Create successful ePaper yourself

Delete template?

Save as template?