28.04.2014 Views

Lecture Notes - Department of Mathematics and Statistics - Queen's ...

Lecture Notes - Department of Mathematics and Statistics - Queen's ...

Lecture Notes - Department of Mathematics and Statistics - Queen's ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

12 CHAPTER 2. CONTROLLED MARKOV CHAINS<br />

is independent <strong>of</strong> the initial distribution (or initial condition) on x 0 .<br />

The last two results are computationally very important, as there are powerful computational algorithms that<br />

allow one to develop such stationary policies.<br />

In the following set <strong>of</strong> notes, we will first consider further properties <strong>of</strong> Markov chains, since under a Markov<br />

control policy, the controlled state becomes a Markov chain. We will then get back to the Controlled Markov<br />

Chains <strong>and</strong> the development <strong>of</strong> optimal control policies.<br />

The classification <strong>of</strong> Markov Chains in the next topic will implicitly characterize the set <strong>of</strong> problems for which<br />

Stationary policies contain optimal admissible policies.<br />

2.3.1 Partially Observed Model<br />

Consider the following model.<br />

x t+1 = f(x t , u t , w t ), y t = g(x t , v t )<br />

Here, as before, x t is the state, u t ∈ U is the control, (w t , v t ) ∈ W × V are second order, zero-mean, i.i.d noise<br />

processes <strong>and</strong> w t is independent <strong>of</strong> v t . In addition to the previous fully observed model, y t denotes an observation<br />

variable taking values in Y, a subset <strong>of</strong> R n in the context <strong>of</strong> this review. The controller only has causal access to<br />

the second component {y t } <strong>of</strong> the process. An admissible policy {Π} is measurable with respect to σ({y s , s ≤ t}).<br />

We denote the observed history space as: H 0 := P, H t = H t−1 × Y × U. Hence, the set <strong>of</strong> (wide-sense) causal<br />

control policies are such that P(u(h t ) ∈ U|h t ) = 1 ∀h t ∈ H t .<br />

One could transform a partially observable Markov Decision Problem to a Fully Observed Markov Decision<br />

Problem via an enlargement <strong>of</strong> the state space. In particular, we obtain via the properties <strong>of</strong> total probability<br />

the following dynamical recursion<br />

π t (A) : = P(x t ∈ A|y [0,t] , u [0,t−1] )<br />

∫<br />

X<br />

=<br />

π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 )<br />

∫ ∫X π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 ) ,<br />

X<br />

where we assume that ∫ B r(y|x)dy = P(y t ∈ B|x t = x) for any B ∈ B(Y) <strong>and</strong> r denotes the density process.<br />

The conditional measure process becomes a controlled Markov chain in P(X), which we endow with the weak<br />

convergence topology.<br />

Theorem 2.3.1 The process {π t , u t } is a controlled Markov chain. That is, under any admissible control policy,<br />

given the action at time t ≥ 0 <strong>and</strong> π t , π t+1 is conditionally independent from {π s , u s , s ≤ t − 1}.<br />

Let the cost function to be minimized be<br />

T∑<br />

−1<br />

t=0<br />

E Π x 0<br />

[c(x t , u t )],<br />

where E Π x 0<br />

[] denotes the expectation over all sample paths with initial state given by x 0 under policy Π.<br />

We ∫ transform the system into a fully observed Markov model as follows. Define the new cost as ˜c(π, u) =<br />

X<br />

c(x, u)π(dx), π ∈ P(X). The stochastic transition kernel q is given by:<br />

∫<br />

q(dx, dy|π, u) = P(dx, dy|x ′ , u)π(dx ′ ), π ∈ P(X)<br />

And, this kernel can be decomposed as q(dx, dy|π, u) = P(dy|π, u)P(dx|π, u, y).<br />

X<br />

The second term here is the filtering equation, mapping (π, u, y) ∈ (P(X) × U × Y) to P(X). It follows that<br />

(P(X), U, K, ˜c) defines a completely observable controlled Markov process. Here, we have<br />

∫<br />

K(B|π, u) = 1 (P(.|π,u,y)∈B) P(dy|π, u), ∀B ∈ B(P(X)),<br />

with 1 (.) denoting the indicator function.<br />

Y

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!