Lecture Notes - Department of Mathematics and Statistics - Queen's ...
Lecture Notes - Department of Mathematics and Statistics - Queen's ...
Lecture Notes - Department of Mathematics and Statistics - Queen's ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
12 CHAPTER 2. CONTROLLED MARKOV CHAINS<br />
is independent <strong>of</strong> the initial distribution (or initial condition) on x 0 .<br />
The last two results are computationally very important, as there are powerful computational algorithms that<br />
allow one to develop such stationary policies.<br />
In the following set <strong>of</strong> notes, we will first consider further properties <strong>of</strong> Markov chains, since under a Markov<br />
control policy, the controlled state becomes a Markov chain. We will then get back to the Controlled Markov<br />
Chains <strong>and</strong> the development <strong>of</strong> optimal control policies.<br />
The classification <strong>of</strong> Markov Chains in the next topic will implicitly characterize the set <strong>of</strong> problems for which<br />
Stationary policies contain optimal admissible policies.<br />
2.3.1 Partially Observed Model<br />
Consider the following model.<br />
x t+1 = f(x t , u t , w t ), y t = g(x t , v t )<br />
Here, as before, x t is the state, u t ∈ U is the control, (w t , v t ) ∈ W × V are second order, zero-mean, i.i.d noise<br />
processes <strong>and</strong> w t is independent <strong>of</strong> v t . In addition to the previous fully observed model, y t denotes an observation<br />
variable taking values in Y, a subset <strong>of</strong> R n in the context <strong>of</strong> this review. The controller only has causal access to<br />
the second component {y t } <strong>of</strong> the process. An admissible policy {Π} is measurable with respect to σ({y s , s ≤ t}).<br />
We denote the observed history space as: H 0 := P, H t = H t−1 × Y × U. Hence, the set <strong>of</strong> (wide-sense) causal<br />
control policies are such that P(u(h t ) ∈ U|h t ) = 1 ∀h t ∈ H t .<br />
One could transform a partially observable Markov Decision Problem to a Fully Observed Markov Decision<br />
Problem via an enlargement <strong>of</strong> the state space. In particular, we obtain via the properties <strong>of</strong> total probability<br />
the following dynamical recursion<br />
π t (A) : = P(x t ∈ A|y [0,t] , u [0,t−1] )<br />
∫<br />
X<br />
=<br />
π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 )<br />
∫ ∫X π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 ) ,<br />
X<br />
where we assume that ∫ B r(y|x)dy = P(y t ∈ B|x t = x) for any B ∈ B(Y) <strong>and</strong> r denotes the density process.<br />
The conditional measure process becomes a controlled Markov chain in P(X), which we endow with the weak<br />
convergence topology.<br />
Theorem 2.3.1 The process {π t , u t } is a controlled Markov chain. That is, under any admissible control policy,<br />
given the action at time t ≥ 0 <strong>and</strong> π t , π t+1 is conditionally independent from {π s , u s , s ≤ t − 1}.<br />
Let the cost function to be minimized be<br />
T∑<br />
−1<br />
t=0<br />
E Π x 0<br />
[c(x t , u t )],<br />
where E Π x 0<br />
[] denotes the expectation over all sample paths with initial state given by x 0 under policy Π.<br />
We ∫ transform the system into a fully observed Markov model as follows. Define the new cost as ˜c(π, u) =<br />
X<br />
c(x, u)π(dx), π ∈ P(X). The stochastic transition kernel q is given by:<br />
∫<br />
q(dx, dy|π, u) = P(dx, dy|x ′ , u)π(dx ′ ), π ∈ P(X)<br />
And, this kernel can be decomposed as q(dx, dy|π, u) = P(dy|π, u)P(dx|π, u, y).<br />
X<br />
The second term here is the filtering equation, mapping (π, u, y) ∈ (P(X) × U × Y) to P(X). It follows that<br />
(P(X), U, K, ˜c) defines a completely observable controlled Markov process. Here, we have<br />
∫<br />
K(B|π, u) = 1 (P(.|π,u,y)∈B) P(dy|π, u), ∀B ∈ B(P(X)),<br />
with 1 (.) denoting the indicator function.<br />
Y