Lecture Notes - Department of Mathematics and Statistics - Queen's ...
Lecture Notes - Department of Mathematics and Statistics - Queen's ...
Lecture Notes - Department of Mathematics and Statistics - Queen's ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
i<br />
Queen’s University<br />
<strong>Mathematics</strong> <strong>and</strong> Engineering <strong>and</strong> <strong>Mathematics</strong> <strong>and</strong> <strong>Statistics</strong><br />
Control <strong>of</strong> Stochastic Systems<br />
MATH/MTHE 472/872 <strong>Lecture</strong> <strong>Notes</strong><br />
Serdar Yüksel<br />
December 19, 2013
ii<br />
This document is a set <strong>of</strong> supplemental lecture notes that has been used for Math 472/872: Control <strong>of</strong> Stochastic<br />
Systems, at Queen’s University, since 2009.<br />
Serdar Yüksel
Contents<br />
1 Review <strong>of</strong> Probability 1<br />
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />
1.2 Measurable Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />
1.2.1 Borel σ−field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.2.2 Measurable Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.2.3 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.2.4 The Extension Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.2.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.2.6 Fatou’s Lemma, the Monotone Convergence Theorem <strong>and</strong> the Dominated Convergence<br />
Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
1.3 Probability Space <strong>and</strong> R<strong>and</strong>om Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
1.3.1 Discrete-Time Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
1.3.2 More on R<strong>and</strong>om Variables <strong>and</strong> Probability density functions . . . . . . . . . . . . . . . . 5<br />
1.3.3 Independence <strong>and</strong> Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
1.4 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
2 Controlled Markov Chains 9<br />
2.1 Controlled Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
2.1.1 Fully Observed Markov Control Problem Model . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
2.1.2 Classes <strong>of</strong> Control Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
2.2 Markov Chain Induced by a Markov Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
2.3 Performance <strong>of</strong> Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />
2.3.1 Partially Observed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />
3 Classification <strong>of</strong> Markov Chains 15<br />
3.1 Countable State Space Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
3.1.1 Recurrence <strong>and</strong> Transience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
iii
iv<br />
CONTENTS<br />
3.2 Stability <strong>and</strong> Invariant Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
3.2.1 Invariant Measures via an Occupational Characterization . . . . . . . . . . . . . . . . . . 18<br />
3.3 Uncountable (Complete, Separable, Metric) State Spaces . . . . . . . . . . . . . . . . . . . . . . . 23<br />
3.3.1 Chapman-Kolmogorov Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
3.3.2 Invariant Distributions for Uncountable Spaces . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
3.3.3 Existence <strong>of</strong> an Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
3.3.4 Dobrushin’s Ergodic Coefficient for Uncountable State Space Chains . . . . . . . . . . . . 27<br />
3.3.5 Ergodic Theorem for Uncountable State Space Chains . . . . . . . . . . . . . . . . . . . . 27<br />
3.4 Further Results on the Existence <strong>of</strong> an Invariant Probability Measure . . . . . . . . . . . . . . . 27<br />
3.4.1 Case with the Feller condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
3.4.2 Case without the Feller condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
4 Martingales <strong>and</strong> Foster-Lyapunov Criteria for Stabilization <strong>of</strong> Markov Chains 31<br />
4.1 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
4.1.1 More on Expectations <strong>and</strong> Conditional Probability . . . . . . . . . . . . . . . . . . . . . . 31<br />
4.1.2 Some Properties <strong>of</strong> Conditional Expectation: . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />
4.1.3 Discrete-Time Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
4.1.4 Doob’s Optional Sampling Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
4.1.5 An Important Martingale Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . 34<br />
4.1.6 Pro<strong>of</strong> <strong>of</strong> the Birkh<strong>of</strong>f Individual Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . 36<br />
4.1.7 This section is optional: Further Martingale Theorems . . . . . . . . . . . . . . . . . . . . 36<br />
4.1.8 Azuma-Hoeffding Inequality for Martingales with Bounded Increments . . . . . . . . . . . 37<br />
4.2 Stability <strong>of</strong> Markov Chains: Foster-Lyapunov Techniques . . . . . . . . . . . . . . . . . . . . . . 37<br />
4.2.1 Criterion for Positive Harris Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />
4.2.2 Criterion for Finite Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
4.2.3 Criterion for Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
4.2.4 On small <strong>and</strong> petite sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
4.2.5 Criterion for Transience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />
4.2.6 State Dependent Drift Criteria: Deterministic <strong>and</strong> R<strong>and</strong>om-Time . . . . . . . . . . . . . . 42<br />
4.3 Convergence Rates to Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
4.3.1 Lyapunov conditions: Geometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
4.3.2 Subgeometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />
5 Dynamic Programming 49
CONTENTS<br />
v<br />
5.1 Bellman’s Principle <strong>of</strong> Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />
5.2 Discussion: Why are Markov Policies Important? Why are optimal policies deterministic for such<br />
settings? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />
5.3 Existence <strong>of</strong> Minimizing Selectors <strong>and</strong> Measurability . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />
5.4 Infinite Horizon Optimal Control Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
5.4.1 Discounted Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
5.5 Linear Quadratic Gaussian Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
6 Partially Observed Markov Decision Processes 63<br />
6.1 Enlargement <strong>of</strong> the State-Space <strong>and</strong> the Construction <strong>of</strong> a Controlled Markov Chain . . . . . . . 63<br />
6.2 Linear Quadratic Gaussian Case <strong>and</strong> Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
6.2.1 Separation <strong>of</strong> Estimation <strong>and</strong> Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
6.3 Estimation <strong>and</strong> Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
6.3.1 Optimal Control <strong>of</strong> Partially Observed LQG Systems . . . . . . . . . . . . . . . . . . . . . 68<br />
6.4 Partially Observed Markov Decision Processes in General Spaces . . . . . . . . . . . . . . . . . . 68<br />
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
7 The Average Cost Problem 71<br />
7.1 Average Cost <strong>and</strong> the Average Cost Optimality Equation (ACOE) . . . . . . . . . . . . . . . . . 71<br />
7.2 The Discounted Cost Approach to the Average Cost Problem . . . . . . . . . . . . . . . . . . . . 72<br />
7.2.1 Borel State <strong>and</strong> Action Spaces [Optional] . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
7.3 Linear Programming Approach to Average Cost Markov Decision Problems . . . . . . . . . . . . 73<br />
7.4 Discussion for More General State Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
7.4.1 Extreme Points <strong>and</strong> the Optimality <strong>of</strong> Deterministic Policies . . . . . . . . . . . . . . . . 78<br />
7.4.2 Sample-Path Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
8 Team Decision Theory <strong>and</strong> Information Structures 81<br />
8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />
A On the Convergence <strong>of</strong> R<strong>and</strong>om Variables 85<br />
A.1 Limit Events <strong>and</strong> Continuity <strong>of</strong> Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />
A.2 Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />
A.3 Convergence <strong>of</strong> R<strong>and</strong>om Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
A.3.1 Convergence almost surely (with probability 1) . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
A.3.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
A.3.3 Convergence in Mean-square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
vi<br />
CONTENTS<br />
A.3.4 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 1<br />
Review <strong>of</strong> Probability<br />
1.1 Introduction<br />
Before discussing controlled Markov chains, we first discuss some preliminaries about probability theory.<br />
Many events in the physical world are uncertain; that is, with a given knowledge up to a certain time, the<br />
entire process is not accurately predictable. Probability theory attempts to develop an underst<strong>and</strong>ing for such<br />
uncertainty in a consistent way given a number <strong>of</strong> properties to be satisfied. The suggestion by probability theory<br />
may not be physically correct, but it is mathematically precise <strong>and</strong> consistent.<br />
Examples <strong>of</strong> stochastic processes include: a) The temperature in a city at noon in October 2013: This process<br />
takes values in R 31 , b) The sequence <strong>of</strong> outputs <strong>of</strong> a communication channel modeled by an additive scalar<br />
Gaussian noise when the input is given by x = {x 1 , . . .,x n } (the output process lives in R n ), c) Infinite copies<br />
<strong>of</strong> a coin flip process (living in {H, T } ∞ ), d) The trajectory <strong>of</strong> a plane flying from point A to point B (taking<br />
values in C([t 0 , t v ]), the space <strong>of</strong> continuous paths in R 3 with x t0 = A, x tf = B), e) The exchange rate between<br />
the Canadian Dollar <strong>and</strong> the American dollar in a given time index T.<br />
Some <strong>of</strong> these processes take values in countable spaces, some do not. When the state space in which a r<strong>and</strong>om<br />
variable takes values is uncountable, further technical intricacies arise. These are best addressed with a precise<br />
characterization <strong>of</strong> probability <strong>and</strong> r<strong>and</strong>om variables.<br />
Hence, probability theory can be used to model uncertainty in the real world in a consistent way according some<br />
properties that we expect such measures should admit. The theory may not model the physical world accurately,<br />
however. In the following, we will develop a rigorous definition for probability.<br />
1.2 Measurable Space<br />
Let X be a collection <strong>of</strong> points. Let F be a collection <strong>of</strong> subsets <strong>of</strong> X with the following properties such that F<br />
is a σ-field, that is:<br />
• X ∈ F<br />
• If A ∈ F, then X \ A ∈ F<br />
• If A k ∈ F, k = 1, 2, 3, . . ., then<br />
∞⋃<br />
A k ∈ F<br />
k=1<br />
By De Morgan’s laws, the set has to be closed under countable intersections as well.<br />
1
2 CHAPTER 1. REVIEW OF PROBABILITY<br />
If the third item above holds for finitely many unions or intersections, then, the collection <strong>of</strong> subsets is said to<br />
be a field.<br />
With the above, (X, F) is termed a measurable space (that is we can associate a measure to this space; which<br />
we will discuss shortly). For example the full power-set <strong>of</strong> any set is a σ-field.<br />
Recall the following: A set is countable if every element <strong>of</strong> the set can be associated with an integer via an<br />
ordered mapping. Examples <strong>of</strong> countable spaces are finite sets <strong>and</strong> the set Q <strong>of</strong> rational numbers. An example<br />
<strong>of</strong> uncountable sets is the set R <strong>of</strong> real numbers.<br />
A σ−field J is generated by a collection <strong>of</strong> sets A, if J is the smallest σ−field containing the sets in A, <strong>and</strong> in<br />
this case, we write J = σ(A). Such a sigma field always exists.<br />
A σ−field is to be considered as an abstract collection <strong>of</strong> sets. It is in general not possible to formulate the<br />
elements <strong>of</strong> a σ-field through countably many operations <strong>of</strong> unions, intersections <strong>and</strong> complements, starting from<br />
a generating set.<br />
We will be interested in σ−fields which are countably generated in this course. Such a σ−field has certain<br />
desirable properties related to the existence <strong>of</strong> conditional probability measures <strong>and</strong> properties related to the<br />
extension theorem <strong>and</strong> constructions <strong>of</strong> measures.<br />
Subsets <strong>of</strong> σ-fields can be interpreted to represent information about an underlying process. We will discuss<br />
this interpretation further <strong>and</strong> this will be a recurring theme in our discussions in the context <strong>of</strong> stochastic<br />
control problems.<br />
1.2.1 Borel σ−field<br />
An important class <strong>of</strong> σ-fields is the Borel σ−field on a metric (or more generally topological) space. Such a<br />
σ−field is the one which is generated by the open sets. The term open naturally depends on the space being<br />
considered. For this course, we will mainly consider the space <strong>of</strong> Real numbers <strong>and</strong> complete, separable <strong>and</strong><br />
metric spaces. Recall that in a metric space with metric d, a set U is open if for every x ∈ U, there exists some<br />
ǫ > 0 such that {y : d(x, y) < ǫ} ⊂ U.<br />
The Borel σ−field on R is the one generated by sets <strong>of</strong> the form (a, b), that is ones generated by open intervals.<br />
We will denote the Borel σ−field on a space X as B(X).<br />
We will take a, b to be rational numbers, so that the σ-field is countably generated.<br />
Not all subsets <strong>of</strong> R are Borel sets, that is, elements <strong>of</strong> the Borel σ−field.<br />
We can also define a Borel σ−field on a product space. let X be a complete, separable, metric space such as R.<br />
Let X Z denote the infinite product <strong>of</strong> X. If this space is endowed with the product metric, the open sets are sets<br />
<strong>of</strong> the form ∏ i∈Z A i, where only finitely many <strong>of</strong> these sets are not equal to X <strong>and</strong> the remaining ones are open.<br />
We define cylinder sets in this product space as:<br />
B [Am,m∈I] = {x ∈ X Z , x m ∈ A m , A m ∈ B(X)},<br />
where I ⊂ Z with |I| < ∞, that is, the set has finitely many elements. Thus, in the above, if x ∈ B [Am,m∈I],<br />
then, x m ∈ A m <strong>and</strong> the remaining terms are arbitrary. The σ−field generated by such cylinder sets forms the<br />
Borel σ−field on the product space. Such a construction is very important for stochastic processes.<br />
1.2.2 Measurable Function<br />
If (X, B(X)) <strong>and</strong> (Y, B(Y)) are measurable spaces; we say a mapping from h : X → Y is measurable if<br />
h −1 (B) = {x : h(x) ∈ B} ∈ B(X),<br />
∀B ∈ B(Y)
1.2. MEASURABLE SPACE 3<br />
1.2.3 Measure<br />
A positive measure µ on (X, B(X)) is a function from B(X) to [0, ∞] which is countably additive such that for<br />
A k ∈ B(X) <strong>and</strong> A k ∩ A j = ∅:<br />
( ) ∞∑<br />
µ ∪ ∞ k=1 A k = µ(A k ).<br />
Definition 1.2.1 µ is a probability measure if it is positive <strong>and</strong> µ(X) = 1.<br />
k=1<br />
Definition 1.2.2 A measure µ is finite if µ(X) < ∞ <strong>and</strong> σ−finite, if there exist a collection <strong>of</strong> subsets such that<br />
X = ∪ ∞ k=1 A k with µ(A k ) < ∞ for all k.<br />
On the real line R, the Lebesgue measure is defined on the Borel σ−field (in fact on a somewhat larger field)<br />
such that for A = (a, b), µ(A) = b − a. Borel field <strong>of</strong> subsets is a subset <strong>of</strong> Lebesgue measurable sets, that is<br />
there exist Lebesgue measurable sets which are not Borel sets. For a definition <strong>and</strong> the construction <strong>of</strong> Lebesgue<br />
measurable sets, see [4].<br />
1.2.4 The Extension Theorem<br />
Theorem 1.2.1 (Carathéodory’s Extension Theorem) Let M be a field, <strong>and</strong> suppose that there exists a<br />
measure P satisfying: (i) P(∅) = 0, (ii) there exists countably many sets A n ∈ X such that X = ∪ n A n , with<br />
P(A n ) < ∞, (iii) For pairwise disjoint sets A n , if ∪ n A n ∈ M, then P(∪ n A n ) = ∑ n P(A n). Then, there exists<br />
a unique measure P ′ on the σ−field generated by M, B(M), which is consistent with P on M.<br />
The above is useful since, when one states that two measures are equal, it suffices to check if they are equal on<br />
the set <strong>of</strong> sets which generate the σ−algebra, <strong>and</strong> not necessarily on the entire σ−field.<br />
When we apply the above to probability measures defined on a product space, <strong>and</strong> consider the field <strong>of</strong> finitely<br />
many unions <strong>and</strong> intersections <strong>of</strong> finite dimensional Borel sets together with the empty set, we obtain the<br />
following.<br />
Theorem 1.2.2 (Kolmogorov’s Extension Theorem) Let X be a complete, separable metric space <strong>and</strong> let<br />
µ n be a probability measure on X n , the n product <strong>of</strong> X, for each n = 1, 2, . . ., such that<br />
µ n (A 1 × A 2 × · · · × A n ) = µ n+1 (A 1 × A 2 × · · · × A n × X),<br />
for every n <strong>and</strong> every sequence <strong>of</strong> Borel sets A k . Then, there exists a unique probability measure µ on (X ∞ , B(X ∞ ))<br />
which is consistent with each <strong>of</strong> the µ n ’s.<br />
For example, if the σ−field on a product space is generated by the sequence <strong>of</strong> finite dimensional distributions,<br />
one can define a measure in the product space which is consistent with the finite dimensional distributions.<br />
Likewise, we can construct the Lebesgue measure by defining it on finitely many unions <strong>and</strong> intersections <strong>of</strong><br />
intervals <strong>of</strong> the form (a, b] or [a, b) <strong>and</strong> extending this to the Borel σ−field.<br />
1.2.5 Integration<br />
Let h be a non-negative measurable function from X, B(X) to R, B(R). The Lebesgue integral <strong>of</strong> h with respect<br />
to a measure µ can be defined in three steps:<br />
First, for A ∈ B(X), define 1 {x∈A} (or 1 (x∈A) , or 1 A (x)) as an indicator function for event x ∈ A, that is the<br />
value that the function takes is 1 if x ∈ A, <strong>and</strong> 0 otherwise. In this case, ∫ X 1 x∈Aµ(dx) =: µ(A).<br />
Now, let us define simple functions h such that, there exist A 1 , A 2 , . . . , A n all in B(X) <strong>and</strong> positive numbers<br />
b 1 , b 2 , . . .,b n such that h n (x) = ∑ n<br />
k=1 b k1 x∈Ak . Now, for any given measurable h, there exist a sequence <strong>of</strong>
4 CHAPTER 1. REVIEW OF PROBABILITY<br />
simple functions h n such that h n (x) → h(x) monotonically, that is h n+1 (x) ≥ h n (x). We define the limit as the<br />
Lebesgue integral:<br />
∫<br />
n∑<br />
lim h n (dx)µ(dx) = lim b k µ(A k )<br />
n→∞<br />
n→∞<br />
We note that the space <strong>of</strong> simple functions is dense in the space <strong>of</strong> bounded measurable functions on R under<br />
the supremum norm.<br />
There are three important convergence theorems which we will not discuss in detail, the statements <strong>of</strong> which will<br />
be given in class.<br />
k=1<br />
1.2.6 Fatou’s Lemma, the Monotone Convergence Theorem <strong>and</strong> the Dominated<br />
Convergence Theorem<br />
Theorem 1.2.3 (Monotone Convergence Theorem) If µ is a σ−finite positive measure on (X, B(X)) <strong>and</strong><br />
{f n , n ∈ Z + } is a sequence <strong>of</strong> measurable functions from X to R which pointwise, monotonically, converge to f:<br />
that is, 0 ≤ f n (x) ≤ f n+1 (x) for all n, <strong>and</strong><br />
lim f n(x) = f(x),<br />
n→∞<br />
for µ−almost every x, then<br />
∫<br />
X<br />
∫<br />
f(x)µ(dx) = lim f n (x)µ(dx)<br />
n→∞<br />
X<br />
Theorem 1.2.4 (Fatou’s Lemma) If µ is a σ−finite positive measure on (X, B(X)) <strong>and</strong> {f n , n ∈ Z + } is a<br />
sequence <strong>of</strong> non-negative measurable functions from X to R, then<br />
∫<br />
∫<br />
liminf f n(x)µ(dx) ≤ liminf f n (x)µ(dx)<br />
n→∞ n→∞<br />
X<br />
Theorem 1.2.5 (Dominated Convergence Theorem) If (i) µ is a σ−finite positive measure on (X, B(X)),<br />
(ii) g(x) ≥ 0 is a Borel measurable function such that<br />
∫<br />
g(x)µ(dx) < ∞,<br />
X<br />
<strong>and</strong> (iii) {f n , n ∈ Z + } is a sequence <strong>of</strong> measurable functions from X to R which satisfy |f n (x)| ≤ g(x) for<br />
µ−almost every x, <strong>and</strong> lim n→∞ f n (x) = f(x), then:<br />
∫<br />
∫<br />
f(x)µ(dx) = lim f n (x)µ(dx)<br />
n→∞<br />
X<br />
Note that for the monotone convergence theorem, there is no restriction on boundedness; whereas for the dominated<br />
convergence theorem, there is a boundedness condition. On the other h<strong>and</strong>, for the dominated convergence<br />
theorem, the pointwise convergence does not have to be monotone.<br />
X<br />
X<br />
1.3 Probability Space <strong>and</strong> R<strong>and</strong>om Variables<br />
Let (Ω, F) be a measurable space. If P is a probability measure, then the triple (Ω, F, P) forms a probability<br />
space. Here Ω is a set called the sample space. F is called the event space, <strong>and</strong> this is a σ−field <strong>of</strong> subsets <strong>of</strong> Ω.<br />
Let (E, E) be another measurable space <strong>and</strong> X : (Ω, F, P) → (E, E) be a measurable map. We call X an<br />
E−valued r<strong>and</strong>om variable. The image under X is a probability measure on (E, E), called the law <strong>of</strong> X.<br />
The σ-field generated by the events {{w : X(w) ∈ A}, A ∈ E} is called the σ−field generated by X <strong>and</strong> is denoted<br />
by σ(X).
1.3. PROBABILITY SPACE AND RANDOM VARIABLES 5<br />
Consider a coin flip process, with possible outcomes {H, T }, heads or tails. We have a good intuitive underst<strong>and</strong>ing<br />
on the environment when someone tells us that, a coin flip takes the value H with probability 1/2. Based on<br />
the definition <strong>of</strong> a r<strong>and</strong>om variable, we view then a con flip process as a deterministic function from some space<br />
(Ω, F, P) to a space <strong>of</strong> heads <strong>and</strong> tails. Here, P denotes the uncertainty measure (think <strong>of</strong> the initial condition<br />
<strong>of</strong> the coin when it is being flipped, the flow <strong>of</strong> air, the surface where the coin touches etc.; we summarize all<br />
these aspects <strong>and</strong> all the uncertainty in the universe by the abstract space (Ω, F, P)).<br />
1.3.1 Discrete-Time Stochastic Processes<br />
One can define a sequence <strong>of</strong> r<strong>and</strong>om variables as a single r<strong>and</strong>om variable living in the product set, that is we<br />
can consider {x 1 , x 2 , . . . , x N } as an individual r<strong>and</strong>om variable X which is an (E × E × . . . E)−valued r<strong>and</strong>om<br />
variable in the measurable space (E × E × . . . E, B(E × E × . . . E)), where now the fields are to be defined on<br />
the product set. In this case, the probability measure induced by a r<strong>and</strong>om sequence is defined on the σ−field<br />
B(E × E × . . . E). We can extend the above to a case when N = Z + , that is, the process is infinite dimensional.<br />
Let X be a complete, separable, metric space. Let B(X) denote the Borel sigma-field <strong>of</strong> subsets <strong>of</strong> X. Let Σ = X T<br />
denote the sequence space <strong>of</strong> all one-sided or two-sided (countable or uncountably) infinite variables drawn from<br />
X. Thus, if T = Z, x ∈ Σ then x = {. . .,x −1 , x 0 , x 1 , . . . } with x i ∈ X.<br />
Let X n : Σ → X denote the coordinate function such that X n (x) = x n . Let B(Σ) denote the smallest sigma-field<br />
containing all cylinder sets <strong>of</strong> the form {x : x i ∈ B i , m ≤ i ≤ n} where B i ∈ B(X), for all integers m, n. We<br />
can define a probability measure by characterization <strong>of</strong> these finite dimensional cylinder sets, by the extension<br />
theorem.<br />
A similar characterization also applies for continuous-time stochastic processes, where T is uncountable. The<br />
extension require more delicate arguments, since finite-dimensional characterizations are too weak to uniquely<br />
define a sigma-field which is consistent with such distributions. Such technicalities arise in the discussion for<br />
continuous-time Markov chains <strong>and</strong> controlled processes, typically requiring a construction from a separable<br />
product space.<br />
1.3.2 More on R<strong>and</strong>om Variables <strong>and</strong> Probability density functions<br />
Consider a probability space (X, B(X), P) <strong>and</strong> consider an R−valued r<strong>and</strong>om variable U measurable with respect<br />
to (X, B(X)).<br />
This r<strong>and</strong>om variable induces a probability measure µ on B(R) such that for some (a, b] ∈ B(R):<br />
µ((a, b]) = P(U ∈ (a, b]) = P({x : U(x) ∈ (a, b]})<br />
When U is R−valued, the Expectation <strong>of</strong> U is defined as:<br />
∫<br />
E[U] = µ(dx)x<br />
We define F(x) = µ(−∞, x] as the cumulative distribution function <strong>of</strong> U. If the derivative <strong>of</strong> F(x) exists, we<br />
call this derivative the density function <strong>of</strong> U, <strong>and</strong> denote it by p(x) (the distribution function is not always<br />
differentiable, for example when the r<strong>and</strong>om variable is only defined on integers). If a density function exists,<br />
we can also write:<br />
∫<br />
E[U] = p(x)xdx<br />
R<br />
R<br />
If a probability density function p exists, the measure P is said to be absolutely continuous with respect to the<br />
Lebesgue measure. In particular, the density function p is the Radon-Nikodym derivative <strong>of</strong> P with respect to<br />
the Lebesgue measure in the sense that for all Borel A: ∫ A<br />
p(x)λ(dx) = P(A).<br />
A probability density does not always exist. In particular, whenever there is a probability mass around a given<br />
point, then a density does not exist; hence in R, if for some x, p(x) > 0, then we say there is a probability mass<br />
at x, <strong>and</strong> a density function does not exist.
6 CHAPTER 1. REVIEW OF PROBABILITY<br />
Some examples <strong>of</strong> commonly encountered r<strong>and</strong>om variables are as follows:<br />
• Gaussian (N(µ, σ 2 )):<br />
f(x) = 1 √<br />
2πσ<br />
e −(x−µ)2 /2σ 2 .<br />
• Exponential:<br />
• Uniform on [a, b] (U([a, b])):<br />
• Binomial (B(n, p)):<br />
F(x) = 1 − e −λx , p(x) = λe −λx .<br />
F(x) = x − a<br />
b − a , p(x) = 1 , x ∈ [a, b].<br />
b − a<br />
p(k) =<br />
( n<br />
k)<br />
p k (1 − p) n−k<br />
If n = 1, we also call a binomial variable, a Bernoulli variable.<br />
1.3.3 Independence <strong>and</strong> Conditional Probability<br />
Consider A, B ∈ B(X) such that P(B) > 0. The quantity<br />
P(A|B) =<br />
P(A ∩ B)<br />
P(B)<br />
is called the conditional probability <strong>of</strong> event A given B. This is itself a probability measure. If<br />
Then, A, B are said to be independent events.<br />
P(A|B) = P(A)<br />
A countable collection <strong>of</strong> events {A n } is independent if for any finitely many sub-collections A i1 , A i2 , . . . , A im ,<br />
we have that<br />
P(A i1 , A i2 , . . . , A im ) = P(A i1 )P(A i2 )...P(A im ).<br />
Here, we use the notation P(A, B) = P(A ∩ B). A sequence <strong>of</strong> events is said to be pairwise independent if for<br />
any two pairs (A m , A n ): P(A m , A n ) = P(A m )P(A n ).<br />
Pairwise independence is weaker than independence.<br />
1.4 Markov Chains<br />
One can define a sequence <strong>of</strong> r<strong>and</strong>om variable as a single r<strong>and</strong>om variable living in the product set, that is we<br />
can consider {x 1 , x 2 , . . . , x N } as an individual r<strong>and</strong>om variable X which is an (E × E × . . . E)−valued r<strong>and</strong>om<br />
variable in the measurable space (E × E × . . . E, B(E × E × . . . E)), where now the fields are to be defined on<br />
the product set. In this case, the probability measure induced by a r<strong>and</strong>om sequence is defined on the σ−field<br />
B(E × E × . . . E). If the probability measure for this sequence is such that<br />
P(x k+1 ∈ A k+1 |x k , x k−1 , . . . , x 0 ) = P(x k+1 ∈ A k+1 |x k ) P.a.s.,<br />
then {x k } is said to be a Markov chain. Thus, for a Markov chain, the immediate state is sufficient to predict<br />
the future; past variables are not needed.<br />
As an example, consider the following linear system:<br />
x t+1 = ax t + w t ,
1.5. EXERCISES 7<br />
where w t is an independent r<strong>and</strong>om variable. This sequence is Markov. Another way to construct a Markov<br />
chain is via the following: Let {x t , t ≥ 0} be a r<strong>and</strong>om sequence with state space (X, B(X)), <strong>and</strong> defined on a<br />
probability space (Ω, F, P), where B(X) denotes the Borel σ−field on X, Ω is the sample space, F a sigma field<br />
<strong>of</strong> subsets <strong>of</strong> Ω, <strong>and</strong> P a probability measure. For x ∈ X <strong>and</strong> D ∈ B(X), we let P(x, D) := P(x t+1 ∈ D|x t = x)<br />
denote the transition probability from x to D, that is the probability <strong>of</strong> the event {x t+1 ∈ D} given that x t = x.<br />
If the probability <strong>of</strong> the same event given some history <strong>of</strong> the past (<strong>and</strong> the present) does not depend on the<br />
past, <strong>and</strong> hence is given by the same quantity, the chain is a Markov chain.<br />
Thus, the Markov chain is completely determined by the transition probability <strong>and</strong> the probability <strong>of</strong> the initial<br />
state, P(x 0 ) = p 0 . Hence, the probability <strong>of</strong> the event {x t+1 ∈ D} for any t can be computed recursively by<br />
starting at t = 0, with P(x 1 ∈ D) = ∑ x P(x 1 ∈ D|x 0 = x)P(x 0 = x), <strong>and</strong> iterating with a similar formula for<br />
t = 1, 2, . . ..<br />
The above requires some justification. Kolmogorov’s extension theorem, is one way we can build on to extend<br />
the probability measures successively.<br />
We will continue our discussion on Markov chains after discussing controlled Markov chains.<br />
1.5 Exercises<br />
Exercise 1.5.1 a) Let F β be a σ−field <strong>of</strong> subsets <strong>of</strong> some space X for all β ∈ Γ where Γ is a family <strong>of</strong> indices.<br />
Let<br />
F = ⋂ β∈Γ<br />
F β<br />
Show that F is also a σ−field.<br />
For a space X, on which a distance metric is defined, the Borel σ−field is generated by the collection <strong>of</strong> open sets.<br />
This means that, the Borel σ−field is the smallest σ−field containing open sets, <strong>and</strong> as such it is the intersection<br />
<strong>of</strong> all σ-fields containing open sets. On R, the Borel σ−field is the smallest σ−field generated by open intervals.<br />
b) Is the set <strong>of</strong> rational numbers an element <strong>of</strong> the Borel σ-field on R? Is the set <strong>of</strong> irrational numbers an<br />
element?<br />
c) Let X be a countable set. On this set, let us define a metric as follows:<br />
{<br />
0,if x = y<br />
d(x, y) =<br />
1,if x ≠ y<br />
Show that, the Borel σ−field on X is generated by the individual sets {x}, x ∈ X.<br />
d) Consider the distance function d(x, y) defined as above defined on R. Is the σ−field generated by open sets<br />
according to this metric the same as the Borel σ−field on R?<br />
Exercise 1.5.2 a) Let X <strong>and</strong> Y be real-valued r<strong>and</strong>om variables defined on a given probability space. Show that<br />
X 2 <strong>and</strong> X + Y are also r<strong>and</strong>om variables.<br />
b) Let F be a σ−field <strong>of</strong> subsets over a set X <strong>and</strong> let A ∈ F. Prove that {A ∩ B, B ∈ F} is a σ−field over A<br />
(that is a σ−field <strong>of</strong> subsets <strong>of</strong> A).<br />
Exercise 1.5.3 Consider a r<strong>and</strong>om variable X which takes values in [0, 1] <strong>and</strong> is uniformly distributed. We<br />
know that for every set S = [a, b) 0 ≤ a ≤ b ≤ 1, the probability <strong>of</strong> a r<strong>and</strong>om variable X taking values in S is<br />
P(X ∈ [a, b)) = U([a, b]) = b − a. Consider now the following question? Does every subset S ⊂ [0, 1] admit a<br />
probability in the form above? In the following we will provide a counterexample, known as the Vitali set.<br />
We can show that a set picked as follows does not admit a measure: Let us define an equivalence class such that<br />
x ≡ y if x − y ∈ Q. This equivalence definition divides [0, 1] into disjoint sets. Let A be a subset which picks<br />
exactly one element from each equivalent class (here, we adopt what is known as the Axiom <strong>of</strong> Choice). Since<br />
A contains an element from each equivalence class, each pint <strong>of</strong> [0, 1] is contained in the union ∪ q∈Q (A ⊕ q).
8 CHAPTER 1. REVIEW OF PROBABILITY<br />
For otherwise there would be a point x which were not in any equivalence class. But x is equivalent with itself<br />
at least. Furthermore, since A contains only one point from each equivalence class, the sets A ⊕ q are disjoint;<br />
for otherwise there would be two sets which could include a common point: A ⊕ q <strong>and</strong> A ⊕ q ′ would include a<br />
common point, leading to the result that the difference x − q = z <strong>and</strong> x − q ′ = z are both in A, a contradiction,<br />
since there should be at most one point which is in the same equivalence class as x − q = z. We expect the<br />
uniform distribution to be shift-invariant, therefore P(A) = P(A ⊕ q). But [0, 1] = ∪ q A ⊕ q. Since a countable<br />
sum <strong>of</strong> identical non-negative elements can either become ∞, or 0, the contradiction follows: We can’t associate<br />
a number with this set.
Chapter 2<br />
Controlled Markov Chains<br />
In the following, we discuss controlled Markov models.<br />
2.1 Controlled Markov Models<br />
Consider the following model.<br />
x t+1 = f(x t , u t , w t ), (2.1)<br />
where x t is a X-valued state variable, u t is U-valued control action variable, w t a W-valued an i.i.d noise process,<br />
<strong>and</strong> f is a measurable function. We assume that X, U, W are subsets <strong>of</strong> Polish spaces. The model above in (2.1)<br />
contains (see [8]) the class <strong>of</strong> all stochastic processes which satisfy the following for all Borel sets B ∈ B(X),<br />
t ≥ 0, <strong>and</strong> all realizations x [0,t] , u [0,t] :<br />
P(x t+1 ∈ B|x [0,t] = a [0,t] , u [0,t] = b [0,t] ) = T (x t+1 ∈ B|a t , b t ) (2.2)<br />
where T (·|x, u) is a stochastic kernel from X × U to X. A stochastic process which satisfies (2.2) is called a<br />
controlled Markov chain.<br />
2.1.1 Fully Observed Markov Control Problem Model<br />
A Fully Observed Markov Control Problem is a five tuple<br />
where<br />
• X is the state space, a subset <strong>of</strong> a Polish space.<br />
• U is the action space, a subset <strong>of</strong> a Polish space.<br />
(X, U, {U(x), x ∈ X}, T, c),<br />
• K = {(x, u) : u ∈ U(x) ∈ B(U), x ∈ X} is the set <strong>of</strong> state, control pairs that are feasible. There might be<br />
different states where different control actions are possible.<br />
• T is a state transition kernel, that is T (A|x t , u t ) = P(x t+1 ∈ A|x t , u t ).<br />
• c : K → R is the cost.<br />
Consider for now that the objective to be minimized is given by: J(x 0 , Π) := E Π ν 0<br />
[ ∑ T −1<br />
t=0 c(x t, u t )], where ν 0 is<br />
the initial probability measure, that is x 0 ∼ ν 0 . The goal is to find a policy Π ∗ (to be defined next) such that<br />
J(x 0 , Π ∗ ) ≤ J(x 0 , Π), ∀Π,<br />
such that Π is an admissible policy to be defined below. Such a Π ∗ is an optimal policy. Here Π can also be<br />
called the strategy, or law.<br />
9
10 CHAPTER 2. CONTROLLED MARKOV CHAINS<br />
2.1.2 Classes <strong>of</strong> Control Policies<br />
Admissible Control Policies<br />
Let H 0 := X, H t = H t−1 × K for t = 1, 2, . . .. We let h t denote an element <strong>of</strong> H t , where h t = {x [0,t] , u [0,t−1] }.<br />
A deterministic admissible control policy Π is a sequence <strong>of</strong> functions {γ t } from H t → U; in this case<br />
u t = γ t (h t ). A r<strong>and</strong>omized control policy is a sequence Π = {Π t , t ≥ 0} such that Π : H t → P(U) (with P(U)<br />
being the space <strong>of</strong> probability measures on U) such that<br />
Π t (u t ∈ U(x t )|h t ) = 1, ∀h t ∈ H t .<br />
Markov Control Policies<br />
A policy is r<strong>and</strong>omized Markov if<br />
P Π x 0<br />
(u t ∈ C|h t ) = Π t (u t ∈ C|x t ),<br />
C ∈ B(U).<br />
Hence, the control action only depends on the state <strong>and</strong> the time, <strong>and</strong> not the past history. If the control strategy<br />
is deterministic, that is if<br />
Π t (u t = f t (x t )|x t ) = 1.<br />
for some function f t , the control policy is said to be deterministic Markov.<br />
Stationary Control Policies<br />
A policy is r<strong>and</strong>omized stationary if<br />
P Π x 0<br />
(u t ∈ C|h t ) = Π(u t ∈ C|x t ),<br />
C ∈ B(U).<br />
Hence, the control action only depends on the state, <strong>and</strong> not the past history or on time. If the control strategy is<br />
deterministic, that is if Π(u t = f(x t )|x t ) = 1 for some function f, the control policy is said to be deterministic<br />
stationary.<br />
2.2 Markov Chain Induced by a Markov Policy<br />
The following is an important result:<br />
Theorem 2.2.1 Let the control policy be r<strong>and</strong>omized Markov. Then, the controlled Markov chain becomes a<br />
Markov chain in X, that is, the state process itself becomes a Markov chain:<br />
P Π x 0<br />
(x t+1 ∈ B|x t , x t−1 , . . .,x 0 ) = Q Π t (x t+1 ∈ B|x t ),<br />
B ∈ B(X), t ≥ 1, P.a.s.<br />
Pro<strong>of</strong>: Let us consider the case where U is countable. Let B ∈ B(X). It follows that,<br />
Px Π 0<br />
(x t+1 ∈ B|x t , x t−1 , . . . , x 0 )<br />
= ∑ Px Π 0<br />
(x t+1 ∈ B, u t |x t , x t−1 , . . . , x 0 )<br />
u t<br />
= ∑ Px Π 0<br />
(x t+1 ∈ B|u t , x t , x t−1 , . . . , x 0 )Px Π 0<br />
(u t |x t , x t−1 , . . .,x 0 )<br />
u t<br />
= ∑ u t<br />
Q Π x 0<br />
(x t+1 ∈ B|u t , x t )π t (u t |x t )<br />
= ∑ u t<br />
Q Π x 0<br />
(x t+1 ∈ B, u t |x t )<br />
= Q Π t (x t+1 ∈ B|x t ) (2.3)
2.3. PERFORMANCE OF POLICIES 11<br />
The essential issue here is that, the control only depends on x t , <strong>and</strong> since x t+1 depends stochastically only on x t<br />
<strong>and</strong> u t , the desired result follows. In the above, we used properties from total probability <strong>and</strong> conditioning. ⋄<br />
We note that, if the applied control policy is not only Markov but also stationary, then the induced Markov<br />
chain in X is time-homogeneous, that is the stochastic kernel Q Π t (.|.) does not depend on time.<br />
2.3 Performance <strong>of</strong> Policies<br />
Consider a Markov control problem with an objective given as the minimization <strong>of</strong><br />
T∑<br />
−1<br />
J(ν 0 , Π) = Eν Π 0<br />
[ c(x t , u t )]<br />
where ν 0 denotes the distribution on x 0 . If x 0 = x, we then write<br />
t=0<br />
T∑<br />
−1<br />
T∑<br />
−1<br />
J(x 0 , Π) = Ex Π 0<br />
[ c(x t , u t )] = E Π [ c(x t , u t )|x 0 ]<br />
Such a cost problem is known as a Finite Horizon Optimal Control problem.<br />
We will also study costs <strong>of</strong> the following form:<br />
t=0<br />
t=0<br />
J(ν 0 , Π) = 1 T∑<br />
−1<br />
T EΠ ν 0<br />
[ c(x t , u t )]<br />
Such a problem is known as the Average Cost Optimal Control Problem.<br />
Finally, we will consider costs <strong>of</strong> the following form:<br />
t=0<br />
T∑<br />
−1<br />
J(ν 0 , Π) = Eν Π 0<br />
[ β t c(x t , u t )],<br />
for some β ∈ (0, 1). This is called a Discounted Optimal Control Problem.<br />
Let Π A denote the class <strong>of</strong> admissible policies, Π M denote the class <strong>of</strong> Markov policies, Π S denote the class <strong>of</strong><br />
Stationary policies. These policies can be both r<strong>and</strong>omized or deterministic.<br />
In a general setting, we have the following:<br />
inf J(ν 0 , Π) ≤<br />
Π∈Π A<br />
since the set <strong>of</strong> policies is progressively shrinking<br />
t=0<br />
inf J(ν 0 , Π) ≤<br />
Π∈Π M<br />
Π S ⊂ Π M ⊂ Π A<br />
inf J(ν 0 , Π),<br />
Π∈Π S<br />
We will show, however, that for the optimal control <strong>of</strong> a Markov chain, under mild conditions, Markov policies<br />
are always optimal (that is there is no loss in optimality in restricting the policies to be Markov); that is, it is<br />
sufficient to consider Markov policies. This is an important result in stochastic control. That is,<br />
inf J(ν 0 , Π) = inf J(ν 0 , Π)<br />
Π∈Π A Π∈Π M<br />
We will also show that, under somewhat more restrictive conditions, Stationary policies are optimal (that is,<br />
there is no loss in optimality in restricting the policies to be Stationary). This will typically require T → ∞.<br />
inf J(ν 0 , Π) = inf J(ν 0 , Π)<br />
Π∈Π A Π∈Π S<br />
Furthermore, we will show that, under some conditions,<br />
inf J(ν 0 , Π)<br />
Π∈Π S
12 CHAPTER 2. CONTROLLED MARKOV CHAINS<br />
is independent <strong>of</strong> the initial distribution (or initial condition) on x 0 .<br />
The last two results are computationally very important, as there are powerful computational algorithms that<br />
allow one to develop such stationary policies.<br />
In the following set <strong>of</strong> notes, we will first consider further properties <strong>of</strong> Markov chains, since under a Markov<br />
control policy, the controlled state becomes a Markov chain. We will then get back to the Controlled Markov<br />
Chains <strong>and</strong> the development <strong>of</strong> optimal control policies.<br />
The classification <strong>of</strong> Markov Chains in the next topic will implicitly characterize the set <strong>of</strong> problems for which<br />
Stationary policies contain optimal admissible policies.<br />
2.3.1 Partially Observed Model<br />
Consider the following model.<br />
x t+1 = f(x t , u t , w t ), y t = g(x t , v t )<br />
Here, as before, x t is the state, u t ∈ U is the control, (w t , v t ) ∈ W × V are second order, zero-mean, i.i.d noise<br />
processes <strong>and</strong> w t is independent <strong>of</strong> v t . In addition to the previous fully observed model, y t denotes an observation<br />
variable taking values in Y, a subset <strong>of</strong> R n in the context <strong>of</strong> this review. The controller only has causal access to<br />
the second component {y t } <strong>of</strong> the process. An admissible policy {Π} is measurable with respect to σ({y s , s ≤ t}).<br />
We denote the observed history space as: H 0 := P, H t = H t−1 × Y × U. Hence, the set <strong>of</strong> (wide-sense) causal<br />
control policies are such that P(u(h t ) ∈ U|h t ) = 1 ∀h t ∈ H t .<br />
One could transform a partially observable Markov Decision Problem to a Fully Observed Markov Decision<br />
Problem via an enlargement <strong>of</strong> the state space. In particular, we obtain via the properties <strong>of</strong> total probability<br />
the following dynamical recursion<br />
π t (A) : = P(x t ∈ A|y [0,t] , u [0,t−1] )<br />
∫<br />
X<br />
=<br />
π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 )<br />
∫ ∫X π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 ) ,<br />
X<br />
where we assume that ∫ B r(y|x)dy = P(y t ∈ B|x t = x) for any B ∈ B(Y) <strong>and</strong> r denotes the density process.<br />
The conditional measure process becomes a controlled Markov chain in P(X), which we endow with the weak<br />
convergence topology.<br />
Theorem 2.3.1 The process {π t , u t } is a controlled Markov chain. That is, under any admissible control policy,<br />
given the action at time t ≥ 0 <strong>and</strong> π t , π t+1 is conditionally independent from {π s , u s , s ≤ t − 1}.<br />
Let the cost function to be minimized be<br />
T∑<br />
−1<br />
t=0<br />
E Π x 0<br />
[c(x t , u t )],<br />
where E Π x 0<br />
[] denotes the expectation over all sample paths with initial state given by x 0 under policy Π.<br />
We ∫ transform the system into a fully observed Markov model as follows. Define the new cost as ˜c(π, u) =<br />
X<br />
c(x, u)π(dx), π ∈ P(X). The stochastic transition kernel q is given by:<br />
∫<br />
q(dx, dy|π, u) = P(dx, dy|x ′ , u)π(dx ′ ), π ∈ P(X)<br />
And, this kernel can be decomposed as q(dx, dy|π, u) = P(dy|π, u)P(dx|π, u, y).<br />
X<br />
The second term here is the filtering equation, mapping (π, u, y) ∈ (P(X) × U × Y) to P(X). It follows that<br />
(P(X), U, K, ˜c) defines a completely observable controlled Markov process. Here, we have<br />
∫<br />
K(B|π, u) = 1 (P(.|π,u,y)∈B) P(dy|π, u), ∀B ∈ B(P(X)),<br />
with 1 (.) denoting the indicator function.<br />
Y
2.4. EXERCISES 13<br />
2.4 Exercises<br />
Exercise 2.4.1 Suppose that there are two decision makers DM 1 <strong>and</strong> DM 2 . Suppose that the information<br />
available to to DM 1 is a r<strong>and</strong>om variable Y 1 <strong>and</strong> the information available to DM 2 is Y 2 , where these r<strong>and</strong>om<br />
variables are measurable on a probability space (Ω, F, P).<br />
Suppose that the sigma-field generated by Y 1 is a subset <strong>of</strong> the sigma-field generated by Y 2 , that is σ(Y 1 ) ⊂ σ(Y 2 ).<br />
Further, suppose that the decision makers wish to minimize the following cost function:<br />
E[c(ω, u i )],<br />
where c : Ω × U → R + is a measurable cost function, with u i ∈ U a Borel space. Here, for i = 1, 2, u i = γ i (y i )<br />
is generated by a measurable function γ i on the sigma-field generated by the r<strong>and</strong>om variable Y i . Let Γ i denote<br />
the space <strong>of</strong> all measurable policies.<br />
Prove that<br />
inf E[c(ω,<br />
γ 1 ∈Γ U1 )] ≥ inf E[c(ω,<br />
1 γ 2 ∈Γ U2 )].<br />
2<br />
Exercise 2.4.2 Consider a line <strong>of</strong> customers at a service station (such as at an airport, a grocery store, or a<br />
communication network where customers are packets). Let L t be the length <strong>of</strong> the line, that is the total number<br />
<strong>of</strong> customers waiting in the line.<br />
Let there be M t servers, serving the customers at time t. Let there be a manager (controller) who decides on the<br />
number <strong>of</strong> servers to be present. Let each <strong>of</strong> the servers be able to serve N customers for every time-stage. The<br />
dynamics <strong>of</strong> the line can be expressed as follows:<br />
L t+1 = L t + A t − M t N1 (Lt≥NM t),<br />
where 1 (E) is the indicator function for event E, i.e., it is equal to zero if E does not occur <strong>and</strong> is equal to 1<br />
otherwise. In the equation above, A t is the number <strong>of</strong> customers that have just arrived at time t. We assume<br />
{A t } to be an independent process, <strong>and</strong> to have an exponential distribution, with mean λ, that is for all k ∈ N<br />
The manager has only access to the information vector<br />
P(A(t) = k) = λk e −λ<br />
, k ∈ {0, 1, 2, . . ., }<br />
k!<br />
I t = {L 0 , L 1 , . . .,L t ; A 0 , A 1 , . . . , A t−1 },<br />
while implementing his policies. A consultant proposes a number <strong>of</strong> possible policies to be adopted by the manager:<br />
According to Policy A, the number <strong>of</strong> servers is given by:<br />
According to Policy B, the number <strong>of</strong> servers is<br />
M t = L t + L t−1 + L t−3<br />
.<br />
2<br />
M t = L t+1 + L t + L t−1<br />
.<br />
2<br />
According to Policy C,<br />
Finally, according to Policy D, the update is<br />
M t = ⌈λ + 0.1⌉1 t≥10 .<br />
M t = ⌈λ + 0.1⌉<br />
a) Which <strong>of</strong> these policies are admissible, that is measurable with respect to the σ−field generated by I t ? Which<br />
are Markov, or stationary?
14 CHAPTER 2. CONTROLLED MARKOV CHAINS<br />
b) Consider the model above. Further, suppose policy D is adopted by the manager. Consider the induced Markov<br />
chain by the policy, that is, consider a Markov chain with the following dynamics:<br />
with A t as described the the earlier question.<br />
Is this Markov chain irreducible?<br />
L t+1 = L t + A t − ⌈λ + 0.1⌉N1 (Lt≥⌈λ+0.1⌉)N,<br />
Is there an absorbing set in this Markov chain? If so, what is it?
Chapter 3<br />
Classification <strong>of</strong> Markov Chains<br />
3.1 Countable State Space Markov Chains<br />
In this chapter, we first review Markov chains where the state takes values in a finite set or a countable space.<br />
A Markov chain {x t } on a countable state space is a collection <strong>of</strong> r<strong>and</strong>om variables {x t , t ∈ Z + }, where each <strong>of</strong><br />
the variables take place in a countable set X. As such, we will call the collection <strong>of</strong> such variables a chain Φ<br />
which takes values in the (one-sided or two-sided) infinite-product space X ∞ . The Borel σ−field generated on<br />
the infinite space is the one generated by the finite dimensional sets <strong>of</strong> the form<br />
{x ∈ X ∞ : x [m,n] = {a m , a m+1 , . . . , a n }, a k ∈ X, m ≤ k ≤ n, m, n ∈ Z + }.<br />
We assume that ν 0 is an initial distribution for the Markov chain for x 0 .<br />
As such, the process Φ = {x 0 , x 1 , . . .,x n , . . . } is a (time-homogeneous) Markov chain <strong>and</strong> satisfies:<br />
P ν0 (x 0 = a 0 , x 1 = a 1 , x 2 = a 2 , . . .,x n = a n )<br />
= ν(x 0 = a 0 )P(x 1 = a 1 |x 0 = a 0 )P(x 2 = a 2 |x 1 = a 1 )...P(x n = a n |x n−1 = a n−1 ) (3.1)<br />
If the initial condition is known to be x 0 , we use P x0 (· · · ) in place <strong>of</strong> P ν0 (· · · ).<br />
We could also write the evolution in terms <strong>of</strong> a matrix.<br />
P(i, j) = Pr(x t+1 = j|x t = i) ≥ 0, ∀i, j ∈ X<br />
Here P(·, ·) is also a probability transition kernel, that is for every i ∈ X, P(i, .) is a probability measure on X.<br />
Thus, ∑ j<br />
P(i, j) = 1.<br />
By induction, we could verify that<br />
P k (i, j) := P(x t+k = j|x t = i) = ∑ m∈X<br />
P(i, m)P k−1 (m, j)<br />
In the following, we characterize Markov Chains based on transience, recurrence <strong>and</strong> communication. We then<br />
consider the issue <strong>of</strong> the existence <strong>of</strong> an invariant distribution. Later, we will extend the analysis to uncountable<br />
space Markov chains.<br />
Let us consider a discrete-time, time-homogeneous Markov process living in a countable state space X.<br />
Communication<br />
If there exists an integer k such that P(x t+k = j|x t = i) = P k (i, j) > 0, <strong>and</strong> another integer l such that<br />
P(x t+l = i|x t = j) = P l (j, i) > 0 then state i communicates with state j.<br />
15
16 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />
A set C ⊂ X is said to be communicating if every two elements (states) <strong>of</strong> C communicate with each other.<br />
If every member <strong>of</strong> the set can communicate to every other member, such a chain is said to be irreducible.<br />
The period <strong>of</strong> a state is defined to be the greatest common divisor <strong>of</strong> {k > 0 : P k (i, i) > 0}.<br />
A Markov chain is called aperiodic if the period <strong>of</strong> all states is 1.<br />
In the following, throughout the notes, we will assume that the Markov process under consideration is aperiodic<br />
unless mentioned otherwise.<br />
Absorbing Set<br />
A set C is called absorbing if P(i, C) = 1 for all i ∈ C. That is, if the state is in C, then the state cannot get<br />
out <strong>of</strong> the set C.<br />
A finite set Markov chain in space X is irreducible if the smallest absorbing set is the entire X itself.<br />
A Markov chain in X is indecomposable if X does not contain two disjoint absorbing sets.<br />
Occupation, Hitting <strong>and</strong> Stopping Times<br />
For any set A ∈ X, the occupation time η A is the number <strong>of</strong> visits <strong>of</strong> ψ to set A:<br />
η A =<br />
∞∑<br />
t=1<br />
1 {xt∈A},<br />
where 1 (E) denotes the indicator function for an event E, that is, it takes the value 1 when E takes place, <strong>and</strong><br />
is otherwise 0.<br />
Define<br />
τ A = min{k > 0 : x k ∈ A},<br />
is the first time that the state visits state i, known as the hitting time.<br />
We define also a return time:<br />
τ A = min{k > 0 : x k ∈ A, x 0 ∈ A}.<br />
A Brief Discussion on Stopping Times<br />
The variable τ A defined above is an example for stopping times:<br />
Definition 3.1.1 A function τ from a measurable space (X ∞ , B(X ∞ )) to (N + , B(N + )) is a stopping time if for<br />
all n ∈ N + , the event {τ = n} ∈ σ(x 0 , x 1 , x 2 , . . . , x n ), that is the event is in the sigma-field generated by the<br />
r<strong>and</strong>om variables up to time n.<br />
Any realistic decision takes place at a time which is measurable. For example if an investor wants to stop<br />
investing when he stops pr<strong>of</strong>iting, he can stop at the time when the investment loses value, this is a stopping<br />
time. But, if an investor claims to stop investing when the investment is at the peak, this is not a stopping time<br />
because to find out whether the investment is at its peak, the next state value should be known, <strong>and</strong> this is not<br />
measurable in a causal fashion.<br />
One important property <strong>of</strong> Markov chains is the so-called Strong Markov Property. This says the following:<br />
Consider a Markov chain which evolves on a countable set. If we sample this chain according to a stopping time<br />
rule, the sampled Markov chain starts from the sampled instant as a Markov chain:<br />
Proposition 3.1.1 For a (time-homogenous) Markov chain in a countable state space X, the strong Markov<br />
property always holds. That is, the sampled chain itself is a Markov chain; if τ is a stopping time with P(τ
3.1. COUNTABLE STATE SPACE MARKOV CHAINS 17<br />
∞) = 1, then for any m ∈ N + :<br />
for all a, b 0 , b 1 , · · ·.<br />
P(x τ+m = a|x τ = b 0 , x τ −1 = b 1 , . . .) = P(x τ+m = a|x τ = b 0 ) = P m (b 0 , a),<br />
3.1.1 Recurrence <strong>and</strong> Transience<br />
Let us define<br />
<strong>and</strong> let us define<br />
∞∑<br />
∞∑<br />
U(x, A) := E x [ 1 (xt∈A)] = P t (x, A)<br />
t=1<br />
t=1<br />
L(x, A) := P x (τ A < ∞),<br />
which is the probability <strong>of</strong> the chain visiting set A, once the process starts at state x.<br />
These two are important characterizations for Markov chains.<br />
Definition 3.1.2 A set A ⊂ X is recurrent if the Markov chain visits A infinitely <strong>of</strong>ten (in expectation), when<br />
the process starts in A. This is equivalent to<br />
E x [η A ] = ∞, ∀x ∈ A (3.2)<br />
That is, if the chain starts at a given state x ∈ A, it comes back to the set A, <strong>and</strong> does so infinitely <strong>of</strong>ten. If a<br />
state is not recurrent, it is transient.<br />
Definition 3.1.3 A set A ⊂ X is positive recurrent if the Markov chain visits A infinitely <strong>of</strong>ten, when the<br />
process starts in A <strong>and</strong> in addition:<br />
E x [τ A ] < ∞, ∀x ∈ A<br />
Definition 3.1.4 A state α is transient if<br />
The above is equivalent to the condition that<br />
which in turn is implied by<br />
U(α, α) = E α [η α ] < ∞, (3.3)<br />
∞∑<br />
P i (α, α) < ∞,<br />
i=1<br />
P i (τ i < ∞) < 1.<br />
While discussing recurrence, there is an equivalent, <strong>and</strong> <strong>of</strong>ten easier to check condition: If E i [τ i ] < ∞, then the<br />
state {i} is positive recurrent. If L(i, i) = P i (τ i < ∞) = 1, but E i [τ i ] = ∞, then i is recurrent (also called, null<br />
recurrent).<br />
The reader should connect the above with the strong Markov property: once the process hits a state, it starts<br />
from the state as if the past never happened; the process recurs itself.<br />
We have the following. The pro<strong>of</strong> is presented later in the chapter, see Theorem 3.3.1.<br />
Theorem 3.1.1 If P i (τ i < ∞) = 1, then P i (η i = ∞) = 1.<br />
There is a more general notion <strong>of</strong> recurrence, named Harris recurrence, which is exactly the condition that<br />
P i (η i = ∞) = 1. We will investigate this further below while studying uncountable state space Markov chains,<br />
however one needs to note that even for countable state space chains Harris recurrence is stronger than recurrence.<br />
For a finite state Markov chain it is not difficult to verify that (3.3) is equivalent to L(i, i) < 1, since P(τ i (k) <<br />
∞) = P(τ i (k − 1) < ∞)P(τ i (1) < ∞) <strong>and</strong> that E[η] = ∑ k<br />
P(η ≥ k).<br />
If every state in X is recurrent, then the chain is recurrent, if all are transient, then the chain is transient.
18 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />
3.2 Stability <strong>and</strong> Invariant Distributions<br />
Stability is an important concept, but it has different meanings in different contexts. In a stochastic setting,<br />
stability means that the state does not grow large too <strong>of</strong>ten, <strong>and</strong> the average over time (temporal) converges to<br />
a statistical average.<br />
If a Markov chain starts at a given time, in the long-run, the chain may forget its initial condition, that is, the<br />
probability distribution at a later time will be less <strong>and</strong> less dependent on the initial distribution. Given an initial<br />
state distribution, the probability distribution on the state at time 1 is given by:<br />
π 1 = π 0 P<br />
And for t > 1:<br />
π t+1 = π t P = π 0 P t+1<br />
One important property <strong>of</strong> Markov chains is whether the above iteration leads to a fixed point in the set <strong>of</strong><br />
probability measures. Such a fixed point π is called an invariant distribution. A distribution in a countable<br />
state Markov chain is invariant if<br />
π = πP<br />
This is equivalent to<br />
π(j) = ∑ i∈X<br />
π(i)P(i, j), ∀j ∈ X<br />
We note that, if such a π exists, it must be written in terms <strong>of</strong> π = π 0 lim t→∞ P t , for some π 0 . Clearly, π 0 can<br />
be π itself, but <strong>of</strong>ten π 0 can be any initial distribution under irreducibility conditions which will be discussed<br />
further. Invariant distributions are especially important in networking problems <strong>and</strong> stochastic control, due<br />
to the Ergodic Theorem (which shows that temporal averages converge to statistical averages), which we will<br />
discuss later in the semester.<br />
3.2.1 Invariant Measures via an Occupational Characterization<br />
Theorem 3.2.1 For a Markov chain, if there exists an element i such that E i [τ i ] < ∞; the following is an<br />
invariant measure:<br />
[∑ τi−1<br />
k=0<br />
µ(j) = E<br />
1 ]<br />
x k =j<br />
|x 0 = i , ∀j ∈ X<br />
E i [τ i ]<br />
The invariant measure is unique if the chain is irreducible.<br />
Pro<strong>of</strong>:<br />
We show that<br />
[∑ τi−1<br />
k=0<br />
E<br />
1 ]<br />
x k =j<br />
|x 0 = i<br />
E[τ i ]<br />
= ∑ s<br />
[∑ τi−1<br />
k=0<br />
P(s, j)E<br />
1 ]<br />
x k =s<br />
|x 0 = i<br />
E[τ i ]
3.2. STABILITY AND INVARIANT DISTRIBUTIONS 19<br />
Note that E[1 xt+1=j] = P(x t+1 = j). Hence,<br />
∑<br />
s<br />
[∑ τi−1<br />
k=0<br />
P(s, j)E<br />
1 ]<br />
x k =s<br />
|x 0 = i<br />
E i [τ i ]<br />
= E<br />
[∑ τi−1 ∑<br />
k=0<br />
E i [τ i ]<br />
[ ]<br />
∑τi−1 ∑<br />
E<br />
k=0 s 1 x k =sE[1 xk+1 =j|x k = s, x 0 = i]|x 0 = i<br />
=<br />
E i [τ i ]<br />
[ ]<br />
∑τi−1<br />
E<br />
k=0 E[∑ s 1 x k =s1 xk+1 =j|x k = s, x 0 = i]|x 0 = i<br />
=<br />
E i [τ i ]<br />
[ ]<br />
∑τi−1 ∑<br />
E<br />
k=0 s 1 x k =s1 xk+1 =j|x 0 = i<br />
=<br />
E i [τ i ]<br />
[∑ τi−1<br />
k=0<br />
= E 1 ] [∑ τi<br />
x k+1 =j<br />
k=1 i = E E[1 ]<br />
x k =j]<br />
i<br />
E i [τ i ]<br />
E i [τ i ]<br />
[∑ τi−1<br />
k=0<br />
= E E[1 ]<br />
x k =j]<br />
i = µ(j),<br />
E i [τ i ]<br />
s P(s, j)1 x k =s<br />
]<br />
|x 0 = i<br />
(3.4)<br />
where we use the fact that the number <strong>of</strong> visits to a given set does not change whether we include or exclude<br />
time t = 0 or τ i , since any state j ≠ i is not visited at these times. Here, (3.4) follows from the law <strong>of</strong> the iterated<br />
expectations, see Theorem 4.1.3. This concludes the pro<strong>of</strong>.<br />
⊓⊔<br />
Remark: If E[τ i ] < ∞, then the above measure becomes a probability measure, as it follows that<br />
∑<br />
µ(j) =<br />
∑<br />
j<br />
[∑ τi−1<br />
k=0<br />
E<br />
1 ]<br />
x k =j<br />
|x 0 = i<br />
E[τ i ]<br />
For example, for a r<strong>and</strong>om walk E[τ i ] = ∞, hence there does not exist a unique invariant probability measure.<br />
But, it has an invariant measure: µ(i) = K, for a constant K. The reader can verify this.<br />
= 1.<br />
⊓⊔<br />
Theorem 3.2.2 For an irreducible Markov chain, positive recurrence implies the existence <strong>of</strong> an invariant distribution<br />
with Π(i) = 1<br />
E i[τ i]<br />
. Furthermore the invariant distribution is unique.<br />
Pro<strong>of</strong>: From Theorem 3.2.1, it follows that π(i) = 1<br />
E i[τ i]<br />
is satisfied for an invariant measure. That, E i [τ i ] is<br />
finite follows from irreducibility <strong>of</strong> a finite state Markov chain. We will prove the uniqueness result for the case<br />
where P(i, j) > ǫ for some ǫ > 0 for all i, j ∈ X.<br />
First, we will show that the absorbing set has to be the entire state space, <strong>and</strong> the invariant measure should be<br />
non-zero on each set in the state space.<br />
We then show that, the measures cannot differ on any given state. Let i be such that π(i) = 0, for an invariant<br />
distribution. Then, for X − i, it follows that<br />
π(X \ {i}) =<br />
∑<br />
( ∑<br />
)<br />
π(j)P(j, X \ {i}) + 0P(i, X \ {i}) ≤ π(j) P(j, X \ {i}) < π(X \ {i})(1 − ǫ),<br />
j∈X\{i}<br />
j∈X\{i}<br />
since there exists some ǫ > 0 such that P(j, i) > ǫ for all j ∈ X − i. We now show that, there cannot be two<br />
different distributions.<br />
Let π(i) <strong>and</strong> π ′ (i) be two different distributions. By the previous argument, they have the same support set. It<br />
must be that for some linear functions F, G:<br />
Fπ(i) = G[π(j), j ∈ X \ {i}]
20 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />
<strong>and</strong><br />
Fπ ′ (i) = G[π ′ (j), j ∈ X \ {i}]<br />
Let us define J := {i : π(i) ≥ π ′ (i)}, then for some linear functions with positive coefficients:<br />
F J [π(j), j ∈ J] = G −J [π(j), j ∈ X \ J]<br />
F J [π ′ (j), j ∈ J] = G −J [π ′ (j), j ∈ X \ J]<br />
<strong>and</strong> by linearity, the same should hold for the difference:<br />
F J [π(j) − π ′ (j), j ∈ J] = G −J [π(j) − π ′ (j), j ∈ X \ J]<br />
But, the terms in the left h<strong>and</strong> vector are all non-negative, <strong>and</strong> the terms on the other h<strong>and</strong> are negative (since<br />
the sums in π(i) equals 1, the terms on the right h<strong>and</strong> side should be strictly negative). Since the matrices<br />
multiplying the vectors are all positive (this is where irreducibility is used), this is a contradiction. We skip the<br />
pro<strong>of</strong> <strong>of</strong> why the matrix elements are all positive -this follows from the equation π = πP <strong>and</strong> that all entries in<br />
P are non-zero. As such, the entries in the measures should be identical to each other.<br />
⊓⊔<br />
Invariant Distribution via Dobrushin’s Ergodic Coefficient<br />
One interesting result we have observed is the following. For every positive recurrent state i<br />
lim P n (i, i) = 1<br />
n→∞ E i [τ i ] .<br />
Consider the iteration π t+1 = π t P. We would like to know when this iteration converges to a limit. We first<br />
review some notions from analysis.<br />
Review <strong>of</strong> Vector Spaces<br />
Definition 3.2.1 A linear space is a space which is closed under addition <strong>and</strong> multiplication by a scalar.<br />
Definition 3.2.2 A normed linear space X is a linear vector space on which a functional (a mapping from X to<br />
R, that is a member <strong>of</strong> R X ) called norm is defined such that:<br />
• ||x|| ≥ 0 ∀x ∈ X, ||x|| = 0 if <strong>and</strong> only if x is the null element (under addition <strong>and</strong> multiplication) <strong>of</strong> X.<br />
• ||x + y|| ≤ ||x|| + ||y||<br />
• ||αx|| = |α|||x||, ∀α ∈ R, ∀x ∈ X<br />
Definition 3.2.3 In a normed linear space X, an infinite sequence <strong>of</strong> elements {x n } converges to an element x<br />
if the sequence {||x n − x||} converges to zero.<br />
Definition 3.2.4 A metric defined on a set X, is a function d : X × X → R such that:<br />
• d(x, y) ≥ 0, ∀x, y ∈ X <strong>and</strong> d(x, y) = 0 if <strong>and</strong> only if x = y.<br />
• d(x, y) = d(y, x), ∀x, y ∈ X.<br />
• d(x, y) ≤ d(x, z) + d(z, y), ∀x, y, z ∈ X.<br />
Definition 3.2.5 A metric space (X, d) is a set equipped with a metric d.<br />
A normed linear space is also a metric space, with metric<br />
d(x, y) = ||x − y||.
3.2. STABILITY AND INVARIANT DISTRIBUTIONS 21<br />
Definition 3.2.6 A sequence {x n } in a normed space X is Cauchy if for every ǫ, there exists an N such that<br />
||x n − x m || ≤ ǫ, for all n, m ≥ N.<br />
The important observation on Cauchy sequences is that, every converging sequence is Cauchy, however, not<br />
all Cauchy sequences are convergent: This is because the limit might not live in the original space where the<br />
sequence lives in. This brings the issue <strong>of</strong> completeness:<br />
Definition 3.2.7 A normed linear space X is complete, if every Cauchy sequence in X has a limit in X. A<br />
complete normed linear space is called Banach.<br />
A map T from one complete normed linear space X to itself is called a contraction if for some 0 ≤ ρ < 1<br />
||T(x) − T(y)|| ≤ ρ||x − y||, ∀x, y ∈ X.<br />
Theorem 3.2.3 A contraction map in a Banach space has a unique fixed point.<br />
Pro<strong>of</strong>: {T n (x)} forms a Cauchy sequence, <strong>and</strong> by completeness, the Cauchy sequence has a limit.<br />
⊓⊔<br />
Contraction Mapping via Dobrushin’s Ergodic Coefficient<br />
Consider a countable state Markov Chain. Define<br />
∑<br />
δ(P) = min min(P(i, j), P(k, j))<br />
i,k<br />
j<br />
Observe that for two scalars a, b<br />
|a − b| = a + b − 2 min(a, b).<br />
Let us define for a vector v the l 1 norm:<br />
||v|| 1 =<br />
N∑<br />
|v i |.<br />
It is known that the set <strong>of</strong> all countable index real-valued vectors (that is functions which map Z → R) with a<br />
finite l 1 norm<br />
{v : ||v|| 1 < ∞}<br />
is a complete normed linear space, <strong>and</strong> as such, is a Banach space (every Cauchy sequence has a limit).<br />
With these observations, we state the following:<br />
i=1<br />
Theorem 3.2.4 (Dobrushin) For any probability measures π, π ′ , it follows that<br />
||πP − π ′ P || 1 ≤ (1 − δ(P))||π − π ′ || 1<br />
Pro<strong>of</strong>: Let ψ(i) = π(i) − min(π(i), π ′ (i)) for all i. Further, let ψ ′ (i) = π ′ (i) − min(π(i), π ′ (i)). Then,<br />
||π − π ′ || 1 = ||ψ − ψ ′ || 1 = 2||ψ|| 1 = 2||ψ ′ || 1 ,<br />
since the sum<br />
is zero, <strong>and</strong><br />
∑<br />
π(i) − π ′ (i) = ∑ i<br />
i<br />
ψ(i) − ψ ′ (i)<br />
∑<br />
|π(i) − π ′ (i)| = ||ψ|| 1 + ||ψ ′ || 1 .<br />
i
22 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />
Now,<br />
||πP − π ′ P || = ||ψP − ψ ′ P ||<br />
= ∑ | ∑ ψ(i)P(i, j) − ∑<br />
j i<br />
k<br />
ψ ′ (k)P(k, j)|<br />
= 1 ∑<br />
||ψ ′ | ∑ ∑<br />
ψ(i)ψ ′ (k)P(i, j) − ψ(i)ψ ′ (k)P(k, j)| (3.5)<br />
|| 1<br />
j k i<br />
≤ 1 ∑∑<br />
∑<br />
||ψ ′ ψ(i)ψ ′ (k)|P(i, j) − P(k, j)| (3.6)<br />
|| 1<br />
j<br />
k<br />
i<br />
= 1 ∑∑<br />
||ψ ′ ψ(i)ψ ′ (k) ∑ |P(i, j) − P(k, j)|<br />
|| 1<br />
k i<br />
j<br />
= 1 ∑∑<br />
||ψ ′ |ψ(i)||ψ ′ (k)|{ ∑ P(i, j) + P(k, j) − 2 min(P(i, j), P(k, j))}<br />
|| 1<br />
k i<br />
j<br />
(3.7)<br />
≤ 1 ∑∑<br />
||ψ ′ |ψ(i)||ψ ′ (k)|(2 − 2δ(P))<br />
|| 1<br />
(3.8)<br />
k<br />
i<br />
= ||ψ ′ || 1 (2 − 2δ(P)) (3.9)<br />
= ||π − π ′ || 1 (1 − δ(P))} (3.10)<br />
In the above, (3.5) follows from adding terms in the summation, (3.6) from taking the norm inside, (3.7) follows<br />
from the relation ||a − b|| = a + b − 2 min(a, b), (3.8) from the definition <strong>of</strong> δ(P) <strong>and</strong> finally (3.9) follows from<br />
the l 1 norms <strong>of</strong> ψ, ψ ′ .<br />
As such, the process P is a contraction mapping if δ(P) > 0. In essence, one proves that such a sequence is<br />
Cauchy, <strong>and</strong> as every Cauchy sequence in a Banach space has a limit, this process also has a limit. In our setting<br />
{π 0 P m } is a Cauchy sequence. The limit is the invariant distribution. ⊓⊔<br />
Dobrushin’s ergodic theorem provides a guarantee that the limit is unique.<br />
⊓⊔<br />
It should also be noted that Dobrushin’s theorem tells us how fast the sequence <strong>of</strong> probability distributions<br />
{π 0 P n } converges to the invariant distribution for any arbitrary π 0 .<br />
Ergodic Theorem for Countable State Space Chains<br />
For a Markov chain which has a unique invariant distribution µ(i), we have that almost surely<br />
1<br />
lim<br />
T →∞ T<br />
T∑<br />
t=1<br />
f(x t ) = ∑ i<br />
f(i)µ(i)<br />
∀ f : X → R.<br />
This is called the ergodic theorem due to Birkh<strong>of</strong>f <strong>and</strong> is a very powerful result. This is a very important theorem,<br />
because, in essence, this property is what makes the connection with stochastic control, <strong>and</strong> Markov chains in a<br />
long time horizon. In particular, for a stationary control policy leading to a unique invariant distribution with<br />
bounded costs, it follows that, almost surely,<br />
1<br />
lim<br />
T →∞ T<br />
T∑<br />
c(x t , u t ) = ∑ c(x, u)µ(x, u),<br />
x,u<br />
t=1<br />
∀ real-valued bounded c. The ergodic theorem is what makes a dynamic optimization problem equivalent to a<br />
static optimization problem under mild technical conditions. This will set the core ideas in the convex analytic<br />
approach <strong>and</strong> the linear programming approach that will be discussed later.
3.3. UNCOUNTABLE (COMPLETE, SEPARABLE, METRIC) STATE SPACES 23<br />
3.3 Uncountable (Complete, Separable, Metric) State Spaces<br />
We now briefly extend the above definitions <strong>and</strong> discussions to an uncountable state space setting.<br />
Let {x t , t ∈ Z + } be a Markov chain with a complete, separable, metric state space (X, B(X), <strong>and</strong> defined on<br />
a probability space (Ω, F, P), where B(X) denotes the Borel σ−field on X, Ω is the sample space, F a sigma<br />
field <strong>of</strong> subsets <strong>of</strong> Ω, <strong>and</strong> P a probability measure. Let P(x, D) := P(x t+1 ∈ D|x t = x) denote the transition<br />
probability from x to D, that is the probability <strong>of</strong> the event {x t+1 ∈ D} given that x t = x.<br />
3.3.1 Chapman-Kolmogorov Equations<br />
Consider a Markov chain with transition probability given by P(x, D), that is<br />
P(x t+1 ∈ D|x t = x) = P(x, D)<br />
We could compute<br />
P(x t+k ∈ D|x t = x)<br />
inductively as follows:<br />
∫<br />
P(x t+k ∈ D|x t = x) =<br />
∫ ∫<br />
. . .<br />
P(x t , dx t+1 )...P(x t+k−2 , dx t+k−1 )P(x t+k−1 , D)<br />
As such, we have for all n ≥ 1<br />
∫<br />
P n (x, A) = P(x t+n ∈ A|x t = x) =<br />
X<br />
P n−1 (x, dy)P(y, A).<br />
The justification for the above construction will be discussed further with regard to conditional expectations in<br />
the next chapter.<br />
Definition 3.3.1 A Markov chain is µ-irreducible, if for any set B ∈ B(X), such that µ(B) > 0, <strong>and</strong> ∀x ∈ X,<br />
there exists some integer n > 0, possibly depending on B <strong>and</strong> x, such that P n (x, B) > 0, where P n (x, B) is the<br />
transition probability in n stages, that is P(x t+n ∈ B|x t = x).<br />
One popular case is with the Lebesgue measure:<br />
Definition 3.3.2 A Markov chain is Lebesgue-irreducible, if for any non-empty open set B ⊂ B(R), ∀x ∈ R,<br />
there exists some integer n > 0, possibly depending on B <strong>and</strong> x, such that P n (x, B) > 0, where P n (x, B) is the<br />
transition probability in n stages, that is P(x t+n ∈ B|x t = x).<br />
Example: Linear system with a drift. Consider the following linear system:<br />
x t+1 = ax t + w t ,<br />
This chain is Lebesgue irreducible if w t is a Gaussian variable.<br />
A ϕ-irreducible Markov chain is aperiodic if for any x ∈ X, <strong>and</strong> any B ∈ B(X) satisfying ϕ(B) > 0, there exists<br />
n 0 = n 0 (x, B) such that<br />
P n (x, B) > 0 for all n ≥ n 0 .<br />
See Meyn <strong>and</strong> Tweedie [23] for a detailed discussion on periodicity <strong>of</strong> Markov chains <strong>and</strong> its connection with<br />
small sets, to be discussed shortly.<br />
The definitions for recurrence <strong>and</strong> transience follow those in the countable state space setting.
24 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />
Definition 3.3.3 A Markov chain is called recurrent if it is µ−irreducible <strong>and</strong><br />
whenever µ(A) > 0 <strong>and</strong> A ∈ B(X).<br />
∞∑<br />
∞∑<br />
E x [ 1 xt∈A] = P t (x, A) = ∞, ∀x ∈ X,<br />
t=1<br />
t=1<br />
While studying chains in an infinite state space, we use a stronger recurrence definition: A set A ∈ B(X) is<br />
Harris recurrent if the Markov chain visits A infinitely <strong>of</strong>ten with probability 1, when the process starts in A:<br />
Definition 3.3.4 A set A ∈ B(X) is Harris recurrent if<br />
P x (η A = ∞) = 1, A ∈ B(X), ∀x ∈ A, (3.11)<br />
Theorem 3.3.1 Harris recurrence <strong>of</strong> a set A is equivalent to<br />
P x [τ A < ∞] = 1, ∀x ∈ A.<br />
Pro<strong>of</strong>: We now prove this argument ([Meyn-Tweedie, Chapter 9]): let τ A (1) be the first time the state hits<br />
A. By the Strong Markov Property, the Markov chain sampled at successive intervals τ A (1), τ A (2) <strong>and</strong> so on is<br />
also a Markov chain. Let Q be the transition kernel for this sampled Markov Chain. Now, the probability <strong>of</strong><br />
τ A (2) < ∞ can be computed recursively as<br />
P(τ A (2) < ∞)<br />
∫<br />
= Q (x xτA (1) τ A(1), dy)P y (τ A (1) < ∞)<br />
A<br />
( ∫<br />
)<br />
= Q (x xτA (1) τ A(1), dy) P y (τ A (1) < ∞)<br />
A<br />
= 1 (3.12)<br />
By induction, for every n ∈ Z +<br />
P(τ A (n + 1) < ∞)<br />
∫<br />
= Q (x xτA (1) τ A(1), dy)P y (τ A (n) < ∞)<br />
Now,<br />
A<br />
= 1 (3.13)<br />
P x (η A ≥ k) = P x (τ A (k) < ∞),<br />
since k times visiting a set requires k times returning to a set, when the initial state x is in the set. As such,<br />
P x (η A ≥ k) = 1, ∀k ∈ Z +<br />
is identically equal to 1. Define B k = {ω ∈ Ω : η(ω) ≥ k}, <strong>and</strong> it follows that B k+1 ⊂ B k . By the continuity <strong>of</strong><br />
probability P(∩B k ) = lim k→∞ P(B k ), it follows that P x (η A = ∞) = 1.<br />
The pro<strong>of</strong> <strong>of</strong> the other direction for showing equivalence is left as an exercise to the reader.<br />
⊓⊔<br />
Definition 3.3.5 A Markov Chain is Harris recurrent if the chain is µ−irreducible <strong>and</strong> every set A ⊂ X is<br />
Harris recurrent whenever µ(A) > 0. If the chain admits an invariant probability measure, then the chain is<br />
called positive Harris recurrent.<br />
Remark: Harris recurrence is stronger than recurrence. In one, an expectation is considered, in the other, a<br />
probability is considered. Consider the following example: Let P(1, 1) = 1 <strong>and</strong> P(x, x + 1) = 1 − 1/x 2 <strong>and</strong><br />
P(x, 1) = 1/x 2 . P x (τ 1 = ∞) = ∏ t≥x (1 − 1/t2 ) > 0. This chain is not Harris recurrent. It is π−irreducible for<br />
the measure π(A) = 1 if 1 ∈ A. Hence, this chain is recurrent but not Harris recurrent.<br />
⊓⊔
3.3. UNCOUNTABLE (COMPLETE, SEPARABLE, METRIC) STATE SPACES 25<br />
If a set is not recurrent, it is transient. A set A is transient if<br />
U(x, A) = E x [η A ] < ∞, ∀x ∈ A, (3.14)<br />
Note that, this is equivalent to<br />
∞∑<br />
P i (x, A) < ∞,<br />
i=1<br />
3.3.2 Invariant Distributions for Uncountable Spaces<br />
Definition 3.3.6 For a Markov chain with transition probability defined as before, a probability measure π is<br />
invariant on the Borel space (X, B(X)) if<br />
∫<br />
π(D) = P(x, D)π(dx), ∀D ∈ B(X).<br />
X<br />
A Harris recurrent chain always admits an invariant measure. But this measure might not be a probability<br />
measure.<br />
Uncountable chains act like countable ones when there is a single atom α ∈ X. An atom is one which has a<br />
positive probability (hence there is a probability mass).<br />
Definition 3.3.7 A set α is called an atom if there exists a probability measure ν such that<br />
P(x, A) = ν(A),<br />
∀x ∈ α, ∀A ∈ B(X).<br />
If the chain is µ−irreducible <strong>and</strong> µ(α) > 0, then α is called an accessible atom.<br />
In case there is an accessible atom α, we have the following:<br />
Theorem 3.3.2 For an µ−irreducible Markov chain for which E α [τ α ] < ∞; the following is the invariant<br />
probability measure:<br />
∑ τα−1<br />
k=0<br />
π(A) = E α [<br />
1 x k ∈A<br />
|x 0 = α], ∀A ∈ B(X), µ(A) > 0<br />
E[τ α ]<br />
In case an atom is not present, one needs further settings:<br />
Definition 3.3.8 (Tweedie) A set A ⊂ X is n-small on (X, B(X)) if for some positive measure µ<br />
P n (x, B) ≥ µ(B),<br />
∀x ∈ A, <strong>and</strong> B ∈ B(X),<br />
where B(X) denotes the (Borel) sigma-field on X.<br />
Definition 3.3.9 (Meyn-Tweedie) A set A ⊂ X is µ − petite on (X, B(X)) if for some distribution T on N<br />
(set <strong>of</strong> natural numbers), <strong>and</strong> some measure µ,<br />
∞∑<br />
P n (x, B)T (n) ≥ µ(B),<br />
n=0<br />
where B(X) denotes the (Borel) sigma-field on X.<br />
∀x ∈ A, <strong>and</strong> B ∈ B(X),<br />
For an aperiodic Markov chain which is irreducible, every petite set is also small.<br />
Remark 3.3.1 Small sets are very important in generalizing the results for countable state space Markov chains<br />
to more general spaces. Meyn <strong>and</strong> Tweedie present a discussion on periodicity as the greatest common divisor<br />
<strong>of</strong> the set <strong>of</strong> possible return times from a given small set; such an approach lets one generalize the definitions <strong>of</strong><br />
aperiodicity <strong>and</strong> periodicity.
26 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />
Theorem 3.3.3 (Theorem 5.5.3 <strong>of</strong> [25], lemma 17 [28] ) For an aperiodic <strong>and</strong> irreducible Markov chain<br />
{x t } every petite set is small.<br />
A maximal irreducibility measure ψ is an irreducibility measure such that for all other irreducibility measures<br />
φ, we have ψ(B) = 0 ⇒ φ(B) = 0 for any B ∈ B(X). Define B + (X) = {A ∈ B(X) : ψ(A) > 0} where ψ is a<br />
maximal irreducibility measure.<br />
Theorem 3.3.4 For an aperiodic <strong>and</strong> irreducible Markov chain every petite set is petite with a maximal irreducibility<br />
measure for a distribution with finite mean.<br />
Nummelin’s Splitting Technique<br />
The results on recurrence apply to uncountable chains with no atom provided there is a small set or a petite set.<br />
Suppose a set A is 1-small. Define a process z t = (x t , a t ), z t ∈ X × {0, 1}. That is we enlarge the probability<br />
space. Suppose that when x t /∈ A, (a t , x t ) evolve independently from each other. However, when x t ∈ A, we<br />
pick a Bernoulli r<strong>and</strong>om variable, <strong>and</strong> with probability δ the state visits A × {1} <strong>and</strong> with probability 1 −δ visits<br />
A × {0}. From A × {1}, the transition for the next time stage is given by ν(dxt) <strong>and</strong> from A × {0}, it visits the<br />
future time stage with probability<br />
P(dx t+1 |x t ) − ν(dx t+1 )<br />
.<br />
1 − δ<br />
Now, pick δ = ν(X). In this case, A×{1} is an accessible atom, <strong>and</strong> one can verify that the marginal distribution<br />
<strong>of</strong> the original Markov process {x t } has not been altered.<br />
The following can be established using the above construction.<br />
δ<br />
Proposition 3.3.1 If<br />
then,<br />
sup E[min(t > 0 : x t ∈ A)|x 0 = x] < ∞<br />
x∈A<br />
sup E[min(t > 0 : z t ∈ (A × {1}))|z 0 = z] < ∞.<br />
z∈(A×{1})<br />
3.3.3 Existence <strong>of</strong> an Invariant Distribution<br />
We state the final Theorem on the invariant distributions for Markov chains:<br />
Theorem 3.3.5 (Meyn-Tweedie) Consider a Markov process {x t } taking values in X. If there is a Harris<br />
recurrent set A which is also a µ− petite set for some positive measure µ, <strong>and</strong> if the set satisfies<br />
sup E[min(t > 0 : x t ∈ A)|x 0 = x] < ∞,<br />
x∈A<br />
then the Markov chain is positive Harris recurrent <strong>and</strong> it admits a unique invariant distribution.<br />
In this case, the invariant measure satisfies the following, which is a generalization <strong>of</strong> Kac’s Lemma [14]:<br />
Theorem 3.3.6 For a µ-irreducible Markov chain with a unique invariant probability measure; the following is<br />
the invariant probability measure:<br />
∫<br />
π(A) =<br />
C<br />
τ∑<br />
C−1<br />
π(dx)E x [ 1 {xk ∈A}], ∀A ∈ B(X), µ(A) > 0, π(C) > 0<br />
k=0
3.4. FURTHER RESULTS ON THE EXISTENCE OF AN INVARIANT PROBABILITY MEASURE 27<br />
The above can also be extended to compute that expected values <strong>of</strong> a function <strong>of</strong> the Markov states.<br />
The above can be verified along the same lines used for the countable state space case (see Theorem 3.2.1).<br />
Example: Linear system with a drift. Consider the following linear system:<br />
x t+1 = ax t + w t ,<br />
is positive Harris recurrent if |a| < 1, is null-recurrent if |a| = 0 <strong>and</strong> is transient if |a| > 1. That is, when<br />
recurrent, every open set will be visited with probability 1 infinitely <strong>of</strong>ten.<br />
3.3.4 Dobrushin’s Ergodic Coefficient for Uncountable State Space Chains<br />
We can extend Dobrushin’s contraction result for the uncountable state space case. By the property that<br />
|a − b| = a + b − 2 min(a, b), the Dobrushin’s coefficient is also related to the term (for the countable state space<br />
case):<br />
δ(P) = 1 − 1 2 max<br />
i,k<br />
∑<br />
|(P(i, j) − P(k, j)|<br />
j<br />
In case P(x, dy) is the stochastic transition kernel <strong>of</strong> a real-valued Markov process with transition kernels admitting<br />
densities (that is P(x, A) = ∫ A<br />
P(x, y)dy admits a density), the expression<br />
δ(P) = 1 − 1 ∫<br />
2 sup |P(x, y) − P(z, y)|dy<br />
x,z<br />
is the Dobrushin’s ergodic coefficient for R−valued Markov processes.<br />
As such, if δ(P) > 0, then the iterations π t (.) = ∫ π t−1 (z)P(z, .)dz converge to a unique fixed point.<br />
R<br />
3.3.5 Ergodic Theorem for Uncountable State Space Chains<br />
Likewise, for the uncountable setting, if µ() is the invariant probability distribution for a positive Harris recurrent<br />
Markov chain, it follows that:<br />
almost surely.<br />
1<br />
lim<br />
T →∞ T<br />
T∑<br />
∫<br />
∫<br />
c(x t ) = c(x)µ(dx), ∀c ∈ L 1 (µ) := {f : f(x)µ(dx) < ∞},<br />
t=1<br />
We will prove this theorem after we discuss Martingales.<br />
3.4 Further Results on the Existence <strong>of</strong> an Invariant Probability<br />
Measure<br />
3.4.1 Case with the Feller condition<br />
A Markov chain is weak Feller if ∫ X<br />
P(dz|x)v(z) is continuous in x for every continuous v on X.<br />
Theorem 3.4.1 Let {x t } be a weak Feller Markov process living in a compact subset <strong>of</strong> a complete, seperable<br />
metric space. Then {x t } admits an invariant distribution.<br />
Pro<strong>of</strong>. Pro<strong>of</strong> follows the observation that the space <strong>of</strong> probability measures on such a compact set is tight.<br />
Consider a sequence µ T = 1 ∑ T −1<br />
T t=0 P t . There exists a subsequence µ Tk which converges weakly to some µ ∗ . It<br />
follows that for every continuous <strong>and</strong> bounded function<br />
∫<br />
〈µ T , f〉 := µ T (dx)f(x) → 〈µ ∗ , f〉
28 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />
Likewise,<br />
∫<br />
〈µ T , Pf〉 :=<br />
∫<br />
µ T (dx)(<br />
P(dy|x)f(y)) → 〈µ ∗ , Pf〉.<br />
Now,<br />
(µ T − µ T P)(f) = 1 ( T∑ −1<br />
T E µ 0<br />
Px k f −<br />
k=0<br />
T∑<br />
−1<br />
k=0<br />
)<br />
Px<br />
k+1 f<br />
= 1 )<br />
T E µ 0<br />
(f(x 0 ) − f(x T ) → 0. (3.15)<br />
Thus,<br />
(µ T − µ T P)(f) = 〈µ T , f〉 − 〈µ T P, f〉 = 〈µ T , f〉 − 〈µ T , Pf〉 → 〈µ ∗ − µ ∗ P, f〉 = 0.<br />
Thus, µ ∗ is an invariant probability measure.<br />
⊓⊔<br />
Lasserre [19] gives the following example to emphasize the importance <strong>of</strong> the Feller property: Consider a Markov<br />
chain evolving in [0, 1] given by: P(x, x/2) = 1 for all x ≠ 0 <strong>and</strong> P(0, 1) = 1. This chain does not admit an<br />
invariant measure.<br />
Remark 3.4.1 Lasserre presents a class <strong>of</strong> chains, Quasi-Feller chains, which are weak Feller except on a closed<br />
set <strong>of</strong> points N such that P(x, N) = 0 for all x ∈ D = X \ N, P(x, N) < 1 for x ∈ N, <strong>and</strong> for all x ∈ D, the<br />
chain is weak Feller.<br />
3.4.2 Case without the Feller condition<br />
One can also relax the weak Feller condition.<br />
Theorem 3.4.2 [20] For a Markov chain, under a uniform countable additivity condition as in (4.7) in the next<br />
chapter, there exists an invariant probability measure if <strong>and</strong> only if there exists a non-negative function g with<br />
lim<br />
n→∞<br />
inf g(x) = ∞<br />
x/∈K n<br />
<strong>and</strong> an initial probability measure ν 0 such that sup t E ν0 [g(x t )] < ∞.<br />
The pro<strong>of</strong> <strong>of</strong> this result follows from a similar observation as (3.15) but with weak convergence replaced by<br />
setwise convergence:<br />
Lemma 3.4.3 [7, Thm. 4.7.25] Let µ be a finite measure on a measurable space (T, A). Assume a set <strong>of</strong><br />
probability measures Ψ ⊂ P(T) satisfies<br />
Then Ψ is setwise precompact.<br />
P ≤ µ, for all P ∈ Ψ.<br />
A sequence <strong>of</strong> occupation measures under (4.7) has a subsequence which converges setwise <strong>and</strong> the limit <strong>of</strong> this<br />
subsequence is invariant. As an example, consider a system <strong>of</strong> the form:<br />
x t+1 = f(x t ) + w t (3.16)<br />
where w t admits a distribution with a bounded density function, which is positive everywhere. If the system has<br />
a finite second moment, then, the system admits an invariant probability measure which is unique.
3.5. EXERCISES 29<br />
3.5 Exercises<br />
Exercise 3.5.1 Prove that if {x t } is irreducible, then all states have the same period.<br />
Exercise 3.5.2 Prove that<br />
<strong>and</strong> for n ≥ 1,<br />
P x (τ A = 1) = P(x, A),<br />
P x (τ A = n) = ∑ i/∈A<br />
P(x, i)P i (τ A = n − 1)<br />
Exercise 3.5.3 Show that irreducibility <strong>of</strong> a Markov chain in a finite state space implies that every set A <strong>and</strong><br />
every x satisfies U(x, A) = ∞.<br />
Exercise 3.5.4 Show that for an irreducible Markov chain, either the entire chain is transient, or recurrent.<br />
Exercise 3.5.5 If P x (τ A < ∞) < 1, show that A is transient.<br />
Exercise 3.5.6 If P x (τ A < ∞) = 1, show that A is Harris recurrent.<br />
Exercise 3.5.7 Prove that a discrete-time r<strong>and</strong>om walk on Z k defined by the transition kernel P(i, j) = 1<br />
2k 1 {|i−j|=1}<br />
is recurrent for k = 1 <strong>and</strong> k = 2.
30 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS
Chapter 4<br />
Martingales <strong>and</strong> Foster-Lyapunov<br />
Criteria for Stabilization <strong>of</strong> Markov<br />
Chains<br />
4.1 Martingales<br />
In this chapter, we first discuss some martingale theorems. Only a few <strong>of</strong> these will be important, some others<br />
are presented for the sake <strong>of</strong> completeness, but we will not be concerned with those within the scope <strong>of</strong> our<br />
course.<br />
These are very important for us to underst<strong>and</strong> stabilization <strong>of</strong> controlled stochastic systems. These also will<br />
pave the way to optimization <strong>of</strong> dynamical systems. The second half <strong>of</strong> this chapter is on the stabilization <strong>of</strong><br />
Markov Chains.<br />
4.1.1 More on Expectations <strong>and</strong> Conditional Probability<br />
Let (Ω, F, P) be a probability space <strong>and</strong> let G be a subset <strong>of</strong> F. Let X be a R−valued r<strong>and</strong>om variable measurable<br />
with respect to (Ω, F) with a finite absolute expectation that is<br />
∫<br />
E[|X|] = |X(ω)|P(dω) < ∞,<br />
where ω ∈ Ω. We call such r<strong>and</strong>om variables integrable.<br />
Ω<br />
We say that Ξ is the conditional expectation r<strong>and</strong>om variable (<strong>and</strong> is also called a version <strong>of</strong> the conditional<br />
expectation) E[X|G], <strong>of</strong> X given G if<br />
i) Ξ is G-measurable. ii) For every A ∈ G,<br />
where<br />
E[1 A Ξ] = E[1 A X],<br />
∫<br />
E[1 A Ξ] =<br />
Ω<br />
X(ω)1 ω∈A P(dω)<br />
For example, if the information that we know about a process is whether an event A ∈ F happened or not, then:<br />
X A := E[X|A] = 1 ∫<br />
P(dω)X(ω).<br />
P(A)<br />
If the information we have is that A did not take place:<br />
X A C := E[X|A C ] =<br />
A<br />
1<br />
P(Ω \ A)<br />
31<br />
∫<br />
Ω\A<br />
P(dω)X(ω).
32CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />
Thus, the conditional expectation given by the sigma-field generated by A (which is F A = {∅, Ω, A, Ω \ A} is<br />
given by:<br />
E[X|F A ] = X A 1 ω∈A + X A C1 ω/∈A .<br />
We note that conditional probability can be written as:<br />
P(A|G) = E[1 {x∈A} |G],<br />
hence, conditional probability is a special case <strong>of</strong> conditional expectation.<br />
Let {B i } be a finite (or countable) partition <strong>of</strong> Ω which generate the σ−field G. It must be that E[X|ω 1 ∈ B 1 ] =<br />
E[X|ω 2 ∈ B 1 ] since otherwise E[X|G] would not be measurable on G. As such for all B ∈ G, we have that<br />
∫<br />
B<br />
∫<br />
X(ω)P(dω) =<br />
B<br />
∫<br />
E[X|B]P(dω) = E[X|B] P(dω) = P(B)E[X|B]<br />
B<br />
The notion <strong>of</strong> conditional expectation is key for the development <strong>of</strong> stochastic processes which evolve according<br />
to a transition kernel. This is useful for optimal decision making when a partial information is available with<br />
regard to a r<strong>and</strong>om variable.<br />
The following discussion is optional until the next subsection.<br />
Theorem 4.1.1 (Radon-Nikodym) Let µ <strong>and</strong> ν be two σ−finite positive measures on (Ω, F) such that ν(A) =<br />
0 implies that µ(A) = 0 (that is µ is absolutely continuous with respect to ν). Then, there exists a measurable<br />
function f : Ω → R such that for every A:<br />
∫<br />
µ(A) = f(ω)ν(dω)<br />
A<br />
The representation above is unique, up to points <strong>of</strong> measure zero. With the above discussion, the conditional<br />
expectation X = E[X|F ′ ] exists for any sub-σ-field, subset <strong>of</strong> F.<br />
It is a useful exercise to now consider the σ-field generated by an observation variable, <strong>and</strong> what a conditional<br />
expectation means in this case.<br />
Theorem 4.1.2 Let X be a X valued r<strong>and</strong>om variable, where X is a complete, separable, metric space <strong>and</strong> Y be<br />
another Y−valued r<strong>and</strong>om variable, Then, a r<strong>and</strong>om variable X is F Y (the σ−field generated by Y ) measurable<br />
if <strong>and</strong> only if there exists a measurable function f : Y → X such that X = f(Y (ω)).<br />
Pro<strong>of</strong>: We prove the more difficult direction. Suppose X lived in a countable space {a 1 , a 2 , . . . , a n , . . . }. We<br />
can write A n = X −1 (a n ). Since X is measurable on F Y , A n ∈ F Y . Let us now make these sets disjoint: Define<br />
C 1 = A 1 <strong>and</strong> C n = A n \ ∪ n−1<br />
i=1 A i. These sets are disjoint, but they all live in F Y . Thus, we can construct a<br />
measurable function defined as f(ω) = a n if ω ∈ C n :<br />
The issue is now what happens when X lives in an uncountable, complete, separable, metric space such as R.<br />
We can define countable sets which cover the entire state space (such as intervals with rational endpoints to<br />
cover R). Let us construct a sequence <strong>of</strong> partitions Γ n = {B i,n , ∪ i B i,n = X} which gets finer <strong>and</strong> finer (such<br />
as for R: [⌊X(ω)⌋, ⌈X(ω)n⌉)/n). We can define a sequence <strong>of</strong> measurable functions f n for every such partition.<br />
Furthermore, f n (ω) converges to a function f(ω) pointwise, the limit is measurable.<br />
⋄<br />
With the above, the expectation E[X|Y = y 0 ] can be defined, this expectation is a measurable function <strong>of</strong> Y .<br />
4.1.2 Some Properties <strong>of</strong> Conditional Expectation:<br />
One very important property is given by the following.<br />
Iterated expectations:
4.1. MARTINGALES 33<br />
Theorem 4.1.3 If H ⊂ G ⊂ F, <strong>and</strong> X is F−measurable, then it follows that:<br />
E[E[X|G]|H] = E[X|H]<br />
Pro<strong>of</strong>: Pro<strong>of</strong> follows by taking a set A ∈ H, which is also in G <strong>and</strong> F. Let η be the conditional expectation<br />
variable with respect to H. Then it follows that<br />
E[1 A η] = E[1 A X]<br />
Now let E[X|G] be η ′ . Then, it must be that E[1 A η ′ ] = E[1 A X] for all A ∈ G <strong>and</strong> hence for all A ∈ H. Thus,<br />
the two expectations are the same.<br />
⋄<br />
4.1.3 Discrete-Time Martingales<br />
Let (Ω, F, P) be a probability space. An increasing family {F n } <strong>of</strong> sub-σ−fields <strong>of</strong> F is called a filtration.<br />
A sequence <strong>of</strong> r<strong>and</strong>om variables on (Ω, F, P) is said to be adapted to F n if X n is F n -measurable, that is<br />
X −1<br />
n (D) = {w ∈ Ω : X n(w) ∈ D} ∈ F n for all Borel D. This holds for example if F n = σ(X m , m ≤ n), n ≥ 0.<br />
Given a filtration F n <strong>and</strong> a sequence <strong>of</strong> real r<strong>and</strong>om variables adapted to it, (X n , F n ) is said to be a martingale<br />
if<br />
E[|X n |] < ∞<br />
<strong>and</strong><br />
E[X n+1 |F n ] = X n .<br />
We will occasionally take the sigma-fields to be F n = σ(X 1 , X 2 , . . . , X n ). For example M t = ∑ t<br />
k=0 ǫ k, where ǫ k<br />
is a zero-mean i.i.d. r<strong>and</strong>om sequence, is a Martingale for every t ∈ N.<br />
A fair game is a Martingale when it is played up to a finite time.<br />
Let n > m ∈ Z + . Then, since F m ⊂ F n , it must be that A ∈ F m should also be in F n . Thus, if X n is a<br />
Martingale sequence,<br />
Thus, E[X n |F m ] = X m .<br />
If we have that<br />
then {X n } is called a submartingale.<br />
And, if<br />
then {X n } is called a supermartingale.<br />
E[1 A X n ] = E[1 A X n−1 ] = · · · = E[1 A X m ].<br />
E[X n |F m ] ≥ X m<br />
E[X n |F m ] ≤ X m<br />
A useful concept related to filtrations is that <strong>of</strong> a stopping time, which we discussed while studying Markov<br />
chains. A stopping time is a r<strong>and</strong>om time, whose occurrence is measurable with respect to the filtration in the<br />
sense that for each n ∈ N, {T ≤ n} ∈ F n .<br />
4.1.4 Doob’s Optional Sampling Theorem<br />
Theorem 4.1.4 Suppose (X n , F n ) is a martingale sequence, <strong>and</strong> ρ, τ < n are bounded stopping times with ρ ≤ τ.<br />
Then,<br />
E[X τ |F ρ ] = X ρ
34CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />
Pro<strong>of</strong>: We observe that<br />
τ∑<br />
−1<br />
E[X τ − X ρ |F ρ ] = E[ X k+1 − X k |F ρ ]<br />
k=ρ<br />
τ∑<br />
−1<br />
τ∑<br />
−1<br />
= E[ E[X k+1 − X k |F k ]|F ρ ] = E[ 0|F ρ ] = 0 (4.1)<br />
k=ρ<br />
k=ρ<br />
⋄<br />
Theorem 4.1.5 Let {X n } be a sequence <strong>of</strong> F n -adapted integrable real r<strong>and</strong>om variables. Then, the following<br />
are equivalent: (i) (X n , F n ) is a sub-martingale. (ii) If T, S are bounded stopping times with T ≥ S (almost<br />
surely), then E[X S ] ≤ E[X T ]<br />
Pro<strong>of</strong>: Let S ≤ T ≤ n Now, note that,<br />
E[X T − X S |F S ] =<br />
n∑<br />
E[1 (T ≥k) 1 (S
4.1. MARTINGALES 35<br />
Theorem 4.1.6 Let X T be a supermartingale sequence. Then,<br />
E[ζ N (a, b)] ≤ E[|X N |] + |a|<br />
.<br />
(b − a)<br />
Pro<strong>of</strong>:<br />
There are three possibilities that might take place: The process can end below a, between a <strong>and</strong> b <strong>and</strong> above<br />
b. If it crosses above b, then we have completed an upcrossing. In view <strong>of</strong> this, we may proceed as follows: Let<br />
β N := max(m : T 2m−1 ≤ N), that is T 2m = N or T 2m−1 = N. Here, β N is measurable on σ(X 1 , X 2 , . . .,X N ), so<br />
it is a stopping time.<br />
Then<br />
Thus,<br />
0 ≥ E[<br />
β N<br />
∑<br />
i=1<br />
X T2i − X T2i−1 ]<br />
(<br />
∑ βN<br />
= E[ X T2i − X T2i−1<br />
)1 {T2βN =N}]<br />
i=1<br />
(<br />
∑ βN<br />
+E[ X T2i − X T2i−1<br />
)1 {T2βN −1=N}]<br />
i=1<br />
β∑<br />
N −1<br />
= E[( (X T2i − X T2i−1 ) + X N − X )1 T2βN −1 {T 2βN =N}]<br />
i=1<br />
( βN −1<br />
+E[<br />
∑<br />
)<br />
(X T2i − X T2i−1 )<br />
i=1<br />
1 {T2βN −1=N}]<br />
( βN<br />
∑−1<br />
)<br />
= E X T2i − X T2i−1 + E[(X N − X )1 T2βN −1 {T 2βN =N}] (4.3)<br />
i=1<br />
( βN<br />
∑−1<br />
)<br />
E X T2i − X T2i−1 ≤ −E[(X N − X )1 T2βN +1 {T 2βN =N}] ≤ E[(a − X N )] (4.4)<br />
i=1<br />
( )<br />
∑βN −1<br />
Since, 0 ≥ E<br />
i=1<br />
X T2i − X T2i−1 ≥ E[β N − 1](b − a), it follows that:<br />
satisfies:<br />
<strong>and</strong> the result follows.<br />
ζ N (a, b) = (β N − 1),<br />
ζ N (b − a) ≤ E[(a − X N )] ≤ |a| + E[|X N |],<br />
Recall that a sequence <strong>of</strong> r<strong>and</strong>om variables defined on a probability space (Ω, F, P) converges to X almost surely<br />
(a. s.) if<br />
P(w : lim<br />
n→∞ x n(w) = x(w)) = 1.<br />
Theorem 4.1.7 Suppose X n is a supermartingale <strong>and</strong> sup n≥0 E[|X n |] < ∞. Then lim n→∞ X n = X exists<br />
(almost surely) <strong>and</strong> E[|X|] < ∞. Note that, the same result applies for submartingales, by considering −X n<br />
regarded as a supermartingale <strong>and</strong> the condition sup n≥0 E[max(0, X n )] < ∞.<br />
Pro<strong>of</strong>: The pro<strong>of</strong> follows from Doob’s upcrossing lemma. Suppose X n does not converge to a variable almost<br />
surely. This means that limsupX n ≠ liminf X n . In particular, there exist reals a, b such that limsupX n ≥ b<br />
<strong>and</strong> liminf X n ≤ a <strong>and</strong> b > a. Hence, by the upcrossing lemma, we have that<br />
E[ζ N (a, b)] ≤ E[|X N |] + |a|<br />
(b − a)<br />
< ∞.<br />
⋄
36CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />
This is true for every N. Since ζ N (a, b) is a monotonically increasing sequence in N, by the monotone convergence<br />
theorem it follows that<br />
lim E[ζ N(a, b)] = E[ lim β N(a, b)] < ∞.<br />
N→∞ N→∞<br />
Hence, this shows that X n must converge as the number <strong>of</strong> upcrossings is finite (it cannot be infinity with a<br />
non-zero probability). Hence X n converges almost surely to a number, as otherwise, there would be infinitely<br />
many crossings. The expected number <strong>of</strong> crossings being bounded implies that the probability <strong>of</strong> the number <strong>of</strong><br />
crossings being finite is one.<br />
⋄<br />
Theorem 4.1.8 (Submartingale Convergence Theorem) Suppose X n is a submartingale <strong>and</strong> sup n≥0 E[|X n |] <<br />
∞. Then X := lim n→∞ X n exists (almost surely) <strong>and</strong> E[|X|] < ∞.<br />
Pro<strong>of</strong>: Note that, sup n≥0 E[|X n |] < ∞, is a sufficient condition both for a submartingale <strong>and</strong> a supermartingale<br />
in Theorem 4.1.7. Hence X n → X almost surely. For finiteness, suppose E[|X|] = ∞. Then, liminf |X n | =<br />
lim |X n | = ∞ a.s.. By Fatou’s lemma,<br />
limsup E[|X n |] ≥ E[liminf |X n |] = ∞.<br />
But this is a contradiction as we had assumed that sup n E[|X n |] < ∞.<br />
⋄<br />
4.1.6 Pro<strong>of</strong> <strong>of</strong> the Birkh<strong>of</strong>f Individual Ergodic Theorem<br />
This will be discussed in class.<br />
4.1.7 This section is optional: Further Martingale Theorems<br />
This section is optional. If you wish not to read it, please proceed to the discussion on stabilization <strong>of</strong> Markov<br />
Chains.<br />
Theorem 4.1.9 Let X n be a Martingale such that X n converges to X in L 1 that is E[|X n − X|] → 0. Then,<br />
X n = E[X|F n ],<br />
n ∈ N<br />
We will use the following while studying the convex analytic method. Let us define uniform integrability:<br />
Definition 4.1.1 : A sequence <strong>of</strong> r<strong>and</strong>om variables {X n } is uniformly integrable if<br />
∫<br />
lim sup |X n |P(dX n ) = 0<br />
K→∞ n<br />
|X n|≥K<br />
This implies that<br />
supE[|X n |] < ∞<br />
n<br />
Let for some ǫ > 0,<br />
sup E[|X| 1+ǫ<br />
n ] < ∞.<br />
n<br />
This implies that the sequence is uniformly integrable as<br />
sup<br />
n<br />
∫<br />
|X n|≥K<br />
|X n |P(dX n ) ≤ sup<br />
n<br />
The following is a very important result:<br />
∫<br />
|X n|≥K<br />
( |X n|<br />
1<br />
K )ǫ |X n |P(dX n ) ≤ sup<br />
n K E[|X n| 1+ǫ ] → K→∞ 0.<br />
Theorem 4.1.10 If X n is a uniformly integrable Martingale, then X = lim n→∞ X n exists almost surely (for all<br />
sequences with probability 1) <strong>and</strong> in L 1 , <strong>and</strong> X n = E[X|F n ].
4.2. STABILITY OF MARKOV CHAINS: FOSTER-LYAPUNOV TECHNIQUES 37<br />
Optional Sampling Theorem For Uniformly Integrable Martingales<br />
Theorem 4.1.11 Suppose (X n , F n ) be a uniformly integrable martingale sequence, <strong>and</strong> ρ, τ are stopping times<br />
with ρ ≤ τ. Then,<br />
E[X τ |F ρ ] = X ρ<br />
Pro<strong>of</strong>: By Uniform Integrability, it follows that {X t } has a limit. Let this limit be X ∞ . It follows that<br />
E[X ∞ |F τ ] = X τ <strong>and</strong><br />
E[E[X ∞ |F τ ]|F ρ ] = X ρ<br />
which is also equal to E[X τ |F ρ ] = X ρ .<br />
⋄<br />
4.1.8 Azuma-Hoeffding Inequality for Martingales with Bounded Increments<br />
The following is an important concentration result:<br />
Theorem 4.1.12 Let X t be a Martingale sequence such that |X t − X t−1 | ≤ c for every t, almost surely. Then<br />
for any x > 0,<br />
P( X t − X 0<br />
t<br />
≥ x) ≤ 2e − tx2<br />
2c<br />
4.2 Stability <strong>of</strong> Markov Chains: Foster-Lyapunov Techniques<br />
A Markov chain’s stability can be characterized by drift conditions, as we discuss below in detail.<br />
4.2.1 Criterion for Positive Harris Recurrence<br />
Theorem 4.2.1 (Foster-Lyapunov for Positive Recurrence) [23] Let S be a petite set, b ∈ R <strong>and</strong> V (.) be<br />
a mapping from X to R + . Let {x n } be an irreducible Markov chain on X. If the following is satisfied for all<br />
x ∈ X:<br />
∫<br />
E[V (x t+1 )|x t = x] = P(x, dy)V (y) ≤ V (x) − 1 + b1 {x∈S} , (4.5)<br />
then a unique invariant probability measure exists for the Markov chain.<br />
X<br />
Pro<strong>of</strong>: We will provide a pro<strong>of</strong> assuming that S is such that sup x∈S V (x) < ∞ (The pro<strong>of</strong> is applicable without<br />
such an assumption also, with further details on replacing S with one which satisfies the above assumption. That<br />
is, for every petite set, we can find one on which V is bounded. See [23] for details on this). Define M 0 := V (x 0 ),<br />
<strong>and</strong> for t ≥ 1<br />
∑t−1<br />
M t := V (x t ) − (−1 + b1 (xt∈S))<br />
It follows that<br />
i=0<br />
E[M (t+1) |x s , s ≤ t] ≤ M t , ∀t ≥ 0.<br />
Furthermore, it can be shown that E[|M t |] ≤ ∞ for all t. Thus, {M t } is a supermartingale. Define a stopping<br />
time: τ N = min(N, τ), where τ = min{i > 0 : x i ∈ S}. The stopping time is bounded. Hence, we have, by the<br />
martingale optional sampling theorem:<br />
E[M τ N] ≤ E[M 0 ].<br />
Hence, we obtain<br />
τ∑<br />
N −1<br />
E x0 [ ] ≤ V (x 0 ) + bE x0 [<br />
i=0<br />
∑<br />
τ N −1<br />
1 (xt∈S)]<br />
i=0
38CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />
Thus, E x0 [τ N − 1 + 1] ≤ V (x 0 ) + b, <strong>and</strong> by the Monotone Convergence Theorem,<br />
lim<br />
N→∞ E x 0<br />
[τ N ] = E x0 [τ] ≤ V (x 0 ) + b<br />
Now, if we had that<br />
sup V (x) < ∞,<br />
x∈S<br />
the pro<strong>of</strong> would be complete in view <strong>of</strong> Theorem 3.3.5 <strong>and</strong> the fact that E x [τ] < ∞ for any x, leading to the<br />
Harris recurrence <strong>of</strong> S.<br />
Typically, this condition is satisfied. However, in case it is not easy to directly verify, we may need additional<br />
steps to construct a petite set. In the following, we will consider this: Following [23], Chapter 11, define for some<br />
l ∈ Z +<br />
V C (l) = {x : V (x) ≤ l}<br />
We will show that B := V C (l) is itself a petite set, which is recurrent <strong>and</strong> satisfies the uniform finite-mean-return<br />
property. Since C is petite for some measure ν, we have that<br />
where K a (x, B) = ∑ i a(i)P i (x, B) <strong>and</strong> hence<br />
K a (x, B) ≥ 1 {x∈C} ν(B), x ∈ X,<br />
1 {x∈C} ≤ 1<br />
ν(B) K a(x, B)<br />
Now,<br />
τ∑<br />
B−1<br />
E x [τ B ] ≤ V (x) + bE x [<br />
1<br />
= V (x) + b<br />
ν(B) E x[<br />
k=0<br />
τ∑<br />
B−1<br />
k=0<br />
1 ∑<br />
τ∑<br />
B−1<br />
= V (x) + b a(i)E x [<br />
ν(B)<br />
i k=0<br />
1 ∑<br />
≤ V (x) + b a(i)(1 + i),<br />
ν(B)<br />
i<br />
τ∑<br />
B−1<br />
1 {xk ∈C}] ≤ V (x) + bE x [<br />
k=0<br />
1<br />
K a (x k , B)] = V (x) + b<br />
ν(B) E x[<br />
1 {xk+i ∈B}]<br />
1<br />
ν(B) K a(x k , B)]<br />
τ∑<br />
B−1<br />
k=0<br />
∑<br />
a(i)P i (x k , B)]<br />
i<br />
where (4.6) follows since at most once the process can hit B between 0 <strong>and</strong> τ B − 1. Now, the petiteness measure<br />
can be adjusted such that ∑ i a ii < ∞ (by Proposition 5.5.6 <strong>of</strong> [23]), leading to the result that<br />
sup<br />
x∈B<br />
which can serve as the recurrent set.<br />
This set is furthermore recurrent, since<br />
1 ∑<br />
E x [τ B ] ≤ V (x) + b a(i)(1 + i) < ∞,<br />
ν(B)<br />
sup x∈B E x [τ C ] ≤ sup V (x) + 1 =: K < ∞,<br />
x∈B<br />
<strong>and</strong> as a result P x (τ C > n) ≤ K/n for any n. Finally, since C is petite, so is B.<br />
There are other versions <strong>of</strong> Foster-Lyapunov criteria.<br />
i<br />
⋄
4.2. STABILITY OF MARKOV CHAINS: FOSTER-LYAPUNOV TECHNIQUES 39<br />
4.2.2 Criterion for Finite Expectations<br />
Theorem 4.2.2 [Comparison Theorem] Let V (.) : X → R + , f, g : X → R + such that E[f(x)] < ∞, E[g(x)] <<br />
∞. Let {x n } be a Markov chain on X. If the following is satisfied:<br />
∫<br />
P(x, dy)V (y) ≤ V (x) − f(x) + g(x), ∀x ∈ X ,<br />
then, for any stopping time τ :<br />
X<br />
∑<br />
τ −1<br />
E[<br />
t=0<br />
τ∑<br />
−1<br />
f(x t )] ≤ V (x 0 ) + E[<br />
t=0<br />
g(x t )]<br />
The pro<strong>of</strong> for the above follows from the construction <strong>of</strong> a supermartingale sequence <strong>and</strong> the monotone convergence<br />
theorem. The above also allows us the computation <strong>of</strong> useful bounds. For example if g(x) = b1 {x∈A} , then<br />
one obtains that E[ ∑ τ −1<br />
t=0 f(x t)] ≤ V (x 0 ) + b. In view <strong>of</strong> the invariant distribution properties, if f(x) ≥ 1, this<br />
provides a bound on ∫ π(dx)f(x).<br />
Theorem 4.2.3 [Foster-Lyapunov Moment Bound] [23] Let S be a petite set, b ∈ R <strong>and</strong> V (.) : X → R + ,<br />
f(.) : X → [1, ∞).<br />
Let {x n } be a Markov chain on X. If the following is satisfied:<br />
∫<br />
P(x, dy)V (y) ≤ V (x) − f(x) + b1 x∈S , ∀x ∈ X ,<br />
then,<br />
X<br />
T<br />
1 ∑−1<br />
∫<br />
lim f(x t ) = lim E[f(x t )] =<br />
T →∞ T<br />
t→∞<br />
almost surely, where µ(.) is the invariant probability measure on X.<br />
t=0<br />
µ(dx)f(x) < ∞,<br />
Recall that, the invariant distribution π, satisfies for any D ∈ B(X) (see Theorem 3.3.6):<br />
∫<br />
π(D) =<br />
A<br />
τ∑<br />
A−1<br />
π(dx)E x [<br />
t=0<br />
1 xt∈D]<br />
Thus, for a bounded measurable function f, by the property that simple functions are dense in the space <strong>of</strong><br />
measurable, bounded functions under the supremum norm, it follows that,<br />
∫<br />
∫<br />
π(dx)f(x) =<br />
A<br />
τ∑<br />
A−1<br />
π(dx)E x [ f(x t )].<br />
Thus, if we can verify that for all x ∈ A, E x [ ∑ τ A−1<br />
t=0<br />
f(x t )] is uniformly bounded, we can obtain a bound on the<br />
expectation <strong>of</strong> π(f) = ∫ f(x)µ(dx).<br />
Another approach would be the following. By Theorem 4.2.2, with taking T to be the stopping times,<br />
limsup<br />
T<br />
1<br />
T E x 0<br />
[<br />
T∑<br />
k=0<br />
f(x k )] ≤ limsup<br />
T<br />
It follows that by Fatou’s Lemma <strong>and</strong> monotone convergence theorem<br />
limsup<br />
T<br />
1<br />
T E π[<br />
T∑<br />
∫<br />
f(x k )] ≤<br />
k=0<br />
We have the following corollary to Theorem 4.2.3<br />
t=0<br />
π(dx)E x [limsup<br />
T<br />
1<br />
T (V (x 0) + bT) = b.<br />
1<br />
T<br />
T∑<br />
f(x k )] ≤ b.<br />
k=0
40CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />
Corollary 4.2.1 In particular, under the conditions <strong>of</strong> Theorem 4.2.3,<br />
∫<br />
∫<br />
π(dx)f(x) =<br />
lets one obtain a bound on the expectation <strong>of</strong> f(x).<br />
S<br />
τ∑<br />
A−1<br />
π(dx)E x [ f(x t )] ≤<br />
t=0<br />
∫<br />
A<br />
π(dx)E x [V (x) + b].<br />
See chapter 14 <strong>of</strong> [23] for further details.<br />
We need to ensure that there exists an invariant measure however. This is why we require that f : X → [1, ∞),<br />
where 1 can be replaced with any positive number.<br />
4.2.3 Criterion for Recurrence<br />
Theorem 4.2.4 (Foster-Lyapunov for Recurrence) Let S be a compact set, b < ∞ <strong>and</strong> V (.) be an infcompact<br />
functional on X such that for all α ∈ R + {x : V (x) ≤ α} is compact. Let {x n } be an irreducible Markov<br />
chain on X. If the following is satisfied:<br />
∫<br />
P(x, dy)V (y) ≤ V (x) + b1 x∈S , ∀x ∈ X , (4.6)<br />
X<br />
then, with τ S = min(t > 0 : x t ∈ S), P x (τ S < ∞) = 1 for all x ∈ X, that is the Markov chain is Harris recurrent.<br />
Pro<strong>of</strong>: Let τ S = min(t > 0 : x t ∈ S). Define two stopping times: τ S <strong>and</strong> τ BN where B N = {x : V (x) ≥ N}. Note<br />
that a sequence defined by M t = V (x t ) (which behaves as a supermartingale for t = 0, 1, · · · , min(τ S , τ BN ) − 1)<br />
is uniformly integrable until τ BN , <strong>and</strong> a variation <strong>of</strong> the optional sampling theorem applies for stopping times<br />
which are not necessarily bounded by a given finite number with probability one. Note that, due to irreducibility,<br />
min(τ S , τ BN ) < ∞ with probability 1. Now, it follows that for x /∈ S ∪ B N , since when exiting into B N , the<br />
minimum value <strong>of</strong> the Lyapunov function is N:<br />
for some finite positive M. Hence,<br />
V (x) = E x [V (x min(τS,τ BN ))] ≥ P x (τ BN < τ S )N + P x (τ BN ≥ τ S )M,<br />
P x (τ BN < τ S ) ≤ V (x)/N<br />
We also have that P(min(τ S , τ BN ) = ∞) = 0, since the chain is irreducible <strong>and</strong> it will escape any compact set in<br />
finite time. As a consequence, we have that<br />
P x (τ S = ∞) ≤ P(τ BN < τ S ) ≤ V (x)/N<br />
<strong>and</strong> taking the limit as N → ∞, P x (τ S = ∞) = 0.<br />
⊓⊔<br />
If S is further petite, then once the petite set is visited, any other set with a positive measure (under the<br />
irreducibility measure) is visited with probability 1 infinitely <strong>of</strong>ten.<br />
⋄<br />
4.2.4 On small <strong>and</strong> petite sets<br />
Establishing petiteness may be difficult to directly verify. In the following, we present two conditions that may<br />
be used to establish the petiteness properties.<br />
By [23], p. 131: For a Markov chain with transition kernel P <strong>and</strong> K a probability measure on natural numbers,<br />
if there exists for every E ∈ B(X), a lower semi-continuous function N(·, E) such that ∑ ∞<br />
n=0 P n (x, E)K(n) ≥<br />
N(x, E), for a sub-stochastic kernel N(·, ·), the chain is called a T −chain.<br />
Theorem 4.2.5 [23] For a T −chain which is irreducible, every compact set is petite.
4.2. STABILITY OF MARKOV CHAINS: FOSTER-LYAPUNOV TECHNIQUES 41<br />
For a countable state space, under irreducibility, every finite set S in (4.5) is petite.<br />
The assumptions on petite sets <strong>and</strong> irreducibility can be relaxed for the existence <strong>of</strong> an invariant probability<br />
measure. Tweedie [31] considers the following. If S is such that, the following uniform countable additivity<br />
condition<br />
lim sup P(x, B n ) = 0, (4.7)<br />
n→∞<br />
x∈S<br />
is satisfied for B n ↓ ∅, then, (4.5) implies the existence <strong>of</strong> an invariant probability measure. There exists at<br />
most finitely many invariant probability measures. By Proposition 5.5.5 (iii) <strong>of</strong> Meyn-Tweedie [23], that under<br />
irreducibility, the Harris recurrent component <strong>of</strong> the space can be expressed as a countable union <strong>of</strong> petite sets<br />
C n with ∪ ∞ n=1 C n, with ∪ ∞ m C m → ∅. By Lemma 4 <strong>of</strong> Tweedie (2001), under uniform countable additivity, any set<br />
∪ M i=1 C i is uniformly accessible from S. Therefore, if the Markov chain is irreducible, the condition (4.7) implies<br />
that the set S is petite. This may be easier to verify for a large class <strong>of</strong> applications, than using the condition<br />
<strong>of</strong> establishing a T −chain property. Under further conditions (such as if S is compact <strong>and</strong> V used in a drift<br />
criterion has compact level sets), one can take B n to be subsets <strong>of</strong> a compact set, making the verification even<br />
simpler.<br />
4.2.5 Criterion for Transience<br />
Criteria for transience is somewhat more difficult to establish. One convenient way is to construct a stopping<br />
time sequence <strong>and</strong> show that the state does not come back to some set infinitely <strong>of</strong>ten. We state the following.<br />
Theorem 4.2.6 ([23], [16]) Let V : X → R + . If there exists a set A such that E[V (x t+1 )|x t = x] ≤ V (x) for all<br />
x /∈ A <strong>and</strong> ∃¯x /∈ A such that V (¯x) < inf z∈A V (z), then {x t } is not recurrent, in the sense that P¯x (τ A < ∞) < 1.<br />
Pro<strong>of</strong>: Let x = ¯x. Pro<strong>of</strong> now follows from observing that V (x) ≥ ∫ y V (y)P(x, dy) ≥ (inf z∈A V (z))P(x, A) +<br />
∫<br />
y/∈A V (y)P(x, dy) ≥ (inf z∈A V (z))P(x, A). It thus follows that<br />
P(τ A < 2) = P(x, A) ≤<br />
V (x)<br />
(inf z∈A V (z))<br />
Likewise,<br />
≥<br />
∫<br />
V (¯x) ≥ V (y)P(¯x, dy)<br />
y<br />
∫<br />
(inf V (z))P(¯x, A) + (<br />
z∈A<br />
∫y/∈A<br />
s<br />
V (s)P(y, ds))P(¯x, dy)<br />
≥ (inf V (z))P(¯x, A) + P(¯x, dy)((inf<br />
z∈A<br />
∫y/∈A<br />
V (s))P(y, A) + V (s)P(y, ds))<br />
s∈A<br />
∫s/∈A<br />
≥ (inf V (z))P(¯x, A) + P(¯x, dy)((inf V (s))P(y, A))<br />
z∈A<br />
∫y/∈A s∈A<br />
( ∫<br />
)<br />
= (inf V (z)) P(¯x, A) + P(¯x, dy)P(y, A) . (4.8)<br />
z∈A<br />
y/∈A<br />
Thus, observing that, P({ω : τ A (ω) < 3}) = ∫ A P(¯x, dy) + ∫ y/∈A<br />
P(¯x, dy)P(y, A), we observe that:<br />
P¯x (τ A < 3) ≤<br />
V (¯x)<br />
(inf z∈A V (z))<br />
V (¯x)<br />
(inf z∈A V (z)) .<br />
Thus, this follows for any n: P¯x (τ A < n) ≤ < 1. Continuity <strong>of</strong> probability measures (by defining:<br />
B n = {ω : τ A < n} <strong>and</strong> observing B n ⊂ B n+1 <strong>and</strong> that lim n P(τ A < n) = P(∪ n B n ) = P(τ A < ∞) < 1) now<br />
leads to transience.<br />
⋄
42CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />
Observe the difference with the inf-compactness condition leading to recurrence <strong>and</strong> the above condition, leading<br />
to non-recurrence.<br />
We finally note that, a convenient way to verify instability or transience is to construct an appropriate martingale<br />
sequence.<br />
4.2.6 State Dependent Drift Criteria: Deterministic <strong>and</strong> R<strong>and</strong>om-Time<br />
It is also possible that, in many applications, the controllers act on a system intermittently. In this case, we<br />
have the following results [33]. These extend the deterministic state-dependent results presented in [23], [24]:<br />
Let τ z , z ≥ 0 be a sequence <strong>of</strong> stopping times, measurable on a filtration, possible generated by the state process.<br />
Theorem 4.2.7 [33] Suppose that x is a ϕ-irreducible Markov chain. Suppose moreover that there are functions<br />
V : X → (0, ∞), δ: X → [1, ∞), f : X → [1, ∞), a petite set C on which V is bounded, <strong>and</strong> a constant b ∈ R,<br />
such that the following hold:<br />
E[V (x τz+1 ) | F τz ] ≤ V (x τz ) − δ(x τz ) + b1 {xτz ∈C}<br />
[ τz+1−1 ∑ ]<br />
E f(x k ) | F τz ≤ δ(x τz ), z ≥ 0.<br />
k=τ z<br />
Then X is positive Harris recurrent, <strong>and</strong> moreover π(f) < ∞, with π being the invariant distribution.<br />
(4.9)<br />
⊓⊔<br />
By taking f(x) = 1 for all x ∈ X, we obtain the following corollary to Theorem 4.2.7.<br />
Corollary 4.2.2 [33] Suppose that X is a ϕ-irreducible Markov chain. Suppose moreover that there is a function<br />
V : X → (0, ∞), a petite set C on which V is bounded,, <strong>and</strong> a constant b ∈ R, such that the following hold:<br />
Then X is positive Harris recurrent.<br />
E[V (x τz+1 ) | F τz ] ≤ V (x τz ) − 1 + b1 {xτz ∈C}<br />
sup E[τ z+1 − τ z | F τz ] < ∞.<br />
z≥0<br />
(4.10)<br />
⊓⊔<br />
More on invariant probability measures<br />
Without the irreducibility condition, if the chain is weak Feller, if (4.5) holds with S compact, then there exists<br />
at least one invariant probability measure as discussed in Section 3.4.<br />
Theorem 4.2.8 [33] Suppose that X is a Feller Markov chain, not necessarily ϕ-irreducible. Then,<br />
If (4.9) holds with C compact then there exists at least one invariant probability measure. Moreover, there<br />
exists c < ∞ such that, under any invariant probability measure π,<br />
∫<br />
E π [f(x)] = π(dx)f(x) ≤ c. (4.11)<br />
4.3 Convergence Rates to Equilibrium<br />
X<br />
In addition to obtaining bounds on the rate <strong>of</strong> convergence through Dobrushin’s coefficient, one powerful approach<br />
is through the Foster-Lyapunov drift conditions.<br />
Regularity <strong>and</strong> ergodicity are concepts closely related through the work <strong>of</strong> Meyn <strong>and</strong> Tweedie [25], [26] <strong>and</strong><br />
Tuominen <strong>and</strong> Tweedie [30].
4.3. CONVERGENCE RATES TO EQUILIBRIUM 43<br />
Definition 4.3.1 A set A ∈ B(X) is called (f, r)-regular if<br />
τ∑<br />
B−1<br />
sup E x [ r(k)f(x k )] < ∞<br />
x∈A<br />
for all B ∈ B + (X). A finite measure ν on B(X) is called (f, r)-regular if<br />
k=0<br />
τ∑<br />
B−1<br />
E ν [ r(k)f(x k )] < ∞<br />
for all B ∈ B + (X), <strong>and</strong> a point x is called (f, r)-regular if the measure δ x is (f, r)-regular.<br />
This leads to a lemma relating regular distributions to regular atoms.<br />
k=0<br />
Lemma 4.3.1 If a Markov chain {x t } has an atom α ∈ B + (X) <strong>and</strong> an (f, r)-regular distribution λ, then α is<br />
an (f, r)-regular set.<br />
Definition 4.3.2 (f-norm) For a function f : X → [1, ∞) the f-norm <strong>of</strong> a measure µ defined on (X, B(X)) is<br />
given by<br />
∫<br />
‖µ‖ f = sup | µ(dx)g(x)|.<br />
g≤f<br />
The total variation norm is the f-norm when f = 1, denoted by ‖.‖ TV .<br />
Coupling inequality <strong>and</strong> moments <strong>of</strong> return times to a small set The main idea behind the coupling<br />
inequality is to bound the total variation distance between the distributions <strong>of</strong> two r<strong>and</strong>om variables by the<br />
probability they are different; the more coupled the two r<strong>and</strong>om variables are the more their distributions match<br />
up. Let X <strong>and</strong> Y be two jointly distributed r<strong>and</strong>om variables on a space X with distributions µ x , µ y respectively.<br />
Then we can bound the total variation between the distributions by the probability the two variables are not<br />
equal.<br />
‖µ x − µ y ‖ TV<br />
= sup|µ x (A) − µ y (A)|<br />
A<br />
= sup|P(X ∈ A, X = Y ) + P(X ∈ A, X ≠ Y )<br />
A<br />
− P(Y ∈ A, X = Y ) − P(Y ∈ A, X ≠ Y )|<br />
≤ sup|P(X ∈ A, X ≠ Y ) − P(Y ∈ A, X ≠ Y )|<br />
A<br />
≤P(X ≠ Y )<br />
The coupling inequality is useful in discussions <strong>of</strong> ergodicity when used in conjunction with parallel Markov chains,<br />
as in Theorem 4.1 <strong>of</strong> [28]. One creates two Markov chains having the same one-step transition probabilitis. Let<br />
{x n } <strong>and</strong> {x ′ n } be two Markov chains that have probability transition kernel P(x, ·), <strong>and</strong> let C be an (m, δ, ν)-<br />
small set. We use the coupling construction provided by Roberts <strong>and</strong> Rosenthal.<br />
Let x 0 = x <strong>and</strong> x ′ 0 ∼ π(·) where π(·) is the invariant distribution <strong>of</strong> both Markov chains. 1. If x n = x ′ n then<br />
x n+1 = x ′ n+1 ∼ P(x n, ·) 2. Else, if (x n , x ′ n ) ∈ C × C then with probability δ, x n+m = x ′ n+m ∼ ν(·)<br />
with probability 1 − δ then independently<br />
x n+m ∼ 1<br />
1−δ (P m (x n , ·) − δν(·))<br />
x ′ n+m ∼ 1<br />
1−δ (P m (x ′ n , ·) − δν(·))<br />
3. Else, independently x n+m ∼ P m (x n , ·) <strong>and</strong> x ′ n+m ∼ P m (x ′ n, ·).<br />
The in-between states x n+1 , ...x n+m−1 , x ′ n+1 , ...x′ n+m−1 are distributed conditionally given x n, x n+m ,x ′ n , x′ n+m .<br />
By the Coupling Inequality <strong>and</strong> the previous discussion with Nummelin’s Splitting technique we have ‖P n (x, ·)−<br />
π(·)‖ TV ≤ P(x n ≠ x ′ n).
44CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />
4.3.1 Lyapunov conditions: Geometric ergodicity<br />
Roberts <strong>and</strong> Rosenthal use this approach to establish geometric ergodicity.<br />
An irreducible Markov chain satisfies the univariate drift condition if there are constants λ ∈ (0, 1) <strong>and</strong> b < ∞,<br />
along with a function W : X → [1, ∞), <strong>and</strong> a small set C such that<br />
PV ≤ λV + b1 C . (4.12)<br />
One can now define a bivariate drift condition for two independent copies <strong>of</strong> a Markov chain with a small set C.<br />
This condition requires that there exists a function h : X × X → [1, ∞) <strong>and</strong> α > 1 such that<br />
where<br />
¯Ph(x, y) ≤h(x, y)/α (x, y) /∈ C × C<br />
¯Ph(x, y) b<br />
1−λ<br />
− 1, then the<br />
bivariate drift condition is satisfied for h(x, y) = 1 2 (V (x) + V (y)) <strong>and</strong> α−1 = λ + b/(d + 1) > 1.<br />
The bivariate condition ensures that the processes x, x ′ hits the set C × C, from where the two chains may be<br />
coupled. The coupling inequality then leads to the desired conclusion.<br />
Theorem 4.3.3 (Theorem 9 <strong>of</strong> [28]) Suppose {x t } is an aperiodic, irreducible Markov chain with invariant<br />
distribution π(·). Suppose C is a (1, ǫ, ν)-small set <strong>and</strong> V :→ [1, ∞) satisfies the univariate drift condition (4.12)<br />
with constants λ ∈ (0, 1) <strong>and</strong> b < ∞ with V (x) < ∞ for some x ∈ X. Then {x t } is geometrically ergodic.<br />
As a side remark, we note the following observation. If the univariate drift condition holds then the sequence <strong>of</strong><br />
r<strong>and</strong>om variables M n = λ −n V (x n ) − n−1 ∑<br />
b1 C (x k ) is a supermartingale <strong>and</strong> thus we have the inequality<br />
k=0<br />
for all n <strong>and</strong> x 0 ∈ X, <strong>and</strong> with Theorem 3.3.4 we also have<br />
for all B ∈ B + (X).<br />
n−1<br />
∑<br />
E x0 [λ −n V (x n )] ≤ V (x 0 ) + E x0 [ b1 C (x k )] (4.13)<br />
k=0<br />
E x [λ −τB ] ≤ V (x) + c(B)<br />
4.3.2 Subgeometric ergodicity<br />
Here, we review the class <strong>of</strong> subgeometric rate functions (see section 4 <strong>of</strong> [16], section 5 <strong>of</strong> [12], [23], [13], [30]).<br />
Let Λ 0 be the family <strong>of</strong> functions r : N → R >0 such that<br />
r is non-decreasing, r(1) ≥ 2<br />
<strong>and</strong><br />
log r(n)<br />
n<br />
↓ 0<br />
as n → ∞
4.4. CONCLUSION 45<br />
The second condition implies that for all r ∈ Λ 0 if n > m > 0 then<br />
so that<br />
n log r(n + m) ≤ n log r(n) + m logr(n) ≤ n log r(n) + n logr(m)<br />
r(m + n) ≤ r(m)r(n) for all m, n ∈ N. (4.14)<br />
The class <strong>of</strong> subgeometric rate functions Λ defined in [30] is the class <strong>of</strong> sequences r for which there exists a<br />
sequence r 0 such that<br />
r(n)<br />
0 < liminf<br />
n→∞ r 0 (n) ≤ limsup r(n)<br />
n→∞ r 0 (n) < ∞.<br />
The main theorem we cite on subgeometric rates <strong>of</strong> convergence is due to Tuominen <strong>and</strong> Tweedie [30].<br />
Theorem 4.3.4 (Theorem 2.1 <strong>of</strong> [30]) Suppose that {x t } t∈N is an irreducible <strong>and</strong> aperiodic Markov chain on<br />
state space X with stationary transition probabilities given by P. Let f : X → [1, ∞) <strong>and</strong> r ∈ Λ be given. The<br />
following are equivalent:<br />
(i) there exists a petite set C ∈ B(X) such that<br />
τ∑<br />
C−1<br />
sup E x [ r(k)f(x k )] < ∞<br />
x∈C<br />
k=0<br />
(ii) there exists a sequence (V n ) <strong>of</strong> functions V n : X → [0, ∞], a petite set C ∈ B(X) <strong>and</strong> b ∈ R + such that V 0 is<br />
bounded on C,<br />
V 0 (x) = ∞ ⇒ V 1 (x) = ∞,<br />
<strong>and</strong><br />
(iii) there exists an (f, r)-regular set A ∈ B + (X).<br />
PV n+1 ≤ V n − r(n)f + br(n)1 C ,<br />
n ∈ N<br />
(iv) there exists a full absorbing set S which can be covered by a countable number <strong>of</strong> (f, r)-regular sets.<br />
Theorem 4.3.5 [30] If a Markov chain {x t } satisfies Theorem 4.3.4 for (f, r) then r(n)‖P n (x 0 , ·) −π(·)‖ f → 0<br />
as n increases<br />
Thus, we observe that the hitting times to a small set is a very important measure in characterizing not only<br />
the existence <strong>of</strong> an invariant probability measure, but also how fast a Markov chain converges to equilibrium.<br />
4.4 Conclusion<br />
This concludes our discussion for Controlled Markov Chains via martingale Methods. We will revisit one more<br />
application <strong>of</strong> martingales while discussing the convex analytic approach to Controlled Markov Problems.<br />
4.5 Exercises<br />
Exercise 4.5.1 Let G = {Ω, ∅}. Then, show that E[X|G] = E[X] <strong>and</strong> if G = F, then E[X|F] = X.<br />
Exercise 4.5.2 Let X be a r<strong>and</strong>om variable such that P(|X| < K) = 1 for some K ∈ R, that is X takes values<br />
from a compact set. Let Y n = X + W n for n ∈ Z + <strong>and</strong> W n be an independent noise for every n.<br />
Let F n be the σ-field generated by Y 0 , Y 1 , . . . , Y n .<br />
Does lim n→∞ E[X|F n ] exist almost surely? Prove your statement.<br />
Can you replace the boundedness assumption on X with E[|X|] < ∞?
46CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />
Exercise 4.5.3 Consider the following two-server system:<br />
x 1 t+1 = max(x 1 t + A1 t − u1 t , 0)<br />
x 2 t+1 = max(x 2 t + A 2 t + u 1 t1 (u 1<br />
t ≤x 1 t +A1 t ) − u 2 t, 0), (4.15)<br />
where 1 (.) denotes the indicator function <strong>and</strong> A 1 t,A 2 t are independent <strong>and</strong> identically distributed (i.i.d.) r<strong>and</strong>om<br />
variables with geometric distributions, that is, for i = 1, 2,<br />
P(A i t = k) = p i(1 − p i ) k k ∈ {0, 1, 2, . . ., },<br />
for some scalars p 1 , p 2 such that E[A 1 t ] = 1.5 <strong>and</strong> E[A2 t ] = 1.<br />
a) Suppose the control actions u 1 t, u 2 t are such that u 1 t + u 2 t ≤ 5 for all t ∈ N + <strong>and</strong> u 1 t, u 2 t ∈ R + . At any given<br />
time t, the controller has to decide on u 1 t <strong>and</strong> u2 t with knowing {x1 s , x2 s , s ≤ t} but not knowing A1 t , A2 t .<br />
Is this server system stochastically stabilizable by some policy, that is, does there exist an invariant distribution<br />
under some control policy?<br />
If your answer is positive, provide a control policy <strong>and</strong> show that there exists a unique invariant distribution.<br />
If your answer is negative, precisely state why.<br />
b) Repeat part a where now we restrict u 1 t , u2 t ∈ Z +, that is the actions are to be integer valued.<br />
Exercise 4.5.4 a) Consider a Controlled Markov Chain with the following dynamics:<br />
x t+1 = ax t + bu t + w t ,<br />
where w t is a zero-mean Gaussian noise with a finite variance, a, b ∈ R are the system dynamics coefficients. One<br />
controller policy which is admissible (that is, the policy at time t is measurable with respect to σ(x 0 , x 1 , . . .,x t )<br />
<strong>and</strong> is a mapping to R) is the following:<br />
u t = − a + 0.5 x t .<br />
b<br />
Show that {x t }, under this policy, has a unique invariant distribution.<br />
b) Consider a similar setup to the one earlier, with b = 1:<br />
x t+1 = ax t + u t + w t ,<br />
where w t is a zero-mean Gaussian noise with a finite variance, <strong>and</strong> a ∈ R is a known number.<br />
This time, suppose, we would like to find a control policy such that<br />
lim<br />
t→∞ E[x2 t] < ∞<br />
Further, suppose we restrict the set <strong>of</strong> control policies to be linear, time-invariant; that is <strong>of</strong> the form u(x t ) = kx t<br />
for some k ∈ R.<br />
Find the set <strong>of</strong> all k values for which the state has a finite second moment, that is find<br />
as a function <strong>of</strong> a.<br />
Hint: Use Foster-Lyapunov criteria.<br />
{k : k ∈ R, lim<br />
t→∞<br />
E[x 2 t] < ∞},<br />
Exercise 4.5.5 Prove Birkh<strong>of</strong>f’s Ergodic Theorem for a countable state space; that is the result that for a Markov<br />
chain {x t } living in a countable space X, which has a unique invariant distribution µ, the following applies almost<br />
surely:<br />
1<br />
T∑<br />
lim f(x t ) = ∑ f(i)µ(i)<br />
T →∞ T<br />
t=1 i
4.5. EXERCISES 47<br />
∀ bounded functions f : X → R.<br />
Hint: You may proceed as follows. Define the occupation measures,, for all t <strong>and</strong> A ∈ B(X):<br />
v T (A) = 1 T<br />
T∑<br />
−1<br />
1 ((xt)∈A), ∀A ∈ σ(X),<br />
t=0<br />
Now, define:<br />
( t∑<br />
F t (A) = 1 (xs)∈A − t ∑<br />
s=1<br />
X<br />
)<br />
P π (A|x)v t (x)<br />
( t∑ ∑t−1<br />
∑<br />
)<br />
= 1 (xs)∈A − P π (A|x)1 ((xs)=(x))<br />
s=1 s=0<br />
X<br />
(4.16)<br />
And verify that, for t ≥ 2,<br />
E[F t (A)|F t−1 ]<br />
[( t∑ ∑t−1<br />
∑<br />
]<br />
= E 1 (xs)∈A − P π (A|x)1 (xs=x)<br />
)|F t−1<br />
s=1 s=0 X<br />
[(<br />
= E 1 (xt)∈A − ∑ ]<br />
P π (A|x)1 (xt−1=x)<br />
)|F t−1<br />
X<br />
( t−1<br />
∑t−2<br />
∑<br />
)<br />
+ 1 xs∈A − P π (A|x)1 (xs=x)<br />
= 0<br />
+<br />
∑<br />
s=1<br />
( ∑t−1<br />
s=1<br />
s=0<br />
X<br />
∑t−2<br />
∑<br />
]<br />
1 xs∈A − P π (A|x)1 (xs=x)<br />
)|F t−1<br />
s=0<br />
X<br />
(4.17)<br />
= F t−1 (A), (4.18)<br />
where the last equality follows from the fact that E[1 xt∈A|F t−1 ] = P(x t ∈ A|F t−1 ).<br />
Furthermore,<br />
|F t (A) − F t−1 (A)| ≤ 1.<br />
Now, we have a sequence which is a martingale sequence. We will invoke a Martingale Convergence Theorem;<br />
which is applicable for martingales with bounded increments. By a version <strong>of</strong> the Martingale stability theorem,<br />
it follows that<br />
1<br />
lim<br />
t→∞ t F t(A) = 0<br />
You need to now complete the remaining steps.<br />
You can use the Azuma-Hoeffding inequality <strong>and</strong> the Borel-Cantelli Lemma to complete the steps; or you may<br />
use any other method you might wish to use.<br />
Exercise 4.5.6 Consider a queuing process, with i.i.d. Poisson arrivals <strong>and</strong> departures, with arrival mean µ<br />
<strong>and</strong> service mean λ <strong>and</strong> suppose the process is such that when a customer leaves the queue, with probability p<br />
(independent <strong>of</strong> time) it comes back to the queue. That is, the dynamics <strong>of</strong> the system satisfies:<br />
where E[A t ] = λ, E[N t ] = µ <strong>and</strong> E[p t ] = p.<br />
L t+1 = max(L t + A t − N t + p t N t , 0), t ∈ N.<br />
For what values <strong>of</strong> µ, λ is such a system stochastically stable? Prove your statement.
48CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />
u<br />
Station 1<br />
Station 2<br />
Figure 4.1<br />
Exercise 4.5.7 Consider a two server-station network; where a router routes the incoming traffic, as is depicted<br />
in Figure 4.1.<br />
Let L 1 t , L2 t<br />
following:<br />
denote the number <strong>of</strong> customers in stations 1 <strong>and</strong> 2 at time t. Let the dynamics be given by the<br />
L 1 t+1 = max(L1 t + u tA t − Nt 1 , 0), t ∈ N.<br />
L 2 t+1 = max(L 2 t + (1 − u t )A t − Nt 2 , 0), t ∈ N.<br />
Customers arrive according to an independent Bernoulli process, A t , with mean λ. That is, P(A t = 1) = λ <strong>and</strong><br />
P(A t = 0) = 1 − λ. Here u t ∈ [0, 1] is the router action.<br />
Station 1 has a Bernoulli service process N 1 t with mean n 1 , <strong>and</strong> Station 2 with n 2 .<br />
Suppose that a router decides to follow the following algorithm to decide on u t : If a customer arrives, the router<br />
simply sends the incoming customer to the shortest queue.<br />
Find necessary <strong>and</strong> sufficient conditions (on λ, n 1 , n 2 ) for this algorithm to lead to a stochastically stable system<br />
which satisfies lim t→∞ E[L 1 t + L2 t ] < ∞.
Chapter 5<br />
Dynamic Programming<br />
In this chapter, we introduce the notion <strong>of</strong> dynamic programming for stochastic systems. Let us first recall a<br />
few notions discussed earlier.<br />
Recall that A Markov control model is a five-tuple<br />
(X, U, {U(x), x ∈ X}, Q, c t )<br />
such that X is a (complete <strong>and</strong> separable) metric space, U is the control action space, U t (x) is the control space<br />
when the state is x at time t, such that<br />
K t = {(x, u) : x ∈ X, u ∈ U t (x)},<br />
is a subset X × U, is a feasible state, control pair, Q is a stochastic kernel on X given K t . Finally c t : K t → R is<br />
the cost function.<br />
In case U t (x) <strong>and</strong> c t () do not depend on t, we drop the time index. Then, the Markov model is called a stationary<br />
Markov model.<br />
If Π = {Π t } is a r<strong>and</strong>omized Markov policy, we define the cost function as<br />
∫<br />
c t (x t , Π) = c(x, u)Π t (du|x t )<br />
U(x t)<br />
<strong>and</strong> the transition function<br />
∫<br />
Q t (dz|x t , Π) = Q(dz|x, u)Π t (du|x t )<br />
U(x t)<br />
Let, as in Chapter 2, Π A denote the set <strong>of</strong> all admissible policies.<br />
Recall that Π = {Π t , 0 ≤ t ≤ N − 1} is a policy. Consider the following expected cost:<br />
N−1<br />
J(Π, x) = Ex Π [ ∑<br />
where c N (.) is the terminal cost function. Define<br />
t=0<br />
c(x t , u t ) + c N (x N )],<br />
J ∗ (x) := inf<br />
Π∈Π A<br />
J(Π, x)<br />
As earlier, let h t = {x [0,t] , u [0,t−1] } denote the history or the information process.<br />
The goal is to find, if there exists an admissible policy, such that J ∗ (x) is attained. We note that, the infimum<br />
value is not always attained.<br />
49
50 CHAPTER 5. DYNAMIC PROGRAMMING<br />
Before we proceed further, we note that we could express the cost as:<br />
[<br />
J(Π, x) = Ex<br />
Π c(x 0 , u 0 )<br />
+E Π x 1<br />
[c(x 1 , u 1 )<br />
= E Π0<br />
x<br />
+E Π x 2<br />
[c(x 2 , u 2 )<br />
+ . . .<br />
]<br />
+Ex Π N −1<br />
[c(x N−1 , u N−1 ) + c N (x N )|h N−1 ]|h N−2<br />
[<br />
c(x 0 , u 0 )<br />
+E Π1<br />
x 1<br />
[c(x 1 , u 1 )<br />
. . . |h 1<br />
]<br />
|h 0<br />
],<br />
+E Π2<br />
x 2<br />
[c(x 2 , u 2 )<br />
+ . . .<br />
] ]<br />
ΠN −1<br />
+Ex N −1<br />
[c(x N−1 , u N−1 ) + c N (x N )|h N−1 ]|h N−2 . . . |h 1 |h 0<br />
],<br />
(5.1)<br />
Thus,<br />
N−1<br />
J(Π, x) = Ex Π [ ∑<br />
t=0<br />
c(x t , u t ) + c N (x N )],<br />
k−1<br />
= Ex Π [ ∑<br />
c(x t , u t ) + Ex Π k<br />
[c(x k , u k ) +<br />
t=0<br />
N−1<br />
∑<br />
t=k+1<br />
c(x t , u t ) + c N (x N )|h k ]|h 0 ], (5.2)<br />
The above follows from the properties <strong>of</strong> conditioning discussed in Chapters 1 <strong>and</strong> 4. Also, observe that for all<br />
t, <strong>and</strong> measurable functions g t<br />
E[g t (x t+1 )|h t , u t ] = E[g t (x t+1 )|x t , u t ]<br />
This follows from the the controlled Markov property.<br />
5.1 Bellman’s Principle <strong>of</strong> Optimality<br />
Let {J t (x t )} be a sequence <strong>of</strong> functions on X denifed by<br />
<strong>and</strong> for 0 ≤ t ≤ N − 1<br />
J N (x) = c N (x)<br />
∫<br />
J t (x) = min {c t(x, u) + J t+1 (z)Q t (dz|x, u)}.<br />
u∈U t(x) X<br />
Suppose these functions admit a minimum <strong>and</strong> are measurable. We will discuss a number <strong>of</strong> sufficiency conditions<br />
on when this is possible. Let there be minimizing selectors which are deterministic <strong>and</strong> denoted by {f t (x)} <strong>and</strong><br />
let the minimum cost be equal to<br />
∫<br />
J t (x) = c(x, f t (x t )) + J t+1 (z)Q t (dz|x, f t (x))}<br />
X<br />
Then we have the following:
5.2. DISCUSSION: WHY ARE MARKOV POLICIES IMPORTANT? WHY ARE OPTIMAL POLICIES DETERMINISTIC FOR<br />
Theorem 5.1.1 The policy Π ∗ = {f 0 , f 1 , . . .,f N−1 } is optimal <strong>and</strong> the cost function is equal to<br />
J ∗ (x) = J 0 (x)<br />
We compare the cost generated by the above policy, with respect to the cost obtained by any other policy.<br />
Pro<strong>of</strong>: We provide the pro<strong>of</strong> by a backwards induction method. Consider the time stage t = N − 1. For this<br />
stage, the cost is equal to<br />
∫<br />
J N−1 (x) = min{c N−1 (x N−1 , u N−1 ) + J N (z)Q N (dz|x N−1 , u N−1 )}<br />
Suppose there is a cost C ∗ N−1 (x N−1), achieved by some policy η N−1 (x) C ∗ N−1 (x N−1) ≥ J N−1 (x N−1 ). Then,<br />
CN−1(x ∗ N−1 )<br />
∫<br />
= c N−1 (x N−1 , η N−1 (x N−1 )) +<br />
≥ J N−1 (x N−1 )<br />
∫<br />
= min{c N−1 (x N−1 , u N−1 ) +<br />
X<br />
X<br />
X<br />
J N (z)Q N (dz|x N−1 , η N−1 (x N−1 ))<br />
J N (z)Q N (dz|x N−1 , u N−1 )} (5.3)<br />
Now, we move to time stage N − 2. In this case, the cost to be minimized is<br />
∫<br />
c N−2 (x N−2 , u N−2 ) + CN−1 ∗ (z)Q N(dz|x N−2 , u N−2 )<br />
X<br />
∫<br />
≥ min{c N−2 (x N−2 , u N−2 ) + J N−1 (z)Q N−1 (dz|x N−2 , u N−2 )}<br />
=: J N−2 (x N−2 )<br />
Thus, the cost function to be minimized will keep being greater than or equal to the one achieved by the<br />
minimizing policy.<br />
X<br />
(5.4)<br />
We can, by induction show that the recursion holds for all 0 ≤ t ≤ N − 2.<br />
⊓⊔<br />
5.2 Discussion: Why are Markov Policies Important? Why are optimal<br />
policies deterministic for such settings?<br />
We observe that when there is an optimal solution, the optimal solution is Markov. Let us provide a more<br />
insightful way to derive this property.<br />
In the following, we will follow David Blackwell’s [5] <strong>and</strong> Hans Witsenhausen’s ideas:<br />
Theorem 5.2.1 (Blackwell’s Principle <strong>of</strong> Irrelevant Information) Let X, Y, U be complete, separable, metric<br />
spaces, <strong>and</strong> let P be a probability measure on B(X × Y), <strong>and</strong> let c : X × U → R be a Borel measurable cost<br />
function. Then, for any Borel measurable function γ : X×Y → U, there exists another Borel measurable function<br />
γ ∗ : X → U such that ∫<br />
∫<br />
c(x, γ ∗ (x))P(dx) ≤ c(x, γ(x, y))P(dx, dy)<br />
X<br />
X×Y<br />
Furthermore, policies based on only x almost surely, are optimal.<br />
Pro<strong>of</strong>: We will construct a γ ∗ given γ.<br />
Let us write<br />
∫<br />
X×Y<br />
∫<br />
c(x, γ(x, y))P(dx, dy) =<br />
X<br />
( ∫<br />
)<br />
c(x, u)P γ (du|x) P(dx),<br />
U
52 CHAPTER 5. DYNAMIC PROGRAMMING<br />
where P γ (u ∈ D|x) = ∫ ( )<br />
∫<br />
1 {γ(x,y)∈D} P(dy|x). Consider h γ (x) :=<br />
U c(x, u)P γ (du|x)<br />
Let<br />
D = {(x, u) ∈ X × U : c(x, u) ≤ h γ (x)},<br />
D is a Borel set since c(x, u) − h γ (x) is a Borel measurable function.<br />
a) Suppose the space U is countable. In this case, let us enumerate the elements in U as {u k , k = 1, 2, . . . }. In<br />
this case, we could define:<br />
D i = {x ∈ X : (x, u i ) ∈ D}, i = 1, 2, . . ., .<br />
<strong>and</strong> define:<br />
γ ∗ (x) = u k<br />
Such a function is measurable, by construction.<br />
if x ∈ D k \ (∪ k−1<br />
i=1 D i), k = 1, 2, . . .,<br />
b) Suppose now the space U is separable, but assume that c(x, u) is continuous in u.<br />
Suppose that the space U was partitioned into a countable collection <strong>of</strong> disjoint sets, K n = {Kn, 1 Kn, 2 . . . , Kn m , . . . , }<br />
<strong>and</strong> elements u k n ∈ Kk n in the sets. In this case, we could define, by an enumeration <strong>of</strong> the control actions in<br />
{K n }:<br />
D i = {x ∈ X : (x, u k n) ∈ D},<br />
<strong>and</strong> define:<br />
γ n (x) = u k n if x ∈ D k \ (∪ k−1<br />
i=1 D i)<br />
We can make K n finer, by replacing the above with K n+1 such that Kn+1 i ⊂ Km n<br />
U = R, let K n = {[ 1<br />
2<br />
[⌊u2 n ⌋, ⌈u2 n ⌉), u ∈ R}).<br />
n<br />
for some m (For example if<br />
Each <strong>of</strong> the γ n functions are measurable. Furthermore, by a careful enumeration, since the image partitions<br />
become finer, lim n→∞ γ n (x) exists pointwise. Being pointwise limit <strong>of</strong> measurable functions, the limit is also<br />
measurable, define this as γ ∗ . By continuity, then, it follows that, {(x, γ ∗ (x))} ∈ D since lim n→∞ c(x, γ n (x)) =<br />
c(x, γ ∗ (x)) ≤ h γ (x).<br />
This would complete the pro<strong>of</strong> for the case when c is continuous in u.<br />
c) For the general case, when c is not necessarily continuous, we define D x = {u : (x, u) ∈ D}. It follows that<br />
P γ (D x |x) > 0, for otherwise we would arrive at a contradiction. It then follows from Theorem 2 <strong>of</strong> [6] that we<br />
can construct a measurable function γ ∗ whose graph {(x, γ ∗ (x))} ∈ D. We do not provide the detailed pro<strong>of</strong> for<br />
this case.<br />
⊓⊔<br />
We also have the following discussion.<br />
Theorem 5.2.2 Let {(x t , u t )} be a controlled Markov Chain. Consider the minimization <strong>of</strong> E[ ∑ T −1<br />
t=0 c(x t, u t )],<br />
over all control policies which are sequences <strong>of</strong> causally measurable functions <strong>of</strong> {x s , s ≤ t}, for all t ≥ 0. Suppose<br />
further that the cost function is measurable <strong>and</strong> the transition kernel is such that ∫ Φ(dx t+1 |x t , u t )f(x t+1 ) is<br />
measurable on U for every measurable <strong>and</strong> bounded function f on X. If an optimal control policy exists, there<br />
is no loss in restricting policies to be Markov, that is a policy which only uses the current state x t <strong>and</strong> the time<br />
information t.<br />
Pro<strong>of</strong>: Let us define a history process {h t } where h t = {x 0 , . . . , x t } for t ≥ 0. Let there be an optimal policy<br />
given by {η t , t ≥ 0} such that u t = η t (h t ). It follows that<br />
P(x t+1 ∈ B|x t )<br />
∫ ∫<br />
= Φ(x t+1 ∈ B|u t , x t )1 {ut=η(h t)}P(dh t |x t )<br />
U X<br />
∫<br />
t ∫<br />
= Φ(x t+1 ∈ B|u t , x t ) 1 {ut=η(h t)∈du}P(dh t |x t )<br />
U<br />
X<br />
∫<br />
t<br />
= Φ(x t+1 ∈ B|u, x t ) ˜P(du|x t ) (5.5)<br />
U
5.3. EXISTENCE OF MINIMIZING SELECTORS AND MEASURABILITY 53<br />
where<br />
∫<br />
˜P(du|x t ) = 1 {u=η(ht)∈du}P(h t |x t )<br />
X t<br />
Thus, we may replace any history dependent policy with a possibly r<strong>and</strong>omized Markov policy.<br />
By hypothesis, there exists a solution almost surely. If the solution is r<strong>and</strong>omized, then there exists some<br />
distribution on U such that P(u t ∈ B|x t ) = κ t (B), <strong>and</strong> the optimal cost can be expressed as a dynamic<br />
programming recursion for some sequence <strong>of</strong> functions on X, {J t , t ≥ 0} with:<br />
∫ { ( ∫<br />
)}<br />
J t (x t ) = κ t (du) c(x t , u) + Φ(dx t+1 |x t , u)J t+1 (x t+1 )<br />
U<br />
X<br />
Define<br />
( ∫<br />
)<br />
M(x t , u t ) = c(x t , u t ) + Φ(dx t+1 |x t , u t )J t+1 (x t+1 )<br />
X<br />
If inf ut∈U M(x t , u t ) admits a solution, the distribution κ can be taken to be a Dirac-delta (δ) distribution at the<br />
minimizing value. By hypothesis, a solution exists P − a.s.. Thus, an optimal policy depends on the state value<br />
<strong>and</strong> time variable (dependency on time is due to the time-dependent nature <strong>of</strong> J t ).<br />
⊓⊔<br />
5.3 Existence <strong>of</strong> Minimizing Selectors <strong>and</strong> Measurability<br />
The above dynamic programming arguments hold when there exist minimizing control policies (selectors measurable<br />
with respect to the Borel field on X).<br />
Measurable Selection Hypothesis: Given a sequence <strong>of</strong> functions J t : X → R, there exists<br />
∫<br />
J t (x) = min (c(x t, u t ) + J t+1 (y)Q(dy|x, u)),<br />
u t∈U t(x)<br />
for all x ∈ X, for t ∈ {0, 1, 2 . . ., N − 1} with<br />
X<br />
J N (x N ) = c N (x N ).<br />
Furthermore, there exist measurable functions f t such that<br />
∫<br />
J t (x) = (c(x t , f(x t )) + J t+1 (y)Q(dy|x, f(x t ))),<br />
X<br />
⊓⊔<br />
Recall that a set in a normed linear space is (sequentially) compact if every sequence in the set has a converging<br />
subsequence.<br />
Condition 1: The cost function to be minimized c(x t , u t ) is continuous on both U <strong>and</strong> X; U t (x) = U is compact;<br />
<strong>and</strong> ∫ X<br />
Q(dy|x, u)v(y) is a continuous function on X × U for every continuous <strong>and</strong> bounded v on X. ⊓⊔<br />
Condition 2: For every x ∈ X the cost function to be minimized c(x t , u t ) is continuous on U; U t (x) is compact;<br />
<strong>and</strong> ∫ Q(dy|x, u)v(y) is a continuous function on U for every bounded, measurable function v on X.<br />
X ⊓⊔<br />
Theorem 5.3.1 Under Condition 1 or Condition 2, there exists an optimal solution <strong>and</strong> the Measurable Selection<br />
applies, <strong>and</strong> there exists a minimizing control policy f t : X → U t (x t ).<br />
Furthermore, under Condition 1, the function J t (x) is continuous, if c N (x N ) is continuous.
54 CHAPTER 5. DYNAMIC PROGRAMMING<br />
The result follows from the two lemma below:<br />
Lemma 5.3.1 A continuous function f : X → R over a compact set A ⊂ X admits a minimum.<br />
Pro<strong>of</strong>: Let δ = inf x∈A f(x). Let {x i } be a sequence such that f(x i ) converges to δ. Since A is compact {x i }<br />
must have a converging subsequence {x i(n) }. Let the limit <strong>of</strong> this subsequence be x 0 . Then, it follows that,<br />
{x i(n) } → x 0 <strong>and</strong> thus, by continuity {f(x i(n) )} → f(x 0 ). As such f(x 0 ) = δ. ⊓⊔<br />
To see why compactness is important, consider<br />
1<br />
inf<br />
x∈A x<br />
for A = [1, 2). How about for A = R? In both cases there does not exist an x value in the specified set which<br />
attains the infimum.<br />
Lemma 5.3.2 Let U be compact, <strong>and</strong> c(x, u) be continuous on X × U. Then, min u c(x, u) is continuous on X.<br />
Pro<strong>of</strong>: Let x n → x, u n optimal for x n <strong>and</strong> u optimal for x. Such optimal action values exist as a result <strong>of</strong><br />
compactness <strong>of</strong> U <strong>and</strong> continuity. Now,<br />
| min c(x n , u) − min c(x, u)|<br />
u<br />
u<br />
(<br />
)<br />
≤ max c(x n , u) − c(x, u), c(x, u n ) − c(x n , u n )<br />
(5.6)<br />
The first term above converges since c is continuous in x, u. The second converges also. Suppose otherwise.<br />
Then, for some ǫ > 0, there exists a subsequence such that<br />
c(x, u kn ) − c(x kn , u kn ) ≥ ǫ<br />
Consider the sequence (x kn , u kn ). There exists a subsequence such that (x k ′ n<br />
, u k ′ n<br />
) which converges to x, u ′ for<br />
some u ′ since U is compact. Hence, for this subsequence, we have convergence <strong>of</strong> c(x k ′<br />
n<br />
, u k ′<br />
n<br />
) as well as c(x, u k ′<br />
n<br />
),<br />
leading to a contradiction.<br />
⋄<br />
Compactness is a crucial component for this result.<br />
The textbook by Hern<strong>and</strong>ez-Lerma <strong>and</strong> Lasserre provides more general conditions. We did not address here the<br />
issue <strong>of</strong> the existence <strong>of</strong> measurable functions. We refer the reader again to Schäl [29]. We, however, state the<br />
following result:<br />
Lemma 5.3.3 Let c(x, u) be a continuous function on U for every x, U be a compact set. Then, there exists a<br />
Borel measurable function f : X → U such that<br />
c(x, f(x)) = min c(x, u)<br />
u∈U<br />
Pro<strong>of</strong>: Let<br />
We first observe the following. The function<br />
˜c(x) := min c(x, u)<br />
u∈U<br />
˜c(x) := min c(x, u),<br />
u∈U<br />
is Borel measurable. This follows from the observation that it is sufficient to prove that {x : ˜c(x) ≤ α} is Borel<br />
for every α ∈ R. By continuity <strong>of</strong> c <strong>and</strong> compactness <strong>of</strong> U, the result follows.<br />
Define {(x, u) : c(x, u) ≤ ˜c(x).} This set is a Borel set. We can now construct a measurable function which lives<br />
in this set, using the property that the control action space is a separable, complete, metric space as discussed<br />
in the pro<strong>of</strong> <strong>of</strong> Theorem 5.2.1. Note that, the continuity <strong>of</strong> c in u ensures that the discussion in the graph issues<br />
does not occur here since lim n→∞ c(x, γ n (x)) = c(x, γ ∗ (x)) ≤ ˜c(x).<br />
⊓⊔
5.4. INFINITE HORIZON OPTIMAL CONTROL PROBLEMS 55<br />
We can replace the compactness condition with an inf-compactness condition, <strong>and</strong> modify the Condition 1 as<br />
below:<br />
Condition 3: For every x ∈ X the cost function to be minimized c(x t , u t ) is continuous on X×U; is non-negative;<br />
{u : c(x, u) ≤ α} is compact for all α > 0 <strong>and</strong> all x ∈ X; ∫ X<br />
Q(dy|x, u)v(y) is a continuous function on X × U for<br />
every continuous <strong>and</strong> bounded v.<br />
⊓⊔<br />
Theorem 5.3.2 Under Condition 3, the Measurable Selection Hypothesis applies.<br />
Remark: We could relax the continuity condition <strong>and</strong> change it with lower-semi continuity. A function is lower<br />
semi-continuous at x 0 if liminf x→x0 f(x) ≥ f(x 0 ).<br />
⊓⊔<br />
Essentially what is needed is the existence <strong>of</strong> a minimizing control function; or a selector which is optimal for<br />
every x t ∈ X. In class we solved the following problem with q > 0, r > 0, p T > 0:<br />
for a linear system:<br />
inf<br />
Π EΠ x [ T −1<br />
∑<br />
qx 2 t + r2 t + p Tx 2 T ]<br />
t=0<br />
x t+1 = ax t + u t + w t ,<br />
where w t is a zero-mean, i.i.d. Gaussian r<strong>and</strong>om variable with variance σw 2 . We showed that, by the method <strong>of</strong><br />
completing the squares:<br />
T −1<br />
J t (x t ) = P t x 2 t + ∑<br />
P t+1 σw<br />
2<br />
where<br />
<strong>and</strong> the optimal control policy is<br />
k=t<br />
P t = q + P t+1 a 2 − P t+1a 2<br />
u t = −P t+1a<br />
P t+1 + r x t.<br />
P t+1 + r<br />
Note that, the optimal control policy is Markov (as it uses only the current state).<br />
5.4 Infinite Horizon Optimal Control Problems<br />
When the time horizon becomes unbounded, we can not directly invoke Dynamic Programming in the form<br />
considered earlier. In such settings, we look for conditions for optimality. We will investigate such conditions in<br />
this section.<br />
Infinite horizon problems that we will consider will belong to two classes: Discounted Cost <strong>and</strong> Average cost<br />
problems. We first discuss the discounted cost problem. The average cost problem is discussed in Chapter 7.<br />
5.4.1 Discounted Cost<br />
One popular type <strong>of</strong> problems are ones where the future costs are discounted: The future is less important than<br />
today, in part due to the uncertainty in the future, as well as due to economic underst<strong>and</strong>ing that current value<br />
<strong>of</strong> a good is more important than the value in the future.<br />
The discounted cost function is given as:<br />
T∑<br />
−1<br />
J(ν 0 , Π) = Eν Π 0<br />
[ β t c(x t , u t )],<br />
t=0
56 CHAPTER 5. DYNAMIC PROGRAMMING<br />
for some β ∈ (0, 1).<br />
If there exists a policy Π ∗ which minimizes this cost, the policy is said to be optimal.<br />
We can consider an infinite horizon problem by taking the limit (when c is non-negative)<br />
<strong>and</strong> invoking the Monotone Convergence Theorem:<br />
for some β ∈ (0, 1).<br />
Note that,<br />
or<br />
or<br />
T∑<br />
−1<br />
J(ν 0 , Π) = lim<br />
T →∞ EΠ ν 0<br />
[ β t c(x t , u t )],<br />
t=0<br />
∞∑<br />
J(ν 0 , Π) = Eν Π 0<br />
[ β t c(x t , u t )],<br />
t=0<br />
∞∑<br />
J(x 0 , Π) = Ex Π 0<br />
[c(x 0 , u 0 ) + Ex Π 1<br />
[ β t c(x t , u t )]|x 0 , u 0 ],<br />
t=1<br />
∞∑<br />
J(x 0 , Π) = Ex Π 0<br />
[c(x 0 , u 0 ) + βEx Π 1<br />
[ β t−1 c(x t , u t )|x 0 , u 0 ]],<br />
t=1<br />
J(x 0 , Π) = E Π x 0<br />
[c(x 0 , u 0 ) + βE Π x 1<br />
[J(x 1 , Π)]|x 0 , u 0 ],<br />
A function v : X → R is a solution to the infinite horizon problem if it satisfies:<br />
for all x ∈ X. We call v(·) a value function.<br />
v(x) = min {c(x, u) + β v(y)Q(dy|x, u)},<br />
U<br />
∫X<br />
Now, let J(x) = inf Π∈ΠA J(x, Π). We observe that the cost <strong>of</strong> starting at any position, at any time k leads to a<br />
recursion where the future cost is multiplied by β. This can be observed by applying the Dynamic Programming<br />
equations discussed earlier: For a given finite horizon truncation, we work with the sequential updates:<br />
∫<br />
J t (x t ) = min {c(x t, u t ) + β J t+1 (x t+1 )Q(dx t+1 |x t , u t )},<br />
U<br />
<strong>and</strong> hope that this converges to a limit.<br />
X<br />
Under measurable selection conditions, <strong>and</strong> a boundedness condition for the cost function, there exists a solution<br />
to the discounted cost problem. In particular, the following holds:<br />
Theorem 5.4.1 Suppose the cost function is bounded, non-negative, <strong>and</strong> one <strong>of</strong> the the measurable selection<br />
conditions (Condition 1, Condition 2 or Condition 3) applies. Then, there exists a solution to the discounted<br />
cost problem. Furthermore, the optimal cost (value function) is obtained by a successive iteration <strong>of</strong> policies:<br />
with v 0 (x) = 0.<br />
v n (x) = min {c(x, u) + β v n−1 (y)Q(dy|x, u)}, ∀x, n ≥ 1<br />
U<br />
∫X<br />
Before presenting the pro<strong>of</strong>, we state the following two lemmas: The first one is on the exchange <strong>of</strong> the order <strong>of</strong><br />
the minimum <strong>and</strong> limits.<br />
Lemma 5.4.1 [Hern<strong>and</strong>ez-Lerma <strong>and</strong> Lasserre] Let V n (x, u) ↑ V (x, u) pointwise. Suppose that V n <strong>and</strong> V are<br />
continuous in u <strong>and</strong> u ∈ U(x) is compact for every x ∈ X.Then,<br />
lim<br />
n→∞<br />
min V n(x, u) = min V (x, u)<br />
u∈U(x) u∈U(x)
5.4. INFINITE HORIZON OPTIMAL CONTROL PROBLEMS 57<br />
Pro<strong>of</strong>. The pro<strong>of</strong> follows from essentially the same arguments as in the pro<strong>of</strong> <strong>of</strong> Lemma 5.3.2. Let u ∗ n solve<br />
min u∈U(x) V n (x, u). Note that<br />
since V n (x, u) ↑ V (x, u). Now, suppose that<br />
| min<br />
u∈U(x) V n(x, u) − min<br />
u∈U(x) V (x, u)| ≤ V (x, u∗ n) − V n (x, u ∗ n), (5.7)<br />
V (x, u ∗ n ) − V n(x, u ∗ n ) > ǫ (5.8)<br />
along a subsequence n k . There exists a further subsequence n ′ k such that u∗ n ′ k<br />
→ ū for some ū. Now, for every<br />
n ′ k ≥ n, since V n is monotonically increasing:<br />
V (x, u ∗ n − V<br />
k) ′ n ′ (x, u∗ k n ≤ V (x, u<br />
k) ∗ ′ n − V<br />
k) ′ n (x, u ∗ n ) ′<br />
k<br />
However, V (x, u ∗ n<br />
) <strong>and</strong> for a fixed n, V ′ n (x, u ∗ n<br />
are continuous hence these two terms converge to:<br />
′<br />
k<br />
k),<br />
V (x, ū) − V n (x, ū).<br />
For every fixed x <strong>and</strong> ū, <strong>and</strong> every ǫ > 0, we can find a sufficiently large n such that V (x, ū) − V n (x, ū) ≤ ǫ/2.<br />
Hence (5.8) cannot hold.<br />
⊓⊔<br />
Lemma 5.4.2 The space <strong>of</strong> measurable functions X → R endowed with the ||.|| ∞ norm is a Banach space, that<br />
is<br />
is a Banach space.<br />
L ∞ (X) = {f : ||f|| ∞ = sup |f(x)| < ∞}<br />
x<br />
Pro<strong>of</strong> <strong>of</strong> Theorem 5.4.1. We obtain the optimal solution as a limit <strong>of</strong> discounted cost problems with a finite<br />
horizon T. By dynamic programming, we obtain the recursion for every T as:<br />
J T T (x) = 0<br />
Jt T (x) = T(Jt+1)(x) T = {min {c(x, u) + β Jt+1(y)Q(dy|x, U<br />
∫X<br />
T u)}}.<br />
This sequence will lead to a solution for an T-stage discounted cost problem. We then look for a solution as<br />
T → ∞ <strong>and</strong> invoke Lemma 5.4.1 since J T t (x) ↑ J ∞ t (x) for any t ∈ Z + . Note also that J ∞ 1 (x) = J ∞ 2 (x) = J ∞ 3 (x)<br />
<strong>and</strong> so on. Thus, the sequence <strong>of</strong> such solutions leads to a solution for the infinite horizon problem. This will<br />
satisfy:<br />
J0 ∞ (x) = T(J ∞ )(x) = {min {c(x, u) + β J<br />
U<br />
∫X<br />
∞ (y)Q(dy|x, u)}}.<br />
Now, the next step is to obtain a solution to this fixed point equation. We will obtain the solution using an<br />
iteration, known as the value iteration algorithm. We observe that the vector J ∞ (·) lives in L ∞ (X) (since the<br />
cost is bounded, there is a uniform bound for every x). We show that, the iteration given by<br />
T(v)(x) = {min {c(x, u) + β v(y)Q(dy|x, u)}}<br />
U<br />
∫X
58 CHAPTER 5. DYNAMIC PROGRAMMING<br />
is a contraction in L ∞ (X). Let<br />
||T(v) − T(v ′ )|| ∞<br />
= sup |T(v)(x) − T(v ′ )(x)|<br />
x∈X<br />
= sup |{min {c(x, u) + β<br />
x∈X U<br />
∫X<br />
= sup<br />
x∈X<br />
v(y)Q(dy|x, u)} − {min {c(x, u) + β v<br />
U<br />
∫X<br />
′ (y)Q(dy|x, u)}|<br />
{ (1 A1 min {c(x, u) + β v(y)Q(dy|x, u)} − min<br />
U<br />
∫X<br />
{c(x, u) + β v<br />
U<br />
∫X<br />
′ (y)Q(dy|x, u)}<br />
})<br />
{<br />
+1 A2 − min {c(x, u) + β v(y)Q(dy|x, u)} + min<br />
U<br />
∫X<br />
{c(x, u) + β v<br />
U<br />
∫X<br />
′ (y)Q(dy|x, u)}<br />
{ ∫<br />
∫<br />
}<br />
≤ sup<br />
(1 A1 c(x, u ∗ ) + β v(y)Q(dy|x, u ∗ ) − c(x, u ∗ ) − β v ′ (y)Q(dy|x, u ∗ )<br />
x∈X<br />
X<br />
X<br />
{ ∫<br />
∫<br />
})<br />
+1 A2 − c(x, u ∗∗ ) − β v(y)Q(dy|x, u ∗∗ ) + c(x, u ∗∗ ) + β v ′ (y)Q(dy|x, u ∗∗ )<br />
X<br />
X<br />
( ∫<br />
) ( ∫<br />
)<br />
= sup 1 A1 {β (v(y) − v(y ′ ))Q(dy|x, u ∗ )} + sup 1 A2 {β (v ′ (y) − v(y))Q(dy|x, u ∗∗ )}<br />
x∈X<br />
X<br />
x∈X<br />
X<br />
∫<br />
∫<br />
≤ β||v − v ′ || ∞ {1 A1 Q(dy|x, u ∗∗ ) + 1 A2 Q(dy|x, u ∗ )}<br />
X<br />
X<br />
= β||v − v ′ || ∞ (5.9)<br />
}<br />
Here A1 denotes the event that<br />
min {c(x, u) + β v(y)Q(dy|x, u)} ≥ min<br />
U<br />
∫X<br />
{c(x, u) + β v<br />
U<br />
∫X<br />
′ (y)Q(dy|x, u)},<br />
<strong>and</strong> A2 denotes the complementary event, u ∗∗ is the minimizing control for {c(x, u) + β ∫ X<br />
v(y)Q(dy|x, u)} <strong>and</strong><br />
u ∗ is the minimizer for {c(x, u) + β ∫ X v′ (y)Q(dy|x, u)}.<br />
⊓⊔<br />
This iteration is known as the value iteration algorithm.<br />
Remark 5.4.1 We note that if the cost is not bounded, one can define a weighted sup-norm: ||c|| f = sup x | c(x)<br />
v(x) |,<br />
where v is a positive function. The discussions above will apply to this context as well.<br />
5.5 Linear Quadratic Gaussian Problem<br />
Consider the following linear system:<br />
x t+1 = Ax t + Bu t + w t ,<br />
where x ∈ R n , u ∈ R m <strong>and</strong> w ∈ R n . Suppose {w t } is i.i.d. Gaussian with a given covariance matrix E[w t w T t ] = W<br />
for all t ≥ 0.<br />
The goal is to obtain<br />
where<br />
inf J(Π, x),<br />
Π<br />
T −1<br />
J(Π, x) = Ex Π [ ∑<br />
x T t Qx t + u T t Ru t + x T T Q Tx T ],<br />
with R, Q, Q T > 0 (that is, these matrices are positive definite).<br />
In class, we obtained the Dynamic Programming recursion for the optimal control problem.<br />
t=0
5.6. EXERCISES 59<br />
Theorem 5.5.1 The optimal control is linear <strong>and</strong> has the form:<br />
u t = −(BP t+1 B + R) −1 B T P t+1 Ax t<br />
where P t solves the Discrete-Time Riccati Equation:<br />
P t = Q + A T P t+1 A − A T P t+1 B((BP t+1 B + R)) −1 B T P t+1 A,<br />
with final condition P T = Q T .<br />
Theorem 5.5.2 As T → ∞ in the optimization problem above, if (A, B) is controllable <strong>and</strong> with Q = C T C <strong>and</strong><br />
(A, C) is observable, the optimal policy is stationary. The Riccati recursion, under the stated assumptions <strong>of</strong><br />
observability <strong>and</strong> controllability<br />
P t = Q + A T P t+1 A − A T P t+1 B((BP t+1 B + R)) −1 B T P t+1 A,<br />
has a unique solution P, which is furthermore positive-definite.<br />
We observe that, if the system has noise, <strong>and</strong> if we have an an average cost optimization problem, the effect <strong>of</strong><br />
the noise will stay bounded, <strong>and</strong> the same solution with P <strong>and</strong> the induced control policy, will be optimal for<br />
the minimization <strong>of</strong> the expected average cost optimization problem:<br />
T −1<br />
1<br />
lim<br />
T →∞ T EΠ x [ ∑<br />
x T t Qx t + u T t Ru t + x T T Q Tx T ]<br />
t=0<br />
5.6 Exercises<br />
Exercise 5.6.1 Consider the following problem considered in class. A investor’s wealth dynamics is given by<br />
the following:<br />
x t+1 = u t w t ,<br />
where {w t } is an i.i.d. R + -valued stochastic process with E[w t ] = 1. The investor has access to the past <strong>and</strong><br />
current wealth information <strong>and</strong> his previous actions. The goal is to maximize:<br />
T∑<br />
−1<br />
J(x 0 , Π) = Ex Π √<br />
0<br />
[ xt − u t ].<br />
The investor’s action set for any given x is: U(x) = [0, x].<br />
Formulate the problem as an optimal stochastic control problem by clearly identifying the state, the control action<br />
spaces, the information available at the controller, the transition kernel <strong>and</strong> a cost functional mapping the actions<br />
<strong>and</strong> states to R.<br />
Find an optimal policy.<br />
Hint: For α ≥ 0, √ x − u + α √ u is a concave function <strong>of</strong> u for 0 ≤ u ≤ x <strong>and</strong> its maximum is computed when<br />
the derivative <strong>of</strong> √ x − u + α √ u is set to zero.<br />
t=0<br />
Exercise 5.6.2 Consider an investment problem in a stock market. Suppose there are N possible goods <strong>and</strong><br />
suppose these goods have prices {p i t , i = 1, . . .,N} at time t taking values in a countable set P. Further, suppose<br />
that the stocks evolve stochastically according to a Markov transition such that for every 1 ≤ i ≤ N<br />
P(p i t+1 = m|p i t = n) = P i (n, m).<br />
Suppose that there is a transaction cost <strong>of</strong> c for every buying <strong>and</strong> selling for each good. Suppose that the investor<br />
can only hold on to at most M units <strong>of</strong> stock. At time t, the investor has access to his current <strong>and</strong> past price<br />
information <strong>and</strong> his investment allocations up to time t.
60 CHAPTER 5. DYNAMIC PROGRAMMING<br />
When a transaction is made the following might occur. If the investor sells one unit <strong>of</strong> i, the investor spends c<br />
dollars for the transaction, <strong>and</strong> adds p i t to his account. If he only buys one unit <strong>of</strong> j, then he takes out pj t dollars<br />
per unit from his account <strong>and</strong> also takes out c dollars. For every such a transaction per unit, there is a cost <strong>of</strong><br />
c dollars. If he does not do anything, then there is no cost. Assume that the investor can make at most one<br />
transaction per time stage.<br />
Suppose the investor initially has an unlimited amount <strong>of</strong> money <strong>and</strong> a given stock allocation.<br />
The goal is to maximize the total worth <strong>of</strong> the stock at the final time stage t = T ∈ Z + minus the total transaction<br />
costs plus the earned money in between.<br />
a) Formulate the problem as a stochastic control problem by clearly identifying the state, the control actions, the<br />
information available at the controller, the transition kernel <strong>and</strong> a cost functional mapping the actions <strong>and</strong> states<br />
to R.<br />
b) By Dynamic Programming, generate a recursion which will lead to an optimal control policy.<br />
Exercise 5.6.3 Consider a special case <strong>of</strong> the above problem, where the objective is to maximize the final worth<br />
<strong>of</strong> the investment <strong>and</strong> minimize the transaction cost. Furthermore, suppose there is only one uniform stock price<br />
with a transition kernel given by<br />
P(p t+1 = m|p t = n) = P(n, m)<br />
The investor has again a possibility <strong>of</strong> at most M units <strong>of</strong> goods. But, the investor might wish to purchase more,<br />
or sell them.<br />
a) Formulate the problem as a stochastic control problem by clearly identifying the state, the control actions, the<br />
transition kernel <strong>and</strong> a cost functional mapping the actions <strong>and</strong> states to R.<br />
b) By Dynamic Programming generate a recursion which will lead to an optimal control policy.<br />
c) What is the optimal control policy?<br />
Exercise 5.6.4 A fishery manager annually has x t units <strong>of</strong> fish <strong>and</strong> sells u t x t <strong>of</strong> these where u t ∈ [0, 1]. With<br />
the remaining ones, the next year’s production is given by the following model<br />
x t+1 = w t x t (1 − u t ) + w t ,<br />
with x 0 is given <strong>and</strong> w t is an independent, identically distributed sequence <strong>of</strong> r<strong>and</strong>om variables <strong>and</strong> w t ≥ 0 for<br />
all t <strong>and</strong> therefore E[w t ] = ˜w ≥ 0.<br />
The goal is to maximize the pr<strong>of</strong>it over the time horizon 0 ≤ t ≤ T − 1. At time T, he sells all <strong>of</strong> the fish.<br />
a) Formulate the problem as an optimal stochastic control problem by clearly identifying the state, the control<br />
actions, the information available at the controller, the transition kernel <strong>and</strong> a cost functional mapping the<br />
actions <strong>and</strong> states to R.<br />
b) Does there exist an optimal policy? If it does, compute the optimal control policy as a dynamic programming<br />
recursion.<br />
Exercise 5.6.5 Consider the following linear system:<br />
x t+1 = Ax t + Bu t + w t ,<br />
where x ∈ R n , u ∈ R m <strong>and</strong> w ∈ R n . Suppose {w t } is i.i.d. Gaussian with a given covariance matrix E[w t w T t ] =<br />
W for all t ≥ 0.<br />
The goal is to obtain<br />
where<br />
inf J(Π, x),<br />
Π<br />
T∑<br />
−1<br />
J(Π, x) = Ex Π [ x T t Qx t + u T t Ru t + x T TQ T x T ],<br />
t=0
5.6. EXERCISES 61<br />
with R, Q, Q T > 0 (that is, these matrices are positive definite).<br />
a) Show that there exists an optimal policy.<br />
b) Obtain the Dynamic Programming recursion for the optimal control problem. Is the optimal control policy<br />
Markov? Is it stationary?<br />
c) For T → ∞, if (A, B) is controllable <strong>and</strong> with Q = C T C <strong>and</strong> (A, C) is observable, prove that the optimal<br />
policy is stationary.<br />
Exercise 5.6.6 Consider an unemployed person who will have to work for years t = 1, 2, ..., 10 if she takes a job<br />
at any given t.<br />
Suppose that each year in which she remains unemployed; she may be <strong>of</strong>fered a good job that pays 10 dollars per<br />
year (with probability 1/4); she may be <strong>of</strong>fered a bad job that pays 4 dollars per year (with probability 1/4); or<br />
she may not be <strong>of</strong>fered a job (with probability 1/2). These events <strong>of</strong> job <strong>of</strong>fers are independent from year to year<br />
(that is the job market is represented by an independent sequence <strong>of</strong> r<strong>and</strong>om variables for every year).<br />
Once she accepts a job, she will remain in that job for the rest <strong>of</strong> the ten years. That is, for example, she cannot<br />
switch from the bad job to the good job.<br />
Suppose the goal is maximize the expected total earnings in ten years, starting from year 1 up to year 10 (including<br />
year 10).<br />
Should this person accept good jobs, bad jobs at any given year? What is her best policy?<br />
State the problem as a Markov Decision Problem, identify the state space, the action space <strong>and</strong> the transition<br />
kernel.<br />
Obtain the optimal solution.
62 CHAPTER 5. DYNAMIC PROGRAMMING
Chapter 6<br />
Partially Observed Markov Decision<br />
Processes<br />
As discussed earlier in chapter 2, consider a system <strong>of</strong> the form:<br />
x t+1 = f(x t , u t , w t ), y t = g(x t , v t ).<br />
Here, x t is the state, u t ∈ U is the control, (w t , v t ) ∈ W × V are second order, zero-mean, i.i.d noise processes<br />
<strong>and</strong> w t is independent <strong>of</strong> v t . We also assume that the state noise w t either has a probability mass function,<br />
or admits a probability measure which is absolutely continuous with respect to the Lebesgue measure; this will<br />
ensure that the probability measure admits a density function. Hence, the notation p(x) will denote either the<br />
probability mass for discrete-valued spaces or probability density function for uncountable spaces. The controller<br />
only has causal access to the second component {y t } <strong>of</strong> the process. A policy {Π} is measurable with respect<br />
to σ(I t ) with I t = {y [0,t] , u [0,t−1] }. We denote the observed history space as: H 0 := P, H t = H t−1 × Y × U.<br />
Here P, denotes the space <strong>of</strong> probability measures on X, which we assume to be a Borel space. Hence, the set<br />
<strong>of</strong> r<strong>and</strong>omized causal control policies are such that P(u(h t ) ∈ U|I t ) = 1 ∀I t ∈ H t .<br />
6.1 Enlargement <strong>of</strong> the State-Space <strong>and</strong> the Construction <strong>of</strong> a Controlled<br />
Markov Chain<br />
One could transform a partially observable Markov Decision Problem to a Fully Observed Markov Decision<br />
Problem via an enlargement <strong>of</strong> the state space. In particular, we obtain via the properties <strong>of</strong> total probability<br />
the following dynamical recursion<br />
π t (x) : = P(x t = x|y [0,t] , u [0,t−1] )<br />
∑<br />
X<br />
=<br />
π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )P(u t−1 |y [0,t−1] , u [0,t−2] )<br />
∑X π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )P(u t−1 |y [0,t−1] , u [0,t−2] ) .<br />
∑<br />
X<br />
Here, the summation needs to be exchanged with an integral in case the variables live in an uncountable space.<br />
The conditional measure process becomes a controlled Markov chain in P.<br />
Theorem 6.1.1 The process {π t , u t } is a controlled Markov chain. That is, under any admissible control policy,<br />
given the action at time t ≥ 0 <strong>and</strong> π t , π t+1 is conditionally independent from {π s , u s , s ≤ t − 1}.<br />
63
64 CHAPTER 6. PARTIALLY OBSERVED MARKOV DECISION PROCESSES<br />
Pro<strong>of</strong>: Suppose X is countable. Let D ∈ B(P(X)):<br />
P(dπ t+1 ∈ D|π s , u s , s ≤ t)<br />
∑<br />
X<br />
=<br />
π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )P(u t−1 |y [0,t−1] , u [0,t−2] )<br />
∑<br />
X<br />
∑X π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )P(u t−1 |y [0,t−1] , u [0,t−2] ) ∈ D|π s, u s , s ≤ t)<br />
∑<br />
X<br />
=<br />
π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )<br />
∑<br />
X<br />
∑X π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 ) ∈ D|π s, u s , s ≤ t)<br />
∑<br />
X<br />
=<br />
π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )<br />
∑X π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 ) ∈ D|π t, u t )<br />
∑<br />
X<br />
Let the cost function to be minimized be<br />
T∑<br />
−1<br />
Ex Π 0<br />
[ c(x t , u t )],<br />
t=0<br />
where E Π x 0<br />
[·] denotes the expectation over all sample paths with initial state given by x 0 under policy Π. We can<br />
transform the system into a fully observed Markov model as follows. Define the new cost as Since<br />
T∑<br />
−1<br />
T∑<br />
−1<br />
Ex Π 0<br />
[ c(x t , u t )] = Ex Π 0<br />
[ E[c(x t , u t )|I t ]],<br />
t=0<br />
we can write the per=stage cost function as<br />
t=0<br />
⋄<br />
˜c(π, u) = ∑ X<br />
c(x, u)π(x), π ∈ P (6.1)<br />
The stochastic transition kernel q is given by:<br />
q(x, y|π, u) = ∑ X<br />
P(x, y|x ′ , u)π(x ′ ),<br />
π ∈ P<br />
And, this kernel can be decomposed as q(x, y|π, u) = P(y|π, u)P(x|π, u, y).<br />
The second term here is the filtering equation, mapping (π, u, y) ∈ (P × U × Y) to P. It follows that (P, U, K, ˜c)<br />
defines a completely observable controlled Markov process. Here, we have<br />
K(B|π, u) = ∑ Y<br />
1 {P(.|π,u,y)∈B} P(y|π, u), ∀B ∈ σ(P).<br />
As such, one can obtain the optimal solution by using the filtering equation as a sufficient statistic in a centralized<br />
setting, as Markov policies (policies that use the Markov state as their sufficient statistics) are optimal for control<br />
<strong>of</strong> Markov chains, under well-studied sufficiency conditions for the existence <strong>of</strong> optimal selectors.<br />
We call the control policies which use π as their information to generate control as separated control policies.<br />
This will become more clear in the context <strong>of</strong> Linear Quadratic Gaussian Control Systems.<br />
Since in our analysis earlier, we had assumed that the state belongs to a complete, separable, metric space, the<br />
analysis here follows the analysis in the previous chapter, provided that the measurability, continuity, compactness<br />
conditions are verified.<br />
6.2 Linear Quadratic Gaussian Case <strong>and</strong> Kalman Filtering<br />
Consider the following linear system:<br />
<strong>and</strong><br />
x t+1 = Ax t + Bu t + w t ,<br />
y t = Cx t + w t ,
6.3. ESTIMATION AND KALMAN FILTERING 65<br />
where x ∈ R n , u ∈ R m <strong>and</strong> w ∈ R n , y ∈ R p , v ∈ R p . Suppose {w t , v t } are i.i.d. r<strong>and</strong>om Gaussian vectors with<br />
given covariance matrices E[w t wt T] = W <strong>and</strong> E[v tvt T ] = V for all t ≥ 0.<br />
The goal is to obtain<br />
where<br />
inf<br />
Π J(Π, µ 0),<br />
T∑<br />
−1<br />
J(Π, µ 0 ) = Eµ Π 0<br />
[ x T t Qx t + u T t Ru t + x T TQ T x T ],<br />
t=0<br />
with R > 0 <strong>and</strong> Q, Q T ≥ 0 (that is, these matrices are positive definite <strong>and</strong> positive semi-definite).<br />
We observe that the optimal control is linear <strong>and</strong> has the form:<br />
where K t solves the Discrete-Time Riccati Equation:<br />
with final condition K T = Q T . We obtained<br />
u t = −(BK t+1 B + R) −1 B T K t+1 AE[x t |I t ]<br />
K t = Q + A T K t+1 A − A T K t+1 B((BK t+1 B + R)) −1 B T K t+1 A,<br />
T∑<br />
−1 (<br />
)<br />
J t (I t ) = E[x T t K t x t ] + E[(x t − E[x t |I t ]) T Q t (x t − E[x t |I t ])] + E[ ˜w t T K t+1 ˜w] , (6.2)<br />
k=t<br />
where ˜w t = Σ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − CE[x t |I t−1 ]) <strong>and</strong> Σ is to be derived below.<br />
6.2.1 Separation <strong>of</strong> Estimation <strong>and</strong> Control<br />
In the above problem, we observed that the optimal control has a separation structure: The controller first<br />
estimates the state, <strong>and</strong> then applies its control action, by regarding the estimate as the state itself.<br />
We say that control has a dual effect, if control affects the estimation quality. Typically, when the dual effect is<br />
absent, separation principle observed above applies. In most general problems, however, the dual effect <strong>of</strong> the<br />
control is present. That is, depending on the control, the estimation quality at the controller will be affected.<br />
As an example, consider a linear system controlled over the Internet, where the controller applies a control, but<br />
does not know if the packet reaches the destination or not. In this case, the control signal which was intended<br />
to be sent, does affect the estimation error quality.<br />
6.3 Estimation <strong>and</strong> Kalman Filtering<br />
Lemma 6.3.1 Let x, y be r<strong>and</strong>om variables with finite second moments <strong>and</strong> R > 0. The function which solves<br />
inf g(y) E[(x − g(y)) T R(x − g(y))] is G(y) = E[x|y].<br />
Pro<strong>of</strong>: Let G(y) = E[x|y] + h(y), for some measurable h, then, Pa.s. we have the following<br />
E[(x − E[x|y] − h(y)) T R(x − E[x|y] − h(y))|y]<br />
= E[(x − E[x|y]) T R(x − E[x|y])|y] + E[h T (y)h(y)|y] + E[(x − E[x|y])h(y)|y]<br />
= E[(x − E[x|y]) T R(x − E[x|y])|y] + E[h T (y)h(y)|y] + E[(x − E[x|y]|y]h(y)<br />
= E[(x − E[x|y]) T R(x − E[x|y])|y] + E[h T (y)h(y)|y]<br />
≥ E[(x − E[x|y]) T R(x − E[x|y])|y] (6.3)<br />
We note that the above also admits a Hilbert space formulation. Let H denote the space r<strong>and</strong>om variables on<br />
which an inner product 〈x, y〉 = E[x T Ry] is defined. Let M H be a subspace <strong>of</strong> H, the closed space <strong>of</strong> r<strong>and</strong>om<br />
⋄
66 CHAPTER 6. PARTIALLY OBSERVED MARKOV DECISION PROCESSES<br />
variables which are measurable on σ(y) which have finite second moments. Then, the Projection theorem leads<br />
to the observation that, an optimal estimate g(y) minimizing ||x − g(y)|| 2 2 , denoted here by G(y), is one which<br />
satisfies:<br />
〈x − G(y), h(y)〉 = E[x − G(y) T Rh(y)] = 0, ∀h ∈ M H<br />
The conditional expectation satisfies this since:<br />
E[(x − E[x|y]) T Rh(y)] = E[E[(x − E[x|y]) T Rh(y)|y]] = E[E[(x − E[x|y]) T |y]Rh(y)] = 0,<br />
since Pa.s., E[(x − E[x|y]) T |y] = 0.<br />
For a linear Gaussian system, the state process has a Gaussian probability measure. A Gaussian probability<br />
measure can be uniquely identified by knowing the mean <strong>and</strong> the covariance <strong>of</strong> the Gaussian r<strong>and</strong>om variable.<br />
This makes the analysis for estimating a Gaussian r<strong>and</strong>om variable particularly simple to perform, since the<br />
conditional estimate <strong>of</strong> a partially observed (through an additive Gaussian noise) Gaussian r<strong>and</strong>om variable is a<br />
linear function <strong>of</strong> the observed variable.<br />
Recall that a Gaussian measure with mean µ <strong>and</strong> covariance matrix K XX has the following density: P(x) =<br />
1<br />
K −1<br />
(2π) n 2 |K XX | 1/2e−1/2((x−µ)T XX (x−µ)) .<br />
Lemma 6.3.2 Let x, y be zero-mean[ Gaussian processes. ] E[x|y] is linear in y (if not zero mean, than the<br />
ΣXX Σ<br />
expectation is affine). Let K XY :=<br />
XY<br />
, whereΣ<br />
Σ Y X Σ XX = E[xx T ]. Then, E[x|y] = Σ XY Σ −1<br />
Y Y y <strong>and</strong><br />
Y Y<br />
E[(x − E[x|y])(x − E[x|y]) T ] = Σ XX − Σ XY Σ −1<br />
Y Y ΣT XY =: D.<br />
Pro<strong>of</strong>: By Baye’s rule <strong>and</strong> the fact that the processes admit densities: P(x|y) = P(x, y)/P(y). It follows that<br />
is also symmetric (since the eigenvectors are the same <strong>and</strong> the eigenvalues are inverted) <strong>and</strong> given by:<br />
Σ −1<br />
XY<br />
[ ]<br />
XY = ΨXX Ψ XY<br />
,<br />
Ψ Y X Ψ Y Y<br />
K −1<br />
Thus,<br />
p(x, y)<br />
p(y)<br />
Ψ XXx+2x T Ψ XY y+y T Ψ Y Y y)<br />
= Ce1/2(xT e 1/2(yT K −1<br />
Y Y )y<br />
By completing the squares method, one obtains a quadratic exponent <strong>of</strong> the form<br />
p(x|y) = Q(y)e 1/2(x−Hy)T D −1 (x−Hy) ,<br />
where Q(y) is a quadratic exponent function which only depends on y. Since ∫ p(x|y)dx = 1, it follows that<br />
1<br />
Q(y) = <strong>and</strong> is in fact independent <strong>of</strong> y.<br />
⋄<br />
(2π) n 2 |D| 1/2<br />
Remark 6.3.1 Even if the r<strong>and</strong>om variables are not Gaussian (but zero-mean), through another Hilbert space<br />
formulation <strong>and</strong> an application <strong>of</strong> the Projection Theorem, the expression Σ XY Σ −1<br />
Y Y<br />
y is the best linear estimate.<br />
One can naturally generalize this for r<strong>and</strong>om variables with non-zero mean.<br />
We will derive the Kalman filter in the following. The following two Lemma are instrumental.<br />
Lemma 6.3.3 If E[x] = 0 <strong>and</strong> z 1 , z 2 independent Gaussian processes, then E[x|z 1 , z 2 ] = E[x|z 1 ] + E[x|z 2 ].<br />
Pro<strong>of</strong>: The pro<strong>of</strong> follows by writing z = [z 1 , z 2 ] <strong>and</strong> E[X|z] = Σ XZ Σ −1<br />
ZZ z.<br />
⋄<br />
Lemma 6.3.4 E[(x − E[x|y])(x − E[x|y]) T ] does not depend on y <strong>and</strong> is given by D above.
6.3. ESTIMATION AND KALMAN FILTERING 67<br />
Now, we can move on to the derivation <strong>of</strong> the Kalman Filter.<br />
Consider;<br />
x t+1 = Ax t + w t , y t = Cx t + v t ,<br />
with E[w t wt T ] = W <strong>and</strong> E[v t vt T ] = V .<br />
Define:<br />
m t = E[x t |y [0,t−1] ]<br />
Σ t|t−1 = E[(x t − E[x t |y [0,t−1] ])(x t − E[x t |y [0,t−1] ]) T |y [0,t−1] ]<br />
Theorem 6.3.1 The following holds:<br />
m t+1 = Am t + AΣ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − Cm t )<br />
Σ t+1|t = AΣ t|t−1 A T + W − (AΣ t|t−1 C T )(CΣ t|t−1 C T + V ) −1 (CΣ t|t−1 A T )<br />
with<br />
<strong>and</strong><br />
m 0 = E[x 0 ]<br />
Σ 0|−1 = E[x 0 x T 0 ]<br />
Pro<strong>of</strong>:<br />
x t+1 = Ax t + w t<br />
m t+1 = E[Ax t + w t |y [0,t] ] = E[Ax t + w t |y [0,t−1] , y t − E[y [0,t−1] ]]<br />
= Am t + E[A(x t − m t )|y [0,t−1] , y t − E[y t |y [0,t−1] ]] + E[w t |y [0,t−1] , y t − E[y t |y [0,t−1] ]]<br />
= Am t + E[A(x t − m t )|y [0,t−1] ] + E[A(x t − m t )|y t − E[y t |y [0,t−1] ]] + E[w t |y t − E[y t |y [0,t−1] ]]<br />
= Am t + E[A(x t − m t )|y t − E[y [0,t−1] ]] + E[w t |y t − E[y t |y [0,t−1] ]]<br />
= Am t + E[A(x t − m t ) + w t |y t − E[y t |y [0,t−1] ]]<br />
= Am t + E[A(x t − m t ) + w t |y t − E[y t |y [0,t−1] ]] (6.4)<br />
We now apply the Lemma above: Let x = A(x t − m t ) + w t <strong>and</strong> y = y t − Cm t , <strong>and</strong> E[x|y] = Σ xx Σ −1<br />
yy y leads to:<br />
Likewise, the covariance matrix:<br />
m t+1 = Am t + AΣ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − Cm t )<br />
x t+1 − m t+1 = A(x t − m t ) + w t − AΣ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − Cm t ),<br />
leads to, after a few lines <strong>of</strong> calculations:<br />
Σ t+1|t = AΣ t|t−1 A T + W − (AΣ t|t−1 C T )(CΣ t|t−1 C T + V ) −1 (CΣ t|t−1 A T )<br />
The above is the celebrated Kalman filter.<br />
Define now<br />
It follows from m t = A ˜m t−1 that<br />
˜m t = E[x t |y [0,t] ] = m t + E[x t − m t |y [0,t] ] = m t + E[x t − m t |y t − E[y t |y [0,t] ]].<br />
˜m t = A ˜m t−1 + Σ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − CA ˜m t−1 )<br />
We observe that the zero-mean variable x t − ˜m t is orthogonal to I t , in the sense that the error is independent<br />
<strong>of</strong> the information available at the controller, <strong>and</strong> since the information available is Gaussian, independence <strong>and</strong><br />
orthogonality are equivalent.<br />
We observe that the recursions in Theorem 6.3.1 remind us the recursions in Theorem 5.5.2. This leads to the<br />
following result.<br />
⋄
68 CHAPTER 6. PARTIALLY OBSERVED MARKOV DECISION PROCESSES<br />
Theorem 6.3.2 Suppose that W = BB T <strong>and</strong> that (B T , A) is observable, (A T , C T ) is controllable <strong>and</strong> V > 0.<br />
Then, the recursions for the covariance matrices Σ (·) in 6.3.1 converge to a unique solution which is positive<br />
definite.<br />
6.3.1 Optimal Control <strong>of</strong> Partially Observed LQG Systems<br />
With this observation, in class we reformulated the quadratic optimization problem as a function <strong>of</strong> ˜m t , u t<br />
<strong>and</strong> x t − ˜m t , the optimal control problem is equivalent to the control fully observed state ˜m t , with additive<br />
time-varying independent Gaussian noise process.<br />
In particular, the cost:<br />
writes as:<br />
T∑<br />
−1<br />
Eµ Π 0<br />
[<br />
t=0<br />
for the fully observed system:<br />
with<br />
T∑<br />
−1<br />
J(Π, µ 0 ) = Eµ Π 0<br />
[ x T t Qx t + u T t Ru t + x T TQ T x T ],<br />
t=0<br />
T∑<br />
˜m T t Q ˜m t + u T t Ru t + ˜m T TQ T ˜m T ] + Eµ Π 0<br />
[ (x t − ˜m t ) T Q(x t − ˜m t )]<br />
˜m t = A ˜m t−1 + ¯w t ,<br />
t=0<br />
¯w t = Σ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − CA ˜m t−1 )<br />
That we can take out the error term (x t − ˜m t ) out <strong>of</strong> the summation is a consequence <strong>of</strong> what is known as the<br />
separation <strong>of</strong> estimation <strong>and</strong> control, <strong>and</strong> a more special version is known as the certainty equivalence principle.<br />
In class, we observed that, the absence <strong>of</strong> dual effect plays a key part in this analysis leading the separation <strong>of</strong><br />
estimation <strong>and</strong> control principle, in taking E[(x t − ˜m) T Q(x t − ˜m)] out <strong>of</strong> the conditioning on I t .<br />
Thus, the optimal control has the following form for all time stages:<br />
u t = −(BK t+1 B + R) −1 B T K t+1 AE[x t |I t ],<br />
<strong>and</strong> the optimal cost will be<br />
T<br />
˜m T 0 P ∑−1<br />
0 ˜m 0 + E[<br />
k=0<br />
¯w T k P k+1 ¯w k ]<br />
6.4 Partially Observed Markov Decision Processes in General Spaces<br />
Topological Properties <strong>of</strong> Space <strong>of</strong> Probability Measures<br />
In class we discussed three convergence notions; weak convergence, setwise convergence <strong>and</strong> convergence under<br />
total variation:<br />
Let P(R N ) denote the family <strong>of</strong> all probability measure on (X, B(R N )) for some N ∈ N. Let {µ n , n ∈ N} be a<br />
sequence in P(R N ).<br />
A sequence {µ n } is said to converge to µ ∈ P(R N ) weakly if<br />
∫<br />
∫<br />
c(x)µ n (dx) → c(x)µ(dx)<br />
R N R N<br />
for every continuous <strong>and</strong> bounded c : R N → R. On the other h<strong>and</strong>, {µ n } is said to converge to µ ∈ P(R N )<br />
setwise if<br />
∫<br />
c(x)µ n (dx) →<br />
R N ∫<br />
c(x)µ(dx)<br />
R N
6.4. PARTIALLY OBSERVED MARKOV DECISION PROCESSES IN GENERAL SPACES 69<br />
for every measurable <strong>and</strong> bounded c : R N → R. Setwise convergence can also be defined through pointwise<br />
convergence on Borel subsets <strong>of</strong> R N (see, e.g., [18]), that is<br />
µ n (A) → µ(A), for all A ∈ B(R N )<br />
since the space <strong>of</strong> simple functions are dense in the space <strong>of</strong> bounded <strong>and</strong> measurable functions under the<br />
supremum norm.<br />
For two probability measures µ, ν ∈ P(R N ), the total variation metric is given by<br />
‖µ − ν‖ TV := 2 sup<br />
B∈B(R N )<br />
= sup<br />
f: ‖f‖ ∞≤1<br />
|µ(B) − ν(B)|<br />
∫<br />
∣<br />
∫<br />
f(x)µ(dx) −<br />
f(x)ν(dx)<br />
∣ , (6.5)<br />
where the infimum is over all measurable real f such that ‖f‖ ∞ = sup x∈R N |f(x)| ≤ 1. A sequence {µ n } is said<br />
to converge to µ ∈ P(R N ) in total variation if ‖µ n − µ‖ TV → 0.<br />
Setwise convergence is equivalent to pointwise convergence on Borel sets whereas total variation requires uniform<br />
convergence on Borel sets. Thus these three convergence notions are in increasing order <strong>of</strong> strength: convergence<br />
in total variation implies setwise convergence, which in turn implies weak convergence.<br />
Weak convergence may not be strong enough for certain measurable selection conditions. One could use other<br />
metrics, such as total variation, or other topologies, such as the one generated by setwise convergence.<br />
Total variation is a stringent notion for convergence. For example a sequence <strong>of</strong> discrete probability measures<br />
never converges in total variation to a probability measure which admits a density function with respect to the<br />
Lebesgue measure <strong>and</strong> such a space is not separable. Setwise convergence also induces a topology on the space<br />
<strong>of</strong> probability measures <strong>and</strong> channels which is not easy to work with since the space under this convergence is<br />
not metrizable ([15, p. 59]).<br />
However, the space <strong>of</strong> probability measures on a complete, separable, metric (Polish) space endowed with the<br />
topology <strong>of</strong> weak convergence is itself a complete, separable, metric space [3]. The Prohorov metric, for example,<br />
can be used to metrize this space.<br />
In the following, we will verify the sufficiency <strong>of</strong> weak convergence.<br />
The discussion in the previous chapter for countable spaces applies also to Polish spaces. In particular, with<br />
π denoting a conditional probability measure π t (A) = P(x t ∈ A|I t ), we can define a new cost as ˜c(π, u) =<br />
c(x, u)π(x), π ∈ P.<br />
∑<br />
X<br />
Let Γ(S) be the set <strong>of</strong> all Borel measurable <strong>and</strong> bounded functions from S to R. We first wish to find a topology<br />
on P(S), under which functions <strong>of</strong> the form:<br />
∫<br />
Θ := { π(dx)f(x), f ∈ M(X)}<br />
are measurable on P(S).<br />
x∈S<br />
By the above definitions, it is evident that both setwise convergence <strong>and</strong> total variation are sufficient for measurability<br />
<strong>of</strong> cost functions <strong>of</strong> the form ∫ π(dx)f(x), since these are (sequentially) continuous on P(S) for every<br />
f ∈ Γ(S). However, as we state in the following, the cost functions defined in (6.1) ˜c : P(X) × U → R is a measurable<br />
cost function under weak convergence topology, as we discuss in the following (for a pro<strong>of</strong>, see Bogachev<br />
(p. 215 [7])).<br />
Theorem 6.4.1 Let S be a metric space <strong>and</strong> let M be the set <strong>of</strong> all measurable <strong>and</strong> bounded functions f : S → R<br />
under which<br />
∫<br />
π(dx)f(x)<br />
defines a measurable function on P(S) under the weak convergence topology. Then, M is the same as Γ(S) <strong>of</strong> all<br />
bounded <strong>and</strong> measurable functions.<br />
This discussion is useful to establish that under weak convergence topology π t , u t forms a controlled Markov<br />
chain for the dynamic programming purposes.
70 CHAPTER 6. PARTIALLY OBSERVED MARKOV DECISION PROCESSES<br />
6.5 Exercises<br />
Exercise 6.5.1 Consider a linear system with the following dynamics:<br />
x t+1 = ax t + u t + w t ,<br />
<strong>and</strong> let the controller have access to the observations given by:<br />
y t = x t + v t .<br />
Here {w t , v t , t ∈ Z} are independent, zero-mean, Gaussian r<strong>and</strong>om variables, with variances E[w 2 ] <strong>and</strong> E[v 2 ].<br />
The controller at time t ∈ Z has access to I t = {y s , u s , s ≤ t − 1} ∪ {y t }.<br />
The initial state has a Gaussian distribution, with zero mean <strong>and</strong> variance E[x 2 0 ], which we denote by ν 0. We<br />
wish to find for some r > 0:<br />
3∑<br />
inf J(x 0, Π) = Eν Π Π 0<br />
[ x 2 t + ru 2 t],<br />
Compute the optimal control policy <strong>and</strong> the optimal cost. It suffices to provide a recursive form.<br />
Hint: Show that the optimal control has a separation structure. Compute the conditional estimate through the<br />
Kalman Filter.<br />
t=0<br />
Exercise 6.5.2 Consider a Markov Decision Problem set up as follows. Let there be two possible links S 1 , S 2<br />
a decision maker can take to send information packets to a destination. These links each are Markov <strong>and</strong><br />
for each <strong>of</strong> the links we have that S i ∈ {0, 1}, i = 1, 2. Here, 0 denotes a good link <strong>and</strong> 1 denotes a bad<br />
link. If a decision maker picks one link, he can find the state <strong>of</strong> the link. Suppose the goal is to maximize<br />
E[ ∑ T −1<br />
k=0 1 U t=11 S 1<br />
t =1 + 1 Ut=21 S 2<br />
t<br />
=1]. Each <strong>of</strong> the links evolve according to a Markov model, independent <strong>of</strong> the<br />
other link.<br />
The only information the encoder has is the initial probability measure on the links <strong>and</strong> the actions it has<br />
taken. Obtain a dynamic programming recursion, by clearly identifying the controlled Markov chain <strong>and</strong> the cost<br />
function. Is the value function convex? Can you deduce any structural properties on the optimal policies?
Chapter 7<br />
The Average Cost Problem<br />
Consider the following average cost problem:<br />
J(x, Π) = limsup<br />
T →∞<br />
T<br />
1 ∑−1<br />
T EΠ x<br />
t=0<br />
c(x t , u t )<br />
This is an important problem in applications where one is concerned about the long-term behavior, unlike the<br />
discounted cost setup where the primary interest is in finite horizon problems.<br />
7.1 Average Cost <strong>and</strong> the Average Cost Optimality Equation (ACOE)<br />
To study this problem, one approach is to establish the existence <strong>of</strong> an Average Cost Optimality Equation.<br />
We will discuss further conditions on how to achieve ACOE. Later, we will discuss alternative approaches for<br />
obtaining optimal solutions.<br />
We now discuss the Average Cost Optimality Equation (ACOE) for infinite horizon problems.<br />
Considering a sequence <strong>of</strong> normalized finite horizon optimization problems, a dynamic programming recursion<br />
<strong>of</strong> the form:<br />
J T (x 0 ) = inf<br />
u 0<br />
(c(x 0 , u 0 )) + (T − 1)E[J T −1 (x 1 )|x 0 , u 0 ])<br />
with J 0 (·) = 0, leads to the following result. If one takes T to infinity, the following discussion seems natural.<br />
Theorem 7.1.1 a) [Ross] If there exists a bounded function f <strong>and</strong> a constant g such that<br />
∫<br />
g + f(x) = min (c(x, u) + f(y)P(dy|x, u)), x ∈ X,<br />
u∈U<br />
then, there exists a stationary deterministic policy Π ∗ so that<br />
g = J(x, Π ∗ ) = min<br />
Π∈Π A J(x, Π)<br />
where<br />
J(x, Π) = limsup<br />
T →∞<br />
T<br />
1 ∑−1<br />
T EΠ x<br />
k=0<br />
c(x t , u t ).<br />
b) If g, considered above, is not a constant <strong>and</strong> depends on x, <strong>and</strong> we have<br />
∫<br />
g(x) + f(x) = min (c(x, u) + f(y)P(dy|x, u)), x ∈ X,<br />
u∈U<br />
71
72 CHAPTER 7. THE AVERAGE COST PROBLEM<br />
with<br />
∫<br />
g(x) = min<br />
u<br />
g(y)P(dy|x, u)<br />
<strong>and</strong> the pointwise minimizations are achieved by a measurable function Π ∗ , then<br />
1<br />
T EΠ∗<br />
∑<br />
T −1<br />
x [<br />
t=0<br />
g(x t )] ≤ inf<br />
Π limsup<br />
N→∞<br />
T<br />
1 ∑−1<br />
N EΠ x [<br />
t=0<br />
c(x t , u t )]<br />
Pro<strong>of</strong>: We prove a. For b, which follows from a similar construction, see the text Hern<strong>and</strong>ez-Lerma <strong>and</strong> Lasserre.<br />
For any policy Π,<br />
n∑<br />
[<br />
]<br />
f(x t ) − E Π [f(x t )|x [0,t−1] , u [0,t−1] ] = 0<br />
Now,<br />
E Π x<br />
t=1<br />
∫<br />
E Π [f(x t )|x [0,t−1] , u [0,t−1] ] = f(y)P(x t ∈ dy|x t−1 , u t−1 ) (7.1)<br />
y<br />
∫<br />
= c(x t−1 , u t−1 ) + f(y)P(dy|x t−1 , u t−1 ) − c(x t−1 , u t−1 ) (7.2)<br />
≥<br />
(<br />
min c(x t−1 , u t−1 ) +<br />
u t−1∈U<br />
y<br />
∫<br />
y<br />
)<br />
f(y)P(dy|x t−1 , u t−1 ) − c(x t−1 , u t−1 ) (7.3)<br />
= g + f(x t−1 ) − c(x t−1 , u t−1 ) (7.4)<br />
The above hold with equality if Π ∗ = {f} is adopted since Π ∗ provides the pointwise minimum. Hence,<br />
<strong>and</strong><br />
with equality under Π ∗ .<br />
0 ≤ 1 n∑<br />
n EΠ x [f(x t ) − g − f(x t−1 ) + c(x t−1 , u t−1 )]<br />
t=1<br />
n∑<br />
g ≤ E Π [f(x n )]/n − Ex Π [f(x 0)]/n + Ex<br />
Π [c(x t−1 , u t−1 )]<br />
t=1<br />
⊓⊔<br />
We note that, the boundedness condition can be relaxed, provided that E Π x [f(x n)]/n → 0 under every policy<br />
<strong>and</strong> any x ∈ X.<br />
7.2 The Discounted Cost Approach to the Average Cost Problem<br />
Average cost emphasizes the asymptotic values <strong>of</strong> the cost function whereas the discounted cost emphasizes the<br />
short-term cost functions. However, under technical restrictions, one can show that the limit as the discounted<br />
factor converges to 1, one can obtain a solution for the average cost optimization. We now state one such<br />
condition below.<br />
Theorem 7.2.1 Consider a finite state controlled Markov chain where the state <strong>and</strong> action spaces are finite,<br />
<strong>and</strong> for every deterministic policy the entire state space is a recurrent set. Let<br />
J β (x) = inf Ex Π [ ∑<br />
∞ β t c(x t , u t )]<br />
Π∈Π A<br />
<strong>and</strong> suppose that Π ∗ n is an optimal deterministic policy for J β n<br />
(x). Then, there exists some Π ∗ which is optimal<br />
for every β sufficiently close to 1, <strong>and</strong> is also optimal for the average cost<br />
t=0<br />
T −1<br />
1<br />
J(x) = inf limsup<br />
Π∈Π A T →∞ T EΠ x [ ∑<br />
c(x t , u t )]<br />
t=0
7.3. LINEAR PROGRAMMING APPROACH TO AVERAGE COST MARKOV DECISION PROBLEMS 73<br />
7.2.1 Borel State <strong>and</strong> Action Spaces [Optional]<br />
In the following, we assume that c is bounded, measurable selection hypothesis hold <strong>and</strong> X is a compact subset<br />
<strong>of</strong> a Polish space <strong>and</strong> U is a compact subset <strong>of</strong> a Polish space.<br />
Consider the value function for a discounted cost problem as discussed in Section 5.4.1:<br />
Let x 0 be an arbitrary state <strong>and</strong> for all x ∈ X consider<br />
J β (x) = min {c(x, u) + β J β (y)Q(dy|x, u)}, ∀x.<br />
U<br />
∫X<br />
J β (x) − J β (x 0 )<br />
= min<br />
u∈U<br />
(<br />
c(x, u) + β<br />
∫<br />
)<br />
P(dx ′ |x t , u t )(J β (x ′ ) − J β (x 0 )) − (1 − β)J β (x 0 )<br />
(7.5)<br />
As argued in Section 5.4.1, this has a solution for every β ∈ (0, 1).<br />
Now suppose that J β (x) − J β (x 0 ) is uniformly (over β) equi-continuous <strong>and</strong> X is compact. By the Arzela-Ascoli<br />
Theorem, taking β ↑ 1 along some sequence, for some subsequence, J β (x) − J β (x 0 ) → η(x) <strong>and</strong> by a Tauberian<br />
theorem [1] (1 −β nk )J β (x 0 ) → ζ ∗ will converge possibly along a further subsequence. This limit will then satisfy<br />
the Average Cost Optimality Equation (ACOE)<br />
( ∫<br />
)<br />
η(x) = min c(x, u) + P(dx ′ |x t , u t )η(x ′ ) − ζ ∗ (7.6)<br />
u∈U<br />
Since under the hypothesis <strong>of</strong> the setting we have that lim T →∞ sup Π<br />
1<br />
T E x[c(x T , u T )] = 0 for every admissible x,<br />
ACOE implies that ζ ∗ is the optimal cost <strong>and</strong> η is the value function (see Theorem 7.1.1 <strong>and</strong> Theorem 5.2.4 in<br />
[17]).<br />
Remark 7.2.1 If ∑ ∞<br />
n=0 ||P n (x, B) − P n (x ′ , B)|| TV ≤ K(x, x ′ ) < ∞ under every optimal policy for 0 ≤ β ∗ ≤<br />
β < 1 for some β ∗ , <strong>and</strong> K is uniformly equi-continuous, then the condition for ACOE is satisfied. It is worth<br />
noting that if under an optimal policy, there exists an atom α (or a pseudo-atom in view <strong>of</strong> the splitting technique<br />
discussed in Chapter 3) so that in some finite expected time, for any two initial conditions x, x ′ , the processes<br />
are coupled in the sense described in Chapter 3, then the condition holds since after hitting α the processes will<br />
be coupled <strong>and</strong> equal, leading to a difference in the costs only until hitting the atom. If this difference is further<br />
continuous in x, x ′ so that the Arzela-Ascoli Theorem is satisfied, the optimality condition holds. For example,<br />
the univariate drift condition (4.12) is sufficient for the ergodicity condition (By Theorem 15.0.1 <strong>of</strong> [23] geometric<br />
ergodicity is equivalent to the satisfaction <strong>of</strong> this drift condition).<br />
We also note that for the above arguments to hold, there does not need to be a single invariant distribution.<br />
Here in (7.6), the pair x <strong>and</strong> x 0 should be picked as a function <strong>of</strong> the reachable set under a given sequence <strong>of</strong><br />
policies. The analysis for such a condition is tedious in general since for every β a different optimal policy will<br />
typically be adopted; however, for certain applications the reachable set from a given point may be independent<br />
<strong>of</strong> the control policy applied.<br />
7.3 Linear Programming Approach to Average Cost Markov Decision<br />
Problems<br />
Convex analytic approach <strong>of</strong> Manne [21] <strong>and</strong> Borkar [10] (later generalized by Hern<strong>and</strong>ez-Lerma <strong>and</strong> Lasserre<br />
[17]) is a powerful approach to the optimization <strong>of</strong> infinite-horizon problems. It is particularly effective in<br />
proving results on the optimality <strong>of</strong> stationary policies, which can lead to an infinite-dimensional linear program.<br />
This approach is particularly effective for constrained optimization problems <strong>and</strong> infinite horizon average cost<br />
optimization problems. It avoids the use <strong>of</strong> dynamic programming.
74 CHAPTER 7. THE AVERAGE COST PROBLEM<br />
In this chapter, we will consider average cost Markov Decision Problems. In particular, we are interested in the<br />
minimization<br />
1 ∑<br />
T<br />
inf limsup<br />
Π∈Π A T →∞ T EΠ x 0<br />
[c(x t , u t )], (7.7)<br />
where E Π x 0<br />
[] denotes the expectation over all sample paths with initial state given by x 0 under the admissible<br />
policy Π.<br />
In class, we considered the finite space setting where both X <strong>and</strong> U were finite sets. We will restrict the analysis<br />
to such a setup, <strong>and</strong> briefly discuss later the generalization to Polish spaces.<br />
We study the limit distribution <strong>of</strong> the following occupation measures.<br />
t=1<br />
v T (D) = 1 T<br />
T∑<br />
1 {xt,u t)∈D}, D ∈ B(X × U).<br />
t=1<br />
Define a F t measurable process with A ∈ B(X):<br />
( t∑<br />
F t (A) = 1 xs∈A − t ∑ )<br />
u)v t (x, u)<br />
s=1<br />
X×UP(A|x,<br />
We have that<br />
<strong>and</strong> {F t (A)} is a Martingale sequence.<br />
E[F t (A)|F t−1 ] = F t−1 (A) ∀t ≥ 0,<br />
Furthermore, F t (A) is a bounded-increment Martingale since |F t (A) − F t−1 (A)| ≤ 1. Hence, for every T > 2,<br />
{F 1 (A), . . . , F T (A)} forms a Martingale sequence with uniformly bounded increments, <strong>and</strong> we could invoke the<br />
Azuma-Hoeffding inequality to show that for all x > 0<br />
P(| F t(A)<br />
| ≥ x) ≤ 2e −2x2 t<br />
t<br />
Finally, invoking the Borel-Cantelli Lemma [9] for the summability <strong>of</strong> the estimate above, that is:<br />
∞∑<br />
2e −2x2t < ∞, ∀x > 0,<br />
n=1<br />
we deduce that<br />
Thus,<br />
F t (A)<br />
lim = 0 a.s.<br />
t→∞ t<br />
(<br />
lim v T (A) − ∑ )<br />
P(A|x, u)v T (x, u) = 0, A ⊂ X<br />
T →∞<br />
X×U<br />
Define<br />
G X = {v ∈ P(X × U) : v(B × U) = ∑ x,u<br />
P(x t+1 ∈ B|x, u)v(x, u), B ∈ B(X)}<br />
Further define<br />
G = {v ∈ P(X × U) : ∃Π ∈ Π S , v(A) = ∑ x,u<br />
P Π ((x t+1 , u t+1 ) ∈ A|x)v(x, u), A ∈ B(X × U)}<br />
It is evident that G ⊂ G X since there are fewer restrictions for G X . We can show that these two sets are indeed<br />
equal. In particular, for v ∈ G X , if we write: v(x, u) = π(x)P(u|x), then, we can construct a consistent v ∈ G:<br />
v(B × C) = ∑ x∈B P(C|x).
7.3. LINEAR PROGRAMMING APPROACH TO AVERAGE COST MARKOV DECISION PROBLEMS 75<br />
Thus, every converging sequence v t will satisfy the above equation (note that, we do not claim that every sequence<br />
converges). And hence, any converging sequence <strong>of</strong> v T (.) will asymptotically live in the set G.<br />
This is where finiteness is helpful: If the state space were countable, we could also have to consider the sequences<br />
which possible converge to infinity.<br />
Thus, under any sequence <strong>of</strong> admissible policies, every converging occupation measure sequence converges to the<br />
set G.<br />
Lemma 7.3.1 Under any admissible policy, with probability 1, any converging sequence {v t } will converge to<br />
the set G.<br />
Let us define<br />
∑<br />
γ ∗ = inf v(x, u)c(c, u)<br />
v∈G<br />
The above optimality is also in a stronger sense under Positive recurrence conditions. Consider the following:<br />
inf limsup 1<br />
Π T →∞ T<br />
where there is no expectation. The above is known as the sample path cost.<br />
Now, we have that<br />
T∑<br />
[c(x t , u t )], (7.8)<br />
t=1<br />
liminf<br />
T →∞ 〈v T,c〉 ≥ γ ∗<br />
since for any sequence v Tk which converges to the liminf value, there exists a further subsequence (due to the<br />
weak compactness <strong>of</strong> the space <strong>of</strong> occupation measures) which has a weak limit, <strong>and</strong> this weak limit is in G. By<br />
Fatou’s Lemma:<br />
lim 〈v T k<br />
, c〉 = 〈 lim v T k<br />
, c〉 ≥ γ ∗ .<br />
T k →∞ T k →∞<br />
Here, 〈v, c〉 := ∑ v(x, u)c(x, u).<br />
Likewise, for the average cost problem<br />
liminf<br />
T →∞ E[〈v T,c〉] ≥ E[liminf<br />
T →∞ 〈v T,c〉] ≥ γ ∗ .<br />
Lemma 7.3.2 [Proposition 9.2.5 in [22]] The space G is convex <strong>and</strong> closed. Let G e denote the set <strong>of</strong> extreme<br />
points <strong>of</strong> G. A measure is in G e if <strong>and</strong> only if two conditions are satisfied: i) The control policy inducing it is<br />
deterministic ii) Under this deterministic policy, the induced Markov chain is irreducible in the support set <strong>of</strong><br />
the invariant measure (there can’t be two invariant measures under this deterministic policy).<br />
Pro<strong>of</strong>:<br />
Consider η, an invariant measure in G which is non-extremal. Then, this measure can be written as a convex<br />
combination <strong>of</strong> two measures:<br />
η(x, u) = Θη 1 (x, u) + (1 − Θ)η 2 (x, u)<br />
π(x) = Θπ 1 (x) + (1 − Θ)π 2 (x)<br />
Let φ 1 (u|x), φ 2 (u|x) be two policies leading to ergodic measures η 1 <strong>and</strong> η 2 <strong>and</strong> invariant measures π 1 <strong>and</strong> π 2<br />
respectively. Then, there exists γ(·) such that:<br />
where under φ the invariant measure η is attained.<br />
φ(u|x) = γ(x)φ 1 (u|x) + (1 − γ(x))φ 2 (u|x),<br />
There are two ways this can be satisfied: i) φ is r<strong>and</strong>omized <strong>and</strong> ii) If φ is not r<strong>and</strong>omized with φ 1 = φ 2 for all<br />
x <strong>and</strong> are deterministic, then, it must be that π 1 <strong>and</strong> π 2 are different measures <strong>and</strong> have different support sets.<br />
In this case, the measures η 1 <strong>and</strong> η 2 do not correspond to ergodic occupation measures (in the sense <strong>of</strong> positive<br />
Harris recurrence).
76 CHAPTER 7. THE AVERAGE COST PROBLEM<br />
We can show that for the countable state space case a converse also holds, that is, if a policy is r<strong>and</strong>omized<br />
it cannot lead to an extreme point unless it has the same invariant measure under some deterministic policy.<br />
That is, extreme points are achieved under deterministic policies. Let there be a policy P which is r<strong>and</strong>omizing<br />
between two deterministic policies P 1 <strong>and</strong> P 2 at a state α such that with probability θ, P 1 is chosen; <strong>and</strong> with<br />
probability 1 − θ, P 2 is chosen. Let v, v 1 , v 2 be corresponding invariant measures to P, P 1 , P 2 . Here, α is an<br />
accessible atom if E P i<br />
α [τ α ] < ∞ <strong>and</strong> P P i (x, B) = P P i (y, B) for all x, y ∈ α, where τ α = min(k > 0 : π k = α) is<br />
the return time to α. In this case, the expected occupation measure can be shown to be equal to:<br />
which is a convex combination <strong>of</strong> v 1 <strong>and</strong> v 2 .<br />
v(x, u) = EP1 α [τ α]θv 1 (x, u) + Eα P2[τ<br />
α](1 − θ)v 2 (x, u)<br />
Eα P1[τ<br />
α]θ + Eα P2[τ<br />
, (7.9)<br />
α](1 − θ)<br />
Hence, a r<strong>and</strong>omized policy cannot lead to an extreme point in the space <strong>of</strong> invariant occupation measures. We<br />
note that, even if one r<strong>and</strong>omizes at a single state, under irreducibility assumptions, then the above argument<br />
applies.<br />
The above does not easily extend to uncountable settings. In Section 3.2 <strong>of</strong> [10], there is a discussion for a<br />
specialized setting.<br />
We can deduce that, for such a finite state space, an optimal policy is deterministic:<br />
⊓⊔<br />
Theorem 7.3.1 Let X, U be finite spaces. Furthermore, let, for every stationary policy, there be a unique<br />
invariant distribution on a single communication class. In this case, a sample path optimal policy <strong>and</strong> an<br />
average cost policy exists. These policies are deterministic <strong>and</strong> stationary.<br />
Pro<strong>of</strong>:<br />
G is a compact set. There exists an optimal v. Furthermore, by some further study, one realizes that the set G<br />
is closed, convex; <strong>and</strong> its extreme points are obtained by deterministic policies.<br />
⊓⊔<br />
The set <strong>of</strong> extreme points will contain optimal solutions for this case. However, one needs to be careful in<br />
characterizing the set <strong>of</strong> extreme points <strong>of</strong> the set <strong>of</strong> invariant occupation measures.<br />
Let µ ∗ be the optimal occupation measure (this exists, as discussed above, since the state space is finite, <strong>and</strong><br />
thus G is compact, <strong>and</strong> ∑ X×U<br />
µ(x, u)c(x, u) is continuous in µ(., .)). This induces an optimal policy π(u|x) as:<br />
π(u|x) = µ∗ (x, u)<br />
∑U µ∗ (x, u) .<br />
Thus, we can find the optimal policy conveniently, without using dynamic programming.<br />
In class, a simple example was provided.<br />
7.4 Discussion for More General State Spaces<br />
We now discuss a more general setting where these spaces are Polish spaces. The main difference is that, while<br />
in class we only considered the occupation measures:<br />
v T (A) = 1 T<br />
T∑<br />
−1<br />
1 ((xt,u t)∈A), ∀A ∈ σ(X × U),<br />
t=0<br />
<strong>and</strong> the convergence <strong>of</strong> v T to some set; for an uncountable setting, we use test functions to lead to the same<br />
result. Let φ : X → R be a continuous <strong>and</strong> bounded function. Define:<br />
v T (φ) = 1 T<br />
T∑<br />
φ(x, u).<br />
t=1
7.4. DISCUSSION FOR MORE GENERAL STATE SPACES 77<br />
Define a F t measurable process, with π an admissible control policy (not necessarily stationary or Markov):<br />
( t∑ ) (∫ ∫<br />
F t (φ) = φ(x t ) − t<br />
s=1<br />
P×U<br />
)<br />
φ(x ′ t)P π (dx ′ t, du ′ t)|x)v t (dx)<br />
(7.10)<br />
We define G X to be the following set in this case.<br />
∫<br />
G X = {η(D) = P(D|z)η(dz),<br />
X×U<br />
∀D ∈ B(X)}.<br />
The above holds under a class <strong>of</strong> technical conditions:<br />
1. (A) The state process lives in a compact space X. The control space U is also compact<br />
2. (A’) The cost function satisfies the following condition: lim Kn↑X inf u,x/∈Kn c(x, u) = ∞, (here, we assume<br />
that the space X is locally compact, or σ−compact).<br />
3. (B) There exists a policy leading to a finite cost<br />
4. (C) The cost function is continuous in x <strong>and</strong> u<br />
5. (D) The transition kernel is weakly continuous in the sense that ∫ Q(dz|x, u)v(z) is continuous in both x<br />
<strong>and</strong> u, for every continuous <strong>and</strong> bounded function v.<br />
Theorem 7.4.1 Under the above Assumptions A, B, C, <strong>and</strong> D (or A’, B, C, <strong>and</strong> D) there exists a solution to the<br />
optimal control problem given in (7.8) for almost every initial condition under the optimal invariant probability<br />
measure.<br />
The key step is the observation that every expected occupation measure sequence has a weakly<br />
converging subsequence. If this exists, then the limit <strong>of</strong> such a converging sequence will be in G <strong>and</strong> the<br />
analysis presented for the finite state space case will be applicable.<br />
Let ν(A) = E[v T (A)]. For C, without the boundedness assumption, the argument follows from the observation<br />
that, for ν n → ν weakly, for every lower semi-continuous functions (see [17])<br />
∫<br />
∫<br />
liminf ν n (dx, du)c(x, u) ≥ ν(dx, du)c(x, u)<br />
n→∞<br />
However, by sequential compactness, every sequence has a weakly converging subsequence <strong>and</strong> the result follows.<br />
The issue here is that a sequence <strong>of</strong> measures on the space <strong>of</strong> invariant measures ν n may have a limiting<br />
probability measure, but this limit may not correspond to an invariant measure under a stationary policy. The<br />
weak continunity <strong>of</strong> the kernel, <strong>and</strong> the separability <strong>of</strong> the space <strong>of</strong> continuous functions on a compact set, allow<br />
for this generalization.<br />
This ensures that every sequence has a converging subsequence weakly. In particular, there exists an optimal<br />
occupation measure.<br />
Lemma 7.4.1 There exists an optimal occupation measure in G under the Assumptions above.<br />
Pro<strong>of</strong>: The problem has now reduced to<br />
s.t.<br />
∫<br />
min<br />
µ<br />
µ(dx, du)c(x, u),<br />
µ ∈ G.
78 CHAPTER 7. THE AVERAGE COST PROBLEM<br />
The set <strong>of</strong> invariant measures is weakly sequentially compact <strong>and</strong> the integral is lower semi-continuous on the<br />
set <strong>of</strong> measures.<br />
⊓⊔<br />
Following the Lemma above, there exists optimal occupation measures, say v(dπ, df). This defines the optimal<br />
stationary control policy by the decomposition:<br />
µ(df|π) =<br />
v(dx, df)<br />
v(dx, df),<br />
∫u<br />
for all x such that ∫ v(dx, df) > 0.<br />
u<br />
There is a final consideration <strong>of</strong> reachability; that is, whether from any initial state, or an initial occupation set,<br />
the region where the optimal policy is defined is attracted (see [1]).<br />
If the Markov chain is stable <strong>and</strong> irreducible, then the optimal cost is independent <strong>of</strong> where the chain starts<br />
from, since in finite time, the states on which the optimal occupation measure has support, can be reached.<br />
If this Markov Chain is not irreducible, then, the optimal stationary policy is only optimal when the state process<br />
starts at the locations where in finite time the optimal stationary policy is applicable. This is particularly useful<br />
for problems where one has control over in which state to start the system.<br />
7.4.1 Extreme Points <strong>and</strong> the Optimality <strong>of</strong> Deterministic Policies<br />
Let µ be an invariant measure in G which is not extreme. This means that there exists κ ∈ (0, 1) <strong>and</strong> invariant<br />
empirical occupation measures µ 1 , µ 2 ∈ G such that:<br />
µ(dx, du) = κµ 1 (dx, du) + (1 − κ)µ 2 (dx, du)<br />
Let µ(dx, du) = P(du|x)µ(dx) <strong>and</strong> for i = 1, 2, µ i (dx, du) = P i (du|x)µ i (dx). Then,<br />
µ(dx) = κµ 1 (dx) + (1 − κ)µ 2 (dx)<br />
Note that µ(dx) = 0 =⇒ µ i (dx) = 0 for i = 1, 2. As a consequence, the Radon-Nikodym derivative <strong>of</strong> µ i with<br />
respect to µ exists. Let dµi<br />
dµ (x) = fi (x). Then<br />
P(du|x) = P 1 (du|x) dµ1<br />
dµ (x)κ + P 2 (du|x) dµ2 (x)(1 − κ)<br />
dµ<br />
is well-defined. Then,<br />
P(du|x) = P 1 (du|x)f 1 (x)κ + P 2 (du|x)(1 − κ)f 2 (x),<br />
such that f i (x)κ + (1 − κ)f 2 (x) = 1. As a result, we have that<br />
P(du|x) = P 1 (du|x)η(x) + P 2 (du|x)(1 − η(x)),<br />
with the r<strong>and</strong>omization kernel η(x) = f 1 (x)κ. As in Meyn [22], there are two ways where such a representation<br />
is possible: Either P(du|x) is r<strong>and</strong>omized, or P 1 (du|x) = P 2 (du|x) <strong>and</strong> deterministic but there are multiple<br />
invariant probability measures under P 1 .<br />
The converse direction can also be obtained for the countable state space case: If a measure in G is extreme,<br />
than it corresponds to a deterministic process. The result also holds for the uncountable state space case under<br />
further technical conditions. See section 3.2 <strong>of</strong> Borkar [10] for further discussions.<br />
Remark 7.4.1 If there is an atom (or a pseudo-atom constructed through Nummelin’s splitting technique) which<br />
is visited under every stationary policy, then as in the arguments leading to (7.9), one can deduce that on this<br />
atom there cannot be a r<strong>and</strong>omized policy which leads to an extreme point on the set <strong>of</strong> invariant occupation<br />
measures.
7.5. EXERCISES 79<br />
7.4.2 Sample-Path Optimality<br />
To apply the convex analytic approach, we require that under any admissible policy, the set <strong>of</strong> sample path<br />
occupation measures would be tight, for almost every sample path realization. If this can be established, then<br />
the result goes through not only for the expected cost, but also the sample-path average cost.<br />
7.5 Exercises<br />
Exercise 7.5.1 Consider the convex analytic method we discussed in class. Let X, U be countable sets <strong>and</strong><br />
consider the occupation measure:<br />
v T (A) = 1 T<br />
T∑<br />
−1<br />
t=0<br />
1 (xt∈A), ∀A ∈ B(X).<br />
While proving that the limit <strong>of</strong> such a measure process lives in a specific set, we used the following result: Let F t<br />
be the σ−field generated by {x s , u s , s ≤ t}. Define a F t measurable process<br />
( t∑<br />
F t (A) =<br />
s=1<br />
1 xs∈A − t ∑ )<br />
P(A|x)v t (x, u) ,<br />
X×U<br />
where Π is some arbitrary but admissible control policy. Show that, {F t (A),<br />
sequence for every T ∈ Z.<br />
t ∈ {1, 2, . . ., T }} is a Martingale<br />
Exercise 7.5.2 Let, for a Markov control problem, x t ∈ X, u t ∈ U, where X <strong>and</strong> U are finite sets denoting the<br />
state space <strong>and</strong> the action space, respectively. Consider the optimal control problem <strong>of</strong> the minimization <strong>of</strong><br />
T<br />
1 ∑−1<br />
lim<br />
T →∞ T EΠ x [ c(x t , u t )],<br />
t=0<br />
where c is a bounded function. Further assume that under any stationary control policy, the state transition<br />
kernel P(x t+1 |x t , u t ) leads to an irreducible Markov Chain.<br />
Does there exist an optimal control policy? Can you propose a way to find the optimal policy?<br />
Is the optimal policy also sample-path optimal?
80 CHAPTER 7. THE AVERAGE COST PROBLEM
Chapter 8<br />
Team Decision Theory <strong>and</strong> Information<br />
Structures<br />
We will primarily follow Chapters 2, 3 <strong>and</strong> 4 <strong>of</strong> [32] for this topic.<br />
8.1 Exercises<br />
Exercise 8.1.1 Consider a common probability space on which the information available to two decision makers<br />
DM 1 <strong>and</strong> DM 2 are defined, such that I 1 is available at DM 1 <strong>and</strong> I 2 is available at DM 2 .<br />
R. J. Aumann [2] defines that an information E is common knowledge between two decision makers DM 1 <strong>and</strong><br />
DM 2 , if whenever E happens, DM 1 knows E, DM 2 knows E, DM 1 knows that DM 2 knows E, DM 2 knows that<br />
DM 1 knows E, <strong>and</strong> so on.<br />
Suppose that one claims that an event E is common knowledge if <strong>and</strong> only if E ∈ σ(I 1 ) ∩ σ(I 2 ), where σ(I 1 )<br />
denotes the σ−field generated by information I 1 <strong>and</strong> likewise for σ(I 2 ).<br />
Is this argument correct? Provide an answer with precise arguments. You may wish to consult [2], [27], [11] <strong>and</strong><br />
Chapter 12 <strong>of</strong> [32].<br />
Exercise 8.1.2 Consider the following static team decision problem with dynamics:<br />
x 1 = ax 0 + b 1 u 1 0 + b 2 u 2 0 + w 0 ,<br />
y 1 0 = x 0 + v 1 0,<br />
y 2 0 = x 0 + v 2 0,<br />
Here v 1 0 , v2 0 , w 0 are independent, Gaussian, zero-mean with finite variances.<br />
Let γ i : R → R be policies <strong>of</strong> the controllers: u 1 0 = γ1 0 (y1 0 ), u2 0 = γ2 0 (y2 0 ).<br />
Find<br />
,γ<br />
min<br />
2<br />
γ 1 ,γ 2 Eγ1 ν 0<br />
[x 2 1 + ρ 1(u 1 0 )2 + ρ 2 (u 2 0 )2 ],<br />
where ν 0 is a zero-mean Gaussian distribution <strong>and</strong> ρ 1 , ρ 2 > 0.<br />
Find an optimal team policy γ = {γ 1 , γ 2 }.<br />
81
82 CHAPTER 8. TEAM DECISION THEORY AND INFORMATION STRUCTURES<br />
Exercise 8.1.3 Consider a linear Gaussian system with mutually independent <strong>and</strong> i.i.d. noises:<br />
x t+1 = Ax t +<br />
with the one-step delayed observation sharing pattern.<br />
L∑<br />
B j u j t + w t ,<br />
j=1<br />
y i t = C i x t + v i t , 1 ≤ i ≤ L, (8.1)<br />
Construct a controlled Markov chain for the team decision problem: First show that one could have<br />
as the state <strong>of</strong> the controlled Markov chain.<br />
Consider the following problem:<br />
{y 1 t , y 2 t , . . . , y L t , P(dx t |y 1 [0,t−1] , y2 [0,t−1] , . . .,yL [0,t−1] )}<br />
E γ T∑<br />
−1<br />
ν 0<br />
[ ]c(x t , u 1 t, · · · , u L t )?<br />
t=0<br />
For this problem, if at time t ≥ 0 each <strong>of</strong> the decision makers (say DM i) has access to P(dx t |y[0,t−1] 1 , y2 [0,t−1] , . . .,yL [0,t−1] )<br />
<strong>and</strong> their local observation y[0,t] i , show that they can obtain a solution where the optimal decision rules only uses<br />
{P(dx t |y[0,t−1] 1 , y2 [0,t−1] , . . . , yL [0,t−1] ), yi t }:<br />
What if, they do not have access to P(dx t |y 1 [0,t−1] , y2 [0,t−1] , . . . , yL [0,t−1] ), <strong>and</strong> only have access to yi [0,t] ? What<br />
would be a sufficient statistic for each decision maker for each time stage?<br />
Exercise 8.1.4 Two decision makers, Alice <strong>and</strong> Bob, wish to control a system:<br />
x t+1 = ax t + u a t + ub t + w t,<br />
y a t = x t + v a t ,<br />
y b t = x t + v b t ,<br />
where u a t , ya t are the control actions <strong>and</strong> the observations <strong>of</strong> Alice, u b t , yb t are those for Bob <strong>and</strong> va t , vb t , w t are<br />
independent zero-mean Gaussian r<strong>and</strong>om variables with finite variance. Suppose the goal is to minimize for<br />
some T ∈ Z + :<br />
[ T∑ −1<br />
E Πa ,Π b<br />
x 0<br />
x 2 t + r a(u a t )2 + r b (u b t<br />
], )2<br />
t=0<br />
for r a , r b > 0, where Π a , Π b denote the policies adopted by Alice <strong>and</strong> Bob. Let the local information available to<br />
Alice be It a = {ya s , ua s , s ≤ t − 1} ∪ {ya t }, <strong>and</strong> Ib t = {yb s , ub s , s ≤ t − 1} ∪ {yb t } is the information available at Bob<br />
at time t.<br />
Consider an n−step delayed information pattern: In an n−step delayed information sharing pattern, the information<br />
at Alice at time t is<br />
I a t ∪ I b t−n,<br />
<strong>and</strong> the information available at Bob is<br />
State if the following are true or false:<br />
I b t ∪ I a t−n.<br />
a) If Alice <strong>and</strong> Bob share all the information they have (with n = 0), it must be that, the optimal controls are<br />
linear.<br />
b) Typically, for such problems, for example, Bob can try to send information to Alice to improve her estimation<br />
on the state, through his actions. When is it the case that Alice cannot benefit from the information from Bob,<br />
that is for what values <strong>of</strong> n, there is no need for Bob to signal information this way?<br />
c) If Alice <strong>and</strong> Bob share all information they have with a delay <strong>of</strong> 2, then their optimal control policies can be<br />
written as<br />
u a t = f a (E[x t |I a t−2, I b t−2], y a t−1, y a t ),
8.1. EXERCISES 83<br />
u b t = f b (E[x t |I a t−2, I b t−2], y b t−1, y b t),<br />
for some functions f a , f b . Here, E[.|.] denotes the expectation.<br />
d) If Alice <strong>and</strong> Bob share all information they have with a delay <strong>of</strong> 0, then their optimal control policies can be<br />
written as<br />
u a t = f a (E[x t |I a t , I b t ]),<br />
u b t = f b(E[x t |I a t , Ib t ]),<br />
for some functions f a , f b . Here, E[.|.] denotes the expectation.<br />
Exercise 8.1.5 In Radner’s theorem, explain why continuous differentiability <strong>of</strong> the cost function (in addition<br />
to its convexity) is needed for person-by-person-optimality <strong>of</strong> a team policy to imply global optimality.
84 CHAPTER 8. TEAM DECISION THEORY AND INFORMATION STRUCTURES
Appendix A<br />
On the Convergence <strong>of</strong> R<strong>and</strong>om<br />
Variables<br />
A.1 Limit Events <strong>and</strong> Continuity <strong>of</strong> Probability Measures<br />
Given A 1 , A 2 , . . . , A n ⊂ F, define:<br />
limsup A n = ∩ ∞ n=1 ∪ ∞ k=n A k<br />
n<br />
liminf A n = ∪ ∞<br />
n<br />
n=1 ∩∞ k=n A k<br />
For the superior limit, an element is in this set, if it is in infinitely many A n s. For the inferior case, an element<br />
is in the limit, if it is in almost except for a finite number <strong>of</strong> A n s.<br />
The limit <strong>of</strong> a sequence <strong>of</strong> sets exists if the above limits are equal.<br />
We have the following result:<br />
Theorem A.1.1 For a sequence <strong>of</strong> events A n :<br />
P(liminf<br />
n<br />
A n ) ≤ liminf<br />
n<br />
P(A n ) ≤ limsup<br />
n<br />
We have the following regarding continuity <strong>of</strong> probability measures:<br />
P(A n ) ≤ P(limsup A n )<br />
n<br />
Theorem A.1.2 (i)For a sequence <strong>of</strong> events A n , A n ⊂ A n+1 ,<br />
lim P(A n) = P(∪ ∞<br />
n→∞<br />
n=1 A n)<br />
(ii) For a sequence <strong>of</strong> events A n , A n+1 ⊂ A n ,<br />
lim P(A n) = P(∩ ∞<br />
n→∞<br />
n=1 A n)<br />
A.2 Borel-Cantelli Lemma<br />
Theorem A.2.1 (i) If ∑ n P(A n) converges, then P(limsup n A n ) = 0. (ii) If {A n } are independent <strong>and</strong> if<br />
∑ P(An ) = ∞, then P(limsup n A n ) = 1.<br />
85
86 APPENDIX A. ON THE CONVERGENCE OF RANDOM VARIABLES<br />
Exercise A.2.1 Let {A n } be a sequence <strong>of</strong> independent events where A n is the event that the nth coin flip is<br />
head. What is the probability that there are infinitely many heads if P(A n ) = 1/n 2 ?<br />
An important application <strong>of</strong> the above is the following:<br />
Theorem A.2.2 Let Z n , n ∈ N <strong>and</strong> Z be r<strong>and</strong>om variables <strong>and</strong> for every ǫ > 0,<br />
∑<br />
P(|Z n − Z| ≥ ǫ) < ∞.<br />
Then,<br />
That is Z n converges to Z with probability 1.<br />
n<br />
P({ω : Z n (ω) = Z(ω)}) = 1.<br />
A.3 Convergence <strong>of</strong> R<strong>and</strong>om Variables<br />
A.3.1 Convergence almost surely (with probability 1)<br />
Definition A.3.1 A sequence <strong>of</strong> r<strong>and</strong>om variables X n converges almost surely to a r<strong>and</strong>om variable X if P({ω :<br />
lim n→∞ X n (ω) = X(ω)}) = 1.<br />
A.3.2 Convergence in Probability<br />
Definition A.3.2 A sequence <strong>of</strong> r<strong>and</strong>om variables X n converges in probability to a r<strong>and</strong>om variable X if<br />
lim n→∞ P(|X n − X| ≥ ǫ}) = 0 for every ǫ > 0.<br />
A.3.3 Convergence in Mean-square<br />
Definition A.3.3 A sequence <strong>of</strong> r<strong>and</strong>om variables X n converges in the mean-square sense to a r<strong>and</strong>om variable<br />
X if lim n→∞ E[|X n − X| 2 ] = 0.<br />
A.3.4 Convergence in Distribution<br />
Definition A.3.4 Let X n be a r<strong>and</strong>om variable with cumulative distribution function F n , <strong>and</strong> X be a r<strong>and</strong>om<br />
variable with cumulative distribution function F. A sequence <strong>of</strong> r<strong>and</strong>om variables X n converges in distribution<br />
(or weakly) to a r<strong>and</strong>om variable X if lim n→∞ F n (x) = F(x) for all points <strong>of</strong> continuity <strong>of</strong> F.<br />
Theorem A.3.1 a) Convergence in almost sure sense implies in probability. b) Convergence in mean-square<br />
sense implies convergence in probability. c) If X n → X in probability, then X n → X in distribution.<br />
We also have partial converses for the above results:<br />
Theorem A.3.2 a) If P(|X n | ≤ Y ) = 1 for some r<strong>and</strong>om variable Y with E[Y 2 ] < ∞, <strong>and</strong> if X n → X in<br />
probability, then X n → X in mean-square. b) If X n → X in probability, there exists a subsequence X nk which<br />
converges to X almost surely. c) If X n → X <strong>and</strong> X n → Y in probability, mean-square or almost surely. Then<br />
P(X = Y ) = 1.<br />
A sequence <strong>of</strong> r<strong>and</strong>om variables is uniformly integrable if:<br />
lim sup E[X n 1<br />
K→∞ |Xn|≥K] = 0.<br />
Note that, if {X n } is uniformly integrable, then, sup n E[|X n |] < ∞.<br />
n
A.3. CONVERGENCE OF RANDOM VARIABLES 87<br />
Theorem A.3.3 Under uniform integrability, convergence in almost sure sense implies convergence in meansquare.<br />
Theorem A.3.4 If X n → X in probability, there exists some subsequence X nk<br />
surely.<br />
which converges to X almost<br />
Theorem A.3.5 (Skorohod’s representation theorem) Let X n → X in distribution. Then, there exists a<br />
sequence <strong>of</strong> r<strong>and</strong>om variables Y n <strong>and</strong> Y such that, X n <strong>and</strong> Y n have the same cumulative distribution functions;<br />
X <strong>and</strong> Y have the same cumulative distribution functions <strong>and</strong> Y n → Y almost surely.<br />
With the above, we can prove the following result.<br />
Theorem A.3.6 The following are equivalent: i) X n converges to X in distribution. ii) E[f(X n )] → E[f(X)]<br />
for all continuous <strong>and</strong> bounded functions f. iii) The characteristic functions Φ n (u) := E[e iuXn ] converge pointwise<br />
for every u ∈ R.
88 APPENDIX A. ON THE CONVERGENCE OF RANDOM VARIABLES
Bibliography<br />
[1] A. Arapostathis, V. S. Borkar, E. Fern<strong>and</strong>ez-Gaucher<strong>and</strong>, M. K. Ghosh, <strong>and</strong> S. I. Marcus. Discrete-time<br />
controlled Markov processes with average cost criterion: A survey. SIAM J. Control <strong>and</strong> Optimization,<br />
31:282–344, 1993.<br />
[2] R. J. Aumann. Agreeing to disagree. Annals <strong>of</strong> <strong>Statistics</strong>, 4:1236 – 1239, 1976.<br />
[3] P. Billingsley. Convergence <strong>of</strong> Probability Measures. New York, NY, John Wiley, 1968.<br />
[4] P. Billingsley. Probability <strong>and</strong> Measure. Wiley, 3rd ed.), New York, 1995.<br />
[5] D. Blackwell. Memoryless strategies in finite-stage dynamic programming. Annals <strong>of</strong> Mathematical <strong>Statistics</strong>,<br />
35:863–865, 1964.<br />
[6] D. Blackwell <strong>and</strong> C. Ryll-Nadrzewski. Non-existence <strong>of</strong> everywhere proper conditional distributions. Annals<br />
<strong>of</strong> Mathematical <strong>Statistics</strong>, pages 223–225, 1963.<br />
[7] V. I. Bogachev. Measure Theory. Springer-Verlag, Berlin, 2007.<br />
[8] V. S. Borkar. White-noise representations in stochastic realization theory. SIAM J. on Control <strong>and</strong> Optimization,<br />
31:1093–1102, 1993.<br />
[9] V. S. Borkar. Probability Theory: An Advanced Course. Springer, New York, 1995.<br />
[10] V. S. Borkar. Convex analytic methods in Markov decision processes. In H<strong>and</strong>book <strong>of</strong> Markov Decision<br />
Processes, E. A. Feinberg, A. Shwartz (Eds.), pages 347–375. Kluwer, Boston, MA, 2001.<br />
[11] A. Br<strong>and</strong>enburger <strong>and</strong> E. Dekel. Common knowledge with probability 1. J. Mathematical Economics,<br />
16:237–245, 1987.<br />
[12] S. B. Connor <strong>and</strong> G. Fort. State-dependent foster-lyapunov criteria for subgeometric convergence <strong>of</strong> Markov<br />
chains. Stoch. Process Appl., 119:176–4193, 2009.<br />
[13] R. Douc, G. Fort, E. Moulines, <strong>and</strong> P. Soulier. Practical drift conditions for subgeometric rates <strong>of</strong> convergence.<br />
Ann. Appl. Probab, 14:1353–1377, 2004.<br />
[14] R. Durrett. Probability: theory <strong>and</strong> examples, volume 3. Cambridge university press, 2010.<br />
[15] J. K. Ghosh <strong>and</strong> R. V. Ramamoorthi. Bayesian Nonparametrics. Springer, New York, 2003.<br />
[16] M. Hairer. Convergence <strong>of</strong> markov processes. <strong>Lecture</strong> <strong>Notes</strong>, 2010.<br />
[17] O. Hern<strong>and</strong>ez-Lerma <strong>and</strong> J. Lasserre. Discrete-time Markov control processes. Springer, 1996.<br />
[18] O. Hern<strong>and</strong>ez-Lerma <strong>and</strong> J. B. Lasserre. Markov Chains <strong>and</strong> Invariant Probabilities. Birkhäuser, Basel,<br />
2003.<br />
[19] J. B. Lasserre. Invariant probabilities for Markov chains on a metric space. <strong>Statistics</strong> <strong>and</strong> Probability Letters,<br />
34:259–265, 1997.<br />
[20] J. Liu <strong>and</strong> E. Susko. On strict stationarity <strong>and</strong> ergodicity <strong>of</strong> a non-linear arma model. Journal <strong>of</strong> Applied<br />
Probability, pages 363–373, 1992.<br />
89
90 BIBLIOGRAPHY<br />
[21] A. S. Manne. Linear programming <strong>and</strong> sequential decision. Management Science, 6:259–267, April 1960.<br />
[22] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.<br />
[23] S. P. Meyn <strong>and</strong> R. Tweedie. Markov Chains <strong>and</strong> Stochastic Stability. Springer Verlag, London, 1993.<br />
[24] S. P. Meyn <strong>and</strong> R. Tweedie. State-dependent criteria for convergence <strong>of</strong> Markov chains. Ann. Appl. Prob,<br />
4:149–168, 1994.<br />
[25] S.P. Meyn <strong>and</strong> R.L. Tweedie. Markov Chains <strong>and</strong> Stochastic Stability. Springer, 1993.<br />
[26] S.P. Meyn <strong>and</strong> R.L. Tweedie. State-dependent criteria for convergence <strong>of</strong> markov chains. Annals Appl.<br />
Prob., 4(1):149–168, 1994.<br />
[27] L. Nielsen. Common knowledge, communication <strong>and</strong> convergence <strong>of</strong> beliefs. Mathematical Social Sciences,<br />
8:1–14, 1984.<br />
[28] G.O. Roberts <strong>and</strong> J.S. Rosenthal. General state space markov chains <strong>and</strong> mcmc algorithms. Probability<br />
Survery, 1:20–71, 2004.<br />
[29] M. Schäl. Conditions for optimality in dynamic programming <strong>and</strong> for the limit <strong>of</strong> n-stage optimal policies<br />
to be optimal. Z. Wahrscheinlichkeitsth, 32:179–296, 1975.<br />
[30] P. Tuominen <strong>and</strong> R.L. Tweedie. Subgeometric rates <strong>of</strong> convergence <strong>of</strong> f-ergodic markov chains. Adv. Annals<br />
Appl. Prob., 26(3):775–798, September 1994.<br />
[31] R. L. Tweedie. Drift conditions <strong>and</strong> invariant measures for Markov chains. Stochastic Processes <strong>and</strong> Their<br />
Applications, 92:345–354, 2001.<br />
[32] S. Yüksel <strong>and</strong> T. Başar. Stochastic Networked Control Systems: Stabilization <strong>and</strong> Optimization under<br />
Information Constraints. Birkhäuser, Boston, MA, 2013.<br />
[33] S. Yüksel <strong>and</strong> S. P. Meyn. R<strong>and</strong>om-time, state-dependent stochastic drift for Markov chains <strong>and</strong> application<br />
to stochastic stabilization over erasure channels. IEEE Transactions on Automatic Control, 58:47 – 59,<br />
January 2013.