28.04.2014 Views

Lecture Notes - Department of Mathematics and Statistics - Queen's ...

Lecture Notes - Department of Mathematics and Statistics - Queen's ...

Lecture Notes - Department of Mathematics and Statistics - Queen's ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

i<br />

Queen’s University<br />

<strong>Mathematics</strong> <strong>and</strong> Engineering <strong>and</strong> <strong>Mathematics</strong> <strong>and</strong> <strong>Statistics</strong><br />

Control <strong>of</strong> Stochastic Systems<br />

MATH/MTHE 472/872 <strong>Lecture</strong> <strong>Notes</strong><br />

Serdar Yüksel<br />

December 19, 2013


ii<br />

This document is a set <strong>of</strong> supplemental lecture notes that has been used for Math 472/872: Control <strong>of</strong> Stochastic<br />

Systems, at Queen’s University, since 2009.<br />

Serdar Yüksel


Contents<br />

1 Review <strong>of</strong> Probability 1<br />

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2 Measurable Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2.1 Borel σ−field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.2.2 Measurable Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.2.3 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2.4 The Extension Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2.6 Fatou’s Lemma, the Monotone Convergence Theorem <strong>and</strong> the Dominated Convergence<br />

Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.3 Probability Space <strong>and</strong> R<strong>and</strong>om Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.3.1 Discrete-Time Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

1.3.2 More on R<strong>and</strong>om Variables <strong>and</strong> Probability density functions . . . . . . . . . . . . . . . . 5<br />

1.3.3 Independence <strong>and</strong> Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.4 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2 Controlled Markov Chains 9<br />

2.1 Controlled Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.1.1 Fully Observed Markov Control Problem Model . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.1.2 Classes <strong>of</strong> Control Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

2.2 Markov Chain Induced by a Markov Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

2.3 Performance <strong>of</strong> Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />

2.3.1 Partially Observed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />

3 Classification <strong>of</strong> Markov Chains 15<br />

3.1 Countable State Space Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

3.1.1 Recurrence <strong>and</strong> Transience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

iii


iv<br />

CONTENTS<br />

3.2 Stability <strong>and</strong> Invariant Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />

3.2.1 Invariant Measures via an Occupational Characterization . . . . . . . . . . . . . . . . . . 18<br />

3.3 Uncountable (Complete, Separable, Metric) State Spaces . . . . . . . . . . . . . . . . . . . . . . . 23<br />

3.3.1 Chapman-Kolmogorov Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

3.3.2 Invariant Distributions for Uncountable Spaces . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

3.3.3 Existence <strong>of</strong> an Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

3.3.4 Dobrushin’s Ergodic Coefficient for Uncountable State Space Chains . . . . . . . . . . . . 27<br />

3.3.5 Ergodic Theorem for Uncountable State Space Chains . . . . . . . . . . . . . . . . . . . . 27<br />

3.4 Further Results on the Existence <strong>of</strong> an Invariant Probability Measure . . . . . . . . . . . . . . . 27<br />

3.4.1 Case with the Feller condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

3.4.2 Case without the Feller condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

4 Martingales <strong>and</strong> Foster-Lyapunov Criteria for Stabilization <strong>of</strong> Markov Chains 31<br />

4.1 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

4.1.1 More on Expectations <strong>and</strong> Conditional Probability . . . . . . . . . . . . . . . . . . . . . . 31<br />

4.1.2 Some Properties <strong>of</strong> Conditional Expectation: . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

4.1.3 Discrete-Time Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

4.1.4 Doob’s Optional Sampling Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

4.1.5 An Important Martingale Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . 34<br />

4.1.6 Pro<strong>of</strong> <strong>of</strong> the Birkh<strong>of</strong>f Individual Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . 36<br />

4.1.7 This section is optional: Further Martingale Theorems . . . . . . . . . . . . . . . . . . . . 36<br />

4.1.8 Azuma-Hoeffding Inequality for Martingales with Bounded Increments . . . . . . . . . . . 37<br />

4.2 Stability <strong>of</strong> Markov Chains: Foster-Lyapunov Techniques . . . . . . . . . . . . . . . . . . . . . . 37<br />

4.2.1 Criterion for Positive Harris Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

4.2.2 Criterion for Finite Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

4.2.3 Criterion for Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

4.2.4 On small <strong>and</strong> petite sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

4.2.5 Criterion for Transience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

4.2.6 State Dependent Drift Criteria: Deterministic <strong>and</strong> R<strong>and</strong>om-Time . . . . . . . . . . . . . . 42<br />

4.3 Convergence Rates to Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

4.3.1 Lyapunov conditions: Geometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

4.3.2 Subgeometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

5 Dynamic Programming 49


CONTENTS<br />

v<br />

5.1 Bellman’s Principle <strong>of</strong> Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

5.2 Discussion: Why are Markov Policies Important? Why are optimal policies deterministic for such<br />

settings? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

5.3 Existence <strong>of</strong> Minimizing Selectors <strong>and</strong> Measurability . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

5.4 Infinite Horizon Optimal Control Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

5.4.1 Discounted Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

5.5 Linear Quadratic Gaussian Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

6 Partially Observed Markov Decision Processes 63<br />

6.1 Enlargement <strong>of</strong> the State-Space <strong>and</strong> the Construction <strong>of</strong> a Controlled Markov Chain . . . . . . . 63<br />

6.2 Linear Quadratic Gaussian Case <strong>and</strong> Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

6.2.1 Separation <strong>of</strong> Estimation <strong>and</strong> Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

6.3 Estimation <strong>and</strong> Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

6.3.1 Optimal Control <strong>of</strong> Partially Observed LQG Systems . . . . . . . . . . . . . . . . . . . . . 68<br />

6.4 Partially Observed Markov Decision Processes in General Spaces . . . . . . . . . . . . . . . . . . 68<br />

6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

7 The Average Cost Problem 71<br />

7.1 Average Cost <strong>and</strong> the Average Cost Optimality Equation (ACOE) . . . . . . . . . . . . . . . . . 71<br />

7.2 The Discounted Cost Approach to the Average Cost Problem . . . . . . . . . . . . . . . . . . . . 72<br />

7.2.1 Borel State <strong>and</strong> Action Spaces [Optional] . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

7.3 Linear Programming Approach to Average Cost Markov Decision Problems . . . . . . . . . . . . 73<br />

7.4 Discussion for More General State Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

7.4.1 Extreme Points <strong>and</strong> the Optimality <strong>of</strong> Deterministic Policies . . . . . . . . . . . . . . . . 78<br />

7.4.2 Sample-Path Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />

7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />

8 Team Decision Theory <strong>and</strong> Information Structures 81<br />

8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

A On the Convergence <strong>of</strong> R<strong>and</strong>om Variables 85<br />

A.1 Limit Events <strong>and</strong> Continuity <strong>of</strong> Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />

A.2 Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />

A.3 Convergence <strong>of</strong> R<strong>and</strong>om Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

A.3.1 Convergence almost surely (with probability 1) . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

A.3.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

A.3.3 Convergence in Mean-square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


vi<br />

CONTENTS<br />

A.3.4 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


Chapter 1<br />

Review <strong>of</strong> Probability<br />

1.1 Introduction<br />

Before discussing controlled Markov chains, we first discuss some preliminaries about probability theory.<br />

Many events in the physical world are uncertain; that is, with a given knowledge up to a certain time, the<br />

entire process is not accurately predictable. Probability theory attempts to develop an underst<strong>and</strong>ing for such<br />

uncertainty in a consistent way given a number <strong>of</strong> properties to be satisfied. The suggestion by probability theory<br />

may not be physically correct, but it is mathematically precise <strong>and</strong> consistent.<br />

Examples <strong>of</strong> stochastic processes include: a) The temperature in a city at noon in October 2013: This process<br />

takes values in R 31 , b) The sequence <strong>of</strong> outputs <strong>of</strong> a communication channel modeled by an additive scalar<br />

Gaussian noise when the input is given by x = {x 1 , . . .,x n } (the output process lives in R n ), c) Infinite copies<br />

<strong>of</strong> a coin flip process (living in {H, T } ∞ ), d) The trajectory <strong>of</strong> a plane flying from point A to point B (taking<br />

values in C([t 0 , t v ]), the space <strong>of</strong> continuous paths in R 3 with x t0 = A, x tf = B), e) The exchange rate between<br />

the Canadian Dollar <strong>and</strong> the American dollar in a given time index T.<br />

Some <strong>of</strong> these processes take values in countable spaces, some do not. When the state space in which a r<strong>and</strong>om<br />

variable takes values is uncountable, further technical intricacies arise. These are best addressed with a precise<br />

characterization <strong>of</strong> probability <strong>and</strong> r<strong>and</strong>om variables.<br />

Hence, probability theory can be used to model uncertainty in the real world in a consistent way according some<br />

properties that we expect such measures should admit. The theory may not model the physical world accurately,<br />

however. In the following, we will develop a rigorous definition for probability.<br />

1.2 Measurable Space<br />

Let X be a collection <strong>of</strong> points. Let F be a collection <strong>of</strong> subsets <strong>of</strong> X with the following properties such that F<br />

is a σ-field, that is:<br />

• X ∈ F<br />

• If A ∈ F, then X \ A ∈ F<br />

• If A k ∈ F, k = 1, 2, 3, . . ., then<br />

∞⋃<br />

A k ∈ F<br />

k=1<br />

By De Morgan’s laws, the set has to be closed under countable intersections as well.<br />

1


2 CHAPTER 1. REVIEW OF PROBABILITY<br />

If the third item above holds for finitely many unions or intersections, then, the collection <strong>of</strong> subsets is said to<br />

be a field.<br />

With the above, (X, F) is termed a measurable space (that is we can associate a measure to this space; which<br />

we will discuss shortly). For example the full power-set <strong>of</strong> any set is a σ-field.<br />

Recall the following: A set is countable if every element <strong>of</strong> the set can be associated with an integer via an<br />

ordered mapping. Examples <strong>of</strong> countable spaces are finite sets <strong>and</strong> the set Q <strong>of</strong> rational numbers. An example<br />

<strong>of</strong> uncountable sets is the set R <strong>of</strong> real numbers.<br />

A σ−field J is generated by a collection <strong>of</strong> sets A, if J is the smallest σ−field containing the sets in A, <strong>and</strong> in<br />

this case, we write J = σ(A). Such a sigma field always exists.<br />

A σ−field is to be considered as an abstract collection <strong>of</strong> sets. It is in general not possible to formulate the<br />

elements <strong>of</strong> a σ-field through countably many operations <strong>of</strong> unions, intersections <strong>and</strong> complements, starting from<br />

a generating set.<br />

We will be interested in σ−fields which are countably generated in this course. Such a σ−field has certain<br />

desirable properties related to the existence <strong>of</strong> conditional probability measures <strong>and</strong> properties related to the<br />

extension theorem <strong>and</strong> constructions <strong>of</strong> measures.<br />

Subsets <strong>of</strong> σ-fields can be interpreted to represent information about an underlying process. We will discuss<br />

this interpretation further <strong>and</strong> this will be a recurring theme in our discussions in the context <strong>of</strong> stochastic<br />

control problems.<br />

1.2.1 Borel σ−field<br />

An important class <strong>of</strong> σ-fields is the Borel σ−field on a metric (or more generally topological) space. Such a<br />

σ−field is the one which is generated by the open sets. The term open naturally depends on the space being<br />

considered. For this course, we will mainly consider the space <strong>of</strong> Real numbers <strong>and</strong> complete, separable <strong>and</strong><br />

metric spaces. Recall that in a metric space with metric d, a set U is open if for every x ∈ U, there exists some<br />

ǫ > 0 such that {y : d(x, y) < ǫ} ⊂ U.<br />

The Borel σ−field on R is the one generated by sets <strong>of</strong> the form (a, b), that is ones generated by open intervals.<br />

We will denote the Borel σ−field on a space X as B(X).<br />

We will take a, b to be rational numbers, so that the σ-field is countably generated.<br />

Not all subsets <strong>of</strong> R are Borel sets, that is, elements <strong>of</strong> the Borel σ−field.<br />

We can also define a Borel σ−field on a product space. let X be a complete, separable, metric space such as R.<br />

Let X Z denote the infinite product <strong>of</strong> X. If this space is endowed with the product metric, the open sets are sets<br />

<strong>of</strong> the form ∏ i∈Z A i, where only finitely many <strong>of</strong> these sets are not equal to X <strong>and</strong> the remaining ones are open.<br />

We define cylinder sets in this product space as:<br />

B [Am,m∈I] = {x ∈ X Z , x m ∈ A m , A m ∈ B(X)},<br />

where I ⊂ Z with |I| < ∞, that is, the set has finitely many elements. Thus, in the above, if x ∈ B [Am,m∈I],<br />

then, x m ∈ A m <strong>and</strong> the remaining terms are arbitrary. The σ−field generated by such cylinder sets forms the<br />

Borel σ−field on the product space. Such a construction is very important for stochastic processes.<br />

1.2.2 Measurable Function<br />

If (X, B(X)) <strong>and</strong> (Y, B(Y)) are measurable spaces; we say a mapping from h : X → Y is measurable if<br />

h −1 (B) = {x : h(x) ∈ B} ∈ B(X),<br />

∀B ∈ B(Y)


1.2. MEASURABLE SPACE 3<br />

1.2.3 Measure<br />

A positive measure µ on (X, B(X)) is a function from B(X) to [0, ∞] which is countably additive such that for<br />

A k ∈ B(X) <strong>and</strong> A k ∩ A j = ∅:<br />

( ) ∞∑<br />

µ ∪ ∞ k=1 A k = µ(A k ).<br />

Definition 1.2.1 µ is a probability measure if it is positive <strong>and</strong> µ(X) = 1.<br />

k=1<br />

Definition 1.2.2 A measure µ is finite if µ(X) < ∞ <strong>and</strong> σ−finite, if there exist a collection <strong>of</strong> subsets such that<br />

X = ∪ ∞ k=1 A k with µ(A k ) < ∞ for all k.<br />

On the real line R, the Lebesgue measure is defined on the Borel σ−field (in fact on a somewhat larger field)<br />

such that for A = (a, b), µ(A) = b − a. Borel field <strong>of</strong> subsets is a subset <strong>of</strong> Lebesgue measurable sets, that is<br />

there exist Lebesgue measurable sets which are not Borel sets. For a definition <strong>and</strong> the construction <strong>of</strong> Lebesgue<br />

measurable sets, see [4].<br />

1.2.4 The Extension Theorem<br />

Theorem 1.2.1 (Carathéodory’s Extension Theorem) Let M be a field, <strong>and</strong> suppose that there exists a<br />

measure P satisfying: (i) P(∅) = 0, (ii) there exists countably many sets A n ∈ X such that X = ∪ n A n , with<br />

P(A n ) < ∞, (iii) For pairwise disjoint sets A n , if ∪ n A n ∈ M, then P(∪ n A n ) = ∑ n P(A n). Then, there exists<br />

a unique measure P ′ on the σ−field generated by M, B(M), which is consistent with P on M.<br />

The above is useful since, when one states that two measures are equal, it suffices to check if they are equal on<br />

the set <strong>of</strong> sets which generate the σ−algebra, <strong>and</strong> not necessarily on the entire σ−field.<br />

When we apply the above to probability measures defined on a product space, <strong>and</strong> consider the field <strong>of</strong> finitely<br />

many unions <strong>and</strong> intersections <strong>of</strong> finite dimensional Borel sets together with the empty set, we obtain the<br />

following.<br />

Theorem 1.2.2 (Kolmogorov’s Extension Theorem) Let X be a complete, separable metric space <strong>and</strong> let<br />

µ n be a probability measure on X n , the n product <strong>of</strong> X, for each n = 1, 2, . . ., such that<br />

µ n (A 1 × A 2 × · · · × A n ) = µ n+1 (A 1 × A 2 × · · · × A n × X),<br />

for every n <strong>and</strong> every sequence <strong>of</strong> Borel sets A k . Then, there exists a unique probability measure µ on (X ∞ , B(X ∞ ))<br />

which is consistent with each <strong>of</strong> the µ n ’s.<br />

For example, if the σ−field on a product space is generated by the sequence <strong>of</strong> finite dimensional distributions,<br />

one can define a measure in the product space which is consistent with the finite dimensional distributions.<br />

Likewise, we can construct the Lebesgue measure by defining it on finitely many unions <strong>and</strong> intersections <strong>of</strong><br />

intervals <strong>of</strong> the form (a, b] or [a, b) <strong>and</strong> extending this to the Borel σ−field.<br />

1.2.5 Integration<br />

Let h be a non-negative measurable function from X, B(X) to R, B(R). The Lebesgue integral <strong>of</strong> h with respect<br />

to a measure µ can be defined in three steps:<br />

First, for A ∈ B(X), define 1 {x∈A} (or 1 (x∈A) , or 1 A (x)) as an indicator function for event x ∈ A, that is the<br />

value that the function takes is 1 if x ∈ A, <strong>and</strong> 0 otherwise. In this case, ∫ X 1 x∈Aµ(dx) =: µ(A).<br />

Now, let us define simple functions h such that, there exist A 1 , A 2 , . . . , A n all in B(X) <strong>and</strong> positive numbers<br />

b 1 , b 2 , . . .,b n such that h n (x) = ∑ n<br />

k=1 b k1 x∈Ak . Now, for any given measurable h, there exist a sequence <strong>of</strong>


4 CHAPTER 1. REVIEW OF PROBABILITY<br />

simple functions h n such that h n (x) → h(x) monotonically, that is h n+1 (x) ≥ h n (x). We define the limit as the<br />

Lebesgue integral:<br />

∫<br />

n∑<br />

lim h n (dx)µ(dx) = lim b k µ(A k )<br />

n→∞<br />

n→∞<br />

We note that the space <strong>of</strong> simple functions is dense in the space <strong>of</strong> bounded measurable functions on R under<br />

the supremum norm.<br />

There are three important convergence theorems which we will not discuss in detail, the statements <strong>of</strong> which will<br />

be given in class.<br />

k=1<br />

1.2.6 Fatou’s Lemma, the Monotone Convergence Theorem <strong>and</strong> the Dominated<br />

Convergence Theorem<br />

Theorem 1.2.3 (Monotone Convergence Theorem) If µ is a σ−finite positive measure on (X, B(X)) <strong>and</strong><br />

{f n , n ∈ Z + } is a sequence <strong>of</strong> measurable functions from X to R which pointwise, monotonically, converge to f:<br />

that is, 0 ≤ f n (x) ≤ f n+1 (x) for all n, <strong>and</strong><br />

lim f n(x) = f(x),<br />

n→∞<br />

for µ−almost every x, then<br />

∫<br />

X<br />

∫<br />

f(x)µ(dx) = lim f n (x)µ(dx)<br />

n→∞<br />

X<br />

Theorem 1.2.4 (Fatou’s Lemma) If µ is a σ−finite positive measure on (X, B(X)) <strong>and</strong> {f n , n ∈ Z + } is a<br />

sequence <strong>of</strong> non-negative measurable functions from X to R, then<br />

∫<br />

∫<br />

liminf f n(x)µ(dx) ≤ liminf f n (x)µ(dx)<br />

n→∞ n→∞<br />

X<br />

Theorem 1.2.5 (Dominated Convergence Theorem) If (i) µ is a σ−finite positive measure on (X, B(X)),<br />

(ii) g(x) ≥ 0 is a Borel measurable function such that<br />

∫<br />

g(x)µ(dx) < ∞,<br />

X<br />

<strong>and</strong> (iii) {f n , n ∈ Z + } is a sequence <strong>of</strong> measurable functions from X to R which satisfy |f n (x)| ≤ g(x) for<br />

µ−almost every x, <strong>and</strong> lim n→∞ f n (x) = f(x), then:<br />

∫<br />

∫<br />

f(x)µ(dx) = lim f n (x)µ(dx)<br />

n→∞<br />

X<br />

Note that for the monotone convergence theorem, there is no restriction on boundedness; whereas for the dominated<br />

convergence theorem, there is a boundedness condition. On the other h<strong>and</strong>, for the dominated convergence<br />

theorem, the pointwise convergence does not have to be monotone.<br />

X<br />

X<br />

1.3 Probability Space <strong>and</strong> R<strong>and</strong>om Variables<br />

Let (Ω, F) be a measurable space. If P is a probability measure, then the triple (Ω, F, P) forms a probability<br />

space. Here Ω is a set called the sample space. F is called the event space, <strong>and</strong> this is a σ−field <strong>of</strong> subsets <strong>of</strong> Ω.<br />

Let (E, E) be another measurable space <strong>and</strong> X : (Ω, F, P) → (E, E) be a measurable map. We call X an<br />

E−valued r<strong>and</strong>om variable. The image under X is a probability measure on (E, E), called the law <strong>of</strong> X.<br />

The σ-field generated by the events {{w : X(w) ∈ A}, A ∈ E} is called the σ−field generated by X <strong>and</strong> is denoted<br />

by σ(X).


1.3. PROBABILITY SPACE AND RANDOM VARIABLES 5<br />

Consider a coin flip process, with possible outcomes {H, T }, heads or tails. We have a good intuitive underst<strong>and</strong>ing<br />

on the environment when someone tells us that, a coin flip takes the value H with probability 1/2. Based on<br />

the definition <strong>of</strong> a r<strong>and</strong>om variable, we view then a con flip process as a deterministic function from some space<br />

(Ω, F, P) to a space <strong>of</strong> heads <strong>and</strong> tails. Here, P denotes the uncertainty measure (think <strong>of</strong> the initial condition<br />

<strong>of</strong> the coin when it is being flipped, the flow <strong>of</strong> air, the surface where the coin touches etc.; we summarize all<br />

these aspects <strong>and</strong> all the uncertainty in the universe by the abstract space (Ω, F, P)).<br />

1.3.1 Discrete-Time Stochastic Processes<br />

One can define a sequence <strong>of</strong> r<strong>and</strong>om variables as a single r<strong>and</strong>om variable living in the product set, that is we<br />

can consider {x 1 , x 2 , . . . , x N } as an individual r<strong>and</strong>om variable X which is an (E × E × . . . E)−valued r<strong>and</strong>om<br />

variable in the measurable space (E × E × . . . E, B(E × E × . . . E)), where now the fields are to be defined on<br />

the product set. In this case, the probability measure induced by a r<strong>and</strong>om sequence is defined on the σ−field<br />

B(E × E × . . . E). We can extend the above to a case when N = Z + , that is, the process is infinite dimensional.<br />

Let X be a complete, separable, metric space. Let B(X) denote the Borel sigma-field <strong>of</strong> subsets <strong>of</strong> X. Let Σ = X T<br />

denote the sequence space <strong>of</strong> all one-sided or two-sided (countable or uncountably) infinite variables drawn from<br />

X. Thus, if T = Z, x ∈ Σ then x = {. . .,x −1 , x 0 , x 1 , . . . } with x i ∈ X.<br />

Let X n : Σ → X denote the coordinate function such that X n (x) = x n . Let B(Σ) denote the smallest sigma-field<br />

containing all cylinder sets <strong>of</strong> the form {x : x i ∈ B i , m ≤ i ≤ n} where B i ∈ B(X), for all integers m, n. We<br />

can define a probability measure by characterization <strong>of</strong> these finite dimensional cylinder sets, by the extension<br />

theorem.<br />

A similar characterization also applies for continuous-time stochastic processes, where T is uncountable. The<br />

extension require more delicate arguments, since finite-dimensional characterizations are too weak to uniquely<br />

define a sigma-field which is consistent with such distributions. Such technicalities arise in the discussion for<br />

continuous-time Markov chains <strong>and</strong> controlled processes, typically requiring a construction from a separable<br />

product space.<br />

1.3.2 More on R<strong>and</strong>om Variables <strong>and</strong> Probability density functions<br />

Consider a probability space (X, B(X), P) <strong>and</strong> consider an R−valued r<strong>and</strong>om variable U measurable with respect<br />

to (X, B(X)).<br />

This r<strong>and</strong>om variable induces a probability measure µ on B(R) such that for some (a, b] ∈ B(R):<br />

µ((a, b]) = P(U ∈ (a, b]) = P({x : U(x) ∈ (a, b]})<br />

When U is R−valued, the Expectation <strong>of</strong> U is defined as:<br />

∫<br />

E[U] = µ(dx)x<br />

We define F(x) = µ(−∞, x] as the cumulative distribution function <strong>of</strong> U. If the derivative <strong>of</strong> F(x) exists, we<br />

call this derivative the density function <strong>of</strong> U, <strong>and</strong> denote it by p(x) (the distribution function is not always<br />

differentiable, for example when the r<strong>and</strong>om variable is only defined on integers). If a density function exists,<br />

we can also write:<br />

∫<br />

E[U] = p(x)xdx<br />

R<br />

R<br />

If a probability density function p exists, the measure P is said to be absolutely continuous with respect to the<br />

Lebesgue measure. In particular, the density function p is the Radon-Nikodym derivative <strong>of</strong> P with respect to<br />

the Lebesgue measure in the sense that for all Borel A: ∫ A<br />

p(x)λ(dx) = P(A).<br />

A probability density does not always exist. In particular, whenever there is a probability mass around a given<br />

point, then a density does not exist; hence in R, if for some x, p(x) > 0, then we say there is a probability mass<br />

at x, <strong>and</strong> a density function does not exist.


6 CHAPTER 1. REVIEW OF PROBABILITY<br />

Some examples <strong>of</strong> commonly encountered r<strong>and</strong>om variables are as follows:<br />

• Gaussian (N(µ, σ 2 )):<br />

f(x) = 1 √<br />

2πσ<br />

e −(x−µ)2 /2σ 2 .<br />

• Exponential:<br />

• Uniform on [a, b] (U([a, b])):<br />

• Binomial (B(n, p)):<br />

F(x) = 1 − e −λx , p(x) = λe −λx .<br />

F(x) = x − a<br />

b − a , p(x) = 1 , x ∈ [a, b].<br />

b − a<br />

p(k) =<br />

( n<br />

k)<br />

p k (1 − p) n−k<br />

If n = 1, we also call a binomial variable, a Bernoulli variable.<br />

1.3.3 Independence <strong>and</strong> Conditional Probability<br />

Consider A, B ∈ B(X) such that P(B) > 0. The quantity<br />

P(A|B) =<br />

P(A ∩ B)<br />

P(B)<br />

is called the conditional probability <strong>of</strong> event A given B. This is itself a probability measure. If<br />

Then, A, B are said to be independent events.<br />

P(A|B) = P(A)<br />

A countable collection <strong>of</strong> events {A n } is independent if for any finitely many sub-collections A i1 , A i2 , . . . , A im ,<br />

we have that<br />

P(A i1 , A i2 , . . . , A im ) = P(A i1 )P(A i2 )...P(A im ).<br />

Here, we use the notation P(A, B) = P(A ∩ B). A sequence <strong>of</strong> events is said to be pairwise independent if for<br />

any two pairs (A m , A n ): P(A m , A n ) = P(A m )P(A n ).<br />

Pairwise independence is weaker than independence.<br />

1.4 Markov Chains<br />

One can define a sequence <strong>of</strong> r<strong>and</strong>om variable as a single r<strong>and</strong>om variable living in the product set, that is we<br />

can consider {x 1 , x 2 , . . . , x N } as an individual r<strong>and</strong>om variable X which is an (E × E × . . . E)−valued r<strong>and</strong>om<br />

variable in the measurable space (E × E × . . . E, B(E × E × . . . E)), where now the fields are to be defined on<br />

the product set. In this case, the probability measure induced by a r<strong>and</strong>om sequence is defined on the σ−field<br />

B(E × E × . . . E). If the probability measure for this sequence is such that<br />

P(x k+1 ∈ A k+1 |x k , x k−1 , . . . , x 0 ) = P(x k+1 ∈ A k+1 |x k ) P.a.s.,<br />

then {x k } is said to be a Markov chain. Thus, for a Markov chain, the immediate state is sufficient to predict<br />

the future; past variables are not needed.<br />

As an example, consider the following linear system:<br />

x t+1 = ax t + w t ,


1.5. EXERCISES 7<br />

where w t is an independent r<strong>and</strong>om variable. This sequence is Markov. Another way to construct a Markov<br />

chain is via the following: Let {x t , t ≥ 0} be a r<strong>and</strong>om sequence with state space (X, B(X)), <strong>and</strong> defined on a<br />

probability space (Ω, F, P), where B(X) denotes the Borel σ−field on X, Ω is the sample space, F a sigma field<br />

<strong>of</strong> subsets <strong>of</strong> Ω, <strong>and</strong> P a probability measure. For x ∈ X <strong>and</strong> D ∈ B(X), we let P(x, D) := P(x t+1 ∈ D|x t = x)<br />

denote the transition probability from x to D, that is the probability <strong>of</strong> the event {x t+1 ∈ D} given that x t = x.<br />

If the probability <strong>of</strong> the same event given some history <strong>of</strong> the past (<strong>and</strong> the present) does not depend on the<br />

past, <strong>and</strong> hence is given by the same quantity, the chain is a Markov chain.<br />

Thus, the Markov chain is completely determined by the transition probability <strong>and</strong> the probability <strong>of</strong> the initial<br />

state, P(x 0 ) = p 0 . Hence, the probability <strong>of</strong> the event {x t+1 ∈ D} for any t can be computed recursively by<br />

starting at t = 0, with P(x 1 ∈ D) = ∑ x P(x 1 ∈ D|x 0 = x)P(x 0 = x), <strong>and</strong> iterating with a similar formula for<br />

t = 1, 2, . . ..<br />

The above requires some justification. Kolmogorov’s extension theorem, is one way we can build on to extend<br />

the probability measures successively.<br />

We will continue our discussion on Markov chains after discussing controlled Markov chains.<br />

1.5 Exercises<br />

Exercise 1.5.1 a) Let F β be a σ−field <strong>of</strong> subsets <strong>of</strong> some space X for all β ∈ Γ where Γ is a family <strong>of</strong> indices.<br />

Let<br />

F = ⋂ β∈Γ<br />

F β<br />

Show that F is also a σ−field.<br />

For a space X, on which a distance metric is defined, the Borel σ−field is generated by the collection <strong>of</strong> open sets.<br />

This means that, the Borel σ−field is the smallest σ−field containing open sets, <strong>and</strong> as such it is the intersection<br />

<strong>of</strong> all σ-fields containing open sets. On R, the Borel σ−field is the smallest σ−field generated by open intervals.<br />

b) Is the set <strong>of</strong> rational numbers an element <strong>of</strong> the Borel σ-field on R? Is the set <strong>of</strong> irrational numbers an<br />

element?<br />

c) Let X be a countable set. On this set, let us define a metric as follows:<br />

{<br />

0,if x = y<br />

d(x, y) =<br />

1,if x ≠ y<br />

Show that, the Borel σ−field on X is generated by the individual sets {x}, x ∈ X.<br />

d) Consider the distance function d(x, y) defined as above defined on R. Is the σ−field generated by open sets<br />

according to this metric the same as the Borel σ−field on R?<br />

Exercise 1.5.2 a) Let X <strong>and</strong> Y be real-valued r<strong>and</strong>om variables defined on a given probability space. Show that<br />

X 2 <strong>and</strong> X + Y are also r<strong>and</strong>om variables.<br />

b) Let F be a σ−field <strong>of</strong> subsets over a set X <strong>and</strong> let A ∈ F. Prove that {A ∩ B, B ∈ F} is a σ−field over A<br />

(that is a σ−field <strong>of</strong> subsets <strong>of</strong> A).<br />

Exercise 1.5.3 Consider a r<strong>and</strong>om variable X which takes values in [0, 1] <strong>and</strong> is uniformly distributed. We<br />

know that for every set S = [a, b) 0 ≤ a ≤ b ≤ 1, the probability <strong>of</strong> a r<strong>and</strong>om variable X taking values in S is<br />

P(X ∈ [a, b)) = U([a, b]) = b − a. Consider now the following question? Does every subset S ⊂ [0, 1] admit a<br />

probability in the form above? In the following we will provide a counterexample, known as the Vitali set.<br />

We can show that a set picked as follows does not admit a measure: Let us define an equivalence class such that<br />

x ≡ y if x − y ∈ Q. This equivalence definition divides [0, 1] into disjoint sets. Let A be a subset which picks<br />

exactly one element from each equivalent class (here, we adopt what is known as the Axiom <strong>of</strong> Choice). Since<br />

A contains an element from each equivalence class, each pint <strong>of</strong> [0, 1] is contained in the union ∪ q∈Q (A ⊕ q).


8 CHAPTER 1. REVIEW OF PROBABILITY<br />

For otherwise there would be a point x which were not in any equivalence class. But x is equivalent with itself<br />

at least. Furthermore, since A contains only one point from each equivalence class, the sets A ⊕ q are disjoint;<br />

for otherwise there would be two sets which could include a common point: A ⊕ q <strong>and</strong> A ⊕ q ′ would include a<br />

common point, leading to the result that the difference x − q = z <strong>and</strong> x − q ′ = z are both in A, a contradiction,<br />

since there should be at most one point which is in the same equivalence class as x − q = z. We expect the<br />

uniform distribution to be shift-invariant, therefore P(A) = P(A ⊕ q). But [0, 1] = ∪ q A ⊕ q. Since a countable<br />

sum <strong>of</strong> identical non-negative elements can either become ∞, or 0, the contradiction follows: We can’t associate<br />

a number with this set.


Chapter 2<br />

Controlled Markov Chains<br />

In the following, we discuss controlled Markov models.<br />

2.1 Controlled Markov Models<br />

Consider the following model.<br />

x t+1 = f(x t , u t , w t ), (2.1)<br />

where x t is a X-valued state variable, u t is U-valued control action variable, w t a W-valued an i.i.d noise process,<br />

<strong>and</strong> f is a measurable function. We assume that X, U, W are subsets <strong>of</strong> Polish spaces. The model above in (2.1)<br />

contains (see [8]) the class <strong>of</strong> all stochastic processes which satisfy the following for all Borel sets B ∈ B(X),<br />

t ≥ 0, <strong>and</strong> all realizations x [0,t] , u [0,t] :<br />

P(x t+1 ∈ B|x [0,t] = a [0,t] , u [0,t] = b [0,t] ) = T (x t+1 ∈ B|a t , b t ) (2.2)<br />

where T (·|x, u) is a stochastic kernel from X × U to X. A stochastic process which satisfies (2.2) is called a<br />

controlled Markov chain.<br />

2.1.1 Fully Observed Markov Control Problem Model<br />

A Fully Observed Markov Control Problem is a five tuple<br />

where<br />

• X is the state space, a subset <strong>of</strong> a Polish space.<br />

• U is the action space, a subset <strong>of</strong> a Polish space.<br />

(X, U, {U(x), x ∈ X}, T, c),<br />

• K = {(x, u) : u ∈ U(x) ∈ B(U), x ∈ X} is the set <strong>of</strong> state, control pairs that are feasible. There might be<br />

different states where different control actions are possible.<br />

• T is a state transition kernel, that is T (A|x t , u t ) = P(x t+1 ∈ A|x t , u t ).<br />

• c : K → R is the cost.<br />

Consider for now that the objective to be minimized is given by: J(x 0 , Π) := E Π ν 0<br />

[ ∑ T −1<br />

t=0 c(x t, u t )], where ν 0 is<br />

the initial probability measure, that is x 0 ∼ ν 0 . The goal is to find a policy Π ∗ (to be defined next) such that<br />

J(x 0 , Π ∗ ) ≤ J(x 0 , Π), ∀Π,<br />

such that Π is an admissible policy to be defined below. Such a Π ∗ is an optimal policy. Here Π can also be<br />

called the strategy, or law.<br />

9


10 CHAPTER 2. CONTROLLED MARKOV CHAINS<br />

2.1.2 Classes <strong>of</strong> Control Policies<br />

Admissible Control Policies<br />

Let H 0 := X, H t = H t−1 × K for t = 1, 2, . . .. We let h t denote an element <strong>of</strong> H t , where h t = {x [0,t] , u [0,t−1] }.<br />

A deterministic admissible control policy Π is a sequence <strong>of</strong> functions {γ t } from H t → U; in this case<br />

u t = γ t (h t ). A r<strong>and</strong>omized control policy is a sequence Π = {Π t , t ≥ 0} such that Π : H t → P(U) (with P(U)<br />

being the space <strong>of</strong> probability measures on U) such that<br />

Π t (u t ∈ U(x t )|h t ) = 1, ∀h t ∈ H t .<br />

Markov Control Policies<br />

A policy is r<strong>and</strong>omized Markov if<br />

P Π x 0<br />

(u t ∈ C|h t ) = Π t (u t ∈ C|x t ),<br />

C ∈ B(U).<br />

Hence, the control action only depends on the state <strong>and</strong> the time, <strong>and</strong> not the past history. If the control strategy<br />

is deterministic, that is if<br />

Π t (u t = f t (x t )|x t ) = 1.<br />

for some function f t , the control policy is said to be deterministic Markov.<br />

Stationary Control Policies<br />

A policy is r<strong>and</strong>omized stationary if<br />

P Π x 0<br />

(u t ∈ C|h t ) = Π(u t ∈ C|x t ),<br />

C ∈ B(U).<br />

Hence, the control action only depends on the state, <strong>and</strong> not the past history or on time. If the control strategy is<br />

deterministic, that is if Π(u t = f(x t )|x t ) = 1 for some function f, the control policy is said to be deterministic<br />

stationary.<br />

2.2 Markov Chain Induced by a Markov Policy<br />

The following is an important result:<br />

Theorem 2.2.1 Let the control policy be r<strong>and</strong>omized Markov. Then, the controlled Markov chain becomes a<br />

Markov chain in X, that is, the state process itself becomes a Markov chain:<br />

P Π x 0<br />

(x t+1 ∈ B|x t , x t−1 , . . .,x 0 ) = Q Π t (x t+1 ∈ B|x t ),<br />

B ∈ B(X), t ≥ 1, P.a.s.<br />

Pro<strong>of</strong>: Let us consider the case where U is countable. Let B ∈ B(X). It follows that,<br />

Px Π 0<br />

(x t+1 ∈ B|x t , x t−1 , . . . , x 0 )<br />

= ∑ Px Π 0<br />

(x t+1 ∈ B, u t |x t , x t−1 , . . . , x 0 )<br />

u t<br />

= ∑ Px Π 0<br />

(x t+1 ∈ B|u t , x t , x t−1 , . . . , x 0 )Px Π 0<br />

(u t |x t , x t−1 , . . .,x 0 )<br />

u t<br />

= ∑ u t<br />

Q Π x 0<br />

(x t+1 ∈ B|u t , x t )π t (u t |x t )<br />

= ∑ u t<br />

Q Π x 0<br />

(x t+1 ∈ B, u t |x t )<br />

= Q Π t (x t+1 ∈ B|x t ) (2.3)


2.3. PERFORMANCE OF POLICIES 11<br />

The essential issue here is that, the control only depends on x t , <strong>and</strong> since x t+1 depends stochastically only on x t<br />

<strong>and</strong> u t , the desired result follows. In the above, we used properties from total probability <strong>and</strong> conditioning. ⋄<br />

We note that, if the applied control policy is not only Markov but also stationary, then the induced Markov<br />

chain in X is time-homogeneous, that is the stochastic kernel Q Π t (.|.) does not depend on time.<br />

2.3 Performance <strong>of</strong> Policies<br />

Consider a Markov control problem with an objective given as the minimization <strong>of</strong><br />

T∑<br />

−1<br />

J(ν 0 , Π) = Eν Π 0<br />

[ c(x t , u t )]<br />

where ν 0 denotes the distribution on x 0 . If x 0 = x, we then write<br />

t=0<br />

T∑<br />

−1<br />

T∑<br />

−1<br />

J(x 0 , Π) = Ex Π 0<br />

[ c(x t , u t )] = E Π [ c(x t , u t )|x 0 ]<br />

Such a cost problem is known as a Finite Horizon Optimal Control problem.<br />

We will also study costs <strong>of</strong> the following form:<br />

t=0<br />

t=0<br />

J(ν 0 , Π) = 1 T∑<br />

−1<br />

T EΠ ν 0<br />

[ c(x t , u t )]<br />

Such a problem is known as the Average Cost Optimal Control Problem.<br />

Finally, we will consider costs <strong>of</strong> the following form:<br />

t=0<br />

T∑<br />

−1<br />

J(ν 0 , Π) = Eν Π 0<br />

[ β t c(x t , u t )],<br />

for some β ∈ (0, 1). This is called a Discounted Optimal Control Problem.<br />

Let Π A denote the class <strong>of</strong> admissible policies, Π M denote the class <strong>of</strong> Markov policies, Π S denote the class <strong>of</strong><br />

Stationary policies. These policies can be both r<strong>and</strong>omized or deterministic.<br />

In a general setting, we have the following:<br />

inf J(ν 0 , Π) ≤<br />

Π∈Π A<br />

since the set <strong>of</strong> policies is progressively shrinking<br />

t=0<br />

inf J(ν 0 , Π) ≤<br />

Π∈Π M<br />

Π S ⊂ Π M ⊂ Π A<br />

inf J(ν 0 , Π),<br />

Π∈Π S<br />

We will show, however, that for the optimal control <strong>of</strong> a Markov chain, under mild conditions, Markov policies<br />

are always optimal (that is there is no loss in optimality in restricting the policies to be Markov); that is, it is<br />

sufficient to consider Markov policies. This is an important result in stochastic control. That is,<br />

inf J(ν 0 , Π) = inf J(ν 0 , Π)<br />

Π∈Π A Π∈Π M<br />

We will also show that, under somewhat more restrictive conditions, Stationary policies are optimal (that is,<br />

there is no loss in optimality in restricting the policies to be Stationary). This will typically require T → ∞.<br />

inf J(ν 0 , Π) = inf J(ν 0 , Π)<br />

Π∈Π A Π∈Π S<br />

Furthermore, we will show that, under some conditions,<br />

inf J(ν 0 , Π)<br />

Π∈Π S


12 CHAPTER 2. CONTROLLED MARKOV CHAINS<br />

is independent <strong>of</strong> the initial distribution (or initial condition) on x 0 .<br />

The last two results are computationally very important, as there are powerful computational algorithms that<br />

allow one to develop such stationary policies.<br />

In the following set <strong>of</strong> notes, we will first consider further properties <strong>of</strong> Markov chains, since under a Markov<br />

control policy, the controlled state becomes a Markov chain. We will then get back to the Controlled Markov<br />

Chains <strong>and</strong> the development <strong>of</strong> optimal control policies.<br />

The classification <strong>of</strong> Markov Chains in the next topic will implicitly characterize the set <strong>of</strong> problems for which<br />

Stationary policies contain optimal admissible policies.<br />

2.3.1 Partially Observed Model<br />

Consider the following model.<br />

x t+1 = f(x t , u t , w t ), y t = g(x t , v t )<br />

Here, as before, x t is the state, u t ∈ U is the control, (w t , v t ) ∈ W × V are second order, zero-mean, i.i.d noise<br />

processes <strong>and</strong> w t is independent <strong>of</strong> v t . In addition to the previous fully observed model, y t denotes an observation<br />

variable taking values in Y, a subset <strong>of</strong> R n in the context <strong>of</strong> this review. The controller only has causal access to<br />

the second component {y t } <strong>of</strong> the process. An admissible policy {Π} is measurable with respect to σ({y s , s ≤ t}).<br />

We denote the observed history space as: H 0 := P, H t = H t−1 × Y × U. Hence, the set <strong>of</strong> (wide-sense) causal<br />

control policies are such that P(u(h t ) ∈ U|h t ) = 1 ∀h t ∈ H t .<br />

One could transform a partially observable Markov Decision Problem to a Fully Observed Markov Decision<br />

Problem via an enlargement <strong>of</strong> the state space. In particular, we obtain via the properties <strong>of</strong> total probability<br />

the following dynamical recursion<br />

π t (A) : = P(x t ∈ A|y [0,t] , u [0,t−1] )<br />

∫<br />

X<br />

=<br />

π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 )<br />

∫ ∫X π t−1(dx t−1 )r(y t |x t )P(dx t |x t−1 , u t−1 ) ,<br />

X<br />

where we assume that ∫ B r(y|x)dy = P(y t ∈ B|x t = x) for any B ∈ B(Y) <strong>and</strong> r denotes the density process.<br />

The conditional measure process becomes a controlled Markov chain in P(X), which we endow with the weak<br />

convergence topology.<br />

Theorem 2.3.1 The process {π t , u t } is a controlled Markov chain. That is, under any admissible control policy,<br />

given the action at time t ≥ 0 <strong>and</strong> π t , π t+1 is conditionally independent from {π s , u s , s ≤ t − 1}.<br />

Let the cost function to be minimized be<br />

T∑<br />

−1<br />

t=0<br />

E Π x 0<br />

[c(x t , u t )],<br />

where E Π x 0<br />

[] denotes the expectation over all sample paths with initial state given by x 0 under policy Π.<br />

We ∫ transform the system into a fully observed Markov model as follows. Define the new cost as ˜c(π, u) =<br />

X<br />

c(x, u)π(dx), π ∈ P(X). The stochastic transition kernel q is given by:<br />

∫<br />

q(dx, dy|π, u) = P(dx, dy|x ′ , u)π(dx ′ ), π ∈ P(X)<br />

And, this kernel can be decomposed as q(dx, dy|π, u) = P(dy|π, u)P(dx|π, u, y).<br />

X<br />

The second term here is the filtering equation, mapping (π, u, y) ∈ (P(X) × U × Y) to P(X). It follows that<br />

(P(X), U, K, ˜c) defines a completely observable controlled Markov process. Here, we have<br />

∫<br />

K(B|π, u) = 1 (P(.|π,u,y)∈B) P(dy|π, u), ∀B ∈ B(P(X)),<br />

with 1 (.) denoting the indicator function.<br />

Y


2.4. EXERCISES 13<br />

2.4 Exercises<br />

Exercise 2.4.1 Suppose that there are two decision makers DM 1 <strong>and</strong> DM 2 . Suppose that the information<br />

available to to DM 1 is a r<strong>and</strong>om variable Y 1 <strong>and</strong> the information available to DM 2 is Y 2 , where these r<strong>and</strong>om<br />

variables are measurable on a probability space (Ω, F, P).<br />

Suppose that the sigma-field generated by Y 1 is a subset <strong>of</strong> the sigma-field generated by Y 2 , that is σ(Y 1 ) ⊂ σ(Y 2 ).<br />

Further, suppose that the decision makers wish to minimize the following cost function:<br />

E[c(ω, u i )],<br />

where c : Ω × U → R + is a measurable cost function, with u i ∈ U a Borel space. Here, for i = 1, 2, u i = γ i (y i )<br />

is generated by a measurable function γ i on the sigma-field generated by the r<strong>and</strong>om variable Y i . Let Γ i denote<br />

the space <strong>of</strong> all measurable policies.<br />

Prove that<br />

inf E[c(ω,<br />

γ 1 ∈Γ U1 )] ≥ inf E[c(ω,<br />

1 γ 2 ∈Γ U2 )].<br />

2<br />

Exercise 2.4.2 Consider a line <strong>of</strong> customers at a service station (such as at an airport, a grocery store, or a<br />

communication network where customers are packets). Let L t be the length <strong>of</strong> the line, that is the total number<br />

<strong>of</strong> customers waiting in the line.<br />

Let there be M t servers, serving the customers at time t. Let there be a manager (controller) who decides on the<br />

number <strong>of</strong> servers to be present. Let each <strong>of</strong> the servers be able to serve N customers for every time-stage. The<br />

dynamics <strong>of</strong> the line can be expressed as follows:<br />

L t+1 = L t + A t − M t N1 (Lt≥NM t),<br />

where 1 (E) is the indicator function for event E, i.e., it is equal to zero if E does not occur <strong>and</strong> is equal to 1<br />

otherwise. In the equation above, A t is the number <strong>of</strong> customers that have just arrived at time t. We assume<br />

{A t } to be an independent process, <strong>and</strong> to have an exponential distribution, with mean λ, that is for all k ∈ N<br />

The manager has only access to the information vector<br />

P(A(t) = k) = λk e −λ<br />

, k ∈ {0, 1, 2, . . ., }<br />

k!<br />

I t = {L 0 , L 1 , . . .,L t ; A 0 , A 1 , . . . , A t−1 },<br />

while implementing his policies. A consultant proposes a number <strong>of</strong> possible policies to be adopted by the manager:<br />

According to Policy A, the number <strong>of</strong> servers is given by:<br />

According to Policy B, the number <strong>of</strong> servers is<br />

M t = L t + L t−1 + L t−3<br />

.<br />

2<br />

M t = L t+1 + L t + L t−1<br />

.<br />

2<br />

According to Policy C,<br />

Finally, according to Policy D, the update is<br />

M t = ⌈λ + 0.1⌉1 t≥10 .<br />

M t = ⌈λ + 0.1⌉<br />

a) Which <strong>of</strong> these policies are admissible, that is measurable with respect to the σ−field generated by I t ? Which<br />

are Markov, or stationary?


14 CHAPTER 2. CONTROLLED MARKOV CHAINS<br />

b) Consider the model above. Further, suppose policy D is adopted by the manager. Consider the induced Markov<br />

chain by the policy, that is, consider a Markov chain with the following dynamics:<br />

with A t as described the the earlier question.<br />

Is this Markov chain irreducible?<br />

L t+1 = L t + A t − ⌈λ + 0.1⌉N1 (Lt≥⌈λ+0.1⌉)N,<br />

Is there an absorbing set in this Markov chain? If so, what is it?


Chapter 3<br />

Classification <strong>of</strong> Markov Chains<br />

3.1 Countable State Space Markov Chains<br />

In this chapter, we first review Markov chains where the state takes values in a finite set or a countable space.<br />

A Markov chain {x t } on a countable state space is a collection <strong>of</strong> r<strong>and</strong>om variables {x t , t ∈ Z + }, where each <strong>of</strong><br />

the variables take place in a countable set X. As such, we will call the collection <strong>of</strong> such variables a chain Φ<br />

which takes values in the (one-sided or two-sided) infinite-product space X ∞ . The Borel σ−field generated on<br />

the infinite space is the one generated by the finite dimensional sets <strong>of</strong> the form<br />

{x ∈ X ∞ : x [m,n] = {a m , a m+1 , . . . , a n }, a k ∈ X, m ≤ k ≤ n, m, n ∈ Z + }.<br />

We assume that ν 0 is an initial distribution for the Markov chain for x 0 .<br />

As such, the process Φ = {x 0 , x 1 , . . .,x n , . . . } is a (time-homogeneous) Markov chain <strong>and</strong> satisfies:<br />

P ν0 (x 0 = a 0 , x 1 = a 1 , x 2 = a 2 , . . .,x n = a n )<br />

= ν(x 0 = a 0 )P(x 1 = a 1 |x 0 = a 0 )P(x 2 = a 2 |x 1 = a 1 )...P(x n = a n |x n−1 = a n−1 ) (3.1)<br />

If the initial condition is known to be x 0 , we use P x0 (· · · ) in place <strong>of</strong> P ν0 (· · · ).<br />

We could also write the evolution in terms <strong>of</strong> a matrix.<br />

P(i, j) = Pr(x t+1 = j|x t = i) ≥ 0, ∀i, j ∈ X<br />

Here P(·, ·) is also a probability transition kernel, that is for every i ∈ X, P(i, .) is a probability measure on X.<br />

Thus, ∑ j<br />

P(i, j) = 1.<br />

By induction, we could verify that<br />

P k (i, j) := P(x t+k = j|x t = i) = ∑ m∈X<br />

P(i, m)P k−1 (m, j)<br />

In the following, we characterize Markov Chains based on transience, recurrence <strong>and</strong> communication. We then<br />

consider the issue <strong>of</strong> the existence <strong>of</strong> an invariant distribution. Later, we will extend the analysis to uncountable<br />

space Markov chains.<br />

Let us consider a discrete-time, time-homogeneous Markov process living in a countable state space X.<br />

Communication<br />

If there exists an integer k such that P(x t+k = j|x t = i) = P k (i, j) > 0, <strong>and</strong> another integer l such that<br />

P(x t+l = i|x t = j) = P l (j, i) > 0 then state i communicates with state j.<br />

15


16 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />

A set C ⊂ X is said to be communicating if every two elements (states) <strong>of</strong> C communicate with each other.<br />

If every member <strong>of</strong> the set can communicate to every other member, such a chain is said to be irreducible.<br />

The period <strong>of</strong> a state is defined to be the greatest common divisor <strong>of</strong> {k > 0 : P k (i, i) > 0}.<br />

A Markov chain is called aperiodic if the period <strong>of</strong> all states is 1.<br />

In the following, throughout the notes, we will assume that the Markov process under consideration is aperiodic<br />

unless mentioned otherwise.<br />

Absorbing Set<br />

A set C is called absorbing if P(i, C) = 1 for all i ∈ C. That is, if the state is in C, then the state cannot get<br />

out <strong>of</strong> the set C.<br />

A finite set Markov chain in space X is irreducible if the smallest absorbing set is the entire X itself.<br />

A Markov chain in X is indecomposable if X does not contain two disjoint absorbing sets.<br />

Occupation, Hitting <strong>and</strong> Stopping Times<br />

For any set A ∈ X, the occupation time η A is the number <strong>of</strong> visits <strong>of</strong> ψ to set A:<br />

η A =<br />

∞∑<br />

t=1<br />

1 {xt∈A},<br />

where 1 (E) denotes the indicator function for an event E, that is, it takes the value 1 when E takes place, <strong>and</strong><br />

is otherwise 0.<br />

Define<br />

τ A = min{k > 0 : x k ∈ A},<br />

is the first time that the state visits state i, known as the hitting time.<br />

We define also a return time:<br />

τ A = min{k > 0 : x k ∈ A, x 0 ∈ A}.<br />

A Brief Discussion on Stopping Times<br />

The variable τ A defined above is an example for stopping times:<br />

Definition 3.1.1 A function τ from a measurable space (X ∞ , B(X ∞ )) to (N + , B(N + )) is a stopping time if for<br />

all n ∈ N + , the event {τ = n} ∈ σ(x 0 , x 1 , x 2 , . . . , x n ), that is the event is in the sigma-field generated by the<br />

r<strong>and</strong>om variables up to time n.<br />

Any realistic decision takes place at a time which is measurable. For example if an investor wants to stop<br />

investing when he stops pr<strong>of</strong>iting, he can stop at the time when the investment loses value, this is a stopping<br />

time. But, if an investor claims to stop investing when the investment is at the peak, this is not a stopping time<br />

because to find out whether the investment is at its peak, the next state value should be known, <strong>and</strong> this is not<br />

measurable in a causal fashion.<br />

One important property <strong>of</strong> Markov chains is the so-called Strong Markov Property. This says the following:<br />

Consider a Markov chain which evolves on a countable set. If we sample this chain according to a stopping time<br />

rule, the sampled Markov chain starts from the sampled instant as a Markov chain:<br />

Proposition 3.1.1 For a (time-homogenous) Markov chain in a countable state space X, the strong Markov<br />

property always holds. That is, the sampled chain itself is a Markov chain; if τ is a stopping time with P(τ


3.1. COUNTABLE STATE SPACE MARKOV CHAINS 17<br />

∞) = 1, then for any m ∈ N + :<br />

for all a, b 0 , b 1 , · · ·.<br />

P(x τ+m = a|x τ = b 0 , x τ −1 = b 1 , . . .) = P(x τ+m = a|x τ = b 0 ) = P m (b 0 , a),<br />

3.1.1 Recurrence <strong>and</strong> Transience<br />

Let us define<br />

<strong>and</strong> let us define<br />

∞∑<br />

∞∑<br />

U(x, A) := E x [ 1 (xt∈A)] = P t (x, A)<br />

t=1<br />

t=1<br />

L(x, A) := P x (τ A < ∞),<br />

which is the probability <strong>of</strong> the chain visiting set A, once the process starts at state x.<br />

These two are important characterizations for Markov chains.<br />

Definition 3.1.2 A set A ⊂ X is recurrent if the Markov chain visits A infinitely <strong>of</strong>ten (in expectation), when<br />

the process starts in A. This is equivalent to<br />

E x [η A ] = ∞, ∀x ∈ A (3.2)<br />

That is, if the chain starts at a given state x ∈ A, it comes back to the set A, <strong>and</strong> does so infinitely <strong>of</strong>ten. If a<br />

state is not recurrent, it is transient.<br />

Definition 3.1.3 A set A ⊂ X is positive recurrent if the Markov chain visits A infinitely <strong>of</strong>ten, when the<br />

process starts in A <strong>and</strong> in addition:<br />

E x [τ A ] < ∞, ∀x ∈ A<br />

Definition 3.1.4 A state α is transient if<br />

The above is equivalent to the condition that<br />

which in turn is implied by<br />

U(α, α) = E α [η α ] < ∞, (3.3)<br />

∞∑<br />

P i (α, α) < ∞,<br />

i=1<br />

P i (τ i < ∞) < 1.<br />

While discussing recurrence, there is an equivalent, <strong>and</strong> <strong>of</strong>ten easier to check condition: If E i [τ i ] < ∞, then the<br />

state {i} is positive recurrent. If L(i, i) = P i (τ i < ∞) = 1, but E i [τ i ] = ∞, then i is recurrent (also called, null<br />

recurrent).<br />

The reader should connect the above with the strong Markov property: once the process hits a state, it starts<br />

from the state as if the past never happened; the process recurs itself.<br />

We have the following. The pro<strong>of</strong> is presented later in the chapter, see Theorem 3.3.1.<br />

Theorem 3.1.1 If P i (τ i < ∞) = 1, then P i (η i = ∞) = 1.<br />

There is a more general notion <strong>of</strong> recurrence, named Harris recurrence, which is exactly the condition that<br />

P i (η i = ∞) = 1. We will investigate this further below while studying uncountable state space Markov chains,<br />

however one needs to note that even for countable state space chains Harris recurrence is stronger than recurrence.<br />

For a finite state Markov chain it is not difficult to verify that (3.3) is equivalent to L(i, i) < 1, since P(τ i (k) <<br />

∞) = P(τ i (k − 1) < ∞)P(τ i (1) < ∞) <strong>and</strong> that E[η] = ∑ k<br />

P(η ≥ k).<br />

If every state in X is recurrent, then the chain is recurrent, if all are transient, then the chain is transient.


18 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />

3.2 Stability <strong>and</strong> Invariant Distributions<br />

Stability is an important concept, but it has different meanings in different contexts. In a stochastic setting,<br />

stability means that the state does not grow large too <strong>of</strong>ten, <strong>and</strong> the average over time (temporal) converges to<br />

a statistical average.<br />

If a Markov chain starts at a given time, in the long-run, the chain may forget its initial condition, that is, the<br />

probability distribution at a later time will be less <strong>and</strong> less dependent on the initial distribution. Given an initial<br />

state distribution, the probability distribution on the state at time 1 is given by:<br />

π 1 = π 0 P<br />

And for t > 1:<br />

π t+1 = π t P = π 0 P t+1<br />

One important property <strong>of</strong> Markov chains is whether the above iteration leads to a fixed point in the set <strong>of</strong><br />

probability measures. Such a fixed point π is called an invariant distribution. A distribution in a countable<br />

state Markov chain is invariant if<br />

π = πP<br />

This is equivalent to<br />

π(j) = ∑ i∈X<br />

π(i)P(i, j), ∀j ∈ X<br />

We note that, if such a π exists, it must be written in terms <strong>of</strong> π = π 0 lim t→∞ P t , for some π 0 . Clearly, π 0 can<br />

be π itself, but <strong>of</strong>ten π 0 can be any initial distribution under irreducibility conditions which will be discussed<br />

further. Invariant distributions are especially important in networking problems <strong>and</strong> stochastic control, due<br />

to the Ergodic Theorem (which shows that temporal averages converge to statistical averages), which we will<br />

discuss later in the semester.<br />

3.2.1 Invariant Measures via an Occupational Characterization<br />

Theorem 3.2.1 For a Markov chain, if there exists an element i such that E i [τ i ] < ∞; the following is an<br />

invariant measure:<br />

[∑ τi−1<br />

k=0<br />

µ(j) = E<br />

1 ]<br />

x k =j<br />

|x 0 = i , ∀j ∈ X<br />

E i [τ i ]<br />

The invariant measure is unique if the chain is irreducible.<br />

Pro<strong>of</strong>:<br />

We show that<br />

[∑ τi−1<br />

k=0<br />

E<br />

1 ]<br />

x k =j<br />

|x 0 = i<br />

E[τ i ]<br />

= ∑ s<br />

[∑ τi−1<br />

k=0<br />

P(s, j)E<br />

1 ]<br />

x k =s<br />

|x 0 = i<br />

E[τ i ]


3.2. STABILITY AND INVARIANT DISTRIBUTIONS 19<br />

Note that E[1 xt+1=j] = P(x t+1 = j). Hence,<br />

∑<br />

s<br />

[∑ τi−1<br />

k=0<br />

P(s, j)E<br />

1 ]<br />

x k =s<br />

|x 0 = i<br />

E i [τ i ]<br />

= E<br />

[∑ τi−1 ∑<br />

k=0<br />

E i [τ i ]<br />

[ ]<br />

∑τi−1 ∑<br />

E<br />

k=0 s 1 x k =sE[1 xk+1 =j|x k = s, x 0 = i]|x 0 = i<br />

=<br />

E i [τ i ]<br />

[ ]<br />

∑τi−1<br />

E<br />

k=0 E[∑ s 1 x k =s1 xk+1 =j|x k = s, x 0 = i]|x 0 = i<br />

=<br />

E i [τ i ]<br />

[ ]<br />

∑τi−1 ∑<br />

E<br />

k=0 s 1 x k =s1 xk+1 =j|x 0 = i<br />

=<br />

E i [τ i ]<br />

[∑ τi−1<br />

k=0<br />

= E 1 ] [∑ τi<br />

x k+1 =j<br />

k=1 i = E E[1 ]<br />

x k =j]<br />

i<br />

E i [τ i ]<br />

E i [τ i ]<br />

[∑ τi−1<br />

k=0<br />

= E E[1 ]<br />

x k =j]<br />

i = µ(j),<br />

E i [τ i ]<br />

s P(s, j)1 x k =s<br />

]<br />

|x 0 = i<br />

(3.4)<br />

where we use the fact that the number <strong>of</strong> visits to a given set does not change whether we include or exclude<br />

time t = 0 or τ i , since any state j ≠ i is not visited at these times. Here, (3.4) follows from the law <strong>of</strong> the iterated<br />

expectations, see Theorem 4.1.3. This concludes the pro<strong>of</strong>.<br />

⊓⊔<br />

Remark: If E[τ i ] < ∞, then the above measure becomes a probability measure, as it follows that<br />

∑<br />

µ(j) =<br />

∑<br />

j<br />

[∑ τi−1<br />

k=0<br />

E<br />

1 ]<br />

x k =j<br />

|x 0 = i<br />

E[τ i ]<br />

For example, for a r<strong>and</strong>om walk E[τ i ] = ∞, hence there does not exist a unique invariant probability measure.<br />

But, it has an invariant measure: µ(i) = K, for a constant K. The reader can verify this.<br />

= 1.<br />

⊓⊔<br />

Theorem 3.2.2 For an irreducible Markov chain, positive recurrence implies the existence <strong>of</strong> an invariant distribution<br />

with Π(i) = 1<br />

E i[τ i]<br />

. Furthermore the invariant distribution is unique.<br />

Pro<strong>of</strong>: From Theorem 3.2.1, it follows that π(i) = 1<br />

E i[τ i]<br />

is satisfied for an invariant measure. That, E i [τ i ] is<br />

finite follows from irreducibility <strong>of</strong> a finite state Markov chain. We will prove the uniqueness result for the case<br />

where P(i, j) > ǫ for some ǫ > 0 for all i, j ∈ X.<br />

First, we will show that the absorbing set has to be the entire state space, <strong>and</strong> the invariant measure should be<br />

non-zero on each set in the state space.<br />

We then show that, the measures cannot differ on any given state. Let i be such that π(i) = 0, for an invariant<br />

distribution. Then, for X − i, it follows that<br />

π(X \ {i}) =<br />

∑<br />

( ∑<br />

)<br />

π(j)P(j, X \ {i}) + 0P(i, X \ {i}) ≤ π(j) P(j, X \ {i}) < π(X \ {i})(1 − ǫ),<br />

j∈X\{i}<br />

j∈X\{i}<br />

since there exists some ǫ > 0 such that P(j, i) > ǫ for all j ∈ X − i. We now show that, there cannot be two<br />

different distributions.<br />

Let π(i) <strong>and</strong> π ′ (i) be two different distributions. By the previous argument, they have the same support set. It<br />

must be that for some linear functions F, G:<br />

Fπ(i) = G[π(j), j ∈ X \ {i}]


20 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />

<strong>and</strong><br />

Fπ ′ (i) = G[π ′ (j), j ∈ X \ {i}]<br />

Let us define J := {i : π(i) ≥ π ′ (i)}, then for some linear functions with positive coefficients:<br />

F J [π(j), j ∈ J] = G −J [π(j), j ∈ X \ J]<br />

F J [π ′ (j), j ∈ J] = G −J [π ′ (j), j ∈ X \ J]<br />

<strong>and</strong> by linearity, the same should hold for the difference:<br />

F J [π(j) − π ′ (j), j ∈ J] = G −J [π(j) − π ′ (j), j ∈ X \ J]<br />

But, the terms in the left h<strong>and</strong> vector are all non-negative, <strong>and</strong> the terms on the other h<strong>and</strong> are negative (since<br />

the sums in π(i) equals 1, the terms on the right h<strong>and</strong> side should be strictly negative). Since the matrices<br />

multiplying the vectors are all positive (this is where irreducibility is used), this is a contradiction. We skip the<br />

pro<strong>of</strong> <strong>of</strong> why the matrix elements are all positive -this follows from the equation π = πP <strong>and</strong> that all entries in<br />

P are non-zero. As such, the entries in the measures should be identical to each other.<br />

⊓⊔<br />

Invariant Distribution via Dobrushin’s Ergodic Coefficient<br />

One interesting result we have observed is the following. For every positive recurrent state i<br />

lim P n (i, i) = 1<br />

n→∞ E i [τ i ] .<br />

Consider the iteration π t+1 = π t P. We would like to know when this iteration converges to a limit. We first<br />

review some notions from analysis.<br />

Review <strong>of</strong> Vector Spaces<br />

Definition 3.2.1 A linear space is a space which is closed under addition <strong>and</strong> multiplication by a scalar.<br />

Definition 3.2.2 A normed linear space X is a linear vector space on which a functional (a mapping from X to<br />

R, that is a member <strong>of</strong> R X ) called norm is defined such that:<br />

• ||x|| ≥ 0 ∀x ∈ X, ||x|| = 0 if <strong>and</strong> only if x is the null element (under addition <strong>and</strong> multiplication) <strong>of</strong> X.<br />

• ||x + y|| ≤ ||x|| + ||y||<br />

• ||αx|| = |α|||x||, ∀α ∈ R, ∀x ∈ X<br />

Definition 3.2.3 In a normed linear space X, an infinite sequence <strong>of</strong> elements {x n } converges to an element x<br />

if the sequence {||x n − x||} converges to zero.<br />

Definition 3.2.4 A metric defined on a set X, is a function d : X × X → R such that:<br />

• d(x, y) ≥ 0, ∀x, y ∈ X <strong>and</strong> d(x, y) = 0 if <strong>and</strong> only if x = y.<br />

• d(x, y) = d(y, x), ∀x, y ∈ X.<br />

• d(x, y) ≤ d(x, z) + d(z, y), ∀x, y, z ∈ X.<br />

Definition 3.2.5 A metric space (X, d) is a set equipped with a metric d.<br />

A normed linear space is also a metric space, with metric<br />

d(x, y) = ||x − y||.


3.2. STABILITY AND INVARIANT DISTRIBUTIONS 21<br />

Definition 3.2.6 A sequence {x n } in a normed space X is Cauchy if for every ǫ, there exists an N such that<br />

||x n − x m || ≤ ǫ, for all n, m ≥ N.<br />

The important observation on Cauchy sequences is that, every converging sequence is Cauchy, however, not<br />

all Cauchy sequences are convergent: This is because the limit might not live in the original space where the<br />

sequence lives in. This brings the issue <strong>of</strong> completeness:<br />

Definition 3.2.7 A normed linear space X is complete, if every Cauchy sequence in X has a limit in X. A<br />

complete normed linear space is called Banach.<br />

A map T from one complete normed linear space X to itself is called a contraction if for some 0 ≤ ρ < 1<br />

||T(x) − T(y)|| ≤ ρ||x − y||, ∀x, y ∈ X.<br />

Theorem 3.2.3 A contraction map in a Banach space has a unique fixed point.<br />

Pro<strong>of</strong>: {T n (x)} forms a Cauchy sequence, <strong>and</strong> by completeness, the Cauchy sequence has a limit.<br />

⊓⊔<br />

Contraction Mapping via Dobrushin’s Ergodic Coefficient<br />

Consider a countable state Markov Chain. Define<br />

∑<br />

δ(P) = min min(P(i, j), P(k, j))<br />

i,k<br />

j<br />

Observe that for two scalars a, b<br />

|a − b| = a + b − 2 min(a, b).<br />

Let us define for a vector v the l 1 norm:<br />

||v|| 1 =<br />

N∑<br />

|v i |.<br />

It is known that the set <strong>of</strong> all countable index real-valued vectors (that is functions which map Z → R) with a<br />

finite l 1 norm<br />

{v : ||v|| 1 < ∞}<br />

is a complete normed linear space, <strong>and</strong> as such, is a Banach space (every Cauchy sequence has a limit).<br />

With these observations, we state the following:<br />

i=1<br />

Theorem 3.2.4 (Dobrushin) For any probability measures π, π ′ , it follows that<br />

||πP − π ′ P || 1 ≤ (1 − δ(P))||π − π ′ || 1<br />

Pro<strong>of</strong>: Let ψ(i) = π(i) − min(π(i), π ′ (i)) for all i. Further, let ψ ′ (i) = π ′ (i) − min(π(i), π ′ (i)). Then,<br />

||π − π ′ || 1 = ||ψ − ψ ′ || 1 = 2||ψ|| 1 = 2||ψ ′ || 1 ,<br />

since the sum<br />

is zero, <strong>and</strong><br />

∑<br />

π(i) − π ′ (i) = ∑ i<br />

i<br />

ψ(i) − ψ ′ (i)<br />

∑<br />

|π(i) − π ′ (i)| = ||ψ|| 1 + ||ψ ′ || 1 .<br />

i


22 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />

Now,<br />

||πP − π ′ P || = ||ψP − ψ ′ P ||<br />

= ∑ | ∑ ψ(i)P(i, j) − ∑<br />

j i<br />

k<br />

ψ ′ (k)P(k, j)|<br />

= 1 ∑<br />

||ψ ′ | ∑ ∑<br />

ψ(i)ψ ′ (k)P(i, j) − ψ(i)ψ ′ (k)P(k, j)| (3.5)<br />

|| 1<br />

j k i<br />

≤ 1 ∑∑<br />

∑<br />

||ψ ′ ψ(i)ψ ′ (k)|P(i, j) − P(k, j)| (3.6)<br />

|| 1<br />

j<br />

k<br />

i<br />

= 1 ∑∑<br />

||ψ ′ ψ(i)ψ ′ (k) ∑ |P(i, j) − P(k, j)|<br />

|| 1<br />

k i<br />

j<br />

= 1 ∑∑<br />

||ψ ′ |ψ(i)||ψ ′ (k)|{ ∑ P(i, j) + P(k, j) − 2 min(P(i, j), P(k, j))}<br />

|| 1<br />

k i<br />

j<br />

(3.7)<br />

≤ 1 ∑∑<br />

||ψ ′ |ψ(i)||ψ ′ (k)|(2 − 2δ(P))<br />

|| 1<br />

(3.8)<br />

k<br />

i<br />

= ||ψ ′ || 1 (2 − 2δ(P)) (3.9)<br />

= ||π − π ′ || 1 (1 − δ(P))} (3.10)<br />

In the above, (3.5) follows from adding terms in the summation, (3.6) from taking the norm inside, (3.7) follows<br />

from the relation ||a − b|| = a + b − 2 min(a, b), (3.8) from the definition <strong>of</strong> δ(P) <strong>and</strong> finally (3.9) follows from<br />

the l 1 norms <strong>of</strong> ψ, ψ ′ .<br />

As such, the process P is a contraction mapping if δ(P) > 0. In essence, one proves that such a sequence is<br />

Cauchy, <strong>and</strong> as every Cauchy sequence in a Banach space has a limit, this process also has a limit. In our setting<br />

{π 0 P m } is a Cauchy sequence. The limit is the invariant distribution. ⊓⊔<br />

Dobrushin’s ergodic theorem provides a guarantee that the limit is unique.<br />

⊓⊔<br />

It should also be noted that Dobrushin’s theorem tells us how fast the sequence <strong>of</strong> probability distributions<br />

{π 0 P n } converges to the invariant distribution for any arbitrary π 0 .<br />

Ergodic Theorem for Countable State Space Chains<br />

For a Markov chain which has a unique invariant distribution µ(i), we have that almost surely<br />

1<br />

lim<br />

T →∞ T<br />

T∑<br />

t=1<br />

f(x t ) = ∑ i<br />

f(i)µ(i)<br />

∀ f : X → R.<br />

This is called the ergodic theorem due to Birkh<strong>of</strong>f <strong>and</strong> is a very powerful result. This is a very important theorem,<br />

because, in essence, this property is what makes the connection with stochastic control, <strong>and</strong> Markov chains in a<br />

long time horizon. In particular, for a stationary control policy leading to a unique invariant distribution with<br />

bounded costs, it follows that, almost surely,<br />

1<br />

lim<br />

T →∞ T<br />

T∑<br />

c(x t , u t ) = ∑ c(x, u)µ(x, u),<br />

x,u<br />

t=1<br />

∀ real-valued bounded c. The ergodic theorem is what makes a dynamic optimization problem equivalent to a<br />

static optimization problem under mild technical conditions. This will set the core ideas in the convex analytic<br />

approach <strong>and</strong> the linear programming approach that will be discussed later.


3.3. UNCOUNTABLE (COMPLETE, SEPARABLE, METRIC) STATE SPACES 23<br />

3.3 Uncountable (Complete, Separable, Metric) State Spaces<br />

We now briefly extend the above definitions <strong>and</strong> discussions to an uncountable state space setting.<br />

Let {x t , t ∈ Z + } be a Markov chain with a complete, separable, metric state space (X, B(X), <strong>and</strong> defined on<br />

a probability space (Ω, F, P), where B(X) denotes the Borel σ−field on X, Ω is the sample space, F a sigma<br />

field <strong>of</strong> subsets <strong>of</strong> Ω, <strong>and</strong> P a probability measure. Let P(x, D) := P(x t+1 ∈ D|x t = x) denote the transition<br />

probability from x to D, that is the probability <strong>of</strong> the event {x t+1 ∈ D} given that x t = x.<br />

3.3.1 Chapman-Kolmogorov Equations<br />

Consider a Markov chain with transition probability given by P(x, D), that is<br />

P(x t+1 ∈ D|x t = x) = P(x, D)<br />

We could compute<br />

P(x t+k ∈ D|x t = x)<br />

inductively as follows:<br />

∫<br />

P(x t+k ∈ D|x t = x) =<br />

∫ ∫<br />

. . .<br />

P(x t , dx t+1 )...P(x t+k−2 , dx t+k−1 )P(x t+k−1 , D)<br />

As such, we have for all n ≥ 1<br />

∫<br />

P n (x, A) = P(x t+n ∈ A|x t = x) =<br />

X<br />

P n−1 (x, dy)P(y, A).<br />

The justification for the above construction will be discussed further with regard to conditional expectations in<br />

the next chapter.<br />

Definition 3.3.1 A Markov chain is µ-irreducible, if for any set B ∈ B(X), such that µ(B) > 0, <strong>and</strong> ∀x ∈ X,<br />

there exists some integer n > 0, possibly depending on B <strong>and</strong> x, such that P n (x, B) > 0, where P n (x, B) is the<br />

transition probability in n stages, that is P(x t+n ∈ B|x t = x).<br />

One popular case is with the Lebesgue measure:<br />

Definition 3.3.2 A Markov chain is Lebesgue-irreducible, if for any non-empty open set B ⊂ B(R), ∀x ∈ R,<br />

there exists some integer n > 0, possibly depending on B <strong>and</strong> x, such that P n (x, B) > 0, where P n (x, B) is the<br />

transition probability in n stages, that is P(x t+n ∈ B|x t = x).<br />

Example: Linear system with a drift. Consider the following linear system:<br />

x t+1 = ax t + w t ,<br />

This chain is Lebesgue irreducible if w t is a Gaussian variable.<br />

A ϕ-irreducible Markov chain is aperiodic if for any x ∈ X, <strong>and</strong> any B ∈ B(X) satisfying ϕ(B) > 0, there exists<br />

n 0 = n 0 (x, B) such that<br />

P n (x, B) > 0 for all n ≥ n 0 .<br />

See Meyn <strong>and</strong> Tweedie [23] for a detailed discussion on periodicity <strong>of</strong> Markov chains <strong>and</strong> its connection with<br />

small sets, to be discussed shortly.<br />

The definitions for recurrence <strong>and</strong> transience follow those in the countable state space setting.


24 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />

Definition 3.3.3 A Markov chain is called recurrent if it is µ−irreducible <strong>and</strong><br />

whenever µ(A) > 0 <strong>and</strong> A ∈ B(X).<br />

∞∑<br />

∞∑<br />

E x [ 1 xt∈A] = P t (x, A) = ∞, ∀x ∈ X,<br />

t=1<br />

t=1<br />

While studying chains in an infinite state space, we use a stronger recurrence definition: A set A ∈ B(X) is<br />

Harris recurrent if the Markov chain visits A infinitely <strong>of</strong>ten with probability 1, when the process starts in A:<br />

Definition 3.3.4 A set A ∈ B(X) is Harris recurrent if<br />

P x (η A = ∞) = 1, A ∈ B(X), ∀x ∈ A, (3.11)<br />

Theorem 3.3.1 Harris recurrence <strong>of</strong> a set A is equivalent to<br />

P x [τ A < ∞] = 1, ∀x ∈ A.<br />

Pro<strong>of</strong>: We now prove this argument ([Meyn-Tweedie, Chapter 9]): let τ A (1) be the first time the state hits<br />

A. By the Strong Markov Property, the Markov chain sampled at successive intervals τ A (1), τ A (2) <strong>and</strong> so on is<br />

also a Markov chain. Let Q be the transition kernel for this sampled Markov Chain. Now, the probability <strong>of</strong><br />

τ A (2) < ∞ can be computed recursively as<br />

P(τ A (2) < ∞)<br />

∫<br />

= Q (x xτA (1) τ A(1), dy)P y (τ A (1) < ∞)<br />

A<br />

( ∫<br />

)<br />

= Q (x xτA (1) τ A(1), dy) P y (τ A (1) < ∞)<br />

A<br />

= 1 (3.12)<br />

By induction, for every n ∈ Z +<br />

P(τ A (n + 1) < ∞)<br />

∫<br />

= Q (x xτA (1) τ A(1), dy)P y (τ A (n) < ∞)<br />

Now,<br />

A<br />

= 1 (3.13)<br />

P x (η A ≥ k) = P x (τ A (k) < ∞),<br />

since k times visiting a set requires k times returning to a set, when the initial state x is in the set. As such,<br />

P x (η A ≥ k) = 1, ∀k ∈ Z +<br />

is identically equal to 1. Define B k = {ω ∈ Ω : η(ω) ≥ k}, <strong>and</strong> it follows that B k+1 ⊂ B k . By the continuity <strong>of</strong><br />

probability P(∩B k ) = lim k→∞ P(B k ), it follows that P x (η A = ∞) = 1.<br />

The pro<strong>of</strong> <strong>of</strong> the other direction for showing equivalence is left as an exercise to the reader.<br />

⊓⊔<br />

Definition 3.3.5 A Markov Chain is Harris recurrent if the chain is µ−irreducible <strong>and</strong> every set A ⊂ X is<br />

Harris recurrent whenever µ(A) > 0. If the chain admits an invariant probability measure, then the chain is<br />

called positive Harris recurrent.<br />

Remark: Harris recurrence is stronger than recurrence. In one, an expectation is considered, in the other, a<br />

probability is considered. Consider the following example: Let P(1, 1) = 1 <strong>and</strong> P(x, x + 1) = 1 − 1/x 2 <strong>and</strong><br />

P(x, 1) = 1/x 2 . P x (τ 1 = ∞) = ∏ t≥x (1 − 1/t2 ) > 0. This chain is not Harris recurrent. It is π−irreducible for<br />

the measure π(A) = 1 if 1 ∈ A. Hence, this chain is recurrent but not Harris recurrent.<br />

⊓⊔


3.3. UNCOUNTABLE (COMPLETE, SEPARABLE, METRIC) STATE SPACES 25<br />

If a set is not recurrent, it is transient. A set A is transient if<br />

U(x, A) = E x [η A ] < ∞, ∀x ∈ A, (3.14)<br />

Note that, this is equivalent to<br />

∞∑<br />

P i (x, A) < ∞,<br />

i=1<br />

3.3.2 Invariant Distributions for Uncountable Spaces<br />

Definition 3.3.6 For a Markov chain with transition probability defined as before, a probability measure π is<br />

invariant on the Borel space (X, B(X)) if<br />

∫<br />

π(D) = P(x, D)π(dx), ∀D ∈ B(X).<br />

X<br />

A Harris recurrent chain always admits an invariant measure. But this measure might not be a probability<br />

measure.<br />

Uncountable chains act like countable ones when there is a single atom α ∈ X. An atom is one which has a<br />

positive probability (hence there is a probability mass).<br />

Definition 3.3.7 A set α is called an atom if there exists a probability measure ν such that<br />

P(x, A) = ν(A),<br />

∀x ∈ α, ∀A ∈ B(X).<br />

If the chain is µ−irreducible <strong>and</strong> µ(α) > 0, then α is called an accessible atom.<br />

In case there is an accessible atom α, we have the following:<br />

Theorem 3.3.2 For an µ−irreducible Markov chain for which E α [τ α ] < ∞; the following is the invariant<br />

probability measure:<br />

∑ τα−1<br />

k=0<br />

π(A) = E α [<br />

1 x k ∈A<br />

|x 0 = α], ∀A ∈ B(X), µ(A) > 0<br />

E[τ α ]<br />

In case an atom is not present, one needs further settings:<br />

Definition 3.3.8 (Tweedie) A set A ⊂ X is n-small on (X, B(X)) if for some positive measure µ<br />

P n (x, B) ≥ µ(B),<br />

∀x ∈ A, <strong>and</strong> B ∈ B(X),<br />

where B(X) denotes the (Borel) sigma-field on X.<br />

Definition 3.3.9 (Meyn-Tweedie) A set A ⊂ X is µ − petite on (X, B(X)) if for some distribution T on N<br />

(set <strong>of</strong> natural numbers), <strong>and</strong> some measure µ,<br />

∞∑<br />

P n (x, B)T (n) ≥ µ(B),<br />

n=0<br />

where B(X) denotes the (Borel) sigma-field on X.<br />

∀x ∈ A, <strong>and</strong> B ∈ B(X),<br />

For an aperiodic Markov chain which is irreducible, every petite set is also small.<br />

Remark 3.3.1 Small sets are very important in generalizing the results for countable state space Markov chains<br />

to more general spaces. Meyn <strong>and</strong> Tweedie present a discussion on periodicity as the greatest common divisor<br />

<strong>of</strong> the set <strong>of</strong> possible return times from a given small set; such an approach lets one generalize the definitions <strong>of</strong><br />

aperiodicity <strong>and</strong> periodicity.


26 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />

Theorem 3.3.3 (Theorem 5.5.3 <strong>of</strong> [25], lemma 17 [28] ) For an aperiodic <strong>and</strong> irreducible Markov chain<br />

{x t } every petite set is small.<br />

A maximal irreducibility measure ψ is an irreducibility measure such that for all other irreducibility measures<br />

φ, we have ψ(B) = 0 ⇒ φ(B) = 0 for any B ∈ B(X). Define B + (X) = {A ∈ B(X) : ψ(A) > 0} where ψ is a<br />

maximal irreducibility measure.<br />

Theorem 3.3.4 For an aperiodic <strong>and</strong> irreducible Markov chain every petite set is petite with a maximal irreducibility<br />

measure for a distribution with finite mean.<br />

Nummelin’s Splitting Technique<br />

The results on recurrence apply to uncountable chains with no atom provided there is a small set or a petite set.<br />

Suppose a set A is 1-small. Define a process z t = (x t , a t ), z t ∈ X × {0, 1}. That is we enlarge the probability<br />

space. Suppose that when x t /∈ A, (a t , x t ) evolve independently from each other. However, when x t ∈ A, we<br />

pick a Bernoulli r<strong>and</strong>om variable, <strong>and</strong> with probability δ the state visits A × {1} <strong>and</strong> with probability 1 −δ visits<br />

A × {0}. From A × {1}, the transition for the next time stage is given by ν(dxt) <strong>and</strong> from A × {0}, it visits the<br />

future time stage with probability<br />

P(dx t+1 |x t ) − ν(dx t+1 )<br />

.<br />

1 − δ<br />

Now, pick δ = ν(X). In this case, A×{1} is an accessible atom, <strong>and</strong> one can verify that the marginal distribution<br />

<strong>of</strong> the original Markov process {x t } has not been altered.<br />

The following can be established using the above construction.<br />

δ<br />

Proposition 3.3.1 If<br />

then,<br />

sup E[min(t > 0 : x t ∈ A)|x 0 = x] < ∞<br />

x∈A<br />

sup E[min(t > 0 : z t ∈ (A × {1}))|z 0 = z] < ∞.<br />

z∈(A×{1})<br />

3.3.3 Existence <strong>of</strong> an Invariant Distribution<br />

We state the final Theorem on the invariant distributions for Markov chains:<br />

Theorem 3.3.5 (Meyn-Tweedie) Consider a Markov process {x t } taking values in X. If there is a Harris<br />

recurrent set A which is also a µ− petite set for some positive measure µ, <strong>and</strong> if the set satisfies<br />

sup E[min(t > 0 : x t ∈ A)|x 0 = x] < ∞,<br />

x∈A<br />

then the Markov chain is positive Harris recurrent <strong>and</strong> it admits a unique invariant distribution.<br />

In this case, the invariant measure satisfies the following, which is a generalization <strong>of</strong> Kac’s Lemma [14]:<br />

Theorem 3.3.6 For a µ-irreducible Markov chain with a unique invariant probability measure; the following is<br />

the invariant probability measure:<br />

∫<br />

π(A) =<br />

C<br />

τ∑<br />

C−1<br />

π(dx)E x [ 1 {xk ∈A}], ∀A ∈ B(X), µ(A) > 0, π(C) > 0<br />

k=0


3.4. FURTHER RESULTS ON THE EXISTENCE OF AN INVARIANT PROBABILITY MEASURE 27<br />

The above can also be extended to compute that expected values <strong>of</strong> a function <strong>of</strong> the Markov states.<br />

The above can be verified along the same lines used for the countable state space case (see Theorem 3.2.1).<br />

Example: Linear system with a drift. Consider the following linear system:<br />

x t+1 = ax t + w t ,<br />

is positive Harris recurrent if |a| < 1, is null-recurrent if |a| = 0 <strong>and</strong> is transient if |a| > 1. That is, when<br />

recurrent, every open set will be visited with probability 1 infinitely <strong>of</strong>ten.<br />

3.3.4 Dobrushin’s Ergodic Coefficient for Uncountable State Space Chains<br />

We can extend Dobrushin’s contraction result for the uncountable state space case. By the property that<br />

|a − b| = a + b − 2 min(a, b), the Dobrushin’s coefficient is also related to the term (for the countable state space<br />

case):<br />

δ(P) = 1 − 1 2 max<br />

i,k<br />

∑<br />

|(P(i, j) − P(k, j)|<br />

j<br />

In case P(x, dy) is the stochastic transition kernel <strong>of</strong> a real-valued Markov process with transition kernels admitting<br />

densities (that is P(x, A) = ∫ A<br />

P(x, y)dy admits a density), the expression<br />

δ(P) = 1 − 1 ∫<br />

2 sup |P(x, y) − P(z, y)|dy<br />

x,z<br />

is the Dobrushin’s ergodic coefficient for R−valued Markov processes.<br />

As such, if δ(P) > 0, then the iterations π t (.) = ∫ π t−1 (z)P(z, .)dz converge to a unique fixed point.<br />

R<br />

3.3.5 Ergodic Theorem for Uncountable State Space Chains<br />

Likewise, for the uncountable setting, if µ() is the invariant probability distribution for a positive Harris recurrent<br />

Markov chain, it follows that:<br />

almost surely.<br />

1<br />

lim<br />

T →∞ T<br />

T∑<br />

∫<br />

∫<br />

c(x t ) = c(x)µ(dx), ∀c ∈ L 1 (µ) := {f : f(x)µ(dx) < ∞},<br />

t=1<br />

We will prove this theorem after we discuss Martingales.<br />

3.4 Further Results on the Existence <strong>of</strong> an Invariant Probability<br />

Measure<br />

3.4.1 Case with the Feller condition<br />

A Markov chain is weak Feller if ∫ X<br />

P(dz|x)v(z) is continuous in x for every continuous v on X.<br />

Theorem 3.4.1 Let {x t } be a weak Feller Markov process living in a compact subset <strong>of</strong> a complete, seperable<br />

metric space. Then {x t } admits an invariant distribution.<br />

Pro<strong>of</strong>. Pro<strong>of</strong> follows the observation that the space <strong>of</strong> probability measures on such a compact set is tight.<br />

Consider a sequence µ T = 1 ∑ T −1<br />

T t=0 P t . There exists a subsequence µ Tk which converges weakly to some µ ∗ . It<br />

follows that for every continuous <strong>and</strong> bounded function<br />

∫<br />

〈µ T , f〉 := µ T (dx)f(x) → 〈µ ∗ , f〉


28 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS<br />

Likewise,<br />

∫<br />

〈µ T , Pf〉 :=<br />

∫<br />

µ T (dx)(<br />

P(dy|x)f(y)) → 〈µ ∗ , Pf〉.<br />

Now,<br />

(µ T − µ T P)(f) = 1 ( T∑ −1<br />

T E µ 0<br />

Px k f −<br />

k=0<br />

T∑<br />

−1<br />

k=0<br />

)<br />

Px<br />

k+1 f<br />

= 1 )<br />

T E µ 0<br />

(f(x 0 ) − f(x T ) → 0. (3.15)<br />

Thus,<br />

(µ T − µ T P)(f) = 〈µ T , f〉 − 〈µ T P, f〉 = 〈µ T , f〉 − 〈µ T , Pf〉 → 〈µ ∗ − µ ∗ P, f〉 = 0.<br />

Thus, µ ∗ is an invariant probability measure.<br />

⊓⊔<br />

Lasserre [19] gives the following example to emphasize the importance <strong>of</strong> the Feller property: Consider a Markov<br />

chain evolving in [0, 1] given by: P(x, x/2) = 1 for all x ≠ 0 <strong>and</strong> P(0, 1) = 1. This chain does not admit an<br />

invariant measure.<br />

Remark 3.4.1 Lasserre presents a class <strong>of</strong> chains, Quasi-Feller chains, which are weak Feller except on a closed<br />

set <strong>of</strong> points N such that P(x, N) = 0 for all x ∈ D = X \ N, P(x, N) < 1 for x ∈ N, <strong>and</strong> for all x ∈ D, the<br />

chain is weak Feller.<br />

3.4.2 Case without the Feller condition<br />

One can also relax the weak Feller condition.<br />

Theorem 3.4.2 [20] For a Markov chain, under a uniform countable additivity condition as in (4.7) in the next<br />

chapter, there exists an invariant probability measure if <strong>and</strong> only if there exists a non-negative function g with<br />

lim<br />

n→∞<br />

inf g(x) = ∞<br />

x/∈K n<br />

<strong>and</strong> an initial probability measure ν 0 such that sup t E ν0 [g(x t )] < ∞.<br />

The pro<strong>of</strong> <strong>of</strong> this result follows from a similar observation as (3.15) but with weak convergence replaced by<br />

setwise convergence:<br />

Lemma 3.4.3 [7, Thm. 4.7.25] Let µ be a finite measure on a measurable space (T, A). Assume a set <strong>of</strong><br />

probability measures Ψ ⊂ P(T) satisfies<br />

Then Ψ is setwise precompact.<br />

P ≤ µ, for all P ∈ Ψ.<br />

A sequence <strong>of</strong> occupation measures under (4.7) has a subsequence which converges setwise <strong>and</strong> the limit <strong>of</strong> this<br />

subsequence is invariant. As an example, consider a system <strong>of</strong> the form:<br />

x t+1 = f(x t ) + w t (3.16)<br />

where w t admits a distribution with a bounded density function, which is positive everywhere. If the system has<br />

a finite second moment, then, the system admits an invariant probability measure which is unique.


3.5. EXERCISES 29<br />

3.5 Exercises<br />

Exercise 3.5.1 Prove that if {x t } is irreducible, then all states have the same period.<br />

Exercise 3.5.2 Prove that<br />

<strong>and</strong> for n ≥ 1,<br />

P x (τ A = 1) = P(x, A),<br />

P x (τ A = n) = ∑ i/∈A<br />

P(x, i)P i (τ A = n − 1)<br />

Exercise 3.5.3 Show that irreducibility <strong>of</strong> a Markov chain in a finite state space implies that every set A <strong>and</strong><br />

every x satisfies U(x, A) = ∞.<br />

Exercise 3.5.4 Show that for an irreducible Markov chain, either the entire chain is transient, or recurrent.<br />

Exercise 3.5.5 If P x (τ A < ∞) < 1, show that A is transient.<br />

Exercise 3.5.6 If P x (τ A < ∞) = 1, show that A is Harris recurrent.<br />

Exercise 3.5.7 Prove that a discrete-time r<strong>and</strong>om walk on Z k defined by the transition kernel P(i, j) = 1<br />

2k 1 {|i−j|=1}<br />

is recurrent for k = 1 <strong>and</strong> k = 2.


30 CHAPTER 3. CLASSIFICATION OF MARKOV CHAINS


Chapter 4<br />

Martingales <strong>and</strong> Foster-Lyapunov<br />

Criteria for Stabilization <strong>of</strong> Markov<br />

Chains<br />

4.1 Martingales<br />

In this chapter, we first discuss some martingale theorems. Only a few <strong>of</strong> these will be important, some others<br />

are presented for the sake <strong>of</strong> completeness, but we will not be concerned with those within the scope <strong>of</strong> our<br />

course.<br />

These are very important for us to underst<strong>and</strong> stabilization <strong>of</strong> controlled stochastic systems. These also will<br />

pave the way to optimization <strong>of</strong> dynamical systems. The second half <strong>of</strong> this chapter is on the stabilization <strong>of</strong><br />

Markov Chains.<br />

4.1.1 More on Expectations <strong>and</strong> Conditional Probability<br />

Let (Ω, F, P) be a probability space <strong>and</strong> let G be a subset <strong>of</strong> F. Let X be a R−valued r<strong>and</strong>om variable measurable<br />

with respect to (Ω, F) with a finite absolute expectation that is<br />

∫<br />

E[|X|] = |X(ω)|P(dω) < ∞,<br />

where ω ∈ Ω. We call such r<strong>and</strong>om variables integrable.<br />

Ω<br />

We say that Ξ is the conditional expectation r<strong>and</strong>om variable (<strong>and</strong> is also called a version <strong>of</strong> the conditional<br />

expectation) E[X|G], <strong>of</strong> X given G if<br />

i) Ξ is G-measurable. ii) For every A ∈ G,<br />

where<br />

E[1 A Ξ] = E[1 A X],<br />

∫<br />

E[1 A Ξ] =<br />

Ω<br />

X(ω)1 ω∈A P(dω)<br />

For example, if the information that we know about a process is whether an event A ∈ F happened or not, then:<br />

X A := E[X|A] = 1 ∫<br />

P(dω)X(ω).<br />

P(A)<br />

If the information we have is that A did not take place:<br />

X A C := E[X|A C ] =<br />

A<br />

1<br />

P(Ω \ A)<br />

31<br />

∫<br />

Ω\A<br />

P(dω)X(ω).


32CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />

Thus, the conditional expectation given by the sigma-field generated by A (which is F A = {∅, Ω, A, Ω \ A} is<br />

given by:<br />

E[X|F A ] = X A 1 ω∈A + X A C1 ω/∈A .<br />

We note that conditional probability can be written as:<br />

P(A|G) = E[1 {x∈A} |G],<br />

hence, conditional probability is a special case <strong>of</strong> conditional expectation.<br />

Let {B i } be a finite (or countable) partition <strong>of</strong> Ω which generate the σ−field G. It must be that E[X|ω 1 ∈ B 1 ] =<br />

E[X|ω 2 ∈ B 1 ] since otherwise E[X|G] would not be measurable on G. As such for all B ∈ G, we have that<br />

∫<br />

B<br />

∫<br />

X(ω)P(dω) =<br />

B<br />

∫<br />

E[X|B]P(dω) = E[X|B] P(dω) = P(B)E[X|B]<br />

B<br />

The notion <strong>of</strong> conditional expectation is key for the development <strong>of</strong> stochastic processes which evolve according<br />

to a transition kernel. This is useful for optimal decision making when a partial information is available with<br />

regard to a r<strong>and</strong>om variable.<br />

The following discussion is optional until the next subsection.<br />

Theorem 4.1.1 (Radon-Nikodym) Let µ <strong>and</strong> ν be two σ−finite positive measures on (Ω, F) such that ν(A) =<br />

0 implies that µ(A) = 0 (that is µ is absolutely continuous with respect to ν). Then, there exists a measurable<br />

function f : Ω → R such that for every A:<br />

∫<br />

µ(A) = f(ω)ν(dω)<br />

A<br />

The representation above is unique, up to points <strong>of</strong> measure zero. With the above discussion, the conditional<br />

expectation X = E[X|F ′ ] exists for any sub-σ-field, subset <strong>of</strong> F.<br />

It is a useful exercise to now consider the σ-field generated by an observation variable, <strong>and</strong> what a conditional<br />

expectation means in this case.<br />

Theorem 4.1.2 Let X be a X valued r<strong>and</strong>om variable, where X is a complete, separable, metric space <strong>and</strong> Y be<br />

another Y−valued r<strong>and</strong>om variable, Then, a r<strong>and</strong>om variable X is F Y (the σ−field generated by Y ) measurable<br />

if <strong>and</strong> only if there exists a measurable function f : Y → X such that X = f(Y (ω)).<br />

Pro<strong>of</strong>: We prove the more difficult direction. Suppose X lived in a countable space {a 1 , a 2 , . . . , a n , . . . }. We<br />

can write A n = X −1 (a n ). Since X is measurable on F Y , A n ∈ F Y . Let us now make these sets disjoint: Define<br />

C 1 = A 1 <strong>and</strong> C n = A n \ ∪ n−1<br />

i=1 A i. These sets are disjoint, but they all live in F Y . Thus, we can construct a<br />

measurable function defined as f(ω) = a n if ω ∈ C n :<br />

The issue is now what happens when X lives in an uncountable, complete, separable, metric space such as R.<br />

We can define countable sets which cover the entire state space (such as intervals with rational endpoints to<br />

cover R). Let us construct a sequence <strong>of</strong> partitions Γ n = {B i,n , ∪ i B i,n = X} which gets finer <strong>and</strong> finer (such<br />

as for R: [⌊X(ω)⌋, ⌈X(ω)n⌉)/n). We can define a sequence <strong>of</strong> measurable functions f n for every such partition.<br />

Furthermore, f n (ω) converges to a function f(ω) pointwise, the limit is measurable.<br />

⋄<br />

With the above, the expectation E[X|Y = y 0 ] can be defined, this expectation is a measurable function <strong>of</strong> Y .<br />

4.1.2 Some Properties <strong>of</strong> Conditional Expectation:<br />

One very important property is given by the following.<br />

Iterated expectations:


4.1. MARTINGALES 33<br />

Theorem 4.1.3 If H ⊂ G ⊂ F, <strong>and</strong> X is F−measurable, then it follows that:<br />

E[E[X|G]|H] = E[X|H]<br />

Pro<strong>of</strong>: Pro<strong>of</strong> follows by taking a set A ∈ H, which is also in G <strong>and</strong> F. Let η be the conditional expectation<br />

variable with respect to H. Then it follows that<br />

E[1 A η] = E[1 A X]<br />

Now let E[X|G] be η ′ . Then, it must be that E[1 A η ′ ] = E[1 A X] for all A ∈ G <strong>and</strong> hence for all A ∈ H. Thus,<br />

the two expectations are the same.<br />

⋄<br />

4.1.3 Discrete-Time Martingales<br />

Let (Ω, F, P) be a probability space. An increasing family {F n } <strong>of</strong> sub-σ−fields <strong>of</strong> F is called a filtration.<br />

A sequence <strong>of</strong> r<strong>and</strong>om variables on (Ω, F, P) is said to be adapted to F n if X n is F n -measurable, that is<br />

X −1<br />

n (D) = {w ∈ Ω : X n(w) ∈ D} ∈ F n for all Borel D. This holds for example if F n = σ(X m , m ≤ n), n ≥ 0.<br />

Given a filtration F n <strong>and</strong> a sequence <strong>of</strong> real r<strong>and</strong>om variables adapted to it, (X n , F n ) is said to be a martingale<br />

if<br />

E[|X n |] < ∞<br />

<strong>and</strong><br />

E[X n+1 |F n ] = X n .<br />

We will occasionally take the sigma-fields to be F n = σ(X 1 , X 2 , . . . , X n ). For example M t = ∑ t<br />

k=0 ǫ k, where ǫ k<br />

is a zero-mean i.i.d. r<strong>and</strong>om sequence, is a Martingale for every t ∈ N.<br />

A fair game is a Martingale when it is played up to a finite time.<br />

Let n > m ∈ Z + . Then, since F m ⊂ F n , it must be that A ∈ F m should also be in F n . Thus, if X n is a<br />

Martingale sequence,<br />

Thus, E[X n |F m ] = X m .<br />

If we have that<br />

then {X n } is called a submartingale.<br />

And, if<br />

then {X n } is called a supermartingale.<br />

E[1 A X n ] = E[1 A X n−1 ] = · · · = E[1 A X m ].<br />

E[X n |F m ] ≥ X m<br />

E[X n |F m ] ≤ X m<br />

A useful concept related to filtrations is that <strong>of</strong> a stopping time, which we discussed while studying Markov<br />

chains. A stopping time is a r<strong>and</strong>om time, whose occurrence is measurable with respect to the filtration in the<br />

sense that for each n ∈ N, {T ≤ n} ∈ F n .<br />

4.1.4 Doob’s Optional Sampling Theorem<br />

Theorem 4.1.4 Suppose (X n , F n ) is a martingale sequence, <strong>and</strong> ρ, τ < n are bounded stopping times with ρ ≤ τ.<br />

Then,<br />

E[X τ |F ρ ] = X ρ


34CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />

Pro<strong>of</strong>: We observe that<br />

τ∑<br />

−1<br />

E[X τ − X ρ |F ρ ] = E[ X k+1 − X k |F ρ ]<br />

k=ρ<br />

τ∑<br />

−1<br />

τ∑<br />

−1<br />

= E[ E[X k+1 − X k |F k ]|F ρ ] = E[ 0|F ρ ] = 0 (4.1)<br />

k=ρ<br />

k=ρ<br />

⋄<br />

Theorem 4.1.5 Let {X n } be a sequence <strong>of</strong> F n -adapted integrable real r<strong>and</strong>om variables. Then, the following<br />

are equivalent: (i) (X n , F n ) is a sub-martingale. (ii) If T, S are bounded stopping times with T ≥ S (almost<br />

surely), then E[X S ] ≤ E[X T ]<br />

Pro<strong>of</strong>: Let S ≤ T ≤ n Now, note that,<br />

E[X T − X S |F S ] =<br />

n∑<br />

E[1 (T ≥k) 1 (S


4.1. MARTINGALES 35<br />

Theorem 4.1.6 Let X T be a supermartingale sequence. Then,<br />

E[ζ N (a, b)] ≤ E[|X N |] + |a|<br />

.<br />

(b − a)<br />

Pro<strong>of</strong>:<br />

There are three possibilities that might take place: The process can end below a, between a <strong>and</strong> b <strong>and</strong> above<br />

b. If it crosses above b, then we have completed an upcrossing. In view <strong>of</strong> this, we may proceed as follows: Let<br />

β N := max(m : T 2m−1 ≤ N), that is T 2m = N or T 2m−1 = N. Here, β N is measurable on σ(X 1 , X 2 , . . .,X N ), so<br />

it is a stopping time.<br />

Then<br />

Thus,<br />

0 ≥ E[<br />

β N<br />

∑<br />

i=1<br />

X T2i − X T2i−1 ]<br />

(<br />

∑ βN<br />

= E[ X T2i − X T2i−1<br />

)1 {T2βN =N}]<br />

i=1<br />

(<br />

∑ βN<br />

+E[ X T2i − X T2i−1<br />

)1 {T2βN −1=N}]<br />

i=1<br />

β∑<br />

N −1<br />

= E[( (X T2i − X T2i−1 ) + X N − X )1 T2βN −1 {T 2βN =N}]<br />

i=1<br />

( βN −1<br />

+E[<br />

∑<br />

)<br />

(X T2i − X T2i−1 )<br />

i=1<br />

1 {T2βN −1=N}]<br />

( βN<br />

∑−1<br />

)<br />

= E X T2i − X T2i−1 + E[(X N − X )1 T2βN −1 {T 2βN =N}] (4.3)<br />

i=1<br />

( βN<br />

∑−1<br />

)<br />

E X T2i − X T2i−1 ≤ −E[(X N − X )1 T2βN +1 {T 2βN =N}] ≤ E[(a − X N )] (4.4)<br />

i=1<br />

( )<br />

∑βN −1<br />

Since, 0 ≥ E<br />

i=1<br />

X T2i − X T2i−1 ≥ E[β N − 1](b − a), it follows that:<br />

satisfies:<br />

<strong>and</strong> the result follows.<br />

ζ N (a, b) = (β N − 1),<br />

ζ N (b − a) ≤ E[(a − X N )] ≤ |a| + E[|X N |],<br />

Recall that a sequence <strong>of</strong> r<strong>and</strong>om variables defined on a probability space (Ω, F, P) converges to X almost surely<br />

(a. s.) if<br />

P(w : lim<br />

n→∞ x n(w) = x(w)) = 1.<br />

Theorem 4.1.7 Suppose X n is a supermartingale <strong>and</strong> sup n≥0 E[|X n |] < ∞. Then lim n→∞ X n = X exists<br />

(almost surely) <strong>and</strong> E[|X|] < ∞. Note that, the same result applies for submartingales, by considering −X n<br />

regarded as a supermartingale <strong>and</strong> the condition sup n≥0 E[max(0, X n )] < ∞.<br />

Pro<strong>of</strong>: The pro<strong>of</strong> follows from Doob’s upcrossing lemma. Suppose X n does not converge to a variable almost<br />

surely. This means that limsupX n ≠ liminf X n . In particular, there exist reals a, b such that limsupX n ≥ b<br />

<strong>and</strong> liminf X n ≤ a <strong>and</strong> b > a. Hence, by the upcrossing lemma, we have that<br />

E[ζ N (a, b)] ≤ E[|X N |] + |a|<br />

(b − a)<br />

< ∞.<br />


36CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />

This is true for every N. Since ζ N (a, b) is a monotonically increasing sequence in N, by the monotone convergence<br />

theorem it follows that<br />

lim E[ζ N(a, b)] = E[ lim β N(a, b)] < ∞.<br />

N→∞ N→∞<br />

Hence, this shows that X n must converge as the number <strong>of</strong> upcrossings is finite (it cannot be infinity with a<br />

non-zero probability). Hence X n converges almost surely to a number, as otherwise, there would be infinitely<br />

many crossings. The expected number <strong>of</strong> crossings being bounded implies that the probability <strong>of</strong> the number <strong>of</strong><br />

crossings being finite is one.<br />

⋄<br />

Theorem 4.1.8 (Submartingale Convergence Theorem) Suppose X n is a submartingale <strong>and</strong> sup n≥0 E[|X n |] <<br />

∞. Then X := lim n→∞ X n exists (almost surely) <strong>and</strong> E[|X|] < ∞.<br />

Pro<strong>of</strong>: Note that, sup n≥0 E[|X n |] < ∞, is a sufficient condition both for a submartingale <strong>and</strong> a supermartingale<br />

in Theorem 4.1.7. Hence X n → X almost surely. For finiteness, suppose E[|X|] = ∞. Then, liminf |X n | =<br />

lim |X n | = ∞ a.s.. By Fatou’s lemma,<br />

limsup E[|X n |] ≥ E[liminf |X n |] = ∞.<br />

But this is a contradiction as we had assumed that sup n E[|X n |] < ∞.<br />

⋄<br />

4.1.6 Pro<strong>of</strong> <strong>of</strong> the Birkh<strong>of</strong>f Individual Ergodic Theorem<br />

This will be discussed in class.<br />

4.1.7 This section is optional: Further Martingale Theorems<br />

This section is optional. If you wish not to read it, please proceed to the discussion on stabilization <strong>of</strong> Markov<br />

Chains.<br />

Theorem 4.1.9 Let X n be a Martingale such that X n converges to X in L 1 that is E[|X n − X|] → 0. Then,<br />

X n = E[X|F n ],<br />

n ∈ N<br />

We will use the following while studying the convex analytic method. Let us define uniform integrability:<br />

Definition 4.1.1 : A sequence <strong>of</strong> r<strong>and</strong>om variables {X n } is uniformly integrable if<br />

∫<br />

lim sup |X n |P(dX n ) = 0<br />

K→∞ n<br />

|X n|≥K<br />

This implies that<br />

supE[|X n |] < ∞<br />

n<br />

Let for some ǫ > 0,<br />

sup E[|X| 1+ǫ<br />

n ] < ∞.<br />

n<br />

This implies that the sequence is uniformly integrable as<br />

sup<br />

n<br />

∫<br />

|X n|≥K<br />

|X n |P(dX n ) ≤ sup<br />

n<br />

The following is a very important result:<br />

∫<br />

|X n|≥K<br />

( |X n|<br />

1<br />

K )ǫ |X n |P(dX n ) ≤ sup<br />

n K E[|X n| 1+ǫ ] → K→∞ 0.<br />

Theorem 4.1.10 If X n is a uniformly integrable Martingale, then X = lim n→∞ X n exists almost surely (for all<br />

sequences with probability 1) <strong>and</strong> in L 1 , <strong>and</strong> X n = E[X|F n ].


4.2. STABILITY OF MARKOV CHAINS: FOSTER-LYAPUNOV TECHNIQUES 37<br />

Optional Sampling Theorem For Uniformly Integrable Martingales<br />

Theorem 4.1.11 Suppose (X n , F n ) be a uniformly integrable martingale sequence, <strong>and</strong> ρ, τ are stopping times<br />

with ρ ≤ τ. Then,<br />

E[X τ |F ρ ] = X ρ<br />

Pro<strong>of</strong>: By Uniform Integrability, it follows that {X t } has a limit. Let this limit be X ∞ . It follows that<br />

E[X ∞ |F τ ] = X τ <strong>and</strong><br />

E[E[X ∞ |F τ ]|F ρ ] = X ρ<br />

which is also equal to E[X τ |F ρ ] = X ρ .<br />

⋄<br />

4.1.8 Azuma-Hoeffding Inequality for Martingales with Bounded Increments<br />

The following is an important concentration result:<br />

Theorem 4.1.12 Let X t be a Martingale sequence such that |X t − X t−1 | ≤ c for every t, almost surely. Then<br />

for any x > 0,<br />

P( X t − X 0<br />

t<br />

≥ x) ≤ 2e − tx2<br />

2c<br />

4.2 Stability <strong>of</strong> Markov Chains: Foster-Lyapunov Techniques<br />

A Markov chain’s stability can be characterized by drift conditions, as we discuss below in detail.<br />

4.2.1 Criterion for Positive Harris Recurrence<br />

Theorem 4.2.1 (Foster-Lyapunov for Positive Recurrence) [23] Let S be a petite set, b ∈ R <strong>and</strong> V (.) be<br />

a mapping from X to R + . Let {x n } be an irreducible Markov chain on X. If the following is satisfied for all<br />

x ∈ X:<br />

∫<br />

E[V (x t+1 )|x t = x] = P(x, dy)V (y) ≤ V (x) − 1 + b1 {x∈S} , (4.5)<br />

then a unique invariant probability measure exists for the Markov chain.<br />

X<br />

Pro<strong>of</strong>: We will provide a pro<strong>of</strong> assuming that S is such that sup x∈S V (x) < ∞ (The pro<strong>of</strong> is applicable without<br />

such an assumption also, with further details on replacing S with one which satisfies the above assumption. That<br />

is, for every petite set, we can find one on which V is bounded. See [23] for details on this). Define M 0 := V (x 0 ),<br />

<strong>and</strong> for t ≥ 1<br />

∑t−1<br />

M t := V (x t ) − (−1 + b1 (xt∈S))<br />

It follows that<br />

i=0<br />

E[M (t+1) |x s , s ≤ t] ≤ M t , ∀t ≥ 0.<br />

Furthermore, it can be shown that E[|M t |] ≤ ∞ for all t. Thus, {M t } is a supermartingale. Define a stopping<br />

time: τ N = min(N, τ), where τ = min{i > 0 : x i ∈ S}. The stopping time is bounded. Hence, we have, by the<br />

martingale optional sampling theorem:<br />

E[M τ N] ≤ E[M 0 ].<br />

Hence, we obtain<br />

τ∑<br />

N −1<br />

E x0 [ ] ≤ V (x 0 ) + bE x0 [<br />

i=0<br />

∑<br />

τ N −1<br />

1 (xt∈S)]<br />

i=0


38CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />

Thus, E x0 [τ N − 1 + 1] ≤ V (x 0 ) + b, <strong>and</strong> by the Monotone Convergence Theorem,<br />

lim<br />

N→∞ E x 0<br />

[τ N ] = E x0 [τ] ≤ V (x 0 ) + b<br />

Now, if we had that<br />

sup V (x) < ∞,<br />

x∈S<br />

the pro<strong>of</strong> would be complete in view <strong>of</strong> Theorem 3.3.5 <strong>and</strong> the fact that E x [τ] < ∞ for any x, leading to the<br />

Harris recurrence <strong>of</strong> S.<br />

Typically, this condition is satisfied. However, in case it is not easy to directly verify, we may need additional<br />

steps to construct a petite set. In the following, we will consider this: Following [23], Chapter 11, define for some<br />

l ∈ Z +<br />

V C (l) = {x : V (x) ≤ l}<br />

We will show that B := V C (l) is itself a petite set, which is recurrent <strong>and</strong> satisfies the uniform finite-mean-return<br />

property. Since C is petite for some measure ν, we have that<br />

where K a (x, B) = ∑ i a(i)P i (x, B) <strong>and</strong> hence<br />

K a (x, B) ≥ 1 {x∈C} ν(B), x ∈ X,<br />

1 {x∈C} ≤ 1<br />

ν(B) K a(x, B)<br />

Now,<br />

τ∑<br />

B−1<br />

E x [τ B ] ≤ V (x) + bE x [<br />

1<br />

= V (x) + b<br />

ν(B) E x[<br />

k=0<br />

τ∑<br />

B−1<br />

k=0<br />

1 ∑<br />

τ∑<br />

B−1<br />

= V (x) + b a(i)E x [<br />

ν(B)<br />

i k=0<br />

1 ∑<br />

≤ V (x) + b a(i)(1 + i),<br />

ν(B)<br />

i<br />

τ∑<br />

B−1<br />

1 {xk ∈C}] ≤ V (x) + bE x [<br />

k=0<br />

1<br />

K a (x k , B)] = V (x) + b<br />

ν(B) E x[<br />

1 {xk+i ∈B}]<br />

1<br />

ν(B) K a(x k , B)]<br />

τ∑<br />

B−1<br />

k=0<br />

∑<br />

a(i)P i (x k , B)]<br />

i<br />

where (4.6) follows since at most once the process can hit B between 0 <strong>and</strong> τ B − 1. Now, the petiteness measure<br />

can be adjusted such that ∑ i a ii < ∞ (by Proposition 5.5.6 <strong>of</strong> [23]), leading to the result that<br />

sup<br />

x∈B<br />

which can serve as the recurrent set.<br />

This set is furthermore recurrent, since<br />

1 ∑<br />

E x [τ B ] ≤ V (x) + b a(i)(1 + i) < ∞,<br />

ν(B)<br />

sup x∈B E x [τ C ] ≤ sup V (x) + 1 =: K < ∞,<br />

x∈B<br />

<strong>and</strong> as a result P x (τ C > n) ≤ K/n for any n. Finally, since C is petite, so is B.<br />

There are other versions <strong>of</strong> Foster-Lyapunov criteria.<br />

i<br />


4.2. STABILITY OF MARKOV CHAINS: FOSTER-LYAPUNOV TECHNIQUES 39<br />

4.2.2 Criterion for Finite Expectations<br />

Theorem 4.2.2 [Comparison Theorem] Let V (.) : X → R + , f, g : X → R + such that E[f(x)] < ∞, E[g(x)] <<br />

∞. Let {x n } be a Markov chain on X. If the following is satisfied:<br />

∫<br />

P(x, dy)V (y) ≤ V (x) − f(x) + g(x), ∀x ∈ X ,<br />

then, for any stopping time τ :<br />

X<br />

∑<br />

τ −1<br />

E[<br />

t=0<br />

τ∑<br />

−1<br />

f(x t )] ≤ V (x 0 ) + E[<br />

t=0<br />

g(x t )]<br />

The pro<strong>of</strong> for the above follows from the construction <strong>of</strong> a supermartingale sequence <strong>and</strong> the monotone convergence<br />

theorem. The above also allows us the computation <strong>of</strong> useful bounds. For example if g(x) = b1 {x∈A} , then<br />

one obtains that E[ ∑ τ −1<br />

t=0 f(x t)] ≤ V (x 0 ) + b. In view <strong>of</strong> the invariant distribution properties, if f(x) ≥ 1, this<br />

provides a bound on ∫ π(dx)f(x).<br />

Theorem 4.2.3 [Foster-Lyapunov Moment Bound] [23] Let S be a petite set, b ∈ R <strong>and</strong> V (.) : X → R + ,<br />

f(.) : X → [1, ∞).<br />

Let {x n } be a Markov chain on X. If the following is satisfied:<br />

∫<br />

P(x, dy)V (y) ≤ V (x) − f(x) + b1 x∈S , ∀x ∈ X ,<br />

then,<br />

X<br />

T<br />

1 ∑−1<br />

∫<br />

lim f(x t ) = lim E[f(x t )] =<br />

T →∞ T<br />

t→∞<br />

almost surely, where µ(.) is the invariant probability measure on X.<br />

t=0<br />

µ(dx)f(x) < ∞,<br />

Recall that, the invariant distribution π, satisfies for any D ∈ B(X) (see Theorem 3.3.6):<br />

∫<br />

π(D) =<br />

A<br />

τ∑<br />

A−1<br />

π(dx)E x [<br />

t=0<br />

1 xt∈D]<br />

Thus, for a bounded measurable function f, by the property that simple functions are dense in the space <strong>of</strong><br />

measurable, bounded functions under the supremum norm, it follows that,<br />

∫<br />

∫<br />

π(dx)f(x) =<br />

A<br />

τ∑<br />

A−1<br />

π(dx)E x [ f(x t )].<br />

Thus, if we can verify that for all x ∈ A, E x [ ∑ τ A−1<br />

t=0<br />

f(x t )] is uniformly bounded, we can obtain a bound on the<br />

expectation <strong>of</strong> π(f) = ∫ f(x)µ(dx).<br />

Another approach would be the following. By Theorem 4.2.2, with taking T to be the stopping times,<br />

limsup<br />

T<br />

1<br />

T E x 0<br />

[<br />

T∑<br />

k=0<br />

f(x k )] ≤ limsup<br />

T<br />

It follows that by Fatou’s Lemma <strong>and</strong> monotone convergence theorem<br />

limsup<br />

T<br />

1<br />

T E π[<br />

T∑<br />

∫<br />

f(x k )] ≤<br />

k=0<br />

We have the following corollary to Theorem 4.2.3<br />

t=0<br />

π(dx)E x [limsup<br />

T<br />

1<br />

T (V (x 0) + bT) = b.<br />

1<br />

T<br />

T∑<br />

f(x k )] ≤ b.<br />

k=0


40CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />

Corollary 4.2.1 In particular, under the conditions <strong>of</strong> Theorem 4.2.3,<br />

∫<br />

∫<br />

π(dx)f(x) =<br />

lets one obtain a bound on the expectation <strong>of</strong> f(x).<br />

S<br />

τ∑<br />

A−1<br />

π(dx)E x [ f(x t )] ≤<br />

t=0<br />

∫<br />

A<br />

π(dx)E x [V (x) + b].<br />

See chapter 14 <strong>of</strong> [23] for further details.<br />

We need to ensure that there exists an invariant measure however. This is why we require that f : X → [1, ∞),<br />

where 1 can be replaced with any positive number.<br />

4.2.3 Criterion for Recurrence<br />

Theorem 4.2.4 (Foster-Lyapunov for Recurrence) Let S be a compact set, b < ∞ <strong>and</strong> V (.) be an infcompact<br />

functional on X such that for all α ∈ R + {x : V (x) ≤ α} is compact. Let {x n } be an irreducible Markov<br />

chain on X. If the following is satisfied:<br />

∫<br />

P(x, dy)V (y) ≤ V (x) + b1 x∈S , ∀x ∈ X , (4.6)<br />

X<br />

then, with τ S = min(t > 0 : x t ∈ S), P x (τ S < ∞) = 1 for all x ∈ X, that is the Markov chain is Harris recurrent.<br />

Pro<strong>of</strong>: Let τ S = min(t > 0 : x t ∈ S). Define two stopping times: τ S <strong>and</strong> τ BN where B N = {x : V (x) ≥ N}. Note<br />

that a sequence defined by M t = V (x t ) (which behaves as a supermartingale for t = 0, 1, · · · , min(τ S , τ BN ) − 1)<br />

is uniformly integrable until τ BN , <strong>and</strong> a variation <strong>of</strong> the optional sampling theorem applies for stopping times<br />

which are not necessarily bounded by a given finite number with probability one. Note that, due to irreducibility,<br />

min(τ S , τ BN ) < ∞ with probability 1. Now, it follows that for x /∈ S ∪ B N , since when exiting into B N , the<br />

minimum value <strong>of</strong> the Lyapunov function is N:<br />

for some finite positive M. Hence,<br />

V (x) = E x [V (x min(τS,τ BN ))] ≥ P x (τ BN < τ S )N + P x (τ BN ≥ τ S )M,<br />

P x (τ BN < τ S ) ≤ V (x)/N<br />

We also have that P(min(τ S , τ BN ) = ∞) = 0, since the chain is irreducible <strong>and</strong> it will escape any compact set in<br />

finite time. As a consequence, we have that<br />

P x (τ S = ∞) ≤ P(τ BN < τ S ) ≤ V (x)/N<br />

<strong>and</strong> taking the limit as N → ∞, P x (τ S = ∞) = 0.<br />

⊓⊔<br />

If S is further petite, then once the petite set is visited, any other set with a positive measure (under the<br />

irreducibility measure) is visited with probability 1 infinitely <strong>of</strong>ten.<br />

⋄<br />

4.2.4 On small <strong>and</strong> petite sets<br />

Establishing petiteness may be difficult to directly verify. In the following, we present two conditions that may<br />

be used to establish the petiteness properties.<br />

By [23], p. 131: For a Markov chain with transition kernel P <strong>and</strong> K a probability measure on natural numbers,<br />

if there exists for every E ∈ B(X), a lower semi-continuous function N(·, E) such that ∑ ∞<br />

n=0 P n (x, E)K(n) ≥<br />

N(x, E), for a sub-stochastic kernel N(·, ·), the chain is called a T −chain.<br />

Theorem 4.2.5 [23] For a T −chain which is irreducible, every compact set is petite.


4.2. STABILITY OF MARKOV CHAINS: FOSTER-LYAPUNOV TECHNIQUES 41<br />

For a countable state space, under irreducibility, every finite set S in (4.5) is petite.<br />

The assumptions on petite sets <strong>and</strong> irreducibility can be relaxed for the existence <strong>of</strong> an invariant probability<br />

measure. Tweedie [31] considers the following. If S is such that, the following uniform countable additivity<br />

condition<br />

lim sup P(x, B n ) = 0, (4.7)<br />

n→∞<br />

x∈S<br />

is satisfied for B n ↓ ∅, then, (4.5) implies the existence <strong>of</strong> an invariant probability measure. There exists at<br />

most finitely many invariant probability measures. By Proposition 5.5.5 (iii) <strong>of</strong> Meyn-Tweedie [23], that under<br />

irreducibility, the Harris recurrent component <strong>of</strong> the space can be expressed as a countable union <strong>of</strong> petite sets<br />

C n with ∪ ∞ n=1 C n, with ∪ ∞ m C m → ∅. By Lemma 4 <strong>of</strong> Tweedie (2001), under uniform countable additivity, any set<br />

∪ M i=1 C i is uniformly accessible from S. Therefore, if the Markov chain is irreducible, the condition (4.7) implies<br />

that the set S is petite. This may be easier to verify for a large class <strong>of</strong> applications, than using the condition<br />

<strong>of</strong> establishing a T −chain property. Under further conditions (such as if S is compact <strong>and</strong> V used in a drift<br />

criterion has compact level sets), one can take B n to be subsets <strong>of</strong> a compact set, making the verification even<br />

simpler.<br />

4.2.5 Criterion for Transience<br />

Criteria for transience is somewhat more difficult to establish. One convenient way is to construct a stopping<br />

time sequence <strong>and</strong> show that the state does not come back to some set infinitely <strong>of</strong>ten. We state the following.<br />

Theorem 4.2.6 ([23], [16]) Let V : X → R + . If there exists a set A such that E[V (x t+1 )|x t = x] ≤ V (x) for all<br />

x /∈ A <strong>and</strong> ∃¯x /∈ A such that V (¯x) < inf z∈A V (z), then {x t } is not recurrent, in the sense that P¯x (τ A < ∞) < 1.<br />

Pro<strong>of</strong>: Let x = ¯x. Pro<strong>of</strong> now follows from observing that V (x) ≥ ∫ y V (y)P(x, dy) ≥ (inf z∈A V (z))P(x, A) +<br />

∫<br />

y/∈A V (y)P(x, dy) ≥ (inf z∈A V (z))P(x, A). It thus follows that<br />

P(τ A < 2) = P(x, A) ≤<br />

V (x)<br />

(inf z∈A V (z))<br />

Likewise,<br />

≥<br />

∫<br />

V (¯x) ≥ V (y)P(¯x, dy)<br />

y<br />

∫<br />

(inf V (z))P(¯x, A) + (<br />

z∈A<br />

∫y/∈A<br />

s<br />

V (s)P(y, ds))P(¯x, dy)<br />

≥ (inf V (z))P(¯x, A) + P(¯x, dy)((inf<br />

z∈A<br />

∫y/∈A<br />

V (s))P(y, A) + V (s)P(y, ds))<br />

s∈A<br />

∫s/∈A<br />

≥ (inf V (z))P(¯x, A) + P(¯x, dy)((inf V (s))P(y, A))<br />

z∈A<br />

∫y/∈A s∈A<br />

( ∫<br />

)<br />

= (inf V (z)) P(¯x, A) + P(¯x, dy)P(y, A) . (4.8)<br />

z∈A<br />

y/∈A<br />

Thus, observing that, P({ω : τ A (ω) < 3}) = ∫ A P(¯x, dy) + ∫ y/∈A<br />

P(¯x, dy)P(y, A), we observe that:<br />

P¯x (τ A < 3) ≤<br />

V (¯x)<br />

(inf z∈A V (z))<br />

V (¯x)<br />

(inf z∈A V (z)) .<br />

Thus, this follows for any n: P¯x (τ A < n) ≤ < 1. Continuity <strong>of</strong> probability measures (by defining:<br />

B n = {ω : τ A < n} <strong>and</strong> observing B n ⊂ B n+1 <strong>and</strong> that lim n P(τ A < n) = P(∪ n B n ) = P(τ A < ∞) < 1) now<br />

leads to transience.<br />


42CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />

Observe the difference with the inf-compactness condition leading to recurrence <strong>and</strong> the above condition, leading<br />

to non-recurrence.<br />

We finally note that, a convenient way to verify instability or transience is to construct an appropriate martingale<br />

sequence.<br />

4.2.6 State Dependent Drift Criteria: Deterministic <strong>and</strong> R<strong>and</strong>om-Time<br />

It is also possible that, in many applications, the controllers act on a system intermittently. In this case, we<br />

have the following results [33]. These extend the deterministic state-dependent results presented in [23], [24]:<br />

Let τ z , z ≥ 0 be a sequence <strong>of</strong> stopping times, measurable on a filtration, possible generated by the state process.<br />

Theorem 4.2.7 [33] Suppose that x is a ϕ-irreducible Markov chain. Suppose moreover that there are functions<br />

V : X → (0, ∞), δ: X → [1, ∞), f : X → [1, ∞), a petite set C on which V is bounded, <strong>and</strong> a constant b ∈ R,<br />

such that the following hold:<br />

E[V (x τz+1 ) | F τz ] ≤ V (x τz ) − δ(x τz ) + b1 {xτz ∈C}<br />

[ τz+1−1 ∑ ]<br />

E f(x k ) | F τz ≤ δ(x τz ), z ≥ 0.<br />

k=τ z<br />

Then X is positive Harris recurrent, <strong>and</strong> moreover π(f) < ∞, with π being the invariant distribution.<br />

(4.9)<br />

⊓⊔<br />

By taking f(x) = 1 for all x ∈ X, we obtain the following corollary to Theorem 4.2.7.<br />

Corollary 4.2.2 [33] Suppose that X is a ϕ-irreducible Markov chain. Suppose moreover that there is a function<br />

V : X → (0, ∞), a petite set C on which V is bounded,, <strong>and</strong> a constant b ∈ R, such that the following hold:<br />

Then X is positive Harris recurrent.<br />

E[V (x τz+1 ) | F τz ] ≤ V (x τz ) − 1 + b1 {xτz ∈C}<br />

sup E[τ z+1 − τ z | F τz ] < ∞.<br />

z≥0<br />

(4.10)<br />

⊓⊔<br />

More on invariant probability measures<br />

Without the irreducibility condition, if the chain is weak Feller, if (4.5) holds with S compact, then there exists<br />

at least one invariant probability measure as discussed in Section 3.4.<br />

Theorem 4.2.8 [33] Suppose that X is a Feller Markov chain, not necessarily ϕ-irreducible. Then,<br />

If (4.9) holds with C compact then there exists at least one invariant probability measure. Moreover, there<br />

exists c < ∞ such that, under any invariant probability measure π,<br />

∫<br />

E π [f(x)] = π(dx)f(x) ≤ c. (4.11)<br />

4.3 Convergence Rates to Equilibrium<br />

X<br />

In addition to obtaining bounds on the rate <strong>of</strong> convergence through Dobrushin’s coefficient, one powerful approach<br />

is through the Foster-Lyapunov drift conditions.<br />

Regularity <strong>and</strong> ergodicity are concepts closely related through the work <strong>of</strong> Meyn <strong>and</strong> Tweedie [25], [26] <strong>and</strong><br />

Tuominen <strong>and</strong> Tweedie [30].


4.3. CONVERGENCE RATES TO EQUILIBRIUM 43<br />

Definition 4.3.1 A set A ∈ B(X) is called (f, r)-regular if<br />

τ∑<br />

B−1<br />

sup E x [ r(k)f(x k )] < ∞<br />

x∈A<br />

for all B ∈ B + (X). A finite measure ν on B(X) is called (f, r)-regular if<br />

k=0<br />

τ∑<br />

B−1<br />

E ν [ r(k)f(x k )] < ∞<br />

for all B ∈ B + (X), <strong>and</strong> a point x is called (f, r)-regular if the measure δ x is (f, r)-regular.<br />

This leads to a lemma relating regular distributions to regular atoms.<br />

k=0<br />

Lemma 4.3.1 If a Markov chain {x t } has an atom α ∈ B + (X) <strong>and</strong> an (f, r)-regular distribution λ, then α is<br />

an (f, r)-regular set.<br />

Definition 4.3.2 (f-norm) For a function f : X → [1, ∞) the f-norm <strong>of</strong> a measure µ defined on (X, B(X)) is<br />

given by<br />

∫<br />

‖µ‖ f = sup | µ(dx)g(x)|.<br />

g≤f<br />

The total variation norm is the f-norm when f = 1, denoted by ‖.‖ TV .<br />

Coupling inequality <strong>and</strong> moments <strong>of</strong> return times to a small set The main idea behind the coupling<br />

inequality is to bound the total variation distance between the distributions <strong>of</strong> two r<strong>and</strong>om variables by the<br />

probability they are different; the more coupled the two r<strong>and</strong>om variables are the more their distributions match<br />

up. Let X <strong>and</strong> Y be two jointly distributed r<strong>and</strong>om variables on a space X with distributions µ x , µ y respectively.<br />

Then we can bound the total variation between the distributions by the probability the two variables are not<br />

equal.<br />

‖µ x − µ y ‖ TV<br />

= sup|µ x (A) − µ y (A)|<br />

A<br />

= sup|P(X ∈ A, X = Y ) + P(X ∈ A, X ≠ Y )<br />

A<br />

− P(Y ∈ A, X = Y ) − P(Y ∈ A, X ≠ Y )|<br />

≤ sup|P(X ∈ A, X ≠ Y ) − P(Y ∈ A, X ≠ Y )|<br />

A<br />

≤P(X ≠ Y )<br />

The coupling inequality is useful in discussions <strong>of</strong> ergodicity when used in conjunction with parallel Markov chains,<br />

as in Theorem 4.1 <strong>of</strong> [28]. One creates two Markov chains having the same one-step transition probabilitis. Let<br />

{x n } <strong>and</strong> {x ′ n } be two Markov chains that have probability transition kernel P(x, ·), <strong>and</strong> let C be an (m, δ, ν)-<br />

small set. We use the coupling construction provided by Roberts <strong>and</strong> Rosenthal.<br />

Let x 0 = x <strong>and</strong> x ′ 0 ∼ π(·) where π(·) is the invariant distribution <strong>of</strong> both Markov chains. 1. If x n = x ′ n then<br />

x n+1 = x ′ n+1 ∼ P(x n, ·) 2. Else, if (x n , x ′ n ) ∈ C × C then with probability δ, x n+m = x ′ n+m ∼ ν(·)<br />

with probability 1 − δ then independently<br />

x n+m ∼ 1<br />

1−δ (P m (x n , ·) − δν(·))<br />

x ′ n+m ∼ 1<br />

1−δ (P m (x ′ n , ·) − δν(·))<br />

3. Else, independently x n+m ∼ P m (x n , ·) <strong>and</strong> x ′ n+m ∼ P m (x ′ n, ·).<br />

The in-between states x n+1 , ...x n+m−1 , x ′ n+1 , ...x′ n+m−1 are distributed conditionally given x n, x n+m ,x ′ n , x′ n+m .<br />

By the Coupling Inequality <strong>and</strong> the previous discussion with Nummelin’s Splitting technique we have ‖P n (x, ·)−<br />

π(·)‖ TV ≤ P(x n ≠ x ′ n).


44CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />

4.3.1 Lyapunov conditions: Geometric ergodicity<br />

Roberts <strong>and</strong> Rosenthal use this approach to establish geometric ergodicity.<br />

An irreducible Markov chain satisfies the univariate drift condition if there are constants λ ∈ (0, 1) <strong>and</strong> b < ∞,<br />

along with a function W : X → [1, ∞), <strong>and</strong> a small set C such that<br />

PV ≤ λV + b1 C . (4.12)<br />

One can now define a bivariate drift condition for two independent copies <strong>of</strong> a Markov chain with a small set C.<br />

This condition requires that there exists a function h : X × X → [1, ∞) <strong>and</strong> α > 1 such that<br />

where<br />

¯Ph(x, y) ≤h(x, y)/α (x, y) /∈ C × C<br />

¯Ph(x, y) b<br />

1−λ<br />

− 1, then the<br />

bivariate drift condition is satisfied for h(x, y) = 1 2 (V (x) + V (y)) <strong>and</strong> α−1 = λ + b/(d + 1) > 1.<br />

The bivariate condition ensures that the processes x, x ′ hits the set C × C, from where the two chains may be<br />

coupled. The coupling inequality then leads to the desired conclusion.<br />

Theorem 4.3.3 (Theorem 9 <strong>of</strong> [28]) Suppose {x t } is an aperiodic, irreducible Markov chain with invariant<br />

distribution π(·). Suppose C is a (1, ǫ, ν)-small set <strong>and</strong> V :→ [1, ∞) satisfies the univariate drift condition (4.12)<br />

with constants λ ∈ (0, 1) <strong>and</strong> b < ∞ with V (x) < ∞ for some x ∈ X. Then {x t } is geometrically ergodic.<br />

As a side remark, we note the following observation. If the univariate drift condition holds then the sequence <strong>of</strong><br />

r<strong>and</strong>om variables M n = λ −n V (x n ) − n−1 ∑<br />

b1 C (x k ) is a supermartingale <strong>and</strong> thus we have the inequality<br />

k=0<br />

for all n <strong>and</strong> x 0 ∈ X, <strong>and</strong> with Theorem 3.3.4 we also have<br />

for all B ∈ B + (X).<br />

n−1<br />

∑<br />

E x0 [λ −n V (x n )] ≤ V (x 0 ) + E x0 [ b1 C (x k )] (4.13)<br />

k=0<br />

E x [λ −τB ] ≤ V (x) + c(B)<br />

4.3.2 Subgeometric ergodicity<br />

Here, we review the class <strong>of</strong> subgeometric rate functions (see section 4 <strong>of</strong> [16], section 5 <strong>of</strong> [12], [23], [13], [30]).<br />

Let Λ 0 be the family <strong>of</strong> functions r : N → R >0 such that<br />

r is non-decreasing, r(1) ≥ 2<br />

<strong>and</strong><br />

log r(n)<br />

n<br />

↓ 0<br />

as n → ∞


4.4. CONCLUSION 45<br />

The second condition implies that for all r ∈ Λ 0 if n > m > 0 then<br />

so that<br />

n log r(n + m) ≤ n log r(n) + m logr(n) ≤ n log r(n) + n logr(m)<br />

r(m + n) ≤ r(m)r(n) for all m, n ∈ N. (4.14)<br />

The class <strong>of</strong> subgeometric rate functions Λ defined in [30] is the class <strong>of</strong> sequences r for which there exists a<br />

sequence r 0 such that<br />

r(n)<br />

0 < liminf<br />

n→∞ r 0 (n) ≤ limsup r(n)<br />

n→∞ r 0 (n) < ∞.<br />

The main theorem we cite on subgeometric rates <strong>of</strong> convergence is due to Tuominen <strong>and</strong> Tweedie [30].<br />

Theorem 4.3.4 (Theorem 2.1 <strong>of</strong> [30]) Suppose that {x t } t∈N is an irreducible <strong>and</strong> aperiodic Markov chain on<br />

state space X with stationary transition probabilities given by P. Let f : X → [1, ∞) <strong>and</strong> r ∈ Λ be given. The<br />

following are equivalent:<br />

(i) there exists a petite set C ∈ B(X) such that<br />

τ∑<br />

C−1<br />

sup E x [ r(k)f(x k )] < ∞<br />

x∈C<br />

k=0<br />

(ii) there exists a sequence (V n ) <strong>of</strong> functions V n : X → [0, ∞], a petite set C ∈ B(X) <strong>and</strong> b ∈ R + such that V 0 is<br />

bounded on C,<br />

V 0 (x) = ∞ ⇒ V 1 (x) = ∞,<br />

<strong>and</strong><br />

(iii) there exists an (f, r)-regular set A ∈ B + (X).<br />

PV n+1 ≤ V n − r(n)f + br(n)1 C ,<br />

n ∈ N<br />

(iv) there exists a full absorbing set S which can be covered by a countable number <strong>of</strong> (f, r)-regular sets.<br />

Theorem 4.3.5 [30] If a Markov chain {x t } satisfies Theorem 4.3.4 for (f, r) then r(n)‖P n (x 0 , ·) −π(·)‖ f → 0<br />

as n increases<br />

Thus, we observe that the hitting times to a small set is a very important measure in characterizing not only<br />

the existence <strong>of</strong> an invariant probability measure, but also how fast a Markov chain converges to equilibrium.<br />

4.4 Conclusion<br />

This concludes our discussion for Controlled Markov Chains via martingale Methods. We will revisit one more<br />

application <strong>of</strong> martingales while discussing the convex analytic approach to Controlled Markov Problems.<br />

4.5 Exercises<br />

Exercise 4.5.1 Let G = {Ω, ∅}. Then, show that E[X|G] = E[X] <strong>and</strong> if G = F, then E[X|F] = X.<br />

Exercise 4.5.2 Let X be a r<strong>and</strong>om variable such that P(|X| < K) = 1 for some K ∈ R, that is X takes values<br />

from a compact set. Let Y n = X + W n for n ∈ Z + <strong>and</strong> W n be an independent noise for every n.<br />

Let F n be the σ-field generated by Y 0 , Y 1 , . . . , Y n .<br />

Does lim n→∞ E[X|F n ] exist almost surely? Prove your statement.<br />

Can you replace the boundedness assumption on X with E[|X|] < ∞?


46CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />

Exercise 4.5.3 Consider the following two-server system:<br />

x 1 t+1 = max(x 1 t + A1 t − u1 t , 0)<br />

x 2 t+1 = max(x 2 t + A 2 t + u 1 t1 (u 1<br />

t ≤x 1 t +A1 t ) − u 2 t, 0), (4.15)<br />

where 1 (.) denotes the indicator function <strong>and</strong> A 1 t,A 2 t are independent <strong>and</strong> identically distributed (i.i.d.) r<strong>and</strong>om<br />

variables with geometric distributions, that is, for i = 1, 2,<br />

P(A i t = k) = p i(1 − p i ) k k ∈ {0, 1, 2, . . ., },<br />

for some scalars p 1 , p 2 such that E[A 1 t ] = 1.5 <strong>and</strong> E[A2 t ] = 1.<br />

a) Suppose the control actions u 1 t, u 2 t are such that u 1 t + u 2 t ≤ 5 for all t ∈ N + <strong>and</strong> u 1 t, u 2 t ∈ R + . At any given<br />

time t, the controller has to decide on u 1 t <strong>and</strong> u2 t with knowing {x1 s , x2 s , s ≤ t} but not knowing A1 t , A2 t .<br />

Is this server system stochastically stabilizable by some policy, that is, does there exist an invariant distribution<br />

under some control policy?<br />

If your answer is positive, provide a control policy <strong>and</strong> show that there exists a unique invariant distribution.<br />

If your answer is negative, precisely state why.<br />

b) Repeat part a where now we restrict u 1 t , u2 t ∈ Z +, that is the actions are to be integer valued.<br />

Exercise 4.5.4 a) Consider a Controlled Markov Chain with the following dynamics:<br />

x t+1 = ax t + bu t + w t ,<br />

where w t is a zero-mean Gaussian noise with a finite variance, a, b ∈ R are the system dynamics coefficients. One<br />

controller policy which is admissible (that is, the policy at time t is measurable with respect to σ(x 0 , x 1 , . . .,x t )<br />

<strong>and</strong> is a mapping to R) is the following:<br />

u t = − a + 0.5 x t .<br />

b<br />

Show that {x t }, under this policy, has a unique invariant distribution.<br />

b) Consider a similar setup to the one earlier, with b = 1:<br />

x t+1 = ax t + u t + w t ,<br />

where w t is a zero-mean Gaussian noise with a finite variance, <strong>and</strong> a ∈ R is a known number.<br />

This time, suppose, we would like to find a control policy such that<br />

lim<br />

t→∞ E[x2 t] < ∞<br />

Further, suppose we restrict the set <strong>of</strong> control policies to be linear, time-invariant; that is <strong>of</strong> the form u(x t ) = kx t<br />

for some k ∈ R.<br />

Find the set <strong>of</strong> all k values for which the state has a finite second moment, that is find<br />

as a function <strong>of</strong> a.<br />

Hint: Use Foster-Lyapunov criteria.<br />

{k : k ∈ R, lim<br />

t→∞<br />

E[x 2 t] < ∞},<br />

Exercise 4.5.5 Prove Birkh<strong>of</strong>f’s Ergodic Theorem for a countable state space; that is the result that for a Markov<br />

chain {x t } living in a countable space X, which has a unique invariant distribution µ, the following applies almost<br />

surely:<br />

1<br />

T∑<br />

lim f(x t ) = ∑ f(i)µ(i)<br />

T →∞ T<br />

t=1 i


4.5. EXERCISES 47<br />

∀ bounded functions f : X → R.<br />

Hint: You may proceed as follows. Define the occupation measures,, for all t <strong>and</strong> A ∈ B(X):<br />

v T (A) = 1 T<br />

T∑<br />

−1<br />

1 ((xt)∈A), ∀A ∈ σ(X),<br />

t=0<br />

Now, define:<br />

( t∑<br />

F t (A) = 1 (xs)∈A − t ∑<br />

s=1<br />

X<br />

)<br />

P π (A|x)v t (x)<br />

( t∑ ∑t−1<br />

∑<br />

)<br />

= 1 (xs)∈A − P π (A|x)1 ((xs)=(x))<br />

s=1 s=0<br />

X<br />

(4.16)<br />

And verify that, for t ≥ 2,<br />

E[F t (A)|F t−1 ]<br />

[( t∑ ∑t−1<br />

∑<br />

]<br />

= E 1 (xs)∈A − P π (A|x)1 (xs=x)<br />

)|F t−1<br />

s=1 s=0 X<br />

[(<br />

= E 1 (xt)∈A − ∑ ]<br />

P π (A|x)1 (xt−1=x)<br />

)|F t−1<br />

X<br />

( t−1<br />

∑t−2<br />

∑<br />

)<br />

+ 1 xs∈A − P π (A|x)1 (xs=x)<br />

= 0<br />

+<br />

∑<br />

s=1<br />

( ∑t−1<br />

s=1<br />

s=0<br />

X<br />

∑t−2<br />

∑<br />

]<br />

1 xs∈A − P π (A|x)1 (xs=x)<br />

)|F t−1<br />

s=0<br />

X<br />

(4.17)<br />

= F t−1 (A), (4.18)<br />

where the last equality follows from the fact that E[1 xt∈A|F t−1 ] = P(x t ∈ A|F t−1 ).<br />

Furthermore,<br />

|F t (A) − F t−1 (A)| ≤ 1.<br />

Now, we have a sequence which is a martingale sequence. We will invoke a Martingale Convergence Theorem;<br />

which is applicable for martingales with bounded increments. By a version <strong>of</strong> the Martingale stability theorem,<br />

it follows that<br />

1<br />

lim<br />

t→∞ t F t(A) = 0<br />

You need to now complete the remaining steps.<br />

You can use the Azuma-Hoeffding inequality <strong>and</strong> the Borel-Cantelli Lemma to complete the steps; or you may<br />

use any other method you might wish to use.<br />

Exercise 4.5.6 Consider a queuing process, with i.i.d. Poisson arrivals <strong>and</strong> departures, with arrival mean µ<br />

<strong>and</strong> service mean λ <strong>and</strong> suppose the process is such that when a customer leaves the queue, with probability p<br />

(independent <strong>of</strong> time) it comes back to the queue. That is, the dynamics <strong>of</strong> the system satisfies:<br />

where E[A t ] = λ, E[N t ] = µ <strong>and</strong> E[p t ] = p.<br />

L t+1 = max(L t + A t − N t + p t N t , 0), t ∈ N.<br />

For what values <strong>of</strong> µ, λ is such a system stochastically stable? Prove your statement.


48CHAPTER 4. MARTINGALES AND FOSTER-LYAPUNOV CRITERIA FOR STABILIZATION OF MARKOV CHAINS<br />

u<br />

Station 1<br />

Station 2<br />

Figure 4.1<br />

Exercise 4.5.7 Consider a two server-station network; where a router routes the incoming traffic, as is depicted<br />

in Figure 4.1.<br />

Let L 1 t , L2 t<br />

following:<br />

denote the number <strong>of</strong> customers in stations 1 <strong>and</strong> 2 at time t. Let the dynamics be given by the<br />

L 1 t+1 = max(L1 t + u tA t − Nt 1 , 0), t ∈ N.<br />

L 2 t+1 = max(L 2 t + (1 − u t )A t − Nt 2 , 0), t ∈ N.<br />

Customers arrive according to an independent Bernoulli process, A t , with mean λ. That is, P(A t = 1) = λ <strong>and</strong><br />

P(A t = 0) = 1 − λ. Here u t ∈ [0, 1] is the router action.<br />

Station 1 has a Bernoulli service process N 1 t with mean n 1 , <strong>and</strong> Station 2 with n 2 .<br />

Suppose that a router decides to follow the following algorithm to decide on u t : If a customer arrives, the router<br />

simply sends the incoming customer to the shortest queue.<br />

Find necessary <strong>and</strong> sufficient conditions (on λ, n 1 , n 2 ) for this algorithm to lead to a stochastically stable system<br />

which satisfies lim t→∞ E[L 1 t + L2 t ] < ∞.


Chapter 5<br />

Dynamic Programming<br />

In this chapter, we introduce the notion <strong>of</strong> dynamic programming for stochastic systems. Let us first recall a<br />

few notions discussed earlier.<br />

Recall that A Markov control model is a five-tuple<br />

(X, U, {U(x), x ∈ X}, Q, c t )<br />

such that X is a (complete <strong>and</strong> separable) metric space, U is the control action space, U t (x) is the control space<br />

when the state is x at time t, such that<br />

K t = {(x, u) : x ∈ X, u ∈ U t (x)},<br />

is a subset X × U, is a feasible state, control pair, Q is a stochastic kernel on X given K t . Finally c t : K t → R is<br />

the cost function.<br />

In case U t (x) <strong>and</strong> c t () do not depend on t, we drop the time index. Then, the Markov model is called a stationary<br />

Markov model.<br />

If Π = {Π t } is a r<strong>and</strong>omized Markov policy, we define the cost function as<br />

∫<br />

c t (x t , Π) = c(x, u)Π t (du|x t )<br />

U(x t)<br />

<strong>and</strong> the transition function<br />

∫<br />

Q t (dz|x t , Π) = Q(dz|x, u)Π t (du|x t )<br />

U(x t)<br />

Let, as in Chapter 2, Π A denote the set <strong>of</strong> all admissible policies.<br />

Recall that Π = {Π t , 0 ≤ t ≤ N − 1} is a policy. Consider the following expected cost:<br />

N−1<br />

J(Π, x) = Ex Π [ ∑<br />

where c N (.) is the terminal cost function. Define<br />

t=0<br />

c(x t , u t ) + c N (x N )],<br />

J ∗ (x) := inf<br />

Π∈Π A<br />

J(Π, x)<br />

As earlier, let h t = {x [0,t] , u [0,t−1] } denote the history or the information process.<br />

The goal is to find, if there exists an admissible policy, such that J ∗ (x) is attained. We note that, the infimum<br />

value is not always attained.<br />

49


50 CHAPTER 5. DYNAMIC PROGRAMMING<br />

Before we proceed further, we note that we could express the cost as:<br />

[<br />

J(Π, x) = Ex<br />

Π c(x 0 , u 0 )<br />

+E Π x 1<br />

[c(x 1 , u 1 )<br />

= E Π0<br />

x<br />

+E Π x 2<br />

[c(x 2 , u 2 )<br />

+ . . .<br />

]<br />

+Ex Π N −1<br />

[c(x N−1 , u N−1 ) + c N (x N )|h N−1 ]|h N−2<br />

[<br />

c(x 0 , u 0 )<br />

+E Π1<br />

x 1<br />

[c(x 1 , u 1 )<br />

. . . |h 1<br />

]<br />

|h 0<br />

],<br />

+E Π2<br />

x 2<br />

[c(x 2 , u 2 )<br />

+ . . .<br />

] ]<br />

ΠN −1<br />

+Ex N −1<br />

[c(x N−1 , u N−1 ) + c N (x N )|h N−1 ]|h N−2 . . . |h 1 |h 0<br />

],<br />

(5.1)<br />

Thus,<br />

N−1<br />

J(Π, x) = Ex Π [ ∑<br />

t=0<br />

c(x t , u t ) + c N (x N )],<br />

k−1<br />

= Ex Π [ ∑<br />

c(x t , u t ) + Ex Π k<br />

[c(x k , u k ) +<br />

t=0<br />

N−1<br />

∑<br />

t=k+1<br />

c(x t , u t ) + c N (x N )|h k ]|h 0 ], (5.2)<br />

The above follows from the properties <strong>of</strong> conditioning discussed in Chapters 1 <strong>and</strong> 4. Also, observe that for all<br />

t, <strong>and</strong> measurable functions g t<br />

E[g t (x t+1 )|h t , u t ] = E[g t (x t+1 )|x t , u t ]<br />

This follows from the the controlled Markov property.<br />

5.1 Bellman’s Principle <strong>of</strong> Optimality<br />

Let {J t (x t )} be a sequence <strong>of</strong> functions on X denifed by<br />

<strong>and</strong> for 0 ≤ t ≤ N − 1<br />

J N (x) = c N (x)<br />

∫<br />

J t (x) = min {c t(x, u) + J t+1 (z)Q t (dz|x, u)}.<br />

u∈U t(x) X<br />

Suppose these functions admit a minimum <strong>and</strong> are measurable. We will discuss a number <strong>of</strong> sufficiency conditions<br />

on when this is possible. Let there be minimizing selectors which are deterministic <strong>and</strong> denoted by {f t (x)} <strong>and</strong><br />

let the minimum cost be equal to<br />

∫<br />

J t (x) = c(x, f t (x t )) + J t+1 (z)Q t (dz|x, f t (x))}<br />

X<br />

Then we have the following:


5.2. DISCUSSION: WHY ARE MARKOV POLICIES IMPORTANT? WHY ARE OPTIMAL POLICIES DETERMINISTIC FOR<br />

Theorem 5.1.1 The policy Π ∗ = {f 0 , f 1 , . . .,f N−1 } is optimal <strong>and</strong> the cost function is equal to<br />

J ∗ (x) = J 0 (x)<br />

We compare the cost generated by the above policy, with respect to the cost obtained by any other policy.<br />

Pro<strong>of</strong>: We provide the pro<strong>of</strong> by a backwards induction method. Consider the time stage t = N − 1. For this<br />

stage, the cost is equal to<br />

∫<br />

J N−1 (x) = min{c N−1 (x N−1 , u N−1 ) + J N (z)Q N (dz|x N−1 , u N−1 )}<br />

Suppose there is a cost C ∗ N−1 (x N−1), achieved by some policy η N−1 (x) C ∗ N−1 (x N−1) ≥ J N−1 (x N−1 ). Then,<br />

CN−1(x ∗ N−1 )<br />

∫<br />

= c N−1 (x N−1 , η N−1 (x N−1 )) +<br />

≥ J N−1 (x N−1 )<br />

∫<br />

= min{c N−1 (x N−1 , u N−1 ) +<br />

X<br />

X<br />

X<br />

J N (z)Q N (dz|x N−1 , η N−1 (x N−1 ))<br />

J N (z)Q N (dz|x N−1 , u N−1 )} (5.3)<br />

Now, we move to time stage N − 2. In this case, the cost to be minimized is<br />

∫<br />

c N−2 (x N−2 , u N−2 ) + CN−1 ∗ (z)Q N(dz|x N−2 , u N−2 )<br />

X<br />

∫<br />

≥ min{c N−2 (x N−2 , u N−2 ) + J N−1 (z)Q N−1 (dz|x N−2 , u N−2 )}<br />

=: J N−2 (x N−2 )<br />

Thus, the cost function to be minimized will keep being greater than or equal to the one achieved by the<br />

minimizing policy.<br />

X<br />

(5.4)<br />

We can, by induction show that the recursion holds for all 0 ≤ t ≤ N − 2.<br />

⊓⊔<br />

5.2 Discussion: Why are Markov Policies Important? Why are optimal<br />

policies deterministic for such settings?<br />

We observe that when there is an optimal solution, the optimal solution is Markov. Let us provide a more<br />

insightful way to derive this property.<br />

In the following, we will follow David Blackwell’s [5] <strong>and</strong> Hans Witsenhausen’s ideas:<br />

Theorem 5.2.1 (Blackwell’s Principle <strong>of</strong> Irrelevant Information) Let X, Y, U be complete, separable, metric<br />

spaces, <strong>and</strong> let P be a probability measure on B(X × Y), <strong>and</strong> let c : X × U → R be a Borel measurable cost<br />

function. Then, for any Borel measurable function γ : X×Y → U, there exists another Borel measurable function<br />

γ ∗ : X → U such that ∫<br />

∫<br />

c(x, γ ∗ (x))P(dx) ≤ c(x, γ(x, y))P(dx, dy)<br />

X<br />

X×Y<br />

Furthermore, policies based on only x almost surely, are optimal.<br />

Pro<strong>of</strong>: We will construct a γ ∗ given γ.<br />

Let us write<br />

∫<br />

X×Y<br />

∫<br />

c(x, γ(x, y))P(dx, dy) =<br />

X<br />

( ∫<br />

)<br />

c(x, u)P γ (du|x) P(dx),<br />

U


52 CHAPTER 5. DYNAMIC PROGRAMMING<br />

where P γ (u ∈ D|x) = ∫ ( )<br />

∫<br />

1 {γ(x,y)∈D} P(dy|x). Consider h γ (x) :=<br />

U c(x, u)P γ (du|x)<br />

Let<br />

D = {(x, u) ∈ X × U : c(x, u) ≤ h γ (x)},<br />

D is a Borel set since c(x, u) − h γ (x) is a Borel measurable function.<br />

a) Suppose the space U is countable. In this case, let us enumerate the elements in U as {u k , k = 1, 2, . . . }. In<br />

this case, we could define:<br />

D i = {x ∈ X : (x, u i ) ∈ D}, i = 1, 2, . . ., .<br />

<strong>and</strong> define:<br />

γ ∗ (x) = u k<br />

Such a function is measurable, by construction.<br />

if x ∈ D k \ (∪ k−1<br />

i=1 D i), k = 1, 2, . . .,<br />

b) Suppose now the space U is separable, but assume that c(x, u) is continuous in u.<br />

Suppose that the space U was partitioned into a countable collection <strong>of</strong> disjoint sets, K n = {Kn, 1 Kn, 2 . . . , Kn m , . . . , }<br />

<strong>and</strong> elements u k n ∈ Kk n in the sets. In this case, we could define, by an enumeration <strong>of</strong> the control actions in<br />

{K n }:<br />

D i = {x ∈ X : (x, u k n) ∈ D},<br />

<strong>and</strong> define:<br />

γ n (x) = u k n if x ∈ D k \ (∪ k−1<br />

i=1 D i)<br />

We can make K n finer, by replacing the above with K n+1 such that Kn+1 i ⊂ Km n<br />

U = R, let K n = {[ 1<br />

2<br />

[⌊u2 n ⌋, ⌈u2 n ⌉), u ∈ R}).<br />

n<br />

for some m (For example if<br />

Each <strong>of</strong> the γ n functions are measurable. Furthermore, by a careful enumeration, since the image partitions<br />

become finer, lim n→∞ γ n (x) exists pointwise. Being pointwise limit <strong>of</strong> measurable functions, the limit is also<br />

measurable, define this as γ ∗ . By continuity, then, it follows that, {(x, γ ∗ (x))} ∈ D since lim n→∞ c(x, γ n (x)) =<br />

c(x, γ ∗ (x)) ≤ h γ (x).<br />

This would complete the pro<strong>of</strong> for the case when c is continuous in u.<br />

c) For the general case, when c is not necessarily continuous, we define D x = {u : (x, u) ∈ D}. It follows that<br />

P γ (D x |x) > 0, for otherwise we would arrive at a contradiction. It then follows from Theorem 2 <strong>of</strong> [6] that we<br />

can construct a measurable function γ ∗ whose graph {(x, γ ∗ (x))} ∈ D. We do not provide the detailed pro<strong>of</strong> for<br />

this case.<br />

⊓⊔<br />

We also have the following discussion.<br />

Theorem 5.2.2 Let {(x t , u t )} be a controlled Markov Chain. Consider the minimization <strong>of</strong> E[ ∑ T −1<br />

t=0 c(x t, u t )],<br />

over all control policies which are sequences <strong>of</strong> causally measurable functions <strong>of</strong> {x s , s ≤ t}, for all t ≥ 0. Suppose<br />

further that the cost function is measurable <strong>and</strong> the transition kernel is such that ∫ Φ(dx t+1 |x t , u t )f(x t+1 ) is<br />

measurable on U for every measurable <strong>and</strong> bounded function f on X. If an optimal control policy exists, there<br />

is no loss in restricting policies to be Markov, that is a policy which only uses the current state x t <strong>and</strong> the time<br />

information t.<br />

Pro<strong>of</strong>: Let us define a history process {h t } where h t = {x 0 , . . . , x t } for t ≥ 0. Let there be an optimal policy<br />

given by {η t , t ≥ 0} such that u t = η t (h t ). It follows that<br />

P(x t+1 ∈ B|x t )<br />

∫ ∫<br />

= Φ(x t+1 ∈ B|u t , x t )1 {ut=η(h t)}P(dh t |x t )<br />

U X<br />

∫<br />

t ∫<br />

= Φ(x t+1 ∈ B|u t , x t ) 1 {ut=η(h t)∈du}P(dh t |x t )<br />

U<br />

X<br />

∫<br />

t<br />

= Φ(x t+1 ∈ B|u, x t ) ˜P(du|x t ) (5.5)<br />

U


5.3. EXISTENCE OF MINIMIZING SELECTORS AND MEASURABILITY 53<br />

where<br />

∫<br />

˜P(du|x t ) = 1 {u=η(ht)∈du}P(h t |x t )<br />

X t<br />

Thus, we may replace any history dependent policy with a possibly r<strong>and</strong>omized Markov policy.<br />

By hypothesis, there exists a solution almost surely. If the solution is r<strong>and</strong>omized, then there exists some<br />

distribution on U such that P(u t ∈ B|x t ) = κ t (B), <strong>and</strong> the optimal cost can be expressed as a dynamic<br />

programming recursion for some sequence <strong>of</strong> functions on X, {J t , t ≥ 0} with:<br />

∫ { ( ∫<br />

)}<br />

J t (x t ) = κ t (du) c(x t , u) + Φ(dx t+1 |x t , u)J t+1 (x t+1 )<br />

U<br />

X<br />

Define<br />

( ∫<br />

)<br />

M(x t , u t ) = c(x t , u t ) + Φ(dx t+1 |x t , u t )J t+1 (x t+1 )<br />

X<br />

If inf ut∈U M(x t , u t ) admits a solution, the distribution κ can be taken to be a Dirac-delta (δ) distribution at the<br />

minimizing value. By hypothesis, a solution exists P − a.s.. Thus, an optimal policy depends on the state value<br />

<strong>and</strong> time variable (dependency on time is due to the time-dependent nature <strong>of</strong> J t ).<br />

⊓⊔<br />

5.3 Existence <strong>of</strong> Minimizing Selectors <strong>and</strong> Measurability<br />

The above dynamic programming arguments hold when there exist minimizing control policies (selectors measurable<br />

with respect to the Borel field on X).<br />

Measurable Selection Hypothesis: Given a sequence <strong>of</strong> functions J t : X → R, there exists<br />

∫<br />

J t (x) = min (c(x t, u t ) + J t+1 (y)Q(dy|x, u)),<br />

u t∈U t(x)<br />

for all x ∈ X, for t ∈ {0, 1, 2 . . ., N − 1} with<br />

X<br />

J N (x N ) = c N (x N ).<br />

Furthermore, there exist measurable functions f t such that<br />

∫<br />

J t (x) = (c(x t , f(x t )) + J t+1 (y)Q(dy|x, f(x t ))),<br />

X<br />

⊓⊔<br />

Recall that a set in a normed linear space is (sequentially) compact if every sequence in the set has a converging<br />

subsequence.<br />

Condition 1: The cost function to be minimized c(x t , u t ) is continuous on both U <strong>and</strong> X; U t (x) = U is compact;<br />

<strong>and</strong> ∫ X<br />

Q(dy|x, u)v(y) is a continuous function on X × U for every continuous <strong>and</strong> bounded v on X. ⊓⊔<br />

Condition 2: For every x ∈ X the cost function to be minimized c(x t , u t ) is continuous on U; U t (x) is compact;<br />

<strong>and</strong> ∫ Q(dy|x, u)v(y) is a continuous function on U for every bounded, measurable function v on X.<br />

X ⊓⊔<br />

Theorem 5.3.1 Under Condition 1 or Condition 2, there exists an optimal solution <strong>and</strong> the Measurable Selection<br />

applies, <strong>and</strong> there exists a minimizing control policy f t : X → U t (x t ).<br />

Furthermore, under Condition 1, the function J t (x) is continuous, if c N (x N ) is continuous.


54 CHAPTER 5. DYNAMIC PROGRAMMING<br />

The result follows from the two lemma below:<br />

Lemma 5.3.1 A continuous function f : X → R over a compact set A ⊂ X admits a minimum.<br />

Pro<strong>of</strong>: Let δ = inf x∈A f(x). Let {x i } be a sequence such that f(x i ) converges to δ. Since A is compact {x i }<br />

must have a converging subsequence {x i(n) }. Let the limit <strong>of</strong> this subsequence be x 0 . Then, it follows that,<br />

{x i(n) } → x 0 <strong>and</strong> thus, by continuity {f(x i(n) )} → f(x 0 ). As such f(x 0 ) = δ. ⊓⊔<br />

To see why compactness is important, consider<br />

1<br />

inf<br />

x∈A x<br />

for A = [1, 2). How about for A = R? In both cases there does not exist an x value in the specified set which<br />

attains the infimum.<br />

Lemma 5.3.2 Let U be compact, <strong>and</strong> c(x, u) be continuous on X × U. Then, min u c(x, u) is continuous on X.<br />

Pro<strong>of</strong>: Let x n → x, u n optimal for x n <strong>and</strong> u optimal for x. Such optimal action values exist as a result <strong>of</strong><br />

compactness <strong>of</strong> U <strong>and</strong> continuity. Now,<br />

| min c(x n , u) − min c(x, u)|<br />

u<br />

u<br />

(<br />

)<br />

≤ max c(x n , u) − c(x, u), c(x, u n ) − c(x n , u n )<br />

(5.6)<br />

The first term above converges since c is continuous in x, u. The second converges also. Suppose otherwise.<br />

Then, for some ǫ > 0, there exists a subsequence such that<br />

c(x, u kn ) − c(x kn , u kn ) ≥ ǫ<br />

Consider the sequence (x kn , u kn ). There exists a subsequence such that (x k ′ n<br />

, u k ′ n<br />

) which converges to x, u ′ for<br />

some u ′ since U is compact. Hence, for this subsequence, we have convergence <strong>of</strong> c(x k ′<br />

n<br />

, u k ′<br />

n<br />

) as well as c(x, u k ′<br />

n<br />

),<br />

leading to a contradiction.<br />

⋄<br />

Compactness is a crucial component for this result.<br />

The textbook by Hern<strong>and</strong>ez-Lerma <strong>and</strong> Lasserre provides more general conditions. We did not address here the<br />

issue <strong>of</strong> the existence <strong>of</strong> measurable functions. We refer the reader again to Schäl [29]. We, however, state the<br />

following result:<br />

Lemma 5.3.3 Let c(x, u) be a continuous function on U for every x, U be a compact set. Then, there exists a<br />

Borel measurable function f : X → U such that<br />

c(x, f(x)) = min c(x, u)<br />

u∈U<br />

Pro<strong>of</strong>: Let<br />

We first observe the following. The function<br />

˜c(x) := min c(x, u)<br />

u∈U<br />

˜c(x) := min c(x, u),<br />

u∈U<br />

is Borel measurable. This follows from the observation that it is sufficient to prove that {x : ˜c(x) ≤ α} is Borel<br />

for every α ∈ R. By continuity <strong>of</strong> c <strong>and</strong> compactness <strong>of</strong> U, the result follows.<br />

Define {(x, u) : c(x, u) ≤ ˜c(x).} This set is a Borel set. We can now construct a measurable function which lives<br />

in this set, using the property that the control action space is a separable, complete, metric space as discussed<br />

in the pro<strong>of</strong> <strong>of</strong> Theorem 5.2.1. Note that, the continuity <strong>of</strong> c in u ensures that the discussion in the graph issues<br />

does not occur here since lim n→∞ c(x, γ n (x)) = c(x, γ ∗ (x)) ≤ ˜c(x).<br />

⊓⊔


5.4. INFINITE HORIZON OPTIMAL CONTROL PROBLEMS 55<br />

We can replace the compactness condition with an inf-compactness condition, <strong>and</strong> modify the Condition 1 as<br />

below:<br />

Condition 3: For every x ∈ X the cost function to be minimized c(x t , u t ) is continuous on X×U; is non-negative;<br />

{u : c(x, u) ≤ α} is compact for all α > 0 <strong>and</strong> all x ∈ X; ∫ X<br />

Q(dy|x, u)v(y) is a continuous function on X × U for<br />

every continuous <strong>and</strong> bounded v.<br />

⊓⊔<br />

Theorem 5.3.2 Under Condition 3, the Measurable Selection Hypothesis applies.<br />

Remark: We could relax the continuity condition <strong>and</strong> change it with lower-semi continuity. A function is lower<br />

semi-continuous at x 0 if liminf x→x0 f(x) ≥ f(x 0 ).<br />

⊓⊔<br />

Essentially what is needed is the existence <strong>of</strong> a minimizing control function; or a selector which is optimal for<br />

every x t ∈ X. In class we solved the following problem with q > 0, r > 0, p T > 0:<br />

for a linear system:<br />

inf<br />

Π EΠ x [ T −1<br />

∑<br />

qx 2 t + r2 t + p Tx 2 T ]<br />

t=0<br />

x t+1 = ax t + u t + w t ,<br />

where w t is a zero-mean, i.i.d. Gaussian r<strong>and</strong>om variable with variance σw 2 . We showed that, by the method <strong>of</strong><br />

completing the squares:<br />

T −1<br />

J t (x t ) = P t x 2 t + ∑<br />

P t+1 σw<br />

2<br />

where<br />

<strong>and</strong> the optimal control policy is<br />

k=t<br />

P t = q + P t+1 a 2 − P t+1a 2<br />

u t = −P t+1a<br />

P t+1 + r x t.<br />

P t+1 + r<br />

Note that, the optimal control policy is Markov (as it uses only the current state).<br />

5.4 Infinite Horizon Optimal Control Problems<br />

When the time horizon becomes unbounded, we can not directly invoke Dynamic Programming in the form<br />

considered earlier. In such settings, we look for conditions for optimality. We will investigate such conditions in<br />

this section.<br />

Infinite horizon problems that we will consider will belong to two classes: Discounted Cost <strong>and</strong> Average cost<br />

problems. We first discuss the discounted cost problem. The average cost problem is discussed in Chapter 7.<br />

5.4.1 Discounted Cost<br />

One popular type <strong>of</strong> problems are ones where the future costs are discounted: The future is less important than<br />

today, in part due to the uncertainty in the future, as well as due to economic underst<strong>and</strong>ing that current value<br />

<strong>of</strong> a good is more important than the value in the future.<br />

The discounted cost function is given as:<br />

T∑<br />

−1<br />

J(ν 0 , Π) = Eν Π 0<br />

[ β t c(x t , u t )],<br />

t=0


56 CHAPTER 5. DYNAMIC PROGRAMMING<br />

for some β ∈ (0, 1).<br />

If there exists a policy Π ∗ which minimizes this cost, the policy is said to be optimal.<br />

We can consider an infinite horizon problem by taking the limit (when c is non-negative)<br />

<strong>and</strong> invoking the Monotone Convergence Theorem:<br />

for some β ∈ (0, 1).<br />

Note that,<br />

or<br />

or<br />

T∑<br />

−1<br />

J(ν 0 , Π) = lim<br />

T →∞ EΠ ν 0<br />

[ β t c(x t , u t )],<br />

t=0<br />

∞∑<br />

J(ν 0 , Π) = Eν Π 0<br />

[ β t c(x t , u t )],<br />

t=0<br />

∞∑<br />

J(x 0 , Π) = Ex Π 0<br />

[c(x 0 , u 0 ) + Ex Π 1<br />

[ β t c(x t , u t )]|x 0 , u 0 ],<br />

t=1<br />

∞∑<br />

J(x 0 , Π) = Ex Π 0<br />

[c(x 0 , u 0 ) + βEx Π 1<br />

[ β t−1 c(x t , u t )|x 0 , u 0 ]],<br />

t=1<br />

J(x 0 , Π) = E Π x 0<br />

[c(x 0 , u 0 ) + βE Π x 1<br />

[J(x 1 , Π)]|x 0 , u 0 ],<br />

A function v : X → R is a solution to the infinite horizon problem if it satisfies:<br />

for all x ∈ X. We call v(·) a value function.<br />

v(x) = min {c(x, u) + β v(y)Q(dy|x, u)},<br />

U<br />

∫X<br />

Now, let J(x) = inf Π∈ΠA J(x, Π). We observe that the cost <strong>of</strong> starting at any position, at any time k leads to a<br />

recursion where the future cost is multiplied by β. This can be observed by applying the Dynamic Programming<br />

equations discussed earlier: For a given finite horizon truncation, we work with the sequential updates:<br />

∫<br />

J t (x t ) = min {c(x t, u t ) + β J t+1 (x t+1 )Q(dx t+1 |x t , u t )},<br />

U<br />

<strong>and</strong> hope that this converges to a limit.<br />

X<br />

Under measurable selection conditions, <strong>and</strong> a boundedness condition for the cost function, there exists a solution<br />

to the discounted cost problem. In particular, the following holds:<br />

Theorem 5.4.1 Suppose the cost function is bounded, non-negative, <strong>and</strong> one <strong>of</strong> the the measurable selection<br />

conditions (Condition 1, Condition 2 or Condition 3) applies. Then, there exists a solution to the discounted<br />

cost problem. Furthermore, the optimal cost (value function) is obtained by a successive iteration <strong>of</strong> policies:<br />

with v 0 (x) = 0.<br />

v n (x) = min {c(x, u) + β v n−1 (y)Q(dy|x, u)}, ∀x, n ≥ 1<br />

U<br />

∫X<br />

Before presenting the pro<strong>of</strong>, we state the following two lemmas: The first one is on the exchange <strong>of</strong> the order <strong>of</strong><br />

the minimum <strong>and</strong> limits.<br />

Lemma 5.4.1 [Hern<strong>and</strong>ez-Lerma <strong>and</strong> Lasserre] Let V n (x, u) ↑ V (x, u) pointwise. Suppose that V n <strong>and</strong> V are<br />

continuous in u <strong>and</strong> u ∈ U(x) is compact for every x ∈ X.Then,<br />

lim<br />

n→∞<br />

min V n(x, u) = min V (x, u)<br />

u∈U(x) u∈U(x)


5.4. INFINITE HORIZON OPTIMAL CONTROL PROBLEMS 57<br />

Pro<strong>of</strong>. The pro<strong>of</strong> follows from essentially the same arguments as in the pro<strong>of</strong> <strong>of</strong> Lemma 5.3.2. Let u ∗ n solve<br />

min u∈U(x) V n (x, u). Note that<br />

since V n (x, u) ↑ V (x, u). Now, suppose that<br />

| min<br />

u∈U(x) V n(x, u) − min<br />

u∈U(x) V (x, u)| ≤ V (x, u∗ n) − V n (x, u ∗ n), (5.7)<br />

V (x, u ∗ n ) − V n(x, u ∗ n ) > ǫ (5.8)<br />

along a subsequence n k . There exists a further subsequence n ′ k such that u∗ n ′ k<br />

→ ū for some ū. Now, for every<br />

n ′ k ≥ n, since V n is monotonically increasing:<br />

V (x, u ∗ n − V<br />

k) ′ n ′ (x, u∗ k n ≤ V (x, u<br />

k) ∗ ′ n − V<br />

k) ′ n (x, u ∗ n ) ′<br />

k<br />

However, V (x, u ∗ n<br />

) <strong>and</strong> for a fixed n, V ′ n (x, u ∗ n<br />

are continuous hence these two terms converge to:<br />

′<br />

k<br />

k),<br />

V (x, ū) − V n (x, ū).<br />

For every fixed x <strong>and</strong> ū, <strong>and</strong> every ǫ > 0, we can find a sufficiently large n such that V (x, ū) − V n (x, ū) ≤ ǫ/2.<br />

Hence (5.8) cannot hold.<br />

⊓⊔<br />

Lemma 5.4.2 The space <strong>of</strong> measurable functions X → R endowed with the ||.|| ∞ norm is a Banach space, that<br />

is<br />

is a Banach space.<br />

L ∞ (X) = {f : ||f|| ∞ = sup |f(x)| < ∞}<br />

x<br />

Pro<strong>of</strong> <strong>of</strong> Theorem 5.4.1. We obtain the optimal solution as a limit <strong>of</strong> discounted cost problems with a finite<br />

horizon T. By dynamic programming, we obtain the recursion for every T as:<br />

J T T (x) = 0<br />

Jt T (x) = T(Jt+1)(x) T = {min {c(x, u) + β Jt+1(y)Q(dy|x, U<br />

∫X<br />

T u)}}.<br />

This sequence will lead to a solution for an T-stage discounted cost problem. We then look for a solution as<br />

T → ∞ <strong>and</strong> invoke Lemma 5.4.1 since J T t (x) ↑ J ∞ t (x) for any t ∈ Z + . Note also that J ∞ 1 (x) = J ∞ 2 (x) = J ∞ 3 (x)<br />

<strong>and</strong> so on. Thus, the sequence <strong>of</strong> such solutions leads to a solution for the infinite horizon problem. This will<br />

satisfy:<br />

J0 ∞ (x) = T(J ∞ )(x) = {min {c(x, u) + β J<br />

U<br />

∫X<br />

∞ (y)Q(dy|x, u)}}.<br />

Now, the next step is to obtain a solution to this fixed point equation. We will obtain the solution using an<br />

iteration, known as the value iteration algorithm. We observe that the vector J ∞ (·) lives in L ∞ (X) (since the<br />

cost is bounded, there is a uniform bound for every x). We show that, the iteration given by<br />

T(v)(x) = {min {c(x, u) + β v(y)Q(dy|x, u)}}<br />

U<br />

∫X


58 CHAPTER 5. DYNAMIC PROGRAMMING<br />

is a contraction in L ∞ (X). Let<br />

||T(v) − T(v ′ )|| ∞<br />

= sup |T(v)(x) − T(v ′ )(x)|<br />

x∈X<br />

= sup |{min {c(x, u) + β<br />

x∈X U<br />

∫X<br />

= sup<br />

x∈X<br />

v(y)Q(dy|x, u)} − {min {c(x, u) + β v<br />

U<br />

∫X<br />

′ (y)Q(dy|x, u)}|<br />

{ (1 A1 min {c(x, u) + β v(y)Q(dy|x, u)} − min<br />

U<br />

∫X<br />

{c(x, u) + β v<br />

U<br />

∫X<br />

′ (y)Q(dy|x, u)}<br />

})<br />

{<br />

+1 A2 − min {c(x, u) + β v(y)Q(dy|x, u)} + min<br />

U<br />

∫X<br />

{c(x, u) + β v<br />

U<br />

∫X<br />

′ (y)Q(dy|x, u)}<br />

{ ∫<br />

∫<br />

}<br />

≤ sup<br />

(1 A1 c(x, u ∗ ) + β v(y)Q(dy|x, u ∗ ) − c(x, u ∗ ) − β v ′ (y)Q(dy|x, u ∗ )<br />

x∈X<br />

X<br />

X<br />

{ ∫<br />

∫<br />

})<br />

+1 A2 − c(x, u ∗∗ ) − β v(y)Q(dy|x, u ∗∗ ) + c(x, u ∗∗ ) + β v ′ (y)Q(dy|x, u ∗∗ )<br />

X<br />

X<br />

( ∫<br />

) ( ∫<br />

)<br />

= sup 1 A1 {β (v(y) − v(y ′ ))Q(dy|x, u ∗ )} + sup 1 A2 {β (v ′ (y) − v(y))Q(dy|x, u ∗∗ )}<br />

x∈X<br />

X<br />

x∈X<br />

X<br />

∫<br />

∫<br />

≤ β||v − v ′ || ∞ {1 A1 Q(dy|x, u ∗∗ ) + 1 A2 Q(dy|x, u ∗ )}<br />

X<br />

X<br />

= β||v − v ′ || ∞ (5.9)<br />

}<br />

Here A1 denotes the event that<br />

min {c(x, u) + β v(y)Q(dy|x, u)} ≥ min<br />

U<br />

∫X<br />

{c(x, u) + β v<br />

U<br />

∫X<br />

′ (y)Q(dy|x, u)},<br />

<strong>and</strong> A2 denotes the complementary event, u ∗∗ is the minimizing control for {c(x, u) + β ∫ X<br />

v(y)Q(dy|x, u)} <strong>and</strong><br />

u ∗ is the minimizer for {c(x, u) + β ∫ X v′ (y)Q(dy|x, u)}.<br />

⊓⊔<br />

This iteration is known as the value iteration algorithm.<br />

Remark 5.4.1 We note that if the cost is not bounded, one can define a weighted sup-norm: ||c|| f = sup x | c(x)<br />

v(x) |,<br />

where v is a positive function. The discussions above will apply to this context as well.<br />

5.5 Linear Quadratic Gaussian Problem<br />

Consider the following linear system:<br />

x t+1 = Ax t + Bu t + w t ,<br />

where x ∈ R n , u ∈ R m <strong>and</strong> w ∈ R n . Suppose {w t } is i.i.d. Gaussian with a given covariance matrix E[w t w T t ] = W<br />

for all t ≥ 0.<br />

The goal is to obtain<br />

where<br />

inf J(Π, x),<br />

Π<br />

T −1<br />

J(Π, x) = Ex Π [ ∑<br />

x T t Qx t + u T t Ru t + x T T Q Tx T ],<br />

with R, Q, Q T > 0 (that is, these matrices are positive definite).<br />

In class, we obtained the Dynamic Programming recursion for the optimal control problem.<br />

t=0


5.6. EXERCISES 59<br />

Theorem 5.5.1 The optimal control is linear <strong>and</strong> has the form:<br />

u t = −(BP t+1 B + R) −1 B T P t+1 Ax t<br />

where P t solves the Discrete-Time Riccati Equation:<br />

P t = Q + A T P t+1 A − A T P t+1 B((BP t+1 B + R)) −1 B T P t+1 A,<br />

with final condition P T = Q T .<br />

Theorem 5.5.2 As T → ∞ in the optimization problem above, if (A, B) is controllable <strong>and</strong> with Q = C T C <strong>and</strong><br />

(A, C) is observable, the optimal policy is stationary. The Riccati recursion, under the stated assumptions <strong>of</strong><br />

observability <strong>and</strong> controllability<br />

P t = Q + A T P t+1 A − A T P t+1 B((BP t+1 B + R)) −1 B T P t+1 A,<br />

has a unique solution P, which is furthermore positive-definite.<br />

We observe that, if the system has noise, <strong>and</strong> if we have an an average cost optimization problem, the effect <strong>of</strong><br />

the noise will stay bounded, <strong>and</strong> the same solution with P <strong>and</strong> the induced control policy, will be optimal for<br />

the minimization <strong>of</strong> the expected average cost optimization problem:<br />

T −1<br />

1<br />

lim<br />

T →∞ T EΠ x [ ∑<br />

x T t Qx t + u T t Ru t + x T T Q Tx T ]<br />

t=0<br />

5.6 Exercises<br />

Exercise 5.6.1 Consider the following problem considered in class. A investor’s wealth dynamics is given by<br />

the following:<br />

x t+1 = u t w t ,<br />

where {w t } is an i.i.d. R + -valued stochastic process with E[w t ] = 1. The investor has access to the past <strong>and</strong><br />

current wealth information <strong>and</strong> his previous actions. The goal is to maximize:<br />

T∑<br />

−1<br />

J(x 0 , Π) = Ex Π √<br />

0<br />

[ xt − u t ].<br />

The investor’s action set for any given x is: U(x) = [0, x].<br />

Formulate the problem as an optimal stochastic control problem by clearly identifying the state, the control action<br />

spaces, the information available at the controller, the transition kernel <strong>and</strong> a cost functional mapping the actions<br />

<strong>and</strong> states to R.<br />

Find an optimal policy.<br />

Hint: For α ≥ 0, √ x − u + α √ u is a concave function <strong>of</strong> u for 0 ≤ u ≤ x <strong>and</strong> its maximum is computed when<br />

the derivative <strong>of</strong> √ x − u + α √ u is set to zero.<br />

t=0<br />

Exercise 5.6.2 Consider an investment problem in a stock market. Suppose there are N possible goods <strong>and</strong><br />

suppose these goods have prices {p i t , i = 1, . . .,N} at time t taking values in a countable set P. Further, suppose<br />

that the stocks evolve stochastically according to a Markov transition such that for every 1 ≤ i ≤ N<br />

P(p i t+1 = m|p i t = n) = P i (n, m).<br />

Suppose that there is a transaction cost <strong>of</strong> c for every buying <strong>and</strong> selling for each good. Suppose that the investor<br />

can only hold on to at most M units <strong>of</strong> stock. At time t, the investor has access to his current <strong>and</strong> past price<br />

information <strong>and</strong> his investment allocations up to time t.


60 CHAPTER 5. DYNAMIC PROGRAMMING<br />

When a transaction is made the following might occur. If the investor sells one unit <strong>of</strong> i, the investor spends c<br />

dollars for the transaction, <strong>and</strong> adds p i t to his account. If he only buys one unit <strong>of</strong> j, then he takes out pj t dollars<br />

per unit from his account <strong>and</strong> also takes out c dollars. For every such a transaction per unit, there is a cost <strong>of</strong><br />

c dollars. If he does not do anything, then there is no cost. Assume that the investor can make at most one<br />

transaction per time stage.<br />

Suppose the investor initially has an unlimited amount <strong>of</strong> money <strong>and</strong> a given stock allocation.<br />

The goal is to maximize the total worth <strong>of</strong> the stock at the final time stage t = T ∈ Z + minus the total transaction<br />

costs plus the earned money in between.<br />

a) Formulate the problem as a stochastic control problem by clearly identifying the state, the control actions, the<br />

information available at the controller, the transition kernel <strong>and</strong> a cost functional mapping the actions <strong>and</strong> states<br />

to R.<br />

b) By Dynamic Programming, generate a recursion which will lead to an optimal control policy.<br />

Exercise 5.6.3 Consider a special case <strong>of</strong> the above problem, where the objective is to maximize the final worth<br />

<strong>of</strong> the investment <strong>and</strong> minimize the transaction cost. Furthermore, suppose there is only one uniform stock price<br />

with a transition kernel given by<br />

P(p t+1 = m|p t = n) = P(n, m)<br />

The investor has again a possibility <strong>of</strong> at most M units <strong>of</strong> goods. But, the investor might wish to purchase more,<br />

or sell them.<br />

a) Formulate the problem as a stochastic control problem by clearly identifying the state, the control actions, the<br />

transition kernel <strong>and</strong> a cost functional mapping the actions <strong>and</strong> states to R.<br />

b) By Dynamic Programming generate a recursion which will lead to an optimal control policy.<br />

c) What is the optimal control policy?<br />

Exercise 5.6.4 A fishery manager annually has x t units <strong>of</strong> fish <strong>and</strong> sells u t x t <strong>of</strong> these where u t ∈ [0, 1]. With<br />

the remaining ones, the next year’s production is given by the following model<br />

x t+1 = w t x t (1 − u t ) + w t ,<br />

with x 0 is given <strong>and</strong> w t is an independent, identically distributed sequence <strong>of</strong> r<strong>and</strong>om variables <strong>and</strong> w t ≥ 0 for<br />

all t <strong>and</strong> therefore E[w t ] = ˜w ≥ 0.<br />

The goal is to maximize the pr<strong>of</strong>it over the time horizon 0 ≤ t ≤ T − 1. At time T, he sells all <strong>of</strong> the fish.<br />

a) Formulate the problem as an optimal stochastic control problem by clearly identifying the state, the control<br />

actions, the information available at the controller, the transition kernel <strong>and</strong> a cost functional mapping the<br />

actions <strong>and</strong> states to R.<br />

b) Does there exist an optimal policy? If it does, compute the optimal control policy as a dynamic programming<br />

recursion.<br />

Exercise 5.6.5 Consider the following linear system:<br />

x t+1 = Ax t + Bu t + w t ,<br />

where x ∈ R n , u ∈ R m <strong>and</strong> w ∈ R n . Suppose {w t } is i.i.d. Gaussian with a given covariance matrix E[w t w T t ] =<br />

W for all t ≥ 0.<br />

The goal is to obtain<br />

where<br />

inf J(Π, x),<br />

Π<br />

T∑<br />

−1<br />

J(Π, x) = Ex Π [ x T t Qx t + u T t Ru t + x T TQ T x T ],<br />

t=0


5.6. EXERCISES 61<br />

with R, Q, Q T > 0 (that is, these matrices are positive definite).<br />

a) Show that there exists an optimal policy.<br />

b) Obtain the Dynamic Programming recursion for the optimal control problem. Is the optimal control policy<br />

Markov? Is it stationary?<br />

c) For T → ∞, if (A, B) is controllable <strong>and</strong> with Q = C T C <strong>and</strong> (A, C) is observable, prove that the optimal<br />

policy is stationary.<br />

Exercise 5.6.6 Consider an unemployed person who will have to work for years t = 1, 2, ..., 10 if she takes a job<br />

at any given t.<br />

Suppose that each year in which she remains unemployed; she may be <strong>of</strong>fered a good job that pays 10 dollars per<br />

year (with probability 1/4); she may be <strong>of</strong>fered a bad job that pays 4 dollars per year (with probability 1/4); or<br />

she may not be <strong>of</strong>fered a job (with probability 1/2). These events <strong>of</strong> job <strong>of</strong>fers are independent from year to year<br />

(that is the job market is represented by an independent sequence <strong>of</strong> r<strong>and</strong>om variables for every year).<br />

Once she accepts a job, she will remain in that job for the rest <strong>of</strong> the ten years. That is, for example, she cannot<br />

switch from the bad job to the good job.<br />

Suppose the goal is maximize the expected total earnings in ten years, starting from year 1 up to year 10 (including<br />

year 10).<br />

Should this person accept good jobs, bad jobs at any given year? What is her best policy?<br />

State the problem as a Markov Decision Problem, identify the state space, the action space <strong>and</strong> the transition<br />

kernel.<br />

Obtain the optimal solution.


62 CHAPTER 5. DYNAMIC PROGRAMMING


Chapter 6<br />

Partially Observed Markov Decision<br />

Processes<br />

As discussed earlier in chapter 2, consider a system <strong>of</strong> the form:<br />

x t+1 = f(x t , u t , w t ), y t = g(x t , v t ).<br />

Here, x t is the state, u t ∈ U is the control, (w t , v t ) ∈ W × V are second order, zero-mean, i.i.d noise processes<br />

<strong>and</strong> w t is independent <strong>of</strong> v t . We also assume that the state noise w t either has a probability mass function,<br />

or admits a probability measure which is absolutely continuous with respect to the Lebesgue measure; this will<br />

ensure that the probability measure admits a density function. Hence, the notation p(x) will denote either the<br />

probability mass for discrete-valued spaces or probability density function for uncountable spaces. The controller<br />

only has causal access to the second component {y t } <strong>of</strong> the process. A policy {Π} is measurable with respect<br />

to σ(I t ) with I t = {y [0,t] , u [0,t−1] }. We denote the observed history space as: H 0 := P, H t = H t−1 × Y × U.<br />

Here P, denotes the space <strong>of</strong> probability measures on X, which we assume to be a Borel space. Hence, the set<br />

<strong>of</strong> r<strong>and</strong>omized causal control policies are such that P(u(h t ) ∈ U|I t ) = 1 ∀I t ∈ H t .<br />

6.1 Enlargement <strong>of</strong> the State-Space <strong>and</strong> the Construction <strong>of</strong> a Controlled<br />

Markov Chain<br />

One could transform a partially observable Markov Decision Problem to a Fully Observed Markov Decision<br />

Problem via an enlargement <strong>of</strong> the state space. In particular, we obtain via the properties <strong>of</strong> total probability<br />

the following dynamical recursion<br />

π t (x) : = P(x t = x|y [0,t] , u [0,t−1] )<br />

∑<br />

X<br />

=<br />

π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )P(u t−1 |y [0,t−1] , u [0,t−2] )<br />

∑X π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )P(u t−1 |y [0,t−1] , u [0,t−2] ) .<br />

∑<br />

X<br />

Here, the summation needs to be exchanged with an integral in case the variables live in an uncountable space.<br />

The conditional measure process becomes a controlled Markov chain in P.<br />

Theorem 6.1.1 The process {π t , u t } is a controlled Markov chain. That is, under any admissible control policy,<br />

given the action at time t ≥ 0 <strong>and</strong> π t , π t+1 is conditionally independent from {π s , u s , s ≤ t − 1}.<br />

63


64 CHAPTER 6. PARTIALLY OBSERVED MARKOV DECISION PROCESSES<br />

Pro<strong>of</strong>: Suppose X is countable. Let D ∈ B(P(X)):<br />

P(dπ t+1 ∈ D|π s , u s , s ≤ t)<br />

∑<br />

X<br />

=<br />

π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )P(u t−1 |y [0,t−1] , u [0,t−2] )<br />

∑<br />

X<br />

∑X π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )P(u t−1 |y [0,t−1] , u [0,t−2] ) ∈ D|π s, u s , s ≤ t)<br />

∑<br />

X<br />

=<br />

π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )<br />

∑<br />

X<br />

∑X π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 ) ∈ D|π s, u s , s ≤ t)<br />

∑<br />

X<br />

=<br />

π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 )<br />

∑X π t−1(x t−1 )P(y t |x t )P(x t |x t−1 , u t−1 ) ∈ D|π t, u t )<br />

∑<br />

X<br />

Let the cost function to be minimized be<br />

T∑<br />

−1<br />

Ex Π 0<br />

[ c(x t , u t )],<br />

t=0<br />

where E Π x 0<br />

[·] denotes the expectation over all sample paths with initial state given by x 0 under policy Π. We can<br />

transform the system into a fully observed Markov model as follows. Define the new cost as Since<br />

T∑<br />

−1<br />

T∑<br />

−1<br />

Ex Π 0<br />

[ c(x t , u t )] = Ex Π 0<br />

[ E[c(x t , u t )|I t ]],<br />

t=0<br />

we can write the per=stage cost function as<br />

t=0<br />

⋄<br />

˜c(π, u) = ∑ X<br />

c(x, u)π(x), π ∈ P (6.1)<br />

The stochastic transition kernel q is given by:<br />

q(x, y|π, u) = ∑ X<br />

P(x, y|x ′ , u)π(x ′ ),<br />

π ∈ P<br />

And, this kernel can be decomposed as q(x, y|π, u) = P(y|π, u)P(x|π, u, y).<br />

The second term here is the filtering equation, mapping (π, u, y) ∈ (P × U × Y) to P. It follows that (P, U, K, ˜c)<br />

defines a completely observable controlled Markov process. Here, we have<br />

K(B|π, u) = ∑ Y<br />

1 {P(.|π,u,y)∈B} P(y|π, u), ∀B ∈ σ(P).<br />

As such, one can obtain the optimal solution by using the filtering equation as a sufficient statistic in a centralized<br />

setting, as Markov policies (policies that use the Markov state as their sufficient statistics) are optimal for control<br />

<strong>of</strong> Markov chains, under well-studied sufficiency conditions for the existence <strong>of</strong> optimal selectors.<br />

We call the control policies which use π as their information to generate control as separated control policies.<br />

This will become more clear in the context <strong>of</strong> Linear Quadratic Gaussian Control Systems.<br />

Since in our analysis earlier, we had assumed that the state belongs to a complete, separable, metric space, the<br />

analysis here follows the analysis in the previous chapter, provided that the measurability, continuity, compactness<br />

conditions are verified.<br />

6.2 Linear Quadratic Gaussian Case <strong>and</strong> Kalman Filtering<br />

Consider the following linear system:<br />

<strong>and</strong><br />

x t+1 = Ax t + Bu t + w t ,<br />

y t = Cx t + w t ,


6.3. ESTIMATION AND KALMAN FILTERING 65<br />

where x ∈ R n , u ∈ R m <strong>and</strong> w ∈ R n , y ∈ R p , v ∈ R p . Suppose {w t , v t } are i.i.d. r<strong>and</strong>om Gaussian vectors with<br />

given covariance matrices E[w t wt T] = W <strong>and</strong> E[v tvt T ] = V for all t ≥ 0.<br />

The goal is to obtain<br />

where<br />

inf<br />

Π J(Π, µ 0),<br />

T∑<br />

−1<br />

J(Π, µ 0 ) = Eµ Π 0<br />

[ x T t Qx t + u T t Ru t + x T TQ T x T ],<br />

t=0<br />

with R > 0 <strong>and</strong> Q, Q T ≥ 0 (that is, these matrices are positive definite <strong>and</strong> positive semi-definite).<br />

We observe that the optimal control is linear <strong>and</strong> has the form:<br />

where K t solves the Discrete-Time Riccati Equation:<br />

with final condition K T = Q T . We obtained<br />

u t = −(BK t+1 B + R) −1 B T K t+1 AE[x t |I t ]<br />

K t = Q + A T K t+1 A − A T K t+1 B((BK t+1 B + R)) −1 B T K t+1 A,<br />

T∑<br />

−1 (<br />

)<br />

J t (I t ) = E[x T t K t x t ] + E[(x t − E[x t |I t ]) T Q t (x t − E[x t |I t ])] + E[ ˜w t T K t+1 ˜w] , (6.2)<br />

k=t<br />

where ˜w t = Σ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − CE[x t |I t−1 ]) <strong>and</strong> Σ is to be derived below.<br />

6.2.1 Separation <strong>of</strong> Estimation <strong>and</strong> Control<br />

In the above problem, we observed that the optimal control has a separation structure: The controller first<br />

estimates the state, <strong>and</strong> then applies its control action, by regarding the estimate as the state itself.<br />

We say that control has a dual effect, if control affects the estimation quality. Typically, when the dual effect is<br />

absent, separation principle observed above applies. In most general problems, however, the dual effect <strong>of</strong> the<br />

control is present. That is, depending on the control, the estimation quality at the controller will be affected.<br />

As an example, consider a linear system controlled over the Internet, where the controller applies a control, but<br />

does not know if the packet reaches the destination or not. In this case, the control signal which was intended<br />

to be sent, does affect the estimation error quality.<br />

6.3 Estimation <strong>and</strong> Kalman Filtering<br />

Lemma 6.3.1 Let x, y be r<strong>and</strong>om variables with finite second moments <strong>and</strong> R > 0. The function which solves<br />

inf g(y) E[(x − g(y)) T R(x − g(y))] is G(y) = E[x|y].<br />

Pro<strong>of</strong>: Let G(y) = E[x|y] + h(y), for some measurable h, then, Pa.s. we have the following<br />

E[(x − E[x|y] − h(y)) T R(x − E[x|y] − h(y))|y]<br />

= E[(x − E[x|y]) T R(x − E[x|y])|y] + E[h T (y)h(y)|y] + E[(x − E[x|y])h(y)|y]<br />

= E[(x − E[x|y]) T R(x − E[x|y])|y] + E[h T (y)h(y)|y] + E[(x − E[x|y]|y]h(y)<br />

= E[(x − E[x|y]) T R(x − E[x|y])|y] + E[h T (y)h(y)|y]<br />

≥ E[(x − E[x|y]) T R(x − E[x|y])|y] (6.3)<br />

We note that the above also admits a Hilbert space formulation. Let H denote the space r<strong>and</strong>om variables on<br />

which an inner product 〈x, y〉 = E[x T Ry] is defined. Let M H be a subspace <strong>of</strong> H, the closed space <strong>of</strong> r<strong>and</strong>om<br />


66 CHAPTER 6. PARTIALLY OBSERVED MARKOV DECISION PROCESSES<br />

variables which are measurable on σ(y) which have finite second moments. Then, the Projection theorem leads<br />

to the observation that, an optimal estimate g(y) minimizing ||x − g(y)|| 2 2 , denoted here by G(y), is one which<br />

satisfies:<br />

〈x − G(y), h(y)〉 = E[x − G(y) T Rh(y)] = 0, ∀h ∈ M H<br />

The conditional expectation satisfies this since:<br />

E[(x − E[x|y]) T Rh(y)] = E[E[(x − E[x|y]) T Rh(y)|y]] = E[E[(x − E[x|y]) T |y]Rh(y)] = 0,<br />

since Pa.s., E[(x − E[x|y]) T |y] = 0.<br />

For a linear Gaussian system, the state process has a Gaussian probability measure. A Gaussian probability<br />

measure can be uniquely identified by knowing the mean <strong>and</strong> the covariance <strong>of</strong> the Gaussian r<strong>and</strong>om variable.<br />

This makes the analysis for estimating a Gaussian r<strong>and</strong>om variable particularly simple to perform, since the<br />

conditional estimate <strong>of</strong> a partially observed (through an additive Gaussian noise) Gaussian r<strong>and</strong>om variable is a<br />

linear function <strong>of</strong> the observed variable.<br />

Recall that a Gaussian measure with mean µ <strong>and</strong> covariance matrix K XX has the following density: P(x) =<br />

1<br />

K −1<br />

(2π) n 2 |K XX | 1/2e−1/2((x−µ)T XX (x−µ)) .<br />

Lemma 6.3.2 Let x, y be zero-mean[ Gaussian processes. ] E[x|y] is linear in y (if not zero mean, than the<br />

ΣXX Σ<br />

expectation is affine). Let K XY :=<br />

XY<br />

, whereΣ<br />

Σ Y X Σ XX = E[xx T ]. Then, E[x|y] = Σ XY Σ −1<br />

Y Y y <strong>and</strong><br />

Y Y<br />

E[(x − E[x|y])(x − E[x|y]) T ] = Σ XX − Σ XY Σ −1<br />

Y Y ΣT XY =: D.<br />

Pro<strong>of</strong>: By Baye’s rule <strong>and</strong> the fact that the processes admit densities: P(x|y) = P(x, y)/P(y). It follows that<br />

is also symmetric (since the eigenvectors are the same <strong>and</strong> the eigenvalues are inverted) <strong>and</strong> given by:<br />

Σ −1<br />

XY<br />

[ ]<br />

XY = ΨXX Ψ XY<br />

,<br />

Ψ Y X Ψ Y Y<br />

K −1<br />

Thus,<br />

p(x, y)<br />

p(y)<br />

Ψ XXx+2x T Ψ XY y+y T Ψ Y Y y)<br />

= Ce1/2(xT e 1/2(yT K −1<br />

Y Y )y<br />

By completing the squares method, one obtains a quadratic exponent <strong>of</strong> the form<br />

p(x|y) = Q(y)e 1/2(x−Hy)T D −1 (x−Hy) ,<br />

where Q(y) is a quadratic exponent function which only depends on y. Since ∫ p(x|y)dx = 1, it follows that<br />

1<br />

Q(y) = <strong>and</strong> is in fact independent <strong>of</strong> y.<br />

⋄<br />

(2π) n 2 |D| 1/2<br />

Remark 6.3.1 Even if the r<strong>and</strong>om variables are not Gaussian (but zero-mean), through another Hilbert space<br />

formulation <strong>and</strong> an application <strong>of</strong> the Projection Theorem, the expression Σ XY Σ −1<br />

Y Y<br />

y is the best linear estimate.<br />

One can naturally generalize this for r<strong>and</strong>om variables with non-zero mean.<br />

We will derive the Kalman filter in the following. The following two Lemma are instrumental.<br />

Lemma 6.3.3 If E[x] = 0 <strong>and</strong> z 1 , z 2 independent Gaussian processes, then E[x|z 1 , z 2 ] = E[x|z 1 ] + E[x|z 2 ].<br />

Pro<strong>of</strong>: The pro<strong>of</strong> follows by writing z = [z 1 , z 2 ] <strong>and</strong> E[X|z] = Σ XZ Σ −1<br />

ZZ z.<br />

⋄<br />

Lemma 6.3.4 E[(x − E[x|y])(x − E[x|y]) T ] does not depend on y <strong>and</strong> is given by D above.


6.3. ESTIMATION AND KALMAN FILTERING 67<br />

Now, we can move on to the derivation <strong>of</strong> the Kalman Filter.<br />

Consider;<br />

x t+1 = Ax t + w t , y t = Cx t + v t ,<br />

with E[w t wt T ] = W <strong>and</strong> E[v t vt T ] = V .<br />

Define:<br />

m t = E[x t |y [0,t−1] ]<br />

Σ t|t−1 = E[(x t − E[x t |y [0,t−1] ])(x t − E[x t |y [0,t−1] ]) T |y [0,t−1] ]<br />

Theorem 6.3.1 The following holds:<br />

m t+1 = Am t + AΣ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − Cm t )<br />

Σ t+1|t = AΣ t|t−1 A T + W − (AΣ t|t−1 C T )(CΣ t|t−1 C T + V ) −1 (CΣ t|t−1 A T )<br />

with<br />

<strong>and</strong><br />

m 0 = E[x 0 ]<br />

Σ 0|−1 = E[x 0 x T 0 ]<br />

Pro<strong>of</strong>:<br />

x t+1 = Ax t + w t<br />

m t+1 = E[Ax t + w t |y [0,t] ] = E[Ax t + w t |y [0,t−1] , y t − E[y [0,t−1] ]]<br />

= Am t + E[A(x t − m t )|y [0,t−1] , y t − E[y t |y [0,t−1] ]] + E[w t |y [0,t−1] , y t − E[y t |y [0,t−1] ]]<br />

= Am t + E[A(x t − m t )|y [0,t−1] ] + E[A(x t − m t )|y t − E[y t |y [0,t−1] ]] + E[w t |y t − E[y t |y [0,t−1] ]]<br />

= Am t + E[A(x t − m t )|y t − E[y [0,t−1] ]] + E[w t |y t − E[y t |y [0,t−1] ]]<br />

= Am t + E[A(x t − m t ) + w t |y t − E[y t |y [0,t−1] ]]<br />

= Am t + E[A(x t − m t ) + w t |y t − E[y t |y [0,t−1] ]] (6.4)<br />

We now apply the Lemma above: Let x = A(x t − m t ) + w t <strong>and</strong> y = y t − Cm t , <strong>and</strong> E[x|y] = Σ xx Σ −1<br />

yy y leads to:<br />

Likewise, the covariance matrix:<br />

m t+1 = Am t + AΣ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − Cm t )<br />

x t+1 − m t+1 = A(x t − m t ) + w t − AΣ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − Cm t ),<br />

leads to, after a few lines <strong>of</strong> calculations:<br />

Σ t+1|t = AΣ t|t−1 A T + W − (AΣ t|t−1 C T )(CΣ t|t−1 C T + V ) −1 (CΣ t|t−1 A T )<br />

The above is the celebrated Kalman filter.<br />

Define now<br />

It follows from m t = A ˜m t−1 that<br />

˜m t = E[x t |y [0,t] ] = m t + E[x t − m t |y [0,t] ] = m t + E[x t − m t |y t − E[y t |y [0,t] ]].<br />

˜m t = A ˜m t−1 + Σ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − CA ˜m t−1 )<br />

We observe that the zero-mean variable x t − ˜m t is orthogonal to I t , in the sense that the error is independent<br />

<strong>of</strong> the information available at the controller, <strong>and</strong> since the information available is Gaussian, independence <strong>and</strong><br />

orthogonality are equivalent.<br />

We observe that the recursions in Theorem 6.3.1 remind us the recursions in Theorem 5.5.2. This leads to the<br />

following result.<br />


68 CHAPTER 6. PARTIALLY OBSERVED MARKOV DECISION PROCESSES<br />

Theorem 6.3.2 Suppose that W = BB T <strong>and</strong> that (B T , A) is observable, (A T , C T ) is controllable <strong>and</strong> V > 0.<br />

Then, the recursions for the covariance matrices Σ (·) in 6.3.1 converge to a unique solution which is positive<br />

definite.<br />

6.3.1 Optimal Control <strong>of</strong> Partially Observed LQG Systems<br />

With this observation, in class we reformulated the quadratic optimization problem as a function <strong>of</strong> ˜m t , u t<br />

<strong>and</strong> x t − ˜m t , the optimal control problem is equivalent to the control fully observed state ˜m t , with additive<br />

time-varying independent Gaussian noise process.<br />

In particular, the cost:<br />

writes as:<br />

T∑<br />

−1<br />

Eµ Π 0<br />

[<br />

t=0<br />

for the fully observed system:<br />

with<br />

T∑<br />

−1<br />

J(Π, µ 0 ) = Eµ Π 0<br />

[ x T t Qx t + u T t Ru t + x T TQ T x T ],<br />

t=0<br />

T∑<br />

˜m T t Q ˜m t + u T t Ru t + ˜m T TQ T ˜m T ] + Eµ Π 0<br />

[ (x t − ˜m t ) T Q(x t − ˜m t )]<br />

˜m t = A ˜m t−1 + ¯w t ,<br />

t=0<br />

¯w t = Σ t|t−1 C T (CΣ t|t−1 C T + V ) −1 (y t − CA ˜m t−1 )<br />

That we can take out the error term (x t − ˜m t ) out <strong>of</strong> the summation is a consequence <strong>of</strong> what is known as the<br />

separation <strong>of</strong> estimation <strong>and</strong> control, <strong>and</strong> a more special version is known as the certainty equivalence principle.<br />

In class, we observed that, the absence <strong>of</strong> dual effect plays a key part in this analysis leading the separation <strong>of</strong><br />

estimation <strong>and</strong> control principle, in taking E[(x t − ˜m) T Q(x t − ˜m)] out <strong>of</strong> the conditioning on I t .<br />

Thus, the optimal control has the following form for all time stages:<br />

u t = −(BK t+1 B + R) −1 B T K t+1 AE[x t |I t ],<br />

<strong>and</strong> the optimal cost will be<br />

T<br />

˜m T 0 P ∑−1<br />

0 ˜m 0 + E[<br />

k=0<br />

¯w T k P k+1 ¯w k ]<br />

6.4 Partially Observed Markov Decision Processes in General Spaces<br />

Topological Properties <strong>of</strong> Space <strong>of</strong> Probability Measures<br />

In class we discussed three convergence notions; weak convergence, setwise convergence <strong>and</strong> convergence under<br />

total variation:<br />

Let P(R N ) denote the family <strong>of</strong> all probability measure on (X, B(R N )) for some N ∈ N. Let {µ n , n ∈ N} be a<br />

sequence in P(R N ).<br />

A sequence {µ n } is said to converge to µ ∈ P(R N ) weakly if<br />

∫<br />

∫<br />

c(x)µ n (dx) → c(x)µ(dx)<br />

R N R N<br />

for every continuous <strong>and</strong> bounded c : R N → R. On the other h<strong>and</strong>, {µ n } is said to converge to µ ∈ P(R N )<br />

setwise if<br />

∫<br />

c(x)µ n (dx) →<br />

R N ∫<br />

c(x)µ(dx)<br />

R N


6.4. PARTIALLY OBSERVED MARKOV DECISION PROCESSES IN GENERAL SPACES 69<br />

for every measurable <strong>and</strong> bounded c : R N → R. Setwise convergence can also be defined through pointwise<br />

convergence on Borel subsets <strong>of</strong> R N (see, e.g., [18]), that is<br />

µ n (A) → µ(A), for all A ∈ B(R N )<br />

since the space <strong>of</strong> simple functions are dense in the space <strong>of</strong> bounded <strong>and</strong> measurable functions under the<br />

supremum norm.<br />

For two probability measures µ, ν ∈ P(R N ), the total variation metric is given by<br />

‖µ − ν‖ TV := 2 sup<br />

B∈B(R N )<br />

= sup<br />

f: ‖f‖ ∞≤1<br />

|µ(B) − ν(B)|<br />

∫<br />

∣<br />

∫<br />

f(x)µ(dx) −<br />

f(x)ν(dx)<br />

∣ , (6.5)<br />

where the infimum is over all measurable real f such that ‖f‖ ∞ = sup x∈R N |f(x)| ≤ 1. A sequence {µ n } is said<br />

to converge to µ ∈ P(R N ) in total variation if ‖µ n − µ‖ TV → 0.<br />

Setwise convergence is equivalent to pointwise convergence on Borel sets whereas total variation requires uniform<br />

convergence on Borel sets. Thus these three convergence notions are in increasing order <strong>of</strong> strength: convergence<br />

in total variation implies setwise convergence, which in turn implies weak convergence.<br />

Weak convergence may not be strong enough for certain measurable selection conditions. One could use other<br />

metrics, such as total variation, or other topologies, such as the one generated by setwise convergence.<br />

Total variation is a stringent notion for convergence. For example a sequence <strong>of</strong> discrete probability measures<br />

never converges in total variation to a probability measure which admits a density function with respect to the<br />

Lebesgue measure <strong>and</strong> such a space is not separable. Setwise convergence also induces a topology on the space<br />

<strong>of</strong> probability measures <strong>and</strong> channels which is not easy to work with since the space under this convergence is<br />

not metrizable ([15, p. 59]).<br />

However, the space <strong>of</strong> probability measures on a complete, separable, metric (Polish) space endowed with the<br />

topology <strong>of</strong> weak convergence is itself a complete, separable, metric space [3]. The Prohorov metric, for example,<br />

can be used to metrize this space.<br />

In the following, we will verify the sufficiency <strong>of</strong> weak convergence.<br />

The discussion in the previous chapter for countable spaces applies also to Polish spaces. In particular, with<br />

π denoting a conditional probability measure π t (A) = P(x t ∈ A|I t ), we can define a new cost as ˜c(π, u) =<br />

c(x, u)π(x), π ∈ P.<br />

∑<br />

X<br />

Let Γ(S) be the set <strong>of</strong> all Borel measurable <strong>and</strong> bounded functions from S to R. We first wish to find a topology<br />

on P(S), under which functions <strong>of</strong> the form:<br />

∫<br />

Θ := { π(dx)f(x), f ∈ M(X)}<br />

are measurable on P(S).<br />

x∈S<br />

By the above definitions, it is evident that both setwise convergence <strong>and</strong> total variation are sufficient for measurability<br />

<strong>of</strong> cost functions <strong>of</strong> the form ∫ π(dx)f(x), since these are (sequentially) continuous on P(S) for every<br />

f ∈ Γ(S). However, as we state in the following, the cost functions defined in (6.1) ˜c : P(X) × U → R is a measurable<br />

cost function under weak convergence topology, as we discuss in the following (for a pro<strong>of</strong>, see Bogachev<br />

(p. 215 [7])).<br />

Theorem 6.4.1 Let S be a metric space <strong>and</strong> let M be the set <strong>of</strong> all measurable <strong>and</strong> bounded functions f : S → R<br />

under which<br />

∫<br />

π(dx)f(x)<br />

defines a measurable function on P(S) under the weak convergence topology. Then, M is the same as Γ(S) <strong>of</strong> all<br />

bounded <strong>and</strong> measurable functions.<br />

This discussion is useful to establish that under weak convergence topology π t , u t forms a controlled Markov<br />

chain for the dynamic programming purposes.


70 CHAPTER 6. PARTIALLY OBSERVED MARKOV DECISION PROCESSES<br />

6.5 Exercises<br />

Exercise 6.5.1 Consider a linear system with the following dynamics:<br />

x t+1 = ax t + u t + w t ,<br />

<strong>and</strong> let the controller have access to the observations given by:<br />

y t = x t + v t .<br />

Here {w t , v t , t ∈ Z} are independent, zero-mean, Gaussian r<strong>and</strong>om variables, with variances E[w 2 ] <strong>and</strong> E[v 2 ].<br />

The controller at time t ∈ Z has access to I t = {y s , u s , s ≤ t − 1} ∪ {y t }.<br />

The initial state has a Gaussian distribution, with zero mean <strong>and</strong> variance E[x 2 0 ], which we denote by ν 0. We<br />

wish to find for some r > 0:<br />

3∑<br />

inf J(x 0, Π) = Eν Π Π 0<br />

[ x 2 t + ru 2 t],<br />

Compute the optimal control policy <strong>and</strong> the optimal cost. It suffices to provide a recursive form.<br />

Hint: Show that the optimal control has a separation structure. Compute the conditional estimate through the<br />

Kalman Filter.<br />

t=0<br />

Exercise 6.5.2 Consider a Markov Decision Problem set up as follows. Let there be two possible links S 1 , S 2<br />

a decision maker can take to send information packets to a destination. These links each are Markov <strong>and</strong><br />

for each <strong>of</strong> the links we have that S i ∈ {0, 1}, i = 1, 2. Here, 0 denotes a good link <strong>and</strong> 1 denotes a bad<br />

link. If a decision maker picks one link, he can find the state <strong>of</strong> the link. Suppose the goal is to maximize<br />

E[ ∑ T −1<br />

k=0 1 U t=11 S 1<br />

t =1 + 1 Ut=21 S 2<br />

t<br />

=1]. Each <strong>of</strong> the links evolve according to a Markov model, independent <strong>of</strong> the<br />

other link.<br />

The only information the encoder has is the initial probability measure on the links <strong>and</strong> the actions it has<br />

taken. Obtain a dynamic programming recursion, by clearly identifying the controlled Markov chain <strong>and</strong> the cost<br />

function. Is the value function convex? Can you deduce any structural properties on the optimal policies?


Chapter 7<br />

The Average Cost Problem<br />

Consider the following average cost problem:<br />

J(x, Π) = limsup<br />

T →∞<br />

T<br />

1 ∑−1<br />

T EΠ x<br />

t=0<br />

c(x t , u t )<br />

This is an important problem in applications where one is concerned about the long-term behavior, unlike the<br />

discounted cost setup where the primary interest is in finite horizon problems.<br />

7.1 Average Cost <strong>and</strong> the Average Cost Optimality Equation (ACOE)<br />

To study this problem, one approach is to establish the existence <strong>of</strong> an Average Cost Optimality Equation.<br />

We will discuss further conditions on how to achieve ACOE. Later, we will discuss alternative approaches for<br />

obtaining optimal solutions.<br />

We now discuss the Average Cost Optimality Equation (ACOE) for infinite horizon problems.<br />

Considering a sequence <strong>of</strong> normalized finite horizon optimization problems, a dynamic programming recursion<br />

<strong>of</strong> the form:<br />

J T (x 0 ) = inf<br />

u 0<br />

(c(x 0 , u 0 )) + (T − 1)E[J T −1 (x 1 )|x 0 , u 0 ])<br />

with J 0 (·) = 0, leads to the following result. If one takes T to infinity, the following discussion seems natural.<br />

Theorem 7.1.1 a) [Ross] If there exists a bounded function f <strong>and</strong> a constant g such that<br />

∫<br />

g + f(x) = min (c(x, u) + f(y)P(dy|x, u)), x ∈ X,<br />

u∈U<br />

then, there exists a stationary deterministic policy Π ∗ so that<br />

g = J(x, Π ∗ ) = min<br />

Π∈Π A J(x, Π)<br />

where<br />

J(x, Π) = limsup<br />

T →∞<br />

T<br />

1 ∑−1<br />

T EΠ x<br />

k=0<br />

c(x t , u t ).<br />

b) If g, considered above, is not a constant <strong>and</strong> depends on x, <strong>and</strong> we have<br />

∫<br />

g(x) + f(x) = min (c(x, u) + f(y)P(dy|x, u)), x ∈ X,<br />

u∈U<br />

71


72 CHAPTER 7. THE AVERAGE COST PROBLEM<br />

with<br />

∫<br />

g(x) = min<br />

u<br />

g(y)P(dy|x, u)<br />

<strong>and</strong> the pointwise minimizations are achieved by a measurable function Π ∗ , then<br />

1<br />

T EΠ∗<br />

∑<br />

T −1<br />

x [<br />

t=0<br />

g(x t )] ≤ inf<br />

Π limsup<br />

N→∞<br />

T<br />

1 ∑−1<br />

N EΠ x [<br />

t=0<br />

c(x t , u t )]<br />

Pro<strong>of</strong>: We prove a. For b, which follows from a similar construction, see the text Hern<strong>and</strong>ez-Lerma <strong>and</strong> Lasserre.<br />

For any policy Π,<br />

n∑<br />

[<br />

]<br />

f(x t ) − E Π [f(x t )|x [0,t−1] , u [0,t−1] ] = 0<br />

Now,<br />

E Π x<br />

t=1<br />

∫<br />

E Π [f(x t )|x [0,t−1] , u [0,t−1] ] = f(y)P(x t ∈ dy|x t−1 , u t−1 ) (7.1)<br />

y<br />

∫<br />

= c(x t−1 , u t−1 ) + f(y)P(dy|x t−1 , u t−1 ) − c(x t−1 , u t−1 ) (7.2)<br />

≥<br />

(<br />

min c(x t−1 , u t−1 ) +<br />

u t−1∈U<br />

y<br />

∫<br />

y<br />

)<br />

f(y)P(dy|x t−1 , u t−1 ) − c(x t−1 , u t−1 ) (7.3)<br />

= g + f(x t−1 ) − c(x t−1 , u t−1 ) (7.4)<br />

The above hold with equality if Π ∗ = {f} is adopted since Π ∗ provides the pointwise minimum. Hence,<br />

<strong>and</strong><br />

with equality under Π ∗ .<br />

0 ≤ 1 n∑<br />

n EΠ x [f(x t ) − g − f(x t−1 ) + c(x t−1 , u t−1 )]<br />

t=1<br />

n∑<br />

g ≤ E Π [f(x n )]/n − Ex Π [f(x 0)]/n + Ex<br />

Π [c(x t−1 , u t−1 )]<br />

t=1<br />

⊓⊔<br />

We note that, the boundedness condition can be relaxed, provided that E Π x [f(x n)]/n → 0 under every policy<br />

<strong>and</strong> any x ∈ X.<br />

7.2 The Discounted Cost Approach to the Average Cost Problem<br />

Average cost emphasizes the asymptotic values <strong>of</strong> the cost function whereas the discounted cost emphasizes the<br />

short-term cost functions. However, under technical restrictions, one can show that the limit as the discounted<br />

factor converges to 1, one can obtain a solution for the average cost optimization. We now state one such<br />

condition below.<br />

Theorem 7.2.1 Consider a finite state controlled Markov chain where the state <strong>and</strong> action spaces are finite,<br />

<strong>and</strong> for every deterministic policy the entire state space is a recurrent set. Let<br />

J β (x) = inf Ex Π [ ∑<br />

∞ β t c(x t , u t )]<br />

Π∈Π A<br />

<strong>and</strong> suppose that Π ∗ n is an optimal deterministic policy for J β n<br />

(x). Then, there exists some Π ∗ which is optimal<br />

for every β sufficiently close to 1, <strong>and</strong> is also optimal for the average cost<br />

t=0<br />

T −1<br />

1<br />

J(x) = inf limsup<br />

Π∈Π A T →∞ T EΠ x [ ∑<br />

c(x t , u t )]<br />

t=0


7.3. LINEAR PROGRAMMING APPROACH TO AVERAGE COST MARKOV DECISION PROBLEMS 73<br />

7.2.1 Borel State <strong>and</strong> Action Spaces [Optional]<br />

In the following, we assume that c is bounded, measurable selection hypothesis hold <strong>and</strong> X is a compact subset<br />

<strong>of</strong> a Polish space <strong>and</strong> U is a compact subset <strong>of</strong> a Polish space.<br />

Consider the value function for a discounted cost problem as discussed in Section 5.4.1:<br />

Let x 0 be an arbitrary state <strong>and</strong> for all x ∈ X consider<br />

J β (x) = min {c(x, u) + β J β (y)Q(dy|x, u)}, ∀x.<br />

U<br />

∫X<br />

J β (x) − J β (x 0 )<br />

= min<br />

u∈U<br />

(<br />

c(x, u) + β<br />

∫<br />

)<br />

P(dx ′ |x t , u t )(J β (x ′ ) − J β (x 0 )) − (1 − β)J β (x 0 )<br />

(7.5)<br />

As argued in Section 5.4.1, this has a solution for every β ∈ (0, 1).<br />

Now suppose that J β (x) − J β (x 0 ) is uniformly (over β) equi-continuous <strong>and</strong> X is compact. By the Arzela-Ascoli<br />

Theorem, taking β ↑ 1 along some sequence, for some subsequence, J β (x) − J β (x 0 ) → η(x) <strong>and</strong> by a Tauberian<br />

theorem [1] (1 −β nk )J β (x 0 ) → ζ ∗ will converge possibly along a further subsequence. This limit will then satisfy<br />

the Average Cost Optimality Equation (ACOE)<br />

( ∫<br />

)<br />

η(x) = min c(x, u) + P(dx ′ |x t , u t )η(x ′ ) − ζ ∗ (7.6)<br />

u∈U<br />

Since under the hypothesis <strong>of</strong> the setting we have that lim T →∞ sup Π<br />

1<br />

T E x[c(x T , u T )] = 0 for every admissible x,<br />

ACOE implies that ζ ∗ is the optimal cost <strong>and</strong> η is the value function (see Theorem 7.1.1 <strong>and</strong> Theorem 5.2.4 in<br />

[17]).<br />

Remark 7.2.1 If ∑ ∞<br />

n=0 ||P n (x, B) − P n (x ′ , B)|| TV ≤ K(x, x ′ ) < ∞ under every optimal policy for 0 ≤ β ∗ ≤<br />

β < 1 for some β ∗ , <strong>and</strong> K is uniformly equi-continuous, then the condition for ACOE is satisfied. It is worth<br />

noting that if under an optimal policy, there exists an atom α (or a pseudo-atom in view <strong>of</strong> the splitting technique<br />

discussed in Chapter 3) so that in some finite expected time, for any two initial conditions x, x ′ , the processes<br />

are coupled in the sense described in Chapter 3, then the condition holds since after hitting α the processes will<br />

be coupled <strong>and</strong> equal, leading to a difference in the costs only until hitting the atom. If this difference is further<br />

continuous in x, x ′ so that the Arzela-Ascoli Theorem is satisfied, the optimality condition holds. For example,<br />

the univariate drift condition (4.12) is sufficient for the ergodicity condition (By Theorem 15.0.1 <strong>of</strong> [23] geometric<br />

ergodicity is equivalent to the satisfaction <strong>of</strong> this drift condition).<br />

We also note that for the above arguments to hold, there does not need to be a single invariant distribution.<br />

Here in (7.6), the pair x <strong>and</strong> x 0 should be picked as a function <strong>of</strong> the reachable set under a given sequence <strong>of</strong><br />

policies. The analysis for such a condition is tedious in general since for every β a different optimal policy will<br />

typically be adopted; however, for certain applications the reachable set from a given point may be independent<br />

<strong>of</strong> the control policy applied.<br />

7.3 Linear Programming Approach to Average Cost Markov Decision<br />

Problems<br />

Convex analytic approach <strong>of</strong> Manne [21] <strong>and</strong> Borkar [10] (later generalized by Hern<strong>and</strong>ez-Lerma <strong>and</strong> Lasserre<br />

[17]) is a powerful approach to the optimization <strong>of</strong> infinite-horizon problems. It is particularly effective in<br />

proving results on the optimality <strong>of</strong> stationary policies, which can lead to an infinite-dimensional linear program.<br />

This approach is particularly effective for constrained optimization problems <strong>and</strong> infinite horizon average cost<br />

optimization problems. It avoids the use <strong>of</strong> dynamic programming.


74 CHAPTER 7. THE AVERAGE COST PROBLEM<br />

In this chapter, we will consider average cost Markov Decision Problems. In particular, we are interested in the<br />

minimization<br />

1 ∑<br />

T<br />

inf limsup<br />

Π∈Π A T →∞ T EΠ x 0<br />

[c(x t , u t )], (7.7)<br />

where E Π x 0<br />

[] denotes the expectation over all sample paths with initial state given by x 0 under the admissible<br />

policy Π.<br />

In class, we considered the finite space setting where both X <strong>and</strong> U were finite sets. We will restrict the analysis<br />

to such a setup, <strong>and</strong> briefly discuss later the generalization to Polish spaces.<br />

We study the limit distribution <strong>of</strong> the following occupation measures.<br />

t=1<br />

v T (D) = 1 T<br />

T∑<br />

1 {xt,u t)∈D}, D ∈ B(X × U).<br />

t=1<br />

Define a F t measurable process with A ∈ B(X):<br />

( t∑<br />

F t (A) = 1 xs∈A − t ∑ )<br />

u)v t (x, u)<br />

s=1<br />

X×UP(A|x,<br />

We have that<br />

<strong>and</strong> {F t (A)} is a Martingale sequence.<br />

E[F t (A)|F t−1 ] = F t−1 (A) ∀t ≥ 0,<br />

Furthermore, F t (A) is a bounded-increment Martingale since |F t (A) − F t−1 (A)| ≤ 1. Hence, for every T > 2,<br />

{F 1 (A), . . . , F T (A)} forms a Martingale sequence with uniformly bounded increments, <strong>and</strong> we could invoke the<br />

Azuma-Hoeffding inequality to show that for all x > 0<br />

P(| F t(A)<br />

| ≥ x) ≤ 2e −2x2 t<br />

t<br />

Finally, invoking the Borel-Cantelli Lemma [9] for the summability <strong>of</strong> the estimate above, that is:<br />

∞∑<br />

2e −2x2t < ∞, ∀x > 0,<br />

n=1<br />

we deduce that<br />

Thus,<br />

F t (A)<br />

lim = 0 a.s.<br />

t→∞ t<br />

(<br />

lim v T (A) − ∑ )<br />

P(A|x, u)v T (x, u) = 0, A ⊂ X<br />

T →∞<br />

X×U<br />

Define<br />

G X = {v ∈ P(X × U) : v(B × U) = ∑ x,u<br />

P(x t+1 ∈ B|x, u)v(x, u), B ∈ B(X)}<br />

Further define<br />

G = {v ∈ P(X × U) : ∃Π ∈ Π S , v(A) = ∑ x,u<br />

P Π ((x t+1 , u t+1 ) ∈ A|x)v(x, u), A ∈ B(X × U)}<br />

It is evident that G ⊂ G X since there are fewer restrictions for G X . We can show that these two sets are indeed<br />

equal. In particular, for v ∈ G X , if we write: v(x, u) = π(x)P(u|x), then, we can construct a consistent v ∈ G:<br />

v(B × C) = ∑ x∈B P(C|x).


7.3. LINEAR PROGRAMMING APPROACH TO AVERAGE COST MARKOV DECISION PROBLEMS 75<br />

Thus, every converging sequence v t will satisfy the above equation (note that, we do not claim that every sequence<br />

converges). And hence, any converging sequence <strong>of</strong> v T (.) will asymptotically live in the set G.<br />

This is where finiteness is helpful: If the state space were countable, we could also have to consider the sequences<br />

which possible converge to infinity.<br />

Thus, under any sequence <strong>of</strong> admissible policies, every converging occupation measure sequence converges to the<br />

set G.<br />

Lemma 7.3.1 Under any admissible policy, with probability 1, any converging sequence {v t } will converge to<br />

the set G.<br />

Let us define<br />

∑<br />

γ ∗ = inf v(x, u)c(c, u)<br />

v∈G<br />

The above optimality is also in a stronger sense under Positive recurrence conditions. Consider the following:<br />

inf limsup 1<br />

Π T →∞ T<br />

where there is no expectation. The above is known as the sample path cost.<br />

Now, we have that<br />

T∑<br />

[c(x t , u t )], (7.8)<br />

t=1<br />

liminf<br />

T →∞ 〈v T,c〉 ≥ γ ∗<br />

since for any sequence v Tk which converges to the liminf value, there exists a further subsequence (due to the<br />

weak compactness <strong>of</strong> the space <strong>of</strong> occupation measures) which has a weak limit, <strong>and</strong> this weak limit is in G. By<br />

Fatou’s Lemma:<br />

lim 〈v T k<br />

, c〉 = 〈 lim v T k<br />

, c〉 ≥ γ ∗ .<br />

T k →∞ T k →∞<br />

Here, 〈v, c〉 := ∑ v(x, u)c(x, u).<br />

Likewise, for the average cost problem<br />

liminf<br />

T →∞ E[〈v T,c〉] ≥ E[liminf<br />

T →∞ 〈v T,c〉] ≥ γ ∗ .<br />

Lemma 7.3.2 [Proposition 9.2.5 in [22]] The space G is convex <strong>and</strong> closed. Let G e denote the set <strong>of</strong> extreme<br />

points <strong>of</strong> G. A measure is in G e if <strong>and</strong> only if two conditions are satisfied: i) The control policy inducing it is<br />

deterministic ii) Under this deterministic policy, the induced Markov chain is irreducible in the support set <strong>of</strong><br />

the invariant measure (there can’t be two invariant measures under this deterministic policy).<br />

Pro<strong>of</strong>:<br />

Consider η, an invariant measure in G which is non-extremal. Then, this measure can be written as a convex<br />

combination <strong>of</strong> two measures:<br />

η(x, u) = Θη 1 (x, u) + (1 − Θ)η 2 (x, u)<br />

π(x) = Θπ 1 (x) + (1 − Θ)π 2 (x)<br />

Let φ 1 (u|x), φ 2 (u|x) be two policies leading to ergodic measures η 1 <strong>and</strong> η 2 <strong>and</strong> invariant measures π 1 <strong>and</strong> π 2<br />

respectively. Then, there exists γ(·) such that:<br />

where under φ the invariant measure η is attained.<br />

φ(u|x) = γ(x)φ 1 (u|x) + (1 − γ(x))φ 2 (u|x),<br />

There are two ways this can be satisfied: i) φ is r<strong>and</strong>omized <strong>and</strong> ii) If φ is not r<strong>and</strong>omized with φ 1 = φ 2 for all<br />

x <strong>and</strong> are deterministic, then, it must be that π 1 <strong>and</strong> π 2 are different measures <strong>and</strong> have different support sets.<br />

In this case, the measures η 1 <strong>and</strong> η 2 do not correspond to ergodic occupation measures (in the sense <strong>of</strong> positive<br />

Harris recurrence).


76 CHAPTER 7. THE AVERAGE COST PROBLEM<br />

We can show that for the countable state space case a converse also holds, that is, if a policy is r<strong>and</strong>omized<br />

it cannot lead to an extreme point unless it has the same invariant measure under some deterministic policy.<br />

That is, extreme points are achieved under deterministic policies. Let there be a policy P which is r<strong>and</strong>omizing<br />

between two deterministic policies P 1 <strong>and</strong> P 2 at a state α such that with probability θ, P 1 is chosen; <strong>and</strong> with<br />

probability 1 − θ, P 2 is chosen. Let v, v 1 , v 2 be corresponding invariant measures to P, P 1 , P 2 . Here, α is an<br />

accessible atom if E P i<br />

α [τ α ] < ∞ <strong>and</strong> P P i (x, B) = P P i (y, B) for all x, y ∈ α, where τ α = min(k > 0 : π k = α) is<br />

the return time to α. In this case, the expected occupation measure can be shown to be equal to:<br />

which is a convex combination <strong>of</strong> v 1 <strong>and</strong> v 2 .<br />

v(x, u) = EP1 α [τ α]θv 1 (x, u) + Eα P2[τ<br />

α](1 − θ)v 2 (x, u)<br />

Eα P1[τ<br />

α]θ + Eα P2[τ<br />

, (7.9)<br />

α](1 − θ)<br />

Hence, a r<strong>and</strong>omized policy cannot lead to an extreme point in the space <strong>of</strong> invariant occupation measures. We<br />

note that, even if one r<strong>and</strong>omizes at a single state, under irreducibility assumptions, then the above argument<br />

applies.<br />

The above does not easily extend to uncountable settings. In Section 3.2 <strong>of</strong> [10], there is a discussion for a<br />

specialized setting.<br />

We can deduce that, for such a finite state space, an optimal policy is deterministic:<br />

⊓⊔<br />

Theorem 7.3.1 Let X, U be finite spaces. Furthermore, let, for every stationary policy, there be a unique<br />

invariant distribution on a single communication class. In this case, a sample path optimal policy <strong>and</strong> an<br />

average cost policy exists. These policies are deterministic <strong>and</strong> stationary.<br />

Pro<strong>of</strong>:<br />

G is a compact set. There exists an optimal v. Furthermore, by some further study, one realizes that the set G<br />

is closed, convex; <strong>and</strong> its extreme points are obtained by deterministic policies.<br />

⊓⊔<br />

The set <strong>of</strong> extreme points will contain optimal solutions for this case. However, one needs to be careful in<br />

characterizing the set <strong>of</strong> extreme points <strong>of</strong> the set <strong>of</strong> invariant occupation measures.<br />

Let µ ∗ be the optimal occupation measure (this exists, as discussed above, since the state space is finite, <strong>and</strong><br />

thus G is compact, <strong>and</strong> ∑ X×U<br />

µ(x, u)c(x, u) is continuous in µ(., .)). This induces an optimal policy π(u|x) as:<br />

π(u|x) = µ∗ (x, u)<br />

∑U µ∗ (x, u) .<br />

Thus, we can find the optimal policy conveniently, without using dynamic programming.<br />

In class, a simple example was provided.<br />

7.4 Discussion for More General State Spaces<br />

We now discuss a more general setting where these spaces are Polish spaces. The main difference is that, while<br />

in class we only considered the occupation measures:<br />

v T (A) = 1 T<br />

T∑<br />

−1<br />

1 ((xt,u t)∈A), ∀A ∈ σ(X × U),<br />

t=0<br />

<strong>and</strong> the convergence <strong>of</strong> v T to some set; for an uncountable setting, we use test functions to lead to the same<br />

result. Let φ : X → R be a continuous <strong>and</strong> bounded function. Define:<br />

v T (φ) = 1 T<br />

T∑<br />

φ(x, u).<br />

t=1


7.4. DISCUSSION FOR MORE GENERAL STATE SPACES 77<br />

Define a F t measurable process, with π an admissible control policy (not necessarily stationary or Markov):<br />

( t∑ ) (∫ ∫<br />

F t (φ) = φ(x t ) − t<br />

s=1<br />

P×U<br />

)<br />

φ(x ′ t)P π (dx ′ t, du ′ t)|x)v t (dx)<br />

(7.10)<br />

We define G X to be the following set in this case.<br />

∫<br />

G X = {η(D) = P(D|z)η(dz),<br />

X×U<br />

∀D ∈ B(X)}.<br />

The above holds under a class <strong>of</strong> technical conditions:<br />

1. (A) The state process lives in a compact space X. The control space U is also compact<br />

2. (A’) The cost function satisfies the following condition: lim Kn↑X inf u,x/∈Kn c(x, u) = ∞, (here, we assume<br />

that the space X is locally compact, or σ−compact).<br />

3. (B) There exists a policy leading to a finite cost<br />

4. (C) The cost function is continuous in x <strong>and</strong> u<br />

5. (D) The transition kernel is weakly continuous in the sense that ∫ Q(dz|x, u)v(z) is continuous in both x<br />

<strong>and</strong> u, for every continuous <strong>and</strong> bounded function v.<br />

Theorem 7.4.1 Under the above Assumptions A, B, C, <strong>and</strong> D (or A’, B, C, <strong>and</strong> D) there exists a solution to the<br />

optimal control problem given in (7.8) for almost every initial condition under the optimal invariant probability<br />

measure.<br />

The key step is the observation that every expected occupation measure sequence has a weakly<br />

converging subsequence. If this exists, then the limit <strong>of</strong> such a converging sequence will be in G <strong>and</strong> the<br />

analysis presented for the finite state space case will be applicable.<br />

Let ν(A) = E[v T (A)]. For C, without the boundedness assumption, the argument follows from the observation<br />

that, for ν n → ν weakly, for every lower semi-continuous functions (see [17])<br />

∫<br />

∫<br />

liminf ν n (dx, du)c(x, u) ≥ ν(dx, du)c(x, u)<br />

n→∞<br />

However, by sequential compactness, every sequence has a weakly converging subsequence <strong>and</strong> the result follows.<br />

The issue here is that a sequence <strong>of</strong> measures on the space <strong>of</strong> invariant measures ν n may have a limiting<br />

probability measure, but this limit may not correspond to an invariant measure under a stationary policy. The<br />

weak continunity <strong>of</strong> the kernel, <strong>and</strong> the separability <strong>of</strong> the space <strong>of</strong> continuous functions on a compact set, allow<br />

for this generalization.<br />

This ensures that every sequence has a converging subsequence weakly. In particular, there exists an optimal<br />

occupation measure.<br />

Lemma 7.4.1 There exists an optimal occupation measure in G under the Assumptions above.<br />

Pro<strong>of</strong>: The problem has now reduced to<br />

s.t.<br />

∫<br />

min<br />

µ<br />

µ(dx, du)c(x, u),<br />

µ ∈ G.


78 CHAPTER 7. THE AVERAGE COST PROBLEM<br />

The set <strong>of</strong> invariant measures is weakly sequentially compact <strong>and</strong> the integral is lower semi-continuous on the<br />

set <strong>of</strong> measures.<br />

⊓⊔<br />

Following the Lemma above, there exists optimal occupation measures, say v(dπ, df). This defines the optimal<br />

stationary control policy by the decomposition:<br />

µ(df|π) =<br />

v(dx, df)<br />

v(dx, df),<br />

∫u<br />

for all x such that ∫ v(dx, df) > 0.<br />

u<br />

There is a final consideration <strong>of</strong> reachability; that is, whether from any initial state, or an initial occupation set,<br />

the region where the optimal policy is defined is attracted (see [1]).<br />

If the Markov chain is stable <strong>and</strong> irreducible, then the optimal cost is independent <strong>of</strong> where the chain starts<br />

from, since in finite time, the states on which the optimal occupation measure has support, can be reached.<br />

If this Markov Chain is not irreducible, then, the optimal stationary policy is only optimal when the state process<br />

starts at the locations where in finite time the optimal stationary policy is applicable. This is particularly useful<br />

for problems where one has control over in which state to start the system.<br />

7.4.1 Extreme Points <strong>and</strong> the Optimality <strong>of</strong> Deterministic Policies<br />

Let µ be an invariant measure in G which is not extreme. This means that there exists κ ∈ (0, 1) <strong>and</strong> invariant<br />

empirical occupation measures µ 1 , µ 2 ∈ G such that:<br />

µ(dx, du) = κµ 1 (dx, du) + (1 − κ)µ 2 (dx, du)<br />

Let µ(dx, du) = P(du|x)µ(dx) <strong>and</strong> for i = 1, 2, µ i (dx, du) = P i (du|x)µ i (dx). Then,<br />

µ(dx) = κµ 1 (dx) + (1 − κ)µ 2 (dx)<br />

Note that µ(dx) = 0 =⇒ µ i (dx) = 0 for i = 1, 2. As a consequence, the Radon-Nikodym derivative <strong>of</strong> µ i with<br />

respect to µ exists. Let dµi<br />

dµ (x) = fi (x). Then<br />

P(du|x) = P 1 (du|x) dµ1<br />

dµ (x)κ + P 2 (du|x) dµ2 (x)(1 − κ)<br />

dµ<br />

is well-defined. Then,<br />

P(du|x) = P 1 (du|x)f 1 (x)κ + P 2 (du|x)(1 − κ)f 2 (x),<br />

such that f i (x)κ + (1 − κ)f 2 (x) = 1. As a result, we have that<br />

P(du|x) = P 1 (du|x)η(x) + P 2 (du|x)(1 − η(x)),<br />

with the r<strong>and</strong>omization kernel η(x) = f 1 (x)κ. As in Meyn [22], there are two ways where such a representation<br />

is possible: Either P(du|x) is r<strong>and</strong>omized, or P 1 (du|x) = P 2 (du|x) <strong>and</strong> deterministic but there are multiple<br />

invariant probability measures under P 1 .<br />

The converse direction can also be obtained for the countable state space case: If a measure in G is extreme,<br />

than it corresponds to a deterministic process. The result also holds for the uncountable state space case under<br />

further technical conditions. See section 3.2 <strong>of</strong> Borkar [10] for further discussions.<br />

Remark 7.4.1 If there is an atom (or a pseudo-atom constructed through Nummelin’s splitting technique) which<br />

is visited under every stationary policy, then as in the arguments leading to (7.9), one can deduce that on this<br />

atom there cannot be a r<strong>and</strong>omized policy which leads to an extreme point on the set <strong>of</strong> invariant occupation<br />

measures.


7.5. EXERCISES 79<br />

7.4.2 Sample-Path Optimality<br />

To apply the convex analytic approach, we require that under any admissible policy, the set <strong>of</strong> sample path<br />

occupation measures would be tight, for almost every sample path realization. If this can be established, then<br />

the result goes through not only for the expected cost, but also the sample-path average cost.<br />

7.5 Exercises<br />

Exercise 7.5.1 Consider the convex analytic method we discussed in class. Let X, U be countable sets <strong>and</strong><br />

consider the occupation measure:<br />

v T (A) = 1 T<br />

T∑<br />

−1<br />

t=0<br />

1 (xt∈A), ∀A ∈ B(X).<br />

While proving that the limit <strong>of</strong> such a measure process lives in a specific set, we used the following result: Let F t<br />

be the σ−field generated by {x s , u s , s ≤ t}. Define a F t measurable process<br />

( t∑<br />

F t (A) =<br />

s=1<br />

1 xs∈A − t ∑ )<br />

P(A|x)v t (x, u) ,<br />

X×U<br />

where Π is some arbitrary but admissible control policy. Show that, {F t (A),<br />

sequence for every T ∈ Z.<br />

t ∈ {1, 2, . . ., T }} is a Martingale<br />

Exercise 7.5.2 Let, for a Markov control problem, x t ∈ X, u t ∈ U, where X <strong>and</strong> U are finite sets denoting the<br />

state space <strong>and</strong> the action space, respectively. Consider the optimal control problem <strong>of</strong> the minimization <strong>of</strong><br />

T<br />

1 ∑−1<br />

lim<br />

T →∞ T EΠ x [ c(x t , u t )],<br />

t=0<br />

where c is a bounded function. Further assume that under any stationary control policy, the state transition<br />

kernel P(x t+1 |x t , u t ) leads to an irreducible Markov Chain.<br />

Does there exist an optimal control policy? Can you propose a way to find the optimal policy?<br />

Is the optimal policy also sample-path optimal?


80 CHAPTER 7. THE AVERAGE COST PROBLEM


Chapter 8<br />

Team Decision Theory <strong>and</strong> Information<br />

Structures<br />

We will primarily follow Chapters 2, 3 <strong>and</strong> 4 <strong>of</strong> [32] for this topic.<br />

8.1 Exercises<br />

Exercise 8.1.1 Consider a common probability space on which the information available to two decision makers<br />

DM 1 <strong>and</strong> DM 2 are defined, such that I 1 is available at DM 1 <strong>and</strong> I 2 is available at DM 2 .<br />

R. J. Aumann [2] defines that an information E is common knowledge between two decision makers DM 1 <strong>and</strong><br />

DM 2 , if whenever E happens, DM 1 knows E, DM 2 knows E, DM 1 knows that DM 2 knows E, DM 2 knows that<br />

DM 1 knows E, <strong>and</strong> so on.<br />

Suppose that one claims that an event E is common knowledge if <strong>and</strong> only if E ∈ σ(I 1 ) ∩ σ(I 2 ), where σ(I 1 )<br />

denotes the σ−field generated by information I 1 <strong>and</strong> likewise for σ(I 2 ).<br />

Is this argument correct? Provide an answer with precise arguments. You may wish to consult [2], [27], [11] <strong>and</strong><br />

Chapter 12 <strong>of</strong> [32].<br />

Exercise 8.1.2 Consider the following static team decision problem with dynamics:<br />

x 1 = ax 0 + b 1 u 1 0 + b 2 u 2 0 + w 0 ,<br />

y 1 0 = x 0 + v 1 0,<br />

y 2 0 = x 0 + v 2 0,<br />

Here v 1 0 , v2 0 , w 0 are independent, Gaussian, zero-mean with finite variances.<br />

Let γ i : R → R be policies <strong>of</strong> the controllers: u 1 0 = γ1 0 (y1 0 ), u2 0 = γ2 0 (y2 0 ).<br />

Find<br />

,γ<br />

min<br />

2<br />

γ 1 ,γ 2 Eγ1 ν 0<br />

[x 2 1 + ρ 1(u 1 0 )2 + ρ 2 (u 2 0 )2 ],<br />

where ν 0 is a zero-mean Gaussian distribution <strong>and</strong> ρ 1 , ρ 2 > 0.<br />

Find an optimal team policy γ = {γ 1 , γ 2 }.<br />

81


82 CHAPTER 8. TEAM DECISION THEORY AND INFORMATION STRUCTURES<br />

Exercise 8.1.3 Consider a linear Gaussian system with mutually independent <strong>and</strong> i.i.d. noises:<br />

x t+1 = Ax t +<br />

with the one-step delayed observation sharing pattern.<br />

L∑<br />

B j u j t + w t ,<br />

j=1<br />

y i t = C i x t + v i t , 1 ≤ i ≤ L, (8.1)<br />

Construct a controlled Markov chain for the team decision problem: First show that one could have<br />

as the state <strong>of</strong> the controlled Markov chain.<br />

Consider the following problem:<br />

{y 1 t , y 2 t , . . . , y L t , P(dx t |y 1 [0,t−1] , y2 [0,t−1] , . . .,yL [0,t−1] )}<br />

E γ T∑<br />

−1<br />

ν 0<br />

[ ]c(x t , u 1 t, · · · , u L t )?<br />

t=0<br />

For this problem, if at time t ≥ 0 each <strong>of</strong> the decision makers (say DM i) has access to P(dx t |y[0,t−1] 1 , y2 [0,t−1] , . . .,yL [0,t−1] )<br />

<strong>and</strong> their local observation y[0,t] i , show that they can obtain a solution where the optimal decision rules only uses<br />

{P(dx t |y[0,t−1] 1 , y2 [0,t−1] , . . . , yL [0,t−1] ), yi t }:<br />

What if, they do not have access to P(dx t |y 1 [0,t−1] , y2 [0,t−1] , . . . , yL [0,t−1] ), <strong>and</strong> only have access to yi [0,t] ? What<br />

would be a sufficient statistic for each decision maker for each time stage?<br />

Exercise 8.1.4 Two decision makers, Alice <strong>and</strong> Bob, wish to control a system:<br />

x t+1 = ax t + u a t + ub t + w t,<br />

y a t = x t + v a t ,<br />

y b t = x t + v b t ,<br />

where u a t , ya t are the control actions <strong>and</strong> the observations <strong>of</strong> Alice, u b t , yb t are those for Bob <strong>and</strong> va t , vb t , w t are<br />

independent zero-mean Gaussian r<strong>and</strong>om variables with finite variance. Suppose the goal is to minimize for<br />

some T ∈ Z + :<br />

[ T∑ −1<br />

E Πa ,Π b<br />

x 0<br />

x 2 t + r a(u a t )2 + r b (u b t<br />

], )2<br />

t=0<br />

for r a , r b > 0, where Π a , Π b denote the policies adopted by Alice <strong>and</strong> Bob. Let the local information available to<br />

Alice be It a = {ya s , ua s , s ≤ t − 1} ∪ {ya t }, <strong>and</strong> Ib t = {yb s , ub s , s ≤ t − 1} ∪ {yb t } is the information available at Bob<br />

at time t.<br />

Consider an n−step delayed information pattern: In an n−step delayed information sharing pattern, the information<br />

at Alice at time t is<br />

I a t ∪ I b t−n,<br />

<strong>and</strong> the information available at Bob is<br />

State if the following are true or false:<br />

I b t ∪ I a t−n.<br />

a) If Alice <strong>and</strong> Bob share all the information they have (with n = 0), it must be that, the optimal controls are<br />

linear.<br />

b) Typically, for such problems, for example, Bob can try to send information to Alice to improve her estimation<br />

on the state, through his actions. When is it the case that Alice cannot benefit from the information from Bob,<br />

that is for what values <strong>of</strong> n, there is no need for Bob to signal information this way?<br />

c) If Alice <strong>and</strong> Bob share all information they have with a delay <strong>of</strong> 2, then their optimal control policies can be<br />

written as<br />

u a t = f a (E[x t |I a t−2, I b t−2], y a t−1, y a t ),


8.1. EXERCISES 83<br />

u b t = f b (E[x t |I a t−2, I b t−2], y b t−1, y b t),<br />

for some functions f a , f b . Here, E[.|.] denotes the expectation.<br />

d) If Alice <strong>and</strong> Bob share all information they have with a delay <strong>of</strong> 0, then their optimal control policies can be<br />

written as<br />

u a t = f a (E[x t |I a t , I b t ]),<br />

u b t = f b(E[x t |I a t , Ib t ]),<br />

for some functions f a , f b . Here, E[.|.] denotes the expectation.<br />

Exercise 8.1.5 In Radner’s theorem, explain why continuous differentiability <strong>of</strong> the cost function (in addition<br />

to its convexity) is needed for person-by-person-optimality <strong>of</strong> a team policy to imply global optimality.


84 CHAPTER 8. TEAM DECISION THEORY AND INFORMATION STRUCTURES


Appendix A<br />

On the Convergence <strong>of</strong> R<strong>and</strong>om<br />

Variables<br />

A.1 Limit Events <strong>and</strong> Continuity <strong>of</strong> Probability Measures<br />

Given A 1 , A 2 , . . . , A n ⊂ F, define:<br />

limsup A n = ∩ ∞ n=1 ∪ ∞ k=n A k<br />

n<br />

liminf A n = ∪ ∞<br />

n<br />

n=1 ∩∞ k=n A k<br />

For the superior limit, an element is in this set, if it is in infinitely many A n s. For the inferior case, an element<br />

is in the limit, if it is in almost except for a finite number <strong>of</strong> A n s.<br />

The limit <strong>of</strong> a sequence <strong>of</strong> sets exists if the above limits are equal.<br />

We have the following result:<br />

Theorem A.1.1 For a sequence <strong>of</strong> events A n :<br />

P(liminf<br />

n<br />

A n ) ≤ liminf<br />

n<br />

P(A n ) ≤ limsup<br />

n<br />

We have the following regarding continuity <strong>of</strong> probability measures:<br />

P(A n ) ≤ P(limsup A n )<br />

n<br />

Theorem A.1.2 (i)For a sequence <strong>of</strong> events A n , A n ⊂ A n+1 ,<br />

lim P(A n) = P(∪ ∞<br />

n→∞<br />

n=1 A n)<br />

(ii) For a sequence <strong>of</strong> events A n , A n+1 ⊂ A n ,<br />

lim P(A n) = P(∩ ∞<br />

n→∞<br />

n=1 A n)<br />

A.2 Borel-Cantelli Lemma<br />

Theorem A.2.1 (i) If ∑ n P(A n) converges, then P(limsup n A n ) = 0. (ii) If {A n } are independent <strong>and</strong> if<br />

∑ P(An ) = ∞, then P(limsup n A n ) = 1.<br />

85


86 APPENDIX A. ON THE CONVERGENCE OF RANDOM VARIABLES<br />

Exercise A.2.1 Let {A n } be a sequence <strong>of</strong> independent events where A n is the event that the nth coin flip is<br />

head. What is the probability that there are infinitely many heads if P(A n ) = 1/n 2 ?<br />

An important application <strong>of</strong> the above is the following:<br />

Theorem A.2.2 Let Z n , n ∈ N <strong>and</strong> Z be r<strong>and</strong>om variables <strong>and</strong> for every ǫ > 0,<br />

∑<br />

P(|Z n − Z| ≥ ǫ) < ∞.<br />

Then,<br />

That is Z n converges to Z with probability 1.<br />

n<br />

P({ω : Z n (ω) = Z(ω)}) = 1.<br />

A.3 Convergence <strong>of</strong> R<strong>and</strong>om Variables<br />

A.3.1 Convergence almost surely (with probability 1)<br />

Definition A.3.1 A sequence <strong>of</strong> r<strong>and</strong>om variables X n converges almost surely to a r<strong>and</strong>om variable X if P({ω :<br />

lim n→∞ X n (ω) = X(ω)}) = 1.<br />

A.3.2 Convergence in Probability<br />

Definition A.3.2 A sequence <strong>of</strong> r<strong>and</strong>om variables X n converges in probability to a r<strong>and</strong>om variable X if<br />

lim n→∞ P(|X n − X| ≥ ǫ}) = 0 for every ǫ > 0.<br />

A.3.3 Convergence in Mean-square<br />

Definition A.3.3 A sequence <strong>of</strong> r<strong>and</strong>om variables X n converges in the mean-square sense to a r<strong>and</strong>om variable<br />

X if lim n→∞ E[|X n − X| 2 ] = 0.<br />

A.3.4 Convergence in Distribution<br />

Definition A.3.4 Let X n be a r<strong>and</strong>om variable with cumulative distribution function F n , <strong>and</strong> X be a r<strong>and</strong>om<br />

variable with cumulative distribution function F. A sequence <strong>of</strong> r<strong>and</strong>om variables X n converges in distribution<br />

(or weakly) to a r<strong>and</strong>om variable X if lim n→∞ F n (x) = F(x) for all points <strong>of</strong> continuity <strong>of</strong> F.<br />

Theorem A.3.1 a) Convergence in almost sure sense implies in probability. b) Convergence in mean-square<br />

sense implies convergence in probability. c) If X n → X in probability, then X n → X in distribution.<br />

We also have partial converses for the above results:<br />

Theorem A.3.2 a) If P(|X n | ≤ Y ) = 1 for some r<strong>and</strong>om variable Y with E[Y 2 ] < ∞, <strong>and</strong> if X n → X in<br />

probability, then X n → X in mean-square. b) If X n → X in probability, there exists a subsequence X nk which<br />

converges to X almost surely. c) If X n → X <strong>and</strong> X n → Y in probability, mean-square or almost surely. Then<br />

P(X = Y ) = 1.<br />

A sequence <strong>of</strong> r<strong>and</strong>om variables is uniformly integrable if:<br />

lim sup E[X n 1<br />

K→∞ |Xn|≥K] = 0.<br />

Note that, if {X n } is uniformly integrable, then, sup n E[|X n |] < ∞.<br />

n


A.3. CONVERGENCE OF RANDOM VARIABLES 87<br />

Theorem A.3.3 Under uniform integrability, convergence in almost sure sense implies convergence in meansquare.<br />

Theorem A.3.4 If X n → X in probability, there exists some subsequence X nk<br />

surely.<br />

which converges to X almost<br />

Theorem A.3.5 (Skorohod’s representation theorem) Let X n → X in distribution. Then, there exists a<br />

sequence <strong>of</strong> r<strong>and</strong>om variables Y n <strong>and</strong> Y such that, X n <strong>and</strong> Y n have the same cumulative distribution functions;<br />

X <strong>and</strong> Y have the same cumulative distribution functions <strong>and</strong> Y n → Y almost surely.<br />

With the above, we can prove the following result.<br />

Theorem A.3.6 The following are equivalent: i) X n converges to X in distribution. ii) E[f(X n )] → E[f(X)]<br />

for all continuous <strong>and</strong> bounded functions f. iii) The characteristic functions Φ n (u) := E[e iuXn ] converge pointwise<br />

for every u ∈ R.


88 APPENDIX A. ON THE CONVERGENCE OF RANDOM VARIABLES


Bibliography<br />

[1] A. Arapostathis, V. S. Borkar, E. Fern<strong>and</strong>ez-Gaucher<strong>and</strong>, M. K. Ghosh, <strong>and</strong> S. I. Marcus. Discrete-time<br />

controlled Markov processes with average cost criterion: A survey. SIAM J. Control <strong>and</strong> Optimization,<br />

31:282–344, 1993.<br />

[2] R. J. Aumann. Agreeing to disagree. Annals <strong>of</strong> <strong>Statistics</strong>, 4:1236 – 1239, 1976.<br />

[3] P. Billingsley. Convergence <strong>of</strong> Probability Measures. New York, NY, John Wiley, 1968.<br />

[4] P. Billingsley. Probability <strong>and</strong> Measure. Wiley, 3rd ed.), New York, 1995.<br />

[5] D. Blackwell. Memoryless strategies in finite-stage dynamic programming. Annals <strong>of</strong> Mathematical <strong>Statistics</strong>,<br />

35:863–865, 1964.<br />

[6] D. Blackwell <strong>and</strong> C. Ryll-Nadrzewski. Non-existence <strong>of</strong> everywhere proper conditional distributions. Annals<br />

<strong>of</strong> Mathematical <strong>Statistics</strong>, pages 223–225, 1963.<br />

[7] V. I. Bogachev. Measure Theory. Springer-Verlag, Berlin, 2007.<br />

[8] V. S. Borkar. White-noise representations in stochastic realization theory. SIAM J. on Control <strong>and</strong> Optimization,<br />

31:1093–1102, 1993.<br />

[9] V. S. Borkar. Probability Theory: An Advanced Course. Springer, New York, 1995.<br />

[10] V. S. Borkar. Convex analytic methods in Markov decision processes. In H<strong>and</strong>book <strong>of</strong> Markov Decision<br />

Processes, E. A. Feinberg, A. Shwartz (Eds.), pages 347–375. Kluwer, Boston, MA, 2001.<br />

[11] A. Br<strong>and</strong>enburger <strong>and</strong> E. Dekel. Common knowledge with probability 1. J. Mathematical Economics,<br />

16:237–245, 1987.<br />

[12] S. B. Connor <strong>and</strong> G. Fort. State-dependent foster-lyapunov criteria for subgeometric convergence <strong>of</strong> Markov<br />

chains. Stoch. Process Appl., 119:176–4193, 2009.<br />

[13] R. Douc, G. Fort, E. Moulines, <strong>and</strong> P. Soulier. Practical drift conditions for subgeometric rates <strong>of</strong> convergence.<br />

Ann. Appl. Probab, 14:1353–1377, 2004.<br />

[14] R. Durrett. Probability: theory <strong>and</strong> examples, volume 3. Cambridge university press, 2010.<br />

[15] J. K. Ghosh <strong>and</strong> R. V. Ramamoorthi. Bayesian Nonparametrics. Springer, New York, 2003.<br />

[16] M. Hairer. Convergence <strong>of</strong> markov processes. <strong>Lecture</strong> <strong>Notes</strong>, 2010.<br />

[17] O. Hern<strong>and</strong>ez-Lerma <strong>and</strong> J. Lasserre. Discrete-time Markov control processes. Springer, 1996.<br />

[18] O. Hern<strong>and</strong>ez-Lerma <strong>and</strong> J. B. Lasserre. Markov Chains <strong>and</strong> Invariant Probabilities. Birkhäuser, Basel,<br />

2003.<br />

[19] J. B. Lasserre. Invariant probabilities for Markov chains on a metric space. <strong>Statistics</strong> <strong>and</strong> Probability Letters,<br />

34:259–265, 1997.<br />

[20] J. Liu <strong>and</strong> E. Susko. On strict stationarity <strong>and</strong> ergodicity <strong>of</strong> a non-linear arma model. Journal <strong>of</strong> Applied<br />

Probability, pages 363–373, 1992.<br />

89


90 BIBLIOGRAPHY<br />

[21] A. S. Manne. Linear programming <strong>and</strong> sequential decision. Management Science, 6:259–267, April 1960.<br />

[22] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.<br />

[23] S. P. Meyn <strong>and</strong> R. Tweedie. Markov Chains <strong>and</strong> Stochastic Stability. Springer Verlag, London, 1993.<br />

[24] S. P. Meyn <strong>and</strong> R. Tweedie. State-dependent criteria for convergence <strong>of</strong> Markov chains. Ann. Appl. Prob,<br />

4:149–168, 1994.<br />

[25] S.P. Meyn <strong>and</strong> R.L. Tweedie. Markov Chains <strong>and</strong> Stochastic Stability. Springer, 1993.<br />

[26] S.P. Meyn <strong>and</strong> R.L. Tweedie. State-dependent criteria for convergence <strong>of</strong> markov chains. Annals Appl.<br />

Prob., 4(1):149–168, 1994.<br />

[27] L. Nielsen. Common knowledge, communication <strong>and</strong> convergence <strong>of</strong> beliefs. Mathematical Social Sciences,<br />

8:1–14, 1984.<br />

[28] G.O. Roberts <strong>and</strong> J.S. Rosenthal. General state space markov chains <strong>and</strong> mcmc algorithms. Probability<br />

Survery, 1:20–71, 2004.<br />

[29] M. Schäl. Conditions for optimality in dynamic programming <strong>and</strong> for the limit <strong>of</strong> n-stage optimal policies<br />

to be optimal. Z. Wahrscheinlichkeitsth, 32:179–296, 1975.<br />

[30] P. Tuominen <strong>and</strong> R.L. Tweedie. Subgeometric rates <strong>of</strong> convergence <strong>of</strong> f-ergodic markov chains. Adv. Annals<br />

Appl. Prob., 26(3):775–798, September 1994.<br />

[31] R. L. Tweedie. Drift conditions <strong>and</strong> invariant measures for Markov chains. Stochastic Processes <strong>and</strong> Their<br />

Applications, 92:345–354, 2001.<br />

[32] S. Yüksel <strong>and</strong> T. Başar. Stochastic Networked Control Systems: Stabilization <strong>and</strong> Optimization under<br />

Information Constraints. Birkhäuser, Boston, MA, 2013.<br />

[33] S. Yüksel <strong>and</strong> S. P. Meyn. R<strong>and</strong>om-time, state-dependent stochastic drift for Markov chains <strong>and</strong> application<br />

to stochastic stabilization over erasure channels. IEEE Transactions on Automatic Control, 58:47 – 59,<br />

January 2013.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!