08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The expectation here is with respect to the “coin tosses” <strong>of</strong> the algorithm, not with respect<br />

to the underlying distribution p. Let f max denote the maximum absolute value <strong>of</strong> f. It is<br />

easy to see that<br />

∣∣∣∣∣ ∑<br />

f i p i − E(γ)<br />

∣ ≤ f ∑<br />

max |p i − a ti | = f max ||p − a t || 1 (5.2)<br />

i<br />

i<br />

where the quantity ||p − a t || 1 is the l 1 distance between the probability distributions p<br />

and a t and is <strong>of</strong>ten called the “total variation distance” between the distributions. We<br />

will build tools to upper bound ||p − a t || 1 . Since p is the stationary distribution, the t for<br />

which ||p − a t || 1 becomes small is determined by the rate <strong>of</strong> convergence <strong>of</strong> the Markov<br />

chain to its steady state.<br />

The following proposition is <strong>of</strong>ten useful.<br />

Proposition 5.4 For two probability distributions p and q,<br />

||p − q|| 1 = 2 ∑ i<br />

(p i − q i ) + = 2 ∑ i<br />

(q i − p i ) +<br />

where x + = x if x ≥ 0 and x + = 0 if x < 0.<br />

The pro<strong>of</strong> is left as an exercise.<br />

5.2.1 Metropolis-Hasting Algorithm<br />

The Metropolis-Hasting algorithm is a general method to design a Markov chain whose<br />

stationary distribution is a given target distribution p. Start with a connected undirected<br />

graph G on the set <strong>of</strong> states. If the states are the lattice points (x 1 , x 2 , . . . , x d ) in R d<br />

with x i ∈ {0, 1, 2, , . . . , n}, then G could be the lattice graph with 2d coordinate edges at<br />

each interior vertex. In general, let r be the maximum degree <strong>of</strong> any vertex <strong>of</strong> G. The<br />

transitions <strong>of</strong> the Markov chain are defined as follows. At state i select neighbor j with<br />

probability 1 . Since the degree <strong>of</strong> i may be less than r, with some probability no edge<br />

r<br />

is selected and the walk remains at i. If a neighbor j is selected and p j ≥ p i , go to j. If<br />

p j < p i , go to j with probability p j /p i and stay at i with probability 1 − p j<br />

p i<br />

. Intuitively,<br />

this favors “heavier” states with higher p values. So, for i, adjacent to j in G,<br />

p ij = 1 (<br />

r min 1, p )<br />

j<br />

p i<br />

and<br />

Thus,<br />

p ii = 1 − ∑ j≠i<br />

p i p ij = p i<br />

r min (<br />

1, p j<br />

p i<br />

)<br />

= 1 r min(p i, p j ) = p j<br />

r min (<br />

1, p i<br />

p j<br />

)<br />

= p j p ji .<br />

By Lemma 5.3, the stationary probabilities are indeed p i as desired.<br />

146<br />

p ij .

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!