08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.2 Markov Chain Monte Carlo<br />

The Markov Chain Monte Carlo (MCMC) method is a technique for sampling a multivariate<br />

probability distribution p(x), where x = (x 1 , x 2 , . . . , x d ). The MCMC method is<br />

used to estimate the expected value <strong>of</strong> a function f(x)<br />

E(f) = ∑ x<br />

f(x)p(x).<br />

If each x i can take on two or more values, then there are at least 2 d values for x, so an<br />

explicit summation requires exponential time. Instead, one could draw a set <strong>of</strong> samples,<br />

each sample x with probability p(x). Averaging f over these samples provides an estimate<br />

<strong>of</strong> the sum.<br />

To sample according to p(x), design a Markov Chain whose states correspond to the<br />

possible values <strong>of</strong> x and whose stationary probability distribution is p(x). There are two<br />

general techniques to design such a Markov Chain: the Metropolis-Hastings algorithm and<br />

Gibbs sampling, which we will describe in the next section. The Fundamental Theorem <strong>of</strong><br />

Markov Chains, Theorem 5.2, states that the average <strong>of</strong> f over states seen in a sufficiently<br />

long run is a good estimate <strong>of</strong> E(f). The harder task is to show that the number <strong>of</strong> steps<br />

needed before the long-run average probabilities are close to the stationary distribution<br />

grows polynomially in d, though the total number <strong>of</strong> states may grow exponentially in d.<br />

This phenomenon known as rapid mixing happens for a number <strong>of</strong> interesting examples.<br />

Section 5.4 presents a crucial tool used to show rapid mixing.<br />

We used x ∈ R d to emphasize that distributions are multi-variate. From a Markov<br />

chain perspective, each value x can take on is a state, i.e., a vertex <strong>of</strong> the graph on which<br />

the random walk takes place. Henceforth, we will use the subscripts i, j, k, . . . to denote<br />

states and will use p i instead <strong>of</strong> p(x 1 , x 2 , . . . , x d ) to denote the probability <strong>of</strong> the state<br />

corresponding to a given set <strong>of</strong> values for the variables. Recall that in the Markov chain<br />

terminology, vertices <strong>of</strong> the graph are called states.<br />

Recall the notation that p t is the row vector <strong>of</strong> probabilities <strong>of</strong> the random walk being<br />

at each state (vertex <strong>of</strong> the graph) at time t. So, p t has as many components as there are<br />

states and its i th component, p ti , is the probability <strong>of</strong> being in state i at time t. Recall<br />

the long-term t-step average is<br />

a t = 1 t [p 0 + p 1 + · · · + p t−1 ] . (5.1)<br />

∑<br />

The expected value <strong>of</strong> the function f under the probability distribution p is E(f) =<br />

i f ip i where f i is the value <strong>of</strong> f at state i. Our estimate <strong>of</strong> this quantity will be the<br />

average value <strong>of</strong> f at the states seen in a t step walk. Call this estimate γ. Clearly, the<br />

expected value <strong>of</strong> γ is<br />

E(γ) = ∑ (<br />

)<br />

1<br />

t∑<br />

f i Prob (walk is in state i at time j) = ∑ f i a ti .<br />

t<br />

i<br />

j=1<br />

i<br />

145

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!