21.01.2015 Views

Metropolis-Hastings algorithm

Metropolis-Hastings algorithm

Metropolis-Hastings algorithm

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Multiple parameters<br />

• Problem: Pi Prior blif beliefs about θ might not be adequately<br />

represented by beta distribution, or any function that yields<br />

analytically solvable posterior function.<br />

• One possible solution: grid approximation.<br />

– Useful when estimating one parameter, e.g. θ.<br />

– However, typical models involve multiple parameters, not just one.<br />

– Grid approximations not very useful for models involving multiple<br />

parameters.<br />

– Parameter space can’t be spanned by grid with reasonable number<br />

of points.<br />

– E.g., model with 6 parameters:<br />

• If each parameter represented by 50 points, 6-dimensional<br />

i parameter space has 50 6 = 1.56 x 10 10 points.<br />

• We will consider new method in context of 1 parameter.<br />

– Not very useful for 1 parameter, but extrapolates to any number.


• Initial i assumptions:<br />

Monte Carlo approach<br />

– Prior distribution is specified by function (continuous or discrete) that is<br />

easily evaluated (by computer):<br />

• If specify θ, then p(θ) is easily determined.<br />

– Likelihood function, p(D | θ), can be determined for any specified values of<br />

D and θ.<br />

– (Actually, all that is required is that the product of the prior and the<br />

likelihood be easily determined.)<br />

• Method produces an approximation of the posterior distribution, p(θ | D):<br />

– Provides large number of θ values sampled from the posterior distribution.<br />

– Can be used to estimate:<br />

• Mean, median, standard ddeviation of posterior.<br />

• Credible HDI regions.<br />

• Etc.<br />

• Example of Monte Carlo method.


Monte Carlo approach<br />

• Basic goal lin Bayesian inference: describe posterior distribution<br />

ib i<br />

over the parameters.<br />

• Monte Carlo approach:<br />

– Sample large number of representative points from posterior.<br />

– From points, calculate descriptive statistics.<br />

• E.g., consider beta(θ | a, b) distribution:<br />

ib ti<br />

– Mean and standard deviation can be analytically derived.<br />

• Expressed exactly in terms of parameters a and b.<br />

– Cumulative probability distribution (cdf, qbeta in R) can be<br />

computed.<br />

• Used to determine credible intervals.<br />

• But: suppose didn’t know analytical formulas or cdf.


• Use a spinner:<br />

Monte Carlo approach<br />

– Marked on circumference with possible θ values.<br />

– Biased to point at θ values exactly according to a beta(θ | a, b)<br />

distribution.<br />

<br />

<br />

beta |1,1 :<br />

– Spin 1000s of times, record the values.<br />

– Calculate statistics (mean, standard deviation, etc.) and percentiles.<br />

– Should closely approximate properties of the ‘underlying’<br />

distribution.


<strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

• Imagine a politician visiting a string of islands:<br />

– Wants to visit each island a number of times proportional to its<br />

population.<br />

– Doesn’t know how many islands there are.<br />

– Each day: might move to a new neighboring island, or stay on current<br />

island.<br />

• Develops simple heuristic to decide which way to move:<br />

– Flips a fair coin to decide whether to move to east or west.<br />

– If the proposed island has a larger population than the current island,<br />

moves there.<br />

– If the proposed island has a smaller population than the current island,<br />

moves there probabilistically, based on uniform spinner:<br />

• Probability of moving depends on relative population sizes:<br />

P<br />

proposed<br />

pmove<br />

<br />

P<br />

current<br />

• End result: probability that politician is on any one of the islands<br />

exactly matches the relative proportion of the island.


<strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

• E.g., have 7 islands (θ = 1–7)<br />

with relative populations given<br />

by P(θ).<br />

• Initial island (θ = 4) chosen<br />

randomly (or centered).<br />

• Each ‘day’ corresponds to one<br />

time increment.<br />

• Trajectory path is a random walk<br />

on a grid.<br />

• One possible instance of a<br />

trajectory:<br />

Resulting frequency distribution<br />

Random walk<br />

Target distribution


<strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

• Probability of being at position θ, as a function of time t, when<br />

y g p , ,<br />

<strong>Metropolis</strong> <strong>algorithm</strong> is applied to target distribution (lower right):


<strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

• Summary of <strong>algorithm</strong>:<br />

– Currently at position<br />

.<br />

current<br />

– Range of possible proposed moves is the proposal distribution<br />

(=trial distribution).<br />

• Sampled uniformly.<br />

• Island example: 2 values with 50-50 probabilities.<br />

– Given a proposed p move, must decide to accept or reject it.<br />

proposed<br />

• Based on value of target distribution at relative to value<br />

of target distribution at current<br />

.<br />

• If target value is > current value, move.<br />

• If target value is < current value, move with probability<br />

P <br />

proposed <br />

p<br />

move<br />

<br />

P <br />

current<br />

<br />

• Combined movement rule (=jumping rule): move with probability<br />

p<br />

move<br />

<br />

<br />

min <br />

P<br />

<br />

proposed<br />

P <br />

<br />

,1<br />

<br />

<br />

<br />

<br />

<br />

current


<strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

• What we must be able to do:<br />

– Select a random value from the proposal distribution (to<br />

create θ proposed ).<br />

– Evaluate the target distribution at any proposed position θ (to<br />

calculate P(θ proposed ) / P(θ current ) .<br />

– Select a random value from a uniform distribution (to make a<br />

move according to p move ).<br />

• Can then do something indirectly than can’t necessarily to<br />

directly: generate random samples from the target distribution.<br />

ib i


<strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

• Note: target distribution ib i does not need to be normalized.<br />

– Based on ratio: P(θ proposed ) / P(θ current ) .<br />

– Useful when target distribution P(θ) is a posterior distribution<br />

p D| p .<br />

proportional to <br />

pD|<br />

p<br />

<br />

Bayes’ rule:<br />

p<br />

<br />

|<br />

D<br />

<br />

p D<br />

<br />

p | D<br />

pD|<br />

p<br />

<br />

– Need only evaluate the product p D| p , not the separate<br />

likelihoods and priors.<br />

– By evaluating p D<br />

| <br />

p<br />

<br />

,<br />

can generate random representative<br />

values from the posterior distribution.<br />

– Don’t need to evaluate evidence, pD.<br />

• Can do Baysian inference with rich and complex models.


• Algorithm named after:<br />

<strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

– Nicholas <strong>Metropolis</strong> (et al., 1953), physicist who was first author<br />

(of 5) on paper that proposed it. (Other coauthors actually did<br />

more work.)<br />

– W. Keith <strong>Hastings</strong> (1970), mathematician who extended to the<br />

general case.<br />

• Commonly used Markov-chain Monte Carlo (MCMC) method.<br />

• Why Markov<br />

• Markov process: process in which the probability of passing<br />

from a current state to another particular state is constant, and<br />

depends d only on the current state.<br />

• Markov chain: probabilistic instance of a Markov process based<br />

on a random walk through the probabilistic decisions.


MCMC methods<br />

• Markov chain Monte Carlo methods:<br />

– Class of <strong>algorithm</strong>s for sampling from a probability distribution.<br />

– Based on constructing a Markov chain.<br />

– Desired distribution is the equilibrium distribution.<br />

• State of Markov chain after many steps then used as a sample of<br />

the desired distribution.<br />

– Quality of the sample improves as the number of steps increases.<br />

• Difficult problem is to determine how many steps are needed to<br />

converge to the stationary distribution within an acceptable<br />

error.<br />

– A good <strong>algorithm</strong> will have rapid mixing: the stationary<br />

distribution is reached quickly starting from an arbitrary position.<br />

– <strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong> has rapid-mixing gproperties.<br />

p


General <strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

• Specific procedure just described was special case:<br />

– Discrete positions (θ);<br />

– One dimension;<br />

– Moves that proposed one position left or right.<br />

• General <strong>algorithm</strong> applies to:<br />

– Continuous values;<br />

– Any number of dimensions;<br />

– More general proposed distributions.<br />

• Essentials are same as for special case:<br />

– Have some target distribution, P(θ), over a multidimensional<br />

continuous parameter space.<br />

– Would like to generate representative samples.<br />

– Must be able to find value of P(θ) for any candidate value of θ.<br />

– Distribution P(θ) does not need to be normalized, merely non-negative.<br />

– Typical Bayesian application: P(θ) is product of likelihood and prior.


General <strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

• Sample values from target distribution generated by taking a random<br />

walk through the multidimensional parameter space.<br />

– Begins at arbitrary point, specified by user, where P(θ) is non-zero.<br />

– At each time step:<br />

• Move to new position in parameter space is proposed.<br />

• Decide whether or not to accept move to new position.<br />

– Proposal distributions can be of various forms.<br />

• Goal is to efficiently explore regions of parameter space where P(θ)<br />

has greatest mass.<br />

• Simplest case: proposal distribution is normal, centered on current<br />

position.<br />

• Proposed move will typically ybe near present position, with probability<br />

of more distant positions decreasing with distance:<br />

0.4<br />

0.35<br />

0.3<br />

025 0.25<br />

P()<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

-3 -2 -1 0 1 2 3


General <strong>Metropolis</strong>-<strong>Hastings</strong> <strong>algorithm</strong><br />

• Sample values from target distribution generated by taking a random<br />

walk through the multidimensional parameter space.<br />

– At each time step:<br />

• Having generated proposed new position, decision made whether to<br />

accept or reject it based on movement rule:<br />

p<br />

move<br />

<br />

<br />

<br />

min <br />

P<br />

<br />

proposed<br />

P <br />

<br />

,1<br />

<br />

<br />

<br />

<br />

current<br />

<br />

• Random number r generated from uniform interval [0, 1].<br />

• If r is between 0 – p move , move is accepted.<br />

– Process repeated.<br />

– In the long run: positions visited by the random<br />

walk will closely approximate the target distribution.<br />

θ 2<br />

θ 1


Visualization of <strong>Metropolis</strong> <strong>algorithm</strong> in 1D<br />

• R code for excellent visualization i of the <strong>Metropolis</strong> <strong>algorithm</strong><br />

for a 1D problem at:<br />

– http://www.r-bloggers.com/visualising-the-metropolis-hastings<strong>algorithm</strong>/<br />

– Minor modifications, code saved as: Chivers_ms_viz.R<br />

<strong>Metropolis</strong><br />

s-<strong>Hastings</strong><br />

Trace<br />

rm(x, target_mu, target_sd)<br />

dnor<br />

-3 -2 -1 0 1 2 3<br />

x


Efficiency, “burn-in”, and convergence<br />

• If proposal ldistribution ib i is narrow relative to target distribution ib i ,<br />

will take a long time for random walk to cover the distribution.<br />

– Algorithm will not be efficient: takes too many steps to accumulate<br />

a representative sample.<br />

– Particular problem if initial position of random walk is in region of<br />

target distribution that is flat and low.<br />

• Random walk moves only slowly away from starting position<br />

into denser region.<br />

– Unrepresentative starting gposition can lead to low efficiency even<br />

if proposal distribution isn’t narrow.<br />

• Solution: early steps of random walk are excluded from portion<br />

of Markov chain considered to be representative of target<br />

distribution.<br />

– Excluded initial steps: burn-in period.


Efficiency, “burn-in”, and convergence<br />

• Efficiency also decreases if proposal distribution is too broad:<br />

– If too broad, proposed positions are far away, away from main<br />

mass of distribution.<br />

– Probability bilit of accepting a move will be small.<br />

– Random walk rejects too many proposals before accepting one.<br />

• Simulation for 1–64 dimensional target distribution:<br />

ciency<br />

% Effic<br />

Stdev of proposal distribution


Efficiency, “burn-in”, and convergence<br />

• Even after random walk lkhas meandered dfor a while, can’t be<br />

sure that it’s really exploring the main regions of the target<br />

distribution.<br />

– Especially if target distribution is complex over many dimensions.<br />

– But don’t know what target distribution looks like.<br />

• Various methods available to assess convergence of random<br />

Various methods available to assess convergence of random<br />

walk.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!