Metropolis-Hastings algorithm

Multiple parameters 

• Problem: Pi Prior blif beliefs about θ might not be adequately 

represented by beta distribution, or any function that yields 

analytically solvable posterior function. 

• One possible solution: grid approximation. 

– Useful when estimating one parameter, e.g. θ. 

– However, typical models involve multiple parameters, not just one. 

– Grid approximations not very useful for models involving multiple 

parameters. 

– Parameter space can’t be spanned by grid with reasonable number 

of points. 

– E.g., model with 6 parameters: 

• If each parameter represented by 50 points, 6-dimensional 

i parameter space has 50 6 = 1.56 x 10 10 points. 

• We will consider new method in context of 1 parameter. 

– Not very useful for 1 parameter, but extrapolates to any number.

• Initial i assumptions: 

Monte Carlo approach 

– Prior distribution is specified by function (continuous or discrete) that is 

easily evaluated (by computer): 

• If specify θ, then p(θ) is easily determined. 

– Likelihood function, p(D | θ), can be determined for any specified values of 

D and θ. 

– (Actually, all that is required is that the product of the prior and the 

likelihood be easily determined.) 

• Method produces an approximation of the posterior distribution, p(θ | D): 

– Provides large number of θ values sampled from the posterior distribution. 

– Can be used to estimate: 

• Mean, median, standard ddeviation of posterior. 

• Credible HDI regions. 

• Etc. 

• Example of Monte Carlo method.


• Basic goal lin Bayesian inference: describe posterior distribution 

ib i 

over the parameters. 

• Monte Carlo approach: 

– Sample large number of representative points from posterior. 

– From points, calculate descriptive statistics. 

• E.g., consider beta(θ | a, b) distribution: 

ib ti 

– Mean and standard deviation can be analytically derived. 

• Expressed exactly in terms of parameters a and b. 

– Cumulative probability distribution (cdf, qbeta in R) can be 

computed. 

• Used to determine credible intervals. 

• But: suppose didn’t know analytical formulas or cdf.

• Use a spinner: 


– Marked on circumference with possible θ values. 

– Biased to point at θ values exactly according to a beta(θ | a, b) 

distribution. 

 

 

beta |1,1 : 

– Spin 1000s of times, record the values. 

– Calculate statistics (mean, standard deviation, etc.) and percentiles. 

– Should closely approximate properties of the ‘underlying’ 

distribution.

Metropolis-Hastings algorithm 

• Imagine a politician visiting a string of islands: 

– Wants to visit each island a number of times proportional to its 

population. 

– Doesn’t know how many islands there are. 

– Each day: might move to a new neighboring island, or stay on current 

island. 

• Develops simple heuristic to decide which way to move: 

– Flips a fair coin to decide whether to move to east or west. 

– If the proposed island has a larger population than the current island, 

moves there. 

– If the proposed island has a smaller population than the current island, 

moves there probabilistically, based on uniform spinner: 

• Probability of moving depends on relative population sizes: 

P 

proposed 

pmove 

 

P 

current 

• End result: probability that politician is on any one of the islands 

exactly matches the relative proportion of the island.


• E.g., have 7 islands (θ = 1–7) 

with relative populations given 

by P(θ). 

• Initial island (θ = 4) chosen 

randomly (or centered). 

• Each ‘day’ corresponds to one 

time increment. 

• Trajectory path is a random walk 

on a grid. 

• One possible instance of a 

trajectory: 

Resulting frequency distribution 

Random walk 

Target distribution


• Probability of being at position θ, as a function of time t, when 

y g p , , 

Metropolis algorithm is applied to target distribution (lower right):


• Summary of algorithm: 

– Currently at position 

. 

current 

– Range of possible proposed moves is the proposal distribution 

(=trial distribution). 

• Sampled uniformly. 

• Island example: 2 values with 50-50 probabilities. 

– Given a proposed p move, must decide to accept or reject it. 

proposed 

• Based on value of target distribution at relative to value 

of target distribution at current 

. 

• If target value is > current value, move. 

• If target value is < current value, move with probability 

P 

proposed 

p 

move 

 

P 

current 

 

• Combined movement rule (=jumping rule): move with probability 

p 

move 

 

 

min 

P 

 

proposed 

P 

 

,1 

 

 

 

 

 

current


• What we must be able to do: 

– Select a random value from the proposal distribution (to 

create θ proposed ). 

– Evaluate the target distribution at any proposed position θ (to 

calculate P(θ proposed ) / P(θ current ) . 

– Select a random value from a uniform distribution (to make a 

move according to p move ). 

• Can then do something indirectly than can’t necessarily to 

directly: generate random samples from the target distribution. 

ib i


• Note: target distribution ib i does not need to be normalized. 

– Based on ratio: P(θ proposed ) / P(θ current ) . 

– Useful when target distribution P(θ) is a posterior distribution 

p D| p . 

proportional to 

pD| 

p 

 

Bayes’ rule: 

p 

 

| 

D 

 

p D 

 

p | D 

pD| 

p 

 

– Need only evaluate the product p D| p , not the separate 

likelihoods and priors. 

– By evaluating p D 

| 

p 

 

, 

can generate random representative 

values from the posterior distribution. 

– Don’t need to evaluate evidence, pD. 

• Can do Baysian inference with rich and complex models.

• Algorithm named after: 


– Nicholas Metropolis (et al., 1953), physicist who was first author 

(of 5) on paper that proposed it. (Other coauthors actually did 

more work.) 

– W. Keith Hastings (1970), mathematician who extended to the 

general case. 

• Commonly used Markov-chain Monte Carlo (MCMC) method. 

• Why Markov 

• Markov process: process in which the probability of passing 

from a current state to another particular state is constant, and 

depends d only on the current state. 

• Markov chain: probabilistic instance of a Markov process based 

on a random walk through the probabilistic decisions.

MCMC methods 

• Markov chain Monte Carlo methods: 

– Class of algorithms for sampling from a probability distribution. 

– Based on constructing a Markov chain. 

– Desired distribution is the equilibrium distribution. 

• State of Markov chain after many steps then used as a sample of 

the desired distribution. 

– Quality of the sample improves as the number of steps increases. 

• Difficult problem is to determine how many steps are needed to 

converge to the stationary distribution within an acceptable 

error. 

– A good algorithm will have rapid mixing: the stationary 

distribution is reached quickly starting from an arbitrary position. 

– Metropolis-Hastings algorithm has rapid-mixing gproperties. 

p

General Metropolis-Hastings algorithm 

• Specific procedure just described was special case: 

– Discrete positions (θ); 

– One dimension; 

– Moves that proposed one position left or right. 

• General algorithm applies to: 

– Continuous values; 

– Any number of dimensions; 

– More general proposed distributions. 

• Essentials are same as for special case: 

– Have some target distribution, P(θ), over a multidimensional 

continuous parameter space. 

– Would like to generate representative samples. 

– Must be able to find value of P(θ) for any candidate value of θ. 

– Distribution P(θ) does not need to be normalized, merely non-negative. 

– Typical Bayesian application: P(θ) is product of likelihood and prior.


• Sample values from target distribution generated by taking a random 

walk through the multidimensional parameter space. 

– Begins at arbitrary point, specified by user, where P(θ) is non-zero. 

– At each time step: 

• Move to new position in parameter space is proposed. 

• Decide whether or not to accept move to new position. 

– Proposal distributions can be of various forms. 

• Goal is to efficiently explore regions of parameter space where P(θ) 

has greatest mass. 

• Simplest case: proposal distribution is normal, centered on current 

position. 

• Proposed move will typically ybe near present position, with probability 

of more distant positions decreasing with distance: 

0.4 

0.35 

0.3 

025 0.25 

P() 

0.2 

0.15 

0.1 

0.05 

0 

-3 -2 -1 0 1 2 3


• Sample values from target distribution generated by taking a random 

walk through the multidimensional parameter space. 

– At each time step: 

• Having generated proposed new position, decision made whether to 

accept or reject it based on movement rule: 

p 

move 

 

 

 

min 

P 

 

proposed 

P 

 

,1 

 

 

 

 

current 

 

• Random number r generated from uniform interval [0, 1]. 

• If r is between 0 – p move , move is accepted. 

– Process repeated. 

– In the long run: positions visited by the random 

walk will closely approximate the target distribution. 

θ 2 

θ 1

Visualization of Metropolis algorithm in 1D 

• R code for excellent visualization i of the Metropolis algorithm 

for a 1D problem at: 

– http://www.r-bloggers.com/visualising-the-metropolis-hastingsalgorithm/ 

– Minor modifications, code saved as: Chivers_ms_viz.R 

Metropolis 

s-Hastings 

Trace 

rm(x, target_mu, target_sd) 

dnor 

-3 -2 -1 0 1 2 3 

x

Efficiency, “burn-in”, and convergence 

• If proposal ldistribution ib i is narrow relative to target distribution ib i , 

will take a long time for random walk to cover the distribution. 

– Algorithm will not be efficient: takes too many steps to accumulate 

a representative sample. 

– Particular problem if initial position of random walk is in region of 

target distribution that is flat and low. 

• Random walk moves only slowly away from starting position 

into denser region. 

– Unrepresentative starting gposition can lead to low efficiency even 

if proposal distribution isn’t narrow. 

• Solution: early steps of random walk are excluded from portion 

of Markov chain considered to be representative of target 

distribution. 

– Excluded initial steps: burn-in period.


• Efficiency also decreases if proposal distribution is too broad: 

– If too broad, proposed positions are far away, away from main 

mass of distribution. 

– Probability bilit of accepting a move will be small. 

– Random walk rejects too many proposals before accepting one. 

• Simulation for 1–64 dimensional target distribution: 

ciency 

% Effic 

Stdev of proposal distribution


• Even after random walk lkhas meandered dfor a while, can’t be 

sure that it’s really exploring the main regions of the target 

distribution. 

– Especially if target distribution is complex over many dimensions. 

– But don’t know what target distribution looks like. 

• Various methods available to assess convergence of random 

Various methods available to assess convergence of random 

walk.

Metropolis-Hastings algorithm

Create successful ePaper yourself

Delete template?

Save as template?