The HMC Algorithm with Overrelaxation and Adaptive--Step ...

The HMC Algorithm with Overrelaxation and 

Adaptive–Step Discretization 

Numerical Experiments with Gaussian Targets 

M. Alfaki, S. Subbey, and D. Haugland 

Thiele Conference 

July 17, 2008 

M. Alfaki, S. Subbey, and D. Haugland The Hamiltonian Monte Carlo (HMC) Algorithm

Background 

Aim of Talk 

The Hamiltonian Monte Carlo (HMC) Algorithm 

Improving Performance of HMC Algorithm 

Numerical Experiments & Results 

Talk Outline 

Bayes Theorem 

MCMC Algorithms 

1 Background 



2 Aim of Talk 

3 The Hamiltonian Monte Carlo (HMC) Algorithm 

Practical Implementation 

4 Improving Performance of HMC Algorithm 

Improving Phase–Space Sampling 

Improvement Strategies 

5 Numerical Experiments & Results 

The improved HMC algorithm 


Background 

Aim of Talk 







Given model–m(C) : C ∈ R k , and data O 

Bayes Theorem: Prior belief × Likelihood → Posterior 

p(m)p(O|m) 

] = p(m|O). (1) 

[∫R 

p(O|m)p(m)dm k 

Posterior pdf used in inference, e.g., expectation of J: 〈J〉 

∫ 

〈J〉 = J(m)p(m|O)dm = 1 

R k n 

≈ 1 n 

n 

∑ 

j=1 

n 

∑ 

j=1 

J(m j )p(m j |O) 

, (2) 

h(m j ) 

J(m j ), for h(m j ) ≈ p(m j |O). (3) 


Background 

Aim of Talk 




Talk Outline 



1 Background 



2 Aim of Talk 









Background 

Aim of Talk 







Avoid calculating (intractable) integral ∫ R k p(O|m)p(m)dm. 

Generate ensemble of models, m 1 , m 2 , . . . , m n |m j ≡ m(C j ) 

Such that distribution of {m j } n j=1 ∼ h(m) 

h(m) ≈ p(m|O) ⇒ 〈J〉 is an average over m 1 , m 2 , . . . , m n 

A popular implementation – Metropolis–Hastings algorithm 

Some example drawbacks: 

long burn–in time 

slow convergence (especially in high dimensions) 

Recent developments attempt to address drawbacks 


Background 

Aim of Talk 




Aim of Talk 

Present 


A variant Monte Carlo algorithm 

Incorporates gradient information in distribution space 

Investigated strategies for improving performance 

Numerical experimental results 


Background 

Aim of Talk 




Algorithm Description–I 


Type of Markov Chain Algorithm 

Combines advantages of Hamiltonian dynamics & 

Metropolis MC 

Incorporates gradients in dynamic trajectories 

Given vector of parameters C ∈ R k , 

Augment with conjugate momentum vector P ∈ R k 

Introduce function H(C, P), on phase–space (C, P). 

H(C, P) ≡ Hamiltonian function (Classical dynamics) 

H(C, P) = V(C) + K(P), (4) 

V(C) = − log π(C), K(P) = 1 2 |P|2 . (5) 

V, K, π(C) ≡ Pot. & Kinetic energies, Target distribution 


Background 

Aim of Talk 




Algorithm Description–II 


If V(C) induces a Boltzmann distribution over C 

p(C) = 

e −V(C) 

∫ 

R 

e −V(C) dC 

n 

(6) 

H(C, P) induces a similar distribution on (C, P), 

p(C, P) = 

e −H(C,P) 

∫R 

∫R 

e −H(C,P) = p(C)p(P), (7) 

dCdP n n 

p(P) = (2π) −n/2 e (− 1 2 |P|2) . (8) 

Simulate ergodic Markov chain with stationary distrib. ∼ (7) 

Estimate 〈J〉– use values of C from successive Markov 

chain states with marginal distribution given by (6) 


Background 

Aim of Talk 




Algorithm Description–III 


Stochastic Transition 

Draw random variable P ∼ p(P) = (2π) −n/2 e (− 1 2 |P|2 ) 

Dynamic Transition 

New pair of (C, P) ∼ p(C, P), starting from current C, 

Sample regions of constant H without bias 

Ensures ergodicity of the Markov chain 

Dynamic transitions–governed by Hamiltonian equations 

dC 

dτ = + ∂H 

∂P = P, dP 

dτ = − ∂H = −∇V(C). (9) 

∂C 

Hamiltonian dynamic transitions satisfy 

Time reversibility (invariance under τ → −τ, P → −P), 

Conservation of energy (H(C, P) invariant with τ) 

Liouville’s theorem (conservation of phase–space volume). 


Background 

Aim of Talk 




Talk Outline 


1 Background 



2 Aim of Talk 









Background 

Aim of Talk 




Leapfrog HMC 


Choose chain length N & leapfrog steps L 

Simulate Hamiltonian dynamics with finite step size, ɛ. 

P(τ + ɛ 2 ) = P(τ) − ɛ ∇V(C(τ)), (10) 

2 

C(τ + ɛ) = C(τ) + ɛP(τ + ɛ ), (11) 

2 

P(τ + ɛ) = P(τ + ɛ 2 ) − ɛ ∇V(C(τ + ɛ)). (12) 

2 

Transition is volume–preserving and time–reversible 

Finite ɛ does not keep H constant → systematic error 

Elimate systematic error using a Metropolis rule 


P 

Background 

Aim of Talk 





The Algorithm–Example Implementation 

Algorithm 

Initialize C (0) 

for i = 1 to N − 1 

Sample u ∼ U [0,1] and P ∗ ∼ N(0, I) 

C 0 = C (i) and P 0 = P ∗ + ε 2 ∇V(C 0) 

For l = 1 to L 

P l = P l−1 − ε 2 ∇V(C l) 

C l = C l−1 +εP l−1 

P l = P l−1 − ε 2 ∇V(C l) 

end For 

dH = H(C L , P L ) − H(C (i) , P ∗ ) 

if u < min{1, exp(−dH)} 

(C (i+1) , P (i+1) ) = (C L , P L ) 

else 

(C (i+1) , P (i+1) ) = (C (i) , P (i) ) 

end for 

return C = [C (1) , C (2) , ..., C (N−1) ] 

Example 

4 

3 

2 

1 

0 

−1 

−2 

−3 

−4 

−4 −3 −2 −1 0 1 2 3 4 

C 

π(C) , Frequency 

600 

500 

400 

300 

200 

100 

Truth 

0 

−4 −3 −2 −1 0 1 2 3 4 

C , Bin Centers 

Phase-space and distribution plots for 2D correlated 

Gaussian distribution. 


Background 

Aim of Talk 




Issues with Implementation 


Example 

Given a chain of length N, the choices of L & ɛ are decisive. 

6 x 10−4 ∆ j 

5 

4 

3 

0.01 

0.005 

5 

4 

3 

δH 

2 

1 

0 

−1 

−2 

−3 

50 100 150 200 250 300 350 400 450 500 

δH 

0 

−0.005 

−0.01 

0 100 200 300 400 500 

∆ j 

δH 

2 

1 

0 

−1 

−2 

50 100 150 200 250 300 350 400 450 500 

∆ j 

(a) ɛ = 0.025, L = 20 (b) ɛ = 0.1, L = 20 (c) ɛ = 1.9, L = 30 


Background 

Aim of Talk 




Talk Outline 



1 Background 



2 Aim of Talk 









Background 

Aim of Talk 




Effect of Gibbs Sampling 



Momentum variable P ∼ Gibbs sampler → random walks 

Could lead to sub–optimal sampling of phase-space 

Doubling on movement leads to extra cost– CPU time 

Illustration 

3 

δπ(x 1 

,x 2 

) 

HMC 

2 

2 

HMC points 

1.5 

1 

1 

0.5 

C 2 

0 

x 2 

0 

−0.5 

−1 

−1 

−2 

−1.5 

−2 

−3 

−3 −2 −1 0 

C 1 

1 2 3 

(a) HMC walker 

−2.5 

x 1 

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 

(b) Ideal walker 


f j 

f j 

Background 

Aim of Talk 




Effect of Constant Step–size 

Example 



For usual implementations, ɛ is constant 

Inefficient when trajectory dynamics vary in different 

phase–space regions 

Leads to extra cost– CPU time 

200 

150 

π(C) 

200 

150 

π(C) 

j 

0 

−4 −3 −2 −1 0 1 2 3 4 

c 

j 

0 

−4 −3 −2 −1 0 1 2 3 4 

c 

5 10 15 20 25 30 35 40 45 50 

10 −1 j 

100 

100 

ε j 

50 

50 

(a) Constant ɛ = 0.1, L = 30 (b) Variable ɛ, L = 30 (c) Variable ɛ j 


Background 

Aim of Talk 




Talk Outline 



1 Background 



2 Aim of Talk 









Background 

Aim of Talk 




Investigate Two Approaches 



Proposal 1: Suppressing random Walk in Gibbs sampling 

Ordered over-relaxation (R. Neal) 

Proposal 2: Using a variable step–size for dynamics 

Investigate a Runge–Kutta type integrator (simplectic) 


Background 

Aim of Talk 






Applying over–relaxation to P– Over-rel. HMC (OHMC) 

Ordered over-relaxation 

To over–relax R n ∋ P ∼ N (P; 0, I) 

For i = 1 : n 

Generated K values from 

N (q i |{q j } i̸=j ). 

Order K values and the odd value P i . 

q (0) 

i 

≤ · · · ≤ q (r) 

i 

= P i ≤ · · · ≤ q (K) 

i 

Set P ′ i = q(K−r) i 

. 

End for 

Example 

C 2 

C 2 

3 

2 

1 

0 

−1 

−2 

δπ(x 1 

,x 2 

) 

OHMC points 

OHMC 

−3 

−3 −2 −1 0 1 2 3 

C 1 

3 

2 

1 

0 

−1 

−2 

δπ(x 1 

,x 2 

) 

HMC points 


−3 

−3 −2 −1 0 1 2 3 

C 1 


Background 

Aim of Talk 






Variable Step–Size HMC Algorithm (SVHMC) 

Explicit variable step–size using a Runge–Kutta scheme 

Adaptive Störmer–Verlet 

For l = 1 : L − steps 

End For 

C l+ 

1 = C l + ɛ P 

2 2ρ l+ 

1 , 

l 2 

P l+ 

1 = P l − ɛ ∇V(C 

2 2ρ l ), 

l 

ρ l+1 + ρ l = 2U(C l+ 

1 

2 

P l+1 = P l+ 

1 − 

2 

, P l+ 

1 ), 

2 

ɛ 

2ρ l+1 

∇V(C l+1 ), 

C l+1 = C l+ 

1 + ɛ P 

2 2ρ l+ 

1 . 

n+1 2 


Background 

Aim of Talk 




Adaptive Step–size 



Example 

Adaptive Störmer–Verlet 

Adaptive ɛ reduces ∆H. 

Parameter ɛ depends on 

U(C, P) = 

√ 

‖∇V(C)‖ 2 + P T [∇ 2 V(C)] 2 P 

log(∆H/H) 

leapfrog Scheme 

SV method 

10 5 

10 0 

10 −5 

10 −10 

0 500 1000 1500 

10 10 τ 

Observed ∼ theoretical 

acceptance rates 

ρ o is a fictive parameter 

Acceptance rate 

1.02 

1.01 

1 

0.99 

0.98 

0.97 

Observed 

Theoretic 

0.96 

0.1 0.15 0.2 0.25 0.3 

ρ 0 

0.35 0.4 0.45 0.5 


Background 

Aim of Talk 




Numerical Experiments 

Example results from the improved HMC algorithm 

Gaussian targets with uncorrelated covariates in 64 & 

128D 

( 

1 

π(C) = 

exp − 1 ) 

(2π) D 2 det(Σ) 1 2 2 CT Σ −1 C . (13) 

Compare HMC, SVHMC & OSVHMC algorithms based on 

Degree of chain autocorrelation 

Effective number of samples in a given chain 

Variance of sample means, C, of a finite chain 

Convergence rates/ratio 

Dimensionless efficiency, 


Background 

Aim of Talk 




Evaluation Criteria 


Suppose {c i } N i=1 

is chain generated by algorithm. 

1 Degree of correlation criteria 

Autocorrelation function ρ(l) = Cov(x i,x i+l ) 

Var(x i ) 

Integrated autocorrelation time τ int = 1 2 + ∑∞ t=1 ρ(t) 

Effective sample size N eff = N/(2τ int ) 

2 Spectral analysis criteria 

Compute ˜P j = | ˜C(κ) ∗ ˜C(κ)|, ˜C(κ) = DFT(c) 

(κ ∗ /κ) α 

Fit template P(κ) = P 0 (κ ∗ /κ) α +1 to ˜P j 

α, P(0) & κ ∗ – parameters to be estimated 

The sample mean variance σ 2¯x ≈ P(κ = 0)/N 

Convergence ratio r = σ 2¯x /σ0 

2 σ0 The dimensionless efficiency E = lim 2/N 

N→∞ 

σ2¯x (N) 


−3 

10 10−2 10−1 100 101 

Background 

Aim of Talk 





Evaluation criteria – Geometric Illustration 

Degree of correlation 

Spectral analysis 

Autocorrelation function 

0.9 

0.7 

0.5 

10 0 

ρ(τ) 

0.3 

P(κ) 

0.1 

10 −1 

−0.1 

−0.3 

−0.5 

0 1 2 3 4 5 6 7 8 

τ 

10 1 κ 

10−2 

−3 

10 10−2 10−1 100 101 

1.3 

Integrated autocorrelation function 

The curve turn over 

1.1 

ρ int 

(τ) 

0.9 

10 0 

10 1 κ 

P(0) 

P(κ) 

P(κ)≈ (κ) −α 

0.7 

0.5 

0.3 

0 1 2 3 4 5 6 7 8 

τ 

κ* 


Background 

Aim of Talk 




Talk Outline 


1 Background 



2 Aim of Talk 









1.9 

1.7 

1.5 

1.3 

1.1 

0.9 

0.7 

0.5 

Background 

Aim of Talk 




Comparing OHMC vs HMC 


Gaussian Target 

1 

π(x) = exp ( − 1 (2π) n/2 |Σ| 1/2 2 xT Σ −1 x ) 

Σ = I 

Results (n=64, N=2000) 

OHMC HMC Ideal 

Accept. rate 0.99 0.99 1 

P(0) 1.35 1.51 1 

κ ∗ 1.65 1.45 

CPU time[sec] 561.22 557.38 

E 0.74 0.66 1 

r 6.7e − 4 7.6e − 4 < 0.01 

τ int 1.63 1.85 0.5 

N eff 614 542 2000 

Graphical Illustration 

P(κ) 

ρ int 

(τ) 

3.1623 

2.5119 

1.9953 

1.5849 

1.2589 

1 

0.7943 

0.631 

0.5012 

0.3981 

0.3162 

10 −3 10 −2 10 −1 10 0 10 1 

κ 

OHMC 


0.3 

0 2 4 6 8 10 12 14 16 

τ 

iacf OHMC 

(τ*,τ* int 

) OHMC 

iacf HMC 

(τ*,τ* int 

) HMC 


1.9 

1.7 

1.5 

1.3 

1.1 

0.9 

0.7 

0.5 

Background 

Aim of Talk 




Comparing SVHMC vs HMC 



Numerical Results (n=128 N=2000) 

SVHMC HMC Ideal 


P(0) 1.09 3.13 1 

κ ∗ 3.15 1.55 

CPU time[sec] 1568.01 1117.78 

E 0.92 0.67 1 

r 5.6e − 4 7.4e − 4 < 0.01 

τ int 0.86 1.80 0.5 

N eff 1167 554 2000 

P(κ) 

ρ int 

(τ) 

3.1623 

2.5119 

1.9953 

1.5849 

1.2589 

1 

0.7943 

0.631 

0.5012 

0.3981 

0.3162 

10 −3 10 −2 10 −1 10 0 10 1 

κ 

SVHMC 


iacf SVHMC 

(τ*,τ* int 

) SVHMC 

iacf HMC 

(τ*,τ* int 

) HMC 

0.3 

0 2 4 6 8 10 12 14 16 

τ 


1.7 

1.5 

1.3 

1.1 

0.9 

0.7 

0.5 

Background 

Aim of Talk 




Comparing OSVHMC vs SVHMC 



Numerical Results (n=64, N=2000) 

OSVHMC SVHMC Ideal 


P(0) 1.02 1.11 1 

κ ∗ 4.15 3.19 

CPU time[sec] 639.58 669.40 

E 0.98 0.90 1 

r 5.1e − 4 5.6e − 4 < 0.01 

τ int 0.71 0.94 0.5 

N eff 1400 1059 2000 

P(κ) 

ρ int 

(τ) 

3.1623 

2.5119 

1.9953 

1.5849 

1.2589 

1 

0.7943 

0.631 

0.5012 

0.3981 

0.3162 

10 −3 10 −2 10 −1 10 0 10 1 

κ 

OSVHMC 

SVHMC 

iacf OSVHMC 

(τ,τ int 

) OSVHMC 

iacf SVHMC 

(τ*,τ* int 

) SVHMC 

0.3 

0 1 2 3 4 5 6 7 8 

τ 


Background 

Aim of Talk 




Summary and Conclusion 


1 Over-relaxation in the Gibbs sampling improves 

dimensionless efficiency by a factor ∼ 12%. 

E OHMC 

E HMC 

≈ 1.2 

2 Using . Störmer–Verlet discretization outperforms the 

leapfrog HMC by having ∼ 50% more effective sample size 

N SV 

eff 

N leapfrog 

eff 

≈ 2.0 

3 The hybrid– OSVHMC (over-relaxing the momentum & 

Adaptive ɛ) outperform the SVHMC

The HMC Algorithm with Overrelaxation and Adaptive--Step ...

Create successful ePaper yourself

Delete template?

Save as template?