final report - probability.ca

More documents

Recommendations

Info

takes the form σ 2 (d) = l 2 /d α , where α is the smallest number satisfying lim d→∞ d λ1 d γi c (J (i, d)) < ∞ and lim dα d→∞ d α < ∞, (2.3) for i = 1, . . . , m. For this to hold, it follows that at least one of the m + 1 limits above must converge to some positive constant, and the rest converge to zero. Since the scaling term of the component of interest is taken to be equal to 1, the largest possible form for the proposal scaling is σ 2 = l 2 , and thus it will not diverge as d → ∞. Note 2.4. When we proceed to prove the convergence of the limiting process of the Metropolis algorithm to a Langevin diffusion, it will be useful to consider the reciprocal of equation (2.3). This equivalent condition is c (J (i, d)) ( K 1 d λ 1 lim + · · · + dλn + · · · d→∞ d λ1 K 1 K n d γ1 K n+1 + · · · + c (J (m, d)) ) d γm K n+m = ∞. (2.4) Since all the distinct terms in the first n positions appear only a finite number of times, it follows that for (2.4) to K be satisfied, there must exist at least one i ∈ {1, . . . , m} such that lim 1 d→∞ d c (J (i, d)) dγ i λ 1 K n+i = ∞. Furthermore, d this stipulates that lim λ 1 d→∞ d = 0 and hence the scaling term θ −2 α 1 (d) is immaterial for the choice of α. As was the case in the papers above, a time-accelerating adjustment is necessary in order to ensure the limiting process of the Metropolis algorithm is non-trivial. For this purpose, a new sped-up continuous ( time version of the original algorithm is introduced, paramterized to make jumps every d −α time units, i.e., Z d (t)= X (d) 1 ([d α t]) , . . . , X (d) d where [·] is the integer function. As was the case earlier, the only way to preserve the Markov property of the discretetime process of the Metropolis algorithm while making it a continuous process is to resort to the memoryless property of the exponential distribution. By allowing the process to jump according to a Poisson process with rate d α , the new sped up process Z d (t) moves about d α times in every time unit, and furthermore, behaves asymptotically like a continuous Markov process. 2.3.3 Optimal proposal and scaling values As has been noted in the preceding paragraphs, the major difference between the target (2.2) and the targets considered earlier is the inclusion of the various scaling terms. Since the preceding sections proved results for targets where the scaling terms are all identical or identically distributed, the question that is of interest is how big a discrepancy between these various scaling terms is required in order to effect the limiting behavior of the algorithm. The results following show that if none of the first n scaling terms are significantly different than the remaining scaling terms, then the limiting behavior is unaffected. Theorem 2.5. (Theorem 1 in [Béd07]) Consider a RWM algorithm with proposals Y (d) (t + 1) ∼ N ( 0, l 2 I d /d α) , with α satisfying (2.3), applied to a target distribution of the form (2.2), which satisfies the specified conditions on f. Consider the i ∗th { component of the process Z (d)} { } { } i.e., Z t≥0, (d) i (t) = X (d) ∗ i ∗ ([dα t]) , and let the algorithm{ start in } stationarity (i.e., X (d) (0) ∼ π). Then Z (d) i (t) ⇒ {Z (t)} ∗ t≥0 , where Z (0) is distributed according to the density f and {Z (t)} t≥0 satisfies t≥0 the Langevin SDE dZ (t) = v (l) 1 2 h (l) ∇ log f (Z (t)) dB t + dt, 2 if and only if d lim ∑ λ1 n d→∞ j=1 dλj + ∑ m i=1 c (J (i, d)) = 0. (2.5) dγi Here h (l) = 2l 2 Φ ( −l √ E R /2 ) , and E R = lim m∑ d→∞ i=1 c (J (i, d)) d α d γi K n+1 E t≥0 [ (f ′ ) ] 2 (X) . f (X) Furthermore, we have lim d→∞ a (d, l) = 2Φ ( −l √ E R /2 ) ≡ a (l). The speed measure v (l) is maximized at the unique value ˆl = 2.38/ √ ) E R , where a (ˆl = 0.234. t≥0 ) ([d α t]) , 18
Note 2.6. Theorem 2.5 provides a condition that when the asymptotically smallest scaling term of the distinct component is not significantly smaller than the remaining scaling components, than the limiting process is the same as the one found in [RGG97]. However, it is possible that θ1 −2 (d) is not the smallest scaling term, it could be that θn+1 −2 (d) is the smallest scaling term. This situation is verified in the theorem above with c (J (i, d)) θ2 n+1 (d) in the denominator. Remark 2.7. The function E R has an inverse effect on the optimal value of the scaling parameter ˆl. As was the case in the section above, for smoother values of f, the smaller is E R and thus the larger is ˆl. The added component of the scaling vector Θ 2 (d) effects the optimal value of ˆl through E R by the following: for larger values of K n+i , and thus the variance of the component, the larger is the value of ˆl. Proof. (of Theorem 2.5). As is shown in the appendix (give citation to“core”section), to prove the weak convergence result of the rescaled RWM to the Langevin diffusion, it suffices to show that for a test function h ∈ Cc ∞ , the following limit holds: [∣ ( ∣∣Gh lim E d, X (d)) ] − G L h (X i ∗) ∣ = 0, d→∞ where ⎡ ⎛ ⎧ ( Gh d, X (d)) ⎨ π (d, Y (d)) ⎫⎞⎤ ⎬ = d α E Y (d) ⎣(h (Y i ∗) − h (X i ∗)) ⎝min ⎩ 1, ( π d, X (d)) ⎠⎦ ⎭ denotes the discrete-time generator of the rescaled Metropolis algorithm, and [ 1 G L h (X i ∗) = v (l) 2 h′′ (X i ∗) + 1 ] 2 h′ (X i ∗) (log f (X i ∗)) ′ is the generator of a Langevin diffusion process with speed measure ( v (l) as in Theorem 2.5. Instead of proving the d, X (d)) ( that is equivalent to Gh d, X (d)) in the convergence of these generators, one can define a generator ˜Gh sense that [∣ ( ∣∣Gh lim E d, X (d)) − ˜Gh (d, X (d))∣ ] ∣ = 0. d→∞ ( d, X (d)) to the The result thus follows if one is able to show the L 1 convergence of this equivalent generator ˜Gh generator of a Langevin diffusion. The equivalent generator is defined as ⎧ ⎛ ⎞⎫ ( ˜Gh d, X (d)) = 1 ⎨ d∑ ⎬ 2 l2 h ′′ (X i ∗) E Y (d)− min ⎩ 1, exp ⎝ ε (d, X j , Y j ) ⎠ ⎭ + j=1,j≠i ⎛ ⎛ ∗ ⎞⎞ d∑ +l 2 h ′ (X i ∗) (log f (X i ∗)) ′ E Y (d)− ⎝exp ⎝ ε (d, X j , Y j ) ⎠⎠ ∣ ∑d , j=1,j≠i ∗ j=1,j≠i ∗ ε(d,Xj,Yj)
Page 1 and 2: Weak Convergence and Optimal Propos
Page 3 and 4: parameter of the proposal distribut
Page 5 and 6: walk Metropolis algorithm, one prop
Page 7 and 8: Using (1.5), we have ( ) e (d) g ld
Page 9 and 10: particular component of interest wi
Page 11 and 12: Proof. Since the sequence starts in
Page 13 and 14: As was the case two equations ago,
Page 15 and 16: { Theorem 1.14. Let process U (d) t
Page 17: 2.3 Bédard: Various Scaling Terms
Page 21 and 22: 2.3.5 Generalizing the K j paramete
Page 23 and 24: Figure 2: Example 2.13 2 * ell * es
Page 25 and 26: Theorem 2.14. (Theorem 3.1 in [NR05
Page 27 and 28: 3 Nonproduct Target Laws and Infini
Page 29 and 30: 3.3 Optimal scaling for the RWM alg
Page 31 and 32: The main result of this paper will
Page 33 and 34: 3.4.2 Main Theorem and Outline of P
Page 35 and 36: Alternatively, for λ > 0 , we set
Page 37 and 38: for h ∈ H. Lastly, if the determi
Page 39 and 40: Thus, for any x ∈ H, we may set x
Page 41 and 42: Note A.11. For any ε < 1 λ 1 , th
Page 43 and 44: A.5 Lack of a Canonical Gaussian Di
Page 45 and 46: denotes the operator norm. The foll
Page 47 and 48: C.2 Code for Example 2.13 > # This
Page 49 and 50: References [Béd06] Bédard, M. (20

final report - probability.ca

Create successful ePaper yourself

Delete template?

Save as template?