- Text
- Estimator,
- Variance,
- Markov,
- Monte,
- Carlo,
- Likelihood,
- Sampling,
- Gibbs,
- Density,
- Bias,
- Integration,
- Statistics,
- Stat.rutgers.edu

Monte Carlo integration with Markov chain - Department of Statistics

1968 Z. Tan / Journal **of** Statistical Planning and Inference 138 (2008) 1967 – 1980 **with**out requiring the value **of** its normalizing constant Z. Then a common approach is that, if n is sufficiently large, these points are used as an approximate and dependent sample from the distribution p(x) for **Monte** **Carlo** **integration**. For example, the expectation E p (φ) **of** a function φ(x) **with** respect to p(x) can be estimated by the sample average or the crude **Monte** **Carlo** estimator Ē(φ) = 1 n n∑ φ(x i ). i=1 However, there are two challenging problems for this inferential approach. First, the **Monte** **Carlo** variability **of** Ē(φ) is generally underestimated by the sample variance 1 n n∑ [φ(x i ) − Ē(φ)] 2 i=1 divided by n, and the amount **of** underestimation can sometimes be substantial due to serial dependency. Specialized methods by running multiple **chain**s or using batch means are needed to assess the **Monte** **Carlo** error (e.g. Gelman and Rubin, 1992; Geyer, 1992). Second, the crude **Monte** **Carlo** estimator is not directly applicable to the normalizing constant Z, which may be **of** interest and part **of** the computational goal. For example, if q(x) is the product **of** a prior and a likelihood in Bayesian inference, Z is the predictive density **of** data.As a result, various methods have been proposed for computing normalizing constants but most **of** them involve a matching density or additional simulation (e.g. DiCiccio et al., 1997; Gelman and Meng, 1998). We consider a general **Markov** **chain** scheme and develop a new method for estimating simultaneously the normalizing constant Z and the expectation E p (φ). Suppose that a **Markov** **chain** [(ξ 1 ,x 1 ),...,(ξ n ,x n )] is generated as follows. General **Markov** **chain** scheme: at each time t, update ξ t given (ξ t−1 ,x t−1 ) and sample x t from p(·; ξ t ). It is helpful to think **of** ξ t as an index and x t as a draw given the index. Denote by Ξ the index set. The details **of** updating ξ t are not needed for estimation, even though these details are essential to simulation. In contrast, the transition density p(·; ξ) is assumed to be completely known on X **with** respect to μ 0 for ξ ∈ Ξ. This scheme is very flexible and encompasses the following individual cases: (i) a **Markov** **chain** (x 1 ,x 2 ,...)**with** transition density p(·; x t−1 ) by letting ξ t = x t−1 . (ii) a subsampled **chain** (x 1 ,x b+1 ,x 2b+1 ,...)by letting (ξ t ,x t ) be (x tb ,x tb+1 ) in the original **Markov** **chain**. (iii) Gibbs sampling, where the **chain** (ξ t ,x t ) has a joint stationary distribution p(ξ,x), and ξ t is sampled from the conditional distribution p(ξ; x t−1 ) and x t from the conditional distribution p(x; ξ t ). (iv) Generalized Gibbs sampling, where a stochastic transformation g t **of** ξ t is inserted after ξ t is sampled and then x t is sampled from p(·; g t (ξ t ))—the index ξ t is expanded to be (ξ t ,g t ). (v) Metropolis–Hastings sampling by identifying the current state as ξ t and the proposal as x t , even though the commonly-referred Metropolis–Hastings **chain**, that is (ξ 1 ,...,ξ n ), does not admit a transition density. Case (i) is conceptually important but may be restrictive in practice, as we shall explain. In case (ii), the subsampled **chain** (x 1 ,x b+1 ,x 2b+1 ,...)is **Markov**ian but its transition density is complicated, being the b-step convolution **of** the original transition density. In case (iii), the joint **chain** is **Markov**ian on Ξ×X **with** transition density p(x; ξ)p(ξ; x t−1 ), but integrals **of** interest are defined on X. The marginal **chain** (x 1 ,...,x n ) is also **Markov**ian but its transition density involves integrating ξ out **of** p(x; ξ)p(ξ; x t−1 ), which is typically difficult. A familiar example **of** case (iii) is data augmentation in Bayesian computation, where x is a parameter, ξ is a latent variable, and q(x) is the product **of** a prior and a likelihood (Gelfand and Smith, 1990; Tanner and Wong, 1987). Case (iv) is represented by a recent development in data augmentation, where ξ includes not only a latent variable but also a working parameter (Liu and Sabatti, 2000; Liu and Wu, 1999; van Dyk and Meng, 2001).

Z. Tan / Journal **of** Statistical Planning and Inference 138 (2008) 1967 – 1980 1969 The general description **of** the scheme allows the situation where x t is sampled component by component such as sample x 1 t from p 1 (x 1 ; ξ t ) and sample x 2 t from p 2|1 (x 2 ; x 1 t , ξ t). Then p(x; ξ t ) is the product p 1 (x 1 ; ξ t )p 2|1 (x 2 ; x 1 , ξ t ). As a result, case (iii) or (iv) is not restricted to two-component Gibbs sampling as it appears to be. Assume that the sequence (x 1 ,...,x n ) converges to a probability distribution as n →∞. Denote by p ∗ (x) the stationary density **with** respect to μ 0 . It is not necessary that p ∗ (x) be identical to the target density p(x). The sequence (x 1 ,...,x n ) typically converges to p ∗ (x) = p(x) in Gibbs sampling and its generalization [cases (iii) and (iv)]. In contrast, this sequence does not converge to p(x), i.e. p ∗ (x) ̸= p(x), in Metropolis–Hastings sampling where (ξ 1 ,...,ξ n ) converges to p(x) [case (v)]. We develop a general method in Section 2 and illustrate the application to Gibbs sampling and its variation in Section 3. Applications **of** the general method to rejection sampling and Metropolis–Hastings sampling are presented in Tan (2006). We take the likelihood approach **of** Kong et al. (2003) in developing the method; see Tan (2004) for the optimality **of** the likelihood approach in two situations **of** independence sampling. In that approach, the baseline measure is treated as the parameter in a model and estimated as a discrete measure by maximum likelihood. Consequently, integrals **of** interest are estimated as finite sums by substituting the estimated measure. Kong et al. (2003) considered independence sampling and the simplest, individual case (i) **of** **Markov** **chain** sampling. We extend the likelihood approach to the general **Markov** **chain** scheme by using partial likelihood (Cox, 1975). The approximate maximum likelihood estimator **of** the baseline measure is ˜μ({x}) = ˆ P({x}) n −1∑ n j=1 p(x; ξ j ) , where ˆ P is the empirical distribution placing mass n −1 at each **of** the points x 1 ,...,x n . The integral Z = ∫ q(x)dμ 0 is estimated by ∫ ˜Z = q(x)d ˜μ = n∑ i=1 q(x i ) ∑ nj=1 p(x i ; ξ j ) . Note that the same estimator also holds for a real-valued integrand q(x). The expectation E p (φ) = ∫ φ(x)q(x) dμ 0 / ∫ q(x)dμ0 is estimated by ∫ Ẽ(φ) = /∫ φ(x)q(x) d ˜μ q(x)d ˜μ = n∑ i=1 / n∑ φ(x i )q(x i ) q(x i ) ∑ nj=1 ∑ p(x i ; ξ j ) nj=1 p(x i=1 i ; ξ j ) . Further, we modify the basic estimator ˜μ and propose a subsampled estimator ˜μ b , a regulated estimator ˜μ δ , an amplified estimator ˜μ a , and their combined versions. Finally, we introduce approximate variance estimators for the point estimators. Our method can require less total computational time (for simulating the **Markov** **chain** and computing the estimators) than statistically inefficient estimation **with** brute-force increase **of** sample size to achieve the same degree **of** accuracy. In three examples, we find that the basic estimator log ˜Z has smaller variance than Chib’s estimator by a factor 100, and the basic estimator Ẽ(φ) has smaller variance than the crude **Monte** **Carlo** estimator by a factor **of** 1–100. Subsampling helps to reduce computational cost while regulation or amplification helps to reinforce statistical efficiency, so that our method achieves overall computational efficiency. We also find that the approximate variance estimators agree **with** the empirical variances **of** the corresponding point estimators in all our examples.

- Page 1: Journal of Statistical Planning and
- Page 5 and 6: Z. Tan / Journal of Statistical Pla
- Page 7 and 8: Z. Tan / Journal of Statistical Pla
- Page 9 and 10: Z. Tan / Journal of Statistical Pla
- Page 11 and 12: Z. Tan / Journal of Statistical Pla
- Page 13 and 14: Z. Tan / Journal of Statistical Pla