ECE 275B – Homework #2 Emended – Due ... - UCSD DSP Lab

ECE 275B – Homework #2 Emended – Due Tuesday 2-10-09 

Reading 

MIDTERM is Scheduled for Thursday, February 12, 2009 

• Read and understand the Newton-Raphson and Method of Scores MLE procedures 

given in Kay, Example 7.11, pp. 178-82. Also read and understand the material in 

Section 7.8 of Kay. 

• Read and understand the Expectation–Maximization (EM) Algorithm described in 

Moon and stirling, Sections 17.1-17.6. 

• The following papers are recommended: 

– “Eliminating Multiple Root Problems in Estimation,” C.G. Small, J. Wang, and 

Z. Yang, Statistical Science, Vol. 15, No. 4 (Nov., 2000), pp. 313-332. 

– “Maximum Likelihood from Incomplete Data via the EM Algorithm,” A.P. Dempster; 

N.M. Laird, & D.B. Rubin, Journal of the Royal Statistical Society. Series 

B (Methodological),Vol. 39, No. 1, (1977), pp. 1-38. Highly Recommended. 

– “On the Convergence Properties of the EM Algorithm,” C.F. Jeff Wu, The Annals 

of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103. 

– “On the Convergence of the EM Algorithm,” R.A. Boyles, Journal of the Royal 

Statistical Society. Series B (Methodological), Vol. 45, No. 1, 1983, pp. 47-50. 

– “Mixture Densities, Maximum Likelihood and the Em Algorithm,” R.A. Redner 

& H.F. Walker, SIAM Review, Vol. 26, No. 2 (Apr., 1984), pp. 195-239. Highly 

Recommended. 

– “Direct Calculation of the Information Matrix via the EM Algorithm,” D. Oakes, 

Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, 

No. 2 (1999), pp. 479-482. 

– “The Variational Approximation for Bayesian Inference: Life After the EM Algorithm,” 

D.G. Tzikas, A.C. Likas, and N.P. Galatsanos, IEEE Signal Processing 

Magazine, November 2008, pp. 131-146. 

Comments on the Above Cited Papers 

As mentioned in class, For highly nonlinear problems of the type considered by Kay and 

for parametric mixture distributions 1 there will generally be multiple roots to the likelihood 

equation corresponding to multiple local maxima to the likelihood function. The question 

of how to deal with this problem, and in particular the question of how to track the correct 

1 In both cases we are leaving the nice world of regular exponential family distributions 

1

zero of the score function (which might not correspond to the global maximum) which has 

the desired asymptotic property of efficiency is a difficult one of keen interest. In this regard, 

you might want to read the paper “Eliminating Multiple Root Problems in Estimation” cited 

above. 

Note that the Newton-Raphson procedure uses the Sample Information Matrix (SIM), 

while the Method of Scores uses the actual Fisher Information Matrix (FIM) which is the 

expected value of the SIM. As discussed in Kay, in the limit of an infinitely large number of 

samples it is the case that FIM ≈ SIM. Thus a perceived merit of the Newton-like numerical 

approaches is that an asymptotically valid estimate of the parameter error covariance matrix 

is provided by the availability of the SIM. 2 

One claimed potential drawback to the basic EM algorithm is the general lack of the 

availability of the SIM or FIM for use as a parameter error covariance matrix. 3 This issue 

is addressed in the above-cited paper “Direct Calculation of the Information Matrix via the 

EM Algorithm” which discusses the additional computations (over and above the basic EM 

algorithm computations) needed to compute an estimate of the information matrix. 

Important and useful background material on the history, applications, and convergence 

properties of the EM algorithm as a procedure for performing missing-data maximum likelihood 

estimation can be found in the classic paper by Dempster, Rubin, & Laird (DLR) 

cited above. The subsequent papers by Wu and Boyles, both cited above, correct errors in 

the convergence proof given in the original DLR paper. The DLR paper discusses a variety 

of applications, including mixture-density estimation and the example given in Moon. In 

the DLR paper the complete data space is not a simple cartesian product of the actual data 

and hidden data spaces. 

Also cited above is the excellent and highly recommended SIAM Review paper by Redner 

& Walker on applying the EM algorithm to the problem of mixture density estimation. This 

paper is well worth reading. 

Finally, the recent review paper by Tzikas et al. on the relationship between the EM 

algorithm and Variational approximation is a nice, short synopsis of material to be found 

in the textbook Pattern Recognition and Machine Learning, C. Bishop, Springer, 2006. The 

derivation of the EM algorithm given in this paper is slightly simpler than the one I give in 

lecture as (unlike DLR) they assume that the complete data space is the cartesian product 

of the actual data and hidden data spaces (Bishop makes this same simplifying assumption). 

2 Recall that for regular statistical families the MLE asymptotically attains the C-R lower bound, which 

is the inverse of the FIM. The availability of the error covariance matrix coupled with the fact that the MLE 

is asymptotically normal allows one to easily construct confidence intervals (error bars) about the estimated 

parameter values. 

3 On the one hand, the ability to avoid the explicit computation of the FIM (usually required to be 

updated at every iteration step) typically provides a huge computational savings relative to the Newton 

algorithm. On the other hand, having this matrix available once the algorithm has finally converged to an 

estimate of the unknown parameters allows one to construct confidence intervals (parameter error bars) for 

these parameters. 

2

Homework Problems and Matlab Programming Exercises 

1. Let the scalar random variables X1, · · · , Xn and Y1, · · · , Yn be all mutually independent, 

where Yi ∼ Poisson(β τi) and Xi ∼ Poisson(τi), i = 1, · · · n. The two data sets 

{Xi} n 

n 

i=1 and {Yi} i=1 taken together comprise the complete data. The parameters β 

and τi, i = 1, · · · , n, are deterministic and unknown. 4 

This provides a simple model of the incidence of a disease Yi seen in hospital i, where 

the underlying rate is a function of an overall (hospital independent) effect β due to 

the intrinsic virulence of the disease and an additional hospital specific factor τi. The 

availability of the measurements Xi allows one to directly estimate the hospital specific 

factor τi. 5 

(a) Find the complete information maximum likelihood estimate (MLE) of the unknown 

parameters. 

(b) Assume that the observation for X1 is missing. (Thus each hospital i, except for 

hospital 1, is able to provide a measurement of the quantity Xi.) Determine an EM 

algorithm for finding the actual information MLE of the unknown parameters. 

2. Let A = [aij] be an arbitrary real m × n matrix with ij-th component aij. For f a 

differentiable real-valued function of A, we define the matrix derivative and the matrix 

gradient of f(A) in a component-wise manner respectively as 

 

 

∂ 

∂ 

∂ 

f(A) f(A) ; ∇A f(A) 

∂A ∂aji 

∂A f(A) 

T 

∂ 

= 

∂aij 

 

f(A) 

Note that this convention is consistent with the convention used to define the vector 

derivative and (cartesian) gradient in ECE275A. 

For Σ square and invertible (but not necessarily symmetric) prove that 

∂ 

∂Σ 

log det Σ = Σ−1 

and 

∂ 

∂Σ tr Σ−1 W = −Σ −1 W Σ −1 . 

3. Let y be a gaussian random vector with unknown mean µ and unknown covariance 

matrix Σ which is assumed to be full rank. Given a collection of N iid samples of y use 

4 Recall that a Poisson(λ) distribution has the form 

pZ(z|λ) = e−λ λ z 

z! 

z = 0, 1, · · · 

with mean E {z} = λ. 

5 For example, the parameter τi could depend upon the size and overall health of the population in the 

region served by hospital i and Xi could be the waiting time in minutes before an entering patient is first 

examined by a trained medical professional. Or τi could depend on the quality of the health care provided 

by hospital i and Xi could be the number of nurses with advanced training. Et cetera. 

3

the result of the previous homework problem to find the maximum likelihood estimates 

of µ and Σ. 6 Why is just determining a stationary point of the log–likelihood function 

generally sufficient to claim uniqueness and optimality of the solution? 

4. (i) In order to model the pdf of a scalar random variable, y, completely set up the 

M–component Gaussian mixture density problem as a hidden–data problem and then 

derive the EM algorithm for identifying the unknown mixture parameters. Assume 

that the individual mixture parameters are all independent. (ii) Write a program 

(in Matlab, or your favorite programming language) to implement your algorithm as 

requested in the problems given below. 

5. Note that the algorithm derived in problem 4 above requires no derivatives, hessians, or 

step-size parameters in its implementation. Is this true for a complete implementation 

of the EM solution given by Kay as equations (7.58) and (7.59)? Why? Can one always 

avoid derivatives when using the EM algorithm? 

6. On pages 290, 291, 293, and 294 (culminating in Example 9.3), Kay shows how to 

determine consistent estimates for the parameters of a zero–mean, unimodal gaussian 

mixture. What utility could these parameters have in determining MLE’s of the 

parameters via numerical means? 

7. Consider a two-gaussian mixture density p(y | Θ), with true mixture parameters α = 

α1 = 0.6, α = α2 = 1 − α = 0.4, means µ1 = 0, µ2 = 10, and variances σ 2 1, and σ 2 2. 

Thus Θ = {α, α, θ}, θ = {µ1, µ2, σ 2 1, σ 2 2}. 

The computer experiments described below are to be performed for the two cases 

σ 2 1 = σ 2 2 = 1 (i.e., both component standard deviations equal to 1) and σ 2 1 = σ 2 2 = 25 

(i.e., both component standard deviations equal to 5). Note that in the first case 

the means of the two component densities are located 10 standard deviations apart 

(probably a rarely seen situation in practice) whereas in the second case the means of 

the two component densities are located 2 standard deviations apart (probably a not 

so rare situation). 

(a) Plot the true pdf p(y | Θ) as a function of y. Generate and save 400 iid samples 

from the pdf. Show frequency histograms of the data for 25 samples, 50 samples, 

100 samples, 200 samples, and 400 samples, plotted on the same plot as the pdf. 

(b) (i) Derive an analytic expression for the mean µ, and variance σ 2 of y as a function 

of the mixture parameters, and evaluate this expression using the true parameter 

6 Do NOT use the gradient formulas for matrices constrained to be symmetric as is done by Moon and 

Stirling on page 558. Although the symmetrically–constrained derivative formulas are sometimes useful, they 

are not needed here because the stationary point of the likelihood just happens to turn out to be symmetric 

(a “happy accident”!) without the need to impose this constraint. This means that the (“accidently” 

symmetric) solution is optimal in the space of all matrices, not just on the manifold of symmetric matrices. 

I have never seen the symmetry constraint applied when taking the derivative except in Moon and Stirling. 

4

values. (ii) Compute the sample mean µ and sample variance σ 2 of y using the 

400 data samples and compare to the analytically determined true values. (iii) 

Show that the sample mean and sample variance are the ML estimates for the 

mean and variance of a (single) gaussian model for the data. Plot this estimated 

single gaussian model against the true two–gaussian mixture density. 

(c) (i) Assume that the true density is unknown and estimate the parameters for 

a two-component gaussian mixture 7 using the EM algorithm starting from the 

following initial conditions, 

What happens and why? 

µ1 = µ2 = µ , σ 2 1 = σ 2 2 = σ 2 , α = 1 

2 . 

(ii) Perturb the initial conditions slightly to destroy their symmetry and describe 

the resulting behavior of the EM algorithm in words. 

(iii) Again using all 400 samples, estimate the parameters for a variety of initial 

conditions and try to trap the algorithm in a false local optimum. Describe the 

results. 

(iv) Once you have determined a good set of initial conditions, estimate the mixture 

parameters for the two component model via the ME algorithm for 25, 50, 

100, 200, and 400 samples. Show that the actual data log-likelihood function increases 

monotonically with each iteration of the EM algorithm. Plot your final 

estimated parameter values versus log 2(Sample Number). 

(d) Discuss the difference between the case where the component standard deviations 

both have the value 1 and the case where they are both have the value 5. 

8. (i) For the two–component gaussian mixture case derive the EM algorithm when the 

constraint σ 2 1 = σ 2 2 = σ 2 is known to hold. (I.e., incorporate this known constraint 

into the algorithm.) Compare the resulting algorithm to the one derived in Problem 

4 above. (ii) Estimate the unknown parameters of a two-component gaussian mixture 

using the synthetic data generated in Problem 4 above using your modified algorithm 

and compare the results. 

9. EM computer experiment when one of the components is non-regular. Consider a twocomponent 

mixture model where the first component is uniformly distributed between 

0 and θ1, U(0, θ1), with mixture parameter α = 0.55, and the second component is 

normally distributed with mean µ and standard deviation σ 2 , θ2 = (µ, σ 2 ) T , with mixture 

parameter ¯α = 1 − α = 0.45. Assume that all of the parameters are independent 

of each other. Assume that 0 · log 0 = 0. 

7 By visual inspection of the data frequency histograms, one can argue that whatever the unknown distribution 

is, it is smooth with bimodal symmetric “humps” and hence should be adequately modelled by a 

two–component gaussian mixture. 

5

(a) i. Generate 1000 samples from the distribution described above for θ1 = 1, 

µ = 3 

2 , and σ2 = 1 . Plot a histogram of the data and the density function 

16 

on the same graph. 

ii. Now assume that you do not known how the samples were generated and learn 

the unknown distribution using 1, 2, 3, and 10 component gaussian mixture 

models by training on the synthesized data. Plot the true mixture distribution 

and the four learned gaussian mixture models on the same graph. Recall 

that the MLE-based density estimate is asymptotically closest to the true density 

in the Kullback-Liebler Divergence sense. Note that our learned models 

approximate an unknown non-regular mixture density by regular gaussian 

mixtures. Show that the actual data log-likelihood function increases monotonically 

with each iteration of the EM algorithm. 

iii. Repeat the above for the values µ = 1, 

1 and 2 mixtures. 

2 

(b) Suppose we now try to solve the non-regular two component mixture problem 

described above by actually modeling a uniform distribution plus gaussian twodensity 

mixture and applying the EM algorithm in an attempt to learn the components 

parameters θ1, µ and σ2 directly. Convince yourself that the Gaussian 

component can be handled as has already been done above, so that the novelty, 

and potential difficulty, is in dealing with the non-regular uniform distribution 

component. 

i. In the M-step of the EM algorithm show that if ˆxℓ,1 = 0 for some sample 

yℓ > θ1 then the function 

N 

ˆxj,1 log p1(yj; θ1) 

j=1 

is not bounded from below. What does this say about the admissible values 

of ˆ θ + 1 ? 

ii. Show that maximizing 

N 

ˆxj,1 log p1(yj; θ1) 

j=1 

with respect to θ1 is equivalent to maximizing 

N1 ˆ N 

1 

θ1 

j ′ =1 

1(0,θ1)(yj ′) with θ1 ≥ yj ′ for all yj ′ (1) 

where {yj ′} are the data samples for which ˆxj ′ ,1 = 0. 1(0,θ1)(yj ′) is the indicator 

function that indicates whether or not yj ′ is in the interval (0, θ1). 

Maximize the expression (1) to obtain the estimate ˆ θ + 1 . (Hint: Compare to 

the full information MLE solution you derived last quarter.) 

6

Explain why the estimate provided by the EM algorithm remains stuck at 

the value of the very first update ˆ θ + , and never changes thereafter, and why 

the value one gets stuck at depends on the initialization value of ˆ θ1. Verify 

these facts in simulation (see below for additional simulation requests). Try 

various initialization values for ˆ θ1, including ˆ θ1 > max 

j yj. (Note that only 

initialization values that have an actual data log-likelihood that is bounded 

from below need be considered.) Show that the actual data log-likelihood 

function monotonically increases with each iteration of the EM algorithm 

(i.e., that we have monotonic hill-climbing on the actual data log-likelihood 

function). 

iii. Let us attempt to heuristically attempt to fix the “stickiness” problem of the 

EM algorithm. There is no guarantee that the procedure outlined below will 

work as we are not conforming to the theory of the EM algorithm. Thus, 

monotonic hill-climbing on the actual data log-likelihood function is not guaranteed. 

First put a floor on the value of ˆxj,1 for positive yj : 

If ˆxj,1 < ɛ and yj > 0 set ˆxj,1 = ɛ and ˆxj,2 = 1 − ɛ 

Choose ɛ to be a very small value. In particular choose ɛ ≪ 1/ ˆ θ1 where 

ˆθ1 > max 

j yj is the initialization value of the estimate of θ1. 

Next define two intermediate estimates, ˆ θ ′ 1 and ˆ θ ′′ 

1, of θ1 as follows. 

ˆθ ′ 1 = 2 

ˆN1 

N 

j=1 

ˆxj,1 yj with ˆ N1 = 

ˆθ ′′ 

1 = max ˆxj,1 yj e 

j 

− 

 

yj −ˆ θ ′ 2 1 

τ 

In the complete information case, which corresponds to setting ˆxj,1 = xj,1 and 

ˆN1 = N1, the estimate ˆ θ ′ 1 corresponds to the BLUE (and method of moments) 

estimate of θ1. We can refer to ˆ θ ′ 1 as the “pseudo-BLUE” estimate. 

The estimate ˆ θ ′′ 

1 modifies the EM update estimate by adding the factor ˆxj,1, 

which weights yj according to the probability that it was generated by the 

uniform distribution, and an exponential factor which weights yj according to 

how far it is from the pseudo-BLUE estimate of θ1. We can refer to ˆ θ ′′ 

1 as the 

“weighted EM update” of θ1. The parameter τ determines how aggressively 

we penalize the deviation of yj from ˆ θ ′ 1. 

N 

j=1 

We construct our final “Modified EM update” of θ1 as 

ˆxj,1 

ˆθ + 1 = β ˆ θ ′ 1 + ¯ β ˆ θ ′′ 

1 for 0 ≤ β ≤ 1 and ¯ β = 1 − β 

7

For example, setting β = 1/2 yields 

ˆθ + 1 = 1 

2 

 

ˆθ ′ 

1 + ˆ θ ′′ 

 

1 

We see that there are three free parameters to be set when implementing the 

Modified EM update step. Namely, ɛ, τ, and β. 

Attempt to run the modified EM algorithm on 1000 samples generated from 

data for the cases θ1 = 1, σ2 = 1 

1 3 

and µ = , 1, , and 2. Plot the actual 

16 2 2 

data log-likelihood function to see if it fails to increase monotonically with 

each iteration of the EM algorithm. 

If the algorithm doesn’t work, don’t worry—I thought up this heuristic fix 

“on the fly” and I’m just curious to see if it has any chance of working. If 

you can get the procedure to work, describe its performance relative to the 

pure Gaussian mixture cases investigated above and show its performance for 

various choices of β, including β = 0, 0.5, and 1. 

8

ECE 275B – Homework #2 Emended – Due ... - UCSD DSP Lab

Create successful ePaper yourself

Delete template?

Save as template?