04.04.2015 Views

Introduction to the EM algorithm - Department of Statistics

Introduction to the EM algorithm - Department of Statistics

Introduction to the EM algorithm - Department of Statistics

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong><br />

Camila P. E. de Souza<br />

PhD student - <strong>Department</strong> <strong>of</strong> <strong>Statistics</strong>, UBC<br />

Student Seminar Series<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 1 / 25


<strong>Introduction</strong><br />

Suppose we have data y 1 , . . . , y n that comes from a mixture <strong>of</strong> M<br />

distributions, that is,<br />

where<br />

p(y i |θ) =<br />

M∑<br />

α j p j (y i |λ j )<br />

j=1<br />

λ j is <strong>the</strong> vec<strong>to</strong>r with <strong>the</strong> parameters <strong>of</strong> <strong>the</strong> jth distribution;<br />

0 ≤ α j ≤ 1 for each j and ∑ M<br />

j=1 α j = 1;<br />

θ = (α 1 , . . . , α J , λ 1 , . . . , λ M ).<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 2 / 25


Example: Mixture <strong>of</strong> 2 normal distributions<br />

Density<br />

0.00 0.05 0.10 0.15 0.20 0.25<br />

0 5 10<br />

y<br />

Figure: λ 1 = (µ 1 = 0, σ 1 = 1), λ 2 = (µ 2 = 7, σ 2 = 0.5), α 1 = 0.7, α 2 = 0.3.<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 3 / 25


Maximum likelihood estima<strong>to</strong>r <strong>of</strong> θ<br />

Our goal is <strong>to</strong> find ˆθ (i.e., <strong>the</strong> ˆα j ’s and ˆλ j ’s) that maximizes <strong>the</strong><br />

log-likelihood <strong>of</strong> <strong>the</strong> data w.r.t. <strong>to</strong> θ.<br />

log p(y|θ) = log<br />

=<br />

=<br />

n∏<br />

p(y i |θ)<br />

i=1<br />

n∑<br />

log p(y i |θ)<br />

i=1<br />

⎡<br />

⎤<br />

n∑ M∑<br />

log ⎣ α j p j (y i |λ j ) ⎦<br />

i=1<br />

The optimization <strong>of</strong> this log-likelihood is a complicated task as it involves<br />

<strong>the</strong> logarithm <strong>of</strong> a summation.<br />

j=1<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 4 / 25


In order <strong>to</strong> obtain <strong>the</strong> parameter estimates for this kind <strong>of</strong> problem it is<br />

common <strong>to</strong> apply numerical methods such as <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong>.<br />

In this example we can suppose <strong>the</strong> existence <strong>of</strong> independent latent<br />

variables z 1 , . . . , z n that tell us from which distribution each y i was<br />

generated from, that is,<br />

y i ∼ p j (y i |θ j ) if z i = j, where p(z i = j) = α j .<br />

Note: It is very common <strong>to</strong> call y <strong>the</strong> incomplete data and (y, z) <strong>the</strong><br />

complete data.<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 5 / 25


or<br />

y i ∼ N(µ 1 , σ 2 1 ) if z i = 1<br />

with p(z = 1) = α 1 = 0.7;<br />

y i ∼ N(µ 2 , σ 2 2 ) if z i = 2<br />

with p(z = 2) = 1 − α 1 .<br />

Density<br />

0.00 0.05 0.10 0.15 0.20 0.25<br />

0 5 10<br />

y<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 6 / 25


The <strong>EM</strong> <strong>algorithm</strong><br />

The derivation <strong>of</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> consists <strong>of</strong> two main steps:<br />

1 Express <strong>the</strong> log-likelihood in terms <strong>of</strong> <strong>the</strong> distribution <strong>of</strong> latent data;<br />

2 Use this expression <strong>to</strong> come up with a sequence <strong>of</strong> θ (k) ’s which will<br />

not decrease log p(y|θ), that is, with<br />

log p(y|θ (k+1) ) ≥ log p(y|θ (k) ).<br />

Note: Various authors have arrived at <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> while trying <strong>to</strong><br />

manipulate <strong>the</strong> log-likelihood equations <strong>to</strong> be able <strong>to</strong> solve <strong>the</strong>m in a nice<br />

way.<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 7 / 25


Derivation <strong>of</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong><br />

The joint distribution <strong>of</strong> y and z (also called <strong>the</strong> complete data<br />

distribution) can be written as<br />

Rearranging this we get:<br />

Taking <strong>the</strong> logarithm we have<br />

p(y, z|θ) = p(z|y, θ)p(y|θ).<br />

p(y|θ) =<br />

p(y, z|θ)<br />

p(z|y, θ) .<br />

log p(y|θ) = log p(y, z|θ) − log p(z|y, θ).<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 8 / 25


Note that<br />

log p(y|θ) = log p(y, z|θ) − log p(z|y, θ). (1)<br />

holds for any value <strong>of</strong> z generated according <strong>to</strong> <strong>the</strong> distribution p(z|y, θ)<br />

for a chosen value <strong>of</strong> θ, say θ (k) .<br />

Therefore, <strong>the</strong> expected value <strong>of</strong> <strong>the</strong> right hand side <strong>of</strong> (1) is also equal <strong>to</strong><br />

log p(θ|y).<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 9 / 25


Taking <strong>the</strong> expectation on both sides <strong>of</strong> <strong>the</strong> previous equation we have that<br />

∫<br />

log p(y|θ)p(z|y, θ (k) )dz =<br />

∫<br />

∫<br />

log p(y, z|θ)p(z|y, θ (k) )dz −<br />

log p(z|y, θ)p(z|y, θ (k) )dz.<br />

Therefore,<br />

log p(y|θ) = Q(θ, θ (k) ) − H(θ, θ (k) ).<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 10 / 25


How can we find a value θ (k+1) so that<br />

log p(y|θ (k+1) ) ≥ log p(y|θ (k) )?<br />

Using log p(y|θ) = Q(θ, θ (k) ) − H(θ, θ (k) ) we have<br />

Q(θ (k+1) , θ (k) ) − Q(θ (k) , θ (k) )<br />

−H(θ (k+1) , θ (k) ) + H(θ (k) , θ (k) ) ≥ 0 (2)<br />

A remarkable result shows that <strong>the</strong> second line <strong>of</strong> (2) is non negative for<br />

any θ (k+1) (see McLachlan and Krishnan, 2008).<br />

Thus, we can guarantee that (2) holds if we guarantee that<br />

Q(θ (k+1) , θ (k) ) − Q(θ (k) , θ (k) ) ≥ 0.<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 11 / 25


To do that just choose θ (k+1) as <strong>the</strong> value <strong>of</strong> θ which maximizes<br />

Q(θ, θ (k) ) for fixed θ (k) . If θ (k+1) is <strong>the</strong> maximizing value, <strong>the</strong>n<br />

Q(θ (k+1) , θ (k) ) ≥ Q(θ (k) , θ (k) ).<br />

Note: This maximization problem can be a difficult task and it is okay <strong>to</strong><br />

choose θ (k+1) that at least does not decrease Q(θ, θ (k) ) from <strong>the</strong> current<br />

value at θ (k) .<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 12 / 25


Therefore, we can guarantee <strong>the</strong> inequality log p(y|θ (k+1) ) ≥ log p(y|θ (k) )<br />

by choosing<br />

θ (k+1) = arg max Q(θ, θ k ). (3)<br />

θ<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 13 / 25


The <strong>EM</strong> <strong>algorithm</strong> has two steps:<br />

1 Expectation step (E-step): Q(θ, θ (k) ), <strong>the</strong> expectation <strong>of</strong> log p(y, z|θ)<br />

conditional on <strong>the</strong> data and using <strong>the</strong> current parameter estimates, is<br />

calculated.<br />

2 Maximization step (M-step): Find θ (k+1) that maximizes Q(θ, θ (k) )<br />

with respect <strong>to</strong> θ.<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 14 / 25


Going back <strong>to</strong> our example<br />

Remember that our goal is <strong>to</strong> find ˆθ (i.e., <strong>the</strong> ˆα j ’s and ˆλ j ’s) that<br />

maximizes <strong>the</strong> log-likelihood <strong>of</strong> <strong>the</strong> data w.r.t. <strong>to</strong> θ.<br />

log p(y|θ) = log<br />

=<br />

=<br />

n∏<br />

p(y i |θ)<br />

i=1<br />

n∑<br />

log p(y i |θ)<br />

i=1<br />

⎡<br />

⎤<br />

n∑ M∑<br />

log ⎣ α j p j (y i |λ j ) ⎦<br />

i=1<br />

j=1<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 15 / 25


Applying <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong><br />

E-step: Calculate Q(θ, θ (k) ) = E(log p(y, z|θ)|y, θ (k) ).<br />

Note: Maybe a better notation for this expectation would be<br />

E θ (k)(log p(y, z|θ)|y).<br />

The complete data log-likelihood can be written as:<br />

log p(y, z|θ) = log p(y|z, θ) + log p(z|α’s)<br />

=<br />

n∑<br />

n∑<br />

log p(y i |z i , λ zi ) + log α zi<br />

i=1<br />

i=1<br />

= L 1 (λ’s) + L 2 (α’s). (4)<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 16 / 25


Since<br />

L 1 (λ’s) =<br />

n∑ M∑<br />

I{z i = j} log p(y i |z i = j, λ j ),<br />

i=1 j=1<br />

where<br />

E(L 1 (λ’s)|y, θ (k) ) =<br />

n∑<br />

i=1 j=1<br />

M∑<br />

p(z i = j|y i , θ (k) ) log p(y i |z i = j, λ j )<br />

p(z i = j|y i , θ (k) p(y i |z i = j, λ (k)<br />

j<br />

) × α (k)<br />

j<br />

) = ∑ J<br />

l=1 p(y i|z i = l, λ (k)<br />

l<br />

) × α (k)<br />

l<br />

Let us use p (k)<br />

ij<br />

as a short notation for p(z i = j|y i , θ (k) ).<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 17 / 25


Considering a mixture <strong>of</strong> normal distributions (λ j = (µ j , σj 2 )) we have:<br />

E(L 1 (λ’s)|y, θ (k) ) = − n 2 log(2π)<br />

− 1 2<br />

− 1 2<br />

n∑<br />

M∑<br />

i=1 j=1<br />

n∑<br />

M∑<br />

i=1 j=1<br />

p (k)<br />

ij<br />

log σj<br />

2<br />

p (k)<br />

ij<br />

(y i − µ j ) 2<br />

.<br />

σ 2 j<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 18 / 25


Since<br />

L 2 (α’s) =<br />

n∑ M∑<br />

log α j × I{z i = j}<br />

i=1 j=1<br />

E(L 2 (α’s)|y, θ (k) ) =<br />

=<br />

n∑<br />

i=1 j=1<br />

n∑<br />

i=1 j=1<br />

M∑<br />

log α j × p(z i = j|y i , θ (k) )<br />

M∑<br />

log α j × p (k)<br />

ij<br />

.<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 19 / 25


Therefore, considering <strong>the</strong> normal case, we get:<br />

Q(θ, θ (k) ) = − n 2 log(2π) − 1 2<br />

− 1 2<br />

+<br />

n∑<br />

i=1 j=1<br />

n∑<br />

M∑<br />

i=1 j=1<br />

M∑<br />

n∑<br />

M∑<br />

i=1 j=1<br />

p (k) (y i − µ j ) 2<br />

ji<br />

σj<br />

2<br />

log α j × p (k)<br />

ji<br />

p (k)<br />

ji<br />

log σj<br />

2<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 20 / 25


M-sptep: Holding α j and σj<br />

2<br />

we get:<br />

fixed and maximizing Q with respect <strong>to</strong> µ j<br />

ˆµ j =<br />

∑ n<br />

i=1 y ip (k)<br />

ij<br />

∑ .<br />

n<br />

i=1 p(k) ij<br />

Now holding α j and µ j fixed and maximizing w.r.t. σ 2 j<br />

we have:<br />

ˆσ 2 j =<br />

∑ n<br />

i=1 (y i − µ j ) 2 × p (k)<br />

ij<br />

.<br />

∑i p(k) ij<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 21 / 25


And finally in <strong>the</strong> same way using Lagrange multipliers and <strong>the</strong> restriction<br />

that ∑ M<br />

j=1 α j = 1 we have:<br />

ˆα j = 1 n<br />

n∑<br />

i=1<br />

p (k)<br />

ij<br />

.<br />

The updates are as follows:<br />

= ˆµ j<br />

σ 2 (k+1)<br />

j<br />

is ˆσ<br />

j 2 with µ j = µ (k+1)<br />

j<br />

;<br />

α (k+1)<br />

j<br />

= ˆα j .<br />

µ (k+1)<br />

j<br />

We iterate <strong>the</strong> E-step and M-step till convergence.<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 22 / 25


In R I found two packages <strong>to</strong> work with mixture <strong>of</strong> distributions:<br />

mixdist and mix<strong>to</strong>ols<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 23 / 25


References<br />

McLachlan, G. J. and Krishnan, T. (2008), The <strong>EM</strong> Algorithm and<br />

Extensions, 2nd Ed., John Willey.<br />

Tagare, H. (1998) A gentle introduction <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong>.<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 24 / 25


Thank you !!!<br />

Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 25 / 25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!