Introduction to the EM algorithm - Department of Statistics
Introduction to the EM algorithm - Department of Statistics
Introduction to the EM algorithm - Department of Statistics
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong><br />
Camila P. E. de Souza<br />
PhD student - <strong>Department</strong> <strong>of</strong> <strong>Statistics</strong>, UBC<br />
Student Seminar Series<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 1 / 25
<strong>Introduction</strong><br />
Suppose we have data y 1 , . . . , y n that comes from a mixture <strong>of</strong> M<br />
distributions, that is,<br />
where<br />
p(y i |θ) =<br />
M∑<br />
α j p j (y i |λ j )<br />
j=1<br />
λ j is <strong>the</strong> vec<strong>to</strong>r with <strong>the</strong> parameters <strong>of</strong> <strong>the</strong> jth distribution;<br />
0 ≤ α j ≤ 1 for each j and ∑ M<br />
j=1 α j = 1;<br />
θ = (α 1 , . . . , α J , λ 1 , . . . , λ M ).<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 2 / 25
Example: Mixture <strong>of</strong> 2 normal distributions<br />
Density<br />
0.00 0.05 0.10 0.15 0.20 0.25<br />
0 5 10<br />
y<br />
Figure: λ 1 = (µ 1 = 0, σ 1 = 1), λ 2 = (µ 2 = 7, σ 2 = 0.5), α 1 = 0.7, α 2 = 0.3.<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 3 / 25
Maximum likelihood estima<strong>to</strong>r <strong>of</strong> θ<br />
Our goal is <strong>to</strong> find ˆθ (i.e., <strong>the</strong> ˆα j ’s and ˆλ j ’s) that maximizes <strong>the</strong><br />
log-likelihood <strong>of</strong> <strong>the</strong> data w.r.t. <strong>to</strong> θ.<br />
log p(y|θ) = log<br />
=<br />
=<br />
n∏<br />
p(y i |θ)<br />
i=1<br />
n∑<br />
log p(y i |θ)<br />
i=1<br />
⎡<br />
⎤<br />
n∑ M∑<br />
log ⎣ α j p j (y i |λ j ) ⎦<br />
i=1<br />
The optimization <strong>of</strong> this log-likelihood is a complicated task as it involves<br />
<strong>the</strong> logarithm <strong>of</strong> a summation.<br />
j=1<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 4 / 25
In order <strong>to</strong> obtain <strong>the</strong> parameter estimates for this kind <strong>of</strong> problem it is<br />
common <strong>to</strong> apply numerical methods such as <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong>.<br />
In this example we can suppose <strong>the</strong> existence <strong>of</strong> independent latent<br />
variables z 1 , . . . , z n that tell us from which distribution each y i was<br />
generated from, that is,<br />
y i ∼ p j (y i |θ j ) if z i = j, where p(z i = j) = α j .<br />
Note: It is very common <strong>to</strong> call y <strong>the</strong> incomplete data and (y, z) <strong>the</strong><br />
complete data.<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 5 / 25
or<br />
y i ∼ N(µ 1 , σ 2 1 ) if z i = 1<br />
with p(z = 1) = α 1 = 0.7;<br />
y i ∼ N(µ 2 , σ 2 2 ) if z i = 2<br />
with p(z = 2) = 1 − α 1 .<br />
Density<br />
0.00 0.05 0.10 0.15 0.20 0.25<br />
0 5 10<br />
y<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 6 / 25
The <strong>EM</strong> <strong>algorithm</strong><br />
The derivation <strong>of</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> consists <strong>of</strong> two main steps:<br />
1 Express <strong>the</strong> log-likelihood in terms <strong>of</strong> <strong>the</strong> distribution <strong>of</strong> latent data;<br />
2 Use this expression <strong>to</strong> come up with a sequence <strong>of</strong> θ (k) ’s which will<br />
not decrease log p(y|θ), that is, with<br />
log p(y|θ (k+1) ) ≥ log p(y|θ (k) ).<br />
Note: Various authors have arrived at <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> while trying <strong>to</strong><br />
manipulate <strong>the</strong> log-likelihood equations <strong>to</strong> be able <strong>to</strong> solve <strong>the</strong>m in a nice<br />
way.<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 7 / 25
Derivation <strong>of</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong><br />
The joint distribution <strong>of</strong> y and z (also called <strong>the</strong> complete data<br />
distribution) can be written as<br />
Rearranging this we get:<br />
Taking <strong>the</strong> logarithm we have<br />
p(y, z|θ) = p(z|y, θ)p(y|θ).<br />
p(y|θ) =<br />
p(y, z|θ)<br />
p(z|y, θ) .<br />
log p(y|θ) = log p(y, z|θ) − log p(z|y, θ).<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 8 / 25
Note that<br />
log p(y|θ) = log p(y, z|θ) − log p(z|y, θ). (1)<br />
holds for any value <strong>of</strong> z generated according <strong>to</strong> <strong>the</strong> distribution p(z|y, θ)<br />
for a chosen value <strong>of</strong> θ, say θ (k) .<br />
Therefore, <strong>the</strong> expected value <strong>of</strong> <strong>the</strong> right hand side <strong>of</strong> (1) is also equal <strong>to</strong><br />
log p(θ|y).<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 9 / 25
Taking <strong>the</strong> expectation on both sides <strong>of</strong> <strong>the</strong> previous equation we have that<br />
∫<br />
log p(y|θ)p(z|y, θ (k) )dz =<br />
∫<br />
∫<br />
log p(y, z|θ)p(z|y, θ (k) )dz −<br />
log p(z|y, θ)p(z|y, θ (k) )dz.<br />
Therefore,<br />
log p(y|θ) = Q(θ, θ (k) ) − H(θ, θ (k) ).<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 10 / 25
How can we find a value θ (k+1) so that<br />
log p(y|θ (k+1) ) ≥ log p(y|θ (k) )?<br />
Using log p(y|θ) = Q(θ, θ (k) ) − H(θ, θ (k) ) we have<br />
Q(θ (k+1) , θ (k) ) − Q(θ (k) , θ (k) )<br />
−H(θ (k+1) , θ (k) ) + H(θ (k) , θ (k) ) ≥ 0 (2)<br />
A remarkable result shows that <strong>the</strong> second line <strong>of</strong> (2) is non negative for<br />
any θ (k+1) (see McLachlan and Krishnan, 2008).<br />
Thus, we can guarantee that (2) holds if we guarantee that<br />
Q(θ (k+1) , θ (k) ) − Q(θ (k) , θ (k) ) ≥ 0.<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 11 / 25
To do that just choose θ (k+1) as <strong>the</strong> value <strong>of</strong> θ which maximizes<br />
Q(θ, θ (k) ) for fixed θ (k) . If θ (k+1) is <strong>the</strong> maximizing value, <strong>the</strong>n<br />
Q(θ (k+1) , θ (k) ) ≥ Q(θ (k) , θ (k) ).<br />
Note: This maximization problem can be a difficult task and it is okay <strong>to</strong><br />
choose θ (k+1) that at least does not decrease Q(θ, θ (k) ) from <strong>the</strong> current<br />
value at θ (k) .<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 12 / 25
Therefore, we can guarantee <strong>the</strong> inequality log p(y|θ (k+1) ) ≥ log p(y|θ (k) )<br />
by choosing<br />
θ (k+1) = arg max Q(θ, θ k ). (3)<br />
θ<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 13 / 25
The <strong>EM</strong> <strong>algorithm</strong> has two steps:<br />
1 Expectation step (E-step): Q(θ, θ (k) ), <strong>the</strong> expectation <strong>of</strong> log p(y, z|θ)<br />
conditional on <strong>the</strong> data and using <strong>the</strong> current parameter estimates, is<br />
calculated.<br />
2 Maximization step (M-step): Find θ (k+1) that maximizes Q(θ, θ (k) )<br />
with respect <strong>to</strong> θ.<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 14 / 25
Going back <strong>to</strong> our example<br />
Remember that our goal is <strong>to</strong> find ˆθ (i.e., <strong>the</strong> ˆα j ’s and ˆλ j ’s) that<br />
maximizes <strong>the</strong> log-likelihood <strong>of</strong> <strong>the</strong> data w.r.t. <strong>to</strong> θ.<br />
log p(y|θ) = log<br />
=<br />
=<br />
n∏<br />
p(y i |θ)<br />
i=1<br />
n∑<br />
log p(y i |θ)<br />
i=1<br />
⎡<br />
⎤<br />
n∑ M∑<br />
log ⎣ α j p j (y i |λ j ) ⎦<br />
i=1<br />
j=1<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 15 / 25
Applying <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong><br />
E-step: Calculate Q(θ, θ (k) ) = E(log p(y, z|θ)|y, θ (k) ).<br />
Note: Maybe a better notation for this expectation would be<br />
E θ (k)(log p(y, z|θ)|y).<br />
The complete data log-likelihood can be written as:<br />
log p(y, z|θ) = log p(y|z, θ) + log p(z|α’s)<br />
=<br />
n∑<br />
n∑<br />
log p(y i |z i , λ zi ) + log α zi<br />
i=1<br />
i=1<br />
= L 1 (λ’s) + L 2 (α’s). (4)<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 16 / 25
Since<br />
L 1 (λ’s) =<br />
n∑ M∑<br />
I{z i = j} log p(y i |z i = j, λ j ),<br />
i=1 j=1<br />
where<br />
E(L 1 (λ’s)|y, θ (k) ) =<br />
n∑<br />
i=1 j=1<br />
M∑<br />
p(z i = j|y i , θ (k) ) log p(y i |z i = j, λ j )<br />
p(z i = j|y i , θ (k) p(y i |z i = j, λ (k)<br />
j<br />
) × α (k)<br />
j<br />
) = ∑ J<br />
l=1 p(y i|z i = l, λ (k)<br />
l<br />
) × α (k)<br />
l<br />
Let us use p (k)<br />
ij<br />
as a short notation for p(z i = j|y i , θ (k) ).<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 17 / 25
Considering a mixture <strong>of</strong> normal distributions (λ j = (µ j , σj 2 )) we have:<br />
E(L 1 (λ’s)|y, θ (k) ) = − n 2 log(2π)<br />
− 1 2<br />
− 1 2<br />
n∑<br />
M∑<br />
i=1 j=1<br />
n∑<br />
M∑<br />
i=1 j=1<br />
p (k)<br />
ij<br />
log σj<br />
2<br />
p (k)<br />
ij<br />
(y i − µ j ) 2<br />
.<br />
σ 2 j<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 18 / 25
Since<br />
L 2 (α’s) =<br />
n∑ M∑<br />
log α j × I{z i = j}<br />
i=1 j=1<br />
E(L 2 (α’s)|y, θ (k) ) =<br />
=<br />
n∑<br />
i=1 j=1<br />
n∑<br />
i=1 j=1<br />
M∑<br />
log α j × p(z i = j|y i , θ (k) )<br />
M∑<br />
log α j × p (k)<br />
ij<br />
.<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 19 / 25
Therefore, considering <strong>the</strong> normal case, we get:<br />
Q(θ, θ (k) ) = − n 2 log(2π) − 1 2<br />
− 1 2<br />
+<br />
n∑<br />
i=1 j=1<br />
n∑<br />
M∑<br />
i=1 j=1<br />
M∑<br />
n∑<br />
M∑<br />
i=1 j=1<br />
p (k) (y i − µ j ) 2<br />
ji<br />
σj<br />
2<br />
log α j × p (k)<br />
ji<br />
p (k)<br />
ji<br />
log σj<br />
2<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 20 / 25
M-sptep: Holding α j and σj<br />
2<br />
we get:<br />
fixed and maximizing Q with respect <strong>to</strong> µ j<br />
ˆµ j =<br />
∑ n<br />
i=1 y ip (k)<br />
ij<br />
∑ .<br />
n<br />
i=1 p(k) ij<br />
Now holding α j and µ j fixed and maximizing w.r.t. σ 2 j<br />
we have:<br />
ˆσ 2 j =<br />
∑ n<br />
i=1 (y i − µ j ) 2 × p (k)<br />
ij<br />
.<br />
∑i p(k) ij<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 21 / 25
And finally in <strong>the</strong> same way using Lagrange multipliers and <strong>the</strong> restriction<br />
that ∑ M<br />
j=1 α j = 1 we have:<br />
ˆα j = 1 n<br />
n∑<br />
i=1<br />
p (k)<br />
ij<br />
.<br />
The updates are as follows:<br />
= ˆµ j<br />
σ 2 (k+1)<br />
j<br />
is ˆσ<br />
j 2 with µ j = µ (k+1)<br />
j<br />
;<br />
α (k+1)<br />
j<br />
= ˆα j .<br />
µ (k+1)<br />
j<br />
We iterate <strong>the</strong> E-step and M-step till convergence.<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 22 / 25
In R I found two packages <strong>to</strong> work with mixture <strong>of</strong> distributions:<br />
mixdist and mix<strong>to</strong>ols<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 23 / 25
References<br />
McLachlan, G. J. and Krishnan, T. (2008), The <strong>EM</strong> Algorithm and<br />
Extensions, 2nd Ed., John Willey.<br />
Tagare, H. (1998) A gentle introduction <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong>.<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 24 / 25
Thank you !!!<br />
Camila Souza (UBC) <strong>Introduction</strong> <strong>to</strong> <strong>the</strong> <strong>EM</strong> <strong>algorithm</strong> November 2010 25 / 25