Assignment 1 Solutions
Assignment 1 Solutions
Assignment 1 Solutions
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Assignment</strong> 1 <strong>Solutions</strong><br />
Instructor: Murat Dundar<br />
Date: 2/18/2013<br />
In this homework you will derive the predictive distribution for the Normal × Normal × Inverted Wishart<br />
model and use it in a K-class classification problem to demonstrate its benefits over the standard data<br />
distribution.<br />
Part A (50 points)<br />
Data Model: x ∼ N(µ, Σ), µ ∼ N(µ0, κ−1Σ), Σ ∼ W −1 (Σ0, m).<br />
Given a data set D = {x1, . . . , xn}, xi ∈ Rd , generated by this model, let the sample mean be ¯x = 1 ni=1 n<br />
xi<br />
and the sample covariance matrix be S = 1 ni=1 n (xi − ¯x)(xi − ¯x) T .<br />
(a) Show that ¯x ∼ N(µ, n−1Σ) (10 points)<br />
Solution:<br />
Note that ¯x is a weighted summation of N iid Normally distributed RVs. So ¯x itself should also be<br />
distributed according to a Normal distribution. The mean of this distribution will be defined by<br />
E{¯x} = 1 Ni N E{xi} = 1<br />
N Nµ = µ.<br />
Its covariance is defined by<br />
E{(¯x − µ) 2 N i } = E{(<br />
xi<br />
N − µ)2 N i } = E{(<br />
xi−Nµ<br />
N ) 2 N (xi−µ) 2<br />
i } = E{ N 2<br />
N i } = E{<br />
Σ<br />
N 2 } = NΣ<br />
N 2 = Σ<br />
N<br />
(b) Show that p(µ, Σ|¯x, S) = N( n¯x+κµ0 Σ<br />
n+κ , n+κ ) × W −1 (Σ0 + (n − 1)S + nκ<br />
n+κ (¯x − µ0)(¯x − µ0) T , n + m) (hint:<br />
(n − 1)S ∼ W ((Σ, n − 1)) (20 points)<br />
Solution:<br />
Class likelihoods of samples x|µ, Σ ∼ N (µ, Σ), Σ ∼ W−1 (Σ0, m), µ|Σ ∼ N (µ0, Σ/κ)<br />
P (µ, Σ|D) =<br />
P (D|µ, Σ)P (µ|Σ)P (Σ)<br />
P (D)<br />
= P (µ, Σ|D) (1)<br />
where D represents the data. Sample mean and sample covariance, (¯x, S), are the sufficient statistics for<br />
the multivariate Normally distributed data.<br />
Here ¯x ∼ N (µ, Σ/n) and also we will use A = (n − 1)S in probability equations. The joint probability of<br />
µ, Σ, ¯x and A is:<br />
1<br />
P (µ, Σ, ¯x, A) =<br />
(2π) d/2 |Σ/κ|−1/2 <br />
exp − 1<br />
2 (µ − µ0) T (Σ/κ) −1 <br />
(µ − µ0)<br />
× |Σ0| 1<br />
×<br />
× |A| 1<br />
2 m 1<br />
− |Σ| 2 (m+d+1) exp<br />
2 1<br />
2 md Γd( 1<br />
1<br />
(2π) d/2 |Σ/n|−1/2 exp<br />
<br />
− 1<br />
2<br />
trΣ0Σ −1<br />
2m) <br />
− 1<br />
2 (¯x − µ)T (Σ/n) −1 (¯x − µ)<br />
2 (n−d−2) 1<br />
− |Σ| 2 (n−1) exp(− 1<br />
2trAΣ−1 )<br />
2 1<br />
2 d(n−1) Γd( 1(n<br />
− 1))<br />
2<br />
= κd/2nd/2 |Σ0| 1<br />
2 m |A| 1<br />
2 (n−d−2) 1<br />
− |Σ|<br />
2 1<br />
2 d(n+m−1) πdΓd( 1<br />
where the term T1 inside the exponential is manipulated as follows:<br />
2<br />
2 (n+m+d) exp(− 1<br />
2T1) 1 m)Γd( 2 (n − 1))<br />
<br />
(2)
T 1 = κ(µ − µ0) T Σ −1 (µ − µ0) + n(¯x − µ) T Σ −1 (¯x − µ) + tr(Σ0 + A)Σ −1<br />
= (κ + n)µ T Σ −1 µ − 2(κµ0 + n¯x) T Σ −1 <br />
T2<br />
<br />
µ + κµ T 0 Σ −1 µ0 + n¯x T Σ −1 ¯x +tr(Σ0 + A)Σ −1<br />
We multiply the above term by κ+n<br />
κ+n<br />
⎡<br />
and complete the quadratic form as follows:<br />
⎢<br />
⎢<br />
T 1 = (κ + n) ⎢ µ −<br />
⎣<br />
κµ0<br />
T + n¯x<br />
Σ<br />
κ + n<br />
−1<br />
<br />
µ − κµ0<br />
<br />
+ n¯x ⎥<br />
+∆⎥<br />
κ + n ⎦<br />
<br />
Q1<br />
+ tr(Σ0 + A)Σ −1<br />
The term denoted by ∆ is the remaining terms after forming Q1 in (4) and will be defined after we extend<br />
Q1 as follows:<br />
Q1 = µ T Σ−1µ − 2<br />
κ + n µT Σ −1 (κµ0 + n¯x) +<br />
Now, ∆ in (4) equals to T2 from (3) minus T3 from (5):<br />
⎤<br />
T3<br />
<br />
(3)<br />
(4)<br />
1<br />
(κ + n) 2 (κµ0 + n¯x) T Σ −1 (κµ0 + n¯x) (5)<br />
T 2 − T 3 = κµT 0 Σ−1 µ0((κ + n) − κ) + 2Nκµ T 0 Σ−1 ¯x + n¯x T Σ −1 ¯x((κ + n) − n)<br />
=<br />
(κ + n) 2<br />
nκ<br />
(κ + n) 2 (µ0 − ¯x) T Σ −1 (µ0 − ¯x) = ∆<br />
Taking ∆ outside the brackets in (4) with (κ + n) before it, T1 becomes:<br />
<br />
(κ + n) µ − κµ0 + n¯x<br />
κ + n<br />
T<br />
Σ −1<br />
<br />
µ − κµ0 + n¯x<br />
κ + n<br />
<br />
+ nκ<br />
κ + n (µ0 − ¯x) T Σ −1 (µ0 − ¯x) + tr(Σ0 + A)Σ −1<br />
We insert (6) into (2) and obtain P (µ, Σ, ¯x, A). In order to obtain P (¯x, A) = P (µ, Σ, ¯x, A)dµdΣ we<br />
proceed as follows and turn it into a suitable form to integrate out µ and Σ, respectively:<br />
<br />
κd/2nd/2 |Σ0|<br />
P (¯x, A) =<br />
m/2 |A| 1<br />
2 (n−d−2) 1<br />
− |Σ| 2 (n+m+d+1)<br />
2 1<br />
2 d(n+m) πd/2Γd( 1<br />
<br />
nκ<br />
exp<br />
1<br />
2m)Γd( 2 (n − 1)) κ + n (µ0 − ¯x) T Σ −1 (µ0 − ¯x) + tr(Σ0 + A)Σ −1<br />
<br />
1<br />
×<br />
(κ + n) d/2<br />
<br />
(κ + n) d/2 |Σ|−1/2<br />
(2π) d/2<br />
<br />
exp − 1<br />
<br />
µ −<br />
2<br />
κµ0<br />
T + n¯x<br />
Σ<br />
κ + n<br />
−1<br />
<br />
µ − κµ0<br />
<br />
+ n¯x<br />
κ + n<br />
<br />
dµ dΣ<br />
<br />
= 1<br />
Integration w.r.t. µ turns out a Multivariate Normal density and integrates to 1.<br />
We express the remaining terms to handle the integration w.r.t. Σ as follows:<br />
κd/2nd/2 |Σ0| m/2 |A| 1<br />
2 (n−d−2)<br />
(n + κ) d/2πd/2Γd( 1 1 m)Γd(<br />
Γd( 1<br />
2 (n + m))<br />
|Σ0 + A + Q2| 1<br />
2 (n+m)<br />
=<br />
×<br />
2 2 (n − 1))<br />
<br />
|Σ0 + A + Q2|<br />
×<br />
1<br />
2 (n+m) 1<br />
− |E| 2 (n+m+d+1)<br />
Γd( 1<br />
1<br />
2 (n + m))2 2 d(n+m)<br />
<br />
exp − 1<br />
2 tr (Σ0 + A + Q2) Σ −1<br />
<br />
dΣ<br />
<br />
=1<br />
where Q2 = nκ<br />
n + κ (µ0 − ¯x)(µ0 − ¯x) T<br />
The integration w.r.t. Σ is in Inverse-Wishart form, which is therefore equal to 1<br />
=<br />
κd/2nd/2 |Σ0| m/2 |A| 1<br />
2 (n−d−2)<br />
(n + κ) d/2πd/2Γd( 1<br />
Γd(<br />
1 ×<br />
2m)Γd( 2 (n − 1)) 1<br />
2 (n + m))<br />
|Σ0 + A + nκ<br />
(6)<br />
(7)<br />
n+κ (µ0 − ¯x)(µ0 − ¯x) T | 1 = P (¯x, A) (8)<br />
(n+m) 2
exp<br />
Having obtained P (µ, Σ, ¯x, A) and P (¯x, A) the posterior P (µ, Σ|¯x, A) is obtained as follows:<br />
P (µ, Σ|¯x, A) =<br />
P (µ, Σ, ¯x, A)<br />
P (¯x, A)<br />
We cancel out the common terms and the remaining equation is:<br />
= |Σ0 + A + nκ<br />
n+κ (µ0 − ¯x)(µ0 − ¯x) T | 1<br />
2 (n+m) 1<br />
− |Σ| 2 (n+m+d+2) (n + κ) d/2<br />
<br />
− 1<br />
2<br />
2 1<br />
2 d(n+m+1) πd/2Γd( 1<br />
×<br />
2 (n + m))<br />
<br />
(κ + n) µ − κµ0<br />
T + n¯x<br />
Σ<br />
κ + n<br />
−1<br />
<br />
µ − κµ0<br />
<br />
+ n¯x<br />
+ tr Σ0 + A +<br />
κ + n<br />
nκ<br />
κ + n (µ0 − ¯x)(µ0 − ¯x) T<br />
<br />
Σ −1<br />
<br />
(10)<br />
1 Σ<br />
= |<br />
(2π) d/2 n + κ |−1/2 <br />
exp − 1<br />
<br />
µ −<br />
2<br />
κµ0<br />
T −1 <br />
+ n¯x Σ<br />
µ −<br />
κ + n κ + n<br />
κµ0<br />
<br />
+ n¯x<br />
κ + n<br />
<br />
×<br />
2 1<br />
2 d(n+m) Γd( 1<br />
<br />
exp −<br />
2 (n + m))<br />
1<br />
2 tr<br />
<br />
Σ0 + A + nκ<br />
κ + n (µ0 − ¯x)(µ0 − ¯x) T<br />
<br />
Σ −1<br />
<br />
(11)<br />
|Σ0 + A + nκ<br />
n+κ (µ0 − ¯x)(µ0 − ¯x) T | 1<br />
2 (n+m) 1<br />
− |Σ| 2 (n+m+d+1)<br />
Recognizing that the above term in (11) is a multivariate normal and the below term is an Inverse-Wishart<br />
with respective parameters, the posterior P (µ, Σ|¯x, A) turns out to be:<br />
<br />
1<br />
P (µ, Σ|¯x, A) = N µ|<br />
κ + n (κµ0<br />
1<br />
+ n¯x),<br />
κ + n Σ<br />
<br />
× W −1<br />
<br />
Σ|Σ0 + A + nκ<br />
κ + n (µ0 − ¯x)(µ0 − ¯x) T <br />
, n + m<br />
(c) Show that p(x∗ |¯x, S) = stu − t( κµ0+n¯x<br />
n+κ , <br />
n+κ+1<br />
(n+κ)(n+m+1−d)<br />
Σ0 + (n − 1)S + nκ<br />
κ+n (µ0<br />
<br />
− ¯x)(µ0 − ¯x) T , n +<br />
m + 1 − d) (hint: |A + xxT | = |A|(1 + xT A−1x)) (20 points)<br />
Solution:<br />
Having obtained the posterior for µ and Σ the predictive distribution for a new sample, x∗ is obtained as<br />
follows:<br />
P (x ∗ <br />
|¯x, A) = P (x ∗ |µ, Σ)P (µ, Σ|¯x, A)dµdΣ (13)<br />
In rewriting P (µ, Σ|¯x, A) we will replace 1<br />
n+κ (κµ0 + n¯x) with ˆµ and<br />
with B:<br />
<br />
= (2π) −d/2 |Σ| −1/2 <br />
exp − 1<br />
<br />
× exp − 1<br />
2<br />
2 (x∗ − µ) T Σ −1 (x ∗ <br />
− µ)<br />
× |B| 1<br />
(9)<br />
(12)<br />
<br />
Σ0 + A + nκ<br />
κ+n (µ0<br />
<br />
− ¯x)(µ0 − ¯x) T<br />
2 (n+m) 1<br />
− |Σ| 2 (n+m+d+2) (n + κ) d/2<br />
(2π) d/22 1<br />
2 d(n+m) Γd( 1<br />
2 (n + m))<br />
<br />
(n + κ)(µ − ˆµ) T Σ −1 (µ − ˆµ) + trBΣ −1<br />
dµdΣ (14)<br />
1<br />
|B| 2<br />
=<br />
(n+m) 1<br />
− |Σ| 2 (n+m+d+1)<br />
2 1<br />
2 d(n+m) Γd( 1<br />
<br />
exp −<br />
2 (n + m)) 1<br />
2 trBΣ−1<br />
<br />
×<br />
<br />
(2π) −d/2 |Σ| −1/2 (2π) −d/2 |Σ/(n + κ)| −1/2 <br />
exp − 1 <br />
(x<br />
2<br />
∗ − µ) T Σ −1 (x ∗ − µ) + (µ − ˆµ) T (Σ/(n + κ)) −1 <br />
(µ − ˆµ)<br />
<br />
<br />
P (x∗ , µ|Σ)<br />
We know from solution Part Ab that integration of the part P (x ∗ , µ|Σ) in (15) w.r.t µ is another multivariate<br />
Normal with updated parameters as follows:<br />
(15)<br />
dµ
=<br />
P (x ∗ |Σ) ∼ N (ˆµ, (Σ/(n + κ)) + Σ)<br />
= (2π) −d/2<br />
<br />
<br />
<br />
Σ <br />
+ Σ<br />
n + κ <br />
We continue by incorporating (16) into (15):<br />
2<br />
|B| 1<br />
2 (n+m) (n + κ) d/2<br />
πd/2Γd( 1<br />
×<br />
2 (n + m))(n + κ + 1)d/2<br />
−1/2<br />
<br />
exp − 1<br />
2<br />
n + κ<br />
n + κ + 1 (x∗ − ˆµ) T Σ −1 (x ∗ <br />
− ˆµ)<br />
|B| 1<br />
2 (n+m) 1<br />
− |Σ| 2 (n+m+d+2) (n + κ) d/2<br />
2 1<br />
2 d(n+m+1) πd/2Γd( 1<br />
<br />
exp −<br />
(n + m))(n + κ + 1)d/2 1 n + κ<br />
tr(B +<br />
2 n + κ + 1 (x∗ − ˆµ)(x ∗ − ˆµ) T )Σ −1<br />
<br />
Γd(<br />
=<br />
1<br />
2 (n + m + 1))<br />
n + κ<br />
| B +<br />
n + κ + 1 (x∗ − ˆµ)(x ∗ − ˆµ) T<br />
|<br />
<br />
TT<br />
1<br />
2 (n+m+1)<br />
×<br />
− |Σ| 1<br />
2 (n+(m+1)+d+1) |B + n+κ<br />
n+κ+1 (x∗ − ˆµ)(x∗ − ˆµ) T | 1<br />
2 (n+(m+1))<br />
2 1<br />
2 d(n+m+1) Γd( 1<br />
<br />
exp −<br />
2 (n + m + 1))<br />
1 n + κ<br />
tr(B +<br />
2 n + κ + 1 (x∗ − ˆµ)(x ∗ − ˆµ) T )Σ −1<br />
<br />
d<br />
<br />
=1<br />
Integration w.r.t. Σ above equals to 1 since it corresponds to integration of an Inverse-Wishart distribution.<br />
By Corollary A.3.1 in [Anderson] |B + xx T | = |B|(1 + x T B −1 x). So the term TT above becomes:<br />
|B|(1 + n+κ<br />
n+κ+1 (x∗ − ˆµ) T B −1 (x ∗ − ˆµ)) and we finalize the predictive distribution, P (x ∗ |¯x, A), as:<br />
P (x ∗ |¯x, A) =<br />
Γd( 1<br />
πd/2Γd( 1<br />
2 (n + m))(n + κ + 1)d/2<br />
2 (n + m + 1))(n + κ)d/2 |B| −1/2<br />
<br />
1 + n+κ<br />
n+κ+1 (x∗ − ˆµ) T B −1 (x ∗ − ˆµ)<br />
There are two equalities for the multivariate gamma function as follows:<br />
Γd(α) = π (d−1)/2 Γd−1(α)Γ(α + (1 − d)/2)<br />
= π (d−1)/2 Γ(α)Γd−1(α − 1/2)<br />
1<br />
2 (n+m+1)<br />
Replacing Γd( 1<br />
1<br />
2 (n+m+1)) in the numerator of (17) with the former and Γd( 2 (n+m)) in the denominator<br />
with the latter one the R.H.S. of (17) becomes:<br />
=<br />
π (d−1)/2Γd−1( n+m<br />
2 )Γ( n+m+1<br />
2 )(n + κ) d/2 |B| −1/2<br />
π (d−1)/2Γd−1( n+m<br />
2 )Γ( n+m+1−d<br />
2 )πd/2 (n + κ + 1) d/2<br />
<br />
1 + n+κ<br />
n+κ+1 (x∗ − ˆµ) T B −1 (x ∗ − ˆµ)<br />
1<br />
2 (n+m+1)<br />
Cancelling out the common terms, dividing and multiplying the above equation by (n + m + 1 − d) 1/2<br />
and rearranging the terms in the denominator according to the definition of the multivariate Student-t<br />
distribution we obtain:<br />
where B =<br />
P (x ∗ |¯x, A) =<br />
Γ( ν<br />
2 )πd/2νd/2 |Bτ| 1/2<br />
<br />
1 +<br />
<br />
Σ0 + A + nκ<br />
κ+n (µ0<br />
<br />
− ¯x)(µ0 − ¯x) T<br />
Γ( ν+d<br />
2 )<br />
<br />
(x∗ − ˆµ) T (Bτ) −1 (x∗ 1<br />
2<br />
− ˆµ)<br />
(ν+d)<br />
1<br />
ν<br />
, ˆµ = κµ0+n¯x<br />
n+κ<br />
n+κ+1<br />
, τ = ν(n+κ) , ν = n + m + 1 − d.<br />
(16)<br />
(17)<br />
(18)<br />
P (x ∗ |¯x, A) = stu − t (ˆµ, Bτ, ν) (19)<br />
where the first parameter is the location vector, the second one is the positive definite scale matrix and<br />
the third one is the degrees of freedom for the multivariate t distribution.<br />
Part B (50 points) Download the MultiClassGaussianDataset.mat from the course web page<br />
(a) Implement the maximum a posteriori (MAP) classifier using the standard likelihood model and evaluate<br />
the accuracy of the classifier on the test data (15 points)
Please see <strong>Assignment</strong>1SolutionPartBa.m.<br />
(b) Implement the maximum a posteriori (MAP) classifier using the predictive likelihood model you<br />
derived in Part A and evaluate the accuracy of the classifier on the test data (30 points)<br />
Please see <strong>Assignment</strong>1SolutionPartBb.m.<br />
(c) Discuss results from part a and b (5 points)<br />
Accuracy for part a is 46% and part b is 61%. Having access to an informative prior model improves<br />
classification accuracy and the degree of improvement will be more when we have less number of training<br />
samples for each class.<br />
For Part B please use the following model parameters: Σ0 = 8I (I is the identity matrix), m = 20,<br />
κ = 0.1, µ0 = [0, . . . , 0] T .