22.10.2013 Views

Assignment 1 Solutions

Assignment 1 Solutions

Assignment 1 Solutions

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Assignment</strong> 1 <strong>Solutions</strong><br />

Instructor: Murat Dundar<br />

Date: 2/18/2013<br />

In this homework you will derive the predictive distribution for the Normal × Normal × Inverted Wishart<br />

model and use it in a K-class classification problem to demonstrate its benefits over the standard data<br />

distribution.<br />

Part A (50 points)<br />

Data Model: x ∼ N(µ, Σ), µ ∼ N(µ0, κ−1Σ), Σ ∼ W −1 (Σ0, m).<br />

Given a data set D = {x1, . . . , xn}, xi ∈ Rd , generated by this model, let the sample mean be ¯x = 1 ni=1 n<br />

xi<br />

and the sample covariance matrix be S = 1 ni=1 n (xi − ¯x)(xi − ¯x) T .<br />

(a) Show that ¯x ∼ N(µ, n−1Σ) (10 points)<br />

Solution:<br />

Note that ¯x is a weighted summation of N iid Normally distributed RVs. So ¯x itself should also be<br />

distributed according to a Normal distribution. The mean of this distribution will be defined by<br />

E{¯x} = 1 Ni N E{xi} = 1<br />

N Nµ = µ.<br />

Its covariance is defined by<br />

E{(¯x − µ) 2 N i } = E{(<br />

xi<br />

N − µ)2 N i } = E{(<br />

xi−Nµ<br />

N ) 2 N (xi−µ) 2<br />

i } = E{ N 2<br />

N i } = E{<br />

Σ<br />

N 2 } = NΣ<br />

N 2 = Σ<br />

N<br />

(b) Show that p(µ, Σ|¯x, S) = N( n¯x+κµ0 Σ<br />

n+κ , n+κ ) × W −1 (Σ0 + (n − 1)S + nκ<br />

n+κ (¯x − µ0)(¯x − µ0) T , n + m) (hint:<br />

(n − 1)S ∼ W ((Σ, n − 1)) (20 points)<br />

Solution:<br />

Class likelihoods of samples x|µ, Σ ∼ N (µ, Σ), Σ ∼ W−1 (Σ0, m), µ|Σ ∼ N (µ0, Σ/κ)<br />

P (µ, Σ|D) =<br />

P (D|µ, Σ)P (µ|Σ)P (Σ)<br />

P (D)<br />

= P (µ, Σ|D) (1)<br />

where D represents the data. Sample mean and sample covariance, (¯x, S), are the sufficient statistics for<br />

the multivariate Normally distributed data.<br />

Here ¯x ∼ N (µ, Σ/n) and also we will use A = (n − 1)S in probability equations. The joint probability of<br />

µ, Σ, ¯x and A is:<br />

1<br />

P (µ, Σ, ¯x, A) =<br />

(2π) d/2 |Σ/κ|−1/2 <br />

exp − 1<br />

2 (µ − µ0) T (Σ/κ) −1 <br />

(µ − µ0)<br />

× |Σ0| 1<br />

×<br />

× |A| 1<br />

2 m 1<br />

− |Σ| 2 (m+d+1) exp<br />

2 1<br />

2 md Γd( 1<br />

1<br />

(2π) d/2 |Σ/n|−1/2 exp<br />

<br />

− 1<br />

2<br />

trΣ0Σ −1<br />

2m) <br />

− 1<br />

2 (¯x − µ)T (Σ/n) −1 (¯x − µ)<br />

2 (n−d−2) 1<br />

− |Σ| 2 (n−1) exp(− 1<br />

2trAΣ−1 )<br />

2 1<br />

2 d(n−1) Γd( 1(n<br />

− 1))<br />

2<br />

= κd/2nd/2 |Σ0| 1<br />

2 m |A| 1<br />

2 (n−d−2) 1<br />

− |Σ|<br />

2 1<br />

2 d(n+m−1) πdΓd( 1<br />

where the term T1 inside the exponential is manipulated as follows:<br />

2<br />

2 (n+m+d) exp(− 1<br />

2T1) 1 m)Γd( 2 (n − 1))<br />

<br />

(2)


T 1 = κ(µ − µ0) T Σ −1 (µ − µ0) + n(¯x − µ) T Σ −1 (¯x − µ) + tr(Σ0 + A)Σ −1<br />

= (κ + n)µ T Σ −1 µ − 2(κµ0 + n¯x) T Σ −1 <br />

T2<br />

<br />

µ + κµ T 0 Σ −1 µ0 + n¯x T Σ −1 ¯x +tr(Σ0 + A)Σ −1<br />

We multiply the above term by κ+n<br />

κ+n<br />

⎡<br />

and complete the quadratic form as follows:<br />

⎢<br />

⎢<br />

T 1 = (κ + n) ⎢ µ −<br />

⎣<br />

κµ0<br />

T + n¯x<br />

Σ<br />

κ + n<br />

−1<br />

<br />

µ − κµ0<br />

<br />

+ n¯x ⎥<br />

+∆⎥<br />

κ + n ⎦<br />

<br />

Q1<br />

+ tr(Σ0 + A)Σ −1<br />

The term denoted by ∆ is the remaining terms after forming Q1 in (4) and will be defined after we extend<br />

Q1 as follows:<br />

Q1 = µ T Σ−1µ − 2<br />

κ + n µT Σ −1 (κµ0 + n¯x) +<br />

Now, ∆ in (4) equals to T2 from (3) minus T3 from (5):<br />

⎤<br />

T3<br />

<br />

(3)<br />

(4)<br />

1<br />

(κ + n) 2 (κµ0 + n¯x) T Σ −1 (κµ0 + n¯x) (5)<br />

T 2 − T 3 = κµT 0 Σ−1 µ0((κ + n) − κ) + 2Nκµ T 0 Σ−1 ¯x + n¯x T Σ −1 ¯x((κ + n) − n)<br />

=<br />

(κ + n) 2<br />

nκ<br />

(κ + n) 2 (µ0 − ¯x) T Σ −1 (µ0 − ¯x) = ∆<br />

Taking ∆ outside the brackets in (4) with (κ + n) before it, T1 becomes:<br />

<br />

(κ + n) µ − κµ0 + n¯x<br />

κ + n<br />

T<br />

Σ −1<br />

<br />

µ − κµ0 + n¯x<br />

κ + n<br />

<br />

+ nκ<br />

κ + n (µ0 − ¯x) T Σ −1 (µ0 − ¯x) + tr(Σ0 + A)Σ −1<br />

We insert (6) into (2) and obtain P (µ, Σ, ¯x, A). In order to obtain P (¯x, A) = P (µ, Σ, ¯x, A)dµdΣ we<br />

proceed as follows and turn it into a suitable form to integrate out µ and Σ, respectively:<br />

<br />

κd/2nd/2 |Σ0|<br />

P (¯x, A) =<br />

m/2 |A| 1<br />

2 (n−d−2) 1<br />

− |Σ| 2 (n+m+d+1)<br />

2 1<br />

2 d(n+m) πd/2Γd( 1<br />

<br />

nκ<br />

exp<br />

1<br />

2m)Γd( 2 (n − 1)) κ + n (µ0 − ¯x) T Σ −1 (µ0 − ¯x) + tr(Σ0 + A)Σ −1<br />

<br />

1<br />

×<br />

(κ + n) d/2<br />

<br />

(κ + n) d/2 |Σ|−1/2<br />

(2π) d/2<br />

<br />

exp − 1<br />

<br />

µ −<br />

2<br />

κµ0<br />

T + n¯x<br />

Σ<br />

κ + n<br />

−1<br />

<br />

µ − κµ0<br />

<br />

+ n¯x<br />

κ + n<br />

<br />

dµ dΣ<br />

<br />

= 1<br />

Integration w.r.t. µ turns out a Multivariate Normal density and integrates to 1.<br />

We express the remaining terms to handle the integration w.r.t. Σ as follows:<br />

κd/2nd/2 |Σ0| m/2 |A| 1<br />

2 (n−d−2)<br />

(n + κ) d/2πd/2Γd( 1 1 m)Γd(<br />

Γd( 1<br />

2 (n + m))<br />

|Σ0 + A + Q2| 1<br />

2 (n+m)<br />

=<br />

×<br />

2 2 (n − 1))<br />

<br />

|Σ0 + A + Q2|<br />

×<br />

1<br />

2 (n+m) 1<br />

− |E| 2 (n+m+d+1)<br />

Γd( 1<br />

1<br />

2 (n + m))2 2 d(n+m)<br />

<br />

exp − 1<br />

2 tr (Σ0 + A + Q2) Σ −1<br />

<br />

dΣ<br />

<br />

=1<br />

where Q2 = nκ<br />

n + κ (µ0 − ¯x)(µ0 − ¯x) T<br />

The integration w.r.t. Σ is in Inverse-Wishart form, which is therefore equal to 1<br />

=<br />

κd/2nd/2 |Σ0| m/2 |A| 1<br />

2 (n−d−2)<br />

(n + κ) d/2πd/2Γd( 1<br />

Γd(<br />

1 ×<br />

2m)Γd( 2 (n − 1)) 1<br />

2 (n + m))<br />

|Σ0 + A + nκ<br />

(6)<br />

(7)<br />

n+κ (µ0 − ¯x)(µ0 − ¯x) T | 1 = P (¯x, A) (8)<br />

(n+m) 2


exp<br />

Having obtained P (µ, Σ, ¯x, A) and P (¯x, A) the posterior P (µ, Σ|¯x, A) is obtained as follows:<br />

P (µ, Σ|¯x, A) =<br />

P (µ, Σ, ¯x, A)<br />

P (¯x, A)<br />

We cancel out the common terms and the remaining equation is:<br />

= |Σ0 + A + nκ<br />

n+κ (µ0 − ¯x)(µ0 − ¯x) T | 1<br />

2 (n+m) 1<br />

− |Σ| 2 (n+m+d+2) (n + κ) d/2<br />

<br />

− 1<br />

2<br />

2 1<br />

2 d(n+m+1) πd/2Γd( 1<br />

×<br />

2 (n + m))<br />

<br />

(κ + n) µ − κµ0<br />

T + n¯x<br />

Σ<br />

κ + n<br />

−1<br />

<br />

µ − κµ0<br />

<br />

+ n¯x<br />

+ tr Σ0 + A +<br />

κ + n<br />

nκ<br />

κ + n (µ0 − ¯x)(µ0 − ¯x) T<br />

<br />

Σ −1<br />

<br />

(10)<br />

1 Σ<br />

= |<br />

(2π) d/2 n + κ |−1/2 <br />

exp − 1<br />

<br />

µ −<br />

2<br />

κµ0<br />

T −1 <br />

+ n¯x Σ<br />

µ −<br />

κ + n κ + n<br />

κµ0<br />

<br />

+ n¯x<br />

κ + n<br />

<br />

×<br />

2 1<br />

2 d(n+m) Γd( 1<br />

<br />

exp −<br />

2 (n + m))<br />

1<br />

2 tr<br />

<br />

Σ0 + A + nκ<br />

κ + n (µ0 − ¯x)(µ0 − ¯x) T<br />

<br />

Σ −1<br />

<br />

(11)<br />

|Σ0 + A + nκ<br />

n+κ (µ0 − ¯x)(µ0 − ¯x) T | 1<br />

2 (n+m) 1<br />

− |Σ| 2 (n+m+d+1)<br />

Recognizing that the above term in (11) is a multivariate normal and the below term is an Inverse-Wishart<br />

with respective parameters, the posterior P (µ, Σ|¯x, A) turns out to be:<br />

<br />

1<br />

P (µ, Σ|¯x, A) = N µ|<br />

κ + n (κµ0<br />

1<br />

+ n¯x),<br />

κ + n Σ<br />

<br />

× W −1<br />

<br />

Σ|Σ0 + A + nκ<br />

κ + n (µ0 − ¯x)(µ0 − ¯x) T <br />

, n + m<br />

(c) Show that p(x∗ |¯x, S) = stu − t( κµ0+n¯x<br />

n+κ , <br />

n+κ+1<br />

(n+κ)(n+m+1−d)<br />

Σ0 + (n − 1)S + nκ<br />

κ+n (µ0<br />

<br />

− ¯x)(µ0 − ¯x) T , n +<br />

m + 1 − d) (hint: |A + xxT | = |A|(1 + xT A−1x)) (20 points)<br />

Solution:<br />

Having obtained the posterior for µ and Σ the predictive distribution for a new sample, x∗ is obtained as<br />

follows:<br />

P (x ∗ <br />

|¯x, A) = P (x ∗ |µ, Σ)P (µ, Σ|¯x, A)dµdΣ (13)<br />

In rewriting P (µ, Σ|¯x, A) we will replace 1<br />

n+κ (κµ0 + n¯x) with ˆµ and<br />

with B:<br />

<br />

= (2π) −d/2 |Σ| −1/2 <br />

exp − 1<br />

<br />

× exp − 1<br />

2<br />

2 (x∗ − µ) T Σ −1 (x ∗ <br />

− µ)<br />

× |B| 1<br />

(9)<br />

(12)<br />

<br />

Σ0 + A + nκ<br />

κ+n (µ0<br />

<br />

− ¯x)(µ0 − ¯x) T<br />

2 (n+m) 1<br />

− |Σ| 2 (n+m+d+2) (n + κ) d/2<br />

(2π) d/22 1<br />

2 d(n+m) Γd( 1<br />

2 (n + m))<br />

<br />

(n + κ)(µ − ˆµ) T Σ −1 (µ − ˆµ) + trBΣ −1<br />

dµdΣ (14)<br />

1<br />

|B| 2<br />

=<br />

(n+m) 1<br />

− |Σ| 2 (n+m+d+1)<br />

2 1<br />

2 d(n+m) Γd( 1<br />

<br />

exp −<br />

2 (n + m)) 1<br />

2 trBΣ−1<br />

<br />

×<br />

<br />

(2π) −d/2 |Σ| −1/2 (2π) −d/2 |Σ/(n + κ)| −1/2 <br />

exp − 1 <br />

(x<br />

2<br />

∗ − µ) T Σ −1 (x ∗ − µ) + (µ − ˆµ) T (Σ/(n + κ)) −1 <br />

(µ − ˆµ)<br />

<br />

<br />

P (x∗ , µ|Σ)<br />

We know from solution Part Ab that integration of the part P (x ∗ , µ|Σ) in (15) w.r.t µ is another multivariate<br />

Normal with updated parameters as follows:<br />

(15)<br />


=<br />

P (x ∗ |Σ) ∼ N (ˆµ, (Σ/(n + κ)) + Σ)<br />

= (2π) −d/2<br />

<br />

<br />

<br />

Σ <br />

+ Σ<br />

n + κ <br />

We continue by incorporating (16) into (15):<br />

2<br />

|B| 1<br />

2 (n+m) (n + κ) d/2<br />

πd/2Γd( 1<br />

×<br />

2 (n + m))(n + κ + 1)d/2<br />

−1/2<br />

<br />

exp − 1<br />

2<br />

n + κ<br />

n + κ + 1 (x∗ − ˆµ) T Σ −1 (x ∗ <br />

− ˆµ)<br />

|B| 1<br />

2 (n+m) 1<br />

− |Σ| 2 (n+m+d+2) (n + κ) d/2<br />

2 1<br />

2 d(n+m+1) πd/2Γd( 1<br />

<br />

exp −<br />

(n + m))(n + κ + 1)d/2 1 n + κ<br />

tr(B +<br />

2 n + κ + 1 (x∗ − ˆµ)(x ∗ − ˆµ) T )Σ −1<br />

<br />

Γd(<br />

=<br />

1<br />

2 (n + m + 1))<br />

n + κ<br />

| B +<br />

n + κ + 1 (x∗ − ˆµ)(x ∗ − ˆµ) T<br />

|<br />

<br />

TT<br />

1<br />

2 (n+m+1)<br />

×<br />

− |Σ| 1<br />

2 (n+(m+1)+d+1) |B + n+κ<br />

n+κ+1 (x∗ − ˆµ)(x∗ − ˆµ) T | 1<br />

2 (n+(m+1))<br />

2 1<br />

2 d(n+m+1) Γd( 1<br />

<br />

exp −<br />

2 (n + m + 1))<br />

1 n + κ<br />

tr(B +<br />

2 n + κ + 1 (x∗ − ˆµ)(x ∗ − ˆµ) T )Σ −1<br />

<br />

d<br />

<br />

=1<br />

Integration w.r.t. Σ above equals to 1 since it corresponds to integration of an Inverse-Wishart distribution.<br />

By Corollary A.3.1 in [Anderson] |B + xx T | = |B|(1 + x T B −1 x). So the term TT above becomes:<br />

|B|(1 + n+κ<br />

n+κ+1 (x∗ − ˆµ) T B −1 (x ∗ − ˆµ)) and we finalize the predictive distribution, P (x ∗ |¯x, A), as:<br />

P (x ∗ |¯x, A) =<br />

Γd( 1<br />

πd/2Γd( 1<br />

2 (n + m))(n + κ + 1)d/2<br />

2 (n + m + 1))(n + κ)d/2 |B| −1/2<br />

<br />

1 + n+κ<br />

n+κ+1 (x∗ − ˆµ) T B −1 (x ∗ − ˆµ)<br />

There are two equalities for the multivariate gamma function as follows:<br />

Γd(α) = π (d−1)/2 Γd−1(α)Γ(α + (1 − d)/2)<br />

= π (d−1)/2 Γ(α)Γd−1(α − 1/2)<br />

1<br />

2 (n+m+1)<br />

Replacing Γd( 1<br />

1<br />

2 (n+m+1)) in the numerator of (17) with the former and Γd( 2 (n+m)) in the denominator<br />

with the latter one the R.H.S. of (17) becomes:<br />

=<br />

π (d−1)/2Γd−1( n+m<br />

2 )Γ( n+m+1<br />

2 )(n + κ) d/2 |B| −1/2<br />

π (d−1)/2Γd−1( n+m<br />

2 )Γ( n+m+1−d<br />

2 )πd/2 (n + κ + 1) d/2<br />

<br />

1 + n+κ<br />

n+κ+1 (x∗ − ˆµ) T B −1 (x ∗ − ˆµ)<br />

1<br />

2 (n+m+1)<br />

Cancelling out the common terms, dividing and multiplying the above equation by (n + m + 1 − d) 1/2<br />

and rearranging the terms in the denominator according to the definition of the multivariate Student-t<br />

distribution we obtain:<br />

where B =<br />

P (x ∗ |¯x, A) =<br />

Γ( ν<br />

2 )πd/2νd/2 |Bτ| 1/2<br />

<br />

1 +<br />

<br />

Σ0 + A + nκ<br />

κ+n (µ0<br />

<br />

− ¯x)(µ0 − ¯x) T<br />

Γ( ν+d<br />

2 )<br />

<br />

(x∗ − ˆµ) T (Bτ) −1 (x∗ 1<br />

2<br />

− ˆµ)<br />

(ν+d)<br />

1<br />

ν<br />

, ˆµ = κµ0+n¯x<br />

n+κ<br />

n+κ+1<br />

, τ = ν(n+κ) , ν = n + m + 1 − d.<br />

(16)<br />

(17)<br />

(18)<br />

P (x ∗ |¯x, A) = stu − t (ˆµ, Bτ, ν) (19)<br />

where the first parameter is the location vector, the second one is the positive definite scale matrix and<br />

the third one is the degrees of freedom for the multivariate t distribution.<br />

Part B (50 points) Download the MultiClassGaussianDataset.mat from the course web page<br />

(a) Implement the maximum a posteriori (MAP) classifier using the standard likelihood model and evaluate<br />

the accuracy of the classifier on the test data (15 points)


Please see <strong>Assignment</strong>1SolutionPartBa.m.<br />

(b) Implement the maximum a posteriori (MAP) classifier using the predictive likelihood model you<br />

derived in Part A and evaluate the accuracy of the classifier on the test data (30 points)<br />

Please see <strong>Assignment</strong>1SolutionPartBb.m.<br />

(c) Discuss results from part a and b (5 points)<br />

Accuracy for part a is 46% and part b is 61%. Having access to an informative prior model improves<br />

classification accuracy and the degree of improvement will be more when we have less number of training<br />

samples for each class.<br />

For Part B please use the following model parameters: Σ0 = 8I (I is the identity matrix), m = 20,<br />

κ = 0.1, µ0 = [0, . . . , 0] T .

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!