Assignment 1 Solutions

Assignment 1 Solutions 

Instructor: Murat Dundar 

Date: 2/18/2013 

In this homework you will derive the predictive distribution for the Normal × Normal × Inverted Wishart 

model and use it in a K-class classification problem to demonstrate its benefits over the standard data 

distribution. 

Part A (50 points) 

Data Model: x ∼ N(µ, Σ), µ ∼ N(µ0, κ−1Σ), Σ ∼ W −1 (Σ0, m). 

Given a data set D = {x1, . . . , xn}, xi ∈ Rd , generated by this model, let the sample mean be ¯x = 1 ni=1 n 

xi 

and the sample covariance matrix be S = 1 ni=1 n (xi − ¯x)(xi − ¯x) T . 

(a) Show that ¯x ∼ N(µ, n−1Σ) (10 points) 

Solution: 

Note that ¯x is a weighted summation of N iid Normally distributed RVs. So ¯x itself should also be 

distributed according to a Normal distribution. The mean of this distribution will be defined by 

E{¯x} = 1 Ni N E{xi} = 1 

N Nµ = µ. 

Its covariance is defined by 

E{(¯x − µ) 2 N i } = E{( 

xi 

N − µ)2 N i } = E{( 

xi−Nµ 

N ) 2 N (xi−µ) 2 

i } = E{ N 2 

N i } = E{ 

Σ 

N 2 } = NΣ 

N 2 = Σ 

N 

(b) Show that p(µ, Σ|¯x, S) = N( n¯x+κµ0 Σ 

n+κ , n+κ ) × W −1 (Σ0 + (n − 1)S + nκ 

n+κ (¯x − µ0)(¯x − µ0) T , n + m) (hint: 

(n − 1)S ∼ W ((Σ, n − 1)) (20 points) 

Solution: 

Class likelihoods of samples x|µ, Σ ∼ N (µ, Σ), Σ ∼ W−1 (Σ0, m), µ|Σ ∼ N (µ0, Σ/κ) 

P (µ, Σ|D) = 

P (D|µ, Σ)P (µ|Σ)P (Σ) 

P (D) 

= P (µ, Σ|D) (1) 

where D represents the data. Sample mean and sample covariance, (¯x, S), are the sufficient statistics for 

the multivariate Normally distributed data. 

Here ¯x ∼ N (µ, Σ/n) and also we will use A = (n − 1)S in probability equations. The joint probability of 

µ, Σ, ¯x and A is: 

1 

P (µ, Σ, ¯x, A) = 

(2π) d/2 |Σ/κ|−1/2 

exp − 1 

2 (µ − µ0) T (Σ/κ) −1 

(µ − µ0) 

× |Σ0| 1 

× 

× |A| 1 

2 m 1 

− |Σ| 2 (m+d+1) exp 

2 1 

2 md Γd( 1 

1 

(2π) d/2 |Σ/n|−1/2 exp 

 

− 1 

2 

trΣ0Σ −1 

2m) 

− 1 

2 (¯x − µ)T (Σ/n) −1 (¯x − µ) 

2 (n−d−2) 1 

− |Σ| 2 (n−1) exp(− 1 

2trAΣ−1 ) 

2 1 

2 d(n−1) Γd( 1(n 

− 1)) 

2 

= κd/2nd/2 |Σ0| 1 

2 m |A| 1 

2 (n−d−2) 1 

− |Σ| 

2 1 

2 d(n+m−1) πdΓd( 1 

where the term T1 inside the exponential is manipulated as follows: 

2 

2 (n+m+d) exp(− 1 

2T1) 1 m)Γd( 2 (n − 1)) 

 

(2)

T 1 = κ(µ − µ0) T Σ −1 (µ − µ0) + n(¯x − µ) T Σ −1 (¯x − µ) + tr(Σ0 + A)Σ −1 

= (κ + n)µ T Σ −1 µ − 2(κµ0 + n¯x) T Σ −1 

T2 

 

µ + κµ T 0 Σ −1 µ0 + n¯x T Σ −1 ¯x +tr(Σ0 + A)Σ −1 

We multiply the above term by κ+n 

κ+n 

⎡ 

and complete the quadratic form as follows: 

⎢ 

⎢ 

T 1 = (κ + n) ⎢ µ − 

⎣ 

κµ0 

T + n¯x 

Σ 

κ + n 

−1 

 

µ − κµ0 

 

+ n¯x ⎥ 

+∆⎥ 

κ + n ⎦ 

 

Q1 

+ tr(Σ0 + A)Σ −1 

The term denoted by ∆ is the remaining terms after forming Q1 in (4) and will be defined after we extend 

Q1 as follows: 

Q1 = µ T Σ−1µ − 2 

κ + n µT Σ −1 (κµ0 + n¯x) + 

Now, ∆ in (4) equals to T2 from (3) minus T3 from (5): 

⎤ 

T3 

 

(3) 

(4) 

1 

(κ + n) 2 (κµ0 + n¯x) T Σ −1 (κµ0 + n¯x) (5) 

T 2 − T 3 = κµT 0 Σ−1 µ0((κ + n) − κ) + 2Nκµ T 0 Σ−1 ¯x + n¯x T Σ −1 ¯x((κ + n) − n) 

= 

(κ + n) 2 

nκ 

(κ + n) 2 (µ0 − ¯x) T Σ −1 (µ0 − ¯x) = ∆ 

Taking ∆ outside the brackets in (4) with (κ + n) before it, T1 becomes: 

 

(κ + n) µ − κµ0 + n¯x 

κ + n 

T 

Σ −1 

 

µ − κµ0 + n¯x 

κ + n 

 

+ nκ 

κ + n (µ0 − ¯x) T Σ −1 (µ0 − ¯x) + tr(Σ0 + A)Σ −1 

We insert (6) into (2) and obtain P (µ, Σ, ¯x, A). In order to obtain P (¯x, A) = P (µ, Σ, ¯x, A)dµdΣ we 

proceed as follows and turn it into a suitable form to integrate out µ and Σ, respectively: 

 

κd/2nd/2 |Σ0| 

P (¯x, A) = 

m/2 |A| 1 

2 (n−d−2) 1 

− |Σ| 2 (n+m+d+1) 

2 1 

2 d(n+m) πd/2Γd( 1 

 

nκ 

exp 

1 

2m)Γd( 2 (n − 1)) κ + n (µ0 − ¯x) T Σ −1 (µ0 − ¯x) + tr(Σ0 + A)Σ −1 

 

1 

× 

(κ + n) d/2 

 

(κ + n) d/2 |Σ|−1/2 

(2π) d/2 

 

exp − 1 

 

µ − 

2 

κµ0 

T + n¯x 

Σ 

κ + n 

−1 

 

µ − κµ0 

 

+ n¯x 

κ + n 

 

dµ dΣ 

 

= 1 

Integration w.r.t. µ turns out a Multivariate Normal density and integrates to 1. 

We express the remaining terms to handle the integration w.r.t. Σ as follows: 

κd/2nd/2 |Σ0| m/2 |A| 1 

2 (n−d−2) 

(n + κ) d/2πd/2Γd( 1 1 m)Γd( 

Γd( 1 

2 (n + m)) 

|Σ0 + A + Q2| 1 

2 (n+m) 

= 

× 

2 2 (n − 1)) 

 

|Σ0 + A + Q2| 

× 

1 

2 (n+m) 1 

− |E| 2 (n+m+d+1) 

Γd( 1 

1 

2 (n + m))2 2 d(n+m) 

 

exp − 1 

2 tr (Σ0 + A + Q2) Σ −1 

 

dΣ 

 

=1 

where Q2 = nκ 

n + κ (µ0 − ¯x)(µ0 − ¯x) T 

The integration w.r.t. Σ is in Inverse-Wishart form, which is therefore equal to 1 

= 

κd/2nd/2 |Σ0| m/2 |A| 1 

2 (n−d−2) 

(n + κ) d/2πd/2Γd( 1 

Γd( 

1 × 

2m)Γd( 2 (n − 1)) 1 

2 (n + m)) 

|Σ0 + A + nκ 

(6) 

(7) 

n+κ (µ0 − ¯x)(µ0 − ¯x) T | 1 = P (¯x, A) (8) 

(n+m) 2

exp 

Having obtained P (µ, Σ, ¯x, A) and P (¯x, A) the posterior P (µ, Σ|¯x, A) is obtained as follows: 

P (µ, Σ|¯x, A) = 

P (µ, Σ, ¯x, A) 

P (¯x, A) 

We cancel out the common terms and the remaining equation is: 

= |Σ0 + A + nκ 

n+κ (µ0 − ¯x)(µ0 − ¯x) T | 1 

2 (n+m) 1 

− |Σ| 2 (n+m+d+2) (n + κ) d/2 

 

− 1 

2 

2 1 

2 d(n+m+1) πd/2Γd( 1 

× 

2 (n + m)) 

 

(κ + n) µ − κµ0 

T + n¯x 

Σ 

κ + n 

−1 

 

µ − κµ0 

 

+ n¯x 

+ tr Σ0 + A + 

κ + n 

nκ 

κ + n (µ0 − ¯x)(µ0 − ¯x) T 

 

Σ −1 

 

(10) 

1 Σ 

= | 

(2π) d/2 n + κ |−1/2 

exp − 1 

 

µ − 

2 

κµ0 

T −1 

+ n¯x Σ 

µ − 

κ + n κ + n 

κµ0 

 

+ n¯x 

κ + n 

 

× 

2 1 

2 d(n+m) Γd( 1 

 

exp − 

2 (n + m)) 

1 

2 tr 

 

Σ0 + A + nκ 

κ + n (µ0 − ¯x)(µ0 − ¯x) T 

 

Σ −1 

 

(11) 

|Σ0 + A + nκ 

n+κ (µ0 − ¯x)(µ0 − ¯x) T | 1 

2 (n+m) 1 

− |Σ| 2 (n+m+d+1) 

Recognizing that the above term in (11) is a multivariate normal and the below term is an Inverse-Wishart 

with respective parameters, the posterior P (µ, Σ|¯x, A) turns out to be: 

 

1 

P (µ, Σ|¯x, A) = N µ| 

κ + n (κµ0 

1 

+ n¯x), 

κ + n Σ 

 

× W −1 

 

Σ|Σ0 + A + nκ 

κ + n (µ0 − ¯x)(µ0 − ¯x) T 

, n + m 

(c) Show that p(x∗ |¯x, S) = stu − t( κµ0+n¯x 

n+κ , 

n+κ+1 

(n+κ)(n+m+1−d) 

Σ0 + (n − 1)S + nκ 

κ+n (µ0 

 

− ¯x)(µ0 − ¯x) T , n + 

m + 1 − d) (hint: |A + xxT | = |A|(1 + xT A−1x)) (20 points) 

Solution: 

Having obtained the posterior for µ and Σ the predictive distribution for a new sample, x∗ is obtained as 

follows: 

P (x ∗ 

|¯x, A) = P (x ∗ |µ, Σ)P (µ, Σ|¯x, A)dµdΣ (13) 

In rewriting P (µ, Σ|¯x, A) we will replace 1 

n+κ (κµ0 + n¯x) with ˆµ and 

with B: 

 

= (2π) −d/2 |Σ| −1/2 

exp − 1 

 

× exp − 1 

2 

2 (x∗ − µ) T Σ −1 (x ∗ 

− µ) 

× |B| 1 

(9) 

(12) 

 

Σ0 + A + nκ 

κ+n (µ0 

 

− ¯x)(µ0 − ¯x) T 

2 (n+m) 1 

− |Σ| 2 (n+m+d+2) (n + κ) d/2 

(2π) d/22 1 

2 d(n+m) Γd( 1 

2 (n + m)) 

 

(n + κ)(µ − ˆµ) T Σ −1 (µ − ˆµ) + trBΣ −1 

dµdΣ (14) 

1 

|B| 2 

= 

(n+m) 1 

− |Σ| 2 (n+m+d+1) 

2 1 

2 d(n+m) Γd( 1 

 

exp − 

2 (n + m)) 1 

2 trBΣ−1 

 

× 

 

(2π) −d/2 |Σ| −1/2 (2π) −d/2 |Σ/(n + κ)| −1/2 

exp − 1 

(x 

2 

∗ − µ) T Σ −1 (x ∗ − µ) + (µ − ˆµ) T (Σ/(n + κ)) −1 

(µ − ˆµ) 

 

 

P (x∗ , µ|Σ) 

We know from solution Part Ab that integration of the part P (x ∗ , µ|Σ) in (15) w.r.t µ is another multivariate 

Normal with updated parameters as follows: 

(15) 

dµ

= 

P (x ∗ |Σ) ∼ N (ˆµ, (Σ/(n + κ)) + Σ) 

= (2π) −d/2 

 

 

 

Σ 

+ Σ 

n + κ 

We continue by incorporating (16) into (15): 

2 

|B| 1 

2 (n+m) (n + κ) d/2 

πd/2Γd( 1 

× 

2 (n + m))(n + κ + 1)d/2 

−1/2 

 

exp − 1 

2 

n + κ 

n + κ + 1 (x∗ − ˆµ) T Σ −1 (x ∗ 

− ˆµ) 

|B| 1 

2 (n+m) 1 

− |Σ| 2 (n+m+d+2) (n + κ) d/2 

2 1 

2 d(n+m+1) πd/2Γd( 1 

 

exp − 

(n + m))(n + κ + 1)d/2 1 n + κ 

tr(B + 

2 n + κ + 1 (x∗ − ˆµ)(x ∗ − ˆµ) T )Σ −1 

 

Γd( 

= 

1 

2 (n + m + 1)) 

n + κ 

| B + 

n + κ + 1 (x∗ − ˆµ)(x ∗ − ˆµ) T 

| 

 

TT 

1 

2 (n+m+1) 

× 

− |Σ| 1 

2 (n+(m+1)+d+1) |B + n+κ 

n+κ+1 (x∗ − ˆµ)(x∗ − ˆµ) T | 1 

2 (n+(m+1)) 

2 1 

2 d(n+m+1) Γd( 1 

 

exp − 

2 (n + m + 1)) 

1 n + κ 

tr(B + 

2 n + κ + 1 (x∗ − ˆµ)(x ∗ − ˆµ) T )Σ −1 

 

d 

 

=1 

Integration w.r.t. Σ above equals to 1 since it corresponds to integration of an Inverse-Wishart distribution. 

By Corollary A.3.1 in [Anderson] |B + xx T | = |B|(1 + x T B −1 x). So the term TT above becomes: 

|B|(1 + n+κ 

n+κ+1 (x∗ − ˆµ) T B −1 (x ∗ − ˆµ)) and we finalize the predictive distribution, P (x ∗ |¯x, A), as: 

P (x ∗ |¯x, A) = 

Γd( 1 

πd/2Γd( 1 

2 (n + m))(n + κ + 1)d/2 

2 (n + m + 1))(n + κ)d/2 |B| −1/2 

 

1 + n+κ 

n+κ+1 (x∗ − ˆµ) T B −1 (x ∗ − ˆµ) 

There are two equalities for the multivariate gamma function as follows: 

Γd(α) = π (d−1)/2 Γd−1(α)Γ(α + (1 − d)/2) 

= π (d−1)/2 Γ(α)Γd−1(α − 1/2) 

1 

2 (n+m+1) 

Replacing Γd( 1 

1 

2 (n+m+1)) in the numerator of (17) with the former and Γd( 2 (n+m)) in the denominator 

with the latter one the R.H.S. of (17) becomes: 

= 

π (d−1)/2Γd−1( n+m 

2 )Γ( n+m+1 

2 )(n + κ) d/2 |B| −1/2 

π (d−1)/2Γd−1( n+m 

2 )Γ( n+m+1−d 

2 )πd/2 (n + κ + 1) d/2 

 

1 + n+κ 

n+κ+1 (x∗ − ˆµ) T B −1 (x ∗ − ˆµ) 

1 

2 (n+m+1) 

Cancelling out the common terms, dividing and multiplying the above equation by (n + m + 1 − d) 1/2 

and rearranging the terms in the denominator according to the definition of the multivariate Student-t 

distribution we obtain: 

where B = 

P (x ∗ |¯x, A) = 

Γ( ν 

2 )πd/2νd/2 |Bτ| 1/2 

 

1 + 

 

Σ0 + A + nκ 

κ+n (µ0 

 

− ¯x)(µ0 − ¯x) T 

Γ( ν+d 

2 ) 

 

(x∗ − ˆµ) T (Bτ) −1 (x∗ 1 

2 

− ˆµ) 

(ν+d) 

1 

ν 

, ˆµ = κµ0+n¯x 

n+κ 

n+κ+1 

, τ = ν(n+κ) , ν = n + m + 1 − d. 

(16) 

(17) 

(18) 

P (x ∗ |¯x, A) = stu − t (ˆµ, Bτ, ν) (19) 

where the first parameter is the location vector, the second one is the positive definite scale matrix and 

the third one is the degrees of freedom for the multivariate t distribution. 

Part B (50 points) Download the MultiClassGaussianDataset.mat from the course web page 

(a) Implement the maximum a posteriori (MAP) classifier using the standard likelihood model and evaluate 

the accuracy of the classifier on the test data (15 points)

Please see Assignment1SolutionPartBa.m. 

(b) Implement the maximum a posteriori (MAP) classifier using the predictive likelihood model you 

derived in Part A and evaluate the accuracy of the classifier on the test data (30 points) 

Please see Assignment1SolutionPartBb.m. 

(c) Discuss results from part a and b (5 points) 

Accuracy for part a is 46% and part b is 61%. Having access to an informative prior model improves 

classification accuracy and the degree of improvement will be more when we have less number of training 

samples for each class. 

For Part B please use the following model parameters: Σ0 = 8I (I is the identity matrix), m = 20, 

κ = 0.1, µ0 = [0, . . . , 0] T .

Assignment 1 Solutions

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?