Fisher information matrix for Gaussian and categorical distributions

Fisher information matrix for Gaussian and categorical 

distributions 

Jakub M. Tomczak 

November 28, 2012 

1 Notations 

Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(x|θ). The 

contiuous random variable x ∈ R can be modelled by normal distribution (Gaussian distribution): 

1 

{ 

p(x|θ) = √ exp (x − 

} 

µ)2 

− 

2πσ 

2 2σ 2 

= N (x|µ, σ 2 ), (1) 

where θ = ( µ σ 2) T 

. 

A discrete (categorical) variable x ∈ X , X is a finite set of K values, can be modelled by categorical 

distribution: 1 p(x|θ) = 

K∏ 

k=1 

θ x k 

k 

= Cat(x|θ), (2) 

where 0 ≤ θ k ≤ 1, ∑ k θ k = 1. 

For X = {0, 1} we get a special case of the categorical distribution, Bernoulli distribution, 

2 Fisher information matrix 

2.1 Definition 

The Fisher score is determined as follows [1]: 

p(x|θ) = θ x (1 − θ) 1−x 

The Fisher information matrix is defined as follows [1]: 

1 We use the 1-of-K encoding [1]. 

= Bern(x|θ). (3) 

g(θ, x) = ∇ θ ln p(x|θ). (4) 

F = E x 

[ 

g(θ, x) g(θ, x) 

T ] . (5) 

1

2.2 Example 1: Bernoulli distribution 

Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm: 

Second, we need to calculate the derivative: 

ln Bern(x|θ) = x ln θ + (1 − x) ln(1 − θ). (6) 

d 

dθ ln Bern(x|θ) = x θ − 1 − x 

1 − θ 

= x − θ 

θ(1 − θ) . (7) 

Hence, we get the following Fisher score for the Bernoulli distribution: 

g(θ, x) = 

x − θ 

θ(1 − θ) . (8) 

The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows: 

F = E x [g(θ, x) g(θ, x)] 

[ (x − θ) 

2 ] 

= E x 

(θ(1 − θ)) 2 

= 

= 

= 

= 

= 

1 

{ 

E 

(θ(1 − θ)) 2 x [x 2 − 2xθ + θ 2 ] 

1 

{ 

} 

E 

(θ(1 − θ)) 2 x [x 2 ] − 2θE x [x] + θ 2 

1 

{ 

} 

θ − 2θ 2 + θ 2 

(θ(1 − θ)) 2 

2.3 Example 2: Categorical distribution 

} 

1 

θ(1 − θ) 

(θ(1 − θ)) 

2 

1 

θ(1 − θ) . (9) 

Let us calculate the fisher matrix for categorical distribution (2). First, we need to take the logarithm: 

ln Cat(x|θ) = 

Second, we need to calculate partial derivatives: 

K∑ 

x k ln θ k . (10) 

k=1 

∂ 

∂θ k 

ln Cat(x|θ) = x k 

θ k 

. (11) 

Hence, we get the following Fisher score for the categorical distribution: 

⎡ 

⎢ 

g(θ, x) = ⎣ 

x 1 

⎤ 

θ ḳ 

. 

x K 

θK 

⎥ 

⎦ . (12) 

2

Now, let us calculate the product of Fisher score and its transposition: 

⎡( 

) 

⎡ 

x 1 

⎤ 

2 

⎤ 

x 1 x 1 x 2 

x 

θ ḳ 

⎢ ⎥ 

⎣ . ⎦ [ x 1 x 

θ k 

· · · K 

] θ 1 θ 1 θ 2 

· · · 1 x K 

θ 1 θ K 

θK = ⎢ 

x K 

⎣ 

. . · · · . ⎥ 

( ) 2 

⎦ 

θK 

x K x 1 x K x 2 

x 

θ K θ 1 θ K θ 2 

· · · K 

θK 

⎡ 

⎤ 

g 11 g 12 · · · g 1K 

⎢ 

⎥ 

= ⎣ . . · · · . ⎦ . (13) 

g K1 g K2 · · · g KK 

Therefore, for g kk we have: 

E x [g kk ] = E x 

[( xk 

θ k 

) 2 ] 

= 1 θ 2 k 

E x [x 2 ] 

= 1 θ k 

, (14) 

and for g ij , i ≠ j: 

Finally, we get: 

E x [g ij ] = E x 

[ xi x j 

θ i θ j 

] 

= 1 

θ i θ j 

E x [x i x j ] 

= 0. (15) 

F = diag{ 1 

θ 1 

, . . . , 

1 

} 

. (16) 

θ K 

2.4 Example 3: Normal distribution 

Let us calculate the Fisher matrix for univariate normal distribution (1). First, we need to take the 

logarithm: 

ln N (x|µ, σ 2 ) = − 1 2 ln 2π − 1 2 ln σ2 − 1 

2σ 2 (x − µ)2 . (17) 

Second, we need to calculate the partial derivatives: 

∂ 

∂µ N (x|µ, σ2 ) = 1 (x − µ) 

σ2 (18) 

∂ 

∂σ N (x|µ, 2 σ2 ) = − 1 

2σ + 1 

2 2σ (x − 4 µ)2 . (19) 

Hence, we get the following Fisher score for normal distribution: 

[ ∂ 

g(θ, x) = 

N (x|µ, ] 

∂µ σ2 ) 

∂ 

N (x|µ, σ 2 ) 

∂σ 

[ 

2 ] 

1 

(x − µ) 

= 

σ 2 

− 1 + 1 (x − µ) 2 . (20) 

2σ 2 2σ 4 

3

Now, let us calculate the product of Fisher score and its transposition: 

⎡ 

⎤ 

1 

(x − µ) 

σ 

⎣ 

2 ⎦ [ 1 

(x − µ) − 1 + 1 (x − µ) 2] = 

− 1 + 1 (x − µ) 2 σ 2 2σ 2 2σ 4 

2σ 2 2σ 4 

⎡ 

⎤ 

1 

(x − µ) 2 − 1 (x − µ) + 1 (x − µ) 3 

σ 

⎣ 

4 2σ 4 2σ 6 ⎦ = (21) 

− 1 (x − µ) + 1 (x − µ) 3 1 − 1 (x − µ) 2 + 1 (x − µ) 4 2σ 4 2σ 6 4σ 4 2σ 6 4σ 

⎡ ⎤ 

8 

g 11 g 12 

⎣ ⎦ , 

g 22 

g 21 

where g 12 = g 21 . 

In order to calculate the Fisher information matrix we need to determine the expected value of 

all g ij . Hence, 2 for g 11 : 

and for g 12 : 

and for g 22 : 

E x [g 11 ] = E x 

( 1 

σ 4 (x − µ)2 ) 

= 1 σ 2 ( 

Ex [x 2 ] − 2µ 2 + µ 2) 

= 1 σ 2 ( 

µ 2 + σ 2 − 2µ 2 + µ 2) 

= 1 σ 2 , (22) 

( 

E x [g 12 ] = E x − 1 

1 

) 

(x − µ) + 

2σ4 2σ (x − 6 µ)3 

= − 1 

( 1 

2σ (E x[x] − µ) + E 4 x 

) 

2σ 6 (x3 − 3x 2 µ + 3xµ 2 − µ 3 ) 

) 

= 1 

2σ 6 ( 

(E x [x 3 ] − 3µE x [x 2 ] + 3µ 2 E x [x] − µ 3 ) 

E x [g 22 ] = E x 

( 1 

4σ 4 − 1 

2σ 6 (x − µ)2 + 1 

4σ 8 (x − µ)4 ) 

= 1 

2σ 6 ( (µ 3 + 3µσ 2 − 3µ(µ 2 + σ 2 ) + 3µ 3 − µ 3)) 

= 0, (23) 

= 1 

4σ 4 − 1 

2σ 6 E x[x 2 − 2xµ + µ 2 ] + 1 

4σ 8 E x[x 4 − 4x 3 µ + 6x 2 µ 2 − 4xµ 3 + µ 4 ] 

= 1 

4σ 4 − 1 

2σ 6 ( 

E x [x 2 ] − 2E x [x]µ + µ 2 ) 

+ 1 

4σ 8 ( 

E x [x 4 ] − 4E x [x 3 ]µ + 6E x [x 2 ]µ 2 − 4E x [x]µ 3 + µ 4 ) 

= 1 

4σ 4 − 1 

2σ 6 σ2 + 1 

4σ 8 3σ4 

= 1 

2σ 4 . (24) 

Finally, we get: 

F = 

2 See Section 3 for raw moments of univariate normal distribution. 

[ 1 

] 

0 

σ 2 1 . (25) 

0 

2σ 4 

4

2.5 Summary 

The Fisher information matrix for given distribution: 

• Bernoulli distribution: 

• Categorical distribution: 

• Normal distribution: 

F = 

1 

θ(1 − θ) , 

F = diag{ 1 

θ 1 

, . . . , 

F = 

[ 1 

] 

0 

σ 2 1 . 

0 

2σ 4 

1 

} 

, 

θ K 

3 Appendix: Raw moments 

Table 1: The raw moments of univariate normal distribution. 

Order Expression Raw moment 

1 E x [x] µ 

2 E x [x 2 ] µ 2 + σ 2 

3 E x [x 3 ] µ 3 + 3µσ 2 

4 E x [x 4 ] µ 4 + 6µ 2 σ 2 + 3µ 4 

References 

[1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 

5

Fisher information matrix for Gaussian and categorical distributions

Create successful ePaper yourself

Delete template?

Save as template?