PDF of Lecture Notes - School of Mathematical Sciences

MATHEMATICAL STATISTICS III 

Lecturer: Associate Professor Patty Solomon 

SCHOOL OF MATHEMATICAL SCIENCES

Contents 

1 Distribution Theory 1 

1.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.1.3 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . 3 

1.1.4 Negative Binomial distribution . . . . . . . . . . . . . . . . . . 3 

1.1.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.1.6 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 5 

1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

1.2.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 7 

1.2.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 8 

1.2.3 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . 9 

1.2.4 Beta density function . . . . . . . . . . . . . . . . . . . . . . . . 10 

1.2.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10 

1.2.6 Cauchy distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10 

1.3 Transformations of a single RV . . . . . . . . . . . . . . . . . . . . . . 12 

1.4 CDF transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

1.5 Non-monotonic transformations . . . . . . . . . . . . . . . . . . . . . . 17 

1.6 Moments of transformed RVS . . . . . . . . . . . . . . . . . . . . . . . 18 

1.7 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . 23 

1.7.1 Trinomial distribution . . . . . . . . . . . . . . . . . . . . . . . 23 

1.7.2 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . 23 

1.7.3 Marginal and conditional distributions . . . . . . . . . . . . . . 24 

1.7.4 Continuous multivariate distributions . . . . . . . . . . . . . . . 27 

1.8 Transformations of several RVs . . . . . . . . . . . . . . . . . . . . . . 33 

1.8.1 Method of regular transformations . . . . . . . . . . . . . . . . 41

1.9 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

1.9.1 Moment generating functions . . . . . . . . . . . . . . . . . . . 50 

1.9.2 Marginal distributions and the MGF . . . . . . . . . . . . . . . 53 

1.9.3 Vector notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

1.9.4 Properties of variance matrices . . . . . . . . . . . . . . . . . . 56 

1.10 The multivariable normal distribution . . . . . . . . . . . . . . . . . . . 57 

1.10.1 The multivariate normal MGF . . . . . . . . . . . . . . . . . . . 64 

1.10.2 Independence and normality . . . . . . . . . . . . . . . . . . . . 65 

1.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 

1.11.1 Convergence of random variables . . . . . . . . . . . . . . . . . 71 

2 Statistical Inference 77 

2.1 Basic definitions and terminology . . . . . . . . . . . . . . . . . . . . . 77 

2.2 Minimum Variance Unbiased Estimation . . . . . . . . . . . . . . . . . 81 

2.3 Methods Of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

2.3.1 Method Of Moments . . . . . . . . . . . . . . . . . . . . . . . . 94 

2.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . 97 

2.4 Hypothesis Tests and Confidence Intervals . . . . . . . . . . . . . . . . 105

1. DISTRIBUTION THEORY 

1 Distribution Theory 

A discrete RV is described by its probability function 

p(x) = P ({X = x}). 

A continuous RV is described by its probability density function, f(x), for which 

P (X = a) = 

P ({a ≤ X ≤ b}) 

= 

∫ a 

a 

∫ b 

a 

f(x) dx, for all a ≤ b. 

f(x) dx = 0 

for any continuous RV. 

Precipitation example, neither discrete or continuous (0 days) - some RVs are neither 

discrete or continuous. 

The cumulative distribution function (CDF) is defined by: 

F (x) = P ({X ≤ x}) 

⎧∑ 

p(t) 

⎪⎨ t;t≤x 

completed by F (x) = 

⎪⎩ 

∫ x 

−∞ 

f(t) dt. 

The expected value E(X) of a discrete RV is given by: 

(area to left) 

E(X) = ∑ x 

xp(x), 

provided that ∑ x 

|x|p(x) < ∞ (i.e., must converge). Otherwise the expectation is not 

defined. 

The expected value of a continuous RV is given by: 

provided that 

∫ ∞ 

−∞ 

E(X) = 

∫ ∞ 

−∞ 

xf(x) dx, 

|x|f(x) < ∞, otherwise it is not defined. 

(For example, the Cauchy distribution is momentless.) 

1


⎧∑ 

h(x)p(x) 

⎪⎨ x 

E{h(X)} = 

⎪⎩ 

∫ ∞ 

−∞ 

h(x)f(x) dx 

if ∑ |h(x)|p(x) < ∞ (X discrete) 

x 

if 

∫ ∞ 

−∞ 

|h(x)|f(x) dx < ∞ (X continuous) 

The moment generating function of a RV X is defined to be: 

M X (t) = E[e tX ] = 

↓ 

moment 

generating fn 

of RV X. 

⎧∑ 

e tx p(x) 

x ⎪⎨ 

⎪⎩ 

∫ ∞ 

−∞ 

e tx f(x) dx 

X discrete 

X continuous 

M X (0) = 1 always; the mgf may or may not be defined for other values of t. 

If M X (t) defined for all t in some open interval containing 0, then: 

1. Moments of all orders exist; 

2. E[X r ] = M (r) 

X 

(0) (rth order derivative) ; 

3. M X (t) uniquely determines the distribution of X: 

M ′ (0) = E(X) 

M ′′ (0) = E(X 2 ), 

and so on. 

1.1 Discrete distributions 

1.1.1 Bernoulli distribution 

Parameter: 0 ≤ p ≤ 1 

Possible values: {0, 1} 

Prob. function: 

⎧ 

⎪⎨ p, x = 1 

p(x) = 

⎪⎩ 

1 − p, x = 0 

2 

= p x (1 − p) 1−x


E(X) = p 

Var (X) = p(1 − p) 

M X (t) = 1 + p(e t − 1). 

1.1.2 Binomial distribution 

Parameter: 0 ≤ p ≤ 1; n > 0; 

M X (t) = {1 + p(e t − 1)} n . 

Consider a sequence of n independent Bern(p) trials. If X = total number of successes, 

X ∼ B(n, p). 

Note: If Y 1 , . . . , Y n are underlying Bernoulli RV’s then X = Y 1 + Y 2 + · · · + Y n . 

(same as counting # successes). 

Probability function if: 

p(x) ≥ 0, x = 0, 1, . . . , n, 

n∑ 

p(x) = 1. 

x=0 

1.1.3 Geometric distribution 

Suppose a sequence of Bernoulli trials is performed and let X be the number of failures 

preceding the first success. Then X ∼ Geom(p). 

1.1.4 Negative Binomial distribution 

Suppose a sequence of independent Bernoulli trials is conducted. If X is the number 

of failures preceding the n th success, then X has a negative binomial distribution. 

Prob. function: 

p(x) = 

( ) n + x − 1 

p 

n − 1 

n (1 − p) x , 

} {{ } 

ways of allocating 

failures and successes 

3

n(1 − p) 

E(X) = , 

p 

n(1 − p) 

Var (X) = , 

p 

[ 

2 ] n 

p 

M X (t) = 

. 

1 − e t (1 − p) 

1. If n = 1, we obtain the geometric distribution. 

2. Also seen to arise as sum of n independent geometric variables. 


1.1.5 Poisson distribution 

Parameter: λ > 0 

M X (t) = e λ(et −1) 

The Poisson distribution arises as the distribution for the number of “point events” 

observed from a Poisson process. 

Examples: 

Figure 1: Poisson Example 

Number of incoming calls to certain exchange in a given hour. 

The Poisson distribution as the limiting form of the binomial distribution: 

n → ∞, 

p → 0 

np → λ 

The derivation of the Poisson distribution (via the binomial) is underpinned by a 

Poisson process i.e., a point process on [0, ∞); see Figure 1. 

AXIOMS for a Poisson process of rate λ > 0 are: 

(A) The number of occurrences in disjoint intervals are independent. 

4


(B) Probability of 1 or more occurrences in any interval [t, t+h) is λh+o(h) (h → 0) 

(approx prob. is equal to length of interval xλ) 

(C) Probability of more than one occurrence in [t, t + h) is o(h) 

is small, negligible). 

(h → 0) (i.e. prob 

Note: o(h), pronounced (small order h) is standard notation for any function r(h) 

with the property: 

r(h) 

lim 

h→0 h = 0 

4 

3 

2 

1 

0 

!2 

!1 

0 

1 

2 

h 

!1 

!2 

Figure 2: Small order h: functions h 4 (yes) and h (no) 

1.1.6 Hypergeometric distribution 

Consider an urn containing M black and N white balls. Suppose n balls are sampled 

randomly without replacement and let X be the number of black balls chosen. Then 

X has a hypergeometric distribution. 

Parameters: M, N > 0, 0 < n ≤ M + N 

Possible values: max (0, n − N) ≤ x ≤ min (n, M) 

Prob function: 

5


) ( N 

) 

p(x) = 

( M 

x n − x 

( ) , 

M + N 

n 

E(X) = n 

M 

M + N , 

M + N − n nMN 

Var (X) = 

M + N − 1 (M + N) . 2 

The mgf exists, but there is no useful expression available. 

1. The hypergeometric distribution is simply 

# selections with x black balls 

, 

# possible selections 

) ( N 

) 

= 

( M 

x n − x 

( ) . 

M + N 

n 

2. To see how the limits arise, observe we must have x ≤ n (i.e., no more than 

sample size of black balls in the sample.) Also, x ≤ M, i.e., x ≤ min (n, M). 

Similarly, we must have x ≥ 0 (i.e., cannot have < 0 black balls in sample), and 

n − x ≤ N (i.e., cannot have more white balls than number in urn). 

i.e. x ≥ n − N 

i.e. x ≥ max (0, n − N). 

3. If we sample with replacement, we would get X ∼ B ( n, p = M 

M+N 

) 

. It is interesting 

to compare moments: 

finite population correction 

↑ 

hypergeometric E(X) = np Var (X) = M+N−n [np(1 − p)] 

M+N−1 

binomial E(x) = np Var (X) = np(1 − p) ↓ 

when sample all 

balls in urn Var(X) ∼ 0 

4. When M, N >> n, the difference between sampling with and without replacement 

should be small. 

6


Figure 3: p = 1 3 

If white ball out → 

Figure 4: p = 1 2 

(without replacement) 

Figure 5: p = 1 3 

(with replacement) 

Intuitively, this implies that for M, N >> n, the hypergeometric and binomial 

probabilities should be very similar, and this can be verified for fixed, n, x: 

0 

lim N 10 

M 

@ A@ 

M,N→∞ 

x n − x 

0 

M 

M + N → p M + N 

1 

@ A 

n 

1 

A 

= 

( n 

x) 

p x (1 − p) n−x . 

1.2 Continuous Distributions 

1.2.1 Uniform Distribution 

CDF, for a < x < b: 

F (x) = 

∫ x 

−∞ 

f(x) dx = 

∫ x 

a 

1 

b − a dx = x − a 

b − a , 

7


that is, 

⎧ 

0 x ≤ a 

⎪⎨ 

x − a 

F (x) = 

b − a 

a < x < b 

⎪⎩ 

1 b ≤ x 

Figure 6: Uniform distribution CDF 

M X (t) = etb − e ta 

t(b − a) 

A special case is the U(0, 1) distribution: 

⎧ 

⎪⎨ 1 0 < x < 1 

f(x) = 

⎪⎩ 

0 otherwise, 

F (x) = x for 0 < x < 1, 

E(X) = 1 2 , Var(X) = 1 12 , M(t) = et − 1 

. 

t 

1.2.2 Exponential Distribution 

CDF: 

F (x) = 1 − e −λx , 

M X (t) = 

λ 

λ − t . 

8


This is the distribution for the waiting time until the first occurrence in a Poisson 

process with rate parameter λ. 

1. If X ∼ Exp(λ) then, 

P (X ≥ t + x|X ≥ t) = P (X ≥ x) 

(memoryless property) 

2. It can be obtained as limiting form of geometric distribution. 

1.2.3 Gamma distribution 

gamma fct : Γ(α) = 

mgf : M X (t) = 

∫ ∞ 

0 

t α−1 e −t dt 

( ) α λ 

, t < λ. 

λ − t 

Suppose Y 1 , . . . , Y K are independent Exp(λ) RV’s and let X = Y 1 + · · · + Y K . 

Then X ∼ Gamma(K, λ), for K integer. In general, X ∼ Gamma(α, λ), α > 0. 

1. α is the shape parameter, 

λ is the scale parameter 

Figure 7: Gamma Distribution 

Note: if Y ∼ Gamma (α, 1) and X = Y , then X ∼ Gamma (α, λ). That is, λ is 

λ 

scale parameter. 

( p 

2. Gamma 

2 , 1 distribution is also called χ 

2) 

2 p (chi-square with p df) distribution 

if p is integer; 

χ 2 2 = exponential distribution (for 2 df). 

9


3. Gamma (K, λ) distribution can be interpreted as the waiting time until the K th 

occurrence in a Poisson process. 

1.2.4 Beta density function 

Suppose Y 1 ∼ Gamma (α, λ), Y 2 ∼ Gamma (β, λ) independently, then, 

Remark: 

X = Y 1 

Y 1 + Y 2 

∼ B(α, β), 0 ≤ x ≤ 1. 

Figure 8: Beta Distribution 

1.2.5 Normal distribution 

X ∼ N(µ, σ 2 ); M X (t) = e tµ e t2 σ 2 /2 . 

1.2.6 Cauchy distribution 

Possible values: x ∈ R 

PDF: f(x) = 1 ( ) 1 

; (location parameter θ = 0) 

π 1 + x 2 

10


CDF: 

F (x) = 1 2 + 1 π arctan x 

E(X), Var(X), M X (t) do not exist. 

→ the Cauchy is a bell-shaped distribution symmetric about zero for which no moments 

are defined. 

11


Figure 9: Cauchy Distribution 

(pointier than normal distribution and tails go to zero much slower than normal distribution) 

If Z 1 ∼ N(0, 1) and Z 2 ∼ N(0, 1) independently, then X = Z 1 

Z 2 

∼ Cauchy distribution. 

1.3 Transformations of a single RV 

If X is a RV and Y = h(X), then Y is also a RV. If we know distribution of X, for a 

given function h(x), we should be able to find distribution of Y = h(X). 

Theorem. 1.3.1 (Discrete case) 

Suppose X is a discrete RV with prob. function p X (x) and let Y = h(X), where h is 

any function then: 

p Y (y) = 

∑ 

p X (x) 

Proof. 

x:h(x)=y 

(sum over all values x for which h(x) = y) 

p Y (y) = P (Y = y) = P {h(X) = y} 

= ∑ 

x:h(x)=y 

= ∑ 

x:h(x)=y 

P (X = x) 

p X (x) 

12


Theorem. 1.3.2 

Suppose X is a continuous RV with PDF f X (x) and let Y = h(X), where h(x) is 

differentiable and monotonic, i.e., either strictly increasing or strictly decreasing. 

Then the PDF of Y is given by: 

Proof. Assume h is increasing. Then 

f Y (y) = f X {h −1 (y)}|h −1 (y) ′ | 

F Y (y) = P (Y ≤ y) = P {h(X) ≤ y} 

= P {X ≤ h −1 (y)} 

= F X {h −1 (y)} 

Figure 10: h increasing. 

⇒ f Y (y) = d 

dy F Y (y) = d 

dy F X{h −1 (y)} 

(use Chain Rule) 

= f X 

{ 

h −1 (y) } h −1 (y) ′ . (1) 

Now consider the case of h(x) decreasing: 

13


F Y (y) = P (Y ≤ y) = P {h(X) ≤ y} 

= P {X ≥ h −1 (y)} 

= 1 − F X {h −1 (y)} 

⇒ f Y (y) = −f X {h −1 (y)}h −1 (y) ′ (2) 

Figure 11: h decreasing. 

Finally, observe that if h is increasing then h ′ (x) and hence h −1 (y) ′ must be positive. 

Similarly for h decreasing, h −1 (y) ′ < 0. Hence (1) and (2) can be combined to give: 

f Y (y) = f X {h −1 (y)}|h −1 (y) ′ | 

Examples: 

1. Discrete transformation of single RV 

X ∼ P o(λ) and Y is X rounded to the nearest multiple of 10. 

Possible values of Y are: 0, 10, 20, . . . 

14


P (Y = 0) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4) 

= e −λ + e −λ λ + e−λ λ 2 

2! 

+ e−λ λ 3 

3! 

+ e−λ λ 4 

; 

4! 

P (Y = 10) = P (X = 5) + P (X = 6) + P (X = 7) + P (X = 8) + . . . + P (X = 14), andsoon. 

2. Continuous transformation of single RV 

Let h(X) = aX + b, a ≠ 0. 

To find h −1 (y) we solve for x in the equation y = h(x), i.e., 

y = ax + b 

⇒ x = y − b 

a 

⇒ h −1 (y) = y − b 

a 

= y a − b a 

Specifically: 

Hence, f Y (y) = 1 

|a| f X 

⇒ h −1 (y) ′ = 1 a . 

( y − b 

1. Suppose Z ∼ N(0, 1) and let X = µ + σZ, σ > 0. Recall that Z has PDF: 

Hence, f X (x) = 1 ( ) x − µ 

|σ| φ = 

σ 

a 

) 

. 

φ(z) = 1 √ 

2π 

e −z2 /2 . 

2. Suppose X ∼ N(µ, σ 2 ) and let Z = X − µ . 

σ 

Find the PDF of Z. 

1 

√ e − (x−µ)2 

2σ 2 , which is N(µ, σ 2 ) PDF. 

2π(σ) 

15


Solution. Observe that Z = h(X) = 1 σ X − µ σ , 

( z + 

µ ) 

σ 

⇒ f Z (z) = σf X 1 

= σf X (µ + σz) 

σ 

1 

= σ √ e (−1) 

2πσ 

2 

(2σ 2 ) (µ+σz−µ)2 

= 1 √ 

2π 

e (−1) 

(2σ 2 ) σ2 z 2 

= 1 √ 

2π 

e −z2 /2 = φ(z), 

i.e., Z ∼ N(0, 1). 

3. Suppose X ∼ Gamma (α, 1) and let Y = X λ , λ > 0. 

Find the PDF of Y . 

Solution. Since Y = 1 X, is a linear function, we have 

λ 

f Y (y) = λf X (λy) 

which is the Gamma(α, λ) PDF. 

= λ 1 

Γ(α) (λy)α−1 e −λy = 

λα 

Γ(α) yα−1 e −λy , 

1.4 CDF transformation 

Suppose X is a continuous RV with CDF F X (x), which is increasing over the range of 

X. If U = F X (x), then U ∼ U(0, 1). 

16



F U (u) = P (U ≤ u) 

= P {F X (X) ≤ u} 

= P {X ≤ F −1 

X 

(u)} 

= F X {F −1 

X 

(u)} 

= u, for 0 

This is simply the CDF of U(0, 1), so the result is proved. 

The converse also applies. If U ∼ U(0, 1) and F is any strictly increasing “on its range” 

CDF, then X = F −1 (U) has CDF F (x), i.e., 

F X (x) = P (X ≤ x) = P {F −1 (U) ≤ x} 

= P (U ≤ F (x)) 

= F (x), as required. 

1.5 Non-monotonic transformations 

Theorem 1.3.2 applies to monotonic (either strictly increasing or decreasing) transformations 

of a continuous RV. 

In general, if h(x) is not monotonic, then h(x) may not even be continuous. 

For example, 

integer part of x 

↗ 

if h(x) = [x], 

then possible values for Y = h(X) are the integers 

⇒ Y is discrete. 

However, it can be shown that h(X) is continuous if X is continuous and h(x) is 

piecewise monotonic. 

17


1.6 Moments of transformed RVS 

Suppose X is a RV and let Y = h(X). 

If we want to find E(Y ) , we can proceed as follows: 

1. Find the distribution of Y = h(X) using preceeding methods. 

⎧∑ 

yp(y) Y discrete 

⎪⎨ y 

2. Find E(Y ) = 

⎪⎩ 


∫ ∞ 

−∞ 

yf(y) dy Y continuous 

(forget X ever existed!) 

If X is a RV of either discrete or continuous type and h(x) is any transformation (not 

necessarily monotonic), then E{h(X)} (provided it exists) is given by: 

⎧∑ 

h(x) p(x) X discrete 

⎪⎨ x 

E{h(X)} = 

⎪⎩ 


∫ ∞ 

−∞ 

h(x)f(x) dx X continuous 

Not examinable. 

Examples: 

1. CDF transformation 

Suppose U ∼ U(0, 1). How can we transform U to get an Exp(λ) RV 

Solution. Take X = F −1 (U), where F is the Exp(λ) CDF. Recall F (x) = 

1 − e −λx for the Exp(λ) distribution. 

To find F −1 (u), we solve for x in F (x) = u, 

18


i.e., 

u = 1 − e −λx 

⇒ 1 − u = e −λx 

⇒ ln(1 − u) = −λx 

⇒ x = 

− ln(1 − u) 

. 

λ 

− ln(1 − U) 

Hence if U ∼ U(0, 1), it follows that X = ∼ Exp(λ). 

λ 

Note: Y = − ln U ∼ Exp(λ) [both (1 − U) & U have U(0, 1) distribution]. 

λ 

m This type of result is used to generate random numbers. That is, there are good 

methods for producing U(0, 1) (pseudo-random) numbers. To obtain Exp(λ) random 

numbers, we can just get U(0, 1) numbers and then calculate X = − log U . 

λ 

2. Non-monotonic transformations 

Suppose Z ∼ N(0, 1) and let X = Z 2 ; h(Z) = Z 2 is not monotonic, so Theorem 

1.3.2 does not apply. However we can proceed as follows: 

F X (x) = P (X ≤ x) 

= P (Z 2 ≤ x) 

= P (− √ x ≤ Z ≤ √ x) 

where 

Φ(a) = 

is the N(0, 1) 

= Φ( √ x) − Φ(− √ x), 

∫ a 

−∞ 

1 

√ 

2π 

e −t2 /2 dt 

CDF. That is, 

⇒ f X (x) = d 

dx F X(x) = φ (√ x ) ( ) 

1 

2 x−1/2 − φ(− √ x) 

(− 1 ) 

2 x−1/2 , 

19


where φ(a) = Φ ′ (a) = 1 √ 

2π 

e −a2 /2 is the N(0, 1) PDF 

= 1 2 x−1/2 [ 1 

√ 

2π 

e −x/2 + 1 √ 

2π 

e −x/2 ] 

( 

= (1/2)1/2 

1 

√ x 1/2−1 e −1/2x, which is the Gamma 

π 2 , 1 2) 

PDF. 

On the other hand, the distribution of Z 2 is also called the χ 2 1, distribution. We 

have proved that it is the same as the Gamma ( 1 

2 , 1 2) 

distribution. 

3. Moments of transformed RVS: 

If U ∼ U(0, 1) and Y = − log U 

λ 

then Y ∼ Exp(λ) ⇒ f(y) = λe −λy , y > 0. 

Can check 

E(Y ) = 

∫ ∞ 

0 

λye −λy dy 

= 1 λ . 

4. Based on Theorem 1.6.1: 

If U ∼ U(0, 1) and Y = − log U , then according to theorem 1.6.1, 

λ 

E(Y ) = 

∫ 1 

0 

= − 1 λ 

= − 1 λ 

− log u 

(1) du 

λ 

∫ 1 

0 

log u du = − 1 λ (u log u − u) ∣ ∣∣∣ 

1 

[ 

u log u ∣ ∣1 

0 − u ∣ ] 

1 0 

0 

= − 1 [0 − 1] 

λ 

= 1 , as required. 

λ 

There are some important consequences of Theorem 1.6.1: 

20


1. If E(X) = µ and Var(X) = σ 2 , and Y = aX + b for constants a, b, then E(Y ) = 

aµ + b and Var(Y ) = a 2 σ 2 . 

Proof. (Continuous case) 

E(Y ) = E(aX + b) = 

∫ ∞ 

= a 

xf(x) dx 

−∞ 

} {{ } 

∫ ∞ 

−∞ 

(ax + b)f(x) dx 

∫ ∞ 

+ b 

f(x) dx 

−∞ 

} {{ } 

E(X) = 1 

= aµ + b. 

[ {Y } ] 2 

Var(Y ) = E − E(Y ) 

[ {aX } ] 2 

= E + b − (aµ + b) 

= E { a 2 (X − µ) 2} 

= a 2 E { (X − µ) 2} 

= a 2 Var(X) 

= a 2 σ 2 . 

2. If X is a RV and h(X) is any function, then the MGF of Y = h(X), provided it 

exists is, 

⎧∑ 

e th(x) p(x) 

⎪⎨ x 

M Y (t) = 

X discrete 

⎪⎩ 

∫ ∞ 

−∞ 

e th(x) f(x) dx X continuous 

This gives us another way to find the distribution of Y = h(X). 

i.e., Find M Y (t). If we recognise M Y (t), then by uniqueness we can conclude that 

Y has that distribution. 

21


Examples 

1. Suppose X is continuous with CDF F (x), and F (a) = 0, F (b) = 1; (a, b can be 

±∞ respectively). 

Let U = F (X). Observe that 

M U (t) = 

∫ b 

a 

e tF (x) f(x) dx 

which is the U(0, 1) MGF. 

= 1 t etF (x) ∣ ∣∣∣ 

b 

= et − 1 

, 

t 

2. Suppose X ∼ U(0, 1), and let Y = − log X . 

λ 

M Y (t) = 

∫ 1 

0 

a 

e 

= etF (b) tF (a) 

− e 

t 

− log x 

t( λ 

) (1) dx 

= 

= 

= 

∫ 1 

0 

x −t/λ dx 

∣ 

1 ∣∣∣ 

1 

1 − t/λ x1−t/λ 

1 

1 − t/λ 

0 

= λ 

λ − t , 

which is the MGF for Exp(λ) distribution. Hence we can conclude that 

Y = − log X 

λ 

∼ Exp(λ). 

# 

22


1.7 Multivariate distributions 

Definition. 1.7.1 

If X 1 , X 2 , . . . , X r are discrete RV’s then, X = (X 1 , X 2 , . . . , X r ) T 

random vector. 

is called a discrete 

The probability function P (x) is: 

P (x) = P (X = x) = P ({X 1 = x 1 } ∩ {X 2 = x 2 } ∩ · · · ∩ {X r = x r }) ; 

P (x) = P (x 1 , x 2 , . . . , x r ) 

1.7.1 Trinomial distribution 

r = 2 

Consider a sequence of n independent trials where each trial produces: 

Outcome 1: with prob π 1 

Outcome 2: with prob π 2 

Outcome 3: with prob 1 − π 1 − π 2 

If X 1 , X 2 are number of occurrences of outcomes 1 and 2 respectively then(X 1 , X 2 ) 

have trinomial distribution. 

Parameters: 

π 1 > 0, π 2 > 0 and π 1 + π 2 < 1; n > 0 fixed 

Possible Values: 

integers (x 1 , x 2 ) s.t. x 1 ≥ 0, x 2 ≥ 0, x 1 + x 2 ≤ n 

Probability function: 

for 

P (x 1 , x 2 ) = 

n! 

x 1 !x 2 !(n − x 1 − x 2 )! πx 1 

1 π x 2 

2 (1 − π 1 − π 2 ) n−x 1−x 2 

x 1 , x 2 ≥ 0, x 1 + x 2 ≤ n. 

1.7.2 Multinomial distribution 

Parameters: 

n > 0, π = (π 1 , π 2 , . . . , π r ) T with π I > 0 and 

23 

r∑ 

π i = 1 

i=1


Possible values: 

Probability function: 

Integer valued (x 1 , x 2 , . . . , x r ) s.t. x i ≥ 0 & 

r∑ 

x i = n 

i=1 

( 

) 

n 

P (x) = 

π x 1 

1 π x 2 

2 . . . πr 

xr 

x 1 , x 2 , . . . , x r 

for 

x i ≥ 0, 

r∑ 

x i = n. 

i=1 

Remarks 

( 

) 

n 

1. Note 

x 1 , x 2 , . . . , x r 

def 

= 

n! 

x 1 !x 2 ! . . . x r ! 

is the multinomial coefficient. 

2. Multinomial distribution is the generalisation of the binomial distribution to r 

types of outcome. 

3. Formulation differs from binomial and trinomial cases in that the redundant count 

x r = n − (x 1 + x 2 + . . . x r−1 ) is included as an argument of P (x). 

1.7.3 Marginal and conditional distributions 

Consider a discrete random vector 

so that X = 

[ ] 

X1 

. 

X 2 

X = (X 1 , X 2 , . . . , X r ) T and let 

X 1 = (X 1 , X 2 , . . . , X r1 ) T & 

X 2 = (X r1 +1, X r1 +2, . . . , X r ) T , 


If X has joint probability function P X (x) = P X (x 1 , x 2 ) then the marginal probability 

function for X 1 is : 

P X1 (x 1 ) = ∑ x 2 

P X (x 1 , x 2 ). 

24


Observe that: 

P X1 (x 1 ) = ∑ x 2 

P X (x 1 , x 2 ) = ∑ x 2 

P ({X 1 = x 1 } ∩ {X 2 = x 2 }) 

= P (X 1 = x 1 ) by law of total probability. 

Hence the marginal probability function for X 1 

would have if X 2 was not observed. 

is just the probability function we 

The marginal probability function for X 2 is: 

P X2 (x 2 ) = ∑ x 1 

P X (x 1 , x 2 ). 


Suppose X = 

X 2 |X 1 by: 

Remarks 

( 

X1 

X 2 

) 

. If P X1 (x 1 ) > 0, we define the conditional probability function 

P X2 |X 1 

(x 2 |x 1 ) = P X(x 1 , x 2 ) 

P X1 (x 1 ) . 

1. 

P X2 |X 1 

(x 2 |x 1 ) = P ({X 1 = x 1 } ∩ {X 2 = x 2 }) 

P (X 1 = x 1 ) 

= P (X 2 = x 2 |X 1 = x 1 ). 

2. Easy to check P X2 |X 1 

(x 2 |x 1 ) is a proper probability function with respect to x 2 

for each fixed x 1 such that P X1 (x 1 ) > 0. 

3. P X1 |X 2 

(x 1 |x 2 ) is defined by: 

Definition. 1.7.4 (Independence) 

P X1 |X 2 

(x 1 |x 2 ) = P X(x 1 , x 2 ) 

P X2 (x 2 ) . 

Discrete RV’s X 1 , X 2 , . . . , X r are said to be independent if their joint probability function 

satisfies 

P (x 1 , x 2 , . . . , x r ) = p 1 (x 1 )p 2 (x 2 ) . . . p r (x r ) 

for some functions p 1 , p 2 , . . . , p r and all (x 1 , x 2 , . . . , x r ). 

25


Remarks 

1. Observe that: 

P X1 (x 1 ) = ∑ x 2 

∑ 

· · · ∑ 

x 3 

x r 

P (x 1 , . . . , x r ) 

= p 1 (x 1 ) ∑ ∑ 

· · · ∑ 

x 2 

x 3 

x r 

p 2 (x 2 ) . . . p r (x r ) 

= c 1 p 1 (x 1 ). 

Hence p 1 (x 1 ) ∝ 

↓ 

P X1 (x 1 ) 

} {{ } also p i(x i ) ∝ P Xi (x i ) 

sum of probs. marginal prob. 

P X1 |X 2 ...X r 

(x 1 |x 2 , . . . , x r ) = 

= 

= 

= 

= 

P (x 1 , . . . , x r ) 

P x2 ...x r 

(x 1 , . . . , x r ) 

p 1 (x 1 )p 2 (x 2 ) . . . p r (x r ) 

∑ 

x 1 

p 1 (x 1 )p 2 (x 2 ) . . . p r (x r ) 

p 1 (x 1 )p 2 (x 2 ) . . . p r (x r ) 

p 2 (x 2 )p 3 (x 3 ) . . . p r (x r ) ∑ x 1 

p 1 (x 1 ) 

p 1 (x 1 ) 

∑ 

x 1 

p 1 (x 1 ) 

1 

c P X 1 

(x 1 ) 

1 ∑ 

P X1 (x 1 ) 

c 

x 1 

= P X1 (x 1 ). 

That is, P X1 |X 2 ...X r 

(x 1 |x 2 , . . . , x r ) = P X1 (x 1 ). 

2. Clearly independence =⇒ 

P Xi |X 1 ...X i−1 ,X i+1 ...X r 

(x i |x 1 , . . . , x i−1 , x i+1 , . . . , x r ) = P Xi (x i ). 

Moreover, we have: 

26


P X1 |X 2 

(x 1 |x 2 ) = P X1 (x 1 ) for any partitioning of X = 

independent. 

( 

X1 

X 2 

) 

if X 1 , . . . , X r are 

1.7.4 Continuous multivariate distributions 


The random vector (X 1 , . . . , X r ) T is said to have a continuous multivariate distribution 

with PDF f(x) if 

∫ ∫ 

P (X ∈ A) = . . . f(x 1 , . . . , x r ) dx 1 . . . dx r 

A 

for any measurable set A. 

Examples 

1. Suppose (X 1 , X 2 ) have the trinomial distribution with parameters n, π 1 , π 2 . 

Then the marginal distribution of X 1 can be seen to be B(n, π 1 ) and the conditional 

distribution of X 2 |X 1 = x 1 is B ( π 2 

) 

n − x 1 , . 

1 − π 1 

Outcome 1 π 1 

Outcome 2 π 2 

Outcome 3 1 − π 1 − π 2 

Examples of Definition 1.7.5 

1. If X 1 , X 2 have PDF 

⎧ 

1 0 < x 1 < 1 

⎪⎨ 

0 < x 2 < 1 

f(x 1 , x 2 ) = 

⎪⎩ 

0 otherwise 

The distribution is called uniform on (0, 1) × (0, 1). 

It follows that P (X ∈ A) = Area (A) for any A ∈ (0, 1) × (0, 1): 

2. Uniform distribution on unit disk: 

⎧ 

1 

⎪⎨ x 2 1 + x 2 2 < 1 

π 

f(x 1 , x 2 ) = 

⎪⎩ 

0 otherwise. 

27


Figure 12: Area A 

3. Dirichlet distribution is defined by PDF 

f(x 1 , x 2 , . . . , x r ) = Γ(α 1 + α 2 + · · · + α r + α r+1 ) 

Γ(α 1 )Γ(α 2 ) . . . Γ(α r )Γ(α r+1 ) × 

x α 1−1 

1 x α 2−1 

2 . . . x αr−1 

r (1 − x 1 − · · · − x r ) α r+1 

for x 1 , x 2 , . . . , x r > 0, 

∑ 

xi < 1, and parameters α 1 , α 2 , . . . , α r+1 > 0. 

# 

Recall joint PDF: 

∫ 

P (X ∈ A) = 

Note: the joint PDF must satisfy: 

∫ 

. . . f(x 1 , . . . , x r ) dx 1 . . . dx r . 

A 

1. f(x) ≥ 0 for all x; 

2. 

∫ ∞ 

. . . 

∫ ∞ 

−∞ −∞ 

f(x 1 , . . . , x r ) dx 1 . . . dx r = 1. 


( ) 

X1 

If X = has joint PDF f(x) = f(x 

X 1 , x 2 ), then the marginal PDF of X 1 is given 

2 

by: 

∫ ∞ ∫ ∞ 

f X1 (x 1 ) = . . . f(x 1 , . . . , x r1 , x r1 +1, . . . x r ) dx r1 +1dx r1 +2 . . . dx r , 

−∞ −∞ 

28


and for f X1 (x 1 ) > 0, the conditional PDF is given by 

f X2 |X 1 

(x 2 |x 1 ) = f(x 1, x 2 ) 

f X1 (x 1 ) . 

Remarks: 

1. f X2 |X 1 

(x 2 |x 1 ) cannot be interpreted as the conditional PDF of X 2 |{X 1 = x 1 } 

because P (X 1 = x 1 ) = 0 for any continuous distribution. 

Proper interpretation is the limit as δ → 0 in X 2 |X 1 ∈ B(x 1 , δ). 

2. f X2 (x 2 ), f X1 |X 2 

(x 1 |x 2 ) are defined analogously. 

Definition. 1.7.7 (Independence) 

Continuous RV’s X 1 , X 2 , . . . , X r are said to be independent if their joint PDF satisfies: 

f(x 1 , x 2 , . . . , x r ) = f 1 (x 1 )f 2 (x 2 ) . . . f r (x r ), 

for some functions f 1 , f 2 , . . . , f r and all x 1 , . . . , x r 

Remarks 

1. Easy to check that if X 1 , . . . , X r are independent then each f i (x i ) = c i f Xi (x i ). 

Moreover c 1 , c 2 , . . . , c r = 1. 

2. If X 1 , X 2 , . . . , X r are independent, then it can be checked that: 

for any partitioning of X = 

f x1 |x 2 

(x 1 |x 2 ) = f X1 (x 1 ) 

( 

X1 

X 2 

) 

. 

Examples 

1. If (X 1 , X 2 ) has the uniform distribution on the unit disk, find: 

(a) the marginal PDF f X1 (x 1 ) 

(b) the conditional PDF f X2 |X 1 

(x 2 |x 1 ). 

29


Solution. Recall 

⎧ 

1 

⎪⎨ x 2 1 + x 2 2 < 1 

π 

f(x 1 , x 2 ) = 

⎪⎩ 

0 otherwise 

(a) f X1 (x 1 ) = 

= 

∫ ∞ 

−∞ 

f(x 1 , x 2 ) dx 2 

∫ − 

√ 

1−x 2 1 

−∞ 

= 0 + x 2 

π 

= 

√ 

∣ √ 

− 

∫ √ 1−x 2 1 

0 dx 2 + 

1−x 2 1 

⎧ 

2 √ 1 − x 

⎪⎨ 

2 1 

π 

1−x 2 1 

+ 0 

√ 

− 1−x 2 1 

−1 < x 1 < 1 

⎪⎩ 

0 otherwise. 

∫ 

1 

∞ 

π dx 2 + √ 0 dx 2 

1−x 2 1 

i.e., a semi-circular distribution (ours has some distortion). 

Figure 13: A graphic showing the conditional distribution and a semi-circular distribution. 

30


(b) The conditional density for X 2 |X 1 is: 

f X2 |X 1 

(x 2 |x 1 ) = f(x 1, x 2 ) 

f X1 (x 1 ) 

⎧ 

1 

⎪⎨ 2 √ for − √ 1 − x 2 

1 − x 2 1 < x 2 < √ 1 − x 2 1 

1 

= 

⎪⎩ 

0 otherwise, 

which is uniform U(− √ 1 − x 2 1, √ 1 − x 2 1). 

2. If X 1 , . . . , X r are any independent continuous RV’s then their joint PDF is: 

f(x 1 , . . . , x r ) = f X1 (x 1 )f X2 (x 2 ) . . . f Xr (x r ). 

E.g. If X 1 , X 2 , . . . , X r are independent Exp(λ) RVs, then 

f(x 1 , . . . , x r ) = 

r∏ 

λ e −λx i 

i=1 

= λ r e −λ P r 

i=1 x i 

, 

for x i > 0 i = 1, . . . , n. 

3. Suppose (X 1 , X 2 ) are uniformly distributed on (0, 1) × (0, 1). 

That is, 

⎧ 

⎪⎨ 1 0 < x 1 < 1, 0 < x 2 < 1 

f(x 1 , x 2 ) = 

⎪⎩ 

0 otherwise 

Claim: X 1 , X 2 are independent. 


Let 

⎧ 

⎪⎨ 1 0 < x 1 < 1 

f 1 (x 1 ) = 

⎪⎩ 

0 otherwise, 

31

⎧ 

⎪⎨ 1 0 < x 2 < 1 

f 2 (x 2 ) = 

⎪⎩ 

0 otherwise, 


then f(x 1 , x 2 ) = f 1 (x 1 )f 2 (x 2 ), and the two variables are independent. 

4. (X 1 , X 2 ) is uniform on unit disk. Are X 1 , X 2 independent NO. 


We know: 

⎧ 

1 

⎪⎨ 2 √ − √ 1 − x 2 

1 − x 2 1 < x 2 < √ 1 − x 2 1 

1 

f X2 |X 1 

(x 2 |x 1 ) = 

. 

⎪⎩ 

0 otherwise, 

i.e., U(− √ 1 − x 2 1, √ 1 − x 2 1). 

On the other hand, 

i.e., a semicircular distribution. 

⎧ 

2 √ 

⎪⎨ 1 − x 

2 

π 

2 −1 < x 2 < 1 

f X2 (x 2 ) = 

⎪⎩ 

0 otherwise, 

Hence f X2 (x 2 ) ≠ f X2 |X 1 

(x 2 |x 1 ), so the variables cannot be independent. 


The joint CDF of RVS’s X 1 , X 2 , . . . , X r is defined by: 

F (x 1 , x 2 , . . . , x r ) = P ({X 1 ≤ x 1 } ∩ {X 2 ≤ x 2 } ∩ . . . ∩ ({X r ≤ x r }). 

Remarks: 

1. Definition applies to RV’s of any type, i.e., discrete, continuous, “hybrid”. 

2. Marginal CDF of X 1 = (X 1 , . . . , X r ) is F X1 (x 1 , . . . , x r ) = F X (x 1 , . . . , x r , ∞, ∞, . . . , ∞). 

3. RV’s X 1 , . . . , X r are defined to be independent if 

F (x 1 , . . . , x r ) = F X1 (x 1 )F X2 (x 2 ) . . . F Xr (x r ). 

4. The definitions above are completely general. However, in practice it is usually 

easier to work with the PDF/probability function for any given example. 

32


1.8 Transformations of several RVs 


If X 1 , X 2 are discrete with joint probability function P (x 1 , x 2 ), and Y = X 1 +X 2 , then: 

1. Y has probability function 

P Y (y) = ∑ x 

P (x, y − x). 

2. If X 1 , X 2 are independent, 

P Y (y) = ∑ x 

P X1 (x)P X2 (y − x) 

Proof. (1) 

P ({Y = y}) = ∑ x 

P ({Y = y} ∩ {X 1 = x}) 

(law of total probability) 

↓ 

⎛ 

⎞ 

If A is an event & B 1 , B 2 . . . are events 

⋃ 

s.t. B i = S & B i ∩ B j = φ (for j ≠ i) 

i 

⎜ 

⎝ 

then P (A) = ∑ P (A ∩ B i ) = ∑ ⎟ 

P (A|B i )P (B i ) 

⎠ 

i 

i 

= ∑ x 

P ({X 1 + X 2 = y} ∩ {X 1 = x}) 

= ∑ x 

P ({X 2 = y − x} ∩ {X 1 = x}) 

= ∑ x 

P (x, y − x). 


33


Just substitute 

P (x, y − x) = P X1 (x)P X2 (y − x). 


Suppose X 1 , X 2 are continuous with PDF, f(x 1 , x 2 ), and let Y = X 1 + X 2 . Then 

1. f Y (y) = 

∫ ∞ 

−∞ 

f(x, y − x) dx. 

2. If X 1 , X 2 are independent, then 

f Y (y) = 

Proof. 1. F Y (y) = P (Y ≤ y) 

∫ ∞ 

−∞ 

= P (X 1 + X 2 ≤ y) 

f X1 (x)f X2 (y − x) dx. 

= 

∫ ∞ ∫ y−x1 

x 1 =−∞ x 2 =−∞ 

f(x 1 , x 2 ) dx 2 dx 1 . 

Let x 2 = t − x 1 , 

⇒ dx 2 

dt = 1 

⇒ dx 2 = dt 

= 

= 

∫ ∞ ∫ y 

−∞ 

∫ y 

−∞ 

−∞ 

f(x 1 , t − x 1 ) dt dx 1 

{∫ ∞ 

} 

f(x 1 , t − x 1 ) dx 1 dt 

−∞ 

⇒ f Y (y) = F ′ Y (y) = 

∫ ∞ 

−∞ 

f(x, y − x) dx. 


Take f(x, y − x) = f X1 (x) f X2 (y − x) if X 1 , X 2 independent. 

34


Figure 14: Theorem 1.8.2 

Examples 

1. Suppose X 1 ∼ B(n 1 , p), X 2 ∼ B(n 2 , p) independently. Find the probability function 

for Y = X 1 + X 2 . 

Solution. 

P Y (y) = ∑ x 

P X1 (x)P X2 (y − x) 

= 

min(n 1 , y) 

∑ 

x=max (0, y − n 2 ) 

= p y (1 − p) n 1+n 2 −y 

= 

= 

i.e., Y ∼ B(n 1 + n 2 , p). 

( ) 

( ) 

n1 

p x (1 − p) n 1−x n2 

p y−x (1 − p) n 2+x−y 

x 

y − x 

min (n 1 , y) 

∑ 

x=max (0, y − n 2 ) 

( ) 

n1 + n 2 

p y (1 − p) n 1+n 2 −y 

y 

( ) 

n1 + n 2 

p y (1 − p) n 1+n 2 −y , 

y 

35 

( ) ( ) 

n1 n2 

x y − x 

min (n 1 , y) 

∑ 

x=min (0, y − n 2 ) 

( 

n1 

) ( ) 

n2 

x y − x 

( ) 

n1 + n 2 

y 

} {{ } 

sum of hypergeometric 

probability function = 1


2. Suppose Z 1 ∼ N(0, 1), Z 2 ∼ N(0, 1) independently. Let X = Z 1 + Z 2 . Find 

PDF of X. 

Solution. 

f X (x) = 

= 

= 

= 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

φ(z)φ(x − z) dz 

1 

√ 

2π 

e −z2 /2 1 

√ 

2π 

e −(x−z)2 /2 dz 

1 

2π e− 1 2 (z2 +(x−z) 2) dz 

( 

1 1 

2π e− 2( z−x 

2 2 

)e )2 −x2 /4 dz 

= 1 ∫ ∞ 

2 √ /4 1 

√ 

π e−x2 e − (z− x 2 )2 

2× 1 2 dz 

−∞ π 

} {{ } 

= 1 

( x 

N 

2 , 1 PDF 

2) 

i.e., X ∼ N(0, 2). 

= 1 

2 √ π e−x2 /4 , 

Theorem. 1.8.3 (Ratio of continuous RVs) 

Suppose X 1 , X 2 are continuous with joint PDF f(x 1 , x 2 ), and let Y = X 2 

X 1 

. 

Then Y has PDF 

If X 1 , X 2 independent, we obtain 

f Y (y) = 

f Y (y) = 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

|x|f(x, yx) dx. 

|x|f X1 (x)f X2 (yx) dx. 

36



F Y (y) = P ({Y ≤ y}) 

= P ({Y ≤ y} ∩ {X 1 < 0}) + P ({Y ≤ y} ∩ {X 1 > 0}) 

= P ({X 2 ≥ yX 1 } ∩ {X 1 < 0}) + P ({X 2 ≤ yX 1 } ∩ {X 1 > 0}) 

= 

∫ 0 

∫ ∞ 

f(x 1 , x 2 ) dx 2 dx 1 + 

∫ ∞ ∫ x1 y 

−∞ x 1 y 

0 −∞ 

Substitute x 2 = tx 1 in both inner integrals; 

dx 2 = x 1 dt; 

f(x 1 , x 2 ) dx 2 dx 1 . 

= 

∫ 0 

∫ −∞ 

x 1 f(x 1 , tx 1 )dt dx 1 + 

∫ ∞ ∫ y 

−∞ y 

0 −∞ 

x 1 f(x 1 , tx 1 ) dt dx 1 

= 

∫ 0 ∫ y 

(−x 1 )f(x 1 , tx 1 ) dt dx 1 + 

∫ ∞ ∫ y 

−∞ −∞ 

0 −∞ 

x 1 f(x 1 , tx 1 ) dt dx 1 

= 

∫ 0 ∫ y 

|x 1 |f(x 1 , tx 1 ) dt dx 1 + 

∫ ∞ ∫ y 

−∞ −∞ 

0 −∞ 

|x 1 |f(x 1 , tx 1 ) dt dx 1 

= F Y (y) 

⇒ f Y (y) = F ′ Y (y) = 

∫ ∞ 

−∞ 

|x 1 |f(x 1 , yx 1 ) dx 1 . 

Example 

1. If Z ∼ N(0, 1) and X ∼ χ 2 k 

independently, then T = 

Z 

√ 

X/k 

is said to have the 

t-distribution with k degrees of freedom. Derive the PDF of T . 

37


Solution. Step 1: 

Let V = √ ( k 

X/k. Need to find PDF of V . Recall χ 2 k is Gamma 2 , 1 ) 

so 

2 

Now, 

f X (x) = (1/2)k/2 

Γ(k/2) xk/2−1 e −x/2 , for x > 0. 

v = h(x) = √ x/k 

⇒ h −1 (v) = kv 2 

and 

h −1 (v) ′ = 2kv 

⇒ f V (v) = f X (h −1 (v))|h −1 (v) ′ | 

= (1/2)k/2 

Γ(k/2) (kv2 ) k/2−1 e −kv2 /2 (2kv) 

Step 2: 

Apply Theorem 1.8.3 to find PDF of 

T = Z ∫ ∞ 

V = |v|f V (v)f Z (vt) dv 

−∞ 

= 2(k/2)k/2 

Γ(k/2) vk−1 e −kv2 /2 , v > 0. 

= 

∫ ∞ 

0 

v 2(k/2)k/2 

Γ(k/2) vk−1 e −kv2 /2 1 

√ 

2π 

e −t2 v 2 /2 dv 

= (k/2)k/2 

Γ(k/2) √ 2π 

∫ ∞ 

0 

v k−1 e −1/2(k+t2 )v 2 2v dv; 

38


substitute u = v 2 

⇒ du = 2v dv 

= (k/2)k/2 

Γ(k/2) √ 2π 

∫ ∞ 

0 

u k−1 

2 e 

−1/2(k+t 2 )u du 

[ ] 

= (k/2)k/2 Γ(α) 

Γ(k/2) √ = (k/2)k/2 

2π λ α Γ(k/2) √ Γ( k+1) 

2 

2π (1/2(k + t 2 )) (k+1)/2 

= 

= 

k+1 

Γ( ) ( 

2 2 

Γ(k/2) √ 2π k × 1 − 

k+1 ( ) 

)) 

2 (k + 2 −1/2 k t2 2 

k+1 

Γ( ) ( ) − 

k+1 

2 

Γ( k)√ 1 + t2 2 

. 

kπ k 

2 

Remarks: 

1. If we take k = 1, we obtain 

f(t) ∝ 1 

1 + t ; 2 

that is, t 1 = Cauchy distribution. 

Can also check 

Γ(1) 

Γ( 1)√ π = 1 π as required, so that f(t) = 1 1 

π 1 + t . 

2 2 

2. As k → ∞, t k → N(0, 1). To see this directly consider limit of t-density as k → ∞: 

i.e. 

) − 

k+1 

lim 

(1 + t2 2 

k→∞ k 

for t fixed 

) −k 

= lim 

(1 + t2 2 

k 

; let l = 

k→∞ k 

2 

( 

= lim 1 + 

l→∞ 

t 2 2 

l 

) −l 

= 1 

e t2 2 

= e −t2 

2 . 

39


Recall standard limit: 

( 

lim 1 + a ) n 

= e 

a 

n→∞ n 

⇒ f T (t) → 1 √ 

2π 

e −t2 

2 as k → ∞. 

Suppose h : R r → R r is continuously differentiable. 

That is, 

h(x 1 , x 2 , . . . , x r ) = (h 1 (x 1 , . . . , x r ), h 2 (x 1 , . . . , x r ), . . . , h r (x 1 , . . . , x r )) , 

where each h i (x) is continuously differentiable. Let 

⎡ 

⎤ 

∂h 1 ∂h 1 ∂h 1 

. . . 

∂x 1 ∂x 2 ∂x r 

∂h 2 ∂h 2 ∂h 2 

. . . 

∂x 

H = 1 ∂x 2 ∂x r 

. . . . . . 

⎢ 

⎥ 

⎣∂h r ∂h r ∂h r 

⎦ 

. . . 

∂x 1 ∂x 2 ∂x r 

If H is invertible for all x, then ∃ an inverse mapping: 

g : R r → R r 

g(h(x)) = x. 

with property that: 

It can be proved that the matrix of partial derivatives 

( ) ∂gi 

G = satisfies G = H −1 . 

∂y j 


Suppose X 1 , X 2 , . . . , X r have joint PDF f X (X 1 , . . . , X r ), and let h, g, G be as above. 

If Y = h(X), then Y has joint PDF 

Remark: 

f Y (y 1 , y 2 , . . . , y r ) = f X 

( 

g(y) 

) 

| det G(y)| 

Can sometimes use H −1 instead of G, but need to be careful to evaluate H −1 (x) at 

x = h −1 (y) = g(y). 

40


1.8.1 Method of regular transformations 

Suppose X = (X 1 , . . . , X r ) T and h : R r → R s , where s < r if 

Y = (Y 1 , . . . , Y s ) T = h(X). 

One approach is as follows: 

1. Choose a function, d : R r → R r−s and let Z = d(X) = (Z 1 , Z 2 , . . . , Z r−s ) T . 

2. Apply theorem 1.8.4 to (Y 1 , Y 2 , . . . , Y s , Z 1 , Z 2 , . . . , Z r−s ) T . 

3. Integrate over Z 1 , Z 2 , . . . , Z r−s to get the marginal PDF for Y. 

Examples 

Suppose Z 1 ∼ N(0, 1), Z 2 ∼ N(0, 1) independently. Consider h(z 1 , z 2 ) = (r, θ) T , where 

r = h 1 (z 1 , z 2 ) = √ z1 2 + z2, 2 and 

⎧ ( ) 

z2 

arctan 

for z 1 > 0 

z 1 

h maps R 2 → [0, ∞) × 

⎪⎨ arctan 

θ = h 2 (z 1 , z 2 ) = 

[ 

− π 2 , 3π 2 

( 

z2 

) 

+ π for z 1 < 0 

z 1 

π 

2 sgnz 2 for z 1 = 0, z 2 ≠ 0 

⎪⎩ 

0 for z 1 = z 2 = 0. 

) 

. 

The inverse mapping, g, can be seen to be: 

( ) 

g1 (r, θ) 

g(r, θ) = = 

g 2 (r, θ) 

To find distribution of (R, θ): 

( ) r cos θ 

r sin θ 

Step 1 

⎡ 

∂ 

G(r, θ) = ⎢ 

∂r g 1(r, θ) 

⎣ 

∂ 

∂r g 2(r, θ) 

41 

⎤ 

∂ 

∂θ g 1(r, θ) 

⎥ 

⎦ 

∂ 

∂θ g 2(r, θ)

∂ 

∂r g 1(r, θ) = ∂ r cos θ = cos θ 

∂r 

∂ 

(r cos θ) = −r sin θ 

∂θ 

∂ 

∂r g 2(r, θ) = ∂ r sin θ = sin θ 

∂r 

∂ 

r sin θ = r cos θ 

∂θ 


⎡ 

cos θ 

⇒ G = ⎣ 

sin θ 

⎤ 

−r sin θ 

⎦ 

r cos θ 

⇒ det(G) = r cos 2 θ − (−r sin 2 θ) 

= r cos 2 θ + r sin 2 θ = r. 

Step 2 

Now apply theorem 1.8.4. 

Recall, 

f Z (z 1 , z 2 ) = 1 √ 

2π 

e − 1 2 z2 1 × 

1 

√ 

2π 

e − 1 2 z2 2 

= 1 

2π e− 1 2 (z2 1 +z2 2 ) . 

42


f R,θ (r, θ) = f Z 

( 

g(r, θ) 

) 

| det G(r, θ)| 

= f Z (r cos θ, r sin θ)|r| 

⎧ 

1 

⎪⎨ 

2π r e− 1 2 r2 for r ≥ 0, − π ≤ θ < 3π 2 2 

= 

⎪⎩ 

0 otherwise 

= f θ (θ)f R (r), 

where 

⎧ 

1 

⎪⎨ − π 

2π 

≤ θ < 3π 2 2 

f θ (θ) = 

⎪⎩ 

0 otherwise 

f R (r) = re −r2 /2 

for r ≥ 0. 

Hence, if Z 1 , Z 2 i.i.d. N(0, 1), and we translate to polar coordinates, (R, θ) we find: 

1. R, θ are independent. 

2. θ ∼ U(− π 2 , 3π 2 ). 

3. R has distribution called the Rayleigh distribution. 

It is easy to check that R 2 ∼ Exp(1/2) ≡ Gamma(1, 1/2) ≡ χ 2 2. 

Example 

Suppose X 1 ∼ Gamma (α, λ) and X 2 ∼ Gamma (β, λ), independently. Find the 

distribution of Y = X 1 /(X 1 + X 2 ). 

Solution. Step 1: try using Z = X 1 + X 2 . 

( x 1 

) 

( ) 

y = yz 

Step 2: If h(x 1 , x 2 ) = x 1 + x 2 , then g(y, z) = 

. 

z = x 1 + x (1 − y)z 

2 

( ) z y 

G = 

−z 1 − y 

43


⇒ det(G) = z(1 − y) − (−zy) 

where z > 0. Why 

= z(1 − y) + zy = z, 

f Y,Z (y, z) = f X1 (yz)f X2 

( 

(1 − y)z 

) 

z 

Γ(α) (yz)α−1 e −λyz λβ ( ) β−1e 

(1 − y)z −λ(1−y)z z 

Γ(β) 

= λα 

= 

1 

Γ(α)Γ(β) λα+β y α−1 (1 − y) β−1 z α+β−1 e −λz . 

Step 3: 

f Y (y) = 

∫ ∞ 

0 

f(y, z) dz 

= 

= 

∫ 

Γ(α + β) 

∞ 

Γ(α)Γ(β) yα−1 (1 − y) β−1 

Γ(α + β) 

Γ(α)Γ(β) yα−1 (1 − y) β−1 

0 

λ α+β z α+β−1 

Γ(α + β) e−λz dz 

= Beta(α, β), for 0 < y < 1. 

Exercise: Justify the range of values for y. 

1.9 Moments 

Suppose X 1 , X 2 , . . . , X r are RVs. If Y = h(X) is defined by a real-valued function h, 

then to find E(Y ) we can: 

1. Find the distribution of Y . 

⎧∫ ∞ 

yf Y (y) dy 

⎪⎨ −∞ 

2. Calculate E(Y ) = 

∑ 

⎪⎩ yp(y) 

y 

continuous 

discrete 

44



If h and X 1 , . . . , X r are as above, then provided it exists, 

⎧∑ 

∑ 

E { h(X) } ⎪⎨ 

= 

⎪⎩ 

x 1 

∫ ∞ 


−∞ 

x 2 

. . . ∑ x r 

h(x 1 , . . . , x r )p(x 1 , . . . , x r ) if X 1 , . . . , X r discrete 

∫ ∞ 

. . . h(x 1 , . . . , x r )f(x 1 , . . . x r ) dx 1 . . . dx r 

−∞ 

if X 1 , . . . , X r continuous 

Suppose X 1 , . . . , X r are RVs, h 1 , . . . , h k are real-valued functions and a 1 , . . . , a k are 

constants. Then, provided it exists, 

E { a 1 h 1 (X) + a 2 h 2 (X) + · · · + a k h k (X) } = a 1 E[h 1 (X)] + a 2 E[h 2 (X)] + · · · + a k E [h k (X)] 

Proof. (Continuous case) 

E { a 1 h 1 (X) + · · · + a k h k (X) } = 

∫ ∞ 

−∞ 

= a 1 

∫ ∞ 

∫ ∞ 

. . . {a 1 h 1 (x) + · · · + a k h k (x)} f x (x) dx 1 dx 2 . . . dx r 

−∞ 

−∞ 

+ a 2 

∫ ∞ 

∫ ∞ 

. . . h 1 (x)f X (x) dx 1 . . . dx r 

−∞ 

−∞ 

. . . + a k 

∫ ∞ 

∫ ∞ 

. . . h 2 (x)f X (x) dx 1 . . . dx r + . . . 

−∞ 

−∞ 

∫ ∞ 

. . . h k (x)f X (x) dx 1 . . . dx r 

−∞ 

= a 1 E[h 1 (X)] + a 2 E[h 2 (X)] + · · · + a k E[h k (X)], 

as required. 

Corollary. 

Provided it exists, 

E ( a 1 X 1 + a 2 X 2 + · · · + a r X r 

) 

= a1 E[X 1 ] + a 2 E[X 2 ] + · · · + a r E[X r ]. 

45


Definition. 1.9.1. 

If X 1 , X 2 are RVs with E[X i ] = µ i and Var(X i ) = σ 2 i , i = 1, 2 we define 

σ 12 = Cov(X 1 , X 2 ) = E [ (X 1 − µ 1 )(X 2 − µ 2 ) ] 

= E [ X 1 X 2 

] 

− µ1 µ 2 , 

Remark: 

ρ 12 = Corr(X 1 , X 2 ) = σ 12 

σ 1 σ 2 

. 

Cov(X, X) = Var(X) = { E[X 2 ] − (E[X]) 2} . 

In some contexts, it is convenient to use the notation 


σ ii = Var(X i ) instead of σ 2 i . 

Suppose X 1 , X 2 , . . . , X r are RVs with E[X i ] = µ i , Cov(X i , X j ) = σ ij , and let a 1 , a 2 , . . . , a r , 

b 1 , b 2 , . . . , b r be constants. Then 

( r∑ 

Cov a i X i , 

i=1 

) 

r∑ 

b j X j = 

j=1 

r∑ r∑ 

a i b j σ ij . 

i=1 j=1 

46



( r∑ 

Cov a i X i , 

i=1 

) [{ 

r∑ 

r∑ 

( r∑ 

)} { r∑ 

( r∑ 

)}] 

b j X j = E a i X i − E a i X i b j X j − E b j X j 

i=1 

i=1 

j=1 

j=1 

j=1 

= E 

{( r∑ 

i=1 

a i X i − 

) ( 

r∑ 

r∑ 

a i µ i b j X j − 

j=1 

i=1 

)} 

r∑ 

b j µ j 

j=1 

[{ r∑ 

} { r∑ 

}] 

= E a i (X i − µ i ) b j (X j − µ j ) 

i=1 

j=1 

= E 

{ r∑ 

i=1 

} 

r∑ 

a i b j (X i − µ i )(X j − µ j ) 

j=1 

= 

r∑ r∑ 

a i b j E {(X i − µ i )(X j − µ j )} (Theorem 1.9.2) 

i=1 j=1 

= 

r∑ r∑ 

a i b j σ ij , 

i=1 j=1 

as required. 

Corollary. 1 

Under the above assumptions, 

( r∑ 

) 

Var a i X i = 

i=1 

} {{ } 

covariance 

with itself 

= variance 

Corollary. 2 

r∑ r∑ 

a i a j σ ij = 

i=1 j=1 

r∑ 

i=1 

a 2 i σ 2 i + 2 ∑ i,j 

i



Consider 0 ≤ Var(X 1 − tX 2 ) (for any constant t) 

= σ 2 1 + t 2 σ 2 2 + 2tσ 12 (by Corollary 1). 

Hence quadratic, 

q(t) = σ 2 2t 2 + 2σ 12 t + σ 2 1 ≥ 0, for all t. 

Hence q(t) has either no real roots or 

one real root. 

* no real roots if ∆ = b 2 − 4ac < 0 

* one real root if ∆ = b 2 − 4ac = 0. 

Hence ∆ ≤ 0 

⇒ 4σ 2 12 − 4σ 2 2σ 2 1 ≤ 0 

⇒ 4σ 2 12 ≤ 4σ 2 1σ 2 2 

(σ’s +ve) 

⇒ σ2 12 

σ 2 1σ 2 2 

≤ 1 

⇒ ρ 2 ≤ 1 

⇒ |ρ| ≤ 1. 

Suppose |ρ| = 1 ⇒ ∆ = 0 

⇒ there is single t 0 such that: 

0 = q(t 0 ) = Var(X 1 − t 0 X 2 ), 

i.e., X 1 = t 0 X 2 + c with probability 1. 


If X 1 , X 2 are independent, and Var(X 1 ), Var(X 2 ) exist, then Cov(X 1 , X 2 ) = 0. 

Proof. (Continuous) 

48


Cov(X 1 , X 2 ) = E { (X 1 − µ 1 )(X 2 − µ 2 ) } 

∫ ∞ ∫ ∞ 

= (x 1 − µ 1 )(x 2 − µ 2 )f(x 1 , x 2 ) dx 1 dx 2 

−∞ −∞ 

∫ ∞ ∫ ∞ 

= (x 1 − µ 1 )(x 2 − µ 2 )f X1 (x 1 )f X2 (x 2 ) dx 1 dx 2 (since X 1 , X 2 independent) 

−∞ −∞ 

(∫ ∞ 

) (∫ ∞ 

) 

= (x 1 − µ 1 )f X1 (x 1 ) dx 1 (x 2 − µ 2 )f X2 (x 2 ) dx 2 

−∞ 

−∞ 

= 0 × 0 

= 0. 

Remark: 

But the converse does NOT apply, in general! 

That is, Cov(X 1 , X 2 ) = 0 ⇏ X 1 , X 2 independent. 


If X 1 , X 2 are RVs, we define the symbol E[X 1 |X 2 ] to be the expectation of X 1 calculated 

with respect to the conditional distribution of X 1 |X 2 , 

i.e., 


E [ X 1 |X 2 

] 

= 

⎧⎪ ⎨ 

⎪ ⎩ 

∑ 

x 1 

x 1 P X1 |X 2 

(x 1 |x 2 ) X 1 |X 2 discrete 

∫ ∞ 

−∞ 

Provided the relevant moments exist, 

x 1 f X1 |X 2 

(x 1 |x 2 ) dx 1 

E(X 1 ) = E X2 

{ 

E(X1 |X 2 ) } , 

X 1 |X 2 continuous 

Var(X 1 ) = E X2 

{ 

Var(X1 |X 2 ) } + Var X2 

{ 

E(X1 |X 2 ) } . 

49



1. (Continuous case) 

E X2 

{ 

E(X1 |X 2 ) } = 

= 

= 

= 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

E(X 1 |X 2 )f X2 (x 2 ) dx 2 

(∫ ∞ 

) 

x 1 f(x 1 |x 2 ) dx 1 f X2 (x 2 ) dx 2 

−∞ 

∫ ∞ ∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

−∞ 

x 1 f(x 1 |x 2 )f X2 (x 2 ) dx 1 dx 2 

∫ ∞ 

x 1 f(x 1 , x 2 ) dx 2 dx 1 

−∞ 

} {{ } 

= f X1 (x 1 ) 

= 

∫ ∞ 

−∞ 

x 1 f X1 (x 1 ) dx 1 

= E(X 1 ). 

1.9.1 Moment generating functions 


If X 1 , X 2 . . . , X r are RVs, then the joint MGF (provided it exists) is given by 

M X (t) = M X (t 1 , t 2 , . . . , t r ) = E [ e t 1X 1 +t 2 X 2 +···+t rX r 

] 

. 


If X 1 , X 2 , . . . , X r are mutually independent, then the joint MGF satisfies 

M X (t 1 , . . . , t r ) = M X1 (t 1 )M X2 (t 2 ) . . . M Xr (t r ), 

provided it exists. 

50


Proof. (Theorem 1.9.6, continuous case) 

M X (t 1 , . . . , t r ) = 

= 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

e t 1x 1 +t 2 x 2 +···+t rx r 

f X (x 1 , . . . , x r ) dx 1 . . . dx r 

∫ ∞ 

. . . e t 1x 1 

e t 2x 2 

. . . e trxr f X1 (x 1 )f X2 (x 2 ) . . . f Xr (x r ) dx 1 . . . dx r 

−∞ 

(∫ ∞ 

) (∫ ∞ 

) (∫ ∞ 

) 

= e t 1x 1 

f X1 (x 1 ) dx 1 e t 2x 2 

f X2 (x 2 ) dx 2 . . . e trxr f Xr (x r ) dx r 

−∞ 

−∞ 

−∞ 

= M X1 (t 1 )M X2 (t 2 ) . . . M Xr (t r ), as required. 

We saw previously that if Y = h(X), then we can find M Y (t) = E[e th(X) ] from the 

joint distribution of X 1 , . . . , X r without calculating f Y (y) explicitly. 

A simple, but important case is: 


Y = X 1 + X 2 + · · · + X r . 

Suppose X 1 , . . . , X r are RVs and let Y = X 1 + X 2 + · · · + X r . Then (assuming MGFs 

exist): 

1. M Y (t) = M X (t, t, t, . . . , t). 

2. If X 1 . . . X r are independent then M Y (t) = M X1 (t)M X2 (t) . . . M Xr (t). 

3. If X 1 , . . . , X r independent and identically distributed with common MGF M X (t), 

then: 

M Y (t) = [ M X (t) ] r 

. 

51



M X (t) = E[e tY ] 

= E[e t(X 1+X 2 ,···+X r) ] 

= E[e tX 1+tX 2 +···+tX r 

] 

= M X (t, t, t, . . . , t) (using def 1.9.3). 

For parts (2) and (3), substitute into (1) and use Theorem 1.9.6. 

Examples 

Consider RVs X, V , defined by V ∼ Gamma(α, λ) and the conditional distribution of 

X|V ∼Po(λ = V ). 

Find E(X) and Var(X). 

Solution. Use 

E(X) = E{E(X|V )} 

Var(X) = E V {Var(X|V )} + Var V {E(X|V )} 

E(X|V ) = V 

Var(X|V ) = V 

so E(X) = E{E(X|V )} = E(V ) = α λ . 

Var(X) = E V {Var(X|V )} + Var V {E(X|V )} 

= E V (V ) + Var V (V ) 

= α λ + α λ 2 

= α λ 

( 

1 + 1 ) ( ) 1 + λ 

= α . 

λ λ 2 

52


Remark 

The marginal distribution of X is sometimes called the negative binomial distribution. 

In particular, when α is an integer, it corresponds to the definition previously with 

p = 

λ 

1 + λ . 

Examples 

1. Suppose X ∼ Bernoulli with parameter p. Then M X (t) = 1 + p(e t − 1). 

Now suppose X 1 , X 2 , . . . , X n are i.i.d. Bernoulli with parameter p and 

Y = X 1 + X 2 + · · · + X n ; 

then M Y (t) = ( 1 + p(e t − 1) ) n 

, which agrees with the formula previously given 

for the binomial distribution. 

2. Suppose X 1 ∼ N(µ 1 , σ 2 1) and X 2 ∼ N(µ 2 , σ 2 2) independently. Find the MGF of 

Y = X 1 + X 2 . 

Solution. Recall that M X1 (t) = e tµ 1 

e t2 σ 2 1 /2 

M X2 (t) = e tµ 2 

e t2 σ 2 2 /2 

⇒ M Y (t) = M X1 (t)M X2 (t) 

= e t(µ 1+µ 2 ) e t2 (σ 2 1 +σ2 2 )/2 

⇒ Y ∼ N ( µ 1 + µ 2 , σ 2 1 + σ 2 2) 

. 

1.9.2 Marginal distributions and the MGF 

To find the moment generating function of the marginal distribution of any set of 

components of X, set to 0 the complementary elements of t in M X (t). 

Let X = (X 1 , X 2 ) T , and t = (t 1 , t 2 ) T . Then 

( ) 

t1 

M X1 (t 1 ) = M X . 

0 

To see this result: Note that if A is a constant matrix, and b is a constant vector, then 

M AX+b (t) = e tT b M X (A T t). 

53



M AX+b (t) = E[e tT (AX+b )] = e tT b E[e tT AX ] 

= e tT b E[e (AT t) T X ] = e tT b M X (A T t). 

Now partition 

and 

Note that 

and 

Hence, 

t r×1 = 

( 

t1 

t 2 

) 

A l×r = (I l×l 0 l×m ). 

( ) 

X1 

AX = (I l×l 0 l×m ) = X 

X 1 , 

2 

A T t 1 = 

( 

Il×l 

0 m×l 

) 

t 1 = 

( ) 

t1 

. 

0 

M X1 (t 1 ) = M AX (t 1 ) = M X (A T t 1 ) 

( ) 

t1 

= M X , 

0 

as required. 

Note that similar results hold for more than two random subvectors. 

The major limitation of the MGF is that it may not exist. The characteristic function 

on the other hand is defined for all distributions. Its definition is similar to the MGF, 

with it replacing t, where i = √ −1; the properties of the characterisitc function are 

similar to those of the MGF, but using it requires some familiarity with complex 

analysis. 

1.9.3 Vector notation 

Consider the random vector 

X = (X 1 , X 2 , . . . , X r ) T , 

with E[X i ] = µ i , Var(X i ) = σi 2 = σ ii , Cov(X i , X j ) = σ ij . 

54


Define the mean vector by: 

⎛ ⎞ 

µ 1 

µ 2 

µ = E(X) = ⎜ ⎟ 

⎝ . ⎠ 

µ r 

and the variance matrix (covariance matrix) by: 

Σ = Var(X) = [σ ij ] i=1,...,r 

j=1,...,r 

Finally, the correlation matrix is defined by: 

⎛ 

⎞ 

1 ρ 12 . . . ρ 1r 

. . . 

. . . 

R = ⎜ 

⎟ 

⎝ .. . ρr−1,r 

⎠ 

1 

where ρ ij = 

σ ij 

√ 

σii 

√ 

σjj 

. 

Now suppose that a = (a 1 , . . . , a r ) T is a vector of constants and observe that: 

a T x = a 1 x 1 + a 2 x 2 + · · · + a r x r = 

r∑ 

a i x i 

i=1 


Suppose X has E(X) = µ and Var(X) = Σ, and let a, b ∈ R r be fixed vectors. Then, 

1. E(a T X) = a T µ 

2. Var(a t X) = a T Σa 

3. Cov(a T X, b T X) = a T Σb 

Remark 

It is easy to check that this is just a re-statement of Theorem 1.9.3 using matrix 

notation. 


Suppose X is a random vector with E(X) = µ, Var(X) = Σ, and let A p×r and b ∈ R p 

be fixed. 

If Y = AX + b, then 

E(Y) = Aµ + b and Var(Y) = AΣA T . 

55


Remark 

This is also a re-statement of previously established results. To see this, observe that 

if a T i is the i th row of A, we see that Y i = a T i X and, moreover, the (i, j) th element of 

AΣA T = a T i Σa j = Cov ( a T i X, a T j X ) 

= Cov(Y i , Y j ). 

1.9.4 Properties of variance matrices 

If Σ = Var(X) for some random vector X = (X 1 , X 2 , . . . , X r ) T , then it must satisfy 

certain properties. 

Since Cov(X i , X j ) = Cov(X j , X i ), it follows that Σ is a square (r × r) symmetric 

matrix. 


The square, symmetric matrix M is said to be positive definite [non-negative definite] 

if a T Ma > 0 [≥ 0] for every vector a ∈ R r s.t. a ≠ 0. 

=⇒ It is necessary and sufficient that Σ be non-negative definite in order that it can 

be a variance matrix. 

=⇒ To see the necessity of this condition, consider the linear combination a T X. 

By Theorem 1.9.8, we have 

=⇒ Σ must be non-negative definite. 

0 ≤ Var(a T X) = a T Σa for every a; 

Suppose λ is an eigenvalue of Σ, and let a be the corresponding eigenvector. Then 

Σa = λa 

=⇒ a T Σa = λa T a = λ||a|| 2 . 

Hence Σ is non-negative definite (positive definite) iff its eigenvalues are all nonnegative 

(positive). 

If Σ is non-negative definite but not positive definite, then there must be at least one 

zero eigenvalue. Let a be the corresponding eigenvector. 

=⇒ a ≠ 0 but a T Σa = 0. That is, Var(a T X) = 0 for that a. 

=⇒ the distribution of X is degenerate in the sense that either one of the X i ’s is 

constant or else a linear combination of the other components. 

56


Finally, recall that if λ 1 , . . . , λ r are eigenvalues of Σ, then det(Σ) = 

r∏ 

i=1 

λ r 

and 

det(Σ) = 0 

=⇒ det(Σ) > 0 for Σ positive definite, 

for Σ non-negative definite but not positive definite. 

1.10 The multivariable normal distribution 


The random vector X = (X 1 , . . . , X r ) T is said to have the r-dimensional multivariate 

normal distribution with parameters µ ∈ R r and Σ r×r positive definite, if it has PDF 

We write X ∼ N r (µ, Σ). 

Examples 

f X (x) = 

1 

(2π) r/2 |Σ| 1/2 e− 1 2 (x−µ)T Σ −1 (x−µ) 

1. r = 2 The bivariate normal distribution 

( ) [ ] 

µ1 

σ 

2 

Let µ = , Σ = 1 ρσ 1 σ 2 

µ 2 ρσ 1 σ 2 σ2 

2 

=⇒ |Σ| = σ 2 1σ 2 2(1 − ρ 2 ) and 

⎛ 

⎞ 

σ 

Σ −1 1 

2 2 −ρσ 1 σ 2 

= 

⎝ 

⎠ 

σ1σ 2 2(1 2 − ρ 2 ) 

−ρσ 1 σ 2 σ1 

2 

=⇒ f(x 1 , x 2 ) = 

{[ 

1 

√ 

2πσ 1 σ exp −1 

2 1 − ρ 

2 2(1 − ρ 2 ) 

[ (x1 ) 2 ( ) 2 

− µ 1 x2 − µ 2 

+ 

σ 1 σ 2 

( ) ( )]]} 

x1 − µ 1 x2 − µ 2 

−2ρ 

. 

σ 1 σ 2 

57


2. If Z 1 , Z 2 , . . . , Z r are i.i.d. N(0, 1) and Z = (Z 1 , . . . , Z r ) T , 

⇒f Z (z) = 

r∏ 

i=1 

1 

√ 

2π 

e −z2 i /2 

= 

which is N r (0 r , I r×r ) PDF. 


1 

(2π) r/2 e−1/2 P r 

i=1 z2 i = 

1 

(2π) r/2 e− 1 2 zT z , 

Suppose X ∼ N r (µ, Σ) and let Y = AX + b, where b ∈ R r and A r×r invertible are 

fixed. Then Y ∼ N r (Aµ + b, AΣA T ). 


We use the transformation rule for PDFs: 

If X has joint PDF f X (x) and Y = g(X) for g : R r → R r invertible, continuously 

differentiable, ( ) then f Y (y) = f X (h(y))|H|, where h : R r → R r such that h = g −1 and 

∂hi 

H = . 

∂y j 

In this case, we have g(x) = Ax + b 

⇒h(y) = A −1 (y − b) 

[ solving for x in 

y = Ax + b 

] 

⇒H = A −1 . 

Hence, 

f Y (y) = 

= 

= 

1 

(2π) r/2 |Σ| 1/2 e−1/2(A−1 (y−b)−µ) T Σ −1 (A −1 (y−b)−µ) |A −1 | 

1 

(2π) r/2 |Σ| 1/2 |AA T | 1/2 e−1/2(y−(Aµ+b))T (A −1 ) T Σ −1 (A −1 )(y−(Aµ+b)) 

1 

(2π) r/2 |AΣA T | 1/2 e−1/2(y−(Aµ+b))T (AΣA T ) −1 (y−(Aµ+b)) 

which is exactly the PDF for N r (Aµ + b, AΣA T ). 

58


Now suppose Σ r×r is any symmetric positive definite matrix and recall that we can write 

Σ = E ∧ E T where ∧ = diag (λ 1 , λ 2 , . . . , λ r ) and E r×r is such that EE T = E T E = I. 

Since Σ is positive definite, we must have λ i > 0, i = 1, . . . , r and we can define the 

symmetric square-root matrix by: 

Σ 1/2 = E ∧ 1/2 E T where ∧ 1/2 = diag ( √ λ 1 , √ λ 2 , . . . , √ λ r ). 

Note that Σ 1/2 is symmetric and satisfies: 

( ) Σ 

1/2 2 

= Σ 1/2 Σ 1/2T = Σ. 

Now recall that if Z 1 , Z 2 , . . . , Z r are i.i.d. N(0, 1) then Z = (Z 1 , . . . , Z r ) T ∼ N r (0, I). 

Because of the i.i.d. N(0, 1) assumption, we know in this case that E(Z) = 0, Var(Z) = 

I. 

Now, let X = Σ 1/2 Z + µ: 

1. From Theorem 1.9.9 we have 

E(X) = Σ 1/2 E(Z) + µ = µ and Var(X) = Σ 1/2 Var(Z) ( Σ 1/2) T 

. 

2. From Theorem 1.10.1, 

X ∼ N r (µ, Σ). 

Since this construction is valid for any symmetric positive definite Σ and any 

µ ∈ R r , we have proved, 


If X ∼ N r (µ, Σ) then 


E(X) = µ and Var(X) = Σ. 

If X ∼ N r (µ, Σ) then Z = Σ −1/2 (X − µ) ∼ N r (0, I). 


Use Theorem 1.10.1. 

Suppose 

( 

X1 

X 2 

) 

∼ N r1 +r 2 

(( 

µ1 

µ 2 

) 

, 

[ ]) 

Σ11 Σ 12 

Σ 21 Σ 22 

59


where X i , µ i are of dimension r i and Σ ij is r i × r j . In order that Σ be symmetric, we 

must have Σ 11 , Σ 22 symmetric and Σ 12 = Σ T 21. 

We will derive the marginal distribution of X 2 

X 1 |X 2 . 

and the conditional distribution of 

Lemma. 1.10.1 

[ ] B A 

If M = is a square, partitioned matrix, then |M| = |B| (det). 

O I 

Lemma. 1.10.2 

[ ] 

Σ11 Σ 

If M = 

12 

then |M| = |Σ 

Σ 21 Σ 22 ||Σ 11 − Σ 12 Σ −1 

22 Σ 21 | 

22 

Proof. (of Lemma 1.10.2) 

Let 

and observe 

C = 

[ ] 

I −Σ12 Σ −1 

22 

0 Σ −1 

22 

[ ] 

Σ11 − Σ 

CM = 

12 Σ −1 

22 Σ 21 0 

Σ −1 

, 

22 Σ 21 I 

and 

Finally, observe |CM| = |C||M| 

|C| = 1 

|Σ 22 | , |CM| = |Σ 11 − Σ 12 Σ −1 

22 Σ 21 | 

⇒ 

|M| = |CM| 

|C| 

= |Σ 22 ||Σ 11 − Σ 12 Σ −1 

22 Σ 21 |. 


Suppose that 

( 

X1 

X 2 

) 

(( ) ( )) 

µ1 Σ11 Σ 

∼ N r1 ,r 2 

, 

12 

µ 2 Σ 21 Σ 22 

Then, the marginal distribution of X 2 is X 2 ∼ N r2 (µ 2 , Σ 22 ) and the conditional distribution 

of X 1 |X 2 is 

( ) 

N r1 µ1 + Σ 12 Σ −1 

22 (x 2 − µ 2 ), Σ 11 − Σ 12 Σ −1 

22 Σ 21 

60



We will exhibit f(x 1 , x 2 ) in the form 

f(x 1 , x 2 ) = h(x 1 , x 2 ) g(x 2 ), 

where h(x 1 , x 2 ) is a PDF with respect to x 1 for each (given) x 2 . 

It follows then that f X1 |X 2 

(x 1 |x 2 ) = h(x 1 , x 2 ) and g(x 2 ) = f X2 (x 2 ). 

Now observe that: 

1 

f(x 1 , x 2 ) = 

(2π) (r 1+r 2 )/2 

∣ Σ ∣ × 

11 Σ 12∣∣∣ 

1/2 

Σ 12 Σ 22 

and 

exp 

[ 

− 1 2 

( (x1 ) T ( ) ) [ ] −1 ( ) ] 

T Σ 

− µ 1 , x2 − µ 11 Σ 12 x1 − µ 1 

2 

Σ 21 Σ 22 x 2 − µ 2 

1. 

(2π) (r 1+r 2 )/2 

∣ Σ ∣ 

11 Σ 12∣∣∣ 

1/2 

Σ 21 Σ 22 

= (2π) r 1/2 (2π) r 2/2 ∣ ∣Σ 22 

∣ ∣ 

1/2 ∣ ∣Σ 11 − Σ 12 Σ −1 

22 Σ 21 

∣ ∣ 

1/2 

= (2π) r 1/2 ∣ ∣Σ 11 − Σ 12 Σ −1 

22 Σ 21 

∣ ∣ 

1/2 

(2π) 

r 2 /2 ∣ ∣Σ 22 

∣ ∣ 

1/2 

2. 

( 

(x1 − µ 1 ) T , (x 2 − µ 2 ) ) [ ] −1 ( ) 

T Σ 11 Σ 12 x1 − µ 1 

= ( (x 

Σ 21 Σ 22 x 2 − µ 1 − µ 1 ) T , (x 2 − µ 2 ) ) T 

2 

⎡ 

× 

⎢ 

⎣ 

( 

Σ11 − Σ 12 Σ −1 

22 Σ 21 

) −1 

− ( Σ 11 − Σ 12 Σ −1 

22 Σ 11 

) −1 

×Σ 12 Σ −1 

22 

−Σ −1 

22 Σ 21 Σ −1 

22 + Σ −1 

22 Σ 21 

× ( ) 

Σ 11 − Σ 12 Σ −1 −1 

22 Σ 21 × ( ) 

Σ 11 − Σ 12 Σ −1 −1 

22 Σ 21 

×Σ 12 Σ −1 

22 

⎤ 

⎥ 

⎦ 

( ) 

x1 − µ 1 

x 2 − µ 2 

= (x 1 − µ 1 ) T V −1 (x 1 − µ 1 ) − (x 1 − µ 1 ) T V −1 Σ 12 Σ −1 

22 (x 2 − µ 2 ) 

−(x 2 − µ 2 ) T Σ −1 

22 Σ 21 V −1 (x 1 − µ 1 ) + (x 2 − µ 2 ) T Σ −1 

22 Σ 21 V −1 Σ 12 Σ −1 

22 (x 2 − µ 2 ) 

61 

+(x 2 − µ 2 ) T Σ −1 

22 (x 2 − µ 2 ).


⎧ 

⎪⎨ 

⎪⎩ 

How did this come about: 

[ ( ) a b x 

(x, y) 

c d] 

y 

= (xy) 

[ ] ax + by 

cx + dy 

= x(ax + by) + y(cx + dy) 

= ax 2 + bxy + cyx + dy 2 ⎫⎪ ⎬ 

⎪ ⎭ 

= ( (x 1 − µ 1 ) − Σ 12 Σ −1 

22 (x 2 − µ 2 ) ) T 

V 

−1 ( (x 1 − µ 1 ) − Σ 12 Σ −1 

22 (x 2 − µ 2 ) ) 

+ (x 2 − µ 2 ) T Σ −1 

22 (x 2 − µ 2 ) 

⎧ 

⎪⎨ 

Note: 

(x − y) T A(x − y) = (x − y) T (Ax − Ay) 

= x T Ax − x T Ay − y T Ax + y T Ay 

⎪⎩ 

And: Σ T 12 = Σ 21 

⎫⎪ ⎬ 

⎪ ⎭ 

= ( x 1 − (µ 1 + Σ 12 Σ −1 

22 (x 2 − µ 2 )) ) T ( 

Σ11 − Σ 12 Σ −1 

22 Σ 21 

) −1× 

( 

x1 − (µ 1 + Σ 12 Σ −1 

22 (x 2 − µ 2 )) ) + (x 2 − µ 2 ) T Σ −1 

22 (x 2 − µ 2 ). 

Combining (1), (2) we obtain: 

62


f(x 1 , x 2 ) = 

1 

(2π) (r 1+r 2 )/2 

∣ Σ ∣ 

11 Σ 12∣∣∣ 

1/2 

Σ 21 Σ 22 

= 

⎧ 

⎡ ⎤ 

⎨ 

Σ 11 Σ 12 

× exp 

⎩ −1 2 ((x 1 − µ 1 ) T , (x 2 − µ 2 ) T ) ⎣ ⎦ 

Σ 22 

1 

(2π) r 1/2 

|Σ 11 − Σ 12 Σ −1 

22 Σ 21 | 1/2 

Σ 21 

{ 

× exp − 1 2 (x 1 − (µ 1 + Σ 12 Σ −1 

22 (x 2 − µ 2 ))) T 

⎞⎫ 

x 1 − µ 1 ⎬ 

⎝ ⎠ 

⎭ 

x 2 − µ 2 

−1 ⎛ 

× ( ) } 

Σ 11 − Σ 12 Σ −1 −1(x1 

22 Σ 21 − (µ 1 + Σ 12 Σ −1 

22 (x 2 − µ 2 ))) 

× 

{ 

1 

(2π) ∣ ∣ r 2/2 ∣Σ exp − 1 } 

22 1/2 

2 (x 2 − µ 2 ) T Σ −1 

22 (x 2 − µ 2 ) 

= h(x 1 , x 2 )g(x 2 ), where for fixed x 2 , h(x 1 , x 2 ), is the 

N r1 

( 

µ1 + Σ 12 Σ −1 

22 (x 2 − µ 2 ), Σ 11 − Σ 12 Σ −1 

22 Σ 21 

) 

PDF, and g(x 2 ) is the N r2 (µ 2 , Σ 22 ) PDF. 

Hence, the result is proved. 


Suppose that X ∼ N r (µ, Σ) and Y = AX + b, where A p×r with linearly independent 

rows and b ∈ R p are fixed. Then Y ∼ N p (Aµ + b, AΣA T ). 

[Note: p ≤ r.] 

Proof. Use method of regular transformations. 

Since the 

( 

rows of A are linearly independent, we can find B (r−p)×r such that the r × r 

B 

matrix is invertible. 

A) 

63


If we take Z = BX, we have 

( ( ( Z B 0 

= X + ∼ N 

Y) 

A) 

b) 

by Theorem 1.10.1. 

([ ] [ ]) 

Bµ + 0 BΣB 

T 

BΣA 

, 

T 

Aµ + b AΣB T AΣA T 

Hence, from Theorem 1.10.4, the marginal distribution for Y is 

N p (Aµ + b, AΣA T ). 

1.10.1 The multivariate normal MGF 

The multivariate normal moment generating function for a random vector X ∼ N(µ, Σ) 

is given by 

M X (t) = e tT µ+ 1 2 tT Σt 

Prove this result as an exercise! 

The characteristic function of X is 

E[exp(it T X)] = exp 

( 

it T µ − 1 ) 

2 tT Σt 

The marginal distribution of X 1 (or X 2 ) is easy to derive using the multivariate normal 

MGF. 

Let 

t = 

( 

t1 

t 2 

) 

, µ = 

( 

µ1 

µ 2 

) 

. 

Then the marginal distribution of X 1 is obtained by setting t 2 = 0 in the expression 

for the MGF of X. 

Proof: 

( 

M X (t) = exp t T µ + 1 ) 

2 tT Σt 

= exp 

(t T 1 µ 1 + t T 2 µ 2 + 1 2 t 1 T Σ 11 t 1 + t T 1 Σ 12 t 2 + 1 ) 

2 t 2 T Σ 22 t 2 . 

Now, 

( ) 

t1 

M X1 (t 1 ) = M X = exp 

(t T 

0 

1 µ 1 + 1 ) 

2 t 1 T Σ 11 t 1 , 

64


which is the MGF of X 1 ∼ N r1 (µ 1 , Σ 11 ). 

Similarly for X 2 . This means that all marginal distributions of a multivariate normal 

distribution are multivariate normal themselves. Note though, that in general 

the opposite implication is not true: there are examples of non-normal multivariate 

distributions who marginal distributions are normal. 

1.10.2 Independence and normality 

We have seen previously that X, Y independent ⇒ Cov(X, Y ) = 0, but not vice-versa. 

An exception is when the data are (jointly) normally distributed. 

In particular, if (X, Y ) have the bivariate normal distribution, then Cov(X, Y ) = 0 ⇐⇒ 

X, Y are independent. 


Suppose X 1 , X 2 , . . . , X r have a multivariate normal distribution. Then X 1 , X 2 , . . . , X r 

are independent if and only if Cov(X i , X j ) = 0 for all i ≠ j. 


(=⇒)X 1 , . . . , X r 

independent 

=⇒ Cov(X i , X j ) = 0 for i ≠ j, has already been proved. 

(⇐=) 

Suppose Cov(X i , X j ) = 0 for i ≠ j 

65

=⇒ Var(X) = Σ = 

=⇒ Σ −1 = 

=⇒ f X (x) = 

= 

= 

⎡ 

⎤ 

σ 11 0 . . . 0 

0 σ 22 . . . . 

⎢ 

⎣ . 

. .. . ⎥ .. . ⎦ 

0 . . . 0 σ rr 

⎡ 

σ −1 

⎤ 

11 0 . . . 0 

0 σ −1 

22 . . . . 

⎢ 

⎣ . 

. .. . ⎥ .. . ⎦ 

0 . . . 0 σrr 

−1 

1 

(2π) r/2 |Σ| 1/2 e− 1 2 (x−µ)T Σ −1 (x−µ) 


and |Σ| = σ 11 σ 22 . . . σ rr 

1 

1 P r (x i −µ i ) 2 

(2π) r/2 (σ 11 σ 22 . . . σ rr ) 1/2 e− 2 i=1 σ ii 

( 

) ( 

1 

√ √ e − 1 (x 1 −µ 1 ) 2 

1 

2 σ 11 √ √ e − 1 2 

2π σ11 2π σ22 

) 

(x 2 −µ 2 ) 2 

σ 22 . . . 

( 

) 

1 

. . . √ √ e − 1 (xr−µr) 2 

2 σrr 

2π σrr 

= f 1 (x 1 )f 2 (x 2 ) . . . f r (x r ) 

=⇒ X 1 , . . . , X r are independent. 

Note: 

(x − µ) T Σ −1 (x − µ) = (x 1 − µ 1 x 2 − µ 2 . . . x r − µ r ) 

= 

⎡ 

σ −1 

⎤ ⎡ ⎤ 

11 0 . . . 0 x 1 − µ 1 

0 σ22 −1 

. . . . 

x 2 − µ 2 

× ⎢ 

⎥ ⎢ ⎥ 

⎣ . 

.. . 

.. . . ⎦ ⎣ . ⎦ 

0 . . . 0 σrr 

−1 x r − µ r 

r∑ (x i − µ i ) 2 

. 

σ ii 

i=1 

66


Remark 

The same methods can be used to establish a similar result for block diagonal matrices. 

The simplest case is the following, which is most easily proved using moment generating 

functions: 

X 1 , X 2 are independent 

if and only if Σ 12 = 0, i.e., 

( ) (( ) ( )) 

X1 

µ1 Σ11 0 

∼ N 

X 

r1 +r 2 

, 

. 

2 µ 2 0 Σ 22 


Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ) RVs and let 

¯X = 1 n 

S 2 = 

n∑ 

X i , 

i=1 

1 

n − 1 

n∑ 

(X i − ¯X) 2 . 

i=1 

Then ¯X ∼ N(µ, σ 2 /n) and 

(n − 1)S2 

σ 2 

∼ χ 2 n−1 independently. 

(Note: S 2 here is a random variable, and is not to be confused with the sample covariance 

matrix, which is also often denoted by S. Hopefully, the meaning of the notation 

will be clear in the context in which it is used.) 

Proof. Observe first that if X = (X 1 , . . . , X n ) T then the i.i.d. assumption may be 

written as: 

X ∼ N n (µ1 n , σ 2 I n×n ) 

1. ¯X ∼ N(µ, σ 2 /n): 

Observe that ¯X = BX, where 

[ 1 

B = 

n 

] 

1 

n . . . 1 

n 

Hence, by Theorem 1.10.5, 

¯X ∼ N(Bµ1, σ 2 BB T ) ≡ N(µ, σ 2 /n), 

as required. 

67


2. Independence of ¯X and S: consider 

⎡ 

⎤ 

1 1 

1 1 

. . . 

n n n n 

1 − 1 − 1 . . . − 1 − 1 n n n n 

A = 

− 1 1 − 1 . . . − 1 − 1 n n n n 

⎢ 

. 

.. . 

.. . 

.. . . 

⎥ 

⎣ 

⎦ 

− 1 − 1 . . . 1 − 1 − 1 n n n n 

and observe 

⎡ ⎤ 

¯X 

X 1 − ¯X 

AX = 

X 2 − ¯X 

⎢ ⎥ 

⎣ . ⎦ 

X n−1 − ¯X 

We can check that Var(AX) = σ 2 AA T has the form: 

⎡ 

⎤ 

1 

0 . . . 0 

n 0 

σ 2 ⎢ 

. Σ 22 

⎥ 

⎣ 

⎦ 

0 

⇒ ¯X, (X 1 − ¯X, X 2 − ¯X, . . . , X n−1 − ¯X) are independent. 

(By multivariate normality.) 

Finally, since 

X n − ¯X = − ( (X 1 − ¯X) + (X 2 − ¯X) + · · · + (X n−1 − ¯X) ) , 

it follows that S 2 is a function of (X 1 − ¯X, X 2 − ¯X, . . . , X n−1 − ¯X) and hence is 

independent of ¯X. 

3. Prove: 

(n − 1)S 2 

σ 2 

∼ χ 2 n−1 

68


Consider the identity: 

n∑ 

n∑ 

(X i − µ) 2 = (X i − ¯X) 2 + n( ¯X − µ) 2 

i=1 

i=1 

(subtract and add ¯X to first term) 

⇒ 

n∑ 

i=1 

(X i − µ) 2 

σ 2 = 

n∑ 

i=1 

(X i − ¯X) 2 

σ 2 + 

( ) 2 ¯X − µ 

σ/ √ n 

and let R 1 = 

R 2 = 

n∑ (X i − µ) 2 

i=1 

n∑ 

i=1 

σ 2 

(X i − ¯X) 2 

σ 2 = 

(n − 1)S2 

σ 2 

R 3 = 

( ) 2 ¯X − µ 

σ/ √ n 

If M 1 , M 2 , M 3 are the MGFs for R 1 , R 2 , R 3 respectively, then, 

M 1 (t) = M 2 (t)M 3 (t) since R 1 = R 2 + R 3 

with R 2 and R 3 independent 

↓ ↓ 

depends depends only 

only on on ¯X 

Next, observe that R 3 = 

( ) 2 ¯X − µ 

σ/ √ ∼ χ 2 1 

n 

S 2 

⇒ M 2 (t) = M 1(t) 

M 3 (t) . 

⇒ M 3 (t) = 

1 

(1 − 2t) 1/2 , 

69


and 

(R 1 ) 

n∑ (X i − µ) 2 

σ 2 

i=1 

∼ χ 2 n 

⇒ M 1 (t) = 

∴ M 2 (t) = 

1 

(1 − 2t) n/2 . 

1 

(1 − 2t) (n−1)/2 , 

which is the mgf for χ 2 n−1. 

Hence, 

(n − 1)S 2 

σ 2 

∼ χ 2 n−1. 

Corollary. 

If X 1 , . . . , X n are i.i.d. N(µ, σ 2 ), then: 


Recall that t k is the distribution of 

Z 

√ 

V/k 

, where Z ∼ N(0, 1) and V ∼ χ 2 k independently. 

T = ¯X − µ 

S/ √ n ∼ t n−1 


and note that, 

T = ¯X − µ 

S/ √ n = ¯X / √ − µ 

σ/ √ (n − 1)S 2 /σ 2 

n (n − 1) 

¯X − µ 

σ/ √ n 

∼ N(0, 1) 

and 

independently. 

(n − 1)S 2 

σ 2 

∼ χ 2 n−1, 

70


1.11 Limit Theorems 

1.11.1 Convergence of random variables 

Let X 1 , X 2 , X 3 , . . . , be an infinite sequence of RVs. We will consider 4 different types 

of convergence. 

1. Convergence in probability (weak convergence) 

The sequence {X n } is said to converge to the constant α ∈ R, in probability if 

for every ε > 0 

lim 

n→∞ P (|X n − α| > ε) = 0 

2. Convergence in Quadratic Mean 

lim 

n→∞ E( (X n − α) 2) = 0 

3. Almost sure convergence (Strong convergence) 

The sequence {X n } is said to converge almost surely to α if for each ε > 0, 

Remarks 

(1),(2),(3) are related by: 

|X n − α| > ε for only finite number of n ≥ 1. 

Figure 15: Relationships of convergence 

4. Convergence in distribution 

The sequence of RVs {X n } with CDFs {F n } is said to converge in distribution 

to the RV X with CDF F (x) if: 

lim F n(x) = F (x) for all continuity points of F . 

n→∞ 

71


Figure 16: Discrete case 

Figure 17: Normal approximation to binomial (an application of the Central Limit 

Theorem) continuous case 

Remarks 

⎧ 

⎪⎨ 0 x < α 

1. If we take F (x) = 

⎪⎩ 

1 x ≥ α 

then we have X = α with probability 1, and convergence in distribution to this 

F is the same thing as convergence in probability to α. 

2. Commonly used notation for convergence in distribution is either: 

(i) L[X n ] → L[X] 

(ii) X n −→ 

D 

L[X] or e.g., X n −→ 

D 

N(0, 1). 

3. An important result that we will use without proof is as follows: 

Let M n (t) be MGF of X n and M(t) be MGF of X. Then if M n (t) → M(t) for 

each t in some open interval containing 0, as n → ∞, then L[X n ] −→ 

D 

L[X]. 

(Sometimes called the Continuity Theorem.) 

72


Theorem. 1.11.1 (Weak law of large numbers) 

Suppose X 1 , X 2 , X 3 , . . . is a sequence of i.i.d. RVs with E[X i ] = µ, Var(X i ) = σ 2 , and 

let 

¯X n = 1 n∑ 

X i . 

n 

Then ¯X n converges to µ in probability. 

Proof. We need to show for each ɛ > 0 that 

i=1 

lim P (| ¯X n − µ| > ɛ) = 0. 

n→∞ 

Now observe that E[ ¯X n ] = µ and Var( ¯X n ) = σ2 

. So by Chebyshev’s inequality, 

n 

P (| ¯X n − µ| > ɛ) ≤ σ2 

→ 0 as n → ∞, for any fixed ɛ > 0. 

nɛ2 Remarks 

1. The proof given for Theorem 1.11.1 is really a corollary to the fact that ¯X n also 

converges to µ in quadratic mean. 

2. There is also a version of this theorem involving almost sure convergence (strong 

law of large numbers). We will not discuss this. 

3. The law of large numbers is one of the fundamental principles of statistical inference. 

That is, it is the formal justification for the claim that the “sample mean 

approaches the population mean for large n”. 

Lemma. 1.11.1 

Suppose a n is a sequence of real numbers s.t. lim 

n→∞ 

na n = a with |a| < ∞. Then, 


Omitted (but not difficult). 

lim (1 + a n) n = e a . 

n→∞ 

73


Remarks 

This is a simple generalisation of the standard limit, 

( 

1 + 

n) x n 

= e x . 

lim 

n→∞ 

Consider a sequence of i.i.d. RVs X 1 , X 2 , . . . with E(X i ) = µ, Var(X i ) = σ 2 and such 

that the MGF, M X (t), is defined for all t in some open interval containing 0. 

n∑ 

Let S n = X i and note that E[S n ] = nµ and Var(S n ) = nσ 2 . 

i=1 

Theorem. 1.11.2 (Central Limit Theorem) 

Let X 1 , X 2 , . . . , S n be as above and let Z n = S n − nµ 

σ √ n 

. Then 


L[Z n ] → N(0, 1) as n → ∞. 

We will use the fact that it is sufficient to prove that 

M Zn (t) → e t2 /2 

for each fixed t 

[Note: if Z ∼ N(0, 1) then M Z (t) = e t2 /2 .] 

Now let U i = X i − µ 

σ 

and observe that Z n = 1 √ n 

Since the U i are independent we have: 

n∑ 

U i . 

i=1 

M Zn (t) = { M U (t/ √ n) } n 

⎡ 

⎣ 

M aX (t) = E[e taX ] 

= M X (at) 

⎤ 

⎦ 

Now, 

E(U i ) = 0 ⇒ M ′ U(0) = 0, 

Var(U i ) = 1 ⇒ M ′′ 

U(0) = 1. 

74


Consider the second-order Taylor expansion, 

M U (t) = M U (0) + M ′ U(0)t + M ′′ 

U(0) t2 2 + r(t) 

M U (t) = 1 + 0 + t 2 /2 + r(t), 

↓ 

(M U (0))(as M ′ U(0) = 0) 

r(s) 

where lim = 0 

s→0 s 2 

⇒ 

= 1 + t 2 /2 + r(t). 

M Zn (t) = { M U (t/ √ n) } n 

= {1 + t 2 /2n + r(t/ √ n)} n 

= (1 + a n ) n , 

where 

a n = t2 

2n + r(t/√ n). 

Next observe that lim na n = t2 

n→∞ 2 

[ 

To check this observe that: 

for fixed t. 

nt 2 

lim 

n→∞ 2n = t2 2 

and lim 

n→∞ 

nr(t/ √ n) 

= lim 

n→∞ 

t 2 r(t/ √ n) 

(t/ √ n) 2 

= t 2 r(s) 

lim = 0 for fixed t. 

s→0 s 2 

Note: s = t/ √ n → 0 as n → ∞. ] 75


Hence we have shown M Zn (t) = (1 + a n ) n , where 

lim na n = t 2 /2, so by Lemma 1.11.1, 

n→∞ 

lim M Z n 

(t) = e t2 /2 for each fixed t. 

n→∞ 

Remarks 

1. The Central Limit Theorem can be stated equivalently for ¯X n , 

( ) ¯Xn − µ 

i.e., L 

σ/ √ → N(0, 1) 

n 

( 

just note that ¯X n − µ 

σ/ √ n = S n − nµ 

σ √ n 

2. The Central Limit Theorem holds under conditions more general than those given 

above. In particular, with suitable assumptions, 

(i) M X (t) need not exist. 

(ii) X 1 , X 2 , . . . need not be i.i.d.. 

3. Theorems 1.11.1 and 1.11.2 are concerned with the asymptotic behaviour of ¯X n . 

) 

. 

Theorem 1.11.1 states ¯X n → µ in prob as n → ∞. 

Theorem 1.11.2 states ¯X n − µ 

σ/ √ n −→ D 

N(0, 1) as n → ∞. 

These results are not contradictory because Var( ¯X n ) → 0, but the Central Limit 

Theorem is concerned with ¯X n − E( ¯X n ) 

√ 

Var( ¯Xn ) . 76

2. STATISTICAL INFERENCE 

2 Statistical Inference 

2.1 Basic definitions and terminology 

Probability is concerned partly with the problem of predicting the behavior of the RV 

X assuming we know its distribution. 

Statistical inference is concerned with the inverse problem: 

Given data x 1 , x 2 , . . . , x n with unknown CDF F (x), what can we conclude about 

F (x) 

In this course, we are concerned with parametric inference. That is, we assume F 

belongs to a given family of distributions, indexed by the parameter θ: 

where Θ is the parameter space. 

Examples 

I = {F (x; θ) : θ ∈ Θ} 

(1) I is the family of normal distributions: 

θ = (µ, σ 2 ) 

Θ = {(µ, σ 2 ) : µ ∈ R, σ 2 ∈ R + } 

then I = 

{ 

F (x) : F (x) = Φ 

( x − µ 

σ 2 )} 

. 

(2) I is the family of Bernoulli distributions with success probability θ: 

Θ = {θ ∈ [0, 1] ⊂ R}. 

In this framework, the problem is then to use the data x 1 , . . . , x n to draw conclusions 

about θ. 


A collection of i.i.d. RVs, X 1 , . . . , X n , with common CDF F (x; θ), is said to be a 

random sample (from F (x; θ)). 


Any function T = T (x 1 , x 2 , . . . , x n ) that can be calculated from the data (without 

knowledge of θ) is called a statistic. 

Example 

The sample mean ¯x is a statistic (¯x = 1 n 

n∑ 

x i ). 

i=1 

77



A statistic T with property T (x) ∈ Θ ∀x is called an estimator for θ. 

Example 

For x 1 , x 2 , . . . , x n i.i.d. N(µ, σ 2 ), we have θ = (µ, σ 2 ). The quantity (¯x, s 2 ) is an estimator 

for θ, where ¯x = 1 n∑ 

x i , S 2 = 1 n∑ 

(x i − ¯x) 2 . 

n 

n − 1 

i=1 

There are two important concepts here: the first is that estimators are random 

variables; the second is that you need to be able to distinguish between random variables 

and their realisations. In particular, an estimate is a realisation of a random 

variable. 

For example, strictly speaking, x 1 , x 2 , . . . , x n are realisations of random variables 

X 1 , X 2 , . . . , X n , and ¯x = 1 n∑ 

x i is a realisation of ¯X; ¯X is an estimator, and ¯x is an 

n 

estimate. 

i=1 

We will find that from now on however, that it is often more convenient, if less rigorous, 

to use the same symbol for estimator and estimate. This arises especially in the use of 

ˆθ as both estimator and estimate, as we shall see. 

An unsatisfactory aspect of Definition 2.1.3 is that it gives no guidance on how to 

recognize (or construct) good estimators. 

Unless stated otherwise, we will assume that θ is a scalar parameter in the following. 

In broad terms, we would like an estimator to be “as close as possible to” θ “with high 

probability. 


The mean squared error of the estimator T of θ is defined by 

i=1 

MSE T (θ) = E{(T − θ) 2 }. 

Example 

Suppose X 1 , . . . , X n are i.i.d. Bernoulli θ RV’s, and T = ¯X =‘proportion of successes’. 

Since nT ∼ B(n, θ) we have 

E(nT ) = nθ, Var(nT ) = nθ(1 − θ) 

# 

=⇒ E(T ) = θ, Var(T ) = 

θ(1 − θ) 

n 

78


=⇒ MSE T (θ) = Var(T ) = 

θ(1 − θ) 

. 

n 

Remark 

This example shows that MSE T (θ) must be thought of as a function of θ rather than 

just a number. 

E.g., Figure 18 

Figure 18: MSE T (θ) as a function of θ 

Intuitively, a good estimator is one for which MSE is as small as possible. However, 

quantifying this idea is complicated, because MSE is a function of θ, not just a number. 

See Figure 19. 

For this reason, it turns out it is not possible to construct a minimum MSE estimator 

in general. 

To see why, suppose T ∗ is a minimum MSE estimator for θ. Now consider the estimator 

T a = a, where a ∈ R is arbitrary. Then for a = θ, T θ = θ with MSE Tθ (θ) = 0. 

Observe MSE Ta (a) = 0; hence if T ∗ exists, then we must have MSE T ∗(a) = 0. As a 

is arbitrary, we must have MSE T ∗(θ) = 0 ∀θ, ∈ Θ 

=⇒ T ∗ = θ with probability 1. 

Therefore we conclude that (excluding trivial situations) no minimum MSE estimator 

can exist. 

Definition. 2.1.5 The bias of the estimator T is defined by: 

b T (θ) = E(T ) − θ. 

79


Figure 19: MSE T (θ) as a function of θ 

An estimator T with b T (θ) = 0, i.e., E(T ) = θ, is said to be unbiased. 

Remarks 

(1) Although unbiasedness is an appealing property, not all commonly used 

estimates are unbiased and in some situations it is impossible to construct 

unbiased estimators for the parameter of interest. 

80


Example (Unbiasedness isn’t everything) 

E(s 2 ) = σ 2 . 

If E(s) = σ, 

then Var(s) = E(s 2 ) − {E(s)} 2 

= 0 

which is not the case 

=⇒ E(s) < σ. 

(2) Intuitively, unbiasedness would seem to be a pre-requisite for a good estimator. 

This is to some extent formalized as follows: 


MSE T (θ) = Var(T ) + b T (θ) 2 

Remark 

Restricting attention to unbiased estimators excludes estimators of the form T a = 

a. We will see that this permits the construction of Minimum Variance Unbiased 

Estimators (MVUE’s) in some cases. 

2.2 Minimum Variance Unbiased Estimation 

Consider data x 1 , x 2 , . . . , x n assumed to be observations of RV’s X 1 , X 2 , . . . , X n with 

joint PDF (probability function) 


f x (x; θ) 

(P x (x; θ)). 

The likelihood function is defined by 

⎧ 

⎨ f x (x; θ) 

L(θ; x) = 

⎩ 

P x (x; θ) 

The log likelihood function is: 

l(θ; x) = log L(θ; x) 

X continuous 

X discrete. 

(log is the natural log i.e., ln). 

81


Remark 

If x 1 , x 2 , . . . , x n are independent, the log likelihood function can be written as: 

l(θ; x) = 

n∑ 

log f i (x i ; θ). 

i=1 

If x 1 , x 2 , . . . , x n are i.i.d., we have: 

l(θ; x) = 

n∑ 

log f(x i ; θ). 

i=1 


Consider a statistical problem with log-likelihood l(θ; x). The score is defined by: 

and the Fisher information is 

U(θ; x) = ∂l 

∂θ 

I(θ) = E 

( ) 

− ∂2 l 

. 

∂θ 2 

Remark 

For a single observation with PDF f(x, θ), the information is 

) 

i(θ) = E 

(− ∂2 

∂θ log f(X, θ) . 

2 

In the case of x 1 , x 2 , . . . , x n i.i.d. , we have 


Under suitable regularity conditions, 

I(θ) = ni(θ). 

E{U(θ; X)} = 0 

and Var{U(θ; X)} = I(θ). 

82



Observe E{U(θ; X)} = 

= 

= 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

∫ ∞ 

. . . U(θ; x)f(x; θ)dx 1 . . . dx n 

−∞ 

. . . 

∫ ∞ 

−∞ 

∫ ∞ 

. . . 

−∞ 

( ∂ 

∂θ log f(x; θ) ) 

f(x; θ)dx 1 . . . dx n 

⎛ ⎞ 

∂ 

f(x; θ) 

⎜ 

⎝ 

∂θ ⎟ 

⎠ f(x; θ)dx 1 . . . dx n 

f(x; θ) 

= 

∫ ∞ 

regularity =⇒ = ∂ ∂θ 

= ∂ ∂θ 1 

−∞ 

∫ ∞ 

∫ ∞ 

. . . 

−∞ 

−∞ 

. . . 

∂ 

∂θ f(x; θ)dx 1 . . . dx n 

∫ ∞ 

−∞ 

f(x; θ)dx 1 . . . dx n 

= 0, as required. 

To calculate Var{U(θ; X)}, observe 

⎧ ⎫ 

∂ 

∂ 2 l(θ; x) 

= ∂ ⎪⎨ f(x; θ) ⎪⎬ 

∂θ 

∂θ 2 ∂θ ⎪⎩ f(x; θ) ⎪⎭ 

= 

= 

=⇒ U 2 (θ; x) = 

( ) 

∂ 2 

2 ∂ 

∂θ f(x; θ)f(x; θ) − 2 ∂θ f(x; θ) 

∂ 2 

[f(x; θ)] 2 

f(x; θ) 

∂θ2 − U 2 (θ; x) 

f(x; θ) 

∂ 2 

f(x; θ) 

∂θ2 f(x; θ) 

− ∂2 l 

∂θ 2 

=⇒ Var{U(θ; X)} = E{U 2 (θ; X)} (as E(U) = 0) 

⎛ 

∂ 2 ⎞ 

f(X; θ) 

⎜ 

= E ⎝ 

∂θ2 ⎟ 

⎠ + I(θ) 

f(X; θ) 

(by definition of I(θ)). 

83


Finally observe that: 

⎛ 

∂ 2 ⎞ 

f(X; θ) 

⎜ 

E ⎝ 

∂θ2 ⎟ 

⎠ = 

f(X; θ) 

∫ ∞ 

−∞ 

∫ ∞ 

. . . 

−∞ 

∂ 2 

f(x; θ) 

∂θ2 f(x; θ)dx 1 . . . dx n 

f(x; θ) 

= 

∫ ∞ 

−∞ 

∫ ∞ 

. . . 

−∞ 

= ∂2 

∂θ 2 ∫ ∞ 

= ∂2 

∂θ 2 1 

= 0 

−∞ 

Hence, we have proved Var{U(θ; X)} = I(θ). 

. . . 

∂ 2 

∂θ 2 f(x; θ)dx 1 . . . dx n 

∫ ∞ 

−∞ 

f(x; θ)dx 1 . . . dx n 

84


Theorem. 2.2.2 (Cramer-Rao Lower Bound) 

If T is an unbiased estimator for θ, then Var(T ) ≥ 1 

I(θ) . 


Observe Cov{T (X), U(θ; X)} = E{T (X)U(θ; X)} 

= 

= 

∫ ∞ 

−∞ 

∫ ∞ 

= ∂ ∂θ 

−∞ 

∫ ∞ 

∫ ∞ 

. . . T (x)U(θ; x)f(x; θ)dx 1 . . . dx n 

−∞ 

∫ ∞ 

. . . 

−∞ 

−∞ 

. . . 

= ∂ E{T (X)} 

∂θ 

∂ 

f(x; θ) 

T (x) ∂θ 

f(x; θ) f(x; θ)dx 1 . . . dx n 

∫ ∞ 

−∞ 

T (x)f(x; θ)dx 1 . . . dx n 

= ∂ ∂θ θ 

To summarize, Cov(T, U) = 1. 

= 1. 

Recall that Cov 2 (T, U) ≤ Var(T ) Var(U) [i.e. |ρ| ≤ 1 and divide both sides by RHS] 

=⇒ Var(T ) ≥ Cov2 (T, U) 

Var(U) 


I(θ) 

Example 

Suppose X 1 , X 2 , . . . , X n are i.i.d. Po(λ) RV’s, and let ¯X = 1 n 

that ¯X is a MVUE for λ. 

n∑ 

X i . We will prove 

i=1 

85



(1) Recall if X ∼Po(λ), then P (x) = e−λ λ x 

, E(X) = λ and Var(X) = λ. 

x! 

Hence, E( ¯X) = λ and Var( ¯X) = λ n . Hence ¯X is unbiased for λ. 

(2) To show that ¯X is MVUE, we will show that Var( ¯X) = 1 

I(λ) . 

Step 1: 

Log-likelihood is 

l(λ; x) = log P (x; λ) 

= log 

n∏ e −λ λ x i 

, 

x i ! 

i=1 

so l(λ; x) = log e −nλ λ P n 

i=1 x i 

= −nλ + 

n∏ 

i=1 

1 

x i ! 

n∑ 

x i log λ − log 

i=1 

= −nλ + n¯x log λ − log 

n∏ 

x i ! 

i=1 

n∏ 

x i !. 

i=1 

Step 2: 

Find ∂2 l 

∂λ 2 ; 

Step 3: 

∂l 

∂λ 

= −n + 

n¯x 

λ 

=⇒ ∂2 l 

∂λ 2 = −n¯x 

λ 2 . 

86


I(λ) = 

( ) ∂ 2 l 

−E 

∂λ 2 

=⇒ I(λ) = 

( ) n ¯X 

E 

λ 2 

= n E( ¯X) 

λ2 = nλ 

λ 2 

= n λ . 

(3) Finally, observe that Var( ¯X) = λ n = 1 

I(λ) . 

By Theorem 2.2.2, any unbiased estimator T for λ must have 

Var(T ) ≥ 1 

I(λ) = λ n 

=⇒ ¯X is a MVUE. 


The unbiased estimator T (x) can achieve the Cramer-Rao Lower Bound only if the 

joint PDF/probability function has the form: 

f(x; θ) = exp{A(θ)T (x) + B(θ) + h(x)}, 

where A, B are functions such that θ = −B′ (θ) 

, and h is some function of x. 

A ′ (θ) 

Proof. Recall from the proof of Theorem 2.2.2 that the bound arises from the inequality 

where U(θ; x) = ∂l 

∂θ . 

Cor 2 (T (X), U(θ; X)) ≤ 1, 

Moreover, it is easy to see that the Cramer-Rao Lower Bound (CRLB) can be achieved 

only when 

Cor 2 {T (X), U(θ; X)} = 1 

87


Hence the CRLB is achieved only if U is a linear function of T with probability 1 

(correlation equals 1): 

U = aT + b a, b constants =⇒ can depend on θ but not x 

i.e. 

∂l 

∂θ 

Integrating wrt θ, we obtain: 

where A ′ (θ) = a(θ), B ′ (θ) = b(θ). 

= U(θ; x) = a(θ)T (x) + b(θ) for all x. 

log f(x; θ) = l(θ; x) = A(θ)T (x) + B(θ) + h(x), 

Hence, 

f(x; θ) = exp{A(θ)T (x) + B(θ) + h(x)}. 

Finally to ensure that E(T ) = θ, recall that E{U(θ; X)} = 0 and observe that in this 

case, 

E{U(θ; X)} = E[A ′ (θ)T (X) + B ′ (θ)] 

= A ′ (θ)E[T (X)] + B ′ (θ) = 0 

=⇒ E[T (X)] = −B′ (θ) 

A ′ (θ) . 

Hence, in order that T (X) be unbiased for θ, we must have −B′ (θ) 

A ′ (θ) 

= θ. 


A probability density function/probability function is said to be a single parameter 

exponential family if it has the form: 

f(x; θ) = exp{A(θ)t(x) + B(θ) + h(x)} 

for all x ∈ D ∈ R, where the D does not depend on θ. 

If x 1 , . . . , x n is a random sample from an exponential family, the joint PDF/prob functions 

becomes 

n∏ 

f(x; θ) = exp{A(θ)t(x i ) + B(θ) + h(x i )} 

i=1 

{ 

= exp A(θ) 

} 

n∑ 

n∑ 

t(x i ) + B(θ) + h(x i ) . 

i=1 

i=1 

88


In this case, a minimum variance unbiased estimator that achieves the CRLB can be 

seen to be the function: 

T = 1 n∑ 

t(x i ), 

n 

which is the MVUE for E(T ) = −B′ (θ) 

A ′ (θ) . 

Example (Poisson example revisited) 

i=1 

Observe that the Poisson probability function, 

p(x; λ) = e−λ λ x 

x! 

= exp{(log λ)x − λ − log x!} 

where 

= exp{A(λ)t(x) + B(λ) + h(x)}, 

A(λ) = log λ 

B(λ) = −λ 

t(x) = x 

¯X = 1 n 

n∑ 

t(x i ) is the MVUE for 

i=1 

h(x) = − log x!. 

−B ′ (λ) 

A ′ (λ) 

= −(−1) 

1/λ 

= λ. 

Example (2) 

Consider the Exp(λ) distribution. The PDF is 

f(x) = λe −λx , x > 0 

= exp{−λx + log λ}; 

89


=⇒ f is an exponential family with 

A(λ) = −λ 

B(λ) = log λ 

t(x) = x 

h(x) = 0 

We can also check that E(X) = −B′ (λ) 

A ′ (λ) . 

E(X) = 1 λ 

for X ∼ Exp(λ). 

In particular, we have seen previously 

Now observe that A ′ (λ) = −1, B ′ (λ) = 1 λ 

=⇒ −B′ (λ) 

A ′ (λ) 


λ 

It also follows that if x 1 , x 2 , . . . , x n are i.i.d. Exp(λ) observations, then ¯X = 1 n 

is the MVUE for 1 λ = E(X). 


n∑ 

t(x i ) 

i=1 

Consider data with PDF/prob. function, f(x; θ). A statistic, S(x), is called a sufficient 

statistic for θ if f(x|s; θ) does not depend on θ for all s. 

Remarks 

(1) We will see that sufficient statistics capture all of the information in the 

data x that is relevant to θ. 

(2) If we consider vector-valued statistics, then this definition admits trivial 

examples, such as s = x 

⎧ 

⎨ 1 x = s 

since P (X = x|S = s) = 

⎩ 

0 otherwise 

which does not depend on θ. 

90


Example 

Suppose x 1 , x 2 , . . . , x n are i.i.d. Bernoulli-θ and let s = 

θ. 

n∑ 

x i . Then S is sufficient for 

i=1 


P (x) = 

n∏ 

θ x i 

(1 − θ) 1−x i 

i=1 

= θ P x i 

(1 − θ) P (1−x i ) 

= θ s (1 − θ) n−s . 

Next observe that S ∼ B(n, θ) 

=⇒ P (s) = 

=⇒ P (x|s) = 

= 

= 

( n 

s) 

θ s (1 − θ) n−s 

P ({X = x} ∩ {S = s}) 

P (S = s) 

P (X = x) 

P (S = s) 

⎧ 

1 

( if ⎪⎨ n 

s) 

⎪⎩ 

n∑ 

x i = s 

i=1 

0, otherwise. 

Theorem. 2.2.4 The Factorization Theorem 

Suppose x 1 , . . . , x n have joint PDF/prob function f(x; θ). Then S is a sufficient statistic 

for θ if and only if 

f(x; θ) = g(s; θ)h(x) 

for some functions g, h. 


Omitted. 

91


Examples 

(1) x 1 , x 2 , . . . , x n i.i.d. N(µ, σ 2 ), σ 2 known. 

n∏ 1 

Then f(x; µ) = √ e (−1/(2σ2 ))(x i −µ) 2 

2πσ 

=⇒ S = 

i=1 

{ 

} 

= (2πσ 2 ) −n/2 exp − 1 n∑ 

(x 

2σ 2 i − µ) 2 

i=1 

{ ( n∑ 

)} 

= (2πσ 2 ) −n/2 exp − 1 

n∑ 

x 2 

2σ 2 i − 2µ x i + nµ 2 

i=1 

i=1 

{ ( 

)} ( 

= (2πσ 2 ) −n/2 1 

n∑ 

exp 2µ x 

2σ 2 i − nµ 2 exp − 1 

2σ 2 

= exp 

{ 

( 

1 

2µ 

2σ 2 

i=1 

)} { 

n∑ 

x i − nµ 2 exp 

i=1 

n∑ 

x i is sufficient for µ. 

i=1 

(2) If x 1 , x 2 , . . . , x n are i.i.d. with 

f(x) = exp{A(θ)t(x) + B(θ) + h(x)}, 

then f(x; θ) = exp 

=⇒ S = 

{ 

A(θ) 

n∑ 

t(x i ) 

i=1 

− 1 

2σ 2 

n∑ 

i=1 

x 2 i 

) 

} 

n∑ 

x 2 i − n 2 log(2πσ2 ) 

i=1 

} { 

n∑ 

n∑ 

} 

t(x i ) + nB(θ) exp h(x i ) 

i=1 

i=1 

is sufficient for θ in the exponential family by the Factorization Theorem. 

Theorem. 2.2.5 Rao-Blackwell Theorem 

If T is an unbiased estimator for θ and S is a sufficient statistic for θ, then the quantity 

T ∗ = E(T |S) is also an unbiased estimator for θ with Var(T ∗ ) ≤ Var(T ). Moreover, 

Var(T ∗ ) = Var(T ) iff T ∗ = T with probability 1. 


(I) Unbiasedness: 

θ = E(T ) = E S {E(T |S)} = E(T ∗ ) 

=⇒ E(T ∗ ) = θ. 

92


(II) Variance inequality: 

Var(T ) = E{Var(T |S)} + Var{E(T |S)} 

= E{Var(T |S)} + Var(T ∗ ) 

≥ 

Var(T ∗), 

since E{Var(T |S)} ≥ 0. 

Observe also that Var(T ) = Var(T ∗ ) 

=⇒ E{Var(T |S)} = 0 

=⇒ Var(T |S) = 0 with prob. 1 

=⇒ T = E(T |S) with prob. 1. 

(III) T ∗ is an estimator: 

Since S is sufficient for θ, 

T ∗ = 

which does not depend on θ. 

∫ ∞ 

−∞ 

T (x)f(x|s)dx, 

Remarks 

(1) The Rao-Blackwell Theorem can be used occasionally to construct estimators. 

(2) The theoretical importance of this result is to observe that T ∗ will always 

depend on x only through S. If T is already a MVUE then T ∗ = T =⇒ 

MVUE depends on x only through S. 

Example 

Suppose x 1 , . . . , x n ∼i.i.d. N(µ, σ 2 ), σ 2 known. We want to estimate µ. Take T = x 1 , 

as an unbiased estimator for µ. 

n∑ 

S = x i is a sufficient statistic for µ. 

i=1 

93


According to Rao-Blackwell, T ∗ = E(T |S) will be an improved (unbiased) estimator 

for µ. 

( T 

If x = (x 1 , . . . , x n ) T for X ∼ N n (µ1, σ 2 I), then = Ax, where A is a matrix 

S) 

⎛ 

A = ⎝ 

1 0 . . . 0 

1 1 . . . 1 

⎞ 

⎠ 

Hence, 

⎛⎛ 

( T 

∼ N 2 (A(µ1), σ 

S) 

2 AA T ) = N 2 

⎝⎝ 

µ 

nµ 

⎞ ⎛ 

⎠ , ⎝ 

⎞⎞ 

σ 2 σ 2 

⎠⎠ 

nσ 2 

σ 2 

=⇒ T |S = s ∼ 

) 

N 

(µ + σ2 

nσ (s − nµ), 2 σ2 − σ4 

nσ 2 

= 

( ( 1 

N 

n s, 1 − 1 ) ) 

σ 2 

n 

∴ E(T |S) = 1 n 

is a better (or equal) estimator for µ. 

∑ 

xi = ¯x = T ∗ 

Finally, observe Var( ¯X) = σ2 

n ≤ σ2 = Var(X 1 ) with < for n > 1, σ 2 > 0. 

Remarks 

(1) We have given only an incomplete outline of theory. 

(2) There are also concepts of minimal sufficient & complete statistics to be 

considered. 

2.3 Methods Of Estimation 

2.3.1 Method Of Moments 

Consider a random sample X 1 , X 2 , . . . , X n from F (θ) & let µ = µ(θ) = E(X). 

The method of moments estimator ˜θ is defined as the solution to the equation 

¯X = µ(˜θ). 

94


Example 

X 1 , . . . , X n ∼ i.i.d. Exp(λ) =⇒ E(X i ) = 1 ; the method of moments estimator is 

λ 

defined as the solution to the equation 

¯X = 1˜λ =⇒ ˜λ = 1¯X . 

Remark 

The method of moments is appealing: 

(1) it is simple; 

(2) rationale is that ¯X is BLUE for µ. 

If θ = (θ 1 , θ 2 , . . . , θ p ), the method of moments can be adapted as follows: 

(1) let µ k = µ k (θ) = E(X k ), k = 1, . . . , p; 

(2) let m k = 1 n 

n∑ 

x k i . 

i=1 

The MoM estimator ˜θ is defined to be the solution to the system of equations 

m 1 = µ 1 (˜θ) 

m 2 = µ 2 (˜θ) 

. 

. 

m p = µ p (˜θ) 

Example 

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ) & let θ = (µ, σ 2 ). 

95


p = 2 =⇒ two equations in two unknowns: 

µ 1 (θ) = E(X) = µ 

µ 2 (θ) = E(X 2 ) = σ 2 + µ 2 

m 1 = 

m 2 = 

1 

n 

1 

n 

n∑ 

i=1 

n∑ 

i=1 

x i 

x 2 i 

∴ we need to solve for µ, σ 2 : 

˜µ = ¯x 

˜σ 2 + ˜µ 2 = 1 n 

∑ 

x 

2 

i =⇒ ˜σ 2 = 1 n 

∑ 

x 

2 

i − ¯x 2 

= 1 n 

∑ 

(xi − ¯x) 2 

= n − 1 

n s2 . 

Remark 

The Method of Moments estimators can be seen to have good statistical properties. 

Under fairly mild regulatory conditions, the MoM estimator is 

(1) Consistent, i.e., ˜θ → θ in prob as n → ∞ 

(2) Asymptotically Normal, i.e., 

˜θ − θ 

√ 

Var(θ) 

. −→ D 

N(0, 1) as n → ∞ 

However, the MoM estimator has one serious defect. 

Suppose X 1 , . . . , X n is a random sample from F θX 

& let ˜θ X be MoM estimator. 

Let Y = h(X) for some invertible function h(X), then Y 1 , . . . , Y n should contain the 

same information as X 1 , . . . , X n . 

∴ we would like ˜θ Y ≡ ˜θ X . Unfortunately this does not hold. 

Example 

96


X 1 , . . . , X n i.i.d. Exp(λ). We saw previously that ˜λ X = 1¯X . 

Suppose Y i = X 2 i (which is invertible for X i > 0). To obtain ˜λ Y , observe E(Y ) = 

E(X 2 ) = 2 λ 2 √ 

=⇒ ˜λ 2n 

Y = ∑ ≠ 1¯X . X 

2 

i 

2.3.2 Maximum Likelihood Estimation 

Consider a statistical problem with log-likelihood function, l(θ; x). 

Definition. The maximum likelihood estimate ˆθ is the solution to the problem max l(θ; x) 

θ∈Θ 

i.e. ˆθ = arg max l(θ; x). 

θ∈Θ 

Remark 

In practice, maximum likelihood estimates are obtained by solving the score equation 

Example 

∂l 

∂θ 

= U(θ; x) = 0 

If X 1 , X 2 , . . . , X n are i.i.d. geometric-θ RV’s, find ˆθ. 

Solution: 

{ n 

} 

∏ 

l(θ; x) = log θ(1 − θ) x i−1 

i=1 

n∑ 

n∑ 

= log θ + (x i − 1) log(1 − θ) 

i=1 

i=1 

= n{log θ + (¯x − 1) log(1 − θ)} 

∴ U(θ; x) = ∂l { 1 

∂θ = n θ − ¯x − 1 

1 − θ 

= n 

( 1 − θ¯x 

θ(1 − θ) 

) 

. 

} 

97


To find ˆθ, we solve for θ in 1 − θ¯x = 0 

⎡ 

=⇒ ˆθ = 1¯x 

⎢ 

⎣ = 

Elementary Properties 

n∑ 

n = 

x i 

i=1 

# of successes 

# of trials 

⎤ 

⎥ 

⎦ 

(1) MLE’s are invariant under invertible transformations of the data. 

Proof. (Continuous i.i.d. RV’s, strictly increasing/decreasing translation) 

Suppose X has PDF f(x; θ) and Y = h(X), where h is strictly monotonic; 

=⇒ f Y (y; θ) = f X (h −1 (y); θ)|h −1 (y) ′ | when y = h(x). 

Consider data x 1 , x 2 , . . . , x n and the transformed version y 1 , y 2 , . . . , y n : 

{ n 

} 

∏ 

l Y (θ; y) = log f Y (y i ; θ) 

= log 

= log 

i=1 

{ 

∏ n 

} 

f X (h −1 (y i ); θ)|h −1 (y i ) ′ | 

i=1 

n∏ 

f X (x i ; θ) + log 

i=1 

n∏ 

|h −1 (y i ) ′ | 

( n 

) 

∏ 

= l X (θ; x) + log |h −1 (y i ) ′ | , 

i=1 

( n 

) 

∏ 

since log |h −1 (y i ) ′ | does not depend on θ, it follows that ˆθ maximizes 

i=1 

l X (θ; x) iff it maximizes l Y (θ; y). 

i=1 

(2) If φ = φ(θ) is a 1-1 transformation of θ, then the MLE’s obey the transformation 

rule, ˆφ = φ(ˆθ). 

Proof. It can be checked that if l φ (φ; x) is the log-likelihood with respect 

to φ, then 

l θ (θ; x) = l φ (φ(θ); x). 

It follows that ˆθ maximizes l θ (θ; x) iff ˆφ = φ(ˆθ) maximizes l φ (φ; x). 

98


(3) If t(x) is a sufficient statistic for θ, then ˆθ depends on the data only as a 

function of t(x). 

Proof. By the Factorization Theorem, t(x) is sufficient for θ iff 

f(x; θ) = g(t(x); θ)h(x) 

=⇒ l(θ; x) = log g(t(x); θ) + log h(x) 

=⇒ ˆθ maximizes l(θ; x) ⇔ ˆθ maximizes log g(t(x); θ) 

=⇒ ˆθ is a function of t(x). 

Example 

Suppose X 1 , X 2 , . . . , X n are i.i.d. Exp(λ); then 

f X (x; λ) = 

n∏ 

i=1 

λe −λx i 

= λ n e −λ P n 

i=1 x i 

= λ n e −λn¯x . 

By the Factorization Theorem, ¯x is sufficient for λ. To get the MLE, 

U(λ, x) = ∂l 

∂λ = ∂ (n log λ − nλ¯x) 

∂λ 

= n λ − n¯x; 

∂l 

∂λ = 0 =⇒ 1 λ = ¯x =⇒ ˆλ = 1¯x . 

Note: as proved ˆλ is a function of the sufficient statistic ¯x. 

Let Y 1 , Y 2 , . . . , Y n be defined by Y i = log X i . 

99


If X ∼Exp(λ) and Y = log X, then we can find f Y (y) by taking h(x) = log x and using 

f Y (y) = f X (h −1 (y))|h −1 (y) ′ | [h −1 (y) = e y , h −1 (y) ′ = e y ] 

= λe −λey e y 

{ n 

} 

∏ 

=⇒ l Y (λ; y) = log λe −λey i 

e y i 

= log 

i=1 

{ 

λ n e −λ P n 

i=1 ey i 

e P } 

n 

i=1 y i 

= n log λ − λ 

=⇒ ∂ 

∂λ l Y (λ; y) = n n∑ 

λ − 

=⇒ ˆλ = 

= 

= 

= 1¯x 

n 

n∑ 

i=1 

i=1 

e y i 

n 

n∑ 

i=1 

n 

n∑ 

i=1 

e log x i 

x i 

e y i 

n∑ 

e y i 

+ 

i=1 

( n 

n¯x) 

. 

n∑ 

i=1 

y i 

Finally, suppose we take θ = log λ =⇒ λ = e θ 

=⇒ f(x; θ) = e θ e −eθ x 

(λe −λx ) 

( n 

) 

∏ 

=⇒ l θ (θ; x) = log e θ e −eθ x i 

= log 

i=1 

{ } 

e nθ e −eθ n¯x 

= nθ − e θ n¯x. 

100

∂l 

∂θ 

= n − n¯xe θ 

∴ ∂l 

∂θ = 0 =⇒ 1 = ¯xeθ =⇒ e θ = 1¯x 

=⇒ ˆθ = − log ¯x. 

But ˆλ = 1¯x ( 1¯x) 

=⇒ log ˆλ = log = − log ¯x = ˆθ, as required. 

Remark (Not examinable) 


Maximum likelihood estimation & method of moments, can both be generated by the 

use of estimating function. 

An estimating function is a function, H(x; θ), with the property E{H(x; θ)} = 0. 

H can be used to define an estimator ˜θ which is a solution to the equation 

H(x; θ) = 0. 

For method of moments estimates, we can take 

H(x; θ) = ¯x − E(X) 

[E( ¯X) if not i.i.d. case]. 

To calculate the MLE: 

(1) find the log-likelihood, then 

(2) calculate the score & let it equal 0. 

For maximum likelihood, we can use 

H(x; θ) = U(θ; x) = ∂l 

∂θ . 

Recall we showed previously that E{U(θ; x)} = 0. 

Asymptotic Properties of MLE’s: 

Suppose X 1 , X 2 , X 3 , . . . is a sequence of i.i.d. RV’s with common PDF (or probability 

function) f(x; θ). 

Assume: 

(1) θ 0 is the true value of the parameter θ; 

101


(2) f is s.t. f(x; θ 1 ) = f(x; θ 2 ) for all x =⇒ θ 1 = θ 2 . 

We will show (in outline) that if ˆθ n is the MLE (maximum likelihood estimator) based 

on X 1 , X 2 , . . . , X n , then: 

(1) ˆθ n → θ 0 in probability, i.e., for each ɛ > 0, 

lim P (|θ n − θ 0 | > ɛ) = 0. 

n→∞ 

(2) 

√ 

ni(θ0 )(ˆθ n − θ 0 ) −→ D 

N(0, 1) as n → ∞ 

Remark 

(Asymptotic Normality). 

The practical use of asymptotic normality is that for large n, 

( ) 

1 

ˆθ ∼: N θ 0 , . 

ni(θ 0 ) 

Outline of consistency & asymptotic normality for MLE’s: 

Consider i.i.d. data X 1 , X 2 , . . . with common PDF/prob. function f(x; θ). 

If ˆθ n is the MLE based on X 1 , X 2 , . . . , X n , we will show (in outline) that ˆθ n → θ 0 in 

probability as n → ∞, where θ 0 is the true value for θ. 

Lemma. 2.3.1 

Suppose f is such that f(x; θ 1 ) = f(x; θ 2 ) for all x =⇒ θ 1 = θ 2 . Then l ∗ (θ) = 

E{log f(x; θ)} is maximized uniquely by θ = θ 0 . 


l ∗ (θ) − l ∗ (θ 0 ) = 

= 

≤ 

= 

= 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

(log f(x; θ))f(x; θ 0 )dx − 

( f(x; θ) 

log 

f(x; θ 0 ) 

) 

f(x; θ 0 )dx 

( f(x; θ) 

f(x; θ 0 ) − 1 ) 

f(x; θ 0 )dx 

(f(x; θ) − f(x; θ 0 ))dx 

f(x; θ)dx − 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

f(x; θ 0 )dx. 

(log f(x; θ 0 ))f(x; θ 0 )dx 

102


Moreover, equality is achieved iff 

f(x; θ) 

f(x; θ 0 ) 

= 1 for all x 

=⇒ f(x; θ) = f(x; θ 0 ) for all x 

=⇒ θ = θ 0 . 

Lemma. 2.3.2 

Let ¯l n (θ; x) = 1 n l(θ, x 1, . . . , x n ), i.e., 1 n × log likelihood based on x 1, x 2 , . . . , x n . 

Then for each θ, ¯l n (θ; x) → l ∗ (θ) in probability as n → ∞. 

Proof. Observe that 

¯l n (θ; x) = 1 n 

n log ∏ 

f(x i ; θ) 

= 1 n 

= 1 n 

i=1 

n∑ 

log f(x i ; θ) 

i=1 

n∑ 

L i (θ), 

i=1 

where L i (θ) = log f(x i ; θ). 

Since L i (θ) are i.i.d. with E(L i (θ)) = l ∗ (θ), it follows by the weak law of large numbers 

that ¯l n (θ; x) → l ∗ (θ) in probability as n → ∞. 

To summarise, we have proved that: 

(1) ¯l n (θ, x) → l ∗ (θ); 

(2) l ∗ (θ) is maximized when θ = θ 0 . 

Since ˆθ n maximizes ¯l n (θ, x), it follows that ˆθ n → θ 0 in probability. 


Under the above assumptions, let U n (θ; x) denote the score based on X 1 , . . . , X n . 

103


Then, 

U n (θ 0 ; x) 

√ 

ni(θ0 ) 

−→ 

D 

N(0, 1). 


U n (θ 0 ; x) = ∂ n 

∂θ log ∏ 

f(x i ; θ)| θ=θ0 

= 

= 

n∑ 

i=1 

i=1 

∂ 

∂θ log f(x i; θ)| θ=θ0 

n∑ 

U i , where U i = ∂ ∂θ log f(x i; θ)| θ=θ0 . 

i=1 

Since U 1 , U 2 , . . . are i.i.d. with E(U i ) = 0 and Var(U i ) = i(θ), by the Central Limit 

Theorem, 

∑ n 

i=1 U i − nE(U) 

√ = U n(θ 0 ; x) −→ 

√ N(0, 1). 

n Var(U) ni(θ0 ) D 


Under the above assumptions, 

√ 

ni(θ0 )(ˆθ n − θ 0 ) −→ D 

N(0, 1). 

Proof. Consider the first order Taylor expansion of U n (ˆθ; x) about θ 0 : 

U n (ˆθ n ; x) ≈ U n (θ 0 ; x) + U n(θ ′ 0 ; x)(ˆθ n − θ 0 ). 

For large n 

=⇒ U n (θ 0 ; x) ≈ −U n(θ ′ 0 ; x)(ˆθ n − θ 0 ) 

U n (θ 0 ; x) 

√ 

ni(θ0 ) 

=⇒ −U ′ n(θ 0 ; x)(ˆθ n − θ 0 ) 

√ 

ni(θ0 ) 

−→ 

D 

−→ 

D 

N(0, 1) 

N(0, 1). 


−U ′ n(θ 0 ; x) 

n 

→ i(θ 0 ) 

as n → ∞ 

104


by the weak law of large numbers, since: 

Hence, 

−U ′ n(θ 0 ; x) 

n 

=⇒ −U ′ n(θ 0 ; x) 

ni(θ 0 ) 

−U ′ n(θ 0 ; x) 

√ 

ni(θ0 ) (ˆθ n − θ 0 ) 

= 

1 

n 

n∑ 

i=1 

− ∂2 

∂θ 2 log f(x i; θ| θ=θ0 ) 

→ 1 in probability. 

≈ 

√ 

ni(θ0 )(ˆθ n − θ 0 ) 

for large n 

=⇒ √ ni(θ 0 )(ˆθ n − θ 0 ) 

−→ 

D 

N(0, 1). 

Remark 

The preceding theory can be generalized to include vector-valued coefficients. We will 

not discuss the details. 

2.4 Hypothesis Tests and Confidence Intervals 

Motivating Example: 

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ), and consider H 0 : µ = µ 0 vs. H a : µ ≠ µ 0 . 

If σ 2 is known, then the test of H 0 with significance level α is defined by the test 

statistic 

Z = ¯X − µ 

σ/ √ n , 

and the rule, reject H 0 if |z| ≥ z(α/2). 

A 100(1 − α)% CI for µ is given by 

( 

¯X − z(α/2) σ √ n 

, ¯X + z(α/2) σ √ n 

) 

. 

It is easy to check that the confidence interval contains all values of µ 0 that are acceptable 

null hypotheses. 

In general, consider a statistical problem with parameter 

θ ∈ Θ 0 ∪ Θ A , where Θ 0 ∩ Θ A = φ. 

We consider the null hypothesis, 

H 0 : θ ∈ Θ 0 

105


and the alternative hypothesis 

H A : θ ∈ Θ A . 

The hypothesis testing set up can be represented as: 

Test Result 

Actual Status 

H 0 true H A true 

Accept H 0 

√ 

type II 

error 

(β) 

Reject H 0 

type I 

error 

√ 

(α) 

We would like both the type I and type II error rates to be as small as possible. 

However, these results conflict with each other. To reduce the type I error rate we 

need to “make it harder to reject H 0 ”. To reduce the type II error rate we need to 

“make it easier to reject H 0 ”. 

The standard (Neyman-Pearson) approach to hypothesis testing is to control the type 

I error rate at a “small” value α and then use a test that makes the type II error as 

small as possible. 

The equivalence between the confidence intervals and hypothesis tests can be formulated 

as follows: Recall that a 100(1 − α)% CI for θ is a random interval, (L, U) with 

the property 

P ((L, U) ∋ θ) = 1 − α. 

It is easy to check that the test defined by rule: 

“Accept H 0 : θ = θ 0 iff θ 0 ∈ (L, U)” has significance level α. 

α = P (reject H 0 |H 0 true) 

β = P (retain H 0 |H A true) 

1 − β = power = P (reject H 0 |H A true), which is what we want. 

Conversely, given a hypothesis test H 0 : θ = θ 0 with significance level α, it can be 

proved that the set {θ 0 : H 0 : θ = θ 0 is accepted} is a 100(1 − α)% confidence region 

for θ. 

Large Sample Tests and Confidence Intervals 

Consider a statistical problem with data x 1 , . . . , x n , log-likelihood l(θ; x), score U(θ; x) 

and information i(θ). 

Consider also a hypothesis H 0 : θ = θ 0 . The following three tests are often considered: 

106


(1) The Wald Test: 

Test Statistic: W = 

√ 

ni(ˆθ)(ˆθ − θ 0 ) 

Critical Region: 

reject for |W | ≥ z(α/2) 

(2) The Score Test: 

Test Statistic: V = U(θ 0; x) 

√ 

ni(θ0 ) 

Critical Region: 

reject for |V | ≥ z(α/2) 

(3) Likelihood-Ratio Test: 

Test Statistic: G 2 = 2(l(ˆθ) − l(θ 0 )) 

Example: 

Critical Region: reject for G 2 ≥ χ 2 1,α 

Suppose X 1 , X 2 , . . . , X n i.i.d. Po(λ), and consider H 0 : λ = λ 0 . 

l(λ; x) = log 

n∏ e −λ λ x i 

i=1 

x i ! 

= n(¯x log λ − λ) − log 

U(λ; x) = ∂l 

∂λ = n ( ¯x 

λ − 1 ) 

= 

I(λ) = ni(λ) = E 

n(¯x − λ) 

λ 

n∏ 

x i ! 

i=1 

=⇒ ˆλ = ¯x 

( ) 

− ∂2 l 

( n¯x 

) 

= E = n ∂λ 2 λ 2 λ 

107


=⇒ W = 

√ 

ni(ˆλ)(ˆλ − λ 0 ) 

= ˆλ − λ 0 

√ˆλ/n 

V = U(λ 0; x) 

√ 

ni(λ0 ) 

= n(¯x − λ 0) 

λ 0 ( √ n/λ 0 ) 

= ¯x − λ 0 

√ 

λ0 /n 

G 2 = 2(l(ˆλ) − l(λ 0 )) 

= 2n(¯x log ˆλ − ˆλ) − 2n(¯x log λ 0 − λ 0 ) 

= 2n 

( 

¯x log ˆλ 

) 

− (ˆλ − λ 0 ) . 

λ 0 

Remarks 

(1) It can be proved that the tests based on W, V, G 2 are asymptotically equivalent 

for H 0 true. 

(2) As a by-product, it follows that the null distribution of G 2 is χ 2 1. (Recall 

from Theorems 2.3.1, 2.3.1 that the null distribution for W, V are both 

N(0, 1)). 

(3) To understand the motivation for the three tests, it is useful to consider 

their relation to the log-likelihood function. See Figure 20. 

We have introduced the Wald test, score test and the likelihood test for H 0 : θ = θ 0 vs. 

H A : θ = θ a . These are large-sample tests in that asymptotic distributions for the test 

statistic under H 0 are available. It can also be proved that the LR statistic and the 

score test statistic are invariant under transformation of the parameter, but the Wald 

test is not. 

Each of these three tests can be inverted to give a confidence interval (region) for θ: 

Wald Test 

108


Figure 20: Relationships of the tests to the log-likelihood function 

Solve for θ 0 in W 2 ≤ z(α/2) 2 . 

Recall, W = 

√ 

ni(ˆθ)(ˆθ − θ 0 ) 

=⇒ ˆθ − √ 

z(α/2) 

ni(ˆθ) 

≤ 

θ 0 ≤ ˆθ + √ 

z(α/2) 

ni(ˆθ) 

i.e., ˆθ ± 

z(α/2) 

√ . 

ni(ˆθ) 

Score Test 

Need to solve for θ 0 in V 2 = 

( 

U(θ 0 ; x) 

√ 

ni(θ0 )) 2 

≤ z(α/2) 2 . 

LR Test 

Solve for θ 0 in 2(l(ˆθ; x) − l(θ 0 ; x)) ≤ χ 2 1,α = z(α/2) 2 . 

Example: 

X 1 , . . . , X n i.i.d. Po(λ). 

109


Recall Wald statistic is W = ˆλ − λ 0 

√ˆλ/n 

=⇒ ˆλ ± 

√ 

z(α/2) ˆλ/n 

⇐⇒ ¯x ± z(α/2) √¯x/n. 

Score test: 

Recall that the test statistic is V = ¯x − λ 0 

√ . Hence to find a confidence interval, we 

λ0 /n 

need to solve for λ 0 in the equation 

V 2 ≤ z(α/2) 2 

=⇒ (¯x − λ 0) 2 

λ 0 /n 

≤ z(α/2) 2 

=⇒ (¯x − λ 0 ) 2 ≤ λ 0z(α/2) 2 

) 

=⇒ λ 2 0 − 

(2¯x + z(α/2)2 λ + ¯x 2 ≤ 0, 

n 

which is a quadratic in λ 0 and hence can be solved: 

−b ± √ b 2 − 4ac 

. 

2a 

n 

For the LR test, we have 

{ 

G 2 = 2n ¯x log ¯x } 

− (¯x − λ 0 ) 

λ 0 

→ can solve numerically for λ 0 in G 2 ≤ z(α/2) 2 . See Figure 21. 

Optimal Tests 

Consider simple null and alternative hypotheses: 

H 0 : θ = θ 0 and H a : θ = θ a 

Recall that the type I error rate is defined by 

α = P (reject H 0 |H 0 true). 

and the type II error rate by 

β = P (accept H 0 |H 0 false). 

The power is defined by 1 − β = P (reject H 0 |H 0 false) → so we want high power. 

110


Figure 21: The 3 tests 

Theorem. 2.4.1 (Neyman-Pearson Lemma) 

Consider the test H 0 : θ = θ 0 vs. H a : θ = θ a defined by the rule 

for some constant k. 

reject H 0 for f(x; θ 0) 

f(x; θ a ) ≤ k, 

Let α ∗ be the type I error rate and 1 − β ∗ be the power for this test. Then any other 

test with α ≤ α ∗ will have (1 − β) ≤ (1 − β ∗ ). 

Proof. Let C be the critical region for the LR test and let D be the critical region (RR) 

for any other test. 

111


Let C 1 = C ∩ D and C 2 , D 2 be such that 

C = C 1 ∪ C 2 , C 1 ∩ C 2 = φ 

D = C 1 ∪ D 2 , C 1 ∩ D 2 = φ 

Figure 22: Neyman-Pearson Lemma 

112


See Figure 22 and observe α ≤ α ∗ 

∫ 

∫ 

=⇒ f(x; θ 0 )dx 1 . . . dx n ≤ 

D 

C 

f(x; θ 0 )dx 1 . . . dx n 

∫ 

∫ 

=⇒ f(x; θ 0 )dx 1 . . . dx n ≤ f(x; θ 0 )dx 1 . . . dx n 

D 2 C 2 

∫ 

∫ 

=⇒ (1 − β ∗ ) − (1 − β) = f(x; θ a )dx 1 . . . dx n − 

C 

D 

f(x; θ a )dx 1 . . . dx n 

∫ 

∫ 

∴ (1 − β ∗ ) − (1 − β) = f(x; θ a )dx 1 . . . dx n + f(x; θ a )dx 1 . . . dx n 

C 1 C 2 

∫ 

∫ 

− f(x; θ a )dx 1 . . . dx n − f(x; θ a )dx 1 . . . dx n 

C 1 D 2 

= 

≥ 

(C 1 ∪ D 2 ) 

∫ 

∫ 

f(x; θ a )dx 1 . . . dx n − f(x; θ a )dx 1 . . . dx n 

C 2 D 2 1 ∫ 

f(x; θ 0 )dx 1 . . . dx n − 1 ∫ 

f(x; θ 0 )dx 1 . . . dx n 

k C 2 

k D 2 

≥ 0, as required. 

See Figure 23. 

Moreover, equality is achieved only if 

D 2 is empty (= φ). 

Example 

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ), σ 2 given, and consider 

H 0 : µ = µ 0 vs. H a : µ = µ a , µ a > µ 0 

113


Figure 23: Neyman-Pearson Lemma 

Then f(x; µ 0) 

f(x; µ a ) 

n∏ 

{ 

1 

√ exp − 1 } 

i=1 2πσ 2σ (x 2 i − µ 0 ) 2 

= n∏ 

{ 

1 

√ exp − 1 } 

i=1 2πσ 2σ (x 2 i − µ a ) 2 

{ ( n∑ 

)} 

exp − 1 x 2 

2σ 2 i − 2n¯xµ 0 + nµ 2 0 

i=1 

= { ( n∑ 

)} 

exp − 1 x 2 

2σ 2 i − 2n¯xµ a + nµ 2 a 

i=1 

{ } 

1 

= exp 

2σ (2n¯x(µ 2 0 − µ a ) − nµ 2 0 + nµ 2 a) . 

For a constant k, 

f(x; µ 0 ) 

f(x; µ a ) 

⇔ (µ 0 − µ a )¯x 

≤ k 

≤ k ∗ 

⇔ ¯x ≥ c, 

for a suitably chosen c (rejected when ¯x is too big). 

114


To choose c, we use the fact that 

Hence, the usual z-test, 

Z = ¯X − µ 0 

σ/ √ n ∼ N(0, 1) under H 0. 

reject H 0 if ẑ ≥ z(α) 

is the Neyman-Pearson LR test in this case. 

Remarks 

(1) This result shows that the one-sided z test is also uniformly most powerful 

for 

H 0 : µ = µ 0 vs. H A : µ > µ 0 

(2) This can be extended to the case of 

In this case we take 

H 0 : µ ≤ µ 0 vs. H A : µ > µ 0 

α = max 

H 0 

P (rejecting |µ = µ 0 ). 

(3) This construction fails when we consider two-sided alternatives 

i.e., H 0 : µ = µ 0 vs. H A : µ ≠ µ 0 

=⇒ no uniformly most powerful test exists for that case. 

115

PDF of Lecture Notes - School of Mathematical Sciences

Create successful ePaper yourself

Delete template?

Save as template?