28.01.2015 Views

PDF of Lecture Notes - School of Mathematical Sciences

PDF of Lecture Notes - School of Mathematical Sciences

PDF of Lecture Notes - School of Mathematical Sciences

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

MATHEMATICAL STATISTICS III<br />

<strong>Lecture</strong>r: Associate Pr<strong>of</strong>essor Patty Solomon<br />

SCHOOL OF MATHEMATICAL SCIENCES


Contents<br />

1 Distribution Theory 1<br />

1.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.1.3 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.1.4 Negative Binomial distribution . . . . . . . . . . . . . . . . . . 3<br />

1.1.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.1.6 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 5<br />

1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

1.2.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

1.2.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 8<br />

1.2.3 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

1.2.4 Beta density function . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.2.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.2.6 Cauchy distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.3 Transformations <strong>of</strong> a single RV . . . . . . . . . . . . . . . . . . . . . . 12<br />

1.4 CDF transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

1.5 Non-monotonic transformations . . . . . . . . . . . . . . . . . . . . . . 17<br />

1.6 Moments <strong>of</strong> transformed RVS . . . . . . . . . . . . . . . . . . . . . . . 18<br />

1.7 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.7.1 Trinomial distribution . . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.7.2 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.7.3 Marginal and conditional distributions . . . . . . . . . . . . . . 24<br />

1.7.4 Continuous multivariate distributions . . . . . . . . . . . . . . . 27<br />

1.8 Transformations <strong>of</strong> several RVs . . . . . . . . . . . . . . . . . . . . . . 33<br />

1.8.1 Method <strong>of</strong> regular transformations . . . . . . . . . . . . . . . . 41


1.9 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

1.9.1 Moment generating functions . . . . . . . . . . . . . . . . . . . 50<br />

1.9.2 Marginal distributions and the MGF . . . . . . . . . . . . . . . 53<br />

1.9.3 Vector notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

1.9.4 Properties <strong>of</strong> variance matrices . . . . . . . . . . . . . . . . . . 56<br />

1.10 The multivariable normal distribution . . . . . . . . . . . . . . . . . . . 57<br />

1.10.1 The multivariate normal MGF . . . . . . . . . . . . . . . . . . . 64<br />

1.10.2 Independence and normality . . . . . . . . . . . . . . . . . . . . 65<br />

1.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

1.11.1 Convergence <strong>of</strong> random variables . . . . . . . . . . . . . . . . . 71<br />

2 Statistical Inference 77<br />

2.1 Basic definitions and terminology . . . . . . . . . . . . . . . . . . . . . 77<br />

2.2 Minimum Variance Unbiased Estimation . . . . . . . . . . . . . . . . . 81<br />

2.3 Methods Of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

2.3.1 Method Of Moments . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

2.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . 97<br />

2.4 Hypothesis Tests and Confidence Intervals . . . . . . . . . . . . . . . . 105


1. DISTRIBUTION THEORY<br />

1 Distribution Theory<br />

A discrete RV is described by its probability function<br />

p(x) = P ({X = x}).<br />

A continuous RV is described by its probability density function, f(x), for which<br />

P (X = a) =<br />

P ({a ≤ X ≤ b})<br />

=<br />

∫ a<br />

a<br />

∫ b<br />

a<br />

f(x) dx, for all a ≤ b.<br />

f(x) dx = 0<br />

for any continuous RV.<br />

Precipitation example, neither discrete or continuous (0 days) - some RVs are neither<br />

discrete or continuous.<br />

The cumulative distribution function (CDF) is defined by:<br />

F (x) = P ({X ≤ x})<br />

⎧∑<br />

p(t)<br />

⎪⎨ t;t≤x<br />

completed by F (x) =<br />

⎪⎩<br />

∫ x<br />

−∞<br />

f(t) dt.<br />

The expected value E(X) <strong>of</strong> a discrete RV is given by:<br />

(area to left)<br />

E(X) = ∑ x<br />

xp(x),<br />

provided that ∑ x<br />

|x|p(x) < ∞ (i.e., must converge). Otherwise the expectation is not<br />

defined.<br />

The expected value <strong>of</strong> a continuous RV is given by:<br />

provided that<br />

∫ ∞<br />

−∞<br />

E(X) =<br />

∫ ∞<br />

−∞<br />

xf(x) dx,<br />

|x|f(x) < ∞, otherwise it is not defined.<br />

(For example, the Cauchy distribution is momentless.)<br />

1


1. DISTRIBUTION THEORY<br />

⎧∑<br />

h(x)p(x)<br />

⎪⎨ x<br />

E{h(X)} =<br />

⎪⎩<br />

∫ ∞<br />

−∞<br />

h(x)f(x) dx<br />

if ∑ |h(x)|p(x) < ∞ (X discrete)<br />

x<br />

if<br />

∫ ∞<br />

−∞<br />

|h(x)|f(x) dx < ∞ (X continuous)<br />

The moment generating function <strong>of</strong> a RV X is defined to be:<br />

M X (t) = E[e tX ] =<br />

↓<br />

moment<br />

generating fn<br />

<strong>of</strong> RV X.<br />

⎧∑<br />

e tx p(x)<br />

x ⎪⎨<br />

⎪⎩<br />

∫ ∞<br />

−∞<br />

e tx f(x) dx<br />

X discrete<br />

X continuous<br />

M X (0) = 1 always; the mgf may or may not be defined for other values <strong>of</strong> t.<br />

If M X (t) defined for all t in some open interval containing 0, then:<br />

1. Moments <strong>of</strong> all orders exist;<br />

2. E[X r ] = M (r)<br />

X<br />

(0) (rth order derivative) ;<br />

3. M X (t) uniquely determines the distribution <strong>of</strong> X:<br />

M ′ (0) = E(X)<br />

M ′′ (0) = E(X 2 ),<br />

and so on.<br />

1.1 Discrete distributions<br />

1.1.1 Bernoulli distribution<br />

Parameter: 0 ≤ p ≤ 1<br />

Possible values: {0, 1}<br />

Prob. function:<br />

⎧<br />

⎪⎨ p, x = 1<br />

p(x) =<br />

⎪⎩<br />

1 − p, x = 0<br />

2<br />

= p x (1 − p) 1−x


1. DISTRIBUTION THEORY<br />

E(X) = p<br />

Var (X) = p(1 − p)<br />

M X (t) = 1 + p(e t − 1).<br />

1.1.2 Binomial distribution<br />

Parameter: 0 ≤ p ≤ 1; n > 0;<br />

M X (t) = {1 + p(e t − 1)} n .<br />

Consider a sequence <strong>of</strong> n independent Bern(p) trials. If X = total number <strong>of</strong> successes,<br />

X ∼ B(n, p).<br />

Note: If Y 1 , . . . , Y n are underlying Bernoulli RV’s then X = Y 1 + Y 2 + · · · + Y n .<br />

(same as counting # successes).<br />

Probability function if:<br />

p(x) ≥ 0, x = 0, 1, . . . , n,<br />

n∑<br />

p(x) = 1.<br />

x=0<br />

1.1.3 Geometric distribution<br />

Suppose a sequence <strong>of</strong> Bernoulli trials is performed and let X be the number <strong>of</strong> failures<br />

preceding the first success. Then X ∼ Geom(p).<br />

1.1.4 Negative Binomial distribution<br />

Suppose a sequence <strong>of</strong> independent Bernoulli trials is conducted. If X is the number<br />

<strong>of</strong> failures preceding the n th success, then X has a negative binomial distribution.<br />

Prob. function:<br />

p(x) =<br />

( ) n + x − 1<br />

p<br />

n − 1<br />

n (1 − p) x ,<br />

} {{ }<br />

ways <strong>of</strong> allocating<br />

failures and successes<br />

3


n(1 − p)<br />

E(X) = ,<br />

p<br />

n(1 − p)<br />

Var (X) = ,<br />

p<br />

[<br />

2 ] n<br />

p<br />

M X (t) =<br />

.<br />

1 − e t (1 − p)<br />

1. If n = 1, we obtain the geometric distribution.<br />

2. Also seen to arise as sum <strong>of</strong> n independent geometric variables.<br />

1. DISTRIBUTION THEORY<br />

1.1.5 Poisson distribution<br />

Parameter: λ > 0<br />

M X (t) = e λ(et −1)<br />

The Poisson distribution arises as the distribution for the number <strong>of</strong> “point events”<br />

observed from a Poisson process.<br />

Examples:<br />

Figure 1: Poisson Example<br />

Number <strong>of</strong> incoming calls to certain exchange in a given hour.<br />

The Poisson distribution as the limiting form <strong>of</strong> the binomial distribution:<br />

n → ∞,<br />

p → 0<br />

np → λ<br />

The derivation <strong>of</strong> the Poisson distribution (via the binomial) is underpinned by a<br />

Poisson process i.e., a point process on [0, ∞); see Figure 1.<br />

AXIOMS for a Poisson process <strong>of</strong> rate λ > 0 are:<br />

(A) The number <strong>of</strong> occurrences in disjoint intervals are independent.<br />

4


1. DISTRIBUTION THEORY<br />

(B) Probability <strong>of</strong> 1 or more occurrences in any interval [t, t+h) is λh+o(h) (h → 0)<br />

(approx prob. is equal to length <strong>of</strong> interval xλ)<br />

(C) Probability <strong>of</strong> more than one occurrence in [t, t + h) is o(h)<br />

is small, negligible).<br />

(h → 0) (i.e. prob<br />

Note: o(h), pronounced (small order h) is standard notation for any function r(h)<br />

with the property:<br />

r(h)<br />

lim<br />

h→0 h = 0<br />

4<br />

3<br />

2<br />

1<br />

0<br />

!2<br />

!1<br />

0<br />

1<br />

2<br />

h<br />

!1<br />

!2<br />

Figure 2: Small order h: functions h 4 (yes) and h (no)<br />

1.1.6 Hypergeometric distribution<br />

Consider an urn containing M black and N white balls. Suppose n balls are sampled<br />

randomly without replacement and let X be the number <strong>of</strong> black balls chosen. Then<br />

X has a hypergeometric distribution.<br />

Parameters: M, N > 0, 0 < n ≤ M + N<br />

Possible values: max (0, n − N) ≤ x ≤ min (n, M)<br />

Prob function:<br />

5


1. DISTRIBUTION THEORY<br />

) ( N<br />

)<br />

p(x) =<br />

( M<br />

x n − x<br />

( ) ,<br />

M + N<br />

n<br />

E(X) = n<br />

M<br />

M + N ,<br />

M + N − n nMN<br />

Var (X) =<br />

M + N − 1 (M + N) . 2<br />

The mgf exists, but there is no useful expression available.<br />

1. The hypergeometric distribution is simply<br />

# selections with x black balls<br />

,<br />

# possible selections<br />

) ( N<br />

)<br />

=<br />

( M<br />

x n − x<br />

( ) .<br />

M + N<br />

n<br />

2. To see how the limits arise, observe we must have x ≤ n (i.e., no more than<br />

sample size <strong>of</strong> black balls in the sample.) Also, x ≤ M, i.e., x ≤ min (n, M).<br />

Similarly, we must have x ≥ 0 (i.e., cannot have < 0 black balls in sample), and<br />

n − x ≤ N (i.e., cannot have more white balls than number in urn).<br />

i.e. x ≥ n − N<br />

i.e. x ≥ max (0, n − N).<br />

3. If we sample with replacement, we would get X ∼ B ( n, p = M<br />

M+N<br />

)<br />

. It is interesting<br />

to compare moments:<br />

finite population correction<br />

↑<br />

hypergeometric E(X) = np Var (X) = M+N−n [np(1 − p)]<br />

M+N−1<br />

binomial E(x) = np Var (X) = np(1 − p) ↓<br />

when sample all<br />

balls in urn Var(X) ∼ 0<br />

4. When M, N >> n, the difference between sampling with and without replacement<br />

should be small.<br />

6


1. DISTRIBUTION THEORY<br />

Figure 3: p = 1 3<br />

If white ball out →<br />

Figure 4: p = 1 2<br />

(without replacement)<br />

Figure 5: p = 1 3<br />

(with replacement)<br />

Intuitively, this implies that for M, N >> n, the hypergeometric and binomial<br />

probabilities should be very similar, and this can be verified for fixed, n, x:<br />

0<br />

lim N 10<br />

M<br />

@ A@<br />

M,N→∞<br />

x n − x<br />

0<br />

M<br />

M + N → p M + N<br />

1<br />

@ A<br />

n<br />

1<br />

A<br />

=<br />

( n<br />

x)<br />

p x (1 − p) n−x .<br />

1.2 Continuous Distributions<br />

1.2.1 Uniform Distribution<br />

CDF, for a < x < b:<br />

F (x) =<br />

∫ x<br />

−∞<br />

f(x) dx =<br />

∫ x<br />

a<br />

1<br />

b − a dx = x − a<br />

b − a ,<br />

7


1. DISTRIBUTION THEORY<br />

that is,<br />

⎧<br />

0 x ≤ a<br />

⎪⎨<br />

x − a<br />

F (x) =<br />

b − a<br />

a < x < b<br />

⎪⎩<br />

1 b ≤ x<br />

Figure 6: Uniform distribution CDF<br />

M X (t) = etb − e ta<br />

t(b − a)<br />

A special case is the U(0, 1) distribution:<br />

⎧<br />

⎪⎨ 1 0 < x < 1<br />

f(x) =<br />

⎪⎩<br />

0 otherwise,<br />

F (x) = x for 0 < x < 1,<br />

E(X) = 1 2 , Var(X) = 1 12 , M(t) = et − 1<br />

.<br />

t<br />

1.2.2 Exponential Distribution<br />

CDF:<br />

F (x) = 1 − e −λx ,<br />

M X (t) =<br />

λ<br />

λ − t .<br />

8


1. DISTRIBUTION THEORY<br />

This is the distribution for the waiting time until the first occurrence in a Poisson<br />

process with rate parameter λ.<br />

1. If X ∼ Exp(λ) then,<br />

P (X ≥ t + x|X ≥ t) = P (X ≥ x)<br />

(memoryless property)<br />

2. It can be obtained as limiting form <strong>of</strong> geometric distribution.<br />

1.2.3 Gamma distribution<br />

gamma fct : Γ(α) =<br />

mgf : M X (t) =<br />

∫ ∞<br />

0<br />

t α−1 e −t dt<br />

( ) α λ<br />

, t < λ.<br />

λ − t<br />

Suppose Y 1 , . . . , Y K are independent Exp(λ) RV’s and let X = Y 1 + · · · + Y K .<br />

Then X ∼ Gamma(K, λ), for K integer. In general, X ∼ Gamma(α, λ), α > 0.<br />

1. α is the shape parameter,<br />

λ is the scale parameter<br />

Figure 7: Gamma Distribution<br />

Note: if Y ∼ Gamma (α, 1) and X = Y , then X ∼ Gamma (α, λ). That is, λ is<br />

λ<br />

scale parameter.<br />

( p<br />

2. Gamma<br />

2 , 1 distribution is also called χ<br />

2)<br />

2 p (chi-square with p df) distribution<br />

if p is integer;<br />

χ 2 2 = exponential distribution (for 2 df).<br />

9


1. DISTRIBUTION THEORY<br />

3. Gamma (K, λ) distribution can be interpreted as the waiting time until the K th<br />

occurrence in a Poisson process.<br />

1.2.4 Beta density function<br />

Suppose Y 1 ∼ Gamma (α, λ), Y 2 ∼ Gamma (β, λ) independently, then,<br />

Remark:<br />

X = Y 1<br />

Y 1 + Y 2<br />

∼ B(α, β), 0 ≤ x ≤ 1.<br />

Figure 8: Beta Distribution<br />

1.2.5 Normal distribution<br />

X ∼ N(µ, σ 2 ); M X (t) = e tµ e t2 σ 2 /2 .<br />

1.2.6 Cauchy distribution<br />

Possible values: x ∈ R<br />

<strong>PDF</strong>: f(x) = 1 ( ) 1<br />

; (location parameter θ = 0)<br />

π 1 + x 2<br />

10


1. DISTRIBUTION THEORY<br />

CDF:<br />

F (x) = 1 2 + 1 π arctan x<br />

E(X), Var(X), M X (t) do not exist.<br />

→ the Cauchy is a bell-shaped distribution symmetric about zero for which no moments<br />

are defined.<br />

11


1. DISTRIBUTION THEORY<br />

Figure 9: Cauchy Distribution<br />

(pointier than normal distribution and tails go to zero much slower than normal distribution)<br />

If Z 1 ∼ N(0, 1) and Z 2 ∼ N(0, 1) independently, then X = Z 1<br />

Z 2<br />

∼ Cauchy distribution.<br />

1.3 Transformations <strong>of</strong> a single RV<br />

If X is a RV and Y = h(X), then Y is also a RV. If we know distribution <strong>of</strong> X, for a<br />

given function h(x), we should be able to find distribution <strong>of</strong> Y = h(X).<br />

Theorem. 1.3.1 (Discrete case)<br />

Suppose X is a discrete RV with prob. function p X (x) and let Y = h(X), where h is<br />

any function then:<br />

p Y (y) =<br />

∑<br />

p X (x)<br />

Pro<strong>of</strong>.<br />

x:h(x)=y<br />

(sum over all values x for which h(x) = y)<br />

p Y (y) = P (Y = y) = P {h(X) = y}<br />

= ∑<br />

x:h(x)=y<br />

= ∑<br />

x:h(x)=y<br />

P (X = x)<br />

p X (x)<br />

12


1. DISTRIBUTION THEORY<br />

Theorem. 1.3.2<br />

Suppose X is a continuous RV with <strong>PDF</strong> f X (x) and let Y = h(X), where h(x) is<br />

differentiable and monotonic, i.e., either strictly increasing or strictly decreasing.<br />

Then the <strong>PDF</strong> <strong>of</strong> Y is given by:<br />

Pro<strong>of</strong>. Assume h is increasing. Then<br />

f Y (y) = f X {h −1 (y)}|h −1 (y) ′ |<br />

F Y (y) = P (Y ≤ y) = P {h(X) ≤ y}<br />

= P {X ≤ h −1 (y)}<br />

= F X {h −1 (y)}<br />

Figure 10: h increasing.<br />

⇒ f Y (y) = d<br />

dy F Y (y) = d<br />

dy F X{h −1 (y)}<br />

(use Chain Rule)<br />

= f X<br />

{<br />

h −1 (y) } h −1 (y) ′ . (1)<br />

Now consider the case <strong>of</strong> h(x) decreasing:<br />

13


1. DISTRIBUTION THEORY<br />

F Y (y) = P (Y ≤ y) = P {h(X) ≤ y}<br />

= P {X ≥ h −1 (y)}<br />

= 1 − F X {h −1 (y)}<br />

⇒ f Y (y) = −f X {h −1 (y)}h −1 (y) ′ (2)<br />

Figure 11: h decreasing.<br />

Finally, observe that if h is increasing then h ′ (x) and hence h −1 (y) ′ must be positive.<br />

Similarly for h decreasing, h −1 (y) ′ < 0. Hence (1) and (2) can be combined to give:<br />

f Y (y) = f X {h −1 (y)}|h −1 (y) ′ |<br />

Examples:<br />

1. Discrete transformation <strong>of</strong> single RV<br />

X ∼ P o(λ) and Y is X rounded to the nearest multiple <strong>of</strong> 10.<br />

Possible values <strong>of</strong> Y are: 0, 10, 20, . . .<br />

14


1. DISTRIBUTION THEORY<br />

P (Y = 0) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)<br />

= e −λ + e −λ λ + e−λ λ 2<br />

2!<br />

+ e−λ λ 3<br />

3!<br />

+ e−λ λ 4<br />

;<br />

4!<br />

P (Y = 10) = P (X = 5) + P (X = 6) + P (X = 7) + P (X = 8) + . . . + P (X = 14), andsoon.<br />

2. Continuous transformation <strong>of</strong> single RV<br />

Let h(X) = aX + b, a ≠ 0.<br />

To find h −1 (y) we solve for x in the equation y = h(x), i.e.,<br />

y = ax + b<br />

⇒ x = y − b<br />

a<br />

⇒ h −1 (y) = y − b<br />

a<br />

= y a − b a<br />

Specifically:<br />

Hence, f Y (y) = 1<br />

|a| f X<br />

⇒ h −1 (y) ′ = 1 a .<br />

( y − b<br />

1. Suppose Z ∼ N(0, 1) and let X = µ + σZ, σ > 0. Recall that Z has <strong>PDF</strong>:<br />

Hence, f X (x) = 1 ( ) x − µ<br />

|σ| φ =<br />

σ<br />

a<br />

)<br />

.<br />

φ(z) = 1 √<br />

2π<br />

e −z2 /2 .<br />

2. Suppose X ∼ N(µ, σ 2 ) and let Z = X − µ .<br />

σ<br />

Find the <strong>PDF</strong> <strong>of</strong> Z.<br />

1<br />

√ e − (x−µ)2<br />

2σ 2 , which is N(µ, σ 2 ) <strong>PDF</strong>.<br />

2π(σ)<br />

15


1. DISTRIBUTION THEORY<br />

Solution. Observe that Z = h(X) = 1 σ X − µ σ ,<br />

( z +<br />

µ )<br />

σ<br />

⇒ f Z (z) = σf X 1<br />

= σf X (µ + σz)<br />

σ<br />

1<br />

= σ √ e (−1)<br />

2πσ<br />

2<br />

(2σ 2 ) (µ+σz−µ)2<br />

= 1 √<br />

2π<br />

e (−1)<br />

(2σ 2 ) σ2 z 2<br />

= 1 √<br />

2π<br />

e −z2 /2 = φ(z),<br />

i.e., Z ∼ N(0, 1).<br />

3. Suppose X ∼ Gamma (α, 1) and let Y = X λ , λ > 0.<br />

Find the <strong>PDF</strong> <strong>of</strong> Y .<br />

Solution. Since Y = 1 X, is a linear function, we have<br />

λ<br />

f Y (y) = λf X (λy)<br />

which is the Gamma(α, λ) <strong>PDF</strong>.<br />

= λ 1<br />

Γ(α) (λy)α−1 e −λy =<br />

λα<br />

Γ(α) yα−1 e −λy ,<br />

1.4 CDF transformation<br />

Suppose X is a continuous RV with CDF F X (x), which is increasing over the range <strong>of</strong><br />

X. If U = F X (x), then U ∼ U(0, 1).<br />

16


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

F U (u) = P (U ≤ u)<br />

= P {F X (X) ≤ u}<br />

= P {X ≤ F −1<br />

X<br />

(u)}<br />

= F X {F −1<br />

X<br />

(u)}<br />

= u, for 0 < u < 1.<br />

This is simply the CDF <strong>of</strong> U(0, 1), so the result is proved.<br />

The converse also applies. If U ∼ U(0, 1) and F is any strictly increasing “on its range”<br />

CDF, then X = F −1 (U) has CDF F (x), i.e.,<br />

F X (x) = P (X ≤ x) = P {F −1 (U) ≤ x}<br />

= P (U ≤ F (x))<br />

= F (x), as required.<br />

1.5 Non-monotonic transformations<br />

Theorem 1.3.2 applies to monotonic (either strictly increasing or decreasing) transformations<br />

<strong>of</strong> a continuous RV.<br />

In general, if h(x) is not monotonic, then h(x) may not even be continuous.<br />

For example,<br />

integer part <strong>of</strong> x<br />

↗<br />

if h(x) = [x],<br />

then possible values for Y = h(X) are the integers<br />

⇒ Y is discrete.<br />

However, it can be shown that h(X) is continuous if X is continuous and h(x) is<br />

piecewise monotonic.<br />

17


1. DISTRIBUTION THEORY<br />

1.6 Moments <strong>of</strong> transformed RVS<br />

Suppose X is a RV and let Y = h(X).<br />

If we want to find E(Y ) , we can proceed as follows:<br />

1. Find the distribution <strong>of</strong> Y = h(X) using preceeding methods.<br />

⎧∑<br />

yp(y) Y discrete<br />

⎪⎨ y<br />

2. Find E(Y ) =<br />

⎪⎩<br />

Theorem. 1.6.1<br />

∫ ∞<br />

−∞<br />

yf(y) dy Y continuous<br />

(forget X ever existed!)<br />

If X is a RV <strong>of</strong> either discrete or continuous type and h(x) is any transformation (not<br />

necessarily monotonic), then E{h(X)} (provided it exists) is given by:<br />

⎧∑<br />

h(x) p(x) X discrete<br />

⎪⎨ x<br />

E{h(X)} =<br />

⎪⎩<br />

Pro<strong>of</strong>.<br />

∫ ∞<br />

−∞<br />

h(x)f(x) dx X continuous<br />

Not examinable.<br />

Examples:<br />

1. CDF transformation<br />

Suppose U ∼ U(0, 1). How can we transform U to get an Exp(λ) RV<br />

Solution. Take X = F −1 (U), where F is the Exp(λ) CDF. Recall F (x) =<br />

1 − e −λx for the Exp(λ) distribution.<br />

To find F −1 (u), we solve for x in F (x) = u,<br />

18


1. DISTRIBUTION THEORY<br />

i.e.,<br />

u = 1 − e −λx<br />

⇒ 1 − u = e −λx<br />

⇒ ln(1 − u) = −λx<br />

⇒ x =<br />

− ln(1 − u)<br />

.<br />

λ<br />

− ln(1 − U)<br />

Hence if U ∼ U(0, 1), it follows that X = ∼ Exp(λ).<br />

λ<br />

Note: Y = − ln U ∼ Exp(λ) [both (1 − U) & U have U(0, 1) distribution].<br />

λ<br />

m This type <strong>of</strong> result is used to generate random numbers. That is, there are good<br />

methods for producing U(0, 1) (pseudo-random) numbers. To obtain Exp(λ) random<br />

numbers, we can just get U(0, 1) numbers and then calculate X = − log U .<br />

λ<br />

2. Non-monotonic transformations<br />

Suppose Z ∼ N(0, 1) and let X = Z 2 ; h(Z) = Z 2 is not monotonic, so Theorem<br />

1.3.2 does not apply. However we can proceed as follows:<br />

F X (x) = P (X ≤ x)<br />

= P (Z 2 ≤ x)<br />

= P (− √ x ≤ Z ≤ √ x)<br />

where<br />

Φ(a) =<br />

is the N(0, 1)<br />

= Φ( √ x) − Φ(− √ x),<br />

∫ a<br />

−∞<br />

1<br />

√<br />

2π<br />

e −t2 /2 dt<br />

CDF. That is,<br />

⇒ f X (x) = d<br />

dx F X(x) = φ (√ x ) ( )<br />

1<br />

2 x−1/2 − φ(− √ x)<br />

(− 1 )<br />

2 x−1/2 ,<br />

19


1. DISTRIBUTION THEORY<br />

where φ(a) = Φ ′ (a) = 1 √<br />

2π<br />

e −a2 /2 is the N(0, 1) <strong>PDF</strong><br />

= 1 2 x−1/2 [ 1<br />

√<br />

2π<br />

e −x/2 + 1 √<br />

2π<br />

e −x/2 ]<br />

(<br />

= (1/2)1/2<br />

1<br />

√ x 1/2−1 e −1/2x, which is the Gamma<br />

π 2 , 1 2)<br />

<strong>PDF</strong>.<br />

On the other hand, the distribution <strong>of</strong> Z 2 is also called the χ 2 1, distribution. We<br />

have proved that it is the same as the Gamma ( 1<br />

2 , 1 2)<br />

distribution.<br />

3. Moments <strong>of</strong> transformed RVS:<br />

If U ∼ U(0, 1) and Y = − log U<br />

λ<br />

then Y ∼ Exp(λ) ⇒ f(y) = λe −λy , y > 0.<br />

Can check<br />

E(Y ) =<br />

∫ ∞<br />

0<br />

λye −λy dy<br />

= 1 λ .<br />

4. Based on Theorem 1.6.1:<br />

If U ∼ U(0, 1) and Y = − log U , then according to theorem 1.6.1,<br />

λ<br />

E(Y ) =<br />

∫ 1<br />

0<br />

= − 1 λ<br />

= − 1 λ<br />

− log u<br />

(1) du<br />

λ<br />

∫ 1<br />

0<br />

log u du = − 1 λ (u log u − u) ∣ ∣∣∣<br />

1<br />

[<br />

u log u ∣ ∣1<br />

0 − u ∣ ]<br />

1 0<br />

0<br />

= − 1 [0 − 1]<br />

λ<br />

= 1 , as required.<br />

λ<br />

There are some important consequences <strong>of</strong> Theorem 1.6.1:<br />

20


1. DISTRIBUTION THEORY<br />

1. If E(X) = µ and Var(X) = σ 2 , and Y = aX + b for constants a, b, then E(Y ) =<br />

aµ + b and Var(Y ) = a 2 σ 2 .<br />

Pro<strong>of</strong>. (Continuous case)<br />

E(Y ) = E(aX + b) =<br />

∫ ∞<br />

= a<br />

xf(x) dx<br />

−∞<br />

} {{ }<br />

∫ ∞<br />

−∞<br />

(ax + b)f(x) dx<br />

∫ ∞<br />

+ b<br />

f(x) dx<br />

−∞<br />

} {{ }<br />

E(X) = 1<br />

= aµ + b.<br />

[ {Y } ] 2<br />

Var(Y ) = E − E(Y )<br />

[ {aX } ] 2<br />

= E + b − (aµ + b)<br />

= E { a 2 (X − µ) 2}<br />

= a 2 E { (X − µ) 2}<br />

= a 2 Var(X)<br />

= a 2 σ 2 .<br />

2. If X is a RV and h(X) is any function, then the MGF <strong>of</strong> Y = h(X), provided it<br />

exists is,<br />

⎧∑<br />

e th(x) p(x)<br />

⎪⎨ x<br />

M Y (t) =<br />

X discrete<br />

⎪⎩<br />

∫ ∞<br />

−∞<br />

e th(x) f(x) dx X continuous<br />

This gives us another way to find the distribution <strong>of</strong> Y = h(X).<br />

i.e., Find M Y (t). If we recognise M Y (t), then by uniqueness we can conclude that<br />

Y has that distribution.<br />

21


1. DISTRIBUTION THEORY<br />

Examples<br />

1. Suppose X is continuous with CDF F (x), and F (a) = 0, F (b) = 1; (a, b can be<br />

±∞ respectively).<br />

Let U = F (X). Observe that<br />

M U (t) =<br />

∫ b<br />

a<br />

e tF (x) f(x) dx<br />

which is the U(0, 1) MGF.<br />

= 1 t etF (x) ∣ ∣∣∣<br />

b<br />

= et − 1<br />

,<br />

t<br />

2. Suppose X ∼ U(0, 1), and let Y = − log X .<br />

λ<br />

M Y (t) =<br />

∫ 1<br />

0<br />

a<br />

e<br />

= etF (b) tF (a)<br />

− e<br />

t<br />

− log x<br />

t( λ<br />

) (1) dx<br />

=<br />

=<br />

=<br />

∫ 1<br />

0<br />

x −t/λ dx<br />

∣<br />

1 ∣∣∣<br />

1<br />

1 − t/λ x1−t/λ<br />

1<br />

1 − t/λ<br />

0<br />

= λ<br />

λ − t ,<br />

which is the MGF for Exp(λ) distribution. Hence we can conclude that<br />

Y = − log X<br />

λ<br />

∼ Exp(λ).<br />

#<br />

22


1. DISTRIBUTION THEORY<br />

1.7 Multivariate distributions<br />

Definition. 1.7.1<br />

If X 1 , X 2 , . . . , X r are discrete RV’s then, X = (X 1 , X 2 , . . . , X r ) T<br />

random vector.<br />

is called a discrete<br />

The probability function P (x) is:<br />

P (x) = P (X = x) = P ({X 1 = x 1 } ∩ {X 2 = x 2 } ∩ · · · ∩ {X r = x r }) ;<br />

P (x) = P (x 1 , x 2 , . . . , x r )<br />

1.7.1 Trinomial distribution<br />

r = 2<br />

Consider a sequence <strong>of</strong> n independent trials where each trial produces:<br />

Outcome 1: with prob π 1<br />

Outcome 2: with prob π 2<br />

Outcome 3: with prob 1 − π 1 − π 2<br />

If X 1 , X 2 are number <strong>of</strong> occurrences <strong>of</strong> outcomes 1 and 2 respectively then(X 1 , X 2 )<br />

have trinomial distribution.<br />

Parameters:<br />

π 1 > 0, π 2 > 0 and π 1 + π 2 < 1; n > 0 fixed<br />

Possible Values:<br />

integers (x 1 , x 2 ) s.t. x 1 ≥ 0, x 2 ≥ 0, x 1 + x 2 ≤ n<br />

Probability function:<br />

for<br />

P (x 1 , x 2 ) =<br />

n!<br />

x 1 !x 2 !(n − x 1 − x 2 )! πx 1<br />

1 π x 2<br />

2 (1 − π 1 − π 2 ) n−x 1−x 2<br />

x 1 , x 2 ≥ 0, x 1 + x 2 ≤ n.<br />

1.7.2 Multinomial distribution<br />

Parameters:<br />

n > 0, π = (π 1 , π 2 , . . . , π r ) T with π I > 0 and<br />

23<br />

r∑<br />

π i = 1<br />

i=1


1. DISTRIBUTION THEORY<br />

Possible values:<br />

Probability function:<br />

Integer valued (x 1 , x 2 , . . . , x r ) s.t. x i ≥ 0 &<br />

r∑<br />

x i = n<br />

i=1<br />

(<br />

)<br />

n<br />

P (x) =<br />

π x 1<br />

1 π x 2<br />

2 . . . πr<br />

xr<br />

x 1 , x 2 , . . . , x r<br />

for<br />

x i ≥ 0,<br />

r∑<br />

x i = n.<br />

i=1<br />

Remarks<br />

(<br />

)<br />

n<br />

1. Note<br />

x 1 , x 2 , . . . , x r<br />

def<br />

=<br />

n!<br />

x 1 !x 2 ! . . . x r !<br />

is the multinomial coefficient.<br />

2. Multinomial distribution is the generalisation <strong>of</strong> the binomial distribution to r<br />

types <strong>of</strong> outcome.<br />

3. Formulation differs from binomial and trinomial cases in that the redundant count<br />

x r = n − (x 1 + x 2 + . . . x r−1 ) is included as an argument <strong>of</strong> P (x).<br />

1.7.3 Marginal and conditional distributions<br />

Consider a discrete random vector<br />

so that X =<br />

[ ]<br />

X1<br />

.<br />

X 2<br />

X = (X 1 , X 2 , . . . , X r ) T and let<br />

X 1 = (X 1 , X 2 , . . . , X r1 ) T &<br />

X 2 = (X r1 +1, X r1 +2, . . . , X r ) T ,<br />

Definition. 1.7.2<br />

If X has joint probability function P X (x) = P X (x 1 , x 2 ) then the marginal probability<br />

function for X 1 is :<br />

P X1 (x 1 ) = ∑ x 2<br />

P X (x 1 , x 2 ).<br />

24


1. DISTRIBUTION THEORY<br />

Observe that:<br />

P X1 (x 1 ) = ∑ x 2<br />

P X (x 1 , x 2 ) = ∑ x 2<br />

P ({X 1 = x 1 } ∩ {X 2 = x 2 })<br />

= P (X 1 = x 1 ) by law <strong>of</strong> total probability.<br />

Hence the marginal probability function for X 1<br />

would have if X 2 was not observed.<br />

is just the probability function we<br />

The marginal probability function for X 2 is:<br />

P X2 (x 2 ) = ∑ x 1<br />

P X (x 1 , x 2 ).<br />

Definition. 1.7.3<br />

Suppose X =<br />

X 2 |X 1 by:<br />

Remarks<br />

(<br />

X1<br />

X 2<br />

)<br />

. If P X1 (x 1 ) > 0, we define the conditional probability function<br />

P X2 |X 1<br />

(x 2 |x 1 ) = P X(x 1 , x 2 )<br />

P X1 (x 1 ) .<br />

1.<br />

P X2 |X 1<br />

(x 2 |x 1 ) = P ({X 1 = x 1 } ∩ {X 2 = x 2 })<br />

P (X 1 = x 1 )<br />

= P (X 2 = x 2 |X 1 = x 1 ).<br />

2. Easy to check P X2 |X 1<br />

(x 2 |x 1 ) is a proper probability function with respect to x 2<br />

for each fixed x 1 such that P X1 (x 1 ) > 0.<br />

3. P X1 |X 2<br />

(x 1 |x 2 ) is defined by:<br />

Definition. 1.7.4 (Independence)<br />

P X1 |X 2<br />

(x 1 |x 2 ) = P X(x 1 , x 2 )<br />

P X2 (x 2 ) .<br />

Discrete RV’s X 1 , X 2 , . . . , X r are said to be independent if their joint probability function<br />

satisfies<br />

P (x 1 , x 2 , . . . , x r ) = p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />

for some functions p 1 , p 2 , . . . , p r and all (x 1 , x 2 , . . . , x r ).<br />

25


1. DISTRIBUTION THEORY<br />

Remarks<br />

1. Observe that:<br />

P X1 (x 1 ) = ∑ x 2<br />

∑<br />

· · · ∑<br />

x 3<br />

x r<br />

P (x 1 , . . . , x r )<br />

= p 1 (x 1 ) ∑ ∑<br />

· · · ∑<br />

x 2<br />

x 3<br />

x r<br />

p 2 (x 2 ) . . . p r (x r )<br />

= c 1 p 1 (x 1 ).<br />

Hence p 1 (x 1 ) ∝<br />

↓<br />

P X1 (x 1 )<br />

} {{ } also p i(x i ) ∝ P Xi (x i )<br />

sum <strong>of</strong> probs. marginal prob.<br />

P X1 |X 2 ...X r<br />

(x 1 |x 2 , . . . , x r ) =<br />

=<br />

=<br />

=<br />

=<br />

P (x 1 , . . . , x r )<br />

P x2 ...x r<br />

(x 1 , . . . , x r )<br />

p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />

∑<br />

x 1<br />

p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />

p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />

p 2 (x 2 )p 3 (x 3 ) . . . p r (x r ) ∑ x 1<br />

p 1 (x 1 )<br />

p 1 (x 1 )<br />

∑<br />

x 1<br />

p 1 (x 1 )<br />

1<br />

c P X 1<br />

(x 1 )<br />

1 ∑<br />

P X1 (x 1 )<br />

c<br />

x 1<br />

= P X1 (x 1 ).<br />

That is, P X1 |X 2 ...X r<br />

(x 1 |x 2 , . . . , x r ) = P X1 (x 1 ).<br />

2. Clearly independence =⇒<br />

P Xi |X 1 ...X i−1 ,X i+1 ...X r<br />

(x i |x 1 , . . . , x i−1 , x i+1 , . . . , x r ) = P Xi (x i ).<br />

Moreover, we have:<br />

26


1. DISTRIBUTION THEORY<br />

P X1 |X 2<br />

(x 1 |x 2 ) = P X1 (x 1 ) for any partitioning <strong>of</strong> X =<br />

independent.<br />

(<br />

X1<br />

X 2<br />

)<br />

if X 1 , . . . , X r are<br />

1.7.4 Continuous multivariate distributions<br />

Definition. 1.7.5<br />

The random vector (X 1 , . . . , X r ) T is said to have a continuous multivariate distribution<br />

with <strong>PDF</strong> f(x) if<br />

∫ ∫<br />

P (X ∈ A) = . . . f(x 1 , . . . , x r ) dx 1 . . . dx r<br />

A<br />

for any measurable set A.<br />

Examples<br />

1. Suppose (X 1 , X 2 ) have the trinomial distribution with parameters n, π 1 , π 2 .<br />

Then the marginal distribution <strong>of</strong> X 1 can be seen to be B(n, π 1 ) and the conditional<br />

distribution <strong>of</strong> X 2 |X 1 = x 1 is B ( π 2<br />

)<br />

n − x 1 , .<br />

1 − π 1<br />

Outcome 1 π 1<br />

Outcome 2 π 2<br />

Outcome 3 1 − π 1 − π 2<br />

Examples <strong>of</strong> Definition 1.7.5<br />

1. If X 1 , X 2 have <strong>PDF</strong><br />

⎧<br />

1 0 < x 1 < 1<br />

⎪⎨<br />

0 < x 2 < 1<br />

f(x 1 , x 2 ) =<br />

⎪⎩<br />

0 otherwise<br />

The distribution is called uniform on (0, 1) × (0, 1).<br />

It follows that P (X ∈ A) = Area (A) for any A ∈ (0, 1) × (0, 1):<br />

2. Uniform distribution on unit disk:<br />

⎧<br />

1<br />

⎪⎨ x 2 1 + x 2 2 < 1<br />

π<br />

f(x 1 , x 2 ) =<br />

⎪⎩<br />

0 otherwise.<br />

27


1. DISTRIBUTION THEORY<br />

Figure 12: Area A<br />

3. Dirichlet distribution is defined by <strong>PDF</strong><br />

f(x 1 , x 2 , . . . , x r ) = Γ(α 1 + α 2 + · · · + α r + α r+1 )<br />

Γ(α 1 )Γ(α 2 ) . . . Γ(α r )Γ(α r+1 ) ×<br />

x α 1−1<br />

1 x α 2−1<br />

2 . . . x αr−1<br />

r (1 − x 1 − · · · − x r ) α r+1<br />

for x 1 , x 2 , . . . , x r > 0,<br />

∑<br />

xi < 1, and parameters α 1 , α 2 , . . . , α r+1 > 0.<br />

#<br />

Recall joint <strong>PDF</strong>:<br />

∫<br />

P (X ∈ A) =<br />

Note: the joint <strong>PDF</strong> must satisfy:<br />

∫<br />

. . . f(x 1 , . . . , x r ) dx 1 . . . dx r .<br />

A<br />

1. f(x) ≥ 0 for all x;<br />

2.<br />

∫ ∞<br />

. . .<br />

∫ ∞<br />

−∞ −∞<br />

f(x 1 , . . . , x r ) dx 1 . . . dx r = 1.<br />

Definition. 1.7.6<br />

( )<br />

X1<br />

If X = has joint <strong>PDF</strong> f(x) = f(x<br />

X 1 , x 2 ), then the marginal <strong>PDF</strong> <strong>of</strong> X 1 is given<br />

2<br />

by:<br />

∫ ∞ ∫ ∞<br />

f X1 (x 1 ) = . . . f(x 1 , . . . , x r1 , x r1 +1, . . . x r ) dx r1 +1dx r1 +2 . . . dx r ,<br />

−∞ −∞<br />

28


1. DISTRIBUTION THEORY<br />

and for f X1 (x 1 ) > 0, the conditional <strong>PDF</strong> is given by<br />

f X2 |X 1<br />

(x 2 |x 1 ) = f(x 1, x 2 )<br />

f X1 (x 1 ) .<br />

Remarks:<br />

1. f X2 |X 1<br />

(x 2 |x 1 ) cannot be interpreted as the conditional <strong>PDF</strong> <strong>of</strong> X 2 |{X 1 = x 1 }<br />

because P (X 1 = x 1 ) = 0 for any continuous distribution.<br />

Proper interpretation is the limit as δ → 0 in X 2 |X 1 ∈ B(x 1 , δ).<br />

2. f X2 (x 2 ), f X1 |X 2<br />

(x 1 |x 2 ) are defined analogously.<br />

Definition. 1.7.7 (Independence)<br />

Continuous RV’s X 1 , X 2 , . . . , X r are said to be independent if their joint <strong>PDF</strong> satisfies:<br />

f(x 1 , x 2 , . . . , x r ) = f 1 (x 1 )f 2 (x 2 ) . . . f r (x r ),<br />

for some functions f 1 , f 2 , . . . , f r and all x 1 , . . . , x r<br />

Remarks<br />

1. Easy to check that if X 1 , . . . , X r are independent then each f i (x i ) = c i f Xi (x i ).<br />

Moreover c 1 , c 2 , . . . , c r = 1.<br />

2. If X 1 , X 2 , . . . , X r are independent, then it can be checked that:<br />

for any partitioning <strong>of</strong> X =<br />

f x1 |x 2<br />

(x 1 |x 2 ) = f X1 (x 1 )<br />

(<br />

X1<br />

X 2<br />

)<br />

.<br />

Examples<br />

1. If (X 1 , X 2 ) has the uniform distribution on the unit disk, find:<br />

(a) the marginal <strong>PDF</strong> f X1 (x 1 )<br />

(b) the conditional <strong>PDF</strong> f X2 |X 1<br />

(x 2 |x 1 ).<br />

29


1. DISTRIBUTION THEORY<br />

Solution. Recall<br />

⎧<br />

1<br />

⎪⎨ x 2 1 + x 2 2 < 1<br />

π<br />

f(x 1 , x 2 ) =<br />

⎪⎩<br />

0 otherwise<br />

(a) f X1 (x 1 ) =<br />

=<br />

∫ ∞<br />

−∞<br />

f(x 1 , x 2 ) dx 2<br />

∫ −<br />

√<br />

1−x 2 1<br />

−∞<br />

= 0 + x 2<br />

π<br />

=<br />

√<br />

∣ √<br />

−<br />

∫ √ 1−x 2 1<br />

0 dx 2 +<br />

1−x 2 1<br />

⎧<br />

2 √ 1 − x<br />

⎪⎨<br />

2 1<br />

π<br />

1−x 2 1<br />

+ 0<br />

√<br />

− 1−x 2 1<br />

−1 < x 1 < 1<br />

⎪⎩<br />

0 otherwise.<br />

∫<br />

1<br />

∞<br />

π dx 2 + √ 0 dx 2<br />

1−x 2 1<br />

i.e., a semi-circular distribution (ours has some distortion).<br />

Figure 13: A graphic showing the conditional distribution and a semi-circular distribution.<br />

30


1. DISTRIBUTION THEORY<br />

(b) The conditional density for X 2 |X 1 is:<br />

f X2 |X 1<br />

(x 2 |x 1 ) = f(x 1, x 2 )<br />

f X1 (x 1 )<br />

⎧<br />

1<br />

⎪⎨ 2 √ for − √ 1 − x 2<br />

1 − x 2 1 < x 2 < √ 1 − x 2 1<br />

1<br />

=<br />

⎪⎩<br />

0 otherwise,<br />

which is uniform U(− √ 1 − x 2 1, √ 1 − x 2 1).<br />

2. If X 1 , . . . , X r are any independent continuous RV’s then their joint <strong>PDF</strong> is:<br />

f(x 1 , . . . , x r ) = f X1 (x 1 )f X2 (x 2 ) . . . f Xr (x r ).<br />

E.g. If X 1 , X 2 , . . . , X r are independent Exp(λ) RVs, then<br />

f(x 1 , . . . , x r ) =<br />

r∏<br />

λ e −λx i<br />

i=1<br />

= λ r e −λ P r<br />

i=1 x i<br />

,<br />

for x i > 0 i = 1, . . . , n.<br />

3. Suppose (X 1 , X 2 ) are uniformly distributed on (0, 1) × (0, 1).<br />

That is,<br />

⎧<br />

⎪⎨ 1 0 < x 1 < 1, 0 < x 2 < 1<br />

f(x 1 , x 2 ) =<br />

⎪⎩<br />

0 otherwise<br />

Claim: X 1 , X 2 are independent.<br />

Pro<strong>of</strong>.<br />

Let<br />

⎧<br />

⎪⎨ 1 0 < x 1 < 1<br />

f 1 (x 1 ) =<br />

⎪⎩<br />

0 otherwise,<br />

31


⎧<br />

⎪⎨ 1 0 < x 2 < 1<br />

f 2 (x 2 ) =<br />

⎪⎩<br />

0 otherwise,<br />

1. DISTRIBUTION THEORY<br />

then f(x 1 , x 2 ) = f 1 (x 1 )f 2 (x 2 ), and the two variables are independent.<br />

4. (X 1 , X 2 ) is uniform on unit disk. Are X 1 , X 2 independent NO.<br />

Pro<strong>of</strong>.<br />

We know:<br />

⎧<br />

1<br />

⎪⎨ 2 √ − √ 1 − x 2<br />

1 − x 2 1 < x 2 < √ 1 − x 2 1<br />

1<br />

f X2 |X 1<br />

(x 2 |x 1 ) =<br />

.<br />

⎪⎩<br />

0 otherwise,<br />

i.e., U(− √ 1 − x 2 1, √ 1 − x 2 1).<br />

On the other hand,<br />

i.e., a semicircular distribution.<br />

⎧<br />

2 √<br />

⎪⎨ 1 − x<br />

2<br />

π<br />

2 −1 < x 2 < 1<br />

f X2 (x 2 ) =<br />

⎪⎩<br />

0 otherwise,<br />

Hence f X2 (x 2 ) ≠ f X2 |X 1<br />

(x 2 |x 1 ), so the variables cannot be independent.<br />

Definition. 1.7.8<br />

The joint CDF <strong>of</strong> RVS’s X 1 , X 2 , . . . , X r is defined by:<br />

F (x 1 , x 2 , . . . , x r ) = P ({X 1 ≤ x 1 } ∩ {X 2 ≤ x 2 } ∩ . . . ∩ ({X r ≤ x r }).<br />

Remarks:<br />

1. Definition applies to RV’s <strong>of</strong> any type, i.e., discrete, continuous, “hybrid”.<br />

2. Marginal CDF <strong>of</strong> X 1 = (X 1 , . . . , X r ) is F X1 (x 1 , . . . , x r ) = F X (x 1 , . . . , x r , ∞, ∞, . . . , ∞).<br />

3. RV’s X 1 , . . . , X r are defined to be independent if<br />

F (x 1 , . . . , x r ) = F X1 (x 1 )F X2 (x 2 ) . . . F Xr (x r ).<br />

4. The definitions above are completely general. However, in practice it is usually<br />

easier to work with the <strong>PDF</strong>/probability function for any given example.<br />

32


1. DISTRIBUTION THEORY<br />

1.8 Transformations <strong>of</strong> several RVs<br />

Theorem. 1.8.1<br />

If X 1 , X 2 are discrete with joint probability function P (x 1 , x 2 ), and Y = X 1 +X 2 , then:<br />

1. Y has probability function<br />

P Y (y) = ∑ x<br />

P (x, y − x).<br />

2. If X 1 , X 2 are independent,<br />

P Y (y) = ∑ x<br />

P X1 (x)P X2 (y − x)<br />

Pro<strong>of</strong>. (1)<br />

P ({Y = y}) = ∑ x<br />

P ({Y = y} ∩ {X 1 = x})<br />

(law <strong>of</strong> total probability)<br />

↓<br />

⎛<br />

⎞<br />

If A is an event & B 1 , B 2 . . . are events<br />

⋃<br />

s.t. B i = S & B i ∩ B j = φ (for j ≠ i)<br />

i<br />

⎜<br />

⎝<br />

then P (A) = ∑ P (A ∩ B i ) = ∑ ⎟<br />

P (A|B i )P (B i )<br />

⎠<br />

i<br />

i<br />

= ∑ x<br />

P ({X 1 + X 2 = y} ∩ {X 1 = x})<br />

= ∑ x<br />

P ({X 2 = y − x} ∩ {X 1 = x})<br />

= ∑ x<br />

P (x, y − x).<br />

Pro<strong>of</strong>. (2)<br />

33


1. DISTRIBUTION THEORY<br />

Just substitute<br />

P (x, y − x) = P X1 (x)P X2 (y − x).<br />

Theorem. 1.8.2<br />

Suppose X 1 , X 2 are continuous with <strong>PDF</strong>, f(x 1 , x 2 ), and let Y = X 1 + X 2 . Then<br />

1. f Y (y) =<br />

∫ ∞<br />

−∞<br />

f(x, y − x) dx.<br />

2. If X 1 , X 2 are independent, then<br />

f Y (y) =<br />

Pro<strong>of</strong>. 1. F Y (y) = P (Y ≤ y)<br />

∫ ∞<br />

−∞<br />

= P (X 1 + X 2 ≤ y)<br />

f X1 (x)f X2 (y − x) dx.<br />

=<br />

∫ ∞ ∫ y−x1<br />

x 1 =−∞ x 2 =−∞<br />

f(x 1 , x 2 ) dx 2 dx 1 .<br />

Let x 2 = t − x 1 ,<br />

⇒ dx 2<br />

dt = 1<br />

⇒ dx 2 = dt<br />

=<br />

=<br />

∫ ∞ ∫ y<br />

−∞<br />

∫ y<br />

−∞<br />

−∞<br />

f(x 1 , t − x 1 ) dt dx 1<br />

{∫ ∞<br />

}<br />

f(x 1 , t − x 1 ) dx 1 dt<br />

−∞<br />

⇒ f Y (y) = F ′ Y (y) =<br />

∫ ∞<br />

−∞<br />

f(x, y − x) dx.<br />

Pro<strong>of</strong>. (2)<br />

Take f(x, y − x) = f X1 (x) f X2 (y − x) if X 1 , X 2 independent.<br />

34


1. DISTRIBUTION THEORY<br />

Figure 14: Theorem 1.8.2<br />

Examples<br />

1. Suppose X 1 ∼ B(n 1 , p), X 2 ∼ B(n 2 , p) independently. Find the probability function<br />

for Y = X 1 + X 2 .<br />

Solution.<br />

P Y (y) = ∑ x<br />

P X1 (x)P X2 (y − x)<br />

=<br />

min(n 1 , y)<br />

∑<br />

x=max (0, y − n 2 )<br />

= p y (1 − p) n 1+n 2 −y<br />

=<br />

=<br />

i.e., Y ∼ B(n 1 + n 2 , p).<br />

( )<br />

( )<br />

n1<br />

p x (1 − p) n 1−x n2<br />

p y−x (1 − p) n 2+x−y<br />

x<br />

y − x<br />

min (n 1 , y)<br />

∑<br />

x=max (0, y − n 2 )<br />

( )<br />

n1 + n 2<br />

p y (1 − p) n 1+n 2 −y<br />

y<br />

( )<br />

n1 + n 2<br />

p y (1 − p) n 1+n 2 −y ,<br />

y<br />

35<br />

( ) ( )<br />

n1 n2<br />

x y − x<br />

min (n 1 , y)<br />

∑<br />

x=min (0, y − n 2 )<br />

(<br />

n1<br />

) ( )<br />

n2<br />

x y − x<br />

( )<br />

n1 + n 2<br />

y<br />

} {{ }<br />

sum <strong>of</strong> hypergeometric<br />

probability function = 1


1. DISTRIBUTION THEORY<br />

2. Suppose Z 1 ∼ N(0, 1), Z 2 ∼ N(0, 1) independently. Let X = Z 1 + Z 2 . Find<br />

<strong>PDF</strong> <strong>of</strong> X.<br />

Solution.<br />

f X (x) =<br />

=<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

φ(z)φ(x − z) dz<br />

1<br />

√<br />

2π<br />

e −z2 /2 1<br />

√<br />

2π<br />

e −(x−z)2 /2 dz<br />

1<br />

2π e− 1 2 (z2 +(x−z) 2) dz<br />

(<br />

1 1<br />

2π e− 2( z−x<br />

2 2<br />

)e )2 −x2 /4 dz<br />

= 1 ∫ ∞<br />

2 √ /4 1<br />

√<br />

π e−x2 e − (z− x 2 )2<br />

2× 1 2 dz<br />

−∞ π<br />

} {{ }<br />

= 1<br />

( x<br />

N<br />

2 , 1 <strong>PDF</strong><br />

2)<br />

i.e., X ∼ N(0, 2).<br />

= 1<br />

2 √ π e−x2 /4 ,<br />

Theorem. 1.8.3 (Ratio <strong>of</strong> continuous RVs)<br />

Suppose X 1 , X 2 are continuous with joint <strong>PDF</strong> f(x 1 , x 2 ), and let Y = X 2<br />

X 1<br />

.<br />

Then Y has <strong>PDF</strong><br />

If X 1 , X 2 independent, we obtain<br />

f Y (y) =<br />

f Y (y) =<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

|x|f(x, yx) dx.<br />

|x|f X1 (x)f X2 (yx) dx.<br />

36


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

F Y (y) = P ({Y ≤ y})<br />

= P ({Y ≤ y} ∩ {X 1 < 0}) + P ({Y ≤ y} ∩ {X 1 > 0})<br />

= P ({X 2 ≥ yX 1 } ∩ {X 1 < 0}) + P ({X 2 ≤ yX 1 } ∩ {X 1 > 0})<br />

=<br />

∫ 0<br />

∫ ∞<br />

f(x 1 , x 2 ) dx 2 dx 1 +<br />

∫ ∞ ∫ x1 y<br />

−∞ x 1 y<br />

0 −∞<br />

Substitute x 2 = tx 1 in both inner integrals;<br />

dx 2 = x 1 dt;<br />

f(x 1 , x 2 ) dx 2 dx 1 .<br />

=<br />

∫ 0<br />

∫ −∞<br />

x 1 f(x 1 , tx 1 )dt dx 1 +<br />

∫ ∞ ∫ y<br />

−∞ y<br />

0 −∞<br />

x 1 f(x 1 , tx 1 ) dt dx 1<br />

=<br />

∫ 0 ∫ y<br />

(−x 1 )f(x 1 , tx 1 ) dt dx 1 +<br />

∫ ∞ ∫ y<br />

−∞ −∞<br />

0 −∞<br />

x 1 f(x 1 , tx 1 ) dt dx 1<br />

=<br />

∫ 0 ∫ y<br />

|x 1 |f(x 1 , tx 1 ) dt dx 1 +<br />

∫ ∞ ∫ y<br />

−∞ −∞<br />

0 −∞<br />

|x 1 |f(x 1 , tx 1 ) dt dx 1<br />

= F Y (y)<br />

⇒ f Y (y) = F ′ Y (y) =<br />

∫ ∞<br />

−∞<br />

|x 1 |f(x 1 , yx 1 ) dx 1 .<br />

Example<br />

1. If Z ∼ N(0, 1) and X ∼ χ 2 k<br />

independently, then T =<br />

Z<br />

√<br />

X/k<br />

is said to have the<br />

t-distribution with k degrees <strong>of</strong> freedom. Derive the <strong>PDF</strong> <strong>of</strong> T .<br />

37


1. DISTRIBUTION THEORY<br />

Solution. Step 1:<br />

Let V = √ ( k<br />

X/k. Need to find <strong>PDF</strong> <strong>of</strong> V . Recall χ 2 k is Gamma 2 , 1 )<br />

so<br />

2<br />

Now,<br />

f X (x) = (1/2)k/2<br />

Γ(k/2) xk/2−1 e −x/2 , for x > 0.<br />

v = h(x) = √ x/k<br />

⇒ h −1 (v) = kv 2<br />

and<br />

h −1 (v) ′ = 2kv<br />

⇒ f V (v) = f X (h −1 (v))|h −1 (v) ′ |<br />

= (1/2)k/2<br />

Γ(k/2) (kv2 ) k/2−1 e −kv2 /2 (2kv)<br />

Step 2:<br />

Apply Theorem 1.8.3 to find <strong>PDF</strong> <strong>of</strong><br />

T = Z ∫ ∞<br />

V = |v|f V (v)f Z (vt) dv<br />

−∞<br />

= 2(k/2)k/2<br />

Γ(k/2) vk−1 e −kv2 /2 , v > 0.<br />

=<br />

∫ ∞<br />

0<br />

v 2(k/2)k/2<br />

Γ(k/2) vk−1 e −kv2 /2 1<br />

√<br />

2π<br />

e −t2 v 2 /2 dv<br />

= (k/2)k/2<br />

Γ(k/2) √ 2π<br />

∫ ∞<br />

0<br />

v k−1 e −1/2(k+t2 )v 2 2v dv;<br />

38


1. DISTRIBUTION THEORY<br />

substitute u = v 2<br />

⇒ du = 2v dv<br />

= (k/2)k/2<br />

Γ(k/2) √ 2π<br />

∫ ∞<br />

0<br />

u k−1<br />

2 e<br />

−1/2(k+t 2 )u du<br />

[ ]<br />

= (k/2)k/2 Γ(α)<br />

Γ(k/2) √ = (k/2)k/2<br />

2π λ α Γ(k/2) √ Γ( k+1)<br />

2<br />

2π (1/2(k + t 2 )) (k+1)/2<br />

=<br />

=<br />

k+1<br />

Γ( ) (<br />

2 2<br />

Γ(k/2) √ 2π k × 1 −<br />

k+1 ( )<br />

))<br />

2 (k + 2 −1/2 k t2 2<br />

k+1<br />

Γ( ) ( ) −<br />

k+1<br />

2<br />

Γ( k)√ 1 + t2 2<br />

.<br />

kπ k<br />

2<br />

Remarks:<br />

1. If we take k = 1, we obtain<br />

f(t) ∝ 1<br />

1 + t ; 2<br />

that is, t 1 = Cauchy distribution.<br />

Can also check<br />

Γ(1)<br />

Γ( 1)√ π = 1 π as required, so that f(t) = 1 1<br />

π 1 + t .<br />

2 2<br />

2. As k → ∞, t k → N(0, 1). To see this directly consider limit <strong>of</strong> t-density as k → ∞:<br />

i.e.<br />

) −<br />

k+1<br />

lim<br />

(1 + t2 2<br />

k→∞ k<br />

for t fixed<br />

) −k<br />

= lim<br />

(1 + t2 2<br />

k<br />

; let l =<br />

k→∞ k<br />

2<br />

(<br />

= lim 1 +<br />

l→∞<br />

t 2 2<br />

l<br />

) −l<br />

= 1<br />

e t2 2<br />

= e −t2<br />

2 .<br />

39


1. DISTRIBUTION THEORY<br />

Recall standard limit:<br />

(<br />

lim 1 + a ) n<br />

= e<br />

a<br />

n→∞ n<br />

⇒ f T (t) → 1 √<br />

2π<br />

e −t2<br />

2 as k → ∞.<br />

Suppose h : R r → R r is continuously differentiable.<br />

That is,<br />

h(x 1 , x 2 , . . . , x r ) = (h 1 (x 1 , . . . , x r ), h 2 (x 1 , . . . , x r ), . . . , h r (x 1 , . . . , x r )) ,<br />

where each h i (x) is continuously differentiable. Let<br />

⎡<br />

⎤<br />

∂h 1 ∂h 1 ∂h 1<br />

. . .<br />

∂x 1 ∂x 2 ∂x r<br />

∂h 2 ∂h 2 ∂h 2<br />

. . .<br />

∂x<br />

H = 1 ∂x 2 ∂x r<br />

. . . . . .<br />

⎢<br />

⎥<br />

⎣∂h r ∂h r ∂h r<br />

⎦<br />

. . .<br />

∂x 1 ∂x 2 ∂x r<br />

If H is invertible for all x, then ∃ an inverse mapping:<br />

g : R r → R r<br />

g(h(x)) = x.<br />

with property that:<br />

It can be proved that the matrix <strong>of</strong> partial derivatives<br />

( ) ∂gi<br />

G = satisfies G = H −1 .<br />

∂y j<br />

Theorem. 1.8.4<br />

Suppose X 1 , X 2 , . . . , X r have joint <strong>PDF</strong> f X (X 1 , . . . , X r ), and let h, g, G be as above.<br />

If Y = h(X), then Y has joint <strong>PDF</strong><br />

Remark:<br />

f Y (y 1 , y 2 , . . . , y r ) = f X<br />

(<br />

g(y)<br />

)<br />

| det G(y)|<br />

Can sometimes use H −1 instead <strong>of</strong> G, but need to be careful to evaluate H −1 (x) at<br />

x = h −1 (y) = g(y).<br />

40


1. DISTRIBUTION THEORY<br />

1.8.1 Method <strong>of</strong> regular transformations<br />

Suppose X = (X 1 , . . . , X r ) T and h : R r → R s , where s < r if<br />

Y = (Y 1 , . . . , Y s ) T = h(X).<br />

One approach is as follows:<br />

1. Choose a function, d : R r → R r−s and let Z = d(X) = (Z 1 , Z 2 , . . . , Z r−s ) T .<br />

2. Apply theorem 1.8.4 to (Y 1 , Y 2 , . . . , Y s , Z 1 , Z 2 , . . . , Z r−s ) T .<br />

3. Integrate over Z 1 , Z 2 , . . . , Z r−s to get the marginal <strong>PDF</strong> for Y.<br />

Examples<br />

Suppose Z 1 ∼ N(0, 1), Z 2 ∼ N(0, 1) independently. Consider h(z 1 , z 2 ) = (r, θ) T , where<br />

r = h 1 (z 1 , z 2 ) = √ z1 2 + z2, 2 and<br />

⎧ ( )<br />

z2<br />

arctan<br />

for z 1 > 0<br />

z 1<br />

h maps R 2 → [0, ∞) ×<br />

⎪⎨ arctan<br />

θ = h 2 (z 1 , z 2 ) =<br />

[<br />

− π 2 , 3π 2<br />

(<br />

z2<br />

)<br />

+ π for z 1 < 0<br />

z 1<br />

π<br />

2 sgnz 2 for z 1 = 0, z 2 ≠ 0<br />

⎪⎩<br />

0 for z 1 = z 2 = 0.<br />

)<br />

.<br />

The inverse mapping, g, can be seen to be:<br />

( )<br />

g1 (r, θ)<br />

g(r, θ) = =<br />

g 2 (r, θ)<br />

To find distribution <strong>of</strong> (R, θ):<br />

( ) r cos θ<br />

r sin θ<br />

Step 1<br />

⎡<br />

∂<br />

G(r, θ) = ⎢<br />

∂r g 1(r, θ)<br />

⎣<br />

∂<br />

∂r g 2(r, θ)<br />

41<br />

⎤<br />

∂<br />

∂θ g 1(r, θ)<br />

⎥<br />

⎦<br />

∂<br />

∂θ g 2(r, θ)


∂<br />

∂r g 1(r, θ) = ∂ r cos θ = cos θ<br />

∂r<br />

∂<br />

(r cos θ) = −r sin θ<br />

∂θ<br />

∂<br />

∂r g 2(r, θ) = ∂ r sin θ = sin θ<br />

∂r<br />

∂<br />

r sin θ = r cos θ<br />

∂θ<br />

1. DISTRIBUTION THEORY<br />

⎡<br />

cos θ<br />

⇒ G = ⎣<br />

sin θ<br />

⎤<br />

−r sin θ<br />

⎦<br />

r cos θ<br />

⇒ det(G) = r cos 2 θ − (−r sin 2 θ)<br />

= r cos 2 θ + r sin 2 θ = r.<br />

Step 2<br />

Now apply theorem 1.8.4.<br />

Recall,<br />

f Z (z 1 , z 2 ) = 1 √<br />

2π<br />

e − 1 2 z2 1 ×<br />

1<br />

√<br />

2π<br />

e − 1 2 z2 2<br />

= 1<br />

2π e− 1 2 (z2 1 +z2 2 ) .<br />

42


1. DISTRIBUTION THEORY<br />

f R,θ (r, θ) = f Z<br />

(<br />

g(r, θ)<br />

)<br />

| det G(r, θ)|<br />

= f Z (r cos θ, r sin θ)|r|<br />

⎧<br />

1<br />

⎪⎨<br />

2π r e− 1 2 r2 for r ≥ 0, − π ≤ θ < 3π 2 2<br />

=<br />

⎪⎩<br />

0 otherwise<br />

= f θ (θ)f R (r),<br />

where<br />

⎧<br />

1<br />

⎪⎨ − π<br />

2π<br />

≤ θ < 3π 2 2<br />

f θ (θ) =<br />

⎪⎩<br />

0 otherwise<br />

f R (r) = re −r2 /2<br />

for r ≥ 0.<br />

Hence, if Z 1 , Z 2 i.i.d. N(0, 1), and we translate to polar coordinates, (R, θ) we find:<br />

1. R, θ are independent.<br />

2. θ ∼ U(− π 2 , 3π 2 ).<br />

3. R has distribution called the Rayleigh distribution.<br />

It is easy to check that R 2 ∼ Exp(1/2) ≡ Gamma(1, 1/2) ≡ χ 2 2.<br />

Example<br />

Suppose X 1 ∼ Gamma (α, λ) and X 2 ∼ Gamma (β, λ), independently. Find the<br />

distribution <strong>of</strong> Y = X 1 /(X 1 + X 2 ).<br />

Solution. Step 1: try using Z = X 1 + X 2 .<br />

( x 1<br />

)<br />

( )<br />

y = yz<br />

Step 2: If h(x 1 , x 2 ) = x 1 + x 2 , then g(y, z) =<br />

.<br />

z = x 1 + x (1 − y)z<br />

2<br />

( ) z y<br />

G =<br />

−z 1 − y<br />

43


1. DISTRIBUTION THEORY<br />

⇒ det(G) = z(1 − y) − (−zy)<br />

where z > 0. Why<br />

= z(1 − y) + zy = z,<br />

f Y,Z (y, z) = f X1 (yz)f X2<br />

(<br />

(1 − y)z<br />

)<br />

z<br />

Γ(α) (yz)α−1 e −λyz λβ ( ) β−1e<br />

(1 − y)z −λ(1−y)z z<br />

Γ(β)<br />

= λα<br />

=<br />

1<br />

Γ(α)Γ(β) λα+β y α−1 (1 − y) β−1 z α+β−1 e −λz .<br />

Step 3:<br />

f Y (y) =<br />

∫ ∞<br />

0<br />

f(y, z) dz<br />

=<br />

=<br />

∫<br />

Γ(α + β)<br />

∞<br />

Γ(α)Γ(β) yα−1 (1 − y) β−1<br />

Γ(α + β)<br />

Γ(α)Γ(β) yα−1 (1 − y) β−1<br />

0<br />

λ α+β z α+β−1<br />

Γ(α + β) e−λz dz<br />

= Beta(α, β), for 0 < y < 1.<br />

Exercise: Justify the range <strong>of</strong> values for y.<br />

1.9 Moments<br />

Suppose X 1 , X 2 , . . . , X r are RVs. If Y = h(X) is defined by a real-valued function h,<br />

then to find E(Y ) we can:<br />

1. Find the distribution <strong>of</strong> Y .<br />

⎧∫ ∞<br />

yf Y (y) dy<br />

⎪⎨ −∞<br />

2. Calculate E(Y ) =<br />

∑<br />

⎪⎩ yp(y)<br />

y<br />

continuous<br />

discrete<br />

44


1. DISTRIBUTION THEORY<br />

Theorem. 1.9.1<br />

If h and X 1 , . . . , X r are as above, then provided it exists,<br />

⎧∑<br />

∑<br />

E { h(X) } ⎪⎨<br />

=<br />

⎪⎩<br />

x 1<br />

∫ ∞<br />

Theorem. 1.9.2<br />

−∞<br />

x 2<br />

. . . ∑ x r<br />

h(x 1 , . . . , x r )p(x 1 , . . . , x r ) if X 1 , . . . , X r discrete<br />

∫ ∞<br />

. . . h(x 1 , . . . , x r )f(x 1 , . . . x r ) dx 1 . . . dx r<br />

−∞<br />

if X 1 , . . . , X r continuous<br />

Suppose X 1 , . . . , X r are RVs, h 1 , . . . , h k are real-valued functions and a 1 , . . . , a k are<br />

constants. Then, provided it exists,<br />

E { a 1 h 1 (X) + a 2 h 2 (X) + · · · + a k h k (X) } = a 1 E[h 1 (X)] + a 2 E[h 2 (X)] + · · · + a k E [h k (X)]<br />

Pro<strong>of</strong>. (Continuous case)<br />

E { a 1 h 1 (X) + · · · + a k h k (X) } =<br />

∫ ∞<br />

−∞<br />

= a 1<br />

∫ ∞<br />

∫ ∞<br />

. . . {a 1 h 1 (x) + · · · + a k h k (x)} f x (x) dx 1 dx 2 . . . dx r<br />

−∞<br />

−∞<br />

+ a 2<br />

∫ ∞<br />

∫ ∞<br />

. . . h 1 (x)f X (x) dx 1 . . . dx r<br />

−∞<br />

−∞<br />

. . . + a k<br />

∫ ∞<br />

∫ ∞<br />

. . . h 2 (x)f X (x) dx 1 . . . dx r + . . .<br />

−∞<br />

−∞<br />

∫ ∞<br />

. . . h k (x)f X (x) dx 1 . . . dx r<br />

−∞<br />

= a 1 E[h 1 (X)] + a 2 E[h 2 (X)] + · · · + a k E[h k (X)],<br />

as required.<br />

Corollary.<br />

Provided it exists,<br />

E ( a 1 X 1 + a 2 X 2 + · · · + a r X r<br />

)<br />

= a1 E[X 1 ] + a 2 E[X 2 ] + · · · + a r E[X r ].<br />

45


1. DISTRIBUTION THEORY<br />

Definition. 1.9.1.<br />

If X 1 , X 2 are RVs with E[X i ] = µ i and Var(X i ) = σ 2 i , i = 1, 2 we define<br />

σ 12 = Cov(X 1 , X 2 ) = E [ (X 1 − µ 1 )(X 2 − µ 2 ) ]<br />

= E [ X 1 X 2<br />

]<br />

− µ1 µ 2 ,<br />

Remark:<br />

ρ 12 = Corr(X 1 , X 2 ) = σ 12<br />

σ 1 σ 2<br />

.<br />

Cov(X, X) = Var(X) = { E[X 2 ] − (E[X]) 2} .<br />

In some contexts, it is convenient to use the notation<br />

Theorem. 1.9.3<br />

σ ii = Var(X i ) instead <strong>of</strong> σ 2 i .<br />

Suppose X 1 , X 2 , . . . , X r are RVs with E[X i ] = µ i , Cov(X i , X j ) = σ ij , and let a 1 , a 2 , . . . , a r ,<br />

b 1 , b 2 , . . . , b r be constants. Then<br />

( r∑<br />

Cov a i X i ,<br />

i=1<br />

)<br />

r∑<br />

b j X j =<br />

j=1<br />

r∑ r∑<br />

a i b j σ ij .<br />

i=1 j=1<br />

46


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

( r∑<br />

Cov a i X i ,<br />

i=1<br />

) [{<br />

r∑<br />

r∑<br />

( r∑<br />

)} { r∑<br />

( r∑<br />

)}]<br />

b j X j = E a i X i − E a i X i b j X j − E b j X j<br />

i=1<br />

i=1<br />

j=1<br />

j=1<br />

j=1<br />

= E<br />

{( r∑<br />

i=1<br />

a i X i −<br />

) (<br />

r∑<br />

r∑<br />

a i µ i b j X j −<br />

j=1<br />

i=1<br />

)}<br />

r∑<br />

b j µ j<br />

j=1<br />

[{ r∑<br />

} { r∑<br />

}]<br />

= E a i (X i − µ i ) b j (X j − µ j )<br />

i=1<br />

j=1<br />

= E<br />

{ r∑<br />

i=1<br />

}<br />

r∑<br />

a i b j (X i − µ i )(X j − µ j )<br />

j=1<br />

=<br />

r∑ r∑<br />

a i b j E {(X i − µ i )(X j − µ j )} (Theorem 1.9.2)<br />

i=1 j=1<br />

=<br />

r∑ r∑<br />

a i b j σ ij ,<br />

i=1 j=1<br />

as required.<br />

Corollary. 1<br />

Under the above assumptions,<br />

( r∑<br />

)<br />

Var a i X i =<br />

i=1<br />

} {{ }<br />

covariance<br />

with itself<br />

= variance<br />

Corollary. 2<br />

r∑ r∑<br />

a i a j σ ij =<br />

i=1 j=1<br />

r∑<br />

i=1<br />

a 2 i σ 2 i + 2 ∑ i,j<br />

i


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

Consider 0 ≤ Var(X 1 − tX 2 ) (for any constant t)<br />

= σ 2 1 + t 2 σ 2 2 + 2tσ 12 (by Corollary 1).<br />

Hence quadratic,<br />

q(t) = σ 2 2t 2 + 2σ 12 t + σ 2 1 ≥ 0, for all t.<br />

Hence q(t) has either no real roots or<br />

one real root.<br />

* no real roots if ∆ = b 2 − 4ac < 0<br />

* one real root if ∆ = b 2 − 4ac = 0.<br />

Hence ∆ ≤ 0<br />

⇒ 4σ 2 12 − 4σ 2 2σ 2 1 ≤ 0<br />

⇒ 4σ 2 12 ≤ 4σ 2 1σ 2 2<br />

(σ’s +ve)<br />

⇒ σ2 12<br />

σ 2 1σ 2 2<br />

≤ 1<br />

⇒ ρ 2 ≤ 1<br />

⇒ |ρ| ≤ 1.<br />

Suppose |ρ| = 1 ⇒ ∆ = 0<br />

⇒ there is single t 0 such that:<br />

0 = q(t 0 ) = Var(X 1 − t 0 X 2 ),<br />

i.e., X 1 = t 0 X 2 + c with probability 1.<br />

Theorem. 1.9.4<br />

If X 1 , X 2 are independent, and Var(X 1 ), Var(X 2 ) exist, then Cov(X 1 , X 2 ) = 0.<br />

Pro<strong>of</strong>. (Continuous)<br />

48


1. DISTRIBUTION THEORY<br />

Cov(X 1 , X 2 ) = E { (X 1 − µ 1 )(X 2 − µ 2 ) }<br />

∫ ∞ ∫ ∞<br />

= (x 1 − µ 1 )(x 2 − µ 2 )f(x 1 , x 2 ) dx 1 dx 2<br />

−∞ −∞<br />

∫ ∞ ∫ ∞<br />

= (x 1 − µ 1 )(x 2 − µ 2 )f X1 (x 1 )f X2 (x 2 ) dx 1 dx 2 (since X 1 , X 2 independent)<br />

−∞ −∞<br />

(∫ ∞<br />

) (∫ ∞<br />

)<br />

= (x 1 − µ 1 )f X1 (x 1 ) dx 1 (x 2 − µ 2 )f X2 (x 2 ) dx 2<br />

−∞<br />

−∞<br />

= 0 × 0<br />

= 0.<br />

Remark:<br />

But the converse does NOT apply, in general!<br />

That is, Cov(X 1 , X 2 ) = 0 ⇏ X 1 , X 2 independent.<br />

Definition. 1.9.2<br />

If X 1 , X 2 are RVs, we define the symbol E[X 1 |X 2 ] to be the expectation <strong>of</strong> X 1 calculated<br />

with respect to the conditional distribution <strong>of</strong> X 1 |X 2 ,<br />

i.e.,<br />

Theorem. 1.9.5<br />

E [ X 1 |X 2<br />

]<br />

=<br />

⎧⎪ ⎨<br />

⎪ ⎩<br />

∑<br />

x 1<br />

x 1 P X1 |X 2<br />

(x 1 |x 2 ) X 1 |X 2 discrete<br />

∫ ∞<br />

−∞<br />

Provided the relevant moments exist,<br />

x 1 f X1 |X 2<br />

(x 1 |x 2 ) dx 1<br />

E(X 1 ) = E X2<br />

{<br />

E(X1 |X 2 ) } ,<br />

X 1 |X 2 continuous<br />

Var(X 1 ) = E X2<br />

{<br />

Var(X1 |X 2 ) } + Var X2<br />

{<br />

E(X1 |X 2 ) } .<br />

49


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

1. (Continuous case)<br />

E X2<br />

{<br />

E(X1 |X 2 ) } =<br />

=<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

E(X 1 |X 2 )f X2 (x 2 ) dx 2<br />

(∫ ∞<br />

)<br />

x 1 f(x 1 |x 2 ) dx 1 f X2 (x 2 ) dx 2<br />

−∞<br />

∫ ∞ ∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

−∞<br />

x 1 f(x 1 |x 2 )f X2 (x 2 ) dx 1 dx 2<br />

∫ ∞<br />

x 1 f(x 1 , x 2 ) dx 2 dx 1<br />

−∞<br />

} {{ }<br />

= f X1 (x 1 )<br />

=<br />

∫ ∞<br />

−∞<br />

x 1 f X1 (x 1 ) dx 1<br />

= E(X 1 ).<br />

1.9.1 Moment generating functions<br />

Definition. 1.9.3<br />

If X 1 , X 2 . . . , X r are RVs, then the joint MGF (provided it exists) is given by<br />

M X (t) = M X (t 1 , t 2 , . . . , t r ) = E [ e t 1X 1 +t 2 X 2 +···+t rX r<br />

]<br />

.<br />

Theorem. 1.9.6<br />

If X 1 , X 2 , . . . , X r are mutually independent, then the joint MGF satisfies<br />

M X (t 1 , . . . , t r ) = M X1 (t 1 )M X2 (t 2 ) . . . M Xr (t r ),<br />

provided it exists.<br />

50


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>. (Theorem 1.9.6, continuous case)<br />

M X (t 1 , . . . , t r ) =<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

e t 1x 1 +t 2 x 2 +···+t rx r<br />

f X (x 1 , . . . , x r ) dx 1 . . . dx r<br />

∫ ∞<br />

. . . e t 1x 1<br />

e t 2x 2<br />

. . . e trxr f X1 (x 1 )f X2 (x 2 ) . . . f Xr (x r ) dx 1 . . . dx r<br />

−∞<br />

(∫ ∞<br />

) (∫ ∞<br />

) (∫ ∞<br />

)<br />

= e t 1x 1<br />

f X1 (x 1 ) dx 1 e t 2x 2<br />

f X2 (x 2 ) dx 2 . . . e trxr f Xr (x r ) dx r<br />

−∞<br />

−∞<br />

−∞<br />

= M X1 (t 1 )M X2 (t 2 ) . . . M Xr (t r ), as required.<br />

We saw previously that if Y = h(X), then we can find M Y (t) = E[e th(X) ] from the<br />

joint distribution <strong>of</strong> X 1 , . . . , X r without calculating f Y (y) explicitly.<br />

A simple, but important case is:<br />

Theorem. 1.9.7<br />

Y = X 1 + X 2 + · · · + X r .<br />

Suppose X 1 , . . . , X r are RVs and let Y = X 1 + X 2 + · · · + X r . Then (assuming MGFs<br />

exist):<br />

1. M Y (t) = M X (t, t, t, . . . , t).<br />

2. If X 1 . . . X r are independent then M Y (t) = M X1 (t)M X2 (t) . . . M Xr (t).<br />

3. If X 1 , . . . , X r independent and identically distributed with common MGF M X (t),<br />

then:<br />

M Y (t) = [ M X (t) ] r<br />

.<br />

51


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

M X (t) = E[e tY ]<br />

= E[e t(X 1+X 2 ,···+X r) ]<br />

= E[e tX 1+tX 2 +···+tX r<br />

]<br />

= M X (t, t, t, . . . , t) (using def 1.9.3).<br />

For parts (2) and (3), substitute into (1) and use Theorem 1.9.6.<br />

Examples<br />

Consider RVs X, V , defined by V ∼ Gamma(α, λ) and the conditional distribution <strong>of</strong><br />

X|V ∼Po(λ = V ).<br />

Find E(X) and Var(X).<br />

Solution. Use<br />

E(X) = E{E(X|V )}<br />

Var(X) = E V {Var(X|V )} + Var V {E(X|V )}<br />

E(X|V ) = V<br />

Var(X|V ) = V<br />

so E(X) = E{E(X|V )} = E(V ) = α λ .<br />

Var(X) = E V {Var(X|V )} + Var V {E(X|V )}<br />

= E V (V ) + Var V (V )<br />

= α λ + α λ 2<br />

= α λ<br />

(<br />

1 + 1 ) ( ) 1 + λ<br />

= α .<br />

λ λ 2<br />

52


1. DISTRIBUTION THEORY<br />

Remark<br />

The marginal distribution <strong>of</strong> X is sometimes called the negative binomial distribution.<br />

In particular, when α is an integer, it corresponds to the definition previously with<br />

p =<br />

λ<br />

1 + λ .<br />

Examples<br />

1. Suppose X ∼ Bernoulli with parameter p. Then M X (t) = 1 + p(e t − 1).<br />

Now suppose X 1 , X 2 , . . . , X n are i.i.d. Bernoulli with parameter p and<br />

Y = X 1 + X 2 + · · · + X n ;<br />

then M Y (t) = ( 1 + p(e t − 1) ) n<br />

, which agrees with the formula previously given<br />

for the binomial distribution.<br />

2. Suppose X 1 ∼ N(µ 1 , σ 2 1) and X 2 ∼ N(µ 2 , σ 2 2) independently. Find the MGF <strong>of</strong><br />

Y = X 1 + X 2 .<br />

Solution. Recall that M X1 (t) = e tµ 1<br />

e t2 σ 2 1 /2<br />

M X2 (t) = e tµ 2<br />

e t2 σ 2 2 /2<br />

⇒ M Y (t) = M X1 (t)M X2 (t)<br />

= e t(µ 1+µ 2 ) e t2 (σ 2 1 +σ2 2 )/2<br />

⇒ Y ∼ N ( µ 1 + µ 2 , σ 2 1 + σ 2 2)<br />

.<br />

1.9.2 Marginal distributions and the MGF<br />

To find the moment generating function <strong>of</strong> the marginal distribution <strong>of</strong> any set <strong>of</strong><br />

components <strong>of</strong> X, set to 0 the complementary elements <strong>of</strong> t in M X (t).<br />

Let X = (X 1 , X 2 ) T , and t = (t 1 , t 2 ) T . Then<br />

( )<br />

t1<br />

M X1 (t 1 ) = M X .<br />

0<br />

To see this result: Note that if A is a constant matrix, and b is a constant vector, then<br />

M AX+b (t) = e tT b M X (A T t).<br />

53


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

M AX+b (t) = E[e tT (AX+b )] = e tT b E[e tT AX ]<br />

= e tT b E[e (AT t) T X ] = e tT b M X (A T t).<br />

Now partition<br />

and<br />

Note that<br />

and<br />

Hence,<br />

t r×1 =<br />

(<br />

t1<br />

t 2<br />

)<br />

A l×r = (I l×l 0 l×m ).<br />

( )<br />

X1<br />

AX = (I l×l 0 l×m ) = X<br />

X 1 ,<br />

2<br />

A T t 1 =<br />

(<br />

Il×l<br />

0 m×l<br />

)<br />

t 1 =<br />

( )<br />

t1<br />

.<br />

0<br />

M X1 (t 1 ) = M AX (t 1 ) = M X (A T t 1 )<br />

( )<br />

t1<br />

= M X ,<br />

0<br />

as required.<br />

Note that similar results hold for more than two random subvectors.<br />

The major limitation <strong>of</strong> the MGF is that it may not exist. The characteristic function<br />

on the other hand is defined for all distributions. Its definition is similar to the MGF,<br />

with it replacing t, where i = √ −1; the properties <strong>of</strong> the characterisitc function are<br />

similar to those <strong>of</strong> the MGF, but using it requires some familiarity with complex<br />

analysis.<br />

1.9.3 Vector notation<br />

Consider the random vector<br />

X = (X 1 , X 2 , . . . , X r ) T ,<br />

with E[X i ] = µ i , Var(X i ) = σi 2 = σ ii , Cov(X i , X j ) = σ ij .<br />

54


1. DISTRIBUTION THEORY<br />

Define the mean vector by:<br />

⎛ ⎞<br />

µ 1<br />

µ 2<br />

µ = E(X) = ⎜ ⎟<br />

⎝ . ⎠<br />

µ r<br />

and the variance matrix (covariance matrix) by:<br />

Σ = Var(X) = [σ ij ] i=1,...,r<br />

j=1,...,r<br />

Finally, the correlation matrix is defined by:<br />

⎛<br />

⎞<br />

1 ρ 12 . . . ρ 1r<br />

. . .<br />

. . .<br />

R = ⎜<br />

⎟<br />

⎝ .. . ρr−1,r<br />

⎠<br />

1<br />

where ρ ij =<br />

σ ij<br />

√<br />

σii<br />

√<br />

σjj<br />

.<br />

Now suppose that a = (a 1 , . . . , a r ) T is a vector <strong>of</strong> constants and observe that:<br />

a T x = a 1 x 1 + a 2 x 2 + · · · + a r x r =<br />

r∑<br />

a i x i<br />

i=1<br />

Theorem. 1.9.8<br />

Suppose X has E(X) = µ and Var(X) = Σ, and let a, b ∈ R r be fixed vectors. Then,<br />

1. E(a T X) = a T µ<br />

2. Var(a t X) = a T Σa<br />

3. Cov(a T X, b T X) = a T Σb<br />

Remark<br />

It is easy to check that this is just a re-statement <strong>of</strong> Theorem 1.9.3 using matrix<br />

notation.<br />

Theorem. 1.9.9<br />

Suppose X is a random vector with E(X) = µ, Var(X) = Σ, and let A p×r and b ∈ R p<br />

be fixed.<br />

If Y = AX + b, then<br />

E(Y) = Aµ + b and Var(Y) = AΣA T .<br />

55


1. DISTRIBUTION THEORY<br />

Remark<br />

This is also a re-statement <strong>of</strong> previously established results. To see this, observe that<br />

if a T i is the i th row <strong>of</strong> A, we see that Y i = a T i X and, moreover, the (i, j) th element <strong>of</strong><br />

AΣA T = a T i Σa j = Cov ( a T i X, a T j X )<br />

= Cov(Y i , Y j ).<br />

1.9.4 Properties <strong>of</strong> variance matrices<br />

If Σ = Var(X) for some random vector X = (X 1 , X 2 , . . . , X r ) T , then it must satisfy<br />

certain properties.<br />

Since Cov(X i , X j ) = Cov(X j , X i ), it follows that Σ is a square (r × r) symmetric<br />

matrix.<br />

Definition. 1.9.4<br />

The square, symmetric matrix M is said to be positive definite [non-negative definite]<br />

if a T Ma > 0 [≥ 0] for every vector a ∈ R r s.t. a ≠ 0.<br />

=⇒ It is necessary and sufficient that Σ be non-negative definite in order that it can<br />

be a variance matrix.<br />

=⇒ To see the necessity <strong>of</strong> this condition, consider the linear combination a T X.<br />

By Theorem 1.9.8, we have<br />

=⇒ Σ must be non-negative definite.<br />

0 ≤ Var(a T X) = a T Σa for every a;<br />

Suppose λ is an eigenvalue <strong>of</strong> Σ, and let a be the corresponding eigenvector. Then<br />

Σa = λa<br />

=⇒ a T Σa = λa T a = λ||a|| 2 .<br />

Hence Σ is non-negative definite (positive definite) iff its eigenvalues are all nonnegative<br />

(positive).<br />

If Σ is non-negative definite but not positive definite, then there must be at least one<br />

zero eigenvalue. Let a be the corresponding eigenvector.<br />

=⇒ a ≠ 0 but a T Σa = 0. That is, Var(a T X) = 0 for that a.<br />

=⇒ the distribution <strong>of</strong> X is degenerate in the sense that either one <strong>of</strong> the X i ’s is<br />

constant or else a linear combination <strong>of</strong> the other components.<br />

56


1. DISTRIBUTION THEORY<br />

Finally, recall that if λ 1 , . . . , λ r are eigenvalues <strong>of</strong> Σ, then det(Σ) =<br />

r∏<br />

i=1<br />

λ r<br />

and<br />

det(Σ) = 0<br />

=⇒ det(Σ) > 0 for Σ positive definite,<br />

for Σ non-negative definite but not positive definite.<br />

1.10 The multivariable normal distribution<br />

Definition. 1.10.1<br />

The random vector X = (X 1 , . . . , X r ) T is said to have the r-dimensional multivariate<br />

normal distribution with parameters µ ∈ R r and Σ r×r positive definite, if it has <strong>PDF</strong><br />

We write X ∼ N r (µ, Σ).<br />

Examples<br />

f X (x) =<br />

1<br />

(2π) r/2 |Σ| 1/2 e− 1 2 (x−µ)T Σ −1 (x−µ)<br />

1. r = 2 The bivariate normal distribution<br />

( ) [ ]<br />

µ1<br />

σ<br />

2<br />

Let µ = , Σ = 1 ρσ 1 σ 2<br />

µ 2 ρσ 1 σ 2 σ2<br />

2<br />

=⇒ |Σ| = σ 2 1σ 2 2(1 − ρ 2 ) and<br />

⎛<br />

⎞<br />

σ<br />

Σ −1 1<br />

2 2 −ρσ 1 σ 2<br />

=<br />

⎝<br />

⎠<br />

σ1σ 2 2(1 2 − ρ 2 )<br />

−ρσ 1 σ 2 σ1<br />

2<br />

=⇒ f(x 1 , x 2 ) =<br />

{[<br />

1<br />

√<br />

2πσ 1 σ exp −1<br />

2 1 − ρ<br />

2 2(1 − ρ 2 )<br />

[ (x1 ) 2 ( ) 2<br />

− µ 1 x2 − µ 2<br />

+<br />

σ 1 σ 2<br />

( ) ( )]]}<br />

x1 − µ 1 x2 − µ 2<br />

−2ρ<br />

.<br />

σ 1 σ 2<br />

57


1. DISTRIBUTION THEORY<br />

2. If Z 1 , Z 2 , . . . , Z r are i.i.d. N(0, 1) and Z = (Z 1 , . . . , Z r ) T ,<br />

⇒f Z (z) =<br />

r∏<br />

i=1<br />

1<br />

√<br />

2π<br />

e −z2 i /2<br />

=<br />

which is N r (0 r , I r×r ) <strong>PDF</strong>.<br />

Theorem. 1.10.1<br />

1<br />

(2π) r/2 e−1/2 P r<br />

i=1 z2 i =<br />

1<br />

(2π) r/2 e− 1 2 zT z ,<br />

Suppose X ∼ N r (µ, Σ) and let Y = AX + b, where b ∈ R r and A r×r invertible are<br />

fixed. Then Y ∼ N r (Aµ + b, AΣA T ).<br />

Pro<strong>of</strong>.<br />

We use the transformation rule for <strong>PDF</strong>s:<br />

If X has joint <strong>PDF</strong> f X (x) and Y = g(X) for g : R r → R r invertible, continuously<br />

differentiable, ( ) then f Y (y) = f X (h(y))|H|, where h : R r → R r such that h = g −1 and<br />

∂hi<br />

H = .<br />

∂y j<br />

In this case, we have g(x) = Ax + b<br />

⇒h(y) = A −1 (y − b)<br />

[ solving for x in<br />

y = Ax + b<br />

]<br />

⇒H = A −1 .<br />

Hence,<br />

f Y (y) =<br />

=<br />

=<br />

1<br />

(2π) r/2 |Σ| 1/2 e−1/2(A−1 (y−b)−µ) T Σ −1 (A −1 (y−b)−µ) |A −1 |<br />

1<br />

(2π) r/2 |Σ| 1/2 |AA T | 1/2 e−1/2(y−(Aµ+b))T (A −1 ) T Σ −1 (A −1 )(y−(Aµ+b))<br />

1<br />

(2π) r/2 |AΣA T | 1/2 e−1/2(y−(Aµ+b))T (AΣA T ) −1 (y−(Aµ+b))<br />

which is exactly the <strong>PDF</strong> for N r (Aµ + b, AΣA T ).<br />

58


1. DISTRIBUTION THEORY<br />

Now suppose Σ r×r is any symmetric positive definite matrix and recall that we can write<br />

Σ = E ∧ E T where ∧ = diag (λ 1 , λ 2 , . . . , λ r ) and E r×r is such that EE T = E T E = I.<br />

Since Σ is positive definite, we must have λ i > 0, i = 1, . . . , r and we can define the<br />

symmetric square-root matrix by:<br />

Σ 1/2 = E ∧ 1/2 E T where ∧ 1/2 = diag ( √ λ 1 , √ λ 2 , . . . , √ λ r ).<br />

Note that Σ 1/2 is symmetric and satisfies:<br />

( ) Σ<br />

1/2 2<br />

= Σ 1/2 Σ 1/2T = Σ.<br />

Now recall that if Z 1 , Z 2 , . . . , Z r are i.i.d. N(0, 1) then Z = (Z 1 , . . . , Z r ) T ∼ N r (0, I).<br />

Because <strong>of</strong> the i.i.d. N(0, 1) assumption, we know in this case that E(Z) = 0, Var(Z) =<br />

I.<br />

Now, let X = Σ 1/2 Z + µ:<br />

1. From Theorem 1.9.9 we have<br />

E(X) = Σ 1/2 E(Z) + µ = µ and Var(X) = Σ 1/2 Var(Z) ( Σ 1/2) T<br />

.<br />

2. From Theorem 1.10.1,<br />

X ∼ N r (µ, Σ).<br />

Since this construction is valid for any symmetric positive definite Σ and any<br />

µ ∈ R r , we have proved,<br />

Theorem. 1.10.2<br />

If X ∼ N r (µ, Σ) then<br />

Theorem. 1.10.3<br />

E(X) = µ and Var(X) = Σ.<br />

If X ∼ N r (µ, Σ) then Z = Σ −1/2 (X − µ) ∼ N r (0, I).<br />

Pro<strong>of</strong>.<br />

Use Theorem 1.10.1.<br />

Suppose<br />

(<br />

X1<br />

X 2<br />

)<br />

∼ N r1 +r 2<br />

((<br />

µ1<br />

µ 2<br />

)<br />

,<br />

[ ])<br />

Σ11 Σ 12<br />

Σ 21 Σ 22<br />

59


1. DISTRIBUTION THEORY<br />

where X i , µ i are <strong>of</strong> dimension r i and Σ ij is r i × r j . In order that Σ be symmetric, we<br />

must have Σ 11 , Σ 22 symmetric and Σ 12 = Σ T 21.<br />

We will derive the marginal distribution <strong>of</strong> X 2<br />

X 1 |X 2 .<br />

and the conditional distribution <strong>of</strong><br />

Lemma. 1.10.1<br />

[ ] B A<br />

If M = is a square, partitioned matrix, then |M| = |B| (det).<br />

O I<br />

Lemma. 1.10.2<br />

[ ]<br />

Σ11 Σ<br />

If M =<br />

12<br />

then |M| = |Σ<br />

Σ 21 Σ 22 ||Σ 11 − Σ 12 Σ −1<br />

22 Σ 21 |<br />

22<br />

Pro<strong>of</strong>. (<strong>of</strong> Lemma 1.10.2)<br />

Let<br />

and observe<br />

C =<br />

[ ]<br />

I −Σ12 Σ −1<br />

22<br />

0 Σ −1<br />

22<br />

[ ]<br />

Σ11 − Σ<br />

CM =<br />

12 Σ −1<br />

22 Σ 21 0<br />

Σ −1<br />

,<br />

22 Σ 21 I<br />

and<br />

Finally, observe |CM| = |C||M|<br />

|C| = 1<br />

|Σ 22 | , |CM| = |Σ 11 − Σ 12 Σ −1<br />

22 Σ 21 |<br />

⇒<br />

|M| = |CM|<br />

|C|<br />

= |Σ 22 ||Σ 11 − Σ 12 Σ −1<br />

22 Σ 21 |.<br />

Theorem. 1.10.4<br />

Suppose that<br />

(<br />

X1<br />

X 2<br />

)<br />

(( ) ( ))<br />

µ1 Σ11 Σ<br />

∼ N r1 ,r 2<br />

,<br />

12<br />

µ 2 Σ 21 Σ 22<br />

Then, the marginal distribution <strong>of</strong> X 2 is X 2 ∼ N r2 (µ 2 , Σ 22 ) and the conditional distribution<br />

<strong>of</strong> X 1 |X 2 is<br />

( )<br />

N r1 µ1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 ), Σ 11 − Σ 12 Σ −1<br />

22 Σ 21<br />

60


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

We will exhibit f(x 1 , x 2 ) in the form<br />

f(x 1 , x 2 ) = h(x 1 , x 2 ) g(x 2 ),<br />

where h(x 1 , x 2 ) is a <strong>PDF</strong> with respect to x 1 for each (given) x 2 .<br />

It follows then that f X1 |X 2<br />

(x 1 |x 2 ) = h(x 1 , x 2 ) and g(x 2 ) = f X2 (x 2 ).<br />

Now observe that:<br />

1<br />

f(x 1 , x 2 ) =<br />

(2π) (r 1+r 2 )/2<br />

∣ Σ ∣ ×<br />

11 Σ 12∣∣∣<br />

1/2<br />

Σ 12 Σ 22<br />

and<br />

exp<br />

[<br />

− 1 2<br />

( (x1 ) T ( ) ) [ ] −1 ( ) ]<br />

T Σ<br />

− µ 1 , x2 − µ 11 Σ 12 x1 − µ 1<br />

2<br />

Σ 21 Σ 22 x 2 − µ 2<br />

1.<br />

(2π) (r 1+r 2 )/2<br />

∣ Σ ∣<br />

11 Σ 12∣∣∣<br />

1/2<br />

Σ 21 Σ 22<br />

= (2π) r 1/2 (2π) r 2/2 ∣ ∣Σ 22<br />

∣ ∣<br />

1/2 ∣ ∣Σ 11 − Σ 12 Σ −1<br />

22 Σ 21<br />

∣ ∣<br />

1/2<br />

= (2π) r 1/2 ∣ ∣Σ 11 − Σ 12 Σ −1<br />

22 Σ 21<br />

∣ ∣<br />

1/2<br />

(2π)<br />

r 2 /2 ∣ ∣Σ 22<br />

∣ ∣<br />

1/2<br />

2.<br />

(<br />

(x1 − µ 1 ) T , (x 2 − µ 2 ) ) [ ] −1 ( )<br />

T Σ 11 Σ 12 x1 − µ 1<br />

= ( (x<br />

Σ 21 Σ 22 x 2 − µ 1 − µ 1 ) T , (x 2 − µ 2 ) ) T<br />

2<br />

⎡<br />

×<br />

⎢<br />

⎣<br />

(<br />

Σ11 − Σ 12 Σ −1<br />

22 Σ 21<br />

) −1<br />

− ( Σ 11 − Σ 12 Σ −1<br />

22 Σ 11<br />

) −1<br />

×Σ 12 Σ −1<br />

22<br />

−Σ −1<br />

22 Σ 21 Σ −1<br />

22 + Σ −1<br />

22 Σ 21<br />

× ( )<br />

Σ 11 − Σ 12 Σ −1 −1<br />

22 Σ 21 × ( )<br />

Σ 11 − Σ 12 Σ −1 −1<br />

22 Σ 21<br />

×Σ 12 Σ −1<br />

22<br />

⎤<br />

⎥<br />

⎦<br />

( )<br />

x1 − µ 1<br />

x 2 − µ 2<br />

= (x 1 − µ 1 ) T V −1 (x 1 − µ 1 ) − (x 1 − µ 1 ) T V −1 Σ 12 Σ −1<br />

22 (x 2 − µ 2 )<br />

−(x 2 − µ 2 ) T Σ −1<br />

22 Σ 21 V −1 (x 1 − µ 1 ) + (x 2 − µ 2 ) T Σ −1<br />

22 Σ 21 V −1 Σ 12 Σ −1<br />

22 (x 2 − µ 2 )<br />

61<br />

+(x 2 − µ 2 ) T Σ −1<br />

22 (x 2 − µ 2 ).


1. DISTRIBUTION THEORY<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

How did this come about:<br />

[ ( ) a b x<br />

(x, y)<br />

c d]<br />

y<br />

= (xy)<br />

[ ] ax + by<br />

cx + dy<br />

= x(ax + by) + y(cx + dy)<br />

= ax 2 + bxy + cyx + dy 2 ⎫⎪ ⎬<br />

⎪ ⎭<br />

= ( (x 1 − µ 1 ) − Σ 12 Σ −1<br />

22 (x 2 − µ 2 ) ) T<br />

V<br />

−1 ( (x 1 − µ 1 ) − Σ 12 Σ −1<br />

22 (x 2 − µ 2 ) )<br />

+ (x 2 − µ 2 ) T Σ −1<br />

22 (x 2 − µ 2 )<br />

⎧<br />

⎪⎨<br />

Note:<br />

(x − y) T A(x − y) = (x − y) T (Ax − Ay)<br />

= x T Ax − x T Ay − y T Ax + y T Ay<br />

⎪⎩<br />

And: Σ T 12 = Σ 21<br />

⎫⎪ ⎬<br />

⎪ ⎭<br />

= ( x 1 − (µ 1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 )) ) T (<br />

Σ11 − Σ 12 Σ −1<br />

22 Σ 21<br />

) −1×<br />

(<br />

x1 − (µ 1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 )) ) + (x 2 − µ 2 ) T Σ −1<br />

22 (x 2 − µ 2 ).<br />

Combining (1), (2) we obtain:<br />

62


1. DISTRIBUTION THEORY<br />

f(x 1 , x 2 ) =<br />

1<br />

(2π) (r 1+r 2 )/2<br />

∣ Σ ∣<br />

11 Σ 12∣∣∣<br />

1/2<br />

Σ 21 Σ 22<br />

=<br />

⎧<br />

⎡ ⎤<br />

⎨<br />

Σ 11 Σ 12<br />

× exp<br />

⎩ −1 2 ((x 1 − µ 1 ) T , (x 2 − µ 2 ) T ) ⎣ ⎦<br />

Σ 22<br />

1<br />

(2π) r 1/2<br />

|Σ 11 − Σ 12 Σ −1<br />

22 Σ 21 | 1/2<br />

Σ 21<br />

{<br />

× exp − 1 2 (x 1 − (µ 1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 ))) T<br />

⎞⎫<br />

x 1 − µ 1 ⎬<br />

⎝ ⎠<br />

⎭<br />

x 2 − µ 2<br />

−1 ⎛<br />

× ( ) }<br />

Σ 11 − Σ 12 Σ −1 −1(x1<br />

22 Σ 21 − (µ 1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 )))<br />

×<br />

{<br />

1<br />

(2π) ∣ ∣ r 2/2 ∣Σ exp − 1 }<br />

22 1/2<br />

2 (x 2 − µ 2 ) T Σ −1<br />

22 (x 2 − µ 2 )<br />

= h(x 1 , x 2 )g(x 2 ), where for fixed x 2 , h(x 1 , x 2 ), is the<br />

N r1<br />

(<br />

µ1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 ), Σ 11 − Σ 12 Σ −1<br />

22 Σ 21<br />

)<br />

<strong>PDF</strong>, and g(x 2 ) is the N r2 (µ 2 , Σ 22 ) <strong>PDF</strong>.<br />

Hence, the result is proved.<br />

Theorem. 1.10.5<br />

Suppose that X ∼ N r (µ, Σ) and Y = AX + b, where A p×r with linearly independent<br />

rows and b ∈ R p are fixed. Then Y ∼ N p (Aµ + b, AΣA T ).<br />

[Note: p ≤ r.]<br />

Pro<strong>of</strong>. Use method <strong>of</strong> regular transformations.<br />

Since the<br />

(<br />

rows <strong>of</strong> A are linearly independent, we can find B (r−p)×r such that the r × r<br />

B<br />

matrix is invertible.<br />

A)<br />

63


1. DISTRIBUTION THEORY<br />

If we take Z = BX, we have<br />

( ( ( Z B 0<br />

= X + ∼ N<br />

Y)<br />

A)<br />

b)<br />

by Theorem 1.10.1.<br />

([ ] [ ])<br />

Bµ + 0 BΣB<br />

T<br />

BΣA<br />

,<br />

T<br />

Aµ + b AΣB T AΣA T<br />

Hence, from Theorem 1.10.4, the marginal distribution for Y is<br />

N p (Aµ + b, AΣA T ).<br />

1.10.1 The multivariate normal MGF<br />

The multivariate normal moment generating function for a random vector X ∼ N(µ, Σ)<br />

is given by<br />

M X (t) = e tT µ+ 1 2 tT Σt<br />

Prove this result as an exercise!<br />

The characteristic function <strong>of</strong> X is<br />

E[exp(it T X)] = exp<br />

(<br />

it T µ − 1 )<br />

2 tT Σt<br />

The marginal distribution <strong>of</strong> X 1 (or X 2 ) is easy to derive using the multivariate normal<br />

MGF.<br />

Let<br />

t =<br />

(<br />

t1<br />

t 2<br />

)<br />

, µ =<br />

(<br />

µ1<br />

µ 2<br />

)<br />

.<br />

Then the marginal distribution <strong>of</strong> X 1 is obtained by setting t 2 = 0 in the expression<br />

for the MGF <strong>of</strong> X.<br />

Pro<strong>of</strong>:<br />

(<br />

M X (t) = exp t T µ + 1 )<br />

2 tT Σt<br />

= exp<br />

(t T 1 µ 1 + t T 2 µ 2 + 1 2 t 1 T Σ 11 t 1 + t T 1 Σ 12 t 2 + 1 )<br />

2 t 2 T Σ 22 t 2 .<br />

Now,<br />

( )<br />

t1<br />

M X1 (t 1 ) = M X = exp<br />

(t T<br />

0<br />

1 µ 1 + 1 )<br />

2 t 1 T Σ 11 t 1 ,<br />

64


1. DISTRIBUTION THEORY<br />

which is the MGF <strong>of</strong> X 1 ∼ N r1 (µ 1 , Σ 11 ).<br />

Similarly for X 2 . This means that all marginal distributions <strong>of</strong> a multivariate normal<br />

distribution are multivariate normal themselves. Note though, that in general<br />

the opposite implication is not true: there are examples <strong>of</strong> non-normal multivariate<br />

distributions who marginal distributions are normal.<br />

1.10.2 Independence and normality<br />

We have seen previously that X, Y independent ⇒ Cov(X, Y ) = 0, but not vice-versa.<br />

An exception is when the data are (jointly) normally distributed.<br />

In particular, if (X, Y ) have the bivariate normal distribution, then Cov(X, Y ) = 0 ⇐⇒<br />

X, Y are independent.<br />

Theorem. 1.10.6<br />

Suppose X 1 , X 2 , . . . , X r have a multivariate normal distribution. Then X 1 , X 2 , . . . , X r<br />

are independent if and only if Cov(X i , X j ) = 0 for all i ≠ j.<br />

Pro<strong>of</strong>.<br />

(=⇒)X 1 , . . . , X r<br />

independent<br />

=⇒ Cov(X i , X j ) = 0 for i ≠ j, has already been proved.<br />

(⇐=)<br />

Suppose Cov(X i , X j ) = 0 for i ≠ j<br />

65


=⇒ Var(X) = Σ =<br />

=⇒ Σ −1 =<br />

=⇒ f X (x) =<br />

=<br />

=<br />

⎡<br />

⎤<br />

σ 11 0 . . . 0<br />

0 σ 22 . . . .<br />

⎢<br />

⎣ .<br />

. .. . ⎥ .. . ⎦<br />

0 . . . 0 σ rr<br />

⎡<br />

σ −1<br />

⎤<br />

11 0 . . . 0<br />

0 σ −1<br />

22 . . . .<br />

⎢<br />

⎣ .<br />

. .. . ⎥ .. . ⎦<br />

0 . . . 0 σrr<br />

−1<br />

1<br />

(2π) r/2 |Σ| 1/2 e− 1 2 (x−µ)T Σ −1 (x−µ)<br />

1. DISTRIBUTION THEORY<br />

and |Σ| = σ 11 σ 22 . . . σ rr<br />

1<br />

1 P r (x i −µ i ) 2<br />

(2π) r/2 (σ 11 σ 22 . . . σ rr ) 1/2 e− 2 i=1 σ ii<br />

(<br />

) (<br />

1<br />

√ √ e − 1 (x 1 −µ 1 ) 2<br />

1<br />

2 σ 11 √ √ e − 1 2<br />

2π σ11 2π σ22<br />

)<br />

(x 2 −µ 2 ) 2<br />

σ 22 . . .<br />

(<br />

)<br />

1<br />

. . . √ √ e − 1 (xr−µr) 2<br />

2 σrr<br />

2π σrr<br />

= f 1 (x 1 )f 2 (x 2 ) . . . f r (x r )<br />

=⇒ X 1 , . . . , X r are independent.<br />

Note:<br />

(x − µ) T Σ −1 (x − µ) = (x 1 − µ 1 x 2 − µ 2 . . . x r − µ r )<br />

=<br />

⎡<br />

σ −1<br />

⎤ ⎡ ⎤<br />

11 0 . . . 0 x 1 − µ 1<br />

0 σ22 −1<br />

. . . .<br />

x 2 − µ 2<br />

× ⎢<br />

⎥ ⎢ ⎥<br />

⎣ .<br />

.. .<br />

.. . . ⎦ ⎣ . ⎦<br />

0 . . . 0 σrr<br />

−1 x r − µ r<br />

r∑ (x i − µ i ) 2<br />

.<br />

σ ii<br />

i=1<br />

66


1. DISTRIBUTION THEORY<br />

Remark<br />

The same methods can be used to establish a similar result for block diagonal matrices.<br />

The simplest case is the following, which is most easily proved using moment generating<br />

functions:<br />

X 1 , X 2 are independent<br />

if and only if Σ 12 = 0, i.e.,<br />

( ) (( ) ( ))<br />

X1<br />

µ1 Σ11 0<br />

∼ N<br />

X<br />

r1 +r 2<br />

,<br />

.<br />

2 µ 2 0 Σ 22<br />

Theorem. 1.10.7<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ) RVs and let<br />

¯X = 1 n<br />

S 2 =<br />

n∑<br />

X i ,<br />

i=1<br />

1<br />

n − 1<br />

n∑<br />

(X i − ¯X) 2 .<br />

i=1<br />

Then ¯X ∼ N(µ, σ 2 /n) and<br />

(n − 1)S2<br />

σ 2<br />

∼ χ 2 n−1 independently.<br />

(Note: S 2 here is a random variable, and is not to be confused with the sample covariance<br />

matrix, which is also <strong>of</strong>ten denoted by S. Hopefully, the meaning <strong>of</strong> the notation<br />

will be clear in the context in which it is used.)<br />

Pro<strong>of</strong>. Observe first that if X = (X 1 , . . . , X n ) T then the i.i.d. assumption may be<br />

written as:<br />

X ∼ N n (µ1 n , σ 2 I n×n )<br />

1. ¯X ∼ N(µ, σ 2 /n):<br />

Observe that ¯X = BX, where<br />

[ 1<br />

B =<br />

n<br />

]<br />

1<br />

n . . . 1<br />

n<br />

Hence, by Theorem 1.10.5,<br />

¯X ∼ N(Bµ1, σ 2 BB T ) ≡ N(µ, σ 2 /n),<br />

as required.<br />

67


1. DISTRIBUTION THEORY<br />

2. Independence <strong>of</strong> ¯X and S: consider<br />

⎡<br />

⎤<br />

1 1<br />

1 1<br />

. . .<br />

n n n n<br />

1 − 1 − 1 . . . − 1 − 1 n n n n<br />

A =<br />

− 1 1 − 1 . . . − 1 − 1 n n n n<br />

⎢<br />

.<br />

.. .<br />

.. .<br />

.. . .<br />

⎥<br />

⎣<br />

⎦<br />

− 1 − 1 . . . 1 − 1 − 1 n n n n<br />

and observe<br />

⎡ ⎤<br />

¯X<br />

X 1 − ¯X<br />

AX =<br />

X 2 − ¯X<br />

⎢ ⎥<br />

⎣ . ⎦<br />

X n−1 − ¯X<br />

We can check that Var(AX) = σ 2 AA T has the form:<br />

⎡<br />

⎤<br />

1<br />

0 . . . 0<br />

n 0<br />

σ 2 ⎢<br />

. Σ 22<br />

⎥<br />

⎣<br />

⎦<br />

0<br />

⇒ ¯X, (X 1 − ¯X, X 2 − ¯X, . . . , X n−1 − ¯X) are independent.<br />

(By multivariate normality.)<br />

Finally, since<br />

X n − ¯X = − ( (X 1 − ¯X) + (X 2 − ¯X) + · · · + (X n−1 − ¯X) ) ,<br />

it follows that S 2 is a function <strong>of</strong> (X 1 − ¯X, X 2 − ¯X, . . . , X n−1 − ¯X) and hence is<br />

independent <strong>of</strong> ¯X.<br />

3. Prove:<br />

(n − 1)S 2<br />

σ 2<br />

∼ χ 2 n−1<br />

68


1. DISTRIBUTION THEORY<br />

Consider the identity:<br />

n∑<br />

n∑<br />

(X i − µ) 2 = (X i − ¯X) 2 + n( ¯X − µ) 2<br />

i=1<br />

i=1<br />

(subtract and add ¯X to first term)<br />

⇒<br />

n∑<br />

i=1<br />

(X i − µ) 2<br />

σ 2 =<br />

n∑<br />

i=1<br />

(X i − ¯X) 2<br />

σ 2 +<br />

( ) 2 ¯X − µ<br />

σ/ √ n<br />

and let R 1 =<br />

R 2 =<br />

n∑ (X i − µ) 2<br />

i=1<br />

n∑<br />

i=1<br />

σ 2<br />

(X i − ¯X) 2<br />

σ 2 =<br />

(n − 1)S2<br />

σ 2<br />

R 3 =<br />

( ) 2 ¯X − µ<br />

σ/ √ n<br />

If M 1 , M 2 , M 3 are the MGFs for R 1 , R 2 , R 3 respectively, then,<br />

M 1 (t) = M 2 (t)M 3 (t) since R 1 = R 2 + R 3<br />

with R 2 and R 3 independent<br />

↓ ↓<br />

depends depends only<br />

only on on ¯X<br />

Next, observe that R 3 =<br />

( ) 2 ¯X − µ<br />

σ/ √ ∼ χ 2 1<br />

n<br />

S 2<br />

⇒ M 2 (t) = M 1(t)<br />

M 3 (t) .<br />

⇒ M 3 (t) =<br />

1<br />

(1 − 2t) 1/2 ,<br />

69


1. DISTRIBUTION THEORY<br />

and<br />

(R 1 )<br />

n∑ (X i − µ) 2<br />

σ 2<br />

i=1<br />

∼ χ 2 n<br />

⇒ M 1 (t) =<br />

∴ M 2 (t) =<br />

1<br />

(1 − 2t) n/2 .<br />

1<br />

(1 − 2t) (n−1)/2 ,<br />

which is the mgf for χ 2 n−1.<br />

Hence,<br />

(n − 1)S 2<br />

σ 2<br />

∼ χ 2 n−1.<br />

Corollary.<br />

If X 1 , . . . , X n are i.i.d. N(µ, σ 2 ), then:<br />

Pro<strong>of</strong>.<br />

Recall that t k is the distribution <strong>of</strong><br />

Z<br />

√<br />

V/k<br />

, where Z ∼ N(0, 1) and V ∼ χ 2 k independently.<br />

T = ¯X − µ<br />

S/ √ n ∼ t n−1<br />

Now observe that:<br />

and note that,<br />

T = ¯X − µ<br />

S/ √ n = ¯X / √ − µ<br />

σ/ √ (n − 1)S 2 /σ 2<br />

n (n − 1)<br />

¯X − µ<br />

σ/ √ n<br />

∼ N(0, 1)<br />

and<br />

independently.<br />

(n − 1)S 2<br />

σ 2<br />

∼ χ 2 n−1,<br />

70


1. DISTRIBUTION THEORY<br />

1.11 Limit Theorems<br />

1.11.1 Convergence <strong>of</strong> random variables<br />

Let X 1 , X 2 , X 3 , . . . , be an infinite sequence <strong>of</strong> RVs. We will consider 4 different types<br />

<strong>of</strong> convergence.<br />

1. Convergence in probability (weak convergence)<br />

The sequence {X n } is said to converge to the constant α ∈ R, in probability if<br />

for every ε > 0<br />

lim<br />

n→∞ P (|X n − α| > ε) = 0<br />

2. Convergence in Quadratic Mean<br />

lim<br />

n→∞ E( (X n − α) 2) = 0<br />

3. Almost sure convergence (Strong convergence)<br />

The sequence {X n } is said to converge almost surely to α if for each ε > 0,<br />

Remarks<br />

(1),(2),(3) are related by:<br />

|X n − α| > ε for only finite number <strong>of</strong> n ≥ 1.<br />

Figure 15: Relationships <strong>of</strong> convergence<br />

4. Convergence in distribution<br />

The sequence <strong>of</strong> RVs {X n } with CDFs {F n } is said to converge in distribution<br />

to the RV X with CDF F (x) if:<br />

lim F n(x) = F (x) for all continuity points <strong>of</strong> F .<br />

n→∞<br />

71


1. DISTRIBUTION THEORY<br />

Figure 16: Discrete case<br />

Figure 17: Normal approximation to binomial (an application <strong>of</strong> the Central Limit<br />

Theorem) continuous case<br />

Remarks<br />

⎧<br />

⎪⎨ 0 x < α<br />

1. If we take F (x) =<br />

⎪⎩<br />

1 x ≥ α<br />

then we have X = α with probability 1, and convergence in distribution to this<br />

F is the same thing as convergence in probability to α.<br />

2. Commonly used notation for convergence in distribution is either:<br />

(i) L[X n ] → L[X]<br />

(ii) X n −→<br />

D<br />

L[X] or e.g., X n −→<br />

D<br />

N(0, 1).<br />

3. An important result that we will use without pro<strong>of</strong> is as follows:<br />

Let M n (t) be MGF <strong>of</strong> X n and M(t) be MGF <strong>of</strong> X. Then if M n (t) → M(t) for<br />

each t in some open interval containing 0, as n → ∞, then L[X n ] −→<br />

D<br />

L[X].<br />

(Sometimes called the Continuity Theorem.)<br />

72


1. DISTRIBUTION THEORY<br />

Theorem. 1.11.1 (Weak law <strong>of</strong> large numbers)<br />

Suppose X 1 , X 2 , X 3 , . . . is a sequence <strong>of</strong> i.i.d. RVs with E[X i ] = µ, Var(X i ) = σ 2 , and<br />

let<br />

¯X n = 1 n∑<br />

X i .<br />

n<br />

Then ¯X n converges to µ in probability.<br />

Pro<strong>of</strong>. We need to show for each ɛ > 0 that<br />

i=1<br />

lim P (| ¯X n − µ| > ɛ) = 0.<br />

n→∞<br />

Now observe that E[ ¯X n ] = µ and Var( ¯X n ) = σ2<br />

. So by Chebyshev’s inequality,<br />

n<br />

P (| ¯X n − µ| > ɛ) ≤ σ2<br />

→ 0 as n → ∞, for any fixed ɛ > 0.<br />

nɛ2 Remarks<br />

1. The pro<strong>of</strong> given for Theorem 1.11.1 is really a corollary to the fact that ¯X n also<br />

converges to µ in quadratic mean.<br />

2. There is also a version <strong>of</strong> this theorem involving almost sure convergence (strong<br />

law <strong>of</strong> large numbers). We will not discuss this.<br />

3. The law <strong>of</strong> large numbers is one <strong>of</strong> the fundamental principles <strong>of</strong> statistical inference.<br />

That is, it is the formal justification for the claim that the “sample mean<br />

approaches the population mean for large n”.<br />

Lemma. 1.11.1<br />

Suppose a n is a sequence <strong>of</strong> real numbers s.t. lim<br />

n→∞<br />

na n = a with |a| < ∞. Then,<br />

Pro<strong>of</strong>.<br />

Omitted (but not difficult).<br />

lim (1 + a n) n = e a .<br />

n→∞<br />

73


1. DISTRIBUTION THEORY<br />

Remarks<br />

This is a simple generalisation <strong>of</strong> the standard limit,<br />

(<br />

1 +<br />

n) x n<br />

= e x .<br />

lim<br />

n→∞<br />

Consider a sequence <strong>of</strong> i.i.d. RVs X 1 , X 2 , . . . with E(X i ) = µ, Var(X i ) = σ 2 and such<br />

that the MGF, M X (t), is defined for all t in some open interval containing 0.<br />

n∑<br />

Let S n = X i and note that E[S n ] = nµ and Var(S n ) = nσ 2 .<br />

i=1<br />

Theorem. 1.11.2 (Central Limit Theorem)<br />

Let X 1 , X 2 , . . . , S n be as above and let Z n = S n − nµ<br />

σ √ n<br />

. Then<br />

Pro<strong>of</strong>.<br />

L[Z n ] → N(0, 1) as n → ∞.<br />

We will use the fact that it is sufficient to prove that<br />

M Zn (t) → e t2 /2<br />

for each fixed t<br />

[Note: if Z ∼ N(0, 1) then M Z (t) = e t2 /2 .]<br />

Now let U i = X i − µ<br />

σ<br />

and observe that Z n = 1 √ n<br />

Since the U i are independent we have:<br />

n∑<br />

U i .<br />

i=1<br />

M Zn (t) = { M U (t/ √ n) } n<br />

⎡<br />

⎣<br />

M aX (t) = E[e taX ]<br />

= M X (at)<br />

⎤<br />

⎦<br />

Now,<br />

E(U i ) = 0 ⇒ M ′ U(0) = 0,<br />

Var(U i ) = 1 ⇒ M ′′<br />

U(0) = 1.<br />

74


1. DISTRIBUTION THEORY<br />

Consider the second-order Taylor expansion,<br />

M U (t) = M U (0) + M ′ U(0)t + M ′′<br />

U(0) t2 2 + r(t)<br />

M U (t) = 1 + 0 + t 2 /2 + r(t),<br />

↓<br />

(M U (0))(as M ′ U(0) = 0)<br />

r(s)<br />

where lim = 0<br />

s→0 s 2<br />

⇒<br />

= 1 + t 2 /2 + r(t).<br />

M Zn (t) = { M U (t/ √ n) } n<br />

= {1 + t 2 /2n + r(t/ √ n)} n<br />

= (1 + a n ) n ,<br />

where<br />

a n = t2<br />

2n + r(t/√ n).<br />

Next observe that lim na n = t2<br />

n→∞ 2<br />

[<br />

To check this observe that:<br />

for fixed t.<br />

nt 2<br />

lim<br />

n→∞ 2n = t2 2<br />

and lim<br />

n→∞<br />

nr(t/ √ n)<br />

= lim<br />

n→∞<br />

t 2 r(t/ √ n)<br />

(t/ √ n) 2<br />

= t 2 r(s)<br />

lim = 0 for fixed t.<br />

s→0 s 2<br />

Note: s = t/ √ n → 0 as n → ∞. ] 75


1. DISTRIBUTION THEORY<br />

Hence we have shown M Zn (t) = (1 + a n ) n , where<br />

lim na n = t 2 /2, so by Lemma 1.11.1,<br />

n→∞<br />

lim M Z n<br />

(t) = e t2 /2 for each fixed t.<br />

n→∞<br />

Remarks<br />

1. The Central Limit Theorem can be stated equivalently for ¯X n ,<br />

( ) ¯Xn − µ<br />

i.e., L<br />

σ/ √ → N(0, 1)<br />

n<br />

(<br />

just note that ¯X n − µ<br />

σ/ √ n = S n − nµ<br />

σ √ n<br />

2. The Central Limit Theorem holds under conditions more general than those given<br />

above. In particular, with suitable assumptions,<br />

(i) M X (t) need not exist.<br />

(ii) X 1 , X 2 , . . . need not be i.i.d..<br />

3. Theorems 1.11.1 and 1.11.2 are concerned with the asymptotic behaviour <strong>of</strong> ¯X n .<br />

)<br />

.<br />

Theorem 1.11.1 states ¯X n → µ in prob as n → ∞.<br />

Theorem 1.11.2 states ¯X n − µ<br />

σ/ √ n −→ D<br />

N(0, 1) as n → ∞.<br />

These results are not contradictory because Var( ¯X n ) → 0, but the Central Limit<br />

Theorem is concerned with ¯X n − E( ¯X n )<br />

√<br />

Var( ¯Xn ) . 76


2. STATISTICAL INFERENCE<br />

2 Statistical Inference<br />

2.1 Basic definitions and terminology<br />

Probability is concerned partly with the problem <strong>of</strong> predicting the behavior <strong>of</strong> the RV<br />

X assuming we know its distribution.<br />

Statistical inference is concerned with the inverse problem:<br />

Given data x 1 , x 2 , . . . , x n with unknown CDF F (x), what can we conclude about<br />

F (x)<br />

In this course, we are concerned with parametric inference. That is, we assume F<br />

belongs to a given family <strong>of</strong> distributions, indexed by the parameter θ:<br />

where Θ is the parameter space.<br />

Examples<br />

I = {F (x; θ) : θ ∈ Θ}<br />

(1) I is the family <strong>of</strong> normal distributions:<br />

θ = (µ, σ 2 )<br />

Θ = {(µ, σ 2 ) : µ ∈ R, σ 2 ∈ R + }<br />

then I =<br />

{<br />

F (x) : F (x) = Φ<br />

( x − µ<br />

σ 2 )}<br />

.<br />

(2) I is the family <strong>of</strong> Bernoulli distributions with success probability θ:<br />

Θ = {θ ∈ [0, 1] ⊂ R}.<br />

In this framework, the problem is then to use the data x 1 , . . . , x n to draw conclusions<br />

about θ.<br />

Definition. 2.1.1<br />

A collection <strong>of</strong> i.i.d. RVs, X 1 , . . . , X n , with common CDF F (x; θ), is said to be a<br />

random sample (from F (x; θ)).<br />

Definition. 2.1.2<br />

Any function T = T (x 1 , x 2 , . . . , x n ) that can be calculated from the data (without<br />

knowledge <strong>of</strong> θ) is called a statistic.<br />

Example<br />

The sample mean ¯x is a statistic (¯x = 1 n<br />

n∑<br />

x i ).<br />

i=1<br />

77


2. STATISTICAL INFERENCE<br />

Definition. 2.1.3<br />

A statistic T with property T (x) ∈ Θ ∀x is called an estimator for θ.<br />

Example<br />

For x 1 , x 2 , . . . , x n i.i.d. N(µ, σ 2 ), we have θ = (µ, σ 2 ). The quantity (¯x, s 2 ) is an estimator<br />

for θ, where ¯x = 1 n∑<br />

x i , S 2 = 1 n∑<br />

(x i − ¯x) 2 .<br />

n<br />

n − 1<br />

i=1<br />

There are two important concepts here: the first is that estimators are random<br />

variables; the second is that you need to be able to distinguish between random variables<br />

and their realisations. In particular, an estimate is a realisation <strong>of</strong> a random<br />

variable.<br />

For example, strictly speaking, x 1 , x 2 , . . . , x n are realisations <strong>of</strong> random variables<br />

X 1 , X 2 , . . . , X n , and ¯x = 1 n∑<br />

x i is a realisation <strong>of</strong> ¯X; ¯X is an estimator, and ¯x is an<br />

n<br />

estimate.<br />

i=1<br />

We will find that from now on however, that it is <strong>of</strong>ten more convenient, if less rigorous,<br />

to use the same symbol for estimator and estimate. This arises especially in the use <strong>of</strong><br />

ˆθ as both estimator and estimate, as we shall see.<br />

An unsatisfactory aspect <strong>of</strong> Definition 2.1.3 is that it gives no guidance on how to<br />

recognize (or construct) good estimators.<br />

Unless stated otherwise, we will assume that θ is a scalar parameter in the following.<br />

In broad terms, we would like an estimator to be “as close as possible to” θ “with high<br />

probability.<br />

Definition. 2.1.4<br />

The mean squared error <strong>of</strong> the estimator T <strong>of</strong> θ is defined by<br />

i=1<br />

MSE T (θ) = E{(T − θ) 2 }.<br />

Example<br />

Suppose X 1 , . . . , X n are i.i.d. Bernoulli θ RV’s, and T = ¯X =‘proportion <strong>of</strong> successes’.<br />

Since nT ∼ B(n, θ) we have<br />

E(nT ) = nθ, Var(nT ) = nθ(1 − θ)<br />

#<br />

=⇒ E(T ) = θ, Var(T ) =<br />

θ(1 − θ)<br />

n<br />

78


2. STATISTICAL INFERENCE<br />

=⇒ MSE T (θ) = Var(T ) =<br />

θ(1 − θ)<br />

.<br />

n<br />

Remark<br />

This example shows that MSE T (θ) must be thought <strong>of</strong> as a function <strong>of</strong> θ rather than<br />

just a number.<br />

E.g., Figure 18<br />

Figure 18: MSE T (θ) as a function <strong>of</strong> θ<br />

Intuitively, a good estimator is one for which MSE is as small as possible. However,<br />

quantifying this idea is complicated, because MSE is a function <strong>of</strong> θ, not just a number.<br />

See Figure 19.<br />

For this reason, it turns out it is not possible to construct a minimum MSE estimator<br />

in general.<br />

To see why, suppose T ∗ is a minimum MSE estimator for θ. Now consider the estimator<br />

T a = a, where a ∈ R is arbitrary. Then for a = θ, T θ = θ with MSE Tθ (θ) = 0.<br />

Observe MSE Ta (a) = 0; hence if T ∗ exists, then we must have MSE T ∗(a) = 0. As a<br />

is arbitrary, we must have MSE T ∗(θ) = 0 ∀θ, ∈ Θ<br />

=⇒ T ∗ = θ with probability 1.<br />

Therefore we conclude that (excluding trivial situations) no minimum MSE estimator<br />

can exist.<br />

Definition. 2.1.5 The bias <strong>of</strong> the estimator T is defined by:<br />

b T (θ) = E(T ) − θ.<br />

79


2. STATISTICAL INFERENCE<br />

Figure 19: MSE T (θ) as a function <strong>of</strong> θ<br />

An estimator T with b T (θ) = 0, i.e., E(T ) = θ, is said to be unbiased.<br />

Remarks<br />

(1) Although unbiasedness is an appealing property, not all commonly used<br />

estimates are unbiased and in some situations it is impossible to construct<br />

unbiased estimators for the parameter <strong>of</strong> interest.<br />

80


2. STATISTICAL INFERENCE<br />

Example (Unbiasedness isn’t everything)<br />

E(s 2 ) = σ 2 .<br />

If E(s) = σ,<br />

then Var(s) = E(s 2 ) − {E(s)} 2<br />

= 0<br />

which is not the case<br />

=⇒ E(s) < σ.<br />

(2) Intuitively, unbiasedness would seem to be a pre-requisite for a good estimator.<br />

This is to some extent formalized as follows:<br />

Theorem. 2.1.1<br />

MSE T (θ) = Var(T ) + b T (θ) 2<br />

Remark<br />

Restricting attention to unbiased estimators excludes estimators <strong>of</strong> the form T a =<br />

a. We will see that this permits the construction <strong>of</strong> Minimum Variance Unbiased<br />

Estimators (MVUE’s) in some cases.<br />

2.2 Minimum Variance Unbiased Estimation<br />

Consider data x 1 , x 2 , . . . , x n assumed to be observations <strong>of</strong> RV’s X 1 , X 2 , . . . , X n with<br />

joint <strong>PDF</strong> (probability function)<br />

Definition. 2.2.1<br />

f x (x; θ)<br />

(P x (x; θ)).<br />

The likelihood function is defined by<br />

⎧<br />

⎨ f x (x; θ)<br />

L(θ; x) =<br />

⎩<br />

P x (x; θ)<br />

The log likelihood function is:<br />

l(θ; x) = log L(θ; x)<br />

X continuous<br />

X discrete.<br />

(log is the natural log i.e., ln).<br />

81


2. STATISTICAL INFERENCE<br />

Remark<br />

If x 1 , x 2 , . . . , x n are independent, the log likelihood function can be written as:<br />

l(θ; x) =<br />

n∑<br />

log f i (x i ; θ).<br />

i=1<br />

If x 1 , x 2 , . . . , x n are i.i.d., we have:<br />

l(θ; x) =<br />

n∑<br />

log f(x i ; θ).<br />

i=1<br />

Definition. 2.2.2<br />

Consider a statistical problem with log-likelihood l(θ; x). The score is defined by:<br />

and the Fisher information is<br />

U(θ; x) = ∂l<br />

∂θ<br />

I(θ) = E<br />

( )<br />

− ∂2 l<br />

.<br />

∂θ 2<br />

Remark<br />

For a single observation with <strong>PDF</strong> f(x, θ), the information is<br />

)<br />

i(θ) = E<br />

(− ∂2<br />

∂θ log f(X, θ) .<br />

2<br />

In the case <strong>of</strong> x 1 , x 2 , . . . , x n i.i.d. , we have<br />

Theorem. 2.2.1<br />

Under suitable regularity conditions,<br />

I(θ) = ni(θ).<br />

E{U(θ; X)} = 0<br />

and Var{U(θ; X)} = I(θ).<br />

82


2. STATISTICAL INFERENCE<br />

Pro<strong>of</strong>.<br />

Observe E{U(θ; X)} =<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

. . . U(θ; x)f(x; θ)dx 1 . . . dx n<br />

−∞<br />

. . .<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

( ∂<br />

∂θ log f(x; θ) )<br />

f(x; θ)dx 1 . . . dx n<br />

⎛ ⎞<br />

∂<br />

f(x; θ)<br />

⎜<br />

⎝<br />

∂θ ⎟<br />

⎠ f(x; θ)dx 1 . . . dx n<br />

f(x; θ)<br />

=<br />

∫ ∞<br />

regularity =⇒ = ∂ ∂θ<br />

= ∂ ∂θ 1<br />

−∞<br />

∫ ∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

−∞<br />

. . .<br />

∂<br />

∂θ f(x; θ)dx 1 . . . dx n<br />

∫ ∞<br />

−∞<br />

f(x; θ)dx 1 . . . dx n<br />

= 0, as required.<br />

To calculate Var{U(θ; X)}, observe<br />

⎧ ⎫<br />

∂<br />

∂ 2 l(θ; x)<br />

= ∂ ⎪⎨ f(x; θ) ⎪⎬<br />

∂θ<br />

∂θ 2 ∂θ ⎪⎩ f(x; θ) ⎪⎭<br />

=<br />

=<br />

=⇒ U 2 (θ; x) =<br />

( )<br />

∂ 2<br />

2 ∂<br />

∂θ f(x; θ)f(x; θ) − 2 ∂θ f(x; θ)<br />

∂ 2<br />

[f(x; θ)] 2<br />

f(x; θ)<br />

∂θ2 − U 2 (θ; x)<br />

f(x; θ)<br />

∂ 2<br />

f(x; θ)<br />

∂θ2 f(x; θ)<br />

− ∂2 l<br />

∂θ 2<br />

=⇒ Var{U(θ; X)} = E{U 2 (θ; X)} (as E(U) = 0)<br />

⎛<br />

∂ 2 ⎞<br />

f(X; θ)<br />

⎜<br />

= E ⎝<br />

∂θ2 ⎟<br />

⎠ + I(θ)<br />

f(X; θ)<br />

(by definition <strong>of</strong> I(θ)).<br />

83


2. STATISTICAL INFERENCE<br />

Finally observe that:<br />

⎛<br />

∂ 2 ⎞<br />

f(X; θ)<br />

⎜<br />

E ⎝<br />

∂θ2 ⎟<br />

⎠ =<br />

f(X; θ)<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

∂ 2<br />

f(x; θ)<br />

∂θ2 f(x; θ)dx 1 . . . dx n<br />

f(x; θ)<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

= ∂2<br />

∂θ 2 ∫ ∞<br />

= ∂2<br />

∂θ 2 1<br />

= 0<br />

−∞<br />

Hence, we have proved Var{U(θ; X)} = I(θ).<br />

. . .<br />

∂ 2<br />

∂θ 2 f(x; θ)dx 1 . . . dx n<br />

∫ ∞<br />

−∞<br />

f(x; θ)dx 1 . . . dx n<br />

84


2. STATISTICAL INFERENCE<br />

Theorem. 2.2.2 (Cramer-Rao Lower Bound)<br />

If T is an unbiased estimator for θ, then Var(T ) ≥ 1<br />

I(θ) .<br />

Pro<strong>of</strong>.<br />

Observe Cov{T (X), U(θ; X)} = E{T (X)U(θ; X)}<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

= ∂ ∂θ<br />

−∞<br />

∫ ∞<br />

∫ ∞<br />

. . . T (x)U(θ; x)f(x; θ)dx 1 . . . dx n<br />

−∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

−∞<br />

. . .<br />

= ∂ E{T (X)}<br />

∂θ<br />

∂<br />

f(x; θ)<br />

T (x) ∂θ<br />

f(x; θ) f(x; θ)dx 1 . . . dx n<br />

∫ ∞<br />

−∞<br />

T (x)f(x; θ)dx 1 . . . dx n<br />

= ∂ ∂θ θ<br />

To summarize, Cov(T, U) = 1.<br />

= 1.<br />

Recall that Cov 2 (T, U) ≤ Var(T ) Var(U) [i.e. |ρ| ≤ 1 and divide both sides by RHS]<br />

=⇒ Var(T ) ≥ Cov2 (T, U)<br />

Var(U)<br />

= 1 , as required.<br />

I(θ)<br />

Example<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. Po(λ) RV’s, and let ¯X = 1 n<br />

that ¯X is a MVUE for λ.<br />

n∑<br />

X i . We will prove<br />

i=1<br />

85


2. STATISTICAL INFERENCE<br />

Pro<strong>of</strong>.<br />

(1) Recall if X ∼Po(λ), then P (x) = e−λ λ x<br />

, E(X) = λ and Var(X) = λ.<br />

x!<br />

Hence, E( ¯X) = λ and Var( ¯X) = λ n . Hence ¯X is unbiased for λ.<br />

(2) To show that ¯X is MVUE, we will show that Var( ¯X) = 1<br />

I(λ) .<br />

Step 1:<br />

Log-likelihood is<br />

l(λ; x) = log P (x; λ)<br />

= log<br />

n∏ e −λ λ x i<br />

,<br />

x i !<br />

i=1<br />

so l(λ; x) = log e −nλ λ P n<br />

i=1 x i<br />

= −nλ +<br />

n∏<br />

i=1<br />

1<br />

x i !<br />

n∑<br />

x i log λ − log<br />

i=1<br />

= −nλ + n¯x log λ − log<br />

n∏<br />

x i !<br />

i=1<br />

n∏<br />

x i !.<br />

i=1<br />

Step 2:<br />

Find ∂2 l<br />

∂λ 2 ;<br />

Step 3:<br />

∂l<br />

∂λ<br />

= −n +<br />

n¯x<br />

λ<br />

=⇒ ∂2 l<br />

∂λ 2 = −n¯x<br />

λ 2 .<br />

86


2. STATISTICAL INFERENCE<br />

I(λ) =<br />

( ) ∂ 2 l<br />

−E<br />

∂λ 2<br />

=⇒ I(λ) =<br />

( ) n ¯X<br />

E<br />

λ 2<br />

= n E( ¯X)<br />

λ2 = nλ<br />

λ 2<br />

= n λ .<br />

(3) Finally, observe that Var( ¯X) = λ n = 1<br />

I(λ) .<br />

By Theorem 2.2.2, any unbiased estimator T for λ must have<br />

Var(T ) ≥ 1<br />

I(λ) = λ n<br />

=⇒ ¯X is a MVUE.<br />

Theorem. 2.2.3<br />

The unbiased estimator T (x) can achieve the Cramer-Rao Lower Bound only if the<br />

joint <strong>PDF</strong>/probability function has the form:<br />

f(x; θ) = exp{A(θ)T (x) + B(θ) + h(x)},<br />

where A, B are functions such that θ = −B′ (θ)<br />

, and h is some function <strong>of</strong> x.<br />

A ′ (θ)<br />

Pro<strong>of</strong>. Recall from the pro<strong>of</strong> <strong>of</strong> Theorem 2.2.2 that the bound arises from the inequality<br />

where U(θ; x) = ∂l<br />

∂θ .<br />

Cor 2 (T (X), U(θ; X)) ≤ 1,<br />

Moreover, it is easy to see that the Cramer-Rao Lower Bound (CRLB) can be achieved<br />

only when<br />

Cor 2 {T (X), U(θ; X)} = 1<br />

87


2. STATISTICAL INFERENCE<br />

Hence the CRLB is achieved only if U is a linear function <strong>of</strong> T with probability 1<br />

(correlation equals 1):<br />

U = aT + b a, b constants =⇒ can depend on θ but not x<br />

i.e.<br />

∂l<br />

∂θ<br />

Integrating wrt θ, we obtain:<br />

where A ′ (θ) = a(θ), B ′ (θ) = b(θ).<br />

= U(θ; x) = a(θ)T (x) + b(θ) for all x.<br />

log f(x; θ) = l(θ; x) = A(θ)T (x) + B(θ) + h(x),<br />

Hence,<br />

f(x; θ) = exp{A(θ)T (x) + B(θ) + h(x)}.<br />

Finally to ensure that E(T ) = θ, recall that E{U(θ; X)} = 0 and observe that in this<br />

case,<br />

E{U(θ; X)} = E[A ′ (θ)T (X) + B ′ (θ)]<br />

= A ′ (θ)E[T (X)] + B ′ (θ) = 0<br />

=⇒ E[T (X)] = −B′ (θ)<br />

A ′ (θ) .<br />

Hence, in order that T (X) be unbiased for θ, we must have −B′ (θ)<br />

A ′ (θ)<br />

= θ.<br />

Definition. 2.2.3<br />

A probability density function/probability function is said to be a single parameter<br />

exponential family if it has the form:<br />

f(x; θ) = exp{A(θ)t(x) + B(θ) + h(x)}<br />

for all x ∈ D ∈ R, where the D does not depend on θ.<br />

If x 1 , . . . , x n is a random sample from an exponential family, the joint <strong>PDF</strong>/prob functions<br />

becomes<br />

n∏<br />

f(x; θ) = exp{A(θ)t(x i ) + B(θ) + h(x i )}<br />

i=1<br />

{<br />

= exp A(θ)<br />

}<br />

n∑<br />

n∑<br />

t(x i ) + B(θ) + h(x i ) .<br />

i=1<br />

i=1<br />

88


2. STATISTICAL INFERENCE<br />

In this case, a minimum variance unbiased estimator that achieves the CRLB can be<br />

seen to be the function:<br />

T = 1 n∑<br />

t(x i ),<br />

n<br />

which is the MVUE for E(T ) = −B′ (θ)<br />

A ′ (θ) .<br />

Example (Poisson example revisited)<br />

i=1<br />

Observe that the Poisson probability function,<br />

p(x; λ) = e−λ λ x<br />

x!<br />

= exp{(log λ)x − λ − log x!}<br />

where<br />

= exp{A(λ)t(x) + B(λ) + h(x)},<br />

A(λ) = log λ<br />

B(λ) = −λ<br />

t(x) = x<br />

¯X = 1 n<br />

n∑<br />

t(x i ) is the MVUE for<br />

i=1<br />

h(x) = − log x!.<br />

−B ′ (λ)<br />

A ′ (λ)<br />

= −(−1)<br />

1/λ<br />

= λ.<br />

Example (2)<br />

Consider the Exp(λ) distribution. The <strong>PDF</strong> is<br />

f(x) = λe −λx , x > 0<br />

= exp{−λx + log λ};<br />

89


2. STATISTICAL INFERENCE<br />

=⇒ f is an exponential family with<br />

A(λ) = −λ<br />

B(λ) = log λ<br />

t(x) = x<br />

h(x) = 0<br />

We can also check that E(X) = −B′ (λ)<br />

A ′ (λ) .<br />

E(X) = 1 λ<br />

for X ∼ Exp(λ).<br />

In particular, we have seen previously<br />

Now observe that A ′ (λ) = −1, B ′ (λ) = 1 λ<br />

=⇒ −B′ (λ)<br />

A ′ (λ)<br />

= 1 , as required.<br />

λ<br />

It also follows that if x 1 , x 2 , . . . , x n are i.i.d. Exp(λ) observations, then ¯X = 1 n<br />

is the MVUE for 1 λ = E(X).<br />

Definition. 2.2.4<br />

n∑<br />

t(x i )<br />

i=1<br />

Consider data with <strong>PDF</strong>/prob. function, f(x; θ). A statistic, S(x), is called a sufficient<br />

statistic for θ if f(x|s; θ) does not depend on θ for all s.<br />

Remarks<br />

(1) We will see that sufficient statistics capture all <strong>of</strong> the information in the<br />

data x that is relevant to θ.<br />

(2) If we consider vector-valued statistics, then this definition admits trivial<br />

examples, such as s = x<br />

⎧<br />

⎨ 1 x = s<br />

since P (X = x|S = s) =<br />

⎩<br />

0 otherwise<br />

which does not depend on θ.<br />

90


2. STATISTICAL INFERENCE<br />

Example<br />

Suppose x 1 , x 2 , . . . , x n are i.i.d. Bernoulli-θ and let s =<br />

θ.<br />

n∑<br />

x i . Then S is sufficient for<br />

i=1<br />

Pro<strong>of</strong>.<br />

P (x) =<br />

n∏<br />

θ x i<br />

(1 − θ) 1−x i<br />

i=1<br />

= θ P x i<br />

(1 − θ) P (1−x i )<br />

= θ s (1 − θ) n−s .<br />

Next observe that S ∼ B(n, θ)<br />

=⇒ P (s) =<br />

=⇒ P (x|s) =<br />

=<br />

=<br />

( n<br />

s)<br />

θ s (1 − θ) n−s<br />

P ({X = x} ∩ {S = s})<br />

P (S = s)<br />

P (X = x)<br />

P (S = s)<br />

⎧<br />

1<br />

( if ⎪⎨ n<br />

s)<br />

⎪⎩<br />

n∑<br />

x i = s<br />

i=1<br />

0, otherwise.<br />

Theorem. 2.2.4 The Factorization Theorem<br />

Suppose x 1 , . . . , x n have joint <strong>PDF</strong>/prob function f(x; θ). Then S is a sufficient statistic<br />

for θ if and only if<br />

f(x; θ) = g(s; θ)h(x)<br />

for some functions g, h.<br />

Pro<strong>of</strong>.<br />

Omitted.<br />

91


2. STATISTICAL INFERENCE<br />

Examples<br />

(1) x 1 , x 2 , . . . , x n i.i.d. N(µ, σ 2 ), σ 2 known.<br />

n∏ 1<br />

Then f(x; µ) = √ e (−1/(2σ2 ))(x i −µ) 2<br />

2πσ<br />

=⇒ S =<br />

i=1<br />

{<br />

}<br />

= (2πσ 2 ) −n/2 exp − 1 n∑<br />

(x<br />

2σ 2 i − µ) 2<br />

i=1<br />

{ ( n∑<br />

)}<br />

= (2πσ 2 ) −n/2 exp − 1<br />

n∑<br />

x 2<br />

2σ 2 i − 2µ x i + nµ 2<br />

i=1<br />

i=1<br />

{ (<br />

)} (<br />

= (2πσ 2 ) −n/2 1<br />

n∑<br />

exp 2µ x<br />

2σ 2 i − nµ 2 exp − 1<br />

2σ 2<br />

= exp<br />

{<br />

(<br />

1<br />

2µ<br />

2σ 2<br />

i=1<br />

)} {<br />

n∑<br />

x i − nµ 2 exp<br />

i=1<br />

n∑<br />

x i is sufficient for µ.<br />

i=1<br />

(2) If x 1 , x 2 , . . . , x n are i.i.d. with<br />

f(x) = exp{A(θ)t(x) + B(θ) + h(x)},<br />

then f(x; θ) = exp<br />

=⇒ S =<br />

{<br />

A(θ)<br />

n∑<br />

t(x i )<br />

i=1<br />

− 1<br />

2σ 2<br />

n∑<br />

i=1<br />

x 2 i<br />

)<br />

}<br />

n∑<br />

x 2 i − n 2 log(2πσ2 )<br />

i=1<br />

} {<br />

n∑<br />

n∑<br />

}<br />

t(x i ) + nB(θ) exp h(x i )<br />

i=1<br />

i=1<br />

is sufficient for θ in the exponential family by the Factorization Theorem.<br />

Theorem. 2.2.5 Rao-Blackwell Theorem<br />

If T is an unbiased estimator for θ and S is a sufficient statistic for θ, then the quantity<br />

T ∗ = E(T |S) is also an unbiased estimator for θ with Var(T ∗ ) ≤ Var(T ). Moreover,<br />

Var(T ∗ ) = Var(T ) iff T ∗ = T with probability 1.<br />

Pro<strong>of</strong>.<br />

(I) Unbiasedness:<br />

θ = E(T ) = E S {E(T |S)} = E(T ∗ )<br />

=⇒ E(T ∗ ) = θ.<br />

92


2. STATISTICAL INFERENCE<br />

(II) Variance inequality:<br />

Var(T ) = E{Var(T |S)} + Var{E(T |S)}<br />

= E{Var(T |S)} + Var(T ∗ )<br />

≥<br />

Var(T ∗),<br />

since E{Var(T |S)} ≥ 0.<br />

Observe also that Var(T ) = Var(T ∗ )<br />

=⇒ E{Var(T |S)} = 0<br />

=⇒ Var(T |S) = 0 with prob. 1<br />

=⇒ T = E(T |S) with prob. 1.<br />

(III) T ∗ is an estimator:<br />

Since S is sufficient for θ,<br />

T ∗ =<br />

which does not depend on θ.<br />

∫ ∞<br />

−∞<br />

T (x)f(x|s)dx,<br />

Remarks<br />

(1) The Rao-Blackwell Theorem can be used occasionally to construct estimators.<br />

(2) The theoretical importance <strong>of</strong> this result is to observe that T ∗ will always<br />

depend on x only through S. If T is already a MVUE then T ∗ = T =⇒<br />

MVUE depends on x only through S.<br />

Example<br />

Suppose x 1 , . . . , x n ∼i.i.d. N(µ, σ 2 ), σ 2 known. We want to estimate µ. Take T = x 1 ,<br />

as an unbiased estimator for µ.<br />

n∑<br />

S = x i is a sufficient statistic for µ.<br />

i=1<br />

93


2. STATISTICAL INFERENCE<br />

According to Rao-Blackwell, T ∗ = E(T |S) will be an improved (unbiased) estimator<br />

for µ.<br />

( T<br />

If x = (x 1 , . . . , x n ) T for X ∼ N n (µ1, σ 2 I), then = Ax, where A is a matrix<br />

S)<br />

⎛<br />

A = ⎝<br />

1 0 . . . 0<br />

1 1 . . . 1<br />

⎞<br />

⎠<br />

Hence,<br />

⎛⎛<br />

( T<br />

∼ N 2 (A(µ1), σ<br />

S)<br />

2 AA T ) = N 2<br />

⎝⎝<br />

µ<br />

nµ<br />

⎞ ⎛<br />

⎠ , ⎝<br />

⎞⎞<br />

σ 2 σ 2<br />

⎠⎠<br />

nσ 2<br />

σ 2<br />

=⇒ T |S = s ∼<br />

)<br />

N<br />

(µ + σ2<br />

nσ (s − nµ), 2 σ2 − σ4<br />

nσ 2<br />

=<br />

( ( 1<br />

N<br />

n s, 1 − 1 ) )<br />

σ 2<br />

n<br />

∴ E(T |S) = 1 n<br />

is a better (or equal) estimator for µ.<br />

∑<br />

xi = ¯x = T ∗<br />

Finally, observe Var( ¯X) = σ2<br />

n ≤ σ2 = Var(X 1 ) with < for n > 1, σ 2 > 0.<br />

Remarks<br />

(1) We have given only an incomplete outline <strong>of</strong> theory.<br />

(2) There are also concepts <strong>of</strong> minimal sufficient & complete statistics to be<br />

considered.<br />

2.3 Methods Of Estimation<br />

2.3.1 Method Of Moments<br />

Consider a random sample X 1 , X 2 , . . . , X n from F (θ) & let µ = µ(θ) = E(X).<br />

The method <strong>of</strong> moments estimator ˜θ is defined as the solution to the equation<br />

¯X = µ(˜θ).<br />

94


2. STATISTICAL INFERENCE<br />

Example<br />

X 1 , . . . , X n ∼ i.i.d. Exp(λ) =⇒ E(X i ) = 1 ; the method <strong>of</strong> moments estimator is<br />

λ<br />

defined as the solution to the equation<br />

¯X = 1˜λ =⇒ ˜λ = 1¯X .<br />

Remark<br />

The method <strong>of</strong> moments is appealing:<br />

(1) it is simple;<br />

(2) rationale is that ¯X is BLUE for µ.<br />

If θ = (θ 1 , θ 2 , . . . , θ p ), the method <strong>of</strong> moments can be adapted as follows:<br />

(1) let µ k = µ k (θ) = E(X k ), k = 1, . . . , p;<br />

(2) let m k = 1 n<br />

n∑<br />

x k i .<br />

i=1<br />

The MoM estimator ˜θ is defined to be the solution to the system <strong>of</strong> equations<br />

m 1 = µ 1 (˜θ)<br />

m 2 = µ 2 (˜θ)<br />

.<br />

.<br />

m p = µ p (˜θ)<br />

Example<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ) & let θ = (µ, σ 2 ).<br />

95


2. STATISTICAL INFERENCE<br />

p = 2 =⇒ two equations in two unknowns:<br />

µ 1 (θ) = E(X) = µ<br />

µ 2 (θ) = E(X 2 ) = σ 2 + µ 2<br />

m 1 =<br />

m 2 =<br />

1<br />

n<br />

1<br />

n<br />

n∑<br />

i=1<br />

n∑<br />

i=1<br />

x i<br />

x 2 i<br />

∴ we need to solve for µ, σ 2 :<br />

˜µ = ¯x<br />

˜σ 2 + ˜µ 2 = 1 n<br />

∑<br />

x<br />

2<br />

i =⇒ ˜σ 2 = 1 n<br />

∑<br />

x<br />

2<br />

i − ¯x 2<br />

= 1 n<br />

∑<br />

(xi − ¯x) 2<br />

= n − 1<br />

n s2 .<br />

Remark<br />

The Method <strong>of</strong> Moments estimators can be seen to have good statistical properties.<br />

Under fairly mild regulatory conditions, the MoM estimator is<br />

(1) Consistent, i.e., ˜θ → θ in prob as n → ∞<br />

(2) Asymptotically Normal, i.e.,<br />

˜θ − θ<br />

√<br />

Var(θ)<br />

. −→ D<br />

N(0, 1) as n → ∞<br />

However, the MoM estimator has one serious defect.<br />

Suppose X 1 , . . . , X n is a random sample from F θX<br />

& let ˜θ X be MoM estimator.<br />

Let Y = h(X) for some invertible function h(X), then Y 1 , . . . , Y n should contain the<br />

same information as X 1 , . . . , X n .<br />

∴ we would like ˜θ Y ≡ ˜θ X . Unfortunately this does not hold.<br />

Example<br />

96


2. STATISTICAL INFERENCE<br />

X 1 , . . . , X n i.i.d. Exp(λ). We saw previously that ˜λ X = 1¯X .<br />

Suppose Y i = X 2 i (which is invertible for X i > 0). To obtain ˜λ Y , observe E(Y ) =<br />

E(X 2 ) = 2 λ 2 √<br />

=⇒ ˜λ 2n<br />

Y = ∑ ≠ 1¯X . X<br />

2<br />

i<br />

2.3.2 Maximum Likelihood Estimation<br />

Consider a statistical problem with log-likelihood function, l(θ; x).<br />

Definition. The maximum likelihood estimate ˆθ is the solution to the problem max l(θ; x)<br />

θ∈Θ<br />

i.e. ˆθ = arg max l(θ; x).<br />

θ∈Θ<br />

Remark<br />

In practice, maximum likelihood estimates are obtained by solving the score equation<br />

Example<br />

∂l<br />

∂θ<br />

= U(θ; x) = 0<br />

If X 1 , X 2 , . . . , X n are i.i.d. geometric-θ RV’s, find ˆθ.<br />

Solution:<br />

{ n<br />

}<br />

∏<br />

l(θ; x) = log θ(1 − θ) x i−1<br />

i=1<br />

n∑<br />

n∑<br />

= log θ + (x i − 1) log(1 − θ)<br />

i=1<br />

i=1<br />

= n{log θ + (¯x − 1) log(1 − θ)}<br />

∴ U(θ; x) = ∂l { 1<br />

∂θ = n θ − ¯x − 1<br />

1 − θ<br />

= n<br />

( 1 − θ¯x<br />

θ(1 − θ)<br />

)<br />

.<br />

}<br />

97


2. STATISTICAL INFERENCE<br />

To find ˆθ, we solve for θ in 1 − θ¯x = 0<br />

⎡<br />

=⇒ ˆθ = 1¯x<br />

⎢<br />

⎣ =<br />

Elementary Properties<br />

n∑<br />

n =<br />

x i<br />

i=1<br />

# <strong>of</strong> successes<br />

# <strong>of</strong> trials<br />

⎤<br />

⎥<br />

⎦<br />

(1) MLE’s are invariant under invertible transformations <strong>of</strong> the data.<br />

Pro<strong>of</strong>. (Continuous i.i.d. RV’s, strictly increasing/decreasing translation)<br />

Suppose X has <strong>PDF</strong> f(x; θ) and Y = h(X), where h is strictly monotonic;<br />

=⇒ f Y (y; θ) = f X (h −1 (y); θ)|h −1 (y) ′ | when y = h(x).<br />

Consider data x 1 , x 2 , . . . , x n and the transformed version y 1 , y 2 , . . . , y n :<br />

{ n<br />

}<br />

∏<br />

l Y (θ; y) = log f Y (y i ; θ)<br />

= log<br />

= log<br />

i=1<br />

{<br />

∏ n<br />

}<br />

f X (h −1 (y i ); θ)|h −1 (y i ) ′ |<br />

i=1<br />

n∏<br />

f X (x i ; θ) + log<br />

i=1<br />

n∏<br />

|h −1 (y i ) ′ |<br />

( n<br />

)<br />

∏<br />

= l X (θ; x) + log |h −1 (y i ) ′ | ,<br />

i=1<br />

( n<br />

)<br />

∏<br />

since log |h −1 (y i ) ′ | does not depend on θ, it follows that ˆθ maximizes<br />

i=1<br />

l X (θ; x) iff it maximizes l Y (θ; y).<br />

i=1<br />

(2) If φ = φ(θ) is a 1-1 transformation <strong>of</strong> θ, then the MLE’s obey the transformation<br />

rule, ˆφ = φ(ˆθ).<br />

Pro<strong>of</strong>. It can be checked that if l φ (φ; x) is the log-likelihood with respect<br />

to φ, then<br />

l θ (θ; x) = l φ (φ(θ); x).<br />

It follows that ˆθ maximizes l θ (θ; x) iff ˆφ = φ(ˆθ) maximizes l φ (φ; x).<br />

98


2. STATISTICAL INFERENCE<br />

(3) If t(x) is a sufficient statistic for θ, then ˆθ depends on the data only as a<br />

function <strong>of</strong> t(x).<br />

Pro<strong>of</strong>. By the Factorization Theorem, t(x) is sufficient for θ iff<br />

f(x; θ) = g(t(x); θ)h(x)<br />

=⇒ l(θ; x) = log g(t(x); θ) + log h(x)<br />

=⇒ ˆθ maximizes l(θ; x) ⇔ ˆθ maximizes log g(t(x); θ)<br />

=⇒ ˆθ is a function <strong>of</strong> t(x).<br />

Example<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. Exp(λ); then<br />

f X (x; λ) =<br />

n∏<br />

i=1<br />

λe −λx i<br />

= λ n e −λ P n<br />

i=1 x i<br />

= λ n e −λn¯x .<br />

By the Factorization Theorem, ¯x is sufficient for λ. To get the MLE,<br />

U(λ, x) = ∂l<br />

∂λ = ∂ (n log λ − nλ¯x)<br />

∂λ<br />

= n λ − n¯x;<br />

∂l<br />

∂λ = 0 =⇒ 1 λ = ¯x =⇒ ˆλ = 1¯x .<br />

Note: as proved ˆλ is a function <strong>of</strong> the sufficient statistic ¯x.<br />

Let Y 1 , Y 2 , . . . , Y n be defined by Y i = log X i .<br />

99


2. STATISTICAL INFERENCE<br />

If X ∼Exp(λ) and Y = log X, then we can find f Y (y) by taking h(x) = log x and using<br />

f Y (y) = f X (h −1 (y))|h −1 (y) ′ | [h −1 (y) = e y , h −1 (y) ′ = e y ]<br />

= λe −λey e y<br />

{ n<br />

}<br />

∏<br />

=⇒ l Y (λ; y) = log λe −λey i<br />

e y i<br />

= log<br />

i=1<br />

{<br />

λ n e −λ P n<br />

i=1 ey i<br />

e P }<br />

n<br />

i=1 y i<br />

= n log λ − λ<br />

=⇒ ∂<br />

∂λ l Y (λ; y) = n n∑<br />

λ −<br />

=⇒ ˆλ =<br />

=<br />

=<br />

= 1¯x<br />

n<br />

n∑<br />

i=1<br />

i=1<br />

e y i<br />

n<br />

n∑<br />

i=1<br />

n<br />

n∑<br />

i=1<br />

e log x i<br />

x i<br />

e y i<br />

n∑<br />

e y i<br />

+<br />

i=1<br />

( n<br />

n¯x)<br />

.<br />

n∑<br />

i=1<br />

y i<br />

Finally, suppose we take θ = log λ =⇒ λ = e θ<br />

=⇒ f(x; θ) = e θ e −eθ x<br />

(λe −λx )<br />

( n<br />

)<br />

∏<br />

=⇒ l θ (θ; x) = log e θ e −eθ x i<br />

= log<br />

i=1<br />

{ }<br />

e nθ e −eθ n¯x<br />

= nθ − e θ n¯x.<br />

100


∂l<br />

∂θ<br />

= n − n¯xe θ<br />

∴ ∂l<br />

∂θ = 0 =⇒ 1 = ¯xeθ =⇒ e θ = 1¯x<br />

=⇒ ˆθ = − log ¯x.<br />

But ˆλ = 1¯x ( 1¯x)<br />

=⇒ log ˆλ = log = − log ¯x = ˆθ, as required.<br />

Remark (Not examinable)<br />

2. STATISTICAL INFERENCE<br />

Maximum likelihood estimation & method <strong>of</strong> moments, can both be generated by the<br />

use <strong>of</strong> estimating function.<br />

An estimating function is a function, H(x; θ), with the property E{H(x; θ)} = 0.<br />

H can be used to define an estimator ˜θ which is a solution to the equation<br />

H(x; θ) = 0.<br />

For method <strong>of</strong> moments estimates, we can take<br />

H(x; θ) = ¯x − E(X)<br />

[E( ¯X) if not i.i.d. case].<br />

To calculate the MLE:<br />

(1) find the log-likelihood, then<br />

(2) calculate the score & let it equal 0.<br />

For maximum likelihood, we can use<br />

H(x; θ) = U(θ; x) = ∂l<br />

∂θ .<br />

Recall we showed previously that E{U(θ; x)} = 0.<br />

Asymptotic Properties <strong>of</strong> MLE’s:<br />

Suppose X 1 , X 2 , X 3 , . . . is a sequence <strong>of</strong> i.i.d. RV’s with common <strong>PDF</strong> (or probability<br />

function) f(x; θ).<br />

Assume:<br />

(1) θ 0 is the true value <strong>of</strong> the parameter θ;<br />

101


2. STATISTICAL INFERENCE<br />

(2) f is s.t. f(x; θ 1 ) = f(x; θ 2 ) for all x =⇒ θ 1 = θ 2 .<br />

We will show (in outline) that if ˆθ n is the MLE (maximum likelihood estimator) based<br />

on X 1 , X 2 , . . . , X n , then:<br />

(1) ˆθ n → θ 0 in probability, i.e., for each ɛ > 0,<br />

lim P (|θ n − θ 0 | > ɛ) = 0.<br />

n→∞<br />

(2)<br />

√<br />

ni(θ0 )(ˆθ n − θ 0 ) −→ D<br />

N(0, 1) as n → ∞<br />

Remark<br />

(Asymptotic Normality).<br />

The practical use <strong>of</strong> asymptotic normality is that for large n,<br />

( )<br />

1<br />

ˆθ ∼: N θ 0 , .<br />

ni(θ 0 )<br />

Outline <strong>of</strong> consistency & asymptotic normality for MLE’s:<br />

Consider i.i.d. data X 1 , X 2 , . . . with common <strong>PDF</strong>/prob. function f(x; θ).<br />

If ˆθ n is the MLE based on X 1 , X 2 , . . . , X n , we will show (in outline) that ˆθ n → θ 0 in<br />

probability as n → ∞, where θ 0 is the true value for θ.<br />

Lemma. 2.3.1<br />

Suppose f is such that f(x; θ 1 ) = f(x; θ 2 ) for all x =⇒ θ 1 = θ 2 . Then l ∗ (θ) =<br />

E{log f(x; θ)} is maximized uniquely by θ = θ 0 .<br />

Pro<strong>of</strong>.<br />

l ∗ (θ) − l ∗ (θ 0 ) =<br />

=<br />

≤<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

(log f(x; θ))f(x; θ 0 )dx −<br />

( f(x; θ)<br />

log<br />

f(x; θ 0 )<br />

)<br />

f(x; θ 0 )dx<br />

( f(x; θ)<br />

f(x; θ 0 ) − 1 )<br />

f(x; θ 0 )dx<br />

(f(x; θ) − f(x; θ 0 ))dx<br />

f(x; θ)dx −<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

f(x; θ 0 )dx.<br />

(log f(x; θ 0 ))f(x; θ 0 )dx<br />

102


2. STATISTICAL INFERENCE<br />

Moreover, equality is achieved iff<br />

f(x; θ)<br />

f(x; θ 0 )<br />

= 1 for all x<br />

=⇒ f(x; θ) = f(x; θ 0 ) for all x<br />

=⇒ θ = θ 0 .<br />

Lemma. 2.3.2<br />

Let ¯l n (θ; x) = 1 n l(θ, x 1, . . . , x n ), i.e., 1 n × log likelihood based on x 1, x 2 , . . . , x n .<br />

Then for each θ, ¯l n (θ; x) → l ∗ (θ) in probability as n → ∞.<br />

Pro<strong>of</strong>. Observe that<br />

¯l n (θ; x) = 1 n<br />

n log ∏<br />

f(x i ; θ)<br />

= 1 n<br />

= 1 n<br />

i=1<br />

n∑<br />

log f(x i ; θ)<br />

i=1<br />

n∑<br />

L i (θ),<br />

i=1<br />

where L i (θ) = log f(x i ; θ).<br />

Since L i (θ) are i.i.d. with E(L i (θ)) = l ∗ (θ), it follows by the weak law <strong>of</strong> large numbers<br />

that ¯l n (θ; x) → l ∗ (θ) in probability as n → ∞.<br />

To summarise, we have proved that:<br />

(1) ¯l n (θ, x) → l ∗ (θ);<br />

(2) l ∗ (θ) is maximized when θ = θ 0 .<br />

Since ˆθ n maximizes ¯l n (θ, x), it follows that ˆθ n → θ 0 in probability.<br />

Theorem. 2.3.1<br />

Under the above assumptions, let U n (θ; x) denote the score based on X 1 , . . . , X n .<br />

103


2. STATISTICAL INFERENCE<br />

Then,<br />

U n (θ 0 ; x)<br />

√<br />

ni(θ0 )<br />

−→<br />

D<br />

N(0, 1).<br />

Pro<strong>of</strong>.<br />

U n (θ 0 ; x) = ∂ n<br />

∂θ log ∏<br />

f(x i ; θ)| θ=θ0<br />

=<br />

=<br />

n∑<br />

i=1<br />

i=1<br />

∂<br />

∂θ log f(x i; θ)| θ=θ0<br />

n∑<br />

U i , where U i = ∂ ∂θ log f(x i; θ)| θ=θ0 .<br />

i=1<br />

Since U 1 , U 2 , . . . are i.i.d. with E(U i ) = 0 and Var(U i ) = i(θ), by the Central Limit<br />

Theorem,<br />

∑ n<br />

i=1 U i − nE(U)<br />

√ = U n(θ 0 ; x) −→<br />

√ N(0, 1).<br />

n Var(U) ni(θ0 ) D<br />

Theorem. 2.3.2<br />

Under the above assumptions,<br />

√<br />

ni(θ0 )(ˆθ n − θ 0 ) −→ D<br />

N(0, 1).<br />

Pro<strong>of</strong>. Consider the first order Taylor expansion <strong>of</strong> U n (ˆθ; x) about θ 0 :<br />

U n (ˆθ n ; x) ≈ U n (θ 0 ; x) + U n(θ ′ 0 ; x)(ˆθ n − θ 0 ).<br />

For large n<br />

=⇒ U n (θ 0 ; x) ≈ −U n(θ ′ 0 ; x)(ˆθ n − θ 0 )<br />

U n (θ 0 ; x)<br />

√<br />

ni(θ0 )<br />

=⇒ −U ′ n(θ 0 ; x)(ˆθ n − θ 0 )<br />

√<br />

ni(θ0 )<br />

−→<br />

D<br />

−→<br />

D<br />

N(0, 1)<br />

N(0, 1).<br />

Now observe that:<br />

−U ′ n(θ 0 ; x)<br />

n<br />

→ i(θ 0 )<br />

as n → ∞<br />

104


2. STATISTICAL INFERENCE<br />

by the weak law <strong>of</strong> large numbers, since:<br />

Hence,<br />

−U ′ n(θ 0 ; x)<br />

n<br />

=⇒ −U ′ n(θ 0 ; x)<br />

ni(θ 0 )<br />

−U ′ n(θ 0 ; x)<br />

√<br />

ni(θ0 ) (ˆθ n − θ 0 )<br />

=<br />

1<br />

n<br />

n∑<br />

i=1<br />

− ∂2<br />

∂θ 2 log f(x i; θ| θ=θ0 )<br />

→ 1 in probability.<br />

≈<br />

√<br />

ni(θ0 )(ˆθ n − θ 0 )<br />

for large n<br />

=⇒ √ ni(θ 0 )(ˆθ n − θ 0 )<br />

−→<br />

D<br />

N(0, 1).<br />

Remark<br />

The preceding theory can be generalized to include vector-valued coefficients. We will<br />

not discuss the details.<br />

2.4 Hypothesis Tests and Confidence Intervals<br />

Motivating Example:<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ), and consider H 0 : µ = µ 0 vs. H a : µ ≠ µ 0 .<br />

If σ 2 is known, then the test <strong>of</strong> H 0 with significance level α is defined by the test<br />

statistic<br />

Z = ¯X − µ<br />

σ/ √ n ,<br />

and the rule, reject H 0 if |z| ≥ z(α/2).<br />

A 100(1 − α)% CI for µ is given by<br />

(<br />

¯X − z(α/2) σ √ n<br />

, ¯X + z(α/2) σ √ n<br />

)<br />

.<br />

It is easy to check that the confidence interval contains all values <strong>of</strong> µ 0 that are acceptable<br />

null hypotheses.<br />

In general, consider a statistical problem with parameter<br />

θ ∈ Θ 0 ∪ Θ A , where Θ 0 ∩ Θ A = φ.<br />

We consider the null hypothesis,<br />

H 0 : θ ∈ Θ 0<br />

105


2. STATISTICAL INFERENCE<br />

and the alternative hypothesis<br />

H A : θ ∈ Θ A .<br />

The hypothesis testing set up can be represented as:<br />

Test Result<br />

Actual Status<br />

H 0 true H A true<br />

Accept H 0<br />

√<br />

type II<br />

error<br />

(β)<br />

Reject H 0<br />

type I<br />

error<br />

√<br />

(α)<br />

We would like both the type I and type II error rates to be as small as possible.<br />

However, these results conflict with each other. To reduce the type I error rate we<br />

need to “make it harder to reject H 0 ”. To reduce the type II error rate we need to<br />

“make it easier to reject H 0 ”.<br />

The standard (Neyman-Pearson) approach to hypothesis testing is to control the type<br />

I error rate at a “small” value α and then use a test that makes the type II error as<br />

small as possible.<br />

The equivalence between the confidence intervals and hypothesis tests can be formulated<br />

as follows: Recall that a 100(1 − α)% CI for θ is a random interval, (L, U) with<br />

the property<br />

P ((L, U) ∋ θ) = 1 − α.<br />

It is easy to check that the test defined by rule:<br />

“Accept H 0 : θ = θ 0 iff θ 0 ∈ (L, U)” has significance level α.<br />

α = P (reject H 0 |H 0 true)<br />

β = P (retain H 0 |H A true)<br />

1 − β = power = P (reject H 0 |H A true), which is what we want.<br />

Conversely, given a hypothesis test H 0 : θ = θ 0 with significance level α, it can be<br />

proved that the set {θ 0 : H 0 : θ = θ 0 is accepted} is a 100(1 − α)% confidence region<br />

for θ.<br />

Large Sample Tests and Confidence Intervals<br />

Consider a statistical problem with data x 1 , . . . , x n , log-likelihood l(θ; x), score U(θ; x)<br />

and information i(θ).<br />

Consider also a hypothesis H 0 : θ = θ 0 . The following three tests are <strong>of</strong>ten considered:<br />

106


2. STATISTICAL INFERENCE<br />

(1) The Wald Test:<br />

Test Statistic: W =<br />

√<br />

ni(ˆθ)(ˆθ − θ 0 )<br />

Critical Region:<br />

reject for |W | ≥ z(α/2)<br />

(2) The Score Test:<br />

Test Statistic: V = U(θ 0; x)<br />

√<br />

ni(θ0 )<br />

Critical Region:<br />

reject for |V | ≥ z(α/2)<br />

(3) Likelihood-Ratio Test:<br />

Test Statistic: G 2 = 2(l(ˆθ) − l(θ 0 ))<br />

Example:<br />

Critical Region: reject for G 2 ≥ χ 2 1,α<br />

Suppose X 1 , X 2 , . . . , X n i.i.d. Po(λ), and consider H 0 : λ = λ 0 .<br />

l(λ; x) = log<br />

n∏ e −λ λ x i<br />

i=1<br />

x i !<br />

= n(¯x log λ − λ) − log<br />

U(λ; x) = ∂l<br />

∂λ = n ( ¯x<br />

λ − 1 )<br />

=<br />

I(λ) = ni(λ) = E<br />

n(¯x − λ)<br />

λ<br />

n∏<br />

x i !<br />

i=1<br />

=⇒ ˆλ = ¯x<br />

( )<br />

− ∂2 l<br />

( n¯x<br />

)<br />

= E = n ∂λ 2 λ 2 λ<br />

107


2. STATISTICAL INFERENCE<br />

=⇒ W =<br />

√<br />

ni(ˆλ)(ˆλ − λ 0 )<br />

= ˆλ − λ 0<br />

√ˆλ/n<br />

V = U(λ 0; x)<br />

√<br />

ni(λ0 )<br />

= n(¯x − λ 0)<br />

λ 0 ( √ n/λ 0 )<br />

= ¯x − λ 0<br />

√<br />

λ0 /n<br />

G 2 = 2(l(ˆλ) − l(λ 0 ))<br />

= 2n(¯x log ˆλ − ˆλ) − 2n(¯x log λ 0 − λ 0 )<br />

= 2n<br />

(<br />

¯x log ˆλ<br />

)<br />

− (ˆλ − λ 0 ) .<br />

λ 0<br />

Remarks<br />

(1) It can be proved that the tests based on W, V, G 2 are asymptotically equivalent<br />

for H 0 true.<br />

(2) As a by-product, it follows that the null distribution <strong>of</strong> G 2 is χ 2 1. (Recall<br />

from Theorems 2.3.1, 2.3.1 that the null distribution for W, V are both<br />

N(0, 1)).<br />

(3) To understand the motivation for the three tests, it is useful to consider<br />

their relation to the log-likelihood function. See Figure 20.<br />

We have introduced the Wald test, score test and the likelihood test for H 0 : θ = θ 0 vs.<br />

H A : θ = θ a . These are large-sample tests in that asymptotic distributions for the test<br />

statistic under H 0 are available. It can also be proved that the LR statistic and the<br />

score test statistic are invariant under transformation <strong>of</strong> the parameter, but the Wald<br />

test is not.<br />

Each <strong>of</strong> these three tests can be inverted to give a confidence interval (region) for θ:<br />

Wald Test<br />

108


2. STATISTICAL INFERENCE<br />

Figure 20: Relationships <strong>of</strong> the tests to the log-likelihood function<br />

Solve for θ 0 in W 2 ≤ z(α/2) 2 .<br />

Recall, W =<br />

√<br />

ni(ˆθ)(ˆθ − θ 0 )<br />

=⇒ ˆθ − √<br />

z(α/2)<br />

ni(ˆθ)<br />

≤<br />

θ 0 ≤ ˆθ + √<br />

z(α/2)<br />

ni(ˆθ)<br />

i.e., ˆθ ±<br />

z(α/2)<br />

√ .<br />

ni(ˆθ)<br />

Score Test<br />

Need to solve for θ 0 in V 2 =<br />

(<br />

U(θ 0 ; x)<br />

√<br />

ni(θ0 )) 2<br />

≤ z(α/2) 2 .<br />

LR Test<br />

Solve for θ 0 in 2(l(ˆθ; x) − l(θ 0 ; x)) ≤ χ 2 1,α = z(α/2) 2 .<br />

Example:<br />

X 1 , . . . , X n i.i.d. Po(λ).<br />

109


2. STATISTICAL INFERENCE<br />

Recall Wald statistic is W = ˆλ − λ 0<br />

√ˆλ/n<br />

=⇒ ˆλ ±<br />

√<br />

z(α/2) ˆλ/n<br />

⇐⇒ ¯x ± z(α/2) √¯x/n.<br />

Score test:<br />

Recall that the test statistic is V = ¯x − λ 0<br />

√ . Hence to find a confidence interval, we<br />

λ0 /n<br />

need to solve for λ 0 in the equation<br />

V 2 ≤ z(α/2) 2<br />

=⇒ (¯x − λ 0) 2<br />

λ 0 /n<br />

≤ z(α/2) 2<br />

=⇒ (¯x − λ 0 ) 2 ≤ λ 0z(α/2) 2<br />

)<br />

=⇒ λ 2 0 −<br />

(2¯x + z(α/2)2 λ + ¯x 2 ≤ 0,<br />

n<br />

which is a quadratic in λ 0 and hence can be solved:<br />

−b ± √ b 2 − 4ac<br />

.<br />

2a<br />

n<br />

For the LR test, we have<br />

{<br />

G 2 = 2n ¯x log ¯x }<br />

− (¯x − λ 0 )<br />

λ 0<br />

→ can solve numerically for λ 0 in G 2 ≤ z(α/2) 2 . See Figure 21.<br />

Optimal Tests<br />

Consider simple null and alternative hypotheses:<br />

H 0 : θ = θ 0 and H a : θ = θ a<br />

Recall that the type I error rate is defined by<br />

α = P (reject H 0 |H 0 true).<br />

and the type II error rate by<br />

β = P (accept H 0 |H 0 false).<br />

The power is defined by 1 − β = P (reject H 0 |H 0 false) → so we want high power.<br />

110


2. STATISTICAL INFERENCE<br />

Figure 21: The 3 tests<br />

Theorem. 2.4.1 (Neyman-Pearson Lemma)<br />

Consider the test H 0 : θ = θ 0 vs. H a : θ = θ a defined by the rule<br />

for some constant k.<br />

reject H 0 for f(x; θ 0)<br />

f(x; θ a ) ≤ k,<br />

Let α ∗ be the type I error rate and 1 − β ∗ be the power for this test. Then any other<br />

test with α ≤ α ∗ will have (1 − β) ≤ (1 − β ∗ ).<br />

Pro<strong>of</strong>. Let C be the critical region for the LR test and let D be the critical region (RR)<br />

for any other test.<br />

111


2. STATISTICAL INFERENCE<br />

Let C 1 = C ∩ D and C 2 , D 2 be such that<br />

C = C 1 ∪ C 2 , C 1 ∩ C 2 = φ<br />

D = C 1 ∪ D 2 , C 1 ∩ D 2 = φ<br />

Figure 22: Neyman-Pearson Lemma<br />

112


2. STATISTICAL INFERENCE<br />

See Figure 22 and observe α ≤ α ∗<br />

∫<br />

∫<br />

=⇒ f(x; θ 0 )dx 1 . . . dx n ≤<br />

D<br />

C<br />

f(x; θ 0 )dx 1 . . . dx n<br />

∫<br />

∫<br />

=⇒ f(x; θ 0 )dx 1 . . . dx n ≤ f(x; θ 0 )dx 1 . . . dx n<br />

D 2 C 2<br />

∫<br />

∫<br />

=⇒ (1 − β ∗ ) − (1 − β) = f(x; θ a )dx 1 . . . dx n −<br />

C<br />

D<br />

f(x; θ a )dx 1 . . . dx n<br />

∫<br />

∫<br />

∴ (1 − β ∗ ) − (1 − β) = f(x; θ a )dx 1 . . . dx n + f(x; θ a )dx 1 . . . dx n<br />

C 1 C 2<br />

∫<br />

∫<br />

− f(x; θ a )dx 1 . . . dx n − f(x; θ a )dx 1 . . . dx n<br />

C 1 D 2<br />

=<br />

≥<br />

(C 1 ∪ D 2 )<br />

∫<br />

∫<br />

f(x; θ a )dx 1 . . . dx n − f(x; θ a )dx 1 . . . dx n<br />

C 2 D 2 1 ∫<br />

f(x; θ 0 )dx 1 . . . dx n − 1 ∫<br />

f(x; θ 0 )dx 1 . . . dx n<br />

k C 2<br />

k D 2<br />

≥ 0, as required.<br />

See Figure 23.<br />

Moreover, equality is achieved only if<br />

D 2 is empty (= φ).<br />

Example<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ), σ 2 given, and consider<br />

H 0 : µ = µ 0 vs. H a : µ = µ a , µ a > µ 0<br />

113


2. STATISTICAL INFERENCE<br />

Figure 23: Neyman-Pearson Lemma<br />

Then f(x; µ 0)<br />

f(x; µ a )<br />

n∏<br />

{<br />

1<br />

√ exp − 1 }<br />

i=1 2πσ 2σ (x 2 i − µ 0 ) 2<br />

= n∏<br />

{<br />

1<br />

√ exp − 1 }<br />

i=1 2πσ 2σ (x 2 i − µ a ) 2<br />

{ ( n∑<br />

)}<br />

exp − 1 x 2<br />

2σ 2 i − 2n¯xµ 0 + nµ 2 0<br />

i=1<br />

= { ( n∑<br />

)}<br />

exp − 1 x 2<br />

2σ 2 i − 2n¯xµ a + nµ 2 a<br />

i=1<br />

{ }<br />

1<br />

= exp<br />

2σ (2n¯x(µ 2 0 − µ a ) − nµ 2 0 + nµ 2 a) .<br />

For a constant k,<br />

f(x; µ 0 )<br />

f(x; µ a )<br />

⇔ (µ 0 − µ a )¯x<br />

≤ k<br />

≤ k ∗<br />

⇔ ¯x ≥ c,<br />

for a suitably chosen c (rejected when ¯x is too big).<br />

114


2. STATISTICAL INFERENCE<br />

To choose c, we use the fact that<br />

Hence, the usual z-test,<br />

Z = ¯X − µ 0<br />

σ/ √ n ∼ N(0, 1) under H 0.<br />

reject H 0 if ẑ ≥ z(α)<br />

is the Neyman-Pearson LR test in this case.<br />

Remarks<br />

(1) This result shows that the one-sided z test is also uniformly most powerful<br />

for<br />

H 0 : µ = µ 0 vs. H A : µ > µ 0<br />

(2) This can be extended to the case <strong>of</strong><br />

In this case we take<br />

H 0 : µ ≤ µ 0 vs. H A : µ > µ 0<br />

α = max<br />

H 0<br />

P (rejecting |µ = µ 0 ).<br />

(3) This construction fails when we consider two-sided alternatives<br />

i.e., H 0 : µ = µ 0 vs. H A : µ ≠ µ 0<br />

=⇒ no uniformly most powerful test exists for that case.<br />

115

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!