28.01.2015 Views

PDF of Lecture Notes - School of Mathematical Sciences

PDF of Lecture Notes - School of Mathematical Sciences

PDF of Lecture Notes - School of Mathematical Sciences

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

MATHEMATICAL STATISTICS III<br />

<strong>Lecture</strong>r: Associate Pr<strong>of</strong>essor Patty Solomon<br />

SCHOOL OF MATHEMATICAL SCIENCES


Contents<br />

1 Distribution Theory 1<br />

1.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.1.3 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.1.4 Negative Binomial distribution . . . . . . . . . . . . . . . . . . 3<br />

1.1.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.1.6 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 5<br />

1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

1.2.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

1.2.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 8<br />

1.2.3 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

1.2.4 Beta density function . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.2.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.2.6 Cauchy distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.3 Transformations <strong>of</strong> a single RV . . . . . . . . . . . . . . . . . . . . . . 12<br />

1.4 CDF transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

1.5 Non-monotonic transformations . . . . . . . . . . . . . . . . . . . . . . 17<br />

1.6 Moments <strong>of</strong> transformed RVS . . . . . . . . . . . . . . . . . . . . . . . 18<br />

1.7 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.7.1 Trinomial distribution . . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.7.2 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.7.3 Marginal and conditional distributions . . . . . . . . . . . . . . 24<br />

1.7.4 Continuous multivariate distributions . . . . . . . . . . . . . . . 27<br />

1.8 Transformations <strong>of</strong> several RVs . . . . . . . . . . . . . . . . . . . . . . 33<br />

1.8.1 Method <strong>of</strong> regular transformations . . . . . . . . . . . . . . . . 41


1.9 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

1.9.1 Moment generating functions . . . . . . . . . . . . . . . . . . . 50<br />

1.9.2 Marginal distributions and the MGF . . . . . . . . . . . . . . . 53<br />

1.9.3 Vector notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

1.9.4 Properties <strong>of</strong> variance matrices . . . . . . . . . . . . . . . . . . 56<br />

1.10 The multivariable normal distribution . . . . . . . . . . . . . . . . . . . 57<br />

1.10.1 The multivariate normal MGF . . . . . . . . . . . . . . . . . . . 64<br />

1.10.2 Independence and normality . . . . . . . . . . . . . . . . . . . . 65<br />

1.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

1.11.1 Convergence <strong>of</strong> random variables . . . . . . . . . . . . . . . . . 71<br />

2 Statistical Inference 77<br />

2.1 Basic definitions and terminology . . . . . . . . . . . . . . . . . . . . . 77<br />

2.2 Minimum Variance Unbiased Estimation . . . . . . . . . . . . . . . . . 81<br />

2.3 Methods Of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

2.3.1 Method Of Moments . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

2.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . 97<br />

2.4 Hypothesis Tests and Confidence Intervals . . . . . . . . . . . . . . . . 105


1. DISTRIBUTION THEORY<br />

1 Distribution Theory<br />

A discrete RV is described by its probability function<br />

p(x) = P ({X = x}).<br />

A continuous RV is described by its probability density function, f(x), for which<br />

P (X = a) =<br />

P ({a ≤ X ≤ b})<br />

=<br />

∫ a<br />

a<br />

∫ b<br />

a<br />

f(x) dx, for all a ≤ b.<br />

f(x) dx = 0<br />

for any continuous RV.<br />

Precipitation example, neither discrete or continuous (0 days) - some RVs are neither<br />

discrete or continuous.<br />

The cumulative distribution function (CDF) is defined by:<br />

F (x) = P ({X ≤ x})<br />

⎧∑<br />

p(t)<br />

⎪⎨ t;t≤x<br />

completed by F (x) =<br />

⎪⎩<br />

∫ x<br />

−∞<br />

f(t) dt.<br />

The expected value E(X) <strong>of</strong> a discrete RV is given by:<br />

(area to left)<br />

E(X) = ∑ x<br />

xp(x),<br />

provided that ∑ x<br />

|x|p(x) < ∞ (i.e., must converge). Otherwise the expectation is not<br />

defined.<br />

The expected value <strong>of</strong> a continuous RV is given by:<br />

provided that<br />

∫ ∞<br />

−∞<br />

E(X) =<br />

∫ ∞<br />

−∞<br />

xf(x) dx,<br />

|x|f(x) < ∞, otherwise it is not defined.<br />

(For example, the Cauchy distribution is momentless.)<br />

1


1. DISTRIBUTION THEORY<br />

⎧∑<br />

h(x)p(x)<br />

⎪⎨ x<br />

E{h(X)} =<br />

⎪⎩<br />

∫ ∞<br />

−∞<br />

h(x)f(x) dx<br />

if ∑ |h(x)|p(x) < ∞ (X discrete)<br />

x<br />

if<br />

∫ ∞<br />

−∞<br />

|h(x)|f(x) dx < ∞ (X continuous)<br />

The moment generating function <strong>of</strong> a RV X is defined to be:<br />

M X (t) = E[e tX ] =<br />

↓<br />

moment<br />

generating fn<br />

<strong>of</strong> RV X.<br />

⎧∑<br />

e tx p(x)<br />

x ⎪⎨<br />

⎪⎩<br />

∫ ∞<br />

−∞<br />

e tx f(x) dx<br />

X discrete<br />

X continuous<br />

M X (0) = 1 always; the mgf may or may not be defined for other values <strong>of</strong> t.<br />

If M X (t) defined for all t in some open interval containing 0, then:<br />

1. Moments <strong>of</strong> all orders exist;<br />

2. E[X r ] = M (r)<br />

X<br />

(0) (rth order derivative) ;<br />

3. M X (t) uniquely determines the distribution <strong>of</strong> X:<br />

M ′ (0) = E(X)<br />

M ′′ (0) = E(X 2 ),<br />

and so on.<br />

1.1 Discrete distributions<br />

1.1.1 Bernoulli distribution<br />

Parameter: 0 ≤ p ≤ 1<br />

Possible values: {0, 1}<br />

Prob. function:<br />

⎧<br />

⎪⎨ p, x = 1<br />

p(x) =<br />

⎪⎩<br />

1 − p, x = 0<br />

2<br />

= p x (1 − p) 1−x


1. DISTRIBUTION THEORY<br />

E(X) = p<br />

Var (X) = p(1 − p)<br />

M X (t) = 1 + p(e t − 1).<br />

1.1.2 Binomial distribution<br />

Parameter: 0 ≤ p ≤ 1; n > 0;<br />

M X (t) = {1 + p(e t − 1)} n .<br />

Consider a sequence <strong>of</strong> n independent Bern(p) trials. If X = total number <strong>of</strong> successes,<br />

X ∼ B(n, p).<br />

Note: If Y 1 , . . . , Y n are underlying Bernoulli RV’s then X = Y 1 + Y 2 + · · · + Y n .<br />

(same as counting # successes).<br />

Probability function if:<br />

p(x) ≥ 0, x = 0, 1, . . . , n,<br />

n∑<br />

p(x) = 1.<br />

x=0<br />

1.1.3 Geometric distribution<br />

Suppose a sequence <strong>of</strong> Bernoulli trials is performed and let X be the number <strong>of</strong> failures<br />

preceding the first success. Then X ∼ Geom(p).<br />

1.1.4 Negative Binomial distribution<br />

Suppose a sequence <strong>of</strong> independent Bernoulli trials is conducted. If X is the number<br />

<strong>of</strong> failures preceding the n th success, then X has a negative binomial distribution.<br />

Prob. function:<br />

p(x) =<br />

( ) n + x − 1<br />

p<br />

n − 1<br />

n (1 − p) x ,<br />

} {{ }<br />

ways <strong>of</strong> allocating<br />

failures and successes<br />

3


n(1 − p)<br />

E(X) = ,<br />

p<br />

n(1 − p)<br />

Var (X) = ,<br />

p<br />

[<br />

2 ] n<br />

p<br />

M X (t) =<br />

.<br />

1 − e t (1 − p)<br />

1. If n = 1, we obtain the geometric distribution.<br />

2. Also seen to arise as sum <strong>of</strong> n independent geometric variables.<br />

1. DISTRIBUTION THEORY<br />

1.1.5 Poisson distribution<br />

Parameter: λ > 0<br />

M X (t) = e λ(et −1)<br />

The Poisson distribution arises as the distribution for the number <strong>of</strong> “point events”<br />

observed from a Poisson process.<br />

Examples:<br />

Figure 1: Poisson Example<br />

Number <strong>of</strong> incoming calls to certain exchange in a given hour.<br />

The Poisson distribution as the limiting form <strong>of</strong> the binomial distribution:<br />

n → ∞,<br />

p → 0<br />

np → λ<br />

The derivation <strong>of</strong> the Poisson distribution (via the binomial) is underpinned by a<br />

Poisson process i.e., a point process on [0, ∞); see Figure 1.<br />

AXIOMS for a Poisson process <strong>of</strong> rate λ > 0 are:<br />

(A) The number <strong>of</strong> occurrences in disjoint intervals are independent.<br />

4


1. DISTRIBUTION THEORY<br />

(B) Probability <strong>of</strong> 1 or more occurrences in any interval [t, t+h) is λh+o(h) (h → 0)<br />

(approx prob. is equal to length <strong>of</strong> interval xλ)<br />

(C) Probability <strong>of</strong> more than one occurrence in [t, t + h) is o(h)<br />

is small, negligible).<br />

(h → 0) (i.e. prob<br />

Note: o(h), pronounced (small order h) is standard notation for any function r(h)<br />

with the property:<br />

r(h)<br />

lim<br />

h→0 h = 0<br />

4<br />

3<br />

2<br />

1<br />

0<br />

!2<br />

!1<br />

0<br />

1<br />

2<br />

h<br />

!1<br />

!2<br />

Figure 2: Small order h: functions h 4 (yes) and h (no)<br />

1.1.6 Hypergeometric distribution<br />

Consider an urn containing M black and N white balls. Suppose n balls are sampled<br />

randomly without replacement and let X be the number <strong>of</strong> black balls chosen. Then<br />

X has a hypergeometric distribution.<br />

Parameters: M, N > 0, 0 < n ≤ M + N<br />

Possible values: max (0, n − N) ≤ x ≤ min (n, M)<br />

Prob function:<br />

5


1. DISTRIBUTION THEORY<br />

) ( N<br />

)<br />

p(x) =<br />

( M<br />

x n − x<br />

( ) ,<br />

M + N<br />

n<br />

E(X) = n<br />

M<br />

M + N ,<br />

M + N − n nMN<br />

Var (X) =<br />

M + N − 1 (M + N) . 2<br />

The mgf exists, but there is no useful expression available.<br />

1. The hypergeometric distribution is simply<br />

# selections with x black balls<br />

,<br />

# possible selections<br />

) ( N<br />

)<br />

=<br />

( M<br />

x n − x<br />

( ) .<br />

M + N<br />

n<br />

2. To see how the limits arise, observe we must have x ≤ n (i.e., no more than<br />

sample size <strong>of</strong> black balls in the sample.) Also, x ≤ M, i.e., x ≤ min (n, M).<br />

Similarly, we must have x ≥ 0 (i.e., cannot have < 0 black balls in sample), and<br />

n − x ≤ N (i.e., cannot have more white balls than number in urn).<br />

i.e. x ≥ n − N<br />

i.e. x ≥ max (0, n − N).<br />

3. If we sample with replacement, we would get X ∼ B ( n, p = M<br />

M+N<br />

)<br />

. It is interesting<br />

to compare moments:<br />

finite population correction<br />

↑<br />

hypergeometric E(X) = np Var (X) = M+N−n [np(1 − p)]<br />

M+N−1<br />

binomial E(x) = np Var (X) = np(1 − p) ↓<br />

when sample all<br />

balls in urn Var(X) ∼ 0<br />

4. When M, N >> n, the difference between sampling with and without replacement<br />

should be small.<br />

6


1. DISTRIBUTION THEORY<br />

Figure 3: p = 1 3<br />

If white ball out →<br />

Figure 4: p = 1 2<br />

(without replacement)<br />

Figure 5: p = 1 3<br />

(with replacement)<br />

Intuitively, this implies that for M, N >> n, the hypergeometric and binomial<br />

probabilities should be very similar, and this can be verified for fixed, n, x:<br />

0<br />

lim N 10<br />

M<br />

@ A@<br />

M,N→∞<br />

x n − x<br />

0<br />

M<br />

M + N → p M + N<br />

1<br />

@ A<br />

n<br />

1<br />

A<br />

=<br />

( n<br />

x)<br />

p x (1 − p) n−x .<br />

1.2 Continuous Distributions<br />

1.2.1 Uniform Distribution<br />

CDF, for a < x < b:<br />

F (x) =<br />

∫ x<br />

−∞<br />

f(x) dx =<br />

∫ x<br />

a<br />

1<br />

b − a dx = x − a<br />

b − a ,<br />

7


1. DISTRIBUTION THEORY<br />

that is,<br />

⎧<br />

0 x ≤ a<br />

⎪⎨<br />

x − a<br />

F (x) =<br />

b − a<br />

a < x < b<br />

⎪⎩<br />

1 b ≤ x<br />

Figure 6: Uniform distribution CDF<br />

M X (t) = etb − e ta<br />

t(b − a)<br />

A special case is the U(0, 1) distribution:<br />

⎧<br />

⎪⎨ 1 0 < x < 1<br />

f(x) =<br />

⎪⎩<br />

0 otherwise,<br />

F (x) = x for 0 < x < 1,<br />

E(X) = 1 2 , Var(X) = 1 12 , M(t) = et − 1<br />

.<br />

t<br />

1.2.2 Exponential Distribution<br />

CDF:<br />

F (x) = 1 − e −λx ,<br />

M X (t) =<br />

λ<br />

λ − t .<br />

8


1. DISTRIBUTION THEORY<br />

This is the distribution for the waiting time until the first occurrence in a Poisson<br />

process with rate parameter λ.<br />

1. If X ∼ Exp(λ) then,<br />

P (X ≥ t + x|X ≥ t) = P (X ≥ x)<br />

(memoryless property)<br />

2. It can be obtained as limiting form <strong>of</strong> geometric distribution.<br />

1.2.3 Gamma distribution<br />

gamma fct : Γ(α) =<br />

mgf : M X (t) =<br />

∫ ∞<br />

0<br />

t α−1 e −t dt<br />

( ) α λ<br />

, t < λ.<br />

λ − t<br />

Suppose Y 1 , . . . , Y K are independent Exp(λ) RV’s and let X = Y 1 + · · · + Y K .<br />

Then X ∼ Gamma(K, λ), for K integer. In general, X ∼ Gamma(α, λ), α > 0.<br />

1. α is the shape parameter,<br />

λ is the scale parameter<br />

Figure 7: Gamma Distribution<br />

Note: if Y ∼ Gamma (α, 1) and X = Y , then X ∼ Gamma (α, λ). That is, λ is<br />

λ<br />

scale parameter.<br />

( p<br />

2. Gamma<br />

2 , 1 distribution is also called χ<br />

2)<br />

2 p (chi-square with p df) distribution<br />

if p is integer;<br />

χ 2 2 = exponential distribution (for 2 df).<br />

9


1. DISTRIBUTION THEORY<br />

3. Gamma (K, λ) distribution can be interpreted as the waiting time until the K th<br />

occurrence in a Poisson process.<br />

1.2.4 Beta density function<br />

Suppose Y 1 ∼ Gamma (α, λ), Y 2 ∼ Gamma (β, λ) independently, then,<br />

Remark:<br />

X = Y 1<br />

Y 1 + Y 2<br />

∼ B(α, β), 0 ≤ x ≤ 1.<br />

Figure 8: Beta Distribution<br />

1.2.5 Normal distribution<br />

X ∼ N(µ, σ 2 ); M X (t) = e tµ e t2 σ 2 /2 .<br />

1.2.6 Cauchy distribution<br />

Possible values: x ∈ R<br />

<strong>PDF</strong>: f(x) = 1 ( ) 1<br />

; (location parameter θ = 0)<br />

π 1 + x 2<br />

10


1. DISTRIBUTION THEORY<br />

CDF:<br />

F (x) = 1 2 + 1 π arctan x<br />

E(X), Var(X), M X (t) do not exist.<br />

→ the Cauchy is a bell-shaped distribution symmetric about zero for which no moments<br />

are defined.<br />

11


1. DISTRIBUTION THEORY<br />

Figure 9: Cauchy Distribution<br />

(pointier than normal distribution and tails go to zero much slower than normal distribution)<br />

If Z 1 ∼ N(0, 1) and Z 2 ∼ N(0, 1) independently, then X = Z 1<br />

Z 2<br />

∼ Cauchy distribution.<br />

1.3 Transformations <strong>of</strong> a single RV<br />

If X is a RV and Y = h(X), then Y is also a RV. If we know distribution <strong>of</strong> X, for a<br />

given function h(x), we should be able to find distribution <strong>of</strong> Y = h(X).<br />

Theorem. 1.3.1 (Discrete case)<br />

Suppose X is a discrete RV with prob. function p X (x) and let Y = h(X), where h is<br />

any function then:<br />

p Y (y) =<br />

∑<br />

p X (x)<br />

Pro<strong>of</strong>.<br />

x:h(x)=y<br />

(sum over all values x for which h(x) = y)<br />

p Y (y) = P (Y = y) = P {h(X) = y}<br />

= ∑<br />

x:h(x)=y<br />

= ∑<br />

x:h(x)=y<br />

P (X = x)<br />

p X (x)<br />

12


1. DISTRIBUTION THEORY<br />

Theorem. 1.3.2<br />

Suppose X is a continuous RV with <strong>PDF</strong> f X (x) and let Y = h(X), where h(x) is<br />

differentiable and monotonic, i.e., either strictly increasing or strictly decreasing.<br />

Then the <strong>PDF</strong> <strong>of</strong> Y is given by:<br />

Pro<strong>of</strong>. Assume h is increasing. Then<br />

f Y (y) = f X {h −1 (y)}|h −1 (y) ′ |<br />

F Y (y) = P (Y ≤ y) = P {h(X) ≤ y}<br />

= P {X ≤ h −1 (y)}<br />

= F X {h −1 (y)}<br />

Figure 10: h increasing.<br />

⇒ f Y (y) = d<br />

dy F Y (y) = d<br />

dy F X{h −1 (y)}<br />

(use Chain Rule)<br />

= f X<br />

{<br />

h −1 (y) } h −1 (y) ′ . (1)<br />

Now consider the case <strong>of</strong> h(x) decreasing:<br />

13


1. DISTRIBUTION THEORY<br />

F Y (y) = P (Y ≤ y) = P {h(X) ≤ y}<br />

= P {X ≥ h −1 (y)}<br />

= 1 − F X {h −1 (y)}<br />

⇒ f Y (y) = −f X {h −1 (y)}h −1 (y) ′ (2)<br />

Figure 11: h decreasing.<br />

Finally, observe that if h is increasing then h ′ (x) and hence h −1 (y) ′ must be positive.<br />

Similarly for h decreasing, h −1 (y) ′ < 0. Hence (1) and (2) can be combined to give:<br />

f Y (y) = f X {h −1 (y)}|h −1 (y) ′ |<br />

Examples:<br />

1. Discrete transformation <strong>of</strong> single RV<br />

X ∼ P o(λ) and Y is X rounded to the nearest multiple <strong>of</strong> 10.<br />

Possible values <strong>of</strong> Y are: 0, 10, 20, . . .<br />

14


1. DISTRIBUTION THEORY<br />

P (Y = 0) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)<br />

= e −λ + e −λ λ + e−λ λ 2<br />

2!<br />

+ e−λ λ 3<br />

3!<br />

+ e−λ λ 4<br />

;<br />

4!<br />

P (Y = 10) = P (X = 5) + P (X = 6) + P (X = 7) + P (X = 8) + . . . + P (X = 14), andsoon.<br />

2. Continuous transformation <strong>of</strong> single RV<br />

Let h(X) = aX + b, a ≠ 0.<br />

To find h −1 (y) we solve for x in the equation y = h(x), i.e.,<br />

y = ax + b<br />

⇒ x = y − b<br />

a<br />

⇒ h −1 (y) = y − b<br />

a<br />

= y a − b a<br />

Specifically:<br />

Hence, f Y (y) = 1<br />

|a| f X<br />

⇒ h −1 (y) ′ = 1 a .<br />

( y − b<br />

1. Suppose Z ∼ N(0, 1) and let X = µ + σZ, σ > 0. Recall that Z has <strong>PDF</strong>:<br />

Hence, f X (x) = 1 ( ) x − µ<br />

|σ| φ =<br />

σ<br />

a<br />

)<br />

.<br />

φ(z) = 1 √<br />

2π<br />

e −z2 /2 .<br />

2. Suppose X ∼ N(µ, σ 2 ) and let Z = X − µ .<br />

σ<br />

Find the <strong>PDF</strong> <strong>of</strong> Z.<br />

1<br />

√ e − (x−µ)2<br />

2σ 2 , which is N(µ, σ 2 ) <strong>PDF</strong>.<br />

2π(σ)<br />

15


1. DISTRIBUTION THEORY<br />

Solution. Observe that Z = h(X) = 1 σ X − µ σ ,<br />

( z +<br />

µ )<br />

σ<br />

⇒ f Z (z) = σf X 1<br />

= σf X (µ + σz)<br />

σ<br />

1<br />

= σ √ e (−1)<br />

2πσ<br />

2<br />

(2σ 2 ) (µ+σz−µ)2<br />

= 1 √<br />

2π<br />

e (−1)<br />

(2σ 2 ) σ2 z 2<br />

= 1 √<br />

2π<br />

e −z2 /2 = φ(z),<br />

i.e., Z ∼ N(0, 1).<br />

3. Suppose X ∼ Gamma (α, 1) and let Y = X λ , λ > 0.<br />

Find the <strong>PDF</strong> <strong>of</strong> Y .<br />

Solution. Since Y = 1 X, is a linear function, we have<br />

λ<br />

f Y (y) = λf X (λy)<br />

which is the Gamma(α, λ) <strong>PDF</strong>.<br />

= λ 1<br />

Γ(α) (λy)α−1 e −λy =<br />

λα<br />

Γ(α) yα−1 e −λy ,<br />

1.4 CDF transformation<br />

Suppose X is a continuous RV with CDF F X (x), which is increasing over the range <strong>of</strong><br />

X. If U = F X (x), then U ∼ U(0, 1).<br />

16


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

F U (u) = P (U ≤ u)<br />

= P {F X (X) ≤ u}<br />

= P {X ≤ F −1<br />

X<br />

(u)}<br />

= F X {F −1<br />

X<br />

(u)}<br />

= u, for 0 < u < 1.<br />

This is simply the CDF <strong>of</strong> U(0, 1), so the result is proved.<br />

The converse also applies. If U ∼ U(0, 1) and F is any strictly increasing “on its range”<br />

CDF, then X = F −1 (U) has CDF F (x), i.e.,<br />

F X (x) = P (X ≤ x) = P {F −1 (U) ≤ x}<br />

= P (U ≤ F (x))<br />

= F (x), as required.<br />

1.5 Non-monotonic transformations<br />

Theorem 1.3.2 applies to monotonic (either strictly increasing or decreasing) transformations<br />

<strong>of</strong> a continuous RV.<br />

In general, if h(x) is not monotonic, then h(x) may not even be continuous.<br />

For example,<br />

integer part <strong>of</strong> x<br />

↗<br />

if h(x) = [x],<br />

then possible values for Y = h(X) are the integers<br />

⇒ Y is discrete.<br />

However, it can be shown that h(X) is continuous if X is continuous and h(x) is<br />

piecewise monotonic.<br />

17


1. DISTRIBUTION THEORY<br />

1.6 Moments <strong>of</strong> transformed RVS<br />

Suppose X is a RV and let Y = h(X).<br />

If we want to find E(Y ) , we can proceed as follows:<br />

1. Find the distribution <strong>of</strong> Y = h(X) using preceeding methods.<br />

⎧∑<br />

yp(y) Y discrete<br />

⎪⎨ y<br />

2. Find E(Y ) =<br />

⎪⎩<br />

Theorem. 1.6.1<br />

∫ ∞<br />

−∞<br />

yf(y) dy Y continuous<br />

(forget X ever existed!)<br />

If X is a RV <strong>of</strong> either discrete or continuous type and h(x) is any transformation (not<br />

necessarily monotonic), then E{h(X)} (provided it exists) is given by:<br />

⎧∑<br />

h(x) p(x) X discrete<br />

⎪⎨ x<br />

E{h(X)} =<br />

⎪⎩<br />

Pro<strong>of</strong>.<br />

∫ ∞<br />

−∞<br />

h(x)f(x) dx X continuous<br />

Not examinable.<br />

Examples:<br />

1. CDF transformation<br />

Suppose U ∼ U(0, 1). How can we transform U to get an Exp(λ) RV<br />

Solution. Take X = F −1 (U), where F is the Exp(λ) CDF. Recall F (x) =<br />

1 − e −λx for the Exp(λ) distribution.<br />

To find F −1 (u), we solve for x in F (x) = u,<br />

18


1. DISTRIBUTION THEORY<br />

i.e.,<br />

u = 1 − e −λx<br />

⇒ 1 − u = e −λx<br />

⇒ ln(1 − u) = −λx<br />

⇒ x =<br />

− ln(1 − u)<br />

.<br />

λ<br />

− ln(1 − U)<br />

Hence if U ∼ U(0, 1), it follows that X = ∼ Exp(λ).<br />

λ<br />

Note: Y = − ln U ∼ Exp(λ) [both (1 − U) & U have U(0, 1) distribution].<br />

λ<br />

m This type <strong>of</strong> result is used to generate random numbers. That is, there are good<br />

methods for producing U(0, 1) (pseudo-random) numbers. To obtain Exp(λ) random<br />

numbers, we can just get U(0, 1) numbers and then calculate X = − log U .<br />

λ<br />

2. Non-monotonic transformations<br />

Suppose Z ∼ N(0, 1) and let X = Z 2 ; h(Z) = Z 2 is not monotonic, so Theorem<br />

1.3.2 does not apply. However we can proceed as follows:<br />

F X (x) = P (X ≤ x)<br />

= P (Z 2 ≤ x)<br />

= P (− √ x ≤ Z ≤ √ x)<br />

where<br />

Φ(a) =<br />

is the N(0, 1)<br />

= Φ( √ x) − Φ(− √ x),<br />

∫ a<br />

−∞<br />

1<br />

√<br />

2π<br />

e −t2 /2 dt<br />

CDF. That is,<br />

⇒ f X (x) = d<br />

dx F X(x) = φ (√ x ) ( )<br />

1<br />

2 x−1/2 − φ(− √ x)<br />

(− 1 )<br />

2 x−1/2 ,<br />

19


1. DISTRIBUTION THEORY<br />

where φ(a) = Φ ′ (a) = 1 √<br />

2π<br />

e −a2 /2 is the N(0, 1) <strong>PDF</strong><br />

= 1 2 x−1/2 [ 1<br />

√<br />

2π<br />

e −x/2 + 1 √<br />

2π<br />

e −x/2 ]<br />

(<br />

= (1/2)1/2<br />

1<br />

√ x 1/2−1 e −1/2x, which is the Gamma<br />

π 2 , 1 2)<br />

<strong>PDF</strong>.<br />

On the other hand, the distribution <strong>of</strong> Z 2 is also called the χ 2 1, distribution. We<br />

have proved that it is the same as the Gamma ( 1<br />

2 , 1 2)<br />

distribution.<br />

3. Moments <strong>of</strong> transformed RVS:<br />

If U ∼ U(0, 1) and Y = − log U<br />

λ<br />

then Y ∼ Exp(λ) ⇒ f(y) = λe −λy , y > 0.<br />

Can check<br />

E(Y ) =<br />

∫ ∞<br />

0<br />

λye −λy dy<br />

= 1 λ .<br />

4. Based on Theorem 1.6.1:<br />

If U ∼ U(0, 1) and Y = − log U , then according to theorem 1.6.1,<br />

λ<br />

E(Y ) =<br />

∫ 1<br />

0<br />

= − 1 λ<br />

= − 1 λ<br />

− log u<br />

(1) du<br />

λ<br />

∫ 1<br />

0<br />

log u du = − 1 λ (u log u − u) ∣ ∣∣∣<br />

1<br />

[<br />

u log u ∣ ∣1<br />

0 − u ∣ ]<br />

1 0<br />

0<br />

= − 1 [0 − 1]<br />

λ<br />

= 1 , as required.<br />

λ<br />

There are some important consequences <strong>of</strong> Theorem 1.6.1:<br />

20


1. DISTRIBUTION THEORY<br />

1. If E(X) = µ and Var(X) = σ 2 , and Y = aX + b for constants a, b, then E(Y ) =<br />

aµ + b and Var(Y ) = a 2 σ 2 .<br />

Pro<strong>of</strong>. (Continuous case)<br />

E(Y ) = E(aX + b) =<br />

∫ ∞<br />

= a<br />

xf(x) dx<br />

−∞<br />

} {{ }<br />

∫ ∞<br />

−∞<br />

(ax + b)f(x) dx<br />

∫ ∞<br />

+ b<br />

f(x) dx<br />

−∞<br />

} {{ }<br />

E(X) = 1<br />

= aµ + b.<br />

[ {Y } ] 2<br />

Var(Y ) = E − E(Y )<br />

[ {aX } ] 2<br />

= E + b − (aµ + b)<br />

= E { a 2 (X − µ) 2}<br />

= a 2 E { (X − µ) 2}<br />

= a 2 Var(X)<br />

= a 2 σ 2 .<br />

2. If X is a RV and h(X) is any function, then the MGF <strong>of</strong> Y = h(X), provided it<br />

exists is,<br />

⎧∑<br />

e th(x) p(x)<br />

⎪⎨ x<br />

M Y (t) =<br />

X discrete<br />

⎪⎩<br />

∫ ∞<br />

−∞<br />

e th(x) f(x) dx X continuous<br />

This gives us another way to find the distribution <strong>of</strong> Y = h(X).<br />

i.e., Find M Y (t). If we recognise M Y (t), then by uniqueness we can conclude that<br />

Y has that distribution.<br />

21


1. DISTRIBUTION THEORY<br />

Examples<br />

1. Suppose X is continuous with CDF F (x), and F (a) = 0, F (b) = 1; (a, b can be<br />

±∞ respectively).<br />

Let U = F (X). Observe that<br />

M U (t) =<br />

∫ b<br />

a<br />

e tF (x) f(x) dx<br />

which is the U(0, 1) MGF.<br />

= 1 t etF (x) ∣ ∣∣∣<br />

b<br />

= et − 1<br />

,<br />

t<br />

2. Suppose X ∼ U(0, 1), and let Y = − log X .<br />

λ<br />

M Y (t) =<br />

∫ 1<br />

0<br />

a<br />

e<br />

= etF (b) tF (a)<br />

− e<br />

t<br />

− log x<br />

t( λ<br />

) (1) dx<br />

=<br />

=<br />

=<br />

∫ 1<br />

0<br />

x −t/λ dx<br />

∣<br />

1 ∣∣∣<br />

1<br />

1 − t/λ x1−t/λ<br />

1<br />

1 − t/λ<br />

0<br />

= λ<br />

λ − t ,<br />

which is the MGF for Exp(λ) distribution. Hence we can conclude that<br />

Y = − log X<br />

λ<br />

∼ Exp(λ).<br />

#<br />

22


1. DISTRIBUTION THEORY<br />

1.7 Multivariate distributions<br />

Definition. 1.7.1<br />

If X 1 , X 2 , . . . , X r are discrete RV’s then, X = (X 1 , X 2 , . . . , X r ) T<br />

random vector.<br />

is called a discrete<br />

The probability function P (x) is:<br />

P (x) = P (X = x) = P ({X 1 = x 1 } ∩ {X 2 = x 2 } ∩ · · · ∩ {X r = x r }) ;<br />

P (x) = P (x 1 , x 2 , . . . , x r )<br />

1.7.1 Trinomial distribution<br />

r = 2<br />

Consider a sequence <strong>of</strong> n independent trials where each trial produces:<br />

Outcome 1: with prob π 1<br />

Outcome 2: with prob π 2<br />

Outcome 3: with prob 1 − π 1 − π 2<br />

If X 1 , X 2 are number <strong>of</strong> occurrences <strong>of</strong> outcomes 1 and 2 respectively then(X 1 , X 2 )<br />

have trinomial distribution.<br />

Parameters:<br />

π 1 > 0, π 2 > 0 and π 1 + π 2 < 1; n > 0 fixed<br />

Possible Values:<br />

integers (x 1 , x 2 ) s.t. x 1 ≥ 0, x 2 ≥ 0, x 1 + x 2 ≤ n<br />

Probability function:<br />

for<br />

P (x 1 , x 2 ) =<br />

n!<br />

x 1 !x 2 !(n − x 1 − x 2 )! πx 1<br />

1 π x 2<br />

2 (1 − π 1 − π 2 ) n−x 1−x 2<br />

x 1 , x 2 ≥ 0, x 1 + x 2 ≤ n.<br />

1.7.2 Multinomial distribution<br />

Parameters:<br />

n > 0, π = (π 1 , π 2 , . . . , π r ) T with π I > 0 and<br />

23<br />

r∑<br />

π i = 1<br />

i=1


1. DISTRIBUTION THEORY<br />

Possible values:<br />

Probability function:<br />

Integer valued (x 1 , x 2 , . . . , x r ) s.t. x i ≥ 0 &<br />

r∑<br />

x i = n<br />

i=1<br />

(<br />

)<br />

n<br />

P (x) =<br />

π x 1<br />

1 π x 2<br />

2 . . . πr<br />

xr<br />

x 1 , x 2 , . . . , x r<br />

for<br />

x i ≥ 0,<br />

r∑<br />

x i = n.<br />

i=1<br />

Remarks<br />

(<br />

)<br />

n<br />

1. Note<br />

x 1 , x 2 , . . . , x r<br />

def<br />

=<br />

n!<br />

x 1 !x 2 ! . . . x r !<br />

is the multinomial coefficient.<br />

2. Multinomial distribution is the generalisation <strong>of</strong> the binomial distribution to r<br />

types <strong>of</strong> outcome.<br />

3. Formulation differs from binomial and trinomial cases in that the redundant count<br />

x r = n − (x 1 + x 2 + . . . x r−1 ) is included as an argument <strong>of</strong> P (x).<br />

1.7.3 Marginal and conditional distributions<br />

Consider a discrete random vector<br />

so that X =<br />

[ ]<br />

X1<br />

.<br />

X 2<br />

X = (X 1 , X 2 , . . . , X r ) T and let<br />

X 1 = (X 1 , X 2 , . . . , X r1 ) T &<br />

X 2 = (X r1 +1, X r1 +2, . . . , X r ) T ,<br />

Definition. 1.7.2<br />

If X has joint probability function P X (x) = P X (x 1 , x 2 ) then the marginal probability<br />

function for X 1 is :<br />

P X1 (x 1 ) = ∑ x 2<br />

P X (x 1 , x 2 ).<br />

24


1. DISTRIBUTION THEORY<br />

Observe that:<br />

P X1 (x 1 ) = ∑ x 2<br />

P X (x 1 , x 2 ) = ∑ x 2<br />

P ({X 1 = x 1 } ∩ {X 2 = x 2 })<br />

= P (X 1 = x 1 ) by law <strong>of</strong> total probability.<br />

Hence the marginal probability function for X 1<br />

would have if X 2 was not observed.<br />

is just the probability function we<br />

The marginal probability function for X 2 is:<br />

P X2 (x 2 ) = ∑ x 1<br />

P X (x 1 , x 2 ).<br />

Definition. 1.7.3<br />

Suppose X =<br />

X 2 |X 1 by:<br />

Remarks<br />

(<br />

X1<br />

X 2<br />

)<br />

. If P X1 (x 1 ) > 0, we define the conditional probability function<br />

P X2 |X 1<br />

(x 2 |x 1 ) = P X(x 1 , x 2 )<br />

P X1 (x 1 ) .<br />

1.<br />

P X2 |X 1<br />

(x 2 |x 1 ) = P ({X 1 = x 1 } ∩ {X 2 = x 2 })<br />

P (X 1 = x 1 )<br />

= P (X 2 = x 2 |X 1 = x 1 ).<br />

2. Easy to check P X2 |X 1<br />

(x 2 |x 1 ) is a proper probability function with respect to x 2<br />

for each fixed x 1 such that P X1 (x 1 ) > 0.<br />

3. P X1 |X 2<br />

(x 1 |x 2 ) is defined by:<br />

Definition. 1.7.4 (Independence)<br />

P X1 |X 2<br />

(x 1 |x 2 ) = P X(x 1 , x 2 )<br />

P X2 (x 2 ) .<br />

Discrete RV’s X 1 , X 2 , . . . , X r are said to be independent if their joint probability function<br />

satisfies<br />

P (x 1 , x 2 , . . . , x r ) = p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />

for some functions p 1 , p 2 , . . . , p r and all (x 1 , x 2 , . . . , x r ).<br />

25


1. DISTRIBUTION THEORY<br />

Remarks<br />

1. Observe that:<br />

P X1 (x 1 ) = ∑ x 2<br />

∑<br />

· · · ∑<br />

x 3<br />

x r<br />

P (x 1 , . . . , x r )<br />

= p 1 (x 1 ) ∑ ∑<br />

· · · ∑<br />

x 2<br />

x 3<br />

x r<br />

p 2 (x 2 ) . . . p r (x r )<br />

= c 1 p 1 (x 1 ).<br />

Hence p 1 (x 1 ) ∝<br />

↓<br />

P X1 (x 1 )<br />

} {{ } also p i(x i ) ∝ P Xi (x i )<br />

sum <strong>of</strong> probs. marginal prob.<br />

P X1 |X 2 ...X r<br />

(x 1 |x 2 , . . . , x r ) =<br />

=<br />

=<br />

=<br />

=<br />

P (x 1 , . . . , x r )<br />

P x2 ...x r<br />

(x 1 , . . . , x r )<br />

p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />

∑<br />

x 1<br />

p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />

p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />

p 2 (x 2 )p 3 (x 3 ) . . . p r (x r ) ∑ x 1<br />

p 1 (x 1 )<br />

p 1 (x 1 )<br />

∑<br />

x 1<br />

p 1 (x 1 )<br />

1<br />

c P X 1<br />

(x 1 )<br />

1 ∑<br />

P X1 (x 1 )<br />

c<br />

x 1<br />

= P X1 (x 1 ).<br />

That is, P X1 |X 2 ...X r<br />

(x 1 |x 2 , . . . , x r ) = P X1 (x 1 ).<br />

2. Clearly independence =⇒<br />

P Xi |X 1 ...X i−1 ,X i+1 ...X r<br />

(x i |x 1 , . . . , x i−1 , x i+1 , . . . , x r ) = P Xi (x i ).<br />

Moreover, we have:<br />

26


1. DISTRIBUTION THEORY<br />

P X1 |X 2<br />

(x 1 |x 2 ) = P X1 (x 1 ) for any partitioning <strong>of</strong> X =<br />

independent.<br />

(<br />

X1<br />

X 2<br />

)<br />

if X 1 , . . . , X r are<br />

1.7.4 Continuous multivariate distributions<br />

Definition. 1.7.5<br />

The random vector (X 1 , . . . , X r ) T is said to have a continuous multivariate distribution<br />

with <strong>PDF</strong> f(x) if<br />

∫ ∫<br />

P (X ∈ A) = . . . f(x 1 , . . . , x r ) dx 1 . . . dx r<br />

A<br />

for any measurable set A.<br />

Examples<br />

1. Suppose (X 1 , X 2 ) have the trinomial distribution with parameters n, π 1 , π 2 .<br />

Then the marginal distribution <strong>of</strong> X 1 can be seen to be B(n, π 1 ) and the conditional<br />

distribution <strong>of</strong> X 2 |X 1 = x 1 is B ( π 2<br />

)<br />

n − x 1 , .<br />

1 − π 1<br />

Outcome 1 π 1<br />

Outcome 2 π 2<br />

Outcome 3 1 − π 1 − π 2<br />

Examples <strong>of</strong> Definition 1.7.5<br />

1. If X 1 , X 2 have <strong>PDF</strong><br />

⎧<br />

1 0 < x 1 < 1<br />

⎪⎨<br />

0 < x 2 < 1<br />

f(x 1 , x 2 ) =<br />

⎪⎩<br />

0 otherwise<br />

The distribution is called uniform on (0, 1) × (0, 1).<br />

It follows that P (X ∈ A) = Area (A) for any A ∈ (0, 1) × (0, 1):<br />

2. Uniform distribution on unit disk:<br />

⎧<br />

1<br />

⎪⎨ x 2 1 + x 2 2 < 1<br />

π<br />

f(x 1 , x 2 ) =<br />

⎪⎩<br />

0 otherwise.<br />

27


1. DISTRIBUTION THEORY<br />

Figure 12: Area A<br />

3. Dirichlet distribution is defined by <strong>PDF</strong><br />

f(x 1 , x 2 , . . . , x r ) = Γ(α 1 + α 2 + · · · + α r + α r+1 )<br />

Γ(α 1 )Γ(α 2 ) . . . Γ(α r )Γ(α r+1 ) ×<br />

x α 1−1<br />

1 x α 2−1<br />

2 . . . x αr−1<br />

r (1 − x 1 − · · · − x r ) α r+1<br />

for x 1 , x 2 , . . . , x r > 0,<br />

∑<br />

xi < 1, and parameters α 1 , α 2 , . . . , α r+1 > 0.<br />

#<br />

Recall joint <strong>PDF</strong>:<br />

∫<br />

P (X ∈ A) =<br />

Note: the joint <strong>PDF</strong> must satisfy:<br />

∫<br />

. . . f(x 1 , . . . , x r ) dx 1 . . . dx r .<br />

A<br />

1. f(x) ≥ 0 for all x;<br />

2.<br />

∫ ∞<br />

. . .<br />

∫ ∞<br />

−∞ −∞<br />

f(x 1 , . . . , x r ) dx 1 . . . dx r = 1.<br />

Definition. 1.7.6<br />

( )<br />

X1<br />

If X = has joint <strong>PDF</strong> f(x) = f(x<br />

X 1 , x 2 ), then the marginal <strong>PDF</strong> <strong>of</strong> X 1 is given<br />

2<br />

by:<br />

∫ ∞ ∫ ∞<br />

f X1 (x 1 ) = . . . f(x 1 , . . . , x r1 , x r1 +1, . . . x r ) dx r1 +1dx r1 +2 . . . dx r ,<br />

−∞ −∞<br />

28


1. DISTRIBUTION THEORY<br />

and for f X1 (x 1 ) > 0, the conditional <strong>PDF</strong> is given by<br />

f X2 |X 1<br />

(x 2 |x 1 ) = f(x 1, x 2 )<br />

f X1 (x 1 ) .<br />

Remarks:<br />

1. f X2 |X 1<br />

(x 2 |x 1 ) cannot be interpreted as the conditional <strong>PDF</strong> <strong>of</strong> X 2 |{X 1 = x 1 }<br />

because P (X 1 = x 1 ) = 0 for any continuous distribution.<br />

Proper interpretation is the limit as δ → 0 in X 2 |X 1 ∈ B(x 1 , δ).<br />

2. f X2 (x 2 ), f X1 |X 2<br />

(x 1 |x 2 ) are defined analogously.<br />

Definition. 1.7.7 (Independence)<br />

Continuous RV’s X 1 , X 2 , . . . , X r are said to be independent if their joint <strong>PDF</strong> satisfies:<br />

f(x 1 , x 2 , . . . , x r ) = f 1 (x 1 )f 2 (x 2 ) . . . f r (x r ),<br />

for some functions f 1 , f 2 , . . . , f r and all x 1 , . . . , x r<br />

Remarks<br />

1. Easy to check that if X 1 , . . . , X r are independent then each f i (x i ) = c i f Xi (x i ).<br />

Moreover c 1 , c 2 , . . . , c r = 1.<br />

2. If X 1 , X 2 , . . . , X r are independent, then it can be checked that:<br />

for any partitioning <strong>of</strong> X =<br />

f x1 |x 2<br />

(x 1 |x 2 ) = f X1 (x 1 )<br />

(<br />

X1<br />

X 2<br />

)<br />

.<br />

Examples<br />

1. If (X 1 , X 2 ) has the uniform distribution on the unit disk, find:<br />

(a) the marginal <strong>PDF</strong> f X1 (x 1 )<br />

(b) the conditional <strong>PDF</strong> f X2 |X 1<br />

(x 2 |x 1 ).<br />

29


1. DISTRIBUTION THEORY<br />

Solution. Recall<br />

⎧<br />

1<br />

⎪⎨ x 2 1 + x 2 2 < 1<br />

π<br />

f(x 1 , x 2 ) =<br />

⎪⎩<br />

0 otherwise<br />

(a) f X1 (x 1 ) =<br />

=<br />

∫ ∞<br />

−∞<br />

f(x 1 , x 2 ) dx 2<br />

∫ −<br />

√<br />

1−x 2 1<br />

−∞<br />

= 0 + x 2<br />

π<br />

=<br />

√<br />

∣ √<br />

−<br />

∫ √ 1−x 2 1<br />

0 dx 2 +<br />

1−x 2 1<br />

⎧<br />

2 √ 1 − x<br />

⎪⎨<br />

2 1<br />

π<br />

1−x 2 1<br />

+ 0<br />

√<br />

− 1−x 2 1<br />

−1 < x 1 < 1<br />

⎪⎩<br />

0 otherwise.<br />

∫<br />

1<br />

∞<br />

π dx 2 + √ 0 dx 2<br />

1−x 2 1<br />

i.e., a semi-circular distribution (ours has some distortion).<br />

Figure 13: A graphic showing the conditional distribution and a semi-circular distribution.<br />

30


1. DISTRIBUTION THEORY<br />

(b) The conditional density for X 2 |X 1 is:<br />

f X2 |X 1<br />

(x 2 |x 1 ) = f(x 1, x 2 )<br />

f X1 (x 1 )<br />

⎧<br />

1<br />

⎪⎨ 2 √ for − √ 1 − x 2<br />

1 − x 2 1 < x 2 < √ 1 − x 2 1<br />

1<br />

=<br />

⎪⎩<br />

0 otherwise,<br />

which is uniform U(− √ 1 − x 2 1, √ 1 − x 2 1).<br />

2. If X 1 , . . . , X r are any independent continuous RV’s then their joint <strong>PDF</strong> is:<br />

f(x 1 , . . . , x r ) = f X1 (x 1 )f X2 (x 2 ) . . . f Xr (x r ).<br />

E.g. If X 1 , X 2 , . . . , X r are independent Exp(λ) RVs, then<br />

f(x 1 , . . . , x r ) =<br />

r∏<br />

λ e −λx i<br />

i=1<br />

= λ r e −λ P r<br />

i=1 x i<br />

,<br />

for x i > 0 i = 1, . . . , n.<br />

3. Suppose (X 1 , X 2 ) are uniformly distributed on (0, 1) × (0, 1).<br />

That is,<br />

⎧<br />

⎪⎨ 1 0 < x 1 < 1, 0 < x 2 < 1<br />

f(x 1 , x 2 ) =<br />

⎪⎩<br />

0 otherwise<br />

Claim: X 1 , X 2 are independent.<br />

Pro<strong>of</strong>.<br />

Let<br />

⎧<br />

⎪⎨ 1 0 < x 1 < 1<br />

f 1 (x 1 ) =<br />

⎪⎩<br />

0 otherwise,<br />

31


⎧<br />

⎪⎨ 1 0 < x 2 < 1<br />

f 2 (x 2 ) =<br />

⎪⎩<br />

0 otherwise,<br />

1. DISTRIBUTION THEORY<br />

then f(x 1 , x 2 ) = f 1 (x 1 )f 2 (x 2 ), and the two variables are independent.<br />

4. (X 1 , X 2 ) is uniform on unit disk. Are X 1 , X 2 independent NO.<br />

Pro<strong>of</strong>.<br />

We know:<br />

⎧<br />

1<br />

⎪⎨ 2 √ − √ 1 − x 2<br />

1 − x 2 1 < x 2 < √ 1 − x 2 1<br />

1<br />

f X2 |X 1<br />

(x 2 |x 1 ) =<br />

.<br />

⎪⎩<br />

0 otherwise,<br />

i.e., U(− √ 1 − x 2 1, √ 1 − x 2 1).<br />

On the other hand,<br />

i.e., a semicircular distribution.<br />

⎧<br />

2 √<br />

⎪⎨ 1 − x<br />

2<br />

π<br />

2 −1 < x 2 < 1<br />

f X2 (x 2 ) =<br />

⎪⎩<br />

0 otherwise,<br />

Hence f X2 (x 2 ) ≠ f X2 |X 1<br />

(x 2 |x 1 ), so the variables cannot be independent.<br />

Definition. 1.7.8<br />

The joint CDF <strong>of</strong> RVS’s X 1 , X 2 , . . . , X r is defined by:<br />

F (x 1 , x 2 , . . . , x r ) = P ({X 1 ≤ x 1 } ∩ {X 2 ≤ x 2 } ∩ . . . ∩ ({X r ≤ x r }).<br />

Remarks:<br />

1. Definition applies to RV’s <strong>of</strong> any type, i.e., discrete, continuous, “hybrid”.<br />

2. Marginal CDF <strong>of</strong> X 1 = (X 1 , . . . , X r ) is F X1 (x 1 , . . . , x r ) = F X (x 1 , . . . , x r , ∞, ∞, . . . , ∞).<br />

3. RV’s X 1 , . . . , X r are defined to be independent if<br />

F (x 1 , . . . , x r ) = F X1 (x 1 )F X2 (x 2 ) . . . F Xr (x r ).<br />

4. The definitions above are completely general. However, in practice it is usually<br />

easier to work with the <strong>PDF</strong>/probability function for any given example.<br />

32


1. DISTRIBUTION THEORY<br />

1.8 Transformations <strong>of</strong> several RVs<br />

Theorem. 1.8.1<br />

If X 1 , X 2 are discrete with joint probability function P (x 1 , x 2 ), and Y = X 1 +X 2 , then:<br />

1. Y has probability function<br />

P Y (y) = ∑ x<br />

P (x, y − x).<br />

2. If X 1 , X 2 are independent,<br />

P Y (y) = ∑ x<br />

P X1 (x)P X2 (y − x)<br />

Pro<strong>of</strong>. (1)<br />

P ({Y = y}) = ∑ x<br />

P ({Y = y} ∩ {X 1 = x})<br />

(law <strong>of</strong> total probability)<br />

↓<br />

⎛<br />

⎞<br />

If A is an event & B 1 , B 2 . . . are events<br />

⋃<br />

s.t. B i = S & B i ∩ B j = φ (for j ≠ i)<br />

i<br />

⎜<br />

⎝<br />

then P (A) = ∑ P (A ∩ B i ) = ∑ ⎟<br />

P (A|B i )P (B i )<br />

⎠<br />

i<br />

i<br />

= ∑ x<br />

P ({X 1 + X 2 = y} ∩ {X 1 = x})<br />

= ∑ x<br />

P ({X 2 = y − x} ∩ {X 1 = x})<br />

= ∑ x<br />

P (x, y − x).<br />

Pro<strong>of</strong>. (2)<br />

33


1. DISTRIBUTION THEORY<br />

Just substitute<br />

P (x, y − x) = P X1 (x)P X2 (y − x).<br />

Theorem. 1.8.2<br />

Suppose X 1 , X 2 are continuous with <strong>PDF</strong>, f(x 1 , x 2 ), and let Y = X 1 + X 2 . Then<br />

1. f Y (y) =<br />

∫ ∞<br />

−∞<br />

f(x, y − x) dx.<br />

2. If X 1 , X 2 are independent, then<br />

f Y (y) =<br />

Pro<strong>of</strong>. 1. F Y (y) = P (Y ≤ y)<br />

∫ ∞<br />

−∞<br />

= P (X 1 + X 2 ≤ y)<br />

f X1 (x)f X2 (y − x) dx.<br />

=<br />

∫ ∞ ∫ y−x1<br />

x 1 =−∞ x 2 =−∞<br />

f(x 1 , x 2 ) dx 2 dx 1 .<br />

Let x 2 = t − x 1 ,<br />

⇒ dx 2<br />

dt = 1<br />

⇒ dx 2 = dt<br />

=<br />

=<br />

∫ ∞ ∫ y<br />

−∞<br />

∫ y<br />

−∞<br />

−∞<br />

f(x 1 , t − x 1 ) dt dx 1<br />

{∫ ∞<br />

}<br />

f(x 1 , t − x 1 ) dx 1 dt<br />

−∞<br />

⇒ f Y (y) = F ′ Y (y) =<br />

∫ ∞<br />

−∞<br />

f(x, y − x) dx.<br />

Pro<strong>of</strong>. (2)<br />

Take f(x, y − x) = f X1 (x) f X2 (y − x) if X 1 , X 2 independent.<br />

34


1. DISTRIBUTION THEORY<br />

Figure 14: Theorem 1.8.2<br />

Examples<br />

1. Suppose X 1 ∼ B(n 1 , p), X 2 ∼ B(n 2 , p) independently. Find the probability function<br />

for Y = X 1 + X 2 .<br />

Solution.<br />

P Y (y) = ∑ x<br />

P X1 (x)P X2 (y − x)<br />

=<br />

min(n 1 , y)<br />

∑<br />

x=max (0, y − n 2 )<br />

= p y (1 − p) n 1+n 2 −y<br />

=<br />

=<br />

i.e., Y ∼ B(n 1 + n 2 , p).<br />

( )<br />

( )<br />

n1<br />

p x (1 − p) n 1−x n2<br />

p y−x (1 − p) n 2+x−y<br />

x<br />

y − x<br />

min (n 1 , y)<br />

∑<br />

x=max (0, y − n 2 )<br />

( )<br />

n1 + n 2<br />

p y (1 − p) n 1+n 2 −y<br />

y<br />

( )<br />

n1 + n 2<br />

p y (1 − p) n 1+n 2 −y ,<br />

y<br />

35<br />

( ) ( )<br />

n1 n2<br />

x y − x<br />

min (n 1 , y)<br />

∑<br />

x=min (0, y − n 2 )<br />

(<br />

n1<br />

) ( )<br />

n2<br />

x y − x<br />

( )<br />

n1 + n 2<br />

y<br />

} {{ }<br />

sum <strong>of</strong> hypergeometric<br />

probability function = 1


1. DISTRIBUTION THEORY<br />

2. Suppose Z 1 ∼ N(0, 1), Z 2 ∼ N(0, 1) independently. Let X = Z 1 + Z 2 . Find<br />

<strong>PDF</strong> <strong>of</strong> X.<br />

Solution.<br />

f X (x) =<br />

=<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

φ(z)φ(x − z) dz<br />

1<br />

√<br />

2π<br />

e −z2 /2 1<br />

√<br />

2π<br />

e −(x−z)2 /2 dz<br />

1<br />

2π e− 1 2 (z2 +(x−z) 2) dz<br />

(<br />

1 1<br />

2π e− 2( z−x<br />

2 2<br />

)e )2 −x2 /4 dz<br />

= 1 ∫ ∞<br />

2 √ /4 1<br />

√<br />

π e−x2 e − (z− x 2 )2<br />

2× 1 2 dz<br />

−∞ π<br />

} {{ }<br />

= 1<br />

( x<br />

N<br />

2 , 1 <strong>PDF</strong><br />

2)<br />

i.e., X ∼ N(0, 2).<br />

= 1<br />

2 √ π e−x2 /4 ,<br />

Theorem. 1.8.3 (Ratio <strong>of</strong> continuous RVs)<br />

Suppose X 1 , X 2 are continuous with joint <strong>PDF</strong> f(x 1 , x 2 ), and let Y = X 2<br />

X 1<br />

.<br />

Then Y has <strong>PDF</strong><br />

If X 1 , X 2 independent, we obtain<br />

f Y (y) =<br />

f Y (y) =<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

|x|f(x, yx) dx.<br />

|x|f X1 (x)f X2 (yx) dx.<br />

36


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

F Y (y) = P ({Y ≤ y})<br />

= P ({Y ≤ y} ∩ {X 1 < 0}) + P ({Y ≤ y} ∩ {X 1 > 0})<br />

= P ({X 2 ≥ yX 1 } ∩ {X 1 < 0}) + P ({X 2 ≤ yX 1 } ∩ {X 1 > 0})<br />

=<br />

∫ 0<br />

∫ ∞<br />

f(x 1 , x 2 ) dx 2 dx 1 +<br />

∫ ∞ ∫ x1 y<br />

−∞ x 1 y<br />

0 −∞<br />

Substitute x 2 = tx 1 in both inner integrals;<br />

dx 2 = x 1 dt;<br />

f(x 1 , x 2 ) dx 2 dx 1 .<br />

=<br />

∫ 0<br />

∫ −∞<br />

x 1 f(x 1 , tx 1 )dt dx 1 +<br />

∫ ∞ ∫ y<br />

−∞ y<br />

0 −∞<br />

x 1 f(x 1 , tx 1 ) dt dx 1<br />

=<br />

∫ 0 ∫ y<br />

(−x 1 )f(x 1 , tx 1 ) dt dx 1 +<br />

∫ ∞ ∫ y<br />

−∞ −∞<br />

0 −∞<br />

x 1 f(x 1 , tx 1 ) dt dx 1<br />

=<br />

∫ 0 ∫ y<br />

|x 1 |f(x 1 , tx 1 ) dt dx 1 +<br />

∫ ∞ ∫ y<br />

−∞ −∞<br />

0 −∞<br />

|x 1 |f(x 1 , tx 1 ) dt dx 1<br />

= F Y (y)<br />

⇒ f Y (y) = F ′ Y (y) =<br />

∫ ∞<br />

−∞<br />

|x 1 |f(x 1 , yx 1 ) dx 1 .<br />

Example<br />

1. If Z ∼ N(0, 1) and X ∼ χ 2 k<br />

independently, then T =<br />

Z<br />

√<br />

X/k<br />

is said to have the<br />

t-distribution with k degrees <strong>of</strong> freedom. Derive the <strong>PDF</strong> <strong>of</strong> T .<br />

37


1. DISTRIBUTION THEORY<br />

Solution. Step 1:<br />

Let V = √ ( k<br />

X/k. Need to find <strong>PDF</strong> <strong>of</strong> V . Recall χ 2 k is Gamma 2 , 1 )<br />

so<br />

2<br />

Now,<br />

f X (x) = (1/2)k/2<br />

Γ(k/2) xk/2−1 e −x/2 , for x > 0.<br />

v = h(x) = √ x/k<br />

⇒ h −1 (v) = kv 2<br />

and<br />

h −1 (v) ′ = 2kv<br />

⇒ f V (v) = f X (h −1 (v))|h −1 (v) ′ |<br />

= (1/2)k/2<br />

Γ(k/2) (kv2 ) k/2−1 e −kv2 /2 (2kv)<br />

Step 2:<br />

Apply Theorem 1.8.3 to find <strong>PDF</strong> <strong>of</strong><br />

T = Z ∫ ∞<br />

V = |v|f V (v)f Z (vt) dv<br />

−∞<br />

= 2(k/2)k/2<br />

Γ(k/2) vk−1 e −kv2 /2 , v > 0.<br />

=<br />

∫ ∞<br />

0<br />

v 2(k/2)k/2<br />

Γ(k/2) vk−1 e −kv2 /2 1<br />

√<br />

2π<br />

e −t2 v 2 /2 dv<br />

= (k/2)k/2<br />

Γ(k/2) √ 2π<br />

∫ ∞<br />

0<br />

v k−1 e −1/2(k+t2 )v 2 2v dv;<br />

38


1. DISTRIBUTION THEORY<br />

substitute u = v 2<br />

⇒ du = 2v dv<br />

= (k/2)k/2<br />

Γ(k/2) √ 2π<br />

∫ ∞<br />

0<br />

u k−1<br />

2 e<br />

−1/2(k+t 2 )u du<br />

[ ]<br />

= (k/2)k/2 Γ(α)<br />

Γ(k/2) √ = (k/2)k/2<br />

2π λ α Γ(k/2) √ Γ( k+1)<br />

2<br />

2π (1/2(k + t 2 )) (k+1)/2<br />

=<br />

=<br />

k+1<br />

Γ( ) (<br />

2 2<br />

Γ(k/2) √ 2π k × 1 −<br />

k+1 ( )<br />

))<br />

2 (k + 2 −1/2 k t2 2<br />

k+1<br />

Γ( ) ( ) −<br />

k+1<br />

2<br />

Γ( k)√ 1 + t2 2<br />

.<br />

kπ k<br />

2<br />

Remarks:<br />

1. If we take k = 1, we obtain<br />

f(t) ∝ 1<br />

1 + t ; 2<br />

that is, t 1 = Cauchy distribution.<br />

Can also check<br />

Γ(1)<br />

Γ( 1)√ π = 1 π as required, so that f(t) = 1 1<br />

π 1 + t .<br />

2 2<br />

2. As k → ∞, t k → N(0, 1). To see this directly consider limit <strong>of</strong> t-density as k → ∞:<br />

i.e.<br />

) −<br />

k+1<br />

lim<br />

(1 + t2 2<br />

k→∞ k<br />

for t fixed<br />

) −k<br />

= lim<br />

(1 + t2 2<br />

k<br />

; let l =<br />

k→∞ k<br />

2<br />

(<br />

= lim 1 +<br />

l→∞<br />

t 2 2<br />

l<br />

) −l<br />

= 1<br />

e t2 2<br />

= e −t2<br />

2 .<br />

39


1. DISTRIBUTION THEORY<br />

Recall standard limit:<br />

(<br />

lim 1 + a ) n<br />

= e<br />

a<br />

n→∞ n<br />

⇒ f T (t) → 1 √<br />

2π<br />

e −t2<br />

2 as k → ∞.<br />

Suppose h : R r → R r is continuously differentiable.<br />

That is,<br />

h(x 1 , x 2 , . . . , x r ) = (h 1 (x 1 , . . . , x r ), h 2 (x 1 , . . . , x r ), . . . , h r (x 1 , . . . , x r )) ,<br />

where each h i (x) is continuously differentiable. Let<br />

⎡<br />

⎤<br />

∂h 1 ∂h 1 ∂h 1<br />

. . .<br />

∂x 1 ∂x 2 ∂x r<br />

∂h 2 ∂h 2 ∂h 2<br />

. . .<br />

∂x<br />

H = 1 ∂x 2 ∂x r<br />

. . . . . .<br />

⎢<br />

⎥<br />

⎣∂h r ∂h r ∂h r<br />

⎦<br />

. . .<br />

∂x 1 ∂x 2 ∂x r<br />

If H is invertible for all x, then ∃ an inverse mapping:<br />

g : R r → R r<br />

g(h(x)) = x.<br />

with property that:<br />

It can be proved that the matrix <strong>of</strong> partial derivatives<br />

( ) ∂gi<br />

G = satisfies G = H −1 .<br />

∂y j<br />

Theorem. 1.8.4<br />

Suppose X 1 , X 2 , . . . , X r have joint <strong>PDF</strong> f X (X 1 , . . . , X r ), and let h, g, G be as above.<br />

If Y = h(X), then Y has joint <strong>PDF</strong><br />

Remark:<br />

f Y (y 1 , y 2 , . . . , y r ) = f X<br />

(<br />

g(y)<br />

)<br />

| det G(y)|<br />

Can sometimes use H −1 instead <strong>of</strong> G, but need to be careful to evaluate H −1 (x) at<br />

x = h −1 (y) = g(y).<br />

40


1. DISTRIBUTION THEORY<br />

1.8.1 Method <strong>of</strong> regular transformations<br />

Suppose X = (X 1 , . . . , X r ) T and h : R r → R s , where s < r if<br />

Y = (Y 1 , . . . , Y s ) T = h(X).<br />

One approach is as follows:<br />

1. Choose a function, d : R r → R r−s and let Z = d(X) = (Z 1 , Z 2 , . . . , Z r−s ) T .<br />

2. Apply theorem 1.8.4 to (Y 1 , Y 2 , . . . , Y s , Z 1 , Z 2 , . . . , Z r−s ) T .<br />

3. Integrate over Z 1 , Z 2 , . . . , Z r−s to get the marginal <strong>PDF</strong> for Y.<br />

Examples<br />

Suppose Z 1 ∼ N(0, 1), Z 2 ∼ N(0, 1) independently. Consider h(z 1 , z 2 ) = (r, θ) T , where<br />

r = h 1 (z 1 , z 2 ) = √ z1 2 + z2, 2 and<br />

⎧ ( )<br />

z2<br />

arctan<br />

for z 1 > 0<br />

z 1<br />

h maps R 2 → [0, ∞) ×<br />

⎪⎨ arctan<br />

θ = h 2 (z 1 , z 2 ) =<br />

[<br />

− π 2 , 3π 2<br />

(<br />

z2<br />

)<br />

+ π for z 1 < 0<br />

z 1<br />

π<br />

2 sgnz 2 for z 1 = 0, z 2 ≠ 0<br />

⎪⎩<br />

0 for z 1 = z 2 = 0.<br />

)<br />

.<br />

The inverse mapping, g, can be seen to be:<br />

( )<br />

g1 (r, θ)<br />

g(r, θ) = =<br />

g 2 (r, θ)<br />

To find distribution <strong>of</strong> (R, θ):<br />

( ) r cos θ<br />

r sin θ<br />

Step 1<br />

⎡<br />

∂<br />

G(r, θ) = ⎢<br />

∂r g 1(r, θ)<br />

⎣<br />

∂<br />

∂r g 2(r, θ)<br />

41<br />

⎤<br />

∂<br />

∂θ g 1(r, θ)<br />

⎥<br />

⎦<br />

∂<br />

∂θ g 2(r, θ)


∂<br />

∂r g 1(r, θ) = ∂ r cos θ = cos θ<br />

∂r<br />

∂<br />

(r cos θ) = −r sin θ<br />

∂θ<br />

∂<br />

∂r g 2(r, θ) = ∂ r sin θ = sin θ<br />

∂r<br />

∂<br />

r sin θ = r cos θ<br />

∂θ<br />

1. DISTRIBUTION THEORY<br />

⎡<br />

cos θ<br />

⇒ G = ⎣<br />

sin θ<br />

⎤<br />

−r sin θ<br />

⎦<br />

r cos θ<br />

⇒ det(G) = r cos 2 θ − (−r sin 2 θ)<br />

= r cos 2 θ + r sin 2 θ = r.<br />

Step 2<br />

Now apply theorem 1.8.4.<br />

Recall,<br />

f Z (z 1 , z 2 ) = 1 √<br />

2π<br />

e − 1 2 z2 1 ×<br />

1<br />

√<br />

2π<br />

e − 1 2 z2 2<br />

= 1<br />

2π e− 1 2 (z2 1 +z2 2 ) .<br />

42


1. DISTRIBUTION THEORY<br />

f R,θ (r, θ) = f Z<br />

(<br />

g(r, θ)<br />

)<br />

| det G(r, θ)|<br />

= f Z (r cos θ, r sin θ)|r|<br />

⎧<br />

1<br />

⎪⎨<br />

2π r e− 1 2 r2 for r ≥ 0, − π ≤ θ < 3π 2 2<br />

=<br />

⎪⎩<br />

0 otherwise<br />

= f θ (θ)f R (r),<br />

where<br />

⎧<br />

1<br />

⎪⎨ − π<br />

2π<br />

≤ θ < 3π 2 2<br />

f θ (θ) =<br />

⎪⎩<br />

0 otherwise<br />

f R (r) = re −r2 /2<br />

for r ≥ 0.<br />

Hence, if Z 1 , Z 2 i.i.d. N(0, 1), and we translate to polar coordinates, (R, θ) we find:<br />

1. R, θ are independent.<br />

2. θ ∼ U(− π 2 , 3π 2 ).<br />

3. R has distribution called the Rayleigh distribution.<br />

It is easy to check that R 2 ∼ Exp(1/2) ≡ Gamma(1, 1/2) ≡ χ 2 2.<br />

Example<br />

Suppose X 1 ∼ Gamma (α, λ) and X 2 ∼ Gamma (β, λ), independently. Find the<br />

distribution <strong>of</strong> Y = X 1 /(X 1 + X 2 ).<br />

Solution. Step 1: try using Z = X 1 + X 2 .<br />

( x 1<br />

)<br />

( )<br />

y = yz<br />

Step 2: If h(x 1 , x 2 ) = x 1 + x 2 , then g(y, z) =<br />

.<br />

z = x 1 + x (1 − y)z<br />

2<br />

( ) z y<br />

G =<br />

−z 1 − y<br />

43


1. DISTRIBUTION THEORY<br />

⇒ det(G) = z(1 − y) − (−zy)<br />

where z > 0. Why<br />

= z(1 − y) + zy = z,<br />

f Y,Z (y, z) = f X1 (yz)f X2<br />

(<br />

(1 − y)z<br />

)<br />

z<br />

Γ(α) (yz)α−1 e −λyz λβ ( ) β−1e<br />

(1 − y)z −λ(1−y)z z<br />

Γ(β)<br />

= λα<br />

=<br />

1<br />

Γ(α)Γ(β) λα+β y α−1 (1 − y) β−1 z α+β−1 e −λz .<br />

Step 3:<br />

f Y (y) =<br />

∫ ∞<br />

0<br />

f(y, z) dz<br />

=<br />

=<br />

∫<br />

Γ(α + β)<br />

∞<br />

Γ(α)Γ(β) yα−1 (1 − y) β−1<br />

Γ(α + β)<br />

Γ(α)Γ(β) yα−1 (1 − y) β−1<br />

0<br />

λ α+β z α+β−1<br />

Γ(α + β) e−λz dz<br />

= Beta(α, β), for 0 < y < 1.<br />

Exercise: Justify the range <strong>of</strong> values for y.<br />

1.9 Moments<br />

Suppose X 1 , X 2 , . . . , X r are RVs. If Y = h(X) is defined by a real-valued function h,<br />

then to find E(Y ) we can:<br />

1. Find the distribution <strong>of</strong> Y .<br />

⎧∫ ∞<br />

yf Y (y) dy<br />

⎪⎨ −∞<br />

2. Calculate E(Y ) =<br />

∑<br />

⎪⎩ yp(y)<br />

y<br />

continuous<br />

discrete<br />

44


1. DISTRIBUTION THEORY<br />

Theorem. 1.9.1<br />

If h and X 1 , . . . , X r are as above, then provided it exists,<br />

⎧∑<br />

∑<br />

E { h(X) } ⎪⎨<br />

=<br />

⎪⎩<br />

x 1<br />

∫ ∞<br />

Theorem. 1.9.2<br />

−∞<br />

x 2<br />

. . . ∑ x r<br />

h(x 1 , . . . , x r )p(x 1 , . . . , x r ) if X 1 , . . . , X r discrete<br />

∫ ∞<br />

. . . h(x 1 , . . . , x r )f(x 1 , . . . x r ) dx 1 . . . dx r<br />

−∞<br />

if X 1 , . . . , X r continuous<br />

Suppose X 1 , . . . , X r are RVs, h 1 , . . . , h k are real-valued functions and a 1 , . . . , a k are<br />

constants. Then, provided it exists,<br />

E { a 1 h 1 (X) + a 2 h 2 (X) + · · · + a k h k (X) } = a 1 E[h 1 (X)] + a 2 E[h 2 (X)] + · · · + a k E [h k (X)]<br />

Pro<strong>of</strong>. (Continuous case)<br />

E { a 1 h 1 (X) + · · · + a k h k (X) } =<br />

∫ ∞<br />

−∞<br />

= a 1<br />

∫ ∞<br />

∫ ∞<br />

. . . {a 1 h 1 (x) + · · · + a k h k (x)} f x (x) dx 1 dx 2 . . . dx r<br />

−∞<br />

−∞<br />

+ a 2<br />

∫ ∞<br />

∫ ∞<br />

. . . h 1 (x)f X (x) dx 1 . . . dx r<br />

−∞<br />

−∞<br />

. . . + a k<br />

∫ ∞<br />

∫ ∞<br />

. . . h 2 (x)f X (x) dx 1 . . . dx r + . . .<br />

−∞<br />

−∞<br />

∫ ∞<br />

. . . h k (x)f X (x) dx 1 . . . dx r<br />

−∞<br />

= a 1 E[h 1 (X)] + a 2 E[h 2 (X)] + · · · + a k E[h k (X)],<br />

as required.<br />

Corollary.<br />

Provided it exists,<br />

E ( a 1 X 1 + a 2 X 2 + · · · + a r X r<br />

)<br />

= a1 E[X 1 ] + a 2 E[X 2 ] + · · · + a r E[X r ].<br />

45


1. DISTRIBUTION THEORY<br />

Definition. 1.9.1.<br />

If X 1 , X 2 are RVs with E[X i ] = µ i and Var(X i ) = σ 2 i , i = 1, 2 we define<br />

σ 12 = Cov(X 1 , X 2 ) = E [ (X 1 − µ 1 )(X 2 − µ 2 ) ]<br />

= E [ X 1 X 2<br />

]<br />

− µ1 µ 2 ,<br />

Remark:<br />

ρ 12 = Corr(X 1 , X 2 ) = σ 12<br />

σ 1 σ 2<br />

.<br />

Cov(X, X) = Var(X) = { E[X 2 ] − (E[X]) 2} .<br />

In some contexts, it is convenient to use the notation<br />

Theorem. 1.9.3<br />

σ ii = Var(X i ) instead <strong>of</strong> σ 2 i .<br />

Suppose X 1 , X 2 , . . . , X r are RVs with E[X i ] = µ i , Cov(X i , X j ) = σ ij , and let a 1 , a 2 , . . . , a r ,<br />

b 1 , b 2 , . . . , b r be constants. Then<br />

( r∑<br />

Cov a i X i ,<br />

i=1<br />

)<br />

r∑<br />

b j X j =<br />

j=1<br />

r∑ r∑<br />

a i b j σ ij .<br />

i=1 j=1<br />

46


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

( r∑<br />

Cov a i X i ,<br />

i=1<br />

) [{<br />

r∑<br />

r∑<br />

( r∑<br />

)} { r∑<br />

( r∑<br />

)}]<br />

b j X j = E a i X i − E a i X i b j X j − E b j X j<br />

i=1<br />

i=1<br />

j=1<br />

j=1<br />

j=1<br />

= E<br />

{( r∑<br />

i=1<br />

a i X i −<br />

) (<br />

r∑<br />

r∑<br />

a i µ i b j X j −<br />

j=1<br />

i=1<br />

)}<br />

r∑<br />

b j µ j<br />

j=1<br />

[{ r∑<br />

} { r∑<br />

}]<br />

= E a i (X i − µ i ) b j (X j − µ j )<br />

i=1<br />

j=1<br />

= E<br />

{ r∑<br />

i=1<br />

}<br />

r∑<br />

a i b j (X i − µ i )(X j − µ j )<br />

j=1<br />

=<br />

r∑ r∑<br />

a i b j E {(X i − µ i )(X j − µ j )} (Theorem 1.9.2)<br />

i=1 j=1<br />

=<br />

r∑ r∑<br />

a i b j σ ij ,<br />

i=1 j=1<br />

as required.<br />

Corollary. 1<br />

Under the above assumptions,<br />

( r∑<br />

)<br />

Var a i X i =<br />

i=1<br />

} {{ }<br />

covariance<br />

with itself<br />

= variance<br />

Corollary. 2<br />

r∑ r∑<br />

a i a j σ ij =<br />

i=1 j=1<br />

r∑<br />

i=1<br />

a 2 i σ 2 i + 2 ∑ i,j<br />

i


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

Consider 0 ≤ Var(X 1 − tX 2 ) (for any constant t)<br />

= σ 2 1 + t 2 σ 2 2 + 2tσ 12 (by Corollary 1).<br />

Hence quadratic,<br />

q(t) = σ 2 2t 2 + 2σ 12 t + σ 2 1 ≥ 0, for all t.<br />

Hence q(t) has either no real roots or<br />

one real root.<br />

* no real roots if ∆ = b 2 − 4ac < 0<br />

* one real root if ∆ = b 2 − 4ac = 0.<br />

Hence ∆ ≤ 0<br />

⇒ 4σ 2 12 − 4σ 2 2σ 2 1 ≤ 0<br />

⇒ 4σ 2 12 ≤ 4σ 2 1σ 2 2<br />

(σ’s +ve)<br />

⇒ σ2 12<br />

σ 2 1σ 2 2<br />

≤ 1<br />

⇒ ρ 2 ≤ 1<br />

⇒ |ρ| ≤ 1.<br />

Suppose |ρ| = 1 ⇒ ∆ = 0<br />

⇒ there is single t 0 such that:<br />

0 = q(t 0 ) = Var(X 1 − t 0 X 2 ),<br />

i.e., X 1 = t 0 X 2 + c with probability 1.<br />

Theorem. 1.9.4<br />

If X 1 , X 2 are independent, and Var(X 1 ), Var(X 2 ) exist, then Cov(X 1 , X 2 ) = 0.<br />

Pro<strong>of</strong>. (Continuous)<br />

48


1. DISTRIBUTION THEORY<br />

Cov(X 1 , X 2 ) = E { (X 1 − µ 1 )(X 2 − µ 2 ) }<br />

∫ ∞ ∫ ∞<br />

= (x 1 − µ 1 )(x 2 − µ 2 )f(x 1 , x 2 ) dx 1 dx 2<br />

−∞ −∞<br />

∫ ∞ ∫ ∞<br />

= (x 1 − µ 1 )(x 2 − µ 2 )f X1 (x 1 )f X2 (x 2 ) dx 1 dx 2 (since X 1 , X 2 independent)<br />

−∞ −∞<br />

(∫ ∞<br />

) (∫ ∞<br />

)<br />

= (x 1 − µ 1 )f X1 (x 1 ) dx 1 (x 2 − µ 2 )f X2 (x 2 ) dx 2<br />

−∞<br />

−∞<br />

= 0 × 0<br />

= 0.<br />

Remark:<br />

But the converse does NOT apply, in general!<br />

That is, Cov(X 1 , X 2 ) = 0 ⇏ X 1 , X 2 independent.<br />

Definition. 1.9.2<br />

If X 1 , X 2 are RVs, we define the symbol E[X 1 |X 2 ] to be the expectation <strong>of</strong> X 1 calculated<br />

with respect to the conditional distribution <strong>of</strong> X 1 |X 2 ,<br />

i.e.,<br />

Theorem. 1.9.5<br />

E [ X 1 |X 2<br />

]<br />

=<br />

⎧⎪ ⎨<br />

⎪ ⎩<br />

∑<br />

x 1<br />

x 1 P X1 |X 2<br />

(x 1 |x 2 ) X 1 |X 2 discrete<br />

∫ ∞<br />

−∞<br />

Provided the relevant moments exist,<br />

x 1 f X1 |X 2<br />

(x 1 |x 2 ) dx 1<br />

E(X 1 ) = E X2<br />

{<br />

E(X1 |X 2 ) } ,<br />

X 1 |X 2 continuous<br />

Var(X 1 ) = E X2<br />

{<br />

Var(X1 |X 2 ) } + Var X2<br />

{<br />

E(X1 |X 2 ) } .<br />

49


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

1. (Continuous case)<br />

E X2<br />

{<br />

E(X1 |X 2 ) } =<br />

=<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

E(X 1 |X 2 )f X2 (x 2 ) dx 2<br />

(∫ ∞<br />

)<br />

x 1 f(x 1 |x 2 ) dx 1 f X2 (x 2 ) dx 2<br />

−∞<br />

∫ ∞ ∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

−∞<br />

x 1 f(x 1 |x 2 )f X2 (x 2 ) dx 1 dx 2<br />

∫ ∞<br />

x 1 f(x 1 , x 2 ) dx 2 dx 1<br />

−∞<br />

} {{ }<br />

= f X1 (x 1 )<br />

=<br />

∫ ∞<br />

−∞<br />

x 1 f X1 (x 1 ) dx 1<br />

= E(X 1 ).<br />

1.9.1 Moment generating functions<br />

Definition. 1.9.3<br />

If X 1 , X 2 . . . , X r are RVs, then the joint MGF (provided it exists) is given by<br />

M X (t) = M X (t 1 , t 2 , . . . , t r ) = E [ e t 1X 1 +t 2 X 2 +···+t rX r<br />

]<br />

.<br />

Theorem. 1.9.6<br />

If X 1 , X 2 , . . . , X r are mutually independent, then the joint MGF satisfies<br />

M X (t 1 , . . . , t r ) = M X1 (t 1 )M X2 (t 2 ) . . . M Xr (t r ),<br />

provided it exists.<br />

50


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>. (Theorem 1.9.6, continuous case)<br />

M X (t 1 , . . . , t r ) =<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

e t 1x 1 +t 2 x 2 +···+t rx r<br />

f X (x 1 , . . . , x r ) dx 1 . . . dx r<br />

∫ ∞<br />

. . . e t 1x 1<br />

e t 2x 2<br />

. . . e trxr f X1 (x 1 )f X2 (x 2 ) . . . f Xr (x r ) dx 1 . . . dx r<br />

−∞<br />

(∫ ∞<br />

) (∫ ∞<br />

) (∫ ∞<br />

)<br />

= e t 1x 1<br />

f X1 (x 1 ) dx 1 e t 2x 2<br />

f X2 (x 2 ) dx 2 . . . e trxr f Xr (x r ) dx r<br />

−∞<br />

−∞<br />

−∞<br />

= M X1 (t 1 )M X2 (t 2 ) . . . M Xr (t r ), as required.<br />

We saw previously that if Y = h(X), then we can find M Y (t) = E[e th(X) ] from the<br />

joint distribution <strong>of</strong> X 1 , . . . , X r without calculating f Y (y) explicitly.<br />

A simple, but important case is:<br />

Theorem. 1.9.7<br />

Y = X 1 + X 2 + · · · + X r .<br />

Suppose X 1 , . . . , X r are RVs and let Y = X 1 + X 2 + · · · + X r . Then (assuming MGFs<br />

exist):<br />

1. M Y (t) = M X (t, t, t, . . . , t).<br />

2. If X 1 . . . X r are independent then M Y (t) = M X1 (t)M X2 (t) . . . M Xr (t).<br />

3. If X 1 , . . . , X r independent and identically distributed with common MGF M X (t),<br />

then:<br />

M Y (t) = [ M X (t) ] r<br />

.<br />

51


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

M X (t) = E[e tY ]<br />

= E[e t(X 1+X 2 ,···+X r) ]<br />

= E[e tX 1+tX 2 +···+tX r<br />

]<br />

= M X (t, t, t, . . . , t) (using def 1.9.3).<br />

For parts (2) and (3), substitute into (1) and use Theorem 1.9.6.<br />

Examples<br />

Consider RVs X, V , defined by V ∼ Gamma(α, λ) and the conditional distribution <strong>of</strong><br />

X|V ∼Po(λ = V ).<br />

Find E(X) and Var(X).<br />

Solution. Use<br />

E(X) = E{E(X|V )}<br />

Var(X) = E V {Var(X|V )} + Var V {E(X|V )}<br />

E(X|V ) = V<br />

Var(X|V ) = V<br />

so E(X) = E{E(X|V )} = E(V ) = α λ .<br />

Var(X) = E V {Var(X|V )} + Var V {E(X|V )}<br />

= E V (V ) + Var V (V )<br />

= α λ + α λ 2<br />

= α λ<br />

(<br />

1 + 1 ) ( ) 1 + λ<br />

= α .<br />

λ λ 2<br />

52


1. DISTRIBUTION THEORY<br />

Remark<br />

The marginal distribution <strong>of</strong> X is sometimes called the negative binomial distribution.<br />

In particular, when α is an integer, it corresponds to the definition previously with<br />

p =<br />

λ<br />

1 + λ .<br />

Examples<br />

1. Suppose X ∼ Bernoulli with parameter p. Then M X (t) = 1 + p(e t − 1).<br />

Now suppose X 1 , X 2 , . . . , X n are i.i.d. Bernoulli with parameter p and<br />

Y = X 1 + X 2 + · · · + X n ;<br />

then M Y (t) = ( 1 + p(e t − 1) ) n<br />

, which agrees with the formula previously given<br />

for the binomial distribution.<br />

2. Suppose X 1 ∼ N(µ 1 , σ 2 1) and X 2 ∼ N(µ 2 , σ 2 2) independently. Find the MGF <strong>of</strong><br />

Y = X 1 + X 2 .<br />

Solution. Recall that M X1 (t) = e tµ 1<br />

e t2 σ 2 1 /2<br />

M X2 (t) = e tµ 2<br />

e t2 σ 2 2 /2<br />

⇒ M Y (t) = M X1 (t)M X2 (t)<br />

= e t(µ 1+µ 2 ) e t2 (σ 2 1 +σ2 2 )/2<br />

⇒ Y ∼ N ( µ 1 + µ 2 , σ 2 1 + σ 2 2)<br />

.<br />

1.9.2 Marginal distributions and the MGF<br />

To find the moment generating function <strong>of</strong> the marginal distribution <strong>of</strong> any set <strong>of</strong><br />

components <strong>of</strong> X, set to 0 the complementary elements <strong>of</strong> t in M X (t).<br />

Let X = (X 1 , X 2 ) T , and t = (t 1 , t 2 ) T . Then<br />

( )<br />

t1<br />

M X1 (t 1 ) = M X .<br />

0<br />

To see this result: Note that if A is a constant matrix, and b is a constant vector, then<br />

M AX+b (t) = e tT b M X (A T t).<br />

53


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

M AX+b (t) = E[e tT (AX+b )] = e tT b E[e tT AX ]<br />

= e tT b E[e (AT t) T X ] = e tT b M X (A T t).<br />

Now partition<br />

and<br />

Note that<br />

and<br />

Hence,<br />

t r×1 =<br />

(<br />

t1<br />

t 2<br />

)<br />

A l×r = (I l×l 0 l×m ).<br />

( )<br />

X1<br />

AX = (I l×l 0 l×m ) = X<br />

X 1 ,<br />

2<br />

A T t 1 =<br />

(<br />

Il×l<br />

0 m×l<br />

)<br />

t 1 =<br />

( )<br />

t1<br />

.<br />

0<br />

M X1 (t 1 ) = M AX (t 1 ) = M X (A T t 1 )<br />

( )<br />

t1<br />

= M X ,<br />

0<br />

as required.<br />

Note that similar results hold for more than two random subvectors.<br />

The major limitation <strong>of</strong> the MGF is that it may not exist. The characteristic function<br />

on the other hand is defined for all distributions. Its definition is similar to the MGF,<br />

with it replacing t, where i = √ −1; the properties <strong>of</strong> the characterisitc function are<br />

similar to those <strong>of</strong> the MGF, but using it requires some familiarity with complex<br />

analysis.<br />

1.9.3 Vector notation<br />

Consider the random vector<br />

X = (X 1 , X 2 , . . . , X r ) T ,<br />

with E[X i ] = µ i , Var(X i ) = σi 2 = σ ii , Cov(X i , X j ) = σ ij .<br />

54


1. DISTRIBUTION THEORY<br />

Define the mean vector by:<br />

⎛ ⎞<br />

µ 1<br />

µ 2<br />

µ = E(X) = ⎜ ⎟<br />

⎝ . ⎠<br />

µ r<br />

and the variance matrix (covariance matrix) by:<br />

Σ = Var(X) = [σ ij ] i=1,...,r<br />

j=1,...,r<br />

Finally, the correlation matrix is defined by:<br />

⎛<br />

⎞<br />

1 ρ 12 . . . ρ 1r<br />

. . .<br />

. . .<br />

R = ⎜<br />

⎟<br />

⎝ .. . ρr−1,r<br />

⎠<br />

1<br />

where ρ ij =<br />

σ ij<br />

√<br />

σii<br />

√<br />

σjj<br />

.<br />

Now suppose that a = (a 1 , . . . , a r ) T is a vector <strong>of</strong> constants and observe that:<br />

a T x = a 1 x 1 + a 2 x 2 + · · · + a r x r =<br />

r∑<br />

a i x i<br />

i=1<br />

Theorem. 1.9.8<br />

Suppose X has E(X) = µ and Var(X) = Σ, and let a, b ∈ R r be fixed vectors. Then,<br />

1. E(a T X) = a T µ<br />

2. Var(a t X) = a T Σa<br />

3. Cov(a T X, b T X) = a T Σb<br />

Remark<br />

It is easy to check that this is just a re-statement <strong>of</strong> Theorem 1.9.3 using matrix<br />

notation.<br />

Theorem. 1.9.9<br />

Suppose X is a random vector with E(X) = µ, Var(X) = Σ, and let A p×r and b ∈ R p<br />

be fixed.<br />

If Y = AX + b, then<br />

E(Y) = Aµ + b and Var(Y) = AΣA T .<br />

55


1. DISTRIBUTION THEORY<br />

Remark<br />

This is also a re-statement <strong>of</strong> previously established results. To see this, observe that<br />

if a T i is the i th row <strong>of</strong> A, we see that Y i = a T i X and, moreover, the (i, j) th element <strong>of</strong><br />

AΣA T = a T i Σa j = Cov ( a T i X, a T j X )<br />

= Cov(Y i , Y j ).<br />

1.9.4 Properties <strong>of</strong> variance matrices<br />

If Σ = Var(X) for some random vector X = (X 1 , X 2 , . . . , X r ) T , then it must satisfy<br />

certain properties.<br />

Since Cov(X i , X j ) = Cov(X j , X i ), it follows that Σ is a square (r × r) symmetric<br />

matrix.<br />

Definition. 1.9.4<br />

The square, symmetric matrix M is said to be positive definite [non-negative definite]<br />

if a T Ma > 0 [≥ 0] for every vector a ∈ R r s.t. a ≠ 0.<br />

=⇒ It is necessary and sufficient that Σ be non-negative definite in order that it can<br />

be a variance matrix.<br />

=⇒ To see the necessity <strong>of</strong> this condition, consider the linear combination a T X.<br />

By Theorem 1.9.8, we have<br />

=⇒ Σ must be non-negative definite.<br />

0 ≤ Var(a T X) = a T Σa for every a;<br />

Suppose λ is an eigenvalue <strong>of</strong> Σ, and let a be the corresponding eigenvector. Then<br />

Σa = λa<br />

=⇒ a T Σa = λa T a = λ||a|| 2 .<br />

Hence Σ is non-negative definite (positive definite) iff its eigenvalues are all nonnegative<br />

(positive).<br />

If Σ is non-negative definite but not positive definite, then there must be at least one<br />

zero eigenvalue. Let a be the corresponding eigenvector.<br />

=⇒ a ≠ 0 but a T Σa = 0. That is, Var(a T X) = 0 for that a.<br />

=⇒ the distribution <strong>of</strong> X is degenerate in the sense that either one <strong>of</strong> the X i ’s is<br />

constant or else a linear combination <strong>of</strong> the other components.<br />

56


1. DISTRIBUTION THEORY<br />

Finally, recall that if λ 1 , . . . , λ r are eigenvalues <strong>of</strong> Σ, then det(Σ) =<br />

r∏<br />

i=1<br />

λ r<br />

and<br />

det(Σ) = 0<br />

=⇒ det(Σ) > 0 for Σ positive definite,<br />

for Σ non-negative definite but not positive definite.<br />

1.10 The multivariable normal distribution<br />

Definition. 1.10.1<br />

The random vector X = (X 1 , . . . , X r ) T is said to have the r-dimensional multivariate<br />

normal distribution with parameters µ ∈ R r and Σ r×r positive definite, if it has <strong>PDF</strong><br />

We write X ∼ N r (µ, Σ).<br />

Examples<br />

f X (x) =<br />

1<br />

(2π) r/2 |Σ| 1/2 e− 1 2 (x−µ)T Σ −1 (x−µ)<br />

1. r = 2 The bivariate normal distribution<br />

( ) [ ]<br />

µ1<br />

σ<br />

2<br />

Let µ = , Σ = 1 ρσ 1 σ 2<br />

µ 2 ρσ 1 σ 2 σ2<br />

2<br />

=⇒ |Σ| = σ 2 1σ 2 2(1 − ρ 2 ) and<br />

⎛<br />

⎞<br />

σ<br />

Σ −1 1<br />

2 2 −ρσ 1 σ 2<br />

=<br />

⎝<br />

⎠<br />

σ1σ 2 2(1 2 − ρ 2 )<br />

−ρσ 1 σ 2 σ1<br />

2<br />

=⇒ f(x 1 , x 2 ) =<br />

{[<br />

1<br />

√<br />

2πσ 1 σ exp −1<br />

2 1 − ρ<br />

2 2(1 − ρ 2 )<br />

[ (x1 ) 2 ( ) 2<br />

− µ 1 x2 − µ 2<br />

+<br />

σ 1 σ 2<br />

( ) ( )]]}<br />

x1 − µ 1 x2 − µ 2<br />

−2ρ<br />

.<br />

σ 1 σ 2<br />

57


1. DISTRIBUTION THEORY<br />

2. If Z 1 , Z 2 , . . . , Z r are i.i.d. N(0, 1) and Z = (Z 1 , . . . , Z r ) T ,<br />

⇒f Z (z) =<br />

r∏<br />

i=1<br />

1<br />

√<br />

2π<br />

e −z2 i /2<br />

=<br />

which is N r (0 r , I r×r ) <strong>PDF</strong>.<br />

Theorem. 1.10.1<br />

1<br />

(2π) r/2 e−1/2 P r<br />

i=1 z2 i =<br />

1<br />

(2π) r/2 e− 1 2 zT z ,<br />

Suppose X ∼ N r (µ, Σ) and let Y = AX + b, where b ∈ R r and A r×r invertible are<br />

fixed. Then Y ∼ N r (Aµ + b, AΣA T ).<br />

Pro<strong>of</strong>.<br />

We use the transformation rule for <strong>PDF</strong>s:<br />

If X has joint <strong>PDF</strong> f X (x) and Y = g(X) for g : R r → R r invertible, continuously<br />

differentiable, ( ) then f Y (y) = f X (h(y))|H|, where h : R r → R r such that h = g −1 and<br />

∂hi<br />

H = .<br />

∂y j<br />

In this case, we have g(x) = Ax + b<br />

⇒h(y) = A −1 (y − b)<br />

[ solving for x in<br />

y = Ax + b<br />

]<br />

⇒H = A −1 .<br />

Hence,<br />

f Y (y) =<br />

=<br />

=<br />

1<br />

(2π) r/2 |Σ| 1/2 e−1/2(A−1 (y−b)−µ) T Σ −1 (A −1 (y−b)−µ) |A −1 |<br />

1<br />

(2π) r/2 |Σ| 1/2 |AA T | 1/2 e−1/2(y−(Aµ+b))T (A −1 ) T Σ −1 (A −1 )(y−(Aµ+b))<br />

1<br />

(2π) r/2 |AΣA T | 1/2 e−1/2(y−(Aµ+b))T (AΣA T ) −1 (y−(Aµ+b))<br />

which is exactly the <strong>PDF</strong> for N r (Aµ + b, AΣA T ).<br />

58


1. DISTRIBUTION THEORY<br />

Now suppose Σ r×r is any symmetric positive definite matrix and recall that we can write<br />

Σ = E ∧ E T where ∧ = diag (λ 1 , λ 2 , . . . , λ r ) and E r×r is such that EE T = E T E = I.<br />

Since Σ is positive definite, we must have λ i > 0, i = 1, . . . , r and we can define the<br />

symmetric square-root matrix by:<br />

Σ 1/2 = E ∧ 1/2 E T where ∧ 1/2 = diag ( √ λ 1 , √ λ 2 , . . . , √ λ r ).<br />

Note that Σ 1/2 is symmetric and satisfies:<br />

( ) Σ<br />

1/2 2<br />

= Σ 1/2 Σ 1/2T = Σ.<br />

Now recall that if Z 1 , Z 2 , . . . , Z r are i.i.d. N(0, 1) then Z = (Z 1 , . . . , Z r ) T ∼ N r (0, I).<br />

Because <strong>of</strong> the i.i.d. N(0, 1) assumption, we know in this case that E(Z) = 0, Var(Z) =<br />

I.<br />

Now, let X = Σ 1/2 Z + µ:<br />

1. From Theorem 1.9.9 we have<br />

E(X) = Σ 1/2 E(Z) + µ = µ and Var(X) = Σ 1/2 Var(Z) ( Σ 1/2) T<br />

.<br />

2. From Theorem 1.10.1,<br />

X ∼ N r (µ, Σ).<br />

Since this construction is valid for any symmetric positive definite Σ and any<br />

µ ∈ R r , we have proved,<br />

Theorem. 1.10.2<br />

If X ∼ N r (µ, Σ) then<br />

Theorem. 1.10.3<br />

E(X) = µ and Var(X) = Σ.<br />

If X ∼ N r (µ, Σ) then Z = Σ −1/2 (X − µ) ∼ N r (0, I).<br />

Pro<strong>of</strong>.<br />

Use Theorem 1.10.1.<br />

Suppose<br />

(<br />

X1<br />

X 2<br />

)<br />

∼ N r1 +r 2<br />

((<br />

µ1<br />

µ 2<br />

)<br />

,<br />

[ ])<br />

Σ11 Σ 12<br />

Σ 21 Σ 22<br />

59


1. DISTRIBUTION THEORY<br />

where X i , µ i are <strong>of</strong> dimension r i and Σ ij is r i × r j . In order that Σ be symmetric, we<br />

must have Σ 11 , Σ 22 symmetric and Σ 12 = Σ T 21.<br />

We will derive the marginal distribution <strong>of</strong> X 2<br />

X 1 |X 2 .<br />

and the conditional distribution <strong>of</strong><br />

Lemma. 1.10.1<br />

[ ] B A<br />

If M = is a square, partitioned matrix, then |M| = |B| (det).<br />

O I<br />

Lemma. 1.10.2<br />

[ ]<br />

Σ11 Σ<br />

If M =<br />

12<br />

then |M| = |Σ<br />

Σ 21 Σ 22 ||Σ 11 − Σ 12 Σ −1<br />

22 Σ 21 |<br />

22<br />

Pro<strong>of</strong>. (<strong>of</strong> Lemma 1.10.2)<br />

Let<br />

and observe<br />

C =<br />

[ ]<br />

I −Σ12 Σ −1<br />

22<br />

0 Σ −1<br />

22<br />

[ ]<br />

Σ11 − Σ<br />

CM =<br />

12 Σ −1<br />

22 Σ 21 0<br />

Σ −1<br />

,<br />

22 Σ 21 I<br />

and<br />

Finally, observe |CM| = |C||M|<br />

|C| = 1<br />

|Σ 22 | , |CM| = |Σ 11 − Σ 12 Σ −1<br />

22 Σ 21 |<br />

⇒<br />

|M| = |CM|<br />

|C|<br />

= |Σ 22 ||Σ 11 − Σ 12 Σ −1<br />

22 Σ 21 |.<br />

Theorem. 1.10.4<br />

Suppose that<br />

(<br />

X1<br />

X 2<br />

)<br />

(( ) ( ))<br />

µ1 Σ11 Σ<br />

∼ N r1 ,r 2<br />

,<br />

12<br />

µ 2 Σ 21 Σ 22<br />

Then, the marginal distribution <strong>of</strong> X 2 is X 2 ∼ N r2 (µ 2 , Σ 22 ) and the conditional distribution<br />

<strong>of</strong> X 1 |X 2 is<br />

( )<br />

N r1 µ1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 ), Σ 11 − Σ 12 Σ −1<br />

22 Σ 21<br />

60


1. DISTRIBUTION THEORY<br />

Pro<strong>of</strong>.<br />

We will exhibit f(x 1 , x 2 ) in the form<br />

f(x 1 , x 2 ) = h(x 1 , x 2 ) g(x 2 ),<br />

where h(x 1 , x 2 ) is a <strong>PDF</strong> with respect to x 1 for each (given) x 2 .<br />

It follows then that f X1 |X 2<br />

(x 1 |x 2 ) = h(x 1 , x 2 ) and g(x 2 ) = f X2 (x 2 ).<br />

Now observe that:<br />

1<br />

f(x 1 , x 2 ) =<br />

(2π) (r 1+r 2 )/2<br />

∣ Σ ∣ ×<br />

11 Σ 12∣∣∣<br />

1/2<br />

Σ 12 Σ 22<br />

and<br />

exp<br />

[<br />

− 1 2<br />

( (x1 ) T ( ) ) [ ] −1 ( ) ]<br />

T Σ<br />

− µ 1 , x2 − µ 11 Σ 12 x1 − µ 1<br />

2<br />

Σ 21 Σ 22 x 2 − µ 2<br />

1.<br />

(2π) (r 1+r 2 )/2<br />

∣ Σ ∣<br />

11 Σ 12∣∣∣<br />

1/2<br />

Σ 21 Σ 22<br />

= (2π) r 1/2 (2π) r 2/2 ∣ ∣Σ 22<br />

∣ ∣<br />

1/2 ∣ ∣Σ 11 − Σ 12 Σ −1<br />

22 Σ 21<br />

∣ ∣<br />

1/2<br />

= (2π) r 1/2 ∣ ∣Σ 11 − Σ 12 Σ −1<br />

22 Σ 21<br />

∣ ∣<br />

1/2<br />

(2π)<br />

r 2 /2 ∣ ∣Σ 22<br />

∣ ∣<br />

1/2<br />

2.<br />

(<br />

(x1 − µ 1 ) T , (x 2 − µ 2 ) ) [ ] −1 ( )<br />

T Σ 11 Σ 12 x1 − µ 1<br />

= ( (x<br />

Σ 21 Σ 22 x 2 − µ 1 − µ 1 ) T , (x 2 − µ 2 ) ) T<br />

2<br />

⎡<br />

×<br />

⎢<br />

⎣<br />

(<br />

Σ11 − Σ 12 Σ −1<br />

22 Σ 21<br />

) −1<br />

− ( Σ 11 − Σ 12 Σ −1<br />

22 Σ 11<br />

) −1<br />

×Σ 12 Σ −1<br />

22<br />

−Σ −1<br />

22 Σ 21 Σ −1<br />

22 + Σ −1<br />

22 Σ 21<br />

× ( )<br />

Σ 11 − Σ 12 Σ −1 −1<br />

22 Σ 21 × ( )<br />

Σ 11 − Σ 12 Σ −1 −1<br />

22 Σ 21<br />

×Σ 12 Σ −1<br />

22<br />

⎤<br />

⎥<br />

⎦<br />

( )<br />

x1 − µ 1<br />

x 2 − µ 2<br />

= (x 1 − µ 1 ) T V −1 (x 1 − µ 1 ) − (x 1 − µ 1 ) T V −1 Σ 12 Σ −1<br />

22 (x 2 − µ 2 )<br />

−(x 2 − µ 2 ) T Σ −1<br />

22 Σ 21 V −1 (x 1 − µ 1 ) + (x 2 − µ 2 ) T Σ −1<br />

22 Σ 21 V −1 Σ 12 Σ −1<br />

22 (x 2 − µ 2 )<br />

61<br />

+(x 2 − µ 2 ) T Σ −1<br />

22 (x 2 − µ 2 ).


1. DISTRIBUTION THEORY<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

How did this come about:<br />

[ ( ) a b x<br />

(x, y)<br />

c d]<br />

y<br />

= (xy)<br />

[ ] ax + by<br />

cx + dy<br />

= x(ax + by) + y(cx + dy)<br />

= ax 2 + bxy + cyx + dy 2 ⎫⎪ ⎬<br />

⎪ ⎭<br />

= ( (x 1 − µ 1 ) − Σ 12 Σ −1<br />

22 (x 2 − µ 2 ) ) T<br />

V<br />

−1 ( (x 1 − µ 1 ) − Σ 12 Σ −1<br />

22 (x 2 − µ 2 ) )<br />

+ (x 2 − µ 2 ) T Σ −1<br />

22 (x 2 − µ 2 )<br />

⎧<br />

⎪⎨<br />

Note:<br />

(x − y) T A(x − y) = (x − y) T (Ax − Ay)<br />

= x T Ax − x T Ay − y T Ax + y T Ay<br />

⎪⎩<br />

And: Σ T 12 = Σ 21<br />

⎫⎪ ⎬<br />

⎪ ⎭<br />

= ( x 1 − (µ 1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 )) ) T (<br />

Σ11 − Σ 12 Σ −1<br />

22 Σ 21<br />

) −1×<br />

(<br />

x1 − (µ 1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 )) ) + (x 2 − µ 2 ) T Σ −1<br />

22 (x 2 − µ 2 ).<br />

Combining (1), (2) we obtain:<br />

62


1. DISTRIBUTION THEORY<br />

f(x 1 , x 2 ) =<br />

1<br />

(2π) (r 1+r 2 )/2<br />

∣ Σ ∣<br />

11 Σ 12∣∣∣<br />

1/2<br />

Σ 21 Σ 22<br />

=<br />

⎧<br />

⎡ ⎤<br />

⎨<br />

Σ 11 Σ 12<br />

× exp<br />

⎩ −1 2 ((x 1 − µ 1 ) T , (x 2 − µ 2 ) T ) ⎣ ⎦<br />

Σ 22<br />

1<br />

(2π) r 1/2<br />

|Σ 11 − Σ 12 Σ −1<br />

22 Σ 21 | 1/2<br />

Σ 21<br />

{<br />

× exp − 1 2 (x 1 − (µ 1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 ))) T<br />

⎞⎫<br />

x 1 − µ 1 ⎬<br />

⎝ ⎠<br />

⎭<br />

x 2 − µ 2<br />

−1 ⎛<br />

× ( ) }<br />

Σ 11 − Σ 12 Σ −1 −1(x1<br />

22 Σ 21 − (µ 1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 )))<br />

×<br />

{<br />

1<br />

(2π) ∣ ∣ r 2/2 ∣Σ exp − 1 }<br />

22 1/2<br />

2 (x 2 − µ 2 ) T Σ −1<br />

22 (x 2 − µ 2 )<br />

= h(x 1 , x 2 )g(x 2 ), where for fixed x 2 , h(x 1 , x 2 ), is the<br />

N r1<br />

(<br />

µ1 + Σ 12 Σ −1<br />

22 (x 2 − µ 2 ), Σ 11 − Σ 12 Σ −1<br />

22 Σ 21<br />

)<br />

<strong>PDF</strong>, and g(x 2 ) is the N r2 (µ 2 , Σ 22 ) <strong>PDF</strong>.<br />

Hence, the result is proved.<br />

Theorem. 1.10.5<br />

Suppose that X ∼ N r (µ, Σ) and Y = AX + b, where A p×r with linearly independent<br />

rows and b ∈ R p are fixed. Then Y ∼ N p (Aµ + b, AΣA T ).<br />

[Note: p ≤ r.]<br />

Pro<strong>of</strong>. Use method <strong>of</strong> regular transformations.<br />

Since the<br />

(<br />

rows <strong>of</strong> A are linearly independent, we can find B (r−p)×r such that the r × r<br />

B<br />

matrix is invertible.<br />

A)<br />

63


1. DISTRIBUTION THEORY<br />

If we take Z = BX, we have<br />

( ( ( Z B 0<br />

= X + ∼ N<br />

Y)<br />

A)<br />

b)<br />

by Theorem 1.10.1.<br />

([ ] [ ])<br />

Bµ + 0 BΣB<br />

T<br />

BΣA<br />

,<br />

T<br />

Aµ + b AΣB T AΣA T<br />

Hence, from Theorem 1.10.4, the marginal distribution for Y is<br />

N p (Aµ + b, AΣA T ).<br />

1.10.1 The multivariate normal MGF<br />

The multivariate normal moment generating function for a random vector X ∼ N(µ, Σ)<br />

is given by<br />

M X (t) = e tT µ+ 1 2 tT Σt<br />

Prove this result as an exercise!<br />

The characteristic function <strong>of</strong> X is<br />

E[exp(it T X)] = exp<br />

(<br />

it T µ − 1 )<br />

2 tT Σt<br />

The marginal distribution <strong>of</strong> X 1 (or X 2 ) is easy to derive using the multivariate normal<br />

MGF.<br />

Let<br />

t =<br />

(<br />

t1<br />

t 2<br />

)<br />

, µ =<br />

(<br />

µ1<br />

µ 2<br />

)<br />

.<br />

Then the marginal distribution <strong>of</strong> X 1 is obtained by setting t 2 = 0 in the expression<br />

for the MGF <strong>of</strong> X.<br />

Pro<strong>of</strong>:<br />

(<br />

M X (t) = exp t T µ + 1 )<br />

2 tT Σt<br />

= exp<br />

(t T 1 µ 1 + t T 2 µ 2 + 1 2 t 1 T Σ 11 t 1 + t T 1 Σ 12 t 2 + 1 )<br />

2 t 2 T Σ 22 t 2 .<br />

Now,<br />

( )<br />

t1<br />

M X1 (t 1 ) = M X = exp<br />

(t T<br />

0<br />

1 µ 1 + 1 )<br />

2 t 1 T Σ 11 t 1 ,<br />

64


1. DISTRIBUTION THEORY<br />

which is the MGF <strong>of</strong> X 1 ∼ N r1 (µ 1 , Σ 11 ).<br />

Similarly for X 2 . This means that all marginal distributions <strong>of</strong> a multivariate normal<br />

distribution are multivariate normal themselves. Note though, that in general<br />

the opposite implication is not true: there are examples <strong>of</strong> non-normal multivariate<br />

distributions who marginal distributions are normal.<br />

1.10.2 Independence and normality<br />

We have seen previously that X, Y independent ⇒ Cov(X, Y ) = 0, but not vice-versa.<br />

An exception is when the data are (jointly) normally distributed.<br />

In particular, if (X, Y ) have the bivariate normal distribution, then Cov(X, Y ) = 0 ⇐⇒<br />

X, Y are independent.<br />

Theorem. 1.10.6<br />

Suppose X 1 , X 2 , . . . , X r have a multivariate normal distribution. Then X 1 , X 2 , . . . , X r<br />

are independent if and only if Cov(X i , X j ) = 0 for all i ≠ j.<br />

Pro<strong>of</strong>.<br />

(=⇒)X 1 , . . . , X r<br />

independent<br />

=⇒ Cov(X i , X j ) = 0 for i ≠ j, has already been proved.<br />

(⇐=)<br />

Suppose Cov(X i , X j ) = 0 for i ≠ j<br />

65


=⇒ Var(X) = Σ =<br />

=⇒ Σ −1 =<br />

=⇒ f X (x) =<br />

=<br />

=<br />

⎡<br />

⎤<br />

σ 11 0 . . . 0<br />

0 σ 22 . . . .<br />

⎢<br />

⎣ .<br />

. .. . ⎥ .. . ⎦<br />

0 . . . 0 σ rr<br />

⎡<br />

σ −1<br />

⎤<br />

11 0 . . . 0<br />

0 σ −1<br />

22 . . . .<br />

⎢<br />

⎣ .<br />

. .. . ⎥ .. . ⎦<br />

0 . . . 0 σrr<br />

−1<br />

1<br />

(2π) r/2 |Σ| 1/2 e− 1 2 (x−µ)T Σ −1 (x−µ)<br />

1. DISTRIBUTION THEORY<br />

and |Σ| = σ 11 σ 22 . . . σ rr<br />

1<br />

1 P r (x i −µ i ) 2<br />

(2π) r/2 (σ 11 σ 22 . . . σ rr ) 1/2 e− 2 i=1 σ ii<br />

(<br />

) (<br />

1<br />

√ √ e − 1 (x 1 −µ 1 ) 2<br />

1<br />

2 σ 11 √ √ e − 1 2<br />

2π σ11 2π σ22<br />

)<br />

(x 2 −µ 2 ) 2<br />

σ 22 . . .<br />

(<br />

)<br />

1<br />

. . . √ √ e − 1 (xr−µr) 2<br />

2 σrr<br />

2π σrr<br />

= f 1 (x 1 )f 2 (x 2 ) . . . f r (x r )<br />

=⇒ X 1 , . . . , X r are independent.<br />

Note:<br />

(x − µ) T Σ −1 (x − µ) = (x 1 − µ 1 x 2 − µ 2 . . . x r − µ r )<br />

=<br />

⎡<br />

σ −1<br />

⎤ ⎡ ⎤<br />

11 0 . . . 0 x 1 − µ 1<br />

0 σ22 −1<br />

. . . .<br />

x 2 − µ 2<br />

× ⎢<br />

⎥ ⎢ ⎥<br />

⎣ .<br />

.. .<br />

.. . . ⎦ ⎣ . ⎦<br />

0 . . . 0 σrr<br />

−1 x r − µ r<br />

r∑ (x i − µ i ) 2<br />

.<br />

σ ii<br />

i=1<br />

66


1. DISTRIBUTION THEORY<br />

Remark<br />

The same methods can be used to establish a similar result for block diagonal matrices.<br />

The simplest case is the following, which is most easily proved using moment generating<br />

functions:<br />

X 1 , X 2 are independent<br />

if and only if Σ 12 = 0, i.e.,<br />

( ) (( ) ( ))<br />

X1<br />

µ1 Σ11 0<br />

∼ N<br />

X<br />

r1 +r 2<br />

,<br />

.<br />

2 µ 2 0 Σ 22<br />

Theorem. 1.10.7<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ) RVs and let<br />

¯X = 1 n<br />

S 2 =<br />

n∑<br />

X i ,<br />

i=1<br />

1<br />

n − 1<br />

n∑<br />

(X i − ¯X) 2 .<br />

i=1<br />

Then ¯X ∼ N(µ, σ 2 /n) and<br />

(n − 1)S2<br />

σ 2<br />

∼ χ 2 n−1 independently.<br />

(Note: S 2 here is a random variable, and is not to be confused with the sample covariance<br />

matrix, which is also <strong>of</strong>ten denoted by S. Hopefully, the meaning <strong>of</strong> the notation<br />

will be clear in the context in which it is used.)<br />

Pro<strong>of</strong>. Observe first that if X = (X 1 , . . . , X n ) T then the i.i.d. assumption may be<br />

written as:<br />

X ∼ N n (µ1 n , σ 2 I n×n )<br />

1. ¯X ∼ N(µ, σ 2 /n):<br />

Observe that ¯X = BX, where<br />

[ 1<br />

B =<br />

n<br />

]<br />

1<br />

n . . . 1<br />

n<br />

Hence, by Theorem 1.10.5,<br />

¯X ∼ N(Bµ1, σ 2 BB T ) ≡ N(µ, σ 2 /n),<br />

as required.<br />

67


1. DISTRIBUTION THEORY<br />

2. Independence <strong>of</strong> ¯X and S: consider<br />

⎡<br />

⎤<br />

1 1<br />

1 1<br />

. . .<br />

n n n n<br />

1 − 1 − 1 . . . − 1 − 1 n n n n<br />

A =<br />

− 1 1 − 1 . . . − 1 − 1 n n n n<br />

⎢<br />

.<br />

.. .<br />

.. .<br />

.. . .<br />

⎥<br />

⎣<br />

⎦<br />

− 1 − 1 . . . 1 − 1 − 1 n n n n<br />

and observe<br />

⎡ ⎤<br />

¯X<br />

X 1 − ¯X<br />

AX =<br />

X 2 − ¯X<br />

⎢ ⎥<br />

⎣ . ⎦<br />

X n−1 − ¯X<br />

We can check that Var(AX) = σ 2 AA T has the form:<br />

⎡<br />

⎤<br />

1<br />

0 . . . 0<br />

n 0<br />

σ 2 ⎢<br />

. Σ 22<br />

⎥<br />

⎣<br />

⎦<br />

0<br />

⇒ ¯X, (X 1 − ¯X, X 2 − ¯X, . . . , X n−1 − ¯X) are independent.<br />

(By multivariate normality.)<br />

Finally, since<br />

X n − ¯X = − ( (X 1 − ¯X) + (X 2 − ¯X) + · · · + (X n−1 − ¯X) ) ,<br />

it follows that S 2 is a function <strong>of</strong> (X 1 − ¯X, X 2 − ¯X, . . . , X n−1 − ¯X) and hence is<br />

independent <strong>of</strong> ¯X.<br />

3. Prove:<br />

(n − 1)S 2<br />

σ 2<br />

∼ χ 2 n−1<br />

68


1. DISTRIBUTION THEORY<br />

Consider the identity:<br />

n∑<br />

n∑<br />

(X i − µ) 2 = (X i − ¯X) 2 + n( ¯X − µ) 2<br />

i=1<br />

i=1<br />

(subtract and add ¯X to first term)<br />

⇒<br />

n∑<br />

i=1<br />

(X i − µ) 2<br />

σ 2 =<br />

n∑<br />

i=1<br />

(X i − ¯X) 2<br />

σ 2 +<br />

( ) 2 ¯X − µ<br />

σ/ √ n<br />

and let R 1 =<br />

R 2 =<br />

n∑ (X i − µ) 2<br />

i=1<br />

n∑<br />

i=1<br />

σ 2<br />

(X i − ¯X) 2<br />

σ 2 =<br />

(n − 1)S2<br />

σ 2<br />

R 3 =<br />

( ) 2 ¯X − µ<br />

σ/ √ n<br />

If M 1 , M 2 , M 3 are the MGFs for R 1 , R 2 , R 3 respectively, then,<br />

M 1 (t) = M 2 (t)M 3 (t) since R 1 = R 2 + R 3<br />

with R 2 and R 3 independent<br />

↓ ↓<br />

depends depends only<br />

only on on ¯X<br />

Next, observe that R 3 =<br />

( ) 2 ¯X − µ<br />

σ/ √ ∼ χ 2 1<br />

n<br />

S 2<br />

⇒ M 2 (t) = M 1(t)<br />

M 3 (t) .<br />

⇒ M 3 (t) =<br />

1<br />

(1 − 2t) 1/2 ,<br />

69


1. DISTRIBUTION THEORY<br />

and<br />

(R 1 )<br />

n∑ (X i − µ) 2<br />

σ 2<br />

i=1<br />

∼ χ 2 n<br />

⇒ M 1 (t) =<br />

∴ M 2 (t) =<br />

1<br />

(1 − 2t) n/2 .<br />

1<br />

(1 − 2t) (n−1)/2 ,<br />

which is the mgf for χ 2 n−1.<br />

Hence,<br />

(n − 1)S 2<br />

σ 2<br />

∼ χ 2 n−1.<br />

Corollary.<br />

If X 1 , . . . , X n are i.i.d. N(µ, σ 2 ), then:<br />

Pro<strong>of</strong>.<br />

Recall that t k is the distribution <strong>of</strong><br />

Z<br />

√<br />

V/k<br />

, where Z ∼ N(0, 1) and V ∼ χ 2 k independently.<br />

T = ¯X − µ<br />

S/ √ n ∼ t n−1<br />

Now observe that:<br />

and note that,<br />

T = ¯X − µ<br />

S/ √ n = ¯X / √ − µ<br />

σ/ √ (n − 1)S 2 /σ 2<br />

n (n − 1)<br />

¯X − µ<br />

σ/ √ n<br />

∼ N(0, 1)<br />

and<br />

independently.<br />

(n − 1)S 2<br />

σ 2<br />

∼ χ 2 n−1,<br />

70


1. DISTRIBUTION THEORY<br />

1.11 Limit Theorems<br />

1.11.1 Convergence <strong>of</strong> random variables<br />

Let X 1 , X 2 , X 3 , . . . , be an infinite sequence <strong>of</strong> RVs. We will consider 4 different types<br />

<strong>of</strong> convergence.<br />

1. Convergence in probability (weak convergence)<br />

The sequence {X n } is said to converge to the constant α ∈ R, in probability if<br />

for every ε > 0<br />

lim<br />

n→∞ P (|X n − α| > ε) = 0<br />

2. Convergence in Quadratic Mean<br />

lim<br />

n→∞ E( (X n − α) 2) = 0<br />

3. Almost sure convergence (Strong convergence)<br />

The sequence {X n } is said to converge almost surely to α if for each ε > 0,<br />

Remarks<br />

(1),(2),(3) are related by:<br />

|X n − α| > ε for only finite number <strong>of</strong> n ≥ 1.<br />

Figure 15: Relationships <strong>of</strong> convergence<br />

4. Convergence in distribution<br />

The sequence <strong>of</strong> RVs {X n } with CDFs {F n } is said to converge in distribution<br />

to the RV X with CDF F (x) if:<br />

lim F n(x) = F (x) for all continuity points <strong>of</strong> F .<br />

n→∞<br />

71


1. DISTRIBUTION THEORY<br />

Figure 16: Discrete case<br />

Figure 17: Normal approximation to binomial (an application <strong>of</strong> the Central Limit<br />

Theorem) continuous case<br />

Remarks<br />

⎧<br />

⎪⎨ 0 x < α<br />

1. If we take F (x) =<br />

⎪⎩<br />

1 x ≥ α<br />

then we have X = α with probability 1, and convergence in distribution to this<br />

F is the same thing as convergence in probability to α.<br />

2. Commonly used notation for convergence in distribution is either:<br />

(i) L[X n ] → L[X]<br />

(ii) X n −→<br />

D<br />

L[X] or e.g., X n −→<br />

D<br />

N(0, 1).<br />

3. An important result that we will use without pro<strong>of</strong> is as follows:<br />

Let M n (t) be MGF <strong>of</strong> X n and M(t) be MGF <strong>of</strong> X. Then if M n (t) → M(t) for<br />

each t in some open interval containing 0, as n → ∞, then L[X n ] −→<br />

D<br />

L[X].<br />

(Sometimes called the Continuity Theorem.)<br />

72


1. DISTRIBUTION THEORY<br />

Theorem. 1.11.1 (Weak law <strong>of</strong> large numbers)<br />

Suppose X 1 , X 2 , X 3 , . . . is a sequence <strong>of</strong> i.i.d. RVs with E[X i ] = µ, Var(X i ) = σ 2 , and<br />

let<br />

¯X n = 1 n∑<br />

X i .<br />

n<br />

Then ¯X n converges to µ in probability.<br />

Pro<strong>of</strong>. We need to show for each ɛ > 0 that<br />

i=1<br />

lim P (| ¯X n − µ| > ɛ) = 0.<br />

n→∞<br />

Now observe that E[ ¯X n ] = µ and Var( ¯X n ) = σ2<br />

. So by Chebyshev’s inequality,<br />

n<br />

P (| ¯X n − µ| > ɛ) ≤ σ2<br />

→ 0 as n → ∞, for any fixed ɛ > 0.<br />

nɛ2 Remarks<br />

1. The pro<strong>of</strong> given for Theorem 1.11.1 is really a corollary to the fact that ¯X n also<br />

converges to µ in quadratic mean.<br />

2. There is also a version <strong>of</strong> this theorem involving almost sure convergence (strong<br />

law <strong>of</strong> large numbers). We will not discuss this.<br />

3. The law <strong>of</strong> large numbers is one <strong>of</strong> the fundamental principles <strong>of</strong> statistical inference.<br />

That is, it is the formal justification for the claim that the “sample mean<br />

approaches the population mean for large n”.<br />

Lemma. 1.11.1<br />

Suppose a n is a sequence <strong>of</strong> real numbers s.t. lim<br />

n→∞<br />

na n = a with |a| < ∞. Then,<br />

Pro<strong>of</strong>.<br />

Omitted (but not difficult).<br />

lim (1 + a n) n = e a .<br />

n→∞<br />

73


1. DISTRIBUTION THEORY<br />

Remarks<br />

This is a simple generalisation <strong>of</strong> the standard limit,<br />

(<br />

1 +<br />

n) x n<br />

= e x .<br />

lim<br />

n→∞<br />

Consider a sequence <strong>of</strong> i.i.d. RVs X 1 , X 2 , . . . with E(X i ) = µ, Var(X i ) = σ 2 and such<br />

that the MGF, M X (t), is defined for all t in some open interval containing 0.<br />

n∑<br />

Let S n = X i and note that E[S n ] = nµ and Var(S n ) = nσ 2 .<br />

i=1<br />

Theorem. 1.11.2 (Central Limit Theorem)<br />

Let X 1 , X 2 , . . . , S n be as above and let Z n = S n − nµ<br />

σ √ n<br />

. Then<br />

Pro<strong>of</strong>.<br />

L[Z n ] → N(0, 1) as n → ∞.<br />

We will use the fact that it is sufficient to prove that<br />

M Zn (t) → e t2 /2<br />

for each fixed t<br />

[Note: if Z ∼ N(0, 1) then M Z (t) = e t2 /2 .]<br />

Now let U i = X i − µ<br />

σ<br />

and observe that Z n = 1 √ n<br />

Since the U i are independent we have:<br />

n∑<br />

U i .<br />

i=1<br />

M Zn (t) = { M U (t/ √ n) } n<br />

⎡<br />

⎣<br />

M aX (t) = E[e taX ]<br />

= M X (at)<br />

⎤<br />

⎦<br />

Now,<br />

E(U i ) = 0 ⇒ M ′ U(0) = 0,<br />

Var(U i ) = 1 ⇒ M ′′<br />

U(0) = 1.<br />

74


1. DISTRIBUTION THEORY<br />

Consider the second-order Taylor expansion,<br />

M U (t) = M U (0) + M ′ U(0)t + M ′′<br />

U(0) t2 2 + r(t)<br />

M U (t) = 1 + 0 + t 2 /2 + r(t),<br />

↓<br />

(M U (0))(as M ′ U(0) = 0)<br />

r(s)<br />

where lim = 0<br />

s→0 s 2<br />

⇒<br />

= 1 + t 2 /2 + r(t).<br />

M Zn (t) = { M U (t/ √ n) } n<br />

= {1 + t 2 /2n + r(t/ √ n)} n<br />

= (1 + a n ) n ,<br />

where<br />

a n = t2<br />

2n + r(t/√ n).<br />

Next observe that lim na n = t2<br />

n→∞ 2<br />

[<br />

To check this observe that:<br />

for fixed t.<br />

nt 2<br />

lim<br />

n→∞ 2n = t2 2<br />

and lim<br />

n→∞<br />

nr(t/ √ n)<br />

= lim<br />

n→∞<br />

t 2 r(t/ √ n)<br />

(t/ √ n) 2<br />

= t 2 r(s)<br />

lim = 0 for fixed t.<br />

s→0 s 2<br />

Note: s = t/ √ n → 0 as n → ∞. ] 75


1. DISTRIBUTION THEORY<br />

Hence we have shown M Zn (t) = (1 + a n ) n , where<br />

lim na n = t 2 /2, so by Lemma 1.11.1,<br />

n→∞<br />

lim M Z n<br />

(t) = e t2 /2 for each fixed t.<br />

n→∞<br />

Remarks<br />

1. The Central Limit Theorem can be stated equivalently for ¯X n ,<br />

( ) ¯Xn − µ<br />

i.e., L<br />

σ/ √ → N(0, 1)<br />

n<br />

(<br />

just note that ¯X n − µ<br />

σ/ √ n = S n − nµ<br />

σ √ n<br />

2. The Central Limit Theorem holds under conditions more general than those given<br />

above. In particular, with suitable assumptions,<br />

(i) M X (t) need not exist.<br />

(ii) X 1 , X 2 , . . . need not be i.i.d..<br />

3. Theorems 1.11.1 and 1.11.2 are concerned with the asymptotic behaviour <strong>of</strong> ¯X n .<br />

)<br />

.<br />

Theorem 1.11.1 states ¯X n → µ in prob as n → ∞.<br />

Theorem 1.11.2 states ¯X n − µ<br />

σ/ √ n −→ D<br />

N(0, 1) as n → ∞.<br />

These results are not contradictory because Var( ¯X n ) → 0, but the Central Limit<br />

Theorem is concerned with ¯X n − E( ¯X n )<br />

√<br />

Var( ¯Xn ) . 76


2. STATISTICAL INFERENCE<br />

2 Statistical Inference<br />

2.1 Basic definitions and terminology<br />

Probability is concerned partly with the problem <strong>of</strong> predicting the behavior <strong>of</strong> the RV<br />

X assuming we know its distribution.<br />

Statistical inference is concerned with the inverse problem:<br />

Given data x 1 , x 2 , . . . , x n with unknown CDF F (x), what can we conclude about<br />

F (x)<br />

In this course, we are concerned with parametric inference. That is, we assume F<br />

belongs to a given family <strong>of</strong> distributions, indexed by the parameter θ:<br />

where Θ is the parameter space.<br />

Examples<br />

I = {F (x; θ) : θ ∈ Θ}<br />

(1) I is the family <strong>of</strong> normal distributions:<br />

θ = (µ, σ 2 )<br />

Θ = {(µ, σ 2 ) : µ ∈ R, σ 2 ∈ R + }<br />

then I =<br />

{<br />

F (x) : F (x) = Φ<br />

( x − µ<br />

σ 2 )}<br />

.<br />

(2) I is the family <strong>of</strong> Bernoulli distributions with success probability θ:<br />

Θ = {θ ∈ [0, 1] ⊂ R}.<br />

In this framework, the problem is then to use the data x 1 , . . . , x n to draw conclusions<br />

about θ.<br />

Definition. 2.1.1<br />

A collection <strong>of</strong> i.i.d. RVs, X 1 , . . . , X n , with common CDF F (x; θ), is said to be a<br />

random sample (from F (x; θ)).<br />

Definition. 2.1.2<br />

Any function T = T (x 1 , x 2 , . . . , x n ) that can be calculated from the data (without<br />

knowledge <strong>of</strong> θ) is called a statistic.<br />

Example<br />

The sample mean ¯x is a statistic (¯x = 1 n<br />

n∑<br />

x i ).<br />

i=1<br />

77


2. STATISTICAL INFERENCE<br />

Definition. 2.1.3<br />

A statistic T with property T (x) ∈ Θ ∀x is called an estimator for θ.<br />

Example<br />

For x 1 , x 2 , . . . , x n i.i.d. N(µ, σ 2 ), we have θ = (µ, σ 2 ). The quantity (¯x, s 2 ) is an estimator<br />

for θ, where ¯x = 1 n∑<br />

x i , S 2 = 1 n∑<br />

(x i − ¯x) 2 .<br />

n<br />

n − 1<br />

i=1<br />

There are two important concepts here: the first is that estimators are random<br />

variables; the second is that you need to be able to distinguish between random variables<br />

and their realisations. In particular, an estimate is a realisation <strong>of</strong> a random<br />

variable.<br />

For example, strictly speaking, x 1 , x 2 , . . . , x n are realisations <strong>of</strong> random variables<br />

X 1 , X 2 , . . . , X n , and ¯x = 1 n∑<br />

x i is a realisation <strong>of</strong> ¯X; ¯X is an estimator, and ¯x is an<br />

n<br />

estimate.<br />

i=1<br />

We will find that from now on however, that it is <strong>of</strong>ten more convenient, if less rigorous,<br />

to use the same symbol for estimator and estimate. This arises especially in the use <strong>of</strong><br />

ˆθ as both estimator and estimate, as we shall see.<br />

An unsatisfactory aspect <strong>of</strong> Definition 2.1.3 is that it gives no guidance on how to<br />

recognize (or construct) good estimators.<br />

Unless stated otherwise, we will assume that θ is a scalar parameter in the following.<br />

In broad terms, we would like an estimator to be “as close as possible to” θ “with high<br />

probability.<br />

Definition. 2.1.4<br />

The mean squared error <strong>of</strong> the estimator T <strong>of</strong> θ is defined by<br />

i=1<br />

MSE T (θ) = E{(T − θ) 2 }.<br />

Example<br />

Suppose X 1 , . . . , X n are i.i.d. Bernoulli θ RV’s, and T = ¯X =‘proportion <strong>of</strong> successes’.<br />

Since nT ∼ B(n, θ) we have<br />

E(nT ) = nθ, Var(nT ) = nθ(1 − θ)<br />

#<br />

=⇒ E(T ) = θ, Var(T ) =<br />

θ(1 − θ)<br />

n<br />

78


2. STATISTICAL INFERENCE<br />

=⇒ MSE T (θ) = Var(T ) =<br />

θ(1 − θ)<br />

.<br />

n<br />

Remark<br />

This example shows that MSE T (θ) must be thought <strong>of</strong> as a function <strong>of</strong> θ rather than<br />

just a number.<br />

E.g., Figure 18<br />

Figure 18: MSE T (θ) as a function <strong>of</strong> θ<br />

Intuitively, a good estimator is one for which MSE is as small as possible. However,<br />

quantifying this idea is complicated, because MSE is a function <strong>of</strong> θ, not just a number.<br />

See Figure 19.<br />

For this reason, it turns out it is not possible to construct a minimum MSE estimator<br />

in general.<br />

To see why, suppose T ∗ is a minimum MSE estimator for θ. Now consider the estimator<br />

T a = a, where a ∈ R is arbitrary. Then for a = θ, T θ = θ with MSE Tθ (θ) = 0.<br />

Observe MSE Ta (a) = 0; hence if T ∗ exists, then we must have MSE T ∗(a) = 0. As a<br />

is arbitrary, we must have MSE T ∗(θ) = 0 ∀θ, ∈ Θ<br />

=⇒ T ∗ = θ with probability 1.<br />

Therefore we conclude that (excluding trivial situations) no minimum MSE estimator<br />

can exist.<br />

Definition. 2.1.5 The bias <strong>of</strong> the estimator T is defined by:<br />

b T (θ) = E(T ) − θ.<br />

79


2. STATISTICAL INFERENCE<br />

Figure 19: MSE T (θ) as a function <strong>of</strong> θ<br />

An estimator T with b T (θ) = 0, i.e., E(T ) = θ, is said to be unbiased.<br />

Remarks<br />

(1) Although unbiasedness is an appealing property, not all commonly used<br />

estimates are unbiased and in some situations it is impossible to construct<br />

unbiased estimators for the parameter <strong>of</strong> interest.<br />

80


2. STATISTICAL INFERENCE<br />

Example (Unbiasedness isn’t everything)<br />

E(s 2 ) = σ 2 .<br />

If E(s) = σ,<br />

then Var(s) = E(s 2 ) − {E(s)} 2<br />

= 0<br />

which is not the case<br />

=⇒ E(s) < σ.<br />

(2) Intuitively, unbiasedness would seem to be a pre-requisite for a good estimator.<br />

This is to some extent formalized as follows:<br />

Theorem. 2.1.1<br />

MSE T (θ) = Var(T ) + b T (θ) 2<br />

Remark<br />

Restricting attention to unbiased estimators excludes estimators <strong>of</strong> the form T a =<br />

a. We will see that this permits the construction <strong>of</strong> Minimum Variance Unbiased<br />

Estimators (MVUE’s) in some cases.<br />

2.2 Minimum Variance Unbiased Estimation<br />

Consider data x 1 , x 2 , . . . , x n assumed to be observations <strong>of</strong> RV’s X 1 , X 2 , . . . , X n with<br />

joint <strong>PDF</strong> (probability function)<br />

Definition. 2.2.1<br />

f x (x; θ)<br />

(P x (x; θ)).<br />

The likelihood function is defined by<br />

⎧<br />

⎨ f x (x; θ)<br />

L(θ; x) =<br />

⎩<br />

P x (x; θ)<br />

The log likelihood function is:<br />

l(θ; x) = log L(θ; x)<br />

X continuous<br />

X discrete.<br />

(log is the natural log i.e., ln).<br />

81


2. STATISTICAL INFERENCE<br />

Remark<br />

If x 1 , x 2 , . . . , x n are independent, the log likelihood function can be written as:<br />

l(θ; x) =<br />

n∑<br />

log f i (x i ; θ).<br />

i=1<br />

If x 1 , x 2 , . . . , x n are i.i.d., we have:<br />

l(θ; x) =<br />

n∑<br />

log f(x i ; θ).<br />

i=1<br />

Definition. 2.2.2<br />

Consider a statistical problem with log-likelihood l(θ; x). The score is defined by:<br />

and the Fisher information is<br />

U(θ; x) = ∂l<br />

∂θ<br />

I(θ) = E<br />

( )<br />

− ∂2 l<br />

.<br />

∂θ 2<br />

Remark<br />

For a single observation with <strong>PDF</strong> f(x, θ), the information is<br />

)<br />

i(θ) = E<br />

(− ∂2<br />

∂θ log f(X, θ) .<br />

2<br />

In the case <strong>of</strong> x 1 , x 2 , . . . , x n i.i.d. , we have<br />

Theorem. 2.2.1<br />

Under suitable regularity conditions,<br />

I(θ) = ni(θ).<br />

E{U(θ; X)} = 0<br />

and Var{U(θ; X)} = I(θ).<br />

82


2. STATISTICAL INFERENCE<br />

Pro<strong>of</strong>.<br />

Observe E{U(θ; X)} =<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

. . . U(θ; x)f(x; θ)dx 1 . . . dx n<br />

−∞<br />

. . .<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

( ∂<br />

∂θ log f(x; θ) )<br />

f(x; θ)dx 1 . . . dx n<br />

⎛ ⎞<br />

∂<br />

f(x; θ)<br />

⎜<br />

⎝<br />

∂θ ⎟<br />

⎠ f(x; θ)dx 1 . . . dx n<br />

f(x; θ)<br />

=<br />

∫ ∞<br />

regularity =⇒ = ∂ ∂θ<br />

= ∂ ∂θ 1<br />

−∞<br />

∫ ∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

−∞<br />

. . .<br />

∂<br />

∂θ f(x; θ)dx 1 . . . dx n<br />

∫ ∞<br />

−∞<br />

f(x; θ)dx 1 . . . dx n<br />

= 0, as required.<br />

To calculate Var{U(θ; X)}, observe<br />

⎧ ⎫<br />

∂<br />

∂ 2 l(θ; x)<br />

= ∂ ⎪⎨ f(x; θ) ⎪⎬<br />

∂θ<br />

∂θ 2 ∂θ ⎪⎩ f(x; θ) ⎪⎭<br />

=<br />

=<br />

=⇒ U 2 (θ; x) =<br />

( )<br />

∂ 2<br />

2 ∂<br />

∂θ f(x; θ)f(x; θ) − 2 ∂θ f(x; θ)<br />

∂ 2<br />

[f(x; θ)] 2<br />

f(x; θ)<br />

∂θ2 − U 2 (θ; x)<br />

f(x; θ)<br />

∂ 2<br />

f(x; θ)<br />

∂θ2 f(x; θ)<br />

− ∂2 l<br />

∂θ 2<br />

=⇒ Var{U(θ; X)} = E{U 2 (θ; X)} (as E(U) = 0)<br />

⎛<br />

∂ 2 ⎞<br />

f(X; θ)<br />

⎜<br />

= E ⎝<br />

∂θ2 ⎟<br />

⎠ + I(θ)<br />

f(X; θ)<br />

(by definition <strong>of</strong> I(θ)).<br />

83


2. STATISTICAL INFERENCE<br />

Finally observe that:<br />

⎛<br />

∂ 2 ⎞<br />

f(X; θ)<br />

⎜<br />

E ⎝<br />

∂θ2 ⎟<br />

⎠ =<br />

f(X; θ)<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

∂ 2<br />

f(x; θ)<br />

∂θ2 f(x; θ)dx 1 . . . dx n<br />

f(x; θ)<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

= ∂2<br />

∂θ 2 ∫ ∞<br />

= ∂2<br />

∂θ 2 1<br />

= 0<br />

−∞<br />

Hence, we have proved Var{U(θ; X)} = I(θ).<br />

. . .<br />

∂ 2<br />

∂θ 2 f(x; θ)dx 1 . . . dx n<br />

∫ ∞<br />

−∞<br />

f(x; θ)dx 1 . . . dx n<br />

84


2. STATISTICAL INFERENCE<br />

Theorem. 2.2.2 (Cramer-Rao Lower Bound)<br />

If T is an unbiased estimator for θ, then Var(T ) ≥ 1<br />

I(θ) .<br />

Pro<strong>of</strong>.<br />

Observe Cov{T (X), U(θ; X)} = E{T (X)U(θ; X)}<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

= ∂ ∂θ<br />

−∞<br />

∫ ∞<br />

∫ ∞<br />

. . . T (x)U(θ; x)f(x; θ)dx 1 . . . dx n<br />

−∞<br />

∫ ∞<br />

. . .<br />

−∞<br />

−∞<br />

. . .<br />

= ∂ E{T (X)}<br />

∂θ<br />

∂<br />

f(x; θ)<br />

T (x) ∂θ<br />

f(x; θ) f(x; θ)dx 1 . . . dx n<br />

∫ ∞<br />

−∞<br />

T (x)f(x; θ)dx 1 . . . dx n<br />

= ∂ ∂θ θ<br />

To summarize, Cov(T, U) = 1.<br />

= 1.<br />

Recall that Cov 2 (T, U) ≤ Var(T ) Var(U) [i.e. |ρ| ≤ 1 and divide both sides by RHS]<br />

=⇒ Var(T ) ≥ Cov2 (T, U)<br />

Var(U)<br />

= 1 , as required.<br />

I(θ)<br />

Example<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. Po(λ) RV’s, and let ¯X = 1 n<br />

that ¯X is a MVUE for λ.<br />

n∑<br />

X i . We will prove<br />

i=1<br />

85


2. STATISTICAL INFERENCE<br />

Pro<strong>of</strong>.<br />

(1) Recall if X ∼Po(λ), then P (x) = e−λ λ x<br />

, E(X) = λ and Var(X) = λ.<br />

x!<br />

Hence, E( ¯X) = λ and Var( ¯X) = λ n . Hence ¯X is unbiased for λ.<br />

(2) To show that ¯X is MVUE, we will show that Var( ¯X) = 1<br />

I(λ) .<br />

Step 1:<br />

Log-likelihood is<br />

l(λ; x) = log P (x; λ)<br />

= log<br />

n∏ e −λ λ x i<br />

,<br />

x i !<br />

i=1<br />

so l(λ; x) = log e −nλ λ P n<br />

i=1 x i<br />

= −nλ +<br />

n∏<br />

i=1<br />

1<br />

x i !<br />

n∑<br />

x i log λ − log<br />

i=1<br />

= −nλ + n¯x log λ − log<br />

n∏<br />

x i !<br />

i=1<br />

n∏<br />

x i !.<br />

i=1<br />

Step 2:<br />

Find ∂2 l<br />

∂λ 2 ;<br />

Step 3:<br />

∂l<br />

∂λ<br />

= −n +<br />

n¯x<br />

λ<br />

=⇒ ∂2 l<br />

∂λ 2 = −n¯x<br />

λ 2 .<br />

86


2. STATISTICAL INFERENCE<br />

I(λ) =<br />

( ) ∂ 2 l<br />

−E<br />

∂λ 2<br />

=⇒ I(λ) =<br />

( ) n ¯X<br />

E<br />

λ 2<br />

= n E( ¯X)<br />

λ2 = nλ<br />

λ 2<br />

= n λ .<br />

(3) Finally, observe that Var( ¯X) = λ n = 1<br />

I(λ) .<br />

By Theorem 2.2.2, any unbiased estimator T for λ must have<br />

Var(T ) ≥ 1<br />

I(λ) = λ n<br />

=⇒ ¯X is a MVUE.<br />

Theorem. 2.2.3<br />

The unbiased estimator T (x) can achieve the Cramer-Rao Lower Bound only if the<br />

joint <strong>PDF</strong>/probability function has the form:<br />

f(x; θ) = exp{A(θ)T (x) + B(θ) + h(x)},<br />

where A, B are functions such that θ = −B′ (θ)<br />

, and h is some function <strong>of</strong> x.<br />

A ′ (θ)<br />

Pro<strong>of</strong>. Recall from the pro<strong>of</strong> <strong>of</strong> Theorem 2.2.2 that the bound arises from the inequality<br />

where U(θ; x) = ∂l<br />

∂θ .<br />

Cor 2 (T (X), U(θ; X)) ≤ 1,<br />

Moreover, it is easy to see that the Cramer-Rao Lower Bound (CRLB) can be achieved<br />

only when<br />

Cor 2 {T (X), U(θ; X)} = 1<br />

87


2. STATISTICAL INFERENCE<br />

Hence the CRLB is achieved only if U is a linear function <strong>of</strong> T with probability 1<br />

(correlation equals 1):<br />

U = aT + b a, b constants =⇒ can depend on θ but not x<br />

i.e.<br />

∂l<br />

∂θ<br />

Integrating wrt θ, we obtain:<br />

where A ′ (θ) = a(θ), B ′ (θ) = b(θ).<br />

= U(θ; x) = a(θ)T (x) + b(θ) for all x.<br />

log f(x; θ) = l(θ; x) = A(θ)T (x) + B(θ) + h(x),<br />

Hence,<br />

f(x; θ) = exp{A(θ)T (x) + B(θ) + h(x)}.<br />

Finally to ensure that E(T ) = θ, recall that E{U(θ; X)} = 0 and observe that in this<br />

case,<br />

E{U(θ; X)} = E[A ′ (θ)T (X) + B ′ (θ)]<br />

= A ′ (θ)E[T (X)] + B ′ (θ) = 0<br />

=⇒ E[T (X)] = −B′ (θ)<br />

A ′ (θ) .<br />

Hence, in order that T (X) be unbiased for θ, we must have −B′ (θ)<br />

A ′ (θ)<br />

= θ.<br />

Definition. 2.2.3<br />

A probability density function/probability function is said to be a single parameter<br />

exponential family if it has the form:<br />

f(x; θ) = exp{A(θ)t(x) + B(θ) + h(x)}<br />

for all x ∈ D ∈ R, where the D does not depend on θ.<br />

If x 1 , . . . , x n is a random sample from an exponential family, the joint <strong>PDF</strong>/prob functions<br />

becomes<br />

n∏<br />

f(x; θ) = exp{A(θ)t(x i ) + B(θ) + h(x i )}<br />

i=1<br />

{<br />

= exp A(θ)<br />

}<br />

n∑<br />

n∑<br />

t(x i ) + B(θ) + h(x i ) .<br />

i=1<br />

i=1<br />

88


2. STATISTICAL INFERENCE<br />

In this case, a minimum variance unbiased estimator that achieves the CRLB can be<br />

seen to be the function:<br />

T = 1 n∑<br />

t(x i ),<br />

n<br />

which is the MVUE for E(T ) = −B′ (θ)<br />

A ′ (θ) .<br />

Example (Poisson example revisited)<br />

i=1<br />

Observe that the Poisson probability function,<br />

p(x; λ) = e−λ λ x<br />

x!<br />

= exp{(log λ)x − λ − log x!}<br />

where<br />

= exp{A(λ)t(x) + B(λ) + h(x)},<br />

A(λ) = log λ<br />

B(λ) = −λ<br />

t(x) = x<br />

¯X = 1 n<br />

n∑<br />

t(x i ) is the MVUE for<br />

i=1<br />

h(x) = − log x!.<br />

−B ′ (λ)<br />

A ′ (λ)<br />

= −(−1)<br />

1/λ<br />

= λ.<br />

Example (2)<br />

Consider the Exp(λ) distribution. The <strong>PDF</strong> is<br />

f(x) = λe −λx , x > 0<br />

= exp{−λx + log λ};<br />

89


2. STATISTICAL INFERENCE<br />

=⇒ f is an exponential family with<br />

A(λ) = −λ<br />

B(λ) = log λ<br />

t(x) = x<br />

h(x) = 0<br />

We can also check that E(X) = −B′ (λ)<br />

A ′ (λ) .<br />

E(X) = 1 λ<br />

for X ∼ Exp(λ).<br />

In particular, we have seen previously<br />

Now observe that A ′ (λ) = −1, B ′ (λ) = 1 λ<br />

=⇒ −B′ (λ)<br />

A ′ (λ)<br />

= 1 , as required.<br />

λ<br />

It also follows that if x 1 , x 2 , . . . , x n are i.i.d. Exp(λ) observations, then ¯X = 1 n<br />

is the MVUE for 1 λ = E(X).<br />

Definition. 2.2.4<br />

n∑<br />

t(x i )<br />

i=1<br />

Consider data with <strong>PDF</strong>/prob. function, f(x; θ). A statistic, S(x), is called a sufficient<br />

statistic for θ if f(x|s; θ) does not depend on θ for all s.<br />

Remarks<br />

(1) We will see that sufficient statistics capture all <strong>of</strong> the information in the<br />

data x that is relevant to θ.<br />

(2) If we consider vector-valued statistics, then this definition admits trivial<br />

examples, such as s = x<br />

⎧<br />

⎨ 1 x = s<br />

since P (X = x|S = s) =<br />

⎩<br />

0 otherwise<br />

which does not depend on θ.<br />

90


2. STATISTICAL INFERENCE<br />

Example<br />

Suppose x 1 , x 2 , . . . , x n are i.i.d. Bernoulli-θ and let s =<br />

θ.<br />

n∑<br />

x i . Then S is sufficient for<br />

i=1<br />

Pro<strong>of</strong>.<br />

P (x) =<br />

n∏<br />

θ x i<br />

(1 − θ) 1−x i<br />

i=1<br />

= θ P x i<br />

(1 − θ) P (1−x i )<br />

= θ s (1 − θ) n−s .<br />

Next observe that S ∼ B(n, θ)<br />

=⇒ P (s) =<br />

=⇒ P (x|s) =<br />

=<br />

=<br />

( n<br />

s)<br />

θ s (1 − θ) n−s<br />

P ({X = x} ∩ {S = s})<br />

P (S = s)<br />

P (X = x)<br />

P (S = s)<br />

⎧<br />

1<br />

( if ⎪⎨ n<br />

s)<br />

⎪⎩<br />

n∑<br />

x i = s<br />

i=1<br />

0, otherwise.<br />

Theorem. 2.2.4 The Factorization Theorem<br />

Suppose x 1 , . . . , x n have joint <strong>PDF</strong>/prob function f(x; θ). Then S is a sufficient statistic<br />

for θ if and only if<br />

f(x; θ) = g(s; θ)h(x)<br />

for some functions g, h.<br />

Pro<strong>of</strong>.<br />

Omitted.<br />

91


2. STATISTICAL INFERENCE<br />

Examples<br />

(1) x 1 , x 2 , . . . , x n i.i.d. N(µ, σ 2 ), σ 2 known.<br />

n∏ 1<br />

Then f(x; µ) = √ e (−1/(2σ2 ))(x i −µ) 2<br />

2πσ<br />

=⇒ S =<br />

i=1<br />

{<br />

}<br />

= (2πσ 2 ) −n/2 exp − 1 n∑<br />

(x<br />

2σ 2 i − µ) 2<br />

i=1<br />

{ ( n∑<br />

)}<br />

= (2πσ 2 ) −n/2 exp − 1<br />

n∑<br />

x 2<br />

2σ 2 i − 2µ x i + nµ 2<br />

i=1<br />

i=1<br />

{ (<br />

)} (<br />

= (2πσ 2 ) −n/2 1<br />

n∑<br />

exp 2µ x<br />

2σ 2 i − nµ 2 exp − 1<br />

2σ 2<br />

= exp<br />

{<br />

(<br />

1<br />

2µ<br />

2σ 2<br />

i=1<br />

)} {<br />

n∑<br />

x i − nµ 2 exp<br />

i=1<br />

n∑<br />

x i is sufficient for µ.<br />

i=1<br />

(2) If x 1 , x 2 , . . . , x n are i.i.d. with<br />

f(x) = exp{A(θ)t(x) + B(θ) + h(x)},<br />

then f(x; θ) = exp<br />

=⇒ S =<br />

{<br />

A(θ)<br />

n∑<br />

t(x i )<br />

i=1<br />

− 1<br />

2σ 2<br />

n∑<br />

i=1<br />

x 2 i<br />

)<br />

}<br />

n∑<br />

x 2 i − n 2 log(2πσ2 )<br />

i=1<br />

} {<br />

n∑<br />

n∑<br />

}<br />

t(x i ) + nB(θ) exp h(x i )<br />

i=1<br />

i=1<br />

is sufficient for θ in the exponential family by the Factorization Theorem.<br />

Theorem. 2.2.5 Rao-Blackwell Theorem<br />

If T is an unbiased estimator for θ and S is a sufficient statistic for θ, then the quantity<br />

T ∗ = E(T |S) is also an unbiased estimator for θ with Var(T ∗ ) ≤ Var(T ). Moreover,<br />

Var(T ∗ ) = Var(T ) iff T ∗ = T with probability 1.<br />

Pro<strong>of</strong>.<br />

(I) Unbiasedness:<br />

θ = E(T ) = E S {E(T |S)} = E(T ∗ )<br />

=⇒ E(T ∗ ) = θ.<br />

92


2. STATISTICAL INFERENCE<br />

(II) Variance inequality:<br />

Var(T ) = E{Var(T |S)} + Var{E(T |S)}<br />

= E{Var(T |S)} + Var(T ∗ )<br />

≥<br />

Var(T ∗),<br />

since E{Var(T |S)} ≥ 0.<br />

Observe also that Var(T ) = Var(T ∗ )<br />

=⇒ E{Var(T |S)} = 0<br />

=⇒ Var(T |S) = 0 with prob. 1<br />

=⇒ T = E(T |S) with prob. 1.<br />

(III) T ∗ is an estimator:<br />

Since S is sufficient for θ,<br />

T ∗ =<br />

which does not depend on θ.<br />

∫ ∞<br />

−∞<br />

T (x)f(x|s)dx,<br />

Remarks<br />

(1) The Rao-Blackwell Theorem can be used occasionally to construct estimators.<br />

(2) The theoretical importance <strong>of</strong> this result is to observe that T ∗ will always<br />

depend on x only through S. If T is already a MVUE then T ∗ = T =⇒<br />

MVUE depends on x only through S.<br />

Example<br />

Suppose x 1 , . . . , x n ∼i.i.d. N(µ, σ 2 ), σ 2 known. We want to estimate µ. Take T = x 1 ,<br />

as an unbiased estimator for µ.<br />

n∑<br />

S = x i is a sufficient statistic for µ.<br />

i=1<br />

93


2. STATISTICAL INFERENCE<br />

According to Rao-Blackwell, T ∗ = E(T |S) will be an improved (unbiased) estimator<br />

for µ.<br />

( T<br />

If x = (x 1 , . . . , x n ) T for X ∼ N n (µ1, σ 2 I), then = Ax, where A is a matrix<br />

S)<br />

⎛<br />

A = ⎝<br />

1 0 . . . 0<br />

1 1 . . . 1<br />

⎞<br />

⎠<br />

Hence,<br />

⎛⎛<br />

( T<br />

∼ N 2 (A(µ1), σ<br />

S)<br />

2 AA T ) = N 2<br />

⎝⎝<br />

µ<br />

nµ<br />

⎞ ⎛<br />

⎠ , ⎝<br />

⎞⎞<br />

σ 2 σ 2<br />

⎠⎠<br />

nσ 2<br />

σ 2<br />

=⇒ T |S = s ∼<br />

)<br />

N<br />

(µ + σ2<br />

nσ (s − nµ), 2 σ2 − σ4<br />

nσ 2<br />

=<br />

( ( 1<br />

N<br />

n s, 1 − 1 ) )<br />

σ 2<br />

n<br />

∴ E(T |S) = 1 n<br />

is a better (or equal) estimator for µ.<br />

∑<br />

xi = ¯x = T ∗<br />

Finally, observe Var( ¯X) = σ2<br />

n ≤ σ2 = Var(X 1 ) with < for n > 1, σ 2 > 0.<br />

Remarks<br />

(1) We have given only an incomplete outline <strong>of</strong> theory.<br />

(2) There are also concepts <strong>of</strong> minimal sufficient & complete statistics to be<br />

considered.<br />

2.3 Methods Of Estimation<br />

2.3.1 Method Of Moments<br />

Consider a random sample X 1 , X 2 , . . . , X n from F (θ) & let µ = µ(θ) = E(X).<br />

The method <strong>of</strong> moments estimator ˜θ is defined as the solution to the equation<br />

¯X = µ(˜θ).<br />

94


2. STATISTICAL INFERENCE<br />

Example<br />

X 1 , . . . , X n ∼ i.i.d. Exp(λ) =⇒ E(X i ) = 1 ; the method <strong>of</strong> moments estimator is<br />

λ<br />

defined as the solution to the equation<br />

¯X = 1˜λ =⇒ ˜λ = 1¯X .<br />

Remark<br />

The method <strong>of</strong> moments is appealing:<br />

(1) it is simple;<br />

(2) rationale is that ¯X is BLUE for µ.<br />

If θ = (θ 1 , θ 2 , . . . , θ p ), the method <strong>of</strong> moments can be adapted as follows:<br />

(1) let µ k = µ k (θ) = E(X k ), k = 1, . . . , p;<br />

(2) let m k = 1 n<br />

n∑<br />

x k i .<br />

i=1<br />

The MoM estimator ˜θ is defined to be the solution to the system <strong>of</strong> equations<br />

m 1 = µ 1 (˜θ)<br />

m 2 = µ 2 (˜θ)<br />

.<br />

.<br />

m p = µ p (˜θ)<br />

Example<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ) & let θ = (µ, σ 2 ).<br />

95


2. STATISTICAL INFERENCE<br />

p = 2 =⇒ two equations in two unknowns:<br />

µ 1 (θ) = E(X) = µ<br />

µ 2 (θ) = E(X 2 ) = σ 2 + µ 2<br />

m 1 =<br />

m 2 =<br />

1<br />

n<br />

1<br />

n<br />

n∑<br />

i=1<br />

n∑<br />

i=1<br />

x i<br />

x 2 i<br />

∴ we need to solve for µ, σ 2 :<br />

˜µ = ¯x<br />

˜σ 2 + ˜µ 2 = 1 n<br />

∑<br />

x<br />

2<br />

i =⇒ ˜σ 2 = 1 n<br />

∑<br />

x<br />

2<br />

i − ¯x 2<br />

= 1 n<br />

∑<br />

(xi − ¯x) 2<br />

= n − 1<br />

n s2 .<br />

Remark<br />

The Method <strong>of</strong> Moments estimators can be seen to have good statistical properties.<br />

Under fairly mild regulatory conditions, the MoM estimator is<br />

(1) Consistent, i.e., ˜θ → θ in prob as n → ∞<br />

(2) Asymptotically Normal, i.e.,<br />

˜θ − θ<br />

√<br />

Var(θ)<br />

. −→ D<br />

N(0, 1) as n → ∞<br />

However, the MoM estimator has one serious defect.<br />

Suppose X 1 , . . . , X n is a random sample from F θX<br />

& let ˜θ X be MoM estimator.<br />

Let Y = h(X) for some invertible function h(X), then Y 1 , . . . , Y n should contain the<br />

same information as X 1 , . . . , X n .<br />

∴ we would like ˜θ Y ≡ ˜θ X . Unfortunately this does not hold.<br />

Example<br />

96


2. STATISTICAL INFERENCE<br />

X 1 , . . . , X n i.i.d. Exp(λ). We saw previously that ˜λ X = 1¯X .<br />

Suppose Y i = X 2 i (which is invertible for X i > 0). To obtain ˜λ Y , observe E(Y ) =<br />

E(X 2 ) = 2 λ 2 √<br />

=⇒ ˜λ 2n<br />

Y = ∑ ≠ 1¯X . X<br />

2<br />

i<br />

2.3.2 Maximum Likelihood Estimation<br />

Consider a statistical problem with log-likelihood function, l(θ; x).<br />

Definition. The maximum likelihood estimate ˆθ is the solution to the problem max l(θ; x)<br />

θ∈Θ<br />

i.e. ˆθ = arg max l(θ; x).<br />

θ∈Θ<br />

Remark<br />

In practice, maximum likelihood estimates are obtained by solving the score equation<br />

Example<br />

∂l<br />

∂θ<br />

= U(θ; x) = 0<br />

If X 1 , X 2 , . . . , X n are i.i.d. geometric-θ RV’s, find ˆθ.<br />

Solution:<br />

{ n<br />

}<br />

∏<br />

l(θ; x) = log θ(1 − θ) x i−1<br />

i=1<br />

n∑<br />

n∑<br />

= log θ + (x i − 1) log(1 − θ)<br />

i=1<br />

i=1<br />

= n{log θ + (¯x − 1) log(1 − θ)}<br />

∴ U(θ; x) = ∂l { 1<br />

∂θ = n θ − ¯x − 1<br />

1 − θ<br />

= n<br />

( 1 − θ¯x<br />

θ(1 − θ)<br />

)<br />

.<br />

}<br />

97


2. STATISTICAL INFERENCE<br />

To find ˆθ, we solve for θ in 1 − θ¯x = 0<br />

⎡<br />

=⇒ ˆθ = 1¯x<br />

⎢<br />

⎣ =<br />

Elementary Properties<br />

n∑<br />

n =<br />

x i<br />

i=1<br />

# <strong>of</strong> successes<br />

# <strong>of</strong> trials<br />

⎤<br />

⎥<br />

⎦<br />

(1) MLE’s are invariant under invertible transformations <strong>of</strong> the data.<br />

Pro<strong>of</strong>. (Continuous i.i.d. RV’s, strictly increasing/decreasing translation)<br />

Suppose X has <strong>PDF</strong> f(x; θ) and Y = h(X), where h is strictly monotonic;<br />

=⇒ f Y (y; θ) = f X (h −1 (y); θ)|h −1 (y) ′ | when y = h(x).<br />

Consider data x 1 , x 2 , . . . , x n and the transformed version y 1 , y 2 , . . . , y n :<br />

{ n<br />

}<br />

∏<br />

l Y (θ; y) = log f Y (y i ; θ)<br />

= log<br />

= log<br />

i=1<br />

{<br />

∏ n<br />

}<br />

f X (h −1 (y i ); θ)|h −1 (y i ) ′ |<br />

i=1<br />

n∏<br />

f X (x i ; θ) + log<br />

i=1<br />

n∏<br />

|h −1 (y i ) ′ |<br />

( n<br />

)<br />

∏<br />

= l X (θ; x) + log |h −1 (y i ) ′ | ,<br />

i=1<br />

( n<br />

)<br />

∏<br />

since log |h −1 (y i ) ′ | does not depend on θ, it follows that ˆθ maximizes<br />

i=1<br />

l X (θ; x) iff it maximizes l Y (θ; y).<br />

i=1<br />

(2) If φ = φ(θ) is a 1-1 transformation <strong>of</strong> θ, then the MLE’s obey the transformation<br />

rule, ˆφ = φ(ˆθ).<br />

Pro<strong>of</strong>. It can be checked that if l φ (φ; x) is the log-likelihood with respect<br />

to φ, then<br />

l θ (θ; x) = l φ (φ(θ); x).<br />

It follows that ˆθ maximizes l θ (θ; x) iff ˆφ = φ(ˆθ) maximizes l φ (φ; x).<br />

98


2. STATISTICAL INFERENCE<br />

(3) If t(x) is a sufficient statistic for θ, then ˆθ depends on the data only as a<br />

function <strong>of</strong> t(x).<br />

Pro<strong>of</strong>. By the Factorization Theorem, t(x) is sufficient for θ iff<br />

f(x; θ) = g(t(x); θ)h(x)<br />

=⇒ l(θ; x) = log g(t(x); θ) + log h(x)<br />

=⇒ ˆθ maximizes l(θ; x) ⇔ ˆθ maximizes log g(t(x); θ)<br />

=⇒ ˆθ is a function <strong>of</strong> t(x).<br />

Example<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. Exp(λ); then<br />

f X (x; λ) =<br />

n∏<br />

i=1<br />

λe −λx i<br />

= λ n e −λ P n<br />

i=1 x i<br />

= λ n e −λn¯x .<br />

By the Factorization Theorem, ¯x is sufficient for λ. To get the MLE,<br />

U(λ, x) = ∂l<br />

∂λ = ∂ (n log λ − nλ¯x)<br />

∂λ<br />

= n λ − n¯x;<br />

∂l<br />

∂λ = 0 =⇒ 1 λ = ¯x =⇒ ˆλ = 1¯x .<br />

Note: as proved ˆλ is a function <strong>of</strong> the sufficient statistic ¯x.<br />

Let Y 1 , Y 2 , . . . , Y n be defined by Y i = log X i .<br />

99


2. STATISTICAL INFERENCE<br />

If X ∼Exp(λ) and Y = log X, then we can find f Y (y) by taking h(x) = log x and using<br />

f Y (y) = f X (h −1 (y))|h −1 (y) ′ | [h −1 (y) = e y , h −1 (y) ′ = e y ]<br />

= λe −λey e y<br />

{ n<br />

}<br />

∏<br />

=⇒ l Y (λ; y) = log λe −λey i<br />

e y i<br />

= log<br />

i=1<br />

{<br />

λ n e −λ P n<br />

i=1 ey i<br />

e P }<br />

n<br />

i=1 y i<br />

= n log λ − λ<br />

=⇒ ∂<br />

∂λ l Y (λ; y) = n n∑<br />

λ −<br />

=⇒ ˆλ =<br />

=<br />

=<br />

= 1¯x<br />

n<br />

n∑<br />

i=1<br />

i=1<br />

e y i<br />

n<br />

n∑<br />

i=1<br />

n<br />

n∑<br />

i=1<br />

e log x i<br />

x i<br />

e y i<br />

n∑<br />

e y i<br />

+<br />

i=1<br />

( n<br />

n¯x)<br />

.<br />

n∑<br />

i=1<br />

y i<br />

Finally, suppose we take θ = log λ =⇒ λ = e θ<br />

=⇒ f(x; θ) = e θ e −eθ x<br />

(λe −λx )<br />

( n<br />

)<br />

∏<br />

=⇒ l θ (θ; x) = log e θ e −eθ x i<br />

= log<br />

i=1<br />

{ }<br />

e nθ e −eθ n¯x<br />

= nθ − e θ n¯x.<br />

100


∂l<br />

∂θ<br />

= n − n¯xe θ<br />

∴ ∂l<br />

∂θ = 0 =⇒ 1 = ¯xeθ =⇒ e θ = 1¯x<br />

=⇒ ˆθ = − log ¯x.<br />

But ˆλ = 1¯x ( 1¯x)<br />

=⇒ log ˆλ = log = − log ¯x = ˆθ, as required.<br />

Remark (Not examinable)<br />

2. STATISTICAL INFERENCE<br />

Maximum likelihood estimation & method <strong>of</strong> moments, can both be generated by the<br />

use <strong>of</strong> estimating function.<br />

An estimating function is a function, H(x; θ), with the property E{H(x; θ)} = 0.<br />

H can be used to define an estimator ˜θ which is a solution to the equation<br />

H(x; θ) = 0.<br />

For method <strong>of</strong> moments estimates, we can take<br />

H(x; θ) = ¯x − E(X)<br />

[E( ¯X) if not i.i.d. case].<br />

To calculate the MLE:<br />

(1) find the log-likelihood, then<br />

(2) calculate the score & let it equal 0.<br />

For maximum likelihood, we can use<br />

H(x; θ) = U(θ; x) = ∂l<br />

∂θ .<br />

Recall we showed previously that E{U(θ; x)} = 0.<br />

Asymptotic Properties <strong>of</strong> MLE’s:<br />

Suppose X 1 , X 2 , X 3 , . . . is a sequence <strong>of</strong> i.i.d. RV’s with common <strong>PDF</strong> (or probability<br />

function) f(x; θ).<br />

Assume:<br />

(1) θ 0 is the true value <strong>of</strong> the parameter θ;<br />

101


2. STATISTICAL INFERENCE<br />

(2) f is s.t. f(x; θ 1 ) = f(x; θ 2 ) for all x =⇒ θ 1 = θ 2 .<br />

We will show (in outline) that if ˆθ n is the MLE (maximum likelihood estimator) based<br />

on X 1 , X 2 , . . . , X n , then:<br />

(1) ˆθ n → θ 0 in probability, i.e., for each ɛ > 0,<br />

lim P (|θ n − θ 0 | > ɛ) = 0.<br />

n→∞<br />

(2)<br />

√<br />

ni(θ0 )(ˆθ n − θ 0 ) −→ D<br />

N(0, 1) as n → ∞<br />

Remark<br />

(Asymptotic Normality).<br />

The practical use <strong>of</strong> asymptotic normality is that for large n,<br />

( )<br />

1<br />

ˆθ ∼: N θ 0 , .<br />

ni(θ 0 )<br />

Outline <strong>of</strong> consistency & asymptotic normality for MLE’s:<br />

Consider i.i.d. data X 1 , X 2 , . . . with common <strong>PDF</strong>/prob. function f(x; θ).<br />

If ˆθ n is the MLE based on X 1 , X 2 , . . . , X n , we will show (in outline) that ˆθ n → θ 0 in<br />

probability as n → ∞, where θ 0 is the true value for θ.<br />

Lemma. 2.3.1<br />

Suppose f is such that f(x; θ 1 ) = f(x; θ 2 ) for all x =⇒ θ 1 = θ 2 . Then l ∗ (θ) =<br />

E{log f(x; θ)} is maximized uniquely by θ = θ 0 .<br />

Pro<strong>of</strong>.<br />

l ∗ (θ) − l ∗ (θ 0 ) =<br />

=<br />

≤<br />

=<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

(log f(x; θ))f(x; θ 0 )dx −<br />

( f(x; θ)<br />

log<br />

f(x; θ 0 )<br />

)<br />

f(x; θ 0 )dx<br />

( f(x; θ)<br />

f(x; θ 0 ) − 1 )<br />

f(x; θ 0 )dx<br />

(f(x; θ) − f(x; θ 0 ))dx<br />

f(x; θ)dx −<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

f(x; θ 0 )dx.<br />

(log f(x; θ 0 ))f(x; θ 0 )dx<br />

102


2. STATISTICAL INFERENCE<br />

Moreover, equality is achieved iff<br />

f(x; θ)<br />

f(x; θ 0 )<br />

= 1 for all x<br />

=⇒ f(x; θ) = f(x; θ 0 ) for all x<br />

=⇒ θ = θ 0 .<br />

Lemma. 2.3.2<br />

Let ¯l n (θ; x) = 1 n l(θ, x 1, . . . , x n ), i.e., 1 n × log likelihood based on x 1, x 2 , . . . , x n .<br />

Then for each θ, ¯l n (θ; x) → l ∗ (θ) in probability as n → ∞.<br />

Pro<strong>of</strong>. Observe that<br />

¯l n (θ; x) = 1 n<br />

n log ∏<br />

f(x i ; θ)<br />

= 1 n<br />

= 1 n<br />

i=1<br />

n∑<br />

log f(x i ; θ)<br />

i=1<br />

n∑<br />

L i (θ),<br />

i=1<br />

where L i (θ) = log f(x i ; θ).<br />

Since L i (θ) are i.i.d. with E(L i (θ)) = l ∗ (θ), it follows by the weak law <strong>of</strong> large numbers<br />

that ¯l n (θ; x) → l ∗ (θ) in probability as n → ∞.<br />

To summarise, we have proved that:<br />

(1) ¯l n (θ, x) → l ∗ (θ);<br />

(2) l ∗ (θ) is maximized when θ = θ 0 .<br />

Since ˆθ n maximizes ¯l n (θ, x), it follows that ˆθ n → θ 0 in probability.<br />

Theorem. 2.3.1<br />

Under the above assumptions, let U n (θ; x) denote the score based on X 1 , . . . , X n .<br />

103


2. STATISTICAL INFERENCE<br />

Then,<br />

U n (θ 0 ; x)<br />

√<br />

ni(θ0 )<br />

−→<br />

D<br />

N(0, 1).<br />

Pro<strong>of</strong>.<br />

U n (θ 0 ; x) = ∂ n<br />

∂θ log ∏<br />

f(x i ; θ)| θ=θ0<br />

=<br />

=<br />

n∑<br />

i=1<br />

i=1<br />

∂<br />

∂θ log f(x i; θ)| θ=θ0<br />

n∑<br />

U i , where U i = ∂ ∂θ log f(x i; θ)| θ=θ0 .<br />

i=1<br />

Since U 1 , U 2 , . . . are i.i.d. with E(U i ) = 0 and Var(U i ) = i(θ), by the Central Limit<br />

Theorem,<br />

∑ n<br />

i=1 U i − nE(U)<br />

√ = U n(θ 0 ; x) −→<br />

√ N(0, 1).<br />

n Var(U) ni(θ0 ) D<br />

Theorem. 2.3.2<br />

Under the above assumptions,<br />

√<br />

ni(θ0 )(ˆθ n − θ 0 ) −→ D<br />

N(0, 1).<br />

Pro<strong>of</strong>. Consider the first order Taylor expansion <strong>of</strong> U n (ˆθ; x) about θ 0 :<br />

U n (ˆθ n ; x) ≈ U n (θ 0 ; x) + U n(θ ′ 0 ; x)(ˆθ n − θ 0 ).<br />

For large n<br />

=⇒ U n (θ 0 ; x) ≈ −U n(θ ′ 0 ; x)(ˆθ n − θ 0 )<br />

U n (θ 0 ; x)<br />

√<br />

ni(θ0 )<br />

=⇒ −U ′ n(θ 0 ; x)(ˆθ n − θ 0 )<br />

√<br />

ni(θ0 )<br />

−→<br />

D<br />

−→<br />

D<br />

N(0, 1)<br />

N(0, 1).<br />

Now observe that:<br />

−U ′ n(θ 0 ; x)<br />

n<br />

→ i(θ 0 )<br />

as n → ∞<br />

104


2. STATISTICAL INFERENCE<br />

by the weak law <strong>of</strong> large numbers, since:<br />

Hence,<br />

−U ′ n(θ 0 ; x)<br />

n<br />

=⇒ −U ′ n(θ 0 ; x)<br />

ni(θ 0 )<br />

−U ′ n(θ 0 ; x)<br />

√<br />

ni(θ0 ) (ˆθ n − θ 0 )<br />

=<br />

1<br />

n<br />

n∑<br />

i=1<br />

− ∂2<br />

∂θ 2 log f(x i; θ| θ=θ0 )<br />

→ 1 in probability.<br />

≈<br />

√<br />

ni(θ0 )(ˆθ n − θ 0 )<br />

for large n<br />

=⇒ √ ni(θ 0 )(ˆθ n − θ 0 )<br />

−→<br />

D<br />

N(0, 1).<br />

Remark<br />

The preceding theory can be generalized to include vector-valued coefficients. We will<br />

not discuss the details.<br />

2.4 Hypothesis Tests and Confidence Intervals<br />

Motivating Example:<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ), and consider H 0 : µ = µ 0 vs. H a : µ ≠ µ 0 .<br />

If σ 2 is known, then the test <strong>of</strong> H 0 with significance level α is defined by the test<br />

statistic<br />

Z = ¯X − µ<br />

σ/ √ n ,<br />

and the rule, reject H 0 if |z| ≥ z(α/2).<br />

A 100(1 − α)% CI for µ is given by<br />

(<br />

¯X − z(α/2) σ √ n<br />

, ¯X + z(α/2) σ √ n<br />

)<br />

.<br />

It is easy to check that the confidence interval contains all values <strong>of</strong> µ 0 that are acceptable<br />

null hypotheses.<br />

In general, consider a statistical problem with parameter<br />

θ ∈ Θ 0 ∪ Θ A , where Θ 0 ∩ Θ A = φ.<br />

We consider the null hypothesis,<br />

H 0 : θ ∈ Θ 0<br />

105


2. STATISTICAL INFERENCE<br />

and the alternative hypothesis<br />

H A : θ ∈ Θ A .<br />

The hypothesis testing set up can be represented as:<br />

Test Result<br />

Actual Status<br />

H 0 true H A true<br />

Accept H 0<br />

√<br />

type II<br />

error<br />

(β)<br />

Reject H 0<br />

type I<br />

error<br />

√<br />

(α)<br />

We would like both the type I and type II error rates to be as small as possible.<br />

However, these results conflict with each other. To reduce the type I error rate we<br />

need to “make it harder to reject H 0 ”. To reduce the type II error rate we need to<br />

“make it easier to reject H 0 ”.<br />

The standard (Neyman-Pearson) approach to hypothesis testing is to control the type<br />

I error rate at a “small” value α and then use a test that makes the type II error as<br />

small as possible.<br />

The equivalence between the confidence intervals and hypothesis tests can be formulated<br />

as follows: Recall that a 100(1 − α)% CI for θ is a random interval, (L, U) with<br />

the property<br />

P ((L, U) ∋ θ) = 1 − α.<br />

It is easy to check that the test defined by rule:<br />

“Accept H 0 : θ = θ 0 iff θ 0 ∈ (L, U)” has significance level α.<br />

α = P (reject H 0 |H 0 true)<br />

β = P (retain H 0 |H A true)<br />

1 − β = power = P (reject H 0 |H A true), which is what we want.<br />

Conversely, given a hypothesis test H 0 : θ = θ 0 with significance level α, it can be<br />

proved that the set {θ 0 : H 0 : θ = θ 0 is accepted} is a 100(1 − α)% confidence region<br />

for θ.<br />

Large Sample Tests and Confidence Intervals<br />

Consider a statistical problem with data x 1 , . . . , x n , log-likelihood l(θ; x), score U(θ; x)<br />

and information i(θ).<br />

Consider also a hypothesis H 0 : θ = θ 0 . The following three tests are <strong>of</strong>ten considered:<br />

106


2. STATISTICAL INFERENCE<br />

(1) The Wald Test:<br />

Test Statistic: W =<br />

√<br />

ni(ˆθ)(ˆθ − θ 0 )<br />

Critical Region:<br />

reject for |W | ≥ z(α/2)<br />

(2) The Score Test:<br />

Test Statistic: V = U(θ 0; x)<br />

√<br />

ni(θ0 )<br />

Critical Region:<br />

reject for |V | ≥ z(α/2)<br />

(3) Likelihood-Ratio Test:<br />

Test Statistic: G 2 = 2(l(ˆθ) − l(θ 0 ))<br />

Example:<br />

Critical Region: reject for G 2 ≥ χ 2 1,α<br />

Suppose X 1 , X 2 , . . . , X n i.i.d. Po(λ), and consider H 0 : λ = λ 0 .<br />

l(λ; x) = log<br />

n∏ e −λ λ x i<br />

i=1<br />

x i !<br />

= n(¯x log λ − λ) − log<br />

U(λ; x) = ∂l<br />

∂λ = n ( ¯x<br />

λ − 1 )<br />

=<br />

I(λ) = ni(λ) = E<br />

n(¯x − λ)<br />

λ<br />

n∏<br />

x i !<br />

i=1<br />

=⇒ ˆλ = ¯x<br />

( )<br />

− ∂2 l<br />

( n¯x<br />

)<br />

= E = n ∂λ 2 λ 2 λ<br />

107


2. STATISTICAL INFERENCE<br />

=⇒ W =<br />

√<br />

ni(ˆλ)(ˆλ − λ 0 )<br />

= ˆλ − λ 0<br />

√ˆλ/n<br />

V = U(λ 0; x)<br />

√<br />

ni(λ0 )<br />

= n(¯x − λ 0)<br />

λ 0 ( √ n/λ 0 )<br />

= ¯x − λ 0<br />

√<br />

λ0 /n<br />

G 2 = 2(l(ˆλ) − l(λ 0 ))<br />

= 2n(¯x log ˆλ − ˆλ) − 2n(¯x log λ 0 − λ 0 )<br />

= 2n<br />

(<br />

¯x log ˆλ<br />

)<br />

− (ˆλ − λ 0 ) .<br />

λ 0<br />

Remarks<br />

(1) It can be proved that the tests based on W, V, G 2 are asymptotically equivalent<br />

for H 0 true.<br />

(2) As a by-product, it follows that the null distribution <strong>of</strong> G 2 is χ 2 1. (Recall<br />

from Theorems 2.3.1, 2.3.1 that the null distribution for W, V are both<br />

N(0, 1)).<br />

(3) To understand the motivation for the three tests, it is useful to consider<br />

their relation to the log-likelihood function. See Figure 20.<br />

We have introduced the Wald test, score test and the likelihood test for H 0 : θ = θ 0 vs.<br />

H A : θ = θ a . These are large-sample tests in that asymptotic distributions for the test<br />

statistic under H 0 are available. It can also be proved that the LR statistic and the<br />

score test statistic are invariant under transformation <strong>of</strong> the parameter, but the Wald<br />

test is not.<br />

Each <strong>of</strong> these three tests can be inverted to give a confidence interval (region) for θ:<br />

Wald Test<br />

108


2. STATISTICAL INFERENCE<br />

Figure 20: Relationships <strong>of</strong> the tests to the log-likelihood function<br />

Solve for θ 0 in W 2 ≤ z(α/2) 2 .<br />

Recall, W =<br />

√<br />

ni(ˆθ)(ˆθ − θ 0 )<br />

=⇒ ˆθ − √<br />

z(α/2)<br />

ni(ˆθ)<br />

≤<br />

θ 0 ≤ ˆθ + √<br />

z(α/2)<br />

ni(ˆθ)<br />

i.e., ˆθ ±<br />

z(α/2)<br />

√ .<br />

ni(ˆθ)<br />

Score Test<br />

Need to solve for θ 0 in V 2 =<br />

(<br />

U(θ 0 ; x)<br />

√<br />

ni(θ0 )) 2<br />

≤ z(α/2) 2 .<br />

LR Test<br />

Solve for θ 0 in 2(l(ˆθ; x) − l(θ 0 ; x)) ≤ χ 2 1,α = z(α/2) 2 .<br />

Example:<br />

X 1 , . . . , X n i.i.d. Po(λ).<br />

109


2. STATISTICAL INFERENCE<br />

Recall Wald statistic is W = ˆλ − λ 0<br />

√ˆλ/n<br />

=⇒ ˆλ ±<br />

√<br />

z(α/2) ˆλ/n<br />

⇐⇒ ¯x ± z(α/2) √¯x/n.<br />

Score test:<br />

Recall that the test statistic is V = ¯x − λ 0<br />

√ . Hence to find a confidence interval, we<br />

λ0 /n<br />

need to solve for λ 0 in the equation<br />

V 2 ≤ z(α/2) 2<br />

=⇒ (¯x − λ 0) 2<br />

λ 0 /n<br />

≤ z(α/2) 2<br />

=⇒ (¯x − λ 0 ) 2 ≤ λ 0z(α/2) 2<br />

)<br />

=⇒ λ 2 0 −<br />

(2¯x + z(α/2)2 λ + ¯x 2 ≤ 0,<br />

n<br />

which is a quadratic in λ 0 and hence can be solved:<br />

−b ± √ b 2 − 4ac<br />

.<br />

2a<br />

n<br />

For the LR test, we have<br />

{<br />

G 2 = 2n ¯x log ¯x }<br />

− (¯x − λ 0 )<br />

λ 0<br />

→ can solve numerically for λ 0 in G 2 ≤ z(α/2) 2 . See Figure 21.<br />

Optimal Tests<br />

Consider simple null and alternative hypotheses:<br />

H 0 : θ = θ 0 and H a : θ = θ a<br />

Recall that the type I error rate is defined by<br />

α = P (reject H 0 |H 0 true).<br />

and the type II error rate by<br />

β = P (accept H 0 |H 0 false).<br />

The power is defined by 1 − β = P (reject H 0 |H 0 false) → so we want high power.<br />

110


2. STATISTICAL INFERENCE<br />

Figure 21: The 3 tests<br />

Theorem. 2.4.1 (Neyman-Pearson Lemma)<br />

Consider the test H 0 : θ = θ 0 vs. H a : θ = θ a defined by the rule<br />

for some constant k.<br />

reject H 0 for f(x; θ 0)<br />

f(x; θ a ) ≤ k,<br />

Let α ∗ be the type I error rate and 1 − β ∗ be the power for this test. Then any other<br />

test with α ≤ α ∗ will have (1 − β) ≤ (1 − β ∗ ).<br />

Pro<strong>of</strong>. Let C be the critical region for the LR test and let D be the critical region (RR)<br />

for any other test.<br />

111


2. STATISTICAL INFERENCE<br />

Let C 1 = C ∩ D and C 2 , D 2 be such that<br />

C = C 1 ∪ C 2 , C 1 ∩ C 2 = φ<br />

D = C 1 ∪ D 2 , C 1 ∩ D 2 = φ<br />

Figure 22: Neyman-Pearson Lemma<br />

112


2. STATISTICAL INFERENCE<br />

See Figure 22 and observe α ≤ α ∗<br />

∫<br />

∫<br />

=⇒ f(x; θ 0 )dx 1 . . . dx n ≤<br />

D<br />

C<br />

f(x; θ 0 )dx 1 . . . dx n<br />

∫<br />

∫<br />

=⇒ f(x; θ 0 )dx 1 . . . dx n ≤ f(x; θ 0 )dx 1 . . . dx n<br />

D 2 C 2<br />

∫<br />

∫<br />

=⇒ (1 − β ∗ ) − (1 − β) = f(x; θ a )dx 1 . . . dx n −<br />

C<br />

D<br />

f(x; θ a )dx 1 . . . dx n<br />

∫<br />

∫<br />

∴ (1 − β ∗ ) − (1 − β) = f(x; θ a )dx 1 . . . dx n + f(x; θ a )dx 1 . . . dx n<br />

C 1 C 2<br />

∫<br />

∫<br />

− f(x; θ a )dx 1 . . . dx n − f(x; θ a )dx 1 . . . dx n<br />

C 1 D 2<br />

=<br />

≥<br />

(C 1 ∪ D 2 )<br />

∫<br />

∫<br />

f(x; θ a )dx 1 . . . dx n − f(x; θ a )dx 1 . . . dx n<br />

C 2 D 2 1 ∫<br />

f(x; θ 0 )dx 1 . . . dx n − 1 ∫<br />

f(x; θ 0 )dx 1 . . . dx n<br />

k C 2<br />

k D 2<br />

≥ 0, as required.<br />

See Figure 23.<br />

Moreover, equality is achieved only if<br />

D 2 is empty (= φ).<br />

Example<br />

Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ), σ 2 given, and consider<br />

H 0 : µ = µ 0 vs. H a : µ = µ a , µ a > µ 0<br />

113


2. STATISTICAL INFERENCE<br />

Figure 23: Neyman-Pearson Lemma<br />

Then f(x; µ 0)<br />

f(x; µ a )<br />

n∏<br />

{<br />

1<br />

√ exp − 1 }<br />

i=1 2πσ 2σ (x 2 i − µ 0 ) 2<br />

= n∏<br />

{<br />

1<br />

√ exp − 1 }<br />

i=1 2πσ 2σ (x 2 i − µ a ) 2<br />

{ ( n∑<br />

)}<br />

exp − 1 x 2<br />

2σ 2 i − 2n¯xµ 0 + nµ 2 0<br />

i=1<br />

= { ( n∑<br />

)}<br />

exp − 1 x 2<br />

2σ 2 i − 2n¯xµ a + nµ 2 a<br />

i=1<br />

{ }<br />

1<br />

= exp<br />

2σ (2n¯x(µ 2 0 − µ a ) − nµ 2 0 + nµ 2 a) .<br />

For a constant k,<br />

f(x; µ 0 )<br />

f(x; µ a )<br />

⇔ (µ 0 − µ a )¯x<br />

≤ k<br />

≤ k ∗<br />

⇔ ¯x ≥ c,<br />

for a suitably chosen c (rejected when ¯x is too big).<br />

114


2. STATISTICAL INFERENCE<br />

To choose c, we use the fact that<br />

Hence, the usual z-test,<br />

Z = ¯X − µ 0<br />

σ/ √ n ∼ N(0, 1) under H 0.<br />

reject H 0 if ẑ ≥ z(α)<br />

is the Neyman-Pearson LR test in this case.<br />

Remarks<br />

(1) This result shows that the one-sided z test is also uniformly most powerful<br />

for<br />

H 0 : µ = µ 0 vs. H A : µ > µ 0<br />

(2) This can be extended to the case <strong>of</strong><br />

In this case we take<br />

H 0 : µ ≤ µ 0 vs. H A : µ > µ 0<br />

α = max<br />

H 0<br />

P (rejecting |µ = µ 0 ).<br />

(3) This construction fails when we consider two-sided alternatives<br />

i.e., H 0 : µ = µ 0 vs. H A : µ ≠ µ 0<br />

=⇒ no uniformly most powerful test exists for that case.<br />

115

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!