PDF of Lecture Notes - School of Mathematical Sciences
PDF of Lecture Notes - School of Mathematical Sciences
PDF of Lecture Notes - School of Mathematical Sciences
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
MATHEMATICAL STATISTICS III<br />
<strong>Lecture</strong>r: Associate Pr<strong>of</strong>essor Patty Solomon<br />
SCHOOL OF MATHEMATICAL SCIENCES
Contents<br />
1 Distribution Theory 1<br />
1.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.1.3 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.1.4 Negative Binomial distribution . . . . . . . . . . . . . . . . . . 3<br />
1.1.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
1.1.6 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 5<br />
1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
1.2.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
1.2.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 8<br />
1.2.3 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
1.2.4 Beta density function . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
1.2.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
1.2.6 Cauchy distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
1.3 Transformations <strong>of</strong> a single RV . . . . . . . . . . . . . . . . . . . . . . 12<br />
1.4 CDF transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
1.5 Non-monotonic transformations . . . . . . . . . . . . . . . . . . . . . . 17<br />
1.6 Moments <strong>of</strong> transformed RVS . . . . . . . . . . . . . . . . . . . . . . . 18<br />
1.7 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
1.7.1 Trinomial distribution . . . . . . . . . . . . . . . . . . . . . . . 23<br />
1.7.2 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . 23<br />
1.7.3 Marginal and conditional distributions . . . . . . . . . . . . . . 24<br />
1.7.4 Continuous multivariate distributions . . . . . . . . . . . . . . . 27<br />
1.8 Transformations <strong>of</strong> several RVs . . . . . . . . . . . . . . . . . . . . . . 33<br />
1.8.1 Method <strong>of</strong> regular transformations . . . . . . . . . . . . . . . . 41
1.9 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
1.9.1 Moment generating functions . . . . . . . . . . . . . . . . . . . 50<br />
1.9.2 Marginal distributions and the MGF . . . . . . . . . . . . . . . 53<br />
1.9.3 Vector notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
1.9.4 Properties <strong>of</strong> variance matrices . . . . . . . . . . . . . . . . . . 56<br />
1.10 The multivariable normal distribution . . . . . . . . . . . . . . . . . . . 57<br />
1.10.1 The multivariate normal MGF . . . . . . . . . . . . . . . . . . . 64<br />
1.10.2 Independence and normality . . . . . . . . . . . . . . . . . . . . 65<br />
1.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
1.11.1 Convergence <strong>of</strong> random variables . . . . . . . . . . . . . . . . . 71<br />
2 Statistical Inference 77<br />
2.1 Basic definitions and terminology . . . . . . . . . . . . . . . . . . . . . 77<br />
2.2 Minimum Variance Unbiased Estimation . . . . . . . . . . . . . . . . . 81<br />
2.3 Methods Of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
2.3.1 Method Of Moments . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
2.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . 97<br />
2.4 Hypothesis Tests and Confidence Intervals . . . . . . . . . . . . . . . . 105
1. DISTRIBUTION THEORY<br />
1 Distribution Theory<br />
A discrete RV is described by its probability function<br />
p(x) = P ({X = x}).<br />
A continuous RV is described by its probability density function, f(x), for which<br />
P (X = a) =<br />
P ({a ≤ X ≤ b})<br />
=<br />
∫ a<br />
a<br />
∫ b<br />
a<br />
f(x) dx, for all a ≤ b.<br />
f(x) dx = 0<br />
for any continuous RV.<br />
Precipitation example, neither discrete or continuous (0 days) - some RVs are neither<br />
discrete or continuous.<br />
The cumulative distribution function (CDF) is defined by:<br />
F (x) = P ({X ≤ x})<br />
⎧∑<br />
p(t)<br />
⎪⎨ t;t≤x<br />
completed by F (x) =<br />
⎪⎩<br />
∫ x<br />
−∞<br />
f(t) dt.<br />
The expected value E(X) <strong>of</strong> a discrete RV is given by:<br />
(area to left)<br />
E(X) = ∑ x<br />
xp(x),<br />
provided that ∑ x<br />
|x|p(x) < ∞ (i.e., must converge). Otherwise the expectation is not<br />
defined.<br />
The expected value <strong>of</strong> a continuous RV is given by:<br />
provided that<br />
∫ ∞<br />
−∞<br />
E(X) =<br />
∫ ∞<br />
−∞<br />
xf(x) dx,<br />
|x|f(x) < ∞, otherwise it is not defined.<br />
(For example, the Cauchy distribution is momentless.)<br />
1
1. DISTRIBUTION THEORY<br />
⎧∑<br />
h(x)p(x)<br />
⎪⎨ x<br />
E{h(X)} =<br />
⎪⎩<br />
∫ ∞<br />
−∞<br />
h(x)f(x) dx<br />
if ∑ |h(x)|p(x) < ∞ (X discrete)<br />
x<br />
if<br />
∫ ∞<br />
−∞<br />
|h(x)|f(x) dx < ∞ (X continuous)<br />
The moment generating function <strong>of</strong> a RV X is defined to be:<br />
M X (t) = E[e tX ] =<br />
↓<br />
moment<br />
generating fn<br />
<strong>of</strong> RV X.<br />
⎧∑<br />
e tx p(x)<br />
x ⎪⎨<br />
⎪⎩<br />
∫ ∞<br />
−∞<br />
e tx f(x) dx<br />
X discrete<br />
X continuous<br />
M X (0) = 1 always; the mgf may or may not be defined for other values <strong>of</strong> t.<br />
If M X (t) defined for all t in some open interval containing 0, then:<br />
1. Moments <strong>of</strong> all orders exist;<br />
2. E[X r ] = M (r)<br />
X<br />
(0) (rth order derivative) ;<br />
3. M X (t) uniquely determines the distribution <strong>of</strong> X:<br />
M ′ (0) = E(X)<br />
M ′′ (0) = E(X 2 ),<br />
and so on.<br />
1.1 Discrete distributions<br />
1.1.1 Bernoulli distribution<br />
Parameter: 0 ≤ p ≤ 1<br />
Possible values: {0, 1}<br />
Prob. function:<br />
⎧<br />
⎪⎨ p, x = 1<br />
p(x) =<br />
⎪⎩<br />
1 − p, x = 0<br />
2<br />
= p x (1 − p) 1−x
1. DISTRIBUTION THEORY<br />
E(X) = p<br />
Var (X) = p(1 − p)<br />
M X (t) = 1 + p(e t − 1).<br />
1.1.2 Binomial distribution<br />
Parameter: 0 ≤ p ≤ 1; n > 0;<br />
M X (t) = {1 + p(e t − 1)} n .<br />
Consider a sequence <strong>of</strong> n independent Bern(p) trials. If X = total number <strong>of</strong> successes,<br />
X ∼ B(n, p).<br />
Note: If Y 1 , . . . , Y n are underlying Bernoulli RV’s then X = Y 1 + Y 2 + · · · + Y n .<br />
(same as counting # successes).<br />
Probability function if:<br />
p(x) ≥ 0, x = 0, 1, . . . , n,<br />
n∑<br />
p(x) = 1.<br />
x=0<br />
1.1.3 Geometric distribution<br />
Suppose a sequence <strong>of</strong> Bernoulli trials is performed and let X be the number <strong>of</strong> failures<br />
preceding the first success. Then X ∼ Geom(p).<br />
1.1.4 Negative Binomial distribution<br />
Suppose a sequence <strong>of</strong> independent Bernoulli trials is conducted. If X is the number<br />
<strong>of</strong> failures preceding the n th success, then X has a negative binomial distribution.<br />
Prob. function:<br />
p(x) =<br />
( ) n + x − 1<br />
p<br />
n − 1<br />
n (1 − p) x ,<br />
} {{ }<br />
ways <strong>of</strong> allocating<br />
failures and successes<br />
3
n(1 − p)<br />
E(X) = ,<br />
p<br />
n(1 − p)<br />
Var (X) = ,<br />
p<br />
[<br />
2 ] n<br />
p<br />
M X (t) =<br />
.<br />
1 − e t (1 − p)<br />
1. If n = 1, we obtain the geometric distribution.<br />
2. Also seen to arise as sum <strong>of</strong> n independent geometric variables.<br />
1. DISTRIBUTION THEORY<br />
1.1.5 Poisson distribution<br />
Parameter: λ > 0<br />
M X (t) = e λ(et −1)<br />
The Poisson distribution arises as the distribution for the number <strong>of</strong> “point events”<br />
observed from a Poisson process.<br />
Examples:<br />
Figure 1: Poisson Example<br />
Number <strong>of</strong> incoming calls to certain exchange in a given hour.<br />
The Poisson distribution as the limiting form <strong>of</strong> the binomial distribution:<br />
n → ∞,<br />
p → 0<br />
np → λ<br />
The derivation <strong>of</strong> the Poisson distribution (via the binomial) is underpinned by a<br />
Poisson process i.e., a point process on [0, ∞); see Figure 1.<br />
AXIOMS for a Poisson process <strong>of</strong> rate λ > 0 are:<br />
(A) The number <strong>of</strong> occurrences in disjoint intervals are independent.<br />
4
1. DISTRIBUTION THEORY<br />
(B) Probability <strong>of</strong> 1 or more occurrences in any interval [t, t+h) is λh+o(h) (h → 0)<br />
(approx prob. is equal to length <strong>of</strong> interval xλ)<br />
(C) Probability <strong>of</strong> more than one occurrence in [t, t + h) is o(h)<br />
is small, negligible).<br />
(h → 0) (i.e. prob<br />
Note: o(h), pronounced (small order h) is standard notation for any function r(h)<br />
with the property:<br />
r(h)<br />
lim<br />
h→0 h = 0<br />
4<br />
3<br />
2<br />
1<br />
0<br />
!2<br />
!1<br />
0<br />
1<br />
2<br />
h<br />
!1<br />
!2<br />
Figure 2: Small order h: functions h 4 (yes) and h (no)<br />
1.1.6 Hypergeometric distribution<br />
Consider an urn containing M black and N white balls. Suppose n balls are sampled<br />
randomly without replacement and let X be the number <strong>of</strong> black balls chosen. Then<br />
X has a hypergeometric distribution.<br />
Parameters: M, N > 0, 0 < n ≤ M + N<br />
Possible values: max (0, n − N) ≤ x ≤ min (n, M)<br />
Prob function:<br />
5
1. DISTRIBUTION THEORY<br />
) ( N<br />
)<br />
p(x) =<br />
( M<br />
x n − x<br />
( ) ,<br />
M + N<br />
n<br />
E(X) = n<br />
M<br />
M + N ,<br />
M + N − n nMN<br />
Var (X) =<br />
M + N − 1 (M + N) . 2<br />
The mgf exists, but there is no useful expression available.<br />
1. The hypergeometric distribution is simply<br />
# selections with x black balls<br />
,<br />
# possible selections<br />
) ( N<br />
)<br />
=<br />
( M<br />
x n − x<br />
( ) .<br />
M + N<br />
n<br />
2. To see how the limits arise, observe we must have x ≤ n (i.e., no more than<br />
sample size <strong>of</strong> black balls in the sample.) Also, x ≤ M, i.e., x ≤ min (n, M).<br />
Similarly, we must have x ≥ 0 (i.e., cannot have < 0 black balls in sample), and<br />
n − x ≤ N (i.e., cannot have more white balls than number in urn).<br />
i.e. x ≥ n − N<br />
i.e. x ≥ max (0, n − N).<br />
3. If we sample with replacement, we would get X ∼ B ( n, p = M<br />
M+N<br />
)<br />
. It is interesting<br />
to compare moments:<br />
finite population correction<br />
↑<br />
hypergeometric E(X) = np Var (X) = M+N−n [np(1 − p)]<br />
M+N−1<br />
binomial E(x) = np Var (X) = np(1 − p) ↓<br />
when sample all<br />
balls in urn Var(X) ∼ 0<br />
4. When M, N >> n, the difference between sampling with and without replacement<br />
should be small.<br />
6
1. DISTRIBUTION THEORY<br />
Figure 3: p = 1 3<br />
If white ball out →<br />
Figure 4: p = 1 2<br />
(without replacement)<br />
Figure 5: p = 1 3<br />
(with replacement)<br />
Intuitively, this implies that for M, N >> n, the hypergeometric and binomial<br />
probabilities should be very similar, and this can be verified for fixed, n, x:<br />
0<br />
lim N 10<br />
M<br />
@ A@<br />
M,N→∞<br />
x n − x<br />
0<br />
M<br />
M + N → p M + N<br />
1<br />
@ A<br />
n<br />
1<br />
A<br />
=<br />
( n<br />
x)<br />
p x (1 − p) n−x .<br />
1.2 Continuous Distributions<br />
1.2.1 Uniform Distribution<br />
CDF, for a < x < b:<br />
F (x) =<br />
∫ x<br />
−∞<br />
f(x) dx =<br />
∫ x<br />
a<br />
1<br />
b − a dx = x − a<br />
b − a ,<br />
7
1. DISTRIBUTION THEORY<br />
that is,<br />
⎧<br />
0 x ≤ a<br />
⎪⎨<br />
x − a<br />
F (x) =<br />
b − a<br />
a < x < b<br />
⎪⎩<br />
1 b ≤ x<br />
Figure 6: Uniform distribution CDF<br />
M X (t) = etb − e ta<br />
t(b − a)<br />
A special case is the U(0, 1) distribution:<br />
⎧<br />
⎪⎨ 1 0 < x < 1<br />
f(x) =<br />
⎪⎩<br />
0 otherwise,<br />
F (x) = x for 0 < x < 1,<br />
E(X) = 1 2 , Var(X) = 1 12 , M(t) = et − 1<br />
.<br />
t<br />
1.2.2 Exponential Distribution<br />
CDF:<br />
F (x) = 1 − e −λx ,<br />
M X (t) =<br />
λ<br />
λ − t .<br />
8
1. DISTRIBUTION THEORY<br />
This is the distribution for the waiting time until the first occurrence in a Poisson<br />
process with rate parameter λ.<br />
1. If X ∼ Exp(λ) then,<br />
P (X ≥ t + x|X ≥ t) = P (X ≥ x)<br />
(memoryless property)<br />
2. It can be obtained as limiting form <strong>of</strong> geometric distribution.<br />
1.2.3 Gamma distribution<br />
gamma fct : Γ(α) =<br />
mgf : M X (t) =<br />
∫ ∞<br />
0<br />
t α−1 e −t dt<br />
( ) α λ<br />
, t < λ.<br />
λ − t<br />
Suppose Y 1 , . . . , Y K are independent Exp(λ) RV’s and let X = Y 1 + · · · + Y K .<br />
Then X ∼ Gamma(K, λ), for K integer. In general, X ∼ Gamma(α, λ), α > 0.<br />
1. α is the shape parameter,<br />
λ is the scale parameter<br />
Figure 7: Gamma Distribution<br />
Note: if Y ∼ Gamma (α, 1) and X = Y , then X ∼ Gamma (α, λ). That is, λ is<br />
λ<br />
scale parameter.<br />
( p<br />
2. Gamma<br />
2 , 1 distribution is also called χ<br />
2)<br />
2 p (chi-square with p df) distribution<br />
if p is integer;<br />
χ 2 2 = exponential distribution (for 2 df).<br />
9
1. DISTRIBUTION THEORY<br />
3. Gamma (K, λ) distribution can be interpreted as the waiting time until the K th<br />
occurrence in a Poisson process.<br />
1.2.4 Beta density function<br />
Suppose Y 1 ∼ Gamma (α, λ), Y 2 ∼ Gamma (β, λ) independently, then,<br />
Remark:<br />
X = Y 1<br />
Y 1 + Y 2<br />
∼ B(α, β), 0 ≤ x ≤ 1.<br />
Figure 8: Beta Distribution<br />
1.2.5 Normal distribution<br />
X ∼ N(µ, σ 2 ); M X (t) = e tµ e t2 σ 2 /2 .<br />
1.2.6 Cauchy distribution<br />
Possible values: x ∈ R<br />
<strong>PDF</strong>: f(x) = 1 ( ) 1<br />
; (location parameter θ = 0)<br />
π 1 + x 2<br />
10
1. DISTRIBUTION THEORY<br />
CDF:<br />
F (x) = 1 2 + 1 π arctan x<br />
E(X), Var(X), M X (t) do not exist.<br />
→ the Cauchy is a bell-shaped distribution symmetric about zero for which no moments<br />
are defined.<br />
11
1. DISTRIBUTION THEORY<br />
Figure 9: Cauchy Distribution<br />
(pointier than normal distribution and tails go to zero much slower than normal distribution)<br />
If Z 1 ∼ N(0, 1) and Z 2 ∼ N(0, 1) independently, then X = Z 1<br />
Z 2<br />
∼ Cauchy distribution.<br />
1.3 Transformations <strong>of</strong> a single RV<br />
If X is a RV and Y = h(X), then Y is also a RV. If we know distribution <strong>of</strong> X, for a<br />
given function h(x), we should be able to find distribution <strong>of</strong> Y = h(X).<br />
Theorem. 1.3.1 (Discrete case)<br />
Suppose X is a discrete RV with prob. function p X (x) and let Y = h(X), where h is<br />
any function then:<br />
p Y (y) =<br />
∑<br />
p X (x)<br />
Pro<strong>of</strong>.<br />
x:h(x)=y<br />
(sum over all values x for which h(x) = y)<br />
p Y (y) = P (Y = y) = P {h(X) = y}<br />
= ∑<br />
x:h(x)=y<br />
= ∑<br />
x:h(x)=y<br />
P (X = x)<br />
p X (x)<br />
12
1. DISTRIBUTION THEORY<br />
Theorem. 1.3.2<br />
Suppose X is a continuous RV with <strong>PDF</strong> f X (x) and let Y = h(X), where h(x) is<br />
differentiable and monotonic, i.e., either strictly increasing or strictly decreasing.<br />
Then the <strong>PDF</strong> <strong>of</strong> Y is given by:<br />
Pro<strong>of</strong>. Assume h is increasing. Then<br />
f Y (y) = f X {h −1 (y)}|h −1 (y) ′ |<br />
F Y (y) = P (Y ≤ y) = P {h(X) ≤ y}<br />
= P {X ≤ h −1 (y)}<br />
= F X {h −1 (y)}<br />
Figure 10: h increasing.<br />
⇒ f Y (y) = d<br />
dy F Y (y) = d<br />
dy F X{h −1 (y)}<br />
(use Chain Rule)<br />
= f X<br />
{<br />
h −1 (y) } h −1 (y) ′ . (1)<br />
Now consider the case <strong>of</strong> h(x) decreasing:<br />
13
1. DISTRIBUTION THEORY<br />
F Y (y) = P (Y ≤ y) = P {h(X) ≤ y}<br />
= P {X ≥ h −1 (y)}<br />
= 1 − F X {h −1 (y)}<br />
⇒ f Y (y) = −f X {h −1 (y)}h −1 (y) ′ (2)<br />
Figure 11: h decreasing.<br />
Finally, observe that if h is increasing then h ′ (x) and hence h −1 (y) ′ must be positive.<br />
Similarly for h decreasing, h −1 (y) ′ < 0. Hence (1) and (2) can be combined to give:<br />
f Y (y) = f X {h −1 (y)}|h −1 (y) ′ |<br />
Examples:<br />
1. Discrete transformation <strong>of</strong> single RV<br />
X ∼ P o(λ) and Y is X rounded to the nearest multiple <strong>of</strong> 10.<br />
Possible values <strong>of</strong> Y are: 0, 10, 20, . . .<br />
14
1. DISTRIBUTION THEORY<br />
P (Y = 0) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)<br />
= e −λ + e −λ λ + e−λ λ 2<br />
2!<br />
+ e−λ λ 3<br />
3!<br />
+ e−λ λ 4<br />
;<br />
4!<br />
P (Y = 10) = P (X = 5) + P (X = 6) + P (X = 7) + P (X = 8) + . . . + P (X = 14), andsoon.<br />
2. Continuous transformation <strong>of</strong> single RV<br />
Let h(X) = aX + b, a ≠ 0.<br />
To find h −1 (y) we solve for x in the equation y = h(x), i.e.,<br />
y = ax + b<br />
⇒ x = y − b<br />
a<br />
⇒ h −1 (y) = y − b<br />
a<br />
= y a − b a<br />
Specifically:<br />
Hence, f Y (y) = 1<br />
|a| f X<br />
⇒ h −1 (y) ′ = 1 a .<br />
( y − b<br />
1. Suppose Z ∼ N(0, 1) and let X = µ + σZ, σ > 0. Recall that Z has <strong>PDF</strong>:<br />
Hence, f X (x) = 1 ( ) x − µ<br />
|σ| φ =<br />
σ<br />
a<br />
)<br />
.<br />
φ(z) = 1 √<br />
2π<br />
e −z2 /2 .<br />
2. Suppose X ∼ N(µ, σ 2 ) and let Z = X − µ .<br />
σ<br />
Find the <strong>PDF</strong> <strong>of</strong> Z.<br />
1<br />
√ e − (x−µ)2<br />
2σ 2 , which is N(µ, σ 2 ) <strong>PDF</strong>.<br />
2π(σ)<br />
15
1. DISTRIBUTION THEORY<br />
Solution. Observe that Z = h(X) = 1 σ X − µ σ ,<br />
( z +<br />
µ )<br />
σ<br />
⇒ f Z (z) = σf X 1<br />
= σf X (µ + σz)<br />
σ<br />
1<br />
= σ √ e (−1)<br />
2πσ<br />
2<br />
(2σ 2 ) (µ+σz−µ)2<br />
= 1 √<br />
2π<br />
e (−1)<br />
(2σ 2 ) σ2 z 2<br />
= 1 √<br />
2π<br />
e −z2 /2 = φ(z),<br />
i.e., Z ∼ N(0, 1).<br />
3. Suppose X ∼ Gamma (α, 1) and let Y = X λ , λ > 0.<br />
Find the <strong>PDF</strong> <strong>of</strong> Y .<br />
Solution. Since Y = 1 X, is a linear function, we have<br />
λ<br />
f Y (y) = λf X (λy)<br />
which is the Gamma(α, λ) <strong>PDF</strong>.<br />
= λ 1<br />
Γ(α) (λy)α−1 e −λy =<br />
λα<br />
Γ(α) yα−1 e −λy ,<br />
1.4 CDF transformation<br />
Suppose X is a continuous RV with CDF F X (x), which is increasing over the range <strong>of</strong><br />
X. If U = F X (x), then U ∼ U(0, 1).<br />
16
1. DISTRIBUTION THEORY<br />
Pro<strong>of</strong>.<br />
F U (u) = P (U ≤ u)<br />
= P {F X (X) ≤ u}<br />
= P {X ≤ F −1<br />
X<br />
(u)}<br />
= F X {F −1<br />
X<br />
(u)}<br />
= u, for 0 < u < 1.<br />
This is simply the CDF <strong>of</strong> U(0, 1), so the result is proved.<br />
The converse also applies. If U ∼ U(0, 1) and F is any strictly increasing “on its range”<br />
CDF, then X = F −1 (U) has CDF F (x), i.e.,<br />
F X (x) = P (X ≤ x) = P {F −1 (U) ≤ x}<br />
= P (U ≤ F (x))<br />
= F (x), as required.<br />
1.5 Non-monotonic transformations<br />
Theorem 1.3.2 applies to monotonic (either strictly increasing or decreasing) transformations<br />
<strong>of</strong> a continuous RV.<br />
In general, if h(x) is not monotonic, then h(x) may not even be continuous.<br />
For example,<br />
integer part <strong>of</strong> x<br />
↗<br />
if h(x) = [x],<br />
then possible values for Y = h(X) are the integers<br />
⇒ Y is discrete.<br />
However, it can be shown that h(X) is continuous if X is continuous and h(x) is<br />
piecewise monotonic.<br />
17
1. DISTRIBUTION THEORY<br />
1.6 Moments <strong>of</strong> transformed RVS<br />
Suppose X is a RV and let Y = h(X).<br />
If we want to find E(Y ) , we can proceed as follows:<br />
1. Find the distribution <strong>of</strong> Y = h(X) using preceeding methods.<br />
⎧∑<br />
yp(y) Y discrete<br />
⎪⎨ y<br />
2. Find E(Y ) =<br />
⎪⎩<br />
Theorem. 1.6.1<br />
∫ ∞<br />
−∞<br />
yf(y) dy Y continuous<br />
(forget X ever existed!)<br />
If X is a RV <strong>of</strong> either discrete or continuous type and h(x) is any transformation (not<br />
necessarily monotonic), then E{h(X)} (provided it exists) is given by:<br />
⎧∑<br />
h(x) p(x) X discrete<br />
⎪⎨ x<br />
E{h(X)} =<br />
⎪⎩<br />
Pro<strong>of</strong>.<br />
∫ ∞<br />
−∞<br />
h(x)f(x) dx X continuous<br />
Not examinable.<br />
Examples:<br />
1. CDF transformation<br />
Suppose U ∼ U(0, 1). How can we transform U to get an Exp(λ) RV<br />
Solution. Take X = F −1 (U), where F is the Exp(λ) CDF. Recall F (x) =<br />
1 − e −λx for the Exp(λ) distribution.<br />
To find F −1 (u), we solve for x in F (x) = u,<br />
18
1. DISTRIBUTION THEORY<br />
i.e.,<br />
u = 1 − e −λx<br />
⇒ 1 − u = e −λx<br />
⇒ ln(1 − u) = −λx<br />
⇒ x =<br />
− ln(1 − u)<br />
.<br />
λ<br />
− ln(1 − U)<br />
Hence if U ∼ U(0, 1), it follows that X = ∼ Exp(λ).<br />
λ<br />
Note: Y = − ln U ∼ Exp(λ) [both (1 − U) & U have U(0, 1) distribution].<br />
λ<br />
m This type <strong>of</strong> result is used to generate random numbers. That is, there are good<br />
methods for producing U(0, 1) (pseudo-random) numbers. To obtain Exp(λ) random<br />
numbers, we can just get U(0, 1) numbers and then calculate X = − log U .<br />
λ<br />
2. Non-monotonic transformations<br />
Suppose Z ∼ N(0, 1) and let X = Z 2 ; h(Z) = Z 2 is not monotonic, so Theorem<br />
1.3.2 does not apply. However we can proceed as follows:<br />
F X (x) = P (X ≤ x)<br />
= P (Z 2 ≤ x)<br />
= P (− √ x ≤ Z ≤ √ x)<br />
where<br />
Φ(a) =<br />
is the N(0, 1)<br />
= Φ( √ x) − Φ(− √ x),<br />
∫ a<br />
−∞<br />
1<br />
√<br />
2π<br />
e −t2 /2 dt<br />
CDF. That is,<br />
⇒ f X (x) = d<br />
dx F X(x) = φ (√ x ) ( )<br />
1<br />
2 x−1/2 − φ(− √ x)<br />
(− 1 )<br />
2 x−1/2 ,<br />
19
1. DISTRIBUTION THEORY<br />
where φ(a) = Φ ′ (a) = 1 √<br />
2π<br />
e −a2 /2 is the N(0, 1) <strong>PDF</strong><br />
= 1 2 x−1/2 [ 1<br />
√<br />
2π<br />
e −x/2 + 1 √<br />
2π<br />
e −x/2 ]<br />
(<br />
= (1/2)1/2<br />
1<br />
√ x 1/2−1 e −1/2x, which is the Gamma<br />
π 2 , 1 2)<br />
<strong>PDF</strong>.<br />
On the other hand, the distribution <strong>of</strong> Z 2 is also called the χ 2 1, distribution. We<br />
have proved that it is the same as the Gamma ( 1<br />
2 , 1 2)<br />
distribution.<br />
3. Moments <strong>of</strong> transformed RVS:<br />
If U ∼ U(0, 1) and Y = − log U<br />
λ<br />
then Y ∼ Exp(λ) ⇒ f(y) = λe −λy , y > 0.<br />
Can check<br />
E(Y ) =<br />
∫ ∞<br />
0<br />
λye −λy dy<br />
= 1 λ .<br />
4. Based on Theorem 1.6.1:<br />
If U ∼ U(0, 1) and Y = − log U , then according to theorem 1.6.1,<br />
λ<br />
E(Y ) =<br />
∫ 1<br />
0<br />
= − 1 λ<br />
= − 1 λ<br />
− log u<br />
(1) du<br />
λ<br />
∫ 1<br />
0<br />
log u du = − 1 λ (u log u − u) ∣ ∣∣∣<br />
1<br />
[<br />
u log u ∣ ∣1<br />
0 − u ∣ ]<br />
1 0<br />
0<br />
= − 1 [0 − 1]<br />
λ<br />
= 1 , as required.<br />
λ<br />
There are some important consequences <strong>of</strong> Theorem 1.6.1:<br />
20
1. DISTRIBUTION THEORY<br />
1. If E(X) = µ and Var(X) = σ 2 , and Y = aX + b for constants a, b, then E(Y ) =<br />
aµ + b and Var(Y ) = a 2 σ 2 .<br />
Pro<strong>of</strong>. (Continuous case)<br />
E(Y ) = E(aX + b) =<br />
∫ ∞<br />
= a<br />
xf(x) dx<br />
−∞<br />
} {{ }<br />
∫ ∞<br />
−∞<br />
(ax + b)f(x) dx<br />
∫ ∞<br />
+ b<br />
f(x) dx<br />
−∞<br />
} {{ }<br />
E(X) = 1<br />
= aµ + b.<br />
[ {Y } ] 2<br />
Var(Y ) = E − E(Y )<br />
[ {aX } ] 2<br />
= E + b − (aµ + b)<br />
= E { a 2 (X − µ) 2}<br />
= a 2 E { (X − µ) 2}<br />
= a 2 Var(X)<br />
= a 2 σ 2 .<br />
2. If X is a RV and h(X) is any function, then the MGF <strong>of</strong> Y = h(X), provided it<br />
exists is,<br />
⎧∑<br />
e th(x) p(x)<br />
⎪⎨ x<br />
M Y (t) =<br />
X discrete<br />
⎪⎩<br />
∫ ∞<br />
−∞<br />
e th(x) f(x) dx X continuous<br />
This gives us another way to find the distribution <strong>of</strong> Y = h(X).<br />
i.e., Find M Y (t). If we recognise M Y (t), then by uniqueness we can conclude that<br />
Y has that distribution.<br />
21
1. DISTRIBUTION THEORY<br />
Examples<br />
1. Suppose X is continuous with CDF F (x), and F (a) = 0, F (b) = 1; (a, b can be<br />
±∞ respectively).<br />
Let U = F (X). Observe that<br />
M U (t) =<br />
∫ b<br />
a<br />
e tF (x) f(x) dx<br />
which is the U(0, 1) MGF.<br />
= 1 t etF (x) ∣ ∣∣∣<br />
b<br />
= et − 1<br />
,<br />
t<br />
2. Suppose X ∼ U(0, 1), and let Y = − log X .<br />
λ<br />
M Y (t) =<br />
∫ 1<br />
0<br />
a<br />
e<br />
= etF (b) tF (a)<br />
− e<br />
t<br />
− log x<br />
t( λ<br />
) (1) dx<br />
=<br />
=<br />
=<br />
∫ 1<br />
0<br />
x −t/λ dx<br />
∣<br />
1 ∣∣∣<br />
1<br />
1 − t/λ x1−t/λ<br />
1<br />
1 − t/λ<br />
0<br />
= λ<br />
λ − t ,<br />
which is the MGF for Exp(λ) distribution. Hence we can conclude that<br />
Y = − log X<br />
λ<br />
∼ Exp(λ).<br />
#<br />
22
1. DISTRIBUTION THEORY<br />
1.7 Multivariate distributions<br />
Definition. 1.7.1<br />
If X 1 , X 2 , . . . , X r are discrete RV’s then, X = (X 1 , X 2 , . . . , X r ) T<br />
random vector.<br />
is called a discrete<br />
The probability function P (x) is:<br />
P (x) = P (X = x) = P ({X 1 = x 1 } ∩ {X 2 = x 2 } ∩ · · · ∩ {X r = x r }) ;<br />
P (x) = P (x 1 , x 2 , . . . , x r )<br />
1.7.1 Trinomial distribution<br />
r = 2<br />
Consider a sequence <strong>of</strong> n independent trials where each trial produces:<br />
Outcome 1: with prob π 1<br />
Outcome 2: with prob π 2<br />
Outcome 3: with prob 1 − π 1 − π 2<br />
If X 1 , X 2 are number <strong>of</strong> occurrences <strong>of</strong> outcomes 1 and 2 respectively then(X 1 , X 2 )<br />
have trinomial distribution.<br />
Parameters:<br />
π 1 > 0, π 2 > 0 and π 1 + π 2 < 1; n > 0 fixed<br />
Possible Values:<br />
integers (x 1 , x 2 ) s.t. x 1 ≥ 0, x 2 ≥ 0, x 1 + x 2 ≤ n<br />
Probability function:<br />
for<br />
P (x 1 , x 2 ) =<br />
n!<br />
x 1 !x 2 !(n − x 1 − x 2 )! πx 1<br />
1 π x 2<br />
2 (1 − π 1 − π 2 ) n−x 1−x 2<br />
x 1 , x 2 ≥ 0, x 1 + x 2 ≤ n.<br />
1.7.2 Multinomial distribution<br />
Parameters:<br />
n > 0, π = (π 1 , π 2 , . . . , π r ) T with π I > 0 and<br />
23<br />
r∑<br />
π i = 1<br />
i=1
1. DISTRIBUTION THEORY<br />
Possible values:<br />
Probability function:<br />
Integer valued (x 1 , x 2 , . . . , x r ) s.t. x i ≥ 0 &<br />
r∑<br />
x i = n<br />
i=1<br />
(<br />
)<br />
n<br />
P (x) =<br />
π x 1<br />
1 π x 2<br />
2 . . . πr<br />
xr<br />
x 1 , x 2 , . . . , x r<br />
for<br />
x i ≥ 0,<br />
r∑<br />
x i = n.<br />
i=1<br />
Remarks<br />
(<br />
)<br />
n<br />
1. Note<br />
x 1 , x 2 , . . . , x r<br />
def<br />
=<br />
n!<br />
x 1 !x 2 ! . . . x r !<br />
is the multinomial coefficient.<br />
2. Multinomial distribution is the generalisation <strong>of</strong> the binomial distribution to r<br />
types <strong>of</strong> outcome.<br />
3. Formulation differs from binomial and trinomial cases in that the redundant count<br />
x r = n − (x 1 + x 2 + . . . x r−1 ) is included as an argument <strong>of</strong> P (x).<br />
1.7.3 Marginal and conditional distributions<br />
Consider a discrete random vector<br />
so that X =<br />
[ ]<br />
X1<br />
.<br />
X 2<br />
X = (X 1 , X 2 , . . . , X r ) T and let<br />
X 1 = (X 1 , X 2 , . . . , X r1 ) T &<br />
X 2 = (X r1 +1, X r1 +2, . . . , X r ) T ,<br />
Definition. 1.7.2<br />
If X has joint probability function P X (x) = P X (x 1 , x 2 ) then the marginal probability<br />
function for X 1 is :<br />
P X1 (x 1 ) = ∑ x 2<br />
P X (x 1 , x 2 ).<br />
24
1. DISTRIBUTION THEORY<br />
Observe that:<br />
P X1 (x 1 ) = ∑ x 2<br />
P X (x 1 , x 2 ) = ∑ x 2<br />
P ({X 1 = x 1 } ∩ {X 2 = x 2 })<br />
= P (X 1 = x 1 ) by law <strong>of</strong> total probability.<br />
Hence the marginal probability function for X 1<br />
would have if X 2 was not observed.<br />
is just the probability function we<br />
The marginal probability function for X 2 is:<br />
P X2 (x 2 ) = ∑ x 1<br />
P X (x 1 , x 2 ).<br />
Definition. 1.7.3<br />
Suppose X =<br />
X 2 |X 1 by:<br />
Remarks<br />
(<br />
X1<br />
X 2<br />
)<br />
. If P X1 (x 1 ) > 0, we define the conditional probability function<br />
P X2 |X 1<br />
(x 2 |x 1 ) = P X(x 1 , x 2 )<br />
P X1 (x 1 ) .<br />
1.<br />
P X2 |X 1<br />
(x 2 |x 1 ) = P ({X 1 = x 1 } ∩ {X 2 = x 2 })<br />
P (X 1 = x 1 )<br />
= P (X 2 = x 2 |X 1 = x 1 ).<br />
2. Easy to check P X2 |X 1<br />
(x 2 |x 1 ) is a proper probability function with respect to x 2<br />
for each fixed x 1 such that P X1 (x 1 ) > 0.<br />
3. P X1 |X 2<br />
(x 1 |x 2 ) is defined by:<br />
Definition. 1.7.4 (Independence)<br />
P X1 |X 2<br />
(x 1 |x 2 ) = P X(x 1 , x 2 )<br />
P X2 (x 2 ) .<br />
Discrete RV’s X 1 , X 2 , . . . , X r are said to be independent if their joint probability function<br />
satisfies<br />
P (x 1 , x 2 , . . . , x r ) = p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />
for some functions p 1 , p 2 , . . . , p r and all (x 1 , x 2 , . . . , x r ).<br />
25
1. DISTRIBUTION THEORY<br />
Remarks<br />
1. Observe that:<br />
P X1 (x 1 ) = ∑ x 2<br />
∑<br />
· · · ∑<br />
x 3<br />
x r<br />
P (x 1 , . . . , x r )<br />
= p 1 (x 1 ) ∑ ∑<br />
· · · ∑<br />
x 2<br />
x 3<br />
x r<br />
p 2 (x 2 ) . . . p r (x r )<br />
= c 1 p 1 (x 1 ).<br />
Hence p 1 (x 1 ) ∝<br />
↓<br />
P X1 (x 1 )<br />
} {{ } also p i(x i ) ∝ P Xi (x i )<br />
sum <strong>of</strong> probs. marginal prob.<br />
P X1 |X 2 ...X r<br />
(x 1 |x 2 , . . . , x r ) =<br />
=<br />
=<br />
=<br />
=<br />
P (x 1 , . . . , x r )<br />
P x2 ...x r<br />
(x 1 , . . . , x r )<br />
p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />
∑<br />
x 1<br />
p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />
p 1 (x 1 )p 2 (x 2 ) . . . p r (x r )<br />
p 2 (x 2 )p 3 (x 3 ) . . . p r (x r ) ∑ x 1<br />
p 1 (x 1 )<br />
p 1 (x 1 )<br />
∑<br />
x 1<br />
p 1 (x 1 )<br />
1<br />
c P X 1<br />
(x 1 )<br />
1 ∑<br />
P X1 (x 1 )<br />
c<br />
x 1<br />
= P X1 (x 1 ).<br />
That is, P X1 |X 2 ...X r<br />
(x 1 |x 2 , . . . , x r ) = P X1 (x 1 ).<br />
2. Clearly independence =⇒<br />
P Xi |X 1 ...X i−1 ,X i+1 ...X r<br />
(x i |x 1 , . . . , x i−1 , x i+1 , . . . , x r ) = P Xi (x i ).<br />
Moreover, we have:<br />
26
1. DISTRIBUTION THEORY<br />
P X1 |X 2<br />
(x 1 |x 2 ) = P X1 (x 1 ) for any partitioning <strong>of</strong> X =<br />
independent.<br />
(<br />
X1<br />
X 2<br />
)<br />
if X 1 , . . . , X r are<br />
1.7.4 Continuous multivariate distributions<br />
Definition. 1.7.5<br />
The random vector (X 1 , . . . , X r ) T is said to have a continuous multivariate distribution<br />
with <strong>PDF</strong> f(x) if<br />
∫ ∫<br />
P (X ∈ A) = . . . f(x 1 , . . . , x r ) dx 1 . . . dx r<br />
A<br />
for any measurable set A.<br />
Examples<br />
1. Suppose (X 1 , X 2 ) have the trinomial distribution with parameters n, π 1 , π 2 .<br />
Then the marginal distribution <strong>of</strong> X 1 can be seen to be B(n, π 1 ) and the conditional<br />
distribution <strong>of</strong> X 2 |X 1 = x 1 is B ( π 2<br />
)<br />
n − x 1 , .<br />
1 − π 1<br />
Outcome 1 π 1<br />
Outcome 2 π 2<br />
Outcome 3 1 − π 1 − π 2<br />
Examples <strong>of</strong> Definition 1.7.5<br />
1. If X 1 , X 2 have <strong>PDF</strong><br />
⎧<br />
1 0 < x 1 < 1<br />
⎪⎨<br />
0 < x 2 < 1<br />
f(x 1 , x 2 ) =<br />
⎪⎩<br />
0 otherwise<br />
The distribution is called uniform on (0, 1) × (0, 1).<br />
It follows that P (X ∈ A) = Area (A) for any A ∈ (0, 1) × (0, 1):<br />
2. Uniform distribution on unit disk:<br />
⎧<br />
1<br />
⎪⎨ x 2 1 + x 2 2 < 1<br />
π<br />
f(x 1 , x 2 ) =<br />
⎪⎩<br />
0 otherwise.<br />
27
1. DISTRIBUTION THEORY<br />
Figure 12: Area A<br />
3. Dirichlet distribution is defined by <strong>PDF</strong><br />
f(x 1 , x 2 , . . . , x r ) = Γ(α 1 + α 2 + · · · + α r + α r+1 )<br />
Γ(α 1 )Γ(α 2 ) . . . Γ(α r )Γ(α r+1 ) ×<br />
x α 1−1<br />
1 x α 2−1<br />
2 . . . x αr−1<br />
r (1 − x 1 − · · · − x r ) α r+1<br />
for x 1 , x 2 , . . . , x r > 0,<br />
∑<br />
xi < 1, and parameters α 1 , α 2 , . . . , α r+1 > 0.<br />
#<br />
Recall joint <strong>PDF</strong>:<br />
∫<br />
P (X ∈ A) =<br />
Note: the joint <strong>PDF</strong> must satisfy:<br />
∫<br />
. . . f(x 1 , . . . , x r ) dx 1 . . . dx r .<br />
A<br />
1. f(x) ≥ 0 for all x;<br />
2.<br />
∫ ∞<br />
. . .<br />
∫ ∞<br />
−∞ −∞<br />
f(x 1 , . . . , x r ) dx 1 . . . dx r = 1.<br />
Definition. 1.7.6<br />
( )<br />
X1<br />
If X = has joint <strong>PDF</strong> f(x) = f(x<br />
X 1 , x 2 ), then the marginal <strong>PDF</strong> <strong>of</strong> X 1 is given<br />
2<br />
by:<br />
∫ ∞ ∫ ∞<br />
f X1 (x 1 ) = . . . f(x 1 , . . . , x r1 , x r1 +1, . . . x r ) dx r1 +1dx r1 +2 . . . dx r ,<br />
−∞ −∞<br />
28
1. DISTRIBUTION THEORY<br />
and for f X1 (x 1 ) > 0, the conditional <strong>PDF</strong> is given by<br />
f X2 |X 1<br />
(x 2 |x 1 ) = f(x 1, x 2 )<br />
f X1 (x 1 ) .<br />
Remarks:<br />
1. f X2 |X 1<br />
(x 2 |x 1 ) cannot be interpreted as the conditional <strong>PDF</strong> <strong>of</strong> X 2 |{X 1 = x 1 }<br />
because P (X 1 = x 1 ) = 0 for any continuous distribution.<br />
Proper interpretation is the limit as δ → 0 in X 2 |X 1 ∈ B(x 1 , δ).<br />
2. f X2 (x 2 ), f X1 |X 2<br />
(x 1 |x 2 ) are defined analogously.<br />
Definition. 1.7.7 (Independence)<br />
Continuous RV’s X 1 , X 2 , . . . , X r are said to be independent if their joint <strong>PDF</strong> satisfies:<br />
f(x 1 , x 2 , . . . , x r ) = f 1 (x 1 )f 2 (x 2 ) . . . f r (x r ),<br />
for some functions f 1 , f 2 , . . . , f r and all x 1 , . . . , x r<br />
Remarks<br />
1. Easy to check that if X 1 , . . . , X r are independent then each f i (x i ) = c i f Xi (x i ).<br />
Moreover c 1 , c 2 , . . . , c r = 1.<br />
2. If X 1 , X 2 , . . . , X r are independent, then it can be checked that:<br />
for any partitioning <strong>of</strong> X =<br />
f x1 |x 2<br />
(x 1 |x 2 ) = f X1 (x 1 )<br />
(<br />
X1<br />
X 2<br />
)<br />
.<br />
Examples<br />
1. If (X 1 , X 2 ) has the uniform distribution on the unit disk, find:<br />
(a) the marginal <strong>PDF</strong> f X1 (x 1 )<br />
(b) the conditional <strong>PDF</strong> f X2 |X 1<br />
(x 2 |x 1 ).<br />
29
1. DISTRIBUTION THEORY<br />
Solution. Recall<br />
⎧<br />
1<br />
⎪⎨ x 2 1 + x 2 2 < 1<br />
π<br />
f(x 1 , x 2 ) =<br />
⎪⎩<br />
0 otherwise<br />
(a) f X1 (x 1 ) =<br />
=<br />
∫ ∞<br />
−∞<br />
f(x 1 , x 2 ) dx 2<br />
∫ −<br />
√<br />
1−x 2 1<br />
−∞<br />
= 0 + x 2<br />
π<br />
=<br />
√<br />
∣ √<br />
−<br />
∫ √ 1−x 2 1<br />
0 dx 2 +<br />
1−x 2 1<br />
⎧<br />
2 √ 1 − x<br />
⎪⎨<br />
2 1<br />
π<br />
1−x 2 1<br />
+ 0<br />
√<br />
− 1−x 2 1<br />
−1 < x 1 < 1<br />
⎪⎩<br />
0 otherwise.<br />
∫<br />
1<br />
∞<br />
π dx 2 + √ 0 dx 2<br />
1−x 2 1<br />
i.e., a semi-circular distribution (ours has some distortion).<br />
Figure 13: A graphic showing the conditional distribution and a semi-circular distribution.<br />
30
1. DISTRIBUTION THEORY<br />
(b) The conditional density for X 2 |X 1 is:<br />
f X2 |X 1<br />
(x 2 |x 1 ) = f(x 1, x 2 )<br />
f X1 (x 1 )<br />
⎧<br />
1<br />
⎪⎨ 2 √ for − √ 1 − x 2<br />
1 − x 2 1 < x 2 < √ 1 − x 2 1<br />
1<br />
=<br />
⎪⎩<br />
0 otherwise,<br />
which is uniform U(− √ 1 − x 2 1, √ 1 − x 2 1).<br />
2. If X 1 , . . . , X r are any independent continuous RV’s then their joint <strong>PDF</strong> is:<br />
f(x 1 , . . . , x r ) = f X1 (x 1 )f X2 (x 2 ) . . . f Xr (x r ).<br />
E.g. If X 1 , X 2 , . . . , X r are independent Exp(λ) RVs, then<br />
f(x 1 , . . . , x r ) =<br />
r∏<br />
λ e −λx i<br />
i=1<br />
= λ r e −λ P r<br />
i=1 x i<br />
,<br />
for x i > 0 i = 1, . . . , n.<br />
3. Suppose (X 1 , X 2 ) are uniformly distributed on (0, 1) × (0, 1).<br />
That is,<br />
⎧<br />
⎪⎨ 1 0 < x 1 < 1, 0 < x 2 < 1<br />
f(x 1 , x 2 ) =<br />
⎪⎩<br />
0 otherwise<br />
Claim: X 1 , X 2 are independent.<br />
Pro<strong>of</strong>.<br />
Let<br />
⎧<br />
⎪⎨ 1 0 < x 1 < 1<br />
f 1 (x 1 ) =<br />
⎪⎩<br />
0 otherwise,<br />
31
⎧<br />
⎪⎨ 1 0 < x 2 < 1<br />
f 2 (x 2 ) =<br />
⎪⎩<br />
0 otherwise,<br />
1. DISTRIBUTION THEORY<br />
then f(x 1 , x 2 ) = f 1 (x 1 )f 2 (x 2 ), and the two variables are independent.<br />
4. (X 1 , X 2 ) is uniform on unit disk. Are X 1 , X 2 independent NO.<br />
Pro<strong>of</strong>.<br />
We know:<br />
⎧<br />
1<br />
⎪⎨ 2 √ − √ 1 − x 2<br />
1 − x 2 1 < x 2 < √ 1 − x 2 1<br />
1<br />
f X2 |X 1<br />
(x 2 |x 1 ) =<br />
.<br />
⎪⎩<br />
0 otherwise,<br />
i.e., U(− √ 1 − x 2 1, √ 1 − x 2 1).<br />
On the other hand,<br />
i.e., a semicircular distribution.<br />
⎧<br />
2 √<br />
⎪⎨ 1 − x<br />
2<br />
π<br />
2 −1 < x 2 < 1<br />
f X2 (x 2 ) =<br />
⎪⎩<br />
0 otherwise,<br />
Hence f X2 (x 2 ) ≠ f X2 |X 1<br />
(x 2 |x 1 ), so the variables cannot be independent.<br />
Definition. 1.7.8<br />
The joint CDF <strong>of</strong> RVS’s X 1 , X 2 , . . . , X r is defined by:<br />
F (x 1 , x 2 , . . . , x r ) = P ({X 1 ≤ x 1 } ∩ {X 2 ≤ x 2 } ∩ . . . ∩ ({X r ≤ x r }).<br />
Remarks:<br />
1. Definition applies to RV’s <strong>of</strong> any type, i.e., discrete, continuous, “hybrid”.<br />
2. Marginal CDF <strong>of</strong> X 1 = (X 1 , . . . , X r ) is F X1 (x 1 , . . . , x r ) = F X (x 1 , . . . , x r , ∞, ∞, . . . , ∞).<br />
3. RV’s X 1 , . . . , X r are defined to be independent if<br />
F (x 1 , . . . , x r ) = F X1 (x 1 )F X2 (x 2 ) . . . F Xr (x r ).<br />
4. The definitions above are completely general. However, in practice it is usually<br />
easier to work with the <strong>PDF</strong>/probability function for any given example.<br />
32
1. DISTRIBUTION THEORY<br />
1.8 Transformations <strong>of</strong> several RVs<br />
Theorem. 1.8.1<br />
If X 1 , X 2 are discrete with joint probability function P (x 1 , x 2 ), and Y = X 1 +X 2 , then:<br />
1. Y has probability function<br />
P Y (y) = ∑ x<br />
P (x, y − x).<br />
2. If X 1 , X 2 are independent,<br />
P Y (y) = ∑ x<br />
P X1 (x)P X2 (y − x)<br />
Pro<strong>of</strong>. (1)<br />
P ({Y = y}) = ∑ x<br />
P ({Y = y} ∩ {X 1 = x})<br />
(law <strong>of</strong> total probability)<br />
↓<br />
⎛<br />
⎞<br />
If A is an event & B 1 , B 2 . . . are events<br />
⋃<br />
s.t. B i = S & B i ∩ B j = φ (for j ≠ i)<br />
i<br />
⎜<br />
⎝<br />
then P (A) = ∑ P (A ∩ B i ) = ∑ ⎟<br />
P (A|B i )P (B i )<br />
⎠<br />
i<br />
i<br />
= ∑ x<br />
P ({X 1 + X 2 = y} ∩ {X 1 = x})<br />
= ∑ x<br />
P ({X 2 = y − x} ∩ {X 1 = x})<br />
= ∑ x<br />
P (x, y − x).<br />
Pro<strong>of</strong>. (2)<br />
33
1. DISTRIBUTION THEORY<br />
Just substitute<br />
P (x, y − x) = P X1 (x)P X2 (y − x).<br />
Theorem. 1.8.2<br />
Suppose X 1 , X 2 are continuous with <strong>PDF</strong>, f(x 1 , x 2 ), and let Y = X 1 + X 2 . Then<br />
1. f Y (y) =<br />
∫ ∞<br />
−∞<br />
f(x, y − x) dx.<br />
2. If X 1 , X 2 are independent, then<br />
f Y (y) =<br />
Pro<strong>of</strong>. 1. F Y (y) = P (Y ≤ y)<br />
∫ ∞<br />
−∞<br />
= P (X 1 + X 2 ≤ y)<br />
f X1 (x)f X2 (y − x) dx.<br />
=<br />
∫ ∞ ∫ y−x1<br />
x 1 =−∞ x 2 =−∞<br />
f(x 1 , x 2 ) dx 2 dx 1 .<br />
Let x 2 = t − x 1 ,<br />
⇒ dx 2<br />
dt = 1<br />
⇒ dx 2 = dt<br />
=<br />
=<br />
∫ ∞ ∫ y<br />
−∞<br />
∫ y<br />
−∞<br />
−∞<br />
f(x 1 , t − x 1 ) dt dx 1<br />
{∫ ∞<br />
}<br />
f(x 1 , t − x 1 ) dx 1 dt<br />
−∞<br />
⇒ f Y (y) = F ′ Y (y) =<br />
∫ ∞<br />
−∞<br />
f(x, y − x) dx.<br />
Pro<strong>of</strong>. (2)<br />
Take f(x, y − x) = f X1 (x) f X2 (y − x) if X 1 , X 2 independent.<br />
34
1. DISTRIBUTION THEORY<br />
Figure 14: Theorem 1.8.2<br />
Examples<br />
1. Suppose X 1 ∼ B(n 1 , p), X 2 ∼ B(n 2 , p) independently. Find the probability function<br />
for Y = X 1 + X 2 .<br />
Solution.<br />
P Y (y) = ∑ x<br />
P X1 (x)P X2 (y − x)<br />
=<br />
min(n 1 , y)<br />
∑<br />
x=max (0, y − n 2 )<br />
= p y (1 − p) n 1+n 2 −y<br />
=<br />
=<br />
i.e., Y ∼ B(n 1 + n 2 , p).<br />
( )<br />
( )<br />
n1<br />
p x (1 − p) n 1−x n2<br />
p y−x (1 − p) n 2+x−y<br />
x<br />
y − x<br />
min (n 1 , y)<br />
∑<br />
x=max (0, y − n 2 )<br />
( )<br />
n1 + n 2<br />
p y (1 − p) n 1+n 2 −y<br />
y<br />
( )<br />
n1 + n 2<br />
p y (1 − p) n 1+n 2 −y ,<br />
y<br />
35<br />
( ) ( )<br />
n1 n2<br />
x y − x<br />
min (n 1 , y)<br />
∑<br />
x=min (0, y − n 2 )<br />
(<br />
n1<br />
) ( )<br />
n2<br />
x y − x<br />
( )<br />
n1 + n 2<br />
y<br />
} {{ }<br />
sum <strong>of</strong> hypergeometric<br />
probability function = 1
1. DISTRIBUTION THEORY<br />
2. Suppose Z 1 ∼ N(0, 1), Z 2 ∼ N(0, 1) independently. Let X = Z 1 + Z 2 . Find<br />
<strong>PDF</strong> <strong>of</strong> X.<br />
Solution.<br />
f X (x) =<br />
=<br />
=<br />
=<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
φ(z)φ(x − z) dz<br />
1<br />
√<br />
2π<br />
e −z2 /2 1<br />
√<br />
2π<br />
e −(x−z)2 /2 dz<br />
1<br />
2π e− 1 2 (z2 +(x−z) 2) dz<br />
(<br />
1 1<br />
2π e− 2( z−x<br />
2 2<br />
)e )2 −x2 /4 dz<br />
= 1 ∫ ∞<br />
2 √ /4 1<br />
√<br />
π e−x2 e − (z− x 2 )2<br />
2× 1 2 dz<br />
−∞ π<br />
} {{ }<br />
= 1<br />
( x<br />
N<br />
2 , 1 <strong>PDF</strong><br />
2)<br />
i.e., X ∼ N(0, 2).<br />
= 1<br />
2 √ π e−x2 /4 ,<br />
Theorem. 1.8.3 (Ratio <strong>of</strong> continuous RVs)<br />
Suppose X 1 , X 2 are continuous with joint <strong>PDF</strong> f(x 1 , x 2 ), and let Y = X 2<br />
X 1<br />
.<br />
Then Y has <strong>PDF</strong><br />
If X 1 , X 2 independent, we obtain<br />
f Y (y) =<br />
f Y (y) =<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
|x|f(x, yx) dx.<br />
|x|f X1 (x)f X2 (yx) dx.<br />
36
1. DISTRIBUTION THEORY<br />
Pro<strong>of</strong>.<br />
F Y (y) = P ({Y ≤ y})<br />
= P ({Y ≤ y} ∩ {X 1 < 0}) + P ({Y ≤ y} ∩ {X 1 > 0})<br />
= P ({X 2 ≥ yX 1 } ∩ {X 1 < 0}) + P ({X 2 ≤ yX 1 } ∩ {X 1 > 0})<br />
=<br />
∫ 0<br />
∫ ∞<br />
f(x 1 , x 2 ) dx 2 dx 1 +<br />
∫ ∞ ∫ x1 y<br />
−∞ x 1 y<br />
0 −∞<br />
Substitute x 2 = tx 1 in both inner integrals;<br />
dx 2 = x 1 dt;<br />
f(x 1 , x 2 ) dx 2 dx 1 .<br />
=<br />
∫ 0<br />
∫ −∞<br />
x 1 f(x 1 , tx 1 )dt dx 1 +<br />
∫ ∞ ∫ y<br />
−∞ y<br />
0 −∞<br />
x 1 f(x 1 , tx 1 ) dt dx 1<br />
=<br />
∫ 0 ∫ y<br />
(−x 1 )f(x 1 , tx 1 ) dt dx 1 +<br />
∫ ∞ ∫ y<br />
−∞ −∞<br />
0 −∞<br />
x 1 f(x 1 , tx 1 ) dt dx 1<br />
=<br />
∫ 0 ∫ y<br />
|x 1 |f(x 1 , tx 1 ) dt dx 1 +<br />
∫ ∞ ∫ y<br />
−∞ −∞<br />
0 −∞<br />
|x 1 |f(x 1 , tx 1 ) dt dx 1<br />
= F Y (y)<br />
⇒ f Y (y) = F ′ Y (y) =<br />
∫ ∞<br />
−∞<br />
|x 1 |f(x 1 , yx 1 ) dx 1 .<br />
Example<br />
1. If Z ∼ N(0, 1) and X ∼ χ 2 k<br />
independently, then T =<br />
Z<br />
√<br />
X/k<br />
is said to have the<br />
t-distribution with k degrees <strong>of</strong> freedom. Derive the <strong>PDF</strong> <strong>of</strong> T .<br />
37
1. DISTRIBUTION THEORY<br />
Solution. Step 1:<br />
Let V = √ ( k<br />
X/k. Need to find <strong>PDF</strong> <strong>of</strong> V . Recall χ 2 k is Gamma 2 , 1 )<br />
so<br />
2<br />
Now,<br />
f X (x) = (1/2)k/2<br />
Γ(k/2) xk/2−1 e −x/2 , for x > 0.<br />
v = h(x) = √ x/k<br />
⇒ h −1 (v) = kv 2<br />
and<br />
h −1 (v) ′ = 2kv<br />
⇒ f V (v) = f X (h −1 (v))|h −1 (v) ′ |<br />
= (1/2)k/2<br />
Γ(k/2) (kv2 ) k/2−1 e −kv2 /2 (2kv)<br />
Step 2:<br />
Apply Theorem 1.8.3 to find <strong>PDF</strong> <strong>of</strong><br />
T = Z ∫ ∞<br />
V = |v|f V (v)f Z (vt) dv<br />
−∞<br />
= 2(k/2)k/2<br />
Γ(k/2) vk−1 e −kv2 /2 , v > 0.<br />
=<br />
∫ ∞<br />
0<br />
v 2(k/2)k/2<br />
Γ(k/2) vk−1 e −kv2 /2 1<br />
√<br />
2π<br />
e −t2 v 2 /2 dv<br />
= (k/2)k/2<br />
Γ(k/2) √ 2π<br />
∫ ∞<br />
0<br />
v k−1 e −1/2(k+t2 )v 2 2v dv;<br />
38
1. DISTRIBUTION THEORY<br />
substitute u = v 2<br />
⇒ du = 2v dv<br />
= (k/2)k/2<br />
Γ(k/2) √ 2π<br />
∫ ∞<br />
0<br />
u k−1<br />
2 e<br />
−1/2(k+t 2 )u du<br />
[ ]<br />
= (k/2)k/2 Γ(α)<br />
Γ(k/2) √ = (k/2)k/2<br />
2π λ α Γ(k/2) √ Γ( k+1)<br />
2<br />
2π (1/2(k + t 2 )) (k+1)/2<br />
=<br />
=<br />
k+1<br />
Γ( ) (<br />
2 2<br />
Γ(k/2) √ 2π k × 1 −<br />
k+1 ( )<br />
))<br />
2 (k + 2 −1/2 k t2 2<br />
k+1<br />
Γ( ) ( ) −<br />
k+1<br />
2<br />
Γ( k)√ 1 + t2 2<br />
.<br />
kπ k<br />
2<br />
Remarks:<br />
1. If we take k = 1, we obtain<br />
f(t) ∝ 1<br />
1 + t ; 2<br />
that is, t 1 = Cauchy distribution.<br />
Can also check<br />
Γ(1)<br />
Γ( 1)√ π = 1 π as required, so that f(t) = 1 1<br />
π 1 + t .<br />
2 2<br />
2. As k → ∞, t k → N(0, 1). To see this directly consider limit <strong>of</strong> t-density as k → ∞:<br />
i.e.<br />
) −<br />
k+1<br />
lim<br />
(1 + t2 2<br />
k→∞ k<br />
for t fixed<br />
) −k<br />
= lim<br />
(1 + t2 2<br />
k<br />
; let l =<br />
k→∞ k<br />
2<br />
(<br />
= lim 1 +<br />
l→∞<br />
t 2 2<br />
l<br />
) −l<br />
= 1<br />
e t2 2<br />
= e −t2<br />
2 .<br />
39
1. DISTRIBUTION THEORY<br />
Recall standard limit:<br />
(<br />
lim 1 + a ) n<br />
= e<br />
a<br />
n→∞ n<br />
⇒ f T (t) → 1 √<br />
2π<br />
e −t2<br />
2 as k → ∞.<br />
Suppose h : R r → R r is continuously differentiable.<br />
That is,<br />
h(x 1 , x 2 , . . . , x r ) = (h 1 (x 1 , . . . , x r ), h 2 (x 1 , . . . , x r ), . . . , h r (x 1 , . . . , x r )) ,<br />
where each h i (x) is continuously differentiable. Let<br />
⎡<br />
⎤<br />
∂h 1 ∂h 1 ∂h 1<br />
. . .<br />
∂x 1 ∂x 2 ∂x r<br />
∂h 2 ∂h 2 ∂h 2<br />
. . .<br />
∂x<br />
H = 1 ∂x 2 ∂x r<br />
. . . . . .<br />
⎢<br />
⎥<br />
⎣∂h r ∂h r ∂h r<br />
⎦<br />
. . .<br />
∂x 1 ∂x 2 ∂x r<br />
If H is invertible for all x, then ∃ an inverse mapping:<br />
g : R r → R r<br />
g(h(x)) = x.<br />
with property that:<br />
It can be proved that the matrix <strong>of</strong> partial derivatives<br />
( ) ∂gi<br />
G = satisfies G = H −1 .<br />
∂y j<br />
Theorem. 1.8.4<br />
Suppose X 1 , X 2 , . . . , X r have joint <strong>PDF</strong> f X (X 1 , . . . , X r ), and let h, g, G be as above.<br />
If Y = h(X), then Y has joint <strong>PDF</strong><br />
Remark:<br />
f Y (y 1 , y 2 , . . . , y r ) = f X<br />
(<br />
g(y)<br />
)<br />
| det G(y)|<br />
Can sometimes use H −1 instead <strong>of</strong> G, but need to be careful to evaluate H −1 (x) at<br />
x = h −1 (y) = g(y).<br />
40
1. DISTRIBUTION THEORY<br />
1.8.1 Method <strong>of</strong> regular transformations<br />
Suppose X = (X 1 , . . . , X r ) T and h : R r → R s , where s < r if<br />
Y = (Y 1 , . . . , Y s ) T = h(X).<br />
One approach is as follows:<br />
1. Choose a function, d : R r → R r−s and let Z = d(X) = (Z 1 , Z 2 , . . . , Z r−s ) T .<br />
2. Apply theorem 1.8.4 to (Y 1 , Y 2 , . . . , Y s , Z 1 , Z 2 , . . . , Z r−s ) T .<br />
3. Integrate over Z 1 , Z 2 , . . . , Z r−s to get the marginal <strong>PDF</strong> for Y.<br />
Examples<br />
Suppose Z 1 ∼ N(0, 1), Z 2 ∼ N(0, 1) independently. Consider h(z 1 , z 2 ) = (r, θ) T , where<br />
r = h 1 (z 1 , z 2 ) = √ z1 2 + z2, 2 and<br />
⎧ ( )<br />
z2<br />
arctan<br />
for z 1 > 0<br />
z 1<br />
h maps R 2 → [0, ∞) ×<br />
⎪⎨ arctan<br />
θ = h 2 (z 1 , z 2 ) =<br />
[<br />
− π 2 , 3π 2<br />
(<br />
z2<br />
)<br />
+ π for z 1 < 0<br />
z 1<br />
π<br />
2 sgnz 2 for z 1 = 0, z 2 ≠ 0<br />
⎪⎩<br />
0 for z 1 = z 2 = 0.<br />
)<br />
.<br />
The inverse mapping, g, can be seen to be:<br />
( )<br />
g1 (r, θ)<br />
g(r, θ) = =<br />
g 2 (r, θ)<br />
To find distribution <strong>of</strong> (R, θ):<br />
( ) r cos θ<br />
r sin θ<br />
Step 1<br />
⎡<br />
∂<br />
G(r, θ) = ⎢<br />
∂r g 1(r, θ)<br />
⎣<br />
∂<br />
∂r g 2(r, θ)<br />
41<br />
⎤<br />
∂<br />
∂θ g 1(r, θ)<br />
⎥<br />
⎦<br />
∂<br />
∂θ g 2(r, θ)
∂<br />
∂r g 1(r, θ) = ∂ r cos θ = cos θ<br />
∂r<br />
∂<br />
(r cos θ) = −r sin θ<br />
∂θ<br />
∂<br />
∂r g 2(r, θ) = ∂ r sin θ = sin θ<br />
∂r<br />
∂<br />
r sin θ = r cos θ<br />
∂θ<br />
1. DISTRIBUTION THEORY<br />
⎡<br />
cos θ<br />
⇒ G = ⎣<br />
sin θ<br />
⎤<br />
−r sin θ<br />
⎦<br />
r cos θ<br />
⇒ det(G) = r cos 2 θ − (−r sin 2 θ)<br />
= r cos 2 θ + r sin 2 θ = r.<br />
Step 2<br />
Now apply theorem 1.8.4.<br />
Recall,<br />
f Z (z 1 , z 2 ) = 1 √<br />
2π<br />
e − 1 2 z2 1 ×<br />
1<br />
√<br />
2π<br />
e − 1 2 z2 2<br />
= 1<br />
2π e− 1 2 (z2 1 +z2 2 ) .<br />
42
1. DISTRIBUTION THEORY<br />
f R,θ (r, θ) = f Z<br />
(<br />
g(r, θ)<br />
)<br />
| det G(r, θ)|<br />
= f Z (r cos θ, r sin θ)|r|<br />
⎧<br />
1<br />
⎪⎨<br />
2π r e− 1 2 r2 for r ≥ 0, − π ≤ θ < 3π 2 2<br />
=<br />
⎪⎩<br />
0 otherwise<br />
= f θ (θ)f R (r),<br />
where<br />
⎧<br />
1<br />
⎪⎨ − π<br />
2π<br />
≤ θ < 3π 2 2<br />
f θ (θ) =<br />
⎪⎩<br />
0 otherwise<br />
f R (r) = re −r2 /2<br />
for r ≥ 0.<br />
Hence, if Z 1 , Z 2 i.i.d. N(0, 1), and we translate to polar coordinates, (R, θ) we find:<br />
1. R, θ are independent.<br />
2. θ ∼ U(− π 2 , 3π 2 ).<br />
3. R has distribution called the Rayleigh distribution.<br />
It is easy to check that R 2 ∼ Exp(1/2) ≡ Gamma(1, 1/2) ≡ χ 2 2.<br />
Example<br />
Suppose X 1 ∼ Gamma (α, λ) and X 2 ∼ Gamma (β, λ), independently. Find the<br />
distribution <strong>of</strong> Y = X 1 /(X 1 + X 2 ).<br />
Solution. Step 1: try using Z = X 1 + X 2 .<br />
( x 1<br />
)<br />
( )<br />
y = yz<br />
Step 2: If h(x 1 , x 2 ) = x 1 + x 2 , then g(y, z) =<br />
.<br />
z = x 1 + x (1 − y)z<br />
2<br />
( ) z y<br />
G =<br />
−z 1 − y<br />
43
1. DISTRIBUTION THEORY<br />
⇒ det(G) = z(1 − y) − (−zy)<br />
where z > 0. Why<br />
= z(1 − y) + zy = z,<br />
f Y,Z (y, z) = f X1 (yz)f X2<br />
(<br />
(1 − y)z<br />
)<br />
z<br />
Γ(α) (yz)α−1 e −λyz λβ ( ) β−1e<br />
(1 − y)z −λ(1−y)z z<br />
Γ(β)<br />
= λα<br />
=<br />
1<br />
Γ(α)Γ(β) λα+β y α−1 (1 − y) β−1 z α+β−1 e −λz .<br />
Step 3:<br />
f Y (y) =<br />
∫ ∞<br />
0<br />
f(y, z) dz<br />
=<br />
=<br />
∫<br />
Γ(α + β)<br />
∞<br />
Γ(α)Γ(β) yα−1 (1 − y) β−1<br />
Γ(α + β)<br />
Γ(α)Γ(β) yα−1 (1 − y) β−1<br />
0<br />
λ α+β z α+β−1<br />
Γ(α + β) e−λz dz<br />
= Beta(α, β), for 0 < y < 1.<br />
Exercise: Justify the range <strong>of</strong> values for y.<br />
1.9 Moments<br />
Suppose X 1 , X 2 , . . . , X r are RVs. If Y = h(X) is defined by a real-valued function h,<br />
then to find E(Y ) we can:<br />
1. Find the distribution <strong>of</strong> Y .<br />
⎧∫ ∞<br />
yf Y (y) dy<br />
⎪⎨ −∞<br />
2. Calculate E(Y ) =<br />
∑<br />
⎪⎩ yp(y)<br />
y<br />
continuous<br />
discrete<br />
44
1. DISTRIBUTION THEORY<br />
Theorem. 1.9.1<br />
If h and X 1 , . . . , X r are as above, then provided it exists,<br />
⎧∑<br />
∑<br />
E { h(X) } ⎪⎨<br />
=<br />
⎪⎩<br />
x 1<br />
∫ ∞<br />
Theorem. 1.9.2<br />
−∞<br />
x 2<br />
. . . ∑ x r<br />
h(x 1 , . . . , x r )p(x 1 , . . . , x r ) if X 1 , . . . , X r discrete<br />
∫ ∞<br />
. . . h(x 1 , . . . , x r )f(x 1 , . . . x r ) dx 1 . . . dx r<br />
−∞<br />
if X 1 , . . . , X r continuous<br />
Suppose X 1 , . . . , X r are RVs, h 1 , . . . , h k are real-valued functions and a 1 , . . . , a k are<br />
constants. Then, provided it exists,<br />
E { a 1 h 1 (X) + a 2 h 2 (X) + · · · + a k h k (X) } = a 1 E[h 1 (X)] + a 2 E[h 2 (X)] + · · · + a k E [h k (X)]<br />
Pro<strong>of</strong>. (Continuous case)<br />
E { a 1 h 1 (X) + · · · + a k h k (X) } =<br />
∫ ∞<br />
−∞<br />
= a 1<br />
∫ ∞<br />
∫ ∞<br />
. . . {a 1 h 1 (x) + · · · + a k h k (x)} f x (x) dx 1 dx 2 . . . dx r<br />
−∞<br />
−∞<br />
+ a 2<br />
∫ ∞<br />
∫ ∞<br />
. . . h 1 (x)f X (x) dx 1 . . . dx r<br />
−∞<br />
−∞<br />
. . . + a k<br />
∫ ∞<br />
∫ ∞<br />
. . . h 2 (x)f X (x) dx 1 . . . dx r + . . .<br />
−∞<br />
−∞<br />
∫ ∞<br />
. . . h k (x)f X (x) dx 1 . . . dx r<br />
−∞<br />
= a 1 E[h 1 (X)] + a 2 E[h 2 (X)] + · · · + a k E[h k (X)],<br />
as required.<br />
Corollary.<br />
Provided it exists,<br />
E ( a 1 X 1 + a 2 X 2 + · · · + a r X r<br />
)<br />
= a1 E[X 1 ] + a 2 E[X 2 ] + · · · + a r E[X r ].<br />
45
1. DISTRIBUTION THEORY<br />
Definition. 1.9.1.<br />
If X 1 , X 2 are RVs with E[X i ] = µ i and Var(X i ) = σ 2 i , i = 1, 2 we define<br />
σ 12 = Cov(X 1 , X 2 ) = E [ (X 1 − µ 1 )(X 2 − µ 2 ) ]<br />
= E [ X 1 X 2<br />
]<br />
− µ1 µ 2 ,<br />
Remark:<br />
ρ 12 = Corr(X 1 , X 2 ) = σ 12<br />
σ 1 σ 2<br />
.<br />
Cov(X, X) = Var(X) = { E[X 2 ] − (E[X]) 2} .<br />
In some contexts, it is convenient to use the notation<br />
Theorem. 1.9.3<br />
σ ii = Var(X i ) instead <strong>of</strong> σ 2 i .<br />
Suppose X 1 , X 2 , . . . , X r are RVs with E[X i ] = µ i , Cov(X i , X j ) = σ ij , and let a 1 , a 2 , . . . , a r ,<br />
b 1 , b 2 , . . . , b r be constants. Then<br />
( r∑<br />
Cov a i X i ,<br />
i=1<br />
)<br />
r∑<br />
b j X j =<br />
j=1<br />
r∑ r∑<br />
a i b j σ ij .<br />
i=1 j=1<br />
46
1. DISTRIBUTION THEORY<br />
Pro<strong>of</strong>.<br />
( r∑<br />
Cov a i X i ,<br />
i=1<br />
) [{<br />
r∑<br />
r∑<br />
( r∑<br />
)} { r∑<br />
( r∑<br />
)}]<br />
b j X j = E a i X i − E a i X i b j X j − E b j X j<br />
i=1<br />
i=1<br />
j=1<br />
j=1<br />
j=1<br />
= E<br />
{( r∑<br />
i=1<br />
a i X i −<br />
) (<br />
r∑<br />
r∑<br />
a i µ i b j X j −<br />
j=1<br />
i=1<br />
)}<br />
r∑<br />
b j µ j<br />
j=1<br />
[{ r∑<br />
} { r∑<br />
}]<br />
= E a i (X i − µ i ) b j (X j − µ j )<br />
i=1<br />
j=1<br />
= E<br />
{ r∑<br />
i=1<br />
}<br />
r∑<br />
a i b j (X i − µ i )(X j − µ j )<br />
j=1<br />
=<br />
r∑ r∑<br />
a i b j E {(X i − µ i )(X j − µ j )} (Theorem 1.9.2)<br />
i=1 j=1<br />
=<br />
r∑ r∑<br />
a i b j σ ij ,<br />
i=1 j=1<br />
as required.<br />
Corollary. 1<br />
Under the above assumptions,<br />
( r∑<br />
)<br />
Var a i X i =<br />
i=1<br />
} {{ }<br />
covariance<br />
with itself<br />
= variance<br />
Corollary. 2<br />
r∑ r∑<br />
a i a j σ ij =<br />
i=1 j=1<br />
r∑<br />
i=1<br />
a 2 i σ 2 i + 2 ∑ i,j<br />
i
1. DISTRIBUTION THEORY<br />
Pro<strong>of</strong>.<br />
Consider 0 ≤ Var(X 1 − tX 2 ) (for any constant t)<br />
= σ 2 1 + t 2 σ 2 2 + 2tσ 12 (by Corollary 1).<br />
Hence quadratic,<br />
q(t) = σ 2 2t 2 + 2σ 12 t + σ 2 1 ≥ 0, for all t.<br />
Hence q(t) has either no real roots or<br />
one real root.<br />
* no real roots if ∆ = b 2 − 4ac < 0<br />
* one real root if ∆ = b 2 − 4ac = 0.<br />
Hence ∆ ≤ 0<br />
⇒ 4σ 2 12 − 4σ 2 2σ 2 1 ≤ 0<br />
⇒ 4σ 2 12 ≤ 4σ 2 1σ 2 2<br />
(σ’s +ve)<br />
⇒ σ2 12<br />
σ 2 1σ 2 2<br />
≤ 1<br />
⇒ ρ 2 ≤ 1<br />
⇒ |ρ| ≤ 1.<br />
Suppose |ρ| = 1 ⇒ ∆ = 0<br />
⇒ there is single t 0 such that:<br />
0 = q(t 0 ) = Var(X 1 − t 0 X 2 ),<br />
i.e., X 1 = t 0 X 2 + c with probability 1.<br />
Theorem. 1.9.4<br />
If X 1 , X 2 are independent, and Var(X 1 ), Var(X 2 ) exist, then Cov(X 1 , X 2 ) = 0.<br />
Pro<strong>of</strong>. (Continuous)<br />
48
1. DISTRIBUTION THEORY<br />
Cov(X 1 , X 2 ) = E { (X 1 − µ 1 )(X 2 − µ 2 ) }<br />
∫ ∞ ∫ ∞<br />
= (x 1 − µ 1 )(x 2 − µ 2 )f(x 1 , x 2 ) dx 1 dx 2<br />
−∞ −∞<br />
∫ ∞ ∫ ∞<br />
= (x 1 − µ 1 )(x 2 − µ 2 )f X1 (x 1 )f X2 (x 2 ) dx 1 dx 2 (since X 1 , X 2 independent)<br />
−∞ −∞<br />
(∫ ∞<br />
) (∫ ∞<br />
)<br />
= (x 1 − µ 1 )f X1 (x 1 ) dx 1 (x 2 − µ 2 )f X2 (x 2 ) dx 2<br />
−∞<br />
−∞<br />
= 0 × 0<br />
= 0.<br />
Remark:<br />
But the converse does NOT apply, in general!<br />
That is, Cov(X 1 , X 2 ) = 0 ⇏ X 1 , X 2 independent.<br />
Definition. 1.9.2<br />
If X 1 , X 2 are RVs, we define the symbol E[X 1 |X 2 ] to be the expectation <strong>of</strong> X 1 calculated<br />
with respect to the conditional distribution <strong>of</strong> X 1 |X 2 ,<br />
i.e.,<br />
Theorem. 1.9.5<br />
E [ X 1 |X 2<br />
]<br />
=<br />
⎧⎪ ⎨<br />
⎪ ⎩<br />
∑<br />
x 1<br />
x 1 P X1 |X 2<br />
(x 1 |x 2 ) X 1 |X 2 discrete<br />
∫ ∞<br />
−∞<br />
Provided the relevant moments exist,<br />
x 1 f X1 |X 2<br />
(x 1 |x 2 ) dx 1<br />
E(X 1 ) = E X2<br />
{<br />
E(X1 |X 2 ) } ,<br />
X 1 |X 2 continuous<br />
Var(X 1 ) = E X2<br />
{<br />
Var(X1 |X 2 ) } + Var X2<br />
{<br />
E(X1 |X 2 ) } .<br />
49
1. DISTRIBUTION THEORY<br />
Pro<strong>of</strong>.<br />
1. (Continuous case)<br />
E X2<br />
{<br />
E(X1 |X 2 ) } =<br />
=<br />
=<br />
=<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
E(X 1 |X 2 )f X2 (x 2 ) dx 2<br />
(∫ ∞<br />
)<br />
x 1 f(x 1 |x 2 ) dx 1 f X2 (x 2 ) dx 2<br />
−∞<br />
∫ ∞ ∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
−∞<br />
x 1 f(x 1 |x 2 )f X2 (x 2 ) dx 1 dx 2<br />
∫ ∞<br />
x 1 f(x 1 , x 2 ) dx 2 dx 1<br />
−∞<br />
} {{ }<br />
= f X1 (x 1 )<br />
=<br />
∫ ∞<br />
−∞<br />
x 1 f X1 (x 1 ) dx 1<br />
= E(X 1 ).<br />
1.9.1 Moment generating functions<br />
Definition. 1.9.3<br />
If X 1 , X 2 . . . , X r are RVs, then the joint MGF (provided it exists) is given by<br />
M X (t) = M X (t 1 , t 2 , . . . , t r ) = E [ e t 1X 1 +t 2 X 2 +···+t rX r<br />
]<br />
.<br />
Theorem. 1.9.6<br />
If X 1 , X 2 , . . . , X r are mutually independent, then the joint MGF satisfies<br />
M X (t 1 , . . . , t r ) = M X1 (t 1 )M X2 (t 2 ) . . . M Xr (t r ),<br />
provided it exists.<br />
50
1. DISTRIBUTION THEORY<br />
Pro<strong>of</strong>. (Theorem 1.9.6, continuous case)<br />
M X (t 1 , . . . , t r ) =<br />
=<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
e t 1x 1 +t 2 x 2 +···+t rx r<br />
f X (x 1 , . . . , x r ) dx 1 . . . dx r<br />
∫ ∞<br />
. . . e t 1x 1<br />
e t 2x 2<br />
. . . e trxr f X1 (x 1 )f X2 (x 2 ) . . . f Xr (x r ) dx 1 . . . dx r<br />
−∞<br />
(∫ ∞<br />
) (∫ ∞<br />
) (∫ ∞<br />
)<br />
= e t 1x 1<br />
f X1 (x 1 ) dx 1 e t 2x 2<br />
f X2 (x 2 ) dx 2 . . . e trxr f Xr (x r ) dx r<br />
−∞<br />
−∞<br />
−∞<br />
= M X1 (t 1 )M X2 (t 2 ) . . . M Xr (t r ), as required.<br />
We saw previously that if Y = h(X), then we can find M Y (t) = E[e th(X) ] from the<br />
joint distribution <strong>of</strong> X 1 , . . . , X r without calculating f Y (y) explicitly.<br />
A simple, but important case is:<br />
Theorem. 1.9.7<br />
Y = X 1 + X 2 + · · · + X r .<br />
Suppose X 1 , . . . , X r are RVs and let Y = X 1 + X 2 + · · · + X r . Then (assuming MGFs<br />
exist):<br />
1. M Y (t) = M X (t, t, t, . . . , t).<br />
2. If X 1 . . . X r are independent then M Y (t) = M X1 (t)M X2 (t) . . . M Xr (t).<br />
3. If X 1 , . . . , X r independent and identically distributed with common MGF M X (t),<br />
then:<br />
M Y (t) = [ M X (t) ] r<br />
.<br />
51
1. DISTRIBUTION THEORY<br />
Pro<strong>of</strong>.<br />
M X (t) = E[e tY ]<br />
= E[e t(X 1+X 2 ,···+X r) ]<br />
= E[e tX 1+tX 2 +···+tX r<br />
]<br />
= M X (t, t, t, . . . , t) (using def 1.9.3).<br />
For parts (2) and (3), substitute into (1) and use Theorem 1.9.6.<br />
Examples<br />
Consider RVs X, V , defined by V ∼ Gamma(α, λ) and the conditional distribution <strong>of</strong><br />
X|V ∼Po(λ = V ).<br />
Find E(X) and Var(X).<br />
Solution. Use<br />
E(X) = E{E(X|V )}<br />
Var(X) = E V {Var(X|V )} + Var V {E(X|V )}<br />
E(X|V ) = V<br />
Var(X|V ) = V<br />
so E(X) = E{E(X|V )} = E(V ) = α λ .<br />
Var(X) = E V {Var(X|V )} + Var V {E(X|V )}<br />
= E V (V ) + Var V (V )<br />
= α λ + α λ 2<br />
= α λ<br />
(<br />
1 + 1 ) ( ) 1 + λ<br />
= α .<br />
λ λ 2<br />
52
1. DISTRIBUTION THEORY<br />
Remark<br />
The marginal distribution <strong>of</strong> X is sometimes called the negative binomial distribution.<br />
In particular, when α is an integer, it corresponds to the definition previously with<br />
p =<br />
λ<br />
1 + λ .<br />
Examples<br />
1. Suppose X ∼ Bernoulli with parameter p. Then M X (t) = 1 + p(e t − 1).<br />
Now suppose X 1 , X 2 , . . . , X n are i.i.d. Bernoulli with parameter p and<br />
Y = X 1 + X 2 + · · · + X n ;<br />
then M Y (t) = ( 1 + p(e t − 1) ) n<br />
, which agrees with the formula previously given<br />
for the binomial distribution.<br />
2. Suppose X 1 ∼ N(µ 1 , σ 2 1) and X 2 ∼ N(µ 2 , σ 2 2) independently. Find the MGF <strong>of</strong><br />
Y = X 1 + X 2 .<br />
Solution. Recall that M X1 (t) = e tµ 1<br />
e t2 σ 2 1 /2<br />
M X2 (t) = e tµ 2<br />
e t2 σ 2 2 /2<br />
⇒ M Y (t) = M X1 (t)M X2 (t)<br />
= e t(µ 1+µ 2 ) e t2 (σ 2 1 +σ2 2 )/2<br />
⇒ Y ∼ N ( µ 1 + µ 2 , σ 2 1 + σ 2 2)<br />
.<br />
1.9.2 Marginal distributions and the MGF<br />
To find the moment generating function <strong>of</strong> the marginal distribution <strong>of</strong> any set <strong>of</strong><br />
components <strong>of</strong> X, set to 0 the complementary elements <strong>of</strong> t in M X (t).<br />
Let X = (X 1 , X 2 ) T , and t = (t 1 , t 2 ) T . Then<br />
( )<br />
t1<br />
M X1 (t 1 ) = M X .<br />
0<br />
To see this result: Note that if A is a constant matrix, and b is a constant vector, then<br />
M AX+b (t) = e tT b M X (A T t).<br />
53
1. DISTRIBUTION THEORY<br />
Pro<strong>of</strong>.<br />
M AX+b (t) = E[e tT (AX+b )] = e tT b E[e tT AX ]<br />
= e tT b E[e (AT t) T X ] = e tT b M X (A T t).<br />
Now partition<br />
and<br />
Note that<br />
and<br />
Hence,<br />
t r×1 =<br />
(<br />
t1<br />
t 2<br />
)<br />
A l×r = (I l×l 0 l×m ).<br />
( )<br />
X1<br />
AX = (I l×l 0 l×m ) = X<br />
X 1 ,<br />
2<br />
A T t 1 =<br />
(<br />
Il×l<br />
0 m×l<br />
)<br />
t 1 =<br />
( )<br />
t1<br />
.<br />
0<br />
M X1 (t 1 ) = M AX (t 1 ) = M X (A T t 1 )<br />
( )<br />
t1<br />
= M X ,<br />
0<br />
as required.<br />
Note that similar results hold for more than two random subvectors.<br />
The major limitation <strong>of</strong> the MGF is that it may not exist. The characteristic function<br />
on the other hand is defined for all distributions. Its definition is similar to the MGF,<br />
with it replacing t, where i = √ −1; the properties <strong>of</strong> the characterisitc function are<br />
similar to those <strong>of</strong> the MGF, but using it requires some familiarity with complex<br />
analysis.<br />
1.9.3 Vector notation<br />
Consider the random vector<br />
X = (X 1 , X 2 , . . . , X r ) T ,<br />
with E[X i ] = µ i , Var(X i ) = σi 2 = σ ii , Cov(X i , X j ) = σ ij .<br />
54
1. DISTRIBUTION THEORY<br />
Define the mean vector by:<br />
⎛ ⎞<br />
µ 1<br />
µ 2<br />
µ = E(X) = ⎜ ⎟<br />
⎝ . ⎠<br />
µ r<br />
and the variance matrix (covariance matrix) by:<br />
Σ = Var(X) = [σ ij ] i=1,...,r<br />
j=1,...,r<br />
Finally, the correlation matrix is defined by:<br />
⎛<br />
⎞<br />
1 ρ 12 . . . ρ 1r<br />
. . .<br />
. . .<br />
R = ⎜<br />
⎟<br />
⎝ .. . ρr−1,r<br />
⎠<br />
1<br />
where ρ ij =<br />
σ ij<br />
√<br />
σii<br />
√<br />
σjj<br />
.<br />
Now suppose that a = (a 1 , . . . , a r ) T is a vector <strong>of</strong> constants and observe that:<br />
a T x = a 1 x 1 + a 2 x 2 + · · · + a r x r =<br />
r∑<br />
a i x i<br />
i=1<br />
Theorem. 1.9.8<br />
Suppose X has E(X) = µ and Var(X) = Σ, and let a, b ∈ R r be fixed vectors. Then,<br />
1. E(a T X) = a T µ<br />
2. Var(a t X) = a T Σa<br />
3. Cov(a T X, b T X) = a T Σb<br />
Remark<br />
It is easy to check that this is just a re-statement <strong>of</strong> Theorem 1.9.3 using matrix<br />
notation.<br />
Theorem. 1.9.9<br />
Suppose X is a random vector with E(X) = µ, Var(X) = Σ, and let A p×r and b ∈ R p<br />
be fixed.<br />
If Y = AX + b, then<br />
E(Y) = Aµ + b and Var(Y) = AΣA T .<br />
55
1. DISTRIBUTION THEORY<br />
Remark<br />
This is also a re-statement <strong>of</strong> previously established results. To see this, observe that<br />
if a T i is the i th row <strong>of</strong> A, we see that Y i = a T i X and, moreover, the (i, j) th element <strong>of</strong><br />
AΣA T = a T i Σa j = Cov ( a T i X, a T j X )<br />
= Cov(Y i , Y j ).<br />
1.9.4 Properties <strong>of</strong> variance matrices<br />
If Σ = Var(X) for some random vector X = (X 1 , X 2 , . . . , X r ) T , then it must satisfy<br />
certain properties.<br />
Since Cov(X i , X j ) = Cov(X j , X i ), it follows that Σ is a square (r × r) symmetric<br />
matrix.<br />
Definition. 1.9.4<br />
The square, symmetric matrix M is said to be positive definite [non-negative definite]<br />
if a T Ma > 0 [≥ 0] for every vector a ∈ R r s.t. a ≠ 0.<br />
=⇒ It is necessary and sufficient that Σ be non-negative definite in order that it can<br />
be a variance matrix.<br />
=⇒ To see the necessity <strong>of</strong> this condition, consider the linear combination a T X.<br />
By Theorem 1.9.8, we have<br />
=⇒ Σ must be non-negative definite.<br />
0 ≤ Var(a T X) = a T Σa for every a;<br />
Suppose λ is an eigenvalue <strong>of</strong> Σ, and let a be the corresponding eigenvector. Then<br />
Σa = λa<br />
=⇒ a T Σa = λa T a = λ||a|| 2 .<br />
Hence Σ is non-negative definite (positive definite) iff its eigenvalues are all nonnegative<br />
(positive).<br />
If Σ is non-negative definite but not positive definite, then there must be at least one<br />
zero eigenvalue. Let a be the corresponding eigenvector.<br />
=⇒ a ≠ 0 but a T Σa = 0. That is, Var(a T X) = 0 for that a.<br />
=⇒ the distribution <strong>of</strong> X is degenerate in the sense that either one <strong>of</strong> the X i ’s is<br />
constant or else a linear combination <strong>of</strong> the other components.<br />
56
1. DISTRIBUTION THEORY<br />
Finally, recall that if λ 1 , . . . , λ r are eigenvalues <strong>of</strong> Σ, then det(Σ) =<br />
r∏<br />
i=1<br />
λ r<br />
and<br />
det(Σ) = 0<br />
=⇒ det(Σ) > 0 for Σ positive definite,<br />
for Σ non-negative definite but not positive definite.<br />
1.10 The multivariable normal distribution<br />
Definition. 1.10.1<br />
The random vector X = (X 1 , . . . , X r ) T is said to have the r-dimensional multivariate<br />
normal distribution with parameters µ ∈ R r and Σ r×r positive definite, if it has <strong>PDF</strong><br />
We write X ∼ N r (µ, Σ).<br />
Examples<br />
f X (x) =<br />
1<br />
(2π) r/2 |Σ| 1/2 e− 1 2 (x−µ)T Σ −1 (x−µ)<br />
1. r = 2 The bivariate normal distribution<br />
( ) [ ]<br />
µ1<br />
σ<br />
2<br />
Let µ = , Σ = 1 ρσ 1 σ 2<br />
µ 2 ρσ 1 σ 2 σ2<br />
2<br />
=⇒ |Σ| = σ 2 1σ 2 2(1 − ρ 2 ) and<br />
⎛<br />
⎞<br />
σ<br />
Σ −1 1<br />
2 2 −ρσ 1 σ 2<br />
=<br />
⎝<br />
⎠<br />
σ1σ 2 2(1 2 − ρ 2 )<br />
−ρσ 1 σ 2 σ1<br />
2<br />
=⇒ f(x 1 , x 2 ) =<br />
{[<br />
1<br />
√<br />
2πσ 1 σ exp −1<br />
2 1 − ρ<br />
2 2(1 − ρ 2 )<br />
[ (x1 ) 2 ( ) 2<br />
− µ 1 x2 − µ 2<br />
+<br />
σ 1 σ 2<br />
( ) ( )]]}<br />
x1 − µ 1 x2 − µ 2<br />
−2ρ<br />
.<br />
σ 1 σ 2<br />
57
1. DISTRIBUTION THEORY<br />
2. If Z 1 , Z 2 , . . . , Z r are i.i.d. N(0, 1) and Z = (Z 1 , . . . , Z r ) T ,<br />
⇒f Z (z) =<br />
r∏<br />
i=1<br />
1<br />
√<br />
2π<br />
e −z2 i /2<br />
=<br />
which is N r (0 r , I r×r ) <strong>PDF</strong>.<br />
Theorem. 1.10.1<br />
1<br />
(2π) r/2 e−1/2 P r<br />
i=1 z2 i =<br />
1<br />
(2π) r/2 e− 1 2 zT z ,<br />
Suppose X ∼ N r (µ, Σ) and let Y = AX + b, where b ∈ R r and A r×r invertible are<br />
fixed. Then Y ∼ N r (Aµ + b, AΣA T ).<br />
Pro<strong>of</strong>.<br />
We use the transformation rule for <strong>PDF</strong>s:<br />
If X has joint <strong>PDF</strong> f X (x) and Y = g(X) for g : R r → R r invertible, continuously<br />
differentiable, ( ) then f Y (y) = f X (h(y))|H|, where h : R r → R r such that h = g −1 and<br />
∂hi<br />
H = .<br />
∂y j<br />
In this case, we have g(x) = Ax + b<br />
⇒h(y) = A −1 (y − b)<br />
[ solving for x in<br />
y = Ax + b<br />
]<br />
⇒H = A −1 .<br />
Hence,<br />
f Y (y) =<br />
=<br />
=<br />
1<br />
(2π) r/2 |Σ| 1/2 e−1/2(A−1 (y−b)−µ) T Σ −1 (A −1 (y−b)−µ) |A −1 |<br />
1<br />
(2π) r/2 |Σ| 1/2 |AA T | 1/2 e−1/2(y−(Aµ+b))T (A −1 ) T Σ −1 (A −1 )(y−(Aµ+b))<br />
1<br />
(2π) r/2 |AΣA T | 1/2 e−1/2(y−(Aµ+b))T (AΣA T ) −1 (y−(Aµ+b))<br />
which is exactly the <strong>PDF</strong> for N r (Aµ + b, AΣA T ).<br />
58
1. DISTRIBUTION THEORY<br />
Now suppose Σ r×r is any symmetric positive definite matrix and recall that we can write<br />
Σ = E ∧ E T where ∧ = diag (λ 1 , λ 2 , . . . , λ r ) and E r×r is such that EE T = E T E = I.<br />
Since Σ is positive definite, we must have λ i > 0, i = 1, . . . , r and we can define the<br />
symmetric square-root matrix by:<br />
Σ 1/2 = E ∧ 1/2 E T where ∧ 1/2 = diag ( √ λ 1 , √ λ 2 , . . . , √ λ r ).<br />
Note that Σ 1/2 is symmetric and satisfies:<br />
( ) Σ<br />
1/2 2<br />
= Σ 1/2 Σ 1/2T = Σ.<br />
Now recall that if Z 1 , Z 2 , . . . , Z r are i.i.d. N(0, 1) then Z = (Z 1 , . . . , Z r ) T ∼ N r (0, I).<br />
Because <strong>of</strong> the i.i.d. N(0, 1) assumption, we know in this case that E(Z) = 0, Var(Z) =<br />
I.<br />
Now, let X = Σ 1/2 Z + µ:<br />
1. From Theorem 1.9.9 we have<br />
E(X) = Σ 1/2 E(Z) + µ = µ and Var(X) = Σ 1/2 Var(Z) ( Σ 1/2) T<br />
.<br />
2. From Theorem 1.10.1,<br />
X ∼ N r (µ, Σ).<br />
Since this construction is valid for any symmetric positive definite Σ and any<br />
µ ∈ R r , we have proved,<br />
Theorem. 1.10.2<br />
If X ∼ N r (µ, Σ) then<br />
Theorem. 1.10.3<br />
E(X) = µ and Var(X) = Σ.<br />
If X ∼ N r (µ, Σ) then Z = Σ −1/2 (X − µ) ∼ N r (0, I).<br />
Pro<strong>of</strong>.<br />
Use Theorem 1.10.1.<br />
Suppose<br />
(<br />
X1<br />
X 2<br />
)<br />
∼ N r1 +r 2<br />
((<br />
µ1<br />
µ 2<br />
)<br />
,<br />
[ ])<br />
Σ11 Σ 12<br />
Σ 21 Σ 22<br />
59
1. DISTRIBUTION THEORY<br />
where X i , µ i are <strong>of</strong> dimension r i and Σ ij is r i × r j . In order that Σ be symmetric, we<br />
must have Σ 11 , Σ 22 symmetric and Σ 12 = Σ T 21.<br />
We will derive the marginal distribution <strong>of</strong> X 2<br />
X 1 |X 2 .<br />
and the conditional distribution <strong>of</strong><br />
Lemma. 1.10.1<br />
[ ] B A<br />
If M = is a square, partitioned matrix, then |M| = |B| (det).<br />
O I<br />
Lemma. 1.10.2<br />
[ ]<br />
Σ11 Σ<br />
If M =<br />
12<br />
then |M| = |Σ<br />
Σ 21 Σ 22 ||Σ 11 − Σ 12 Σ −1<br />
22 Σ 21 |<br />
22<br />
Pro<strong>of</strong>. (<strong>of</strong> Lemma 1.10.2)<br />
Let<br />
and observe<br />
C =<br />
[ ]<br />
I −Σ12 Σ −1<br />
22<br />
0 Σ −1<br />
22<br />
[ ]<br />
Σ11 − Σ<br />
CM =<br />
12 Σ −1<br />
22 Σ 21 0<br />
Σ −1<br />
,<br />
22 Σ 21 I<br />
and<br />
Finally, observe |CM| = |C||M|<br />
|C| = 1<br />
|Σ 22 | , |CM| = |Σ 11 − Σ 12 Σ −1<br />
22 Σ 21 |<br />
⇒<br />
|M| = |CM|<br />
|C|<br />
= |Σ 22 ||Σ 11 − Σ 12 Σ −1<br />
22 Σ 21 |.<br />
Theorem. 1.10.4<br />
Suppose that<br />
(<br />
X1<br />
X 2<br />
)<br />
(( ) ( ))<br />
µ1 Σ11 Σ<br />
∼ N r1 ,r 2<br />
,<br />
12<br />
µ 2 Σ 21 Σ 22<br />
Then, the marginal distribution <strong>of</strong> X 2 is X 2 ∼ N r2 (µ 2 , Σ 22 ) and the conditional distribution<br />
<strong>of</strong> X 1 |X 2 is<br />
( )<br />
N r1 µ1 + Σ 12 Σ −1<br />
22 (x 2 − µ 2 ), Σ 11 − Σ 12 Σ −1<br />
22 Σ 21<br />
60
1. DISTRIBUTION THEORY<br />
Pro<strong>of</strong>.<br />
We will exhibit f(x 1 , x 2 ) in the form<br />
f(x 1 , x 2 ) = h(x 1 , x 2 ) g(x 2 ),<br />
where h(x 1 , x 2 ) is a <strong>PDF</strong> with respect to x 1 for each (given) x 2 .<br />
It follows then that f X1 |X 2<br />
(x 1 |x 2 ) = h(x 1 , x 2 ) and g(x 2 ) = f X2 (x 2 ).<br />
Now observe that:<br />
1<br />
f(x 1 , x 2 ) =<br />
(2π) (r 1+r 2 )/2<br />
∣ Σ ∣ ×<br />
11 Σ 12∣∣∣<br />
1/2<br />
Σ 12 Σ 22<br />
and<br />
exp<br />
[<br />
− 1 2<br />
( (x1 ) T ( ) ) [ ] −1 ( ) ]<br />
T Σ<br />
− µ 1 , x2 − µ 11 Σ 12 x1 − µ 1<br />
2<br />
Σ 21 Σ 22 x 2 − µ 2<br />
1.<br />
(2π) (r 1+r 2 )/2<br />
∣ Σ ∣<br />
11 Σ 12∣∣∣<br />
1/2<br />
Σ 21 Σ 22<br />
= (2π) r 1/2 (2π) r 2/2 ∣ ∣Σ 22<br />
∣ ∣<br />
1/2 ∣ ∣Σ 11 − Σ 12 Σ −1<br />
22 Σ 21<br />
∣ ∣<br />
1/2<br />
= (2π) r 1/2 ∣ ∣Σ 11 − Σ 12 Σ −1<br />
22 Σ 21<br />
∣ ∣<br />
1/2<br />
(2π)<br />
r 2 /2 ∣ ∣Σ 22<br />
∣ ∣<br />
1/2<br />
2.<br />
(<br />
(x1 − µ 1 ) T , (x 2 − µ 2 ) ) [ ] −1 ( )<br />
T Σ 11 Σ 12 x1 − µ 1<br />
= ( (x<br />
Σ 21 Σ 22 x 2 − µ 1 − µ 1 ) T , (x 2 − µ 2 ) ) T<br />
2<br />
⎡<br />
×<br />
⎢<br />
⎣<br />
(<br />
Σ11 − Σ 12 Σ −1<br />
22 Σ 21<br />
) −1<br />
− ( Σ 11 − Σ 12 Σ −1<br />
22 Σ 11<br />
) −1<br />
×Σ 12 Σ −1<br />
22<br />
−Σ −1<br />
22 Σ 21 Σ −1<br />
22 + Σ −1<br />
22 Σ 21<br />
× ( )<br />
Σ 11 − Σ 12 Σ −1 −1<br />
22 Σ 21 × ( )<br />
Σ 11 − Σ 12 Σ −1 −1<br />
22 Σ 21<br />
×Σ 12 Σ −1<br />
22<br />
⎤<br />
⎥<br />
⎦<br />
( )<br />
x1 − µ 1<br />
x 2 − µ 2<br />
= (x 1 − µ 1 ) T V −1 (x 1 − µ 1 ) − (x 1 − µ 1 ) T V −1 Σ 12 Σ −1<br />
22 (x 2 − µ 2 )<br />
−(x 2 − µ 2 ) T Σ −1<br />
22 Σ 21 V −1 (x 1 − µ 1 ) + (x 2 − µ 2 ) T Σ −1<br />
22 Σ 21 V −1 Σ 12 Σ −1<br />
22 (x 2 − µ 2 )<br />
61<br />
+(x 2 − µ 2 ) T Σ −1<br />
22 (x 2 − µ 2 ).
1. DISTRIBUTION THEORY<br />
⎧<br />
⎪⎨<br />
⎪⎩<br />
How did this come about:<br />
[ ( ) a b x<br />
(x, y)<br />
c d]<br />
y<br />
= (xy)<br />
[ ] ax + by<br />
cx + dy<br />
= x(ax + by) + y(cx + dy)<br />
= ax 2 + bxy + cyx + dy 2 ⎫⎪ ⎬<br />
⎪ ⎭<br />
= ( (x 1 − µ 1 ) − Σ 12 Σ −1<br />
22 (x 2 − µ 2 ) ) T<br />
V<br />
−1 ( (x 1 − µ 1 ) − Σ 12 Σ −1<br />
22 (x 2 − µ 2 ) )<br />
+ (x 2 − µ 2 ) T Σ −1<br />
22 (x 2 − µ 2 )<br />
⎧<br />
⎪⎨<br />
Note:<br />
(x − y) T A(x − y) = (x − y) T (Ax − Ay)<br />
= x T Ax − x T Ay − y T Ax + y T Ay<br />
⎪⎩<br />
And: Σ T 12 = Σ 21<br />
⎫⎪ ⎬<br />
⎪ ⎭<br />
= ( x 1 − (µ 1 + Σ 12 Σ −1<br />
22 (x 2 − µ 2 )) ) T (<br />
Σ11 − Σ 12 Σ −1<br />
22 Σ 21<br />
) −1×<br />
(<br />
x1 − (µ 1 + Σ 12 Σ −1<br />
22 (x 2 − µ 2 )) ) + (x 2 − µ 2 ) T Σ −1<br />
22 (x 2 − µ 2 ).<br />
Combining (1), (2) we obtain:<br />
62
1. DISTRIBUTION THEORY<br />
f(x 1 , x 2 ) =<br />
1<br />
(2π) (r 1+r 2 )/2<br />
∣ Σ ∣<br />
11 Σ 12∣∣∣<br />
1/2<br />
Σ 21 Σ 22<br />
=<br />
⎧<br />
⎡ ⎤<br />
⎨<br />
Σ 11 Σ 12<br />
× exp<br />
⎩ −1 2 ((x 1 − µ 1 ) T , (x 2 − µ 2 ) T ) ⎣ ⎦<br />
Σ 22<br />
1<br />
(2π) r 1/2<br />
|Σ 11 − Σ 12 Σ −1<br />
22 Σ 21 | 1/2<br />
Σ 21<br />
{<br />
× exp − 1 2 (x 1 − (µ 1 + Σ 12 Σ −1<br />
22 (x 2 − µ 2 ))) T<br />
⎞⎫<br />
x 1 − µ 1 ⎬<br />
⎝ ⎠<br />
⎭<br />
x 2 − µ 2<br />
−1 ⎛<br />
× ( ) }<br />
Σ 11 − Σ 12 Σ −1 −1(x1<br />
22 Σ 21 − (µ 1 + Σ 12 Σ −1<br />
22 (x 2 − µ 2 )))<br />
×<br />
{<br />
1<br />
(2π) ∣ ∣ r 2/2 ∣Σ exp − 1 }<br />
22 1/2<br />
2 (x 2 − µ 2 ) T Σ −1<br />
22 (x 2 − µ 2 )<br />
= h(x 1 , x 2 )g(x 2 ), where for fixed x 2 , h(x 1 , x 2 ), is the<br />
N r1<br />
(<br />
µ1 + Σ 12 Σ −1<br />
22 (x 2 − µ 2 ), Σ 11 − Σ 12 Σ −1<br />
22 Σ 21<br />
)<br />
<strong>PDF</strong>, and g(x 2 ) is the N r2 (µ 2 , Σ 22 ) <strong>PDF</strong>.<br />
Hence, the result is proved.<br />
Theorem. 1.10.5<br />
Suppose that X ∼ N r (µ, Σ) and Y = AX + b, where A p×r with linearly independent<br />
rows and b ∈ R p are fixed. Then Y ∼ N p (Aµ + b, AΣA T ).<br />
[Note: p ≤ r.]<br />
Pro<strong>of</strong>. Use method <strong>of</strong> regular transformations.<br />
Since the<br />
(<br />
rows <strong>of</strong> A are linearly independent, we can find B (r−p)×r such that the r × r<br />
B<br />
matrix is invertible.<br />
A)<br />
63
1. DISTRIBUTION THEORY<br />
If we take Z = BX, we have<br />
( ( ( Z B 0<br />
= X + ∼ N<br />
Y)<br />
A)<br />
b)<br />
by Theorem 1.10.1.<br />
([ ] [ ])<br />
Bµ + 0 BΣB<br />
T<br />
BΣA<br />
,<br />
T<br />
Aµ + b AΣB T AΣA T<br />
Hence, from Theorem 1.10.4, the marginal distribution for Y is<br />
N p (Aµ + b, AΣA T ).<br />
1.10.1 The multivariate normal MGF<br />
The multivariate normal moment generating function for a random vector X ∼ N(µ, Σ)<br />
is given by<br />
M X (t) = e tT µ+ 1 2 tT Σt<br />
Prove this result as an exercise!<br />
The characteristic function <strong>of</strong> X is<br />
E[exp(it T X)] = exp<br />
(<br />
it T µ − 1 )<br />
2 tT Σt<br />
The marginal distribution <strong>of</strong> X 1 (or X 2 ) is easy to derive using the multivariate normal<br />
MGF.<br />
Let<br />
t =<br />
(<br />
t1<br />
t 2<br />
)<br />
, µ =<br />
(<br />
µ1<br />
µ 2<br />
)<br />
.<br />
Then the marginal distribution <strong>of</strong> X 1 is obtained by setting t 2 = 0 in the expression<br />
for the MGF <strong>of</strong> X.<br />
Pro<strong>of</strong>:<br />
(<br />
M X (t) = exp t T µ + 1 )<br />
2 tT Σt<br />
= exp<br />
(t T 1 µ 1 + t T 2 µ 2 + 1 2 t 1 T Σ 11 t 1 + t T 1 Σ 12 t 2 + 1 )<br />
2 t 2 T Σ 22 t 2 .<br />
Now,<br />
( )<br />
t1<br />
M X1 (t 1 ) = M X = exp<br />
(t T<br />
0<br />
1 µ 1 + 1 )<br />
2 t 1 T Σ 11 t 1 ,<br />
64
1. DISTRIBUTION THEORY<br />
which is the MGF <strong>of</strong> X 1 ∼ N r1 (µ 1 , Σ 11 ).<br />
Similarly for X 2 . This means that all marginal distributions <strong>of</strong> a multivariate normal<br />
distribution are multivariate normal themselves. Note though, that in general<br />
the opposite implication is not true: there are examples <strong>of</strong> non-normal multivariate<br />
distributions who marginal distributions are normal.<br />
1.10.2 Independence and normality<br />
We have seen previously that X, Y independent ⇒ Cov(X, Y ) = 0, but not vice-versa.<br />
An exception is when the data are (jointly) normally distributed.<br />
In particular, if (X, Y ) have the bivariate normal distribution, then Cov(X, Y ) = 0 ⇐⇒<br />
X, Y are independent.<br />
Theorem. 1.10.6<br />
Suppose X 1 , X 2 , . . . , X r have a multivariate normal distribution. Then X 1 , X 2 , . . . , X r<br />
are independent if and only if Cov(X i , X j ) = 0 for all i ≠ j.<br />
Pro<strong>of</strong>.<br />
(=⇒)X 1 , . . . , X r<br />
independent<br />
=⇒ Cov(X i , X j ) = 0 for i ≠ j, has already been proved.<br />
(⇐=)<br />
Suppose Cov(X i , X j ) = 0 for i ≠ j<br />
65
=⇒ Var(X) = Σ =<br />
=⇒ Σ −1 =<br />
=⇒ f X (x) =<br />
=<br />
=<br />
⎡<br />
⎤<br />
σ 11 0 . . . 0<br />
0 σ 22 . . . .<br />
⎢<br />
⎣ .<br />
. .. . ⎥ .. . ⎦<br />
0 . . . 0 σ rr<br />
⎡<br />
σ −1<br />
⎤<br />
11 0 . . . 0<br />
0 σ −1<br />
22 . . . .<br />
⎢<br />
⎣ .<br />
. .. . ⎥ .. . ⎦<br />
0 . . . 0 σrr<br />
−1<br />
1<br />
(2π) r/2 |Σ| 1/2 e− 1 2 (x−µ)T Σ −1 (x−µ)<br />
1. DISTRIBUTION THEORY<br />
and |Σ| = σ 11 σ 22 . . . σ rr<br />
1<br />
1 P r (x i −µ i ) 2<br />
(2π) r/2 (σ 11 σ 22 . . . σ rr ) 1/2 e− 2 i=1 σ ii<br />
(<br />
) (<br />
1<br />
√ √ e − 1 (x 1 −µ 1 ) 2<br />
1<br />
2 σ 11 √ √ e − 1 2<br />
2π σ11 2π σ22<br />
)<br />
(x 2 −µ 2 ) 2<br />
σ 22 . . .<br />
(<br />
)<br />
1<br />
. . . √ √ e − 1 (xr−µr) 2<br />
2 σrr<br />
2π σrr<br />
= f 1 (x 1 )f 2 (x 2 ) . . . f r (x r )<br />
=⇒ X 1 , . . . , X r are independent.<br />
Note:<br />
(x − µ) T Σ −1 (x − µ) = (x 1 − µ 1 x 2 − µ 2 . . . x r − µ r )<br />
=<br />
⎡<br />
σ −1<br />
⎤ ⎡ ⎤<br />
11 0 . . . 0 x 1 − µ 1<br />
0 σ22 −1<br />
. . . .<br />
x 2 − µ 2<br />
× ⎢<br />
⎥ ⎢ ⎥<br />
⎣ .<br />
.. .<br />
.. . . ⎦ ⎣ . ⎦<br />
0 . . . 0 σrr<br />
−1 x r − µ r<br />
r∑ (x i − µ i ) 2<br />
.<br />
σ ii<br />
i=1<br />
66
1. DISTRIBUTION THEORY<br />
Remark<br />
The same methods can be used to establish a similar result for block diagonal matrices.<br />
The simplest case is the following, which is most easily proved using moment generating<br />
functions:<br />
X 1 , X 2 are independent<br />
if and only if Σ 12 = 0, i.e.,<br />
( ) (( ) ( ))<br />
X1<br />
µ1 Σ11 0<br />
∼ N<br />
X<br />
r1 +r 2<br />
,<br />
.<br />
2 µ 2 0 Σ 22<br />
Theorem. 1.10.7<br />
Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ) RVs and let<br />
¯X = 1 n<br />
S 2 =<br />
n∑<br />
X i ,<br />
i=1<br />
1<br />
n − 1<br />
n∑<br />
(X i − ¯X) 2 .<br />
i=1<br />
Then ¯X ∼ N(µ, σ 2 /n) and<br />
(n − 1)S2<br />
σ 2<br />
∼ χ 2 n−1 independently.<br />
(Note: S 2 here is a random variable, and is not to be confused with the sample covariance<br />
matrix, which is also <strong>of</strong>ten denoted by S. Hopefully, the meaning <strong>of</strong> the notation<br />
will be clear in the context in which it is used.)<br />
Pro<strong>of</strong>. Observe first that if X = (X 1 , . . . , X n ) T then the i.i.d. assumption may be<br />
written as:<br />
X ∼ N n (µ1 n , σ 2 I n×n )<br />
1. ¯X ∼ N(µ, σ 2 /n):<br />
Observe that ¯X = BX, where<br />
[ 1<br />
B =<br />
n<br />
]<br />
1<br />
n . . . 1<br />
n<br />
Hence, by Theorem 1.10.5,<br />
¯X ∼ N(Bµ1, σ 2 BB T ) ≡ N(µ, σ 2 /n),<br />
as required.<br />
67
1. DISTRIBUTION THEORY<br />
2. Independence <strong>of</strong> ¯X and S: consider<br />
⎡<br />
⎤<br />
1 1<br />
1 1<br />
. . .<br />
n n n n<br />
1 − 1 − 1 . . . − 1 − 1 n n n n<br />
A =<br />
− 1 1 − 1 . . . − 1 − 1 n n n n<br />
⎢<br />
.<br />
.. .<br />
.. .<br />
.. . .<br />
⎥<br />
⎣<br />
⎦<br />
− 1 − 1 . . . 1 − 1 − 1 n n n n<br />
and observe<br />
⎡ ⎤<br />
¯X<br />
X 1 − ¯X<br />
AX =<br />
X 2 − ¯X<br />
⎢ ⎥<br />
⎣ . ⎦<br />
X n−1 − ¯X<br />
We can check that Var(AX) = σ 2 AA T has the form:<br />
⎡<br />
⎤<br />
1<br />
0 . . . 0<br />
n 0<br />
σ 2 ⎢<br />
. Σ 22<br />
⎥<br />
⎣<br />
⎦<br />
0<br />
⇒ ¯X, (X 1 − ¯X, X 2 − ¯X, . . . , X n−1 − ¯X) are independent.<br />
(By multivariate normality.)<br />
Finally, since<br />
X n − ¯X = − ( (X 1 − ¯X) + (X 2 − ¯X) + · · · + (X n−1 − ¯X) ) ,<br />
it follows that S 2 is a function <strong>of</strong> (X 1 − ¯X, X 2 − ¯X, . . . , X n−1 − ¯X) and hence is<br />
independent <strong>of</strong> ¯X.<br />
3. Prove:<br />
(n − 1)S 2<br />
σ 2<br />
∼ χ 2 n−1<br />
68
1. DISTRIBUTION THEORY<br />
Consider the identity:<br />
n∑<br />
n∑<br />
(X i − µ) 2 = (X i − ¯X) 2 + n( ¯X − µ) 2<br />
i=1<br />
i=1<br />
(subtract and add ¯X to first term)<br />
⇒<br />
n∑<br />
i=1<br />
(X i − µ) 2<br />
σ 2 =<br />
n∑<br />
i=1<br />
(X i − ¯X) 2<br />
σ 2 +<br />
( ) 2 ¯X − µ<br />
σ/ √ n<br />
and let R 1 =<br />
R 2 =<br />
n∑ (X i − µ) 2<br />
i=1<br />
n∑<br />
i=1<br />
σ 2<br />
(X i − ¯X) 2<br />
σ 2 =<br />
(n − 1)S2<br />
σ 2<br />
R 3 =<br />
( ) 2 ¯X − µ<br />
σ/ √ n<br />
If M 1 , M 2 , M 3 are the MGFs for R 1 , R 2 , R 3 respectively, then,<br />
M 1 (t) = M 2 (t)M 3 (t) since R 1 = R 2 + R 3<br />
with R 2 and R 3 independent<br />
↓ ↓<br />
depends depends only<br />
only on on ¯X<br />
Next, observe that R 3 =<br />
( ) 2 ¯X − µ<br />
σ/ √ ∼ χ 2 1<br />
n<br />
S 2<br />
⇒ M 2 (t) = M 1(t)<br />
M 3 (t) .<br />
⇒ M 3 (t) =<br />
1<br />
(1 − 2t) 1/2 ,<br />
69
1. DISTRIBUTION THEORY<br />
and<br />
(R 1 )<br />
n∑ (X i − µ) 2<br />
σ 2<br />
i=1<br />
∼ χ 2 n<br />
⇒ M 1 (t) =<br />
∴ M 2 (t) =<br />
1<br />
(1 − 2t) n/2 .<br />
1<br />
(1 − 2t) (n−1)/2 ,<br />
which is the mgf for χ 2 n−1.<br />
Hence,<br />
(n − 1)S 2<br />
σ 2<br />
∼ χ 2 n−1.<br />
Corollary.<br />
If X 1 , . . . , X n are i.i.d. N(µ, σ 2 ), then:<br />
Pro<strong>of</strong>.<br />
Recall that t k is the distribution <strong>of</strong><br />
Z<br />
√<br />
V/k<br />
, where Z ∼ N(0, 1) and V ∼ χ 2 k independently.<br />
T = ¯X − µ<br />
S/ √ n ∼ t n−1<br />
Now observe that:<br />
and note that,<br />
T = ¯X − µ<br />
S/ √ n = ¯X / √ − µ<br />
σ/ √ (n − 1)S 2 /σ 2<br />
n (n − 1)<br />
¯X − µ<br />
σ/ √ n<br />
∼ N(0, 1)<br />
and<br />
independently.<br />
(n − 1)S 2<br />
σ 2<br />
∼ χ 2 n−1,<br />
70
1. DISTRIBUTION THEORY<br />
1.11 Limit Theorems<br />
1.11.1 Convergence <strong>of</strong> random variables<br />
Let X 1 , X 2 , X 3 , . . . , be an infinite sequence <strong>of</strong> RVs. We will consider 4 different types<br />
<strong>of</strong> convergence.<br />
1. Convergence in probability (weak convergence)<br />
The sequence {X n } is said to converge to the constant α ∈ R, in probability if<br />
for every ε > 0<br />
lim<br />
n→∞ P (|X n − α| > ε) = 0<br />
2. Convergence in Quadratic Mean<br />
lim<br />
n→∞ E( (X n − α) 2) = 0<br />
3. Almost sure convergence (Strong convergence)<br />
The sequence {X n } is said to converge almost surely to α if for each ε > 0,<br />
Remarks<br />
(1),(2),(3) are related by:<br />
|X n − α| > ε for only finite number <strong>of</strong> n ≥ 1.<br />
Figure 15: Relationships <strong>of</strong> convergence<br />
4. Convergence in distribution<br />
The sequence <strong>of</strong> RVs {X n } with CDFs {F n } is said to converge in distribution<br />
to the RV X with CDF F (x) if:<br />
lim F n(x) = F (x) for all continuity points <strong>of</strong> F .<br />
n→∞<br />
71
1. DISTRIBUTION THEORY<br />
Figure 16: Discrete case<br />
Figure 17: Normal approximation to binomial (an application <strong>of</strong> the Central Limit<br />
Theorem) continuous case<br />
Remarks<br />
⎧<br />
⎪⎨ 0 x < α<br />
1. If we take F (x) =<br />
⎪⎩<br />
1 x ≥ α<br />
then we have X = α with probability 1, and convergence in distribution to this<br />
F is the same thing as convergence in probability to α.<br />
2. Commonly used notation for convergence in distribution is either:<br />
(i) L[X n ] → L[X]<br />
(ii) X n −→<br />
D<br />
L[X] or e.g., X n −→<br />
D<br />
N(0, 1).<br />
3. An important result that we will use without pro<strong>of</strong> is as follows:<br />
Let M n (t) be MGF <strong>of</strong> X n and M(t) be MGF <strong>of</strong> X. Then if M n (t) → M(t) for<br />
each t in some open interval containing 0, as n → ∞, then L[X n ] −→<br />
D<br />
L[X].<br />
(Sometimes called the Continuity Theorem.)<br />
72
1. DISTRIBUTION THEORY<br />
Theorem. 1.11.1 (Weak law <strong>of</strong> large numbers)<br />
Suppose X 1 , X 2 , X 3 , . . . is a sequence <strong>of</strong> i.i.d. RVs with E[X i ] = µ, Var(X i ) = σ 2 , and<br />
let<br />
¯X n = 1 n∑<br />
X i .<br />
n<br />
Then ¯X n converges to µ in probability.<br />
Pro<strong>of</strong>. We need to show for each ɛ > 0 that<br />
i=1<br />
lim P (| ¯X n − µ| > ɛ) = 0.<br />
n→∞<br />
Now observe that E[ ¯X n ] = µ and Var( ¯X n ) = σ2<br />
. So by Chebyshev’s inequality,<br />
n<br />
P (| ¯X n − µ| > ɛ) ≤ σ2<br />
→ 0 as n → ∞, for any fixed ɛ > 0.<br />
nɛ2 Remarks<br />
1. The pro<strong>of</strong> given for Theorem 1.11.1 is really a corollary to the fact that ¯X n also<br />
converges to µ in quadratic mean.<br />
2. There is also a version <strong>of</strong> this theorem involving almost sure convergence (strong<br />
law <strong>of</strong> large numbers). We will not discuss this.<br />
3. The law <strong>of</strong> large numbers is one <strong>of</strong> the fundamental principles <strong>of</strong> statistical inference.<br />
That is, it is the formal justification for the claim that the “sample mean<br />
approaches the population mean for large n”.<br />
Lemma. 1.11.1<br />
Suppose a n is a sequence <strong>of</strong> real numbers s.t. lim<br />
n→∞<br />
na n = a with |a| < ∞. Then,<br />
Pro<strong>of</strong>.<br />
Omitted (but not difficult).<br />
lim (1 + a n) n = e a .<br />
n→∞<br />
73
1. DISTRIBUTION THEORY<br />
Remarks<br />
This is a simple generalisation <strong>of</strong> the standard limit,<br />
(<br />
1 +<br />
n) x n<br />
= e x .<br />
lim<br />
n→∞<br />
Consider a sequence <strong>of</strong> i.i.d. RVs X 1 , X 2 , . . . with E(X i ) = µ, Var(X i ) = σ 2 and such<br />
that the MGF, M X (t), is defined for all t in some open interval containing 0.<br />
n∑<br />
Let S n = X i and note that E[S n ] = nµ and Var(S n ) = nσ 2 .<br />
i=1<br />
Theorem. 1.11.2 (Central Limit Theorem)<br />
Let X 1 , X 2 , . . . , S n be as above and let Z n = S n − nµ<br />
σ √ n<br />
. Then<br />
Pro<strong>of</strong>.<br />
L[Z n ] → N(0, 1) as n → ∞.<br />
We will use the fact that it is sufficient to prove that<br />
M Zn (t) → e t2 /2<br />
for each fixed t<br />
[Note: if Z ∼ N(0, 1) then M Z (t) = e t2 /2 .]<br />
Now let U i = X i − µ<br />
σ<br />
and observe that Z n = 1 √ n<br />
Since the U i are independent we have:<br />
n∑<br />
U i .<br />
i=1<br />
M Zn (t) = { M U (t/ √ n) } n<br />
⎡<br />
⎣<br />
M aX (t) = E[e taX ]<br />
= M X (at)<br />
⎤<br />
⎦<br />
Now,<br />
E(U i ) = 0 ⇒ M ′ U(0) = 0,<br />
Var(U i ) = 1 ⇒ M ′′<br />
U(0) = 1.<br />
74
1. DISTRIBUTION THEORY<br />
Consider the second-order Taylor expansion,<br />
M U (t) = M U (0) + M ′ U(0)t + M ′′<br />
U(0) t2 2 + r(t)<br />
M U (t) = 1 + 0 + t 2 /2 + r(t),<br />
↓<br />
(M U (0))(as M ′ U(0) = 0)<br />
r(s)<br />
where lim = 0<br />
s→0 s 2<br />
⇒<br />
= 1 + t 2 /2 + r(t).<br />
M Zn (t) = { M U (t/ √ n) } n<br />
= {1 + t 2 /2n + r(t/ √ n)} n<br />
= (1 + a n ) n ,<br />
where<br />
a n = t2<br />
2n + r(t/√ n).<br />
Next observe that lim na n = t2<br />
n→∞ 2<br />
[<br />
To check this observe that:<br />
for fixed t.<br />
nt 2<br />
lim<br />
n→∞ 2n = t2 2<br />
and lim<br />
n→∞<br />
nr(t/ √ n)<br />
= lim<br />
n→∞<br />
t 2 r(t/ √ n)<br />
(t/ √ n) 2<br />
= t 2 r(s)<br />
lim = 0 for fixed t.<br />
s→0 s 2<br />
Note: s = t/ √ n → 0 as n → ∞. ] 75
1. DISTRIBUTION THEORY<br />
Hence we have shown M Zn (t) = (1 + a n ) n , where<br />
lim na n = t 2 /2, so by Lemma 1.11.1,<br />
n→∞<br />
lim M Z n<br />
(t) = e t2 /2 for each fixed t.<br />
n→∞<br />
Remarks<br />
1. The Central Limit Theorem can be stated equivalently for ¯X n ,<br />
( ) ¯Xn − µ<br />
i.e., L<br />
σ/ √ → N(0, 1)<br />
n<br />
(<br />
just note that ¯X n − µ<br />
σ/ √ n = S n − nµ<br />
σ √ n<br />
2. The Central Limit Theorem holds under conditions more general than those given<br />
above. In particular, with suitable assumptions,<br />
(i) M X (t) need not exist.<br />
(ii) X 1 , X 2 , . . . need not be i.i.d..<br />
3. Theorems 1.11.1 and 1.11.2 are concerned with the asymptotic behaviour <strong>of</strong> ¯X n .<br />
)<br />
.<br />
Theorem 1.11.1 states ¯X n → µ in prob as n → ∞.<br />
Theorem 1.11.2 states ¯X n − µ<br />
σ/ √ n −→ D<br />
N(0, 1) as n → ∞.<br />
These results are not contradictory because Var( ¯X n ) → 0, but the Central Limit<br />
Theorem is concerned with ¯X n − E( ¯X n )<br />
√<br />
Var( ¯Xn ) . 76
2. STATISTICAL INFERENCE<br />
2 Statistical Inference<br />
2.1 Basic definitions and terminology<br />
Probability is concerned partly with the problem <strong>of</strong> predicting the behavior <strong>of</strong> the RV<br />
X assuming we know its distribution.<br />
Statistical inference is concerned with the inverse problem:<br />
Given data x 1 , x 2 , . . . , x n with unknown CDF F (x), what can we conclude about<br />
F (x)<br />
In this course, we are concerned with parametric inference. That is, we assume F<br />
belongs to a given family <strong>of</strong> distributions, indexed by the parameter θ:<br />
where Θ is the parameter space.<br />
Examples<br />
I = {F (x; θ) : θ ∈ Θ}<br />
(1) I is the family <strong>of</strong> normal distributions:<br />
θ = (µ, σ 2 )<br />
Θ = {(µ, σ 2 ) : µ ∈ R, σ 2 ∈ R + }<br />
then I =<br />
{<br />
F (x) : F (x) = Φ<br />
( x − µ<br />
σ 2 )}<br />
.<br />
(2) I is the family <strong>of</strong> Bernoulli distributions with success probability θ:<br />
Θ = {θ ∈ [0, 1] ⊂ R}.<br />
In this framework, the problem is then to use the data x 1 , . . . , x n to draw conclusions<br />
about θ.<br />
Definition. 2.1.1<br />
A collection <strong>of</strong> i.i.d. RVs, X 1 , . . . , X n , with common CDF F (x; θ), is said to be a<br />
random sample (from F (x; θ)).<br />
Definition. 2.1.2<br />
Any function T = T (x 1 , x 2 , . . . , x n ) that can be calculated from the data (without<br />
knowledge <strong>of</strong> θ) is called a statistic.<br />
Example<br />
The sample mean ¯x is a statistic (¯x = 1 n<br />
n∑<br />
x i ).<br />
i=1<br />
77
2. STATISTICAL INFERENCE<br />
Definition. 2.1.3<br />
A statistic T with property T (x) ∈ Θ ∀x is called an estimator for θ.<br />
Example<br />
For x 1 , x 2 , . . . , x n i.i.d. N(µ, σ 2 ), we have θ = (µ, σ 2 ). The quantity (¯x, s 2 ) is an estimator<br />
for θ, where ¯x = 1 n∑<br />
x i , S 2 = 1 n∑<br />
(x i − ¯x) 2 .<br />
n<br />
n − 1<br />
i=1<br />
There are two important concepts here: the first is that estimators are random<br />
variables; the second is that you need to be able to distinguish between random variables<br />
and their realisations. In particular, an estimate is a realisation <strong>of</strong> a random<br />
variable.<br />
For example, strictly speaking, x 1 , x 2 , . . . , x n are realisations <strong>of</strong> random variables<br />
X 1 , X 2 , . . . , X n , and ¯x = 1 n∑<br />
x i is a realisation <strong>of</strong> ¯X; ¯X is an estimator, and ¯x is an<br />
n<br />
estimate.<br />
i=1<br />
We will find that from now on however, that it is <strong>of</strong>ten more convenient, if less rigorous,<br />
to use the same symbol for estimator and estimate. This arises especially in the use <strong>of</strong><br />
ˆθ as both estimator and estimate, as we shall see.<br />
An unsatisfactory aspect <strong>of</strong> Definition 2.1.3 is that it gives no guidance on how to<br />
recognize (or construct) good estimators.<br />
Unless stated otherwise, we will assume that θ is a scalar parameter in the following.<br />
In broad terms, we would like an estimator to be “as close as possible to” θ “with high<br />
probability.<br />
Definition. 2.1.4<br />
The mean squared error <strong>of</strong> the estimator T <strong>of</strong> θ is defined by<br />
i=1<br />
MSE T (θ) = E{(T − θ) 2 }.<br />
Example<br />
Suppose X 1 , . . . , X n are i.i.d. Bernoulli θ RV’s, and T = ¯X =‘proportion <strong>of</strong> successes’.<br />
Since nT ∼ B(n, θ) we have<br />
E(nT ) = nθ, Var(nT ) = nθ(1 − θ)<br />
#<br />
=⇒ E(T ) = θ, Var(T ) =<br />
θ(1 − θ)<br />
n<br />
78
2. STATISTICAL INFERENCE<br />
=⇒ MSE T (θ) = Var(T ) =<br />
θ(1 − θ)<br />
.<br />
n<br />
Remark<br />
This example shows that MSE T (θ) must be thought <strong>of</strong> as a function <strong>of</strong> θ rather than<br />
just a number.<br />
E.g., Figure 18<br />
Figure 18: MSE T (θ) as a function <strong>of</strong> θ<br />
Intuitively, a good estimator is one for which MSE is as small as possible. However,<br />
quantifying this idea is complicated, because MSE is a function <strong>of</strong> θ, not just a number.<br />
See Figure 19.<br />
For this reason, it turns out it is not possible to construct a minimum MSE estimator<br />
in general.<br />
To see why, suppose T ∗ is a minimum MSE estimator for θ. Now consider the estimator<br />
T a = a, where a ∈ R is arbitrary. Then for a = θ, T θ = θ with MSE Tθ (θ) = 0.<br />
Observe MSE Ta (a) = 0; hence if T ∗ exists, then we must have MSE T ∗(a) = 0. As a<br />
is arbitrary, we must have MSE T ∗(θ) = 0 ∀θ, ∈ Θ<br />
=⇒ T ∗ = θ with probability 1.<br />
Therefore we conclude that (excluding trivial situations) no minimum MSE estimator<br />
can exist.<br />
Definition. 2.1.5 The bias <strong>of</strong> the estimator T is defined by:<br />
b T (θ) = E(T ) − θ.<br />
79
2. STATISTICAL INFERENCE<br />
Figure 19: MSE T (θ) as a function <strong>of</strong> θ<br />
An estimator T with b T (θ) = 0, i.e., E(T ) = θ, is said to be unbiased.<br />
Remarks<br />
(1) Although unbiasedness is an appealing property, not all commonly used<br />
estimates are unbiased and in some situations it is impossible to construct<br />
unbiased estimators for the parameter <strong>of</strong> interest.<br />
80
2. STATISTICAL INFERENCE<br />
Example (Unbiasedness isn’t everything)<br />
E(s 2 ) = σ 2 .<br />
If E(s) = σ,<br />
then Var(s) = E(s 2 ) − {E(s)} 2<br />
= 0<br />
which is not the case<br />
=⇒ E(s) < σ.<br />
(2) Intuitively, unbiasedness would seem to be a pre-requisite for a good estimator.<br />
This is to some extent formalized as follows:<br />
Theorem. 2.1.1<br />
MSE T (θ) = Var(T ) + b T (θ) 2<br />
Remark<br />
Restricting attention to unbiased estimators excludes estimators <strong>of</strong> the form T a =<br />
a. We will see that this permits the construction <strong>of</strong> Minimum Variance Unbiased<br />
Estimators (MVUE’s) in some cases.<br />
2.2 Minimum Variance Unbiased Estimation<br />
Consider data x 1 , x 2 , . . . , x n assumed to be observations <strong>of</strong> RV’s X 1 , X 2 , . . . , X n with<br />
joint <strong>PDF</strong> (probability function)<br />
Definition. 2.2.1<br />
f x (x; θ)<br />
(P x (x; θ)).<br />
The likelihood function is defined by<br />
⎧<br />
⎨ f x (x; θ)<br />
L(θ; x) =<br />
⎩<br />
P x (x; θ)<br />
The log likelihood function is:<br />
l(θ; x) = log L(θ; x)<br />
X continuous<br />
X discrete.<br />
(log is the natural log i.e., ln).<br />
81
2. STATISTICAL INFERENCE<br />
Remark<br />
If x 1 , x 2 , . . . , x n are independent, the log likelihood function can be written as:<br />
l(θ; x) =<br />
n∑<br />
log f i (x i ; θ).<br />
i=1<br />
If x 1 , x 2 , . . . , x n are i.i.d., we have:<br />
l(θ; x) =<br />
n∑<br />
log f(x i ; θ).<br />
i=1<br />
Definition. 2.2.2<br />
Consider a statistical problem with log-likelihood l(θ; x). The score is defined by:<br />
and the Fisher information is<br />
U(θ; x) = ∂l<br />
∂θ<br />
I(θ) = E<br />
( )<br />
− ∂2 l<br />
.<br />
∂θ 2<br />
Remark<br />
For a single observation with <strong>PDF</strong> f(x, θ), the information is<br />
)<br />
i(θ) = E<br />
(− ∂2<br />
∂θ log f(X, θ) .<br />
2<br />
In the case <strong>of</strong> x 1 , x 2 , . . . , x n i.i.d. , we have<br />
Theorem. 2.2.1<br />
Under suitable regularity conditions,<br />
I(θ) = ni(θ).<br />
E{U(θ; X)} = 0<br />
and Var{U(θ; X)} = I(θ).<br />
82
2. STATISTICAL INFERENCE<br />
Pro<strong>of</strong>.<br />
Observe E{U(θ; X)} =<br />
=<br />
=<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
. . . U(θ; x)f(x; θ)dx 1 . . . dx n<br />
−∞<br />
. . .<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
. . .<br />
−∞<br />
( ∂<br />
∂θ log f(x; θ) )<br />
f(x; θ)dx 1 . . . dx n<br />
⎛ ⎞<br />
∂<br />
f(x; θ)<br />
⎜<br />
⎝<br />
∂θ ⎟<br />
⎠ f(x; θ)dx 1 . . . dx n<br />
f(x; θ)<br />
=<br />
∫ ∞<br />
regularity =⇒ = ∂ ∂θ<br />
= ∂ ∂θ 1<br />
−∞<br />
∫ ∞<br />
∫ ∞<br />
. . .<br />
−∞<br />
−∞<br />
. . .<br />
∂<br />
∂θ f(x; θ)dx 1 . . . dx n<br />
∫ ∞<br />
−∞<br />
f(x; θ)dx 1 . . . dx n<br />
= 0, as required.<br />
To calculate Var{U(θ; X)}, observe<br />
⎧ ⎫<br />
∂<br />
∂ 2 l(θ; x)<br />
= ∂ ⎪⎨ f(x; θ) ⎪⎬<br />
∂θ<br />
∂θ 2 ∂θ ⎪⎩ f(x; θ) ⎪⎭<br />
=<br />
=<br />
=⇒ U 2 (θ; x) =<br />
( )<br />
∂ 2<br />
2 ∂<br />
∂θ f(x; θ)f(x; θ) − 2 ∂θ f(x; θ)<br />
∂ 2<br />
[f(x; θ)] 2<br />
f(x; θ)<br />
∂θ2 − U 2 (θ; x)<br />
f(x; θ)<br />
∂ 2<br />
f(x; θ)<br />
∂θ2 f(x; θ)<br />
− ∂2 l<br />
∂θ 2<br />
=⇒ Var{U(θ; X)} = E{U 2 (θ; X)} (as E(U) = 0)<br />
⎛<br />
∂ 2 ⎞<br />
f(X; θ)<br />
⎜<br />
= E ⎝<br />
∂θ2 ⎟<br />
⎠ + I(θ)<br />
f(X; θ)<br />
(by definition <strong>of</strong> I(θ)).<br />
83
2. STATISTICAL INFERENCE<br />
Finally observe that:<br />
⎛<br />
∂ 2 ⎞<br />
f(X; θ)<br />
⎜<br />
E ⎝<br />
∂θ2 ⎟<br />
⎠ =<br />
f(X; θ)<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
. . .<br />
−∞<br />
∂ 2<br />
f(x; θ)<br />
∂θ2 f(x; θ)dx 1 . . . dx n<br />
f(x; θ)<br />
=<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
. . .<br />
−∞<br />
= ∂2<br />
∂θ 2 ∫ ∞<br />
= ∂2<br />
∂θ 2 1<br />
= 0<br />
−∞<br />
Hence, we have proved Var{U(θ; X)} = I(θ).<br />
. . .<br />
∂ 2<br />
∂θ 2 f(x; θ)dx 1 . . . dx n<br />
∫ ∞<br />
−∞<br />
f(x; θ)dx 1 . . . dx n<br />
84
2. STATISTICAL INFERENCE<br />
Theorem. 2.2.2 (Cramer-Rao Lower Bound)<br />
If T is an unbiased estimator for θ, then Var(T ) ≥ 1<br />
I(θ) .<br />
Pro<strong>of</strong>.<br />
Observe Cov{T (X), U(θ; X)} = E{T (X)U(θ; X)}<br />
=<br />
=<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
= ∂ ∂θ<br />
−∞<br />
∫ ∞<br />
∫ ∞<br />
. . . T (x)U(θ; x)f(x; θ)dx 1 . . . dx n<br />
−∞<br />
∫ ∞<br />
. . .<br />
−∞<br />
−∞<br />
. . .<br />
= ∂ E{T (X)}<br />
∂θ<br />
∂<br />
f(x; θ)<br />
T (x) ∂θ<br />
f(x; θ) f(x; θ)dx 1 . . . dx n<br />
∫ ∞<br />
−∞<br />
T (x)f(x; θ)dx 1 . . . dx n<br />
= ∂ ∂θ θ<br />
To summarize, Cov(T, U) = 1.<br />
= 1.<br />
Recall that Cov 2 (T, U) ≤ Var(T ) Var(U) [i.e. |ρ| ≤ 1 and divide both sides by RHS]<br />
=⇒ Var(T ) ≥ Cov2 (T, U)<br />
Var(U)<br />
= 1 , as required.<br />
I(θ)<br />
Example<br />
Suppose X 1 , X 2 , . . . , X n are i.i.d. Po(λ) RV’s, and let ¯X = 1 n<br />
that ¯X is a MVUE for λ.<br />
n∑<br />
X i . We will prove<br />
i=1<br />
85
2. STATISTICAL INFERENCE<br />
Pro<strong>of</strong>.<br />
(1) Recall if X ∼Po(λ), then P (x) = e−λ λ x<br />
, E(X) = λ and Var(X) = λ.<br />
x!<br />
Hence, E( ¯X) = λ and Var( ¯X) = λ n . Hence ¯X is unbiased for λ.<br />
(2) To show that ¯X is MVUE, we will show that Var( ¯X) = 1<br />
I(λ) .<br />
Step 1:<br />
Log-likelihood is<br />
l(λ; x) = log P (x; λ)<br />
= log<br />
n∏ e −λ λ x i<br />
,<br />
x i !<br />
i=1<br />
so l(λ; x) = log e −nλ λ P n<br />
i=1 x i<br />
= −nλ +<br />
n∏<br />
i=1<br />
1<br />
x i !<br />
n∑<br />
x i log λ − log<br />
i=1<br />
= −nλ + n¯x log λ − log<br />
n∏<br />
x i !<br />
i=1<br />
n∏<br />
x i !.<br />
i=1<br />
Step 2:<br />
Find ∂2 l<br />
∂λ 2 ;<br />
Step 3:<br />
∂l<br />
∂λ<br />
= −n +<br />
n¯x<br />
λ<br />
=⇒ ∂2 l<br />
∂λ 2 = −n¯x<br />
λ 2 .<br />
86
2. STATISTICAL INFERENCE<br />
I(λ) =<br />
( ) ∂ 2 l<br />
−E<br />
∂λ 2<br />
=⇒ I(λ) =<br />
( ) n ¯X<br />
E<br />
λ 2<br />
= n E( ¯X)<br />
λ2 = nλ<br />
λ 2<br />
= n λ .<br />
(3) Finally, observe that Var( ¯X) = λ n = 1<br />
I(λ) .<br />
By Theorem 2.2.2, any unbiased estimator T for λ must have<br />
Var(T ) ≥ 1<br />
I(λ) = λ n<br />
=⇒ ¯X is a MVUE.<br />
Theorem. 2.2.3<br />
The unbiased estimator T (x) can achieve the Cramer-Rao Lower Bound only if the<br />
joint <strong>PDF</strong>/probability function has the form:<br />
f(x; θ) = exp{A(θ)T (x) + B(θ) + h(x)},<br />
where A, B are functions such that θ = −B′ (θ)<br />
, and h is some function <strong>of</strong> x.<br />
A ′ (θ)<br />
Pro<strong>of</strong>. Recall from the pro<strong>of</strong> <strong>of</strong> Theorem 2.2.2 that the bound arises from the inequality<br />
where U(θ; x) = ∂l<br />
∂θ .<br />
Cor 2 (T (X), U(θ; X)) ≤ 1,<br />
Moreover, it is easy to see that the Cramer-Rao Lower Bound (CRLB) can be achieved<br />
only when<br />
Cor 2 {T (X), U(θ; X)} = 1<br />
87
2. STATISTICAL INFERENCE<br />
Hence the CRLB is achieved only if U is a linear function <strong>of</strong> T with probability 1<br />
(correlation equals 1):<br />
U = aT + b a, b constants =⇒ can depend on θ but not x<br />
i.e.<br />
∂l<br />
∂θ<br />
Integrating wrt θ, we obtain:<br />
where A ′ (θ) = a(θ), B ′ (θ) = b(θ).<br />
= U(θ; x) = a(θ)T (x) + b(θ) for all x.<br />
log f(x; θ) = l(θ; x) = A(θ)T (x) + B(θ) + h(x),<br />
Hence,<br />
f(x; θ) = exp{A(θ)T (x) + B(θ) + h(x)}.<br />
Finally to ensure that E(T ) = θ, recall that E{U(θ; X)} = 0 and observe that in this<br />
case,<br />
E{U(θ; X)} = E[A ′ (θ)T (X) + B ′ (θ)]<br />
= A ′ (θ)E[T (X)] + B ′ (θ) = 0<br />
=⇒ E[T (X)] = −B′ (θ)<br />
A ′ (θ) .<br />
Hence, in order that T (X) be unbiased for θ, we must have −B′ (θ)<br />
A ′ (θ)<br />
= θ.<br />
Definition. 2.2.3<br />
A probability density function/probability function is said to be a single parameter<br />
exponential family if it has the form:<br />
f(x; θ) = exp{A(θ)t(x) + B(θ) + h(x)}<br />
for all x ∈ D ∈ R, where the D does not depend on θ.<br />
If x 1 , . . . , x n is a random sample from an exponential family, the joint <strong>PDF</strong>/prob functions<br />
becomes<br />
n∏<br />
f(x; θ) = exp{A(θ)t(x i ) + B(θ) + h(x i )}<br />
i=1<br />
{<br />
= exp A(θ)<br />
}<br />
n∑<br />
n∑<br />
t(x i ) + B(θ) + h(x i ) .<br />
i=1<br />
i=1<br />
88
2. STATISTICAL INFERENCE<br />
In this case, a minimum variance unbiased estimator that achieves the CRLB can be<br />
seen to be the function:<br />
T = 1 n∑<br />
t(x i ),<br />
n<br />
which is the MVUE for E(T ) = −B′ (θ)<br />
A ′ (θ) .<br />
Example (Poisson example revisited)<br />
i=1<br />
Observe that the Poisson probability function,<br />
p(x; λ) = e−λ λ x<br />
x!<br />
= exp{(log λ)x − λ − log x!}<br />
where<br />
= exp{A(λ)t(x) + B(λ) + h(x)},<br />
A(λ) = log λ<br />
B(λ) = −λ<br />
t(x) = x<br />
¯X = 1 n<br />
n∑<br />
t(x i ) is the MVUE for<br />
i=1<br />
h(x) = − log x!.<br />
−B ′ (λ)<br />
A ′ (λ)<br />
= −(−1)<br />
1/λ<br />
= λ.<br />
Example (2)<br />
Consider the Exp(λ) distribution. The <strong>PDF</strong> is<br />
f(x) = λe −λx , x > 0<br />
= exp{−λx + log λ};<br />
89
2. STATISTICAL INFERENCE<br />
=⇒ f is an exponential family with<br />
A(λ) = −λ<br />
B(λ) = log λ<br />
t(x) = x<br />
h(x) = 0<br />
We can also check that E(X) = −B′ (λ)<br />
A ′ (λ) .<br />
E(X) = 1 λ<br />
for X ∼ Exp(λ).<br />
In particular, we have seen previously<br />
Now observe that A ′ (λ) = −1, B ′ (λ) = 1 λ<br />
=⇒ −B′ (λ)<br />
A ′ (λ)<br />
= 1 , as required.<br />
λ<br />
It also follows that if x 1 , x 2 , . . . , x n are i.i.d. Exp(λ) observations, then ¯X = 1 n<br />
is the MVUE for 1 λ = E(X).<br />
Definition. 2.2.4<br />
n∑<br />
t(x i )<br />
i=1<br />
Consider data with <strong>PDF</strong>/prob. function, f(x; θ). A statistic, S(x), is called a sufficient<br />
statistic for θ if f(x|s; θ) does not depend on θ for all s.<br />
Remarks<br />
(1) We will see that sufficient statistics capture all <strong>of</strong> the information in the<br />
data x that is relevant to θ.<br />
(2) If we consider vector-valued statistics, then this definition admits trivial<br />
examples, such as s = x<br />
⎧<br />
⎨ 1 x = s<br />
since P (X = x|S = s) =<br />
⎩<br />
0 otherwise<br />
which does not depend on θ.<br />
90
2. STATISTICAL INFERENCE<br />
Example<br />
Suppose x 1 , x 2 , . . . , x n are i.i.d. Bernoulli-θ and let s =<br />
θ.<br />
n∑<br />
x i . Then S is sufficient for<br />
i=1<br />
Pro<strong>of</strong>.<br />
P (x) =<br />
n∏<br />
θ x i<br />
(1 − θ) 1−x i<br />
i=1<br />
= θ P x i<br />
(1 − θ) P (1−x i )<br />
= θ s (1 − θ) n−s .<br />
Next observe that S ∼ B(n, θ)<br />
=⇒ P (s) =<br />
=⇒ P (x|s) =<br />
=<br />
=<br />
( n<br />
s)<br />
θ s (1 − θ) n−s<br />
P ({X = x} ∩ {S = s})<br />
P (S = s)<br />
P (X = x)<br />
P (S = s)<br />
⎧<br />
1<br />
( if ⎪⎨ n<br />
s)<br />
⎪⎩<br />
n∑<br />
x i = s<br />
i=1<br />
0, otherwise.<br />
Theorem. 2.2.4 The Factorization Theorem<br />
Suppose x 1 , . . . , x n have joint <strong>PDF</strong>/prob function f(x; θ). Then S is a sufficient statistic<br />
for θ if and only if<br />
f(x; θ) = g(s; θ)h(x)<br />
for some functions g, h.<br />
Pro<strong>of</strong>.<br />
Omitted.<br />
91
2. STATISTICAL INFERENCE<br />
Examples<br />
(1) x 1 , x 2 , . . . , x n i.i.d. N(µ, σ 2 ), σ 2 known.<br />
n∏ 1<br />
Then f(x; µ) = √ e (−1/(2σ2 ))(x i −µ) 2<br />
2πσ<br />
=⇒ S =<br />
i=1<br />
{<br />
}<br />
= (2πσ 2 ) −n/2 exp − 1 n∑<br />
(x<br />
2σ 2 i − µ) 2<br />
i=1<br />
{ ( n∑<br />
)}<br />
= (2πσ 2 ) −n/2 exp − 1<br />
n∑<br />
x 2<br />
2σ 2 i − 2µ x i + nµ 2<br />
i=1<br />
i=1<br />
{ (<br />
)} (<br />
= (2πσ 2 ) −n/2 1<br />
n∑<br />
exp 2µ x<br />
2σ 2 i − nµ 2 exp − 1<br />
2σ 2<br />
= exp<br />
{<br />
(<br />
1<br />
2µ<br />
2σ 2<br />
i=1<br />
)} {<br />
n∑<br />
x i − nµ 2 exp<br />
i=1<br />
n∑<br />
x i is sufficient for µ.<br />
i=1<br />
(2) If x 1 , x 2 , . . . , x n are i.i.d. with<br />
f(x) = exp{A(θ)t(x) + B(θ) + h(x)},<br />
then f(x; θ) = exp<br />
=⇒ S =<br />
{<br />
A(θ)<br />
n∑<br />
t(x i )<br />
i=1<br />
− 1<br />
2σ 2<br />
n∑<br />
i=1<br />
x 2 i<br />
)<br />
}<br />
n∑<br />
x 2 i − n 2 log(2πσ2 )<br />
i=1<br />
} {<br />
n∑<br />
n∑<br />
}<br />
t(x i ) + nB(θ) exp h(x i )<br />
i=1<br />
i=1<br />
is sufficient for θ in the exponential family by the Factorization Theorem.<br />
Theorem. 2.2.5 Rao-Blackwell Theorem<br />
If T is an unbiased estimator for θ and S is a sufficient statistic for θ, then the quantity<br />
T ∗ = E(T |S) is also an unbiased estimator for θ with Var(T ∗ ) ≤ Var(T ). Moreover,<br />
Var(T ∗ ) = Var(T ) iff T ∗ = T with probability 1.<br />
Pro<strong>of</strong>.<br />
(I) Unbiasedness:<br />
θ = E(T ) = E S {E(T |S)} = E(T ∗ )<br />
=⇒ E(T ∗ ) = θ.<br />
92
2. STATISTICAL INFERENCE<br />
(II) Variance inequality:<br />
Var(T ) = E{Var(T |S)} + Var{E(T |S)}<br />
= E{Var(T |S)} + Var(T ∗ )<br />
≥<br />
Var(T ∗),<br />
since E{Var(T |S)} ≥ 0.<br />
Observe also that Var(T ) = Var(T ∗ )<br />
=⇒ E{Var(T |S)} = 0<br />
=⇒ Var(T |S) = 0 with prob. 1<br />
=⇒ T = E(T |S) with prob. 1.<br />
(III) T ∗ is an estimator:<br />
Since S is sufficient for θ,<br />
T ∗ =<br />
which does not depend on θ.<br />
∫ ∞<br />
−∞<br />
T (x)f(x|s)dx,<br />
Remarks<br />
(1) The Rao-Blackwell Theorem can be used occasionally to construct estimators.<br />
(2) The theoretical importance <strong>of</strong> this result is to observe that T ∗ will always<br />
depend on x only through S. If T is already a MVUE then T ∗ = T =⇒<br />
MVUE depends on x only through S.<br />
Example<br />
Suppose x 1 , . . . , x n ∼i.i.d. N(µ, σ 2 ), σ 2 known. We want to estimate µ. Take T = x 1 ,<br />
as an unbiased estimator for µ.<br />
n∑<br />
S = x i is a sufficient statistic for µ.<br />
i=1<br />
93
2. STATISTICAL INFERENCE<br />
According to Rao-Blackwell, T ∗ = E(T |S) will be an improved (unbiased) estimator<br />
for µ.<br />
( T<br />
If x = (x 1 , . . . , x n ) T for X ∼ N n (µ1, σ 2 I), then = Ax, where A is a matrix<br />
S)<br />
⎛<br />
A = ⎝<br />
1 0 . . . 0<br />
1 1 . . . 1<br />
⎞<br />
⎠<br />
Hence,<br />
⎛⎛<br />
( T<br />
∼ N 2 (A(µ1), σ<br />
S)<br />
2 AA T ) = N 2<br />
⎝⎝<br />
µ<br />
nµ<br />
⎞ ⎛<br />
⎠ , ⎝<br />
⎞⎞<br />
σ 2 σ 2<br />
⎠⎠<br />
nσ 2<br />
σ 2<br />
=⇒ T |S = s ∼<br />
)<br />
N<br />
(µ + σ2<br />
nσ (s − nµ), 2 σ2 − σ4<br />
nσ 2<br />
=<br />
( ( 1<br />
N<br />
n s, 1 − 1 ) )<br />
σ 2<br />
n<br />
∴ E(T |S) = 1 n<br />
is a better (or equal) estimator for µ.<br />
∑<br />
xi = ¯x = T ∗<br />
Finally, observe Var( ¯X) = σ2<br />
n ≤ σ2 = Var(X 1 ) with < for n > 1, σ 2 > 0.<br />
Remarks<br />
(1) We have given only an incomplete outline <strong>of</strong> theory.<br />
(2) There are also concepts <strong>of</strong> minimal sufficient & complete statistics to be<br />
considered.<br />
2.3 Methods Of Estimation<br />
2.3.1 Method Of Moments<br />
Consider a random sample X 1 , X 2 , . . . , X n from F (θ) & let µ = µ(θ) = E(X).<br />
The method <strong>of</strong> moments estimator ˜θ is defined as the solution to the equation<br />
¯X = µ(˜θ).<br />
94
2. STATISTICAL INFERENCE<br />
Example<br />
X 1 , . . . , X n ∼ i.i.d. Exp(λ) =⇒ E(X i ) = 1 ; the method <strong>of</strong> moments estimator is<br />
λ<br />
defined as the solution to the equation<br />
¯X = 1˜λ =⇒ ˜λ = 1¯X .<br />
Remark<br />
The method <strong>of</strong> moments is appealing:<br />
(1) it is simple;<br />
(2) rationale is that ¯X is BLUE for µ.<br />
If θ = (θ 1 , θ 2 , . . . , θ p ), the method <strong>of</strong> moments can be adapted as follows:<br />
(1) let µ k = µ k (θ) = E(X k ), k = 1, . . . , p;<br />
(2) let m k = 1 n<br />
n∑<br />
x k i .<br />
i=1<br />
The MoM estimator ˜θ is defined to be the solution to the system <strong>of</strong> equations<br />
m 1 = µ 1 (˜θ)<br />
m 2 = µ 2 (˜θ)<br />
.<br />
.<br />
m p = µ p (˜θ)<br />
Example<br />
Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ) & let θ = (µ, σ 2 ).<br />
95
2. STATISTICAL INFERENCE<br />
p = 2 =⇒ two equations in two unknowns:<br />
µ 1 (θ) = E(X) = µ<br />
µ 2 (θ) = E(X 2 ) = σ 2 + µ 2<br />
m 1 =<br />
m 2 =<br />
1<br />
n<br />
1<br />
n<br />
n∑<br />
i=1<br />
n∑<br />
i=1<br />
x i<br />
x 2 i<br />
∴ we need to solve for µ, σ 2 :<br />
˜µ = ¯x<br />
˜σ 2 + ˜µ 2 = 1 n<br />
∑<br />
x<br />
2<br />
i =⇒ ˜σ 2 = 1 n<br />
∑<br />
x<br />
2<br />
i − ¯x 2<br />
= 1 n<br />
∑<br />
(xi − ¯x) 2<br />
= n − 1<br />
n s2 .<br />
Remark<br />
The Method <strong>of</strong> Moments estimators can be seen to have good statistical properties.<br />
Under fairly mild regulatory conditions, the MoM estimator is<br />
(1) Consistent, i.e., ˜θ → θ in prob as n → ∞<br />
(2) Asymptotically Normal, i.e.,<br />
˜θ − θ<br />
√<br />
Var(θ)<br />
. −→ D<br />
N(0, 1) as n → ∞<br />
However, the MoM estimator has one serious defect.<br />
Suppose X 1 , . . . , X n is a random sample from F θX<br />
& let ˜θ X be MoM estimator.<br />
Let Y = h(X) for some invertible function h(X), then Y 1 , . . . , Y n should contain the<br />
same information as X 1 , . . . , X n .<br />
∴ we would like ˜θ Y ≡ ˜θ X . Unfortunately this does not hold.<br />
Example<br />
96
2. STATISTICAL INFERENCE<br />
X 1 , . . . , X n i.i.d. Exp(λ). We saw previously that ˜λ X = 1¯X .<br />
Suppose Y i = X 2 i (which is invertible for X i > 0). To obtain ˜λ Y , observe E(Y ) =<br />
E(X 2 ) = 2 λ 2 √<br />
=⇒ ˜λ 2n<br />
Y = ∑ ≠ 1¯X . X<br />
2<br />
i<br />
2.3.2 Maximum Likelihood Estimation<br />
Consider a statistical problem with log-likelihood function, l(θ; x).<br />
Definition. The maximum likelihood estimate ˆθ is the solution to the problem max l(θ; x)<br />
θ∈Θ<br />
i.e. ˆθ = arg max l(θ; x).<br />
θ∈Θ<br />
Remark<br />
In practice, maximum likelihood estimates are obtained by solving the score equation<br />
Example<br />
∂l<br />
∂θ<br />
= U(θ; x) = 0<br />
If X 1 , X 2 , . . . , X n are i.i.d. geometric-θ RV’s, find ˆθ.<br />
Solution:<br />
{ n<br />
}<br />
∏<br />
l(θ; x) = log θ(1 − θ) x i−1<br />
i=1<br />
n∑<br />
n∑<br />
= log θ + (x i − 1) log(1 − θ)<br />
i=1<br />
i=1<br />
= n{log θ + (¯x − 1) log(1 − θ)}<br />
∴ U(θ; x) = ∂l { 1<br />
∂θ = n θ − ¯x − 1<br />
1 − θ<br />
= n<br />
( 1 − θ¯x<br />
θ(1 − θ)<br />
)<br />
.<br />
}<br />
97
2. STATISTICAL INFERENCE<br />
To find ˆθ, we solve for θ in 1 − θ¯x = 0<br />
⎡<br />
=⇒ ˆθ = 1¯x<br />
⎢<br />
⎣ =<br />
Elementary Properties<br />
n∑<br />
n =<br />
x i<br />
i=1<br />
# <strong>of</strong> successes<br />
# <strong>of</strong> trials<br />
⎤<br />
⎥<br />
⎦<br />
(1) MLE’s are invariant under invertible transformations <strong>of</strong> the data.<br />
Pro<strong>of</strong>. (Continuous i.i.d. RV’s, strictly increasing/decreasing translation)<br />
Suppose X has <strong>PDF</strong> f(x; θ) and Y = h(X), where h is strictly monotonic;<br />
=⇒ f Y (y; θ) = f X (h −1 (y); θ)|h −1 (y) ′ | when y = h(x).<br />
Consider data x 1 , x 2 , . . . , x n and the transformed version y 1 , y 2 , . . . , y n :<br />
{ n<br />
}<br />
∏<br />
l Y (θ; y) = log f Y (y i ; θ)<br />
= log<br />
= log<br />
i=1<br />
{<br />
∏ n<br />
}<br />
f X (h −1 (y i ); θ)|h −1 (y i ) ′ |<br />
i=1<br />
n∏<br />
f X (x i ; θ) + log<br />
i=1<br />
n∏<br />
|h −1 (y i ) ′ |<br />
( n<br />
)<br />
∏<br />
= l X (θ; x) + log |h −1 (y i ) ′ | ,<br />
i=1<br />
( n<br />
)<br />
∏<br />
since log |h −1 (y i ) ′ | does not depend on θ, it follows that ˆθ maximizes<br />
i=1<br />
l X (θ; x) iff it maximizes l Y (θ; y).<br />
i=1<br />
(2) If φ = φ(θ) is a 1-1 transformation <strong>of</strong> θ, then the MLE’s obey the transformation<br />
rule, ˆφ = φ(ˆθ).<br />
Pro<strong>of</strong>. It can be checked that if l φ (φ; x) is the log-likelihood with respect<br />
to φ, then<br />
l θ (θ; x) = l φ (φ(θ); x).<br />
It follows that ˆθ maximizes l θ (θ; x) iff ˆφ = φ(ˆθ) maximizes l φ (φ; x).<br />
98
2. STATISTICAL INFERENCE<br />
(3) If t(x) is a sufficient statistic for θ, then ˆθ depends on the data only as a<br />
function <strong>of</strong> t(x).<br />
Pro<strong>of</strong>. By the Factorization Theorem, t(x) is sufficient for θ iff<br />
f(x; θ) = g(t(x); θ)h(x)<br />
=⇒ l(θ; x) = log g(t(x); θ) + log h(x)<br />
=⇒ ˆθ maximizes l(θ; x) ⇔ ˆθ maximizes log g(t(x); θ)<br />
=⇒ ˆθ is a function <strong>of</strong> t(x).<br />
Example<br />
Suppose X 1 , X 2 , . . . , X n are i.i.d. Exp(λ); then<br />
f X (x; λ) =<br />
n∏<br />
i=1<br />
λe −λx i<br />
= λ n e −λ P n<br />
i=1 x i<br />
= λ n e −λn¯x .<br />
By the Factorization Theorem, ¯x is sufficient for λ. To get the MLE,<br />
U(λ, x) = ∂l<br />
∂λ = ∂ (n log λ − nλ¯x)<br />
∂λ<br />
= n λ − n¯x;<br />
∂l<br />
∂λ = 0 =⇒ 1 λ = ¯x =⇒ ˆλ = 1¯x .<br />
Note: as proved ˆλ is a function <strong>of</strong> the sufficient statistic ¯x.<br />
Let Y 1 , Y 2 , . . . , Y n be defined by Y i = log X i .<br />
99
2. STATISTICAL INFERENCE<br />
If X ∼Exp(λ) and Y = log X, then we can find f Y (y) by taking h(x) = log x and using<br />
f Y (y) = f X (h −1 (y))|h −1 (y) ′ | [h −1 (y) = e y , h −1 (y) ′ = e y ]<br />
= λe −λey e y<br />
{ n<br />
}<br />
∏<br />
=⇒ l Y (λ; y) = log λe −λey i<br />
e y i<br />
= log<br />
i=1<br />
{<br />
λ n e −λ P n<br />
i=1 ey i<br />
e P }<br />
n<br />
i=1 y i<br />
= n log λ − λ<br />
=⇒ ∂<br />
∂λ l Y (λ; y) = n n∑<br />
λ −<br />
=⇒ ˆλ =<br />
=<br />
=<br />
= 1¯x<br />
n<br />
n∑<br />
i=1<br />
i=1<br />
e y i<br />
n<br />
n∑<br />
i=1<br />
n<br />
n∑<br />
i=1<br />
e log x i<br />
x i<br />
e y i<br />
n∑<br />
e y i<br />
+<br />
i=1<br />
( n<br />
n¯x)<br />
.<br />
n∑<br />
i=1<br />
y i<br />
Finally, suppose we take θ = log λ =⇒ λ = e θ<br />
=⇒ f(x; θ) = e θ e −eθ x<br />
(λe −λx )<br />
( n<br />
)<br />
∏<br />
=⇒ l θ (θ; x) = log e θ e −eθ x i<br />
= log<br />
i=1<br />
{ }<br />
e nθ e −eθ n¯x<br />
= nθ − e θ n¯x.<br />
100
∂l<br />
∂θ<br />
= n − n¯xe θ<br />
∴ ∂l<br />
∂θ = 0 =⇒ 1 = ¯xeθ =⇒ e θ = 1¯x<br />
=⇒ ˆθ = − log ¯x.<br />
But ˆλ = 1¯x ( 1¯x)<br />
=⇒ log ˆλ = log = − log ¯x = ˆθ, as required.<br />
Remark (Not examinable)<br />
2. STATISTICAL INFERENCE<br />
Maximum likelihood estimation & method <strong>of</strong> moments, can both be generated by the<br />
use <strong>of</strong> estimating function.<br />
An estimating function is a function, H(x; θ), with the property E{H(x; θ)} = 0.<br />
H can be used to define an estimator ˜θ which is a solution to the equation<br />
H(x; θ) = 0.<br />
For method <strong>of</strong> moments estimates, we can take<br />
H(x; θ) = ¯x − E(X)<br />
[E( ¯X) if not i.i.d. case].<br />
To calculate the MLE:<br />
(1) find the log-likelihood, then<br />
(2) calculate the score & let it equal 0.<br />
For maximum likelihood, we can use<br />
H(x; θ) = U(θ; x) = ∂l<br />
∂θ .<br />
Recall we showed previously that E{U(θ; x)} = 0.<br />
Asymptotic Properties <strong>of</strong> MLE’s:<br />
Suppose X 1 , X 2 , X 3 , . . . is a sequence <strong>of</strong> i.i.d. RV’s with common <strong>PDF</strong> (or probability<br />
function) f(x; θ).<br />
Assume:<br />
(1) θ 0 is the true value <strong>of</strong> the parameter θ;<br />
101
2. STATISTICAL INFERENCE<br />
(2) f is s.t. f(x; θ 1 ) = f(x; θ 2 ) for all x =⇒ θ 1 = θ 2 .<br />
We will show (in outline) that if ˆθ n is the MLE (maximum likelihood estimator) based<br />
on X 1 , X 2 , . . . , X n , then:<br />
(1) ˆθ n → θ 0 in probability, i.e., for each ɛ > 0,<br />
lim P (|θ n − θ 0 | > ɛ) = 0.<br />
n→∞<br />
(2)<br />
√<br />
ni(θ0 )(ˆθ n − θ 0 ) −→ D<br />
N(0, 1) as n → ∞<br />
Remark<br />
(Asymptotic Normality).<br />
The practical use <strong>of</strong> asymptotic normality is that for large n,<br />
( )<br />
1<br />
ˆθ ∼: N θ 0 , .<br />
ni(θ 0 )<br />
Outline <strong>of</strong> consistency & asymptotic normality for MLE’s:<br />
Consider i.i.d. data X 1 , X 2 , . . . with common <strong>PDF</strong>/prob. function f(x; θ).<br />
If ˆθ n is the MLE based on X 1 , X 2 , . . . , X n , we will show (in outline) that ˆθ n → θ 0 in<br />
probability as n → ∞, where θ 0 is the true value for θ.<br />
Lemma. 2.3.1<br />
Suppose f is such that f(x; θ 1 ) = f(x; θ 2 ) for all x =⇒ θ 1 = θ 2 . Then l ∗ (θ) =<br />
E{log f(x; θ)} is maximized uniquely by θ = θ 0 .<br />
Pro<strong>of</strong>.<br />
l ∗ (θ) − l ∗ (θ 0 ) =<br />
=<br />
≤<br />
=<br />
=<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
(log f(x; θ))f(x; θ 0 )dx −<br />
( f(x; θ)<br />
log<br />
f(x; θ 0 )<br />
)<br />
f(x; θ 0 )dx<br />
( f(x; θ)<br />
f(x; θ 0 ) − 1 )<br />
f(x; θ 0 )dx<br />
(f(x; θ) − f(x; θ 0 ))dx<br />
f(x; θ)dx −<br />
∫ ∞<br />
−∞<br />
∫ ∞<br />
−∞<br />
f(x; θ 0 )dx.<br />
(log f(x; θ 0 ))f(x; θ 0 )dx<br />
102
2. STATISTICAL INFERENCE<br />
Moreover, equality is achieved iff<br />
f(x; θ)<br />
f(x; θ 0 )<br />
= 1 for all x<br />
=⇒ f(x; θ) = f(x; θ 0 ) for all x<br />
=⇒ θ = θ 0 .<br />
Lemma. 2.3.2<br />
Let ¯l n (θ; x) = 1 n l(θ, x 1, . . . , x n ), i.e., 1 n × log likelihood based on x 1, x 2 , . . . , x n .<br />
Then for each θ, ¯l n (θ; x) → l ∗ (θ) in probability as n → ∞.<br />
Pro<strong>of</strong>. Observe that<br />
¯l n (θ; x) = 1 n<br />
n log ∏<br />
f(x i ; θ)<br />
= 1 n<br />
= 1 n<br />
i=1<br />
n∑<br />
log f(x i ; θ)<br />
i=1<br />
n∑<br />
L i (θ),<br />
i=1<br />
where L i (θ) = log f(x i ; θ).<br />
Since L i (θ) are i.i.d. with E(L i (θ)) = l ∗ (θ), it follows by the weak law <strong>of</strong> large numbers<br />
that ¯l n (θ; x) → l ∗ (θ) in probability as n → ∞.<br />
To summarise, we have proved that:<br />
(1) ¯l n (θ, x) → l ∗ (θ);<br />
(2) l ∗ (θ) is maximized when θ = θ 0 .<br />
Since ˆθ n maximizes ¯l n (θ, x), it follows that ˆθ n → θ 0 in probability.<br />
Theorem. 2.3.1<br />
Under the above assumptions, let U n (θ; x) denote the score based on X 1 , . . . , X n .<br />
103
2. STATISTICAL INFERENCE<br />
Then,<br />
U n (θ 0 ; x)<br />
√<br />
ni(θ0 )<br />
−→<br />
D<br />
N(0, 1).<br />
Pro<strong>of</strong>.<br />
U n (θ 0 ; x) = ∂ n<br />
∂θ log ∏<br />
f(x i ; θ)| θ=θ0<br />
=<br />
=<br />
n∑<br />
i=1<br />
i=1<br />
∂<br />
∂θ log f(x i; θ)| θ=θ0<br />
n∑<br />
U i , where U i = ∂ ∂θ log f(x i; θ)| θ=θ0 .<br />
i=1<br />
Since U 1 , U 2 , . . . are i.i.d. with E(U i ) = 0 and Var(U i ) = i(θ), by the Central Limit<br />
Theorem,<br />
∑ n<br />
i=1 U i − nE(U)<br />
√ = U n(θ 0 ; x) −→<br />
√ N(0, 1).<br />
n Var(U) ni(θ0 ) D<br />
Theorem. 2.3.2<br />
Under the above assumptions,<br />
√<br />
ni(θ0 )(ˆθ n − θ 0 ) −→ D<br />
N(0, 1).<br />
Pro<strong>of</strong>. Consider the first order Taylor expansion <strong>of</strong> U n (ˆθ; x) about θ 0 :<br />
U n (ˆθ n ; x) ≈ U n (θ 0 ; x) + U n(θ ′ 0 ; x)(ˆθ n − θ 0 ).<br />
For large n<br />
=⇒ U n (θ 0 ; x) ≈ −U n(θ ′ 0 ; x)(ˆθ n − θ 0 )<br />
U n (θ 0 ; x)<br />
√<br />
ni(θ0 )<br />
=⇒ −U ′ n(θ 0 ; x)(ˆθ n − θ 0 )<br />
√<br />
ni(θ0 )<br />
−→<br />
D<br />
−→<br />
D<br />
N(0, 1)<br />
N(0, 1).<br />
Now observe that:<br />
−U ′ n(θ 0 ; x)<br />
n<br />
→ i(θ 0 )<br />
as n → ∞<br />
104
2. STATISTICAL INFERENCE<br />
by the weak law <strong>of</strong> large numbers, since:<br />
Hence,<br />
−U ′ n(θ 0 ; x)<br />
n<br />
=⇒ −U ′ n(θ 0 ; x)<br />
ni(θ 0 )<br />
−U ′ n(θ 0 ; x)<br />
√<br />
ni(θ0 ) (ˆθ n − θ 0 )<br />
=<br />
1<br />
n<br />
n∑<br />
i=1<br />
− ∂2<br />
∂θ 2 log f(x i; θ| θ=θ0 )<br />
→ 1 in probability.<br />
≈<br />
√<br />
ni(θ0 )(ˆθ n − θ 0 )<br />
for large n<br />
=⇒ √ ni(θ 0 )(ˆθ n − θ 0 )<br />
−→<br />
D<br />
N(0, 1).<br />
Remark<br />
The preceding theory can be generalized to include vector-valued coefficients. We will<br />
not discuss the details.<br />
2.4 Hypothesis Tests and Confidence Intervals<br />
Motivating Example:<br />
Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ), and consider H 0 : µ = µ 0 vs. H a : µ ≠ µ 0 .<br />
If σ 2 is known, then the test <strong>of</strong> H 0 with significance level α is defined by the test<br />
statistic<br />
Z = ¯X − µ<br />
σ/ √ n ,<br />
and the rule, reject H 0 if |z| ≥ z(α/2).<br />
A 100(1 − α)% CI for µ is given by<br />
(<br />
¯X − z(α/2) σ √ n<br />
, ¯X + z(α/2) σ √ n<br />
)<br />
.<br />
It is easy to check that the confidence interval contains all values <strong>of</strong> µ 0 that are acceptable<br />
null hypotheses.<br />
In general, consider a statistical problem with parameter<br />
θ ∈ Θ 0 ∪ Θ A , where Θ 0 ∩ Θ A = φ.<br />
We consider the null hypothesis,<br />
H 0 : θ ∈ Θ 0<br />
105
2. STATISTICAL INFERENCE<br />
and the alternative hypothesis<br />
H A : θ ∈ Θ A .<br />
The hypothesis testing set up can be represented as:<br />
Test Result<br />
Actual Status<br />
H 0 true H A true<br />
Accept H 0<br />
√<br />
type II<br />
error<br />
(β)<br />
Reject H 0<br />
type I<br />
error<br />
√<br />
(α)<br />
We would like both the type I and type II error rates to be as small as possible.<br />
However, these results conflict with each other. To reduce the type I error rate we<br />
need to “make it harder to reject H 0 ”. To reduce the type II error rate we need to<br />
“make it easier to reject H 0 ”.<br />
The standard (Neyman-Pearson) approach to hypothesis testing is to control the type<br />
I error rate at a “small” value α and then use a test that makes the type II error as<br />
small as possible.<br />
The equivalence between the confidence intervals and hypothesis tests can be formulated<br />
as follows: Recall that a 100(1 − α)% CI for θ is a random interval, (L, U) with<br />
the property<br />
P ((L, U) ∋ θ) = 1 − α.<br />
It is easy to check that the test defined by rule:<br />
“Accept H 0 : θ = θ 0 iff θ 0 ∈ (L, U)” has significance level α.<br />
α = P (reject H 0 |H 0 true)<br />
β = P (retain H 0 |H A true)<br />
1 − β = power = P (reject H 0 |H A true), which is what we want.<br />
Conversely, given a hypothesis test H 0 : θ = θ 0 with significance level α, it can be<br />
proved that the set {θ 0 : H 0 : θ = θ 0 is accepted} is a 100(1 − α)% confidence region<br />
for θ.<br />
Large Sample Tests and Confidence Intervals<br />
Consider a statistical problem with data x 1 , . . . , x n , log-likelihood l(θ; x), score U(θ; x)<br />
and information i(θ).<br />
Consider also a hypothesis H 0 : θ = θ 0 . The following three tests are <strong>of</strong>ten considered:<br />
106
2. STATISTICAL INFERENCE<br />
(1) The Wald Test:<br />
Test Statistic: W =<br />
√<br />
ni(ˆθ)(ˆθ − θ 0 )<br />
Critical Region:<br />
reject for |W | ≥ z(α/2)<br />
(2) The Score Test:<br />
Test Statistic: V = U(θ 0; x)<br />
√<br />
ni(θ0 )<br />
Critical Region:<br />
reject for |V | ≥ z(α/2)<br />
(3) Likelihood-Ratio Test:<br />
Test Statistic: G 2 = 2(l(ˆθ) − l(θ 0 ))<br />
Example:<br />
Critical Region: reject for G 2 ≥ χ 2 1,α<br />
Suppose X 1 , X 2 , . . . , X n i.i.d. Po(λ), and consider H 0 : λ = λ 0 .<br />
l(λ; x) = log<br />
n∏ e −λ λ x i<br />
i=1<br />
x i !<br />
= n(¯x log λ − λ) − log<br />
U(λ; x) = ∂l<br />
∂λ = n ( ¯x<br />
λ − 1 )<br />
=<br />
I(λ) = ni(λ) = E<br />
n(¯x − λ)<br />
λ<br />
n∏<br />
x i !<br />
i=1<br />
=⇒ ˆλ = ¯x<br />
( )<br />
− ∂2 l<br />
( n¯x<br />
)<br />
= E = n ∂λ 2 λ 2 λ<br />
107
2. STATISTICAL INFERENCE<br />
=⇒ W =<br />
√<br />
ni(ˆλ)(ˆλ − λ 0 )<br />
= ˆλ − λ 0<br />
√ˆλ/n<br />
V = U(λ 0; x)<br />
√<br />
ni(λ0 )<br />
= n(¯x − λ 0)<br />
λ 0 ( √ n/λ 0 )<br />
= ¯x − λ 0<br />
√<br />
λ0 /n<br />
G 2 = 2(l(ˆλ) − l(λ 0 ))<br />
= 2n(¯x log ˆλ − ˆλ) − 2n(¯x log λ 0 − λ 0 )<br />
= 2n<br />
(<br />
¯x log ˆλ<br />
)<br />
− (ˆλ − λ 0 ) .<br />
λ 0<br />
Remarks<br />
(1) It can be proved that the tests based on W, V, G 2 are asymptotically equivalent<br />
for H 0 true.<br />
(2) As a by-product, it follows that the null distribution <strong>of</strong> G 2 is χ 2 1. (Recall<br />
from Theorems 2.3.1, 2.3.1 that the null distribution for W, V are both<br />
N(0, 1)).<br />
(3) To understand the motivation for the three tests, it is useful to consider<br />
their relation to the log-likelihood function. See Figure 20.<br />
We have introduced the Wald test, score test and the likelihood test for H 0 : θ = θ 0 vs.<br />
H A : θ = θ a . These are large-sample tests in that asymptotic distributions for the test<br />
statistic under H 0 are available. It can also be proved that the LR statistic and the<br />
score test statistic are invariant under transformation <strong>of</strong> the parameter, but the Wald<br />
test is not.<br />
Each <strong>of</strong> these three tests can be inverted to give a confidence interval (region) for θ:<br />
Wald Test<br />
108
2. STATISTICAL INFERENCE<br />
Figure 20: Relationships <strong>of</strong> the tests to the log-likelihood function<br />
Solve for θ 0 in W 2 ≤ z(α/2) 2 .<br />
Recall, W =<br />
√<br />
ni(ˆθ)(ˆθ − θ 0 )<br />
=⇒ ˆθ − √<br />
z(α/2)<br />
ni(ˆθ)<br />
≤<br />
θ 0 ≤ ˆθ + √<br />
z(α/2)<br />
ni(ˆθ)<br />
i.e., ˆθ ±<br />
z(α/2)<br />
√ .<br />
ni(ˆθ)<br />
Score Test<br />
Need to solve for θ 0 in V 2 =<br />
(<br />
U(θ 0 ; x)<br />
√<br />
ni(θ0 )) 2<br />
≤ z(α/2) 2 .<br />
LR Test<br />
Solve for θ 0 in 2(l(ˆθ; x) − l(θ 0 ; x)) ≤ χ 2 1,α = z(α/2) 2 .<br />
Example:<br />
X 1 , . . . , X n i.i.d. Po(λ).<br />
109
2. STATISTICAL INFERENCE<br />
Recall Wald statistic is W = ˆλ − λ 0<br />
√ˆλ/n<br />
=⇒ ˆλ ±<br />
√<br />
z(α/2) ˆλ/n<br />
⇐⇒ ¯x ± z(α/2) √¯x/n.<br />
Score test:<br />
Recall that the test statistic is V = ¯x − λ 0<br />
√ . Hence to find a confidence interval, we<br />
λ0 /n<br />
need to solve for λ 0 in the equation<br />
V 2 ≤ z(α/2) 2<br />
=⇒ (¯x − λ 0) 2<br />
λ 0 /n<br />
≤ z(α/2) 2<br />
=⇒ (¯x − λ 0 ) 2 ≤ λ 0z(α/2) 2<br />
)<br />
=⇒ λ 2 0 −<br />
(2¯x + z(α/2)2 λ + ¯x 2 ≤ 0,<br />
n<br />
which is a quadratic in λ 0 and hence can be solved:<br />
−b ± √ b 2 − 4ac<br />
.<br />
2a<br />
n<br />
For the LR test, we have<br />
{<br />
G 2 = 2n ¯x log ¯x }<br />
− (¯x − λ 0 )<br />
λ 0<br />
→ can solve numerically for λ 0 in G 2 ≤ z(α/2) 2 . See Figure 21.<br />
Optimal Tests<br />
Consider simple null and alternative hypotheses:<br />
H 0 : θ = θ 0 and H a : θ = θ a<br />
Recall that the type I error rate is defined by<br />
α = P (reject H 0 |H 0 true).<br />
and the type II error rate by<br />
β = P (accept H 0 |H 0 false).<br />
The power is defined by 1 − β = P (reject H 0 |H 0 false) → so we want high power.<br />
110
2. STATISTICAL INFERENCE<br />
Figure 21: The 3 tests<br />
Theorem. 2.4.1 (Neyman-Pearson Lemma)<br />
Consider the test H 0 : θ = θ 0 vs. H a : θ = θ a defined by the rule<br />
for some constant k.<br />
reject H 0 for f(x; θ 0)<br />
f(x; θ a ) ≤ k,<br />
Let α ∗ be the type I error rate and 1 − β ∗ be the power for this test. Then any other<br />
test with α ≤ α ∗ will have (1 − β) ≤ (1 − β ∗ ).<br />
Pro<strong>of</strong>. Let C be the critical region for the LR test and let D be the critical region (RR)<br />
for any other test.<br />
111
2. STATISTICAL INFERENCE<br />
Let C 1 = C ∩ D and C 2 , D 2 be such that<br />
C = C 1 ∪ C 2 , C 1 ∩ C 2 = φ<br />
D = C 1 ∪ D 2 , C 1 ∩ D 2 = φ<br />
Figure 22: Neyman-Pearson Lemma<br />
112
2. STATISTICAL INFERENCE<br />
See Figure 22 and observe α ≤ α ∗<br />
∫<br />
∫<br />
=⇒ f(x; θ 0 )dx 1 . . . dx n ≤<br />
D<br />
C<br />
f(x; θ 0 )dx 1 . . . dx n<br />
∫<br />
∫<br />
=⇒ f(x; θ 0 )dx 1 . . . dx n ≤ f(x; θ 0 )dx 1 . . . dx n<br />
D 2 C 2<br />
∫<br />
∫<br />
=⇒ (1 − β ∗ ) − (1 − β) = f(x; θ a )dx 1 . . . dx n −<br />
C<br />
D<br />
f(x; θ a )dx 1 . . . dx n<br />
∫<br />
∫<br />
∴ (1 − β ∗ ) − (1 − β) = f(x; θ a )dx 1 . . . dx n + f(x; θ a )dx 1 . . . dx n<br />
C 1 C 2<br />
∫<br />
∫<br />
− f(x; θ a )dx 1 . . . dx n − f(x; θ a )dx 1 . . . dx n<br />
C 1 D 2<br />
=<br />
≥<br />
(C 1 ∪ D 2 )<br />
∫<br />
∫<br />
f(x; θ a )dx 1 . . . dx n − f(x; θ a )dx 1 . . . dx n<br />
C 2 D 2 1 ∫<br />
f(x; θ 0 )dx 1 . . . dx n − 1 ∫<br />
f(x; θ 0 )dx 1 . . . dx n<br />
k C 2<br />
k D 2<br />
≥ 0, as required.<br />
See Figure 23.<br />
Moreover, equality is achieved only if<br />
D 2 is empty (= φ).<br />
Example<br />
Suppose X 1 , X 2 , . . . , X n are i.i.d. N(µ, σ 2 ), σ 2 given, and consider<br />
H 0 : µ = µ 0 vs. H a : µ = µ a , µ a > µ 0<br />
113
2. STATISTICAL INFERENCE<br />
Figure 23: Neyman-Pearson Lemma<br />
Then f(x; µ 0)<br />
f(x; µ a )<br />
n∏<br />
{<br />
1<br />
√ exp − 1 }<br />
i=1 2πσ 2σ (x 2 i − µ 0 ) 2<br />
= n∏<br />
{<br />
1<br />
√ exp − 1 }<br />
i=1 2πσ 2σ (x 2 i − µ a ) 2<br />
{ ( n∑<br />
)}<br />
exp − 1 x 2<br />
2σ 2 i − 2n¯xµ 0 + nµ 2 0<br />
i=1<br />
= { ( n∑<br />
)}<br />
exp − 1 x 2<br />
2σ 2 i − 2n¯xµ a + nµ 2 a<br />
i=1<br />
{ }<br />
1<br />
= exp<br />
2σ (2n¯x(µ 2 0 − µ a ) − nµ 2 0 + nµ 2 a) .<br />
For a constant k,<br />
f(x; µ 0 )<br />
f(x; µ a )<br />
⇔ (µ 0 − µ a )¯x<br />
≤ k<br />
≤ k ∗<br />
⇔ ¯x ≥ c,<br />
for a suitably chosen c (rejected when ¯x is too big).<br />
114
2. STATISTICAL INFERENCE<br />
To choose c, we use the fact that<br />
Hence, the usual z-test,<br />
Z = ¯X − µ 0<br />
σ/ √ n ∼ N(0, 1) under H 0.<br />
reject H 0 if ẑ ≥ z(α)<br />
is the Neyman-Pearson LR test in this case.<br />
Remarks<br />
(1) This result shows that the one-sided z test is also uniformly most powerful<br />
for<br />
H 0 : µ = µ 0 vs. H A : µ > µ 0<br />
(2) This can be extended to the case <strong>of</strong><br />
In this case we take<br />
H 0 : µ ≤ µ 0 vs. H A : µ > µ 0<br />
α = max<br />
H 0<br />
P (rejecting |µ = µ 0 ).<br />
(3) This construction fails when we consider two-sided alternatives<br />
i.e., H 0 : µ = µ 0 vs. H A : µ ≠ µ 0<br />
=⇒ no uniformly most powerful test exists for that case.<br />
115