Student Notes To Accompany MS4214: STATISTICAL INFERENCE
Student Notes To Accompany MS4214: STATISTICAL INFERENCE
Student Notes To Accompany MS4214: STATISTICAL INFERENCE
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Student</strong> <strong>Notes</strong> <strong>To</strong> <strong>Accompany</strong><br />
<strong>MS4214</strong>: <strong>STATISTICAL</strong> <strong>INFERENCE</strong><br />
Dr. Kevin Hayes<br />
September 1, 2007
Contents<br />
1 Introduction 3<br />
1.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.2 General Course Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
2 The Theory of Estimation 8<br />
2.1 The Frequentist Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
2.2 The Frequentist Approach to Estimation . . . . . . . . . . . . . . . . . 10<br />
2.3 Minimum-Variance Unbiased Estimation . . . . . . . . . . . . . . . . . 14<br />
2.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . 18<br />
2.5 Multi-parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
2.6 Newton-Raphsom Optimization . . . . . . . . . . . . . . . . . . . . . . 31<br />
2.7 The Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
2.8 Optimality Properties of the MLE . . . . . . . . . . . . . . . . . . . . . 39<br />
2.9 Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
2.10 Worked Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
3 The Theory of Confidence Intervals 50<br />
3.1 Exact Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />
3.2 Pivotal Quantities for Use with Normal Data . . . . . . . . . . . . . . . 53<br />
3.3 Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . . . 58<br />
3.4 Worked Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
4 The Theory of Hypothesis Testing 64<br />
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
4.2 The General Testing Problem . . . . . . . . . . . . . . . . . . . . . . . 65<br />
4.3 Hypothesis Testing for Normal Data . . . . . . . . . . . . . . . . . . . 66<br />
4.4 Generally Applicable Test Procedures . . . . . . . . . . . . . . . . . . . 71<br />
4.5 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . 74<br />
4.6 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
4.7 The χ 2 Test for Contingency Tables . . . . . . . . . . . . . . . . . . . . 79<br />
1
4.8 Worked Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />
A Review of Probability 88<br />
A.1 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
A.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
A.2.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 89<br />
A.2.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 89<br />
A.2.3 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . 90<br />
A.2.4 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . 91<br />
A.2.5 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . 91<br />
A.2.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
A.2.7 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . 93<br />
A.2.8 The Multinomial Distribution . . . . . . . . . . . . . . . . . . . 93<br />
A.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . 94<br />
A.3.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
A.3.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 94<br />
A.3.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . 95<br />
A.3.4 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . 96<br />
A.3.5 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
A.3.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
A.3.7 Chi-square Distribution . . . . . . . . . . . . . . . . . . . . . . 99<br />
A.3.8 Distribution of a Function of a Random Variable . . . . . . . . 99<br />
A.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
A.4.1 Sums of Independent Random Variables . . . . . . . . . . . . . 101<br />
A.4.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . 104<br />
A.4.3 The Bivariate Change of Variables Formula . . . . . . . . . . . 105<br />
A.4.4 The Bivariate Normal Distribution . . . . . . . . . . . . . . . . 106<br />
A.4.5 Bivariate Normal Conditional Distributions . . . . . . . . . . . 107<br />
A.4.6 The Multivariate Normal Distribution . . . . . . . . . . . . . . 107<br />
A.5 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />
A.6 Table of Common Distributions . . . . . . . . . . . . . . . . . . . . . . 111<br />
2
Chapter 1<br />
Introduction<br />
1.1 Motivating Examples<br />
Example 1.1 (Radioactive decay). Let X denote the number of particles that will be<br />
emitted from a radioactive source in the next one minute period. We know that X<br />
will turn out to be equal to one of the non-negative integers but, apart from that, we<br />
know nothing about which of the possible values are more or less likely to occur. The<br />
quantity X is said to be a random variable.<br />
Suppose we are told that the random variable X has a Poisson distribution with<br />
parameter θ = 2. Then, if x is some non-negative integer, we know that the probability<br />
that the random variable X takes the value x is given by the formula<br />
P (X = x) = θx exp (−θ)<br />
x!<br />
where θ = 2. So, for instance, the probability that X takes the value x = 4 is<br />
P (X = 4) = 24 exp (−2)<br />
4!<br />
= 0.0902 .<br />
We have here a probability model for the random variable X. Note that we are using<br />
upper case letters for random variables and lower case letters for the values taken by<br />
random variables. We shall persist with this convention throughout the course.<br />
Suppose we are told that the random variable X has a Poisson distribution with<br />
parameter θ where θ is some unspecified positive number. Then, if x is some non-<br />
negative integer, we know that the probability that the random variable X takes the<br />
value x is given by the formula<br />
P (X = x|θ) = θx exp (−θ)<br />
, (1.1)<br />
x!<br />
for θ ∈ R + . However, we cannot calculate probabilities such as the probability that X<br />
takes the value x = 4 without knowing the value of θ.<br />
3
Suppose that, in order to learn something about the value of θ, we decide to measure<br />
the value of X for each of the next 5 one minute time periods. Let us use the notation X1<br />
to denote the number of particles emitted in the first period, X2 to denote the number<br />
emitted in the second period and so forth. We shall end up with data consisting of a<br />
random vector X = (X1, X2, . . . , X5). Consider x = (x1, x2, x3, x4, x5) = (2, 1, 0, 3, 4).<br />
Then x is a possible value for the random vector X. We know that the probability that<br />
X1 takes the value x1 = 2 is given by the formula<br />
P (X = 2|θ) = θ2 exp (−θ)<br />
2!<br />
and similarly that the probability that X2 takes the value x2 = 1 is given by<br />
θ exp (−θ)<br />
P (X = 1|θ) =<br />
1!<br />
and so on. However, what about the probability that X takes the value x? In order for<br />
this probability to be specified we need to know something about the joint distribution<br />
of the random variables X1, X2, . . . , X5. A simple assumption to make is that the<br />
random variables X1, X2, . . . , X5 are mutually independent. (Note that this assumption<br />
may not be correct since X2 may tend to be more similar to X1 that it would be to<br />
X5.) However, with this assumption we can say that the probability that X takes the<br />
value x is given by<br />
P (X = x|θ) =<br />
5�<br />
i=1<br />
θxi exp (−θ)<br />
,<br />
xi!<br />
= θ2 exp (−θ)<br />
2!<br />
× θ1 exp (−θ)<br />
1!<br />
× θ3 exp (−θ)<br />
3!<br />
× θ0 exp (−θ)<br />
0!<br />
× θ4 exp (−θ)<br />
,<br />
4!<br />
= θ10 exp (−5θ)<br />
.<br />
288<br />
In general, if X = (x1, x2, x3, x4, x5) is any vector of 5 non-negative integers, then the<br />
probability that X takes the value x is given by<br />
P (X = x|θ) =<br />
5� θxi exp (−θ)<br />
,<br />
xi!<br />
i=1<br />
= θ� 5<br />
i=1 xi exp (−5θ)<br />
5�<br />
xi!<br />
We have here a probability model for the random vector X.<br />
Our plan is to use the value x of X that we actually observe to learn something<br />
about the value of θ. The ways and means to accomplish this task make up the subject<br />
matter of this course. �<br />
4<br />
i=1<br />
.
Example 1.2 (Tuberculosis). Suppose we are going to examine n people and record<br />
a value 1 for people who have been exposed to the tuberculosis virus and a value 0<br />
for people who have not been so exposed. The data will consist of a random vector<br />
X = (X1, X2, . . . , Xn) where Xi = 1 if the ith person has been exposed to the TB virus<br />
and Xi = 0 otherwise.<br />
A Bernoulli random variable X has probability mass function<br />
P (X = x|θ) = θ x (1 − θ) 1−x , (1.2)<br />
for x = 0, 1 and θ ∈ (0, 1). A possible model would be to assume that X1, X2, . . . , Xn<br />
behave like n independent Bernoulli random variables each of which has the same<br />
(unknown) probability θ of taking the value 1.<br />
Let x = (x1, x2, . . . , xn) be a particular vector of zeros and ones. Then the model<br />
implies that the probability that the random vector X takes the value x is given by<br />
P (X = x|θ) =<br />
n�<br />
θ xi 1−xi (1 − θ)<br />
i=1<br />
= θ �n i=1 xi n−<br />
(1 − θ) �n i=1 xi .<br />
Once again our plan is to use the value x of X that we actually observe to learn<br />
something about the value of θ. �<br />
Example 1.3 (Viagra). A chemical compound Y is used in the manufacture of Viagra.<br />
Suppose that we are going to measure the micrograms of Y in a sample of n pills. The<br />
data will consist of a random vector X = (X1, X2, . . . , Xn) where Xi is the chemical<br />
content of Y for the ith pill.<br />
A possible model would be to assume that X1, X2, . . . , Xn behave like n independent<br />
random variables each having a N (µ, σ 2 ) density with unknown mean parameter µ ∈ R,<br />
(really, here µ ∈ R + ) and known variance parameter σ 2 < ∞. Each Xi has density<br />
fXi (xi|µ) =<br />
1<br />
√ 2πσ 2 exp<br />
�<br />
− (xi − µ) 2<br />
2σ 2<br />
Let x = (x1, x2, . . . , xn) be a particular vector of real numbers. Then the model implies<br />
the joint density<br />
fX(x|µ) =<br />
=<br />
n�<br />
i=1<br />
�<br />
.<br />
1<br />
√<br />
2πσ2 exp<br />
�<br />
− (xi − µ) 2 �<br />
1<br />
( √ 2πσ2 exp<br />
) n<br />
�<br />
−<br />
2σ 2<br />
� n<br />
i=1 (xi − µ) 2<br />
2σ 2<br />
Once again our plan is to use the value x of X that we actually observe to learn<br />
something about the value of µ. �<br />
5<br />
�
Example 1.4 (Blood pressure). We wish to test a new device for measuring blood pres-<br />
sure. We are going to try it out on n people and record the difference between the value<br />
returned by the device and the true value as recorded by standard techniques. The<br />
data will consist of a random vector X = (X1, X2, . . . , Xn) where Xi is the difference<br />
for the ith person. A possible model would be to assume that X1, X2, . . . , Xn behave<br />
like n independent random variables each having a N (0, σ 2 ) density where σ 2 is some<br />
unknown positive real number. Let x = (x1, x2, . . . , xn) be a particular vector of real<br />
numbers. Then the model implies that the probability that the random vector X takes<br />
the value x is given by<br />
fX(x|σ 2 ) =<br />
=<br />
n�<br />
i=1<br />
1<br />
√<br />
2πσ2 exp<br />
�<br />
− x2 i<br />
1<br />
( √ 2πσ2 exp<br />
) n<br />
�<br />
−<br />
2σ 2<br />
�<br />
� n<br />
i=1 x2 i<br />
2σ 2<br />
Once again our plan is to use the value x of X that we actually observe to learn<br />
something about the value of σ 2 . Knowledge of σ is useful since it allows us to make<br />
statements such as that 95% of errors will be less than 1.96 × σ in magnitude. �<br />
1.2 General Course Overview<br />
Definition 1.1 (Inference). Inference studies the way in which data we observe<br />
should influence our beliefs about and practices in the real world. �<br />
Definition 1.2 (Statistical inference). Statistical inference considers how inference<br />
should proceed when the data is subject to random fluctuation. �<br />
The concept of probability is used to describe the random mechanism which gave<br />
rise to the data. This involves the use of probability models.<br />
The incentive for contemplating a probability model is that through it we may<br />
achieve an economy of thought in the description of events enabling us to enunciate<br />
laws and relations of more than immediate validity and relevance. A probability model<br />
is usually completely specified apart from the values of a few unknown quantities called<br />
parameters. We then try to discover to what extent the data can inform us about the<br />
values of the parameters.<br />
Statistical inference assumes that the data is given and that the probability model<br />
is a correct description of the random mechanism which generated the data.<br />
6<br />
�<br />
.
Three main topics will be covered :<br />
Estimation Unbiasedness, mean square error, consistency, relative efficiency, suffi-<br />
ciency, minimum variance. Fisher information for a function of a parameter,<br />
Cramér-Rao lower bound, efficiency. Fitting standard distributions to discrete<br />
and continuous data. Method of moments. Maximum likelihood estimation:<br />
finding estimators analytically and numerically, invariance, censored data.<br />
Hypothesis testing Simple and composite hypotheses, types of error, power, op-<br />
erating characteristic curves, p-value. Neuman-Pearson method. Generalised<br />
likelhood ratio test. Use of asympotic results to construct tests. Central limit<br />
theorem, asymptotic distributions of maximum likelihood estimator and gener-<br />
alised ratio test statistic.<br />
Confidence intervals and sets Random intervals and sets. Use of pivotal qnatities.<br />
Relationship between tests and confidence intervals. Use of asymptotic results.<br />
7
Chapter 2<br />
The Theory of Estimation<br />
2.1 The Frequentist Philosophy<br />
The dominant philosophy of inference is based on the frequentist theory of probability.<br />
According to the frequentist theory probability statements can only be made regarding<br />
events associated with a random experiment. A random experiment is an experiment<br />
which has a well defined set of possible outcomes S. In addition, we must be able to en-<br />
visage an infinite sequence of independent repetitions of the experiment with the actual<br />
outcome of each repetition being some unpredictable element of S. A random variable<br />
is a numerical quantity associated with each possible outcome in S. A random vector<br />
is a collection of numerical quantities associated with each possible outcome in S. In<br />
performing the experiment we determine which element of S has occurred and thereby<br />
the observed values of all random variables or random vectors of interest. Since the<br />
outcome of the experiment is unpredictable so too is the value of any random variable<br />
or random vector. Since we can envisage an infinite sequence of independent repetitions<br />
of the experiment, we can envisage an infinite sequence of independent determinations<br />
of the value of a random variable (or vector). The purpose of a statistical model is to<br />
describe the unpredictability of such a sequence of determinations.<br />
Consider the random experiment which consists of picking someone at random from<br />
the 2007 electoral register for Limerick. The outcome of such an experiment will be<br />
a human being and the set S consists of all human beings whose names are on the<br />
register. We can clearly envisage an infinite sequence of independent repetitions of<br />
such an experiment. Consider the random variable X where X = 0 if the outcome<br />
of the experiment is a male and X = 1 if the outcome of the experiment is a female.<br />
When we say that P (X = 1) = 0.54 we are taken to mean that in an infinite sequence<br />
of independent repetitions of the experiment exactly 54% of the outcomes will produce<br />
a value of X = 1.<br />
8
Now consider the random experiment which consists of picking 3 people at random<br />
from the 1997 electoral register for Limerick. The outcome of such an experiment will<br />
be a collection of 3 human beings and the set S consists of all subsets of 3 human<br />
beings which may be formed from the set of all human beings whose names are on the<br />
register. We can clearly envisage an infinite sequence of independent repetitions of such<br />
an experiment. Consider the random vector X = (X1, X2, X3) where for i = 1, 2, 3,<br />
Xi = 0 if the ith person chosen is a male and Xi = 1 if the ith person chosen is a<br />
female. When we say that X1, X2, X3 are independent and identically distributed or IID<br />
with P (Xi = 1) = θ we are taken to mean that in an infinite sequence of independent<br />
repetitions of the experiment the proportion of outcomes which produce, for instance,<br />
a value of X = (1, 1, 0) is given by θ 2 (1 − θ).<br />
Suppose that the value of θ is unknown and we propose to estimate it by the<br />
estimator ˆ θ whose value is given by the proportion of females in the sample of size 3.<br />
Since ˆ θ depends on the value of X we sometimes write ˆ θ(X) to emphasise this fact.<br />
We can work out the probability distribution of ˆ θ as follows :<br />
X P (X = x) ˆ θ(x)<br />
(0, 0, 0) (1 − θ) 3 0<br />
(0, 0, 1) θ(1 − θ) 2 1/3<br />
(0, 1, 0) θ(1 − θ) 2 1/3<br />
(1, 0, 0) θ(1 − θ) 2 1/3<br />
(0, 1, 1) θ 2 (1 − θ) 2/3<br />
(1, 0, 1) θ 2 (1 − θ) 2/3<br />
(1, 1, 0) θ 2 (1 − θ) 2/3<br />
(1, 1, 1) θ 3 1<br />
Thus P ( ˆ θ = 0) = (1 − θ) 3 , P ( ˆ θ = 1/3) = 3θ(1 − θ) 2 , P ( ˆ θ = 2/3) = 3θ 2 (1 − θ) and<br />
P ( ˆ θ = 1) = θ 3 . We now ask whether ˆ θ is a good estimator of θ? Clearly if θ = 0 we<br />
have that P ( ˆ θ = θ) = P ( ˆ θ = 0) = 1 which is good. Likewise if θ = 1 we also have that<br />
P ( ˆ θ = θ) = P ( ˆ θ = 1) = 1. If θ = 1/3 then P ( ˆ θ = θ) = P ( ˆ θ = 1/3) = 3(1/3)(1−1/3) 2 =<br />
4/9. Likewise if θ = 2/3 we have that P ( ˆ θ = θ) = P ( ˆ θ = 2/3) = 3(2/3) 2 (1−2/3) = 4/9.<br />
However if the value of θ lies outside the set {0, 1/3, 2/3, 1} we have that P ( ˆ θ = θ) = 0.<br />
Since ˆ θ is a random variable we might try to calculate its expected value E( ˆ θ) i.e.<br />
the average value we would get if we carried out an infinite number of independent<br />
repetitions of the experiment. We have that<br />
E( ˆ θ) = 0P ( ˆ θ = 0) + (1/3)P ( ˆ θ = 1/3) + (2/3)P ( ˆ θ = 2/3) + 1P ( ˆ θ = 1) ,<br />
= 0(1 − θ) 3 + (1/3)3θ(1 − θ) 2 + (2/3)3θ 2 (1 − θ) + 1θ 3 ,<br />
= θ.<br />
9
Thus if we carried out an infinite number of independent repetitions of the experiment<br />
and calculate the value of ˆ θ for each repetition the average of the ˆ θ values would be<br />
exactly θ,the true value of the parameter! This is true no matter what the actual value<br />
of θ is. Such an estimator is said to be unbiased.<br />
Consider the quantity L = ( ˆ θ − θ) 2 which might be regarded as a measure of<br />
the error or loss involved in using ˆ θ to estimate θ. The possible values for L are<br />
(0 − θ) 2 , (1/3 − θ) 2 , (2/3 − θ) 2 and (1 − θ) 2 . We can calculate the expected value of<br />
L as follows:<br />
E(L) = (0 − θ) 2 P ( ˆ θ = 0) + (1/3 − θ) 2 P ( ˆ θ = 1/3)<br />
+(2/3 − θ) 2 P ( ˆ θ = 2/3) + (1 − θ) 2 P ( ˆ θ = 1)<br />
= θ 2 (1 − θ) 3 + (1/3 − θ) 2 3θ(1 − θ) 2 + (2/3 − θ) 2 3θ 2 (1 − θ) + (1 − θ) 2 θ 3 ,<br />
= θ(1 − θ)/3 .<br />
The quantity E(L) is called the mean squared error ( MSE ) of the estimator ˆ θ. Since<br />
the quantity θ(1 − θ) attains its maximum value of 1/4 for θ = 1/2, the largest value<br />
E(L) can attain is 1/12 which occurs if the true value of the parameter θ happens to be<br />
equal to 1/2; for all other values of θ the quantity E(L) is less than 1/12. If somebody<br />
could invent a different estimator ˜ θ of θ whose MSE was less than that of ˆ θ for all<br />
values of θ then we would prefer ˜ θ to ˆ θ.<br />
This trivial example gives some idea of the kinds of calculations that we will be<br />
performing. The basic frequentist principle is that statistical procedures should be<br />
judged in terms of their average performance in an infinite series of independent repe-<br />
titions of the experiment which produced the data. An important point to note is that<br />
the parameter values are treated as fixed (although unknown) throughout this infinite<br />
series of repetitions. We should be happy to use a procedure which performs well on<br />
the average and should not be concerned with how it performs on any one particular<br />
occasion.<br />
2.2 The Frequentist Approach to Estimation<br />
Suppose that we are going to observe a value of a random vector X. Let X denote the<br />
set of possible values X can take and, for x ∈ X , let f(x|θ) denote the probability that<br />
X takes the value x where the parameter θ is some unknown element of the set Θ.<br />
The problem we face is that of estimating θ. An estimator ˆ θ is a procedure which<br />
for each possible value x ∈ X specifies which element of Θ we should quote as an<br />
estimate of θ. When we observe X = x we quote ˆ θ(x) as our estimate of θ. Thus ˆ θ is a<br />
function of the random vector X. Sometimes we write ˆ θ(X) to emphasise this point.<br />
10
Given any estimator ˆ θ we can calculate its expected value for each possible value of<br />
θ ∈ Θ. An estimator is said to be unbiased if this expected value is identically equal to<br />
θ. If an estimator is unbiased then we can conclude that if we repeat the experiment<br />
an infinite number of times with θ fixed and calculate the value of the estimator each<br />
time then the average of the estimator values will be exactly equal to θ. From the<br />
frequentist viewpoint this is a desireable property and so, where possible, frequentists<br />
use unbiased estimators.<br />
Definition 2.1 (The Frequentist philosophy). <strong>To</strong> evaluate the usefulness of an<br />
estimator ˆ θ = ˆ θ(x) of θ, examine the properties of the random variable ˆ θ = ˆ θ(X). �<br />
Definition 2.2 (Unbiased estimators). An estimator ˆ θ = ˆ θ(X) is said to be unbi-<br />
ased for a parameter θ if it equals θ in expectation:<br />
E[ ˆ θ(X)] = E( ˆ θ) = θ.<br />
Intuitively, an unbiased estimator is ‘right on target’. �<br />
Definition 2.3 (Bias of an estimator). The bias of an estimator ˆ θ = ˆ θ(X) of θ is<br />
defined as bias( ˆ θ) = E[ ˆ θ(X) − θ]. �<br />
Definition 2.4 (Bias corrected estimators). If bias( ˆ θ) is of the form cθ, then<br />
(obviously) ˜ θ = ˆ θ/(1 + c) is unbiased for θ. Likewise, if bias( ˆ θ) = θ + c, then ˜ θ = ˆ θ − c<br />
is unbiased for θ. In such situations we say that ˜ θ is a biased corrected version of ˆ θ. �<br />
Definition 2.5 (Unbiased functions). More generally ˆg(X) is said to be unbiased<br />
for a function g(θ) if E[ˆg(X)] = g(θ). �<br />
Note that even if ˆ θ is an unbiased estimator of θ, g( ˆ θ) will generally not be an<br />
unbiased estimator of g(θ) unless g is linear or affine. This limits the importance of the<br />
notion of unbiasedness. It might be at least as important that an estimator is accurate<br />
in the sense that its distribution is highly concentrated around θ.<br />
Is unbiasedness a good thing? Unbiasedness is important when combining esti-<br />
mates, as averages of unbiased estimators are unbiased (see the review exercises at the<br />
end of this chapter). For example, when combining standard deviations s1, s2, . . . , sk<br />
with degrees of freedom df1, . . . , dfk we always average their squares<br />
�<br />
df1s<br />
¯s =<br />
2 1 + · · · + dfks2 k<br />
df1 + · · · + dfk<br />
as s 2 i are unbiased estimators of the variance σ 2 , whereas si are not unbiased estimators<br />
of σ (see the review exercises). Be careful when averaging biased estimators! It may<br />
well be appropriate to make a bias-correction before averaging.<br />
11
Problem 2.1. Let X have a binomial distribution with parameters n and θ. Show<br />
that the sample proportion ˆ θ = X/n is an unbiased estimate of θ.<br />
Solution. X ∼ Bin(n, θ) ⇒ E(X) = nθ. Then E( ˆ θ) = E(X/n) = E(X)/n = nθ/n = θ.<br />
As E( ˆ θ) = θ, the estimator ˆ θ is unbiased.<br />
Problem 2.2. Let X1, . . . , Xn be independent and identically distributed with density<br />
�<br />
e<br />
f(x|θ) =<br />
−(x−θ) , for x > θ;<br />
0, otherwise.<br />
Show that ˆ θ = ¯ X = (X1 + · · · + Xn)/n is a biased estimator of θ. Propose an unbiased<br />
estimator ˜ θ of the form ˜ θ = ˆ θ + c.<br />
Solution. E(X) = � ∞<br />
θ xe−(x−θ) dx = � −xe −(x−θ) + � e −(x−θ) dx � ∞<br />
θ = −(x+1)e−(x−θ) | ∞ θ =<br />
θ+1. Next, E( ˆ θ) = E( ¯ X) = 1<br />
n E(X1 +X2 +· · ·+Xn) = θ+1 �= θ ⇒ ˆ θ is biased. Propose<br />
˜θ = ¯ X − 1. Then E( ˜ θ) = E( ¯ X) − 1 = θ + 1 − 1 = θ and ˜ θ is unbiased.<br />
Definition 2.6 (Mean squared error). The mean squared error of the estimator ˆ θ<br />
is defined as MSE( ˆ θ) = E( ˆ θ − θ) 2 . Given the same set of data, ˆ θ1 is “better” than ˆ θ2 if<br />
MSE( ˆ θ1) ≤ MSE( ˆ θ2) (uniformly better if true ∀ θ). �<br />
Lemma 2.3 (The MSE variance-bias tradeoff). The MSE decomposes as<br />
MSE( ˆ θ) = Var( ˆ θ) + Bias( ˆ θ) 2 .<br />
Proof. The problem of finding minimum MSE estimators cannot be solved uniquely:<br />
MSE( ˆ θ) = E( ˆ θ − θ) 2<br />
= E{ [ ˆ θ − E( ˆ θ) ] + [ E( ˆ θ) − θ ]} 2<br />
= E[ ˆ θ − E( ˆ θ)] 2 + E[E( ˆ θ) − θ] 2<br />
�<br />
+2 E [ ˆ θ − E( ˆ θ)][E( ˆ �<br />
θ) − θ]<br />
� ��<br />
=0<br />
�<br />
= E[ ˆ θ − E( ˆ θ)] 2 + E[E( ˆ θ) − θ] 2<br />
= Var( ˆ θ) + [E( ˆ θ) − θ] 2<br />
� �� �<br />
Bias( ˆ θ) 2<br />
NOTE : This lemma implies that the mean squared error of an unbiased estimator<br />
is equal to the variance of the estimator.<br />
12
Problem 2.4. Consider X1, . . . , Xn where Xi ∼ N(θ, σ 2 ) and σ is known. Three<br />
estimators of θ are ˆ θ1 = ¯ X = 1<br />
n<br />
� n<br />
i=1 Xi, ˆ θ2 = X1, and ˆ θ3 = (X1 + ¯ X)/2. Pick one.<br />
Solution. E( ˆ θ1) = 1<br />
n [E(X1)+· · ·+E(Xn)] = 1<br />
1<br />
[θ+· · ·+θ] = [nθ] = θ, (unbiased). Next<br />
n n<br />
E( ˆ θ2) = E(X1) = θ, (unbiased). Finally E( ˆ θ3) = 1<br />
2E � n+1<br />
n X1 + 1<br />
n [X2 + · · · + Xn] � =<br />
�<br />
1 n+1<br />
2 n E(X1) + 1<br />
n [E(X2) + · · · + E(Xn)] � = 1 n+1 n−1<br />
{ θ + θ} = θ, (unbiased). All three<br />
2 n n<br />
estimators are unbiased. Although desirable from a frequentist standpoint, unbiased-<br />
ness is not a property that helps us choose between estimators. <strong>To</strong> do this we must<br />
examine some measure of loss like the mean squared error. For class of estimators<br />
that are unbiased, the mean squared error will be equal to the estimation variance.<br />
Calculate Var( ˆ θ1) = 1<br />
n 2 [Var(X1) + · · · + Var(Xn)] = 1<br />
n 2 [σ 2 + · · · + σ 2 ] = 1<br />
n 2 [nσ 2 ] = 1<br />
n σ2 .<br />
Trivially Var( ˆ θ2) = Var(X1) = σ 2 . Finally Var( ˆ θ3) = (σ 2 /n + σ 2 )/4 + 2Cov( ¯ X, X1).<br />
So ¯ X appears “best” in the sense that Var( ˆ θ) is smallest among these three unbiased<br />
estimators.<br />
Problem 2.5. Consider X1, . . . , Xn to be independent random variables with means<br />
E(Xi) = µ + βi and variances Var(Xi) = σ 2 i . Such a situation could arise when Xi are<br />
estimators of µ obtained from independent sources and βi is the bias of the estimator<br />
Xi. Consider pooling the estimators of µ into a common estimator using the linear<br />
combination ˆµ = w1X1 + w2X2 + · · · + wnXn.<br />
(i) If the estimators are unbiased, show that ˆµ is unbiased if and only if � wi = 1.<br />
(ii) In the case when the estimators are unbiased, show that ˆµ has minimum variance<br />
when the weights are inversely proportional to the variances σ 2 i .<br />
(iii) Show that the variance of ˆµ for optimal weights wi is Var(ˆµ) = 1/ �<br />
i σ−2<br />
i .<br />
(iv) Consider the case when estimators may be biased. Find the mean square error<br />
of the optimal linear combination obtained above, and compare its behaviour as<br />
n → ∞ in the biased and unbiased case, when σ 2 i = σ 2 , i = 1, . . . , n.<br />
Solution. E(ˆµ) = E(w1X1 + · · · + wnXn) = �<br />
i wiE(Xi) = �<br />
i wiµ = µ �<br />
i wi so ˆµ is<br />
unbiased if and only if �<br />
i wi = 1. The variance of our estimator is Var(ˆµ) = �<br />
i w2 i σ 2 i ,<br />
which should be minimized subject to the constraint �<br />
i wi = 1. Differentiating the<br />
Lagrangian L = �<br />
i w2 i σ 2 i − λ ( �<br />
i wi − 1) with respect to wi and setting equal to zero<br />
yields 2wiσ 2 i = λ ⇒ wi ∝ σ −2<br />
i so that wi = σ −2<br />
i /(�<br />
j σ−2 j<br />
). Then, for optimal weights we<br />
get Var(ˆµ) = �<br />
i w2 i σ2 i = ( �<br />
i σ−4 i σ2 i )/( �<br />
i σ−2 i )2 = 1/( �<br />
i σ−2 i ). When σ2 i = σ2 we have<br />
that Var(ˆµ) = σ2 /n which tends to zero for n → ∞ whereas bias(ˆµ) = � βi/n = ¯ β is<br />
equal to the average bias and MSE(ˆµ) = σ 2 + ¯ β 2 . Therefore the bias tends to dominate<br />
the variance as n gets larger, which is very unfortunate.<br />
13
Problem 2.6. Let X1, . . . , Xn be an independent sample of size n from the uniform<br />
distribution on the interval (0, θ), with density for a single observation being f(x|θ) =<br />
θ −1 for 0 < x < θ and 0 otherwise, and consider θ > 0 unknown.<br />
(i) Find the expected value and variance of the estimator ˆ θ = 2 ¯ X.<br />
(ii) Find the expected value of the estimator ˜ θ = X(n), i.e. the largest observation.<br />
(iii) Find an unbiased estimator of the form ˇ θ = cX(n) and calculate its variance.<br />
(iv) Compare the mean square error of ˆ θ and ˇ θ.<br />
Solution. ˆ θ has E( ˆ θ) = E(2 ¯ X) = 2<br />
n [E(X1) + · · · + E(Xn)] = 2 [(θ/2) + · · · + (θ/2)] =<br />
n<br />
2<br />
n [n(θ/2)] = θ (unbiased), and Var(ˆ θ) = Var(2 ¯ X) = 4<br />
n2 [Var(X1) + · · · + Var(Xn)] =<br />
4<br />
n2 [(θ2 /12) + · · · + (θ2 /12)] = 4<br />
n2 n<br />
12θ2 = 1<br />
3nθ2 . Let U = X(n), we then have P (U ≤ u) =<br />
�n i P (Xi ≤ u) = (u/θ) n for 0 < u < θ so differentiation yields that U has density<br />
f(u|θ) = nun−1θ−n for 0 < u < θ. Direct integration now yields E( ˜ θ) = E(U) = n<br />
n+1θ (a biased estimator). The estimator ˇ θ = n+1 U is unbiased. Direct integration gives<br />
E(U 2 ) = n<br />
n+2θ2 so Var( ˜ n<br />
θ) = Var(U) = (n+2)(n+1) 2 θ2 and Var( ˇ θ) = 1<br />
n(n+2) θ2 . As ˆ θ and<br />
ˇθ are both unbiased estimators MSE( ˆ θ) = Var( ˆ θ) and MSE( ˇ θ) = Var( ˇ θ). Clearly the<br />
mean square error of ˆ θ is very large compared to the mean square error of ˇ θ.<br />
2.3 Minimum-Variance Unbiased Estimation<br />
Getting a small MSE often involves a tradeoff between variance and bias. By not<br />
insisting on ˆ θ being unbiased, the variance can sometimes be drastically reduced. For<br />
unbiased estimators, the MSE obviously equals the variance, MSE( ˆ θ) = Var( ˆ θ), so no<br />
tradeoff can be made. One approach is to restrict ourselves to the subclass of estimators<br />
that are unbiased and minimum variance.<br />
Definition 2.7 (Minimum-variance unbiased estimator). If an unbiased estima-<br />
tor of g(θ) has minimum variance among all unbiased estimators of g(θ) it is called a<br />
minimum variance unbiased estimator (MVUE). �<br />
n<br />
We will develop a method of finding the MVUE when it exists. When such an<br />
estimator does not exist we will be able to find a lower bound for the variance of an<br />
unbiased estimator in the class of unbiased estimators, and compare the variance of<br />
our unbiased estimator with this lower bound.<br />
14
Definition 2.8 (Score function). For the (possibly vector valued) observation X = x<br />
to be informative about θ, the density must vary with θ. If f(x|θ) is smooth and<br />
differentiable, this change is quantified to first order by the score function<br />
S(θ) = ∂<br />
∂θ ln f(x|θ) ≡ f ′ (x|θ)<br />
f(x|θ) .<br />
Under suitable regularity conditions (differentiation wrt θ and integration wrt x can<br />
be interchanged), we have<br />
E{S(θ)} =<br />
=<br />
� ′ f (x|θ)<br />
f(x|θ)dx =<br />
f(x|θ)<br />
�<br />
∂<br />
�� �<br />
f(x|θ)dx =<br />
∂θ<br />
∂<br />
∂θ<br />
f ′ (x|θ)dx ,<br />
1 = 0.<br />
Thus the score function has expectation zero. �<br />
True frequentism evaluates the properties of estimators based on their “long-run”<br />
behaviour. The value of x will vary from sample to sample so we have treated the score<br />
function as a random variable and looked at its average across all possible samples.<br />
Lemma 2.7 (Fisher information). The variance of S(θ) is the expected Fisher<br />
information about θ<br />
Proof. Using the chain rule<br />
I(θ) = E{S(θ) 2 } ≡ E<br />
∂2 ∂<br />
ln f =<br />
∂θ2 ∂θ<br />
� �<br />
1 ∂f<br />
f ∂θ<br />
= − 1<br />
f 2<br />
�<br />
∂f<br />
∂θ<br />
�<br />
∂ ln f<br />
= −<br />
∂f<br />
�� � �<br />
2<br />
∂<br />
ln f(x|θ)<br />
∂θ<br />
� 2<br />
� 2<br />
+ 1 ∂<br />
f<br />
2f ∂θ2 + 1 ∂<br />
f<br />
2f ∂θ2 If integration and differentiation can be interchanged<br />
�<br />
1 ∂<br />
E<br />
f<br />
2f ∂θ2 � �<br />
=<br />
∂2f ∂2<br />
dx =<br />
∂θ2 ∂θ2 �<br />
dx = ∂2<br />
1 = 0,<br />
∂θ2 thus<br />
X<br />
X<br />
� � �� � �<br />
2<br />
∂<br />
∂<br />
−E ln f(x|θ) = E ln f(x|θ) = I(θ). (2.1)<br />
∂θ2 ∂θ<br />
Variance measures lack of knowledge. Reasonable that the reciprocal of the variance<br />
should be defined as the amount of information carried by the (possibly vector valued)<br />
observation x about θ.<br />
15
Theorem 2.8 (Cramér Rao lower bound). Let ˆ θ be an unbiased estimator of θ.<br />
Then<br />
Var( ˆ θ) ≥ { I(θ) } −1 .<br />
Proof. Unbiasedness, E( ˆ θ) = θ, implies<br />
�<br />
ˆθ(x)f(x|θ)dx = θ.<br />
Assume we can differentiate wrt θ under the integral, then<br />
�<br />
∂<br />
� �<br />
ˆθ(x)f(x|θ)dx = 1.<br />
∂θ<br />
The estimator ˆ θ(x) can’t depend on θ, so<br />
�<br />
ˆθ(x) ∂<br />
{f(x|θ)dx} = 1.<br />
∂θ<br />
For any pdf f,<br />
so that now �<br />
Thus<br />
Define random variables<br />
∂f<br />
∂θ<br />
∂<br />
= f (ln f) ,<br />
∂θ<br />
ˆθ(x)f ∂<br />
(ln f) dx = 1.<br />
∂θ<br />
�<br />
E ˆθ(x) ∂<br />
�<br />
(ln f) = 1.<br />
∂θ<br />
U = ˆ θ(x),<br />
and<br />
S = ∂<br />
(ln f) .<br />
∂θ<br />
Then E (US) = 1. We already know that the score function has expectation zero,<br />
E (S) = 0. Consequently Cov(U, S) = E(US) − E(U)E(S) = E(US) = 1.<br />
Setting Cov(U, S) = 1 we get<br />
This implies<br />
{Corr(U, S)} 2 =<br />
{Cov(U, S)}2<br />
Var(U)Var(S)<br />
Var(U)Var(S) ≥ 1<br />
Var( ˆ θ) ≥ 1<br />
I(θ)<br />
≤ 1<br />
which is our main result. We call { I(θ) } −1 the Cramér Rao lower bound (CRLB).<br />
16
Sufficient conditions for the proof of CRLB are that all the integrands are finite,<br />
within the range of x. We also require that the limits of the integrals do not depend<br />
on θ. That is, the range of x, here f(x|θ), cannot depend on θ. This second condition<br />
is violated for many density functions, i.e. the CRLB is not valid for the uniform<br />
distribution. We can have absolute assessment for unbiased estimators by comparing<br />
their variances to the CRLB. We can also assess unbiased estimators. If its variance is<br />
lower than CRLB then it is indeed a very good estimate, although it is bias.<br />
Example 2.1. Consider IID random variables Xi, i = 1, . . . , n, with<br />
fXi (xi|µ) = 1<br />
µ exp<br />
�<br />
− 1<br />
µ xi<br />
�<br />
.<br />
Denote the joint distribution of X1, . . . , Xn by<br />
n�<br />
f = fXi (xi|µ)<br />
� �n 1<br />
= exp<br />
µ<br />
so that<br />
i=1<br />
ln f = −n ln(µ) − 1<br />
µ<br />
�<br />
− 1<br />
µ<br />
n�<br />
xi.<br />
The score function is the partial derivative of ln f wrt the unknown parameter µ,<br />
S(µ) = ∂<br />
ln f = −n<br />
∂µ µ + 1<br />
µ 2<br />
n�<br />
xi<br />
and<br />
E {S(µ)} = E<br />
�<br />
− n<br />
µ + 1<br />
µ 2<br />
n�<br />
i=1<br />
Xi<br />
�<br />
i=1<br />
= − n<br />
µ + 1<br />
i=1<br />
n�<br />
i=1<br />
xi<br />
�<br />
µ 2 E<br />
�<br />
n�<br />
�<br />
Xi<br />
i=1<br />
For X ∼ Exp(1/µ), we have E(X) = µ implying E(X1 + · · · + Xn) = E(X1) + · · · +<br />
E(Xn) = nµ and E {S(µ)} = 0 as required.<br />
I(θ) =<br />
� �<br />
∂<br />
−E −<br />
∂µ<br />
n<br />
µ + 1<br />
µ 2<br />
=<br />
n�<br />
i=1<br />
�<br />
n<br />
−E<br />
µ 2 − 2<br />
µ 3<br />
=<br />
n�<br />
�<br />
Xi<br />
i=1<br />
− n<br />
µ 2 + 2<br />
µ 3 E<br />
�<br />
n�<br />
�<br />
Xi<br />
Hence<br />
i=1<br />
= − n<br />
µ 2 + 2nµ<br />
µ 3 = n<br />
µ 2<br />
CRLB = µ2<br />
n .<br />
17<br />
Xi<br />
��<br />
,
Let us propose ˆµ = ¯ X as an estimator of µ. Then<br />
�<br />
n�<br />
�<br />
1<br />
E(ˆµ) = E Xi =<br />
n<br />
1<br />
n E<br />
�<br />
n�<br />
i=1<br />
i=1<br />
Xi<br />
�<br />
= µ,<br />
verifying that ˆµ = ¯ X is indeed an unbiased estimator of µ. For X ∼ Exp(1/µ), we have<br />
E(X) = µ = Var(X), implying<br />
Var(ˆµ) = 1<br />
n 2<br />
n�<br />
i=1<br />
Var(Xi) = nµ µ<br />
=<br />
n2 n .<br />
We have already shown that Var(ˆµ) = { I(θ) } −1 , and therefore conclude that the<br />
unbiased estimator ˆµ = ¯x achieves its CRLB. �<br />
Definition 2.9 (Efficiency ). Define the efficiency of the unbiased estimator ˆ θ as<br />
eff( ˆ θ) = CRLB<br />
Var( ˆ θ) ,<br />
where CRLB = { I(θ) } −1 . Clearly 0 < eff( ˆ θ) ≤ 1. An unbiased estimator ˆ θ is said to<br />
be efficient if eff( ˆ θ) = 1. �<br />
Definition 2.10 (Asymptotic efficiency ). The asymptotic efficiency of an unbiased<br />
estimator ˆ θ is the limit of the efficiency as n → ∞. An unbiased estimator ˆ θ is said to<br />
be asymptotically efficient if its asymptotic efficiency is equal to 1. �<br />
2.4 Maximum Likelihood Estimation<br />
Let x be a realization of the random variable X with probability density fX(x|θ) where<br />
θ = (θ1, θ2, . . . , θm) T is a vector of m unknown parameters to be estimated. The set<br />
of allowable values for θ, denoted by Ω, or sometimes by Ωθ, is called the parameter<br />
space. Define the likelihood function<br />
L(θ|x) = fX(x|θ). (2.2)<br />
It is crucial to stress that the argument of fX(x|θ) is x, but the argument of L(θ|x)<br />
is θ. It is therefore convenient to view the likelihood function L(θ) as the probability<br />
of the observed data x considered as a function of θ. Usually it is convenient to work<br />
with the natural logarithm of the likelihood called the log-likelihood, denoted by<br />
ℓ(θ|x) = ln L(θ|x).<br />
18
When θ ∈ R1 we can define the score function as the first derivative of the log-likelihood<br />
S(θ) = ∂<br />
ln L(θ).<br />
∂θ<br />
The maximum likelihood estimate (MLE) ˆ θ of θ is the solution to the score equation<br />
S(θ) = 0.<br />
At the maximum, the second partial derivative of the log-likelihood is negative, so we<br />
define the curvature at ˆ θ as I( ˆ θ) where<br />
I(θ) = − ∂2<br />
ln L(θ).<br />
∂θ2 We can check that a solution ˆ θ of the equation S(θ) = 0 is actually a maximum by<br />
checking that I( ˆ θ) > 0. A large curvature I( ˆ θ) is associated with a tight or strong<br />
peak, intuitively indicating less uncertainty about θ. In likelihood theory I(θ) is a<br />
key quantity called the observed Fisher information, and I( ˆ θ) is the observed Fisher<br />
information evaluated at the MLE ˆ θ. Although I(θ) is a function, I( ˆ θ) is a scalar.<br />
The likelihood function L(θ|x) supplies an order of preference or plausibility among<br />
possible values of θ based on the observed y. It ranks the plausibility of possible values<br />
of θ by how probable they make the observed y. If P (x|θ = θ1) > P (x|θ = θ2) then<br />
the observed x makes θ = θ1 more plausible than θ = θ2, and consequently from<br />
(2.2), L(θ1|x) > L(θ2|x). The likelihood ratio L(θ1|x)/L(θ2|x) = f(θ1|x)/f(θ2|x) is<br />
a measure of the plausibility of θ1 relative to θ2 based on the observed fact y. The<br />
relative likelihood L(θ1|x)/L(θ2|x) = k means that the observed value x will occur k<br />
times more frequently in repeated samples from the population defined by the value θ1<br />
than from the population defined by θ2. Since only ratios of likelihoods are meaningful,<br />
it is convenient to standardize the likelihood with respect to its maximum. Define the<br />
relative likelihood as R(θ|x) = L(θ|x)/L( ˆ θ|x). The relative likelihood varies between 0<br />
and 1. The MLE ˆ θ is most plausible value of θ in that it makes the observed sample<br />
most probable. The relative likelihood measures the plausibility of any particular value<br />
of θ relative to that of ˆ θ.<br />
When the random variables X1, . . . , Xn are mutually independent we can write the<br />
joint density as<br />
fX(x) =<br />
j=1<br />
where x = (x1, . . . , xn) ′ is a realization of the random vector X = (X1, . . . , Xn) ′ , and<br />
the likelihood function becomes<br />
LX(θ|x) =<br />
n�<br />
n�<br />
j=1<br />
fXj (xj)<br />
fXj (xj|θ).<br />
When the densities fXj (xj) are identical, we unambiguously write f(xj).<br />
19
Example 2.2 (Bernoulli Trials). Consider n independent Bernoulli trials. The jth ob-<br />
servation is either a “success” or “failure” coded xj = 1 and xj = 0 respectively, and<br />
P (Xj = xj) = θ xj (1 − θ) 1−xj<br />
for j = 1, . . . , n. The vector of observations y = (x1, x2, . . . , xn) T is a sequence of ones<br />
and zeros, and is a realization of the random vector Y = (X1, X2, . . . , Xn) T . As the<br />
Bernoulli outcomes are assumed to be independent we can write the joint probability<br />
mass function of Y as the product of the marginal probabilities, that is<br />
n�<br />
L(θ) = P (Xj = xj)<br />
=<br />
j=1<br />
n�<br />
θ xj 1−xj (1 − θ)<br />
j=1<br />
= θ � xj (1 − θ) n− � xj<br />
= θ r (1 − θ) n−r<br />
where r = � n<br />
i=1 xj is the number of observed successes (1’s) in the vector y. The<br />
log-likelihood function is then<br />
and the score function is<br />
ℓ(θ) = r ln θ + (n − r) ln(1 − θ),<br />
S(θ) = ∂ r (n − r)<br />
ℓ(θ) = −<br />
∂θ θ 1 − θ .<br />
Solving for S( ˆ θ) = 0 we get ˆ θ = r/n. The information function is<br />
I(θ) = r n − r<br />
+ > 0 ∀ θ,<br />
θ2 (1 − θ) 2<br />
guaranteeing that ˆ θ is the MLE. Each Xi is a Bernoulli random variable and has<br />
expected value E(Xi) = θ, and variance Var(Xi) = θ(1 − θ). The MLE ˆ θ(y) is itself a<br />
random variable and has expected value<br />
E( ˆ �<br />
r<br />
�<br />
θ) = E = E<br />
n<br />
�� n<br />
i=1 Xi<br />
n<br />
�<br />
= 1<br />
n<br />
n�<br />
i=1<br />
E (Xi) = 1<br />
n<br />
n�<br />
θ = θ,<br />
implying that ˆ θ(y) is an unbiased estimator of θ. The variance of ˆ θ(y) is<br />
Var( ˆ ��n i=1 θ) = Var<br />
Xi<br />
�<br />
=<br />
n<br />
1<br />
n2 n�<br />
Var (Xi) = 1<br />
n2 n� (1 − θ)θ<br />
(1 − θ)θ = .<br />
n<br />
Finally, note that<br />
i=1<br />
I(θ) = E [I(θ)] = E(r) n − E(r) n<br />
+ =<br />
θ2 (1 − θ) 2 θ<br />
+ n<br />
1 − θ =<br />
i=1<br />
i=1<br />
n<br />
θ(1 − θ) =<br />
�<br />
Var[ ˆ �−1 θ] ,<br />
and ˆ θ attains the Cramer-Rao lower bound (CRLB). �<br />
20
Example 2.3 (Binomial sampling). The number of successes in n Bernoulli trials is a<br />
random variable R taking on values r = 0, 1, . . . , n with probability mass function<br />
� �<br />
n<br />
P (R = r) = θ<br />
r<br />
r (1 − θ) n−r .<br />
This is the exact same sampling scheme as in the previous example except that instead<br />
of observing the sequence y we only observe the total number of successes r. Hence the<br />
likelihood function has the form<br />
LR (θ|r) =<br />
� �<br />
n<br />
θ<br />
r<br />
r (1 − θ) n−r .<br />
The relevant mathematical calculations are as follows:<br />
� �<br />
n<br />
ℓR (θ|r) = ln + r ln(θ) + (n − r) ln(1 − θ)<br />
r<br />
S (θ) = r n − r<br />
+<br />
n 1 − θ<br />
⇒ ˆ θ = r<br />
n<br />
I (θ) = r n − r<br />
+<br />
θ2 (1 − θ) 2 > 0 ∀ θ<br />
E( ˆ θ) = E(r) nθ<br />
=<br />
n n = θ ⇒ ˆ θ unbiased<br />
Var( ˆ θ) = Var(r)<br />
n2 nθ(1 − θ)<br />
=<br />
n2 = θ(1 − θ)<br />
n<br />
E [I (θ)] = E(r) n − E(r) nθ n − nθ<br />
+ = +<br />
θ2 (1 − θ) 2 θ2 (1 − θ) 2<br />
=<br />
n<br />
θ(1 − θ) =<br />
�<br />
Var[ ˆ �−1 θ]<br />
and ˆ θ attains the Cramer-Rao lower bound (CRLB). �<br />
Example 2.4 (Germinating seeds). Suppose 25 seeds were planted and r = 5 seeds<br />
germinated. Then ˆ θ = r/n = 0.2 and Var( ˆ θ) = 0.2 × 0.8/25 = 0.0064. The relative<br />
likelihood is<br />
R1(θ) =<br />
� �5 � �20 θ 1 − θ<br />
.<br />
0.2 0.8<br />
Suppose 100 seeds were planted and r = 20 seeds germinated. Then ˆ θ = r/n = 0.2<br />
but Var( ˆ θ) = 0.2 × 0.8/100 = 0.0016. The relative likelihood is<br />
R2(θ) =<br />
� θ<br />
0.2<br />
� 20 � 1 − θ<br />
0.8<br />
� 80<br />
.<br />
Suppose 25 seeds were planted and it is known only that r ≤ 5 seeds germinated. In<br />
this case the exact number of germinating seeds is unknown. The information about θ<br />
is given by the likelihood function<br />
L (θ) = P (R ≤ 5) =<br />
21<br />
5�<br />
r=0<br />
� �<br />
25<br />
θ<br />
r<br />
r (1 − θ) 25−r .
Relative Likelihood<br />
0.0 0.2 0.4 0.6 0.8 1.0<br />
Germinating probability<br />
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7<br />
θ<br />
r = 5 n = 25<br />
r = 20 n = 100<br />
r ≤ 5 n = 25<br />
Figure 2.4.1: Relative likelihood functions for seed germinating probabilities.<br />
Here, the most plausible value for θ is ˆ θ = 0, implying L( ˆ θ) = 1. The relative likelihood<br />
is R3 (θ) = L(θ)/L( ˆ θ) = L(θ). R1 (θ) is plotted as the dashed curve in figure 2.4.1.<br />
R2 (θ) is plotted as the dotted curve in figure 2.4.1. R3 (θ) is plotted as a solid curve<br />
in figure 2.4.1. �<br />
Example 2.5 (Prevalence of a Genotype). Geneticists interested in the prevalence of a<br />
certain genotype, observe that the genotype makes its first appearance in the 22nd sub-<br />
ject analysed. If we assume that the subjects are independent, the likelihood function<br />
can be computed based on the geometric distribution, as L(θ) = (1−θ) n−1 θ. The score<br />
function is then S(θ) = θ −1 −(n−1)(1−θ) −1 . Setting S( ˆ θ) = 0 we get ˆ θ = n −1 = 22 −1 .<br />
The observed Fisher information equals I(θ) = θ −2 + (n − 1)(1 − θ) −2 and is greater<br />
than zero for all θ, implying that ˆ θ is MLE.<br />
Suppose that the geneticists had planned to stop sampling once they observed<br />
r = 10 subjects with the specified genotype, and the tenth subject with the genotype<br />
was the 100th subject anaylsed overall. The likelihood of θ can be computed based on<br />
the negative binomial distribution, as<br />
� �<br />
n − 1<br />
L(θ) = θ<br />
r − 1<br />
r (1 − θ) n−r<br />
for n = 100, r = 5. The usual calculation will confirm that ˆ θ = r/n is MLE. �<br />
22
Example 2.6 (Radioactive Decay). In this classic set of data Rutherford and Geiger<br />
counted the number of scintillations in 72 second intervals caused by radioactive decay<br />
of a quantity of the element polonium. Altogether there were 10097 scintillations during<br />
2608 such intervals:<br />
Count 0 1 2 3 4 5 6 7<br />
Observed 57 203 383 525 532 408 573 139<br />
Count 8 9 10 11 12 13 14<br />
Observed 45 27 10 4 1 0 1<br />
The Poisson probability mass function with mean parameter θ is<br />
The likelihood function equals<br />
fX(x|θ) = θx exp (−θ)<br />
.<br />
x!<br />
L(θ) = � θ xi exp (−θ)<br />
xi!<br />
The relevant mathematical calculations are<br />
= θ� xi exp (−nθ)<br />
� .<br />
xi!<br />
ℓ(θ) = (Σxi) ln (θ) − nθ − ln [Π(xi!)]<br />
S(θ) =<br />
⇒ ˆ θ =<br />
� xi<br />
θ<br />
� xi<br />
n<br />
− n<br />
= ¯x<br />
I(θ) = Σxi<br />
> 0,<br />
θ2 ∀ θ<br />
implying ˆ θ is MLE. Also E( ˆ θ) = � E(xi) = 1 �<br />
θ = θ, so θˆ is an unbiased estimator.<br />
� Var(xi) = 1<br />
Next Var( ˆ θ) = 1<br />
n2 nθ and I(θ) = E[I(θ)] = n/θ = (Var[ˆ θ]) −1 implying<br />
that ˆ θ attains the theoretical CRLB. It is always useful to compare the fitted values<br />
from a model against the observed values.<br />
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14<br />
Oi 57 203 383 525 532 408 573 139 45 27 10 4 1 0 1<br />
Ei 54 211 407 525 508 393 254 140 68 29 11 4 1 0 0<br />
n<br />
+3 -8 -24 0 +24 +15 +19 -1 -23 -2 -1 0 -1 +1 +1<br />
The Poisson law agrees with the observed variation within about one-twentieth of its<br />
range. �<br />
23
Example 2.7 (Exponential distribution). Suppose random variables X1, . . . , Xn are i.i.d.<br />
as Exp(θ). Then<br />
L(θ) =<br />
n�<br />
θ exp (−θxi)<br />
i=1<br />
= θ n exp<br />
�<br />
−θ � �<br />
xi<br />
ℓ(θ) = n ln θ − θ � xi<br />
S(θ) = n<br />
θ −<br />
⇒ θ ˆ<br />
I(θ)<br />
=<br />
=<br />
n�<br />
xi<br />
i=1<br />
n<br />
�<br />
xi<br />
n<br />
> 0<br />
θ2 ∀ θ.<br />
In order to work out the expectation and variance of ˆ θ we need to work out the<br />
probability distribution of Z = n�<br />
Xi, where Xi ∼ Exp(θ). From the appendix on<br />
i=1<br />
probability theory we have Z ∼ Ga(θ, n). Then<br />
� �<br />
1<br />
E<br />
Z<br />
We now return to our estimator<br />
implies<br />
=<br />
�∞<br />
0<br />
1<br />
z<br />
= θ2<br />
Γ(n)<br />
=<br />
= θ Γ(n − 1)<br />
E[ ˆ θ] = E<br />
θ<br />
Γ(n)<br />
θnzn−1 exp (−θz)<br />
dz<br />
Γ(n)<br />
�<br />
∞<br />
0<br />
∞<br />
�<br />
0<br />
Γ(n)<br />
ˆθ = n<br />
n�<br />
xi<br />
i=1<br />
(θz) n−2 exp (−θz)dz<br />
u n−2 exp (−u)du<br />
= θ<br />
n − 1 .<br />
= n<br />
Z<br />
�<br />
n<br />
� � �<br />
1<br />
= nE =<br />
Z Z<br />
n<br />
n − 1 θ<br />
which turns out to be biased. Propose the alternative estimator ˜ θ = n−1<br />
n ˆ θ. Then<br />
E[ ˜ �<br />
(n − 1)<br />
θ] = E<br />
n<br />
�<br />
ˆθ = (n − 1)<br />
shows ˜ θ is an unbiased estimator.<br />
E[<br />
n<br />
ˆ θ] =<br />
24<br />
(n − 1)<br />
n<br />
� �<br />
n<br />
θ = θ<br />
n − 1
As this example demonstrates, maximum likelihood estimation does not automati-<br />
cally produce unbiased estimates. If it is thought that this property is (in some sense)<br />
desirable, then some adjustments to the MLEs, usually in the form of scaling, may be<br />
required. We conclude this example with the following tedious (but straightforward)<br />
calculations.<br />
�<br />
1<br />
E<br />
Z2 �<br />
We have already calculated that<br />
However,<br />
�∞<br />
1<br />
= θ<br />
Γ(n)<br />
0<br />
n z n−3 exp −θzdz<br />
= θ2<br />
�∞<br />
u<br />
Γ(n)<br />
n−3 exp −θudu<br />
0<br />
= θ2 Γ(n − 2)<br />
Γ(n)<br />
=<br />
θ 2<br />
(n − 1)(n − 2)<br />
⇒ Var[ ˜ θ] = E[ ˜ θ 2 �<br />
] − E[ ˜ � �<br />
2<br />
2 (n − 1)<br />
θ] = E<br />
Z2 �<br />
− θ 2<br />
=<br />
(n − 1) 2θ2 (n − 1)(n − 2) − θ2 = θ2<br />
n − 2 .<br />
I(θ) = n<br />
θ2 ⇒ E [I(θ)] = n<br />
θ<br />
eff( ˜ θ) =<br />
(E [I(θ)])−1<br />
Var[ ˜ θ]<br />
2 �=<br />
θ2 θ2<br />
= ÷<br />
n n − 2<br />
�<br />
Var[ ˜ �−1 θ] .<br />
= n − 2<br />
n<br />
which although not equal to 1, converges to 1 as n → ∞, and ˜ θ is asymptotically<br />
efficient. �<br />
Example 2.8 (Lifetime of a component). The time to failure T of components has an<br />
exponential distributed with mean µ. Suppose n components are tested for 100 hours<br />
and that m components failed at times t1, . . . , tm, with n − m components surviving<br />
the 100 hour test. The likelihood function can be written<br />
m� 1<br />
L(θ) =<br />
µ<br />
i=1<br />
e−ti/µ<br />
n�<br />
P (Tj > 100) .<br />
j=m+1<br />
� �� � � �� �<br />
components failed components survived<br />
Clearly P (T ≤ t) = 1 − e −t/µ implies P (T > 100) = e −100/µ is the probability of a<br />
component surviving the 100 hour test. Then<br />
L(µ) =<br />
�<br />
m�<br />
1<br />
µ e−ti/µ<br />
�<br />
�e−100/µ �n−m<br />
,<br />
i=1<br />
25
ℓ(µ) = −m ln µ − 1<br />
m�<br />
ti −<br />
µ<br />
i=1<br />
1<br />
µ 100(n − m),<br />
S(µ) = − m<br />
µ + 1<br />
µ 2<br />
m�<br />
ti + 1<br />
µ 2 100(n − m).<br />
Setting S(ˆµ) = 0 suggests the estimator ˆµ = [ � m<br />
i=1 ti + 100 (n − m)] /m. Also, I(ˆµ) =<br />
m/ˆµ 2 > 0, and ˆµ is indeed the MLE. Although failure times were recorded for just m<br />
components, this example usefully demonstrates that all n components contribute to<br />
the estimation of the mean failure parameter µ. The n − m surviving components are<br />
often referred to as right censored. �<br />
i=1<br />
Example 2.9 (Gaussian Distribution). Consider data X1, X2 . . . , Xn distributed as N(µ, υ).<br />
Then the likelihood function is<br />
L(µ, υ) =<br />
and the log-likelihood function is<br />
⎧<br />
� �n ⎪⎨<br />
1<br />
√πυ exp −<br />
⎪⎩<br />
n�<br />
(xi − µ) 2<br />
⎫<br />
⎪⎬<br />
i=1<br />
ℓ(µ, υ) = − n n 1<br />
ln (2π) − ln (υ) −<br />
2 2 2υ<br />
2υ<br />
⎪⎭<br />
n�<br />
(xi − µ) 2<br />
i=1<br />
(2.3)<br />
Unknown mean and known variance: As υ is known we treat this parameter as a con-<br />
stant when differentiating wrt µ. Then<br />
S(µ) = 1<br />
υ<br />
n�<br />
i=1<br />
(xi − µ), ˆµ = 1<br />
n<br />
n�<br />
i=1<br />
xi, and I(θ) = n<br />
υ<br />
Also, E[ˆµ] = nµ/n = µ, and so the MLE of µ is unbiased. Finally<br />
Var[ˆµ] = 1<br />
�<br />
n�<br />
�<br />
Var xi =<br />
n2 υ<br />
n = (E[I(θ)])−1 .<br />
Known mean and unknown variance: Differentiating (2.3) wrt υ returns<br />
and setting S(υ) = 0 implies<br />
i=1<br />
S(υ) = − n 1<br />
+<br />
2υ 2υ2 ˆυ = 1<br />
n<br />
n�<br />
(xi − µ) 2 ,<br />
i=1<br />
n�<br />
(xi − µ) 2 .<br />
i=1<br />
26<br />
> 0 ∀ µ.
Differentiating again, and multiplying by −1 yields the information function<br />
Clearly ˆυ is the MLE since<br />
Define<br />
I(υ) = − n 1<br />
+<br />
2υ2 υ3 n�<br />
(xi − µ) 2 .<br />
i=1<br />
I(ˆυ) = n<br />
> 0.<br />
2υ2 Zi = (Xi − µ) 2 / √ υ,<br />
so that Zi ∼ N(0, 1). From the appendix on probability<br />
n�<br />
Z 2 i ∼ χ 2 n,<br />
implying E[ � Z 2 i ] = n, and Var[ � Z 2 i ] = 2n. The MLE<br />
Then<br />
and<br />
Finally,<br />
Var[ˆυ] =<br />
i=1<br />
ˆυ = (υ/n)<br />
E[ˆυ] = E<br />
�<br />
υ<br />
n<br />
n�<br />
i=1<br />
n�<br />
i=1<br />
Z 2 i<br />
Z 2 i .<br />
�<br />
= υ,<br />
�<br />
�<br />
υ<br />
� n�<br />
2<br />
Var Z<br />
n<br />
i=1<br />
2 �<br />
i = 2υ2<br />
n .<br />
E [I(υ)] = − n 1<br />
+<br />
2υ2 υ3 = − n nυ<br />
+<br />
2υ2 υ3 = n<br />
.<br />
2υ2 n�<br />
E � (xi − µ) 2�<br />
Hence the CRLB = 2υ 2 /n, and so ˆυ has efficiency 1. �<br />
Our treatment of the two parameters of the Gaussian distribution in the last ex-<br />
ample was to (i) fix the variance and estimate the mean using maximum likelihood;<br />
and then (ii) fix the mean and estimate the variance using maximum likelihood. In<br />
practice we would like to consider the simultaneous estimation of these parameters. In<br />
the next section of these notes we extend MLE to multiple parameter estimation.<br />
27<br />
i=1
2.5 Multi-parameter Estimation<br />
Suppose that a statistical model specifies that the data y has a probability distribution<br />
f(y; α, β) depending on two unknown parameters α and β. In this case the likelihood<br />
function is a function of the two variables α and β and having observed the value y is<br />
defined as L(α, β) = f(y; α, β) with ℓ(α, β) = ln L(α, β). The MLE of (α, β) is a value<br />
(ˆα, ˆ β) for which L(α, β) , or equivalently ℓ(α, β) , attains its maximum value.<br />
Define S1(α, β) = ∂ℓ/∂α and S2(α, β) = ∂ℓ/∂β. The MLEs (ˆα, ˆ β) can be obtained<br />
by solving the pair of simultaneous equations :<br />
S1(α, β) = 0<br />
S2(α, β) = 0<br />
The information matrix I(α, β) is defined to be the matrix :<br />
�<br />
I11(α, β)<br />
I(α, β) =<br />
I21(α, β)<br />
� �<br />
I12(α, β)<br />
= −<br />
I22(α, β)<br />
∂2 ∂α2 ℓ ∂2 ∂α∂β ℓ<br />
∂2 ∂β∂α ℓ ∂2 ∂β2 ℓ<br />
The conditions for a value (α0, β0) satisfying S1(α0, β0) = 0 and S2(α0, β0) = 0 to<br />
be a MLE are that<br />
and<br />
I11(α0, β0) > 0, I22(α0, β0) > 0,<br />
det(I(α0, β0) = I11(α0, β0)I22(α0, β0) − I12(α0, β0) 2 > 0.<br />
This is equivalent to requiring that both eigenvalues of the matrix I(α0, β0) be positive.<br />
Example 2.10 (Gaussian distribution). Let X1, X2 . . . , Xn be iid observations from a<br />
N (µ, v) density in which both µ and v are unknown. The log likelihood is<br />
ℓ(µ, v) =<br />
n�<br />
�<br />
1<br />
ln √ exp [−<br />
2πv i=1<br />
1<br />
2v (xi − µ) 2 =<br />
�<br />
]<br />
n�<br />
�<br />
−<br />
i=1<br />
1 1 1<br />
ln [2π] − ln [v] −<br />
2 2 2v (xi − µ) 2<br />
�<br />
= − n<br />
n�<br />
n 1<br />
ln [2π] − ln [v] − (xi − µ)<br />
2 2 2v<br />
2 .<br />
Hence<br />
implies that<br />
S1(µ, v) = ∂ℓ<br />
∂µ<br />
ˆµ = 1<br />
n<br />
= 1<br />
v<br />
i=1<br />
n�<br />
(xi − µ) = 0<br />
i=1<br />
n�<br />
xi = ¯x. (2.4)<br />
i=1<br />
28<br />
�
Also<br />
implies that<br />
S2(µ, v) = ∂ℓ<br />
∂v<br />
ˆv = 1<br />
n<br />
n�<br />
i=1<br />
n 1<br />
= − +<br />
2v 2v2 (xi − ˆµ) 2 = 1<br />
n<br />
n�<br />
(xi − µ) 2 = 0<br />
i=1<br />
n�<br />
(xi − ¯x) 2 . (2.5)<br />
Calculating second derivatives and multiplying by −1 gives that the information matrix<br />
I(µ, v) equals<br />
I(µ, v) =<br />
⎛<br />
⎜<br />
⎝<br />
1<br />
v 2<br />
i=1<br />
i=1<br />
n<br />
1<br />
v<br />
v2 n�<br />
(xi − µ)<br />
i=1<br />
n�<br />
(xi − µ) − n<br />
2v2 + 1<br />
v3 n�<br />
(xi − µ) 2<br />
⎞<br />
⎟<br />
⎠<br />
Hence I(ˆµ, ˆv) is given by : �<br />
n<br />
ˆv 0<br />
0 n<br />
2v2 �<br />
Clearly both diagonal terms are positive and the determinant is positive and so (ˆµ, ˆv)<br />
are, indeed, the MLEs of (µ, v).<br />
Go back to equation (2.4), and ¯ X ∼ N (µ, v/n). Clearly E( ¯ X) = µ (unbiased) and<br />
Var( ¯ X) = v/n, so ¯ X achieved the CRLB. Go back to equation (2.5). Then from lemma<br />
2.9 we have<br />
so that<br />
nˆv<br />
v ∼ χ2 n−1<br />
i=1<br />
� �<br />
nˆv<br />
E = n − 1<br />
v<br />
� �<br />
n − 1<br />
⇒ E(ˆv) = v<br />
n<br />
Instead, propose the (unbiased) estimator<br />
Observe that<br />
E(˜v) =<br />
˜v =<br />
n<br />
ˆv =<br />
n − 1<br />
� �<br />
n<br />
E(ˆv) =<br />
n − 1<br />
1<br />
n − 1<br />
n�<br />
(xi − ¯x) 2<br />
i=1<br />
� n<br />
n − 1<br />
� � n − 1<br />
and ˜v is unbiased as suggested. We can easily show that<br />
Hence<br />
Var(˜v) =<br />
2v 2<br />
(n − 1)<br />
n<br />
�<br />
v = v<br />
(2.6)<br />
eff(˜v) = 2v2 2v2 1<br />
÷ = 1 −<br />
n (n − 1) n<br />
Clearly ˜v is not efficient, but is asymptotically efficient. �<br />
29
Lemma 2.9 (Joint distribution of the sample mean and sample variance). If<br />
X1, . . . , Xn are iid N (µ, v) then the sample mean ¯ X and sample variance S 2 /(n − 1)<br />
are independent. Also ¯ X is distributed N (µ, v/n) and S 2 /v is a chi-squared random<br />
variable with n − 1 degrees of freedom.<br />
Proof. Define<br />
⇒<br />
W =<br />
n�<br />
(Xi − ¯ X) 2 =<br />
i=1<br />
W<br />
v + ( ¯ X − µ) 2<br />
v/n<br />
=<br />
n�<br />
(Xi − µ) 2 − n( ¯ X − µ) 2<br />
i=1<br />
n� (Xi − µ) 2<br />
The RHS is the sum of n independent standard normal random variables squared, and<br />
so is distributed χ 2 ndf Also, ¯ X ∼ N (µ, v/n), therefore ( ¯ X − µ) 2 /(v/n) is the square of<br />
a standard normal and so is distributed χ2 1df These Chi-Squared random variables have<br />
moment generating functions (1 − 2t) −n/2 and (1 − 2t) −1/2 respectively. Next, W/v and<br />
( ¯ X − µ) 2 /(v/n) are independent:<br />
Cov(Xi − ¯ X, ¯ X) = Cov(Xi, ¯ X) − Cov( ¯ X, ¯ X)<br />
�<br />
= Cov Xi, 1 �<br />
�<br />
Xj − Var(<br />
n<br />
¯ X)<br />
= 1 �<br />
Cov(Xi, Xj) −<br />
n<br />
v<br />
n<br />
j<br />
= v v<br />
−<br />
n n<br />
i=1<br />
= 0<br />
But, Cov(Xi − ¯ X, ¯ X − µ) = Cov(Xi − ¯ X, ¯ X) = 0 , hence<br />
�<br />
Cov(Xi − ¯ X, ¯ �<br />
�<br />
X − µ) = Cov (Xi − ¯ X), ¯ �<br />
X − µ = 0<br />
i<br />
As the moment generating function of the sum of independent random variables is<br />
equal to the product of their individual moment generating functions, we see<br />
E � e t(W/v)� (1 − 2t) −1/2 = (1 − 2t) −n/2<br />
⇒ E � e t(W/v)� = (1 − 2t) −(n−1)/2<br />
But (1 − 2t) −(n−1)/2 is the moment generating function of a χ 2 random variables with<br />
(n−1) degrees of freedom, and the moment generating function uniquely characterizes<br />
the random variable S = (W/v).<br />
30<br />
i<br />
v
Suppose that a statistical model specifies that the data x has a probability distribu-<br />
tion f(x; θ) depending on a vector of m unknown parameters θ = (θ1, . . . , θm). In this<br />
case the likelihood function is a function of the m parameters θ1, . . . , θm and having<br />
observed the value of x is defined as L(θ) = f(x; θ) with ℓ(θ) = ln L(θ).<br />
The MLE of θ is a value ˆ θ for which L(θ), or equivalently ℓ(θ), attains its maximum<br />
value. For r = 1, . . . , m define Sr(θ) = ∂ℓ/∂θr. Then we can (usually) find the MLE<br />
ˆθ by solving the set of m simultaneous equations Sr(θ) = 0 for r = 1, . . . , m. The<br />
information matrix I(θ) is defined to be the m×m matrix whose (r, s) element is given<br />
by Irs where Irs = −∂ 2 ℓ/∂θr∂θs. The conditions for a value ˆ θ satisfying Sr( ˆ θ) = 0 for<br />
r = 1, . . . , m to be a MLE are that all the eigenvalues of the matrix I( ˆ θ) are positive.<br />
2.6 Newton-Raphsom Optimization<br />
Example 2.11 (Radioactive Scatter). A radioactive source emits particles intermittently<br />
and at various angles. Let X denote the cosine of the angle of emission. The angle<br />
of emission can range from 0 degrees to 180 degrees and so X takes values in [−1, 1].<br />
Assume that X has density<br />
1 + θx<br />
f(x|θ) =<br />
2<br />
for −1 ≤ x ≤ 1 where θ ∈ [−1, 1] is unknown. Suppose the data consist of n indepen-<br />
dently identically distributed measures of X yielding values x1, x2, ..., xn. Here<br />
L(θ) = 1<br />
2 n<br />
n�<br />
(1 + θxi)<br />
i=1<br />
l(θ) = −n ln [2] +<br />
S(θ) =<br />
I(θ) =<br />
n�<br />
i=1<br />
n�<br />
i=1<br />
xi<br />
1 + θxi<br />
n�<br />
ln [1 + θxi]<br />
i=1<br />
x 2 i<br />
(1 + θxi) 2<br />
Since I(θ) > 0 for all θ, the MLE may be found by solving the equation S(θ) = 0. It<br />
is not immediately obvious how to solve this equation.<br />
By Taylor’s Theorem we have<br />
0 = S( ˆ θ)<br />
= S(θ0) + ( ˆ θ − θ0)S ′ (θ0) + ( ˆ θ − θ0) 2 S ′′ (θ0)/2 + ....<br />
31
So, if | ˆ θ − θ0| is small, we have that<br />
and hence<br />
0 ≈ S(θ0) + ( ˆ θ − θ0)S ′ (θ0),<br />
ˆθ ≈ θ0 − S(θ0)/S ′ (θ0)<br />
= θ0 + S(θ0)/I(θ0)<br />
We now replace θ0 by this improved approximation θ0 +S(θ0)/I(θ0) and keep repeating<br />
the process until we find a value ˆ θ for which |S( ˆ θ)| < ɛ where ɛ is some prechosen small<br />
number such as 0.000001. This method is called Newton’s method for solving a nonlinear<br />
equation.<br />
If a unique solution to S(θ) = 0 exists, Newton’s method works well regardless of<br />
the choice of θ0. When there are multiple solutions, the solution to which the algorithm<br />
converges depends crucially on the choice of θ0. In many instances it is a good idea<br />
to try various starting values just to be sure that the method is not sensitive to the<br />
choice of θ0.<br />
One approach to finding an initial estimate θ0 would be to use the Method of<br />
Moments. This involves solving the equation E(X) = ¯x for θ. For the previous example<br />
E(X) =<br />
� 1<br />
−1<br />
x(1 + θx)<br />
dx =<br />
2<br />
θ<br />
3<br />
and so θ0 = 3¯x might yield a good choice for a starting value.<br />
Suppose the following 10 values were recorded:<br />
0.00, 0.23, −0.05, 0.01, −0.89, 0.19, 0.28, 0.51, −0.25 and 0.27.<br />
Then ¯x = 0.03, and we substitute θ0 = .09 into the updating formula<br />
ˆθnew = θ old +<br />
� n�<br />
⇒ θ1 = 0.2160061<br />
θ2 = 0.2005475<br />
θ3 = 0.2003788<br />
θ4 = 0.2003788<br />
i=1<br />
xi<br />
1 + θ old xi<br />
� � n�<br />
i=1<br />
x 2 i<br />
(1 + θ old xi) 2<br />
The relative likelihood function is plotted in Figure 2.6.2. �<br />
32<br />
� −1
Relative Likelihood<br />
0.4 0.6 0.8 1.0<br />
Radioactive Scatter<br />
−1.0 −0.5 0.0 0.5 1.0<br />
θ<br />
θ ^ = 0.2003788<br />
Figure 2.6.2: Relative likelihood for the radioactive scatter, solved by Newton Raphson.<br />
A Weibull random variable with ‘shape’ parameter a > 0 and ‘scale’ parameter<br />
b > 0 has density<br />
fT (t) = (a/b)(t/b) a−1 exp{−(t/b) a }<br />
for t ≥ 0. The (cumulative) distribution function is<br />
FT (t) = 1 − exp{−(t/b) a }<br />
on t ≥ 0. Suppose that the time to failure T of components has a Weibull distribution<br />
and after testing n components for 100 hours, m components fail at times t1, . . . , tm,<br />
with n − m components surviving the 100 hour test. The likelihood function can be<br />
written<br />
L(θ) =<br />
m�<br />
� �a−1 � � �a� a ti<br />
ti<br />
exp −<br />
b b<br />
b<br />
i=1<br />
� �� �<br />
components failed<br />
33<br />
n�<br />
� � �a� 100<br />
exp − .<br />
b<br />
j=m+1<br />
� �� �<br />
components survived
Then the log-likelihood function is<br />
ℓ(a, b) = m ln (a) − ma ln (b) + (a − 1)<br />
m�<br />
ln (ti) −<br />
i=1<br />
n�<br />
(ti/b) a ,<br />
where for convenience we have written tm+1 = · · · = tn = 100. This yields score<br />
functions<br />
and<br />
Sa(a, b) = m<br />
a<br />
− m ln (b) +<br />
m�<br />
ln (ti) −<br />
i=1<br />
Sb(a, b) = − ma<br />
b<br />
+ a<br />
b<br />
i=1<br />
n�<br />
(ti/b) a ln (ti/b) ,<br />
It is not obvious how to solve Sa(a, b) = Sb(a, b) = 0 for a and b.<br />
i=1<br />
n�<br />
(ti/b) a . (2.7)<br />
When the m equations Sr(θ) = 0, r = 1, . . . , m cannot be solved directly numerical<br />
optimization is required. Let S(θ) be the m × 1 vector whose rth element is Sr(θ). Let<br />
ˆθ be the solution to the set of equations S(θ) = 0 and let θ0 be an initial guess at ˆ θ.<br />
Then a first order Taylor’s series approximation to the function S about the point θ0<br />
is given by<br />
Sr( ˆ m�<br />
θ) ≈ Sr(θ0) + ( ˆ θj − θ0j) ∂Sr<br />
(θ0)<br />
∂θj<br />
for r = 1, 2, . . . , m which may be written in matrix notation as<br />
j=1<br />
i=1<br />
S( ˆ θ) ≈ S(θ0) − I(θ0)( ˆ θ − θ0).<br />
Requiring S( ˆ θ) = 0, this last equation can be reorganized to give<br />
ˆθ ≈ θ0 + I(θ0) −1 S(θ0). (2.8)<br />
Thus given θ0 this is a method for finding an improved guess at ˆ θ. We then replace θ0<br />
by this improved guess and repeat the process. We keep repeating the process until we<br />
obtain a value θ ∗ for which |Sr(θ ∗ )| is less than ɛ for r = 1, 2, . . . , m where ɛ is some<br />
small number like 0.0001. θ ∗ will be an approximate solution to the set of equations<br />
S(θ) = 0. We then evaluate the matrix I(θ ∗ ) and if all m of its eigenvalues are positive<br />
we set ˆ θ = θ ∗ .<br />
For the Weibull distribution<br />
Iaa = m<br />
+<br />
a2 n�<br />
i=1<br />
Ibb = − ma a(a + 1)<br />
+<br />
b2 b2 Iab = Iba = m<br />
b<br />
(ti/b) a [ln (ti/b)] 2 ,<br />
− 1<br />
b<br />
n�<br />
(ti/b) a ,<br />
i=1<br />
n�<br />
(ti/b) a [a ln (ti/b) + 1] .<br />
i=1<br />
34
a0 b0 a ∗ b ∗ Steps Eigenvalues<br />
1.8 74.5 1.924941 78.12213 4 all positive<br />
2.36 72 1.924941 78.12213 5 all positive<br />
2 70 1.924941 78.12213 5 all positive<br />
1 2 1.924941 78.12213 15 all positive<br />
80 1 1.924941 78.12213 387 all positive<br />
2 100 4.292059 −34.591322 2 crashed<br />
Table 2.6.1: Newton-Raphson estimates of the Weibull parameters.<br />
Given initial values a0 and b0 equation (2.8) can be used to obtained updated estimates<br />
via<br />
�<br />
anew<br />
bnew<br />
�<br />
=<br />
�<br />
a0<br />
b0<br />
�<br />
+<br />
�<br />
Iaa(a0, b0) Iab(a0, b0)<br />
Iba(a0, b0) Ibb(a0, b0)<br />
� −1 �<br />
Sa(a0, b0)<br />
Sb(a0, b0)<br />
�<br />
. (2.9)<br />
Having solved for a and b we check that a maximum has been achieved by calculating<br />
I(a ∗ , b ∗ ) and verifying that both of its eigenvalues are positive.<br />
Suppose n = 10 and m = 8 items fail at times 17, 29, 32, 55, 61, 74, 77, 93 (in hours)<br />
with the remaining two items surviving the 100 hour test. Table 2.6.1 shows estimates<br />
(a ∗ , b ∗ ) obtained from equation (2.9) using various starting values (a0, b0) along with<br />
the number of iteration steps taken until convergence using ɛ = 0.000001. Four of<br />
the starting pairs considered converged to estimates a ∗ = 1.924941 and b ∗ = 78.12213<br />
for a and b. The starting pairs (1, 2) and (80, 1) might misleadingly suggest that the<br />
algorithm is robust to the choice of starting values. However, the starting pair (2, 100)<br />
produced a negative estimate of b, and as ln(b) is required in the computation of the<br />
score function, caused the algorithm to crash. Other starting pairs not reported failed<br />
to yield estimates due to a non-invertible I(θ0) matrix being produced during the<br />
running of the algorithm. The remainder of this section describes commonly applied<br />
modifications of the Newton-Raphson method.<br />
Initial values<br />
The Newton-Raphson method depends on the initial guess being close to the true value.<br />
If this requirement is not satisfied the procedure might convergence to a minimum<br />
instead of a maximum, or just simply diverge and fail to produce any estimates at all.<br />
Methods of finding good initial estimates depend very much on the problem at hand<br />
and may require some ingenuity.<br />
The distribution function of a Weibull random variable T can be linearized as<br />
ln(− ln[1 − F (t)]) = −a ln(b) + a ln(t).<br />
35
A regression of the empirical distribution function ˆ F (t) on ln(t) can be used the recover<br />
initial estimates for a and b. The starting pair (1.8, 74.5) in Table 2.6.1 was obtained<br />
in this way.<br />
A more general method for finding initial values is the method of moments. This<br />
method estimates the kth moment about the origin E(T k ) by the sample moment<br />
˜mk = n −1 � t k i . For a Weibull distribution the kth moment about the origin equals<br />
b k Γ(1 + k/a) where Γ(c) = � ∞<br />
0 uc−1 e −u du is the usual gamma function satisfying<br />
Γ(c) = (c − 1)Γ(c − 1). It is easy to show that E(T 2 )/[E(T )] 2 = 2aΓ(2/a)/[Γ(1/a)] 2 . An<br />
estimate a0 of a can be obtained by solving ˜m 2 2/( ˜m1) 2 = 2aΓ(2/a)/[Γ(1/a)] 2 for a. The<br />
corresponding estimate of b is then b0 = ˜m1/Γ(1 + 1/a0). The starting pair (2.36, 72)<br />
in Table 2.6.1 was obtained in this way.<br />
Fisher’s method of scoring<br />
One modification to Newton’s method is to replace the matrix I(θ) in (2.8) with Ī(θ) =<br />
E [I(θ)] . The matrix Ī(θ) is positive definite, thus overcoming many of the problems<br />
regarding matrix inversion. Like Newton-Raphson, there is no guarantee that Fisher’s<br />
method of scoring will avoid producing negative parameter estimates or converging to<br />
local minima. Unfortunately, calculating Ī(θ) often can be mathematically difficult.<br />
The profile likelihood<br />
Setting Sb(a, b) = 0 from equation (2.7) we can solve for b in terms of a as<br />
�<br />
1<br />
b =<br />
m<br />
n�<br />
i=1<br />
t a i<br />
� 1/a<br />
. (2.10)<br />
This reduces our two parameter problem to a search over the “new” one-parameter<br />
profile log-likelihood<br />
ℓa(a) = ℓ(a, b(a)),<br />
�<br />
= m ln (a) − m ln<br />
1<br />
m<br />
n�<br />
i=1<br />
t a i<br />
�<br />
+ (a − 1)<br />
m�<br />
ln (ti) − m. (2.11)<br />
Given an initial guess a0 for a, an improved estimate a1 can be obtained using a<br />
single parameter Newton-Raphson updating step a1 = a0 + S(a0)/I(a0), where S(a)<br />
and I(a) are now obtained from ℓa(a). The estimates â = 1.924941 and ˆ b = 78.12213<br />
were obtained by applying this method to the Weibull data using starting values a0 =<br />
0.001 and a0 = 5.8 in 16 and 13 iterations respectively. However, the starting value<br />
a0 = 5.9 produced the sequence of estimates a1 = −5.544163, a2 = 8.013465, a3 =<br />
−16.02908, a4 = 230.0001 and subsequently crashed.<br />
36<br />
i=1
Reparameterization<br />
Negative parameter estimates can be avoided by reparameterizing the profile log-<br />
likelihood in (2.11) using α = ln(a). Since a = e α we are guaranteed to obtain a > 0.<br />
The reparameterized profile log-likelihood becomes<br />
�<br />
n� 1<br />
ℓα(α) = mα − m ln t<br />
m<br />
eα<br />
�<br />
i + (e α − 1)<br />
implying score function<br />
and information function<br />
I(α) =<br />
i=1<br />
S(α) = m − meα � n<br />
meα �n i=1 teα i<br />
i=1 teα<br />
�n i=1 teα i<br />
�<br />
n�<br />
t eα<br />
i ln(ti) − e α [� n<br />
i=1<br />
i ln(ti)<br />
+ e α<br />
+e α<br />
m�<br />
ln (ti) − m,<br />
i=1<br />
m�<br />
ln(ti)<br />
i=1<br />
i=1 teα i ln(ti)] 2<br />
�n i=1 teα i<br />
n�<br />
i=1<br />
t eα<br />
i [ln(ti)] 2<br />
�<br />
− e α<br />
m�<br />
ln(ti).<br />
The estimates â = 1.924941 and ˆ b = 78.12213 were obtained by applying this method<br />
to the Weibull data using starting values a0 = 0.07 and a0 = 76 in 103 and 105<br />
iterations respectively. However, the starting values a0 = 0.06 and a0 = 77 failed due<br />
to division by computationally tiny (1.0e-300) values.<br />
The step-halving scheme<br />
The Newton-Raphson method uses the (first and second) derivatives of ℓ(θ) to max-<br />
imize the function ℓ(θ), but the function itself is not used in the algorithm. The<br />
log-likelihood can be incorporated into the Newton-Raphson method by modifying the<br />
updating step to<br />
θi+1 = θi + λiI(θi) −1 S(θi), (2.12)<br />
where the search direction has been multiplied by some λi ∈ (0, 1] chosen so that the<br />
inequality<br />
ℓ � θi + λiI(θi) −1 S(θi) � > ℓ (θi) (2.13)<br />
holds. This requirement protects the algorithm from converging towards minima or<br />
saddle points. At each iteration the algorithm sets λi = 1, and if (2.13) does not<br />
hold λi is replaced with λi/2. The process is repeated until the inequality in (2.13) is<br />
satisfied. At this point the parameter estimates are updated using (2.12) with the value<br />
of λi for which (2.13) holds. If the function ℓ(θ) is concave and unimodal convergence<br />
is guaranteed. Finally, when<br />
maxima is guaranteed, even if ℓ(θ) is not concave.<br />
Ī(θ) is used in place of I(θ) convergence to a (local)<br />
37<br />
i=1
2.7 The Invariance Principle<br />
How do we deal with parameter transformation? We will assume a one-to-one trans-<br />
formation, but the idea applied generally. Consider a binomial sample with n = 10<br />
independent trials resulting in data x = 8 successes. The likelihood ratio of θ1 = 0.8<br />
versus θ2 = 0.3 is<br />
L(θ1 = 0.8)<br />
L(θ2 = 0.3) = θ8 1(1 − θ1) 2<br />
θ8 = 208.7 ,<br />
2(1 − θ2) 2<br />
that is, given the data θ = 0.8 is about 200 times more likely than θ = 0.3.<br />
Suppose we are interested in expressing θ on the logit scale as<br />
ψ ≡ ln{θ/(1 − θ)} ,<br />
then ‘intuitively’ our relative information about ψ1 = ln(0.8/0.2) = 1.29 versus ψ2 =<br />
ln(0.3/0.7) = −0.85 should be<br />
L ∗ (ψ1)<br />
L ∗ (ψ2)<br />
= L(θ1)<br />
L(θ2)<br />
= 208.7 .<br />
That is, our information should be invariant to the choice of parameterization. ( For<br />
the purposes of this example we are not too concerned about how to calculate L ∗ (ψ). )<br />
Theorem 2.10 (Invariance of the MLE). If g is a one-to-one function, and ˆ θ is<br />
the MLE of θ then g( ˆ θ) is the MLE of g(θ).<br />
Proof. This is trivially true as we let θ = g −1 (µ) then f{y|g −1 (µ)} is maximized in µ<br />
exactly when µ = g( ˆ θ). When g is not one-to-one the discussion becomes more subtle,<br />
but we simply choose to define ˆgMLE(θ) = g( ˆ θ)<br />
It seems intuitive that if ˆ θ is most likely for θ and our knowledge (data) remains<br />
unchanged then g( ˆ θ) is most likely for g(θ). In fact, we would find it strange if ˆ θ is an<br />
estimate of θ, but ˆ θ 2 is not an estimate of θ 2 . In the binomial example with n = 10<br />
and x = 8 we get ˆ θ = 0.8, so the MLE of g(θ) = θ/(1 − θ) is<br />
g( ˆ θ) = ˆ θ/(1 − ˆ θ) = 0.8/0.2 = 4.<br />
This convenient property is not necessarily true of other estimators. For example, if ˆ θ<br />
is the MVUE of θ, then g( ˆ θ) is generally not MVUE for g(θ).<br />
Frequentists generally accept the invariance principle without question. This is<br />
not the case for intelligent lifeforms such as Bayesians. The invariance property of<br />
the likeihood ratio is incompatible with the Bayesian habit of assigning a probability<br />
distribution to a parameter.<br />
38
2.8 Optimality Properties of the MLE<br />
Suppose that an experiment consists of measuring random variables x1, x2, . . . , xn which<br />
are iid with probability distribution depending on a parameter θ. Let ˆ θ be the MLE<br />
of θ. Define<br />
W1 = � E[I(θ)]( ˆ θ − θ)<br />
W2 = � I(θ)( ˆ θ − θ)<br />
�<br />
W3 = E[I( ˆ θ)]( ˆ θ − θ)<br />
�<br />
W4 = I( ˆ θ)( ˆ θ − θ).<br />
Then, W1, W2, W3, and W4 are all random variables and, as n → ∞, the probabilistic<br />
behaviour of each of W1, W2, W3, and W4 is well approximated by that of a N(0, 1)<br />
random variable. Then, since E[W1] ≈ 0, we have that E[ ˆ θ] ≈ θ and so ˆ θ is approx-<br />
imately unbiased. Also Var[W1] ≈ 1 implies that Var[ ˆ θ] ≈ (E[I(θ)]) −1 and so ˆ θ is<br />
approximately efficient.<br />
Let the data X have probability distribution g(X; θ) where θ = (θ1, θ2, . . . , θm) is a<br />
vector of m unknown parameters. Let I(θ) be the m×m information matrix as defined<br />
above and let E[I(θ)] be the m × m matrix obtained by replacing the elements of I(θ)<br />
by their expected values. Let ˆ θ be the MLE of θ. Let CRLBr be the rth diagonal<br />
element of [E[I(θ)]] −1 . For r = 1, 2, . . . , m, define W1r = ( ˆ θr − θr)/ √ CRLBr. Then, as<br />
n → ∞, W1r behaves like a standard normal random variable.<br />
Suppose we define W2r by replacing CRLBr by the rth diagonal element of the<br />
matrix [I(θ)] −1 , W3r by replacing CRLBr by the rth diagonal element of the matrix<br />
[EI( ˆ θ)] −1 and W4r by replacing CRLBr by the rth diagonal element of the matrix<br />
[I( ˆ θ)] −1 . Then it can be shown that as n → ∞, W2r, W3r, and W4r all behave like<br />
standard normal random variables.<br />
2.9 Data Reduction<br />
Definition 2.11 (Sufficiency). Consider a statistic T = t(X) that summarises the<br />
data so that no information about θ is lost. Then we call t(X) a sufficient statistic. �<br />
Example 2.12. T = t(X) = ¯ X is sufficient for µ when Xi ∼ iid N(µ, σ 2 ). �<br />
<strong>To</strong> better understand the motivation behind the concept of sufficiency consider<br />
three independent Binomial trials where θ = P (X = 1).<br />
39
Event Probability Set<br />
0 0 0 (1 − θ) 3 A0<br />
1 0 0<br />
0 1 0<br />
0 0 1<br />
0 1 1<br />
1 0 1<br />
1 1 0<br />
θ(1 − θ) 2 A1<br />
θ 2 (1 − θ) A2<br />
1 1 1 θ 3 A4<br />
Knowing which Ai the sample is in carries all the information about θ. Which<br />
particular sample within Ai gives us no extra information about θ. Extra information<br />
about other aspect of the model maybe, but not about θ. Here T = t(X) = � Xi<br />
equals the number of “successes”, and identifies Ai. Mathematically we can express<br />
the above concept by saying that the probability P (X = x|Ai) does not depend on θ.<br />
i.e. P (010|A1; θ) = 1/3. More generally, a statistic T = t(X) is said to be sufficient<br />
for the parameter θ if Pr (X = x|T = t) does not depend on θ. Sufficient statistics are<br />
most easily recognized through the following fundamental result:<br />
Theorem 2.11 (Neyman’s Factorization Criterion). A statistic T = t(X) is<br />
sufficient for θ if and only if the family of densities can be factorized as<br />
f(x; θ) = h(x)k {t(x); θ} , x ∈ X , θ ∈ Θ. (2.14)<br />
i.e. into a function which does not depend on θ and one which only depends on x<br />
through t(x). This is true in general. We will prove it in the case where X is discrete.<br />
Proof. Assume T is sufficient and let h(x) = Pθ {X = x|T = t(x)} be independent of<br />
θ. Let k {t; θ} = Pθ(T = t). Then<br />
f(x; θ) = Pθ {X = x|T = t(x)} Pθ {T = t(x)} = h(x)k {θ, t(x)} .<br />
Conversely assume the result in (2.14) to be true. Then<br />
Pθ (X = x|T = t) =<br />
which is independent of θ.<br />
=<br />
=<br />
h(x)k {t(x); θ}<br />
1{x:t(x)=t}(x),<br />
h(y)k {t(y); θ}<br />
�<br />
y:t(y)=t<br />
h(x)k {t; θ}<br />
k {t; θ} �<br />
1{x:t(x)=t}(x),<br />
y:t(y)=t h(y)<br />
h(x)<br />
�<br />
y:t(y)=t<br />
40<br />
h(y) 1{x:t(x)=t}(x),
Example 2.13 (Poisson). Let X = (X1, . . . , Xn) be independent and Poisson dis-<br />
tributed with mean λ so that<br />
f(x; λ) =<br />
n�<br />
i=1<br />
λxi xi! e−λ = λΣxi<br />
�<br />
i xi! e−nλ .<br />
Take k { � xi; λ} = λ Σxi e −nλ and h(x) = ( � xi!) −1 , then t(x) = �<br />
i xi is sufficient. �<br />
Example 2.14 (Binomial). Let X = (X1, . . . , Xn) be independent and Bernoulli dis-<br />
tributed with parameter θ so that<br />
f(x; θ) =<br />
n�<br />
θ xi 1−xi Σxi n−Σxi<br />
(1 − θ) = θ (1 − θ)<br />
i=1<br />
Take k { � xi; θ} = θ Σxi (1 − θ) n−Σxi and h(x) = 1, then t(x) = �<br />
i xi is sufficient. �<br />
Example 2.15 (Uniform). Factorization criterion works in general but can mess up if<br />
the pdf depends on θ. Let X = (X1, . . . , Xn) be independent and Uniform distributed<br />
with parameter θ so that X1, X2, . . . , Xn ∼ Unif(0, θ). Then<br />
f(x; θ) = 1<br />
θ n<br />
0 ≤ xi ≤ θ ∀ i.<br />
It is not at all obvious but t(x) = max(xi) is a sufficient statistic. We have to show<br />
that f(x|t) is independent of θ. Well<br />
Then<br />
So<br />
f(x|t) =<br />
f(x, t)<br />
fT (t) .<br />
P (T ≤ t) = P (X1 ≤ t, . . . , Xn ≤ t)<br />
n�<br />
�<br />
t<br />
= P (Xi ≤ t) =<br />
θ<br />
i=1<br />
FT (t) = tn<br />
θn ⇒ fT (t) = ntn−1<br />
θn Also<br />
f(x, t) = 1<br />
θn ≡ f(x; θ).<br />
Hence<br />
f(x|t) =<br />
1<br />
,<br />
ntn−1 and is independent of θ. �<br />
41<br />
� n
2.10 Worked Problems<br />
The Problems<br />
1. The continuous random variable T has probability density function<br />
f(t) = λe −λt , t > 0; λ > 0.<br />
(a) Show that the cumulative distribution function is given by<br />
(b) Deduce that<br />
F (t) = 1 − e −λt , t > 0; λ > 0.<br />
P (a ≤ b) = e −λa − e −λb , 0 < a < b.<br />
(c) The accounts manager for a building society firm assumes that the time T<br />
taken to settle invoices is a random variable with the pdf f(t) given above<br />
for some unknown value of λ. For a random sample of 100 invoices, he finds<br />
that 50 are settled within one week, 35 are settled during the second week<br />
and 15 are settles after 2 weeks. Explain clearly why the likelihood function<br />
of these data may be written as<br />
where k is a constant.<br />
L(λ) = k(1 − e −λ ) 50 � e −λ e −2λ� 35 (1 − e −2λ ) 15 ,<br />
(d) Show that the maximum likelihood estimate ˆ λ of λ is approximately 0.836.<br />
(e) Using ˆ λ = 0.836, calculate the expected number of invoices settled within<br />
the first week, settled during the second week, and settled after 2 weeks.<br />
Hence comment briefly on how well the model fits the data.<br />
2. A finite population has nA individuals of type A and nB individuals of type<br />
B. The overall number is known to be 100, but the size of the two groups is<br />
unknown. If a sample of 5 individuals is taken without replacement, write down<br />
the likelihood for nA in the following three situations:<br />
(a) the observed sequence is A, B, A, B, A ;<br />
(b) the observed sequence is A, A, B, B, A ;<br />
(c) the precise sequence of the outcomes is not given, but it is known that three<br />
elements are As, two are Bs.<br />
Compare the likelihood functions and comment on your findings.<br />
42
3. A certain type of plant cell may appear in any one of four versions. According to<br />
a genetic theory, the four versions have the following probabilities of appearance:<br />
1/2 + θ/4, (1 − θ)/4, (1 − θ)/4, θ/4,<br />
where θ is a parameter not specified by the genetic theory (0 < θ < 1). If a<br />
sample of n cells had observed frequencies (a, b, c, d) where n = a + b + c + d, find<br />
the maximum likelihood estimate of θ, and an estimate of Var( ˆ θ).<br />
4. Let X1, X2, . . . , Xn be a random sample from a population with probability den-<br />
sity<br />
f(x) =<br />
(a) Show that E(X 2 ) = θ.<br />
�<br />
2<br />
πθ exp<br />
�<br />
− x2<br />
�<br />
, x > 0.<br />
2θ<br />
(b) Find the maximum likelihood estimator (MLE), ˆ θ, of θ.<br />
(c) Show that ˆ θ is an unbiased estimator of θ and that the Cramér-Rao lower<br />
bound is attained. [You may assume Var(X 2 ) = 2θ 2 .]<br />
(d) Suppose now that φ = √ θ is the parameter of interest. Without undertaking<br />
further calculations, write down the MLE of φ and explain why it is a biased<br />
estimator of φ.<br />
5. A random sample X1, X2, . . . , Xn is available from a Poisson distribution with<br />
mean θ. Define the parameter λ = e −θ .<br />
(a) Find the maximum likelihood estimator (MLE), ˆ θ, of θ. Hence deduce the<br />
MLE of ˆ λ, of λ.<br />
(b) Find the variance of ˆ θ, and deduce the approximate variance of ˆ λ using the<br />
delta method.<br />
(c) An alternative estimator of λ is ˜ λ, defines as the observed proportion of zero<br />
observations. Find the bias of ˜ λ and show that<br />
Var( ˜ λ) = e−θ � 1 − e−θ� .<br />
n<br />
(d) Draw a rough sketch of the efficiency of ˜ λ relative to ˆ λ, and discuss its<br />
properties.<br />
6. Let X1, . . . , Xn be independent and identically distributed as N (µ, σ 2 ), where µ<br />
and σ 2 are both unknown, and assume n = 2p + 1 is odd. Define the estimators<br />
ˆµ = ¯ X = (X1 + · · · + Xn)/n, ˜µ = median(X1, . . . , Xn) = X(p+1), and<br />
�<br />
�<br />
�<br />
ˆσ = S = � n �<br />
(Xi − ¯ X) 2 /(n − 1).<br />
i=1<br />
43
(a) Show that ˆµ is an unbiased estimator of µ.<br />
(b) Show that ˜µ is an unbiased estimator of µ. You should exploit the identity<br />
X(p+1) − µ = (X − µ)(p+1) = − � �<br />
(µ − X)(p+1) and the symmetry of the<br />
distribution of Xi − µ.<br />
(c) Show that E (ˆσ) < σ, and consequently ˆσ is a biased estimator of σ.<br />
(d) Find an unbiased estimator of σ of the form ˇσ = cS in the case p = 1, i.e.<br />
when n = 3.<br />
(e) Repeat (d), but for general n.<br />
7. Let X have density<br />
where θ > 0 is assumed unknown.<br />
(a) Show that c(θ) = θ 3 /2.<br />
f(x|θ) = c(θ)x 2 e −θx , x > 0,<br />
(b) Show that ˜ θ = 2/X is an unbiased estimator of θ and find its variance.<br />
(c) Find the Fisher information i(θ) for the parameter θ and compare the vari-<br />
ance of ˜ θ to the Cramer-Rao lower bound.<br />
(d) Let µ = θ −1 and show that ˆµ = X/3 is an unbiased estimator of µ.<br />
(e) Find the variance of ˆµ and show that it attains the Cramer-Rao lower bound.<br />
8. Let X1, . . . , Xn be a random sample from a population density function f(x|θ),<br />
where θ is a parameter. Let S = S(X1, . . . , Xn) be a sufficient statistic for θ.<br />
(a) What can be said about the conditional distribution of X1, . . . , Xn given<br />
S = s?<br />
(b) State the factorisation theorem for sufficient statistics.<br />
(c) Suppose now that<br />
f(x|θ) = xθ−1e−x , x > 0,<br />
Γ(θ)<br />
where Γ(·) is the gamma function and θ > 0 is a positive parameter. Show<br />
that<br />
is a sufficient statistic for θ.<br />
S =<br />
44<br />
n�<br />
ln Xi<br />
i=1
Outline Solutions<br />
1. The cdf is F (t) = P (T ≤ t) = � t<br />
0 λe−λυ dυ = λ � − 1<br />
λ e−λυ� t<br />
0 = 1 − e−λυ . Next<br />
P (a ≤ T ≤ b) = F (b) − F (a) = e −λa − e −λb . Assume all settlements are inde-<br />
pendent. Then P (50 in first week) = {F (1)} 50 = (1 − e −λ ) 50 , because T ≤ 1<br />
for these 50 settlements. Likewise, 1 < T ≤ 2, for the 35 in the second week,<br />
so we have P (35 in second week) = {F (2) − F (1)} 35 = (e −λ − e −2λ ) 35 . The re-<br />
maining 15 have T > 2, which has probability 1 − P (T ≤ 2) = e −2λ , and thus<br />
P (15 after week two) = (e −2λ ) 15 . The likelihood function is therefore the product<br />
L(λ) = (1 − e −λ ) 50 (e −λ − e −2λ ) 35 (e −2λ ) 15 . Taking logarithms (always based e),<br />
∴ d<br />
dλ<br />
ln L(λ) = 50 ln(1 − e −λ ) + 35 ln � e −λ (1 − e −λ ) � + 15 ln(e −2λ )<br />
= 85 ln(1 − e −λ ) − (35 + 30)λ = 85 ln(1 − e −λ ) − 65λ.<br />
85e−λ<br />
85<br />
ln L(λ) = − 65 =<br />
1 − e−λ e−λ − 65.<br />
− 1<br />
Equating to zero, 85 = 65(e −λ − 1) or e λ = 150/65, so that ˆ λ = ln(150/65) =<br />
0.836. This is indeed a maximum; e.g. d2<br />
dλ 2 ln L(λ) = −85/(e λ − 1) 2 < 0 ∀ λ. Next<br />
1 − e −0.836 = 0.5666; e −0.836 − e −1.672 = 0.43344 − 0.18787 = 0.2456. Hence out of<br />
100 invoices, 56.66, 24.56 and 18.78 would be expected to be paid, on this model,<br />
in weeks 1, 2 and later. The actual numbers were 50, 35 and 15. The prediction<br />
for the second week is a long way from what happened, balanced by smaller<br />
discrepancies in the other two periods. This does not seem very satisfactory.<br />
2. nA + nB = n ⇒ nB = n − nA. Suppose we observe the sequence A, B, A, B, A,<br />
then L1(nA) = nA<br />
n<br />
× n−nA<br />
n−1<br />
× nA−1<br />
n−2<br />
× n−nA−1<br />
n−3<br />
sequence A, A, B, B, A, then L2(nA) = nA<br />
n<br />
nA−2<br />
× . Next, suppose we observe the<br />
n−4<br />
× nA−1<br />
n−1<br />
× n−nA<br />
n−2<br />
× n−nA−1<br />
n−3<br />
nA−2<br />
× n−4<br />
. If it<br />
is known that 3As and 2Bs are drawn but the exact sequence is unknown then<br />
L3(nA) = P (Y = y|nA) = � �� � � � nA n−nA n<br />
/ , where y = 3. This third likelihood<br />
y 5−y 5<br />
function expands to give L3(nA) = 10 × nA(nA−1)(nA−2)(n−nA)(n−nA−1)<br />
. Clearly<br />
n(n−1)(n−2)(n−3)(n−4)<br />
L1(nA) = L2(nA) = L3(nA) ÷ 10. The first two likelihood functions are identical.<br />
The third likelihood function is a constant times the other two, and as only the<br />
ratio of likelihood functions are meaningful, L3(nA) carries the same information<br />
about our preferences for the parameter nA as the other functions.<br />
3. L(θ) = 4 −n (2 + θ) a (1 − θ) b+c (θ) d so, ℓ(θ) = −n ln(4)+a ln(2+θ)+(b+c) ln(1−<br />
θ) + d ln(θ). Differentiating we get S(θ) = a b+c d − + and setting S(θ) = 0<br />
2+θ 1−θ θ<br />
leads to the quadratic equation nθ2 − {a − 2b − 2c − d}θ − 2d = 0, of which<br />
the positive root, ˆ θ, satisfies the condition of maximum likelihood. If S(θ) is<br />
differentiated again with respect to θ, and expected values substituted for a, b, c,<br />
and d, we obtain Var( ˆ θ) ≈ {E[I(θ)]} −1 = {I(θ)} −1 = 2θ(1−θ)(2+θ)<br />
(1+2θ)n<br />
45<br />
.
4. E(X2 ) = � ∞<br />
0 x2f(x)dx =<br />
� �<br />
2 = πθ<br />
�� � �<br />
2 n/2<br />
ℓ(θ) = ln exp −<br />
πθ<br />
� �<br />
2 ∞<br />
πθ 0 x2e−x2 /2θdx =<br />
x[−θe−x2 /2θ ∞ ] 0 + � ∞<br />
0 θe−x2 /2θdx � X 2 i<br />
2θ<br />
Setting this equal to zero gives ˆ θ = 1<br />
n<br />
� 2<br />
� �<br />
2 ∞<br />
πθ 0 x<br />
�<br />
xe−x2 �<br />
/2θ dx<br />
�<br />
= πθ [0 − 0] + θ � ∞<br />
f(x)dx = θ.<br />
0<br />
��<br />
= n<br />
2 ln � � �<br />
2 X2 i − so S(θ) = − πθ 2θ<br />
n 1 + 2θ 2θ2 � 2 Xi .<br />
� 2 Xi . It may be verified (by considering<br />
the second derivative) that this is indeed a maximum, and so is the MLE of θ.<br />
Since E(X 2 ) = θ, we immediately have E( ˆ θ) = θ, i.e. ˆ θ is unbiased for θ. Next,<br />
I(θ) = − n<br />
2θ 2 + nθ<br />
θ 3 = n<br />
2θ 2 , so the Cramér-Rao lower bound is 2θ 2 /n. Now, Var( ˆ θ) =<br />
2θ 2 /n (using the hint), and so the variance of ˆ θ attains the bound.<br />
φ = √ θ; so MLE of φ is ˆ φ = √ MLE of θ = �� X 2 i /n. Because φ is a non-linear<br />
transformation of θ, and ˆ θ is unbiased for θ, ˆ φ cannot be unbiased for φ.<br />
5. L(θ) = θ� X ie −nθ<br />
� Xi! , giving ℓ(θ) = ( � Xi) ln θ − nθ − ln( � Xi!) and S(θ) = � Xi<br />
so the MLE of θ is ˆ θ = 1<br />
n<br />
� Xi = ¯ X. Also I(θ) = � Xi<br />
θ 2<br />
θ<br />
− n,<br />
> 0 ∀ θ, so ˆ θ is indeed a<br />
maximum. By the “invariance property”, the MLE of λ = e −θ is ˆ λ = e −ˆ θ = e − ¯ X .<br />
The delta method gives that the variance of g( ˆ θ) is approximated by � � dg 2<br />
Var( θ) ˆ<br />
dθ<br />
evaluated at the mean of the distribution, which here is simply θ. So we need<br />
to obtain θ<br />
n<br />
� dg<br />
dθ<br />
� 2 with g(θ) = e −θ . This immediately gives dg<br />
dθ = −e−θ , so the<br />
approximate variance is θ<br />
n (−e−θ ) 2 = 1<br />
n θe−2θ .<br />
The number of zero observations is binomially distributed with p = e −θ = λ, i.e.<br />
Bin(n, λ). Thus ˜ λ, the proportion of zeros, has expected value λ, i.e. it is unbiased.<br />
Also we have Var( ˜ λ) = 1<br />
1<br />
λ(1 − λ) = n ne−θ (1 − e−θ ). Using the approximate<br />
variance from part (ii), the efficiency of ˜ λ relative to ˆ λ is given approximately by<br />
θe −2θ<br />
n<br />
n<br />
e θ (1−e θ )<br />
θ = eθ . If θ is small, the efficiency is near (but less than) unity; as<br />
−1<br />
θ increases, the efficiency decreases; as θ becomes large, the efficiency tends to 0.<br />
6. E( ¯ X) = E[(X1+X2+· · ·+Xn)/n] = [E(X1)+E(X2)+· · ·+E(Xn)]/n = nµ/n = µ.<br />
The distribution of Xi−µ is symmetric implying E[(X−µ)(p+1)] = E[(µ−X)(p+1)].<br />
Combining this with the identity X(p+1) − µ = (µ − X)(p+1) = −{(µ − X)(p+1)}<br />
yields E[X(p+1) − µ] = E[(X − µ)(p+1)] = E[(µ − X)(p+1)] = −E[(µ − X)(p+1)] = 0<br />
and hence that E(˜µ) = E[X(p+1)] = µ. Note that the argument applies to any<br />
distribution which is symmetric around µ.<br />
For any positive random variable Y the identity Var( √ Y ) = E(Y ) − [E( √ Y )] 2<br />
holds, so for Y = 1<br />
n<br />
� n<br />
i (Xi − ¯ X) 2 = SSD/n, clearly, Var( √ Y ) > 0, and it follows<br />
that E(ˆσ) = E( � (SSD)/n) < � E(SSD)/n) = σ, hence ˆσ is not unbiased.<br />
For n = 3, Z = SSD/σ 2 follows a χ 2 -distribution with 2 degrees of freedom,<br />
46
hence<br />
E( √ � ∞ √ 1<br />
Z) = z<br />
2Γ(1) e−z/2dz = 23/2Γ(3/2) 2Γ(1) = � π/2 ,<br />
0<br />
hence ˜σ = 2ˆσ/ √ π is unbiased for σ. If we let d = (n − 1) we similarly find for<br />
general n that<br />
E( √ � ∞ √<br />
Z) = z<br />
so<br />
0<br />
zd/2−1 2d/2Γ(d/2) e−z/2dz = 2(d+1)/2Γ((d + 1)/2)<br />
2d/2Γ(d/2) c =<br />
makes cS an unbiased estimator of σ.<br />
√ dΓ(d/2)<br />
√ 2Γ([d + 1]/2)<br />
7. First of all c(θ) −1 = � ∞<br />
0 x2 e −θx dx = Γ(3)/θ 3 = 2/θ 3 . Next, we get E( ˜ θ) =<br />
E(2/X) = θ 3 � ∞<br />
0 xe−θx dx = θ 3 θ −2 Γ(2) = θ and so ˜ θ is an unbiased estimator of θ.<br />
<strong>To</strong> get the variance we use E( ˜ θ 2 ) = 2θ 3 � ∞<br />
0 e−θx dx = 2θ 2 so Var( ˜ θ) = 2θ 2 −θ 2 = θ 2 .<br />
<strong>To</strong> find the Fisher information for θ, we first calculate ∂ ln f(x|θ) = −x + 3/θ,<br />
∂θ<br />
and then calculate − ∂2<br />
∂θ2 ln f(x|θ) = 3/θ2 . Thus eff( ˜ θ) = θ2 /(3θ2 ) = 1<br />
3 .<br />
E(ˆµ) = θ3<br />
6<br />
� ∞<br />
0 x3e−θxdx = θ3Γ(4) 6θ4 = 1/θ = µ so ˆµ is an unbiased estimator of µ.<br />
Similarly E(ˆµ 2 ) = θ3<br />
18<br />
� ∞<br />
0 x4 e −θx dx = θ3 Γ(5)<br />
θ 5 = 4<br />
3θ 2 so Var(ˆµ) = 1<br />
3θ 2 . <strong>To</strong> calculate<br />
the CRLB we find for g(θ) = 1/θ that g ′ (θ) = −θ −2 so the lower bound is<br />
I(θ) −1 {g ′ (θ) 2 } = θ2<br />
3 θ−4 = 1<br />
= Var(ˆµ).<br />
3θ2 8. The formal definition of a sufficient statistic (S) is that the conditional distribu-<br />
tion of the sample X = (X1, X2, . . . , Xn) given the value of S, that is knowing<br />
that S = s, does not depend on θ.<br />
If f(x|θ) is the joint distribution of the sample X, the statistic S is sufficient for<br />
θ if and only if there exists function g(s|θ) and h(x) such that, for all sample<br />
points {xi} and all θ, the density factorises f(x|θ) = g(s|θ)h(x).<br />
f(x|θ) =<br />
n�<br />
x<br />
i=1<br />
θ−1<br />
i e−xi {Γ(θ)} n<br />
= exp[(θ − 1) � ln(xi)]<br />
{Γ(θ)} n<br />
× e − � xi<br />
↑ ↑<br />
�� �<br />
g ln(xi)|θ h(x)<br />
Thus, by the factorisation theorem, S = n�<br />
ln(Xi) is sufficient for θ.<br />
47<br />
i=1<br />
,
<strong>Student</strong> Questions<br />
1. Let X1, X2, . . . , Xn be iid with density f(x|θ) = θx θ−1 for 0 ≤ x ≤ 1.<br />
(a) Write down the likelihood function L(θ) and the log likelihood function ℓ(θ).<br />
(b) Derive ˆ θ the MLE of θ. Calculate the information function I(θ).<br />
(c) Show that ˆ θ is biased and calculate its bias function b(θ). HINT : Let<br />
Zi = − log[Xi] for i = 1, . . . , n and show that Z1, . . . , Zn are iid with density<br />
θ exp(−θz) for z ≥ 0. Show that as n → ∞ the bias converges to 0.<br />
(d) Based on the calculations in (a) suggest an unbiased estimate ˜ θ for θ. Cal-<br />
culate the variance of this unbiased estimate and its efficiency. Show that<br />
as n → ∞ the efficiency tends to 1.<br />
(e) Calculate the expected squared error for both ˆ θ and ˜ θ and compare them.<br />
(f) Suppose n = 10 and the data values are .21, .32, .45, .52, .58, .63, .65, .68,<br />
.70, and .72. Calculate the values of ˆ θ and ˜ θ based on these data.<br />
2. Let X1, . . . , Xn be iid with density f(x|θ) = θ 2 x exp (−θx) for x ≥ 0.<br />
(a) Write down the log likelihood function ℓ(θ).<br />
(b) Derive ˆ θ, the MLE of θ. Calculate the information function I(θ).<br />
(c) Suppose n = 10 and the data values are: .17, .28, .41, .48, .53, .58, .60, .63,<br />
.65 and .67. Calculate the MLE of θ and the information function.<br />
3. Let X1, . . . , Xn be iid with density f(x|β) = βx β−1 exp(−x β ) for x ≥ 0.<br />
(a) Write down the likelihood and log likelihood functions L(β) and ℓ(β).<br />
(b) Explain how Newton’s method may be used to find the MLE of β.<br />
(c) Suppose n = 10 and the data values are: .32, .35, .65, .65, .66, .74, .74, .82,<br />
1.00 and 1.80 . Write a program (preferably in R) to carry out Newton’s<br />
method. Try various starting values and report on what happens.<br />
4. Let X1, X2, . . . , Xn be iid with density<br />
f(x|θ) = 2θ2<br />
(x + θ) 3<br />
for x ≥ 0 where θ > 0 is an unknown parameter. Write down the log likelihood<br />
function ℓ(θ). Calculate the information function I(θ). Explain in detail how you<br />
would calculate the MLE of θ.<br />
48
5. Let X1, . . . , Xn be iid with density<br />
fX(x|θ) = 1<br />
exp [−x/θ]<br />
θ<br />
for 0 ≤ x < ∞. Let Y1, y . . . , Ym be iid with density<br />
for 0 ≤ y < ∞.<br />
fY (y|θ, λ) = λ<br />
exp [−λy/θ]<br />
θ<br />
(a) Write down the likelihood and log-likelihood functions L(θ, λ) and ℓ(θ, λ).<br />
(b) Derive ( ˆ θ, ˆ λ) the MLE of (θ, λ). Calculate the information matrix I = I(θ, λ).<br />
(c) Show that ˆ θ is unbiased but that ˆ λ is biased. Suggest an alternative ˜ λ to ˆ λ<br />
which is unbiased. Show that ˆ θ has efficiency 1. Calculate the efficiency of<br />
˜λ and show that as m → ∞ the efficiency converges to 1.<br />
(d) Suppose n = 11 and the average of the data values x1, x2, . . . , x11 is 2.0.<br />
Suppose m = 5 and the average of the data values y1, y2, . . . , y5 is 4.0.<br />
Calculate the maximum likelihood estimate ( ˆ θ, ˆ λ). Evaluate the information<br />
matrix at the point ( ˆ θ, ˆ λ) and show that both of its eigenvalues are positive.<br />
Calculate ˜ λ – the unbiased alternative to ˆ λ derived in (c).<br />
6. Let X1, . . . , Xn be iid observations from a Poisson distribution with mean θ. Let<br />
Y1, . . . , Ym be iid observations from a Poisson distribution with mean λθ.<br />
(a) Write down the likelihood log-likelihood functions L(θ, λ) and ℓ(θ, λ).<br />
(b) Derive ( ˆ θ, ˆ λ) the MLE of (θ, λ). Calculate the information matrix I = I(θ, λ).<br />
(c) Suppose n = 10 and the average of the data values x1, x2, . . . , x10 is 2.0.<br />
Suppose m = 5 and the average of the data values y1, y2, . . . , y5 is 4.0.<br />
Calculate the maximum likelihood estimate ( ˆ θ, ˆ λ). Evaluate the information<br />
matrix at the point ( ˆ θ, ˆ λ) and show that both of its eigenvalues are positive.<br />
7. Let Y be the number of particles emitted by a radioactive source in 1 minute. Y<br />
is thought to have a Poisson distribution whose mean θ is given by exp[α + βt]<br />
where t is the temperature of the source. The numbers of particles y1, y2, . . . , yn<br />
emitted in n 1 minute periods are observed; the temperature of the source for<br />
the ith period was ti. Assume that Y1, . . . , Yn are independent with Yi having a<br />
Poisson distribution with mean θi = exp[α + βti]. Derive an expression for the<br />
log likelihood ℓ(α, β). Suppose that you were required to find the MLE (ˆα, ˆ β).<br />
Clearly describe how you would perform this task. Your account should include<br />
the derivation of the likelihood equations and a detailed account of how these<br />
equations would be solved including how initial values for the iterative procedure<br />
involved would be obtained.<br />
49
Chapter 3<br />
The Theory of Confidence Intervals<br />
3.1 Exact Confidence Intervals<br />
Suppose that we are going to observe the value of a random vector X. Let X denote the<br />
set of possible values that X can take and, for x ∈ X , let g(x|θ) denote the probability<br />
that X takes the value x where the parameter θ is some unknown element of the set Θ.<br />
Consider the problem of quoting a subset of θ values which are in some sense plausible<br />
in the light of the data x. We need a procedure which for each possible value x ∈ X<br />
specifies a subset C(x) of Θ which we should quote as a set of plausible values for θ.<br />
Example 3.1. Suppose we are going to observe data x where x = (x1, x2, . . . , xn), and<br />
x1, x2, . . . , xn are the observed values of random variables X1, X2, . . . , Xn which are<br />
thought to be iid N(θ, 1) for some unknown parameter θ ∈ (−∞, ∞) = Θ. Consider<br />
the subset C(x) = [¯x − 1.96/ √ n, ¯x + 1.96/ √ n]. If we carry out an infinite sequence of<br />
independent repetitions of the experiment then we will get an infinite sequence of x<br />
values and thereby an infinite sequence of subsets C(x). We might ask what proportion<br />
of this infinite sequence of subsets actually contain the fixed but unknown value of θ?<br />
Since C(x) depends on x only through the value of ¯x we need to know how ¯x<br />
behaves in the infinite sequence of repetitions. This follows from the fact that ¯ X has a<br />
N(θ, 1<br />
n ) density and so Z = ¯ X−θ<br />
√<br />
1<br />
n<br />
= √ n( ¯ X − θ) has a N(0, 1) density. Thus eventhough<br />
θ is unknown we can calculate the probability that the value of Z will exceed 2.78,<br />
for example, using the standard normal tables. Remember that (from a frequentist<br />
viewpoint) the probability is the proportion of experiments in the infinite sequence of<br />
repetitions which produce a value of Z greater than 2.78.<br />
In particular we have that P [|Z| ≤ 1.96] = 0.95. Thus 95% of the time Z will lie<br />
between −1.96 and +1.96. But<br />
−1.96 ≤ Z ≤ +1.96 ⇒ −1.96 ≤ √ n( ¯ X − θ) ≤ +1.96<br />
50
⇒ −1.96/ √ n ≤ ¯ X − θ ≤ +1.96/ √ n<br />
⇒ ¯ X − 1.96/ √ n ≤ θ ≤ ¯ X + 1.96/ √ n<br />
⇒ θ ∈ C(X)<br />
Thus we have answered the question we started with. The proportion of the infinite<br />
sequence of subsets given by the formula C(X) which will actually include the fixed<br />
but unknown value of θ is 0.95. For this reason the set C(X) is called a 95% confidence<br />
set or confidence interval for the parameter θ. �<br />
It is well to bear in mind that once we have actually carried out the experiment<br />
and observed our value of x, the resulting interval C(x) either does or does not contain<br />
the unknown parameter θ. We do not know which is the case. All we know is that<br />
the procedure we used in constructing C(x) is one which 95% of the time produces an<br />
interval which contains the unknown parameter.<br />
The crucial step in the last example was finding the quantity Z = √ n( ¯ X −θ) whose<br />
value depended on the parameter of interest θ but whose distribution was known to be<br />
that of a standard normal variable. This leads to the following definition.<br />
Definition 3.1 (Pivotal Quantity). A pivotal quantity for a parameter θ is a random<br />
variable Q(X|θ) whose value depends both on (the data) X and on the value of the<br />
unknown parameter θ but whose distribution is known. �<br />
The quantity Z in the example above is a pivotal quantity for θ. The following<br />
lemma provides a method of finding pivotal quantities in general.<br />
Lemma 3.1. Let X be a random variable and define F (a) = P [X ≤ a]. Consider the<br />
random variable U = −2 log [F (X)]. Then U has a χ 2 2 density. Consider the random<br />
variable V = −2 log [1 − F (X)]. Then V has a χ 2 2 density.<br />
Proof. Observe that, for a ≥ 0,<br />
P [U ≤ a] = P [F (X) ≥ exp (−a/2)]<br />
= 1 − P [F (X) ≤ exp (−a/2)]<br />
= 1 − P [X ≤ F −1 (exp (−a/2))]<br />
= 1 − F [F −1 (exp (−a/2))]<br />
= 1 − exp (−a/2).<br />
Hence, U has density 1<br />
2 exp (−a/2) which is the density of a χ2 2 variable as required.<br />
The corresponding proof for V is left as an exercise.<br />
This lemma has an immediate, and very important, application.<br />
51
Suppose that we have data X1, X2, . . . , Xn which are iid with density f(x|θ). Define<br />
F (a|θ) = � a<br />
−∞ f(x|θ)dx and, for i = 1, 2, . . . , n, define Ui = −2 log[F (Xi|θ)]. Then<br />
U1, U2, . . . , Un are iid each having a χ 2 2 density. Hence Q1(X, θ) = � n<br />
i=1 Ui has a χ 2 2n<br />
density and so is a pivotal quantity for θ. Another pivotal quantity ( also having a χ 2 2n<br />
density ) is given by Q2(X, θ) = � n<br />
i=1 Vi where Vi = −2 log[1 − F (Xi|θ)].<br />
Example 3.2. Suppose that we have data X1, X2, . . . , Xn which are iid with density<br />
f(x|θ) = θ exp (−θx)<br />
for x ≥ 0 and suppose that we want to construct a 95% confidence interval for θ. We<br />
need to find a pivotal quantity for θ. Observe that<br />
Hence<br />
F (a|θ) =<br />
=<br />
Q1(X, θ) = −2<br />
� a<br />
−∞<br />
� a<br />
0<br />
f(x|θ)dx<br />
θ exp (−θx)dx<br />
= 1 − exp (−θa).<br />
n�<br />
log [1 − exp (−θXi)]<br />
i=1<br />
is a pivotal quantity for θ having a χ 2 2n density. Also<br />
Q2(X, θ) = −2<br />
n�<br />
log [exp (−θXi)] = 2θ<br />
is another pivotal quantity for θ having a χ 2 2n density.<br />
i=1<br />
Using the tables, find A < B such that P [χ 2 2n < A] = P [χ 2 2n > B] = 0.025. Then<br />
0.95 = P [A ≤ Q2(X, θ) ≤ B]<br />
n�<br />
= P [A ≤ 2θ Xi ≤ B]<br />
i=1<br />
A<br />
= P [<br />
2 �n i=1 Xi<br />
≤ θ ≤<br />
and so the interval<br />
A<br />
[<br />
2 �n i=1 Xi<br />
B<br />
,<br />
2 �n i=1 Xi<br />
]<br />
is a 95% confidence interval for θ.<br />
n�<br />
i=1<br />
B<br />
2 �n i=1 Xi<br />
]<br />
The other pivotal quantity Q1(X, θ) is more awkward in this example since it is not<br />
straightforward to determine the set of θ values which satisfy A ≤ Q1(X, θ) ≤ B. �<br />
52<br />
Xi
3.2 Pivotal Quantities for Use with Normal Data<br />
Many exact pivotal quantities have been developed for use with Gaussian data.<br />
Example 3.3. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />
from a N (θ, σ2 ) density where σ is known. Define<br />
√<br />
n( X¯ − θ)<br />
Q =<br />
.<br />
σ<br />
Then Q has a N (0, 1) density and so is a pivotal quantity for θ. In particular we can<br />
be 95% sure that<br />
which is equivalent to<br />
−1.96 ≤<br />
√ n( ¯ X − θ)<br />
σ<br />
≤ +1.96<br />
¯X − 1.96 σ √ n ≤ θ ≤ ¯ X + 1.96 σ √ n .<br />
The R command qnorm(p=0.975,mean=0,sd=1) returns the value 1.959964 as the<br />
97 1%<br />
quantile from the standard normal distribution. �<br />
2<br />
Example 3.4. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />
from a N(θ, σ 2 ) density where θ is known. Define<br />
Q =<br />
n�<br />
(Xi − θ) 2<br />
i=1<br />
We can write Q = n�<br />
Z2 i where Zi = (Xi − θ)/σ. If Zi has a N (0, 1) density then<br />
i=1<br />
Z2 i has a χ2 1 density. Hence, Q has a χ2 n density and so is a pivotal quantity for σ. If<br />
n = 20 then we can be 95% sure that<br />
n�<br />
(Xi − θ) 2<br />
9.591 ≤<br />
i=1<br />
σ 2<br />
σ 2<br />
≤ 34.170<br />
which is equivalent to<br />
�<br />
�<br />
�<br />
� 1<br />
n�<br />
(Xi − θ)<br />
34.170<br />
2 �<br />
�<br />
�<br />
≤ σ ≤ � 1<br />
9.591<br />
i=1<br />
n�<br />
(Xi − θ) 2 .<br />
The R command qchisq(p=c(.025,.975),df=20) returns the values 9.590777 and<br />
34.169607 as the 2 1<br />
1 % and 97 % quantiles from a Chi-squared distribution on 20 degrees<br />
2 2<br />
of freedom. �<br />
53<br />
i=1
Lemma 3.2 (The <strong>Student</strong> t-distribution). Suppose the random variables X and<br />
Y are independent, and X ∼ N(0, 1) and Y ∼ χ 2 n. Then the ratio<br />
has pdf<br />
fT (t|n) = 1<br />
√ πn<br />
T = X<br />
� Y/n<br />
Γ([n + 1]/2)<br />
Γ(n/2)<br />
�<br />
1 + t2<br />
�−(n+1)/2 ,<br />
n<br />
and is known as <strong>Student</strong>’s t-distribution on n degrees of freedom.<br />
Proof. The random variables X and Y are independent and have joint density<br />
fX,Y (x, y) = 1<br />
√ 2π<br />
2 −n/2<br />
Γ(n/2) e−x2 /2 y n/2−1 e −y/2−1 e −y/2<br />
The Jacobian ∂(t, u)/∂(x, y) of the change of variables<br />
equals<br />
∂(t, u)<br />
∂(x, y) ≡<br />
and the inverse Jacobian<br />
Then<br />
fT (t) =<br />
=<br />
=<br />
�<br />
�<br />
�<br />
�<br />
� ∞<br />
0<br />
t = x<br />
� y/n<br />
∂t<br />
∂x<br />
∂u<br />
∂x<br />
∂t<br />
∂y<br />
∂u<br />
∂y<br />
and u = y<br />
�<br />
�<br />
�<br />
� =<br />
� �<br />
�<br />
1<br />
�<br />
n/y − 2<br />
� 0 1<br />
x √ n<br />
(y) 3/2<br />
∂(x, y)/∂(t, u) = (u/n) 1/2 .<br />
fX,Y<br />
� � 1/2<br />
t(u/n) , u � u<br />
�1/2 du<br />
n<br />
1 2<br />
√<br />
2π<br />
−n/2 � ∞<br />
Γ(n/2) 0<br />
1 2<br />
√<br />
2π<br />
−n/2<br />
Γ(n/2)n1/2 � ∞<br />
e −t2 u/2n u n/2−1 e −u/2<br />
0<br />
�<br />
�<br />
�<br />
� = (n/y)1/2<br />
for y > 0.<br />
�<br />
u<br />
�1/2 du<br />
n<br />
e −(1+t2 /n)u/2 u (n+1)/2−1 du .<br />
The last integrand comes from the pdf of a Gam((n + 1)/2, 1/2 + t 2 /(2n)) random<br />
variable. Hence<br />
fT (t) = 1<br />
√ πn<br />
which gives the above formula.<br />
Γ([n + 1]/2)<br />
Γ(n/2)<br />
54<br />
�<br />
1<br />
1 + t2 � (n+1)/2<br />
,<br />
/n
Example 3.5. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />
from a N (θ, σ2 ) density where both θ and σ are unknown. Define<br />
√<br />
n( X¯ − θ)<br />
Q =<br />
s<br />
where<br />
n�<br />
(Xi − ¯ X) 2<br />
We can write<br />
where<br />
has a N (0, 1) density and<br />
s 2 =<br />
Q =<br />
Z =<br />
i=1<br />
n − 1<br />
Z<br />
� W/(n − 1)<br />
√ n( ¯ X − θ)<br />
σ<br />
n�<br />
(Xi − ¯ X) 2<br />
i=1<br />
W =<br />
σ2 has a χ2 n−1 density ( see lemma 2.9 ). It follows immediately that W is a pivotal<br />
quantity for σ. If n = 31 then we can be 95% sure that<br />
n�<br />
(Xi − ¯ X) 2<br />
which is equivalent to<br />
�<br />
�<br />
�<br />
� 1<br />
46.97924<br />
16.79077 ≤<br />
i=1<br />
σ 2<br />
.<br />
≤ 46.97924<br />
n�<br />
(Xi − ¯ X) 2 �<br />
�<br />
�<br />
≤ σ ≤ � 1<br />
16.79077<br />
i=1<br />
n�<br />
(Xi − ¯ X) 2 . (3.1)<br />
The R command qchisq(p=c(.025,.975),df=30) returns the values 16.79077 and<br />
46.97924 as the 2 1<br />
1 % and 97 % quantiles from a Chi-squared distribution on 30 degrees<br />
2 2<br />
of freedom. In lemma 3.2 we show that Q has a tn−1 density, and so is a pivotal quantity<br />
for θ. If n = 31 then we can be 95% sure that<br />
√<br />
n( X¯ − θ)<br />
−2.042 ≤<br />
≤ +2.042<br />
s<br />
which is equivalent to<br />
¯X − 2.042 s<br />
√ n ≤ θ ≤ ¯ X + 2.042 s<br />
√ n . (3.2)<br />
The R command qt(p=.975,df=30) returns the value 2.042272 as the 97 1%<br />
quantile<br />
2<br />
from a <strong>Student</strong> t-distribution on 30 degrees of freedom. ( It is important to point<br />
out that although a probability statement involving 95% confidence has been attached<br />
the two intervals (3.2) and (3.1) separately, this does not imply that both intervals<br />
simultaneously hold with 95% confidence. ) �<br />
55<br />
i=1
Example 3.6. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />
from a N (θ1, σ 2 ) density and data Y1, Y2, . . . , Ym which are iid observations from a<br />
N (θ2, σ 2 ) density where θ1, θ2, and σ are unknown. Let δ = θ1 − θ2 and define<br />
where<br />
Q = ( ¯ X − ¯ Y ) − δ<br />
�<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
s 2 �n i=1<br />
=<br />
(Xi − ¯ X) 2 + �m j=1 (Yj − ¯ Y ) 2<br />
.<br />
n + m − 2<br />
We know that ¯ X has a N (θ1, σ2<br />
n ) density and that ¯ Y has a N (θ2, σ2 ) density. Then<br />
m<br />
the difference ¯ X − ¯ Y has a N (δ, σ2 [ 1 1 + ]) density. Hence<br />
n m<br />
Z = ¯ X − ¯ Y − δ<br />
�<br />
σ 2 [ 1<br />
n<br />
+ 1<br />
m ]<br />
has a N (0, 1) density. Let W1 = � n<br />
i=1 (Xi − ¯ X) 2 /σ 2 and let W2 = � m<br />
j=1 (Yj − ¯ Y ) 2 /σ 2 .<br />
Then, W1 has a χ 2 n−1 density and W2 has a χ 2 m−1 density, and W = W1 + W2 has a<br />
χ 2 n+m−2 density. We can write<br />
Q1 = Z/ � W/(n + m − 2)<br />
and so, Q1 has a tn+m−2 density and so is a pivotal quantity for δ. Define<br />
�n i=1<br />
Q2 =<br />
(Xi − ¯ X) 2 + �m j=1 (Yj − ¯ Y ) 2<br />
σ2 .<br />
Then Q2 has a χ 2 n+m−2 density and so is a pivotal quantity for σ. �<br />
Lemma 3.3 (The Fisher F-distribution). Let X1, X2, . . . , Xn and Y1, Y2, . . . , Ym<br />
be iid N (0, 1) random variables. The ratio<br />
Z =<br />
n�<br />
X<br />
i=1<br />
2 i /n<br />
m�<br />
i=1<br />
Y 2<br />
i /m<br />
has the distribution called Fisher, or F distribution with parameters (degrees of free-<br />
dom) n, m, or the Fn,m distribution for short. The corresponding pdf fFn,m is concen-<br />
trated on the positive half axis:<br />
fFn,m(z) =<br />
Γ((n + m)/2)<br />
Γ(n/2)Γ(m/2)<br />
�<br />
n<br />
�n/2 z<br />
m<br />
n/2−1<br />
�<br />
1 + n<br />
m z<br />
�−(n+m)/2 for z > 0.<br />
Observe that if T ∼ tm, then T 2 = Z ∼ F1,m, and if Z ∼ Fn,m, then Z −1 ∼ Fm,n.<br />
If W1 ∼ χ 2 n and W2 ∼ χ 2 m, then Z = (mW1)/(nW2) ∼ Fn,m. �<br />
56
Example 3.7. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />
from a N (θX, σ 2 X ) density and data Y1, Y2, . . . , Ym which are iid observations from a<br />
N (θY , σ 2 Y ) density where θX, θY , σX, and σY are all unknown. Let<br />
and define<br />
Let<br />
and let<br />
F ∗ = ˆs2 X<br />
ˆs 2 Y<br />
=<br />
λ = σX/σY<br />
�n i=1 (Xi − ¯ X) 2 (m − 1)<br />
�m (n − 1) j=1 (Yj − ¯ .<br />
Y ) 2<br />
WX =<br />
WY =<br />
n�<br />
(Xi − ¯ X) 2 /σ 2 X<br />
i=1<br />
m�<br />
(Yj − ¯ Y ) 2 /σ 2 Y .<br />
j=1<br />
Then, WX has a χ 2 n−1 density and WY has a χ 2 m−1 density. Hence, by lemma 3.3,<br />
Q = WX/(n − 1)<br />
WY /(m − 1)<br />
≡ F ∗<br />
λ 2<br />
has an F density with n−1 and m−1 degrees of freedom and so is a pivotal quantity for<br />
λ. Suppose that n = 25 and m = 13. Then we can be 95% sure that 0.39 ≤ Q ≤ 3.02<br />
which is equivalent to � �<br />
F ∗<br />
F ∗<br />
≤ λ ≤<br />
3.02 0.39 .<br />
<strong>To</strong> see how this might work in practice try the following R commands one at a time:<br />
x = rnorm(25, mean = 0, sd = 2)<br />
y = rnorm(13, mean = 1, sd = 1)<br />
Fstar = var(x)/var(y); Fstar<br />
CutOffs = qf(p=c(.025,.975), df1=24, df2=12)<br />
CutOffs; rev(CutOffs)<br />
Fstar / rev(CutOffs)<br />
var.test(x, y)<br />
The search for a nice pivotal quantity for δ = θ1 − δ2 continues and is one of the<br />
great unsolved problems in Statistics - referred to as the Behrens-Fisher Problem.<br />
57<br />
�
3.3 Approximate Confidence Intervals<br />
Let X1, X2, . . . , Xn be iid with density f(x|θ). Let ˆ θ be the MLE of θ. We saw before<br />
that the quantities W1 = � EI(θ)( ˆ θ − θ), W2 = � I(θ)( ˆ �<br />
θ − θ), W3 = EI( ˆ θ)( ˆ θ − θ),<br />
�<br />
and W4 = I( ˆ θ)( ˆ θ − θ) all had densities which were approximately N (0, 1). Hence<br />
they are all approximate pivotal quantities for θ. W3 and W4 are the simplest to<br />
use in general. For W3 the approximate 95% confidence interval is given by [ ˆ �<br />
θ −<br />
1.96/ EI( ˆ θ), ˆ �<br />
θ + 1.96/ EI( ˆ θ)]. For W4 the approximate 95% confidence interval is<br />
given by [ ˆ �<br />
θ − 1.96/ I( ˆ θ), ˆ �<br />
θ + 1.96/ I( ˆ �<br />
θ)]. The quantity 1/ EI( ˆ �<br />
θ) ( or 1/ I( ˆ θ)) is<br />
often referred to as the approximate standard error of the MLE ˆ θ.<br />
Let X1, X2, . . . , Xn be iid with density f(x|θ) where θ = (θ1, θ2, . . . , θm) consists<br />
of m unknown parameters. Let θ = ( ˆ θ1, ˆ θ2, . . . , ˆ θm) be the MLE of θ. We saw before<br />
that for r = 1, 2, . . . , m the quantities W1r = ( ˆ θr − θr)/ √ CRLBr where CRLBr is the<br />
lower bound for Var( ˆ θr) given in the generalisation of the Cramer-Rao theorem had<br />
a density which was approximately N (0, 1). Recall that CRLBr is the rth diagonal<br />
element of the matrix [EI(θ)] −1 . In certain cases CRLBr may depend on the values of<br />
unknown parameters other than θr and in those cases W1r will not be an approximate<br />
pivotal quantity for θr.<br />
We also saw that if we define W2r by replacing CRLBr by the rth diagonal element<br />
of the matrix [I(θ)] −1 , W3r by replacing CRLBr by the rth diagonal element of the<br />
matrix [EI( ˆ θ)] −1 and W4r by replacing CRLBr by the rth diagonal element of the<br />
matrix [I( ˆ θ)] −1 we get three more quantities all of whom have a density which is<br />
approximately N (0, 1). W3r and W4r only depend on the unknown parameter θr and<br />
so are approximate pivotal quantities for θr. However in certain cases the rth diagonal<br />
element of the matrix [I(θ)] −1 may depend on the values of unknown parameters other<br />
than θr and in those cases W2r will not be an approximate pivotal quantity for θr.<br />
Generally W3r and W4r are most commonly used.<br />
We now examine the use of approximate pivotal quantities based on the MLE in a<br />
series of examples :<br />
Example 3.8 (Poisson sampling continued). Recall that ˆ θ = ¯x and I(θ) = � n<br />
i=1 xi/θ 2 =<br />
n ˆ θ/θ 2 with E[I(θ)] = n/θ. Hence E[I( ˆ θ)] = I( ˆ θ) = n/ ˆ θ and the usual approximate<br />
95% confidence interval is given by<br />
[ ˆ �<br />
θ − 1.96<br />
ˆθ<br />
n , ˆ �<br />
θ + 1.96<br />
ˆθ<br />
n ].<br />
58<br />
�
Example 3.9 (Bernoulli trials continued). Recall that ˆ θ = ¯x and<br />
with<br />
Hence<br />
I(θ) =<br />
� n<br />
i=1 xi<br />
θ 2<br />
E[I(θ)] =<br />
E[I( ˆ θ)] = I( ˆ θ) =<br />
+ n − � n<br />
i=1 xi<br />
(1 − θ) 2<br />
n<br />
θ(1 − θ) .<br />
n<br />
ˆθ(1 − ˆ θ)<br />
and the usual approximate 95% confidence interval is given by<br />
[ ˆ �<br />
θ − 1.96<br />
ˆθ(1 − ˆ θ)<br />
,<br />
n<br />
ˆ �<br />
θ + 1.96<br />
ˆθ(1 − ˆ θ)<br />
].<br />
n<br />
Example 3.10. Let X1, X2, . . . , Xn be iid observations from the density<br />
f(x|α, β) = αβx β−1 exp [−αx β ]<br />
for x ≥ 0 where both α and β are unknown. We saw how to calculate the MLE’s of α<br />
and β using Newton’s Method and that the information matrix I(α, β) is given by<br />
�<br />
n/α2 �n i=1 xβi<br />
log[xi]<br />
�n i=1 xβi<br />
log[xi] n/β2 + α �n i=1 xβ<br />
2<br />
i log[xi]<br />
�<br />
Let V11 and V22 be the diagonal elements of the matrix [I(ˆα, ˆ β)] −1 . Then the approxi-<br />
mate 95% confidence interval for α is<br />
[ˆα − 1.96 � V11, ˆα + 1.96 � V11]<br />
and the approximate 95% confidence interval for β is<br />
[ ˆ β − 1.96 � V22, ˆ β + 1.96 � V22].<br />
59<br />
�<br />
�
3.4 Worked Problems<br />
The Problems<br />
1. Components are produced in an industrial process and the number of flaws indif-<br />
ferent components are independent and identically distributed with probability<br />
mass function p(x) = θ(1 − θ) x , x = 0, 1, 2, . . ., where 0 < θ < 1. A random<br />
sample of n components is inspected; n0 components are found to have no flaws,<br />
n1 components are found to have two or more flaws.<br />
(a) Show that the likelihood function is L(θ) = θ n0+n1 (1 − θ) 2n−2n0−n1 .<br />
(b) Find the MLE of θ and the sample information in terms of n, n0 and n1.<br />
(c) Hence calculate an approximate 90% confidence interval for θ where 90 out<br />
of 100 components have no flaws, and seven have exactly one flaw.<br />
2. Suppose that X1, X2, . . . , Xn is a random sample from the shifted exponential<br />
distribution with probability density function<br />
f(x|θ, µ) = 1<br />
θ e−(x−µ)/θ , µ < x < ∞,<br />
where θ > 0 and −∞, µ < ∞. Both θ and µ are unknown, and n > 1.<br />
(a) The sample range W is defined as W = X(n) − X(1), where X(n) = maxi Xi<br />
and X(1) = mini Xi. It can be shown that the joint probability density<br />
function of X(1) and W is given by<br />
fX (1),W (x(1), w) = n(n − 1)θ −2 e −n(x (1)−µ)/θ e −w/θ (1 − e −w/θ ) n−2 ,<br />
for x(1) > µ and w > 0. Hence obtain the marginal density function of W and<br />
show that W has distribution function P (W ≤ w) = (1 − e −w/θ ) n−1 , w > 0.<br />
(b) Show that W/θ is a pivotal quantity for θ. Without carrying out any cal-<br />
culations, explain how this result may be used to construct a 100(1 − α)%<br />
confidence interval for θ for 0, α < 1.<br />
3. Let X have the logistic distribution with probability density function<br />
f(x) =<br />
ex−θ (1 + ex−θ , −∞ < x < ∞,<br />
) 2<br />
where −∞ < θ < ∞ is an unknown parameter.<br />
(a) Show that X − θ is a pivotal quantity and hence, given a single observation<br />
X, construct an exact 100(1 − α)% confidence interval for θ. Evaluate the<br />
interval when α = 0.05 and X = 10.<br />
60
(b) Given a random sample X1, X2, . . . , Xn from the above distribution, briefly<br />
explain how you would use the central limit theorem to construct an approx-<br />
imate 95% confidence interval for θ. Hint: E(X) = θ and Var(X) = π 2 /3.<br />
Outline Solutions<br />
1. P (0) = θ, P (1) = θ(1 − θ) and P (≥ 2) = 1 − θ − θ(1 − θ) = (1 − θ) 2 . Thus<br />
the likelihood of n0 zeros, n1 ones and n2 with two or more flaws is L(θ) =<br />
θ n0 [θ(1 − θ)] n1 (1 − θ) 2(n−n0−n1) = θ n0+n1 (1 − θ) 2n−2n0−n1 . The log-likelihood is<br />
ℓ(θ) = (n0 + n1) ln θ + (2n − 2n0 − n1) ln(1 − θ) and the score function can be<br />
obtained as d<br />
dθ ℓ(θ) = S(θ) = (n0 + n1)θ −1 − (2n − 2n0 − n1)(1 − θ) −1 . Setting this<br />
equal to zero gives that ˆ θ satisfies (n + n0)(1 − ˆ θ) = (2n − 2n0 − n1) ˆ θ, so that<br />
ˆθ = n0+n1<br />
2n−n0 . Differentiating again I(θ) = (n0 +n1)θ −2 +(2n−2n0 −n1)(1−θ) 2 > 0,<br />
which confirms that ˆ θ is a maximum. Using 1 − ˆ θ = 2n−2n0−n1<br />
calculations we get I( ˆ θ) =<br />
(2n−n0) 2<br />
n0+n1<br />
+ (2n−n0)<br />
2n−2n0−n1<br />
2n−n0<br />
to simplify the<br />
. An approximate 90% CI for θ<br />
is ˆ θ ± 1.6449[I( ˆ θ)] −1/2 . In the case when n = 100, n0 = 90 and n1 = 7, we have<br />
2n − n0 = 110, n0 + n1 = 97 and 2n − 2n0 − n1 = 13. Thus ˆ θ = 97/110 = 0.882<br />
and the sample information is 1102<br />
97<br />
+ 1102<br />
13<br />
0.882 ± 1.6669(32.489) −1 , i.e. 0.882 ± 0.051 or (0.831, 0.933).<br />
= 1055.5115. Thus the 90% CI is<br />
2. Given f(x, w) = n(n−1)θ−2e−n(x−µ)/θe−w/θ (1−e−w/θ ) n−2 , where x ≡ x(1) we have<br />
fW (w) = n(n − 1)θ−2e−w/θ (1 − e−w/θ ) n−2 � ∞<br />
µ e−n(x−µ)/θdx. Putting v = (x − µ) so<br />
that dv = dx, we have fW (w) = n(n − 1)θ−2e−w/θ (1 − e−w/θ ) n−2 � ∞<br />
µ e−nv/θdx =<br />
n(n − 1)θ−2e−w/θ (1 − e−w/θ ) n−2 � − θ<br />
ne−nv/θ� ∞ (n−1)<br />
= θ e−w/θ (1 − e−w/θ ) n−2 . Next,<br />
P (W ≤ w) = � w<br />
0<br />
v=0<br />
(n−1)<br />
θ e−y/θ (1 − e −y/θ ) n−2 dy = � (1 − e −y/θ ) n−1� w<br />
0 = (1 − ew/θ ) n−1 ,<br />
for 0 < w < ∞. Let Z = W<br />
θ . Then FZ(z) = P (Z ≤ z) = P (W ≤ zθ) =<br />
(1 − e −z ) n−1 , 0 < z < ∞. Z is a function of θ whose distribution does not<br />
depend on θ. Hence Z is a pivotal quantity. Choose any interval [z1, z2], where<br />
z1 ≥ 0, such that � z2<br />
z1 fZ(z)dz = 1 − α for 0 < α < 1. Then, given the range<br />
W = w, we have z1 ≤ w<br />
θ ≤ z2,<br />
� �<br />
w w<br />
and a 100(1 − α)% CI for θ is , . z1 z2<br />
3. Let W = X − θ. Then fW (w) = e w (1 + e w ) −2 . Here X − θ is a function of θ whose<br />
distribution does not depend on θ, and so is a pivotal quantity. Now, fW (w)<br />
is symmetric about zero, and so a 100(1 − α)% confidence interval for θ (where<br />
0 < α < 1) is {θ : −c < X − θ < c}, where c satisfies P (W ≤ c) = α/2. Hence<br />
� c<br />
−∞ ew (1 + e w ) −2 dw = α<br />
2 ⇔ [−(1 + ew ) −1 ] c<br />
−∞ = 1 − (1 + ec ) −1 ∴ c = ln<br />
� α/2<br />
1−(α/2)<br />
When α = 0.05, c = −3.664. So when X = 10, the interval is (6.336, 13.664).<br />
Let ¯ X denote the sample mean. The CLT gives ¯ X is approximately distributed<br />
N (θ, π2<br />
3n ). An approximate 95% CI for θ is therefore [ ¯ X − 1.96π<br />
√ 3n , ¯ X + 1.96π<br />
√ 3n ].<br />
61<br />
�<br />
.
<strong>Student</strong> Questions<br />
1. Let X1, X2, . . . , Xn be iid with density fX(x|θ) = θ exp (−θx) for x ≥ 0.<br />
(a) Show that � x<br />
f(u|θ)du = 1 − exp (−θx).<br />
0<br />
(b) Use the result in (a) to establish that Q = 2θ � n<br />
i=1 Xi is a pivotal quantity<br />
for θ and explain how to use Q to find a 95% confidence interval for θ.<br />
(c) Derive the information I(θ). Suggest an approximate pivotal quantity for<br />
θ involving I(θ) and another approximate pivotal quantity involving I( ˆ θ)<br />
where ˆ θ = 1/¯x is the maximum likelihood estimate of θ. Show how both<br />
approximate pivotal quantities may be used to find approximate 95% confi-<br />
dence intervals for θ. Prove that the approximate confidence interval calcu-<br />
lated using the approximate pivotal quantity involving I( ˆ θ) is always shorter<br />
than the approximate confidence interval calculated using the approximate<br />
pivotal quantity involving I(θ) but that the ratio of the lengths converges<br />
to 1 as n → ∞.<br />
(d) Suppose n = 25 and � 20<br />
i=1 xi = 250. Use the method explained in (b) to<br />
calculate a 95% confidence interval for θ and the two methods explained in<br />
(c) to calculate approximate 95% confidence intervals for θ. Compare the<br />
three intervals obtained.<br />
2. Let X1, X2, . . . , Xn be iid with density<br />
for x ≥ 0.<br />
f(x|θ) =<br />
θ<br />
(x + 1) θ+1<br />
(a) Derive an exact pivotal quantity for θ and explain how it may be used to<br />
find a 95% confidence interval for θ.<br />
(b) Derive the information I(θ). Suggest an approximate pivotal quantity for<br />
θ involving I(θ) and another approximate pivotal quantity involving I( ˆ θ)<br />
where ˆ θ is the maximum likelihood estimate of θ. Show how both approx-<br />
imate pivotal quantities may be used to find approximate 95% confidence<br />
intervals for θ.<br />
(c) Suppose n = 25 and � 25<br />
i=1 log [xi + 1] = 250. Use the method explained<br />
in (a) to calculate a 95% confidence interval for θ and the two methods<br />
explained in (b) to calculate approximate 95% confidence intervals for θ.<br />
Compare the three intervals obtained.<br />
62
3. Let X1, X2, . . . , Xn be iid with density<br />
for x ≥ 0.<br />
f(x|θ) = θ 2 x exp (−θx)<br />
(a) Show that � x<br />
f(u|θ)du = 1 − exp (−θx)[1 + θx].<br />
0<br />
(b) Describe how the result from (a) can be used to construct two exact pivotal<br />
quantities for θ.<br />
(c) Construct FOUR approximate pivotal quantities for θ based on the MLE ˆ θ.<br />
(d) Suppose that n = 10 and the data values are 1.6, 2.5, 2.7, 3.5, 4.6, 5.2, 5.6, 6.4,<br />
7.7, 9.2. Evaluate the 95% confidence interval corresponding to ONE of the<br />
exact pivotal quantities ( you may need to use a computer to do this ).<br />
Compare your answer to the 95% confidence intervals corresponding to<br />
each of the FOUR approximate pivotal quantities derived in (c).<br />
4. Let X1, X2, . . . , Xn be iid each having a Poisson density<br />
f(x|θ) = θx exp (−θ)<br />
x!<br />
for x = 0, 1, 2, . . . ., ∞. Construct FOUR approximate pivotal quantities for θ<br />
based on the MLE ˆ θ. Show how each may be used to construct an approximate<br />
95% confidence interval for θ. Evaluate the four confidence intervals in the case<br />
where the data consist of n = 64 observations with an average value of ¯x = 4.5.<br />
5. Let X1, X2, . . . , Xn be iid with density<br />
f1(x|θ) = 1<br />
exp [−x/θ]<br />
θ<br />
for 0 ≤ x < ∞. Let Y1, Y2, . . . , Ym be iid with density<br />
for 0 ≤ y < ∞.<br />
f2(y|θ, λ) = λ<br />
exp [−λy/θ]<br />
θ<br />
(a) Derive approximate pivotal quantities for each of the parameters θ and λ.<br />
(b) Suppose that n = 10 and the average of the 10 x values is 24.0. Suppose that<br />
m = 40 and the average of the 40 y values is 12.0. Calculate approximate<br />
95% confidence intervals for both θ and λ.<br />
63
Chapter 4<br />
The Theory of Hypothesis Testing<br />
4.1 Introduction<br />
Suppose that we are going to observe the value of a random vector X. Let X denote the<br />
set of possible values that X can take and, for x ∈ X , let g(x|θ) denote the probability<br />
that X = x where the parameter θ is some unknown element of the set Θ.<br />
Our assumptions specify g, Θ, and X . A hypothesis specifies that θ belongs to some<br />
subset Θ0 of Θ. The question arises as to whether the observed data x is consistent<br />
with the hypothesis that θ ∈ Θ0, often written as H0 : θ ∈ Θ0. The hypothesis H0 is<br />
usually referred to as the null hypothesis.<br />
In a hypothesis testing situation, two types of error are possible.<br />
• The first type of error is to reject the null hypothesis H0 : θ ∈ Θ0 as being<br />
inconsistent with the observed data x when, in fact, θ ∈ Θ0 i.e. when, in fact,<br />
the null hypothesis happens to be true. This is referred to as type 1 error.<br />
• The second type of error is to fail to reject the null hypothesis H0 : θ ∈ Θ0 as<br />
being inconsistent with the observed data x when, in fact, θ /∈ Θ0 i.e. when, in<br />
fact, the null hypothesis happens to be false. This is referred to as type 2 error.<br />
Example 4.1. Suppose the data consist of a random sample X1, X2, . . . , Xn from a<br />
N (θ, 1) density. Let Θ = (−∞, ∞) and Θ0 = (−∞, 0] and consider testing H0 : θ ∈ Θ0<br />
or in other words H0 : θ ≤ 0.<br />
The standard estimate of θ for this example is ¯ X. It would seem rational to consider<br />
that the bigger the value of ¯ X that we observe the stronger is the evidence against the<br />
null hypothesis that θ ≤ 0. How big does ¯ X have to be in order for us to reject H0 ?<br />
Suppose that n = 25 and we observe ¯x = 0.32. What are the chances of getting<br />
such a large value for ¯x if, in fact, θ ≤ 0 ? We know that ¯ X has a N (θ, 1 ) density<br />
n<br />
64
i.e. a N (θ, 0.04) density. So the probability of getting a value for ¯x as large as 0.32<br />
is the area under a N (θ, 0.04) curve between 0.32 and ∞ which is, in turn, equal to<br />
the area under a N (0, 1) curve between 0.32−θ and ∞. <strong>To</strong> evaluate the probability of<br />
0.20<br />
getting a value for ¯x as large as 0.32 if, in fact, θ ≤ 0 we need to find the value of θ ≤ 0<br />
for which the area under a N (0, 1) curve between 0.32−θ and ∞ is maximised. Clearly<br />
0.20<br />
this happens for θ = 0 and the resulting maximum is the area under a N (0, 1) curve<br />
between 0.32<br />
0.20<br />
= 1.60 and ∞ or 0.0548. This quantity is called the p-value. The p-value<br />
is used to measure the strength of the evidence against H0 : θ ≤ 0 and H0 is rejected<br />
if the p-value is less than some small number such as 0.05. You might like to try the<br />
R commands 1-pnorm(q=0.32,mean=0,sd=sqrt(.04)) and 1-pnorm(1.6). �<br />
4.2 The General Testing Problem<br />
Suppose that we are going to observe the value of a random vector X. Let X denote the<br />
set of possible values that X can take and, for X ∈ X , let g(x; θ) denote the probability<br />
that X takes the value x where the parameter θ is some unknown element of the set<br />
Θ. Let Θ0 be some subset of Θ and consider testing the null hypothesis that θ lies in<br />
the subset Θ0 i.e. H0 : θ ∈ Θ0.<br />
Standard procedures involve calculating some function of the data which is called<br />
a test statistic and which we will denote by T (X) and then rejecting H0 : θ ∈ Θ0 if<br />
the observed value of T (X) turns out to be surprisingly large. We must be clear about<br />
what is meant by surprisingly large however? The quantity T (X) is a random variable<br />
since if we repeated the experiment we would get a different value of X and hence a<br />
different value of T (X). The probability distribution of T (X) depends on the value of<br />
θ. Suppose we observe T (X) = t(x) = t. For any value of θ ∈ Θ, we can calculate<br />
the probability that T (X) exceeds t if that value of θ was in fact the true value. Let<br />
us denote this probability by p(t; θ). The quantity max{p(t; θ) : θ ∈ Θ0} is called the<br />
p-value. The p-value is used to measure the strength of the evidence against H0 : θ ≤ 0<br />
and H0 is rejected if the p-value is less than some small number such as 0.05.<br />
Example 4.2 (Example 4.1 continued). Consider the test statistic T (X) = ¯ X and sup-<br />
pose we observe T (x) = t. We need to calculate p(t; θ) which is the probability that<br />
the random variable T (X) exceeds t when θ is the true value of the parameter. If θ is<br />
the true value of the parameter T (X) has a N (θ, 1/n) density and so<br />
p(t; θ) = P {N [θ, 1/n] ≥ t} = P � N [0, 1] ≥ √ n(t − θ) �<br />
In order to calculate the p-value we need to find θ ≤ 0 for which p(t; θ) is a maximum.<br />
Since p(t; θ) is maximised by making √ n(t − θ) as small as possible the maximum over<br />
(−∞, 0] always occurs at 0. Hence we have that P-Value = P { N [0, 1] ≥ √ nt } �<br />
65
Example 4.3 (The power function). Suppose our rule is to reject H0 : θ ≤ 0 if the<br />
p-value is less than 0.05. In order for the p-value to be less than 0.05 we require<br />
√ nt > 1.65 and so we reject H0 if ¯x > 1.65/ √ n. What are the chances of rejecting<br />
H0 if θ = 0.2 ? If θ = 0.2 then ¯x has a N [0.2, 1/n] density and so the probability of<br />
rejecting H0 is<br />
P<br />
� �<br />
N 0.2, 1<br />
�<br />
≥<br />
n<br />
1.65<br />
�<br />
√ = P<br />
n<br />
� N (0, 1) ≥ 1.65 − 0.2 √ n � .<br />
For n = 25 this is given by P {N (0, 1) ≥ 0.65} = 0.2578. This calculation can be<br />
verified using the R command 1-pnorm(1.65-0.2*sqrt(25)). The following table<br />
gives the results of this calculation for n = 25 and various values of θ.<br />
θ: -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1<br />
Prob: .000 .000 .000 .000 .000 .000 .000 .001 .004 .016<br />
θ: 0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0<br />
Prob: 0.50 .125 .258 .440 .637 .802 .912 .968 .991 .998 .999<br />
This is called the power function of the test. The R command Ns=seq(from=(-1),to=1,<br />
by=0.1) generates and stores the sequence −1.0, −0.9, . . . , +1.0 and the probabilities<br />
in the table were calculated using 1-pnorm(1.65-Ns*sqrt(25)). �<br />
Example 4.4 (Sample size). How large would n have to be so that the probability of<br />
rejecting H0 when θ = 0.2 is 0.90 ? We would require 1.65 − 0.2 √ n = −1.28 which<br />
implies that √ n = (1.65 + 1.28)/0.2 or n = 215. �<br />
So the general plan for testing a hypothesis is clear: choose a test statistic T ,<br />
observe the data, calculate the observed value t of the test statistic T , calculate the<br />
p-value as the maximum over all values of θ in Θ0 of the probability of getting a value<br />
for T as large as t, and reject H0 : θ ∈ Θ0 if the p-value so obtained is too small.<br />
4.3 Hypothesis Testing for Normal Data<br />
Many standard test statistics have been developed for use with normally distributed<br />
data.<br />
Example 4.5 (One Gaussian sample). Suppose that we have data X1, X2, . . . , Xn which<br />
are iid observations from a N (µ, σ 2 ) density where both µ and σ are unknown. Here<br />
θ = (µ, σ) and Θ = {(µ, σ) : −∞ < µ < ∞, 0 < σ < ∞}. Define<br />
¯X =<br />
� n<br />
i=1 Xi<br />
n<br />
and s 2 =<br />
66<br />
�N i=1 (Xi − ¯ X) 2<br />
.<br />
n − 1
(a) Suppose Θ0 = {(µ, σ) : −∞ < µ ≤ A, 0 < σ < ∞}. Define T = ¯ X. Let t denote<br />
the observed value of T . Then<br />
p(t; θ) = P [ ¯ X ≥ t]<br />
√<br />
n( ¯<br />
√<br />
X − µ) n(t − µ)<br />
= P [<br />
≥ ]<br />
s<br />
√<br />
s<br />
n(t − µ)<br />
= P [tn−1 ≥ ].<br />
s<br />
<strong>To</strong> maximize this we choose µ in (−∞, A] as large as possible which clearly means<br />
choosing µ = A. Hence the p-value is<br />
P [tn−1 ≥<br />
√<br />
n(¯x − A)<br />
].<br />
s<br />
(b) Suppose Θ0 = {(µ, σ) : A ≤ µ < ∞, 0 < σ < ∞}. Define T = − ¯ X. Let t denote<br />
the observed value of T . Then<br />
p(t; θ) = P [− ¯ X ≥ t]<br />
= P [ ¯ X ≤ −t]<br />
√<br />
n( ¯<br />
√<br />
X − µ) n(−t − µ)<br />
= P [<br />
≤<br />
]<br />
s<br />
√<br />
s<br />
n(−t − µ)<br />
= P [tn−1 ≤<br />
].<br />
s<br />
<strong>To</strong> maximize this we choose µ in [A, ∞) as small as possible which clearly means<br />
choosing µ = A. Hence the p-value is<br />
√ √<br />
n(−t − A)<br />
n(¯x − A)<br />
P [tn−1 ≤<br />
] = P [tn−1 ≤<br />
].<br />
s<br />
s<br />
(c) Suppose Θ0 = {(A, σ) : 0 < σ < ∞}. Define T = | ¯ X − A|. Let t denote the<br />
observed value of T . Then<br />
p(t; θ) = P [| ¯ X − A| ≥ t] = P [ ¯ X ≥ A + t] + P [ ¯ X ≤ A − t]<br />
√<br />
n( ¯<br />
√<br />
X − µ) n(A + t − µ)<br />
= P [<br />
≥<br />
]<br />
s<br />
√<br />
s<br />
n( ¯<br />
√<br />
X − µ) n(A − t − µ)<br />
+P [<br />
≤<br />
]<br />
√<br />
s<br />
s<br />
n(A + t − µ)<br />
= P [tn−1 ≥<br />
]<br />
s<br />
√<br />
n(A − t − µ)<br />
+P [tn−1 ≤<br />
].<br />
s<br />
The maximization is trivially found by setting µ = A. Hence the p-value is<br />
√<br />
nt<br />
P [tn−1 ≥<br />
s ]+P [tn−1 ≤ −√ √<br />
nt<br />
nt<br />
] = 2P [tn−1 ≥<br />
s<br />
s ] = 2P [tn−1<br />
√<br />
n|¯x − A|<br />
≥ ].<br />
s<br />
67
(d) Suppose Θ0 = {(µ, σ) : −∞ < µ < ∞, 0 < σ ≤ A}. Define T = � n<br />
i=1 (Xi − ¯ X) 2 .<br />
Let t denote the observed value of T . Then<br />
p(t; σ) = P [ �n i=1 (Xi − ¯ X) 2 ≥ t]<br />
= P [<br />
� n<br />
i=1 (Xi− ¯ X) 2<br />
σ 2<br />
= P [χ 2 n−1 ≥ t<br />
σ 2 ].<br />
≥ t<br />
σ 2 ]<br />
<strong>To</strong> maximize this we choose σ in (0, A] as large as possible which clearly means<br />
choosing σ = A. Hence the p-value is<br />
P [χ 2 n−1 ≥ t<br />
A 2 ] = P [χ2 n−1 ≥<br />
� n<br />
i=1 (xi − ¯x) 2<br />
A2 ]<br />
(e) Θ0 = {(µ, σ) : −∞ < µ < ∞, A ≤ σ < ∞}. Define T = [ � n<br />
i=1 (xi − ¯x) 2 ] −1 , and<br />
let t denote the observed value of T . Then<br />
p(t; σ) =<br />
1<br />
P [ �n i=1 (Xi − ¯ ≥ t]<br />
X) 2<br />
=<br />
n�<br />
P [ (Xi −<br />
i=1<br />
¯ X) 2 ≤ 1<br />
t ]<br />
=<br />
�n i=1 P [<br />
(Xi − ¯ X) 2<br />
σ2 ≤ 1<br />
=<br />
]<br />
tσ2 P [χ 2 n−1 ≤ 1<br />
].<br />
tσ2 <strong>To</strong> maximize this we choose σ in [A, ∞) as small as possible which clearly means<br />
choosing σ = A. Hence the p-value is<br />
P [χ 2 n−1 ≤ 1<br />
tA 2 ] = P [χ2 n−1 ≤<br />
(f) Suppose Θ0 = {(µ, A) : −∞ < µ ≤ ∞}. Define<br />
T = max{<br />
� n<br />
i=1 (Xi − ¯ X) 2<br />
A2 ,<br />
� n<br />
i=1 (xi − ¯x) 2<br />
A2 ].<br />
A 2<br />
�n i=1 (Xi − ¯ }.<br />
X) 2<br />
Let t denote the observed value of T and note that t must be greater than or<br />
equal to 1. Then<br />
p(t; σ) = P [ �n i=1 (Xi − ¯ X) 2 ≥ A2t] + P [ �n i=1 (Xi − ¯ X) 2 ≤ A2<br />
t ]<br />
= P [<br />
� n<br />
i=1 (Xi− ¯ X) 2<br />
σ 2<br />
≥ A2 t<br />
σ 2 ] + P [<br />
� n<br />
i=1 (Xi− ¯ X) 2<br />
σ 2<br />
= P [χ 2 n−1 ≥ A2 t<br />
σ 2 ] + P [χ 2 n−1 ≤ A2<br />
tσ 2 ].<br />
≤ A2<br />
tσ 2 ]<br />
Since A is the only element in Θ0, the maximization is trivially found by setting<br />
σ = A. Hence the p-value is<br />
�n where t = max{<br />
P [χ 2 n−1 ≥ t] + P [χ 2 n−1 ≤ 1<br />
t ].<br />
i=1 (xi−¯x) 2<br />
A2 A , 2<br />
�n i=1 (xi−¯x) 2 }. �<br />
68
Example 4.6 (Two Gaussian sample). Suppose that we have data X1, X2, . . . , Xn which<br />
are iid observations from a N (µ1, σ 2 ) density and data y1, y2, . . . , ym which are iid<br />
observations from a N (µ2, σ 2 ) density where µ1, µ2, and σ are unknown.<br />
and<br />
Here<br />
Define<br />
θ = (µ1, µ2, σ)<br />
Θ = {(µ1, µ2, σ) : −∞ < µ1 < ∞, −∞ < µ2 < ∞, 0 < σ < ∞}.<br />
s 2 =<br />
� n<br />
i=1 (xi − ¯x) 2 + Σm j=1(yj − ¯y) 2<br />
.<br />
n + m − 2<br />
(a) Suppose Θ0 = {(µ1, µ2, σ) : −∞ < µ1 < ∞, µ1 < µ2 < ∞, 0 < σ < ∞}. Define<br />
T = ¯x − ¯y. Let t denote the observed value of T . Then<br />
p(t; θ) = P [¯x − ¯y ≥ t]<br />
= P [ [(¯x − ¯y) − (µ1 − µ2)]<br />
�<br />
≥ [t − (µ1 − µ2)]<br />
� ]<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
= P [tn+m−2 ≥ [t − (µ1 − µ2)]<br />
� ].<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
<strong>To</strong> maximize this we choose µ2 > µ1 in such a way as to maximize the probability<br />
which clearly implies choosing µ2 = µ1. Hence the p-value is<br />
t<br />
P [tn+m−2 ≥ �<br />
s2 ( 1 1 + n m )<br />
¯x − ¯y<br />
] = P [tn+m−2 ≥ �<br />
s2 ( 1 1 + n m )<br />
].<br />
(b) Suppose Θ0 = {(µ1, µ2, σ) : −∞ < µ1 < ∞, −∞ < µ2 < µ1, 0 < σ < ∞}. Define<br />
T = ¯y − ¯x. Let t denote the observed value of T . Then<br />
p(t; θ) = P [¯y − ¯x ≥ t]<br />
= P [ [(¯y − ¯x) − (µ2 − µ1)]<br />
�<br />
≥ [t − (µ2 − µ1)]<br />
� ]<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
= P [tn+m−2 ≥ [t − (µ2 − µ1)]<br />
� ].<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
<strong>To</strong> maximize this we choose µ2 < µ1 in such a way as to maximize the probability<br />
which clearly implies choosing µ2 = µ1. Hence the p-value is<br />
t<br />
P [tn+m−2 ≥ �<br />
s2 ( 1 1 + n m )<br />
¯y − ¯x<br />
] = P [tn+m−2 ≥ �<br />
s2 ( 1 1 + n m )<br />
].<br />
69
(c) Θ0 = {(µ, µ,σ) : −∞ < µ < ∞, 0 < σ < ∞}. Define T = |¯y − ¯x|. Let t denote<br />
the observed value of T . Then<br />
p(t; θ) = P [|¯y − ¯x| ≥ t]<br />
= P [¯y − ¯x ≥ t] + P [¯y − ¯x ≤ −t]<br />
= P [ [(¯y − ¯x) − (µ2 − µ1)]<br />
�<br />
≥ [t − (µ2 − µ1)]<br />
� ]<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
+P [ [(¯y − ¯x) − (µ2 − µ1)]<br />
�<br />
≤ [−t − (µ2 − µ1)]<br />
�<br />
]<br />
= P [tn+m−2 ≥ [t − (µ2 − µ1)]<br />
� ] +<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
s 2 ( 1<br />
n<br />
P [tn+m−2 ≤ [−t − (µ2 − µ1)]<br />
�<br />
].<br />
s 2 ( 1<br />
n<br />
+ 1<br />
m )<br />
+ 1<br />
m )<br />
Since, for all sets of parameter values in Θ0, we have µ1 = µ2 the maximization<br />
is trivial and so the p-value is<br />
t<br />
P [tn+m−2 ≥ √<br />
s2 1 1<br />
( + n m )] + P [tn+m−2<br />
−t ≤ √<br />
s2 1 1<br />
( + n m )]<br />
= 2P [tn+m−2 ≥<br />
√ t<br />
s2 1 1<br />
( + n m )]<br />
= 2P [tn+m−2 ≥ |¯y−¯x| √<br />
s2 1 1<br />
( + n m )].<br />
(d) Suppose that we have data X1, X2, . . . , Xn which are iid observations from a<br />
N (µ1, σ 2 1) density and data y1, y2, . . . , ym which are iid observations from a N (µ2, σ 2 1)<br />
density where µ1, µ2, σ1, and σ2 are all unknown. Here θ = (µ1, µ2, σ1, σ2) and<br />
Θ = {(µ1, µ2, σ1, σ2) : −∞ < µ1 < ∞, −∞ < µ2 < ∞, 0 < σ1 < ∞, 0 < σ2 <<br />
∞}. Define<br />
s 2 1 =<br />
� n<br />
i=1 (xi − ¯x) 2<br />
n − 1<br />
, and s 2 2 = Σm j=1(yj − ¯y) 2<br />
m − 1<br />
Suppose Θ0 = {(µ1, µ2, σ, σ) : −∞ < µ1 < ∞, µ1 < µ2 < ∞, 0 < σ < ∞}.<br />
Define<br />
T = max{ s21 s2 ,<br />
2<br />
s22 s2 }.<br />
1<br />
70<br />
.
Let t denote the observed value of T and observe that t must be greater than or<br />
equal to 1. Then<br />
p(t; θ) = P [ s2 1<br />
s 2 2<br />
= P [ s2 1<br />
s 2 2<br />
= P [ σ2 2s 2 1<br />
σ 2 1s 2 2<br />
≥ t] + P [ s2 2<br />
s 2 1<br />
≥ t] + P [ s2 1<br />
s 2 2<br />
≥ t]<br />
≤ 1<br />
t ]<br />
≥ σ2 2t<br />
σ2 ] + P [<br />
1<br />
σ2 2s2 1<br />
σ2 1s2 2<br />
≤ σ2 2<br />
σ 2 1t ]<br />
= P [Fn−1,m−1 ≥ σ2 2t<br />
σ2 ] + P [Fn−1,m−1 ≤<br />
1<br />
σ2 2<br />
σ2 1t ].<br />
Since, for all sets of parameter values in Θ0, we have σ1 = σ2 the maximization<br />
is trivial and so the p-value is P [Fn−1,m−1 ≥ t] + P [Fn−1,m−1 ≤ 1].<br />
�<br />
t<br />
4.4 Generally Applicable Test Procedures<br />
Suppose that we observe the value of a random vector X whose probability density<br />
function is g(X|θ) for x ∈ X where the parameter θ = (θ1, θ2, . . . , θp) is some unknown<br />
element of the set Θ ⊆ R p . Let Θ0 be a specified subset of Θ. Consider the hypothesis<br />
H0 : θ ∈ Θ0. In this section we consider three ways in which good test statistics may<br />
be found for this general problem.<br />
The Likelihood Ratio Test: This test statistic is based on the idea that the<br />
maximum of the log likelihood over the subset Θ0 should not be too much less<br />
than the maximum over the whole set Θ if, in fact, the parameter θ actually<br />
does lie in the subset Θ0. Let ℓ(θ) denote the log likelihood function. The test<br />
statistic is<br />
T1(x) = 2[ℓ( ˆ θ) − ℓ( ˆ θ0)]<br />
where ˆ θ is the value of θ in the set Θ for which ℓ(θ) is a maximum and ˆ θ0 is the<br />
value of θ in the set Θ0 for which ℓ(θ) is a maximum.<br />
The Maximum Likelihood Test Statistic: This test statistic is based on<br />
the idea that ˆ θ and ˆ θ0 should be close to one another. Let I(θ) be the p × p<br />
information matrix. Let B = I( ˆ θ). The test statistic is<br />
T2(x) = ( ˆ θ − ˆ θ0) T B( ˆ θ − ˆ θ0)<br />
Other forms of this test statistic follow by choosing B to be I( ˆ θ0) or EI( ˆ θ) or<br />
EI( ˆ θ0).<br />
71
The Score Test Statistic: This test statistic is based on the idea that ˆ<br />
θ0 should<br />
almost solve the likelihood equations. Let S(θ) be the p × 1 vector whose rth<br />
element is given by ∂ℓ/∂θr. Let C be the inverse of I( ˆ θ0) i.e. C = I( ˆ θ0) −1 . The<br />
test statistic is<br />
T3(x) = S( ˆ θ0) T CS( ˆ θ0)<br />
In order to calculate p-values we need to know the probability distribution of the<br />
test statistic under the null hypothesis. Deriving the exact probability distribution<br />
may be difficult but approximations suitable for situations in which the sample size<br />
is large are available in the special case where Θ is a p dimensional set and Θ0 is a<br />
q dimensional subset of Θ for q < p, whence it can be shown that, when H0 is true,<br />
the probability distributions of T1(x), T2(x) and T3(x) are all approximated by a χ 2 p−q<br />
density.<br />
Example 4.7. Let X1, X2, . . . , Xn be iid each having a Poisson distribution with pa-<br />
rameter θ. Consider testing H0 : θ = θ0 where θ0 is some specified constant. Recall<br />
that<br />
n�<br />
n�<br />
ℓ(θ) = [ xi] log [θ] − nθ − log [ xi!].<br />
i=1<br />
Here Θ = [0, ∞) and the value of θ ∈ Θ for which ℓ(θ) is a maximum is ˆ θ = ¯x. Also<br />
Θ0 = {θ0} and so trivially ˆ θ0 = θ0. We saw also that<br />
�n i=1 S(θ) =<br />
xi<br />
− n<br />
θ<br />
and that<br />
�n i=1 xi<br />
I(θ) =<br />
θ2 .<br />
Suppose that θ0 = 2, n = 40 and that when we observe the data we get ¯x = 2.50.<br />
Hence �n i=1 xi = 100. Then<br />
T1 = 2[ℓ(2.5) − ℓ(2.0)]<br />
i=1<br />
= 200 log (2.5) − 200 − 200 log (2.0) + 160 = 4.62.<br />
The information is B = I( ˆ θ) = 100/2.5 2 = 16. Hence<br />
T2 = ( ˆ θ − ˆ θ0) 2 B = 0.25 × 16 = 4.<br />
We have S(θ0) = S(2.0) = 10 and I(θ0) = 25 and so<br />
T3 = 10 2 /25 = 4.<br />
Here p = 1, q = 0 implying p−q = 1. Since P [χ 2 1 ≥ 3.84] = 0.05 all three test statistics<br />
produce a p-value less than 0.05 and lead to the rejection of H0 : θ = 2. �<br />
72
Example 4.8. Let X1, X2, . . . , Xn be iid with density f(x|α, β) = αβx β−1 exp(−αx β )<br />
for x ≥ 0. Consider testing H0 : β = 1. Here Θ = {(α, β) : 0 < α < ∞, 0 < β < ∞}<br />
and Θ0 = {(α, 1) : 0 < α < ∞} is a one-dimensional subset of the two-dimensional set<br />
Θ. Recall that ℓ(α, β) = n log[α] + n log[β] + (β − 1) � n<br />
i=1 log[xi] − α � n<br />
the vector S(α, β) is given by<br />
�<br />
and the matrix I(α, β) is given by<br />
�<br />
n/α − �n i=1 xβi<br />
n/β + � n<br />
i=1 log[xi] − α � n<br />
i=1 xβ<br />
i log[xi]<br />
n/α 2<br />
� n<br />
i=1 xβ<br />
i log[xi]<br />
�n i=1 xβi<br />
log[xi] n/β2 + α �n i=1 xβi<br />
log[xi] 2<br />
�<br />
�<br />
i=1 xβi<br />
. Hence<br />
We have that ˆ θ = (ˆα, ˆ β) which require Newton’s method for their calculation. Also<br />
ˆθ0 = (ˆα0, 1) where ˆα0 = 1/¯x. Suppose that the observed value of T1(x) is 3.20. Then<br />
the p-value is P [T1(x) ≥ 3.20] ≈ P [χ 2 1 ≥ 3.20] = 0.0736. In order to get the maximum<br />
likelihood test statistic plug in the values ˆα, ˆ β for α, β in the formula for I(α, β) to get<br />
the matrix B. Then calculate T1(X) = ( ˆ θ − ˆ θ0) T B( ˆ θ − ˆ θ0) and use the χ 2 1 tables to<br />
calculate the p-value. Finally, to calculate the score test statistic note that the vector<br />
S( ˆ θ0) is given by �<br />
and the matrix I( ˆ θ0) is given by<br />
�<br />
0<br />
n + �n i=1 log[xi] − �n i=1 xi<br />
�<br />
log[xi]/¯x<br />
n¯x 2<br />
� n<br />
i=1 xi log[xi]<br />
� n<br />
i=1 xi log[xi] n + � n<br />
i=1 xi log[xi] 2 /¯x<br />
Since T2(x) = S( ˆ θ0) T CS( ˆ θ0) where C = I( ˆ θ0) −1 we have that T2(x) is<br />
[n +<br />
n�<br />
log[xi] −<br />
i=1<br />
n�<br />
i=1<br />
xi log[xi]/¯x] 2<br />
multiplied by the lower diagonal element of C which is given by<br />
n¯x 2<br />
[n¯x 2 ][n + �n i=1 xi log[xi] 2 /¯x] − [ �n i=1 xi log[xi]] 2<br />
Hence we get that<br />
T2(x) =<br />
[n + �n i=1 log[xi] − �n i=1 xi log[xi]/¯x] 2n¯x 2<br />
[n¯x 2 ][n + �n i=1 xi log[xi] 2 /¯x] − [ �n i=1 xi log[xi]] 2<br />
No iterative techniques are need to calculate the value of T2(X) and for this reason<br />
the score test is often preferred to the other two. However there is some evidence that<br />
the likelihood ratio test is more powerful in the sense that it has a better chance of<br />
detecting departures from the null hypothesis. �<br />
73<br />
�
4.5 The Neyman-Pearson Lemma<br />
Suppose we are testing a simple null hypothesis H0 : θ = θ ′ against a simple alternative<br />
H1 : θ = θ ′′ , where θ is the parameter of interest, and θ ′ , θ ′′ are particular values of θ.<br />
Observed values of the i.i.d. random variables X1, X2, . . . , Xn, each with p.d.f. fX(x|θ),<br />
are available. We are going to reject H0 if (x1, x2, . . . , xn) ∈ C, where C is a region of<br />
the n-dimensional space called the critical region. This specifies a test of hypothesis.<br />
We say that the critical region C has size α if the probability of a Type I error is α:<br />
P [ (X1, X2, . . . , Xn) ∈ C|H0 ] = α.<br />
We call C a best critical region of size α if it has size α, and<br />
P [ (X1, X2, . . . , Xn) ∈ C|H1 ] ≥ P [ (X1, X2, . . . , Xn) ∈ A|H1 ]<br />
for every subset A of the sample space for which P [ (X1, X2, . . . , Xn) ∈ C|H0 ] = α.<br />
Thus, the power of the test associated with the best critical region C is at least as great<br />
as the power of the test associated with any other critical region A of size α. The<br />
Neyman-Pearson lemma provides us with a way of finding a best critical region.<br />
Lemma 4.1 (The Neyman-Pearson lemma). If k > 0 and C is a subset of the<br />
sample space such that<br />
(a) L(θ ′ )/L(θ ′′ ) ≤ k for all (x1, x2, . . . , xn) ∈ C<br />
(b) L(θ ′ )/L(θ ′′ ) ≥ k for all (x1, x2, . . . , xn) ∈ C ∗ ,<br />
(c) α = P [(X1, X2, . . . , Xn) ∈ C|H0]<br />
where C ∗ is the complement of C, then C is a best critical region of size α for testing the<br />
simple hypothesis H0 : θ = θ ′ against the alternative simple hypothesis H1 : θ = θ ′′ .<br />
Proof. Suppose for simplicity that the random variables X1, X2, . . . , Xn are continuous.<br />
(If they were discrete, the proof would be the same, except that integrals would be<br />
replaced by sums.) For any region R of n-dimensional space, we will denote the<br />
probability that X ∈ R by �<br />
L(θ), where θ is the true value of the parameter. The full<br />
R<br />
notation, omitted to save space, would be<br />
� �<br />
. . .<br />
P (X ∈ R|θ) = L(θ|x1, . . . , xn)dx1 . . . dxn .<br />
R<br />
We need to prove that if A is another critical region of size α, then the power of the<br />
test associated with C is at least as great as the power of the test associated with A,<br />
or in the present notation, that �<br />
A<br />
L(θ ′′ �<br />
) ≤<br />
74<br />
C<br />
L(θ ′′ ). (4.1)
Suppose X ∈ A∗ ∩ C. Then X ∈ C, so by (a),<br />
�<br />
L(θ ′′ ) ≥ 1<br />
k<br />
�<br />
A ∗ ∩C<br />
A ∗ ∩C<br />
Next, suppose X ∈ A ∩ C ∗ . Then X ∈ C ∗ , so by (b),<br />
�<br />
A∩C ∗<br />
L(θ ′′ ) ≤ 1<br />
k<br />
�<br />
A∩C ∗<br />
We now establish (4.1), thereby completing the proof.<br />
�<br />
A<br />
L(θ ′′ ) =<br />
=<br />
≤<br />
=<br />
=<br />
=<br />
�<br />
A∩C<br />
�<br />
C<br />
�<br />
C<br />
�<br />
C<br />
�<br />
C<br />
�<br />
C<br />
L(θ ′′ ) +<br />
L(θ ′′ ) −<br />
�<br />
A∩C ∗<br />
�<br />
A ∗ ∩C<br />
L(θ ′′ ) − 1<br />
k<br />
�<br />
A ∗ ∩C<br />
⎡<br />
+ ⎣ 1<br />
�<br />
k<br />
L(θ ′′ ) − 1<br />
k<br />
L(θ ′′ ) − α<br />
k<br />
L(θ ′′ )<br />
since both C and A have size α.<br />
�<br />
C<br />
L(θ ′′ )<br />
L(θ ′′ ) +<br />
A∩C<br />
+ α<br />
k<br />
�<br />
A∩C ∗<br />
L(θ ′ ) + 1<br />
k<br />
L(θ ′ ). (4.2)<br />
L(θ ′ ). (4.3)<br />
L(θ ′′ )<br />
�<br />
A∩C ∗<br />
L(θ ′ ) − 1<br />
�<br />
k<br />
L(θ ′ ) + 1<br />
k<br />
�<br />
A<br />
A∩C<br />
L(θ ′ ) ( see (4.2), (4.3) )<br />
⎤<br />
L(θ ′ ) ⎦ (add zero)<br />
L(θ ′ ) (collect terms)<br />
Example 4.9. Suppose X1, . . . , Xn are iid N (0, 1), and and we want to test H0 : θ = θ ′<br />
versus H1 : θ = θ ′′ , where θ ′′ > θ ′ . According to the Z-test, we should reject H0 if<br />
Z = √ n( ¯ X − θ ′ ) is large, or equivalently if ¯ X is large. We can now use the Neyman-<br />
Pearson lemma to show that the Z-test is “best”. The likelihood function is<br />
L(θ) = (2π) −n/2 exp{−<br />
75<br />
n�<br />
(xi − θ) 2 /2}.<br />
i=1
According to the Neyman-Pearson lemma, a best critical region is given by the set of<br />
(x1, . . . , xn) such that L(θ ′ )/L(θ ′′ ) ≤ k1, or equivalently, such that<br />
But<br />
1<br />
n ln[L(θ′′ )/L(θ ′ )] = 1<br />
n<br />
= 1<br />
2n<br />
= 1<br />
2n<br />
1<br />
n ln[L(θ′′ )/L(θ ′ )] ≥ k2.<br />
n�<br />
i=1<br />
[(xi − θ ′ ) 2 /2 − (x1 − θ ′′ ) 2 /2]<br />
n�<br />
i=1<br />
n�<br />
i=1<br />
[(x 2 i − 2θ ′ xi + θ ′2 ) − (x 2 i − 2θ ′′ xi + θ ′′2 )]<br />
[2(θ ′′ − θ ′ )xi + θ ′2 − θ ′′2 ]<br />
= (θ ′′ − θ ′ )¯x + 1<br />
2 [θ′2 − θ ′′2 ].<br />
So the best test rejects H0 when ¯x ≥ k, where k is a constant. But this is exactly the<br />
form of the rejection region for the Z-test. Therefore, the Z-test is “best”. �<br />
4.6 Goodness of Fit Tests<br />
Suppose that we have a random experiment with a random variable Y of interest.<br />
Assume additionally that Y is discrete with density function f on a finite set S. We<br />
repeat the experiment n times to generate a random sample Y1, Y2, . . . , Yn from the<br />
distribution of Y . These are independent variables, each with the distribution of Y .<br />
In this section, we assume that the distribution of Y is unknown. For a given<br />
density function f0, we will test the hypotheses H0 : f = f0 versus H1 : f �= f0. The<br />
test that we will construct is known as the goodness of fit test for the conjectured<br />
density f0. As usual, our challenge in developing the test is to find an appropriate test<br />
statistic – one that gives us information about the hypotheses and whose distribution,<br />
under the null hypothesis, is known, at least approximately.<br />
Suppose that S = y1, y2, . . . , yk. <strong>To</strong> simplify the notation, let pj = f0(yj) for<br />
j = 1, 2, . . . , k. Now let Nj = #{i ∈ 1, 2, ..., n : yi = yj} for j = 1, 2, . . . , k. Under the<br />
null hypothesis, (N1, N2, . . . , Nk) has the multinomial distribution with parameters n<br />
and p1, p2, . . . , pk with E(Nj) = npj and Var(Nj) = npj(1 − pj). This results indicates<br />
how we might begin to construct our test: for each j we can compare the observed<br />
frequency of yj (namely Nj) with the expected frequency of value yj (namely npj),<br />
under the null hypothesis. Specifically, our test statistic will be<br />
X 2 = (N1 − np1) 2<br />
+<br />
np1<br />
(N2 − np2) 2<br />
+ · · · +<br />
np2<br />
(Nk − npk) 2<br />
.<br />
npk<br />
76
Note that the test statistic is based on the squared errors (the differences between the<br />
expected frequencies and the observed frequencies). The reason that the squared errors<br />
are scaled as they are is the following crucial fact, which we will accept without proof:<br />
under the null hypothesis, as n increases to infinity, the distribution of X 2 converges<br />
to the chi-square distribution with k − 1 degrees of freedom.<br />
For m > 0 and r in (0, 1), we will let χ 2 m,r denote the quantile of order r for<br />
the chi-square distribution with m degrees of freedom. Then, the following test has<br />
approximate significance level α: reject H0 : f = f0 versus H1 : f �= f0, if and only if<br />
X2 > χ2 k−1,1−α . The test is an approximate one and works best when n is large. Just<br />
how large n needs to be depends on the pj. One popular rule of thumb proposes that<br />
the test will work well if all the expected frequencies satisfy npj ≥ 1 and at least 80%<br />
of the expected frequencies satisfy npj ≥ 5.<br />
Example 4.10 (Genetical inheritance). In crosses between two types of maize four dis-<br />
tinct types of plants were found in the second generation. In a sample of 1301 plants<br />
there were 773 green, 231 golden, 238 green-striped, 59 golden-green-striped. Accord-<br />
ing to a simple theory of genetical inheritance the probabilities of obtaining these four<br />
plants are 9 3 3 , , 16 16 16<br />
experiment?<br />
and 1<br />
16<br />
Formally we will consider the hypotheses:<br />
respectively. Is the theory acceptable as a model for this<br />
H0 : p1 = 9<br />
16 , and p2 = 3<br />
16 , and p3 = 3<br />
16 and p4 = 1<br />
16 ;<br />
H1 : not all the above probabilities are correct.<br />
The expected frequencies for any plant under H0 is npi = 1301pi. We therefore calculate<br />
the following table:<br />
Observed Counts Expected Counts Contributions to X 2<br />
Oi Ei (Oi − Ei) 2 /Ei<br />
773 731.8125 2.318<br />
231 243.9375 0.686<br />
238 243.9375 0.145<br />
59 81.3125 6.123<br />
X 2 = 9.272<br />
Since X 2 embodies the differences between the observed and expected values we<br />
can say that if X 2 is large that there is a big difference between what we observe and<br />
what we expect so the theory does not seem to be supported by the observations. If<br />
X 2 is small the observations apparently conform to the theory and act as support for<br />
the theory. The test statistic X2 is distributed X2 ∼ χ2 3df . In order to define what we<br />
77
would consider to be an unusually large value of X 2 we will choose a significance level<br />
of α = 0.05. The R command qchisq(p=0.05,df=3,lower.tail=FALSE) calculates<br />
the 5% critical value for the test as 7.815. Since our value of X 2 is greater than the<br />
critical value 7.815 we reject H0 and conclude that the theory is not a good model for<br />
these data. The R command pchisq(q=9.272,df=3,lower.tail=FALSE) calculates<br />
the p-value for the test equal to 0.026. ( These data are examined further in chapter<br />
9 of Snedecor and Cochoran. ) �<br />
Very often we do not have a list of probabilities to specify our hypothesis as we had<br />
in the above example. Rather our hypothesis relates to the probability distribution<br />
of the counts without necessarily specifying the parameters of the distribution. For<br />
instance, we might want to test that the number of male babies born on successive<br />
days in a maternity hospital followed a binomial distribution, without specifying the<br />
probability that any given baby will be male. Or, we might want to test that the<br />
number of defective items in large consignments of spare parts for cars, follows a Poisson<br />
distribution, again without specifying the parameter of the distribution.<br />
The X 2 test is applicable when all the probabilities depend on unknown parame-<br />
ters, provided that the unknown parameters are replaced by their maximum likelihood<br />
estimates and provided that one degree of freedom is deducted for each parameter<br />
estimated.<br />
Example 4.11. Feller reports an analysis of flying-bomb hits in the south of London<br />
during World War II. Investigators partitioned the area into 576 sectors each beng<br />
1<br />
4km2 . The following table gives the resulting data:<br />
No. of hits (x) 0 1 2 3 4 5<br />
No. of sectors with x hits 229 221 93 35 7 1<br />
If the hit pattern is random in the sense that the probability that a bomb will land in<br />
any particular sector in constant, irrespective of the landing place of previous bombs,<br />
a Poisson distribution might be expected to model the data.<br />
x P (x) = ˆ θ x e −ˆ θ Expected Observed Contributions to X 2<br />
x! 576 × P (X) (Oi − Ei) 2 /Ei<br />
0 0.395 227.53 229 0.0095<br />
1 0.367 211.34 211 0.0005<br />
2 0.170 98.15 93 0.2702<br />
3 0.053 30.39 35 0.6993<br />
4 0.012 7.06 7 0.0005<br />
5 0.002 1.31 1 0.0734<br />
78<br />
X 2 = 1.0534
The MLE of θ was calculated as ˆ θ = 535/576 = 0.9288, that is, the total number<br />
of observed hits divided by the number of sectors. We carry out the chi-squared test<br />
as before except that we now subtract one additional degree of freedom because we<br />
had to estimate θ. The test statistic X2 is distributed X2 ∼ χ2 4df . The R command<br />
qchisq(p=0.05,df=4,lower.tail=FALSE) calculates the 5% critical value for the test<br />
as 9.488. Alternatively, the R command pchisq(q=1.0534,df=4,lower.tail=FALSE)<br />
calculates the p-value for the test equal to 0.90. The result of the chi-squared test is<br />
not statistically significant indicating that the divergence between the observed and<br />
expected counts can be regarded as random fluctuations about the expected values.<br />
Feller comments, “It is interesting to note that most people believed in a tendency of<br />
the points of impact to cluster. It this were true, there would be a higher frequency of<br />
sectors with either many hits or no hits and a deficiency in the intermediate classes. the<br />
above table indicates perfect randomness and homogeneity of the area; we have here<br />
an instructive illustration of the established fact that to the untrained eye randomness<br />
appears a regularity or tendency to cluster.” �<br />
4.7 The χ 2 Test for Contingency Tables<br />
Let X and Y be a pair of categorical variables and suppose there are r possible values<br />
for X and c possible values for Y . Examples of categorical variables are Religion, Race,<br />
Social Class, Blood Group, Wind Direction, Fertiliser Type etc. The random variables<br />
X and Y are said to be independent if P [X = a, Y = b] = P [X = a]P [Y = b] for all<br />
possible values a of X and b of Y . In this section we consider how to test the null<br />
hypothesis of independence using data consisting of a random sample of N observations<br />
from the joint distribution of X and Y .<br />
Example 4.12. A study was carried out to investigate whether hair colour (columns)<br />
and eye colour (rows) were genetically linked. A genetic link would be supported if<br />
the proportions of people having various eye colourings varied from one hair colour<br />
grouping to another. 955 people were chosen at random and their hair colour and eye<br />
colour recorded. The data are summarised in the following table :<br />
Oij Black Brown Fair Red <strong>To</strong>tal<br />
Brown 60 110 42 30 242<br />
Green 67 142 28 35 272<br />
Blue 123 248 90 25 486<br />
<strong>To</strong>tal 250 500 160 90 1000<br />
The proportion of people with red hair is 90/1000 = 0.09 and the proportion having<br />
blue eyes is 486/1000 = 0.486. So if eye colour and hair colour were truly inde-<br />
79
pendent we would expect the proportion of people having both black hair and brown<br />
eyes to be approximately equal to (0.090)(0.486) = 0.04374 or equivalently we would<br />
expect the number of people having both black hair and brown eyes to be close to<br />
(1000)(0.04374) = 43.74. The observed number of people having both black hair and<br />
brown eyes is 60.5. We can do similar calculations for all other combinations of hair<br />
colour and eye colour to derive the following table of expected counts :<br />
Eij Black Brown Fair Red <strong>To</strong>tal<br />
Brown 60.5 121 38.72 21.78 242<br />
Green 68.0 136 43.52 24.48 272<br />
Blue 121.5 243 77.76 43.74 486<br />
<strong>To</strong>tal 250.0 500 160.00 90.00 1000<br />
In order to test the null hypothesis of independence we need a test statistic which mea-<br />
sures the magnitude of the discrepancy between the observed table and the table that<br />
would be expected if independence were in fact true. In the early part of this century,<br />
long before the invention of maximum likelihood or the formal theory of hypothesis<br />
testing, Karl Pearson ( one of the founding fathers of Statistics ) proposed the following<br />
method of constructing such a measure of discrepancy:<br />
(Oij−Eij) 2<br />
Eij<br />
Black Brown Fair Red<br />
Brown 0.004 1.000 0.278 3.102<br />
Green 0.015 0.265 5.535 4.521<br />
Blue 0.019 0.103 1.927 8.029<br />
For each cell in the table calculate (Oij − Eij) 2 /Eij where Oij is the observed count<br />
and Eij is the expected count and add the resulting values across all cells of the table.<br />
The resulting total is called the χ 2 test statistic which we will denote by W . The null<br />
hypothesis of independence is rejected if the observed value of W is surprisingly large.<br />
In the hair and eye colour example the discrepancies are as follows :<br />
W =<br />
r�<br />
i=1<br />
c�<br />
j=1<br />
(Oij − Eij) 2<br />
Eij<br />
= 24.796<br />
What we would now like to calculate is the p-value which is the probability of getting a<br />
value for W as large as 24.796 if the hypothesis of independence were in fact true. Fisher<br />
showed that, when the hypothesis of independence is true, W behaves somewhat like a<br />
χ 2 random variable with degrees of freedom given by (r−1)(c−1) where r is the number<br />
of rows in the table and c is the number of columns. In our example r = 3, c = 4 and<br />
so (r − 1)(c − 1) = 6 and so the p-value is P [W ≥ 24.796] ≈ P [χ 2 6 ≥ 24.796] = 0.0004.<br />
Hence we reject the independence hypothesis. �<br />
80
4.8 Worked Problems<br />
The Problems<br />
1. A random sample of n flowers is taken from a colony and the numbers X, Y and<br />
Z of the three genotypes AA, Aa and aa are observed, where X + Y + Z = n.<br />
Under the hypothesis of random cross-fertilisation, each flower has probabilities<br />
θ 2 , 2θ(1−θ) and (1−θ) 2 of belonging to the respective genotypes, where 0 < θ < 1<br />
is an unknown parameter.<br />
(a) Show that the MLE of θ is ˆ θ = (2X + Y )/(2n).<br />
(b) Consider the test statistic T = 2X + Y. Given that T has a binomial dis-<br />
tribution with parameters 2n and θ, obtain a critical region of approximate<br />
size α based on T for testing the null hypothesis that θ = θ0 against the<br />
alternative that θ = θ1, where θ1 < θ0 and 0 < α < 1.<br />
(c) Show that the test in part (b) is the most powerful of size α.<br />
(d) Deduce approximately how large n must be to ensure that the power is at<br />
least 0.9 when α = 0.05, θ0 = 0.4 and θ1 = 0.3.<br />
2. Suppose that the household incomes in a certain country have the Pareto distri-<br />
bution with probability density function<br />
f(x) = θvθ<br />
, v ≤ x < ∞ ,<br />
xθ+1 where θ > 0 is unknown and v > 0 is known. Let x1, x2, . . . , xn denote the<br />
incomes for a random sample of n such households. It is required to test the null<br />
hypothesis θ = 1 against the alternative that θ �= 1.<br />
(a) Derive an expression for ˆ θ, the MLE of θ.<br />
(b) Show that the generalised likelihood ratio test statistic, λ(x), satisfies<br />
ln{λ(x)} = n − n ln( ˆ θ) − n<br />
ˆθ .<br />
(c) Show that the test accepts the null hypothesis if<br />
k1 <<br />
n�<br />
ln(xi) < k2 ,<br />
and state how the values of k1 and k2 may be determined.<br />
81<br />
i=1
3. Explain what is meant by the power of a test and describe how the power may<br />
be used to determine the most appropriate size of a sample.<br />
Let X1, X2, . . . , Xn be a random sample from the Weibull distribution with prob-<br />
ability density function f(x) = θλx λ−1 exp(−θx λ ), for x > 0 where θ > 0 is<br />
unknown and λ > 0 is known.<br />
(a) Find the form of the most powerful test of the null hypothesis that θ = θ0<br />
against the alternative hypothesis that θ = θ1, where θ0 > θ1.<br />
(b) Find the distribution function of X λ and deduce that this random variable<br />
has an exponential distribution.<br />
(c) Find the critical region of the most powerful test at the 1% level when<br />
n = 50, θ0 = 0.05 and θ1 = 0.025. Evaluate the power of this test.<br />
4. In a particular set of Bernoulli trials, it is widely believed that the probability of<br />
a success is θ = 3<br />
4<br />
2<br />
. However, an alternative view is that θ = . In order to test<br />
3<br />
H0 : θ = 3<br />
4 against H1 : θ = 2<br />
3 , n independent trials are to be observed. Let ˆ θ<br />
denote the proportion of successes in these trials.<br />
(a) Show that the likelihood ratio aapproach leads to a size α test in which H0<br />
is rejected in favour of H1 when ˆ θ < k for some suitable k.<br />
(b) By applying the central limit theorem, write down the large sample distri-<br />
butions of ˆ θ when H0 is true and when H1 is true.<br />
(c) Hence find an expression for k in terms of n when α = 0.05.<br />
(d) Find n so that this test has power 0.95.<br />
5. It is believed that the number of breakages in a damaged chromosome, X, follows<br />
a truncated Poisson distribution with probability mass function<br />
P (X = k) = e−λ<br />
1 − e −λ<br />
λ k<br />
k!<br />
, k = 1, 2, . . . ,<br />
where λ > 0 is an unknown parameter. The frequency distribution of the number<br />
of breakages in a random sample of 33 damaged chromosomes was as follows:<br />
Breakages 1 2 3 4 5 6 7 8 9 10 11 12 13 <strong>To</strong>tal<br />
Chromosomes 11 6 4 5 0 1 0 2 1 0 1 1 1 33<br />
(a) Assuming that the appropriate regularity conditions hold, find an equation<br />
satisfied by ˆ λ, the MLE of λ.<br />
82
(b) Assuming that an initial guess ˆ λ0 is available, use the Newton-Raphson<br />
method to find an iterative algorithm for computing the value of ˆ λ. There<br />
is no need to carry out any numerical calculations.<br />
(c) It is found that the observed data give the estimate ˆ λ = 3.6. Using this value<br />
The Solutions<br />
for ˆ λ, test the null hypothesis that the number of breakages in a damaged<br />
chromosome follows a truncated Poisson distribution. The categories 6 to<br />
13 should be combined into a single category inthe goodness-of-fit test.<br />
1. The likelihood function is proportional to θ 2x {2θ(1 − θ)} y (1 − θ) 2z ; and so the<br />
log-likelihood ℓ(θ) = ln L(θ) = (2x + y) ln θ + (y + 2z) ln(1 − θ) for 0 < θ < 1).<br />
Then S(θ) = dℓ<br />
dθ<br />
solution dℓ<br />
dθ<br />
= 2x+y<br />
θ<br />
y+2z<br />
− 1−θ and I(θ) = − d2ℓ dθ2 = 2x+y<br />
θ2 = 0 will be a maximum. ∴ 2x+y<br />
ˆθ<br />
y+2z<br />
=<br />
1− ˆ θ i.e. ˆ θ = 2x+y<br />
+ y+2z<br />
(1−θ )2 > 0 so the<br />
Testing the simple hypothesis H0 : θ = θ0 against the alternative simple hypoth-<br />
esis H1 : θ = θ1 < θ0. Critical region size α. Reject H0 if t ≤ k where k satisfies<br />
P (T ≤ k|θ = θ0) ≈ α; i.e. k� �<br />
t θ0(1 − θ0) 2n−t ≈ α. (*)<br />
�2n t<br />
t=0<br />
The likelihood ratio for testing H0 against H1 is<br />
λ = L(θ0)<br />
L(θ1) =<br />
� θ0<br />
θ1<br />
� 2x+y � 1 − θ0<br />
1 − θ1<br />
� y+2z<br />
2n .<br />
� �2n � �t 1 − θ0 θ0(1 − θ1)<br />
=<br />
1 − θ1 θ1(1 − θ0)<br />
Since θ0(1 − θ1) > θ1(1 − θ0), λ increases with t. The Neymann-Pearsom lemma<br />
then shows that the most powerful test of size α rejects H0 if t ≤ k with k defined<br />
in (*) above.<br />
Using the central limit theorem, T ∼ N(2nθ, 2nθ[1 − θ]). Therefore Φ([k +<br />
1<br />
2 − 2nθ0]/ � 2nθ0(1 − θ0)) = 0.05; ∴ [k + 1<br />
2 − 2nθ0]/ � 2nθ0(1 − θ0) = −1.645 or<br />
[k + 0.5 − 0.8n]/ √ 0.48n = −1.645. Likewise Φ([k + 1<br />
2 − 2nθ1]/ � 2nθ1(1 − θ1)) =<br />
0.95 so [k + 0.5 − 0.6n]/ √ 0.42n = 1.282. From these results, k + 0.5 = 0.8n −<br />
1.645 √ 0.48n = 0.6n+1.282 √ 0.42n which gives 0.2 √ n = 1.645 √ 0.48+1.282 √ 0.42<br />
or n = (1.645 √ 0.48+1.282 √ 0.42) 2 /0.04 = 97.1 or n = 98 since n must be integer.<br />
2. The likelihood function is L(θ) = n� �<br />
θvθ x<br />
i=1<br />
θ+1<br />
�<br />
θ =<br />
i<br />
nvnθ ( �n i=1 xi) θ+1 for v ≤ x < ∞ and<br />
θ > 0. Therefore ln L(θ) = ℓ(θ) = n ln θ+nθ ln v−(θ+1) � ln(xi). Differentiating<br />
we get the score function S(θ) = (n/θ)+n ln v − � ln xi and I(θ) = n/θ2 > 0 ∀ θ.<br />
The MLE ˆ θ is found by S( ˆ θ) = 0, implying (n/ ˆ θ) = � ln xi − n ln v = � ln � � xi<br />
v<br />
so that ˆ θ = n/ [ �n i=1 (xi/v)] .<br />
83<br />
.
For the null hypothesis θ = 1, the generalised likelihood ratio is λ = L(1)/L( ˆ θ)<br />
so that ln(λ(x)) = ℓ(1) − ℓ( ˆ θ). Thus<br />
ln(λ(x)) = n ln v − 2<br />
n�<br />
ln(xi) − n ln( ˆ θ) − nˆ θ ln v + ( ˆ θ + 1)<br />
i=1<br />
= n ln v + ( ˆ θ − 1)<br />
= − n<br />
ˆθ + ˆ θ<br />
n�<br />
i=1<br />
n�<br />
ln(xi) − n ln( ˆ θ) − nˆ θ ln v<br />
i=1<br />
n�<br />
ln(xi)<br />
ln(xi) − n ln( ˆ θ) − n ˆ θ ln v using n<br />
ˆθ = � ln xi − n ln v<br />
= − n<br />
ˆθ + n − n ln(ˆ θ), again using n<br />
ˆθ = � ln xi − n ln v<br />
�<br />
= n 1 − ln ˆ θ − 1<br />
�<br />
.<br />
ˆθ<br />
Let u = 1/ ˆ θ. Then ln(λ(x)) = −n(u−1−ln u) and d<br />
1<br />
(ln λ) = −n(1− ). Clearly<br />
du u<br />
ln λ has a maximum at u = 1. The null hypothesis H0 : θ = 1 will be rejected<br />
if λ(x) ≤ c, for some c; i.e. if u ≤ k ′ 1 or u ≥ k ′ 2. reject H0 if n�<br />
n�<br />
i=1<br />
i=1<br />
ln(xi) ≤ k1 or<br />
ln(xi) ≥ k2, where k1 = n{k<br />
i=1<br />
′ 1 = ln v} and k2 = n{k ′ 2 = ln v}. For a test size of<br />
�<br />
n�<br />
α, choose k1, k2 to satisfy P ln(xi) ≤ k1 or<br />
i=1<br />
n�<br />
�<br />
ln(xi) ≥ k2|θ = 1 = α.<br />
i=1<br />
3. The power of a test is the probability of rejecting the null hypothesis expressed<br />
as a function of the parameter under investigation. If both the significance level<br />
of the test and the power required at a particular value of the parameter are<br />
specified, then a lower bound for the necessary sample size can be determined.<br />
The likelihood function is L(θ) = n�<br />
n�<br />
i=1<br />
θλx λ−1<br />
i e −θxλ i = θ n λ n e −θ � n<br />
i=1 xλ i<br />
i=1<br />
x λ−1<br />
i , for<br />
θ > 0, and so the likelihood ratio for testing H0 : θ = θ0 against H1 : θ = θ1<br />
(where θ0 > θ1) is λ = L(θ0)<br />
L(θ1) =<br />
� θ0<br />
θ1<br />
� n<br />
exp{−(θ0 − θ1) � n<br />
i=1 xλ i }. By the Neyman-<br />
Pearson lemma, the most powerful test has critical region c = {x : n�<br />
xλ i ≥ k},<br />
where k is chosen to give significance level α for the test.<br />
�<br />
−eθtλ �∞ P (X > x) = � ∞<br />
x θλtλ−1 e −θtλ<br />
dt =<br />
x<br />
i=1<br />
= eθxλ, for x > 0. Hence P (Xλ ><br />
x) = P (X > x 1<br />
λ ) = e −θx . So P (X λ ≤ x) = 1 − e −θx , hence X λ ∼ Exp(θ).<br />
If Y1, Y2, . . . , Ym is a random sample from an exponential distribution with mean<br />
v−1 , then 2v m�<br />
Yi has a χ2 2m distribution. Applying this result 2θ n�<br />
Xλ i ∼ χ2 2n.<br />
i=1<br />
Under H0 : θ = 0.05, we have 0.1 50�<br />
Xλ i ∼ χ2 100 with 1% point 135.81. You<br />
i=1<br />
can check this using statistical tables or the R command qchisq(p=0.99,df=100).<br />
84<br />
i=1
Thus P (0.1 50 �<br />
X<br />
i=1<br />
λ i ≥ 135.81|θ = 0.05) = 0.01, and the test therefore reject H0 if<br />
�50<br />
X<br />
i=1<br />
λ i ≥ 1358.1. Under H1 : θ = 0.025, we have 0.05 50 �<br />
X<br />
i=1<br />
λ i ∼ χ2 100 and so the<br />
power of the test is P ( 50 �<br />
Xλ i ≥ 1358.1|θ = 0.025) = P (0.05 50 �<br />
Xλ i ≥ 67.905) =<br />
i=1<br />
P (χ 2 100 > 67.905) ≈ 0.994 using pchisq(q=67.905,df=100,lower.tail=FALSE).<br />
4. The likelihood for a sample (x1, x2, . . . , xn) is L(θ) = const × θ � xi (1 − θ) n− � xi ,<br />
2<br />
L( 3 and so the likelihood ratio is λ = )<br />
L( 3<br />
2<br />
( 3 =<br />
) 4 )� xi( 1<br />
3 )n−� xi ( 3<br />
4 )� xi( 1<br />
4 )n−� x = i � �<br />
8<br />
9<br />
� xi � � �<br />
4 n− xi<br />
.<br />
3<br />
Using the Neyman-pearsom lemma, we reject H0 when λ > c, where c is chosen<br />
to give the required level of test, α. Now, λ is an increasing function of � xi,<br />
hence of ˆ θ, and an equivalent rule is therefore to reject H0 when ˆ θ < k, where k<br />
is chosen to give significance level α for the test.<br />
n ˆ θ is binomial with parameters (n, θ). Hence the large-sample distribution of ˆ θ is<br />
N (θ, θ(1 − θ)/n). When θ = 3/4 this is N ( 3 3<br />
2 2<br />
, n). When θ = 2/3 it is N ( , 4 16 3 9n). For α = 0.05, choose k such that P ( ˆ θ < k|θ = 3<br />
�<br />
) = 0.05. That is, we want<br />
4<br />
k−<br />
Φ = 0.05, or<br />
3<br />
√ 4 = −1.6449, giving k =<br />
3/(16n) 3<br />
�<br />
1.6449 3 − 4 4 n .<br />
� k− 3<br />
4<br />
√ 3<br />
16n<br />
For power 0.95, P ( ˆ θ < k|θ = 2)<br />
= 0.95, i.e. Φ<br />
giving k = 2<br />
3<br />
+ 1.6449<br />
3<br />
� 2<br />
n<br />
3<br />
� k− 2<br />
3<br />
√ 2<br />
9n<br />
i=1<br />
�<br />
= 0.95 or<br />
k− 2<br />
3<br />
√ 2<br />
9n<br />
= 1.6449,<br />
. Using this expression for k together with the expression<br />
for k derived in (c) means that we require 3 1.6449<br />
2 1.6449<br />
− = + 4 4 n 3 3 n or<br />
� √ √ �<br />
1 1.6449 = √ 1 1<br />
√<br />
3 + 2 . Thus we get n = 12 × 1.6449 × 0.9044 = 17.8521 and<br />
12 n 4 3<br />
n = 318.7, so we take n = 319.<br />
5. Let Xi denote the number of breakages in the ith chromosome. Then the like-<br />
lihood function is L(λ) = 33 �<br />
i=1<br />
e−λ (1−e−λ λ<br />
)<br />
xi xi!<br />
= e−33λ<br />
(1−e −λ ) 33<br />
� 3<br />
33�<br />
xi λi=3<br />
�33<br />
xi!<br />
i=1<br />
� 2<br />
for λ > 0. Given<br />
that � xi = (11 × 1 + 6 × 2 + · · · 1 × 13 = 122, the log-likelihood function<br />
equals ln[L(λ)] = ℓ(λ) = −33λ − 33 ln(1 − e −λ ) + 122 ln λ − � 33<br />
i=1 ln(xi!). So<br />
S(λ) = dℓ<br />
dλ<br />
33e−λ<br />
= −33 − 1−e−λ + 122<br />
λ<br />
= −33 − 33<br />
e λ −1<br />
+ 122<br />
λ<br />
condition, the MLE ˆ λ satisfies S(λ) = 0, i.e. −33 − 33<br />
e ˆ + λ−1 122<br />
ˆλ<br />
. With the usual regularity<br />
= 0. The informa-<br />
tion function I(λ) = − d2 ℓ<br />
dλ 2 = 122<br />
λ 2 − 33eλ<br />
(e λ −1) 2 . An iterative algorithm for finding ˆ λ<br />
numerically is given by λj+1 = λj + S(λj)/I(λj). An initial estimate λ0 could be<br />
found by plotting ℓ(λ) against λ. Alternatively, it is often satisfactory to use the<br />
estimator for a non-truncated Poisson, which here would be λ0 = 122<br />
33<br />
Using the value ˆ λ = 3.6, P (X = k) = e−3.6<br />
3.6e−3.6 1−e−3.6 = 0.0984<br />
0.9727<br />
1−e −3.6<br />
= 3.70.<br />
3.6k . Calculating P (X = 1) =<br />
k!<br />
= 0.1011 if follows (recursively) that P (X = 2) = 3.6<br />
2<br />
85<br />
P (X = 1) =
0.1820. Similarly P (X = 3) = 0.2184, P (X = 4) = 0.1966, P (X = 5) = 0.1415.<br />
Hence P (X ≥ 6) = 0.1603. [ These probabilities are accurate to four decimal<br />
places, but there is slight rounding error inthe expected frequencies below.]<br />
k 1 2 3 4 5 ≥ 6 TOTAL<br />
observed 11 6 4 5 0 7 33<br />
expected 3.34 6.01 7.21 6.49 4.67 5.28 33.00<br />
Comparing the observed and expected frequencies, the χ 2 test will have 4 degrees<br />
of freedom since λ had to be estimated. The test statistic is<br />
X 2 =<br />
(11 − 3.34)2<br />
3.34<br />
+ (6 − 6.01)2<br />
6.01<br />
+ · · · +<br />
(7 − 5.28)2<br />
5.28<br />
= 24.57 .<br />
This is very highly statistically significant as an observation from χ 2 4, i.e. there is<br />
very strong evidence against the null hypothesis of a truncated Poisson distribu-<br />
tion (the R command qchisq(p=0.95,df=4) returned the critical value 9.488).<br />
<strong>Student</strong> Problems<br />
1. Suppose that we have data x1, x2, . . . , xn which are iid observations from a N(µ, 1)<br />
density where µ is unknown. Consider testing H0 : µ = 0 using the test statistic<br />
T = |¯x|. Suppose we reject H0 if the p-value turns out to be less than α = 0.05.<br />
Show that the test based on T rejects H0 if |¯x| > 1.96<br />
√ n . Calculate the power of<br />
the test for n = 25 and µ = −3, −2, −1, +1, +2, +3. How large would n have to<br />
be in order that the power of the test was equal to 0.95 for µ = +1 ?<br />
2. (a) Let X be a random variable with density given by f(x; θ) = θx θ−1 for 0 ≤<br />
x ≤ 1. Prove that the random variable Y = −2θ log[X] has an exponential<br />
density with mean 2.<br />
(b) Let x1, x2, . . . , xn be iid with density f(x; θ). Consider testing H0 : θ = 1.<br />
Derive (i) the likelihood ratio test statistic, (ii) the maximum likelihood<br />
test statistic, and (iii) the score test statistic. Using the result in (a) explain<br />
how you would calculate exact p-values associated with each of the 3 test<br />
statistics. HINT : The sum of n independent random variables each having<br />
an exponential density with mean 2 has a χ 2 2n density.<br />
(c) Suppose n = 10 and the 10 observed values are 0.17,0.21,0.28,0.30,0.35,0.40,<br />
0.45, 0.58, 0.65, 0.70. Calculate the p-value associated with each of the test<br />
statistics derived in (b).<br />
86
3. Let x1, x2, . . . , xn be iid with density fX(x; θ) = 1 exp [−x/θ] for 0 ≤ x < ∞. Let<br />
θ<br />
y1, y2, . . . , ym be iid with density fY (y; θ, λ) = λ<br />
θ<br />
exp [−λy/θ] for 0 ≤ y < ∞.<br />
(a) Consider testing H0 : λ = 1. Derive (i) the likelihood ratio test statistic,<br />
(ii) the maximum likelihood test statistic, and (iii) the score test statistic.<br />
Explain how you would calculate exact p-values associated with each of the<br />
3 test statistics. Hint : Recall the relationship between an F density and<br />
the ratio of two chi-squareds.<br />
(b) Suppose that n = 10 and the average of the 10 x values is 24.0. Suppose<br />
that m = 40 and the average of the 40 y values is 12.0. Calculate the p-value<br />
associated with each of the test statistics derived in (a).<br />
4. Let x1, x2, . . . , xn be a random sample from a N[µ1, v] density and let y1, y2, . . . , ym<br />
be a random sample from a N[µ2, v] density where µ1, µ2, and v are unknown.<br />
Consider testing H0 : µ1 = µ2. Derive the likelihood ratio test and show � that it is<br />
equivalent to rejecting H0 for large values of |W | where W = ¯x − ¯y/ s2 ( 1 1 + n m )<br />
and s2 = ( �n i=1 (xi − ¯x) 2 + Σm i=1(yi − ¯y) 2 ) /(n + m − 2). What is the distribution<br />
of W ? Suppose n = 21 and m = 11 and you observe W = −2.32. What is the<br />
p-value ?<br />
5. Let x1, x2, . . . , xn be iid observations from a normal density with mean θ and<br />
variance σ 2 . Let y1, y2, . . . , ym be iid observations from a normal density with<br />
mean θ and variance kσ 2 . Assume that both k and σ 2 are known but that θ is<br />
unknown.<br />
(a) Consider testing the null hypothesis H0 : θ = 0. Derive the test statistics<br />
used in the likelihood ratio test, the maximum likelihood test, and the score<br />
test. Show that all three test statistics are increasing functions of U where<br />
U = [k<br />
n�<br />
i=1<br />
xi + Σ m j=1yj] 2 /[k(nk + m)σ 2 ]<br />
and explain how this fact may be used to enable exact calculation of p-values.<br />
(b) Consider a situation in which k = 2 and σ 2 = 10. Suppose that n = 20 and<br />
the sum of the numbers x1, x2, . . . , x20 is 28.0. Suppose that m = 40 and<br />
the sum of the numbers y1, y2, . . . , y40 is 48.0. Calculate the p-value for each<br />
of the tests derived in (a) and comment on the results.<br />
87
Appendix A<br />
Review of Probability<br />
A.1 Expectation and Variance<br />
The expected value E[Y ], of a random variable Y is defined as<br />
if Y is discrete, and<br />
E[Y ] =<br />
∞�<br />
yiP (yi),<br />
i=0<br />
� ∞<br />
E[Y ] = yf(y)dy,<br />
−∞<br />
if Y is continuous, where f(y) is the probability density function. The variance Var[Y ],<br />
of a random variable Y is defined as<br />
or<br />
if Y is discrete, and<br />
Var[Y ] = E � (Y − E[Y ]) 2� ,<br />
Var[Y ] =<br />
∞�<br />
(yi − E[Y ]) 2 P (yi),<br />
i=0<br />
� ∞<br />
Var[Y ] = (y − E[Y ]) 2 f(y)dy,<br />
−∞<br />
if Y is continuous. When there is no ambiguity we often write EY for E[Y ], and VarY<br />
for Var[Y ]. A function of a random variable is itself a random variable. If h(Y ) is a<br />
function of the random variable Y, then the expected value of h(Y ) is given by<br />
E [h(Y )] =<br />
∞�<br />
h(yi)P (yi),<br />
if Y is discrete, and if Y is continuous<br />
� ∞<br />
E [h(Y )] = h(y)f(y)dy.<br />
i=0<br />
−∞<br />
88
It is relatively straightforward to derive the following results for the expectation and<br />
variance of a linear function of Y :<br />
where a and b are constants. Also<br />
E[aY + b] = aE[Y ] + b,<br />
Var[aY + b] = a 2 Var[Y ],<br />
Var[Y ] = E[Y 2 ] − (E[Y ]) 2 . (A.1)<br />
For expectations, it can be shown more generally that<br />
�<br />
k�<br />
�<br />
k�<br />
E aihi(Y ) = aiE [hi(Y )] ,<br />
i=1<br />
where ai, i = 1, 2, . . . , k are constants and hi(Y ), i = 1, 2, . . . , k are functions of the<br />
random variable Y.<br />
A.2 Discrete Random Variables<br />
A.2.1 Bernoulli Distribution<br />
A Bernoulli trial is a probabilistic experiment which can have one of two outcomes,<br />
success (Y = 1) or failure (Y = 0) and in which the probability of success is θ. We refer<br />
to θ as the Bernoulli probability parameter. The value of the random variable Y is<br />
used as an indicator of the outcome, which may also be interpreted as the presence or<br />
absence of a particular characteristic. A Bernoulli random variable Y has probability<br />
mass function<br />
i=1<br />
P (Y = y|θ) = θ y (1 − θ) 1−y , (A.2)<br />
for y = 0, 1 and some θ ∈ (0, 1). The notation Y ∼ Ber(θ) should be read as “the<br />
random variable Y follows a Bernoulli distribution with parameter θ.” A Bernoulli<br />
random variable Y has expected value E[Y ] = 0 × P (Y = 0) + 1 × P (Y = 1) =<br />
0×(1−θ)+1×θ = θ, and variance Var[Y ] = (0−θ) 2 ×(1−θ)+(1−θ) 2 ×θ = θ(1−θ).<br />
A.2.2 Binomial Distribution<br />
Consider independent repetitions of Bernoulli experiments, each with a probability of<br />
success θ. Next consider the random variable Y, defined as the number of successes in<br />
a fixed number of independent Bernoulli trials, n. That is,<br />
Y =<br />
n�<br />
Xi,<br />
i=1<br />
89
where Xi ∼ Bernoulli(θ) for i = 1, . . . , n. Each sequence of length n containing y “ones”<br />
and (n−y) “zeros” occurs with probability θ y (1−θ) n−y . The number of sequences with<br />
y successes, and consequently (n − y) fails, is<br />
n!<br />
y!(n − y)! =<br />
The random variable Y can take on values y = 0, 1, 2, . . . , n with probabilities<br />
� �<br />
n<br />
P (Y = y|n, θ) = θ<br />
y<br />
y (1 − θ) n−y . (A.3)<br />
The notation Y ∼ Bin(n, θ) should be read as “the random variable Y follows a bi-<br />
nomial distribution with parameters n and θ.” Finally using the fact that Y is the<br />
sum of n independent Bernoulli random variables we can calculate the expected value<br />
as E[Y ] = E [ � Xi] = � E[Xi] = � θ = nθ and variance as Var[Y ] = Var [ � Xi] =<br />
� Var[Xi] = � θ(1 − θ) = nθ(1 − θ).<br />
A.2.3 Geometric Distribution<br />
Instead of fixing the number of trials, suppose now that the number of successes, r,<br />
is fixed, and that the sample size required in order to reach this fixed number is the<br />
random variable N. This is sometimes called inverse sampling. In the case of r = 1,<br />
� n<br />
using the independence argument again, leads to<br />
y<br />
�<br />
.<br />
P (N = n|θ) = θ(1 − θ) n−1 , (A.4)<br />
for n = 1, 2, . . . which is the geometric probability function with parameter θ. The dis-<br />
tribution is so named as successive probabilities form a geometric series. The notation<br />
N ∼ Geo(θ) should be read as “the random variable N follows a geometric distribution<br />
with parameter θ.” Write (1 − θ) = q. Then<br />
E[N] =<br />
∞�<br />
nq<br />
n=1<br />
n−1 ∞� d<br />
θ = θ<br />
dq<br />
n=0<br />
(qn ) = θ d<br />
�<br />
∞�<br />
q<br />
dq<br />
n=0<br />
n<br />
=<br />
�<br />
θ d<br />
� �<br />
1 θ 1<br />
= =<br />
dq 1 − q (1 − q) 2 θ .<br />
Also,<br />
E � N 2� =<br />
∞�<br />
n=1<br />
n 2 q n−1 θ = θ<br />
∞� d<br />
dq (nqn ) = θ d<br />
�<br />
∞�<br />
nq<br />
dq<br />
n=1<br />
n<br />
�<br />
�<br />
= θ d � −2<br />
q(1 − q)<br />
dq<br />
�<br />
n=1<br />
= θ d<br />
�<br />
q<br />
dq 1 − q E(N)<br />
�<br />
1 2(1 − θ)<br />
= θ +<br />
θ2 θ3 �<br />
= 2 1<br />
−<br />
θ2 θ .<br />
Using Var[N] = E[N 2 ] − (E[N]) 2 , we get Var[N] = (1 − θ)/θ 2 .<br />
90
A.2.4 Negative Binomial Distribution<br />
Consider the sampling scheme described in the previous section, except now sampling<br />
continues until a total of r successes are observed. Again, let the random variable<br />
N denote number of trial required. If the rth success occurs on the nth trial, then<br />
this implies that a total of (r − 1) successes are observed by the (n − 1)th trial. The<br />
probability of this happening can be calculated using the binomial distribution as<br />
� �<br />
n − 1<br />
θ<br />
r − 1<br />
r−1 (1 − θ) n−r .<br />
The probability that the nth trial is a success is θ. As these two events are independent<br />
we have that<br />
P (N = n|r, θ) =<br />
� �<br />
n − 1<br />
θ<br />
r − 1<br />
r (1 − θ) n−r<br />
(A.5)<br />
for n = r, r + 1, . . . . The notation N ∼ NegBin(r, θ) should be read as “the random<br />
variable N follows a negative binomial distribution with parameters r and θ.” This is<br />
also known as the Pascal distribution.<br />
E � N k� =<br />
∞�<br />
n<br />
n=r<br />
k<br />
� �<br />
n − 1<br />
θ<br />
r − 1<br />
r (1 − θ) n−r<br />
= r<br />
∞�<br />
n<br />
θ<br />
n=r<br />
k−1<br />
� �<br />
n<br />
θ<br />
r<br />
r+1 (1 − θ) n−r<br />
=<br />
� � � �<br />
n − 1 n<br />
since n = r<br />
r − 1 r<br />
r<br />
∞�<br />
(m − 1)<br />
θ<br />
m=r+1<br />
k−1<br />
� �<br />
m − 1<br />
θ<br />
r<br />
r+1 (1 − θ) m−(r+1)<br />
= r<br />
θ E � (X − 1) k−1�<br />
where X ∼ Negative binomial(r + 1, θ). Setting k = 1 we get E(N) = r/θ. Setting<br />
k = 2 gives<br />
Therefore Var[N] = r(1 − θ)/θ 2 .<br />
E � N 2� = r<br />
� �<br />
r r + 1<br />
E(X − 1) = − 1<br />
θ θ θ<br />
A.2.5 Hypergeometric Distribution<br />
The hypergeometric distribution is used to describe sampling without replacement.<br />
Consider an urn containing b balls, of which w are white and b − w are red. We intend<br />
to draw a sample of size n from the urn. Let Y denote the number of white balls<br />
selected. Then, for y = 0, 1, 2, . . . , n we have<br />
P (Y = y|b, w, n) =<br />
� w<br />
y<br />
91<br />
�� b − w<br />
n − y<br />
�<br />
/<br />
.<br />
� �<br />
b<br />
. (A.6)<br />
n
The expected value of the jth moment of a hypergeometric random variable is<br />
The identities y � w<br />
y<br />
E[Y ] =<br />
n�<br />
y j P (Y = y) =<br />
y=0<br />
� = w � w−1<br />
y−1<br />
n�<br />
y j<br />
� �� � � �<br />
w b − w b<br />
/ .<br />
y n − y n<br />
y=1<br />
� � � � � b b−1<br />
and n = b can be used to obtain<br />
n n−1<br />
E � Y j� = nw<br />
n�<br />
y<br />
b<br />
y=1<br />
j−1<br />
� �� � � �<br />
w − 1 b − w b − 1<br />
/ ,<br />
y − 1 n − 1 n − 1<br />
= nw �n−1<br />
(x + 1)<br />
b<br />
x=0<br />
j−1<br />
� �� � � �<br />
w − 1 b − w b − 1<br />
/ ,<br />
x n − 1 − x n − 1<br />
= nw<br />
b E � (X + 1) j−1�<br />
where X is a hypergeometric random variable with parameters n − 1, b − 1, w − 1. It is<br />
easy to establish that E[Y ] = nθ and Var[Y ] = b−nnθ(1<br />
− θ), where θ = w/b denotes<br />
b−1<br />
the fraction of white balls in the population.<br />
A.2.6 Poisson Distribution<br />
Certain problems involve counting the number of events that have occurred in a fixed<br />
time period. A random variable Y, taking on one of the values 0, 1, 2, . . . , is said to be<br />
a Poisson random variable with parameter θ if for some θ > 0,<br />
P (Y = y|θ) = θy<br />
y! e−θ , y = 0, 1, 2, . . . (A.7)<br />
The notation Y ∼ Pois(θ) should be read as “the random variable Y follows a Poisson<br />
distribution with parameter θ.” Equation (A.7) defines a probability mass function,<br />
since<br />
∞�<br />
y=0<br />
θ y<br />
y! e−θ = e −θ<br />
∞�<br />
y=0<br />
The expected value of a Poisson random variable is<br />
E[Y ] =<br />
∞�<br />
y=0<br />
ye −θ θ y<br />
y!<br />
= θe −θ<br />
∞�<br />
y=1<br />
<strong>To</strong> get the variance we first computer E [Y 2 ] .<br />
E � Y 2� =<br />
∞�<br />
y=0<br />
y 2 e −θ θ y<br />
y!<br />
= θ<br />
∞�<br />
y=1<br />
ye −θ θ y−1<br />
(y − 1)!<br />
θ y<br />
y! = e−θ e θ = 1.<br />
θ y−1<br />
(y − 1)!<br />
= θ<br />
∞�<br />
j=0<br />
= θe−θ<br />
∞�<br />
j=0<br />
(j + 1)e −θ θ j<br />
j!<br />
θ j<br />
j!<br />
= θ .<br />
= θ(θ + 1) .<br />
Since we already have E[Y ] = θ, we obtain Var[Y ] = E [Y 2 ] − (E[Y ]) 2 = θ.<br />
92
Suppose that Y ∼ Binomial(n, π), and let θ = nπ. Then<br />
P (Y = y|n, π) =<br />
� �<br />
n<br />
π<br />
y<br />
y (1 − π) n−y<br />
=<br />
� � � �y �<br />
n θ<br />
1 −<br />
y n<br />
θ<br />
�n−y n<br />
= n(n − 1) · · · (n − y + 1)<br />
ny θy (1 − θ/n)<br />
y!<br />
n<br />
.<br />
(1 − θ/n) y<br />
For n large and θ “moderate”, we have that<br />
�<br />
1 − θ<br />
n<br />
� n<br />
≈ e −θ ,<br />
n(n − 1) · · · (n − y + 1)<br />
n y<br />
≈ 1 ,<br />
�<br />
1 − θ<br />
�y ≈ 1.<br />
n<br />
Our result is that a binomial random variable Bin(n, π) is well approximated by a<br />
poisson random variable Pois(θ = n × π) when n is large and π is small. That is<br />
P (Y = y|n, π) ≈ e−nπ (nπ) y<br />
A.2.7 Discrete Uniform Distribution<br />
The discrete uniform distribution with integer parameter N has a random variable Y<br />
that can take the vales y = 1, 2, . . . , N with equal probability 1/N. It is easy to show<br />
that the mean and variance of Y are E[Y ] = (N + 1)/2, and Var[Y ] = (N 2 − 1)/12.<br />
A.2.8 The Multinomial Distribution<br />
Suppose that we perform n independent and identical experiments, where each ex-<br />
periment can result in any one of r possible outcomes, with respective probabilities<br />
p1, p2, . . . , pr, � r<br />
i=1 pi = 1. If we denote by Yi, the number of the n experiments that<br />
result in outcome number i, then<br />
P (Y1 = n1, Y2 = n2, . . . , Yr = nr) =<br />
y!<br />
.<br />
n!<br />
n1!n2! · · · nr! pn1 1 p n2<br />
2 · · · p nr<br />
r<br />
(A.8)<br />
where � r<br />
i=1 ni = n. Equation (A.8) is justified by noting that any sequence of outcomes<br />
that leads to outcome i occurring ni times for i = 1, 2, . . . , r, will, by the assumption<br />
of independence of experiments, have probability p n1<br />
1 p n2<br />
2 · · · p nr<br />
r of occurring. As there<br />
are n!/(n1!n2! · · · nr!) such sequence of outcomes equation (A.8) is established.<br />
93
A.3 Continuous Random Variables<br />
A.3.1 Uniform Distribution<br />
A random variable Y is said to be uniformly distributed over the interval (a, b) if its<br />
probability density function is given by<br />
f(y|a, b) = 1<br />
, if a < y < b<br />
b − a<br />
and equals 0 for all other values of y. Since F (u) = � u<br />
f(y)dy, the distribution<br />
−∞<br />
function of a uniform random variable on the interval (a, b) is<br />
⎧<br />
⎪⎨ 0 u ≤ a<br />
F (u) =<br />
⎪⎩<br />
u−a<br />
b−a<br />
1<br />
a < u < b<br />
u ≥ b<br />
The expected value of a uniform random turns out to be the mid-point of the interval,<br />
that is<br />
� ∞<br />
E[Y ] = yf(y)dy =<br />
−∞<br />
The second moment is calculated as<br />
hence the variance is<br />
E � Y 2� =<br />
� b<br />
a<br />
� b<br />
a<br />
y 2<br />
b − a dy = b3 − a 3<br />
3(b − a)<br />
y<br />
b − a dy = b2 − a 2<br />
2(b − a)<br />
= b + a<br />
2 .<br />
= 1<br />
3 (b2 + ab + a 2 ),<br />
Var[Y ] = E � Y 2� − (E[Y ]) 2 = 1<br />
12 (b − a)2 .<br />
The notation Y ∼ U(a, b) should be read as “the random variable Y follows a uniform<br />
distribution on the interval (a, b)”.<br />
A.3.2 Exponential Distribution<br />
A random variable Y is said to be an exponential random variable if its probability<br />
density function is given by<br />
f(y|θ) = θe −θy , 0 ≤ y < ∞, θ > 0.<br />
The cumulative distribution of an exponential random variable is given by<br />
F (a) =<br />
The expected value<br />
� a<br />
0<br />
θe −θy dy = −e −θy dy = −e −θy� � a<br />
0 = 1 − e−θa . a ≥ 0.<br />
� ∞<br />
E[Y ] = yθe −θy dy<br />
0<br />
94
equires integration by parts, yielding<br />
� ∞<br />
E[Y ] = −ye −θy� � ∞<br />
0 +<br />
0<br />
e −θy dy = 0 − e−θy<br />
θ<br />
�<br />
�<br />
�<br />
�<br />
∞<br />
0<br />
= 1<br />
θ .<br />
Integration by parts can be used to verify that E [Y 2 ] = 2θ −2 . Hence<br />
Var[Y ] = 1<br />
.<br />
θ2 The notation Y ∼ Exp(θ) should be read as “the random variable Y follows an expo-<br />
nential distribution with parameter θ”.<br />
A.3.3 Gamma Distribution<br />
A random variable Y is said to have a gamma distribution if its density function is<br />
given by<br />
f(y|ω, θ) = θe−θy (θy) ω−1<br />
,<br />
Γ(ω)<br />
0 ≤ y < ∞<br />
and ω, θ > 0<br />
where Γ(ω), is called the gamma function and is defined by<br />
� ∞<br />
Γ(ω) = e −u u ω−1 du.<br />
The integration by parts of Γ(ω) yields the recursive relationship<br />
Γ(ω) = −e −u u ω−1� �∞ 0 +<br />
� ∞<br />
e<br />
0<br />
−u (ω − 1)u ω−2 du<br />
� ∞<br />
= (ω − 1) e −u u ω−2 du = (ω − 1)Γ(ω − 1).<br />
0<br />
0<br />
For integer values of ω, say ω = n, this recursive relationship reduces to Γ(n) = n!.<br />
Note, by setting ω = 1 the gamma distribution reduces to an exponential distribution.<br />
The expected value of a gamma random variable is given by<br />
E[Y ] = θω<br />
� ∞<br />
y<br />
Γ(ω)<br />
ω e −θy θ<br />
dy =<br />
ω<br />
Γ(ω)θω+1 � ∞<br />
after the change of variable u = θy. Hence<br />
Using the same substitution<br />
0<br />
E[Y ] =<br />
E � Y 2� = θω<br />
Γ(ω)<br />
� ∞<br />
0<br />
Γ(ω + 1)<br />
θΓ(ω)<br />
= ω<br />
θ .<br />
y ω+1 e −θy dy =<br />
0<br />
u ω e −θu du,<br />
(ω + 1)ω<br />
θ2 ,<br />
so that<br />
Var[Y ] = ω<br />
.<br />
θ2 The notation Y ∼ Gamma(ω, θ) should be read as “the random variable Y follows a<br />
gamma distribution with parameters ω and θ”.<br />
95
A.3.4 Gaussian Distribution<br />
A random variable Y is a normal (or Gaussian) random variable, or simply normally<br />
distributed, with parameters µ and σ 2 if the density of Y is specified by<br />
f(y|µ, σ 2 ) = 1<br />
√ 2πσ e −(y−µ)2 /2σ 2<br />
, (A.9)<br />
for −∞ < y < ∞ with −∞ < µ < ∞ and σ > 0. It is not immediately obvious that<br />
(A.9) specifies a probability density. <strong>To</strong> show that this is the case we need to prove<br />
� ∞<br />
1<br />
√ e<br />
2πσ<br />
−(y−µ)2 /2σ2 dy = 1.<br />
−∞<br />
Substituting z = (y − µ)/σ, we need to show that I = � ∞<br />
−∞ e−z2 /2 dz = √ 2π. This is a<br />
“classic” results and so is well worth confirming. Form<br />
I 2 � ∞<br />
= e −z2 � ∞<br />
/2<br />
dz e −w2 � ∞ � ∞<br />
/2<br />
dw =<br />
−∞<br />
−∞<br />
−∞<br />
−∞<br />
e − z2 +w 2<br />
2 dzdw.<br />
The double integral can be evaluated by a change of variables to polar coordinates.<br />
Substituting z = r cos θ, w = r sin θ, and dzdw = rdθdr, we get<br />
I 2 =<br />
� ∞ � π<br />
e<br />
0 0<br />
−r2 /2<br />
rdθdr<br />
=<br />
� ∞<br />
2π re<br />
0<br />
−r2 /2<br />
dr<br />
= −2πe −r2 �<br />
/2�<br />
� ∞<br />
= 2π<br />
Taking the square root we get I = √ 2π. The result I = √ 2π can also be used to<br />
establish the result Γ(1/2) = π1/2 . <strong>To</strong> prove that this is the case note that<br />
Γ( 1<br />
� ∞<br />
) = e<br />
2 −u u −1/2 � ∞<br />
du = 2 e −z2<br />
dz = √ π.<br />
The expected value of Y equals<br />
0<br />
E[Y ] = 1<br />
√ 2πσ<br />
Writing y as (y − µ) + µ = z + µ yields<br />
E[Y ] = 1<br />
� ∞<br />
√<br />
2πσ<br />
� ∞<br />
−∞<br />
0<br />
0<br />
ye −(y−µ)/2σ2<br />
dy<br />
ze<br />
−∞<br />
−z2 /2<br />
dz + µ<br />
� ∞<br />
f(y)dy,<br />
where f(y) represents the normal density from equation (A.9). Using the argument of<br />
symmetry, the first integral must be 0. Then<br />
� ∞<br />
E[Y ] = µ f(y)dy = µ.<br />
−∞<br />
96<br />
−∞
Since E[Y ] = µ, we have that<br />
Var[Y ] = 1<br />
√ 2πσ<br />
� ∞<br />
(y − µ) 2 e −(y−µ)2 /2σ2 dy.<br />
−∞<br />
Using the substitution z = (y − µ)/σ yields<br />
Var[Y ] = σ2<br />
=<br />
=<br />
� ∞<br />
√<br />
2π −∞<br />
�<br />
2 1<br />
σ √<br />
2π<br />
� ∞<br />
2 1<br />
σ √<br />
2π<br />
= σ 2<br />
z 2 e −z2 /2 dz<br />
−ze −z2 /2<br />
�<br />
�<br />
� ∞<br />
−∞<br />
e<br />
−∞<br />
−z2 /2<br />
dz<br />
� ∞<br />
+<br />
−∞<br />
e −z2 �<br />
/2<br />
dz<br />
The notation Y ∼ N(µ, σ 2 ) should be read as “the random variable Y follows a<br />
normal distribution with mean parameter µ and variance parameter σ 2 .”<br />
Suppose Y ∼ N(µ, σ2 ) and Z = a + bY where a and b are known constants. Then<br />
� � � �<br />
c − a c − a<br />
P (a + bY ≤ c) = P Y ≤ = FY ,<br />
b<br />
b<br />
where FY (c) = � c<br />
−∞ fY (y)dy is the cumulative distribution function of the random<br />
variable Y. Then<br />
FY<br />
� �<br />
c − a<br />
b<br />
=<br />
=<br />
� (c−a)/b<br />
−∞<br />
� b<br />
−∞<br />
1<br />
√ 2πσ e −(y−µ)2 /2σ 2<br />
dy<br />
1<br />
√ 2πbσ exp<br />
� −(z − (bµ + a)) 2<br />
2b 2 σ 2<br />
�<br />
dz<br />
Since FZ(c) = � c<br />
−∞ fZ(y)dz, it follows that the probability density function of Z, is<br />
given by<br />
Hence Z ∼ N(bµ + a, (bσ) 2 ).<br />
1<br />
√ 2πbσ exp<br />
� −(z − (bµ + a)) 2<br />
2b 2 σ 2<br />
An important consequence of the preceding result is that if Y ∼ N(µ, σ 2 ), then<br />
Z = (Y − µ)/σ is normally distributed with mean 0 and variance 1. Such a random<br />
variable is said to have the standard normal distribution.<br />
It is tradition to denote the cumulative distribution function of a standard normal<br />
random variable by Φ(z). That is,<br />
Φ(z) = 1<br />
√ 2π<br />
� z<br />
e<br />
−∞<br />
−u2 /2<br />
du.<br />
Values of (1 − Φ(z)) are tabulated in Table 3 of Murdoch and Barnes for z > 0.<br />
97<br />
�<br />
.
A.3.5 Weibull Distribution<br />
The Weibull distribution function has the form<br />
� �<br />
y<br />
�a� F (y) = 1 − exp −<br />
b<br />
if y > 0<br />
and equals 0 for y ≤ 0. The Weibull density can be obtained by differentiation as<br />
�<br />
a<br />
� �<br />
y<br />
�a−1 � �<br />
y<br />
�a� f(y) =<br />
exp − for y > 0<br />
b b<br />
b<br />
and equals 0 for y ≤ 0. <strong>To</strong> calculate the expected value<br />
� ∞ � �a 1<br />
E[Y ] = ya y<br />
b<br />
a−1 � �<br />
y<br />
�a� exp −<br />
b<br />
0<br />
we use the substitutions u = (y/b) a , and du = ab −a y a−1 dy. These yield<br />
� ∞<br />
E[Y ] = b<br />
0<br />
u 1/a e −u du = bΓ<br />
� a + 1<br />
It is straightforward to verify that<br />
E[Y 2 ] = b 2 Var[Y ] =<br />
� �<br />
a + 2<br />
Γ , and<br />
a<br />
b 2<br />
� � � � � �� �<br />
2<br />
a + 2 a + 1<br />
Γ − Γ<br />
.<br />
a<br />
a<br />
A.3.6 Beta Distribution<br />
A random variable is said to have a beta distribution if its density is given by<br />
f(y) =<br />
1<br />
B(a, b) ya−1 (1 − y) b−1<br />
and is 0 everywhere else. Here the function<br />
B(a, b) =<br />
� 1<br />
0<br />
a<br />
u a−1 (1 − u) b−1 du<br />
�<br />
.<br />
0 < y < 1,<br />
is the “beta” function, and is related to the gamma function through<br />
B(a, b) = Γ(a)Γ(b)<br />
Γ(a + b) .<br />
Proceeding in the usual manner, we can show that<br />
E[Y ] =<br />
Var[Y ] =<br />
a<br />
a + b<br />
ab<br />
(a + b) 2 (a + b + 1) .<br />
98
A.3.7 Chi-square Distribution<br />
Let Z ∼ N(0, 1), and let Y = Z 2 . Then<br />
so that<br />
fY (y) =<br />
FY (y) = P (Y ≤ y)<br />
= 1 1<br />
e− 2<br />
2 y<br />
= P (Z 2 ≤ y)<br />
= P (− √ y ≤ Z ≤ √ y)<br />
= FZ ( √ y) − FZ (− √ y)<br />
1<br />
2 √ y [fz( √ y) + fz(− √ y)]<br />
�<br />
1<br />
2 y<br />
� 1<br />
− � �<br />
2 1<br />
1 1<br />
√π ≡ Gamma ,<br />
2 2<br />
Suppose that Y = �n i=1 Z2 i , where the Zi ∼ N(0, 1) for i = 1, . . . , n and are<br />
independent. From results on the sum of independent Gamma random variables (see<br />
�<br />
. This density has the form<br />
section A.4.1), Y ∼ Gamma � n<br />
2<br />
, 1<br />
2<br />
fY (y) = e−y/2yn/2−1 2n/2Γ � � , y > 0 (A.10)<br />
n<br />
2<br />
and is referred to as a chi-squared distribution on n degrees of freedom. The notation<br />
Y ∼ Chi(n) should be read as “the random variable Y follows a chi-squared distribution<br />
with n degrees of freedom”. Later we will show that if X ∼ Chi(u) and Y ∼ Chi(v), it<br />
follows that X + Y ∼ Chi(u + v).<br />
A.3.8 Distribution of a Function of a Random Variable<br />
Let Y be a continuous random variable with probability density function fY . Suppose<br />
that g(y) is a strictly monotone (increasing or decreasing) differentiable (and hence<br />
continuous) function of y. The random variable Z defined by Z = g(Y ) has probability<br />
density function given by<br />
� � −1<br />
fZ(z) = fY g (z) � �<br />
��<br />
d<br />
dz g−1 �<br />
�<br />
(z) �<br />
� if z = g(y) (A.11)<br />
where g −1 (z) is defined to be the inverse function of g(y).<br />
Proof. Let g(y) be a monotone increasing function and let FY (y) and FZ(z) denote the<br />
probability distribution functions of the random variables Y and Z. Then<br />
FZ(z) = P (Z ≤ z) = P (g(Y ) ≤ z) = P � Y ≤ g −1 (z) � � � −1<br />
= FY g (z)<br />
99
Next, let g(y) be a monotone decreasing function. Then<br />
FZ(z) = P (Z ≤ z) = P (g(Y ) ≤ z) = P � Y ≥ g −1 (z) �<br />
= 1 − P � Y ≤ g −1 (z) � � � −1<br />
= 1 − FY g (z)<br />
For a monotone increasing function we have, by the chain rule,<br />
fZ(z) = d<br />
dz FZ(z)<br />
= d<br />
dz FY<br />
=<br />
= fY<br />
� g −1 (z) �<br />
d<br />
d(g−1 (z)) FY<br />
� � −1 dg<br />
g (z) −1 (z)<br />
dz<br />
� �dg −1<br />
g (z) −1 (z)<br />
dz<br />
For a monotone decreasing function we have, by the chain rule,<br />
fZ(z) = d<br />
dz FZ(z)<br />
= − d<br />
dz FY<br />
= −<br />
= fY<br />
� g −1 (z) �<br />
d<br />
d(g−1 (z)) FY<br />
� � −1 dg<br />
g (z) −1 (z)<br />
dz<br />
� � −1<br />
g (z) �<br />
− dg−1 �<br />
(z)<br />
dz<br />
Equation (A.11) covers the cases of both monotonic increasing and monotonic decreas-<br />
ing functions.<br />
Example A.1 (The Chi-Squared Distribution). Suppose Y ∼ N(0, 1) and g(y) = y 2 .<br />
Then g −1 (z) = z 1/2 and by equation (A.11) we get<br />
� 1/2<br />
fZ(z) = fY z � � �<br />
��<br />
d<br />
dz z1/2<br />
�<br />
�<br />
�<br />
�<br />
=<br />
1<br />
√ 2π e<br />
1<br />
− z 1<br />
2<br />
2 z−1/2<br />
� �<br />
1 1<br />
≡ Gamma ,<br />
2 2<br />
which was our main result from the previous section. �<br />
Example A.2 (The Log-Normal Distribution). Suppose Y ∼ N(µ, σ 2 ) and g(y) = e y .<br />
Then g−1 (z) = ln z and by equation (A.11) we get<br />
fZ(z) =<br />
=<br />
� �<br />
�<br />
fY (ln z) �<br />
d �<br />
� ln z�<br />
dz �<br />
1<br />
zσ √ 2π exp<br />
�<br />
− (ln{z/m})2<br />
2σ2 �<br />
where µ = ln m. �<br />
100
A.4 Random Vectors<br />
A.4.1 Sums of Independent Random Variables<br />
When X and Y are discrete random variables, the condition of independence is equiva-<br />
lent to pX,Y (x, y) = pX(x)pY (y) for all x, y. In the jointly continuous case the condition<br />
of independence is equivalent to fX,Y (x, y) = fX(x)fY (y) for all x, y.<br />
Consider random variables X and Y with probability densities fX(x) and fY (y)<br />
respectively. We seek the probability density of the random variable X + Y. Our<br />
general result follows from<br />
FX+Y (a) =<br />
=<br />
P (X + Y ≤ a)<br />
� �<br />
=<br />
=<br />
� ∞<br />
fX(x)fY (y)dxdy<br />
X+Y ≤a<br />
� a−y<br />
−∞ −∞<br />
� a−y<br />
� ∞<br />
−∞<br />
−∞<br />
fX(x)fY (y)dxdy<br />
fX(x)dxfY (y)dy<br />
� ∞<br />
= FX(a − y)fY (y)dy<br />
⇒ fX+Y (a) =<br />
−∞<br />
d<br />
=<br />
� ∞<br />
FX(a − y)fY (y)dy<br />
dx −∞<br />
� ∞<br />
d<br />
−∞ dx FX(a − y)fY (y)dy<br />
� ∞<br />
= fX(a − y)fY (y)dy (A.12)<br />
−∞<br />
The density function fX+Y is called the convolution of the densities fX and fY . If<br />
the random variables X and Y are discrete the equivalent result is<br />
fX+Y (a) = �<br />
fX(a − y)fY (y).<br />
Result: Sum of Independent Poisson Random Variables<br />
a∈R<br />
Suppose X ∼ Pois(θ) and Y ∼ Pois(λ). Assume that X and Y are independent. Then<br />
P (X + Y = n) =<br />
=<br />
=<br />
n�<br />
P (X = k, Y = n − k)<br />
k=0<br />
n�<br />
P (X = k) P (Y = n − k)<br />
k=0<br />
n�<br />
e<br />
k=0<br />
101<br />
−θ θk<br />
k!<br />
e−λ λn−k<br />
(n − k)!
That is, X + Y ∼ Pois(θ + λ).<br />
= e−(θ+λ)<br />
n!<br />
= e−(θ+λ)<br />
n!<br />
n�<br />
k=0<br />
n!<br />
k!(n − k)! θk λ n−k<br />
(θ + λ) n<br />
Result: Sum of Independent Binomial Random Variables<br />
We seek the distribution of Y + X, where Y ∼ Bin(n, θ) and X ∼ Bin(m, θ). Then<br />
X + Y is modelling the situation where the total number of trials is fixed at n + m,<br />
and the probability of a success in a single trial equals θ. Without performing any<br />
calculations, we expect to find that X + Y ∼ Bin(n + m, θ). <strong>To</strong> verify that this result<br />
is true,<br />
P (X + Y = k) =<br />
=<br />
n�<br />
P (X = i, Y = k − i)<br />
i=0<br />
n�<br />
P (X = i) P (Y = k − i)<br />
i=0<br />
n�<br />
� �<br />
n<br />
=<br />
θ<br />
i<br />
i=0<br />
i (1 − θ) n−i<br />
� �<br />
m<br />
θ<br />
k − i<br />
k−i (1 − θ) m−k+i<br />
= θ k (1 − θ) n+m−k<br />
n�<br />
� � � �<br />
n m<br />
i k − i<br />
and the result follows by applying the combinatorial identity<br />
� �<br />
n + m<br />
=<br />
k<br />
i=0<br />
n�<br />
� � �<br />
n m<br />
i=0<br />
i<br />
k − i<br />
Result: Sum of Independent Gamma Random Variables<br />
Let X ∼ Gamma(γ, θ) and Y ∼ Gamma(ω, θ). Then<br />
fX+Y (a) = (Γ(γ)Γ(ω)) −1<br />
� a<br />
θe<br />
0<br />
−θ(a−y) (θ(a − y)) γ−1 θγ −θy (θy) ω−1 dy<br />
� a<br />
= Ke −θa<br />
0<br />
(a − y) γ−1 y ω−1 dy<br />
�<br />
.<br />
where K is a constant. Let u = y/a so that dy = adu. Then<br />
fX+Y (a) = Ke −θa a γ+ω−1<br />
� 1<br />
= Ce −θa a γ+ω−1 ,<br />
102<br />
0<br />
(1 − u) γ−1 u ω−1 du
where C is a constant not depending on a. fX+Y (a) is a density function and so must<br />
integrate to 1.<br />
⇒ fX+Y (a) = θe−θa (θa) γ+ω−1<br />
.<br />
Γ(γ + ω)<br />
But this is the pdf of a Gamma random variable distributed as Gamma(γ + ω, θ). The<br />
result X + Y ∼ Chi(u + v) when X ∼ Chi(u) and Y ∼ Chi(v), follows as a corollary.<br />
Result: Sum of Independent Exponential Random Variables<br />
Let Y1, . . . , Yn be n independent exponential random variables each with parameter<br />
θ. Then Z = Y1 + Y2 + · · · + Yn is a Gamma(n, θ) random variable. <strong>To</strong> see that<br />
this is indeed the case, write Yi ∼ Exp(θ), or alternatively, Yi ∼ Gamma(1, θ). Then<br />
Yi + Yj ∼ Gamma(2, θ), implying that<br />
n�<br />
Yi = Z ∼ Gamma (n, θ) .<br />
i=1<br />
Result: Sum of Independent Gaussian Random Variables<br />
Let X ∼ N(µX, σ2 X ) and Y ∼ N(µY , σ2 Y ). Then<br />
fX+Y (a) = (2πσXσY ) −1<br />
� ∞ �<br />
−(a − y − µX)<br />
exp<br />
2 � �<br />
−(y − µY )<br />
exp<br />
2 �<br />
dy<br />
−∞<br />
−∞<br />
2σ 2 X<br />
2σ 2 X<br />
2σ 2 Y<br />
setting z = y − µY<br />
= (2πσXσY ) −1<br />
� ∞ �<br />
−(a − z − µX − µY )<br />
exp<br />
2 � � � 2 −z<br />
exp dz<br />
2σ 2 Y<br />
=<br />
andlet m = a − µX − µY<br />
(2πσXσY ) −1<br />
� ∞ � 2 −(m − z)<br />
exp<br />
−∞ 2σ2 � � 2 −z<br />
exp<br />
X 2σ2 =<br />
�<br />
dz<br />
Y<br />
(2πσXσY ) −1 e −m2 /2σ2 � ∞ �<br />
2mz<br />
X exp<br />
−∞ 2σ2 −<br />
X<br />
z2<br />
2σ2 −<br />
X<br />
z2<br />
2σ2 =<br />
�<br />
dz<br />
Y<br />
(2πσXσY ) −1 e βm<br />
� ∞<br />
e<br />
−∞<br />
−(αz2 +2βz) dz<br />
where α = 1<br />
�<br />
1<br />
2 σ2 +<br />
X<br />
1<br />
σ2 �<br />
and<br />
Y<br />
β = − m<br />
2σ2 = (2πσXσY )<br />
X<br />
−1 e βm e β2 � ∞<br />
/α<br />
e −α(z+β/α)2<br />
dz<br />
−∞<br />
setting u = z + β/α<br />
� ∞<br />
e −αu2<br />
du<br />
= (2πσXσY ) −1 e βm e β2 /α<br />
−∞<br />
= (2πσXσY ) −1 e βm+β2 /α � π/α<br />
103
since � ∞<br />
−∞ e−αu2<br />
du = � π/a. Some algebra will confirm that<br />
so that<br />
and<br />
βm + β2<br />
α =<br />
�<br />
π/α<br />
=<br />
2πσXσY<br />
−m 2<br />
2(σ 2 X + σ2 Y ),<br />
1<br />
� 2π (σ 2 X + σ 2 Y )<br />
fX+Y (a) = � 2π(σ 2 X + σ 2 Y ) �− 1<br />
�<br />
− (a − µX − µY ) 2 exp<br />
2<br />
�<br />
or equivalently X + Y ∼ N (µX + µY , σ2 X + σ2 Y ) .<br />
A.4.2 Covariance and Correlation<br />
2 (σ 2 X + σ2 Y )<br />
Suppose that X and Y are real-valued random variables for some random experiment.<br />
The covariance of X and Y is defined by<br />
Cov (X, Y ) = E [(X − E(X))(Y − E[Y ])]<br />
and (assuming the variances are positive) the correlation of X and Y is defined by<br />
ρ (X, Y ) ≡ Corr (X, Y ) =<br />
Cov (X, Y )<br />
� Var(X)Var[Y ]<br />
Note that the covariance and correlation always have the same sign (positive, nega-<br />
tive, or 0). When the sign is positive, the variables are said to be positively correlated;<br />
when the sign is negative, the variables are said to be negatively correlated; and when<br />
the sign is 0, the variables are said to be uncorrelated. For an intuitive understanding<br />
of correlation, suppose that we run the experiment a large number of times and that<br />
for each run, we plot the values (X, Y ) in a scatterplot. The scatterplot for positively<br />
correlated variables shows a linear trend with positive slope, while the scatterplot for<br />
negatively correlated variables shows a linear trend with negative slope. For uncorre-<br />
lated variables, the scatterplot should look like an amorphous blob of points with no<br />
discernible linear trend. You should satisfy yourself that the following are true<br />
1. Cov (X, Y ) = E (XY ) − E (X) E (Y )<br />
2. Cov (X, Y ) = Cov (Y, X)<br />
3. Cov (Y, Y ) = Var (Y )<br />
4. Cov (aX + bY, Z) = aCov (X, Z) + bCov (Y, Z)<br />
� � �n<br />
5. Var = �n i,j=1 Cov (Yi, Yj)<br />
j=1 Yi<br />
6. If X and Y are independent, then they are uncorrelated. The converse is not<br />
true however.<br />
104
A.4.3 The Bivariate Change of Variables Formula<br />
Suppose that (X, Y ) is a continuous random variable taking values in a subset S of R 2<br />
with probability density function f. Suppose that U and V are new random variables<br />
that are functions of X and Y :<br />
U ≡ U(X, Y ), V ≡ V (X, Y ).<br />
If these functions are “well behaved”, there is a simple way to get the joint probability<br />
density function g of (U, V ). First, we will assume that the transformation (x, y) →<br />
(u, v) is one-to-one and maps S onto a subset T of R 2 . Thus, the inverse transformation<br />
(u, v) → (x, y) is well defined and maps T onto S. We will assume that the inverse<br />
transformation is “smooth”, in the sense that the partial derivatives<br />
exist on T , and the Jacobian<br />
is nonzero on T .<br />
∂(x, y)<br />
∂(u, v) ≡<br />
�<br />
�<br />
�<br />
�<br />
∂x<br />
∂u<br />
∂x<br />
∂u<br />
∂y<br />
∂u<br />
, ∂x<br />
∂v<br />
∂x<br />
∂v<br />
∂y<br />
∂v<br />
∂y ∂y<br />
, ,<br />
∂u ∂v ,<br />
�<br />
�<br />
�<br />
∂x ∂y<br />
� =<br />
∂u ∂v<br />
∂x ∂y<br />
−<br />
∂v ∂u<br />
Now, let B be an arbitrary subset of T . The inverse transformation maps B onto a<br />
subset A of S. Therefore,<br />
� �<br />
P ((U, V ) ∈ B) = P ((X, Y ) ∈ A) =<br />
A<br />
f(x, y)dxdy<br />
But, by the change of variables formula for double integrals, this can be written as<br />
� �<br />
� �<br />
�<br />
P ((U, V ) ∈ B) = f (x(u, v), y(u, y)) �<br />
∂(x, y) �<br />
�<br />
�∂(u,<br />
v) � dudv<br />
B<br />
By the very meaning of density, it follows that the probability density function of (U, V )<br />
is<br />
� �<br />
�<br />
g(u, v) = f (x(u, v), y(u, v)) �<br />
∂(x, y) �<br />
�<br />
�∂(u,<br />
v) � , (u, v) ∈ T .<br />
By a symmetric argument,<br />
� �<br />
�<br />
f(x, y) = g (u(x, y), v(x, y)) �<br />
∂(u, v) �<br />
�<br />
�∂(x,<br />
y) � , (x, y) ∈ S.<br />
The change of variables formula generalizes to R n .<br />
105
A.4.4 The Bivariate Normal Distribution<br />
Suppose that U and V are independent random variables each, with the standard<br />
normal distribution. We will need the following parameters:<br />
µX, µY ∈ (−∞, ∞); σX, σY ∈ (0, ∞); ρ ∈ [−1, +1].<br />
Now let X and Y be new random variables defined by<br />
X = µX + σXU<br />
�<br />
V = µY + ρσY U + σY 1 − ρ2V Using basic properties of mean, variance, covariance, and the normal distribution,<br />
satisfy yourself of the following:<br />
1. X is normally distributed with mean µX and standard deviation σX<br />
2. Y is normally distributed with mean µY and standard deviation σY<br />
3. Corr (X, Y ) = ρ<br />
4. X and Y are independent if and only if ρ = 0.<br />
The inverse transformation is<br />
u =<br />
v =<br />
x − µX<br />
σY<br />
σX<br />
y − µY<br />
�<br />
1 − ρ2 so that the Jacobian of the transformation is<br />
∂(x, y)<br />
∂(u, v) =<br />
σXσY<br />
ρ(x − µX)<br />
− �<br />
σX 1 − ρ2 1<br />
�<br />
1 − ρ2 .<br />
Since U and V are independent standard normal variables, their joint probability den-<br />
sity function is<br />
g(u, v) = 1 1<br />
e− 2(u<br />
2π 2 +v2 ) 2<br />
, (u, v) ∈ R<br />
Using the bivariate change of variables formula , the joint density of (X, Y ) is<br />
1<br />
f(x, y) = �<br />
2πσXσY 1 − ρ2 exp<br />
�<br />
− 1<br />
�<br />
(x − µX)<br />
2<br />
2<br />
σ2 X (1 − ρ2 ) − 2ρ(x − µX)(y − µY )<br />
σXσY (1 − ρ2 +<br />
)<br />
(y − µY ) 2<br />
σ2 Y (1 − ρ2 ��<br />
)<br />
If c is a constant, the set of points (x, y) ∈ R 2 : f(x, y) = c is called a level curve of f<br />
(these are points of constant probability density).<br />
106
A.4.5 Bivariate Normal Conditional Distributions<br />
In the last section we derived the joint probability density function f of the bivariate<br />
normal random variables X and Y. The marginal densities are known. Then,<br />
fY |X(y|x) = fY,X(y, x)<br />
fX(x)<br />
�<br />
1<br />
= √ � exp −<br />
2πσY (1 − ρ2 ) 1 (y − (µY + ρσY (x − µX)/σX))<br />
2<br />
2<br />
σ2 Y (1 − ρ2 �<br />
)<br />
Then the conditional distribution of Y given X = x is also Gaussian, with<br />
(x − µX)<br />
E (Y |X = x) = µY + ρσY<br />
Var (Y |X = x) = σ 2 Y (1 − ρ 2 )<br />
A.4.6 The Multivariate Normal Distribution<br />
Let Σ denote the 2 × 2 symmetric matrix<br />
Then<br />
and<br />
� σ 2 X σXσY ρ<br />
σY σXρ σ 2 Y<br />
�<br />
σX<br />
det |Σ| = σ 2 Xσ 2 Y − (σXσY ρ) 2 = σ 2 Xσ 2 Y (1 − ρ 2 )<br />
Σ −1 =<br />
1<br />
(1 − ρ2 � 1<br />
σ<br />
)<br />
2 X<br />
− ρ<br />
σXσY<br />
− ρ<br />
σXσY<br />
1<br />
σ2 Y<br />
Hence the bivariate normal distribution can be written in matrix notation as<br />
1<br />
f(X,Y )(x, y) =<br />
2π � det |Σ| exp<br />
�<br />
− 1<br />
� �T x − µX<br />
Σ<br />
2 y − µY<br />
−1<br />
� �<br />
x − µX<br />
y − µY<br />
�<br />
Let Y = (Y1, . . . , Yp) ′ be a random vector of length p. Let E(Yi) = µi, i = 1, . . . , p,<br />
and define the p-length vector µ = (µ1, . . . , µp) ′ . Define the p×p matrix Σ with element<br />
Cov(Yi, Yj) for i, j = 1, . . . p. Finally, denote a realization of the random vector Y by<br />
y = (y1, . . . , yp) ′ . Then, the random vector Y has a p-dimensional multivariate Gaussian<br />
distribution if its density function is specified by<br />
1<br />
fY (y) =<br />
(2π) p/2 �<br />
exp −<br />
|Σ| 1/2 1<br />
2 (y − µ)′ Σ −1 �<br />
(y − µ) . (A.13)<br />
The notation Y ∼ MVNp(µ, Σ) should be read as “the random variable Y follows a<br />
multivariate Gaussian (normal) distribution with p-vector mean µ and p × p variance-<br />
covariance matrix Σ.”<br />
107<br />
�<br />
.
A.5 Generating Functions<br />
Denote the sample space of the discrete random variable Y as {0, 1, 2, . . .}. Let f denote<br />
the probability mass function of Y and suppose that the probabilities are given by<br />
∞�<br />
f(j) = P (Y = j) = pj, j = 0, 1, 2, . . . , where pj = 1.<br />
The mean and variance of Y satisfy<br />
and<br />
µY = E (Y ) =<br />
∞�<br />
jpj<br />
j=0<br />
σ 2 Y = E � (Y − µY ) 2� = E � Y 2� − µ 2 Y =<br />
∞�<br />
j=0<br />
j=0<br />
j 2 pj − µ 2 Y .<br />
The probability generating function (p.g.f) of the discrete random variable Y<br />
is a function defined on a subset of the reals, denoted by GY (t) and defined by<br />
GY (t) = E � t Y � ∞�<br />
= pjt j ,<br />
for some t ∈ R.<br />
Because � ∞<br />
j=0 pj = 1, the sum defined by GY (t) converges absolutely for |t| ≤ 1.<br />
That is, GY (t) is well defined for |t| ≤ 1. As the name implies, the p.g.f generates the<br />
probabilities associated with the distribution<br />
j=0<br />
GY (0) = p0, G ′ Y (0) = p1, G ′′ Y (0) = 2!p2.<br />
In general the kth derivative of the p.g.f of Y satisfies<br />
G (k)<br />
Y (0) = k!pk.<br />
The p.g.f can be used to calculate the mean and variance of a random variable Y.<br />
Note that G ′ Y (t) = � ∞<br />
j=1 jpjt j−1 for −1 < t < 1. Let t approach one from the left,<br />
t → 1 − , to obtain<br />
G ′ Y (1) =<br />
∞�<br />
jpj = E (Y ) = µY .<br />
j=1<br />
The second derivative of GY (t) satisfies<br />
G ′′ Y (t) =<br />
⇒ G ′′ Y (1) =<br />
∞�<br />
j(j − 1)pjt j−2 ,<br />
j=1<br />
∞�<br />
j(j − 1)pj<br />
j=1<br />
= E � Y 2� − E (Y ) .<br />
108
If the mean is finite the the variance of Y satisfies<br />
σ 2 Y = E � Y 2� − E (Y ) + E (Y ) − [E (Y )] 2<br />
= G ′′ Y (1) + G ′ Y (1) − [G ′ Y (1)] 2 .<br />
The moment generating function (m.g.f) of the discrete random variable Y<br />
with state space {0, 1, 2, . . .} and probability function f(j) = pj, j = 0, 1, 2, . . . , is<br />
denoted by MY (t) and defined as<br />
for some t ∈ R.<br />
MY (t) = E � e tY � =<br />
∞�<br />
pje jt ,<br />
The moment generating function generates the moments E � Y k� of the distribution<br />
of the random variable Y :<br />
and, in general,<br />
j=0<br />
MY (0) = 1, M ′ Y (0) = µY = E (Y ) , M ′′<br />
Y (0) = E � Y 2� ,<br />
M (k)<br />
Y (0) = E � Y k� .<br />
The characteristic function (ch.f) of the discrete random variable Y is<br />
CY (t) = E � e itY � =<br />
∞�<br />
pje ijt , where i = √ −1.<br />
j=0<br />
The cumulative generating function (c.g.f) of the discrete random variable Y<br />
is the natural logarithm of the moment generating function and is denoted as KY (t),<br />
so that<br />
KY (t) = ln [MY (t)] .<br />
Assume Y is a continuous random variable with probability density function fY (y).<br />
The probability generating function (p.g.f.) of Y is defined as<br />
GY (t) = E � t Y � �<br />
=<br />
The moment generating function (m.g.f.) of Y is<br />
MY (t) = E � e tY � �<br />
=<br />
The characteristic function (ch.f.) of Y is<br />
CY (t) = E � e itY � �<br />
=<br />
109<br />
fY (y)t<br />
R<br />
y dy.<br />
fY (y)e<br />
R<br />
ty dy.<br />
fY (y)e<br />
R<br />
ity dy.
Density p.g.f. m.g.f. ch.f. c.g.f<br />
Bi(n, θ) (θt + q) n (θe t + q) n (θe it + q) n<br />
Geo(θ) θt/ (1 − qt) θ/ (e −t − q) θ/ (e −it − q)<br />
Neg-Bi(r, θ) θ r (1 − qt) −r θ r (1 − qe t ) −r θ r (1 − qe it ) −r r ln(θ)<br />
−r ln(1 − qe t )<br />
Poi(θ) e −θ(1−t) e θ(exp(t)−1) e θ(exp(it)−1) θ(e t − 1)<br />
Unif(α, β) e αt � e βt − 1 � /βt e iαt � e iβt − 1 � /iβt<br />
Exp(θ) (1 − t/θ) −1 (1 − it/θ) −1 − ln(1 − it/θ)<br />
Ga(c, λ) (1 − t/λ) −c (1 − it/λ) −c −c ln(1 − it/λ)<br />
N(µ, σ 2 ) exp � µt + 1<br />
2 σ2 t 2�<br />
exp � iµt − 1<br />
2 σ2 t 2�<br />
iµt − 1<br />
2 σ2 t 2<br />
Table A.5.1: Generating functions. For discrete random variables define q = 1 − θ.<br />
Finally, the cumulative generating function (c.g.f.) is KY (t) = ln [MY (t)] .<br />
The generating functions are related through<br />
GY (e t ) = MY (t) and MY (it) = CY (t).<br />
We can use these generating functions to establish the formulas<br />
and<br />
σ 2 Y =<br />
µY = G ′ Y (1) = M ′ Y (0) = K ′ Y (0)<br />
⎧<br />
⎪⎨<br />
⎪⎩<br />
G ′′ Y (1) + G′ Y (1) − [G′ Y (1)]2<br />
M ′′<br />
Y (0) − [M ′ Y (0)]2<br />
K ′′<br />
Y (0)<br />
A very important result concerning generating functions states that the moment gen-<br />
erating function uniquely defines the probability distribution (provided it exists in an<br />
open interval around zero). The characteristic function also uniquely defines the prob-<br />
ability distribution. The generating functions of the discrete and continuous random<br />
variables discussed thus far are given in Table (A.5.1). Suppose that Y1, Y2, . . . , Yn<br />
are independent random variables. Then the moment generating function of the linear<br />
combination Z = � n<br />
i=1 aiYi is the product of the individual moment generating<br />
functions:<br />
MZ(t) = E � e t(� aiYi) �<br />
= E � e a1tY1 � E � e a2tY2 � · · · E � e antYn�<br />
n�<br />
= MYi (aiYi) .<br />
i=1<br />
It also follows that CZ(t) = � n<br />
i=1 CYi (aiYi) and KZ(t) = � n<br />
i=1 KYi (aiYi) .<br />
110
A.6 Table of Common Distributions<br />
Bernoulli(θ)<br />
Discrete Distributions<br />
pmf P (Y = y|θ) = θy (1 − θ) 1−y ; y = 0, 1; 0 ≤ θ ≤ 1<br />
mean and<br />
variance<br />
mgf<br />
E[Y ] = θ, Var[Y ] = θ(1 − θ)<br />
MY (t) = θet + (1 − θ)<br />
Binomial Y ∼ Bin(n, θ)<br />
pmf P (Y = y|n, θ) = ( n<br />
y )θy (1 − θ) n−y mean and<br />
variance<br />
mgf<br />
; y = 0, 1, . . . , n; 0 ≤ θ ≤ 1<br />
E[Y ] = nθ, Var[Y ] = nθ(1 − θ)<br />
MY (t) = [θet + (1 − θ)] n<br />
Discrete uniform(N)<br />
pmf P (Y = y|N) = 1/N; y = 1, 2, . . . , N; N ∈ Z +<br />
mean and<br />
variance<br />
E[Y ] = (N + 1)/2, Var[Y ] = (N + 1)(N − 1)/12<br />
mgf MY (t) = 1<br />
N<br />
Geometric(θ)<br />
� N<br />
i=1 eit<br />
pmf P (Y = y|θ) = θ(1 − θ) y−1 ; y = 0, 1, . . .; 0 ≤ θ ≤ 1<br />
mean and<br />
variance<br />
E[Y ] = 1/θ, Var[Y ] = (1 − θ)/θ2 mgf MY (t) = θet /[1 − (1 − θ)et ], t < − log(1 − θ)<br />
notes The random variable X = Y − 1 is negative binomial(1, θ).<br />
Hypergeometric(b, w, n)<br />
pmf P (Y = y|b, w, n) = � �� � � � w b−w b<br />
/ ; y = 0, 1, . . . , n;<br />
y n−y n<br />
b − (b − w) ≤ y ≤ b; b, w, n ≥ 0<br />
mean and<br />
variance<br />
E[Y ] = nw,<br />
b<br />
nw (b−w)(b−n)<br />
Var[Y ] = b b(b−1)<br />
Negative binomial(r, θ)<br />
pmf P (Y = y|r, θ) = � � r+y−1 r y θ (1 − θ) ; y ∈ Z; 0 ≤ θ ≤ 1<br />
y<br />
mean and<br />
variance<br />
E[Y ] = r(1 − θ)/θ, Var[Y ] = r(1 − θ)/θ2 θ<br />
mgf MY (t) = ( 1−(1−θ)et ) r , t < − log(1 − θ)<br />
notes An alternative form of the pmf, used in the derivation in our notes,<br />
is given by P (N = n|r, θ) = � � n−1 r n−r θ (1−θ) , n = r, r+1, . . . where<br />
r−1<br />
the random variable N = Y + r. The negative binomial can also be<br />
Poisson(θ)<br />
derived as a mixture of Poisson random variables.<br />
pmf P (Y = y|θ) = θye−θ ; y ∈ Z; 0 ≤ θ < ∞<br />
y!<br />
mean and<br />
variance<br />
E[Y ] = θ, Var[Y ] = θ<br />
mgf MY (t) = e θ(et −1)<br />
111
Uniform U(a, b)<br />
Continuous Distributions<br />
pdf f(y|a, b) = (b − a) −1 ; a ≤ y ≤ b<br />
mean and<br />
variance<br />
E[Y ] = 1<br />
mgf<br />
(b + a), 2<br />
MY (t) =<br />
1 Var[Y ] = (b − a)2<br />
12 ebt−eat (b−a)t<br />
notes A uniform distribution with a = 0 and b = 1 is a special case of the<br />
Exponential E(θ)<br />
beta distribution where (α = β = 1).<br />
pdf f(y|θ) = θe−θy ; 0 ≤ y < ∞; θ > 0<br />
mean and<br />
variance<br />
E[Y ] = 1/θ, Var[Y ] = 1/θ2 mgf MY (t) = (1 − t/θ) −1<br />
notes Special case of the gamma distribution. X = Y 1/γ is Weibull,<br />
Gamma G(ω, θ)<br />
X = √ 2θY is Rayleigh, X = α − γ log(Y/β) is Gumbel.<br />
pdf f(y|ω, θ) = θe−θy (θy) ω−1<br />
; 0 ≤ y < ∞; ω, θ > 0<br />
Γ(ω)<br />
mean and<br />
variance<br />
E[Y ] = ω/θ, Var[Y ] = ω/θ2 mgf MY (t) = (1 − t/θ) −ω<br />
notes Includes the exponential (ω = 1) and chi squared (ω = 1 1 , θ = 2 2 ).<br />
Normal N(µ, σ2 )<br />
pdf f(y|µ, σ2 ) = 1 √ e<br />
2πσ −(y−µ)2 /2σ2 ; −∞ < y < ∞, −∞ < µ < ∞, σ > 0<br />
mean and<br />
variance<br />
E[Y ] = µ, Var[Y ] = σ2 mgf MY (t) = e µt+σ2t2 /2<br />
notes Sometimes called the Gaussian distribution.<br />
112