05.12.2012 Views

Student Notes To Accompany MS4214: STATISTICAL INFERENCE

Student Notes To Accompany MS4214: STATISTICAL INFERENCE

Student Notes To Accompany MS4214: STATISTICAL INFERENCE

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Student</strong> <strong>Notes</strong> <strong>To</strong> <strong>Accompany</strong><br />

<strong>MS4214</strong>: <strong>STATISTICAL</strong> <strong>INFERENCE</strong><br />

Dr. Kevin Hayes<br />

September 1, 2007


Contents<br />

1 Introduction 3<br />

1.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2 General Course Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

2 The Theory of Estimation 8<br />

2.1 The Frequentist Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2.2 The Frequentist Approach to Estimation . . . . . . . . . . . . . . . . . 10<br />

2.3 Minimum-Variance Unbiased Estimation . . . . . . . . . . . . . . . . . 14<br />

2.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . 18<br />

2.5 Multi-parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

2.6 Newton-Raphsom Optimization . . . . . . . . . . . . . . . . . . . . . . 31<br />

2.7 The Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

2.8 Optimality Properties of the MLE . . . . . . . . . . . . . . . . . . . . . 39<br />

2.9 Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

2.10 Worked Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

3 The Theory of Confidence Intervals 50<br />

3.1 Exact Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

3.2 Pivotal Quantities for Use with Normal Data . . . . . . . . . . . . . . . 53<br />

3.3 Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . . . 58<br />

3.4 Worked Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

4 The Theory of Hypothesis Testing 64<br />

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

4.2 The General Testing Problem . . . . . . . . . . . . . . . . . . . . . . . 65<br />

4.3 Hypothesis Testing for Normal Data . . . . . . . . . . . . . . . . . . . 66<br />

4.4 Generally Applicable Test Procedures . . . . . . . . . . . . . . . . . . . 71<br />

4.5 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . 74<br />

4.6 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

4.7 The χ 2 Test for Contingency Tables . . . . . . . . . . . . . . . . . . . . 79<br />

1


4.8 Worked Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

A Review of Probability 88<br />

A.1 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

A.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

A.2.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 89<br />

A.2.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 89<br />

A.2.3 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . 90<br />

A.2.4 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . 91<br />

A.2.5 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . 91<br />

A.2.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

A.2.7 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . 93<br />

A.2.8 The Multinomial Distribution . . . . . . . . . . . . . . . . . . . 93<br />

A.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . 94<br />

A.3.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

A.3.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 94<br />

A.3.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . 95<br />

A.3.4 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . 96<br />

A.3.5 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

A.3.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

A.3.7 Chi-square Distribution . . . . . . . . . . . . . . . . . . . . . . 99<br />

A.3.8 Distribution of a Function of a Random Variable . . . . . . . . 99<br />

A.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

A.4.1 Sums of Independent Random Variables . . . . . . . . . . . . . 101<br />

A.4.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . 104<br />

A.4.3 The Bivariate Change of Variables Formula . . . . . . . . . . . 105<br />

A.4.4 The Bivariate Normal Distribution . . . . . . . . . . . . . . . . 106<br />

A.4.5 Bivariate Normal Conditional Distributions . . . . . . . . . . . 107<br />

A.4.6 The Multivariate Normal Distribution . . . . . . . . . . . . . . 107<br />

A.5 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

A.6 Table of Common Distributions . . . . . . . . . . . . . . . . . . . . . . 111<br />

2


Chapter 1<br />

Introduction<br />

1.1 Motivating Examples<br />

Example 1.1 (Radioactive decay). Let X denote the number of particles that will be<br />

emitted from a radioactive source in the next one minute period. We know that X<br />

will turn out to be equal to one of the non-negative integers but, apart from that, we<br />

know nothing about which of the possible values are more or less likely to occur. The<br />

quantity X is said to be a random variable.<br />

Suppose we are told that the random variable X has a Poisson distribution with<br />

parameter θ = 2. Then, if x is some non-negative integer, we know that the probability<br />

that the random variable X takes the value x is given by the formula<br />

P (X = x) = θx exp (−θ)<br />

x!<br />

where θ = 2. So, for instance, the probability that X takes the value x = 4 is<br />

P (X = 4) = 24 exp (−2)<br />

4!<br />

= 0.0902 .<br />

We have here a probability model for the random variable X. Note that we are using<br />

upper case letters for random variables and lower case letters for the values taken by<br />

random variables. We shall persist with this convention throughout the course.<br />

Suppose we are told that the random variable X has a Poisson distribution with<br />

parameter θ where θ is some unspecified positive number. Then, if x is some non-<br />

negative integer, we know that the probability that the random variable X takes the<br />

value x is given by the formula<br />

P (X = x|θ) = θx exp (−θ)<br />

, (1.1)<br />

x!<br />

for θ ∈ R + . However, we cannot calculate probabilities such as the probability that X<br />

takes the value x = 4 without knowing the value of θ.<br />

3


Suppose that, in order to learn something about the value of θ, we decide to measure<br />

the value of X for each of the next 5 one minute time periods. Let us use the notation X1<br />

to denote the number of particles emitted in the first period, X2 to denote the number<br />

emitted in the second period and so forth. We shall end up with data consisting of a<br />

random vector X = (X1, X2, . . . , X5). Consider x = (x1, x2, x3, x4, x5) = (2, 1, 0, 3, 4).<br />

Then x is a possible value for the random vector X. We know that the probability that<br />

X1 takes the value x1 = 2 is given by the formula<br />

P (X = 2|θ) = θ2 exp (−θ)<br />

2!<br />

and similarly that the probability that X2 takes the value x2 = 1 is given by<br />

θ exp (−θ)<br />

P (X = 1|θ) =<br />

1!<br />

and so on. However, what about the probability that X takes the value x? In order for<br />

this probability to be specified we need to know something about the joint distribution<br />

of the random variables X1, X2, . . . , X5. A simple assumption to make is that the<br />

random variables X1, X2, . . . , X5 are mutually independent. (Note that this assumption<br />

may not be correct since X2 may tend to be more similar to X1 that it would be to<br />

X5.) However, with this assumption we can say that the probability that X takes the<br />

value x is given by<br />

P (X = x|θ) =<br />

5�<br />

i=1<br />

θxi exp (−θ)<br />

,<br />

xi!<br />

= θ2 exp (−θ)<br />

2!<br />

× θ1 exp (−θ)<br />

1!<br />

× θ3 exp (−θ)<br />

3!<br />

× θ0 exp (−θ)<br />

0!<br />

× θ4 exp (−θ)<br />

,<br />

4!<br />

= θ10 exp (−5θ)<br />

.<br />

288<br />

In general, if X = (x1, x2, x3, x4, x5) is any vector of 5 non-negative integers, then the<br />

probability that X takes the value x is given by<br />

P (X = x|θ) =<br />

5� θxi exp (−θ)<br />

,<br />

xi!<br />

i=1<br />

= θ� 5<br />

i=1 xi exp (−5θ)<br />

5�<br />

xi!<br />

We have here a probability model for the random vector X.<br />

Our plan is to use the value x of X that we actually observe to learn something<br />

about the value of θ. The ways and means to accomplish this task make up the subject<br />

matter of this course. �<br />

4<br />

i=1<br />

.


Example 1.2 (Tuberculosis). Suppose we are going to examine n people and record<br />

a value 1 for people who have been exposed to the tuberculosis virus and a value 0<br />

for people who have not been so exposed. The data will consist of a random vector<br />

X = (X1, X2, . . . , Xn) where Xi = 1 if the ith person has been exposed to the TB virus<br />

and Xi = 0 otherwise.<br />

A Bernoulli random variable X has probability mass function<br />

P (X = x|θ) = θ x (1 − θ) 1−x , (1.2)<br />

for x = 0, 1 and θ ∈ (0, 1). A possible model would be to assume that X1, X2, . . . , Xn<br />

behave like n independent Bernoulli random variables each of which has the same<br />

(unknown) probability θ of taking the value 1.<br />

Let x = (x1, x2, . . . , xn) be a particular vector of zeros and ones. Then the model<br />

implies that the probability that the random vector X takes the value x is given by<br />

P (X = x|θ) =<br />

n�<br />

θ xi 1−xi (1 − θ)<br />

i=1<br />

= θ �n i=1 xi n−<br />

(1 − θ) �n i=1 xi .<br />

Once again our plan is to use the value x of X that we actually observe to learn<br />

something about the value of θ. �<br />

Example 1.3 (Viagra). A chemical compound Y is used in the manufacture of Viagra.<br />

Suppose that we are going to measure the micrograms of Y in a sample of n pills. The<br />

data will consist of a random vector X = (X1, X2, . . . , Xn) where Xi is the chemical<br />

content of Y for the ith pill.<br />

A possible model would be to assume that X1, X2, . . . , Xn behave like n independent<br />

random variables each having a N (µ, σ 2 ) density with unknown mean parameter µ ∈ R,<br />

(really, here µ ∈ R + ) and known variance parameter σ 2 < ∞. Each Xi has density<br />

fXi (xi|µ) =<br />

1<br />

√ 2πσ 2 exp<br />

�<br />

− (xi − µ) 2<br />

2σ 2<br />

Let x = (x1, x2, . . . , xn) be a particular vector of real numbers. Then the model implies<br />

the joint density<br />

fX(x|µ) =<br />

=<br />

n�<br />

i=1<br />

�<br />

.<br />

1<br />

√<br />

2πσ2 exp<br />

�<br />

− (xi − µ) 2 �<br />

1<br />

( √ 2πσ2 exp<br />

) n<br />

�<br />

−<br />

2σ 2<br />

� n<br />

i=1 (xi − µ) 2<br />

2σ 2<br />

Once again our plan is to use the value x of X that we actually observe to learn<br />

something about the value of µ. �<br />

5<br />


Example 1.4 (Blood pressure). We wish to test a new device for measuring blood pres-<br />

sure. We are going to try it out on n people and record the difference between the value<br />

returned by the device and the true value as recorded by standard techniques. The<br />

data will consist of a random vector X = (X1, X2, . . . , Xn) where Xi is the difference<br />

for the ith person. A possible model would be to assume that X1, X2, . . . , Xn behave<br />

like n independent random variables each having a N (0, σ 2 ) density where σ 2 is some<br />

unknown positive real number. Let x = (x1, x2, . . . , xn) be a particular vector of real<br />

numbers. Then the model implies that the probability that the random vector X takes<br />

the value x is given by<br />

fX(x|σ 2 ) =<br />

=<br />

n�<br />

i=1<br />

1<br />

√<br />

2πσ2 exp<br />

�<br />

− x2 i<br />

1<br />

( √ 2πσ2 exp<br />

) n<br />

�<br />

−<br />

2σ 2<br />

�<br />

� n<br />

i=1 x2 i<br />

2σ 2<br />

Once again our plan is to use the value x of X that we actually observe to learn<br />

something about the value of σ 2 . Knowledge of σ is useful since it allows us to make<br />

statements such as that 95% of errors will be less than 1.96 × σ in magnitude. �<br />

1.2 General Course Overview<br />

Definition 1.1 (Inference). Inference studies the way in which data we observe<br />

should influence our beliefs about and practices in the real world. �<br />

Definition 1.2 (Statistical inference). Statistical inference considers how inference<br />

should proceed when the data is subject to random fluctuation. �<br />

The concept of probability is used to describe the random mechanism which gave<br />

rise to the data. This involves the use of probability models.<br />

The incentive for contemplating a probability model is that through it we may<br />

achieve an economy of thought in the description of events enabling us to enunciate<br />

laws and relations of more than immediate validity and relevance. A probability model<br />

is usually completely specified apart from the values of a few unknown quantities called<br />

parameters. We then try to discover to what extent the data can inform us about the<br />

values of the parameters.<br />

Statistical inference assumes that the data is given and that the probability model<br />

is a correct description of the random mechanism which generated the data.<br />

6<br />

�<br />

.


Three main topics will be covered :<br />

Estimation Unbiasedness, mean square error, consistency, relative efficiency, suffi-<br />

ciency, minimum variance. Fisher information for a function of a parameter,<br />

Cramér-Rao lower bound, efficiency. Fitting standard distributions to discrete<br />

and continuous data. Method of moments. Maximum likelihood estimation:<br />

finding estimators analytically and numerically, invariance, censored data.<br />

Hypothesis testing Simple and composite hypotheses, types of error, power, op-<br />

erating characteristic curves, p-value. Neuman-Pearson method. Generalised<br />

likelhood ratio test. Use of asympotic results to construct tests. Central limit<br />

theorem, asymptotic distributions of maximum likelihood estimator and gener-<br />

alised ratio test statistic.<br />

Confidence intervals and sets Random intervals and sets. Use of pivotal qnatities.<br />

Relationship between tests and confidence intervals. Use of asymptotic results.<br />

7


Chapter 2<br />

The Theory of Estimation<br />

2.1 The Frequentist Philosophy<br />

The dominant philosophy of inference is based on the frequentist theory of probability.<br />

According to the frequentist theory probability statements can only be made regarding<br />

events associated with a random experiment. A random experiment is an experiment<br />

which has a well defined set of possible outcomes S. In addition, we must be able to en-<br />

visage an infinite sequence of independent repetitions of the experiment with the actual<br />

outcome of each repetition being some unpredictable element of S. A random variable<br />

is a numerical quantity associated with each possible outcome in S. A random vector<br />

is a collection of numerical quantities associated with each possible outcome in S. In<br />

performing the experiment we determine which element of S has occurred and thereby<br />

the observed values of all random variables or random vectors of interest. Since the<br />

outcome of the experiment is unpredictable so too is the value of any random variable<br />

or random vector. Since we can envisage an infinite sequence of independent repetitions<br />

of the experiment, we can envisage an infinite sequence of independent determinations<br />

of the value of a random variable (or vector). The purpose of a statistical model is to<br />

describe the unpredictability of such a sequence of determinations.<br />

Consider the random experiment which consists of picking someone at random from<br />

the 2007 electoral register for Limerick. The outcome of such an experiment will be<br />

a human being and the set S consists of all human beings whose names are on the<br />

register. We can clearly envisage an infinite sequence of independent repetitions of<br />

such an experiment. Consider the random variable X where X = 0 if the outcome<br />

of the experiment is a male and X = 1 if the outcome of the experiment is a female.<br />

When we say that P (X = 1) = 0.54 we are taken to mean that in an infinite sequence<br />

of independent repetitions of the experiment exactly 54% of the outcomes will produce<br />

a value of X = 1.<br />

8


Now consider the random experiment which consists of picking 3 people at random<br />

from the 1997 electoral register for Limerick. The outcome of such an experiment will<br />

be a collection of 3 human beings and the set S consists of all subsets of 3 human<br />

beings which may be formed from the set of all human beings whose names are on the<br />

register. We can clearly envisage an infinite sequence of independent repetitions of such<br />

an experiment. Consider the random vector X = (X1, X2, X3) where for i = 1, 2, 3,<br />

Xi = 0 if the ith person chosen is a male and Xi = 1 if the ith person chosen is a<br />

female. When we say that X1, X2, X3 are independent and identically distributed or IID<br />

with P (Xi = 1) = θ we are taken to mean that in an infinite sequence of independent<br />

repetitions of the experiment the proportion of outcomes which produce, for instance,<br />

a value of X = (1, 1, 0) is given by θ 2 (1 − θ).<br />

Suppose that the value of θ is unknown and we propose to estimate it by the<br />

estimator ˆ θ whose value is given by the proportion of females in the sample of size 3.<br />

Since ˆ θ depends on the value of X we sometimes write ˆ θ(X) to emphasise this fact.<br />

We can work out the probability distribution of ˆ θ as follows :<br />

X P (X = x) ˆ θ(x)<br />

(0, 0, 0) (1 − θ) 3 0<br />

(0, 0, 1) θ(1 − θ) 2 1/3<br />

(0, 1, 0) θ(1 − θ) 2 1/3<br />

(1, 0, 0) θ(1 − θ) 2 1/3<br />

(0, 1, 1) θ 2 (1 − θ) 2/3<br />

(1, 0, 1) θ 2 (1 − θ) 2/3<br />

(1, 1, 0) θ 2 (1 − θ) 2/3<br />

(1, 1, 1) θ 3 1<br />

Thus P ( ˆ θ = 0) = (1 − θ) 3 , P ( ˆ θ = 1/3) = 3θ(1 − θ) 2 , P ( ˆ θ = 2/3) = 3θ 2 (1 − θ) and<br />

P ( ˆ θ = 1) = θ 3 . We now ask whether ˆ θ is a good estimator of θ? Clearly if θ = 0 we<br />

have that P ( ˆ θ = θ) = P ( ˆ θ = 0) = 1 which is good. Likewise if θ = 1 we also have that<br />

P ( ˆ θ = θ) = P ( ˆ θ = 1) = 1. If θ = 1/3 then P ( ˆ θ = θ) = P ( ˆ θ = 1/3) = 3(1/3)(1−1/3) 2 =<br />

4/9. Likewise if θ = 2/3 we have that P ( ˆ θ = θ) = P ( ˆ θ = 2/3) = 3(2/3) 2 (1−2/3) = 4/9.<br />

However if the value of θ lies outside the set {0, 1/3, 2/3, 1} we have that P ( ˆ θ = θ) = 0.<br />

Since ˆ θ is a random variable we might try to calculate its expected value E( ˆ θ) i.e.<br />

the average value we would get if we carried out an infinite number of independent<br />

repetitions of the experiment. We have that<br />

E( ˆ θ) = 0P ( ˆ θ = 0) + (1/3)P ( ˆ θ = 1/3) + (2/3)P ( ˆ θ = 2/3) + 1P ( ˆ θ = 1) ,<br />

= 0(1 − θ) 3 + (1/3)3θ(1 − θ) 2 + (2/3)3θ 2 (1 − θ) + 1θ 3 ,<br />

= θ.<br />

9


Thus if we carried out an infinite number of independent repetitions of the experiment<br />

and calculate the value of ˆ θ for each repetition the average of the ˆ θ values would be<br />

exactly θ,the true value of the parameter! This is true no matter what the actual value<br />

of θ is. Such an estimator is said to be unbiased.<br />

Consider the quantity L = ( ˆ θ − θ) 2 which might be regarded as a measure of<br />

the error or loss involved in using ˆ θ to estimate θ. The possible values for L are<br />

(0 − θ) 2 , (1/3 − θ) 2 , (2/3 − θ) 2 and (1 − θ) 2 . We can calculate the expected value of<br />

L as follows:<br />

E(L) = (0 − θ) 2 P ( ˆ θ = 0) + (1/3 − θ) 2 P ( ˆ θ = 1/3)<br />

+(2/3 − θ) 2 P ( ˆ θ = 2/3) + (1 − θ) 2 P ( ˆ θ = 1)<br />

= θ 2 (1 − θ) 3 + (1/3 − θ) 2 3θ(1 − θ) 2 + (2/3 − θ) 2 3θ 2 (1 − θ) + (1 − θ) 2 θ 3 ,<br />

= θ(1 − θ)/3 .<br />

The quantity E(L) is called the mean squared error ( MSE ) of the estimator ˆ θ. Since<br />

the quantity θ(1 − θ) attains its maximum value of 1/4 for θ = 1/2, the largest value<br />

E(L) can attain is 1/12 which occurs if the true value of the parameter θ happens to be<br />

equal to 1/2; for all other values of θ the quantity E(L) is less than 1/12. If somebody<br />

could invent a different estimator ˜ θ of θ whose MSE was less than that of ˆ θ for all<br />

values of θ then we would prefer ˜ θ to ˆ θ.<br />

This trivial example gives some idea of the kinds of calculations that we will be<br />

performing. The basic frequentist principle is that statistical procedures should be<br />

judged in terms of their average performance in an infinite series of independent repe-<br />

titions of the experiment which produced the data. An important point to note is that<br />

the parameter values are treated as fixed (although unknown) throughout this infinite<br />

series of repetitions. We should be happy to use a procedure which performs well on<br />

the average and should not be concerned with how it performs on any one particular<br />

occasion.<br />

2.2 The Frequentist Approach to Estimation<br />

Suppose that we are going to observe a value of a random vector X. Let X denote the<br />

set of possible values X can take and, for x ∈ X , let f(x|θ) denote the probability that<br />

X takes the value x where the parameter θ is some unknown element of the set Θ.<br />

The problem we face is that of estimating θ. An estimator ˆ θ is a procedure which<br />

for each possible value x ∈ X specifies which element of Θ we should quote as an<br />

estimate of θ. When we observe X = x we quote ˆ θ(x) as our estimate of θ. Thus ˆ θ is a<br />

function of the random vector X. Sometimes we write ˆ θ(X) to emphasise this point.<br />

10


Given any estimator ˆ θ we can calculate its expected value for each possible value of<br />

θ ∈ Θ. An estimator is said to be unbiased if this expected value is identically equal to<br />

θ. If an estimator is unbiased then we can conclude that if we repeat the experiment<br />

an infinite number of times with θ fixed and calculate the value of the estimator each<br />

time then the average of the estimator values will be exactly equal to θ. From the<br />

frequentist viewpoint this is a desireable property and so, where possible, frequentists<br />

use unbiased estimators.<br />

Definition 2.1 (The Frequentist philosophy). <strong>To</strong> evaluate the usefulness of an<br />

estimator ˆ θ = ˆ θ(x) of θ, examine the properties of the random variable ˆ θ = ˆ θ(X). �<br />

Definition 2.2 (Unbiased estimators). An estimator ˆ θ = ˆ θ(X) is said to be unbi-<br />

ased for a parameter θ if it equals θ in expectation:<br />

E[ ˆ θ(X)] = E( ˆ θ) = θ.<br />

Intuitively, an unbiased estimator is ‘right on target’. �<br />

Definition 2.3 (Bias of an estimator). The bias of an estimator ˆ θ = ˆ θ(X) of θ is<br />

defined as bias( ˆ θ) = E[ ˆ θ(X) − θ]. �<br />

Definition 2.4 (Bias corrected estimators). If bias( ˆ θ) is of the form cθ, then<br />

(obviously) ˜ θ = ˆ θ/(1 + c) is unbiased for θ. Likewise, if bias( ˆ θ) = θ + c, then ˜ θ = ˆ θ − c<br />

is unbiased for θ. In such situations we say that ˜ θ is a biased corrected version of ˆ θ. �<br />

Definition 2.5 (Unbiased functions). More generally ˆg(X) is said to be unbiased<br />

for a function g(θ) if E[ˆg(X)] = g(θ). �<br />

Note that even if ˆ θ is an unbiased estimator of θ, g( ˆ θ) will generally not be an<br />

unbiased estimator of g(θ) unless g is linear or affine. This limits the importance of the<br />

notion of unbiasedness. It might be at least as important that an estimator is accurate<br />

in the sense that its distribution is highly concentrated around θ.<br />

Is unbiasedness a good thing? Unbiasedness is important when combining esti-<br />

mates, as averages of unbiased estimators are unbiased (see the review exercises at the<br />

end of this chapter). For example, when combining standard deviations s1, s2, . . . , sk<br />

with degrees of freedom df1, . . . , dfk we always average their squares<br />

�<br />

df1s<br />

¯s =<br />

2 1 + · · · + dfks2 k<br />

df1 + · · · + dfk<br />

as s 2 i are unbiased estimators of the variance σ 2 , whereas si are not unbiased estimators<br />

of σ (see the review exercises). Be careful when averaging biased estimators! It may<br />

well be appropriate to make a bias-correction before averaging.<br />

11


Problem 2.1. Let X have a binomial distribution with parameters n and θ. Show<br />

that the sample proportion ˆ θ = X/n is an unbiased estimate of θ.<br />

Solution. X ∼ Bin(n, θ) ⇒ E(X) = nθ. Then E( ˆ θ) = E(X/n) = E(X)/n = nθ/n = θ.<br />

As E( ˆ θ) = θ, the estimator ˆ θ is unbiased.<br />

Problem 2.2. Let X1, . . . , Xn be independent and identically distributed with density<br />

�<br />

e<br />

f(x|θ) =<br />

−(x−θ) , for x > θ;<br />

0, otherwise.<br />

Show that ˆ θ = ¯ X = (X1 + · · · + Xn)/n is a biased estimator of θ. Propose an unbiased<br />

estimator ˜ θ of the form ˜ θ = ˆ θ + c.<br />

Solution. E(X) = � ∞<br />

θ xe−(x−θ) dx = � −xe −(x−θ) + � e −(x−θ) dx � ∞<br />

θ = −(x+1)e−(x−θ) | ∞ θ =<br />

θ+1. Next, E( ˆ θ) = E( ¯ X) = 1<br />

n E(X1 +X2 +· · ·+Xn) = θ+1 �= θ ⇒ ˆ θ is biased. Propose<br />

˜θ = ¯ X − 1. Then E( ˜ θ) = E( ¯ X) − 1 = θ + 1 − 1 = θ and ˜ θ is unbiased.<br />

Definition 2.6 (Mean squared error). The mean squared error of the estimator ˆ θ<br />

is defined as MSE( ˆ θ) = E( ˆ θ − θ) 2 . Given the same set of data, ˆ θ1 is “better” than ˆ θ2 if<br />

MSE( ˆ θ1) ≤ MSE( ˆ θ2) (uniformly better if true ∀ θ). �<br />

Lemma 2.3 (The MSE variance-bias tradeoff). The MSE decomposes as<br />

MSE( ˆ θ) = Var( ˆ θ) + Bias( ˆ θ) 2 .<br />

Proof. The problem of finding minimum MSE estimators cannot be solved uniquely:<br />

MSE( ˆ θ) = E( ˆ θ − θ) 2<br />

= E{ [ ˆ θ − E( ˆ θ) ] + [ E( ˆ θ) − θ ]} 2<br />

= E[ ˆ θ − E( ˆ θ)] 2 + E[E( ˆ θ) − θ] 2<br />

�<br />

+2 E [ ˆ θ − E( ˆ θ)][E( ˆ �<br />

θ) − θ]<br />

� ��<br />

=0<br />

�<br />

= E[ ˆ θ − E( ˆ θ)] 2 + E[E( ˆ θ) − θ] 2<br />

= Var( ˆ θ) + [E( ˆ θ) − θ] 2<br />

� �� �<br />

Bias( ˆ θ) 2<br />

NOTE : This lemma implies that the mean squared error of an unbiased estimator<br />

is equal to the variance of the estimator.<br />

12


Problem 2.4. Consider X1, . . . , Xn where Xi ∼ N(θ, σ 2 ) and σ is known. Three<br />

estimators of θ are ˆ θ1 = ¯ X = 1<br />

n<br />

� n<br />

i=1 Xi, ˆ θ2 = X1, and ˆ θ3 = (X1 + ¯ X)/2. Pick one.<br />

Solution. E( ˆ θ1) = 1<br />

n [E(X1)+· · ·+E(Xn)] = 1<br />

1<br />

[θ+· · ·+θ] = [nθ] = θ, (unbiased). Next<br />

n n<br />

E( ˆ θ2) = E(X1) = θ, (unbiased). Finally E( ˆ θ3) = 1<br />

2E � n+1<br />

n X1 + 1<br />

n [X2 + · · · + Xn] � =<br />

�<br />

1 n+1<br />

2 n E(X1) + 1<br />

n [E(X2) + · · · + E(Xn)] � = 1 n+1 n−1<br />

{ θ + θ} = θ, (unbiased). All three<br />

2 n n<br />

estimators are unbiased. Although desirable from a frequentist standpoint, unbiased-<br />

ness is not a property that helps us choose between estimators. <strong>To</strong> do this we must<br />

examine some measure of loss like the mean squared error. For class of estimators<br />

that are unbiased, the mean squared error will be equal to the estimation variance.<br />

Calculate Var( ˆ θ1) = 1<br />

n 2 [Var(X1) + · · · + Var(Xn)] = 1<br />

n 2 [σ 2 + · · · + σ 2 ] = 1<br />

n 2 [nσ 2 ] = 1<br />

n σ2 .<br />

Trivially Var( ˆ θ2) = Var(X1) = σ 2 . Finally Var( ˆ θ3) = (σ 2 /n + σ 2 )/4 + 2Cov( ¯ X, X1).<br />

So ¯ X appears “best” in the sense that Var( ˆ θ) is smallest among these three unbiased<br />

estimators.<br />

Problem 2.5. Consider X1, . . . , Xn to be independent random variables with means<br />

E(Xi) = µ + βi and variances Var(Xi) = σ 2 i . Such a situation could arise when Xi are<br />

estimators of µ obtained from independent sources and βi is the bias of the estimator<br />

Xi. Consider pooling the estimators of µ into a common estimator using the linear<br />

combination ˆµ = w1X1 + w2X2 + · · · + wnXn.<br />

(i) If the estimators are unbiased, show that ˆµ is unbiased if and only if � wi = 1.<br />

(ii) In the case when the estimators are unbiased, show that ˆµ has minimum variance<br />

when the weights are inversely proportional to the variances σ 2 i .<br />

(iii) Show that the variance of ˆµ for optimal weights wi is Var(ˆµ) = 1/ �<br />

i σ−2<br />

i .<br />

(iv) Consider the case when estimators may be biased. Find the mean square error<br />

of the optimal linear combination obtained above, and compare its behaviour as<br />

n → ∞ in the biased and unbiased case, when σ 2 i = σ 2 , i = 1, . . . , n.<br />

Solution. E(ˆµ) = E(w1X1 + · · · + wnXn) = �<br />

i wiE(Xi) = �<br />

i wiµ = µ �<br />

i wi so ˆµ is<br />

unbiased if and only if �<br />

i wi = 1. The variance of our estimator is Var(ˆµ) = �<br />

i w2 i σ 2 i ,<br />

which should be minimized subject to the constraint �<br />

i wi = 1. Differentiating the<br />

Lagrangian L = �<br />

i w2 i σ 2 i − λ ( �<br />

i wi − 1) with respect to wi and setting equal to zero<br />

yields 2wiσ 2 i = λ ⇒ wi ∝ σ −2<br />

i so that wi = σ −2<br />

i /(�<br />

j σ−2 j<br />

). Then, for optimal weights we<br />

get Var(ˆµ) = �<br />

i w2 i σ2 i = ( �<br />

i σ−4 i σ2 i )/( �<br />

i σ−2 i )2 = 1/( �<br />

i σ−2 i ). When σ2 i = σ2 we have<br />

that Var(ˆµ) = σ2 /n which tends to zero for n → ∞ whereas bias(ˆµ) = � βi/n = ¯ β is<br />

equal to the average bias and MSE(ˆµ) = σ 2 + ¯ β 2 . Therefore the bias tends to dominate<br />

the variance as n gets larger, which is very unfortunate.<br />

13


Problem 2.6. Let X1, . . . , Xn be an independent sample of size n from the uniform<br />

distribution on the interval (0, θ), with density for a single observation being f(x|θ) =<br />

θ −1 for 0 < x < θ and 0 otherwise, and consider θ > 0 unknown.<br />

(i) Find the expected value and variance of the estimator ˆ θ = 2 ¯ X.<br />

(ii) Find the expected value of the estimator ˜ θ = X(n), i.e. the largest observation.<br />

(iii) Find an unbiased estimator of the form ˇ θ = cX(n) and calculate its variance.<br />

(iv) Compare the mean square error of ˆ θ and ˇ θ.<br />

Solution. ˆ θ has E( ˆ θ) = E(2 ¯ X) = 2<br />

n [E(X1) + · · · + E(Xn)] = 2 [(θ/2) + · · · + (θ/2)] =<br />

n<br />

2<br />

n [n(θ/2)] = θ (unbiased), and Var(ˆ θ) = Var(2 ¯ X) = 4<br />

n2 [Var(X1) + · · · + Var(Xn)] =<br />

4<br />

n2 [(θ2 /12) + · · · + (θ2 /12)] = 4<br />

n2 n<br />

12θ2 = 1<br />

3nθ2 . Let U = X(n), we then have P (U ≤ u) =<br />

�n i P (Xi ≤ u) = (u/θ) n for 0 < u < θ so differentiation yields that U has density<br />

f(u|θ) = nun−1θ−n for 0 < u < θ. Direct integration now yields E( ˜ θ) = E(U) = n<br />

n+1θ (a biased estimator). The estimator ˇ θ = n+1 U is unbiased. Direct integration gives<br />

E(U 2 ) = n<br />

n+2θ2 so Var( ˜ n<br />

θ) = Var(U) = (n+2)(n+1) 2 θ2 and Var( ˇ θ) = 1<br />

n(n+2) θ2 . As ˆ θ and<br />

ˇθ are both unbiased estimators MSE( ˆ θ) = Var( ˆ θ) and MSE( ˇ θ) = Var( ˇ θ). Clearly the<br />

mean square error of ˆ θ is very large compared to the mean square error of ˇ θ.<br />

2.3 Minimum-Variance Unbiased Estimation<br />

Getting a small MSE often involves a tradeoff between variance and bias. By not<br />

insisting on ˆ θ being unbiased, the variance can sometimes be drastically reduced. For<br />

unbiased estimators, the MSE obviously equals the variance, MSE( ˆ θ) = Var( ˆ θ), so no<br />

tradeoff can be made. One approach is to restrict ourselves to the subclass of estimators<br />

that are unbiased and minimum variance.<br />

Definition 2.7 (Minimum-variance unbiased estimator). If an unbiased estima-<br />

tor of g(θ) has minimum variance among all unbiased estimators of g(θ) it is called a<br />

minimum variance unbiased estimator (MVUE). �<br />

n<br />

We will develop a method of finding the MVUE when it exists. When such an<br />

estimator does not exist we will be able to find a lower bound for the variance of an<br />

unbiased estimator in the class of unbiased estimators, and compare the variance of<br />

our unbiased estimator with this lower bound.<br />

14


Definition 2.8 (Score function). For the (possibly vector valued) observation X = x<br />

to be informative about θ, the density must vary with θ. If f(x|θ) is smooth and<br />

differentiable, this change is quantified to first order by the score function<br />

S(θ) = ∂<br />

∂θ ln f(x|θ) ≡ f ′ (x|θ)<br />

f(x|θ) .<br />

Under suitable regularity conditions (differentiation wrt θ and integration wrt x can<br />

be interchanged), we have<br />

E{S(θ)} =<br />

=<br />

� ′ f (x|θ)<br />

f(x|θ)dx =<br />

f(x|θ)<br />

�<br />

∂<br />

�� �<br />

f(x|θ)dx =<br />

∂θ<br />

∂<br />

∂θ<br />

f ′ (x|θ)dx ,<br />

1 = 0.<br />

Thus the score function has expectation zero. �<br />

True frequentism evaluates the properties of estimators based on their “long-run”<br />

behaviour. The value of x will vary from sample to sample so we have treated the score<br />

function as a random variable and looked at its average across all possible samples.<br />

Lemma 2.7 (Fisher information). The variance of S(θ) is the expected Fisher<br />

information about θ<br />

Proof. Using the chain rule<br />

I(θ) = E{S(θ) 2 } ≡ E<br />

∂2 ∂<br />

ln f =<br />

∂θ2 ∂θ<br />

� �<br />

1 ∂f<br />

f ∂θ<br />

= − 1<br />

f 2<br />

�<br />

∂f<br />

∂θ<br />

�<br />

∂ ln f<br />

= −<br />

∂f<br />

�� � �<br />

2<br />

∂<br />

ln f(x|θ)<br />

∂θ<br />

� 2<br />

� 2<br />

+ 1 ∂<br />

f<br />

2f ∂θ2 + 1 ∂<br />

f<br />

2f ∂θ2 If integration and differentiation can be interchanged<br />

�<br />

1 ∂<br />

E<br />

f<br />

2f ∂θ2 � �<br />

=<br />

∂2f ∂2<br />

dx =<br />

∂θ2 ∂θ2 �<br />

dx = ∂2<br />

1 = 0,<br />

∂θ2 thus<br />

X<br />

X<br />

� � �� � �<br />

2<br />

∂<br />

∂<br />

−E ln f(x|θ) = E ln f(x|θ) = I(θ). (2.1)<br />

∂θ2 ∂θ<br />

Variance measures lack of knowledge. Reasonable that the reciprocal of the variance<br />

should be defined as the amount of information carried by the (possibly vector valued)<br />

observation x about θ.<br />

15


Theorem 2.8 (Cramér Rao lower bound). Let ˆ θ be an unbiased estimator of θ.<br />

Then<br />

Var( ˆ θ) ≥ { I(θ) } −1 .<br />

Proof. Unbiasedness, E( ˆ θ) = θ, implies<br />

�<br />

ˆθ(x)f(x|θ)dx = θ.<br />

Assume we can differentiate wrt θ under the integral, then<br />

�<br />

∂<br />

� �<br />

ˆθ(x)f(x|θ)dx = 1.<br />

∂θ<br />

The estimator ˆ θ(x) can’t depend on θ, so<br />

�<br />

ˆθ(x) ∂<br />

{f(x|θ)dx} = 1.<br />

∂θ<br />

For any pdf f,<br />

so that now �<br />

Thus<br />

Define random variables<br />

∂f<br />

∂θ<br />

∂<br />

= f (ln f) ,<br />

∂θ<br />

ˆθ(x)f ∂<br />

(ln f) dx = 1.<br />

∂θ<br />

�<br />

E ˆθ(x) ∂<br />

�<br />

(ln f) = 1.<br />

∂θ<br />

U = ˆ θ(x),<br />

and<br />

S = ∂<br />

(ln f) .<br />

∂θ<br />

Then E (US) = 1. We already know that the score function has expectation zero,<br />

E (S) = 0. Consequently Cov(U, S) = E(US) − E(U)E(S) = E(US) = 1.<br />

Setting Cov(U, S) = 1 we get<br />

This implies<br />

{Corr(U, S)} 2 =<br />

{Cov(U, S)}2<br />

Var(U)Var(S)<br />

Var(U)Var(S) ≥ 1<br />

Var( ˆ θ) ≥ 1<br />

I(θ)<br />

≤ 1<br />

which is our main result. We call { I(θ) } −1 the Cramér Rao lower bound (CRLB).<br />

16


Sufficient conditions for the proof of CRLB are that all the integrands are finite,<br />

within the range of x. We also require that the limits of the integrals do not depend<br />

on θ. That is, the range of x, here f(x|θ), cannot depend on θ. This second condition<br />

is violated for many density functions, i.e. the CRLB is not valid for the uniform<br />

distribution. We can have absolute assessment for unbiased estimators by comparing<br />

their variances to the CRLB. We can also assess unbiased estimators. If its variance is<br />

lower than CRLB then it is indeed a very good estimate, although it is bias.<br />

Example 2.1. Consider IID random variables Xi, i = 1, . . . , n, with<br />

fXi (xi|µ) = 1<br />

µ exp<br />

�<br />

− 1<br />

µ xi<br />

�<br />

.<br />

Denote the joint distribution of X1, . . . , Xn by<br />

n�<br />

f = fXi (xi|µ)<br />

� �n 1<br />

= exp<br />

µ<br />

so that<br />

i=1<br />

ln f = −n ln(µ) − 1<br />

µ<br />

�<br />

− 1<br />

µ<br />

n�<br />

xi.<br />

The score function is the partial derivative of ln f wrt the unknown parameter µ,<br />

S(µ) = ∂<br />

ln f = −n<br />

∂µ µ + 1<br />

µ 2<br />

n�<br />

xi<br />

and<br />

E {S(µ)} = E<br />

�<br />

− n<br />

µ + 1<br />

µ 2<br />

n�<br />

i=1<br />

Xi<br />

�<br />

i=1<br />

= − n<br />

µ + 1<br />

i=1<br />

n�<br />

i=1<br />

xi<br />

�<br />

µ 2 E<br />

�<br />

n�<br />

�<br />

Xi<br />

i=1<br />

For X ∼ Exp(1/µ), we have E(X) = µ implying E(X1 + · · · + Xn) = E(X1) + · · · +<br />

E(Xn) = nµ and E {S(µ)} = 0 as required.<br />

I(θ) =<br />

� �<br />

∂<br />

−E −<br />

∂µ<br />

n<br />

µ + 1<br />

µ 2<br />

=<br />

n�<br />

i=1<br />

�<br />

n<br />

−E<br />

µ 2 − 2<br />

µ 3<br />

=<br />

n�<br />

�<br />

Xi<br />

i=1<br />

− n<br />

µ 2 + 2<br />

µ 3 E<br />

�<br />

n�<br />

�<br />

Xi<br />

Hence<br />

i=1<br />

= − n<br />

µ 2 + 2nµ<br />

µ 3 = n<br />

µ 2<br />

CRLB = µ2<br />

n .<br />

17<br />

Xi<br />

��<br />

,


Let us propose ˆµ = ¯ X as an estimator of µ. Then<br />

�<br />

n�<br />

�<br />

1<br />

E(ˆµ) = E Xi =<br />

n<br />

1<br />

n E<br />

�<br />

n�<br />

i=1<br />

i=1<br />

Xi<br />

�<br />

= µ,<br />

verifying that ˆµ = ¯ X is indeed an unbiased estimator of µ. For X ∼ Exp(1/µ), we have<br />

E(X) = µ = Var(X), implying<br />

Var(ˆµ) = 1<br />

n 2<br />

n�<br />

i=1<br />

Var(Xi) = nµ µ<br />

=<br />

n2 n .<br />

We have already shown that Var(ˆµ) = { I(θ) } −1 , and therefore conclude that the<br />

unbiased estimator ˆµ = ¯x achieves its CRLB. �<br />

Definition 2.9 (Efficiency ). Define the efficiency of the unbiased estimator ˆ θ as<br />

eff( ˆ θ) = CRLB<br />

Var( ˆ θ) ,<br />

where CRLB = { I(θ) } −1 . Clearly 0 < eff( ˆ θ) ≤ 1. An unbiased estimator ˆ θ is said to<br />

be efficient if eff( ˆ θ) = 1. �<br />

Definition 2.10 (Asymptotic efficiency ). The asymptotic efficiency of an unbiased<br />

estimator ˆ θ is the limit of the efficiency as n → ∞. An unbiased estimator ˆ θ is said to<br />

be asymptotically efficient if its asymptotic efficiency is equal to 1. �<br />

2.4 Maximum Likelihood Estimation<br />

Let x be a realization of the random variable X with probability density fX(x|θ) where<br />

θ = (θ1, θ2, . . . , θm) T is a vector of m unknown parameters to be estimated. The set<br />

of allowable values for θ, denoted by Ω, or sometimes by Ωθ, is called the parameter<br />

space. Define the likelihood function<br />

L(θ|x) = fX(x|θ). (2.2)<br />

It is crucial to stress that the argument of fX(x|θ) is x, but the argument of L(θ|x)<br />

is θ. It is therefore convenient to view the likelihood function L(θ) as the probability<br />

of the observed data x considered as a function of θ. Usually it is convenient to work<br />

with the natural logarithm of the likelihood called the log-likelihood, denoted by<br />

ℓ(θ|x) = ln L(θ|x).<br />

18


When θ ∈ R1 we can define the score function as the first derivative of the log-likelihood<br />

S(θ) = ∂<br />

ln L(θ).<br />

∂θ<br />

The maximum likelihood estimate (MLE) ˆ θ of θ is the solution to the score equation<br />

S(θ) = 0.<br />

At the maximum, the second partial derivative of the log-likelihood is negative, so we<br />

define the curvature at ˆ θ as I( ˆ θ) where<br />

I(θ) = − ∂2<br />

ln L(θ).<br />

∂θ2 We can check that a solution ˆ θ of the equation S(θ) = 0 is actually a maximum by<br />

checking that I( ˆ θ) > 0. A large curvature I( ˆ θ) is associated with a tight or strong<br />

peak, intuitively indicating less uncertainty about θ. In likelihood theory I(θ) is a<br />

key quantity called the observed Fisher information, and I( ˆ θ) is the observed Fisher<br />

information evaluated at the MLE ˆ θ. Although I(θ) is a function, I( ˆ θ) is a scalar.<br />

The likelihood function L(θ|x) supplies an order of preference or plausibility among<br />

possible values of θ based on the observed y. It ranks the plausibility of possible values<br />

of θ by how probable they make the observed y. If P (x|θ = θ1) > P (x|θ = θ2) then<br />

the observed x makes θ = θ1 more plausible than θ = θ2, and consequently from<br />

(2.2), L(θ1|x) > L(θ2|x). The likelihood ratio L(θ1|x)/L(θ2|x) = f(θ1|x)/f(θ2|x) is<br />

a measure of the plausibility of θ1 relative to θ2 based on the observed fact y. The<br />

relative likelihood L(θ1|x)/L(θ2|x) = k means that the observed value x will occur k<br />

times more frequently in repeated samples from the population defined by the value θ1<br />

than from the population defined by θ2. Since only ratios of likelihoods are meaningful,<br />

it is convenient to standardize the likelihood with respect to its maximum. Define the<br />

relative likelihood as R(θ|x) = L(θ|x)/L( ˆ θ|x). The relative likelihood varies between 0<br />

and 1. The MLE ˆ θ is most plausible value of θ in that it makes the observed sample<br />

most probable. The relative likelihood measures the plausibility of any particular value<br />

of θ relative to that of ˆ θ.<br />

When the random variables X1, . . . , Xn are mutually independent we can write the<br />

joint density as<br />

fX(x) =<br />

j=1<br />

where x = (x1, . . . , xn) ′ is a realization of the random vector X = (X1, . . . , Xn) ′ , and<br />

the likelihood function becomes<br />

LX(θ|x) =<br />

n�<br />

n�<br />

j=1<br />

fXj (xj)<br />

fXj (xj|θ).<br />

When the densities fXj (xj) are identical, we unambiguously write f(xj).<br />

19


Example 2.2 (Bernoulli Trials). Consider n independent Bernoulli trials. The jth ob-<br />

servation is either a “success” or “failure” coded xj = 1 and xj = 0 respectively, and<br />

P (Xj = xj) = θ xj (1 − θ) 1−xj<br />

for j = 1, . . . , n. The vector of observations y = (x1, x2, . . . , xn) T is a sequence of ones<br />

and zeros, and is a realization of the random vector Y = (X1, X2, . . . , Xn) T . As the<br />

Bernoulli outcomes are assumed to be independent we can write the joint probability<br />

mass function of Y as the product of the marginal probabilities, that is<br />

n�<br />

L(θ) = P (Xj = xj)<br />

=<br />

j=1<br />

n�<br />

θ xj 1−xj (1 − θ)<br />

j=1<br />

= θ � xj (1 − θ) n− � xj<br />

= θ r (1 − θ) n−r<br />

where r = � n<br />

i=1 xj is the number of observed successes (1’s) in the vector y. The<br />

log-likelihood function is then<br />

and the score function is<br />

ℓ(θ) = r ln θ + (n − r) ln(1 − θ),<br />

S(θ) = ∂ r (n − r)<br />

ℓ(θ) = −<br />

∂θ θ 1 − θ .<br />

Solving for S( ˆ θ) = 0 we get ˆ θ = r/n. The information function is<br />

I(θ) = r n − r<br />

+ > 0 ∀ θ,<br />

θ2 (1 − θ) 2<br />

guaranteeing that ˆ θ is the MLE. Each Xi is a Bernoulli random variable and has<br />

expected value E(Xi) = θ, and variance Var(Xi) = θ(1 − θ). The MLE ˆ θ(y) is itself a<br />

random variable and has expected value<br />

E( ˆ �<br />

r<br />

�<br />

θ) = E = E<br />

n<br />

�� n<br />

i=1 Xi<br />

n<br />

�<br />

= 1<br />

n<br />

n�<br />

i=1<br />

E (Xi) = 1<br />

n<br />

n�<br />

θ = θ,<br />

implying that ˆ θ(y) is an unbiased estimator of θ. The variance of ˆ θ(y) is<br />

Var( ˆ ��n i=1 θ) = Var<br />

Xi<br />

�<br />

=<br />

n<br />

1<br />

n2 n�<br />

Var (Xi) = 1<br />

n2 n� (1 − θ)θ<br />

(1 − θ)θ = .<br />

n<br />

Finally, note that<br />

i=1<br />

I(θ) = E [I(θ)] = E(r) n − E(r) n<br />

+ =<br />

θ2 (1 − θ) 2 θ<br />

+ n<br />

1 − θ =<br />

i=1<br />

i=1<br />

n<br />

θ(1 − θ) =<br />

�<br />

Var[ ˆ �−1 θ] ,<br />

and ˆ θ attains the Cramer-Rao lower bound (CRLB). �<br />

20


Example 2.3 (Binomial sampling). The number of successes in n Bernoulli trials is a<br />

random variable R taking on values r = 0, 1, . . . , n with probability mass function<br />

� �<br />

n<br />

P (R = r) = θ<br />

r<br />

r (1 − θ) n−r .<br />

This is the exact same sampling scheme as in the previous example except that instead<br />

of observing the sequence y we only observe the total number of successes r. Hence the<br />

likelihood function has the form<br />

LR (θ|r) =<br />

� �<br />

n<br />

θ<br />

r<br />

r (1 − θ) n−r .<br />

The relevant mathematical calculations are as follows:<br />

� �<br />

n<br />

ℓR (θ|r) = ln + r ln(θ) + (n − r) ln(1 − θ)<br />

r<br />

S (θ) = r n − r<br />

+<br />

n 1 − θ<br />

⇒ ˆ θ = r<br />

n<br />

I (θ) = r n − r<br />

+<br />

θ2 (1 − θ) 2 > 0 ∀ θ<br />

E( ˆ θ) = E(r) nθ<br />

=<br />

n n = θ ⇒ ˆ θ unbiased<br />

Var( ˆ θ) = Var(r)<br />

n2 nθ(1 − θ)<br />

=<br />

n2 = θ(1 − θ)<br />

n<br />

E [I (θ)] = E(r) n − E(r) nθ n − nθ<br />

+ = +<br />

θ2 (1 − θ) 2 θ2 (1 − θ) 2<br />

=<br />

n<br />

θ(1 − θ) =<br />

�<br />

Var[ ˆ �−1 θ]<br />

and ˆ θ attains the Cramer-Rao lower bound (CRLB). �<br />

Example 2.4 (Germinating seeds). Suppose 25 seeds were planted and r = 5 seeds<br />

germinated. Then ˆ θ = r/n = 0.2 and Var( ˆ θ) = 0.2 × 0.8/25 = 0.0064. The relative<br />

likelihood is<br />

R1(θ) =<br />

� �5 � �20 θ 1 − θ<br />

.<br />

0.2 0.8<br />

Suppose 100 seeds were planted and r = 20 seeds germinated. Then ˆ θ = r/n = 0.2<br />

but Var( ˆ θ) = 0.2 × 0.8/100 = 0.0016. The relative likelihood is<br />

R2(θ) =<br />

� θ<br />

0.2<br />

� 20 � 1 − θ<br />

0.8<br />

� 80<br />

.<br />

Suppose 25 seeds were planted and it is known only that r ≤ 5 seeds germinated. In<br />

this case the exact number of germinating seeds is unknown. The information about θ<br />

is given by the likelihood function<br />

L (θ) = P (R ≤ 5) =<br />

21<br />

5�<br />

r=0<br />

� �<br />

25<br />

θ<br />

r<br />

r (1 − θ) 25−r .


Relative Likelihood<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

Germinating probability<br />

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7<br />

θ<br />

r = 5 n = 25<br />

r = 20 n = 100<br />

r ≤ 5 n = 25<br />

Figure 2.4.1: Relative likelihood functions for seed germinating probabilities.<br />

Here, the most plausible value for θ is ˆ θ = 0, implying L( ˆ θ) = 1. The relative likelihood<br />

is R3 (θ) = L(θ)/L( ˆ θ) = L(θ). R1 (θ) is plotted as the dashed curve in figure 2.4.1.<br />

R2 (θ) is plotted as the dotted curve in figure 2.4.1. R3 (θ) is plotted as a solid curve<br />

in figure 2.4.1. �<br />

Example 2.5 (Prevalence of a Genotype). Geneticists interested in the prevalence of a<br />

certain genotype, observe that the genotype makes its first appearance in the 22nd sub-<br />

ject analysed. If we assume that the subjects are independent, the likelihood function<br />

can be computed based on the geometric distribution, as L(θ) = (1−θ) n−1 θ. The score<br />

function is then S(θ) = θ −1 −(n−1)(1−θ) −1 . Setting S( ˆ θ) = 0 we get ˆ θ = n −1 = 22 −1 .<br />

The observed Fisher information equals I(θ) = θ −2 + (n − 1)(1 − θ) −2 and is greater<br />

than zero for all θ, implying that ˆ θ is MLE.<br />

Suppose that the geneticists had planned to stop sampling once they observed<br />

r = 10 subjects with the specified genotype, and the tenth subject with the genotype<br />

was the 100th subject anaylsed overall. The likelihood of θ can be computed based on<br />

the negative binomial distribution, as<br />

� �<br />

n − 1<br />

L(θ) = θ<br />

r − 1<br />

r (1 − θ) n−r<br />

for n = 100, r = 5. The usual calculation will confirm that ˆ θ = r/n is MLE. �<br />

22


Example 2.6 (Radioactive Decay). In this classic set of data Rutherford and Geiger<br />

counted the number of scintillations in 72 second intervals caused by radioactive decay<br />

of a quantity of the element polonium. Altogether there were 10097 scintillations during<br />

2608 such intervals:<br />

Count 0 1 2 3 4 5 6 7<br />

Observed 57 203 383 525 532 408 573 139<br />

Count 8 9 10 11 12 13 14<br />

Observed 45 27 10 4 1 0 1<br />

The Poisson probability mass function with mean parameter θ is<br />

The likelihood function equals<br />

fX(x|θ) = θx exp (−θ)<br />

.<br />

x!<br />

L(θ) = � θ xi exp (−θ)<br />

xi!<br />

The relevant mathematical calculations are<br />

= θ� xi exp (−nθ)<br />

� .<br />

xi!<br />

ℓ(θ) = (Σxi) ln (θ) − nθ − ln [Π(xi!)]<br />

S(θ) =<br />

⇒ ˆ θ =<br />

� xi<br />

θ<br />

� xi<br />

n<br />

− n<br />

= ¯x<br />

I(θ) = Σxi<br />

> 0,<br />

θ2 ∀ θ<br />

implying ˆ θ is MLE. Also E( ˆ θ) = � E(xi) = 1 �<br />

θ = θ, so θˆ is an unbiased estimator.<br />

� Var(xi) = 1<br />

Next Var( ˆ θ) = 1<br />

n2 nθ and I(θ) = E[I(θ)] = n/θ = (Var[ˆ θ]) −1 implying<br />

that ˆ θ attains the theoretical CRLB. It is always useful to compare the fitted values<br />

from a model against the observed values.<br />

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14<br />

Oi 57 203 383 525 532 408 573 139 45 27 10 4 1 0 1<br />

Ei 54 211 407 525 508 393 254 140 68 29 11 4 1 0 0<br />

n<br />

+3 -8 -24 0 +24 +15 +19 -1 -23 -2 -1 0 -1 +1 +1<br />

The Poisson law agrees with the observed variation within about one-twentieth of its<br />

range. �<br />

23


Example 2.7 (Exponential distribution). Suppose random variables X1, . . . , Xn are i.i.d.<br />

as Exp(θ). Then<br />

L(θ) =<br />

n�<br />

θ exp (−θxi)<br />

i=1<br />

= θ n exp<br />

�<br />

−θ � �<br />

xi<br />

ℓ(θ) = n ln θ − θ � xi<br />

S(θ) = n<br />

θ −<br />

⇒ θ ˆ<br />

I(θ)<br />

=<br />

=<br />

n�<br />

xi<br />

i=1<br />

n<br />

�<br />

xi<br />

n<br />

> 0<br />

θ2 ∀ θ.<br />

In order to work out the expectation and variance of ˆ θ we need to work out the<br />

probability distribution of Z = n�<br />

Xi, where Xi ∼ Exp(θ). From the appendix on<br />

i=1<br />

probability theory we have Z ∼ Ga(θ, n). Then<br />

� �<br />

1<br />

E<br />

Z<br />

We now return to our estimator<br />

implies<br />

=<br />

�∞<br />

0<br />

1<br />

z<br />

= θ2<br />

Γ(n)<br />

=<br />

= θ Γ(n − 1)<br />

E[ ˆ θ] = E<br />

θ<br />

Γ(n)<br />

θnzn−1 exp (−θz)<br />

dz<br />

Γ(n)<br />

�<br />

∞<br />

0<br />

∞<br />

�<br />

0<br />

Γ(n)<br />

ˆθ = n<br />

n�<br />

xi<br />

i=1<br />

(θz) n−2 exp (−θz)dz<br />

u n−2 exp (−u)du<br />

= θ<br />

n − 1 .<br />

= n<br />

Z<br />

�<br />

n<br />

� � �<br />

1<br />

= nE =<br />

Z Z<br />

n<br />

n − 1 θ<br />

which turns out to be biased. Propose the alternative estimator ˜ θ = n−1<br />

n ˆ θ. Then<br />

E[ ˜ �<br />

(n − 1)<br />

θ] = E<br />

n<br />

�<br />

ˆθ = (n − 1)<br />

shows ˜ θ is an unbiased estimator.<br />

E[<br />

n<br />

ˆ θ] =<br />

24<br />

(n − 1)<br />

n<br />

� �<br />

n<br />

θ = θ<br />

n − 1


As this example demonstrates, maximum likelihood estimation does not automati-<br />

cally produce unbiased estimates. If it is thought that this property is (in some sense)<br />

desirable, then some adjustments to the MLEs, usually in the form of scaling, may be<br />

required. We conclude this example with the following tedious (but straightforward)<br />

calculations.<br />

�<br />

1<br />

E<br />

Z2 �<br />

We have already calculated that<br />

However,<br />

�∞<br />

1<br />

= θ<br />

Γ(n)<br />

0<br />

n z n−3 exp −θzdz<br />

= θ2<br />

�∞<br />

u<br />

Γ(n)<br />

n−3 exp −θudu<br />

0<br />

= θ2 Γ(n − 2)<br />

Γ(n)<br />

=<br />

θ 2<br />

(n − 1)(n − 2)<br />

⇒ Var[ ˜ θ] = E[ ˜ θ 2 �<br />

] − E[ ˜ � �<br />

2<br />

2 (n − 1)<br />

θ] = E<br />

Z2 �<br />

− θ 2<br />

=<br />

(n − 1) 2θ2 (n − 1)(n − 2) − θ2 = θ2<br />

n − 2 .<br />

I(θ) = n<br />

θ2 ⇒ E [I(θ)] = n<br />

θ<br />

eff( ˜ θ) =<br />

(E [I(θ)])−1<br />

Var[ ˜ θ]<br />

2 �=<br />

θ2 θ2<br />

= ÷<br />

n n − 2<br />

�<br />

Var[ ˜ �−1 θ] .<br />

= n − 2<br />

n<br />

which although not equal to 1, converges to 1 as n → ∞, and ˜ θ is asymptotically<br />

efficient. �<br />

Example 2.8 (Lifetime of a component). The time to failure T of components has an<br />

exponential distributed with mean µ. Suppose n components are tested for 100 hours<br />

and that m components failed at times t1, . . . , tm, with n − m components surviving<br />

the 100 hour test. The likelihood function can be written<br />

m� 1<br />

L(θ) =<br />

µ<br />

i=1<br />

e−ti/µ<br />

n�<br />

P (Tj > 100) .<br />

j=m+1<br />

� �� � � �� �<br />

components failed components survived<br />

Clearly P (T ≤ t) = 1 − e −t/µ implies P (T > 100) = e −100/µ is the probability of a<br />

component surviving the 100 hour test. Then<br />

L(µ) =<br />

�<br />

m�<br />

1<br />

µ e−ti/µ<br />

�<br />

�e−100/µ �n−m<br />

,<br />

i=1<br />

25


ℓ(µ) = −m ln µ − 1<br />

m�<br />

ti −<br />

µ<br />

i=1<br />

1<br />

µ 100(n − m),<br />

S(µ) = − m<br />

µ + 1<br />

µ 2<br />

m�<br />

ti + 1<br />

µ 2 100(n − m).<br />

Setting S(ˆµ) = 0 suggests the estimator ˆµ = [ � m<br />

i=1 ti + 100 (n − m)] /m. Also, I(ˆµ) =<br />

m/ˆµ 2 > 0, and ˆµ is indeed the MLE. Although failure times were recorded for just m<br />

components, this example usefully demonstrates that all n components contribute to<br />

the estimation of the mean failure parameter µ. The n − m surviving components are<br />

often referred to as right censored. �<br />

i=1<br />

Example 2.9 (Gaussian Distribution). Consider data X1, X2 . . . , Xn distributed as N(µ, υ).<br />

Then the likelihood function is<br />

L(µ, υ) =<br />

and the log-likelihood function is<br />

⎧<br />

� �n ⎪⎨<br />

1<br />

√πυ exp −<br />

⎪⎩<br />

n�<br />

(xi − µ) 2<br />

⎫<br />

⎪⎬<br />

i=1<br />

ℓ(µ, υ) = − n n 1<br />

ln (2π) − ln (υ) −<br />

2 2 2υ<br />

2υ<br />

⎪⎭<br />

n�<br />

(xi − µ) 2<br />

i=1<br />

(2.3)<br />

Unknown mean and known variance: As υ is known we treat this parameter as a con-<br />

stant when differentiating wrt µ. Then<br />

S(µ) = 1<br />

υ<br />

n�<br />

i=1<br />

(xi − µ), ˆµ = 1<br />

n<br />

n�<br />

i=1<br />

xi, and I(θ) = n<br />

υ<br />

Also, E[ˆµ] = nµ/n = µ, and so the MLE of µ is unbiased. Finally<br />

Var[ˆµ] = 1<br />

�<br />

n�<br />

�<br />

Var xi =<br />

n2 υ<br />

n = (E[I(θ)])−1 .<br />

Known mean and unknown variance: Differentiating (2.3) wrt υ returns<br />

and setting S(υ) = 0 implies<br />

i=1<br />

S(υ) = − n 1<br />

+<br />

2υ 2υ2 ˆυ = 1<br />

n<br />

n�<br />

(xi − µ) 2 ,<br />

i=1<br />

n�<br />

(xi − µ) 2 .<br />

i=1<br />

26<br />

> 0 ∀ µ.


Differentiating again, and multiplying by −1 yields the information function<br />

Clearly ˆυ is the MLE since<br />

Define<br />

I(υ) = − n 1<br />

+<br />

2υ2 υ3 n�<br />

(xi − µ) 2 .<br />

i=1<br />

I(ˆυ) = n<br />

> 0.<br />

2υ2 Zi = (Xi − µ) 2 / √ υ,<br />

so that Zi ∼ N(0, 1). From the appendix on probability<br />

n�<br />

Z 2 i ∼ χ 2 n,<br />

implying E[ � Z 2 i ] = n, and Var[ � Z 2 i ] = 2n. The MLE<br />

Then<br />

and<br />

Finally,<br />

Var[ˆυ] =<br />

i=1<br />

ˆυ = (υ/n)<br />

E[ˆυ] = E<br />

�<br />

υ<br />

n<br />

n�<br />

i=1<br />

n�<br />

i=1<br />

Z 2 i<br />

Z 2 i .<br />

�<br />

= υ,<br />

�<br />

�<br />

υ<br />

� n�<br />

2<br />

Var Z<br />

n<br />

i=1<br />

2 �<br />

i = 2υ2<br />

n .<br />

E [I(υ)] = − n 1<br />

+<br />

2υ2 υ3 = − n nυ<br />

+<br />

2υ2 υ3 = n<br />

.<br />

2υ2 n�<br />

E � (xi − µ) 2�<br />

Hence the CRLB = 2υ 2 /n, and so ˆυ has efficiency 1. �<br />

Our treatment of the two parameters of the Gaussian distribution in the last ex-<br />

ample was to (i) fix the variance and estimate the mean using maximum likelihood;<br />

and then (ii) fix the mean and estimate the variance using maximum likelihood. In<br />

practice we would like to consider the simultaneous estimation of these parameters. In<br />

the next section of these notes we extend MLE to multiple parameter estimation.<br />

27<br />

i=1


2.5 Multi-parameter Estimation<br />

Suppose that a statistical model specifies that the data y has a probability distribution<br />

f(y; α, β) depending on two unknown parameters α and β. In this case the likelihood<br />

function is a function of the two variables α and β and having observed the value y is<br />

defined as L(α, β) = f(y; α, β) with ℓ(α, β) = ln L(α, β). The MLE of (α, β) is a value<br />

(ˆα, ˆ β) for which L(α, β) , or equivalently ℓ(α, β) , attains its maximum value.<br />

Define S1(α, β) = ∂ℓ/∂α and S2(α, β) = ∂ℓ/∂β. The MLEs (ˆα, ˆ β) can be obtained<br />

by solving the pair of simultaneous equations :<br />

S1(α, β) = 0<br />

S2(α, β) = 0<br />

The information matrix I(α, β) is defined to be the matrix :<br />

�<br />

I11(α, β)<br />

I(α, β) =<br />

I21(α, β)<br />

� �<br />

I12(α, β)<br />

= −<br />

I22(α, β)<br />

∂2 ∂α2 ℓ ∂2 ∂α∂β ℓ<br />

∂2 ∂β∂α ℓ ∂2 ∂β2 ℓ<br />

The conditions for a value (α0, β0) satisfying S1(α0, β0) = 0 and S2(α0, β0) = 0 to<br />

be a MLE are that<br />

and<br />

I11(α0, β0) > 0, I22(α0, β0) > 0,<br />

det(I(α0, β0) = I11(α0, β0)I22(α0, β0) − I12(α0, β0) 2 > 0.<br />

This is equivalent to requiring that both eigenvalues of the matrix I(α0, β0) be positive.<br />

Example 2.10 (Gaussian distribution). Let X1, X2 . . . , Xn be iid observations from a<br />

N (µ, v) density in which both µ and v are unknown. The log likelihood is<br />

ℓ(µ, v) =<br />

n�<br />

�<br />

1<br />

ln √ exp [−<br />

2πv i=1<br />

1<br />

2v (xi − µ) 2 =<br />

�<br />

]<br />

n�<br />

�<br />

−<br />

i=1<br />

1 1 1<br />

ln [2π] − ln [v] −<br />

2 2 2v (xi − µ) 2<br />

�<br />

= − n<br />

n�<br />

n 1<br />

ln [2π] − ln [v] − (xi − µ)<br />

2 2 2v<br />

2 .<br />

Hence<br />

implies that<br />

S1(µ, v) = ∂ℓ<br />

∂µ<br />

ˆµ = 1<br />

n<br />

= 1<br />

v<br />

i=1<br />

n�<br />

(xi − µ) = 0<br />

i=1<br />

n�<br />

xi = ¯x. (2.4)<br />

i=1<br />

28<br />


Also<br />

implies that<br />

S2(µ, v) = ∂ℓ<br />

∂v<br />

ˆv = 1<br />

n<br />

n�<br />

i=1<br />

n 1<br />

= − +<br />

2v 2v2 (xi − ˆµ) 2 = 1<br />

n<br />

n�<br />

(xi − µ) 2 = 0<br />

i=1<br />

n�<br />

(xi − ¯x) 2 . (2.5)<br />

Calculating second derivatives and multiplying by −1 gives that the information matrix<br />

I(µ, v) equals<br />

I(µ, v) =<br />

⎛<br />

⎜<br />

⎝<br />

1<br />

v 2<br />

i=1<br />

i=1<br />

n<br />

1<br />

v<br />

v2 n�<br />

(xi − µ)<br />

i=1<br />

n�<br />

(xi − µ) − n<br />

2v2 + 1<br />

v3 n�<br />

(xi − µ) 2<br />

⎞<br />

⎟<br />

⎠<br />

Hence I(ˆµ, ˆv) is given by : �<br />

n<br />

ˆv 0<br />

0 n<br />

2v2 �<br />

Clearly both diagonal terms are positive and the determinant is positive and so (ˆµ, ˆv)<br />

are, indeed, the MLEs of (µ, v).<br />

Go back to equation (2.4), and ¯ X ∼ N (µ, v/n). Clearly E( ¯ X) = µ (unbiased) and<br />

Var( ¯ X) = v/n, so ¯ X achieved the CRLB. Go back to equation (2.5). Then from lemma<br />

2.9 we have<br />

so that<br />

nˆv<br />

v ∼ χ2 n−1<br />

i=1<br />

� �<br />

nˆv<br />

E = n − 1<br />

v<br />

� �<br />

n − 1<br />

⇒ E(ˆv) = v<br />

n<br />

Instead, propose the (unbiased) estimator<br />

Observe that<br />

E(˜v) =<br />

˜v =<br />

n<br />

ˆv =<br />

n − 1<br />

� �<br />

n<br />

E(ˆv) =<br />

n − 1<br />

1<br />

n − 1<br />

n�<br />

(xi − ¯x) 2<br />

i=1<br />

� n<br />

n − 1<br />

� � n − 1<br />

and ˜v is unbiased as suggested. We can easily show that<br />

Hence<br />

Var(˜v) =<br />

2v 2<br />

(n − 1)<br />

n<br />

�<br />

v = v<br />

(2.6)<br />

eff(˜v) = 2v2 2v2 1<br />

÷ = 1 −<br />

n (n − 1) n<br />

Clearly ˜v is not efficient, but is asymptotically efficient. �<br />

29


Lemma 2.9 (Joint distribution of the sample mean and sample variance). If<br />

X1, . . . , Xn are iid N (µ, v) then the sample mean ¯ X and sample variance S 2 /(n − 1)<br />

are independent. Also ¯ X is distributed N (µ, v/n) and S 2 /v is a chi-squared random<br />

variable with n − 1 degrees of freedom.<br />

Proof. Define<br />

⇒<br />

W =<br />

n�<br />

(Xi − ¯ X) 2 =<br />

i=1<br />

W<br />

v + ( ¯ X − µ) 2<br />

v/n<br />

=<br />

n�<br />

(Xi − µ) 2 − n( ¯ X − µ) 2<br />

i=1<br />

n� (Xi − µ) 2<br />

The RHS is the sum of n independent standard normal random variables squared, and<br />

so is distributed χ 2 ndf Also, ¯ X ∼ N (µ, v/n), therefore ( ¯ X − µ) 2 /(v/n) is the square of<br />

a standard normal and so is distributed χ2 1df These Chi-Squared random variables have<br />

moment generating functions (1 − 2t) −n/2 and (1 − 2t) −1/2 respectively. Next, W/v and<br />

( ¯ X − µ) 2 /(v/n) are independent:<br />

Cov(Xi − ¯ X, ¯ X) = Cov(Xi, ¯ X) − Cov( ¯ X, ¯ X)<br />

�<br />

= Cov Xi, 1 �<br />

�<br />

Xj − Var(<br />

n<br />

¯ X)<br />

= 1 �<br />

Cov(Xi, Xj) −<br />

n<br />

v<br />

n<br />

j<br />

= v v<br />

−<br />

n n<br />

i=1<br />

= 0<br />

But, Cov(Xi − ¯ X, ¯ X − µ) = Cov(Xi − ¯ X, ¯ X) = 0 , hence<br />

�<br />

Cov(Xi − ¯ X, ¯ �<br />

�<br />

X − µ) = Cov (Xi − ¯ X), ¯ �<br />

X − µ = 0<br />

i<br />

As the moment generating function of the sum of independent random variables is<br />

equal to the product of their individual moment generating functions, we see<br />

E � e t(W/v)� (1 − 2t) −1/2 = (1 − 2t) −n/2<br />

⇒ E � e t(W/v)� = (1 − 2t) −(n−1)/2<br />

But (1 − 2t) −(n−1)/2 is the moment generating function of a χ 2 random variables with<br />

(n−1) degrees of freedom, and the moment generating function uniquely characterizes<br />

the random variable S = (W/v).<br />

30<br />

i<br />

v


Suppose that a statistical model specifies that the data x has a probability distribu-<br />

tion f(x; θ) depending on a vector of m unknown parameters θ = (θ1, . . . , θm). In this<br />

case the likelihood function is a function of the m parameters θ1, . . . , θm and having<br />

observed the value of x is defined as L(θ) = f(x; θ) with ℓ(θ) = ln L(θ).<br />

The MLE of θ is a value ˆ θ for which L(θ), or equivalently ℓ(θ), attains its maximum<br />

value. For r = 1, . . . , m define Sr(θ) = ∂ℓ/∂θr. Then we can (usually) find the MLE<br />

ˆθ by solving the set of m simultaneous equations Sr(θ) = 0 for r = 1, . . . , m. The<br />

information matrix I(θ) is defined to be the m×m matrix whose (r, s) element is given<br />

by Irs where Irs = −∂ 2 ℓ/∂θr∂θs. The conditions for a value ˆ θ satisfying Sr( ˆ θ) = 0 for<br />

r = 1, . . . , m to be a MLE are that all the eigenvalues of the matrix I( ˆ θ) are positive.<br />

2.6 Newton-Raphsom Optimization<br />

Example 2.11 (Radioactive Scatter). A radioactive source emits particles intermittently<br />

and at various angles. Let X denote the cosine of the angle of emission. The angle<br />

of emission can range from 0 degrees to 180 degrees and so X takes values in [−1, 1].<br />

Assume that X has density<br />

1 + θx<br />

f(x|θ) =<br />

2<br />

for −1 ≤ x ≤ 1 where θ ∈ [−1, 1] is unknown. Suppose the data consist of n indepen-<br />

dently identically distributed measures of X yielding values x1, x2, ..., xn. Here<br />

L(θ) = 1<br />

2 n<br />

n�<br />

(1 + θxi)<br />

i=1<br />

l(θ) = −n ln [2] +<br />

S(θ) =<br />

I(θ) =<br />

n�<br />

i=1<br />

n�<br />

i=1<br />

xi<br />

1 + θxi<br />

n�<br />

ln [1 + θxi]<br />

i=1<br />

x 2 i<br />

(1 + θxi) 2<br />

Since I(θ) > 0 for all θ, the MLE may be found by solving the equation S(θ) = 0. It<br />

is not immediately obvious how to solve this equation.<br />

By Taylor’s Theorem we have<br />

0 = S( ˆ θ)<br />

= S(θ0) + ( ˆ θ − θ0)S ′ (θ0) + ( ˆ θ − θ0) 2 S ′′ (θ0)/2 + ....<br />

31


So, if | ˆ θ − θ0| is small, we have that<br />

and hence<br />

0 ≈ S(θ0) + ( ˆ θ − θ0)S ′ (θ0),<br />

ˆθ ≈ θ0 − S(θ0)/S ′ (θ0)<br />

= θ0 + S(θ0)/I(θ0)<br />

We now replace θ0 by this improved approximation θ0 +S(θ0)/I(θ0) and keep repeating<br />

the process until we find a value ˆ θ for which |S( ˆ θ)| < ɛ where ɛ is some prechosen small<br />

number such as 0.000001. This method is called Newton’s method for solving a nonlinear<br />

equation.<br />

If a unique solution to S(θ) = 0 exists, Newton’s method works well regardless of<br />

the choice of θ0. When there are multiple solutions, the solution to which the algorithm<br />

converges depends crucially on the choice of θ0. In many instances it is a good idea<br />

to try various starting values just to be sure that the method is not sensitive to the<br />

choice of θ0.<br />

One approach to finding an initial estimate θ0 would be to use the Method of<br />

Moments. This involves solving the equation E(X) = ¯x for θ. For the previous example<br />

E(X) =<br />

� 1<br />

−1<br />

x(1 + θx)<br />

dx =<br />

2<br />

θ<br />

3<br />

and so θ0 = 3¯x might yield a good choice for a starting value.<br />

Suppose the following 10 values were recorded:<br />

0.00, 0.23, −0.05, 0.01, −0.89, 0.19, 0.28, 0.51, −0.25 and 0.27.<br />

Then ¯x = 0.03, and we substitute θ0 = .09 into the updating formula<br />

ˆθnew = θ old +<br />

� n�<br />

⇒ θ1 = 0.2160061<br />

θ2 = 0.2005475<br />

θ3 = 0.2003788<br />

θ4 = 0.2003788<br />

i=1<br />

xi<br />

1 + θ old xi<br />

� � n�<br />

i=1<br />

x 2 i<br />

(1 + θ old xi) 2<br />

The relative likelihood function is plotted in Figure 2.6.2. �<br />

32<br />

� −1


Relative Likelihood<br />

0.4 0.6 0.8 1.0<br />

Radioactive Scatter<br />

−1.0 −0.5 0.0 0.5 1.0<br />

θ<br />

θ ^ = 0.2003788<br />

Figure 2.6.2: Relative likelihood for the radioactive scatter, solved by Newton Raphson.<br />

A Weibull random variable with ‘shape’ parameter a > 0 and ‘scale’ parameter<br />

b > 0 has density<br />

fT (t) = (a/b)(t/b) a−1 exp{−(t/b) a }<br />

for t ≥ 0. The (cumulative) distribution function is<br />

FT (t) = 1 − exp{−(t/b) a }<br />

on t ≥ 0. Suppose that the time to failure T of components has a Weibull distribution<br />

and after testing n components for 100 hours, m components fail at times t1, . . . , tm,<br />

with n − m components surviving the 100 hour test. The likelihood function can be<br />

written<br />

L(θ) =<br />

m�<br />

� �a−1 � � �a� a ti<br />

ti<br />

exp −<br />

b b<br />

b<br />

i=1<br />

� �� �<br />

components failed<br />

33<br />

n�<br />

� � �a� 100<br />

exp − .<br />

b<br />

j=m+1<br />

� �� �<br />

components survived


Then the log-likelihood function is<br />

ℓ(a, b) = m ln (a) − ma ln (b) + (a − 1)<br />

m�<br />

ln (ti) −<br />

i=1<br />

n�<br />

(ti/b) a ,<br />

where for convenience we have written tm+1 = · · · = tn = 100. This yields score<br />

functions<br />

and<br />

Sa(a, b) = m<br />

a<br />

− m ln (b) +<br />

m�<br />

ln (ti) −<br />

i=1<br />

Sb(a, b) = − ma<br />

b<br />

+ a<br />

b<br />

i=1<br />

n�<br />

(ti/b) a ln (ti/b) ,<br />

It is not obvious how to solve Sa(a, b) = Sb(a, b) = 0 for a and b.<br />

i=1<br />

n�<br />

(ti/b) a . (2.7)<br />

When the m equations Sr(θ) = 0, r = 1, . . . , m cannot be solved directly numerical<br />

optimization is required. Let S(θ) be the m × 1 vector whose rth element is Sr(θ). Let<br />

ˆθ be the solution to the set of equations S(θ) = 0 and let θ0 be an initial guess at ˆ θ.<br />

Then a first order Taylor’s series approximation to the function S about the point θ0<br />

is given by<br />

Sr( ˆ m�<br />

θ) ≈ Sr(θ0) + ( ˆ θj − θ0j) ∂Sr<br />

(θ0)<br />

∂θj<br />

for r = 1, 2, . . . , m which may be written in matrix notation as<br />

j=1<br />

i=1<br />

S( ˆ θ) ≈ S(θ0) − I(θ0)( ˆ θ − θ0).<br />

Requiring S( ˆ θ) = 0, this last equation can be reorganized to give<br />

ˆθ ≈ θ0 + I(θ0) −1 S(θ0). (2.8)<br />

Thus given θ0 this is a method for finding an improved guess at ˆ θ. We then replace θ0<br />

by this improved guess and repeat the process. We keep repeating the process until we<br />

obtain a value θ ∗ for which |Sr(θ ∗ )| is less than ɛ for r = 1, 2, . . . , m where ɛ is some<br />

small number like 0.0001. θ ∗ will be an approximate solution to the set of equations<br />

S(θ) = 0. We then evaluate the matrix I(θ ∗ ) and if all m of its eigenvalues are positive<br />

we set ˆ θ = θ ∗ .<br />

For the Weibull distribution<br />

Iaa = m<br />

+<br />

a2 n�<br />

i=1<br />

Ibb = − ma a(a + 1)<br />

+<br />

b2 b2 Iab = Iba = m<br />

b<br />

(ti/b) a [ln (ti/b)] 2 ,<br />

− 1<br />

b<br />

n�<br />

(ti/b) a ,<br />

i=1<br />

n�<br />

(ti/b) a [a ln (ti/b) + 1] .<br />

i=1<br />

34


a0 b0 a ∗ b ∗ Steps Eigenvalues<br />

1.8 74.5 1.924941 78.12213 4 all positive<br />

2.36 72 1.924941 78.12213 5 all positive<br />

2 70 1.924941 78.12213 5 all positive<br />

1 2 1.924941 78.12213 15 all positive<br />

80 1 1.924941 78.12213 387 all positive<br />

2 100 4.292059 −34.591322 2 crashed<br />

Table 2.6.1: Newton-Raphson estimates of the Weibull parameters.<br />

Given initial values a0 and b0 equation (2.8) can be used to obtained updated estimates<br />

via<br />

�<br />

anew<br />

bnew<br />

�<br />

=<br />

�<br />

a0<br />

b0<br />

�<br />

+<br />

�<br />

Iaa(a0, b0) Iab(a0, b0)<br />

Iba(a0, b0) Ibb(a0, b0)<br />

� −1 �<br />

Sa(a0, b0)<br />

Sb(a0, b0)<br />

�<br />

. (2.9)<br />

Having solved for a and b we check that a maximum has been achieved by calculating<br />

I(a ∗ , b ∗ ) and verifying that both of its eigenvalues are positive.<br />

Suppose n = 10 and m = 8 items fail at times 17, 29, 32, 55, 61, 74, 77, 93 (in hours)<br />

with the remaining two items surviving the 100 hour test. Table 2.6.1 shows estimates<br />

(a ∗ , b ∗ ) obtained from equation (2.9) using various starting values (a0, b0) along with<br />

the number of iteration steps taken until convergence using ɛ = 0.000001. Four of<br />

the starting pairs considered converged to estimates a ∗ = 1.924941 and b ∗ = 78.12213<br />

for a and b. The starting pairs (1, 2) and (80, 1) might misleadingly suggest that the<br />

algorithm is robust to the choice of starting values. However, the starting pair (2, 100)<br />

produced a negative estimate of b, and as ln(b) is required in the computation of the<br />

score function, caused the algorithm to crash. Other starting pairs not reported failed<br />

to yield estimates due to a non-invertible I(θ0) matrix being produced during the<br />

running of the algorithm. The remainder of this section describes commonly applied<br />

modifications of the Newton-Raphson method.<br />

Initial values<br />

The Newton-Raphson method depends on the initial guess being close to the true value.<br />

If this requirement is not satisfied the procedure might convergence to a minimum<br />

instead of a maximum, or just simply diverge and fail to produce any estimates at all.<br />

Methods of finding good initial estimates depend very much on the problem at hand<br />

and may require some ingenuity.<br />

The distribution function of a Weibull random variable T can be linearized as<br />

ln(− ln[1 − F (t)]) = −a ln(b) + a ln(t).<br />

35


A regression of the empirical distribution function ˆ F (t) on ln(t) can be used the recover<br />

initial estimates for a and b. The starting pair (1.8, 74.5) in Table 2.6.1 was obtained<br />

in this way.<br />

A more general method for finding initial values is the method of moments. This<br />

method estimates the kth moment about the origin E(T k ) by the sample moment<br />

˜mk = n −1 � t k i . For a Weibull distribution the kth moment about the origin equals<br />

b k Γ(1 + k/a) where Γ(c) = � ∞<br />

0 uc−1 e −u du is the usual gamma function satisfying<br />

Γ(c) = (c − 1)Γ(c − 1). It is easy to show that E(T 2 )/[E(T )] 2 = 2aΓ(2/a)/[Γ(1/a)] 2 . An<br />

estimate a0 of a can be obtained by solving ˜m 2 2/( ˜m1) 2 = 2aΓ(2/a)/[Γ(1/a)] 2 for a. The<br />

corresponding estimate of b is then b0 = ˜m1/Γ(1 + 1/a0). The starting pair (2.36, 72)<br />

in Table 2.6.1 was obtained in this way.<br />

Fisher’s method of scoring<br />

One modification to Newton’s method is to replace the matrix I(θ) in (2.8) with Ī(θ) =<br />

E [I(θ)] . The matrix Ī(θ) is positive definite, thus overcoming many of the problems<br />

regarding matrix inversion. Like Newton-Raphson, there is no guarantee that Fisher’s<br />

method of scoring will avoid producing negative parameter estimates or converging to<br />

local minima. Unfortunately, calculating Ī(θ) often can be mathematically difficult.<br />

The profile likelihood<br />

Setting Sb(a, b) = 0 from equation (2.7) we can solve for b in terms of a as<br />

�<br />

1<br />

b =<br />

m<br />

n�<br />

i=1<br />

t a i<br />

� 1/a<br />

. (2.10)<br />

This reduces our two parameter problem to a search over the “new” one-parameter<br />

profile log-likelihood<br />

ℓa(a) = ℓ(a, b(a)),<br />

�<br />

= m ln (a) − m ln<br />

1<br />

m<br />

n�<br />

i=1<br />

t a i<br />

�<br />

+ (a − 1)<br />

m�<br />

ln (ti) − m. (2.11)<br />

Given an initial guess a0 for a, an improved estimate a1 can be obtained using a<br />

single parameter Newton-Raphson updating step a1 = a0 + S(a0)/I(a0), where S(a)<br />

and I(a) are now obtained from ℓa(a). The estimates â = 1.924941 and ˆ b = 78.12213<br />

were obtained by applying this method to the Weibull data using starting values a0 =<br />

0.001 and a0 = 5.8 in 16 and 13 iterations respectively. However, the starting value<br />

a0 = 5.9 produced the sequence of estimates a1 = −5.544163, a2 = 8.013465, a3 =<br />

−16.02908, a4 = 230.0001 and subsequently crashed.<br />

36<br />

i=1


Reparameterization<br />

Negative parameter estimates can be avoided by reparameterizing the profile log-<br />

likelihood in (2.11) using α = ln(a). Since a = e α we are guaranteed to obtain a > 0.<br />

The reparameterized profile log-likelihood becomes<br />

�<br />

n� 1<br />

ℓα(α) = mα − m ln t<br />

m<br />

eα<br />

�<br />

i + (e α − 1)<br />

implying score function<br />

and information function<br />

I(α) =<br />

i=1<br />

S(α) = m − meα � n<br />

meα �n i=1 teα i<br />

i=1 teα<br />

�n i=1 teα i<br />

�<br />

n�<br />

t eα<br />

i ln(ti) − e α [� n<br />

i=1<br />

i ln(ti)<br />

+ e α<br />

+e α<br />

m�<br />

ln (ti) − m,<br />

i=1<br />

m�<br />

ln(ti)<br />

i=1<br />

i=1 teα i ln(ti)] 2<br />

�n i=1 teα i<br />

n�<br />

i=1<br />

t eα<br />

i [ln(ti)] 2<br />

�<br />

− e α<br />

m�<br />

ln(ti).<br />

The estimates â = 1.924941 and ˆ b = 78.12213 were obtained by applying this method<br />

to the Weibull data using starting values a0 = 0.07 and a0 = 76 in 103 and 105<br />

iterations respectively. However, the starting values a0 = 0.06 and a0 = 77 failed due<br />

to division by computationally tiny (1.0e-300) values.<br />

The step-halving scheme<br />

The Newton-Raphson method uses the (first and second) derivatives of ℓ(θ) to max-<br />

imize the function ℓ(θ), but the function itself is not used in the algorithm. The<br />

log-likelihood can be incorporated into the Newton-Raphson method by modifying the<br />

updating step to<br />

θi+1 = θi + λiI(θi) −1 S(θi), (2.12)<br />

where the search direction has been multiplied by some λi ∈ (0, 1] chosen so that the<br />

inequality<br />

ℓ � θi + λiI(θi) −1 S(θi) � > ℓ (θi) (2.13)<br />

holds. This requirement protects the algorithm from converging towards minima or<br />

saddle points. At each iteration the algorithm sets λi = 1, and if (2.13) does not<br />

hold λi is replaced with λi/2. The process is repeated until the inequality in (2.13) is<br />

satisfied. At this point the parameter estimates are updated using (2.12) with the value<br />

of λi for which (2.13) holds. If the function ℓ(θ) is concave and unimodal convergence<br />

is guaranteed. Finally, when<br />

maxima is guaranteed, even if ℓ(θ) is not concave.<br />

Ī(θ) is used in place of I(θ) convergence to a (local)<br />

37<br />

i=1


2.7 The Invariance Principle<br />

How do we deal with parameter transformation? We will assume a one-to-one trans-<br />

formation, but the idea applied generally. Consider a binomial sample with n = 10<br />

independent trials resulting in data x = 8 successes. The likelihood ratio of θ1 = 0.8<br />

versus θ2 = 0.3 is<br />

L(θ1 = 0.8)<br />

L(θ2 = 0.3) = θ8 1(1 − θ1) 2<br />

θ8 = 208.7 ,<br />

2(1 − θ2) 2<br />

that is, given the data θ = 0.8 is about 200 times more likely than θ = 0.3.<br />

Suppose we are interested in expressing θ on the logit scale as<br />

ψ ≡ ln{θ/(1 − θ)} ,<br />

then ‘intuitively’ our relative information about ψ1 = ln(0.8/0.2) = 1.29 versus ψ2 =<br />

ln(0.3/0.7) = −0.85 should be<br />

L ∗ (ψ1)<br />

L ∗ (ψ2)<br />

= L(θ1)<br />

L(θ2)<br />

= 208.7 .<br />

That is, our information should be invariant to the choice of parameterization. ( For<br />

the purposes of this example we are not too concerned about how to calculate L ∗ (ψ). )<br />

Theorem 2.10 (Invariance of the MLE). If g is a one-to-one function, and ˆ θ is<br />

the MLE of θ then g( ˆ θ) is the MLE of g(θ).<br />

Proof. This is trivially true as we let θ = g −1 (µ) then f{y|g −1 (µ)} is maximized in µ<br />

exactly when µ = g( ˆ θ). When g is not one-to-one the discussion becomes more subtle,<br />

but we simply choose to define ˆgMLE(θ) = g( ˆ θ)<br />

It seems intuitive that if ˆ θ is most likely for θ and our knowledge (data) remains<br />

unchanged then g( ˆ θ) is most likely for g(θ). In fact, we would find it strange if ˆ θ is an<br />

estimate of θ, but ˆ θ 2 is not an estimate of θ 2 . In the binomial example with n = 10<br />

and x = 8 we get ˆ θ = 0.8, so the MLE of g(θ) = θ/(1 − θ) is<br />

g( ˆ θ) = ˆ θ/(1 − ˆ θ) = 0.8/0.2 = 4.<br />

This convenient property is not necessarily true of other estimators. For example, if ˆ θ<br />

is the MVUE of θ, then g( ˆ θ) is generally not MVUE for g(θ).<br />

Frequentists generally accept the invariance principle without question. This is<br />

not the case for intelligent lifeforms such as Bayesians. The invariance property of<br />

the likeihood ratio is incompatible with the Bayesian habit of assigning a probability<br />

distribution to a parameter.<br />

38


2.8 Optimality Properties of the MLE<br />

Suppose that an experiment consists of measuring random variables x1, x2, . . . , xn which<br />

are iid with probability distribution depending on a parameter θ. Let ˆ θ be the MLE<br />

of θ. Define<br />

W1 = � E[I(θ)]( ˆ θ − θ)<br />

W2 = � I(θ)( ˆ θ − θ)<br />

�<br />

W3 = E[I( ˆ θ)]( ˆ θ − θ)<br />

�<br />

W4 = I( ˆ θ)( ˆ θ − θ).<br />

Then, W1, W2, W3, and W4 are all random variables and, as n → ∞, the probabilistic<br />

behaviour of each of W1, W2, W3, and W4 is well approximated by that of a N(0, 1)<br />

random variable. Then, since E[W1] ≈ 0, we have that E[ ˆ θ] ≈ θ and so ˆ θ is approx-<br />

imately unbiased. Also Var[W1] ≈ 1 implies that Var[ ˆ θ] ≈ (E[I(θ)]) −1 and so ˆ θ is<br />

approximately efficient.<br />

Let the data X have probability distribution g(X; θ) where θ = (θ1, θ2, . . . , θm) is a<br />

vector of m unknown parameters. Let I(θ) be the m×m information matrix as defined<br />

above and let E[I(θ)] be the m × m matrix obtained by replacing the elements of I(θ)<br />

by their expected values. Let ˆ θ be the MLE of θ. Let CRLBr be the rth diagonal<br />

element of [E[I(θ)]] −1 . For r = 1, 2, . . . , m, define W1r = ( ˆ θr − θr)/ √ CRLBr. Then, as<br />

n → ∞, W1r behaves like a standard normal random variable.<br />

Suppose we define W2r by replacing CRLBr by the rth diagonal element of the<br />

matrix [I(θ)] −1 , W3r by replacing CRLBr by the rth diagonal element of the matrix<br />

[EI( ˆ θ)] −1 and W4r by replacing CRLBr by the rth diagonal element of the matrix<br />

[I( ˆ θ)] −1 . Then it can be shown that as n → ∞, W2r, W3r, and W4r all behave like<br />

standard normal random variables.<br />

2.9 Data Reduction<br />

Definition 2.11 (Sufficiency). Consider a statistic T = t(X) that summarises the<br />

data so that no information about θ is lost. Then we call t(X) a sufficient statistic. �<br />

Example 2.12. T = t(X) = ¯ X is sufficient for µ when Xi ∼ iid N(µ, σ 2 ). �<br />

<strong>To</strong> better understand the motivation behind the concept of sufficiency consider<br />

three independent Binomial trials where θ = P (X = 1).<br />

39


Event Probability Set<br />

0 0 0 (1 − θ) 3 A0<br />

1 0 0<br />

0 1 0<br />

0 0 1<br />

0 1 1<br />

1 0 1<br />

1 1 0<br />

θ(1 − θ) 2 A1<br />

θ 2 (1 − θ) A2<br />

1 1 1 θ 3 A4<br />

Knowing which Ai the sample is in carries all the information about θ. Which<br />

particular sample within Ai gives us no extra information about θ. Extra information<br />

about other aspect of the model maybe, but not about θ. Here T = t(X) = � Xi<br />

equals the number of “successes”, and identifies Ai. Mathematically we can express<br />

the above concept by saying that the probability P (X = x|Ai) does not depend on θ.<br />

i.e. P (010|A1; θ) = 1/3. More generally, a statistic T = t(X) is said to be sufficient<br />

for the parameter θ if Pr (X = x|T = t) does not depend on θ. Sufficient statistics are<br />

most easily recognized through the following fundamental result:<br />

Theorem 2.11 (Neyman’s Factorization Criterion). A statistic T = t(X) is<br />

sufficient for θ if and only if the family of densities can be factorized as<br />

f(x; θ) = h(x)k {t(x); θ} , x ∈ X , θ ∈ Θ. (2.14)<br />

i.e. into a function which does not depend on θ and one which only depends on x<br />

through t(x). This is true in general. We will prove it in the case where X is discrete.<br />

Proof. Assume T is sufficient and let h(x) = Pθ {X = x|T = t(x)} be independent of<br />

θ. Let k {t; θ} = Pθ(T = t). Then<br />

f(x; θ) = Pθ {X = x|T = t(x)} Pθ {T = t(x)} = h(x)k {θ, t(x)} .<br />

Conversely assume the result in (2.14) to be true. Then<br />

Pθ (X = x|T = t) =<br />

which is independent of θ.<br />

=<br />

=<br />

h(x)k {t(x); θ}<br />

1{x:t(x)=t}(x),<br />

h(y)k {t(y); θ}<br />

�<br />

y:t(y)=t<br />

h(x)k {t; θ}<br />

k {t; θ} �<br />

1{x:t(x)=t}(x),<br />

y:t(y)=t h(y)<br />

h(x)<br />

�<br />

y:t(y)=t<br />

40<br />

h(y) 1{x:t(x)=t}(x),


Example 2.13 (Poisson). Let X = (X1, . . . , Xn) be independent and Poisson dis-<br />

tributed with mean λ so that<br />

f(x; λ) =<br />

n�<br />

i=1<br />

λxi xi! e−λ = λΣxi<br />

�<br />

i xi! e−nλ .<br />

Take k { � xi; λ} = λ Σxi e −nλ and h(x) = ( � xi!) −1 , then t(x) = �<br />

i xi is sufficient. �<br />

Example 2.14 (Binomial). Let X = (X1, . . . , Xn) be independent and Bernoulli dis-<br />

tributed with parameter θ so that<br />

f(x; θ) =<br />

n�<br />

θ xi 1−xi Σxi n−Σxi<br />

(1 − θ) = θ (1 − θ)<br />

i=1<br />

Take k { � xi; θ} = θ Σxi (1 − θ) n−Σxi and h(x) = 1, then t(x) = �<br />

i xi is sufficient. �<br />

Example 2.15 (Uniform). Factorization criterion works in general but can mess up if<br />

the pdf depends on θ. Let X = (X1, . . . , Xn) be independent and Uniform distributed<br />

with parameter θ so that X1, X2, . . . , Xn ∼ Unif(0, θ). Then<br />

f(x; θ) = 1<br />

θ n<br />

0 ≤ xi ≤ θ ∀ i.<br />

It is not at all obvious but t(x) = max(xi) is a sufficient statistic. We have to show<br />

that f(x|t) is independent of θ. Well<br />

Then<br />

So<br />

f(x|t) =<br />

f(x, t)<br />

fT (t) .<br />

P (T ≤ t) = P (X1 ≤ t, . . . , Xn ≤ t)<br />

n�<br />

�<br />

t<br />

= P (Xi ≤ t) =<br />

θ<br />

i=1<br />

FT (t) = tn<br />

θn ⇒ fT (t) = ntn−1<br />

θn Also<br />

f(x, t) = 1<br />

θn ≡ f(x; θ).<br />

Hence<br />

f(x|t) =<br />

1<br />

,<br />

ntn−1 and is independent of θ. �<br />

41<br />

� n


2.10 Worked Problems<br />

The Problems<br />

1. The continuous random variable T has probability density function<br />

f(t) = λe −λt , t > 0; λ > 0.<br />

(a) Show that the cumulative distribution function is given by<br />

(b) Deduce that<br />

F (t) = 1 − e −λt , t > 0; λ > 0.<br />

P (a ≤ b) = e −λa − e −λb , 0 < a < b.<br />

(c) The accounts manager for a building society firm assumes that the time T<br />

taken to settle invoices is a random variable with the pdf f(t) given above<br />

for some unknown value of λ. For a random sample of 100 invoices, he finds<br />

that 50 are settled within one week, 35 are settled during the second week<br />

and 15 are settles after 2 weeks. Explain clearly why the likelihood function<br />

of these data may be written as<br />

where k is a constant.<br />

L(λ) = k(1 − e −λ ) 50 � e −λ e −2λ� 35 (1 − e −2λ ) 15 ,<br />

(d) Show that the maximum likelihood estimate ˆ λ of λ is approximately 0.836.<br />

(e) Using ˆ λ = 0.836, calculate the expected number of invoices settled within<br />

the first week, settled during the second week, and settled after 2 weeks.<br />

Hence comment briefly on how well the model fits the data.<br />

2. A finite population has nA individuals of type A and nB individuals of type<br />

B. The overall number is known to be 100, but the size of the two groups is<br />

unknown. If a sample of 5 individuals is taken without replacement, write down<br />

the likelihood for nA in the following three situations:<br />

(a) the observed sequence is A, B, A, B, A ;<br />

(b) the observed sequence is A, A, B, B, A ;<br />

(c) the precise sequence of the outcomes is not given, but it is known that three<br />

elements are As, two are Bs.<br />

Compare the likelihood functions and comment on your findings.<br />

42


3. A certain type of plant cell may appear in any one of four versions. According to<br />

a genetic theory, the four versions have the following probabilities of appearance:<br />

1/2 + θ/4, (1 − θ)/4, (1 − θ)/4, θ/4,<br />

where θ is a parameter not specified by the genetic theory (0 < θ < 1). If a<br />

sample of n cells had observed frequencies (a, b, c, d) where n = a + b + c + d, find<br />

the maximum likelihood estimate of θ, and an estimate of Var( ˆ θ).<br />

4. Let X1, X2, . . . , Xn be a random sample from a population with probability den-<br />

sity<br />

f(x) =<br />

(a) Show that E(X 2 ) = θ.<br />

�<br />

2<br />

πθ exp<br />

�<br />

− x2<br />

�<br />

, x > 0.<br />

2θ<br />

(b) Find the maximum likelihood estimator (MLE), ˆ θ, of θ.<br />

(c) Show that ˆ θ is an unbiased estimator of θ and that the Cramér-Rao lower<br />

bound is attained. [You may assume Var(X 2 ) = 2θ 2 .]<br />

(d) Suppose now that φ = √ θ is the parameter of interest. Without undertaking<br />

further calculations, write down the MLE of φ and explain why it is a biased<br />

estimator of φ.<br />

5. A random sample X1, X2, . . . , Xn is available from a Poisson distribution with<br />

mean θ. Define the parameter λ = e −θ .<br />

(a) Find the maximum likelihood estimator (MLE), ˆ θ, of θ. Hence deduce the<br />

MLE of ˆ λ, of λ.<br />

(b) Find the variance of ˆ θ, and deduce the approximate variance of ˆ λ using the<br />

delta method.<br />

(c) An alternative estimator of λ is ˜ λ, defines as the observed proportion of zero<br />

observations. Find the bias of ˜ λ and show that<br />

Var( ˜ λ) = e−θ � 1 − e−θ� .<br />

n<br />

(d) Draw a rough sketch of the efficiency of ˜ λ relative to ˆ λ, and discuss its<br />

properties.<br />

6. Let X1, . . . , Xn be independent and identically distributed as N (µ, σ 2 ), where µ<br />

and σ 2 are both unknown, and assume n = 2p + 1 is odd. Define the estimators<br />

ˆµ = ¯ X = (X1 + · · · + Xn)/n, ˜µ = median(X1, . . . , Xn) = X(p+1), and<br />

�<br />

�<br />

�<br />

ˆσ = S = � n �<br />

(Xi − ¯ X) 2 /(n − 1).<br />

i=1<br />

43


(a) Show that ˆµ is an unbiased estimator of µ.<br />

(b) Show that ˜µ is an unbiased estimator of µ. You should exploit the identity<br />

X(p+1) − µ = (X − µ)(p+1) = − � �<br />

(µ − X)(p+1) and the symmetry of the<br />

distribution of Xi − µ.<br />

(c) Show that E (ˆσ) < σ, and consequently ˆσ is a biased estimator of σ.<br />

(d) Find an unbiased estimator of σ of the form ˇσ = cS in the case p = 1, i.e.<br />

when n = 3.<br />

(e) Repeat (d), but for general n.<br />

7. Let X have density<br />

where θ > 0 is assumed unknown.<br />

(a) Show that c(θ) = θ 3 /2.<br />

f(x|θ) = c(θ)x 2 e −θx , x > 0,<br />

(b) Show that ˜ θ = 2/X is an unbiased estimator of θ and find its variance.<br />

(c) Find the Fisher information i(θ) for the parameter θ and compare the vari-<br />

ance of ˜ θ to the Cramer-Rao lower bound.<br />

(d) Let µ = θ −1 and show that ˆµ = X/3 is an unbiased estimator of µ.<br />

(e) Find the variance of ˆµ and show that it attains the Cramer-Rao lower bound.<br />

8. Let X1, . . . , Xn be a random sample from a population density function f(x|θ),<br />

where θ is a parameter. Let S = S(X1, . . . , Xn) be a sufficient statistic for θ.<br />

(a) What can be said about the conditional distribution of X1, . . . , Xn given<br />

S = s?<br />

(b) State the factorisation theorem for sufficient statistics.<br />

(c) Suppose now that<br />

f(x|θ) = xθ−1e−x , x > 0,<br />

Γ(θ)<br />

where Γ(·) is the gamma function and θ > 0 is a positive parameter. Show<br />

that<br />

is a sufficient statistic for θ.<br />

S =<br />

44<br />

n�<br />

ln Xi<br />

i=1


Outline Solutions<br />

1. The cdf is F (t) = P (T ≤ t) = � t<br />

0 λe−λυ dυ = λ � − 1<br />

λ e−λυ� t<br />

0 = 1 − e−λυ . Next<br />

P (a ≤ T ≤ b) = F (b) − F (a) = e −λa − e −λb . Assume all settlements are inde-<br />

pendent. Then P (50 in first week) = {F (1)} 50 = (1 − e −λ ) 50 , because T ≤ 1<br />

for these 50 settlements. Likewise, 1 < T ≤ 2, for the 35 in the second week,<br />

so we have P (35 in second week) = {F (2) − F (1)} 35 = (e −λ − e −2λ ) 35 . The re-<br />

maining 15 have T > 2, which has probability 1 − P (T ≤ 2) = e −2λ , and thus<br />

P (15 after week two) = (e −2λ ) 15 . The likelihood function is therefore the product<br />

L(λ) = (1 − e −λ ) 50 (e −λ − e −2λ ) 35 (e −2λ ) 15 . Taking logarithms (always based e),<br />

∴ d<br />

dλ<br />

ln L(λ) = 50 ln(1 − e −λ ) + 35 ln � e −λ (1 − e −λ ) � + 15 ln(e −2λ )<br />

= 85 ln(1 − e −λ ) − (35 + 30)λ = 85 ln(1 − e −λ ) − 65λ.<br />

85e−λ<br />

85<br />

ln L(λ) = − 65 =<br />

1 − e−λ e−λ − 65.<br />

− 1<br />

Equating to zero, 85 = 65(e −λ − 1) or e λ = 150/65, so that ˆ λ = ln(150/65) =<br />

0.836. This is indeed a maximum; e.g. d2<br />

dλ 2 ln L(λ) = −85/(e λ − 1) 2 < 0 ∀ λ. Next<br />

1 − e −0.836 = 0.5666; e −0.836 − e −1.672 = 0.43344 − 0.18787 = 0.2456. Hence out of<br />

100 invoices, 56.66, 24.56 and 18.78 would be expected to be paid, on this model,<br />

in weeks 1, 2 and later. The actual numbers were 50, 35 and 15. The prediction<br />

for the second week is a long way from what happened, balanced by smaller<br />

discrepancies in the other two periods. This does not seem very satisfactory.<br />

2. nA + nB = n ⇒ nB = n − nA. Suppose we observe the sequence A, B, A, B, A,<br />

then L1(nA) = nA<br />

n<br />

× n−nA<br />

n−1<br />

× nA−1<br />

n−2<br />

× n−nA−1<br />

n−3<br />

sequence A, A, B, B, A, then L2(nA) = nA<br />

n<br />

nA−2<br />

× . Next, suppose we observe the<br />

n−4<br />

× nA−1<br />

n−1<br />

× n−nA<br />

n−2<br />

× n−nA−1<br />

n−3<br />

nA−2<br />

× n−4<br />

. If it<br />

is known that 3As and 2Bs are drawn but the exact sequence is unknown then<br />

L3(nA) = P (Y = y|nA) = � �� � � � nA n−nA n<br />

/ , where y = 3. This third likelihood<br />

y 5−y 5<br />

function expands to give L3(nA) = 10 × nA(nA−1)(nA−2)(n−nA)(n−nA−1)<br />

. Clearly<br />

n(n−1)(n−2)(n−3)(n−4)<br />

L1(nA) = L2(nA) = L3(nA) ÷ 10. The first two likelihood functions are identical.<br />

The third likelihood function is a constant times the other two, and as only the<br />

ratio of likelihood functions are meaningful, L3(nA) carries the same information<br />

about our preferences for the parameter nA as the other functions.<br />

3. L(θ) = 4 −n (2 + θ) a (1 − θ) b+c (θ) d so, ℓ(θ) = −n ln(4)+a ln(2+θ)+(b+c) ln(1−<br />

θ) + d ln(θ). Differentiating we get S(θ) = a b+c d − + and setting S(θ) = 0<br />

2+θ 1−θ θ<br />

leads to the quadratic equation nθ2 − {a − 2b − 2c − d}θ − 2d = 0, of which<br />

the positive root, ˆ θ, satisfies the condition of maximum likelihood. If S(θ) is<br />

differentiated again with respect to θ, and expected values substituted for a, b, c,<br />

and d, we obtain Var( ˆ θ) ≈ {E[I(θ)]} −1 = {I(θ)} −1 = 2θ(1−θ)(2+θ)<br />

(1+2θ)n<br />

45<br />

.


4. E(X2 ) = � ∞<br />

0 x2f(x)dx =<br />

� �<br />

2 = πθ<br />

�� � �<br />

2 n/2<br />

ℓ(θ) = ln exp −<br />

πθ<br />

� �<br />

2 ∞<br />

πθ 0 x2e−x2 /2θdx =<br />

x[−θe−x2 /2θ ∞ ] 0 + � ∞<br />

0 θe−x2 /2θdx � X 2 i<br />

2θ<br />

Setting this equal to zero gives ˆ θ = 1<br />

n<br />

� 2<br />

� �<br />

2 ∞<br />

πθ 0 x<br />

�<br />

xe−x2 �<br />

/2θ dx<br />

�<br />

= πθ [0 − 0] + θ � ∞<br />

f(x)dx = θ.<br />

0<br />

��<br />

= n<br />

2 ln � � �<br />

2 X2 i − so S(θ) = − πθ 2θ<br />

n 1 + 2θ 2θ2 � 2 Xi .<br />

� 2 Xi . It may be verified (by considering<br />

the second derivative) that this is indeed a maximum, and so is the MLE of θ.<br />

Since E(X 2 ) = θ, we immediately have E( ˆ θ) = θ, i.e. ˆ θ is unbiased for θ. Next,<br />

I(θ) = − n<br />

2θ 2 + nθ<br />

θ 3 = n<br />

2θ 2 , so the Cramér-Rao lower bound is 2θ 2 /n. Now, Var( ˆ θ) =<br />

2θ 2 /n (using the hint), and so the variance of ˆ θ attains the bound.<br />

φ = √ θ; so MLE of φ is ˆ φ = √ MLE of θ = �� X 2 i /n. Because φ is a non-linear<br />

transformation of θ, and ˆ θ is unbiased for θ, ˆ φ cannot be unbiased for φ.<br />

5. L(θ) = θ� X ie −nθ<br />

� Xi! , giving ℓ(θ) = ( � Xi) ln θ − nθ − ln( � Xi!) and S(θ) = � Xi<br />

so the MLE of θ is ˆ θ = 1<br />

n<br />

� Xi = ¯ X. Also I(θ) = � Xi<br />

θ 2<br />

θ<br />

− n,<br />

> 0 ∀ θ, so ˆ θ is indeed a<br />

maximum. By the “invariance property”, the MLE of λ = e −θ is ˆ λ = e −ˆ θ = e − ¯ X .<br />

The delta method gives that the variance of g( ˆ θ) is approximated by � � dg 2<br />

Var( θ) ˆ<br />

dθ<br />

evaluated at the mean of the distribution, which here is simply θ. So we need<br />

to obtain θ<br />

n<br />

� dg<br />

dθ<br />

� 2 with g(θ) = e −θ . This immediately gives dg<br />

dθ = −e−θ , so the<br />

approximate variance is θ<br />

n (−e−θ ) 2 = 1<br />

n θe−2θ .<br />

The number of zero observations is binomially distributed with p = e −θ = λ, i.e.<br />

Bin(n, λ). Thus ˜ λ, the proportion of zeros, has expected value λ, i.e. it is unbiased.<br />

Also we have Var( ˜ λ) = 1<br />

1<br />

λ(1 − λ) = n ne−θ (1 − e−θ ). Using the approximate<br />

variance from part (ii), the efficiency of ˜ λ relative to ˆ λ is given approximately by<br />

θe −2θ<br />

n<br />

n<br />

e θ (1−e θ )<br />

θ = eθ . If θ is small, the efficiency is near (but less than) unity; as<br />

−1<br />

θ increases, the efficiency decreases; as θ becomes large, the efficiency tends to 0.<br />

6. E( ¯ X) = E[(X1+X2+· · ·+Xn)/n] = [E(X1)+E(X2)+· · ·+E(Xn)]/n = nµ/n = µ.<br />

The distribution of Xi−µ is symmetric implying E[(X−µ)(p+1)] = E[(µ−X)(p+1)].<br />

Combining this with the identity X(p+1) − µ = (µ − X)(p+1) = −{(µ − X)(p+1)}<br />

yields E[X(p+1) − µ] = E[(X − µ)(p+1)] = E[(µ − X)(p+1)] = −E[(µ − X)(p+1)] = 0<br />

and hence that E(˜µ) = E[X(p+1)] = µ. Note that the argument applies to any<br />

distribution which is symmetric around µ.<br />

For any positive random variable Y the identity Var( √ Y ) = E(Y ) − [E( √ Y )] 2<br />

holds, so for Y = 1<br />

n<br />

� n<br />

i (Xi − ¯ X) 2 = SSD/n, clearly, Var( √ Y ) > 0, and it follows<br />

that E(ˆσ) = E( � (SSD)/n) < � E(SSD)/n) = σ, hence ˆσ is not unbiased.<br />

For n = 3, Z = SSD/σ 2 follows a χ 2 -distribution with 2 degrees of freedom,<br />

46


hence<br />

E( √ � ∞ √ 1<br />

Z) = z<br />

2Γ(1) e−z/2dz = 23/2Γ(3/2) 2Γ(1) = � π/2 ,<br />

0<br />

hence ˜σ = 2ˆσ/ √ π is unbiased for σ. If we let d = (n − 1) we similarly find for<br />

general n that<br />

E( √ � ∞ √<br />

Z) = z<br />

so<br />

0<br />

zd/2−1 2d/2Γ(d/2) e−z/2dz = 2(d+1)/2Γ((d + 1)/2)<br />

2d/2Γ(d/2) c =<br />

makes cS an unbiased estimator of σ.<br />

√ dΓ(d/2)<br />

√ 2Γ([d + 1]/2)<br />

7. First of all c(θ) −1 = � ∞<br />

0 x2 e −θx dx = Γ(3)/θ 3 = 2/θ 3 . Next, we get E( ˜ θ) =<br />

E(2/X) = θ 3 � ∞<br />

0 xe−θx dx = θ 3 θ −2 Γ(2) = θ and so ˜ θ is an unbiased estimator of θ.<br />

<strong>To</strong> get the variance we use E( ˜ θ 2 ) = 2θ 3 � ∞<br />

0 e−θx dx = 2θ 2 so Var( ˜ θ) = 2θ 2 −θ 2 = θ 2 .<br />

<strong>To</strong> find the Fisher information for θ, we first calculate ∂ ln f(x|θ) = −x + 3/θ,<br />

∂θ<br />

and then calculate − ∂2<br />

∂θ2 ln f(x|θ) = 3/θ2 . Thus eff( ˜ θ) = θ2 /(3θ2 ) = 1<br />

3 .<br />

E(ˆµ) = θ3<br />

6<br />

� ∞<br />

0 x3e−θxdx = θ3Γ(4) 6θ4 = 1/θ = µ so ˆµ is an unbiased estimator of µ.<br />

Similarly E(ˆµ 2 ) = θ3<br />

18<br />

� ∞<br />

0 x4 e −θx dx = θ3 Γ(5)<br />

θ 5 = 4<br />

3θ 2 so Var(ˆµ) = 1<br />

3θ 2 . <strong>To</strong> calculate<br />

the CRLB we find for g(θ) = 1/θ that g ′ (θ) = −θ −2 so the lower bound is<br />

I(θ) −1 {g ′ (θ) 2 } = θ2<br />

3 θ−4 = 1<br />

= Var(ˆµ).<br />

3θ2 8. The formal definition of a sufficient statistic (S) is that the conditional distribu-<br />

tion of the sample X = (X1, X2, . . . , Xn) given the value of S, that is knowing<br />

that S = s, does not depend on θ.<br />

If f(x|θ) is the joint distribution of the sample X, the statistic S is sufficient for<br />

θ if and only if there exists function g(s|θ) and h(x) such that, for all sample<br />

points {xi} and all θ, the density factorises f(x|θ) = g(s|θ)h(x).<br />

f(x|θ) =<br />

n�<br />

x<br />

i=1<br />

θ−1<br />

i e−xi {Γ(θ)} n<br />

= exp[(θ − 1) � ln(xi)]<br />

{Γ(θ)} n<br />

× e − � xi<br />

↑ ↑<br />

�� �<br />

g ln(xi)|θ h(x)<br />

Thus, by the factorisation theorem, S = n�<br />

ln(Xi) is sufficient for θ.<br />

47<br />

i=1<br />

,


<strong>Student</strong> Questions<br />

1. Let X1, X2, . . . , Xn be iid with density f(x|θ) = θx θ−1 for 0 ≤ x ≤ 1.<br />

(a) Write down the likelihood function L(θ) and the log likelihood function ℓ(θ).<br />

(b) Derive ˆ θ the MLE of θ. Calculate the information function I(θ).<br />

(c) Show that ˆ θ is biased and calculate its bias function b(θ). HINT : Let<br />

Zi = − log[Xi] for i = 1, . . . , n and show that Z1, . . . , Zn are iid with density<br />

θ exp(−θz) for z ≥ 0. Show that as n → ∞ the bias converges to 0.<br />

(d) Based on the calculations in (a) suggest an unbiased estimate ˜ θ for θ. Cal-<br />

culate the variance of this unbiased estimate and its efficiency. Show that<br />

as n → ∞ the efficiency tends to 1.<br />

(e) Calculate the expected squared error for both ˆ θ and ˜ θ and compare them.<br />

(f) Suppose n = 10 and the data values are .21, .32, .45, .52, .58, .63, .65, .68,<br />

.70, and .72. Calculate the values of ˆ θ and ˜ θ based on these data.<br />

2. Let X1, . . . , Xn be iid with density f(x|θ) = θ 2 x exp (−θx) for x ≥ 0.<br />

(a) Write down the log likelihood function ℓ(θ).<br />

(b) Derive ˆ θ, the MLE of θ. Calculate the information function I(θ).<br />

(c) Suppose n = 10 and the data values are: .17, .28, .41, .48, .53, .58, .60, .63,<br />

.65 and .67. Calculate the MLE of θ and the information function.<br />

3. Let X1, . . . , Xn be iid with density f(x|β) = βx β−1 exp(−x β ) for x ≥ 0.<br />

(a) Write down the likelihood and log likelihood functions L(β) and ℓ(β).<br />

(b) Explain how Newton’s method may be used to find the MLE of β.<br />

(c) Suppose n = 10 and the data values are: .32, .35, .65, .65, .66, .74, .74, .82,<br />

1.00 and 1.80 . Write a program (preferably in R) to carry out Newton’s<br />

method. Try various starting values and report on what happens.<br />

4. Let X1, X2, . . . , Xn be iid with density<br />

f(x|θ) = 2θ2<br />

(x + θ) 3<br />

for x ≥ 0 where θ > 0 is an unknown parameter. Write down the log likelihood<br />

function ℓ(θ). Calculate the information function I(θ). Explain in detail how you<br />

would calculate the MLE of θ.<br />

48


5. Let X1, . . . , Xn be iid with density<br />

fX(x|θ) = 1<br />

exp [−x/θ]<br />

θ<br />

for 0 ≤ x < ∞. Let Y1, y . . . , Ym be iid with density<br />

for 0 ≤ y < ∞.<br />

fY (y|θ, λ) = λ<br />

exp [−λy/θ]<br />

θ<br />

(a) Write down the likelihood and log-likelihood functions L(θ, λ) and ℓ(θ, λ).<br />

(b) Derive ( ˆ θ, ˆ λ) the MLE of (θ, λ). Calculate the information matrix I = I(θ, λ).<br />

(c) Show that ˆ θ is unbiased but that ˆ λ is biased. Suggest an alternative ˜ λ to ˆ λ<br />

which is unbiased. Show that ˆ θ has efficiency 1. Calculate the efficiency of<br />

˜λ and show that as m → ∞ the efficiency converges to 1.<br />

(d) Suppose n = 11 and the average of the data values x1, x2, . . . , x11 is 2.0.<br />

Suppose m = 5 and the average of the data values y1, y2, . . . , y5 is 4.0.<br />

Calculate the maximum likelihood estimate ( ˆ θ, ˆ λ). Evaluate the information<br />

matrix at the point ( ˆ θ, ˆ λ) and show that both of its eigenvalues are positive.<br />

Calculate ˜ λ – the unbiased alternative to ˆ λ derived in (c).<br />

6. Let X1, . . . , Xn be iid observations from a Poisson distribution with mean θ. Let<br />

Y1, . . . , Ym be iid observations from a Poisson distribution with mean λθ.<br />

(a) Write down the likelihood log-likelihood functions L(θ, λ) and ℓ(θ, λ).<br />

(b) Derive ( ˆ θ, ˆ λ) the MLE of (θ, λ). Calculate the information matrix I = I(θ, λ).<br />

(c) Suppose n = 10 and the average of the data values x1, x2, . . . , x10 is 2.0.<br />

Suppose m = 5 and the average of the data values y1, y2, . . . , y5 is 4.0.<br />

Calculate the maximum likelihood estimate ( ˆ θ, ˆ λ). Evaluate the information<br />

matrix at the point ( ˆ θ, ˆ λ) and show that both of its eigenvalues are positive.<br />

7. Let Y be the number of particles emitted by a radioactive source in 1 minute. Y<br />

is thought to have a Poisson distribution whose mean θ is given by exp[α + βt]<br />

where t is the temperature of the source. The numbers of particles y1, y2, . . . , yn<br />

emitted in n 1 minute periods are observed; the temperature of the source for<br />

the ith period was ti. Assume that Y1, . . . , Yn are independent with Yi having a<br />

Poisson distribution with mean θi = exp[α + βti]. Derive an expression for the<br />

log likelihood ℓ(α, β). Suppose that you were required to find the MLE (ˆα, ˆ β).<br />

Clearly describe how you would perform this task. Your account should include<br />

the derivation of the likelihood equations and a detailed account of how these<br />

equations would be solved including how initial values for the iterative procedure<br />

involved would be obtained.<br />

49


Chapter 3<br />

The Theory of Confidence Intervals<br />

3.1 Exact Confidence Intervals<br />

Suppose that we are going to observe the value of a random vector X. Let X denote the<br />

set of possible values that X can take and, for x ∈ X , let g(x|θ) denote the probability<br />

that X takes the value x where the parameter θ is some unknown element of the set Θ.<br />

Consider the problem of quoting a subset of θ values which are in some sense plausible<br />

in the light of the data x. We need a procedure which for each possible value x ∈ X<br />

specifies a subset C(x) of Θ which we should quote as a set of plausible values for θ.<br />

Example 3.1. Suppose we are going to observe data x where x = (x1, x2, . . . , xn), and<br />

x1, x2, . . . , xn are the observed values of random variables X1, X2, . . . , Xn which are<br />

thought to be iid N(θ, 1) for some unknown parameter θ ∈ (−∞, ∞) = Θ. Consider<br />

the subset C(x) = [¯x − 1.96/ √ n, ¯x + 1.96/ √ n]. If we carry out an infinite sequence of<br />

independent repetitions of the experiment then we will get an infinite sequence of x<br />

values and thereby an infinite sequence of subsets C(x). We might ask what proportion<br />

of this infinite sequence of subsets actually contain the fixed but unknown value of θ?<br />

Since C(x) depends on x only through the value of ¯x we need to know how ¯x<br />

behaves in the infinite sequence of repetitions. This follows from the fact that ¯ X has a<br />

N(θ, 1<br />

n ) density and so Z = ¯ X−θ<br />

√<br />

1<br />

n<br />

= √ n( ¯ X − θ) has a N(0, 1) density. Thus eventhough<br />

θ is unknown we can calculate the probability that the value of Z will exceed 2.78,<br />

for example, using the standard normal tables. Remember that (from a frequentist<br />

viewpoint) the probability is the proportion of experiments in the infinite sequence of<br />

repetitions which produce a value of Z greater than 2.78.<br />

In particular we have that P [|Z| ≤ 1.96] = 0.95. Thus 95% of the time Z will lie<br />

between −1.96 and +1.96. But<br />

−1.96 ≤ Z ≤ +1.96 ⇒ −1.96 ≤ √ n( ¯ X − θ) ≤ +1.96<br />

50


⇒ −1.96/ √ n ≤ ¯ X − θ ≤ +1.96/ √ n<br />

⇒ ¯ X − 1.96/ √ n ≤ θ ≤ ¯ X + 1.96/ √ n<br />

⇒ θ ∈ C(X)<br />

Thus we have answered the question we started with. The proportion of the infinite<br />

sequence of subsets given by the formula C(X) which will actually include the fixed<br />

but unknown value of θ is 0.95. For this reason the set C(X) is called a 95% confidence<br />

set or confidence interval for the parameter θ. �<br />

It is well to bear in mind that once we have actually carried out the experiment<br />

and observed our value of x, the resulting interval C(x) either does or does not contain<br />

the unknown parameter θ. We do not know which is the case. All we know is that<br />

the procedure we used in constructing C(x) is one which 95% of the time produces an<br />

interval which contains the unknown parameter.<br />

The crucial step in the last example was finding the quantity Z = √ n( ¯ X −θ) whose<br />

value depended on the parameter of interest θ but whose distribution was known to be<br />

that of a standard normal variable. This leads to the following definition.<br />

Definition 3.1 (Pivotal Quantity). A pivotal quantity for a parameter θ is a random<br />

variable Q(X|θ) whose value depends both on (the data) X and on the value of the<br />

unknown parameter θ but whose distribution is known. �<br />

The quantity Z in the example above is a pivotal quantity for θ. The following<br />

lemma provides a method of finding pivotal quantities in general.<br />

Lemma 3.1. Let X be a random variable and define F (a) = P [X ≤ a]. Consider the<br />

random variable U = −2 log [F (X)]. Then U has a χ 2 2 density. Consider the random<br />

variable V = −2 log [1 − F (X)]. Then V has a χ 2 2 density.<br />

Proof. Observe that, for a ≥ 0,<br />

P [U ≤ a] = P [F (X) ≥ exp (−a/2)]<br />

= 1 − P [F (X) ≤ exp (−a/2)]<br />

= 1 − P [X ≤ F −1 (exp (−a/2))]<br />

= 1 − F [F −1 (exp (−a/2))]<br />

= 1 − exp (−a/2).<br />

Hence, U has density 1<br />

2 exp (−a/2) which is the density of a χ2 2 variable as required.<br />

The corresponding proof for V is left as an exercise.<br />

This lemma has an immediate, and very important, application.<br />

51


Suppose that we have data X1, X2, . . . , Xn which are iid with density f(x|θ). Define<br />

F (a|θ) = � a<br />

−∞ f(x|θ)dx and, for i = 1, 2, . . . , n, define Ui = −2 log[F (Xi|θ)]. Then<br />

U1, U2, . . . , Un are iid each having a χ 2 2 density. Hence Q1(X, θ) = � n<br />

i=1 Ui has a χ 2 2n<br />

density and so is a pivotal quantity for θ. Another pivotal quantity ( also having a χ 2 2n<br />

density ) is given by Q2(X, θ) = � n<br />

i=1 Vi where Vi = −2 log[1 − F (Xi|θ)].<br />

Example 3.2. Suppose that we have data X1, X2, . . . , Xn which are iid with density<br />

f(x|θ) = θ exp (−θx)<br />

for x ≥ 0 and suppose that we want to construct a 95% confidence interval for θ. We<br />

need to find a pivotal quantity for θ. Observe that<br />

Hence<br />

F (a|θ) =<br />

=<br />

Q1(X, θ) = −2<br />

� a<br />

−∞<br />

� a<br />

0<br />

f(x|θ)dx<br />

θ exp (−θx)dx<br />

= 1 − exp (−θa).<br />

n�<br />

log [1 − exp (−θXi)]<br />

i=1<br />

is a pivotal quantity for θ having a χ 2 2n density. Also<br />

Q2(X, θ) = −2<br />

n�<br />

log [exp (−θXi)] = 2θ<br />

is another pivotal quantity for θ having a χ 2 2n density.<br />

i=1<br />

Using the tables, find A < B such that P [χ 2 2n < A] = P [χ 2 2n > B] = 0.025. Then<br />

0.95 = P [A ≤ Q2(X, θ) ≤ B]<br />

n�<br />

= P [A ≤ 2θ Xi ≤ B]<br />

i=1<br />

A<br />

= P [<br />

2 �n i=1 Xi<br />

≤ θ ≤<br />

and so the interval<br />

A<br />

[<br />

2 �n i=1 Xi<br />

B<br />

,<br />

2 �n i=1 Xi<br />

]<br />

is a 95% confidence interval for θ.<br />

n�<br />

i=1<br />

B<br />

2 �n i=1 Xi<br />

]<br />

The other pivotal quantity Q1(X, θ) is more awkward in this example since it is not<br />

straightforward to determine the set of θ values which satisfy A ≤ Q1(X, θ) ≤ B. �<br />

52<br />

Xi


3.2 Pivotal Quantities for Use with Normal Data<br />

Many exact pivotal quantities have been developed for use with Gaussian data.<br />

Example 3.3. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />

from a N (θ, σ2 ) density where σ is known. Define<br />

√<br />

n( X¯ − θ)<br />

Q =<br />

.<br />

σ<br />

Then Q has a N (0, 1) density and so is a pivotal quantity for θ. In particular we can<br />

be 95% sure that<br />

which is equivalent to<br />

−1.96 ≤<br />

√ n( ¯ X − θ)<br />

σ<br />

≤ +1.96<br />

¯X − 1.96 σ √ n ≤ θ ≤ ¯ X + 1.96 σ √ n .<br />

The R command qnorm(p=0.975,mean=0,sd=1) returns the value 1.959964 as the<br />

97 1%<br />

quantile from the standard normal distribution. �<br />

2<br />

Example 3.4. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />

from a N(θ, σ 2 ) density where θ is known. Define<br />

Q =<br />

n�<br />

(Xi − θ) 2<br />

i=1<br />

We can write Q = n�<br />

Z2 i where Zi = (Xi − θ)/σ. If Zi has a N (0, 1) density then<br />

i=1<br />

Z2 i has a χ2 1 density. Hence, Q has a χ2 n density and so is a pivotal quantity for σ. If<br />

n = 20 then we can be 95% sure that<br />

n�<br />

(Xi − θ) 2<br />

9.591 ≤<br />

i=1<br />

σ 2<br />

σ 2<br />

≤ 34.170<br />

which is equivalent to<br />

�<br />

�<br />

�<br />

� 1<br />

n�<br />

(Xi − θ)<br />

34.170<br />

2 �<br />

�<br />

�<br />

≤ σ ≤ � 1<br />

9.591<br />

i=1<br />

n�<br />

(Xi − θ) 2 .<br />

The R command qchisq(p=c(.025,.975),df=20) returns the values 9.590777 and<br />

34.169607 as the 2 1<br />

1 % and 97 % quantiles from a Chi-squared distribution on 20 degrees<br />

2 2<br />

of freedom. �<br />

53<br />

i=1


Lemma 3.2 (The <strong>Student</strong> t-distribution). Suppose the random variables X and<br />

Y are independent, and X ∼ N(0, 1) and Y ∼ χ 2 n. Then the ratio<br />

has pdf<br />

fT (t|n) = 1<br />

√ πn<br />

T = X<br />

� Y/n<br />

Γ([n + 1]/2)<br />

Γ(n/2)<br />

�<br />

1 + t2<br />

�−(n+1)/2 ,<br />

n<br />

and is known as <strong>Student</strong>’s t-distribution on n degrees of freedom.<br />

Proof. The random variables X and Y are independent and have joint density<br />

fX,Y (x, y) = 1<br />

√ 2π<br />

2 −n/2<br />

Γ(n/2) e−x2 /2 y n/2−1 e −y/2−1 e −y/2<br />

The Jacobian ∂(t, u)/∂(x, y) of the change of variables<br />

equals<br />

∂(t, u)<br />

∂(x, y) ≡<br />

and the inverse Jacobian<br />

Then<br />

fT (t) =<br />

=<br />

=<br />

�<br />

�<br />

�<br />

�<br />

� ∞<br />

0<br />

t = x<br />

� y/n<br />

∂t<br />

∂x<br />

∂u<br />

∂x<br />

∂t<br />

∂y<br />

∂u<br />

∂y<br />

and u = y<br />

�<br />

�<br />

�<br />

� =<br />

� �<br />

�<br />

1<br />

�<br />

n/y − 2<br />

� 0 1<br />

x √ n<br />

(y) 3/2<br />

∂(x, y)/∂(t, u) = (u/n) 1/2 .<br />

fX,Y<br />

� � 1/2<br />

t(u/n) , u � u<br />

�1/2 du<br />

n<br />

1 2<br />

√<br />

2π<br />

−n/2 � ∞<br />

Γ(n/2) 0<br />

1 2<br />

√<br />

2π<br />

−n/2<br />

Γ(n/2)n1/2 � ∞<br />

e −t2 u/2n u n/2−1 e −u/2<br />

0<br />

�<br />

�<br />

�<br />

� = (n/y)1/2<br />

for y > 0.<br />

�<br />

u<br />

�1/2 du<br />

n<br />

e −(1+t2 /n)u/2 u (n+1)/2−1 du .<br />

The last integrand comes from the pdf of a Gam((n + 1)/2, 1/2 + t 2 /(2n)) random<br />

variable. Hence<br />

fT (t) = 1<br />

√ πn<br />

which gives the above formula.<br />

Γ([n + 1]/2)<br />

Γ(n/2)<br />

54<br />

�<br />

1<br />

1 + t2 � (n+1)/2<br />

,<br />

/n


Example 3.5. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />

from a N (θ, σ2 ) density where both θ and σ are unknown. Define<br />

√<br />

n( X¯ − θ)<br />

Q =<br />

s<br />

where<br />

n�<br />

(Xi − ¯ X) 2<br />

We can write<br />

where<br />

has a N (0, 1) density and<br />

s 2 =<br />

Q =<br />

Z =<br />

i=1<br />

n − 1<br />

Z<br />

� W/(n − 1)<br />

√ n( ¯ X − θ)<br />

σ<br />

n�<br />

(Xi − ¯ X) 2<br />

i=1<br />

W =<br />

σ2 has a χ2 n−1 density ( see lemma 2.9 ). It follows immediately that W is a pivotal<br />

quantity for σ. If n = 31 then we can be 95% sure that<br />

n�<br />

(Xi − ¯ X) 2<br />

which is equivalent to<br />

�<br />

�<br />

�<br />

� 1<br />

46.97924<br />

16.79077 ≤<br />

i=1<br />

σ 2<br />

.<br />

≤ 46.97924<br />

n�<br />

(Xi − ¯ X) 2 �<br />

�<br />

�<br />

≤ σ ≤ � 1<br />

16.79077<br />

i=1<br />

n�<br />

(Xi − ¯ X) 2 . (3.1)<br />

The R command qchisq(p=c(.025,.975),df=30) returns the values 16.79077 and<br />

46.97924 as the 2 1<br />

1 % and 97 % quantiles from a Chi-squared distribution on 30 degrees<br />

2 2<br />

of freedom. In lemma 3.2 we show that Q has a tn−1 density, and so is a pivotal quantity<br />

for θ. If n = 31 then we can be 95% sure that<br />

√<br />

n( X¯ − θ)<br />

−2.042 ≤<br />

≤ +2.042<br />

s<br />

which is equivalent to<br />

¯X − 2.042 s<br />

√ n ≤ θ ≤ ¯ X + 2.042 s<br />

√ n . (3.2)<br />

The R command qt(p=.975,df=30) returns the value 2.042272 as the 97 1%<br />

quantile<br />

2<br />

from a <strong>Student</strong> t-distribution on 30 degrees of freedom. ( It is important to point<br />

out that although a probability statement involving 95% confidence has been attached<br />

the two intervals (3.2) and (3.1) separately, this does not imply that both intervals<br />

simultaneously hold with 95% confidence. ) �<br />

55<br />

i=1


Example 3.6. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />

from a N (θ1, σ 2 ) density and data Y1, Y2, . . . , Ym which are iid observations from a<br />

N (θ2, σ 2 ) density where θ1, θ2, and σ are unknown. Let δ = θ1 − θ2 and define<br />

where<br />

Q = ( ¯ X − ¯ Y ) − δ<br />

�<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

s 2 �n i=1<br />

=<br />

(Xi − ¯ X) 2 + �m j=1 (Yj − ¯ Y ) 2<br />

.<br />

n + m − 2<br />

We know that ¯ X has a N (θ1, σ2<br />

n ) density and that ¯ Y has a N (θ2, σ2 ) density. Then<br />

m<br />

the difference ¯ X − ¯ Y has a N (δ, σ2 [ 1 1 + ]) density. Hence<br />

n m<br />

Z = ¯ X − ¯ Y − δ<br />

�<br />

σ 2 [ 1<br />

n<br />

+ 1<br />

m ]<br />

has a N (0, 1) density. Let W1 = � n<br />

i=1 (Xi − ¯ X) 2 /σ 2 and let W2 = � m<br />

j=1 (Yj − ¯ Y ) 2 /σ 2 .<br />

Then, W1 has a χ 2 n−1 density and W2 has a χ 2 m−1 density, and W = W1 + W2 has a<br />

χ 2 n+m−2 density. We can write<br />

Q1 = Z/ � W/(n + m − 2)<br />

and so, Q1 has a tn+m−2 density and so is a pivotal quantity for δ. Define<br />

�n i=1<br />

Q2 =<br />

(Xi − ¯ X) 2 + �m j=1 (Yj − ¯ Y ) 2<br />

σ2 .<br />

Then Q2 has a χ 2 n+m−2 density and so is a pivotal quantity for σ. �<br />

Lemma 3.3 (The Fisher F-distribution). Let X1, X2, . . . , Xn and Y1, Y2, . . . , Ym<br />

be iid N (0, 1) random variables. The ratio<br />

Z =<br />

n�<br />

X<br />

i=1<br />

2 i /n<br />

m�<br />

i=1<br />

Y 2<br />

i /m<br />

has the distribution called Fisher, or F distribution with parameters (degrees of free-<br />

dom) n, m, or the Fn,m distribution for short. The corresponding pdf fFn,m is concen-<br />

trated on the positive half axis:<br />

fFn,m(z) =<br />

Γ((n + m)/2)<br />

Γ(n/2)Γ(m/2)<br />

�<br />

n<br />

�n/2 z<br />

m<br />

n/2−1<br />

�<br />

1 + n<br />

m z<br />

�−(n+m)/2 for z > 0.<br />

Observe that if T ∼ tm, then T 2 = Z ∼ F1,m, and if Z ∼ Fn,m, then Z −1 ∼ Fm,n.<br />

If W1 ∼ χ 2 n and W2 ∼ χ 2 m, then Z = (mW1)/(nW2) ∼ Fn,m. �<br />

56


Example 3.7. Suppose that we have data X1, X2, . . . , Xn which are iid observations<br />

from a N (θX, σ 2 X ) density and data Y1, Y2, . . . , Ym which are iid observations from a<br />

N (θY , σ 2 Y ) density where θX, θY , σX, and σY are all unknown. Let<br />

and define<br />

Let<br />

and let<br />

F ∗ = ˆs2 X<br />

ˆs 2 Y<br />

=<br />

λ = σX/σY<br />

�n i=1 (Xi − ¯ X) 2 (m − 1)<br />

�m (n − 1) j=1 (Yj − ¯ .<br />

Y ) 2<br />

WX =<br />

WY =<br />

n�<br />

(Xi − ¯ X) 2 /σ 2 X<br />

i=1<br />

m�<br />

(Yj − ¯ Y ) 2 /σ 2 Y .<br />

j=1<br />

Then, WX has a χ 2 n−1 density and WY has a χ 2 m−1 density. Hence, by lemma 3.3,<br />

Q = WX/(n − 1)<br />

WY /(m − 1)<br />

≡ F ∗<br />

λ 2<br />

has an F density with n−1 and m−1 degrees of freedom and so is a pivotal quantity for<br />

λ. Suppose that n = 25 and m = 13. Then we can be 95% sure that 0.39 ≤ Q ≤ 3.02<br />

which is equivalent to � �<br />

F ∗<br />

F ∗<br />

≤ λ ≤<br />

3.02 0.39 .<br />

<strong>To</strong> see how this might work in practice try the following R commands one at a time:<br />

x = rnorm(25, mean = 0, sd = 2)<br />

y = rnorm(13, mean = 1, sd = 1)<br />

Fstar = var(x)/var(y); Fstar<br />

CutOffs = qf(p=c(.025,.975), df1=24, df2=12)<br />

CutOffs; rev(CutOffs)<br />

Fstar / rev(CutOffs)<br />

var.test(x, y)<br />

The search for a nice pivotal quantity for δ = θ1 − δ2 continues and is one of the<br />

great unsolved problems in Statistics - referred to as the Behrens-Fisher Problem.<br />

57<br />


3.3 Approximate Confidence Intervals<br />

Let X1, X2, . . . , Xn be iid with density f(x|θ). Let ˆ θ be the MLE of θ. We saw before<br />

that the quantities W1 = � EI(θ)( ˆ θ − θ), W2 = � I(θ)( ˆ �<br />

θ − θ), W3 = EI( ˆ θ)( ˆ θ − θ),<br />

�<br />

and W4 = I( ˆ θ)( ˆ θ − θ) all had densities which were approximately N (0, 1). Hence<br />

they are all approximate pivotal quantities for θ. W3 and W4 are the simplest to<br />

use in general. For W3 the approximate 95% confidence interval is given by [ ˆ �<br />

θ −<br />

1.96/ EI( ˆ θ), ˆ �<br />

θ + 1.96/ EI( ˆ θ)]. For W4 the approximate 95% confidence interval is<br />

given by [ ˆ �<br />

θ − 1.96/ I( ˆ θ), ˆ �<br />

θ + 1.96/ I( ˆ �<br />

θ)]. The quantity 1/ EI( ˆ �<br />

θ) ( or 1/ I( ˆ θ)) is<br />

often referred to as the approximate standard error of the MLE ˆ θ.<br />

Let X1, X2, . . . , Xn be iid with density f(x|θ) where θ = (θ1, θ2, . . . , θm) consists<br />

of m unknown parameters. Let θ = ( ˆ θ1, ˆ θ2, . . . , ˆ θm) be the MLE of θ. We saw before<br />

that for r = 1, 2, . . . , m the quantities W1r = ( ˆ θr − θr)/ √ CRLBr where CRLBr is the<br />

lower bound for Var( ˆ θr) given in the generalisation of the Cramer-Rao theorem had<br />

a density which was approximately N (0, 1). Recall that CRLBr is the rth diagonal<br />

element of the matrix [EI(θ)] −1 . In certain cases CRLBr may depend on the values of<br />

unknown parameters other than θr and in those cases W1r will not be an approximate<br />

pivotal quantity for θr.<br />

We also saw that if we define W2r by replacing CRLBr by the rth diagonal element<br />

of the matrix [I(θ)] −1 , W3r by replacing CRLBr by the rth diagonal element of the<br />

matrix [EI( ˆ θ)] −1 and W4r by replacing CRLBr by the rth diagonal element of the<br />

matrix [I( ˆ θ)] −1 we get three more quantities all of whom have a density which is<br />

approximately N (0, 1). W3r and W4r only depend on the unknown parameter θr and<br />

so are approximate pivotal quantities for θr. However in certain cases the rth diagonal<br />

element of the matrix [I(θ)] −1 may depend on the values of unknown parameters other<br />

than θr and in those cases W2r will not be an approximate pivotal quantity for θr.<br />

Generally W3r and W4r are most commonly used.<br />

We now examine the use of approximate pivotal quantities based on the MLE in a<br />

series of examples :<br />

Example 3.8 (Poisson sampling continued). Recall that ˆ θ = ¯x and I(θ) = � n<br />

i=1 xi/θ 2 =<br />

n ˆ θ/θ 2 with E[I(θ)] = n/θ. Hence E[I( ˆ θ)] = I( ˆ θ) = n/ ˆ θ and the usual approximate<br />

95% confidence interval is given by<br />

[ ˆ �<br />

θ − 1.96<br />

ˆθ<br />

n , ˆ �<br />

θ + 1.96<br />

ˆθ<br />

n ].<br />

58<br />


Example 3.9 (Bernoulli trials continued). Recall that ˆ θ = ¯x and<br />

with<br />

Hence<br />

I(θ) =<br />

� n<br />

i=1 xi<br />

θ 2<br />

E[I(θ)] =<br />

E[I( ˆ θ)] = I( ˆ θ) =<br />

+ n − � n<br />

i=1 xi<br />

(1 − θ) 2<br />

n<br />

θ(1 − θ) .<br />

n<br />

ˆθ(1 − ˆ θ)<br />

and the usual approximate 95% confidence interval is given by<br />

[ ˆ �<br />

θ − 1.96<br />

ˆθ(1 − ˆ θ)<br />

,<br />

n<br />

ˆ �<br />

θ + 1.96<br />

ˆθ(1 − ˆ θ)<br />

].<br />

n<br />

Example 3.10. Let X1, X2, . . . , Xn be iid observations from the density<br />

f(x|α, β) = αβx β−1 exp [−αx β ]<br />

for x ≥ 0 where both α and β are unknown. We saw how to calculate the MLE’s of α<br />

and β using Newton’s Method and that the information matrix I(α, β) is given by<br />

�<br />

n/α2 �n i=1 xβi<br />

log[xi]<br />

�n i=1 xβi<br />

log[xi] n/β2 + α �n i=1 xβ<br />

2<br />

i log[xi]<br />

�<br />

Let V11 and V22 be the diagonal elements of the matrix [I(ˆα, ˆ β)] −1 . Then the approxi-<br />

mate 95% confidence interval for α is<br />

[ˆα − 1.96 � V11, ˆα + 1.96 � V11]<br />

and the approximate 95% confidence interval for β is<br />

[ ˆ β − 1.96 � V22, ˆ β + 1.96 � V22].<br />

59<br />

�<br />


3.4 Worked Problems<br />

The Problems<br />

1. Components are produced in an industrial process and the number of flaws indif-<br />

ferent components are independent and identically distributed with probability<br />

mass function p(x) = θ(1 − θ) x , x = 0, 1, 2, . . ., where 0 < θ < 1. A random<br />

sample of n components is inspected; n0 components are found to have no flaws,<br />

n1 components are found to have two or more flaws.<br />

(a) Show that the likelihood function is L(θ) = θ n0+n1 (1 − θ) 2n−2n0−n1 .<br />

(b) Find the MLE of θ and the sample information in terms of n, n0 and n1.<br />

(c) Hence calculate an approximate 90% confidence interval for θ where 90 out<br />

of 100 components have no flaws, and seven have exactly one flaw.<br />

2. Suppose that X1, X2, . . . , Xn is a random sample from the shifted exponential<br />

distribution with probability density function<br />

f(x|θ, µ) = 1<br />

θ e−(x−µ)/θ , µ < x < ∞,<br />

where θ > 0 and −∞, µ < ∞. Both θ and µ are unknown, and n > 1.<br />

(a) The sample range W is defined as W = X(n) − X(1), where X(n) = maxi Xi<br />

and X(1) = mini Xi. It can be shown that the joint probability density<br />

function of X(1) and W is given by<br />

fX (1),W (x(1), w) = n(n − 1)θ −2 e −n(x (1)−µ)/θ e −w/θ (1 − e −w/θ ) n−2 ,<br />

for x(1) > µ and w > 0. Hence obtain the marginal density function of W and<br />

show that W has distribution function P (W ≤ w) = (1 − e −w/θ ) n−1 , w > 0.<br />

(b) Show that W/θ is a pivotal quantity for θ. Without carrying out any cal-<br />

culations, explain how this result may be used to construct a 100(1 − α)%<br />

confidence interval for θ for 0, α < 1.<br />

3. Let X have the logistic distribution with probability density function<br />

f(x) =<br />

ex−θ (1 + ex−θ , −∞ < x < ∞,<br />

) 2<br />

where −∞ < θ < ∞ is an unknown parameter.<br />

(a) Show that X − θ is a pivotal quantity and hence, given a single observation<br />

X, construct an exact 100(1 − α)% confidence interval for θ. Evaluate the<br />

interval when α = 0.05 and X = 10.<br />

60


(b) Given a random sample X1, X2, . . . , Xn from the above distribution, briefly<br />

explain how you would use the central limit theorem to construct an approx-<br />

imate 95% confidence interval for θ. Hint: E(X) = θ and Var(X) = π 2 /3.<br />

Outline Solutions<br />

1. P (0) = θ, P (1) = θ(1 − θ) and P (≥ 2) = 1 − θ − θ(1 − θ) = (1 − θ) 2 . Thus<br />

the likelihood of n0 zeros, n1 ones and n2 with two or more flaws is L(θ) =<br />

θ n0 [θ(1 − θ)] n1 (1 − θ) 2(n−n0−n1) = θ n0+n1 (1 − θ) 2n−2n0−n1 . The log-likelihood is<br />

ℓ(θ) = (n0 + n1) ln θ + (2n − 2n0 − n1) ln(1 − θ) and the score function can be<br />

obtained as d<br />

dθ ℓ(θ) = S(θ) = (n0 + n1)θ −1 − (2n − 2n0 − n1)(1 − θ) −1 . Setting this<br />

equal to zero gives that ˆ θ satisfies (n + n0)(1 − ˆ θ) = (2n − 2n0 − n1) ˆ θ, so that<br />

ˆθ = n0+n1<br />

2n−n0 . Differentiating again I(θ) = (n0 +n1)θ −2 +(2n−2n0 −n1)(1−θ) 2 > 0,<br />

which confirms that ˆ θ is a maximum. Using 1 − ˆ θ = 2n−2n0−n1<br />

calculations we get I( ˆ θ) =<br />

(2n−n0) 2<br />

n0+n1<br />

+ (2n−n0)<br />

2n−2n0−n1<br />

2n−n0<br />

to simplify the<br />

. An approximate 90% CI for θ<br />

is ˆ θ ± 1.6449[I( ˆ θ)] −1/2 . In the case when n = 100, n0 = 90 and n1 = 7, we have<br />

2n − n0 = 110, n0 + n1 = 97 and 2n − 2n0 − n1 = 13. Thus ˆ θ = 97/110 = 0.882<br />

and the sample information is 1102<br />

97<br />

+ 1102<br />

13<br />

0.882 ± 1.6669(32.489) −1 , i.e. 0.882 ± 0.051 or (0.831, 0.933).<br />

= 1055.5115. Thus the 90% CI is<br />

2. Given f(x, w) = n(n−1)θ−2e−n(x−µ)/θe−w/θ (1−e−w/θ ) n−2 , where x ≡ x(1) we have<br />

fW (w) = n(n − 1)θ−2e−w/θ (1 − e−w/θ ) n−2 � ∞<br />

µ e−n(x−µ)/θdx. Putting v = (x − µ) so<br />

that dv = dx, we have fW (w) = n(n − 1)θ−2e−w/θ (1 − e−w/θ ) n−2 � ∞<br />

µ e−nv/θdx =<br />

n(n − 1)θ−2e−w/θ (1 − e−w/θ ) n−2 � − θ<br />

ne−nv/θ� ∞ (n−1)<br />

= θ e−w/θ (1 − e−w/θ ) n−2 . Next,<br />

P (W ≤ w) = � w<br />

0<br />

v=0<br />

(n−1)<br />

θ e−y/θ (1 − e −y/θ ) n−2 dy = � (1 − e −y/θ ) n−1� w<br />

0 = (1 − ew/θ ) n−1 ,<br />

for 0 < w < ∞. Let Z = W<br />

θ . Then FZ(z) = P (Z ≤ z) = P (W ≤ zθ) =<br />

(1 − e −z ) n−1 , 0 < z < ∞. Z is a function of θ whose distribution does not<br />

depend on θ. Hence Z is a pivotal quantity. Choose any interval [z1, z2], where<br />

z1 ≥ 0, such that � z2<br />

z1 fZ(z)dz = 1 − α for 0 < α < 1. Then, given the range<br />

W = w, we have z1 ≤ w<br />

θ ≤ z2,<br />

� �<br />

w w<br />

and a 100(1 − α)% CI for θ is , . z1 z2<br />

3. Let W = X − θ. Then fW (w) = e w (1 + e w ) −2 . Here X − θ is a function of θ whose<br />

distribution does not depend on θ, and so is a pivotal quantity. Now, fW (w)<br />

is symmetric about zero, and so a 100(1 − α)% confidence interval for θ (where<br />

0 < α < 1) is {θ : −c < X − θ < c}, where c satisfies P (W ≤ c) = α/2. Hence<br />

� c<br />

−∞ ew (1 + e w ) −2 dw = α<br />

2 ⇔ [−(1 + ew ) −1 ] c<br />

−∞ = 1 − (1 + ec ) −1 ∴ c = ln<br />

� α/2<br />

1−(α/2)<br />

When α = 0.05, c = −3.664. So when X = 10, the interval is (6.336, 13.664).<br />

Let ¯ X denote the sample mean. The CLT gives ¯ X is approximately distributed<br />

N (θ, π2<br />

3n ). An approximate 95% CI for θ is therefore [ ¯ X − 1.96π<br />

√ 3n , ¯ X + 1.96π<br />

√ 3n ].<br />

61<br />

�<br />

.


<strong>Student</strong> Questions<br />

1. Let X1, X2, . . . , Xn be iid with density fX(x|θ) = θ exp (−θx) for x ≥ 0.<br />

(a) Show that � x<br />

f(u|θ)du = 1 − exp (−θx).<br />

0<br />

(b) Use the result in (a) to establish that Q = 2θ � n<br />

i=1 Xi is a pivotal quantity<br />

for θ and explain how to use Q to find a 95% confidence interval for θ.<br />

(c) Derive the information I(θ). Suggest an approximate pivotal quantity for<br />

θ involving I(θ) and another approximate pivotal quantity involving I( ˆ θ)<br />

where ˆ θ = 1/¯x is the maximum likelihood estimate of θ. Show how both<br />

approximate pivotal quantities may be used to find approximate 95% confi-<br />

dence intervals for θ. Prove that the approximate confidence interval calcu-<br />

lated using the approximate pivotal quantity involving I( ˆ θ) is always shorter<br />

than the approximate confidence interval calculated using the approximate<br />

pivotal quantity involving I(θ) but that the ratio of the lengths converges<br />

to 1 as n → ∞.<br />

(d) Suppose n = 25 and � 20<br />

i=1 xi = 250. Use the method explained in (b) to<br />

calculate a 95% confidence interval for θ and the two methods explained in<br />

(c) to calculate approximate 95% confidence intervals for θ. Compare the<br />

three intervals obtained.<br />

2. Let X1, X2, . . . , Xn be iid with density<br />

for x ≥ 0.<br />

f(x|θ) =<br />

θ<br />

(x + 1) θ+1<br />

(a) Derive an exact pivotal quantity for θ and explain how it may be used to<br />

find a 95% confidence interval for θ.<br />

(b) Derive the information I(θ). Suggest an approximate pivotal quantity for<br />

θ involving I(θ) and another approximate pivotal quantity involving I( ˆ θ)<br />

where ˆ θ is the maximum likelihood estimate of θ. Show how both approx-<br />

imate pivotal quantities may be used to find approximate 95% confidence<br />

intervals for θ.<br />

(c) Suppose n = 25 and � 25<br />

i=1 log [xi + 1] = 250. Use the method explained<br />

in (a) to calculate a 95% confidence interval for θ and the two methods<br />

explained in (b) to calculate approximate 95% confidence intervals for θ.<br />

Compare the three intervals obtained.<br />

62


3. Let X1, X2, . . . , Xn be iid with density<br />

for x ≥ 0.<br />

f(x|θ) = θ 2 x exp (−θx)<br />

(a) Show that � x<br />

f(u|θ)du = 1 − exp (−θx)[1 + θx].<br />

0<br />

(b) Describe how the result from (a) can be used to construct two exact pivotal<br />

quantities for θ.<br />

(c) Construct FOUR approximate pivotal quantities for θ based on the MLE ˆ θ.<br />

(d) Suppose that n = 10 and the data values are 1.6, 2.5, 2.7, 3.5, 4.6, 5.2, 5.6, 6.4,<br />

7.7, 9.2. Evaluate the 95% confidence interval corresponding to ONE of the<br />

exact pivotal quantities ( you may need to use a computer to do this ).<br />

Compare your answer to the 95% confidence intervals corresponding to<br />

each of the FOUR approximate pivotal quantities derived in (c).<br />

4. Let X1, X2, . . . , Xn be iid each having a Poisson density<br />

f(x|θ) = θx exp (−θ)<br />

x!<br />

for x = 0, 1, 2, . . . ., ∞. Construct FOUR approximate pivotal quantities for θ<br />

based on the MLE ˆ θ. Show how each may be used to construct an approximate<br />

95% confidence interval for θ. Evaluate the four confidence intervals in the case<br />

where the data consist of n = 64 observations with an average value of ¯x = 4.5.<br />

5. Let X1, X2, . . . , Xn be iid with density<br />

f1(x|θ) = 1<br />

exp [−x/θ]<br />

θ<br />

for 0 ≤ x < ∞. Let Y1, Y2, . . . , Ym be iid with density<br />

for 0 ≤ y < ∞.<br />

f2(y|θ, λ) = λ<br />

exp [−λy/θ]<br />

θ<br />

(a) Derive approximate pivotal quantities for each of the parameters θ and λ.<br />

(b) Suppose that n = 10 and the average of the 10 x values is 24.0. Suppose that<br />

m = 40 and the average of the 40 y values is 12.0. Calculate approximate<br />

95% confidence intervals for both θ and λ.<br />

63


Chapter 4<br />

The Theory of Hypothesis Testing<br />

4.1 Introduction<br />

Suppose that we are going to observe the value of a random vector X. Let X denote the<br />

set of possible values that X can take and, for x ∈ X , let g(x|θ) denote the probability<br />

that X = x where the parameter θ is some unknown element of the set Θ.<br />

Our assumptions specify g, Θ, and X . A hypothesis specifies that θ belongs to some<br />

subset Θ0 of Θ. The question arises as to whether the observed data x is consistent<br />

with the hypothesis that θ ∈ Θ0, often written as H0 : θ ∈ Θ0. The hypothesis H0 is<br />

usually referred to as the null hypothesis.<br />

In a hypothesis testing situation, two types of error are possible.<br />

• The first type of error is to reject the null hypothesis H0 : θ ∈ Θ0 as being<br />

inconsistent with the observed data x when, in fact, θ ∈ Θ0 i.e. when, in fact,<br />

the null hypothesis happens to be true. This is referred to as type 1 error.<br />

• The second type of error is to fail to reject the null hypothesis H0 : θ ∈ Θ0 as<br />

being inconsistent with the observed data x when, in fact, θ /∈ Θ0 i.e. when, in<br />

fact, the null hypothesis happens to be false. This is referred to as type 2 error.<br />

Example 4.1. Suppose the data consist of a random sample X1, X2, . . . , Xn from a<br />

N (θ, 1) density. Let Θ = (−∞, ∞) and Θ0 = (−∞, 0] and consider testing H0 : θ ∈ Θ0<br />

or in other words H0 : θ ≤ 0.<br />

The standard estimate of θ for this example is ¯ X. It would seem rational to consider<br />

that the bigger the value of ¯ X that we observe the stronger is the evidence against the<br />

null hypothesis that θ ≤ 0. How big does ¯ X have to be in order for us to reject H0 ?<br />

Suppose that n = 25 and we observe ¯x = 0.32. What are the chances of getting<br />

such a large value for ¯x if, in fact, θ ≤ 0 ? We know that ¯ X has a N (θ, 1 ) density<br />

n<br />

64


i.e. a N (θ, 0.04) density. So the probability of getting a value for ¯x as large as 0.32<br />

is the area under a N (θ, 0.04) curve between 0.32 and ∞ which is, in turn, equal to<br />

the area under a N (0, 1) curve between 0.32−θ and ∞. <strong>To</strong> evaluate the probability of<br />

0.20<br />

getting a value for ¯x as large as 0.32 if, in fact, θ ≤ 0 we need to find the value of θ ≤ 0<br />

for which the area under a N (0, 1) curve between 0.32−θ and ∞ is maximised. Clearly<br />

0.20<br />

this happens for θ = 0 and the resulting maximum is the area under a N (0, 1) curve<br />

between 0.32<br />

0.20<br />

= 1.60 and ∞ or 0.0548. This quantity is called the p-value. The p-value<br />

is used to measure the strength of the evidence against H0 : θ ≤ 0 and H0 is rejected<br />

if the p-value is less than some small number such as 0.05. You might like to try the<br />

R commands 1-pnorm(q=0.32,mean=0,sd=sqrt(.04)) and 1-pnorm(1.6). �<br />

4.2 The General Testing Problem<br />

Suppose that we are going to observe the value of a random vector X. Let X denote the<br />

set of possible values that X can take and, for X ∈ X , let g(x; θ) denote the probability<br />

that X takes the value x where the parameter θ is some unknown element of the set<br />

Θ. Let Θ0 be some subset of Θ and consider testing the null hypothesis that θ lies in<br />

the subset Θ0 i.e. H0 : θ ∈ Θ0.<br />

Standard procedures involve calculating some function of the data which is called<br />

a test statistic and which we will denote by T (X) and then rejecting H0 : θ ∈ Θ0 if<br />

the observed value of T (X) turns out to be surprisingly large. We must be clear about<br />

what is meant by surprisingly large however? The quantity T (X) is a random variable<br />

since if we repeated the experiment we would get a different value of X and hence a<br />

different value of T (X). The probability distribution of T (X) depends on the value of<br />

θ. Suppose we observe T (X) = t(x) = t. For any value of θ ∈ Θ, we can calculate<br />

the probability that T (X) exceeds t if that value of θ was in fact the true value. Let<br />

us denote this probability by p(t; θ). The quantity max{p(t; θ) : θ ∈ Θ0} is called the<br />

p-value. The p-value is used to measure the strength of the evidence against H0 : θ ≤ 0<br />

and H0 is rejected if the p-value is less than some small number such as 0.05.<br />

Example 4.2 (Example 4.1 continued). Consider the test statistic T (X) = ¯ X and sup-<br />

pose we observe T (x) = t. We need to calculate p(t; θ) which is the probability that<br />

the random variable T (X) exceeds t when θ is the true value of the parameter. If θ is<br />

the true value of the parameter T (X) has a N (θ, 1/n) density and so<br />

p(t; θ) = P {N [θ, 1/n] ≥ t} = P � N [0, 1] ≥ √ n(t − θ) �<br />

In order to calculate the p-value we need to find θ ≤ 0 for which p(t; θ) is a maximum.<br />

Since p(t; θ) is maximised by making √ n(t − θ) as small as possible the maximum over<br />

(−∞, 0] always occurs at 0. Hence we have that P-Value = P { N [0, 1] ≥ √ nt } �<br />

65


Example 4.3 (The power function). Suppose our rule is to reject H0 : θ ≤ 0 if the<br />

p-value is less than 0.05. In order for the p-value to be less than 0.05 we require<br />

√ nt > 1.65 and so we reject H0 if ¯x > 1.65/ √ n. What are the chances of rejecting<br />

H0 if θ = 0.2 ? If θ = 0.2 then ¯x has a N [0.2, 1/n] density and so the probability of<br />

rejecting H0 is<br />

P<br />

� �<br />

N 0.2, 1<br />

�<br />

≥<br />

n<br />

1.65<br />

�<br />

√ = P<br />

n<br />

� N (0, 1) ≥ 1.65 − 0.2 √ n � .<br />

For n = 25 this is given by P {N (0, 1) ≥ 0.65} = 0.2578. This calculation can be<br />

verified using the R command 1-pnorm(1.65-0.2*sqrt(25)). The following table<br />

gives the results of this calculation for n = 25 and various values of θ.<br />

θ: -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1<br />

Prob: .000 .000 .000 .000 .000 .000 .000 .001 .004 .016<br />

θ: 0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0<br />

Prob: 0.50 .125 .258 .440 .637 .802 .912 .968 .991 .998 .999<br />

This is called the power function of the test. The R command Ns=seq(from=(-1),to=1,<br />

by=0.1) generates and stores the sequence −1.0, −0.9, . . . , +1.0 and the probabilities<br />

in the table were calculated using 1-pnorm(1.65-Ns*sqrt(25)). �<br />

Example 4.4 (Sample size). How large would n have to be so that the probability of<br />

rejecting H0 when θ = 0.2 is 0.90 ? We would require 1.65 − 0.2 √ n = −1.28 which<br />

implies that √ n = (1.65 + 1.28)/0.2 or n = 215. �<br />

So the general plan for testing a hypothesis is clear: choose a test statistic T ,<br />

observe the data, calculate the observed value t of the test statistic T , calculate the<br />

p-value as the maximum over all values of θ in Θ0 of the probability of getting a value<br />

for T as large as t, and reject H0 : θ ∈ Θ0 if the p-value so obtained is too small.<br />

4.3 Hypothesis Testing for Normal Data<br />

Many standard test statistics have been developed for use with normally distributed<br />

data.<br />

Example 4.5 (One Gaussian sample). Suppose that we have data X1, X2, . . . , Xn which<br />

are iid observations from a N (µ, σ 2 ) density where both µ and σ are unknown. Here<br />

θ = (µ, σ) and Θ = {(µ, σ) : −∞ < µ < ∞, 0 < σ < ∞}. Define<br />

¯X =<br />

� n<br />

i=1 Xi<br />

n<br />

and s 2 =<br />

66<br />

�N i=1 (Xi − ¯ X) 2<br />

.<br />

n − 1


(a) Suppose Θ0 = {(µ, σ) : −∞ < µ ≤ A, 0 < σ < ∞}. Define T = ¯ X. Let t denote<br />

the observed value of T . Then<br />

p(t; θ) = P [ ¯ X ≥ t]<br />

√<br />

n( ¯<br />

√<br />

X − µ) n(t − µ)<br />

= P [<br />

≥ ]<br />

s<br />

√<br />

s<br />

n(t − µ)<br />

= P [tn−1 ≥ ].<br />

s<br />

<strong>To</strong> maximize this we choose µ in (−∞, A] as large as possible which clearly means<br />

choosing µ = A. Hence the p-value is<br />

P [tn−1 ≥<br />

√<br />

n(¯x − A)<br />

].<br />

s<br />

(b) Suppose Θ0 = {(µ, σ) : A ≤ µ < ∞, 0 < σ < ∞}. Define T = − ¯ X. Let t denote<br />

the observed value of T . Then<br />

p(t; θ) = P [− ¯ X ≥ t]<br />

= P [ ¯ X ≤ −t]<br />

√<br />

n( ¯<br />

√<br />

X − µ) n(−t − µ)<br />

= P [<br />

≤<br />

]<br />

s<br />

√<br />

s<br />

n(−t − µ)<br />

= P [tn−1 ≤<br />

].<br />

s<br />

<strong>To</strong> maximize this we choose µ in [A, ∞) as small as possible which clearly means<br />

choosing µ = A. Hence the p-value is<br />

√ √<br />

n(−t − A)<br />

n(¯x − A)<br />

P [tn−1 ≤<br />

] = P [tn−1 ≤<br />

].<br />

s<br />

s<br />

(c) Suppose Θ0 = {(A, σ) : 0 < σ < ∞}. Define T = | ¯ X − A|. Let t denote the<br />

observed value of T . Then<br />

p(t; θ) = P [| ¯ X − A| ≥ t] = P [ ¯ X ≥ A + t] + P [ ¯ X ≤ A − t]<br />

√<br />

n( ¯<br />

√<br />

X − µ) n(A + t − µ)<br />

= P [<br />

≥<br />

]<br />

s<br />

√<br />

s<br />

n( ¯<br />

√<br />

X − µ) n(A − t − µ)<br />

+P [<br />

≤<br />

]<br />

√<br />

s<br />

s<br />

n(A + t − µ)<br />

= P [tn−1 ≥<br />

]<br />

s<br />

√<br />

n(A − t − µ)<br />

+P [tn−1 ≤<br />

].<br />

s<br />

The maximization is trivially found by setting µ = A. Hence the p-value is<br />

√<br />

nt<br />

P [tn−1 ≥<br />

s ]+P [tn−1 ≤ −√ √<br />

nt<br />

nt<br />

] = 2P [tn−1 ≥<br />

s<br />

s ] = 2P [tn−1<br />

√<br />

n|¯x − A|<br />

≥ ].<br />

s<br />

67


(d) Suppose Θ0 = {(µ, σ) : −∞ < µ < ∞, 0 < σ ≤ A}. Define T = � n<br />

i=1 (Xi − ¯ X) 2 .<br />

Let t denote the observed value of T . Then<br />

p(t; σ) = P [ �n i=1 (Xi − ¯ X) 2 ≥ t]<br />

= P [<br />

� n<br />

i=1 (Xi− ¯ X) 2<br />

σ 2<br />

= P [χ 2 n−1 ≥ t<br />

σ 2 ].<br />

≥ t<br />

σ 2 ]<br />

<strong>To</strong> maximize this we choose σ in (0, A] as large as possible which clearly means<br />

choosing σ = A. Hence the p-value is<br />

P [χ 2 n−1 ≥ t<br />

A 2 ] = P [χ2 n−1 ≥<br />

� n<br />

i=1 (xi − ¯x) 2<br />

A2 ]<br />

(e) Θ0 = {(µ, σ) : −∞ < µ < ∞, A ≤ σ < ∞}. Define T = [ � n<br />

i=1 (xi − ¯x) 2 ] −1 , and<br />

let t denote the observed value of T . Then<br />

p(t; σ) =<br />

1<br />

P [ �n i=1 (Xi − ¯ ≥ t]<br />

X) 2<br />

=<br />

n�<br />

P [ (Xi −<br />

i=1<br />

¯ X) 2 ≤ 1<br />

t ]<br />

=<br />

�n i=1 P [<br />

(Xi − ¯ X) 2<br />

σ2 ≤ 1<br />

=<br />

]<br />

tσ2 P [χ 2 n−1 ≤ 1<br />

].<br />

tσ2 <strong>To</strong> maximize this we choose σ in [A, ∞) as small as possible which clearly means<br />

choosing σ = A. Hence the p-value is<br />

P [χ 2 n−1 ≤ 1<br />

tA 2 ] = P [χ2 n−1 ≤<br />

(f) Suppose Θ0 = {(µ, A) : −∞ < µ ≤ ∞}. Define<br />

T = max{<br />

� n<br />

i=1 (Xi − ¯ X) 2<br />

A2 ,<br />

� n<br />

i=1 (xi − ¯x) 2<br />

A2 ].<br />

A 2<br />

�n i=1 (Xi − ¯ }.<br />

X) 2<br />

Let t denote the observed value of T and note that t must be greater than or<br />

equal to 1. Then<br />

p(t; σ) = P [ �n i=1 (Xi − ¯ X) 2 ≥ A2t] + P [ �n i=1 (Xi − ¯ X) 2 ≤ A2<br />

t ]<br />

= P [<br />

� n<br />

i=1 (Xi− ¯ X) 2<br />

σ 2<br />

≥ A2 t<br />

σ 2 ] + P [<br />

� n<br />

i=1 (Xi− ¯ X) 2<br />

σ 2<br />

= P [χ 2 n−1 ≥ A2 t<br />

σ 2 ] + P [χ 2 n−1 ≤ A2<br />

tσ 2 ].<br />

≤ A2<br />

tσ 2 ]<br />

Since A is the only element in Θ0, the maximization is trivially found by setting<br />

σ = A. Hence the p-value is<br />

�n where t = max{<br />

P [χ 2 n−1 ≥ t] + P [χ 2 n−1 ≤ 1<br />

t ].<br />

i=1 (xi−¯x) 2<br />

A2 A , 2<br />

�n i=1 (xi−¯x) 2 }. �<br />

68


Example 4.6 (Two Gaussian sample). Suppose that we have data X1, X2, . . . , Xn which<br />

are iid observations from a N (µ1, σ 2 ) density and data y1, y2, . . . , ym which are iid<br />

observations from a N (µ2, σ 2 ) density where µ1, µ2, and σ are unknown.<br />

and<br />

Here<br />

Define<br />

θ = (µ1, µ2, σ)<br />

Θ = {(µ1, µ2, σ) : −∞ < µ1 < ∞, −∞ < µ2 < ∞, 0 < σ < ∞}.<br />

s 2 =<br />

� n<br />

i=1 (xi − ¯x) 2 + Σm j=1(yj − ¯y) 2<br />

.<br />

n + m − 2<br />

(a) Suppose Θ0 = {(µ1, µ2, σ) : −∞ < µ1 < ∞, µ1 < µ2 < ∞, 0 < σ < ∞}. Define<br />

T = ¯x − ¯y. Let t denote the observed value of T . Then<br />

p(t; θ) = P [¯x − ¯y ≥ t]<br />

= P [ [(¯x − ¯y) − (µ1 − µ2)]<br />

�<br />

≥ [t − (µ1 − µ2)]<br />

� ]<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

= P [tn+m−2 ≥ [t − (µ1 − µ2)]<br />

� ].<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

<strong>To</strong> maximize this we choose µ2 > µ1 in such a way as to maximize the probability<br />

which clearly implies choosing µ2 = µ1. Hence the p-value is<br />

t<br />

P [tn+m−2 ≥ �<br />

s2 ( 1 1 + n m )<br />

¯x − ¯y<br />

] = P [tn+m−2 ≥ �<br />

s2 ( 1 1 + n m )<br />

].<br />

(b) Suppose Θ0 = {(µ1, µ2, σ) : −∞ < µ1 < ∞, −∞ < µ2 < µ1, 0 < σ < ∞}. Define<br />

T = ¯y − ¯x. Let t denote the observed value of T . Then<br />

p(t; θ) = P [¯y − ¯x ≥ t]<br />

= P [ [(¯y − ¯x) − (µ2 − µ1)]<br />

�<br />

≥ [t − (µ2 − µ1)]<br />

� ]<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

= P [tn+m−2 ≥ [t − (µ2 − µ1)]<br />

� ].<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

<strong>To</strong> maximize this we choose µ2 < µ1 in such a way as to maximize the probability<br />

which clearly implies choosing µ2 = µ1. Hence the p-value is<br />

t<br />

P [tn+m−2 ≥ �<br />

s2 ( 1 1 + n m )<br />

¯y − ¯x<br />

] = P [tn+m−2 ≥ �<br />

s2 ( 1 1 + n m )<br />

].<br />

69


(c) Θ0 = {(µ, µ,σ) : −∞ < µ < ∞, 0 < σ < ∞}. Define T = |¯y − ¯x|. Let t denote<br />

the observed value of T . Then<br />

p(t; θ) = P [|¯y − ¯x| ≥ t]<br />

= P [¯y − ¯x ≥ t] + P [¯y − ¯x ≤ −t]<br />

= P [ [(¯y − ¯x) − (µ2 − µ1)]<br />

�<br />

≥ [t − (µ2 − µ1)]<br />

� ]<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

+P [ [(¯y − ¯x) − (µ2 − µ1)]<br />

�<br />

≤ [−t − (µ2 − µ1)]<br />

�<br />

]<br />

= P [tn+m−2 ≥ [t − (µ2 − µ1)]<br />

� ] +<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

s 2 ( 1<br />

n<br />

P [tn+m−2 ≤ [−t − (µ2 − µ1)]<br />

�<br />

].<br />

s 2 ( 1<br />

n<br />

+ 1<br />

m )<br />

+ 1<br />

m )<br />

Since, for all sets of parameter values in Θ0, we have µ1 = µ2 the maximization<br />

is trivial and so the p-value is<br />

t<br />

P [tn+m−2 ≥ √<br />

s2 1 1<br />

( + n m )] + P [tn+m−2<br />

−t ≤ √<br />

s2 1 1<br />

( + n m )]<br />

= 2P [tn+m−2 ≥<br />

√ t<br />

s2 1 1<br />

( + n m )]<br />

= 2P [tn+m−2 ≥ |¯y−¯x| √<br />

s2 1 1<br />

( + n m )].<br />

(d) Suppose that we have data X1, X2, . . . , Xn which are iid observations from a<br />

N (µ1, σ 2 1) density and data y1, y2, . . . , ym which are iid observations from a N (µ2, σ 2 1)<br />

density where µ1, µ2, σ1, and σ2 are all unknown. Here θ = (µ1, µ2, σ1, σ2) and<br />

Θ = {(µ1, µ2, σ1, σ2) : −∞ < µ1 < ∞, −∞ < µ2 < ∞, 0 < σ1 < ∞, 0 < σ2 <<br />

∞}. Define<br />

s 2 1 =<br />

� n<br />

i=1 (xi − ¯x) 2<br />

n − 1<br />

, and s 2 2 = Σm j=1(yj − ¯y) 2<br />

m − 1<br />

Suppose Θ0 = {(µ1, µ2, σ, σ) : −∞ < µ1 < ∞, µ1 < µ2 < ∞, 0 < σ < ∞}.<br />

Define<br />

T = max{ s21 s2 ,<br />

2<br />

s22 s2 }.<br />

1<br />

70<br />

.


Let t denote the observed value of T and observe that t must be greater than or<br />

equal to 1. Then<br />

p(t; θ) = P [ s2 1<br />

s 2 2<br />

= P [ s2 1<br />

s 2 2<br />

= P [ σ2 2s 2 1<br />

σ 2 1s 2 2<br />

≥ t] + P [ s2 2<br />

s 2 1<br />

≥ t] + P [ s2 1<br />

s 2 2<br />

≥ t]<br />

≤ 1<br />

t ]<br />

≥ σ2 2t<br />

σ2 ] + P [<br />

1<br />

σ2 2s2 1<br />

σ2 1s2 2<br />

≤ σ2 2<br />

σ 2 1t ]<br />

= P [Fn−1,m−1 ≥ σ2 2t<br />

σ2 ] + P [Fn−1,m−1 ≤<br />

1<br />

σ2 2<br />

σ2 1t ].<br />

Since, for all sets of parameter values in Θ0, we have σ1 = σ2 the maximization<br />

is trivial and so the p-value is P [Fn−1,m−1 ≥ t] + P [Fn−1,m−1 ≤ 1].<br />

�<br />

t<br />

4.4 Generally Applicable Test Procedures<br />

Suppose that we observe the value of a random vector X whose probability density<br />

function is g(X|θ) for x ∈ X where the parameter θ = (θ1, θ2, . . . , θp) is some unknown<br />

element of the set Θ ⊆ R p . Let Θ0 be a specified subset of Θ. Consider the hypothesis<br />

H0 : θ ∈ Θ0. In this section we consider three ways in which good test statistics may<br />

be found for this general problem.<br />

The Likelihood Ratio Test: This test statistic is based on the idea that the<br />

maximum of the log likelihood over the subset Θ0 should not be too much less<br />

than the maximum over the whole set Θ if, in fact, the parameter θ actually<br />

does lie in the subset Θ0. Let ℓ(θ) denote the log likelihood function. The test<br />

statistic is<br />

T1(x) = 2[ℓ( ˆ θ) − ℓ( ˆ θ0)]<br />

where ˆ θ is the value of θ in the set Θ for which ℓ(θ) is a maximum and ˆ θ0 is the<br />

value of θ in the set Θ0 for which ℓ(θ) is a maximum.<br />

The Maximum Likelihood Test Statistic: This test statistic is based on<br />

the idea that ˆ θ and ˆ θ0 should be close to one another. Let I(θ) be the p × p<br />

information matrix. Let B = I( ˆ θ). The test statistic is<br />

T2(x) = ( ˆ θ − ˆ θ0) T B( ˆ θ − ˆ θ0)<br />

Other forms of this test statistic follow by choosing B to be I( ˆ θ0) or EI( ˆ θ) or<br />

EI( ˆ θ0).<br />

71


The Score Test Statistic: This test statistic is based on the idea that ˆ<br />

θ0 should<br />

almost solve the likelihood equations. Let S(θ) be the p × 1 vector whose rth<br />

element is given by ∂ℓ/∂θr. Let C be the inverse of I( ˆ θ0) i.e. C = I( ˆ θ0) −1 . The<br />

test statistic is<br />

T3(x) = S( ˆ θ0) T CS( ˆ θ0)<br />

In order to calculate p-values we need to know the probability distribution of the<br />

test statistic under the null hypothesis. Deriving the exact probability distribution<br />

may be difficult but approximations suitable for situations in which the sample size<br />

is large are available in the special case where Θ is a p dimensional set and Θ0 is a<br />

q dimensional subset of Θ for q < p, whence it can be shown that, when H0 is true,<br />

the probability distributions of T1(x), T2(x) and T3(x) are all approximated by a χ 2 p−q<br />

density.<br />

Example 4.7. Let X1, X2, . . . , Xn be iid each having a Poisson distribution with pa-<br />

rameter θ. Consider testing H0 : θ = θ0 where θ0 is some specified constant. Recall<br />

that<br />

n�<br />

n�<br />

ℓ(θ) = [ xi] log [θ] − nθ − log [ xi!].<br />

i=1<br />

Here Θ = [0, ∞) and the value of θ ∈ Θ for which ℓ(θ) is a maximum is ˆ θ = ¯x. Also<br />

Θ0 = {θ0} and so trivially ˆ θ0 = θ0. We saw also that<br />

�n i=1 S(θ) =<br />

xi<br />

− n<br />

θ<br />

and that<br />

�n i=1 xi<br />

I(θ) =<br />

θ2 .<br />

Suppose that θ0 = 2, n = 40 and that when we observe the data we get ¯x = 2.50.<br />

Hence �n i=1 xi = 100. Then<br />

T1 = 2[ℓ(2.5) − ℓ(2.0)]<br />

i=1<br />

= 200 log (2.5) − 200 − 200 log (2.0) + 160 = 4.62.<br />

The information is B = I( ˆ θ) = 100/2.5 2 = 16. Hence<br />

T2 = ( ˆ θ − ˆ θ0) 2 B = 0.25 × 16 = 4.<br />

We have S(θ0) = S(2.0) = 10 and I(θ0) = 25 and so<br />

T3 = 10 2 /25 = 4.<br />

Here p = 1, q = 0 implying p−q = 1. Since P [χ 2 1 ≥ 3.84] = 0.05 all three test statistics<br />

produce a p-value less than 0.05 and lead to the rejection of H0 : θ = 2. �<br />

72


Example 4.8. Let X1, X2, . . . , Xn be iid with density f(x|α, β) = αβx β−1 exp(−αx β )<br />

for x ≥ 0. Consider testing H0 : β = 1. Here Θ = {(α, β) : 0 < α < ∞, 0 < β < ∞}<br />

and Θ0 = {(α, 1) : 0 < α < ∞} is a one-dimensional subset of the two-dimensional set<br />

Θ. Recall that ℓ(α, β) = n log[α] + n log[β] + (β − 1) � n<br />

i=1 log[xi] − α � n<br />

the vector S(α, β) is given by<br />

�<br />

and the matrix I(α, β) is given by<br />

�<br />

n/α − �n i=1 xβi<br />

n/β + � n<br />

i=1 log[xi] − α � n<br />

i=1 xβ<br />

i log[xi]<br />

n/α 2<br />

� n<br />

i=1 xβ<br />

i log[xi]<br />

�n i=1 xβi<br />

log[xi] n/β2 + α �n i=1 xβi<br />

log[xi] 2<br />

�<br />

�<br />

i=1 xβi<br />

. Hence<br />

We have that ˆ θ = (ˆα, ˆ β) which require Newton’s method for their calculation. Also<br />

ˆθ0 = (ˆα0, 1) where ˆα0 = 1/¯x. Suppose that the observed value of T1(x) is 3.20. Then<br />

the p-value is P [T1(x) ≥ 3.20] ≈ P [χ 2 1 ≥ 3.20] = 0.0736. In order to get the maximum<br />

likelihood test statistic plug in the values ˆα, ˆ β for α, β in the formula for I(α, β) to get<br />

the matrix B. Then calculate T1(X) = ( ˆ θ − ˆ θ0) T B( ˆ θ − ˆ θ0) and use the χ 2 1 tables to<br />

calculate the p-value. Finally, to calculate the score test statistic note that the vector<br />

S( ˆ θ0) is given by �<br />

and the matrix I( ˆ θ0) is given by<br />

�<br />

0<br />

n + �n i=1 log[xi] − �n i=1 xi<br />

�<br />

log[xi]/¯x<br />

n¯x 2<br />

� n<br />

i=1 xi log[xi]<br />

� n<br />

i=1 xi log[xi] n + � n<br />

i=1 xi log[xi] 2 /¯x<br />

Since T2(x) = S( ˆ θ0) T CS( ˆ θ0) where C = I( ˆ θ0) −1 we have that T2(x) is<br />

[n +<br />

n�<br />

log[xi] −<br />

i=1<br />

n�<br />

i=1<br />

xi log[xi]/¯x] 2<br />

multiplied by the lower diagonal element of C which is given by<br />

n¯x 2<br />

[n¯x 2 ][n + �n i=1 xi log[xi] 2 /¯x] − [ �n i=1 xi log[xi]] 2<br />

Hence we get that<br />

T2(x) =<br />

[n + �n i=1 log[xi] − �n i=1 xi log[xi]/¯x] 2n¯x 2<br />

[n¯x 2 ][n + �n i=1 xi log[xi] 2 /¯x] − [ �n i=1 xi log[xi]] 2<br />

No iterative techniques are need to calculate the value of T2(X) and for this reason<br />

the score test is often preferred to the other two. However there is some evidence that<br />

the likelihood ratio test is more powerful in the sense that it has a better chance of<br />

detecting departures from the null hypothesis. �<br />

73<br />


4.5 The Neyman-Pearson Lemma<br />

Suppose we are testing a simple null hypothesis H0 : θ = θ ′ against a simple alternative<br />

H1 : θ = θ ′′ , where θ is the parameter of interest, and θ ′ , θ ′′ are particular values of θ.<br />

Observed values of the i.i.d. random variables X1, X2, . . . , Xn, each with p.d.f. fX(x|θ),<br />

are available. We are going to reject H0 if (x1, x2, . . . , xn) ∈ C, where C is a region of<br />

the n-dimensional space called the critical region. This specifies a test of hypothesis.<br />

We say that the critical region C has size α if the probability of a Type I error is α:<br />

P [ (X1, X2, . . . , Xn) ∈ C|H0 ] = α.<br />

We call C a best critical region of size α if it has size α, and<br />

P [ (X1, X2, . . . , Xn) ∈ C|H1 ] ≥ P [ (X1, X2, . . . , Xn) ∈ A|H1 ]<br />

for every subset A of the sample space for which P [ (X1, X2, . . . , Xn) ∈ C|H0 ] = α.<br />

Thus, the power of the test associated with the best critical region C is at least as great<br />

as the power of the test associated with any other critical region A of size α. The<br />

Neyman-Pearson lemma provides us with a way of finding a best critical region.<br />

Lemma 4.1 (The Neyman-Pearson lemma). If k > 0 and C is a subset of the<br />

sample space such that<br />

(a) L(θ ′ )/L(θ ′′ ) ≤ k for all (x1, x2, . . . , xn) ∈ C<br />

(b) L(θ ′ )/L(θ ′′ ) ≥ k for all (x1, x2, . . . , xn) ∈ C ∗ ,<br />

(c) α = P [(X1, X2, . . . , Xn) ∈ C|H0]<br />

where C ∗ is the complement of C, then C is a best critical region of size α for testing the<br />

simple hypothesis H0 : θ = θ ′ against the alternative simple hypothesis H1 : θ = θ ′′ .<br />

Proof. Suppose for simplicity that the random variables X1, X2, . . . , Xn are continuous.<br />

(If they were discrete, the proof would be the same, except that integrals would be<br />

replaced by sums.) For any region R of n-dimensional space, we will denote the<br />

probability that X ∈ R by �<br />

L(θ), where θ is the true value of the parameter. The full<br />

R<br />

notation, omitted to save space, would be<br />

� �<br />

. . .<br />

P (X ∈ R|θ) = L(θ|x1, . . . , xn)dx1 . . . dxn .<br />

R<br />

We need to prove that if A is another critical region of size α, then the power of the<br />

test associated with C is at least as great as the power of the test associated with A,<br />

or in the present notation, that �<br />

A<br />

L(θ ′′ �<br />

) ≤<br />

74<br />

C<br />

L(θ ′′ ). (4.1)


Suppose X ∈ A∗ ∩ C. Then X ∈ C, so by (a),<br />

�<br />

L(θ ′′ ) ≥ 1<br />

k<br />

�<br />

A ∗ ∩C<br />

A ∗ ∩C<br />

Next, suppose X ∈ A ∩ C ∗ . Then X ∈ C ∗ , so by (b),<br />

�<br />

A∩C ∗<br />

L(θ ′′ ) ≤ 1<br />

k<br />

�<br />

A∩C ∗<br />

We now establish (4.1), thereby completing the proof.<br />

�<br />

A<br />

L(θ ′′ ) =<br />

=<br />

≤<br />

=<br />

=<br />

=<br />

�<br />

A∩C<br />

�<br />

C<br />

�<br />

C<br />

�<br />

C<br />

�<br />

C<br />

�<br />

C<br />

L(θ ′′ ) +<br />

L(θ ′′ ) −<br />

�<br />

A∩C ∗<br />

�<br />

A ∗ ∩C<br />

L(θ ′′ ) − 1<br />

k<br />

�<br />

A ∗ ∩C<br />

⎡<br />

+ ⎣ 1<br />

�<br />

k<br />

L(θ ′′ ) − 1<br />

k<br />

L(θ ′′ ) − α<br />

k<br />

L(θ ′′ )<br />

since both C and A have size α.<br />

�<br />

C<br />

L(θ ′′ )<br />

L(θ ′′ ) +<br />

A∩C<br />

+ α<br />

k<br />

�<br />

A∩C ∗<br />

L(θ ′ ) + 1<br />

k<br />

L(θ ′ ). (4.2)<br />

L(θ ′ ). (4.3)<br />

L(θ ′′ )<br />

�<br />

A∩C ∗<br />

L(θ ′ ) − 1<br />

�<br />

k<br />

L(θ ′ ) + 1<br />

k<br />

�<br />

A<br />

A∩C<br />

L(θ ′ ) ( see (4.2), (4.3) )<br />

⎤<br />

L(θ ′ ) ⎦ (add zero)<br />

L(θ ′ ) (collect terms)<br />

Example 4.9. Suppose X1, . . . , Xn are iid N (0, 1), and and we want to test H0 : θ = θ ′<br />

versus H1 : θ = θ ′′ , where θ ′′ > θ ′ . According to the Z-test, we should reject H0 if<br />

Z = √ n( ¯ X − θ ′ ) is large, or equivalently if ¯ X is large. We can now use the Neyman-<br />

Pearson lemma to show that the Z-test is “best”. The likelihood function is<br />

L(θ) = (2π) −n/2 exp{−<br />

75<br />

n�<br />

(xi − θ) 2 /2}.<br />

i=1


According to the Neyman-Pearson lemma, a best critical region is given by the set of<br />

(x1, . . . , xn) such that L(θ ′ )/L(θ ′′ ) ≤ k1, or equivalently, such that<br />

But<br />

1<br />

n ln[L(θ′′ )/L(θ ′ )] = 1<br />

n<br />

= 1<br />

2n<br />

= 1<br />

2n<br />

1<br />

n ln[L(θ′′ )/L(θ ′ )] ≥ k2.<br />

n�<br />

i=1<br />

[(xi − θ ′ ) 2 /2 − (x1 − θ ′′ ) 2 /2]<br />

n�<br />

i=1<br />

n�<br />

i=1<br />

[(x 2 i − 2θ ′ xi + θ ′2 ) − (x 2 i − 2θ ′′ xi + θ ′′2 )]<br />

[2(θ ′′ − θ ′ )xi + θ ′2 − θ ′′2 ]<br />

= (θ ′′ − θ ′ )¯x + 1<br />

2 [θ′2 − θ ′′2 ].<br />

So the best test rejects H0 when ¯x ≥ k, where k is a constant. But this is exactly the<br />

form of the rejection region for the Z-test. Therefore, the Z-test is “best”. �<br />

4.6 Goodness of Fit Tests<br />

Suppose that we have a random experiment with a random variable Y of interest.<br />

Assume additionally that Y is discrete with density function f on a finite set S. We<br />

repeat the experiment n times to generate a random sample Y1, Y2, . . . , Yn from the<br />

distribution of Y . These are independent variables, each with the distribution of Y .<br />

In this section, we assume that the distribution of Y is unknown. For a given<br />

density function f0, we will test the hypotheses H0 : f = f0 versus H1 : f �= f0. The<br />

test that we will construct is known as the goodness of fit test for the conjectured<br />

density f0. As usual, our challenge in developing the test is to find an appropriate test<br />

statistic – one that gives us information about the hypotheses and whose distribution,<br />

under the null hypothesis, is known, at least approximately.<br />

Suppose that S = y1, y2, . . . , yk. <strong>To</strong> simplify the notation, let pj = f0(yj) for<br />

j = 1, 2, . . . , k. Now let Nj = #{i ∈ 1, 2, ..., n : yi = yj} for j = 1, 2, . . . , k. Under the<br />

null hypothesis, (N1, N2, . . . , Nk) has the multinomial distribution with parameters n<br />

and p1, p2, . . . , pk with E(Nj) = npj and Var(Nj) = npj(1 − pj). This results indicates<br />

how we might begin to construct our test: for each j we can compare the observed<br />

frequency of yj (namely Nj) with the expected frequency of value yj (namely npj),<br />

under the null hypothesis. Specifically, our test statistic will be<br />

X 2 = (N1 − np1) 2<br />

+<br />

np1<br />

(N2 − np2) 2<br />

+ · · · +<br />

np2<br />

(Nk − npk) 2<br />

.<br />

npk<br />

76


Note that the test statistic is based on the squared errors (the differences between the<br />

expected frequencies and the observed frequencies). The reason that the squared errors<br />

are scaled as they are is the following crucial fact, which we will accept without proof:<br />

under the null hypothesis, as n increases to infinity, the distribution of X 2 converges<br />

to the chi-square distribution with k − 1 degrees of freedom.<br />

For m > 0 and r in (0, 1), we will let χ 2 m,r denote the quantile of order r for<br />

the chi-square distribution with m degrees of freedom. Then, the following test has<br />

approximate significance level α: reject H0 : f = f0 versus H1 : f �= f0, if and only if<br />

X2 > χ2 k−1,1−α . The test is an approximate one and works best when n is large. Just<br />

how large n needs to be depends on the pj. One popular rule of thumb proposes that<br />

the test will work well if all the expected frequencies satisfy npj ≥ 1 and at least 80%<br />

of the expected frequencies satisfy npj ≥ 5.<br />

Example 4.10 (Genetical inheritance). In crosses between two types of maize four dis-<br />

tinct types of plants were found in the second generation. In a sample of 1301 plants<br />

there were 773 green, 231 golden, 238 green-striped, 59 golden-green-striped. Accord-<br />

ing to a simple theory of genetical inheritance the probabilities of obtaining these four<br />

plants are 9 3 3 , , 16 16 16<br />

experiment?<br />

and 1<br />

16<br />

Formally we will consider the hypotheses:<br />

respectively. Is the theory acceptable as a model for this<br />

H0 : p1 = 9<br />

16 , and p2 = 3<br />

16 , and p3 = 3<br />

16 and p4 = 1<br />

16 ;<br />

H1 : not all the above probabilities are correct.<br />

The expected frequencies for any plant under H0 is npi = 1301pi. We therefore calculate<br />

the following table:<br />

Observed Counts Expected Counts Contributions to X 2<br />

Oi Ei (Oi − Ei) 2 /Ei<br />

773 731.8125 2.318<br />

231 243.9375 0.686<br />

238 243.9375 0.145<br />

59 81.3125 6.123<br />

X 2 = 9.272<br />

Since X 2 embodies the differences between the observed and expected values we<br />

can say that if X 2 is large that there is a big difference between what we observe and<br />

what we expect so the theory does not seem to be supported by the observations. If<br />

X 2 is small the observations apparently conform to the theory and act as support for<br />

the theory. The test statistic X2 is distributed X2 ∼ χ2 3df . In order to define what we<br />

77


would consider to be an unusually large value of X 2 we will choose a significance level<br />

of α = 0.05. The R command qchisq(p=0.05,df=3,lower.tail=FALSE) calculates<br />

the 5% critical value for the test as 7.815. Since our value of X 2 is greater than the<br />

critical value 7.815 we reject H0 and conclude that the theory is not a good model for<br />

these data. The R command pchisq(q=9.272,df=3,lower.tail=FALSE) calculates<br />

the p-value for the test equal to 0.026. ( These data are examined further in chapter<br />

9 of Snedecor and Cochoran. ) �<br />

Very often we do not have a list of probabilities to specify our hypothesis as we had<br />

in the above example. Rather our hypothesis relates to the probability distribution<br />

of the counts without necessarily specifying the parameters of the distribution. For<br />

instance, we might want to test that the number of male babies born on successive<br />

days in a maternity hospital followed a binomial distribution, without specifying the<br />

probability that any given baby will be male. Or, we might want to test that the<br />

number of defective items in large consignments of spare parts for cars, follows a Poisson<br />

distribution, again without specifying the parameter of the distribution.<br />

The X 2 test is applicable when all the probabilities depend on unknown parame-<br />

ters, provided that the unknown parameters are replaced by their maximum likelihood<br />

estimates and provided that one degree of freedom is deducted for each parameter<br />

estimated.<br />

Example 4.11. Feller reports an analysis of flying-bomb hits in the south of London<br />

during World War II. Investigators partitioned the area into 576 sectors each beng<br />

1<br />

4km2 . The following table gives the resulting data:<br />

No. of hits (x) 0 1 2 3 4 5<br />

No. of sectors with x hits 229 221 93 35 7 1<br />

If the hit pattern is random in the sense that the probability that a bomb will land in<br />

any particular sector in constant, irrespective of the landing place of previous bombs,<br />

a Poisson distribution might be expected to model the data.<br />

x P (x) = ˆ θ x e −ˆ θ Expected Observed Contributions to X 2<br />

x! 576 × P (X) (Oi − Ei) 2 /Ei<br />

0 0.395 227.53 229 0.0095<br />

1 0.367 211.34 211 0.0005<br />

2 0.170 98.15 93 0.2702<br />

3 0.053 30.39 35 0.6993<br />

4 0.012 7.06 7 0.0005<br />

5 0.002 1.31 1 0.0734<br />

78<br />

X 2 = 1.0534


The MLE of θ was calculated as ˆ θ = 535/576 = 0.9288, that is, the total number<br />

of observed hits divided by the number of sectors. We carry out the chi-squared test<br />

as before except that we now subtract one additional degree of freedom because we<br />

had to estimate θ. The test statistic X2 is distributed X2 ∼ χ2 4df . The R command<br />

qchisq(p=0.05,df=4,lower.tail=FALSE) calculates the 5% critical value for the test<br />

as 9.488. Alternatively, the R command pchisq(q=1.0534,df=4,lower.tail=FALSE)<br />

calculates the p-value for the test equal to 0.90. The result of the chi-squared test is<br />

not statistically significant indicating that the divergence between the observed and<br />

expected counts can be regarded as random fluctuations about the expected values.<br />

Feller comments, “It is interesting to note that most people believed in a tendency of<br />

the points of impact to cluster. It this were true, there would be a higher frequency of<br />

sectors with either many hits or no hits and a deficiency in the intermediate classes. the<br />

above table indicates perfect randomness and homogeneity of the area; we have here<br />

an instructive illustration of the established fact that to the untrained eye randomness<br />

appears a regularity or tendency to cluster.” �<br />

4.7 The χ 2 Test for Contingency Tables<br />

Let X and Y be a pair of categorical variables and suppose there are r possible values<br />

for X and c possible values for Y . Examples of categorical variables are Religion, Race,<br />

Social Class, Blood Group, Wind Direction, Fertiliser Type etc. The random variables<br />

X and Y are said to be independent if P [X = a, Y = b] = P [X = a]P [Y = b] for all<br />

possible values a of X and b of Y . In this section we consider how to test the null<br />

hypothesis of independence using data consisting of a random sample of N observations<br />

from the joint distribution of X and Y .<br />

Example 4.12. A study was carried out to investigate whether hair colour (columns)<br />

and eye colour (rows) were genetically linked. A genetic link would be supported if<br />

the proportions of people having various eye colourings varied from one hair colour<br />

grouping to another. 955 people were chosen at random and their hair colour and eye<br />

colour recorded. The data are summarised in the following table :<br />

Oij Black Brown Fair Red <strong>To</strong>tal<br />

Brown 60 110 42 30 242<br />

Green 67 142 28 35 272<br />

Blue 123 248 90 25 486<br />

<strong>To</strong>tal 250 500 160 90 1000<br />

The proportion of people with red hair is 90/1000 = 0.09 and the proportion having<br />

blue eyes is 486/1000 = 0.486. So if eye colour and hair colour were truly inde-<br />

79


pendent we would expect the proportion of people having both black hair and brown<br />

eyes to be approximately equal to (0.090)(0.486) = 0.04374 or equivalently we would<br />

expect the number of people having both black hair and brown eyes to be close to<br />

(1000)(0.04374) = 43.74. The observed number of people having both black hair and<br />

brown eyes is 60.5. We can do similar calculations for all other combinations of hair<br />

colour and eye colour to derive the following table of expected counts :<br />

Eij Black Brown Fair Red <strong>To</strong>tal<br />

Brown 60.5 121 38.72 21.78 242<br />

Green 68.0 136 43.52 24.48 272<br />

Blue 121.5 243 77.76 43.74 486<br />

<strong>To</strong>tal 250.0 500 160.00 90.00 1000<br />

In order to test the null hypothesis of independence we need a test statistic which mea-<br />

sures the magnitude of the discrepancy between the observed table and the table that<br />

would be expected if independence were in fact true. In the early part of this century,<br />

long before the invention of maximum likelihood or the formal theory of hypothesis<br />

testing, Karl Pearson ( one of the founding fathers of Statistics ) proposed the following<br />

method of constructing such a measure of discrepancy:<br />

(Oij−Eij) 2<br />

Eij<br />

Black Brown Fair Red<br />

Brown 0.004 1.000 0.278 3.102<br />

Green 0.015 0.265 5.535 4.521<br />

Blue 0.019 0.103 1.927 8.029<br />

For each cell in the table calculate (Oij − Eij) 2 /Eij where Oij is the observed count<br />

and Eij is the expected count and add the resulting values across all cells of the table.<br />

The resulting total is called the χ 2 test statistic which we will denote by W . The null<br />

hypothesis of independence is rejected if the observed value of W is surprisingly large.<br />

In the hair and eye colour example the discrepancies are as follows :<br />

W =<br />

r�<br />

i=1<br />

c�<br />

j=1<br />

(Oij − Eij) 2<br />

Eij<br />

= 24.796<br />

What we would now like to calculate is the p-value which is the probability of getting a<br />

value for W as large as 24.796 if the hypothesis of independence were in fact true. Fisher<br />

showed that, when the hypothesis of independence is true, W behaves somewhat like a<br />

χ 2 random variable with degrees of freedom given by (r−1)(c−1) where r is the number<br />

of rows in the table and c is the number of columns. In our example r = 3, c = 4 and<br />

so (r − 1)(c − 1) = 6 and so the p-value is P [W ≥ 24.796] ≈ P [χ 2 6 ≥ 24.796] = 0.0004.<br />

Hence we reject the independence hypothesis. �<br />

80


4.8 Worked Problems<br />

The Problems<br />

1. A random sample of n flowers is taken from a colony and the numbers X, Y and<br />

Z of the three genotypes AA, Aa and aa are observed, where X + Y + Z = n.<br />

Under the hypothesis of random cross-fertilisation, each flower has probabilities<br />

θ 2 , 2θ(1−θ) and (1−θ) 2 of belonging to the respective genotypes, where 0 < θ < 1<br />

is an unknown parameter.<br />

(a) Show that the MLE of θ is ˆ θ = (2X + Y )/(2n).<br />

(b) Consider the test statistic T = 2X + Y. Given that T has a binomial dis-<br />

tribution with parameters 2n and θ, obtain a critical region of approximate<br />

size α based on T for testing the null hypothesis that θ = θ0 against the<br />

alternative that θ = θ1, where θ1 < θ0 and 0 < α < 1.<br />

(c) Show that the test in part (b) is the most powerful of size α.<br />

(d) Deduce approximately how large n must be to ensure that the power is at<br />

least 0.9 when α = 0.05, θ0 = 0.4 and θ1 = 0.3.<br />

2. Suppose that the household incomes in a certain country have the Pareto distri-<br />

bution with probability density function<br />

f(x) = θvθ<br />

, v ≤ x < ∞ ,<br />

xθ+1 where θ > 0 is unknown and v > 0 is known. Let x1, x2, . . . , xn denote the<br />

incomes for a random sample of n such households. It is required to test the null<br />

hypothesis θ = 1 against the alternative that θ �= 1.<br />

(a) Derive an expression for ˆ θ, the MLE of θ.<br />

(b) Show that the generalised likelihood ratio test statistic, λ(x), satisfies<br />

ln{λ(x)} = n − n ln( ˆ θ) − n<br />

ˆθ .<br />

(c) Show that the test accepts the null hypothesis if<br />

k1 <<br />

n�<br />

ln(xi) < k2 ,<br />

and state how the values of k1 and k2 may be determined.<br />

81<br />

i=1


3. Explain what is meant by the power of a test and describe how the power may<br />

be used to determine the most appropriate size of a sample.<br />

Let X1, X2, . . . , Xn be a random sample from the Weibull distribution with prob-<br />

ability density function f(x) = θλx λ−1 exp(−θx λ ), for x > 0 where θ > 0 is<br />

unknown and λ > 0 is known.<br />

(a) Find the form of the most powerful test of the null hypothesis that θ = θ0<br />

against the alternative hypothesis that θ = θ1, where θ0 > θ1.<br />

(b) Find the distribution function of X λ and deduce that this random variable<br />

has an exponential distribution.<br />

(c) Find the critical region of the most powerful test at the 1% level when<br />

n = 50, θ0 = 0.05 and θ1 = 0.025. Evaluate the power of this test.<br />

4. In a particular set of Bernoulli trials, it is widely believed that the probability of<br />

a success is θ = 3<br />

4<br />

2<br />

. However, an alternative view is that θ = . In order to test<br />

3<br />

H0 : θ = 3<br />

4 against H1 : θ = 2<br />

3 , n independent trials are to be observed. Let ˆ θ<br />

denote the proportion of successes in these trials.<br />

(a) Show that the likelihood ratio aapproach leads to a size α test in which H0<br />

is rejected in favour of H1 when ˆ θ < k for some suitable k.<br />

(b) By applying the central limit theorem, write down the large sample distri-<br />

butions of ˆ θ when H0 is true and when H1 is true.<br />

(c) Hence find an expression for k in terms of n when α = 0.05.<br />

(d) Find n so that this test has power 0.95.<br />

5. It is believed that the number of breakages in a damaged chromosome, X, follows<br />

a truncated Poisson distribution with probability mass function<br />

P (X = k) = e−λ<br />

1 − e −λ<br />

λ k<br />

k!<br />

, k = 1, 2, . . . ,<br />

where λ > 0 is an unknown parameter. The frequency distribution of the number<br />

of breakages in a random sample of 33 damaged chromosomes was as follows:<br />

Breakages 1 2 3 4 5 6 7 8 9 10 11 12 13 <strong>To</strong>tal<br />

Chromosomes 11 6 4 5 0 1 0 2 1 0 1 1 1 33<br />

(a) Assuming that the appropriate regularity conditions hold, find an equation<br />

satisfied by ˆ λ, the MLE of λ.<br />

82


(b) Assuming that an initial guess ˆ λ0 is available, use the Newton-Raphson<br />

method to find an iterative algorithm for computing the value of ˆ λ. There<br />

is no need to carry out any numerical calculations.<br />

(c) It is found that the observed data give the estimate ˆ λ = 3.6. Using this value<br />

The Solutions<br />

for ˆ λ, test the null hypothesis that the number of breakages in a damaged<br />

chromosome follows a truncated Poisson distribution. The categories 6 to<br />

13 should be combined into a single category inthe goodness-of-fit test.<br />

1. The likelihood function is proportional to θ 2x {2θ(1 − θ)} y (1 − θ) 2z ; and so the<br />

log-likelihood ℓ(θ) = ln L(θ) = (2x + y) ln θ + (y + 2z) ln(1 − θ) for 0 < θ < 1).<br />

Then S(θ) = dℓ<br />

dθ<br />

solution dℓ<br />

dθ<br />

= 2x+y<br />

θ<br />

y+2z<br />

− 1−θ and I(θ) = − d2ℓ dθ2 = 2x+y<br />

θ2 = 0 will be a maximum. ∴ 2x+y<br />

ˆθ<br />

y+2z<br />

=<br />

1− ˆ θ i.e. ˆ θ = 2x+y<br />

+ y+2z<br />

(1−θ )2 > 0 so the<br />

Testing the simple hypothesis H0 : θ = θ0 against the alternative simple hypoth-<br />

esis H1 : θ = θ1 < θ0. Critical region size α. Reject H0 if t ≤ k where k satisfies<br />

P (T ≤ k|θ = θ0) ≈ α; i.e. k� �<br />

t θ0(1 − θ0) 2n−t ≈ α. (*)<br />

�2n t<br />

t=0<br />

The likelihood ratio for testing H0 against H1 is<br />

λ = L(θ0)<br />

L(θ1) =<br />

� θ0<br />

θ1<br />

� 2x+y � 1 − θ0<br />

1 − θ1<br />

� y+2z<br />

2n .<br />

� �2n � �t 1 − θ0 θ0(1 − θ1)<br />

=<br />

1 − θ1 θ1(1 − θ0)<br />

Since θ0(1 − θ1) > θ1(1 − θ0), λ increases with t. The Neymann-Pearsom lemma<br />

then shows that the most powerful test of size α rejects H0 if t ≤ k with k defined<br />

in (*) above.<br />

Using the central limit theorem, T ∼ N(2nθ, 2nθ[1 − θ]). Therefore Φ([k +<br />

1<br />

2 − 2nθ0]/ � 2nθ0(1 − θ0)) = 0.05; ∴ [k + 1<br />

2 − 2nθ0]/ � 2nθ0(1 − θ0) = −1.645 or<br />

[k + 0.5 − 0.8n]/ √ 0.48n = −1.645. Likewise Φ([k + 1<br />

2 − 2nθ1]/ � 2nθ1(1 − θ1)) =<br />

0.95 so [k + 0.5 − 0.6n]/ √ 0.42n = 1.282. From these results, k + 0.5 = 0.8n −<br />

1.645 √ 0.48n = 0.6n+1.282 √ 0.42n which gives 0.2 √ n = 1.645 √ 0.48+1.282 √ 0.42<br />

or n = (1.645 √ 0.48+1.282 √ 0.42) 2 /0.04 = 97.1 or n = 98 since n must be integer.<br />

2. The likelihood function is L(θ) = n� �<br />

θvθ x<br />

i=1<br />

θ+1<br />

�<br />

θ =<br />

i<br />

nvnθ ( �n i=1 xi) θ+1 for v ≤ x < ∞ and<br />

θ > 0. Therefore ln L(θ) = ℓ(θ) = n ln θ+nθ ln v−(θ+1) � ln(xi). Differentiating<br />

we get the score function S(θ) = (n/θ)+n ln v − � ln xi and I(θ) = n/θ2 > 0 ∀ θ.<br />

The MLE ˆ θ is found by S( ˆ θ) = 0, implying (n/ ˆ θ) = � ln xi − n ln v = � ln � � xi<br />

v<br />

so that ˆ θ = n/ [ �n i=1 (xi/v)] .<br />

83<br />

.


For the null hypothesis θ = 1, the generalised likelihood ratio is λ = L(1)/L( ˆ θ)<br />

so that ln(λ(x)) = ℓ(1) − ℓ( ˆ θ). Thus<br />

ln(λ(x)) = n ln v − 2<br />

n�<br />

ln(xi) − n ln( ˆ θ) − nˆ θ ln v + ( ˆ θ + 1)<br />

i=1<br />

= n ln v + ( ˆ θ − 1)<br />

= − n<br />

ˆθ + ˆ θ<br />

n�<br />

i=1<br />

n�<br />

ln(xi) − n ln( ˆ θ) − nˆ θ ln v<br />

i=1<br />

n�<br />

ln(xi)<br />

ln(xi) − n ln( ˆ θ) − n ˆ θ ln v using n<br />

ˆθ = � ln xi − n ln v<br />

= − n<br />

ˆθ + n − n ln(ˆ θ), again using n<br />

ˆθ = � ln xi − n ln v<br />

�<br />

= n 1 − ln ˆ θ − 1<br />

�<br />

.<br />

ˆθ<br />

Let u = 1/ ˆ θ. Then ln(λ(x)) = −n(u−1−ln u) and d<br />

1<br />

(ln λ) = −n(1− ). Clearly<br />

du u<br />

ln λ has a maximum at u = 1. The null hypothesis H0 : θ = 1 will be rejected<br />

if λ(x) ≤ c, for some c; i.e. if u ≤ k ′ 1 or u ≥ k ′ 2. reject H0 if n�<br />

n�<br />

i=1<br />

i=1<br />

ln(xi) ≤ k1 or<br />

ln(xi) ≥ k2, where k1 = n{k<br />

i=1<br />

′ 1 = ln v} and k2 = n{k ′ 2 = ln v}. For a test size of<br />

�<br />

n�<br />

α, choose k1, k2 to satisfy P ln(xi) ≤ k1 or<br />

i=1<br />

n�<br />

�<br />

ln(xi) ≥ k2|θ = 1 = α.<br />

i=1<br />

3. The power of a test is the probability of rejecting the null hypothesis expressed<br />

as a function of the parameter under investigation. If both the significance level<br />

of the test and the power required at a particular value of the parameter are<br />

specified, then a lower bound for the necessary sample size can be determined.<br />

The likelihood function is L(θ) = n�<br />

n�<br />

i=1<br />

θλx λ−1<br />

i e −θxλ i = θ n λ n e −θ � n<br />

i=1 xλ i<br />

i=1<br />

x λ−1<br />

i , for<br />

θ > 0, and so the likelihood ratio for testing H0 : θ = θ0 against H1 : θ = θ1<br />

(where θ0 > θ1) is λ = L(θ0)<br />

L(θ1) =<br />

� θ0<br />

θ1<br />

� n<br />

exp{−(θ0 − θ1) � n<br />

i=1 xλ i }. By the Neyman-<br />

Pearson lemma, the most powerful test has critical region c = {x : n�<br />

xλ i ≥ k},<br />

where k is chosen to give significance level α for the test.<br />

�<br />

−eθtλ �∞ P (X > x) = � ∞<br />

x θλtλ−1 e −θtλ<br />

dt =<br />

x<br />

i=1<br />

= eθxλ, for x > 0. Hence P (Xλ ><br />

x) = P (X > x 1<br />

λ ) = e −θx . So P (X λ ≤ x) = 1 − e −θx , hence X λ ∼ Exp(θ).<br />

If Y1, Y2, . . . , Ym is a random sample from an exponential distribution with mean<br />

v−1 , then 2v m�<br />

Yi has a χ2 2m distribution. Applying this result 2θ n�<br />

Xλ i ∼ χ2 2n.<br />

i=1<br />

Under H0 : θ = 0.05, we have 0.1 50�<br />

Xλ i ∼ χ2 100 with 1% point 135.81. You<br />

i=1<br />

can check this using statistical tables or the R command qchisq(p=0.99,df=100).<br />

84<br />

i=1


Thus P (0.1 50 �<br />

X<br />

i=1<br />

λ i ≥ 135.81|θ = 0.05) = 0.01, and the test therefore reject H0 if<br />

�50<br />

X<br />

i=1<br />

λ i ≥ 1358.1. Under H1 : θ = 0.025, we have 0.05 50 �<br />

X<br />

i=1<br />

λ i ∼ χ2 100 and so the<br />

power of the test is P ( 50 �<br />

Xλ i ≥ 1358.1|θ = 0.025) = P (0.05 50 �<br />

Xλ i ≥ 67.905) =<br />

i=1<br />

P (χ 2 100 > 67.905) ≈ 0.994 using pchisq(q=67.905,df=100,lower.tail=FALSE).<br />

4. The likelihood for a sample (x1, x2, . . . , xn) is L(θ) = const × θ � xi (1 − θ) n− � xi ,<br />

2<br />

L( 3 and so the likelihood ratio is λ = )<br />

L( 3<br />

2<br />

( 3 =<br />

) 4 )� xi( 1<br />

3 )n−� xi ( 3<br />

4 )� xi( 1<br />

4 )n−� x = i � �<br />

8<br />

9<br />

� xi � � �<br />

4 n− xi<br />

.<br />

3<br />

Using the Neyman-pearsom lemma, we reject H0 when λ > c, where c is chosen<br />

to give the required level of test, α. Now, λ is an increasing function of � xi,<br />

hence of ˆ θ, and an equivalent rule is therefore to reject H0 when ˆ θ < k, where k<br />

is chosen to give significance level α for the test.<br />

n ˆ θ is binomial with parameters (n, θ). Hence the large-sample distribution of ˆ θ is<br />

N (θ, θ(1 − θ)/n). When θ = 3/4 this is N ( 3 3<br />

2 2<br />

, n). When θ = 2/3 it is N ( , 4 16 3 9n). For α = 0.05, choose k such that P ( ˆ θ < k|θ = 3<br />

�<br />

) = 0.05. That is, we want<br />

4<br />

k−<br />

Φ = 0.05, or<br />

3<br />

√ 4 = −1.6449, giving k =<br />

3/(16n) 3<br />

�<br />

1.6449 3 − 4 4 n .<br />

� k− 3<br />

4<br />

√ 3<br />

16n<br />

For power 0.95, P ( ˆ θ < k|θ = 2)<br />

= 0.95, i.e. Φ<br />

giving k = 2<br />

3<br />

+ 1.6449<br />

3<br />

� 2<br />

n<br />

3<br />

� k− 2<br />

3<br />

√ 2<br />

9n<br />

i=1<br />

�<br />

= 0.95 or<br />

k− 2<br />

3<br />

√ 2<br />

9n<br />

= 1.6449,<br />

. Using this expression for k together with the expression<br />

for k derived in (c) means that we require 3 1.6449<br />

2 1.6449<br />

− = + 4 4 n 3 3 n or<br />

� √ √ �<br />

1 1.6449 = √ 1 1<br />

√<br />

3 + 2 . Thus we get n = 12 × 1.6449 × 0.9044 = 17.8521 and<br />

12 n 4 3<br />

n = 318.7, so we take n = 319.<br />

5. Let Xi denote the number of breakages in the ith chromosome. Then the like-<br />

lihood function is L(λ) = 33 �<br />

i=1<br />

e−λ (1−e−λ λ<br />

)<br />

xi xi!<br />

= e−33λ<br />

(1−e −λ ) 33<br />

� 3<br />

33�<br />

xi λi=3<br />

�33<br />

xi!<br />

i=1<br />

� 2<br />

for λ > 0. Given<br />

that � xi = (11 × 1 + 6 × 2 + · · · 1 × 13 = 122, the log-likelihood function<br />

equals ln[L(λ)] = ℓ(λ) = −33λ − 33 ln(1 − e −λ ) + 122 ln λ − � 33<br />

i=1 ln(xi!). So<br />

S(λ) = dℓ<br />

dλ<br />

33e−λ<br />

= −33 − 1−e−λ + 122<br />

λ<br />

= −33 − 33<br />

e λ −1<br />

+ 122<br />

λ<br />

condition, the MLE ˆ λ satisfies S(λ) = 0, i.e. −33 − 33<br />

e ˆ + λ−1 122<br />

ˆλ<br />

. With the usual regularity<br />

= 0. The informa-<br />

tion function I(λ) = − d2 ℓ<br />

dλ 2 = 122<br />

λ 2 − 33eλ<br />

(e λ −1) 2 . An iterative algorithm for finding ˆ λ<br />

numerically is given by λj+1 = λj + S(λj)/I(λj). An initial estimate λ0 could be<br />

found by plotting ℓ(λ) against λ. Alternatively, it is often satisfactory to use the<br />

estimator for a non-truncated Poisson, which here would be λ0 = 122<br />

33<br />

Using the value ˆ λ = 3.6, P (X = k) = e−3.6<br />

3.6e−3.6 1−e−3.6 = 0.0984<br />

0.9727<br />

1−e −3.6<br />

= 3.70.<br />

3.6k . Calculating P (X = 1) =<br />

k!<br />

= 0.1011 if follows (recursively) that P (X = 2) = 3.6<br />

2<br />

85<br />

P (X = 1) =


0.1820. Similarly P (X = 3) = 0.2184, P (X = 4) = 0.1966, P (X = 5) = 0.1415.<br />

Hence P (X ≥ 6) = 0.1603. [ These probabilities are accurate to four decimal<br />

places, but there is slight rounding error inthe expected frequencies below.]<br />

k 1 2 3 4 5 ≥ 6 TOTAL<br />

observed 11 6 4 5 0 7 33<br />

expected 3.34 6.01 7.21 6.49 4.67 5.28 33.00<br />

Comparing the observed and expected frequencies, the χ 2 test will have 4 degrees<br />

of freedom since λ had to be estimated. The test statistic is<br />

X 2 =<br />

(11 − 3.34)2<br />

3.34<br />

+ (6 − 6.01)2<br />

6.01<br />

+ · · · +<br />

(7 − 5.28)2<br />

5.28<br />

= 24.57 .<br />

This is very highly statistically significant as an observation from χ 2 4, i.e. there is<br />

very strong evidence against the null hypothesis of a truncated Poisson distribu-<br />

tion (the R command qchisq(p=0.95,df=4) returned the critical value 9.488).<br />

<strong>Student</strong> Problems<br />

1. Suppose that we have data x1, x2, . . . , xn which are iid observations from a N(µ, 1)<br />

density where µ is unknown. Consider testing H0 : µ = 0 using the test statistic<br />

T = |¯x|. Suppose we reject H0 if the p-value turns out to be less than α = 0.05.<br />

Show that the test based on T rejects H0 if |¯x| > 1.96<br />

√ n . Calculate the power of<br />

the test for n = 25 and µ = −3, −2, −1, +1, +2, +3. How large would n have to<br />

be in order that the power of the test was equal to 0.95 for µ = +1 ?<br />

2. (a) Let X be a random variable with density given by f(x; θ) = θx θ−1 for 0 ≤<br />

x ≤ 1. Prove that the random variable Y = −2θ log[X] has an exponential<br />

density with mean 2.<br />

(b) Let x1, x2, . . . , xn be iid with density f(x; θ). Consider testing H0 : θ = 1.<br />

Derive (i) the likelihood ratio test statistic, (ii) the maximum likelihood<br />

test statistic, and (iii) the score test statistic. Using the result in (a) explain<br />

how you would calculate exact p-values associated with each of the 3 test<br />

statistics. HINT : The sum of n independent random variables each having<br />

an exponential density with mean 2 has a χ 2 2n density.<br />

(c) Suppose n = 10 and the 10 observed values are 0.17,0.21,0.28,0.30,0.35,0.40,<br />

0.45, 0.58, 0.65, 0.70. Calculate the p-value associated with each of the test<br />

statistics derived in (b).<br />

86


3. Let x1, x2, . . . , xn be iid with density fX(x; θ) = 1 exp [−x/θ] for 0 ≤ x < ∞. Let<br />

θ<br />

y1, y2, . . . , ym be iid with density fY (y; θ, λ) = λ<br />

θ<br />

exp [−λy/θ] for 0 ≤ y < ∞.<br />

(a) Consider testing H0 : λ = 1. Derive (i) the likelihood ratio test statistic,<br />

(ii) the maximum likelihood test statistic, and (iii) the score test statistic.<br />

Explain how you would calculate exact p-values associated with each of the<br />

3 test statistics. Hint : Recall the relationship between an F density and<br />

the ratio of two chi-squareds.<br />

(b) Suppose that n = 10 and the average of the 10 x values is 24.0. Suppose<br />

that m = 40 and the average of the 40 y values is 12.0. Calculate the p-value<br />

associated with each of the test statistics derived in (a).<br />

4. Let x1, x2, . . . , xn be a random sample from a N[µ1, v] density and let y1, y2, . . . , ym<br />

be a random sample from a N[µ2, v] density where µ1, µ2, and v are unknown.<br />

Consider testing H0 : µ1 = µ2. Derive the likelihood ratio test and show � that it is<br />

equivalent to rejecting H0 for large values of |W | where W = ¯x − ¯y/ s2 ( 1 1 + n m )<br />

and s2 = ( �n i=1 (xi − ¯x) 2 + Σm i=1(yi − ¯y) 2 ) /(n + m − 2). What is the distribution<br />

of W ? Suppose n = 21 and m = 11 and you observe W = −2.32. What is the<br />

p-value ?<br />

5. Let x1, x2, . . . , xn be iid observations from a normal density with mean θ and<br />

variance σ 2 . Let y1, y2, . . . , ym be iid observations from a normal density with<br />

mean θ and variance kσ 2 . Assume that both k and σ 2 are known but that θ is<br />

unknown.<br />

(a) Consider testing the null hypothesis H0 : θ = 0. Derive the test statistics<br />

used in the likelihood ratio test, the maximum likelihood test, and the score<br />

test. Show that all three test statistics are increasing functions of U where<br />

U = [k<br />

n�<br />

i=1<br />

xi + Σ m j=1yj] 2 /[k(nk + m)σ 2 ]<br />

and explain how this fact may be used to enable exact calculation of p-values.<br />

(b) Consider a situation in which k = 2 and σ 2 = 10. Suppose that n = 20 and<br />

the sum of the numbers x1, x2, . . . , x20 is 28.0. Suppose that m = 40 and<br />

the sum of the numbers y1, y2, . . . , y40 is 48.0. Calculate the p-value for each<br />

of the tests derived in (a) and comment on the results.<br />

87


Appendix A<br />

Review of Probability<br />

A.1 Expectation and Variance<br />

The expected value E[Y ], of a random variable Y is defined as<br />

if Y is discrete, and<br />

E[Y ] =<br />

∞�<br />

yiP (yi),<br />

i=0<br />

� ∞<br />

E[Y ] = yf(y)dy,<br />

−∞<br />

if Y is continuous, where f(y) is the probability density function. The variance Var[Y ],<br />

of a random variable Y is defined as<br />

or<br />

if Y is discrete, and<br />

Var[Y ] = E � (Y − E[Y ]) 2� ,<br />

Var[Y ] =<br />

∞�<br />

(yi − E[Y ]) 2 P (yi),<br />

i=0<br />

� ∞<br />

Var[Y ] = (y − E[Y ]) 2 f(y)dy,<br />

−∞<br />

if Y is continuous. When there is no ambiguity we often write EY for E[Y ], and VarY<br />

for Var[Y ]. A function of a random variable is itself a random variable. If h(Y ) is a<br />

function of the random variable Y, then the expected value of h(Y ) is given by<br />

E [h(Y )] =<br />

∞�<br />

h(yi)P (yi),<br />

if Y is discrete, and if Y is continuous<br />

� ∞<br />

E [h(Y )] = h(y)f(y)dy.<br />

i=0<br />

−∞<br />

88


It is relatively straightforward to derive the following results for the expectation and<br />

variance of a linear function of Y :<br />

where a and b are constants. Also<br />

E[aY + b] = aE[Y ] + b,<br />

Var[aY + b] = a 2 Var[Y ],<br />

Var[Y ] = E[Y 2 ] − (E[Y ]) 2 . (A.1)<br />

For expectations, it can be shown more generally that<br />

�<br />

k�<br />

�<br />

k�<br />

E aihi(Y ) = aiE [hi(Y )] ,<br />

i=1<br />

where ai, i = 1, 2, . . . , k are constants and hi(Y ), i = 1, 2, . . . , k are functions of the<br />

random variable Y.<br />

A.2 Discrete Random Variables<br />

A.2.1 Bernoulli Distribution<br />

A Bernoulli trial is a probabilistic experiment which can have one of two outcomes,<br />

success (Y = 1) or failure (Y = 0) and in which the probability of success is θ. We refer<br />

to θ as the Bernoulli probability parameter. The value of the random variable Y is<br />

used as an indicator of the outcome, which may also be interpreted as the presence or<br />

absence of a particular characteristic. A Bernoulli random variable Y has probability<br />

mass function<br />

i=1<br />

P (Y = y|θ) = θ y (1 − θ) 1−y , (A.2)<br />

for y = 0, 1 and some θ ∈ (0, 1). The notation Y ∼ Ber(θ) should be read as “the<br />

random variable Y follows a Bernoulli distribution with parameter θ.” A Bernoulli<br />

random variable Y has expected value E[Y ] = 0 × P (Y = 0) + 1 × P (Y = 1) =<br />

0×(1−θ)+1×θ = θ, and variance Var[Y ] = (0−θ) 2 ×(1−θ)+(1−θ) 2 ×θ = θ(1−θ).<br />

A.2.2 Binomial Distribution<br />

Consider independent repetitions of Bernoulli experiments, each with a probability of<br />

success θ. Next consider the random variable Y, defined as the number of successes in<br />

a fixed number of independent Bernoulli trials, n. That is,<br />

Y =<br />

n�<br />

Xi,<br />

i=1<br />

89


where Xi ∼ Bernoulli(θ) for i = 1, . . . , n. Each sequence of length n containing y “ones”<br />

and (n−y) “zeros” occurs with probability θ y (1−θ) n−y . The number of sequences with<br />

y successes, and consequently (n − y) fails, is<br />

n!<br />

y!(n − y)! =<br />

The random variable Y can take on values y = 0, 1, 2, . . . , n with probabilities<br />

� �<br />

n<br />

P (Y = y|n, θ) = θ<br />

y<br />

y (1 − θ) n−y . (A.3)<br />

The notation Y ∼ Bin(n, θ) should be read as “the random variable Y follows a bi-<br />

nomial distribution with parameters n and θ.” Finally using the fact that Y is the<br />

sum of n independent Bernoulli random variables we can calculate the expected value<br />

as E[Y ] = E [ � Xi] = � E[Xi] = � θ = nθ and variance as Var[Y ] = Var [ � Xi] =<br />

� Var[Xi] = � θ(1 − θ) = nθ(1 − θ).<br />

A.2.3 Geometric Distribution<br />

Instead of fixing the number of trials, suppose now that the number of successes, r,<br />

is fixed, and that the sample size required in order to reach this fixed number is the<br />

random variable N. This is sometimes called inverse sampling. In the case of r = 1,<br />

� n<br />

using the independence argument again, leads to<br />

y<br />

�<br />

.<br />

P (N = n|θ) = θ(1 − θ) n−1 , (A.4)<br />

for n = 1, 2, . . . which is the geometric probability function with parameter θ. The dis-<br />

tribution is so named as successive probabilities form a geometric series. The notation<br />

N ∼ Geo(θ) should be read as “the random variable N follows a geometric distribution<br />

with parameter θ.” Write (1 − θ) = q. Then<br />

E[N] =<br />

∞�<br />

nq<br />

n=1<br />

n−1 ∞� d<br />

θ = θ<br />

dq<br />

n=0<br />

(qn ) = θ d<br />

�<br />

∞�<br />

q<br />

dq<br />

n=0<br />

n<br />

=<br />

�<br />

θ d<br />

� �<br />

1 θ 1<br />

= =<br />

dq 1 − q (1 − q) 2 θ .<br />

Also,<br />

E � N 2� =<br />

∞�<br />

n=1<br />

n 2 q n−1 θ = θ<br />

∞� d<br />

dq (nqn ) = θ d<br />

�<br />

∞�<br />

nq<br />

dq<br />

n=1<br />

n<br />

�<br />

�<br />

= θ d � −2<br />

q(1 − q)<br />

dq<br />

�<br />

n=1<br />

= θ d<br />

�<br />

q<br />

dq 1 − q E(N)<br />

�<br />

1 2(1 − θ)<br />

= θ +<br />

θ2 θ3 �<br />

= 2 1<br />

−<br />

θ2 θ .<br />

Using Var[N] = E[N 2 ] − (E[N]) 2 , we get Var[N] = (1 − θ)/θ 2 .<br />

90


A.2.4 Negative Binomial Distribution<br />

Consider the sampling scheme described in the previous section, except now sampling<br />

continues until a total of r successes are observed. Again, let the random variable<br />

N denote number of trial required. If the rth success occurs on the nth trial, then<br />

this implies that a total of (r − 1) successes are observed by the (n − 1)th trial. The<br />

probability of this happening can be calculated using the binomial distribution as<br />

� �<br />

n − 1<br />

θ<br />

r − 1<br />

r−1 (1 − θ) n−r .<br />

The probability that the nth trial is a success is θ. As these two events are independent<br />

we have that<br />

P (N = n|r, θ) =<br />

� �<br />

n − 1<br />

θ<br />

r − 1<br />

r (1 − θ) n−r<br />

(A.5)<br />

for n = r, r + 1, . . . . The notation N ∼ NegBin(r, θ) should be read as “the random<br />

variable N follows a negative binomial distribution with parameters r and θ.” This is<br />

also known as the Pascal distribution.<br />

E � N k� =<br />

∞�<br />

n<br />

n=r<br />

k<br />

� �<br />

n − 1<br />

θ<br />

r − 1<br />

r (1 − θ) n−r<br />

= r<br />

∞�<br />

n<br />

θ<br />

n=r<br />

k−1<br />

� �<br />

n<br />

θ<br />

r<br />

r+1 (1 − θ) n−r<br />

=<br />

� � � �<br />

n − 1 n<br />

since n = r<br />

r − 1 r<br />

r<br />

∞�<br />

(m − 1)<br />

θ<br />

m=r+1<br />

k−1<br />

� �<br />

m − 1<br />

θ<br />

r<br />

r+1 (1 − θ) m−(r+1)<br />

= r<br />

θ E � (X − 1) k−1�<br />

where X ∼ Negative binomial(r + 1, θ). Setting k = 1 we get E(N) = r/θ. Setting<br />

k = 2 gives<br />

Therefore Var[N] = r(1 − θ)/θ 2 .<br />

E � N 2� = r<br />

� �<br />

r r + 1<br />

E(X − 1) = − 1<br />

θ θ θ<br />

A.2.5 Hypergeometric Distribution<br />

The hypergeometric distribution is used to describe sampling without replacement.<br />

Consider an urn containing b balls, of which w are white and b − w are red. We intend<br />

to draw a sample of size n from the urn. Let Y denote the number of white balls<br />

selected. Then, for y = 0, 1, 2, . . . , n we have<br />

P (Y = y|b, w, n) =<br />

� w<br />

y<br />

91<br />

�� b − w<br />

n − y<br />

�<br />

/<br />

.<br />

� �<br />

b<br />

. (A.6)<br />

n


The expected value of the jth moment of a hypergeometric random variable is<br />

The identities y � w<br />

y<br />

E[Y ] =<br />

n�<br />

y j P (Y = y) =<br />

y=0<br />

� = w � w−1<br />

y−1<br />

n�<br />

y j<br />

� �� � � �<br />

w b − w b<br />

/ .<br />

y n − y n<br />

y=1<br />

� � � � � b b−1<br />

and n = b can be used to obtain<br />

n n−1<br />

E � Y j� = nw<br />

n�<br />

y<br />

b<br />

y=1<br />

j−1<br />

� �� � � �<br />

w − 1 b − w b − 1<br />

/ ,<br />

y − 1 n − 1 n − 1<br />

= nw �n−1<br />

(x + 1)<br />

b<br />

x=0<br />

j−1<br />

� �� � � �<br />

w − 1 b − w b − 1<br />

/ ,<br />

x n − 1 − x n − 1<br />

= nw<br />

b E � (X + 1) j−1�<br />

where X is a hypergeometric random variable with parameters n − 1, b − 1, w − 1. It is<br />

easy to establish that E[Y ] = nθ and Var[Y ] = b−nnθ(1<br />

− θ), where θ = w/b denotes<br />

b−1<br />

the fraction of white balls in the population.<br />

A.2.6 Poisson Distribution<br />

Certain problems involve counting the number of events that have occurred in a fixed<br />

time period. A random variable Y, taking on one of the values 0, 1, 2, . . . , is said to be<br />

a Poisson random variable with parameter θ if for some θ > 0,<br />

P (Y = y|θ) = θy<br />

y! e−θ , y = 0, 1, 2, . . . (A.7)<br />

The notation Y ∼ Pois(θ) should be read as “the random variable Y follows a Poisson<br />

distribution with parameter θ.” Equation (A.7) defines a probability mass function,<br />

since<br />

∞�<br />

y=0<br />

θ y<br />

y! e−θ = e −θ<br />

∞�<br />

y=0<br />

The expected value of a Poisson random variable is<br />

E[Y ] =<br />

∞�<br />

y=0<br />

ye −θ θ y<br />

y!<br />

= θe −θ<br />

∞�<br />

y=1<br />

<strong>To</strong> get the variance we first computer E [Y 2 ] .<br />

E � Y 2� =<br />

∞�<br />

y=0<br />

y 2 e −θ θ y<br />

y!<br />

= θ<br />

∞�<br />

y=1<br />

ye −θ θ y−1<br />

(y − 1)!<br />

θ y<br />

y! = e−θ e θ = 1.<br />

θ y−1<br />

(y − 1)!<br />

= θ<br />

∞�<br />

j=0<br />

= θe−θ<br />

∞�<br />

j=0<br />

(j + 1)e −θ θ j<br />

j!<br />

θ j<br />

j!<br />

= θ .<br />

= θ(θ + 1) .<br />

Since we already have E[Y ] = θ, we obtain Var[Y ] = E [Y 2 ] − (E[Y ]) 2 = θ.<br />

92


Suppose that Y ∼ Binomial(n, π), and let θ = nπ. Then<br />

P (Y = y|n, π) =<br />

� �<br />

n<br />

π<br />

y<br />

y (1 − π) n−y<br />

=<br />

� � � �y �<br />

n θ<br />

1 −<br />

y n<br />

θ<br />

�n−y n<br />

= n(n − 1) · · · (n − y + 1)<br />

ny θy (1 − θ/n)<br />

y!<br />

n<br />

.<br />

(1 − θ/n) y<br />

For n large and θ “moderate”, we have that<br />

�<br />

1 − θ<br />

n<br />

� n<br />

≈ e −θ ,<br />

n(n − 1) · · · (n − y + 1)<br />

n y<br />

≈ 1 ,<br />

�<br />

1 − θ<br />

�y ≈ 1.<br />

n<br />

Our result is that a binomial random variable Bin(n, π) is well approximated by a<br />

poisson random variable Pois(θ = n × π) when n is large and π is small. That is<br />

P (Y = y|n, π) ≈ e−nπ (nπ) y<br />

A.2.7 Discrete Uniform Distribution<br />

The discrete uniform distribution with integer parameter N has a random variable Y<br />

that can take the vales y = 1, 2, . . . , N with equal probability 1/N. It is easy to show<br />

that the mean and variance of Y are E[Y ] = (N + 1)/2, and Var[Y ] = (N 2 − 1)/12.<br />

A.2.8 The Multinomial Distribution<br />

Suppose that we perform n independent and identical experiments, where each ex-<br />

periment can result in any one of r possible outcomes, with respective probabilities<br />

p1, p2, . . . , pr, � r<br />

i=1 pi = 1. If we denote by Yi, the number of the n experiments that<br />

result in outcome number i, then<br />

P (Y1 = n1, Y2 = n2, . . . , Yr = nr) =<br />

y!<br />

.<br />

n!<br />

n1!n2! · · · nr! pn1 1 p n2<br />

2 · · · p nr<br />

r<br />

(A.8)<br />

where � r<br />

i=1 ni = n. Equation (A.8) is justified by noting that any sequence of outcomes<br />

that leads to outcome i occurring ni times for i = 1, 2, . . . , r, will, by the assumption<br />

of independence of experiments, have probability p n1<br />

1 p n2<br />

2 · · · p nr<br />

r of occurring. As there<br />

are n!/(n1!n2! · · · nr!) such sequence of outcomes equation (A.8) is established.<br />

93


A.3 Continuous Random Variables<br />

A.3.1 Uniform Distribution<br />

A random variable Y is said to be uniformly distributed over the interval (a, b) if its<br />

probability density function is given by<br />

f(y|a, b) = 1<br />

, if a < y < b<br />

b − a<br />

and equals 0 for all other values of y. Since F (u) = � u<br />

f(y)dy, the distribution<br />

−∞<br />

function of a uniform random variable on the interval (a, b) is<br />

⎧<br />

⎪⎨ 0 u ≤ a<br />

F (u) =<br />

⎪⎩<br />

u−a<br />

b−a<br />

1<br />

a < u < b<br />

u ≥ b<br />

The expected value of a uniform random turns out to be the mid-point of the interval,<br />

that is<br />

� ∞<br />

E[Y ] = yf(y)dy =<br />

−∞<br />

The second moment is calculated as<br />

hence the variance is<br />

E � Y 2� =<br />

� b<br />

a<br />

� b<br />

a<br />

y 2<br />

b − a dy = b3 − a 3<br />

3(b − a)<br />

y<br />

b − a dy = b2 − a 2<br />

2(b − a)<br />

= b + a<br />

2 .<br />

= 1<br />

3 (b2 + ab + a 2 ),<br />

Var[Y ] = E � Y 2� − (E[Y ]) 2 = 1<br />

12 (b − a)2 .<br />

The notation Y ∼ U(a, b) should be read as “the random variable Y follows a uniform<br />

distribution on the interval (a, b)”.<br />

A.3.2 Exponential Distribution<br />

A random variable Y is said to be an exponential random variable if its probability<br />

density function is given by<br />

f(y|θ) = θe −θy , 0 ≤ y < ∞, θ > 0.<br />

The cumulative distribution of an exponential random variable is given by<br />

F (a) =<br />

The expected value<br />

� a<br />

0<br />

θe −θy dy = −e −θy dy = −e −θy� � a<br />

0 = 1 − e−θa . a ≥ 0.<br />

� ∞<br />

E[Y ] = yθe −θy dy<br />

0<br />

94


equires integration by parts, yielding<br />

� ∞<br />

E[Y ] = −ye −θy� � ∞<br />

0 +<br />

0<br />

e −θy dy = 0 − e−θy<br />

θ<br />

�<br />

�<br />

�<br />

�<br />

∞<br />

0<br />

= 1<br />

θ .<br />

Integration by parts can be used to verify that E [Y 2 ] = 2θ −2 . Hence<br />

Var[Y ] = 1<br />

.<br />

θ2 The notation Y ∼ Exp(θ) should be read as “the random variable Y follows an expo-<br />

nential distribution with parameter θ”.<br />

A.3.3 Gamma Distribution<br />

A random variable Y is said to have a gamma distribution if its density function is<br />

given by<br />

f(y|ω, θ) = θe−θy (θy) ω−1<br />

,<br />

Γ(ω)<br />

0 ≤ y < ∞<br />

and ω, θ > 0<br />

where Γ(ω), is called the gamma function and is defined by<br />

� ∞<br />

Γ(ω) = e −u u ω−1 du.<br />

The integration by parts of Γ(ω) yields the recursive relationship<br />

Γ(ω) = −e −u u ω−1� �∞ 0 +<br />

� ∞<br />

e<br />

0<br />

−u (ω − 1)u ω−2 du<br />

� ∞<br />

= (ω − 1) e −u u ω−2 du = (ω − 1)Γ(ω − 1).<br />

0<br />

0<br />

For integer values of ω, say ω = n, this recursive relationship reduces to Γ(n) = n!.<br />

Note, by setting ω = 1 the gamma distribution reduces to an exponential distribution.<br />

The expected value of a gamma random variable is given by<br />

E[Y ] = θω<br />

� ∞<br />

y<br />

Γ(ω)<br />

ω e −θy θ<br />

dy =<br />

ω<br />

Γ(ω)θω+1 � ∞<br />

after the change of variable u = θy. Hence<br />

Using the same substitution<br />

0<br />

E[Y ] =<br />

E � Y 2� = θω<br />

Γ(ω)<br />

� ∞<br />

0<br />

Γ(ω + 1)<br />

θΓ(ω)<br />

= ω<br />

θ .<br />

y ω+1 e −θy dy =<br />

0<br />

u ω e −θu du,<br />

(ω + 1)ω<br />

θ2 ,<br />

so that<br />

Var[Y ] = ω<br />

.<br />

θ2 The notation Y ∼ Gamma(ω, θ) should be read as “the random variable Y follows a<br />

gamma distribution with parameters ω and θ”.<br />

95


A.3.4 Gaussian Distribution<br />

A random variable Y is a normal (or Gaussian) random variable, or simply normally<br />

distributed, with parameters µ and σ 2 if the density of Y is specified by<br />

f(y|µ, σ 2 ) = 1<br />

√ 2πσ e −(y−µ)2 /2σ 2<br />

, (A.9)<br />

for −∞ < y < ∞ with −∞ < µ < ∞ and σ > 0. It is not immediately obvious that<br />

(A.9) specifies a probability density. <strong>To</strong> show that this is the case we need to prove<br />

� ∞<br />

1<br />

√ e<br />

2πσ<br />

−(y−µ)2 /2σ2 dy = 1.<br />

−∞<br />

Substituting z = (y − µ)/σ, we need to show that I = � ∞<br />

−∞ e−z2 /2 dz = √ 2π. This is a<br />

“classic” results and so is well worth confirming. Form<br />

I 2 � ∞<br />

= e −z2 � ∞<br />

/2<br />

dz e −w2 � ∞ � ∞<br />

/2<br />

dw =<br />

−∞<br />

−∞<br />

−∞<br />

−∞<br />

e − z2 +w 2<br />

2 dzdw.<br />

The double integral can be evaluated by a change of variables to polar coordinates.<br />

Substituting z = r cos θ, w = r sin θ, and dzdw = rdθdr, we get<br />

I 2 =<br />

� ∞ � π<br />

e<br />

0 0<br />

−r2 /2<br />

rdθdr<br />

=<br />

� ∞<br />

2π re<br />

0<br />

−r2 /2<br />

dr<br />

= −2πe −r2 �<br />

/2�<br />

� ∞<br />

= 2π<br />

Taking the square root we get I = √ 2π. The result I = √ 2π can also be used to<br />

establish the result Γ(1/2) = π1/2 . <strong>To</strong> prove that this is the case note that<br />

Γ( 1<br />

� ∞<br />

) = e<br />

2 −u u −1/2 � ∞<br />

du = 2 e −z2<br />

dz = √ π.<br />

The expected value of Y equals<br />

0<br />

E[Y ] = 1<br />

√ 2πσ<br />

Writing y as (y − µ) + µ = z + µ yields<br />

E[Y ] = 1<br />

� ∞<br />

√<br />

2πσ<br />

� ∞<br />

−∞<br />

0<br />

0<br />

ye −(y−µ)/2σ2<br />

dy<br />

ze<br />

−∞<br />

−z2 /2<br />

dz + µ<br />

� ∞<br />

f(y)dy,<br />

where f(y) represents the normal density from equation (A.9). Using the argument of<br />

symmetry, the first integral must be 0. Then<br />

� ∞<br />

E[Y ] = µ f(y)dy = µ.<br />

−∞<br />

96<br />

−∞


Since E[Y ] = µ, we have that<br />

Var[Y ] = 1<br />

√ 2πσ<br />

� ∞<br />

(y − µ) 2 e −(y−µ)2 /2σ2 dy.<br />

−∞<br />

Using the substitution z = (y − µ)/σ yields<br />

Var[Y ] = σ2<br />

=<br />

=<br />

� ∞<br />

√<br />

2π −∞<br />

�<br />

2 1<br />

σ √<br />

2π<br />

� ∞<br />

2 1<br />

σ √<br />

2π<br />

= σ 2<br />

z 2 e −z2 /2 dz<br />

−ze −z2 /2<br />

�<br />

�<br />

� ∞<br />

−∞<br />

e<br />

−∞<br />

−z2 /2<br />

dz<br />

� ∞<br />

+<br />

−∞<br />

e −z2 �<br />

/2<br />

dz<br />

The notation Y ∼ N(µ, σ 2 ) should be read as “the random variable Y follows a<br />

normal distribution with mean parameter µ and variance parameter σ 2 .”<br />

Suppose Y ∼ N(µ, σ2 ) and Z = a + bY where a and b are known constants. Then<br />

� � � �<br />

c − a c − a<br />

P (a + bY ≤ c) = P Y ≤ = FY ,<br />

b<br />

b<br />

where FY (c) = � c<br />

−∞ fY (y)dy is the cumulative distribution function of the random<br />

variable Y. Then<br />

FY<br />

� �<br />

c − a<br />

b<br />

=<br />

=<br />

� (c−a)/b<br />

−∞<br />

� b<br />

−∞<br />

1<br />

√ 2πσ e −(y−µ)2 /2σ 2<br />

dy<br />

1<br />

√ 2πbσ exp<br />

� −(z − (bµ + a)) 2<br />

2b 2 σ 2<br />

�<br />

dz<br />

Since FZ(c) = � c<br />

−∞ fZ(y)dz, it follows that the probability density function of Z, is<br />

given by<br />

Hence Z ∼ N(bµ + a, (bσ) 2 ).<br />

1<br />

√ 2πbσ exp<br />

� −(z − (bµ + a)) 2<br />

2b 2 σ 2<br />

An important consequence of the preceding result is that if Y ∼ N(µ, σ 2 ), then<br />

Z = (Y − µ)/σ is normally distributed with mean 0 and variance 1. Such a random<br />

variable is said to have the standard normal distribution.<br />

It is tradition to denote the cumulative distribution function of a standard normal<br />

random variable by Φ(z). That is,<br />

Φ(z) = 1<br />

√ 2π<br />

� z<br />

e<br />

−∞<br />

−u2 /2<br />

du.<br />

Values of (1 − Φ(z)) are tabulated in Table 3 of Murdoch and Barnes for z > 0.<br />

97<br />

�<br />

.


A.3.5 Weibull Distribution<br />

The Weibull distribution function has the form<br />

� �<br />

y<br />

�a� F (y) = 1 − exp −<br />

b<br />

if y > 0<br />

and equals 0 for y ≤ 0. The Weibull density can be obtained by differentiation as<br />

�<br />

a<br />

� �<br />

y<br />

�a−1 � �<br />

y<br />

�a� f(y) =<br />

exp − for y > 0<br />

b b<br />

b<br />

and equals 0 for y ≤ 0. <strong>To</strong> calculate the expected value<br />

� ∞ � �a 1<br />

E[Y ] = ya y<br />

b<br />

a−1 � �<br />

y<br />

�a� exp −<br />

b<br />

0<br />

we use the substitutions u = (y/b) a , and du = ab −a y a−1 dy. These yield<br />

� ∞<br />

E[Y ] = b<br />

0<br />

u 1/a e −u du = bΓ<br />

� a + 1<br />

It is straightforward to verify that<br />

E[Y 2 ] = b 2 Var[Y ] =<br />

� �<br />

a + 2<br />

Γ , and<br />

a<br />

b 2<br />

� � � � � �� �<br />

2<br />

a + 2 a + 1<br />

Γ − Γ<br />

.<br />

a<br />

a<br />

A.3.6 Beta Distribution<br />

A random variable is said to have a beta distribution if its density is given by<br />

f(y) =<br />

1<br />

B(a, b) ya−1 (1 − y) b−1<br />

and is 0 everywhere else. Here the function<br />

B(a, b) =<br />

� 1<br />

0<br />

a<br />

u a−1 (1 − u) b−1 du<br />

�<br />

.<br />

0 < y < 1,<br />

is the “beta” function, and is related to the gamma function through<br />

B(a, b) = Γ(a)Γ(b)<br />

Γ(a + b) .<br />

Proceeding in the usual manner, we can show that<br />

E[Y ] =<br />

Var[Y ] =<br />

a<br />

a + b<br />

ab<br />

(a + b) 2 (a + b + 1) .<br />

98


A.3.7 Chi-square Distribution<br />

Let Z ∼ N(0, 1), and let Y = Z 2 . Then<br />

so that<br />

fY (y) =<br />

FY (y) = P (Y ≤ y)<br />

= 1 1<br />

e− 2<br />

2 y<br />

= P (Z 2 ≤ y)<br />

= P (− √ y ≤ Z ≤ √ y)<br />

= FZ ( √ y) − FZ (− √ y)<br />

1<br />

2 √ y [fz( √ y) + fz(− √ y)]<br />

�<br />

1<br />

2 y<br />

� 1<br />

− � �<br />

2 1<br />

1 1<br />

√π ≡ Gamma ,<br />

2 2<br />

Suppose that Y = �n i=1 Z2 i , where the Zi ∼ N(0, 1) for i = 1, . . . , n and are<br />

independent. From results on the sum of independent Gamma random variables (see<br />

�<br />

. This density has the form<br />

section A.4.1), Y ∼ Gamma � n<br />

2<br />

, 1<br />

2<br />

fY (y) = e−y/2yn/2−1 2n/2Γ � � , y > 0 (A.10)<br />

n<br />

2<br />

and is referred to as a chi-squared distribution on n degrees of freedom. The notation<br />

Y ∼ Chi(n) should be read as “the random variable Y follows a chi-squared distribution<br />

with n degrees of freedom”. Later we will show that if X ∼ Chi(u) and Y ∼ Chi(v), it<br />

follows that X + Y ∼ Chi(u + v).<br />

A.3.8 Distribution of a Function of a Random Variable<br />

Let Y be a continuous random variable with probability density function fY . Suppose<br />

that g(y) is a strictly monotone (increasing or decreasing) differentiable (and hence<br />

continuous) function of y. The random variable Z defined by Z = g(Y ) has probability<br />

density function given by<br />

� � −1<br />

fZ(z) = fY g (z) � �<br />

��<br />

d<br />

dz g−1 �<br />

�<br />

(z) �<br />

� if z = g(y) (A.11)<br />

where g −1 (z) is defined to be the inverse function of g(y).<br />

Proof. Let g(y) be a monotone increasing function and let FY (y) and FZ(z) denote the<br />

probability distribution functions of the random variables Y and Z. Then<br />

FZ(z) = P (Z ≤ z) = P (g(Y ) ≤ z) = P � Y ≤ g −1 (z) � � � −1<br />

= FY g (z)<br />

99


Next, let g(y) be a monotone decreasing function. Then<br />

FZ(z) = P (Z ≤ z) = P (g(Y ) ≤ z) = P � Y ≥ g −1 (z) �<br />

= 1 − P � Y ≤ g −1 (z) � � � −1<br />

= 1 − FY g (z)<br />

For a monotone increasing function we have, by the chain rule,<br />

fZ(z) = d<br />

dz FZ(z)<br />

= d<br />

dz FY<br />

=<br />

= fY<br />

� g −1 (z) �<br />

d<br />

d(g−1 (z)) FY<br />

� � −1 dg<br />

g (z) −1 (z)<br />

dz<br />

� �dg −1<br />

g (z) −1 (z)<br />

dz<br />

For a monotone decreasing function we have, by the chain rule,<br />

fZ(z) = d<br />

dz FZ(z)<br />

= − d<br />

dz FY<br />

= −<br />

= fY<br />

� g −1 (z) �<br />

d<br />

d(g−1 (z)) FY<br />

� � −1 dg<br />

g (z) −1 (z)<br />

dz<br />

� � −1<br />

g (z) �<br />

− dg−1 �<br />

(z)<br />

dz<br />

Equation (A.11) covers the cases of both monotonic increasing and monotonic decreas-<br />

ing functions.<br />

Example A.1 (The Chi-Squared Distribution). Suppose Y ∼ N(0, 1) and g(y) = y 2 .<br />

Then g −1 (z) = z 1/2 and by equation (A.11) we get<br />

� 1/2<br />

fZ(z) = fY z � � �<br />

��<br />

d<br />

dz z1/2<br />

�<br />

�<br />

�<br />

�<br />

=<br />

1<br />

√ 2π e<br />

1<br />

− z 1<br />

2<br />

2 z−1/2<br />

� �<br />

1 1<br />

≡ Gamma ,<br />

2 2<br />

which was our main result from the previous section. �<br />

Example A.2 (The Log-Normal Distribution). Suppose Y ∼ N(µ, σ 2 ) and g(y) = e y .<br />

Then g−1 (z) = ln z and by equation (A.11) we get<br />

fZ(z) =<br />

=<br />

� �<br />

�<br />

fY (ln z) �<br />

d �<br />

� ln z�<br />

dz �<br />

1<br />

zσ √ 2π exp<br />

�<br />

− (ln{z/m})2<br />

2σ2 �<br />

where µ = ln m. �<br />

100


A.4 Random Vectors<br />

A.4.1 Sums of Independent Random Variables<br />

When X and Y are discrete random variables, the condition of independence is equiva-<br />

lent to pX,Y (x, y) = pX(x)pY (y) for all x, y. In the jointly continuous case the condition<br />

of independence is equivalent to fX,Y (x, y) = fX(x)fY (y) for all x, y.<br />

Consider random variables X and Y with probability densities fX(x) and fY (y)<br />

respectively. We seek the probability density of the random variable X + Y. Our<br />

general result follows from<br />

FX+Y (a) =<br />

=<br />

P (X + Y ≤ a)<br />

� �<br />

=<br />

=<br />

� ∞<br />

fX(x)fY (y)dxdy<br />

X+Y ≤a<br />

� a−y<br />

−∞ −∞<br />

� a−y<br />

� ∞<br />

−∞<br />

−∞<br />

fX(x)fY (y)dxdy<br />

fX(x)dxfY (y)dy<br />

� ∞<br />

= FX(a − y)fY (y)dy<br />

⇒ fX+Y (a) =<br />

−∞<br />

d<br />

=<br />

� ∞<br />

FX(a − y)fY (y)dy<br />

dx −∞<br />

� ∞<br />

d<br />

−∞ dx FX(a − y)fY (y)dy<br />

� ∞<br />

= fX(a − y)fY (y)dy (A.12)<br />

−∞<br />

The density function fX+Y is called the convolution of the densities fX and fY . If<br />

the random variables X and Y are discrete the equivalent result is<br />

fX+Y (a) = �<br />

fX(a − y)fY (y).<br />

Result: Sum of Independent Poisson Random Variables<br />

a∈R<br />

Suppose X ∼ Pois(θ) and Y ∼ Pois(λ). Assume that X and Y are independent. Then<br />

P (X + Y = n) =<br />

=<br />

=<br />

n�<br />

P (X = k, Y = n − k)<br />

k=0<br />

n�<br />

P (X = k) P (Y = n − k)<br />

k=0<br />

n�<br />

e<br />

k=0<br />

101<br />

−θ θk<br />

k!<br />

e−λ λn−k<br />

(n − k)!


That is, X + Y ∼ Pois(θ + λ).<br />

= e−(θ+λ)<br />

n!<br />

= e−(θ+λ)<br />

n!<br />

n�<br />

k=0<br />

n!<br />

k!(n − k)! θk λ n−k<br />

(θ + λ) n<br />

Result: Sum of Independent Binomial Random Variables<br />

We seek the distribution of Y + X, where Y ∼ Bin(n, θ) and X ∼ Bin(m, θ). Then<br />

X + Y is modelling the situation where the total number of trials is fixed at n + m,<br />

and the probability of a success in a single trial equals θ. Without performing any<br />

calculations, we expect to find that X + Y ∼ Bin(n + m, θ). <strong>To</strong> verify that this result<br />

is true,<br />

P (X + Y = k) =<br />

=<br />

n�<br />

P (X = i, Y = k − i)<br />

i=0<br />

n�<br />

P (X = i) P (Y = k − i)<br />

i=0<br />

n�<br />

� �<br />

n<br />

=<br />

θ<br />

i<br />

i=0<br />

i (1 − θ) n−i<br />

� �<br />

m<br />

θ<br />

k − i<br />

k−i (1 − θ) m−k+i<br />

= θ k (1 − θ) n+m−k<br />

n�<br />

� � � �<br />

n m<br />

i k − i<br />

and the result follows by applying the combinatorial identity<br />

� �<br />

n + m<br />

=<br />

k<br />

i=0<br />

n�<br />

� � �<br />

n m<br />

i=0<br />

i<br />

k − i<br />

Result: Sum of Independent Gamma Random Variables<br />

Let X ∼ Gamma(γ, θ) and Y ∼ Gamma(ω, θ). Then<br />

fX+Y (a) = (Γ(γ)Γ(ω)) −1<br />

� a<br />

θe<br />

0<br />

−θ(a−y) (θ(a − y)) γ−1 θγ −θy (θy) ω−1 dy<br />

� a<br />

= Ke −θa<br />

0<br />

(a − y) γ−1 y ω−1 dy<br />

�<br />

.<br />

where K is a constant. Let u = y/a so that dy = adu. Then<br />

fX+Y (a) = Ke −θa a γ+ω−1<br />

� 1<br />

= Ce −θa a γ+ω−1 ,<br />

102<br />

0<br />

(1 − u) γ−1 u ω−1 du


where C is a constant not depending on a. fX+Y (a) is a density function and so must<br />

integrate to 1.<br />

⇒ fX+Y (a) = θe−θa (θa) γ+ω−1<br />

.<br />

Γ(γ + ω)<br />

But this is the pdf of a Gamma random variable distributed as Gamma(γ + ω, θ). The<br />

result X + Y ∼ Chi(u + v) when X ∼ Chi(u) and Y ∼ Chi(v), follows as a corollary.<br />

Result: Sum of Independent Exponential Random Variables<br />

Let Y1, . . . , Yn be n independent exponential random variables each with parameter<br />

θ. Then Z = Y1 + Y2 + · · · + Yn is a Gamma(n, θ) random variable. <strong>To</strong> see that<br />

this is indeed the case, write Yi ∼ Exp(θ), or alternatively, Yi ∼ Gamma(1, θ). Then<br />

Yi + Yj ∼ Gamma(2, θ), implying that<br />

n�<br />

Yi = Z ∼ Gamma (n, θ) .<br />

i=1<br />

Result: Sum of Independent Gaussian Random Variables<br />

Let X ∼ N(µX, σ2 X ) and Y ∼ N(µY , σ2 Y ). Then<br />

fX+Y (a) = (2πσXσY ) −1<br />

� ∞ �<br />

−(a − y − µX)<br />

exp<br />

2 � �<br />

−(y − µY )<br />

exp<br />

2 �<br />

dy<br />

−∞<br />

−∞<br />

2σ 2 X<br />

2σ 2 X<br />

2σ 2 Y<br />

setting z = y − µY<br />

= (2πσXσY ) −1<br />

� ∞ �<br />

−(a − z − µX − µY )<br />

exp<br />

2 � � � 2 −z<br />

exp dz<br />

2σ 2 Y<br />

=<br />

andlet m = a − µX − µY<br />

(2πσXσY ) −1<br />

� ∞ � 2 −(m − z)<br />

exp<br />

−∞ 2σ2 � � 2 −z<br />

exp<br />

X 2σ2 =<br />

�<br />

dz<br />

Y<br />

(2πσXσY ) −1 e −m2 /2σ2 � ∞ �<br />

2mz<br />

X exp<br />

−∞ 2σ2 −<br />

X<br />

z2<br />

2σ2 −<br />

X<br />

z2<br />

2σ2 =<br />

�<br />

dz<br />

Y<br />

(2πσXσY ) −1 e βm<br />

� ∞<br />

e<br />

−∞<br />

−(αz2 +2βz) dz<br />

where α = 1<br />

�<br />

1<br />

2 σ2 +<br />

X<br />

1<br />

σ2 �<br />

and<br />

Y<br />

β = − m<br />

2σ2 = (2πσXσY )<br />

X<br />

−1 e βm e β2 � ∞<br />

/α<br />

e −α(z+β/α)2<br />

dz<br />

−∞<br />

setting u = z + β/α<br />

� ∞<br />

e −αu2<br />

du<br />

= (2πσXσY ) −1 e βm e β2 /α<br />

−∞<br />

= (2πσXσY ) −1 e βm+β2 /α � π/α<br />

103


since � ∞<br />

−∞ e−αu2<br />

du = � π/a. Some algebra will confirm that<br />

so that<br />

and<br />

βm + β2<br />

α =<br />

�<br />

π/α<br />

=<br />

2πσXσY<br />

−m 2<br />

2(σ 2 X + σ2 Y ),<br />

1<br />

� 2π (σ 2 X + σ 2 Y )<br />

fX+Y (a) = � 2π(σ 2 X + σ 2 Y ) �− 1<br />

�<br />

− (a − µX − µY ) 2 exp<br />

2<br />

�<br />

or equivalently X + Y ∼ N (µX + µY , σ2 X + σ2 Y ) .<br />

A.4.2 Covariance and Correlation<br />

2 (σ 2 X + σ2 Y )<br />

Suppose that X and Y are real-valued random variables for some random experiment.<br />

The covariance of X and Y is defined by<br />

Cov (X, Y ) = E [(X − E(X))(Y − E[Y ])]<br />

and (assuming the variances are positive) the correlation of X and Y is defined by<br />

ρ (X, Y ) ≡ Corr (X, Y ) =<br />

Cov (X, Y )<br />

� Var(X)Var[Y ]<br />

Note that the covariance and correlation always have the same sign (positive, nega-<br />

tive, or 0). When the sign is positive, the variables are said to be positively correlated;<br />

when the sign is negative, the variables are said to be negatively correlated; and when<br />

the sign is 0, the variables are said to be uncorrelated. For an intuitive understanding<br />

of correlation, suppose that we run the experiment a large number of times and that<br />

for each run, we plot the values (X, Y ) in a scatterplot. The scatterplot for positively<br />

correlated variables shows a linear trend with positive slope, while the scatterplot for<br />

negatively correlated variables shows a linear trend with negative slope. For uncorre-<br />

lated variables, the scatterplot should look like an amorphous blob of points with no<br />

discernible linear trend. You should satisfy yourself that the following are true<br />

1. Cov (X, Y ) = E (XY ) − E (X) E (Y )<br />

2. Cov (X, Y ) = Cov (Y, X)<br />

3. Cov (Y, Y ) = Var (Y )<br />

4. Cov (aX + bY, Z) = aCov (X, Z) + bCov (Y, Z)<br />

� � �n<br />

5. Var = �n i,j=1 Cov (Yi, Yj)<br />

j=1 Yi<br />

6. If X and Y are independent, then they are uncorrelated. The converse is not<br />

true however.<br />

104


A.4.3 The Bivariate Change of Variables Formula<br />

Suppose that (X, Y ) is a continuous random variable taking values in a subset S of R 2<br />

with probability density function f. Suppose that U and V are new random variables<br />

that are functions of X and Y :<br />

U ≡ U(X, Y ), V ≡ V (X, Y ).<br />

If these functions are “well behaved”, there is a simple way to get the joint probability<br />

density function g of (U, V ). First, we will assume that the transformation (x, y) →<br />

(u, v) is one-to-one and maps S onto a subset T of R 2 . Thus, the inverse transformation<br />

(u, v) → (x, y) is well defined and maps T onto S. We will assume that the inverse<br />

transformation is “smooth”, in the sense that the partial derivatives<br />

exist on T , and the Jacobian<br />

is nonzero on T .<br />

∂(x, y)<br />

∂(u, v) ≡<br />

�<br />

�<br />

�<br />

�<br />

∂x<br />

∂u<br />

∂x<br />

∂u<br />

∂y<br />

∂u<br />

, ∂x<br />

∂v<br />

∂x<br />

∂v<br />

∂y<br />

∂v<br />

∂y ∂y<br />

, ,<br />

∂u ∂v ,<br />

�<br />

�<br />

�<br />

∂x ∂y<br />

� =<br />

∂u ∂v<br />

∂x ∂y<br />

−<br />

∂v ∂u<br />

Now, let B be an arbitrary subset of T . The inverse transformation maps B onto a<br />

subset A of S. Therefore,<br />

� �<br />

P ((U, V ) ∈ B) = P ((X, Y ) ∈ A) =<br />

A<br />

f(x, y)dxdy<br />

But, by the change of variables formula for double integrals, this can be written as<br />

� �<br />

� �<br />

�<br />

P ((U, V ) ∈ B) = f (x(u, v), y(u, y)) �<br />

∂(x, y) �<br />

�<br />

�∂(u,<br />

v) � dudv<br />

B<br />

By the very meaning of density, it follows that the probability density function of (U, V )<br />

is<br />

� �<br />

�<br />

g(u, v) = f (x(u, v), y(u, v)) �<br />

∂(x, y) �<br />

�<br />

�∂(u,<br />

v) � , (u, v) ∈ T .<br />

By a symmetric argument,<br />

� �<br />

�<br />

f(x, y) = g (u(x, y), v(x, y)) �<br />

∂(u, v) �<br />

�<br />

�∂(x,<br />

y) � , (x, y) ∈ S.<br />

The change of variables formula generalizes to R n .<br />

105


A.4.4 The Bivariate Normal Distribution<br />

Suppose that U and V are independent random variables each, with the standard<br />

normal distribution. We will need the following parameters:<br />

µX, µY ∈ (−∞, ∞); σX, σY ∈ (0, ∞); ρ ∈ [−1, +1].<br />

Now let X and Y be new random variables defined by<br />

X = µX + σXU<br />

�<br />

V = µY + ρσY U + σY 1 − ρ2V Using basic properties of mean, variance, covariance, and the normal distribution,<br />

satisfy yourself of the following:<br />

1. X is normally distributed with mean µX and standard deviation σX<br />

2. Y is normally distributed with mean µY and standard deviation σY<br />

3. Corr (X, Y ) = ρ<br />

4. X and Y are independent if and only if ρ = 0.<br />

The inverse transformation is<br />

u =<br />

v =<br />

x − µX<br />

σY<br />

σX<br />

y − µY<br />

�<br />

1 − ρ2 so that the Jacobian of the transformation is<br />

∂(x, y)<br />

∂(u, v) =<br />

σXσY<br />

ρ(x − µX)<br />

− �<br />

σX 1 − ρ2 1<br />

�<br />

1 − ρ2 .<br />

Since U and V are independent standard normal variables, their joint probability den-<br />

sity function is<br />

g(u, v) = 1 1<br />

e− 2(u<br />

2π 2 +v2 ) 2<br />

, (u, v) ∈ R<br />

Using the bivariate change of variables formula , the joint density of (X, Y ) is<br />

1<br />

f(x, y) = �<br />

2πσXσY 1 − ρ2 exp<br />

�<br />

− 1<br />

�<br />

(x − µX)<br />

2<br />

2<br />

σ2 X (1 − ρ2 ) − 2ρ(x − µX)(y − µY )<br />

σXσY (1 − ρ2 +<br />

)<br />

(y − µY ) 2<br />

σ2 Y (1 − ρ2 ��<br />

)<br />

If c is a constant, the set of points (x, y) ∈ R 2 : f(x, y) = c is called a level curve of f<br />

(these are points of constant probability density).<br />

106


A.4.5 Bivariate Normal Conditional Distributions<br />

In the last section we derived the joint probability density function f of the bivariate<br />

normal random variables X and Y. The marginal densities are known. Then,<br />

fY |X(y|x) = fY,X(y, x)<br />

fX(x)<br />

�<br />

1<br />

= √ � exp −<br />

2πσY (1 − ρ2 ) 1 (y − (µY + ρσY (x − µX)/σX))<br />

2<br />

2<br />

σ2 Y (1 − ρ2 �<br />

)<br />

Then the conditional distribution of Y given X = x is also Gaussian, with<br />

(x − µX)<br />

E (Y |X = x) = µY + ρσY<br />

Var (Y |X = x) = σ 2 Y (1 − ρ 2 )<br />

A.4.6 The Multivariate Normal Distribution<br />

Let Σ denote the 2 × 2 symmetric matrix<br />

Then<br />

and<br />

� σ 2 X σXσY ρ<br />

σY σXρ σ 2 Y<br />

�<br />

σX<br />

det |Σ| = σ 2 Xσ 2 Y − (σXσY ρ) 2 = σ 2 Xσ 2 Y (1 − ρ 2 )<br />

Σ −1 =<br />

1<br />

(1 − ρ2 � 1<br />

σ<br />

)<br />

2 X<br />

− ρ<br />

σXσY<br />

− ρ<br />

σXσY<br />

1<br />

σ2 Y<br />

Hence the bivariate normal distribution can be written in matrix notation as<br />

1<br />

f(X,Y )(x, y) =<br />

2π � det |Σ| exp<br />

�<br />

− 1<br />

� �T x − µX<br />

Σ<br />

2 y − µY<br />

−1<br />

� �<br />

x − µX<br />

y − µY<br />

�<br />

Let Y = (Y1, . . . , Yp) ′ be a random vector of length p. Let E(Yi) = µi, i = 1, . . . , p,<br />

and define the p-length vector µ = (µ1, . . . , µp) ′ . Define the p×p matrix Σ with element<br />

Cov(Yi, Yj) for i, j = 1, . . . p. Finally, denote a realization of the random vector Y by<br />

y = (y1, . . . , yp) ′ . Then, the random vector Y has a p-dimensional multivariate Gaussian<br />

distribution if its density function is specified by<br />

1<br />

fY (y) =<br />

(2π) p/2 �<br />

exp −<br />

|Σ| 1/2 1<br />

2 (y − µ)′ Σ −1 �<br />

(y − µ) . (A.13)<br />

The notation Y ∼ MVNp(µ, Σ) should be read as “the random variable Y follows a<br />

multivariate Gaussian (normal) distribution with p-vector mean µ and p × p variance-<br />

covariance matrix Σ.”<br />

107<br />

�<br />

.


A.5 Generating Functions<br />

Denote the sample space of the discrete random variable Y as {0, 1, 2, . . .}. Let f denote<br />

the probability mass function of Y and suppose that the probabilities are given by<br />

∞�<br />

f(j) = P (Y = j) = pj, j = 0, 1, 2, . . . , where pj = 1.<br />

The mean and variance of Y satisfy<br />

and<br />

µY = E (Y ) =<br />

∞�<br />

jpj<br />

j=0<br />

σ 2 Y = E � (Y − µY ) 2� = E � Y 2� − µ 2 Y =<br />

∞�<br />

j=0<br />

j=0<br />

j 2 pj − µ 2 Y .<br />

The probability generating function (p.g.f) of the discrete random variable Y<br />

is a function defined on a subset of the reals, denoted by GY (t) and defined by<br />

GY (t) = E � t Y � ∞�<br />

= pjt j ,<br />

for some t ∈ R.<br />

Because � ∞<br />

j=0 pj = 1, the sum defined by GY (t) converges absolutely for |t| ≤ 1.<br />

That is, GY (t) is well defined for |t| ≤ 1. As the name implies, the p.g.f generates the<br />

probabilities associated with the distribution<br />

j=0<br />

GY (0) = p0, G ′ Y (0) = p1, G ′′ Y (0) = 2!p2.<br />

In general the kth derivative of the p.g.f of Y satisfies<br />

G (k)<br />

Y (0) = k!pk.<br />

The p.g.f can be used to calculate the mean and variance of a random variable Y.<br />

Note that G ′ Y (t) = � ∞<br />

j=1 jpjt j−1 for −1 < t < 1. Let t approach one from the left,<br />

t → 1 − , to obtain<br />

G ′ Y (1) =<br />

∞�<br />

jpj = E (Y ) = µY .<br />

j=1<br />

The second derivative of GY (t) satisfies<br />

G ′′ Y (t) =<br />

⇒ G ′′ Y (1) =<br />

∞�<br />

j(j − 1)pjt j−2 ,<br />

j=1<br />

∞�<br />

j(j − 1)pj<br />

j=1<br />

= E � Y 2� − E (Y ) .<br />

108


If the mean is finite the the variance of Y satisfies<br />

σ 2 Y = E � Y 2� − E (Y ) + E (Y ) − [E (Y )] 2<br />

= G ′′ Y (1) + G ′ Y (1) − [G ′ Y (1)] 2 .<br />

The moment generating function (m.g.f) of the discrete random variable Y<br />

with state space {0, 1, 2, . . .} and probability function f(j) = pj, j = 0, 1, 2, . . . , is<br />

denoted by MY (t) and defined as<br />

for some t ∈ R.<br />

MY (t) = E � e tY � =<br />

∞�<br />

pje jt ,<br />

The moment generating function generates the moments E � Y k� of the distribution<br />

of the random variable Y :<br />

and, in general,<br />

j=0<br />

MY (0) = 1, M ′ Y (0) = µY = E (Y ) , M ′′<br />

Y (0) = E � Y 2� ,<br />

M (k)<br />

Y (0) = E � Y k� .<br />

The characteristic function (ch.f) of the discrete random variable Y is<br />

CY (t) = E � e itY � =<br />

∞�<br />

pje ijt , where i = √ −1.<br />

j=0<br />

The cumulative generating function (c.g.f) of the discrete random variable Y<br />

is the natural logarithm of the moment generating function and is denoted as KY (t),<br />

so that<br />

KY (t) = ln [MY (t)] .<br />

Assume Y is a continuous random variable with probability density function fY (y).<br />

The probability generating function (p.g.f.) of Y is defined as<br />

GY (t) = E � t Y � �<br />

=<br />

The moment generating function (m.g.f.) of Y is<br />

MY (t) = E � e tY � �<br />

=<br />

The characteristic function (ch.f.) of Y is<br />

CY (t) = E � e itY � �<br />

=<br />

109<br />

fY (y)t<br />

R<br />

y dy.<br />

fY (y)e<br />

R<br />

ty dy.<br />

fY (y)e<br />

R<br />

ity dy.


Density p.g.f. m.g.f. ch.f. c.g.f<br />

Bi(n, θ) (θt + q) n (θe t + q) n (θe it + q) n<br />

Geo(θ) θt/ (1 − qt) θ/ (e −t − q) θ/ (e −it − q)<br />

Neg-Bi(r, θ) θ r (1 − qt) −r θ r (1 − qe t ) −r θ r (1 − qe it ) −r r ln(θ)<br />

−r ln(1 − qe t )<br />

Poi(θ) e −θ(1−t) e θ(exp(t)−1) e θ(exp(it)−1) θ(e t − 1)<br />

Unif(α, β) e αt � e βt − 1 � /βt e iαt � e iβt − 1 � /iβt<br />

Exp(θ) (1 − t/θ) −1 (1 − it/θ) −1 − ln(1 − it/θ)<br />

Ga(c, λ) (1 − t/λ) −c (1 − it/λ) −c −c ln(1 − it/λ)<br />

N(µ, σ 2 ) exp � µt + 1<br />

2 σ2 t 2�<br />

exp � iµt − 1<br />

2 σ2 t 2�<br />

iµt − 1<br />

2 σ2 t 2<br />

Table A.5.1: Generating functions. For discrete random variables define q = 1 − θ.<br />

Finally, the cumulative generating function (c.g.f.) is KY (t) = ln [MY (t)] .<br />

The generating functions are related through<br />

GY (e t ) = MY (t) and MY (it) = CY (t).<br />

We can use these generating functions to establish the formulas<br />

and<br />

σ 2 Y =<br />

µY = G ′ Y (1) = M ′ Y (0) = K ′ Y (0)<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

G ′′ Y (1) + G′ Y (1) − [G′ Y (1)]2<br />

M ′′<br />

Y (0) − [M ′ Y (0)]2<br />

K ′′<br />

Y (0)<br />

A very important result concerning generating functions states that the moment gen-<br />

erating function uniquely defines the probability distribution (provided it exists in an<br />

open interval around zero). The characteristic function also uniquely defines the prob-<br />

ability distribution. The generating functions of the discrete and continuous random<br />

variables discussed thus far are given in Table (A.5.1). Suppose that Y1, Y2, . . . , Yn<br />

are independent random variables. Then the moment generating function of the linear<br />

combination Z = � n<br />

i=1 aiYi is the product of the individual moment generating<br />

functions:<br />

MZ(t) = E � e t(� aiYi) �<br />

= E � e a1tY1 � E � e a2tY2 � · · · E � e antYn�<br />

n�<br />

= MYi (aiYi) .<br />

i=1<br />

It also follows that CZ(t) = � n<br />

i=1 CYi (aiYi) and KZ(t) = � n<br />

i=1 KYi (aiYi) .<br />

110


A.6 Table of Common Distributions<br />

Bernoulli(θ)<br />

Discrete Distributions<br />

pmf P (Y = y|θ) = θy (1 − θ) 1−y ; y = 0, 1; 0 ≤ θ ≤ 1<br />

mean and<br />

variance<br />

mgf<br />

E[Y ] = θ, Var[Y ] = θ(1 − θ)<br />

MY (t) = θet + (1 − θ)<br />

Binomial Y ∼ Bin(n, θ)<br />

pmf P (Y = y|n, θ) = ( n<br />

y )θy (1 − θ) n−y mean and<br />

variance<br />

mgf<br />

; y = 0, 1, . . . , n; 0 ≤ θ ≤ 1<br />

E[Y ] = nθ, Var[Y ] = nθ(1 − θ)<br />

MY (t) = [θet + (1 − θ)] n<br />

Discrete uniform(N)<br />

pmf P (Y = y|N) = 1/N; y = 1, 2, . . . , N; N ∈ Z +<br />

mean and<br />

variance<br />

E[Y ] = (N + 1)/2, Var[Y ] = (N + 1)(N − 1)/12<br />

mgf MY (t) = 1<br />

N<br />

Geometric(θ)<br />

� N<br />

i=1 eit<br />

pmf P (Y = y|θ) = θ(1 − θ) y−1 ; y = 0, 1, . . .; 0 ≤ θ ≤ 1<br />

mean and<br />

variance<br />

E[Y ] = 1/θ, Var[Y ] = (1 − θ)/θ2 mgf MY (t) = θet /[1 − (1 − θ)et ], t < − log(1 − θ)<br />

notes The random variable X = Y − 1 is negative binomial(1, θ).<br />

Hypergeometric(b, w, n)<br />

pmf P (Y = y|b, w, n) = � �� � � � w b−w b<br />

/ ; y = 0, 1, . . . , n;<br />

y n−y n<br />

b − (b − w) ≤ y ≤ b; b, w, n ≥ 0<br />

mean and<br />

variance<br />

E[Y ] = nw,<br />

b<br />

nw (b−w)(b−n)<br />

Var[Y ] = b b(b−1)<br />

Negative binomial(r, θ)<br />

pmf P (Y = y|r, θ) = � � r+y−1 r y θ (1 − θ) ; y ∈ Z; 0 ≤ θ ≤ 1<br />

y<br />

mean and<br />

variance<br />

E[Y ] = r(1 − θ)/θ, Var[Y ] = r(1 − θ)/θ2 θ<br />

mgf MY (t) = ( 1−(1−θ)et ) r , t < − log(1 − θ)<br />

notes An alternative form of the pmf, used in the derivation in our notes,<br />

is given by P (N = n|r, θ) = � � n−1 r n−r θ (1−θ) , n = r, r+1, . . . where<br />

r−1<br />

the random variable N = Y + r. The negative binomial can also be<br />

Poisson(θ)<br />

derived as a mixture of Poisson random variables.<br />

pmf P (Y = y|θ) = θye−θ ; y ∈ Z; 0 ≤ θ < ∞<br />

y!<br />

mean and<br />

variance<br />

E[Y ] = θ, Var[Y ] = θ<br />

mgf MY (t) = e θ(et −1)<br />

111


Uniform U(a, b)<br />

Continuous Distributions<br />

pdf f(y|a, b) = (b − a) −1 ; a ≤ y ≤ b<br />

mean and<br />

variance<br />

E[Y ] = 1<br />

mgf<br />

(b + a), 2<br />

MY (t) =<br />

1 Var[Y ] = (b − a)2<br />

12 ebt−eat (b−a)t<br />

notes A uniform distribution with a = 0 and b = 1 is a special case of the<br />

Exponential E(θ)<br />

beta distribution where (α = β = 1).<br />

pdf f(y|θ) = θe−θy ; 0 ≤ y < ∞; θ > 0<br />

mean and<br />

variance<br />

E[Y ] = 1/θ, Var[Y ] = 1/θ2 mgf MY (t) = (1 − t/θ) −1<br />

notes Special case of the gamma distribution. X = Y 1/γ is Weibull,<br />

Gamma G(ω, θ)<br />

X = √ 2θY is Rayleigh, X = α − γ log(Y/β) is Gumbel.<br />

pdf f(y|ω, θ) = θe−θy (θy) ω−1<br />

; 0 ≤ y < ∞; ω, θ > 0<br />

Γ(ω)<br />

mean and<br />

variance<br />

E[Y ] = ω/θ, Var[Y ] = ω/θ2 mgf MY (t) = (1 − t/θ) −ω<br />

notes Includes the exponential (ω = 1) and chi squared (ω = 1 1 , θ = 2 2 ).<br />

Normal N(µ, σ2 )<br />

pdf f(y|µ, σ2 ) = 1 √ e<br />

2πσ −(y−µ)2 /2σ2 ; −∞ < y < ∞, −∞ < µ < ∞, σ > 0<br />

mean and<br />

variance<br />

E[Y ] = µ, Var[Y ] = σ2 mgf MY (t) = e µt+σ2t2 /2<br />

notes Sometimes called the Gaussian distribution.<br />

112

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!