04.01.2015 Views

ST217: Mathematical Statistics B - Of the Clux

ST217: Mathematical Statistics B - Of the Clux

ST217: Mathematical Statistics B - Of the Clux

SHOW MORE
SHOW LESS
  • No tags were found...

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

MSB<br />

<strong>ST217</strong>: <strong>Ma<strong>the</strong>matical</strong> <strong>Statistics</strong> B<br />

Aim<br />

To review, expand & apply <strong>the</strong> ideas from MSA.<br />

In particular, MSA mainly studied one unknown quantity at once. In MSB we’ll study interrelationships.<br />

Lectures & Classes<br />

Monday 12–1 R0.21<br />

Wednesday 10–11 R0.21<br />

Friday 3–4 ACCR<br />

Examples classes will begin in week 3.<br />

Contents<br />

1. Overview of MSA.<br />

2. Bivariate & Multivariate Probability Distributions.<br />

Joint distributions, conditional distributions, marginal distributions; conditional expectation. The<br />

χ 2 , t, F and multivariate Normal distributions and <strong>the</strong>ir interrelationships.<br />

3. Inference for Multiparameter Models.<br />

Likelihood, frequentist and Bayesian inference, prediction and decision-making. Comparison between<br />

various approaches. Point and interval estimation. Classical simple and composite hypo<strong>the</strong>sis testing,<br />

likelihood ratio tests, asymptotic results.<br />

4. Linear Statistical Models.<br />

Linear regression, multiple regression & analysis of variance models. Model choice, model checking<br />

and residuals.<br />

5. Fur<strong>the</strong>r Topics (time permitting).<br />

Nonlinear models, problems & paradoxes, etc.<br />

Books<br />

The books recommended for MSA are also useful for MSB. Excellent books on ma<strong>the</strong>matical statistics are:<br />

1. ‘Statistical Inference’ by George Casella & Roger L. Berger [C&B], Duxbury Press (2001),<br />

2. ‘Probability and <strong>Statistics</strong>’ by Morris DeGroot, Addison-Wesley (2 nd edition 1989; 3 rd edition by<br />

DeGroot & Schervish, 2002).<br />

A good book discussing <strong>the</strong> application and interpretation of statistical methods is ‘Introduction to <strong>the</strong><br />

Practice of <strong>Statistics</strong>’ by Moore & McCabe [M&M], Freeman (4 th edition 2002). Many of <strong>the</strong> data sets<br />

considered below come from <strong>the</strong> ‘Handbook of Small Data Sets’ [HSDS] by Hand et al., Chapman & Hall,<br />

London (1994).<br />

There are many o<strong>the</strong>r useful references on ma<strong>the</strong>matical statistics available in <strong>the</strong> library, including books<br />

by Hogg & Craig [H&C], Lindgren, Mood, Graybill & Boes [MG&B], and Rice.<br />

1


Chapter 1<br />

Overview of MSA<br />

1.1 Basic Ideas<br />

1.1.1 What is ‘<strong>Statistics</strong>’<br />

<strong>Statistics</strong> may be defined as:<br />

‘The study of how information should be employed to reflect on, and give guidance for action<br />

in, a practical situation involving uncertainty.’ [italics by JEHS]<br />

Vic Barnett, Comparative Statistical Inference<br />

Figure 1.1: A practical situation involving uncertainty<br />

2


1.1.2 Statistical Modelling<br />

The emphasis of modern statistics is on modelling <strong>the</strong> patterns and interrelationships in <strong>the</strong> existing data,<br />

and <strong>the</strong>n applying <strong>the</strong> chosen model(s) to predict future data.<br />

Typically <strong>the</strong>re is a measurable response (for example, reduction Y in patient’s blood pressure) that is<br />

thought to be related to explanatory variables x j (for example, treatment applied, dose, patient’s age,<br />

weight, etc.) We seek a formula that relates <strong>the</strong> observed responses to <strong>the</strong> corresponding explanatory<br />

variables, and that can be used to predict future responses in terms of <strong>the</strong>ir corresponding explanatory<br />

variables:<br />

Observed Response = Fitted Value + Residual,<br />

Future Response = Predicted value + Error.<br />

Here <strong>the</strong> fitted values should take account of all <strong>the</strong> consistent patterns in <strong>the</strong> data, and <strong>the</strong> residuals<br />

represent <strong>the</strong> remaining random variation.<br />

1.1.3 Prediction and Decision-Making<br />

Always remember that <strong>the</strong> main aim in modelling as above is to predict (for example) <strong>the</strong> effects of different<br />

medical treatments, and hence to decide which treatment to use, and in what circumstances.<br />

The fundamental assumption is that <strong>the</strong> future data will be in some sense similar to existing data. The<br />

ideas of exchangeability and conditional independence are crucial.<br />

The following notation is useful:<br />

X ⊥ Y ‘X is independent of Y ’, i.e. Y gives you no information about X,<br />

X ⊥ Y |Z<br />

‘X is conditionally independent of Y given Z’, i.e. if you know <strong>the</strong> value taken by <strong>the</strong><br />

RV Z, <strong>the</strong>n Y gives you no fur<strong>the</strong>r information about X.<br />

Most methods of statistical inference proceed indirectly from what we know (<strong>the</strong> observed data and any<br />

o<strong>the</strong>r relevant information) to what we really want to know (future, as yet unobserved, data), by assuming<br />

that <strong>the</strong> random variation in <strong>the</strong> observed data can be thought of as a sample from an underlying population,<br />

and learning about <strong>the</strong> properties of this population.<br />

1.1.4 Known and Unknown Features of a Statistical Problem<br />

A statistic is a property of a sample, whereas a parameter is a property of a population. <strong>Of</strong>ten it’s natural<br />

to estimate a parameter θ (such as <strong>the</strong> population mean µ) by <strong>the</strong> corresponding property of <strong>the</strong> sample<br />

(here <strong>the</strong> sample mean X). Note that θ may be a vector or more complicated object.<br />

Unobserved quantities are treated ma<strong>the</strong>matically as random variables. Potentially observable quantities<br />

are usually denoted by capital letters (X i , X, Y etc.) Once <strong>the</strong> data have been observed, <strong>the</strong> values taken<br />

by <strong>the</strong>se random variables are known (X i = x i , X = x etc.) Unobservable or hypo<strong>the</strong>tical quantities are<br />

usually denoted by Greek letters (θ, µ, σ 2 etc.), and estimators are often denoted by putting a hat on <strong>the</strong><br />

corresponding symbol (̂θ, ̂µ, ̂σ 2 etc.)<br />

Nearly all statistics books use <strong>the</strong> above style of notation, so it will be adopted in <strong>the</strong>se notes. However,<br />

sometimes I shall wish to distinguish carefully between knowns and unknowns, and shall denote all<br />

unknowns by capitals. Thus Θ represents an unknown parameter vector, and θ represents a particular<br />

assumed value of Θ. This is especially useful when considering probability distributions for parameters;<br />

one can <strong>the</strong>n write f Θ (θ) and P(Θ = θ) (<strong>the</strong> probability that Θ = θ) by exact analogy with f X (x) and<br />

P(X = x).<br />

The set of possible values for a RV X is called its sample space Ω X . Similarly <strong>the</strong> parameter space Ω Θ is<br />

<strong>the</strong> set of possible values for <strong>the</strong> parameter Θ.<br />

3


1.1.5 Likelihood<br />

In general, we can infer properties θ of <strong>the</strong> population by comparing how compatible are <strong>the</strong> various possible<br />

values of θ with <strong>the</strong> observed data. This motivates <strong>the</strong> idea of likelihood (equivalently, log-likelihood or<br />

support). We need a probability model for <strong>the</strong> data, in which <strong>the</strong> probability distribution of <strong>the</strong> random<br />

variation is a member of a (realistic but ma<strong>the</strong>matically tractable) family of probability distributions,<br />

indexed by a parameter θ.<br />

Likelihood-based approaches have both advantages and disadvantages—<br />

Advantages<br />

Unified <strong>the</strong>ory (many practical problems can be<br />

tackled in essentially <strong>the</strong> same way).<br />

<strong>Of</strong>ten get simple sufficient statistics (hence we can<br />

summarise a huge data set by a few simple properties).<br />

CLT suggests likelihood methods work well when<br />

<strong>the</strong>re’s loads of data.<br />

Disadvantages<br />

Is <strong>the</strong> <strong>the</strong>ory directly relevant (is likelihood alone<br />

enough and how do we balance realism and<br />

tractability)<br />

If <strong>the</strong> probability model is wrong, <strong>the</strong>n results can<br />

be misleading (e.g. if one assumes a Normal distribution<br />

when <strong>the</strong> true distribution is Cauchy).<br />

One seldom has loads of data!<br />

1.1.6 Where Will We Go from Here<br />

• MSA provided <strong>the</strong> ma<strong>the</strong>matical toolbox (e.g. probability <strong>the</strong>ory and <strong>the</strong> idea of random variables)<br />

for studying random variation.<br />

• MSB will add to this toolbox and study interrelationships between (random) variables.<br />

• We shall also consider some important general forms for <strong>the</strong> fitted/predicted values, in particular<br />

linear models and <strong>the</strong>ir generalizations.<br />

1.2 Sampling Distributions<br />

Statistical analysis involves calculating various statistics from <strong>the</strong> data, for example <strong>the</strong> maximum likelihood<br />

estimator (MLE) ̂θ for θ. We want to understand <strong>the</strong> properties of <strong>the</strong>se statistics; hence <strong>the</strong> importance<br />

of <strong>the</strong> central limit <strong>the</strong>orem (CLT) & its generalizations, and of studying <strong>the</strong> probability distributions of<br />

transformed random variables.<br />

If we have a formula for a summary statistic S, e.g. S = ∑ X i /n = X, and are prepared to make certain<br />

assumptions about <strong>the</strong> original random variables X i , <strong>the</strong>n we can say things about <strong>the</strong> probability<br />

distribution of S.<br />

The probability distribution of a statistic S, i.e. <strong>the</strong> pattern of values S would take if it were calculated in<br />

successive samples similar to <strong>the</strong> one we actually have, is called its sampling distribution.<br />

4


1.2.1 Typical Assumptions<br />

1. Standard Assumption (IID RVs):<br />

X i are IID (independent and identically distributed) with (unknown) mean µ and variance σ 2 .<br />

This implies<br />

(a) E[X] = E[X i ] = µ, and<br />

(b) Var[X] = 1 n Var[X i] = σ 2 /n.<br />

(c) If we define <strong>the</strong> standardised random variables<br />

Z n = X − µ ,<br />

σ<br />

<strong>the</strong>n as n → ∞, <strong>the</strong> distribution of Z n tends to <strong>the</strong> standard Normal N(0, 1) distribution.<br />

2. Additional Assumption (Normality):<br />

IID<br />

The X i are IID Normal: X i ∼ N (µ, σ 2 ).<br />

This implies that X ∼ N(µ, σ 2 /n).<br />

1.2.2 Fur<strong>the</strong>r Uses of Sampling Distributions<br />

We can also<br />

• compare various plausible estimators (e.g. to estimate <strong>the</strong> centre of symmetry of a supposedly symmetric<br />

distribution we might use <strong>the</strong> sample mean, median, or something more exotic),<br />

• obtain interval estimates for unknown quantities (e.g. 95% confidence intervals, HPD intervals, support<br />

intervals),<br />

• test hypo<strong>the</strong>ses about unknown quantities.<br />

Comments<br />

1. Note <strong>the</strong> importance of expectations of (possibly transformed) random variables:<br />

E[X] = µ (measure of location)<br />

E[(X − µ) 2 ] = σ 2 (measure of scale)<br />

E[e sX ] = moment generating function<br />

E[e itX ] = characteristic function<br />

2. We must always consider whe<strong>the</strong>r <strong>the</strong> assumptions made are reasonable, both from general considerations<br />

(e.g.: is independence reasonable is <strong>the</strong> assumption of identical distributions reasonable<br />

is it reasonable to assume that <strong>the</strong> data follow a Poisson distribution etc.) and with reference to<br />

<strong>the</strong> observed set of data (e.g. are <strong>the</strong>re any ‘outliers’—unreasonably extreme values—or unexpected<br />

patterns)<br />

3. Likelihood and o<strong>the</strong>r methods suggest estimators for unknown quantities of interest (parameters etc.)<br />

under certain specified assumptions.<br />

Even if <strong>the</strong>se assumptions are invalid (and in practice <strong>the</strong>y always will be to some extent!) we may still<br />

want to use summary statistics as estimators of properties of <strong>the</strong> underlying population. Therefore<br />

(a) We’ll want to investigate <strong>the</strong> properties of estimators under various relaxed assumptions, for<br />

example partially specified models that use only <strong>the</strong> first and second moments of <strong>the</strong> unknown<br />

quantities.<br />

(b) It’s useful if <strong>the</strong> calculated statistics (e.g. MLEs) have an intuitive interpretation (like ‘sample<br />

mean’ or ‘sample variance’).<br />

5


1.3 (Revision) Problems<br />

1. First-year students attending a statistics course were asked to carry out <strong>the</strong> following procedure:<br />

Toss two coins, without showing anyone else <strong>the</strong> results.<br />

If <strong>the</strong> first coin showed ‘Heads’ <strong>the</strong>n answer <strong>the</strong> following question:<br />

“Did <strong>the</strong> second coin show ‘Heads’ (Yes or No)”<br />

If <strong>the</strong> first coin showed ‘Tails’ <strong>the</strong>n answer <strong>the</strong> following question:<br />

“Have you ever watched a complete episode of ‘Teletubbies’ (Yes or No)”<br />

The following results were recorded:<br />

Yes No<br />

Males 84 48<br />

Females 23 24<br />

For each sex, and for both sexes combined, estimate <strong>the</strong> proportion who have watched a complete<br />

episode of ‘Teletubbies’.<br />

Using a chi-squared test, or o<strong>the</strong>rwise, test whe<strong>the</strong>r <strong>the</strong> proportions differ between <strong>the</strong> sexes.<br />

Discuss <strong>the</strong> assumptions you have made in carrying out your analysis.<br />

2. Let X and Y be IID RVs with a standard Normal N(0, 1) distribution, and define Z = X/Y .<br />

(a) Write down <strong>the</strong> lower quartile, median and upper quartile of Z, i.e. <strong>the</strong> points z 25 , z 50 & z 75<br />

such that P(Z < z k ) = k/100.<br />

(b) Show that Z has a Cauchy distribution, with PDF 1/π(z 2 + 1).<br />

HINT : consider <strong>the</strong> transformation Z = X/Y and W = |Y |.<br />

3. Let X 1 , . . . X n be mutually independent RVs, with respective MGFs (moment generating functions)<br />

M X1 (t), . . . , M Xn (t), and let a 1 , . . . , a n and b 1 , . . . , b n be fixed constants.<br />

Show that <strong>the</strong> MGF of Z = (a 1 X 1 + b 1 ) + (a 2 X 2 + b 2 ) + · · · + (a n X n + b n ) is<br />

M Z (t) = exp ( t ∑ b i<br />

)<br />

MX1 (a 1 t) × · · · × M Xn (a n t).<br />

Hence or o<strong>the</strong>rwise show that any linear combination of independent Normal RVs is itself Normally<br />

distributed.<br />

4. A workman has to move a rectangular stone block a short distance, but doesn’t want to strain himself.<br />

He rapidly estimates:<br />

• height of block = 10 cm, with standard deviation 1 cm.<br />

• width of block = 20 cm, with standard deviation 3 cm.<br />

• length of block = 25 cm, with standard deviation 4 cm.<br />

• density of block = 4.0 g/cc, with standard deviation 0.5 g/cc.<br />

Assuming <strong>the</strong>se estimates are mutually independent, calculate his estimates of <strong>the</strong> volume V (cc)<br />

and total weight W (Kg) of <strong>the</strong> block, and <strong>the</strong>ir standard deviations.<br />

The workman fears that he might hurt his back if W ≥ 30.<br />

Using Chebyshev’s inequality, give an upper bound for his probability P(W ≥ 30).<br />

[Chebyshev’s inequality states that if X has mean µ & variance σ 2 , <strong>the</strong>n P(|X − µ| ≥ c) ≤<br />

σ 2 /c 2 —see MSA].<br />

What is <strong>the</strong> workman’s value for P(W > 30) under <strong>the</strong> additional assumption that W is Normally<br />

distributed Compare this value with <strong>the</strong> bound found earlier.<br />

How reasonable are <strong>the</strong> independence and Normality assumptions used in <strong>the</strong> above analysis<br />

6


5. Calculate <strong>the</strong> MLE of <strong>the</strong> centre of symmetry θ, given IID RVs X 1 , X 2 , . . . , X n , where <strong>the</strong> common<br />

PDF f X (x) of <strong>the</strong> X i s is<br />

(a) Normal (or Gaussian):<br />

f X (x|θ, σ) = √ 1 ( ( ) )<br />

exp − 1 2<br />

2πσ<br />

2 (x − θ)/σ<br />

(b) Laplacian (or Double Exponential):<br />

f X (x|θ, σ) = 1<br />

2σ exp( |x − θ|/σ )<br />

(c) Uniform (or Rectangular):<br />

{ ( ) ( )<br />

1 if θ −<br />

1<br />

f X (x|θ) =<br />

2 < x < θ +<br />

1<br />

2<br />

0 o<strong>the</strong>rwise.<br />

Do you consider <strong>the</strong>se MLEs to be intuitively reasonable<br />

6. Calculate E[X], E[X 2 ], E[X 3 ] and E[X 4 ] under each of <strong>the</strong> following assumptions:<br />

(a) X ∼ Poi(λ), i.e. X has PMF (probability mass function)<br />

P(X = x|λ) = λx exp(−λ)<br />

x!<br />

(x = 0, 1, 2, . . . )<br />

(b) X ∼ Exp(β), i.e. X has PDF (probability density function)<br />

{<br />

βe<br />

−βx<br />

if x > 0<br />

f X (x|β) =<br />

0 o<strong>the</strong>rwise.<br />

(c) X ∼ N ( µ, σ 2) , i.e. X has PDF<br />

f X (x|µ, σ) = √ 1 ( ( ) )<br />

exp − 1 2<br />

2πσ<br />

2 (x − µ)/σ<br />

7. Describe briefly how, and under what circumstances, you might approximate<br />

(a) a binomial distribution by a Normal distribution,<br />

(b) a binomial distribution by a Poisson distribution,<br />

(c) a Poisson distribution by a Normal distribution.<br />

Suppose X ∼ Bin(100, 0.1), Y ∼ Poi(10), and Z ∼ N ( 10, 3 2) . Calculate, or look up in tables,<br />

(i) P(X ≥ 6), (ii) P(Y ≥ 6), (iii) P(Z > 5.5),<br />

(iv) P(X > 16), (v) P(Y > 16), (vi) P(Z > 16.5),<br />

and comment on <strong>the</strong> accuracy of <strong>the</strong> approximations here.<br />

8. The t distribution with n degrees of freedom, denoted t n or t(n), has <strong>the</strong> PDF<br />

f(t) = Γ( 1<br />

2<br />

(n + 1))<br />

Γ ( 1 1<br />

1<br />

2 n) √ nπ<br />

(<br />

1 + t2 /n ) , −∞ < t < ∞,<br />

(n+1)/2<br />

and <strong>the</strong> F distribution with m and n degrees of freedom, denoted F m,n or F (m, n), has PDF<br />

with f(x) = 0 for x ≤ 0.<br />

f(x) = Γ( 1<br />

2<br />

(m + n))<br />

Γ ( 1<br />

2 m) Γ ( 1<br />

2 n) mm/2 n n/2 x (m/2)−1<br />

, 0 < x < ∞,<br />

(mx + n)<br />

(m+n)/2<br />

Show that if T ∼ t n and X ∼ F m,n , <strong>the</strong>n T 2 and X −1 both have F distributions.<br />

7


9. Table 1.1 shows <strong>the</strong> estimated total resident population (thousands) of England and Wales at<br />

30 June 1993:<br />

Age Persons Males Females<br />

< 1 669.6 343.1 326.5<br />

1–14 9,268.0 4,756.9 4,511.1<br />

15–44 21,875.0 11,115.6 10,759.4<br />

45–64 11,435.8 5,676.6 5,759.2<br />

65–74 4,595.9 2,081.7 2,514.2<br />

≥ 75 3,594.9 1,224.5 2,370.4<br />

Total 51,439.2 25,198.4 26,240.8<br />

Table 1.1: Estimated resident population of England & Wales, mid 1993, by sex and<br />

age-group (simplified from Table 1 of <strong>the</strong> 1993 mortality tables )<br />

Table 1.2, also extracted from <strong>the</strong> published 1993 Mortality <strong>Statistics</strong>, shows <strong>the</strong> number of deaths<br />

in 1993 among <strong>the</strong> resident population of England and Wales, categorised by sex, age-group and<br />

underlying cause of death.<br />

Assume that <strong>the</strong> rates observed in Tables 1.1 and 1.2 hold exactly, and suppose that an individual<br />

I is chosen at random from <strong>the</strong> population. Define <strong>the</strong> random variables S (sex), A (age group), D<br />

(death) and C (cause) as follows:<br />

For example,<br />

S = 0 if I is male,<br />

1 if I is female,<br />

A = 1 if I is under 1 year old, 4 if I is aged 45–64,<br />

2 if I is aged 1–14, 5 if I is aged 65–74,<br />

3 if I is aged 15–44, 6 if I is 75 years old or over,<br />

D = 0 if I survives <strong>the</strong> year,<br />

1 if I dies,<br />

C = cause of death (0–17).<br />

P(S=0) = 25198.4/51439.2,<br />

P(S=0 & A=6) = 1224.5/51439.2,<br />

P(D=0|S=0 & A=6) = 1 − 138.239/1224.5,<br />

P(C=8|S=0 & A=6) = 28.645/1224.5, etc.<br />

(a) Calculate P(D=1|S=0), and P(D=1|S=0 & A=a) for a = 1, 2, 3, 4, 5, 6.<br />

Also calculate P(S=0|D=1), and P(S=0|D=1 & A=a) for a = 1, 2, 3, 4, 5, 6.<br />

If you were an actuary, and were asked by a non-expert “is <strong>the</strong> death rate for males higher or<br />

lower than that for females”, how would you respond based on <strong>the</strong> above calculations Justify<br />

your answer.<br />

(b) Similarly, explain how you would respond to <strong>the</strong> questions<br />

i. “is <strong>the</strong> death rate from neoplasms higher for males or for females”<br />

ii. “is <strong>the</strong> death rate from mental disorders higher for males or for females”<br />

iii. “is <strong>the</strong> death rate from diseases of <strong>the</strong> circulatory system higher for males or for females”<br />

iv. “is <strong>the</strong> death rate from diseases of <strong>the</strong> respiratory system higher for males or for females”<br />

8


All<br />

Age at death (years)<br />

Cause of death Sex ages < 1 1–14 15–44 45–64 65–74 ≥ 75<br />

0 Deaths below 28 days M 1,603 1,603 − − − − −<br />

(no cause specified) F 1,192 1,192 − − − − −<br />

1 Infectious & parasitic M 1,954 60 79 565 390 346 514<br />

diseases F 1,452 46 44 169 193 283 717<br />

2 Neoplasms M 74,480 16 195 2,000 16,372 25,644 30,253<br />

F 67,966 8 138 2,551 15,026 19,141 31,102<br />

3 Endocrine, nutritional M 3,515 28 43 208 639 959 1,638<br />

& metabolic diseases and F 4,403 17 37 153 474 901 2,821<br />

immunity disorders<br />

4 Diseases of blood and M 897 5 12 62 106 204 508<br />

blood-forming organs F 1,084 3 14 28 73 163 803<br />

5 Mental disorders M 2,530 − 8 281 169 334 1,738<br />

F 5,189 − 1 83 99 297 4,709<br />

6 Diseases of <strong>the</strong> nervous M 4,403 59 136 530 675 890 2,113<br />

system and sense organs F 4,717 42 118 313 546 809 2,889<br />

7 Diseases of <strong>the</strong> M 123,717 41 66 1,997 20,682 37,195 63,736<br />

circulatory system F 134,439 44 45 834 7,783 23,185 102,548<br />

8 Diseases of <strong>the</strong> M 41,802 86 79 608 3,157 9,227 28,645<br />

respiratory system F 49,068 59 74 322 2,145 6,602 39,866<br />

9 Diseases of <strong>the</strong> M 7,848 10 27 511 1,706 2,058 3,536<br />

digestive system F 10,574 20 14 298 1,193 1,921 7,128<br />

10 Diseases of <strong>the</strong> M 3,008 4 6 57 215 676 2,050<br />

genitourinary system F 3,710 4 7 55 219 535 2,890<br />

11 Complications of pregnancy, M − − − − − − −<br />

childbirth and <strong>the</strong> puerperium F 27 − − 27 − − −<br />

12 Diseases of <strong>the</strong> skin M 269 1 1 7 22 62 176<br />

and subcutaneous tissue F 748 − − 15 30 80 623<br />

13 Diseases of <strong>the</strong> musculoskeletal M 785 1 5 28 106 151 494<br />

system and connective tissue F 2,639 − 5 43 173 385 2,033<br />

14 Congenital anomalies M 660 131 114 158 118 58 81<br />

F 675 136 116 133 101 87 102<br />

15 Certain conditions originating M 186 93 8 13 18 16 38<br />

in <strong>the</strong> perinatal period F 114 60 5 3 4 10 32<br />

16 Signs, symptoms and M 1,642 238 17 126 111 72 1,078<br />

ill-defined conditions F 5,146 171 17 50 53 75 4,780<br />

17 External causes of M 9,859 34 311 4,749 2,183 941 1,641<br />

injury and poisoning F 5,869 30 162 1,240 882 731 2,824<br />

Total M 279,158 2,410 1,107 11,900 46,669 78,833 138,239<br />

F 299,012 1,832 797 6,317 28,994 55,205 205,867<br />

Table 1.2: Deaths in England & Wales, 1993, by underlying cause, sex and age-group<br />

(extracted from Table 2 of <strong>the</strong> 1993 mortality tables )<br />

9


(c) Now treat <strong>the</strong> data in Tables 1.1 & 1.2 as subject to statistical fluctuations.<br />

estimate<br />

One can still<br />

psac = P(S=s & A=a & C=c), p·ac = P(A=a & C=c), ps · · = P(S=s) etc.<br />

from <strong>the</strong> data, for example ̂p 0,·,14 = 660/25198400 = 2.62×10 −5 . Similarly estimate p 1,·,14 and<br />

p·,a,14 for a = 1 . . . 6. Using a chi-squared test or o<strong>the</strong>rwise, investigate whe<strong>the</strong>r <strong>the</strong> relative<br />

risk of death from a congenital anomaly between males and females is <strong>the</strong> same at all ages,<br />

i.e. whe<strong>the</strong>r it reasonable to assume that<br />

ps, a,14 = ps, ·,14 × p·, a,14.<br />

10. Data were collected on litter size and sex ratios for a large number of litters of piglets. The following<br />

table gives <strong>the</strong> data for all litters of size between four and twelve:<br />

Number<br />

Litter size<br />

of males 4 5 6 7 8 9 10 11 12<br />

0 1 2 3 0 1 0 0 0 0<br />

1 14 20 16 21 8 2 7 1 0<br />

2 23 41 53 63 37 23 8 3 1<br />

3 14 35 78 117 81 72 19 15 8<br />

4 1 14 53 104 162 101 79 15 4<br />

5 4 18 46 77 83 82 33 9<br />

6 0 21 30 46 48 13 18<br />

7 2 5 12 24 12 11<br />

8 1 7 10 8 15<br />

9 0 0 1 4<br />

10 0 1 0<br />

11 0 0<br />

12 0<br />

Total 53 116 221 374 402 346 277 102 70<br />

(a) Discuss briefly what sort of probability distributions it might be reasonable to assume for <strong>the</strong><br />

total size N of a litter, and for <strong>the</strong> number M of males in a litter of size N = n.<br />

(b) Suppose now that <strong>the</strong> litter size N follows a Poisson distribution with mean λ. Write down<br />

an expression for P(N = n|4 ≤ N ≤ 12). Hence or o<strong>the</strong>rwise give an expression for <strong>the</strong> loglikelihood<br />

l(λ; . . .) given <strong>the</strong> above table of data.<br />

(c) Evaluate l(λ; . . .) at λ = 7.5, 8 and 8.5. By fitting a quadratic to <strong>the</strong>se values, provide point<br />

and interval estimates of λ.<br />

(d) Using a chi-squared test or o<strong>the</strong>rwise, check how well your model fits <strong>the</strong> data.<br />

(e) Comment on <strong>the</strong> following argument: ‘Provided λ isn’t too small, we could approximate <strong>the</strong><br />

Poisson distribution Poi(λ) by <strong>the</strong> Normal distribution N (λ, λ). This is symmetric, so we may<br />

simply estimate <strong>the</strong> mean λ by <strong>the</strong> mode of <strong>the</strong> data (8 in our case). The standard deviation is<br />

<strong>the</strong>refore nearly 3, and so we would expect <strong>the</strong> counts at litter size 8 ± 3 to be nearly 60% <strong>the</strong><br />

count at 8 (note that for a standard Normal, φ(1)/φ(0) = exp(−0.5) ≏ 0.6). Since <strong>the</strong>re are far<br />

fewer litters of size 5 & 11 than this, <strong>the</strong> Poisson distribution must be a poor fit.’<br />

Data from HSDS, set 176<br />

Education is what survives when what has been learnt has been forgotten.<br />

Burrhus Frederoc Skinner<br />

10


Chapter 2<br />

Bivariate & Multivariate<br />

Distributions<br />

MSA largely concerned IID (independent & identically distributed) random variables.<br />

However in practice we are usually most interested in several random variables simultaneously, and <strong>the</strong>ir<br />

interrelationships. Therefore we need to consider <strong>the</strong> probability distributions of random vectors, i.e. <strong>the</strong><br />

joint distribution of <strong>the</strong> individual random variables.<br />

Bivariate Examples<br />

A. (X 1 , X 2 ), <strong>the</strong> number of male & female pigs in a litter.<br />

B. (X, Y ), <strong>the</strong> systolic and diastolic blood pressure of an individual.<br />

C. (X, Y ), <strong>the</strong> age and height of an individual.<br />

D. (X, Y ), <strong>the</strong> height and weight of an individual.<br />

E. (̂µ, ̂σ 2 ), <strong>the</strong> estimated common mean and variance of n IID random variables X 1 , . . . , X n .<br />

F. (Θ, X) where Θ ∼ U (0, 1) and X|Θ ∼ Bin(n, Θ), i.e.<br />

f Θ (θ) =<br />

{ 1 if 0 < x < 1<br />

0 o<strong>the</strong>rwise,<br />

f X (x|Θ = θ) = ( n<br />

x)<br />

θ x (1 − θ) n−x x = 0, 1, . . . , n.<br />

Definition 2.1 (Bivariate CDF)<br />

The joint cumulative distribution function of 2 RVs X & Y is <strong>the</strong> function<br />

F X,Y (x, y) = P(X ≤ x & Y ≤ y), (x, y) ∈ R 2 . (2.1)<br />

Comments<br />

1. The joint cumulative distribution function (or joint CDF) may also be called <strong>the</strong> ‘joint distribution<br />

function’ or ‘joint DF’.<br />

2. If <strong>the</strong>re’s no ambiguity, <strong>the</strong>n we may simply write F (x, y) for F X,Y (x, y).<br />

11


2.1 Discrete Bivariate Distributions<br />

If RVs X & Y are discrete, <strong>the</strong>n <strong>the</strong>y have a discrete joint distribution and a probability mass function<br />

(PMF) that, similarly to <strong>the</strong> univariate case, is usually written f X,Y (x, y) or more simply f(x, y):<br />

Definition 2.2 (Bivariate PMF)<br />

The joint probability mass function of discrete RVs X and Y is<br />

f(x, y) = P(X = x & Y = y).<br />

Exercise 2.1<br />

Suppose that <strong>the</strong> numbers X 1 and X 2 of male and female piglets follow independent Poisson distributions<br />

with means λ 1 & λ 2 respectively. Find <strong>the</strong> joint PMF.<br />

‖<br />

Exercise 2.2<br />

Now assume <strong>the</strong> model N ∼ Poi(λ), (X 1 |N) ∼ Bin(N, θ), i.e. <strong>the</strong> total number N of piglets follows a<br />

Poisson distribution, and, conditional on N = n, X has a Bin(n, θ) distribution (in particular θ = 0.5 if<br />

<strong>the</strong> sexes are equally likely). Again find <strong>the</strong> joint PMF.<br />

‖<br />

Exercise 2.3<br />

Verify that <strong>the</strong> two models given in Exercises 2.1 & 2.2 give identical fitted values, and are <strong>the</strong>refore in<br />

practice indistinguishable.<br />

‖<br />

2.1.1 Manipulation<br />

A discrete RV has a countable sample space, which without loss of generality can be represented as<br />

N = {0, 1, 2, . . .}. Values of a discrete joint distribution f(x, y) can <strong>the</strong>refore be tabulated:<br />

Y<br />

0 1 2 3 . . .<br />

0 f 00 f 01 f 02 . . .<br />

X 1 f 10 f 11 f 12 . . .<br />

. . . .<br />

. ..<br />

and <strong>the</strong> probability of any event E obtained by simple summation:<br />

P ( (X, Y ) ∈ E ) =<br />

∑<br />

f(x i , y i ).<br />

(x i,y i)∈E<br />

Exercise 2.4<br />

Continuing Exercise 2.2, find <strong>the</strong> PMF of X 1 , and hence identify <strong>the</strong> distribution of X 1 .<br />

‖<br />

Exercise 2.5<br />

The RV Q is defined on <strong>the</strong> rational numbers in [0, 1] by Q = X/Y , where f(x, y) = (1 − α)α y−1 /(y + 1),<br />

0 < α < 1, y = {1, 2, . . .}, x = {0, 1, . . . , y}.<br />

Show that P(Q = 0) = (α − 1) ( α + log(1 − α) ) /α 2 .<br />

‖<br />

12


2.2 Continuous Bivariate Distributions<br />

Definition 2.3 (Continuous bivariate distribution)<br />

Random variables X & Y have a continuous joint distribution if <strong>the</strong>re exists a function f from R 2 to<br />

[0, ∞) such that<br />

P ( (X, Y ) ∈ A ) ∫∫<br />

= f(x, y) dx dy ∀A ⊆ R 2 . (2.2)<br />

A<br />

Definition 2.4 (Bivariate PDF)<br />

The function f(x, y) defined by Equation 2.2 is called <strong>the</strong> joint probability density function of X & Y .<br />

Comments<br />

1. f(x, y) may be written more explicitly as f X,Y (x, y).<br />

2.<br />

∫ ∞ ∫ ∞<br />

−∞<br />

−∞<br />

f(x, y) dx dy = 1.<br />

3. f(x, y) is not unique—it could be arbitrarily defined at a countable set of points (x i , y i ) (more<br />

generally, any ‘set with measure zero’) without changing <strong>the</strong> value of ∫∫ f(x, y) dx dy for any set A.<br />

A<br />

4. f(x, y) ≥ 0 at all continuity points (x, y) ∈ R 2 .<br />

Examples<br />

1. As in Example E from page 11, we will want to know properties of <strong>the</strong> joint distribution of (̂µ, ̂σ 2 ),<br />

<strong>the</strong> MLEs of µ and σ 2 respectively given X 1 , . . . , X n<br />

IID<br />

∼ N (µ, σ 2 ).<br />

2. In <strong>the</strong> situation of Example B from page 11, where X is <strong>the</strong> systolic blood pressure and Y <strong>the</strong> diastolic<br />

blood pressure of an individual, it might be reasonable to assume that<br />

and hence obtain<br />

X ∼ N (µ S , σ 2 S),<br />

Y |X ∼ N (α + βX, σ 2 D),<br />

f X,Y (x, y) = f X (x) f Y |X (y|x).<br />

Comment<br />

As in Exercise 2.2, a family of multivariate distributions is most easily built up hierarchically using simple<br />

univariate distributions and conditional distributions like that of Y |X. Conditional distributions are<br />

considered formally in Section 2.4.<br />

2.2.1 Visualising and Displaying a Continuous Joint Distribution<br />

A continuous bivariate distribution can be represented by a contour or o<strong>the</strong>r plot of its joint PDF (Fig. 2.1).<br />

Comments<br />

1. The joint distribution of X and Y may be nei<strong>the</strong>r discrete nor continuous, for example:<br />

• Ei<strong>the</strong>r X or Y may have both continuous and discrete components,<br />

• One of X and Y may have a continuous distribution, <strong>the</strong> o<strong>the</strong>r discrete (like Example F on<br />

page 11).<br />

2. Higher dimensional joint distributions are obviously much more difficult to interpret and to represent<br />

graphically, with or without computer help.<br />

13


Figure 2.1: Contour and perspective plots of a bivariate distribution<br />

2.3 Marginal Distributions<br />

Given a joint CDF F X,Y (x, y), <strong>the</strong> distributions defined by <strong>the</strong> CDFs F X (x) = lim y→∞ F X,Y (x, y) and<br />

F Y (y) = lim x→∞ F X,Y (x, y) are called <strong>the</strong> marginal distributions of X and Y respectively:<br />

Definition 2.5 (Marginal CDF, PMF and PDF—bivariate case)<br />

F X (x) = lim y→∞ F X,Y (x, y) is <strong>the</strong> marginal CDF of X.<br />

If X has a discrete distribution, <strong>the</strong>n f X (x) = P(X = x) is <strong>the</strong> marginal PMF of X.<br />

If X has a continuous distribution, <strong>the</strong>n f X (x) = d<br />

dx F X(x) is <strong>the</strong> marginal PDF of X.<br />

Marginal CDFs and PDFs of Y , and of o<strong>the</strong>r RVs for higher-dimensional joint distributions, are defined<br />

similarly.<br />

Exercise 2.6<br />

Suppose that you are given a bag containing five coins:<br />

1 double-tailed, 1 with P(head) = 1/4, 2 fair, 1 double-headed.<br />

You pick one coin at random (each with probability 1/5), <strong>the</strong>n toss it twice.<br />

By finding <strong>the</strong> joint distribution of Θ = P(head) and X = number of heads, or o<strong>the</strong>rwise, calculate <strong>the</strong><br />

distribution of <strong>the</strong> number of heads obtained.<br />

‖<br />

Comments<br />

1. If you’ve tabulated P(Θ = θ & X = x), <strong>the</strong>n it’s simple to find F Θ (θ) and F X (x) by writing <strong>the</strong> row<br />

sums and column sums in <strong>the</strong> margins of <strong>the</strong> table of P(Θ = θ & X = x)—hence <strong>the</strong> name ‘marginal<br />

distribution’.<br />

2. Although <strong>the</strong> most satisfactory general definition of marginal distributions is in terms of <strong>the</strong>ir CDFs,<br />

in practice it’s usually easiest to work with PMFs or PDFs<br />

2.4 Conditional Distributions<br />

2.4.1 Discrete Case<br />

If X and Y are discrete RVs <strong>the</strong>n, by definition,<br />

P(Y =y|X=x) = P(X=x & Y =y)/P(X=x). (2.3)<br />

14


In o<strong>the</strong>r words (or, more accurately, in o<strong>the</strong>r symbols):<br />

Definition 2.6 (Conditional PMF—bivariate case)<br />

If X and Y have a discrete joint distribution with PMF f X,Y (x, y), <strong>the</strong>n <strong>the</strong> conditional PMF f Y |X<br />

of Y given X = x is<br />

f Y |X (y|x) = f X,Y (x, y)<br />

(2.4)<br />

f X (x)<br />

where f X (x) = ∑ y f X,Y (x, y) is <strong>the</strong> marginal PMF of X.<br />

Exercise 2.7<br />

Continuing Exercise 2.6, what are <strong>the</strong> conditional distributions of [X |Θ = 1/4] and [Θ|X = 0]<br />

‖<br />

2.4.2 Continuous Case<br />

Now suppose that X and Y have a continuous joint distribution. If we observe X = x, <strong>the</strong>n we will want<br />

to know <strong>the</strong> conditional CDF F Y |X (y|X = x). But we CAN’T use Equation 2.3 directly, which would<br />

entail dividing by zero. Therefore, by analogy with Equation 2.4, we adopt <strong>the</strong> following definition:<br />

Definition 2.7 (Conditional PDF—bivariate case)<br />

If X and Y have a continuous joint distribution with PDF f X,Y (x, y), <strong>the</strong>n <strong>the</strong> conditional PDF f Y |X<br />

of Y given that X = x is<br />

f Y |X (y|x) = f X,Y (x, y)<br />

, (2.5)<br />

f X (x)<br />

defined for all x ∈ R such that f X (x) > 0.<br />

2.4.3 Independence<br />

Recall that two RVs X and Y are independent (X ⊥ Y ) if, for any two sets A, B ∈ R,<br />

P(X∈A & Y ∈B) = P(X ∈ A)P(Y ∈ B) (2.6)<br />

Exercise 2.8<br />

Show that X and Y are independent according to Formula 2.6 if and only if<br />

or equivalently if and only if<br />

F X,Y (x, y) = F X (x)F Y (y) − ∞ < x, y < ∞, (2.7)<br />

f X,Y (x, y) = f X (x)f Y (y) − ∞ < x, y < ∞, (2.8)<br />

(where <strong>the</strong> functions f are interpreted as PMFs or PDFs in <strong>the</strong> discrete or continuous case respectively).<br />

‖<br />

15


2.5 Problems<br />

1. Let <strong>the</strong> function f(x, y) be defined by<br />

{ 6xy<br />

2<br />

if 0 < x < 1 and 0 < y < 1,<br />

f(x, y) =<br />

0 o<strong>the</strong>rwise.<br />

(a) Show that f(x, y) is a probability density function.<br />

(b) If X and Y have <strong>the</strong> joint PDF f(x, y) above, show that P(X + Y ≥ 1) = 9/10.<br />

(c) Find <strong>the</strong> marginal PDF f X (x) of X.<br />

(d) Show that P(0.5 < X < 0.75) = 5/16.<br />

2. Suppose that <strong>the</strong> random vector (X, Y ) takes values in <strong>the</strong> region A = {(x, y)|0 ≤ x ≤ 2, 0 ≤ y ≤ 2},<br />

and that its CDF within A is given by F X,Y (x, y) = xy(x + y)/16.<br />

(a) Find F X,Y (x, y) for values of (X, Y ) outside A.<br />

(b) Find <strong>the</strong> marginal CDF F X (x) of X.<br />

(c) Find <strong>the</strong> joint PDF f X,Y (x, y).<br />

3. Suppose that X and Y are RVs with joint PDF<br />

{ cx<br />

f(x, y) =<br />

2 y if x 2 ≤ y ≤ 1,<br />

0 o<strong>the</strong>rwise.<br />

(a) Find <strong>the</strong> value of c.<br />

(b) Find P(X ≥ Y ).<br />

(c) Find <strong>the</strong> marginal PDFs f X (x) & f Y (y)<br />

4. For each of <strong>the</strong> following joint PDFs f of X and Y , determine <strong>the</strong> constant c, find <strong>the</strong> marginal PDFs<br />

of X and Y , and determine whe<strong>the</strong>r or not X and Y are independent.<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

f(x, y) =<br />

{<br />

ce −(x+2y) , for x, y ≥ 0,<br />

0 o<strong>the</strong>rwise.<br />

{ cy<br />

f(x, y) =<br />

2 /2, for 0 ≤ x ≤ 2 and 0 ≤ y ≤ 1,<br />

0 o<strong>the</strong>rwise.<br />

{ cxe<br />

f(x, y) =<br />

−y , for 0 ≤ x ≤ 1 and 0 ≤ y < ∞,<br />

0 o<strong>the</strong>rwise.<br />

{ cxy, for x, y ≥ 0 and x + y ≤ 1,<br />

f(x, y) =<br />

0 o<strong>the</strong>rwise.<br />

5. Suppose that X and Y are continuous RVs with joint PDF f(x, y) = e −y on 0 < x < y < ∞.<br />

(a) Find P(X + Y ≥ 1) [HINT : write this as 1 − P(X + Y < 1)].<br />

(b) Find <strong>the</strong> marginal distribution of X.<br />

(c) Find <strong>the</strong> conditional distribution of Y given that X = x.<br />

16


6. Assume that X and Y are random variables each taking values in [0, 1]. For each of <strong>the</strong> following<br />

CDFs, show that <strong>the</strong> marginal distribution of X and Y are both uniform U(0, 1), and determine <strong>the</strong><br />

conditional CDF F X|Y (x|Y = 0.5) in each case:<br />

(a) F (x, y) = xy,<br />

(b) F (x, y) = min(x, y),<br />

{ 0, if x + y < 1,<br />

(c) F (x, y) =<br />

x + y − 1 if x + y ≥ 1.<br />

7. Suppose that Θ is a random variable uniformly distributed on (0, 1), i.e. Θ ∼ U (0, 1), and that, once<br />

Θ = θ has been observed, <strong>the</strong> random variable X is drawn from a binomial distribution [X|θ] ∼<br />

Bin(2, θ).<br />

(a) Find <strong>the</strong> joint CDF F (θ, x).<br />

(b) How might you display <strong>the</strong> joint distribution of Θ and X graphically<br />

(c) What (as simply as you can express <strong>the</strong>m) are <strong>the</strong> marginal CDFs F 1 (θ) of Θ and F 2 (x) of X<br />

8. Suppose that X and Y are two RVs having a continuous joint distribution. Show that X and Y are<br />

independent if and only if f X|Y (x|y) = f X (x) for each value of y such that f Y (y) > 0, and for all x.<br />

9. Suppose that X ∼ U (0, 1) and [Y |X = x] ∼ U (0, x). Find <strong>the</strong> marginal PDFs of X and Y .<br />

2.6 Multivariate Distributions<br />

2.6.1 Introduction<br />

Given a random vector X = (X 1 , X 2 , . . . , X n ) T , <strong>the</strong> joint distribution of <strong>the</strong> random variables X 1 , X 2 , . . . , X n<br />

is called a multivariate distribution.<br />

Definition 2.8 (Joint CDF)<br />

The joint cumulative distribution function of RVs X 1 , X 2 , . . . , X n is <strong>the</strong> function<br />

F X (x 1 , x 2 , . . . , x n ) = P(X k ≤ x k ∀ k = 1, 2, . . . , n). (2.9)<br />

Comments<br />

1. Formula 2.9 can be written succinctly as F X (x) = P(X ≤ x), in an ‘obvious’ vector notation.<br />

2. F X (x) can be called simply <strong>the</strong> CDF of <strong>the</strong> random vector X.<br />

3. Properties of F X are similar to <strong>the</strong> bivariate case. Unfortunately <strong>the</strong> notation is messier, particularly<br />

for <strong>the</strong> things we’re generally most interested in for statistical inference, such as<br />

(a) marginal distributions of unknown quantities and vectors,<br />

(b) conditional distributions of unknown quantities and vectors, given what we know.<br />

4. It’s often simpler to blur <strong>the</strong> distinction between row and column vectors, i.e. to let X denote ei<strong>the</strong>r<br />

(X 1 , X 2 , . . . , X n ) or (X 1 , X 2 , . . . , X n ) T , depending on context.<br />

17


Definition 2.9 (Discrete multivariate distribution)<br />

The RV X ∈ R n has a discrete distribution if it can take only a countable number of possible values.<br />

Definition 2.10 (Multivariate PMF)<br />

If X has a discrete distribution, <strong>the</strong>n its probability mass function (PMF) is<br />

f(x) = P(X = x), x ∈ R n (2.10)<br />

[i.e. <strong>the</strong> RVs X 1 . . . X n have joint PMF f(x 1 . . . x n ) = P(X 1 = x 1 & · · · &X n = x n )].<br />

Definition 2.11 (Continuous multivariate distribution)<br />

The RV X = (X 1 , X 2 , . . . , X n ) has a continuous distribution if <strong>the</strong>re is a nonnegative function f(x),<br />

where x = (x 1 , x 2 , . . . , x n ), such that for any subset A ⊂ R n ,<br />

P ( (X 1 , X 2 , . . . , X n ) ∈ A ) ∫ ∫<br />

= . . . f(x 1 , x 2 , . . . x n ) dx 1 dx 2 . . . dx n . (2.11)<br />

Definition 2.12 (Multivariate PDF)<br />

The function f in 2.11 is <strong>the</strong> (joint) probability density function of X.<br />

A<br />

Comments<br />

1. Without loss of generality, if X is discrete, <strong>the</strong>n we can take its possible values to be N n (i.e. each<br />

coordinate X i of X is a nonnegative integer).<br />

2. Equation 2.11 could be simply written<br />

P ( ∫<br />

X ∈ A) =<br />

3. As usual, f(·) may be written more explicitly f X (·), etc.<br />

4. By <strong>the</strong> fundamental <strong>the</strong>orem of calculus,<br />

A<br />

f(x)dx (2.12)<br />

f X (x 1 , . . . , x n ) = ∂n F X (x 1 , . . . , x n )<br />

∂x 1 · · · ∂x n<br />

(2.13)<br />

at all points (x 1 , . . . , x n ) where this derivative exists ( i.e. f X (x) = ∂ n F X (x)/∂x ) .<br />

5. Mixed distributions (nei<strong>the</strong>r continuous nor discrete) can be handled using appropriate combinations<br />

of summation and integration.<br />

2.6.2 Useful Notation for Marginal & Conditional Distributions<br />

We’ll sometimes adopt <strong>the</strong> following notation from DeGroot, particularly when <strong>the</strong> components X i of X<br />

are in some way similar, as in <strong>the</strong> multivariate Normal distribution (see later).<br />

F (x) denotes <strong>the</strong> CDF of X = (X 1 , X 2 , . . . , X n ) at x = (x 1 , x 2 , . . . , x n ),<br />

f(x) denotes <strong>the</strong> corresponding joint PMF (discrete case) or PDF (continuous case),<br />

f j (x j ) denotes <strong>the</strong> marginal PMF (PDF) of X j (integrating over x 1 . . . x j−1 , x j+1 . . . x n ),<br />

f jk (x j , x k ) denotes <strong>the</strong> marginal joint PDF of X j & X k (integrating over <strong>the</strong> remaining x i s),<br />

g j (x j |x 1 . . . x j−1 , x j+1 . . . x n ) denotes <strong>the</strong> conditional PMF (PDF) of X j given X i = x i , i ≠ j,<br />

F j (x j ) denotes <strong>the</strong> marginal CDF of X j ,<br />

G jk<br />

denotes <strong>the</strong> conditional CDF of (X j , X k ) given <strong>the</strong> values x i of all X i , i ≠ j, k, etc.<br />

18


2.7 Expectation<br />

2.7.1 Introduction<br />

The following are important definitions and properties involving expectations, variances and covariances:<br />

Var(X) = E [ (X − µ) 2] where µ = EX<br />

= E [ X 2] − µ 2 ,<br />

E[aX + b] = aEX + b where a and b are constants,<br />

E [ (aX + b) 2] = a 2 E [ X 2] + 2abEX + b 2 ,<br />

Var(aX + b) = a 2 Var(X),<br />

E[X 1 X 2 ] = (EX 1 )(EX 2 ) if X 1 ⊥ X 2 ,<br />

Cov(X 1 , X 2 ) = E(X 1 − µ 1 )(X 2 − µ 2 ) = E[X 1 X 2 ] − µ 1 µ 2 ,<br />

SD(X) = √ Var(X),<br />

corr(X 1 , X 2 ) = ρ(X 1 , X 2 ) =<br />

Cov(X 1 , X 2 )<br />

SD(X 1 )SD(X 2 ) .<br />

Note that <strong>the</strong> definition of expectation applies directly in <strong>the</strong> multivariate case:<br />

Definition 2.13 (Multivariate expectation)<br />

∑<br />

x<br />

⎧<br />

⎪⎨<br />

h(x) f(x)<br />

E[h(X)] = ∫<br />

⎪⎩ h(x) f(x) dx<br />

R n<br />

if X is discrete,<br />

if X is continuous.<br />

For example, if X = (X 1 , X 2 , X 3 ) has a continuous distribution, <strong>the</strong>n<br />

E[X 1 ] =<br />

∫ ∞ ∫ ∞ ∫ ∞<br />

−∞<br />

−∞<br />

−∞<br />

x 1 f(x 1 , x 2 , x 3 ) dx 1 dx 2 dx 3<br />

Exercise 2.9<br />

Let X and Y be independent continuous RVs. Prove that, for arbitrary functions g(·) and h(·),<br />

E [ g(X)h(Y ) ] = ( E g(X) )( E h(Y ) ) .<br />

‖<br />

Exercise 2.10<br />

Let X, Y and Z have independent Poisson distributions with means λ, µ, ν respectively. Find E[X 2 Y Z].<br />

‖<br />

Exercise 2.11<br />

[Cauchy-Schwartz] By considering E ( (tX − Y ) 2) , or o<strong>the</strong>rwise, prove <strong>the</strong> Cauchy Schwartz inequality for<br />

expectations, i.e. for any two RVs X and Y with finite second moments, ( E(XY ) ) 2<br />

≤ E<br />

(<br />

X<br />

2 ) E ( Y 2) , with<br />

equality if and only if P(Y = cX) = 1 for some constant c.<br />

Hence or o<strong>the</strong>rwise prove that <strong>the</strong> correlation ρ X,Y between X and Y satisfies |ρ X,Y | ≤ 1.<br />

Under what circumstances does ρ X,Y = 1<br />

‖<br />

19


2.8 Approximate Moments of Transformed Distributions<br />

The moments of a transformed RV g(X) can often be well approximated via a Taylor series:<br />

Exercise 2.12<br />

[delta method] Let X 1 , X 2 , . . . , X n be independent, each with mean µ and variance σ 2 , and let g(·) be a<br />

function with a continuous derivative g ′ (·).<br />

By considering a Taylor series expansion involving<br />

Z n = X − µ √<br />

σ2 /n ,<br />

show that<br />

E [ g(X) ] = g(µ) + O(n −1 ), (2.14)<br />

Var [ g(X) ] = n −1 σ 2 g ′ (µ) 2 + O(n −3/2 ). (2.15)<br />

‖<br />

Comments<br />

1. There is similarly a multivariate delta method, outside <strong>the</strong> scope of this course.<br />

2. Important uses of expansions like <strong>the</strong> delta method include identifying useful transformations g(·), for<br />

example to remove skewness or, when Var(X) is a function of µ, to make Var [ g(X) ] (approximately)<br />

independent of µ.<br />

3. A useful transformation g(X) is sometimes in practice applied to <strong>the</strong> original RVs on <strong>the</strong> (often<br />

reasonable) assumption that <strong>the</strong> properties of (∑ g(X i ) ) /n will be similar to those of g (∑ X i )/n ) .<br />

Exercise 2.13<br />

[Variance stabilising transformations]<br />

Suppose that X 1 , X 2 , . . . , X n are IID and that <strong>the</strong> (common) variance of each X i is a function of <strong>the</strong><br />

(common) mean µ = EX i .<br />

Show that <strong>the</strong> variance of g(X) is approximately constant if<br />

g ′ (µ) = 1/ √ Var(µ).<br />

If X ∼ Poi(µ), show that Y = √ X has approximately constant variance.<br />

‖<br />

20


2.9 Problems<br />

1. The discrete random vector (X 1 , X 2 , X 3 ) has <strong>the</strong> following PMF:<br />

(X 1 = 1) X 3 (X 1 = 2) X 3<br />

1 2 3 1 2 3<br />

1 .02 .03 .05 1 .08 .04 .03<br />

X 2 2 .04 .06 .10 X 2 2 .12 .11 .07<br />

3 .02 .03 .05 3 .05 .05 .05<br />

(a) Calculate <strong>the</strong> marginal PMFs: f 1 (x 1 ), f 2 (x 2 ), f 3 (x 3 ) and f 12 (x 1 , x 2 ).<br />

(b) Are X 1 and X 2 independent<br />

(c) What are <strong>the</strong> conditional PMFs: g 1 (x 1 |X 2 = 1, X 3 = 3), g 2 (x 2 |X 1 = 1, X 3 = 3),<br />

g 3 (x 3 |X 1 = 1, X 2 = 3), and g 12 (x 1 , x 2 |X 3 = 3) <br />

2. The RVs A, B, C etc. count <strong>the</strong> number of times <strong>the</strong> corresponding letter appears when a word is<br />

chosen at random from <strong>the</strong> following list (each being chosen with probability 1/16):<br />

MASCARA, MASK, MERCY, MONSTER,<br />

MOVIE, PREY, REPLICA, REPTILES,<br />

RITE, SEAT, SNAKE, SOMBRE,<br />

SQUID, TENDER, TIME, TROUT.<br />

(a) Complete <strong>the</strong> following table of <strong>the</strong> joint distribution of E, M and R:<br />

E = 0 E = 1 E = 2<br />

M=0 M=1 M=0 M=1 M=0 M=1<br />

R=0 1/16 1/16 R=0 2/16 R=0<br />

R=1 R=1 R=1<br />

(b) Calculate all three bivariate marginal distributions, and hence find which of <strong>the</strong> following statements<br />

are true:<br />

(i) E ⊥ M, (ii) E ⊥ R, (iii) M ⊥ R.<br />

(c) Similarly discover which of <strong>the</strong> following statements are true:<br />

3. Find variance stabilizing transformations for<br />

(a) <strong>the</strong> exponential distribution,<br />

(b) <strong>the</strong> binomial distribution.<br />

4. Let Z ∼ N (0, 1) and define <strong>the</strong> RV X by<br />

(iv) M ⊥ R|E=0, (v) M ⊥ R|E=1, (vi) M ⊥ R|E=2,<br />

(vii) M ⊥ R|E, (viii) E ⊥ R|M, (ix) E ⊥ M |R.<br />

P(X = − √ 3) = 1/6, P(X = 0) = 4/6, P(X = + √ 3) = 1/6.<br />

(a) Show that X has <strong>the</strong> same mean and variance as Z, and that X 2 has <strong>the</strong> same mean and<br />

variance as Z 2 .<br />

(b) Suppose <strong>the</strong> RV Y has mean µ and variance σ 2 . Compare <strong>the</strong> delta method for estimating <strong>the</strong><br />

mean and variance of <strong>the</strong> RV T = g(Y ) with <strong>the</strong> alternative estimates ̂µ(T ) ≏ E ( g(µ + σX) ) ,<br />

̂Var(T ) ≏ Var ( g(µ + σX) ) . [Try a few simple distributions for Y and transformations g(·)].<br />

21


2.10 Conditional Expectation<br />

2.10.1 Introduction<br />

A common practical problem arises when X 1 and X 2 aren’t independent, we observe X 2 = x 2 , and we<br />

want to know <strong>the</strong> mean of <strong>the</strong> resulting conditional distribution of X 1 .<br />

Definition 2.14 (Conditional expectation)<br />

The conditional expectation of X 1 given X 2 is denoted E[X 1 |X 2 ]. If X 2 = x 2 <strong>the</strong>n<br />

E[X 1 |x 2 ] =<br />

∫ ∞<br />

−∞<br />

x 1 g 1 (x 1 |x 2 ) dx 1 (continuous case) (2.16)<br />

= ∑ x 1<br />

x 1 g 1 (x 1 |x 2 ) (discrete case) (2.17)<br />

where g 1 (x 1 |x 2 ) is <strong>the</strong> conditional PDF or PMF respectively.<br />

Comment<br />

Note that before X 2 is known to take <strong>the</strong> value x 2 , E[X 1 |X 2 ] is itself a random variable, being a function<br />

of <strong>the</strong> RV X 2 . We’ll be interested in <strong>the</strong> distribution of <strong>the</strong> RV E[X 1 |X 2 ], and (for example) comparing it<br />

with <strong>the</strong> unconditional expectation EX 1 . The following is an important result:<br />

Theorem 2.1 (Marginal expectation)<br />

For any two RVs X 1 & X 2 ,<br />

E [ E[X 1 |X 2 ] ] = EX 1 . (2.18)<br />

Exercise 2.14<br />

Prove Equation 2.18 (i) for continuous RVs X 1 and X 2 , (ii) for discrete RVs X 1 and X 2 .<br />

‖<br />

Exercise 2.15<br />

Suppose that <strong>the</strong> RV X has a uniform distribution, X ∼ U (0, 1), and that, once X = x has been observed,<br />

<strong>the</strong> conditional distribution of Y is [Y |X = x] ∼ U (x, 1).<br />

Find E[Y |x] and hence, or o<strong>the</strong>rwise, show that EY = 3/4.<br />

‖<br />

Exercise 2.16<br />

Suppose that Θ ∼ U (0, 1) and (X|Θ) ∼ Bin(2, Θ).<br />

Find E[X |Θ] and hence or o<strong>the</strong>rwise show that EX = 1.<br />

‖<br />

2.10.2 Conditional Expectations of Functions of RVs<br />

By extending Theorem 2.1, we can relate <strong>the</strong> conditional and marginal expectations of functions of RVs<br />

(in particular, <strong>the</strong>ir variances).<br />

Theorem 2.2 (Marginal expectation of a transformed RV)<br />

For any RVs X 1 & X 2 , and for any function h(·),<br />

E [ E[h(X 1 )|X 2 ] ] = E[h(X 1 )]. (2.19)<br />

22


Exercise 2.17<br />

Prove Equation 2.19 (i) for discrete RVs X 1 and X 2 , (ii) for continuous RVs X 1 and X 2 .<br />

An important consequence of Equation 2.19 is <strong>the</strong> following <strong>the</strong>orem relating marginal variance to conditional<br />

variance and conditional expectation:<br />

‖<br />

Theorem 2.3 (Marginal variance)<br />

For any RVs X 1 & X 2 ,<br />

Var(X 1 ) = E [ Var(X 1 |X 2 ) ] + Var ( E[X 1 |X 2 ] ) . (2.20)<br />

Comments<br />

1. Equation 2.20 is easiest to remember in English:<br />

‘marginal variance = expectation of conditional variance<br />

+ variance of conditional expectation’.<br />

2. A useful interpretation of Equation 2.20 is:<br />

Var(X 1 ) = average random variation inherent in X 1 even if X 2 were known<br />

+ random variation due to not knowing X 2 and hence not knowing EX 1 .<br />

i.e. <strong>the</strong> uncertainty involved in predicting <strong>the</strong> value x 1 taken by a random variable X 1 splits into two<br />

components. One component is <strong>the</strong> unavoidable uncertainty due to random variation in X 1 , but <strong>the</strong><br />

o<strong>the</strong>r can be reduced by observing quantities (here <strong>the</strong> value x 2 of X 2 ) related to X 1 .<br />

Exercise 2.18<br />

[Proof of Theorem 2.3] Expand E [ Var(X 1 |X 2 ) ] and Var ( E[X 1 |X 2 ] ) .<br />

Hence show that Var(X 1 ) = E [ Var(X 1 |X 2 ) ] + Var ( E[X 1 |X 2 ] ) .<br />

‖<br />

Exercise 2.19<br />

Continuing Exercise 2.16, in which Θ ∼ U (0, 1), (X|Θ) ∼ Bin(2, Θ), and E[X |Θ] = 2Θ, find Var ( E[X |Θ] )<br />

and E [ Var(X |Θ) ] . Hence or o<strong>the</strong>rwise show that VarX = 2/3, and comment on <strong>the</strong> effect on <strong>the</strong> uncertainty<br />

in X of observing Θ.<br />

‖<br />

23


2.11 Problems<br />

1. Two fair coins are tossed independently. Let A 1 , A 2 and A 3 be <strong>the</strong> following events:<br />

A 1 = ‘coin 1 comes down heads’<br />

A 2 = ‘coin 2 comes down heads’<br />

A 3 = ‘results of both tosses are <strong>the</strong> same’.<br />

(a) Show that A 1 , A 2 and A 3 are pairwise independent (i.e. A 1 ⊥ A 2 , A 1 ⊥ A 3 and A 2 ⊥ A 3 ) but<br />

not mutually independent.<br />

(b) Hence or o<strong>the</strong>rwise construct three random variables X 1 , X 2 , X 3 such that E[X 3 |X 1 = x 1 ] and<br />

E[X 3 |X 2 = x 2 ] are constant, but E[X 3 |X 1 = x 1 &X 2 = x 2 ] isn’t.<br />

2. Construct three random variables X 1 , X 2 , X 3 with continuous distributions such that X 1 ⊥ X 2 ,<br />

X 1 ⊥ X 3 and X 2 ⊥ X 3 , but any two X i ’s determine <strong>the</strong> remaining one.<br />

3. (a) Show that for any random variables X and Y ,<br />

i. E[Y ] = E [ E[Y |X] ] ,<br />

ii. Var[Y ] = E [ Var[Y |X] ] + Var [ E[Y |X] ] .<br />

(b) Suppose that <strong>the</strong> random variables X i and P i , i = 1, . . . , n, have <strong>the</strong> following distributions:<br />

X i =<br />

P i<br />

IID<br />

∼<br />

{<br />

1 with probability Pi ,<br />

0 with probability 1 − P i ,<br />

Beta(α, β),<br />

i.e. P i has density<br />

f(p) =<br />

with mean µ and variance σ 2 given by<br />

Γ(α + β)<br />

Γ(α) Γ(β) pα−1 (1 − p) β−1<br />

µ = E[P i ] = α<br />

α + β , αβ<br />

σ2 = Var[P i ] =<br />

(α + β) 2 (α + β + 1) ,<br />

and X i has a Bernoulli(P i ) distribution.<br />

Find<br />

i. E[X 1 |P 1 ],<br />

ii. Var[X 1 |P 1 ],<br />

iii. Var [ E[X 1 |P 1 ] ] , and<br />

iv. E [ Var[X 1 |P 1 ] ] .<br />

Hence find E[Y ] where Y = ∑ n<br />

i=1 X i, and show that Var[Y ] = nαβ/(α + β) 2 .<br />

(c) Express E[Y ] and Var[Y ] in terms of µ and σ 2 , and comment on <strong>the</strong> result.<br />

From Warwick <strong>ST217</strong> exam 1998<br />

4. Suppose that <strong>the</strong> number N of bye-elections occurring in Government-held seats over a 12-month<br />

period follows a Poisson distribution with mean 10.<br />

Suppose also that, independently for each such bye-election, <strong>the</strong> probability that <strong>the</strong> Government<br />

hold onto <strong>the</strong> seat is 1/4. The number X of seats retained in <strong>the</strong> N bye-elections <strong>the</strong>refore follows a<br />

binomial distribution:<br />

[X|N] ∼ Bin(N, 0.25).<br />

(a) What are E[N], Var[N], E[X|N] and Var[X|N]<br />

(b) What are E[X] and Var[X]<br />

(c) What is <strong>the</strong> distribution of X<br />

[HINT : try using generating functions—see MSA]<br />

24


5. (a) For continuous random variables X 1 , X 2 , X 3 with joint density f X (x), define <strong>the</strong> following in<br />

terms of f X (x):<br />

i. <strong>the</strong> marginal densities f X1 (x 1 ) and f X1,X 2<br />

(x 1 , x 2 ),<br />

ii. <strong>the</strong> conditional density f X2|X 1<br />

(x 2 |x 1 ) of X 2 given X 1 = x 1 , and<br />

iii. <strong>the</strong> conditional expectation E[h(X 2 )|X 1 ] of a function h(X 2 ) of X 2 given X 1 .<br />

(b) Show that<br />

i. E[h(X 2 )] = E [ E[h(X 2 )|X 1 ] ] ,<br />

ii. Var[X 2 ] = E [ Var[X 2 |X 1 ] ] + Var [ E[X 2 |X 1 ] ] ,<br />

iii. Cov[X 2 , X 3 ] = E [ Cov[X 2 , X 3 |X 1 ] ] + Cov [ E[X 2 |X 1 ], E[X 3 |X 1 ] ] , and<br />

iv. Cov[X 1 , X 2 |X 1 ] = 0.<br />

(c) Suppose that <strong>the</strong> random variables X and Y have a continuous joint distribution, with PDF<br />

f(x, y), means µ X & µ Y respectively, variances σX 2 & σ2 Y respectively, and correlation ρ. Suppose<br />

also that<br />

E[Y |x] = β 0 + β 1 x<br />

where β 0 and β 1 are constants.<br />

Show that<br />

i. ∫ ∞<br />

−∞ yf(x, y)dy = (β 0 + β 1 x)f X (x),<br />

ii. µ Y = β 0 + β 1 µ X , and<br />

iii. ρ σ X σ Y + µ X µ Y = β 0 µ X + β 1 (σ 2 X + µ2 X ).<br />

(Hint: use <strong>the</strong> fact that E[XY ] = E[E[XY |X]]).<br />

(d) Hence or o<strong>the</strong>rwise express β 0 and β 1 in terms of µ X , µ Y , σ X , σ Y & ρ.<br />

6. For discrete random variables X and Y , define:<br />

(i) The conditional expectation of Y given X, E[Y |X], and<br />

(ii) The conditional variance of Y given X, Var[Y |X].<br />

Show that<br />

(iii) E[Y ] = E [ E[Y |X] ] , and<br />

(iv) Var[Y ] = E [ Var[Y |X] ] + Var [ E[Y |X] ] .<br />

(v) Show also that if E[Y |X] = β 0 + β 1 X for some constants β 0 and β 1 , <strong>the</strong>n<br />

E[XY ] = β 0 E[X] + β 1 E[X 2 ].<br />

From Warwick <strong>ST217</strong> exam 2002<br />

The random variable X denotes <strong>the</strong> number of leaves on a certain plant at noon on Monday, Y<br />

denotes <strong>the</strong> number of greenfly on <strong>the</strong> plant at noon on Tuesday, and Z denotes <strong>the</strong> number of<br />

ladybirds on <strong>the</strong> plant at noon on Wednesday.<br />

Suppose that, given X = x, Y has a Poisson distributions with mean µx.<br />

distribution with mean λ, show that<br />

E[Y ] = λµ<br />

and<br />

Var[Y ] = λµ(1 + µ),<br />

(you may assume that for a Poisson distribution <strong>the</strong> mean and variance are equal).<br />

If X has a Poisson<br />

Suppose fur<strong>the</strong>r that, given Y = y, Z has a Poisson distributions with mean νy. Find E[Z], Var[Z],<br />

and <strong>the</strong> correlation between X and Z.<br />

From Warwick <strong>ST217</strong> exam 1996<br />

25


7. Using <strong>the</strong> relationship<br />

E [ E[h(X 1 )|X 2 ] ] = E[h(X 1 )],<br />

where<br />

h(x 1 ) = (x 1 − E[X 1 |x 2 ] + E[X 1 |x 2 ] − EX 1 ) 2 ,<br />

prove that<br />

Var(X 1 ) = E [ Var(X 1 |X 2 ) ] + Var ( E[X 1 |X 2 ] )<br />

for any two random variables X 1 & X 2 .<br />

8. Prove that, for any three RVs X, Y and Z for which <strong>the</strong> various expectations exist,<br />

(a) X and Y − E(Y |X) are uncorrelated,<br />

(b) Var ( Y − E(Y |X) ) = E ( Var(Y |X) ) ,<br />

(c) if X and Y are uncorrelated <strong>the</strong>n E ( Cov(X, Y |Z) ) = −Cov ( E(X|Z), E(Y |Z) ) ,<br />

(d) Cov ( Z, E(Y |Z) ) = Cov(Z, Y ).<br />

In scientific thought we adopt <strong>the</strong> simplest <strong>the</strong>ory which will explain all <strong>the</strong> facts under consideration<br />

and enable us to predict new facts of <strong>the</strong> same kind. The catch in this criterion lies<br />

in <strong>the</strong> word ‘simplest’. It is really an aes<strong>the</strong>tic canon such as we find implicit in our criticisms<br />

of poetry or painting.<br />

J. B. S. Haldane<br />

All models are wrong, some models are useful.<br />

G. E. P. Box<br />

A child of five would understand this. Send somebody to fetch a child of five.<br />

Groucho Marx<br />

26


Chapter 3<br />

The Multivariate Normal<br />

Distribution<br />

3.1 Motivation<br />

A Normally distributed RV<br />

has PDF<br />

where<br />

X ∼ N (µ, σ 2 )<br />

[<br />

f(x; µ, σ 2 ) = constant × exp − 1 (x − µ) 2 ]<br />

2 σ 2<br />

µ is <strong>the</strong> mean of X,<br />

σ 2 is <strong>the</strong> variance of X, and<br />

‘constant’ is <strong>the</strong>re to make f integrate to 1.<br />

(3.1)<br />

The Normal distribution is important because, by <strong>the</strong> CLT, as n → ∞, <strong>the</strong> CDF of a MLE such as<br />

̂θ = ∑ X i /n or ̂θ = ∑ (X i − ΣX j /n) 2 / n, tends uniformly (under reasonable conditions) to <strong>the</strong> CDF of a<br />

Normal RV with <strong>the</strong> appropriate mean and variance.<br />

i.e. <strong>the</strong> log-likelihood tends to a quadratic in θ.<br />

Similarly it can be shown that, for a model with parameter vector θ = (θ 1 , . . . , θ p ) T , under reasonable<br />

conditions <strong>the</strong> log-likelihood will tend to a quadratic in (θ 1 , . . . , θ p ).<br />

Therefore, by analogy with Equation 3.1, we will want to define a distribution with PDF<br />

[<br />

f(x; µ, V) = constant × exp − 1 ]<br />

2 (x − µ)T V −1 (x − µ)<br />

(3.2)<br />

where<br />

µ is a (p × 1) matrix or column vector,<br />

V is a (p × p) matrix, and<br />

‘constant’ is again <strong>the</strong>re to make f integrate to 1.<br />

27


IID<br />

As an example of a PDF of this form, if X 1 , X 2 , . . . , X p ∼ N (0, 1), <strong>the</strong>n<br />

f(x) = f 1 (x 1 ) × f 2 (x 2 ) × · · · × f p (x p ) by independence<br />

1<br />

=<br />

(2π) exp ( )<br />

− 1 1<br />

p/2 2 Σx2 i =<br />

(2π) exp ( − 1 p/2 2 xT x ) . (3.3)<br />

Definition 3.1 (Multivariate standard Normal)<br />

The distribution with PDF<br />

f(z) = f(z 1 , z 2 , . . . , z p ) =<br />

is called <strong>the</strong> multivariate standard Normal distribution.<br />

1<br />

(2π) exp ( − 1 p/2 2 zT z )<br />

The statement ‘Z has a multivariate standard Normal distribution’ is often written<br />

Z ∼ N (0, I), Z ∼ MVN (0, I), Z ∼ N p (0, I), or Z ∼ MVN p (0, I),<br />

and <strong>the</strong> CDF and PDF of Z are often written Φ(z) and φ(z), or Φ p (z) and φ p (z), respectively.<br />

In <strong>the</strong> more general case, where <strong>the</strong> component RVs X 1 , X 2 , . . . , X p in Equation 3.2 aren’t independent,<br />

we need an expression for <strong>the</strong> constant term.<br />

3.2 Digression: Transforming a Random Vector<br />

Exercise 3.1<br />

Suppose that <strong>the</strong> RVs Z 1 , Z 2 , . . . , Z n have a continuous joint distribution, with joint PDF f Z (z).<br />

Consider a 1-1 transformation (i.e. a bijection between <strong>the</strong> corresponding sample spaces) to new RVs<br />

X 1 , X 2 , . . . , X n . What is <strong>the</strong> PDF f X (x) of <strong>the</strong> transformed RVs Solution: Because <strong>the</strong> transformation<br />

is 1-1 we can invert it and write<br />

Z = u(X)<br />

i.e. a given point (z 1 , . . . , z n ) transforms to (x 1 , . . . , x n ), where<br />

z 1 = u 1 (x 1 , . . . , x n ),<br />

z 2 = u 2 (x 1 , . . . , x n ),<br />

.<br />

z n = u n (x 1 , . . . , x n ).<br />

(3.4)<br />

Now assume that each function u i (·) is continuous and differentiable.<br />

Then we can form <strong>the</strong> following matrix:<br />

⎛<br />

∂u 1 ∂u 1 ∂u 1<br />

. . .<br />

∂x 1 ∂x 2 ∂x n<br />

∂u<br />

∂u 2 ∂u 2 ∂u 2<br />

. . .<br />

=<br />

∂x<br />

∂x<br />

1 ∂x 2 ∂x n<br />

.<br />

⎜ . . .. .<br />

⎝ ∂u n ∂u n ∂u n<br />

. . .<br />

∂x 1 ∂x 2 ∂x n<br />

⎞<br />

⎟<br />

⎠<br />

(3.5)<br />

and its determinant J, which is called <strong>the</strong> Jacobian of <strong>the</strong> transformation u<br />

[i.e. of <strong>the</strong> joint transformation (u 1 , . . . , u n )].<br />

Then it can be shown that<br />

f X (x) = |J| × f Z (z)<br />

at all points in <strong>the</strong> ‘sample space’ (i.e. set of possible values) of X.<br />

‖<br />

28


z + δ 2 z + δ 1 + δ 2 infinitesimal δ 1 × δ 2 rectangle<br />

z z + δ 1<br />

✻<br />

∴<br />

density = f Z (z)<br />

area = δ 1 δ 2<br />

probability content = δ 1 δ 2 f Z (z)<br />

u<br />

✏ ✏<br />

u −1 (z + δ 1 + δ 2 )<br />

u −1 (z + δ 2 ) ✏✏<br />

✏ ✏✏ ✏ ✁✁<br />

✁ ✁ u −1 (z + δ 1 )<br />

x = u −1 (z)<br />

infinitesimal parallelogram<br />

∴<br />

area = δ 1 δ 2 /|J|<br />

probability content = δ 1 δ 2 f Z (z)<br />

density = |J| × f Z (z)<br />

Figure 3.1: Bivariate Parameter Transformation<br />

3.3 The Bivariate Normal Distribution<br />

Suppose that Z 1 and Z 2 are IID with N (0, 1) distributions, i.e. (as in Equation 3.3):<br />

f Z (z 1 , z 2 ) = 1<br />

2π exp( − 1 2 (z2 1 + z 2 2) ) .<br />

Now let µ 1 , µ 2 ∈ (−∞, ∞), σ 1 , σ 2 ∈ (0, ∞) & ρ ∈ (−1, 1), and define (as in DeGroot §5.12):<br />

X 1 = σ 1 Z 1 + µ 1 ,<br />

X 2 = σ 2<br />

(<br />

ρZ1 + √ 1 − ρ 2 Z 2<br />

)<br />

+ µ2 .<br />

Then <strong>the</strong> Jacobian of <strong>the</strong> transformation from Z to X is given by<br />

∣ J =<br />

σ 1 0 ∣∣∣<br />

√<br />

∣<br />

= √ 1 − ρ<br />

ρ σ 2 1 − ρ2 σ 2 σ 1 σ 2 .<br />

2<br />

(3.6)<br />

Therefore <strong>the</strong> Jacobian of <strong>the</strong> inverse transformation from X to Z is 1/ (√ 1 − ρ 2 σ 1 σ 2<br />

)<br />

, and <strong>the</strong> PDF of<br />

X is given by Equations 3.7 & 3.8 below.<br />

Definition 3.2 (Bivariate Normal Distribution)<br />

The continuous bivariate distribution with PDF<br />

where<br />

f X (x) =<br />

1<br />

|J| f Z(z) =<br />

σ 1<br />

(<br />

)<br />

1<br />

√ × 1<br />

1 − ρ2 σ 1 σ 2<br />

2π × exp 1<br />

−<br />

2 ( 1 − ρ 2)Q , (3.7)<br />

( ) 2 ( )( ) ( ) 2 x1 − µ 1 x1 − µ 1 x2 − µ 2 x2 − µ 2<br />

Q =<br />

− 2ρ<br />

+<br />

. (3.8)<br />

σ 1 σ 2 σ 2<br />

is called <strong>the</strong> bivariate Normal distribution.<br />

29


Exercise 3.2<br />

If <strong>the</strong> RV X = (X 1 , X 2 ) has PDF given by Equations 3.7 & 3.8, <strong>the</strong>n show by substituting<br />

or o<strong>the</strong>rwise, that X 1 ∼ N (µ 1 , σ 2 1).<br />

v = x 2 − µ 2<br />

followed by w = v − ρ(x 1 − µ 1 )/σ<br />

√ 1<br />

,<br />

σ 2 1 − ρ<br />

2<br />

Hence or o<strong>the</strong>rwise show that <strong>the</strong> conditional distribution of X 1 given X 2 = x 2 is Normal with mean<br />

µ 1 + (ρσ 1 /σ 2 )(x 2 − µ 2 ) and variance σ 2 1(<br />

1 − ρ<br />

2 ) .<br />

‖<br />

Comments<br />

1. It’s easy to show (problem 3.4.2, page 31) that EX i = µ i , VarX i = σ 2 i and corr(X 1, X 2 ) = ρ. This<br />

suggests that we will be able to write<br />

X = (X 1 , X 2 ) T ∼ MVN (µ, V), where<br />

µ = (µ 1 , µ 2 ) T is <strong>the</strong> ‘mean vector’ of X, and<br />

( )<br />

σ<br />

2<br />

V =<br />

1 ρ σ 1 σ 2<br />

ρ σ 1 σ 2 σ2<br />

2 is <strong>the</strong> ‘variance-covariance matrix’ of X.<br />

2. The ‘level curves’ (i.e. contours in 2-d) of <strong>the</strong> bivariate Normal PDF are given by Q = constant in<br />

formula 3.8; i.e. ellipses provided <strong>the</strong> discriminant is negative:<br />

( ) 2 ρ<br />

− 1 1<br />

σ 1 σ 2 σ1<br />

2 σ2<br />

2<br />

= ρ2 − 1<br />

σ 2 1 σ2 2<br />

< 0.<br />

This holds as we are only considering ‘nonsingular’ bivariate Normal distributions with ρ ≠ ±1.<br />

3. PLEASE MAKE NO ATTEMPT TO MEMORISE FORMULAE 3.7 & 3.8!!<br />

Exercise 3.3<br />

(<br />

σ<br />

2<br />

Show that <strong>the</strong> inverse of <strong>the</strong> variance-covariance matrix V = 1 ρ σ 1 σ 2<br />

ρ σ 1 σ 2 σ2<br />

2<br />

V −1 = 1<br />

1 − ρ 2 ( 1/σ<br />

2<br />

1 −ρ/σ 1 σ 2<br />

−ρ/σ 1 σ 2 1/σ 2 2<br />

)<br />

.<br />

)<br />

is<br />

‖<br />

30


3.4 Problems<br />

1. Suppose that <strong>the</strong> RVs X 1 , X 2 , . . . , X n have a continuous joint distribution with PDF f X (x), and that<br />

<strong>the</strong> RVs Y 1 , Y 2 , . . . , Y n are defined by Y = AX, where <strong>the</strong> (n × n) matrix A is nonsingular. Show<br />

that <strong>the</strong> joint density of <strong>the</strong> Y i s is given by<br />

1<br />

f Y (y) =<br />

| det A| f (<br />

X A −1 y ) for y ∈ R n .<br />

Hence or o<strong>the</strong>rwise show carefully that if X 1 and X 2 are independent RVs with PDFs f 1 and f 2<br />

respectively, <strong>the</strong>n <strong>the</strong> PDF of Y = X 1 + X 2 is given by<br />

or equivalently by<br />

f Y (y) =<br />

f Y (y) =<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

f 1 (y − z)f 2 (z)dz<br />

f 1 (z)f 2 (y − z)dz<br />

IID<br />

If X i ∼ Exp(1), i = 1, 2, <strong>the</strong>n what is <strong>the</strong> distribution of X 1 + X 2 <br />

for −∞ < y < ∞<br />

for −∞ < y < ∞<br />

2. Suppose that Z 1 and Z 2 are i.i.d. random variables with standard Normal N(0, 1) distributions.<br />

Define <strong>the</strong> random vector (X 1 , X 2 ) by:<br />

where σ 1 , σ 2 > 0 and −1 ≤ ρ ≤ 1.<br />

X 1 = µ 1 + σ 1 Z 1 , X 2 = µ 2 + σ 2<br />

[<br />

ρZ 1 + √ 1 − ρ 2 Z 2<br />

]<br />

,<br />

(a) Show that E[X 1 ] = µ 1 , E[X 2 ] = µ 2 , Var[X 1 ] = σ 2 1, Var[X 2 ] = σ 2 2, and corr[X 1 , X 2 ] = ρ.<br />

(b) Find E[X 2 |X 1 ] and Var[X 2 |X 1 ].<br />

(c) Derive <strong>the</strong> joint PDF f(x 1 , x 2 ).<br />

(d) Find <strong>the</strong> distribution of [X 2 |X 1 ]. Hence or o<strong>the</strong>rwise show that two r.v.s. with a joint bivariate<br />

Normal distribution are independent if and only if <strong>the</strong>y are uncorrelated.<br />

(e) Now suppose that σ 1 = σ 2 . Show that <strong>the</strong> RVs Y 1 = X 1 +X 2 and Y 2 = X 1 −X 2 are independent.<br />

3. Suppose that X and Y have <strong>the</strong> joint density<br />

1<br />

f X,Y (x, y) =<br />

√<br />

2π σ X σ Y 1 − ρ<br />

2<br />

(<br />

[ (x<br />

1 −<br />

× exp −<br />

2 ( µX<br />

1 − ρ 2)<br />

σ X<br />

) 2 ( )( ) ( ) ]) 2 x − µX y − µY y − µY<br />

− 2ρ<br />

+<br />

.<br />

σ X σ Y σ Y<br />

(a) Show by substituting u = (x−µ X )/σ X and v = (y −µ Y )/σ Y followed by w = (u−ρv)/ √ 1 − ρ 2 ,<br />

or o<strong>the</strong>rwise, that f X,Y does indeed integrate to 1.<br />

(b) Show that <strong>the</strong> ‘joint MGF’ M X,Y (s, t) = E [ exp(sX + tY ) ] is given by<br />

(c) Show that<br />

M X,Y (s, t) = exp [ µ X s + µ Y t + 1 2 (σ2 Xs 2 + 2ρσ X σ Y st + σ 2 Y t 2 ) ] .<br />

&<br />

∂M X,Y<br />

∂s<br />

∣ = µ X ,<br />

s,t=0<br />

∂ 2 M X,Y<br />

∂s 2 ∣<br />

∣∣∣s,t=0<br />

= µ 2 X + σ 2 X,<br />

∂2 M X,Y<br />

∂s∂t<br />

∣ = µ X µ Y + ρσ X σ Y .<br />

s,t=0<br />

(d) Guess <strong>the</strong> formula for <strong>the</strong> MGF M X (s) of X, where X ∼ MVN (µ, V).<br />

4. Suppose that (X 1 , X 2 ) have a bivariate Normal distribution. Show that any linear combination<br />

Y = a 0 + a 1 X 1 + a 2 X 2 has a univariate Normal distribution.<br />

31


3.5 The Multivariate Normal Distribution<br />

Definition 3.3 (Multivariate Normal distribution)<br />

Let µ = (µ 1 , µ 2 , . . . , µ p ) be a p-vector, and let V be a symmetric positive-definite (p × p) matrix.<br />

Then <strong>the</strong> multivariate probability density defined by<br />

f X (x; µ, V) =<br />

1<br />

√<br />

(2π)p |V| exp[ − 1 2 (x − µ)T V −1 (x − µ) ] (3.9)<br />

is called a multivariate Normal PDF with mean vector µ and variance-covariance matrix V.<br />

Comments<br />

1. Expression 3.9 is a natural generalisation of <strong>the</strong> univariate Normal density, with V taking <strong>the</strong> rôle of<br />

σ 2 in <strong>the</strong> exponent, and its determinant |V| taking <strong>the</strong> rôle of σ 2 in <strong>the</strong> ‘normalising constant’ that<br />

makes <strong>the</strong> whole thing integrate to 1. Many of <strong>the</strong> properties of <strong>the</strong> MVN distribution are guessable<br />

from properties of <strong>the</strong> univariate Normal distribution—in particular, it’s helpful to think of 3.9 as<br />

‘exponential of a quadratic’.<br />

2. The statement ‘X = (X 1 , X 2 , . . . , X p ) has a multivariate Normal distribution with mean vector µ<br />

and variance-covariance matrix V’ may be written<br />

X ∼ N (µ, V), X ∼ MVN(µ, V), X ∼ N p (µ, V), or X ∼ MVN p (µ, V).<br />

3. The mean vector µ is sometimes called just <strong>the</strong> mean, and <strong>the</strong> variance-covariance matrix V is<br />

sometimes called <strong>the</strong> dispersion matrix, or simply <strong>the</strong> variance matrix or covariance matrix.<br />

4. µ = EX, (or equivalently, componentwise, EX i = µ i , i = 1, 2, . . . , p). This fact should be obvious<br />

from <strong>the</strong> name ‘mean vector’, and can be proved in various ways, e.g. by differentiating a multivariate<br />

generalization of <strong>the</strong> MGF, or simply by symmetry.<br />

5. V = E ( (X − µ)(X − µ) T ) = E(XX T ) − µµ T , i.e.<br />

⎛<br />

E(XX T ) − µµ T = E ⎜<br />

⎝<br />

X1 2 X 1 X 2 . . . X 1 X p<br />

X 2 X 1 X2 2 . . . X 2 X p<br />

.<br />

. . .. .<br />

⎞ ⎛<br />

⎟<br />

⎠ − ⎜<br />

⎝<br />

µ 2 1 µ 1 µ 2 . . . µ 1 µ p<br />

µ 2 µ 1 µ 2 2 . . . µ 2 µ p<br />

.<br />

. . .. .<br />

⎞<br />

⎟<br />

⎠<br />

X p X 1 X p X 2 . . . X 2 p<br />

µ p µ 1 µ p µ 2 . . . µ 2 p<br />

=<br />

⎛<br />

⎜<br />

⎝<br />

E ( )<br />

X1 2 ⎞<br />

− µ<br />

2<br />

1 E(X 1 X 2 − µ 1 µ 2 . . . E(X 1 X p ) − µ 1 µ p<br />

E(X 2 X 1 ) − µ 2 µ 1 E ( )<br />

X2 2 − µ<br />

2<br />

2 . . . E(X 2 X p ) − µ 2 µ p<br />

.<br />

.<br />

.<br />

..<br />

⎟<br />

.<br />

E(X p X 1 ) − µ p µ 1 E(X p X 2 ) − µ p µ 2 . . . E ( ⎠<br />

)<br />

Xp<br />

2 − µ<br />

2<br />

p<br />

= V =<br />

⎛<br />

⎜<br />

⎝<br />

⎞<br />

v 11 v 12 . . . v 1p<br />

v 21 v 22 . . . v 2p<br />

.<br />

. . ..<br />

⎟<br />

. ⎠ ,<br />

v p1 v p2 . . . v pp<br />

say,<br />

from which it follows that<br />

⎛<br />

V = ⎜<br />

⎝<br />

σ1 2 ρ 12 σ 1 σ 2 . . . ρ 1p σ 1 σ p<br />

ρ 12 σ 1 σ 2 σ2 2 . . . ρ 2p σ 2 σ p<br />

.<br />

. . .. .<br />

⎞<br />

⎟<br />

⎠ , (3.10)<br />

ρ 1p σ 1 σ p ρ 2p σ 2 σ p . . . σ 2 p<br />

32


where σ i is <strong>the</strong> standard deviation of X i and ρ ij is <strong>the</strong> correlation between X i and X j . Again <strong>the</strong>se<br />

results can be proved using a multivariate generalization of <strong>the</strong> MGF.<br />

6. The p-dimensional MVN p (µ, V) distribution can <strong>the</strong>refore be parametrised by—<br />

NB. —a total of 1 2p(p + 3) parameters.<br />

p means µ i ,<br />

p variances σi 2, and<br />

1<br />

2 p(p − 1) correlations ρ ij<br />

7. Given n random vectors X i = (X i1 , X i2 , . . . , X ip ) IID<br />

∼ MVN (µ, V), i = 1, 2, . . . , n,<br />

a set of minimal sufficient statistics for <strong>the</strong> unknown parameters is given by:<br />

&<br />

n∑<br />

X ij j = 1, . . . , p,<br />

i=1<br />

n∑<br />

Xij 2 j = 1, . . . , p,<br />

i=1<br />

n∑<br />

X ij X ik j = 2, . . . , p, k = 1, . . . , (j − 1), ⎪⎭<br />

i=1<br />

⎫<br />

⎪⎬<br />

(3.11)<br />

and MLEs for µ and V are given by:<br />

or, in matrix notation,<br />

̂µ j = 1 ∑<br />

X ij , (3.12)<br />

n<br />

i<br />

̂σ j 2 = 1 ∑<br />

(X ij − ̂µ j ) 2 , (3.13)<br />

n<br />

i<br />

∑<br />

1<br />

n i<br />

̂ρ jk =<br />

(X ij − ̂µ j )(X ik − ̂µ k )<br />

, (3.14)<br />

̂σ ĵσ k<br />

̂µ = 1 n<br />

̂V = 1 n<br />

= 1 n<br />

n∑<br />

X i , (3.15)<br />

i=1<br />

n∑<br />

(X i − ̂µ)(X i − ̂µ) T (3.16)<br />

i=1<br />

n∑<br />

X i X T i − ̂µ̂µ T . (3.17)<br />

i=1<br />

8. The fact that V is positive-definite implies various (messy!) constraints on <strong>the</strong> correlations ρ ij .<br />

9. Surfaces of constant density form concentric (hyper-)ellipsoids (concentric hyper-spheres in <strong>the</strong> case<br />

of <strong>the</strong> standard MVN distribution). In particular, <strong>the</strong> contours of a bivariate Normal density form<br />

concentric ellipses (or concentric circles for <strong>the</strong> standard bivariate Normal).<br />

10. It can be proved that all conditional and marginal distributions of a MVN are <strong>the</strong>mselves MVN. The<br />

proof of this important fact is quite straightforward, quite tedious, and mercifully omitted from this<br />

course.<br />

33


3.6 Distributions Related to <strong>the</strong> MVN<br />

Because of <strong>the</strong> CLT, <strong>the</strong> MVN distribution is important throughout statistics. For example, <strong>the</strong> joint<br />

distribution of <strong>the</strong> MLEs ̂θ 1 , ̂θ 2 , . . . , ̂θ p of unknown parameters θ 1 , θ 2 , . . . , θ p will under reasonable conditions<br />

tend to a MVN as <strong>the</strong> size of <strong>the</strong> sample from which ̂θ = (̂θ1 , ̂θ 2 , . . . , ̂θ p ) T was calculated increases.<br />

Therefore various distributions arising from <strong>the</strong> MVN by transformation are also important.<br />

Throughout this Section we shall usually denote independent standard Normal RVs by Z i , i.e.:<br />

Z i<br />

IID<br />

∼ N (0, 1), i = 1, 2, . . .<br />

i.e. Z = (Z 1 , Z 2 , . . . , Z n ) T ∼ MVN (0, I).<br />

Exercise 3.4<br />

Show that if a is a constant (n × 1) column vector, B is a constant nonsingular (n × n) matrix, and Z =<br />

(Z 1 , Z 2 , . . . , Z n ) T is a random n-vector with a MVN (0, I) distribution, <strong>the</strong>n Y = a+BZ ∼ MVN ( a, BB T ) .<br />

‖<br />

3.6.1 The Chi-squared Distribution<br />

Definition 3.4 (Chi-squared Distribution)<br />

If Z i<br />

IID<br />

∼ N(0, 1) for i = 1, 2, . . . , n, <strong>the</strong>n <strong>the</strong> distribution of<br />

X = Z 2 1 + Z 2 2 + · · · + Z 2 n<br />

is called a Chi-squared distribution on n degrees of freedom, and we write X ∼ χ 2 n.<br />

Comments<br />

1. In particular, if Z ∼ N (0, 1), <strong>the</strong>n Z 2 ∼ χ 2 1.<br />

2. The above construction of <strong>the</strong> χ 2 n distribution shows that if X ∼ χ 2 m, Y ∼ χ 2 n, and X ⊥ Y ,<br />

<strong>the</strong>n (X + Y ) ∼ χ 2 m+n.<br />

This summation property accounts for <strong>the</strong> importance and usefulness of <strong>the</strong> χ 2 distribution:<br />

essentially a squared length is split into two orthogonal components, as in Pythagoras’ <strong>the</strong>orem.<br />

3. If X ∼ χ 2 n, <strong>the</strong>n <strong>the</strong> (unmemorable) density of X can be shown to be<br />

with f X (x) = 0 for x ≤ 0.<br />

f X (x) =<br />

1<br />

2 n/2 Γ(n/2) x(n/2)−1 e −x/2 for x > 0, (3.18)<br />

Comparing this with <strong>the</strong> definition of a Gamma distribution (MSA) shows that a Chi-squared distribution<br />

on n degrees of freedom is just a Gamma distribution with α = n/2 and β = 1/2 (in <strong>the</strong><br />

usual parametrisation).<br />

4. It can be shown that if X ∼ χ 2 n <strong>the</strong>n EX = n and VarX = 2n.<br />

Note that this implies that E[X/n] = 1 and Var[X/n] = 2/n.<br />

5. The χ 2 distributions are positively skewed—for example, χ 2 2 is just an exponential distribution with<br />

mean 2. However, because of <strong>the</strong> CLT, <strong>the</strong> χ 2 n distribution tends (slowly!) to Normality as n → ∞.<br />

6. The PDF 3.18 cannot be integrated analytically except for <strong>the</strong> special case n = 2. Therefore <strong>the</strong><br />

CDFs of χ 2 n distributions for various n are given in standard Statistical Tables.<br />

34


Figure 3.2: Chi-squared distributions for 1, 2, 5 & 20 d.f. Vertical lines show <strong>the</strong> 2.5%, 16%, 50%, 84%<br />

and 97.5% points (which for N(0, 1) are at −2, −1, 0, 1, 2).<br />

3.6.2 Student’s t Distribution<br />

Definition 3.5 (t Distribution)<br />

If Z ∼ N(0, 1), Y ∼ χ 2 n and Y ⊥ Z, <strong>the</strong>n <strong>the</strong> distribution of<br />

X =<br />

Z √<br />

Y/n<br />

is called a (Student’s) t distribution on n degrees of freedom, and we write X ∼ t n .<br />

Comments<br />

1. The shape of <strong>the</strong> t distribution is like that of a Normal, but with heavier tails (since <strong>the</strong>re is variability<br />

in <strong>the</strong> denominator of t as well as in <strong>the</strong> Normally-distributed numerator Z).<br />

However, as n → ∞, <strong>the</strong> denominator becomes more and more concentrated around 1, so (loosely<br />

speaking!) ‘t n → N(0, 1) as n → ∞’.<br />

2. The (highly unmemorable) PDF of X ∼ t n can be shown to be<br />

f X (x) = Γ( (n + 1)/2 ) (<br />

√ 1 + x 2 /n ) −(n+1)/2<br />

nπ Γ(n/2)<br />

for −∞ < x < ∞. (3.19)<br />

35


Figure 3.3: t distributions for 1, 2, 5 & 20 d.f. Vertical lines show <strong>the</strong> 2.5%, 16%, 50%, 84% and 97.5%<br />

points.<br />

3. The t distribution on 1 degree of freedom is also called <strong>the</strong> Cauchy distribution—note that it arises<br />

as <strong>the</strong> distribution of Z 1 /Z 2 where Z i<br />

IID<br />

∼ N (0, 1).<br />

The Cauchy distribution is infamous for not having a mean. More generally, only <strong>the</strong> first n − 1<br />

moments of <strong>the</strong> t n distribution exist.<br />

IID<br />

4. Note that if X i ∼ N (0, σ 2 ), <strong>the</strong>n <strong>the</strong> RV<br />

T =<br />

X 1<br />

√∑ n<br />

i=2 X2 i /(n − 1)<br />

has a t n−1 distribution, and is a measure of <strong>the</strong> length of X 1 compared to <strong>the</strong> root mean square<br />

length of <strong>the</strong> o<strong>the</strong>r X i s.<br />

i.e. if X has a spherical MVN (0, σ 2 I) distribution, <strong>the</strong>n we would expect T not to be too large. This<br />

is, in effect, how <strong>the</strong> t distribution usually arises in practice.<br />

5. The PDF 3.19 cannot be integrated analytically in general (exception: n = 1 d.f.). The CDF must<br />

be looked up in Statistical Tables or approximated using a computer.<br />

36


3.6.3 Snedecor’s F Distribution<br />

Definition 3.6 (F Distribution)<br />

If Y ∼ χ 2 m, Z ∼ χ 2 n and Y ⊥ Z, <strong>the</strong>n <strong>the</strong> distribution of<br />

X = Y/m<br />

Z/n<br />

is called an F distribution on m & n degrees of freedom, and we write X ∼ F m,n .<br />

Figure 3.4: F distributions for selected d.f. Vertical lines show <strong>the</strong> 2.5%, 16%, 50%, 84% and 97.5% points.<br />

Comments<br />

1. Note that <strong>the</strong> numerator Y/m and denominator Z/n of X both have mean 1. Therefore, provided<br />

both m and n are large, X will usually take values around 1.<br />

2. If X ∼ F m,n , <strong>the</strong>n <strong>the</strong> (extraordinarily unmemorable) density of X can be shown to be<br />

with f X (x) = 0 for x ≤ 0.<br />

f X (x) = Γ( (m + n)/2 ) m m/2 n n/2<br />

Γ(m/2) Γ(n/2)<br />

×<br />

x (m/2)−1<br />

(mx + n) (m+n)/2 for x > 0, (3.20)<br />

37


3.7 Problems<br />

1. Let Z ∼ N (0, 1) & Y = Z 2 , and let φ(·) & Φ(·) denote <strong>the</strong> PDF & CDF respectively of <strong>the</strong> standard<br />

Normal N (0, 1) distribution.<br />

(a) Show that F Y (y) = Φ( √ y) − Φ(− √ y).<br />

(b) Express f Y (y) in terms of φ( √ y).<br />

(c) Hence show that<br />

(d) Find <strong>the</strong> MGF of Y .<br />

f Y (y) = 1 √<br />

2π<br />

y −1/2 e −y/2 for y > 0.<br />

2. Using Formula 3.18 for <strong>the</strong> PDF of <strong>the</strong> χ 2 distribution, show that if X ∼ χ 2 n <strong>the</strong>n <strong>the</strong> MGF of X is<br />

M X (t) = (1 − 2t) −n/2 . (3.21)<br />

Deduce that if X ∼ χ 2 m & Y ∼ χ 2 n with X ⊥ Y , <strong>the</strong>n (X + Y ) ∼ χ 2 m+n.<br />

IID<br />

3. Given Z 1 , Z 2 ∼ N (0, 1), what is <strong>the</strong> probability that <strong>the</strong> point (Z 1 , Z 2 ) lies<br />

(a) in <strong>the</strong> square {(z 1 , z 2 ) | −1 < z 1 < 1 & −1 < z 2 < 1},<br />

(b) in <strong>the</strong> circle {(z 1 , z 2 ) | (z 2 1 + z 2 2) < 1}<br />

4. Let Z 1 , Z 2 , . . . be independent random variables, each with mean 0 and variance 1, and let µ i , σ i and<br />

ρ ij be constants with −1 ≤ ρ ij ≤ 1. Let<br />

and define X i = µ i + σ i Y i , i = 1, 2.<br />

Y 1 = Z 1 ,<br />

Y 2 = ρ 12 Z 1 +<br />

√<br />

1 − ρ 2 12 Z 2,<br />

(a) Show that E[X i ] = µ i , Var[X i ] = σi<br />

2 (i = 1, 2), and that ρ 12 is <strong>the</strong> correlation between X 1 and<br />

X 2 .<br />

(b) Find constants c 0 , c 1 , c 2 and c 3 such that<br />

Y 3 = c 0 + c 1 Z 1 + c 2 Z 2 + c 3 Z 3<br />

has mean 0, variance 1, and correlations ρ 13 & ρ 23 with Y 1 and Y 2 respectively.<br />

(c) Hence show that <strong>the</strong> random vector Z = (Z 1 , Z 2 , Z 3 ) T with zero mean vector and identity<br />

variance-covariance matrix can be transformed to give a random vector X = (X 1 , X 2 , X 3 ) T with<br />

specified first and second moments, subject to constraints on <strong>the</strong> correlations corr[X i , X j ] = ρ ij<br />

including<br />

ρ 2 12 + ρ 2 13 + ρ 2 23 ∈ [0, 1 + 2ρ 12 ρ 13 ρ 23 ].<br />

(d) What can you say about <strong>the</strong> distribution of X when Z has a standard trivariate Normal distribution<br />

and ρ 2 12 + ρ 2 13 + ρ 2 23 is at one of <strong>the</strong> extremes of its allowable range (i.e. 0 or 1 + 2ρ 12 ρ 13 ρ 23 )<br />

From Warwick <strong>ST217</strong> exam 2001<br />

5. Suppose that <strong>the</strong> RVs O i have independent Poisson distributions: O i ∼ Poi(np i ), i = 1, . . . , k, where<br />

∑ k<br />

i=1 p i = 1.<br />

(a) Find EO i and Var O i . Hence or o<strong>the</strong>rwise show that E[O i − np i ] = 0 and Var[O i − np i ] = np i .<br />

(b) Define <strong>the</strong> RV N by N = ∑ k<br />

i=1 O i. What is <strong>the</strong> distribution of N<br />

(c) Define <strong>the</strong> RVs E i = Np i , i = 1, . . . , k. Show that EE i = np i and VarE i = np 2 i .<br />

(d) By writing E[O 1 E 1 ] = p 1<br />

(<br />

E[O<br />

2<br />

1 ] + E[O 1<br />

∑ k<br />

i=2 O i] ) , or o<strong>the</strong>rwise, show that Cov(O 1 , E 1 ) = np 2 1.<br />

(e) Deduce that <strong>the</strong> RV (O i − E i ) has mean 0 and variance np i (1 − p i ) for i = 1, . . . , k.<br />

38


6. Suppose that Y i are independent RVs with Poisson distributions: Y i ∼ Poi(λ i ), i = 1, . . . , k.<br />

(a) Assuming that λ i is large, what is <strong>the</strong> approximate distribution of Z i = (Y i − λ i )/ √ λ i <br />

(b) Hence or o<strong>the</strong>rwise show that if all <strong>the</strong> λ i s are large, <strong>the</strong>n <strong>the</strong> RV X = ∑ k<br />

i=1 (Y i − λ i ) 2 /λ i has<br />

approximately a χ 2 k distribution.<br />

7. Let Z = (Z 1 , Z 2 , . . . , Z m+n ) T ∼ MVN m+n (0, I).<br />

(a) Describe <strong>the</strong> distribution of Y = Z/<br />

√ ∑m+n<br />

1<br />

Z 2 i .<br />

(b) Show that <strong>the</strong> RV X = (n ∑ m<br />

1 Y 2<br />

i )/(m ∑ m+n<br />

m+1 Y 2<br />

i ) has an F m,n distribution.<br />

(c) Hence show that if Y = (Y 1 , Y 2 , . . . , Y m+n ) T has any continuous spherically symmetric distribution<br />

centred at <strong>the</strong> origin, <strong>the</strong>n X = (n ∑ m<br />

1 Y i 2)/(m<br />

∑ m+n<br />

m+1 Y i 2)<br />

has an F m,n distribution.<br />

8. (a) Define a multivariate standard Normal distribution N(0, I), where I denotes <strong>the</strong> identity matrix.<br />

Given Z = (Z 1 , Z 2 , . . . , Z n ) T ∼ N(0, I), write down functions of Z (i.e. transformed random<br />

variables) having<br />

i. a chi-squared distribution on (n − 1) degrees of freedom, and<br />

ii. a t distribution on (n − 1) degrees of freedom.<br />

(b) Let Z = (Z 1 , Z 2 , . . . , Z n ) T have a multivariate standard Normal distribution, and let Z =<br />

∑ n<br />

i=1 Z i/n.<br />

Also let A = (a ij ) be an n × n orthogonal matrix, i.e. AA T = A T A = I, and define <strong>the</strong> random<br />

vector Y = (Y 1 , Y 2 , . . . , Y n ) T by Y = AZ.<br />

Quoting any properties of probability distributions that you require, show <strong>the</strong> following:<br />

i. Show that ∑ n<br />

i=1 Y i<br />

2<br />

= ∑ n<br />

i=1 Z2 i .<br />

ii. Show that Y ∼ N(0, I).<br />

iii. Show that for suitable choices of k i , i = 1, . . . , n (where k i > 0 for all i), <strong>the</strong> following<br />

matrix A is orthogonal, and find k i :<br />

⎛<br />

⎞<br />

k 1 −k 1 0 . . . 0 0<br />

k 2 k 2 −2k 2 . . . 0 0<br />

.<br />

A =<br />

. . . .. .<br />

.<br />

.<br />

⎜ k n−2 k n−2 k n−2 . . . −(n − 2)k n−2 0<br />

⎟<br />

⎝ k n−1 k n−1 k n−1 . . . k n−1 −(n − 1)k n−1<br />

⎠<br />

k n k n k n . . . k n k n<br />

iv. With <strong>the</strong> above definition of A, show that ∑ n−1<br />

i=1 Y i<br />

2 = ∑ n<br />

i=1 (Z i − Z) 2 and that Y n = √ n Z.<br />

v. Hence show that <strong>the</strong> RVs Z and ∑ n<br />

i=1 (Z i − Z) 2 are independent and have N (0, 1/n) and<br />

χ 2 n−1 distributions respectively.<br />

IID<br />

vi. Suppose that X 1 , X 2 , . . . , X n ∼ N (µ, σ 2 ), and define <strong>the</strong> random variables<br />

X =<br />

S 2 =<br />

n∑<br />

X i /n,<br />

i=1<br />

1<br />

n − 1<br />

n∑<br />

(X i − X) 2 ,<br />

i=1<br />

T = X − µ √<br />

S2 /n .<br />

Show that T has a t distribution on n − 1 degrees of freedom.<br />

From Warwick <strong>ST217</strong> exam 2003<br />

39


9. Suppose that X has a χ 2 n distribution with PDF given by Formula 3.18. Find <strong>the</strong> mean, mode &<br />

variance of X, and an approximate variance-stabilising transformation.<br />

10. Let z(m, n, P ) denote <strong>the</strong> P % point of <strong>the</strong> F m,n distribution. Without looking in statistical tables,<br />

what can you say about <strong>the</strong> relationships between <strong>the</strong> following values:<br />

(a) z(2, 2, 50) and z(20, 20, 50), (b) z(2, 20, 50) and z(20, 2, 50),<br />

(c) z(2, 20, 16) and z(20, 2, 84), (d) z(20, 20, 2.5) and z(20, 20, 97.5)<br />

11. Let <strong>the</strong> random variables X and Y have finite first and second moments.<br />

(a) By considering <strong>the</strong> quadratic (tX − Y ) 2 , or o<strong>the</strong>rwise, prove <strong>the</strong> Cauchy-Schwarz inequality:<br />

(<br />

E(XY )<br />

) 2<br />

≤ E<br />

(<br />

X<br />

2 ) E ( Y 2) ,<br />

with equality if and only if P(Y = cX) = 1 for some constant c.<br />

(b) Hence or o<strong>the</strong>rwise prove <strong>the</strong> following<br />

i.<br />

Var(X + Y ) ≤<br />

( √Var(X)<br />

+<br />

√<br />

Var(Y )<br />

) 2<br />

.<br />

Under what conditions does equality hold<br />

ii. If <strong>the</strong> random variable Z has a finite variance, <strong>the</strong>n E ( |Z| ) is also finite.<br />

Show that <strong>the</strong> converse is untrue, i.e. give an example of a probability distribution D such<br />

that if Z ∼ D <strong>the</strong>n E ( |Z| ) is finite but <strong>the</strong> variance of Z does not exist.<br />

From Warwick <strong>ST217</strong> exam 2002<br />

IID<br />

12. Suppose that Z i ∼ N (0, 1), i = 1, 2, . . . What is <strong>the</strong> distribution of <strong>the</strong> following RVs<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

(e)<br />

X 4 =<br />

X 5 =<br />

X 1 = Z 1 + Z 2 − Z 3<br />

X 2 = Z 1 + Z 2<br />

Z 1 − Z 2<br />

X 3 = (Z 1 − Z 2 ) 2<br />

(Z 1 + Z 2 ) 2<br />

(<br />

(Z1 + Z 2 ) 2 + (Z 1 − Z 2 ) 2)<br />

2<br />

2Z 1<br />

√<br />

Z<br />

2<br />

2 + Z 2 3 + Z2 4 + Z2 5<br />

(f)<br />

X 6 = (Z 1 + Z 2 + Z 3 )<br />

√<br />

Z<br />

2<br />

4 + Z 2 5 + Z2 6<br />

40


stating under what conditions σ 2 X , µ Y , µ T and σ 2 T exist. From Warwick <strong>ST217</strong> exam 1998<br />

(g)<br />

X 7 =<br />

3(Z 1 + Z 2 + Z 3 + Z 4 ) 2<br />

(Z 1 + Z 2 − Z 3 − Z 4 ) 2 + (Z 1 − Z 2 + Z 3 − Z 4 ) 2 + (Z 1 − Z 2 − Z 3 + Z 4 ) 2<br />

(h)<br />

X 8 = 2Z 2 1 + (Z 2 + Z 3 ) 2<br />

13. For each of <strong>the</strong> RVs X i defined in <strong>the</strong> previous question, use Statistical Tables to find c i (i = 1 . . . 8)<br />

such that P(X i > c i ) = 0.95.<br />

14. Show that <strong>the</strong> PDFs of <strong>the</strong> t and F distributions (definitions 3.5 & 3.6) are indeed given by formulae<br />

3.19 & 3.20.<br />

15. (a) Define <strong>the</strong> Standard Multivariate Normal distribution MVN (0, I).<br />

(b) Given Z = (Z 1 , Z 2 , . . . , Z m+n ) T ∼ MVN (0, I), write down transformed random variables X(Z),<br />

T (Z) and Y (Z) with <strong>the</strong> following distributions:<br />

i. X ∼ χ 2 n,<br />

ii. T ∼ t n ,<br />

iii. Y ∼ F m,n .<br />

(c) Given that <strong>the</strong> PDF of X ∼ χ 2 n is<br />

f X (x) =<br />

and f X (x) = 0 elsewhere, show that<br />

i. E[X] = n,<br />

ii. E[X 2 ] = n 2 + 2n, and<br />

iii. E[1/X] = 1/(n − 2) (provided n > 2).<br />

(d) Hence or o<strong>the</strong>rwise find<br />

i. <strong>the</strong> variance σX 2 of X ∼ χ2 n,<br />

ii. <strong>the</strong> mean µ Y of Y ∼ F m,n and<br />

iii. <strong>the</strong> mean µ T and variance σT 2 of T ∼ t n,<br />

1<br />

2 n/2 Γ(n/2) x(n/2)−1 e −x/2 for x > 0,<br />

Theory is often just practice with <strong>the</strong> hard bits left out.<br />

J. M. Robson<br />

Get a bunch of those 3–D glasses and wear <strong>the</strong>m at <strong>the</strong> same time. Use enough to get it up to<br />

a good, say, 10– or 12–D.<br />

Rod Schmidt<br />

The Normal . . . is <strong>the</strong> Ordinary made beautiful; it is also <strong>the</strong> Average made lethal.<br />

Peter Shaffer<br />

Symmetry, as wide or as narrow as you define is meaning, is one idea by which man through<br />

<strong>the</strong> ages has tried to comprehend and create order, beauty and perfection.<br />

Hermann Weyl<br />

41


Chapter 4<br />

Inference for Multiparameter Models<br />

4.1 Introduction: General Concepts<br />

4.1.1 Modelling<br />

Given a random vector X = (X 1 , X 2 , . . . , X p ), we can describe <strong>the</strong> joint distribution of <strong>the</strong> X i s by <strong>the</strong><br />

CDF F X (x) or, usually more conveniently, by <strong>the</strong> PMF or PDF f X (x).<br />

Interrelationships between <strong>the</strong> X i s can be described using<br />

1. marginals F i (x i ), f i (x i ), F ij (x i , x j ), etc.,<br />

2. conditionals G i (x i |x j , j ≠ i), g i (x i |x j , j ≠ i), G ij (x i , x j |x k , k ≠ i, j), etc.,<br />

3. conditional expectations E[X i |X j ], Var[X i |X j ], etc.<br />

<strong>Of</strong>ten F X (x) is assumed to lie in a family of probability distributions:<br />

where Ω Θ is <strong>the</strong> ‘parameter space’.<br />

F = {F (x|θ) | θ ∈ Ω Θ } (4.1)<br />

The process of formulating, choosing within, & checking <strong>the</strong> reasonableness of, such families F, is called<br />

statistical modelling (or probability modelling, or just modelling).<br />

Exercise 4.1<br />

The data-set in Table 4.1, plotted in Figure 1.1 (page 2), shows patients’ blood pressures before and after<br />

treatment. Suggest some reasonable models for <strong>the</strong> data.<br />

‖<br />

4.1.2 Data<br />

In practice, we typically have a set of data in which d variables are measured on each of n ‘cases’ (or<br />

‘individuals’ or ‘units’):<br />

D =<br />

case.1<br />

case.2<br />

.<br />

case.n<br />

var.1 var.2 · · · var.d<br />

⎛<br />

⎞<br />

x 11 x 12 · · · x 1d<br />

x 21 x 22 · · · x 2d<br />

⎜<br />

⎝<br />

.<br />

. . ..<br />

⎟<br />

. ⎠<br />

x n1 x n2 · · · x nd<br />

(4.2)<br />

42


Patient Systolic Diastolic<br />

Number before after change before after change<br />

1 210 201 -9 130 125 -5<br />

2 169 165 -4 122 121 -1<br />

3 187 166 -21 124 121 -3<br />

4 160 157 -3 104 106 2<br />

5 167 147 -20 112 101 -11<br />

6 176 145 -31 101 85 -16<br />

7 185 168 -17 121 98 -23<br />

8 206 180 -26 124 105 -19<br />

9 173 147 -26 115 103 -12<br />

10 146 136 -10 102 98 -4<br />

11 174 151 -23 98 90 -8<br />

12 201 168 -33 119 98 -21<br />

13 198 179 -19 106 110 4<br />

14 148 129 -19 107 103 -4<br />

15 154 131 -23 100 82 -18<br />

Table 4.1: Supine systolic and diastolic blood pressures of 15 patients with moderate<br />

hypertension (high blood pressure), immediately before and 2 hours after taking 25mg<br />

of <strong>the</strong> drug captopril.<br />

Data from HSDS, set 72<br />

Definition 4.1 (Data Matrix)<br />

A set of data D arranged in <strong>the</strong> form of 4.2 is called a data matrix or a cases-by-variables array.<br />

The data-set D is assumed to be a representative sample (of size n) from an underlying population of<br />

potential cases. This population may be actual, e.g. <strong>the</strong> resident population of England & Wales at noon<br />

on June 30th 1993, or purely <strong>the</strong>oretical/hypo<strong>the</strong>tical, e.g. MVN(µ, V).<br />

Exercise 4.2<br />

Table 4.2 presents data on ten asthmatic subjects, each tested with 4 drugs. Describe various ways that<br />

<strong>the</strong> data might be set out as a data matrix for analysis by a statistical computing package.<br />

‖<br />

4.1.3 Statistical Inference<br />

Statistical inference is <strong>the</strong> art/science of using <strong>the</strong> sample to learn about <strong>the</strong> population (and hence,<br />

implicitly, about future samples).<br />

Typically we use statistics (properties of <strong>the</strong> sample)<br />

to learn about parameters (properties of <strong>the</strong> population).<br />

This activity might be:<br />

1. Part of analysing a formal probability model,<br />

e.g. calculating <strong>the</strong> MLEs ̂θ of θ, after making an assumption as in Expression 4.1, or<br />

2. Purely to summarise <strong>the</strong> data as a part of ‘data analysis’ (Section 4.2),<br />

IID<br />

For example, given X 1 , X 2 , . . . , X n ∼ F X (unknown), <strong>the</strong> statistics<br />

∑<br />

Xi = X,<br />

∑<br />

(Xi − X) 2 ,<br />

∑<br />

(Xi − X) 3<br />

S 1 = 1 n<br />

S 2 = 1 n<br />

S 3 = 1 n<br />

43


Patient number<br />

Drug Time 1 2 3 4 5 6 7 8 9 10<br />

P −5 mins 0.0 2.3 2.4 1.9 1.6 4.8 0.6 2.7 0.9 1.3<br />

+15 mins 3.8 9.2 5.4 3.3 4.2 15.1 1.3 6.7 4.2 3.1<br />

C −5 mins 0.5 1.0 2.0 1.1 2.1 6.8 0.6 3.1 1.5 3.0<br />

+15 mins 2.0 5.3 7.5 6.4 4.1 9.1 0.6 14.8 2.4 2.3<br />

D −5 mins 0.8 2.3 0.8 0.8 1.2 9.6 1.1 9.7 0.8 4.9<br />

+15 mins 2.4 4.8 2.4 1.9 1.2 12.5 1.7 12.5 4.3 8.1<br />

K −5 mins 0.2 1.7 2.2 0.1 1.7 9.2 0.6 12.7 1.1 2.8<br />

+15 mins 0.4 3.4 2.0 1.3 3.4 6.7 1.1 12.5 2.7 5.7<br />

Table 4.2: NCF (Neutrophil Chemotactic Factor) of ten individuals, each tested with<br />

4 drugs: P (Placebo), C (Clemastine), D (DSCG), K (Ketotifen). On a given day,<br />

an individual was administered <strong>the</strong> chosen drug, and his NCF measured 5 minutes<br />

before, and 15 minutes after, being given a ‘challenge’ of allergen.<br />

Data from Dr. R. Morgan of Bart’s Hospital<br />

provide measures of location, scale and skewness.<br />

Note that here we’re implicitly estimating <strong>the</strong> corresponding population quantities<br />

µ X = EX, E [ (X − µ X ) 2] , E [ (X − µ X ) 3] ,<br />

and using <strong>the</strong>se as measures of population location, scale and skewness. Without a formal probability<br />

model, it can be hard to judge whe<strong>the</strong>r <strong>the</strong>se or some o<strong>the</strong>r measures may be most appropriate.<br />

In both cases, <strong>the</strong> CLT & its generalisations (to higher dimensions and to ‘near-independence’) show that,<br />

under reasonable conditions, <strong>the</strong> joint distribution of <strong>the</strong> statistics of interest, such as ̂θ or (S1 , S 2 , S 3 ), is<br />

approximately MVN. This approximation improves if<br />

1. <strong>the</strong> sample size n → ∞, and/or<br />

2. <strong>the</strong> joint distribution of <strong>the</strong> random variables being summed (e.g. <strong>the</strong> original random vectors<br />

X 1 , X 2 , . . . , X n ) is itself close to MVN.<br />

QUESTIONS: How should we interpret this How should we try to link probability models to reality<br />

4.2 Data Analysis<br />

Data analysis is <strong>the</strong> art of summarising data while attempting to avoid probability <strong>the</strong>ory.<br />

For example, you can calculate summary statistics such as means, medians, modes, ranges, standard<br />

deviations etc., thus summarising in a few numbers <strong>the</strong> main features of a possibly huge data-set. For<br />

example, <strong>the</strong> (0%, 25%, 50%, 75%, 100%) points of <strong>the</strong> data distribution (i.e. minimum, lower quartile,<br />

median, upper quartile and maximum) form <strong>the</strong> five-number summary, and <strong>the</strong> inter-quartile range (IQR =<br />

upper quartile - lower quartile) is a measure of spread, containing <strong>the</strong> ‘middle 50%’ of <strong>the</strong> data.<br />

These summaries can be formalised as follows<br />

Definition 4.2 (Order statistics)<br />

Given RVs X 1 , X 2 , . . . , X n , one can order <strong>the</strong>m and denote <strong>the</strong> smallest of <strong>the</strong> X i s by X (1) , <strong>the</strong> second<br />

smallest by X (2) , etc. Then X (k) is called <strong>the</strong> kth order statistic.<br />

44


Thus X (1) , X (2) , . . . , X (n) are a permutation of X 1 , X 2 , . . . , X n , and x (n) , <strong>the</strong> observed value of X (n) , denotes<br />

<strong>the</strong> largest observed value in a sample of size n.<br />

Given ordered data x (1) ≤ x (2) ≤ · · · ≤ x (n) , one can define:<br />

Definition 4.3 (Sample median)<br />

⎧<br />

⎪⎨<br />

x M =<br />

⎪⎩<br />

[<br />

1<br />

2<br />

x ( n<br />

2<br />

x ( )<br />

n+1 if n is odd,<br />

2 ]<br />

) + x ( n<br />

2 + 1)<br />

if n is even.<br />

We can always write x M = x ( n<br />

2 + ) 1 , provided we adopt <strong>the</strong> following convention:<br />

2<br />

1. If <strong>the</strong> number in brackets is exactly half-way between two integers, <strong>the</strong>n take <strong>the</strong> average of <strong>the</strong> two<br />

corresponding order statistics.<br />

2. O<strong>the</strong>rwise round <strong>the</strong> bracketed subscript to <strong>the</strong> nearest integer, and take <strong>the</strong> corresponding order<br />

statistic.<br />

Similarly <strong>the</strong> quartiles etc. can be formally defined as follows:<br />

Definition 4.4 (Sample lower quartile)<br />

Definition 4.5 (Sample upper quartile)<br />

Definition 4.6 (100p th sample percentile)<br />

x L = x ( n<br />

4 + ) 1 ,<br />

2<br />

x U = x ( 3n<br />

4 + ) 1 ,<br />

2<br />

x 100p% = x ( pn<br />

100 + ) 1 ,<br />

2<br />

Definition 4.7 (Five number summary)<br />

(<br />

x(1) , x L , x M , x U , x (n)<br />

)<br />

.<br />

4.3 Classical Inference<br />

4.3.1 Introduction<br />

In ‘classical statistical inference’, <strong>the</strong> typical procedure is:<br />

1. Choose a family F of models indexed by θ (formula 4.1).<br />

2. Assume temporarily that <strong>the</strong> true distribution lies in F<br />

i.e. data D ∼ F (d|θ) for some true but unknown parameter vector θ ∈ Ω Θ .<br />

3. Compare possible models according to some criterion of compatibility between <strong>the</strong> model & <strong>the</strong> data<br />

(equivalently, between <strong>the</strong> population & <strong>the</strong> sample).<br />

4. Assess <strong>the</strong> chosen model(s), and go back to step (1) or (2) if <strong>the</strong> model proves inadequate.<br />

45


Comments<br />

1. Step 1 is a compromise between<br />

(a) what we believe is <strong>the</strong> true underlying mechanism that produced <strong>the</strong> data, and<br />

(b) what we can do ma<strong>the</strong>matically.<br />

If in doubt, keep it simple.<br />

2. Step 2, by assuming a true θ exists, implicitly interprets probability as a property of Nature<br />

e.g. a ‘fair’ coin is assumed to have an intrinsic property: if you toss it n times, <strong>the</strong>n <strong>the</strong> proportion<br />

of ‘heads’ tends to 1/2 as n → ∞.<br />

Thus probability represents a ‘long-run relative frequency’.<br />

3. Most statistical computer packages currently use <strong>the</strong> classical approach, and we’ll mainly be using<br />

classical inference in MSB.<br />

4. There are many possible criteria at step 3. For example, hypo<strong>the</strong>sis-testing and likelihood approaches<br />

are both discussed briefly below.<br />

4.3.2 Point Estimation (Univariate)<br />

Given RVs X = (X 1 , X 2 , . . . , X n ), a point estimator for an unknown parameter Θ ∈ Ω Θ is simply a function<br />

̂Θ(X) taking values in <strong>the</strong> parameter space Ω Θ . Once data X = x are obtained, one can calculate <strong>the</strong><br />

corresponding point estimate ̂θ = ̂Θ(x).<br />

There are many plausible criteria for ̂Θ to be considered a ‘good’ estimator of Θ. For example:<br />

1. Mean Squared Error<br />

One would like <strong>the</strong> mean squared error (MSE) of ̂Θ to be small whatever <strong>the</strong> true value θ of Θ, where<br />

MSE(̂Θ) = E [ (̂Θ − θ) 2] . (4.3)<br />

In particular, an estimator ̂Θ has minimum mean squared error if<br />

2. Unbiasedness<br />

Definition 4.8 (Bias)<br />

The bias of an estimator ̂Θ is<br />

MSE(̂Θ) = min<br />

̂Θ ′ MSE(̂Θ ′ ).<br />

Bias(̂Θ) = E[̂Θ − θ|Θ = θ]. (4.4)<br />

Exercise 4.3<br />

Show that MSE(̂Θ) = Var(̂Θ) + (Bias ̂Θ) 2 .<br />

‖<br />

Definition 4.9 (Unbiasedness)<br />

An estimator ̂Θ for a parameter Θ is called unbiased if E[̂Θ|Θ = θ] = θ for all possible true<br />

values θ of Θ.<br />

46


Example<br />

IID<br />

Given a random sample X 1 , X 2 , . . . , X n , i.e. X i ∼ F X (x), where F X is a member of some family F<br />

of probability distributions,<br />

(a) X = ∑ n<br />

i=1 X i/n is an unbiased estimate of <strong>the</strong> mean µ X = EX of F X .<br />

(b) More generally, any statistic of <strong>the</strong> form ∑ n<br />

i=1 w iX i , where ∑ n<br />

i=1 w i = 1, is an unbiased estimate<br />

of µ X .<br />

(c) ̂σ 2 1 = ∑ n<br />

i=1 (X i − X) 2 /(n − 1) is an unbiased estimate of <strong>the</strong> variance σ 2 X of F X, but<br />

(d) ̂σ 2 2 = ∑ n<br />

i=1 (X i − X) 2 /n is NOT an unbiased estimate of <strong>the</strong> variance σ 2 X of F X.<br />

3. Efficiency & Minimum Variance Unbiased Estimation<br />

Given two unbiased estimators ̂Θ 1 & ̂Θ 2 for a parameter Θ, <strong>the</strong> efficiency of ̂Θ 1 relative to ̂Θ 2 is<br />

defined by<br />

Eff(̂Θ 1 , ̂Θ 2 ) = Var(̂Θ 1 )<br />

Var(̂Θ 2 ) . (4.5)<br />

Definition 4.10 (MVUE)<br />

The Minimum Variance Unbiased Estimator of a parameter Θ is <strong>the</strong> unbiased estimator ̂Θ, out<br />

of all possible unbiased estimators, that has minimum variance.<br />

Example<br />

IID<br />

Given X i ∼ F X (x) ∈ F, <strong>the</strong> family of all probability distributions with finite mean & variance, it can<br />

be shown that<br />

(a) X is <strong>the</strong> MVUE of <strong>the</strong> mean µ X = EX of F X , and<br />

(b) ∑ n<br />

i=1 (X i − X) 2 /(n − 1) is <strong>the</strong> MVUE of <strong>the</strong> variance σ 2 X of F X.<br />

Note that <strong>the</strong>re are major problems with using MVUE as a criterion for estimation:<br />

(a) The MVUE may not exist (e.g. in general <strong>the</strong>re is no unbiased estimator for <strong>the</strong> underlying<br />

standard deviation σ X of X).<br />

(b) The MVUE may exist but be nonsensical (see Problems).<br />

(c) Even if <strong>the</strong> MVUE exists and appears reasonable, o<strong>the</strong>r (biased) estimators may be better by<br />

o<strong>the</strong>r criteria, for example by having smaller mean squared error, which is much more important<br />

in practice than being unbiased.<br />

4. Consistency<br />

Definition 4.11 (Consistency)<br />

A sequence of estimators ̂Θ 1 , ̂Θ 2 , . . . is consistent for Θ ∈ Ω Θ if, for all ɛ > 0 and for all θ ∈ Ω Θ ,<br />

lim P(|̂Θ n − θ| > ɛ|Θ = θ) = 0.<br />

n→∞<br />

5. Sufficiency<br />

̂Θ(X 1 , . . . X n ) is sufficient for Θ if <strong>the</strong> conditional distribution of (X 1 , . . . X n ) given ̂Θ = ̂θ & Θ = θ<br />

does not depend on θ. See MSA.<br />

6. Maximum likelihood<br />

See MSA.<br />

7. Invariance<br />

See Casella & Berger, page 300.<br />

47


8. The ‘plug in’ property<br />

If θ is a specified property of <strong>the</strong> CDF F (x), <strong>the</strong>n ̂θ is <strong>the</strong> corresponding property of <strong>the</strong> empirical<br />

CDF<br />

̂F (x) = 1 n × (number of X i ≤ x i ). (4.6)<br />

For example (assuming <strong>the</strong> named quantities exist):<br />

(a) <strong>the</strong> sample mean ̂θ = x = ∑ x i /n is <strong>the</strong> plug-in estimate of <strong>the</strong> population mean θ = EX,<br />

(b) <strong>the</strong> sample median is <strong>the</strong> plug-in estimate of <strong>the</strong> population median F −1 (0.5).<br />

4.3.3 Hypo<strong>the</strong>sis Testing (Introduction)<br />

In this approach you<br />

1. Choose a statistic T that has a known distribution F 0 (t) if <strong>the</strong> true parameter value is θ = θ 0 (for<br />

some particular parameter value θ 0 of interest). The statistic T should provide a measure of <strong>the</strong><br />

discrepancy of <strong>the</strong> data D from what would be reasonable if θ = θ 0 .<br />

2. Test <strong>the</strong> hypo<strong>the</strong>sis ‘θ = θ 0 ’ using <strong>the</strong> tail probabilities of F 0 .<br />

An example is <strong>the</strong> ‘Chi-squared’ statistic used in MSA. Hypo<strong>the</strong>sis testing will be covered in more detail<br />

in chapter 5.<br />

Some problems with <strong>the</strong> standard hypo<strong>the</strong>sis testing approach are:<br />

1. In practice, we don’t really believe that θ = θ 0 is ‘true’ and all o<strong>the</strong>r possible values of θ are ‘false’;<br />

instead we just wish to adopt ‘θ = θ 0 ’ as a convenient assumption, because it’s as good as, and<br />

simpler than, o<strong>the</strong>r models.<br />

2. If we really do want to make a decision [e.g. to give drug ‘A’ or drug ‘B’ to a particular patient],<br />

<strong>the</strong>n we should weigh up <strong>the</strong> possible consequences.<br />

3. It’s hard to create appropriate hypo<strong>the</strong>sis tests in complex situations, such as to test whe<strong>the</strong>r or not<br />

θ lies in a particular subset Ω 0 of <strong>the</strong> parameter space Ω.<br />

Unfortunately, real life is a complex situation.<br />

4.3.4 Likelihood Methods<br />

Use <strong>the</strong> likelihood function<br />

{ P(D|θ) (discrete case)<br />

L(θ; D) =<br />

(constant) × f(D|θ) (continuous case),<br />

(4.7)<br />

or equivalently <strong>the</strong> log-likelihood or ‘support’ function<br />

l(θ; D) = log L(θ; D) (4.8)<br />

as a measure of <strong>the</strong> compatibility between data D and parameter θ.<br />

In particular, <strong>the</strong> MLE corresponds to <strong>the</strong> particular ∈ F that is most compatible with <strong>the</strong> data D.<br />

F̂θ<br />

Likelihood underlies <strong>the</strong> most useful general approaches to statistics:<br />

1. It can handle several parameters simultaneously.<br />

2. The CLT implies that in many cases <strong>the</strong> log-likelihood will be approximately quadratic in θ (at least<br />

near <strong>the</strong> MLE).<br />

This makes both <strong>the</strong>ory and numerical computation easier.<br />

48


However, <strong>the</strong>re are difficulties with basing inference solely on likelihood:<br />

1. How should we handle ‘nuisance parameters’ (i.e. components θ i that we’re not interested in)<br />

Note that it makes no sense to integrate over values of θ i to get a ‘marginal likelihood’ for <strong>the</strong> o<strong>the</strong>r<br />

θ j s, since L(θ; d) is NOT a probability density or probability function—we would get a different<br />

marginal likelihood if we reparametrised say by θ i ↦→ log θ i .<br />

2. A more fundamental problem is that likelihood takes no account of how far-fetched <strong>the</strong> model might<br />

be (‘high likelihood’ does NOT mean ‘likely’!)<br />

This suggests that in practice we may wish to incorporate information not contained in <strong>the</strong> likelihood:<br />

1. Prior information/Expert opinion: Are <strong>the</strong>re external reasons for doubting some values of θ more<br />

than o<strong>the</strong>rs<br />

2. For decision-making: How relatively important are <strong>the</strong> possible consequences of our inferences<br />

[e.g. an innocent person is punished / a murderer walks free].<br />

4.4 Problems<br />

1. How might <strong>the</strong> mortality data in Tables 1.1 and 1.2 (pages 8 & 9) be set out as a data matrix<br />

2. Suppose that ̂θ(X 1 , . . . X n ) is unbiased. Show that ̂θ is consistent iff lim n→∞<br />

(<br />

Var(̂θ(X1 , . . . X n )) ) = 0.<br />

IID<br />

3. Given X i ∼ F X (x), where F X is a member of some family F of probability distributions, show that<br />

(a) Any statistic of <strong>the</strong> form ∑ n<br />

i=1 w iX i , where ∑ n<br />

i=1 w i = 1, is an unbiased estimate of µ X = EX,<br />

(b) The mean X = ∑ n<br />

i=1 X i/n is <strong>the</strong> unique UMVUE of this form,<br />

(c) ̂σ 2 = ∑ n<br />

i=1 (X i − X) 2 /(n − 1) is an unbiased estimate of <strong>the</strong> variance σ 2 X of F X.<br />

4. The number of mistakes made each lecture by a certain lecturer follow independent Poisson distributions,<br />

each with mean λ > 0.<br />

You decide to attend <strong>the</strong> Monday lecture, note <strong>the</strong> number of mistakes X, and use X to estimate<br />

<strong>the</strong> probability p that <strong>the</strong>re will be no mistakes in <strong>the</strong> remaining two lectures that week.<br />

(a) Show that p = exp(−2λ).<br />

(b) Show that <strong>the</strong> only unbiased estimator of p (and hence, trivially, <strong>the</strong> MVUE), is<br />

{<br />

1 if X is even,<br />

̂p =<br />

−1 if X is odd.<br />

(c) What is <strong>the</strong> maximum likelihood estimator of p<br />

(d) Discuss (briefly) <strong>the</strong> relative merits of <strong>the</strong> MLE and <strong>the</strong> MVUE in this case.<br />

5. Let T be an unbiased estimator for g(θ), let S be a sufficient statistic for θ, and let φ(S) = E[T |S].<br />

Prove <strong>the</strong> Rao-Blackwell <strong>the</strong>orem:<br />

φ(S) is also an unbiased estimator of g(θ), and Var[φ(S)|θ] ≤ Var[T |θ], for all θ,<br />

and interpret this result.<br />

49


6. (a) Explain what is meant by an unbiased estimator for an unknown parameter θ.<br />

(b) Show, using moment generating functions or o<strong>the</strong>rwise, that if X 1 & X 2 have independent<br />

Poisson distributions with means λ 1 & λ 2 respectively, <strong>the</strong>n <strong>the</strong>ir sum (X 1 + X 2 ) follows a<br />

Poisson distribution with mean (λ 1 + λ 2 ).<br />

(c) A particular sports game comprises four ‘quarters’, each lasting 15 minutes, and a statistician<br />

attending <strong>the</strong> game wishes to predict <strong>the</strong> probability p that no fur<strong>the</strong>r goals will be scored before<br />

full time.<br />

The statistician assumes that <strong>the</strong> numbers X k of goals scored in <strong>the</strong> kth quarter follow independent<br />

Poisson distributions, each with (unknown) mean λ, so that<br />

P(X k = x) = λx<br />

x! e−λ<br />

(k = 1, 2, 3, 4; x = 0, 1, 2, . . .).<br />

Suppose that <strong>the</strong> statistician makes his prediction halfway through <strong>the</strong> match (i.e. after observing<br />

X 1 = x 1 & X 2 = x 2 ). Show that an unbiased estimator of p is<br />

{<br />

1 if (x1 + x<br />

T =<br />

2 ) = 0,<br />

0 o<strong>the</strong>rwise.<br />

(d) Suppose <strong>the</strong> statistician also made a prediction after 15 minutes. Show that in this case <strong>the</strong><br />

ONLY unbiased estimator of p given X 1 = x 1 is<br />

{ 2<br />

x 1<br />

if x<br />

T =<br />

1 is even,<br />

if x 1 is odd.<br />

−2 x1<br />

(e) What are <strong>the</strong> maximum likelihood estimators of p after 15 and after 30 minutes<br />

(f) Briefly compare <strong>the</strong> advantages of maximum likelihood and unbiased estimation for this situation.<br />

7. (a) Show that for any random variables X and Y ,<br />

i. E[Y ] = E [ E[Y |X] ] ,<br />

ii. Var[Y ] = E [ Var[Y |X] ] + Var [ E[Y |X] ] .<br />

(b) Explain what is meant by a minimum variance unbiased estimator (MVUE).<br />

From Warwick <strong>ST217</strong> exam 1997<br />

(c) Let S be a sufficient statistic for a parameter θ, let T be an unbiased estimator for φ(θ), and<br />

define W = E[T |S]. Show that<br />

i. W is an unbiased estimator for φ(θ), and<br />

ii. Var[W ] ≤ Var[T ] for all θ.<br />

Deduce that a MVUE, if one exists, must be a function of a sufficient statistic.<br />

(d) Let X 1 , X 2 , . . . , X n be IID random variables with a geometric distribution, i.e.<br />

P(X i = x) = θ(1 − θ) x−1 x = 1, 2, . . . .<br />

i. Show that S = ∑ n<br />

i=1 X i is a sufficient statistic for θ, and find <strong>the</strong> maximum likelihood<br />

estimator ̂θ of θ.<br />

ii. Define T by<br />

{ 1 if X1 = 1,<br />

T =<br />

0 o<strong>the</strong>rwise.<br />

What is E[T ]<br />

iii. Find E[T |S], and briefly discuss its advantages & disadvantages compared to <strong>the</strong> maximum<br />

likelihood estimator ̂θ.<br />

From Warwick <strong>ST217</strong> exam 2001<br />

50


8. (a) For random variables X (continuous) and Y (discrete), define<br />

i. <strong>the</strong> joint cumulative distribution function (CDF) F X,Y (x, y),<br />

ii. <strong>the</strong> conditional probability mass function (PMF) f Y |X (y|x) of Y given X = x, and<br />

iii. <strong>the</strong> conditional expectation E[Y |X] of Y given X.<br />

(b) Show that E[g(Y )] = E [ [E[g(Y )|X] ] , for any function g(·) such that <strong>the</strong> expectations exist.<br />

(c) People arrive randomly at a public phone box at a rate of one per minute, and <strong>the</strong> length Λ of a<br />

phone call has an exponential distribution with mean µ minutes. Thus <strong>the</strong> number Y of people<br />

arriving at <strong>the</strong> phone box during a call has PMF<br />

f Y |Λ (y|λ) = λy e −λ<br />

, (y = 0, 1, 2, . . .),<br />

y!<br />

where Λ has PDF<br />

f Λ (λ) = 1 exp(−λ/µ), (λ > 0).<br />

µ<br />

Show that <strong>the</strong> conditional distribution of Y given Λ = λ has moment generating function (MGF)<br />

M Y |Λ (t) = E [ e tY |Λ = λ ] given by<br />

M Y |Λ (t) = exp ( λ(e t − 1) ) .<br />

Hence or o<strong>the</strong>rwise find <strong>the</strong> MGF M Y (t) = E [ e tY ] of Y .<br />

(d) By differentiation of M Y (·), or o<strong>the</strong>rwise, show that Y is an unbiased estimator of µ, and find<br />

its variance.<br />

From Warwick <strong>ST217</strong> exam 2000<br />

9. (a) Explain what is meant by an unbiased estimator for an unknown parameter θ.<br />

(b) A statistician working for a CD manufacturer assumes that <strong>the</strong> number of faults on each CD<br />

follow independent Poisson distributions with mean θ, and hence <strong>the</strong> number of faults X on a<br />

faulty CD follows a truncated Poisson distribution, i.e.<br />

P(X = x) =<br />

e −θ θ x<br />

(1 − e −θ ) x!<br />

(x = 1, 2, . . .).<br />

Find <strong>the</strong> mean and variance of this distribution.<br />

(c) The statistician counts <strong>the</strong> number of faults x on a faulty CD, and is interested in φ = 1 − e −θ ,<br />

<strong>the</strong> probability that a random CD is fault-free. Show that <strong>the</strong> only unbiased estimator T (X) of<br />

φ takes <strong>the</strong> form:<br />

{ 0 if x is odd,<br />

T (x) =<br />

2 if x is even.<br />

(d) Find an equation satisfied by <strong>the</strong> maximum likelihood estimator of θ given observations x 1 , x 2 , . . . , x n<br />

on n faulty CDs.<br />

(e) Comment briefly on <strong>the</strong> relative advantages and disadvantages of unbiased estimation and maximum<br />

likelihood estimation for φ.<br />

From Warwick <strong>ST217</strong> exam 2002<br />

51


IID<br />

10. Given X i ∼ Poi(θ), compare <strong>the</strong> following possible estimators for θ in terms of unbiasedness, consistency,<br />

relative efficiency, etc.<br />

̂θ 1 = X = 1 n∑<br />

X k , ̂θ6 = (̂θ 1 +<br />

n<br />

̂θ 5 )/2,<br />

( k=1 )<br />

̂θ 2 = 1 n∑<br />

100 + X k , ̂θ7 = median(X 1 , X 2 , . . . , X n ),<br />

n<br />

k=1<br />

̂θ 3 = 1 2 (X 2 − X 1 ) 2 , ̂θ8 = mode(X 1 , X 2 , . . . , X n ),<br />

̂θ 4 = 1 n∑<br />

(X k − X) 2 2<br />

n∑<br />

, ̂θ9 =<br />

kX k ,<br />

n<br />

n(n + 1)<br />

k=1<br />

k=1<br />

1<br />

n∑<br />

̂θ 5 =<br />

(X k − X) 2 1<br />

n∑<br />

, ̂θ10 =<br />

X k .<br />

n − 1<br />

n − 1<br />

k=1<br />

11. [Light relief] Discuss <strong>the</strong> following possible defence submission at a murder trial:<br />

‘The supposed DNA match placing <strong>the</strong> defendant at <strong>the</strong> scene of <strong>the</strong> crime would have arisen with<br />

even higher probability if <strong>the</strong> defendant had a secret identical twin<br />

[<strong>the</strong> more people with that DNA, <strong>the</strong> more chances of getting a match at <strong>the</strong> crime scene].<br />

‘Now assume that my client has been cloned θ times, θ ∈ {0, 1, . . . , n} for some n > 0. Clearly <strong>the</strong><br />

larger <strong>the</strong> value of θ, <strong>the</strong> higher <strong>the</strong> probability of obtaining <strong>the</strong> observed DNA results<br />

[every increase in θ means ano<strong>the</strong>r clone who might have been at <strong>the</strong> scene of <strong>the</strong> crime].<br />

‘Therefore <strong>the</strong> MLE of θ is n.<br />

‘But <strong>the</strong>n, even assuming somebody with my client’s DNA committed this terrible crime, <strong>the</strong> probability<br />

that it was my client is only 1/(n + 1) (under reasonable assumptions).<br />

‘Therefore you cannot say that my client is, beyond a reasonable doubt, guilty.<br />

‘The defence rests.’<br />

k=2<br />

4.5 Bayesian Inference<br />

4.5.1 Introduction<br />

Classical inference regards probability as a property of physical objects (e.g. a ‘fair coin’).<br />

An alternative interpretation uses probability to represent an individual’s (lack of) understanding of an<br />

uncertain situation.<br />

Examples<br />

1. ‘I have no reason to suspect that “heads” or “tails” are more likely. Therefore, by symmetry, my<br />

current probability for this particular coin’s coming down “heads” is 1/2.’<br />

2. ‘I doubt <strong>the</strong> accused has any previously-unknown identical siblings. I’d bet 100,000 to 1 against’<br />

(i.e. if θ is <strong>the</strong> number of identical siblings, <strong>the</strong>n my probability for θ > 0 is 1/100001).<br />

Different people, with different knowledge, can legitimately have different probabilities for real-world events<br />

(<strong>the</strong>refore it’s good discipline to say ‘my probability for. . . ’ ra<strong>the</strong>r than ‘<strong>the</strong> probability of. . . ’).<br />

As you learn. your probabilities can be continually updated using Bayes’ <strong>the</strong>orem, i.e.<br />

P(A|B) =<br />

P(B|A) × P(A)<br />

P(B)<br />

(4.9)<br />

(<br />

assuming P(B) is positive, and using <strong>the</strong> fact that P(A&B) = P(A|B)P(B) = P(B|A)P(A)<br />

)<br />

.<br />

52


The Bayesian approach to statistical inference treats all uncertainty via probability, as follows:<br />

1. You have a probability model for <strong>the</strong> data, with PMF p(D|Θ).<br />

2. Your prior PMF for Θ (i.e. your PMF for Θ based on a combination of expert opinion, previous<br />

experience, and your own prejudice), is p(θ).<br />

3. Then Bayes’ <strong>the</strong>orem says<br />

p(θ|D) =<br />

p(D|θ) p(θ)<br />

p(D)<br />

or, since once <strong>the</strong> data have been obtained p(D) is a constant,<br />

p(θ|D) ∝ p(D|θ) p(θ)<br />

∝ L(θ; D) p(θ)<br />

i.e. ‘posterior probability ∝ ‘likelihood’ × ‘prior’<br />

(4.10)<br />

Formula 4.10 also applies in <strong>the</strong> continuous case, in which case p(·) represents a PDF.<br />

Comments<br />

1. Fur<strong>the</strong>r applications to decision <strong>the</strong>ory are given in <strong>the</strong> third year course ST301.<br />

2. Note that if θ = (θ 1 , θ 2 , . . . , θ p ), <strong>the</strong>n p(θ|D) is a p-dimensional function, and may prove difficult to<br />

manipulate, summarise or visualise.<br />

3. Treating all uncertainty via probability has <strong>the</strong> advantage that one-off events (e.g. management<br />

decisions, or <strong>the</strong> results of horse races) can be handled. However, it’s not at all obvious that all<br />

uncertainty can be treated via probability!<br />

4. As with Classical inference, a Bayesian analysis of a problem should involve checking whe<strong>the</strong>r <strong>the</strong><br />

assumptions underlying p(D|θ) and p(θ) are reasonable, and rethinking & reanalysing <strong>the</strong> model if<br />

necessary.<br />

Exercise 4.4<br />

Describe <strong>the</strong> Bayesian approach to statistical inference, denoting <strong>the</strong> data by x, <strong>the</strong> prior by f Θ (θ), and<br />

<strong>the</strong> likelihood by L(θ; x) = f X|Θ (x|θ).<br />

‖<br />

4.6 Nonparametric Methods<br />

IID<br />

Standard Classical and Bayesian methods make strong assumptions, e.g. X i ∼ F (x|θ) for some θ ∈ Ω.<br />

Assumptions of independence are critical (what aspects of <strong>the</strong> problem provide information about o<strong>the</strong>r<br />

aspects)<br />

Assumptions about <strong>the</strong> form of probability distributions are often less important, at least provided <strong>the</strong><br />

sample size n is large. However, <strong>the</strong>re are exceptions to this:<br />

1. It might be that <strong>the</strong> probability distribution encountered in practice is fundamentally different from<br />

<strong>the</strong> form assumed in our model. For example, some probability distributions are so ‘heavy-tailed’<br />

that <strong>the</strong>ir means don’t exist ( e.g. <strong>the</strong> Cauchy distribution with f(x) = 1/π(1 + x 2 ), x ∈ R ) .<br />

2. Some data may be recorded incorrectly, or <strong>the</strong>re may be a few atypically large/small data values<br />

(‘outliers’), etc.<br />

3. In any case, what if n is small and <strong>the</strong> CLT can’t be invoked<br />

53


‘Nonparametric’ methods don’t assume that <strong>the</strong> actual probability distribution F (·|θ) lies in a particular<br />

parametric family F; instead <strong>the</strong>y make more general assumptions, for example<br />

1. ‘F (x) is symmetric about some unknown value Θ’.<br />

Note that this may be a reasonable assumption even if EX doesn’t exist.<br />

Θ is <strong>the</strong> (unknown) median of <strong>the</strong> population, i.e. P(X < Θ) = P(X > Θ).<br />

Therefore one could estimate Θ by <strong>the</strong> median of <strong>the</strong> data (though better methods may exist).<br />

2. ‘F (x, y) is such that if (X i , Y i ) IID<br />

∼ F , (i = 1, 2), <strong>the</strong>n P(Y 1 < Y 2 |X 1 < X 2 ) = 1/2’.<br />

This is a nonparametric version of <strong>the</strong> statement ‘X & Y are uncorrelated’.<br />

Many statistical methods involve estimating means, as we’ll see in <strong>the</strong> rest of <strong>the</strong> course (t-tests, linear<br />

regression, many MLEs etc.)<br />

Corresponding nonparametric methods typically involve medians—or equivalently, various probabilities.<br />

Exercise 4.5<br />

Suppose that X has a continuous distribution. Show that a test of <strong>the</strong> statement ‘median of X is θ 0 ’ is<br />

equivalent to a test of <strong>the</strong> statement ‘P(X < θ 0 ) = 1/2’.<br />

If X i are IID, what is <strong>the</strong> distribution of R = (number of X i < θ T ), where θ T is <strong>the</strong> true value of θ<br />

‖<br />

O<strong>the</strong>r nonparametric methods involve ranking <strong>the</strong> data X i : replacing <strong>the</strong> smallest X i by 1, <strong>the</strong> next<br />

smallest by 2, etc. Classical statistical methods can <strong>the</strong>n be applied to <strong>the</strong> ranks. Note that <strong>the</strong> effect of<br />

outliers will be reduced.<br />

Example<br />

Given data (X i , Y i ), i = 1, . . . , n from a continuous bivariate distribution, ‘Spearman’s rank correlation’<br />

(often written ρ S ) can be calculated as follows:<br />

1. replace <strong>the</strong> X i values by <strong>the</strong>ir ranks R i ,<br />

2. similarly replace <strong>the</strong> Y i values by <strong>the</strong>ir ranks S i ,<br />

3. calculate <strong>the</strong> usual (‘product-moment’ or ‘Pearson’s’) correlation between <strong>the</strong> R i s and S i s.<br />

Comments<br />

1. If <strong>the</strong> distribution of <strong>the</strong> original RVs is not continuous, <strong>the</strong>n some data values may be repeated (‘tied<br />

ranks’). Repeated X i s are given averaged ranks (for example, if <strong>the</strong>re are two X i with <strong>the</strong> smallest<br />

value, <strong>the</strong>n <strong>the</strong>y are each given rank 1.5 = (1 + 2)/2).<br />

2. If X ⊥ Y , so <strong>the</strong> ‘true’ ρ S is zero, <strong>the</strong>n <strong>the</strong> distribution of <strong>the</strong> calculated ρ S is easily approximated<br />

(using <strong>the</strong> standard formulae for ∑ n<br />

i=1 ik ).<br />

3. ‘Easily approximated’ does not necessarily mean ‘well approximated’!<br />

4. Most books give ano<strong>the</strong>r formula for ρ S , which is equivalent unless <strong>the</strong>re are tied ranks, but which<br />

obscures <strong>the</strong> relationship with <strong>the</strong> standard product-moment correlation<br />

∑ (xi − x)(y i − y)<br />

ρ = √∑ (xi − x) ∑ 2 (y i − y) . 2<br />

5. O<strong>the</strong>r, perhaps better, types of nonparametric correlation have been defined (‘Kendall’s τ’).<br />

54


4.7 Graphical Methods<br />

A vital part of data analysis is to plot <strong>the</strong> data using bar-charts, histograms, scatter diagrams etc. Plotting<br />

<strong>the</strong> data is important no matter what fur<strong>the</strong>r formal statistical methods will be used:<br />

1. It enables you to ‘get a feel for’ <strong>the</strong> data,<br />

2. It helps you look for patterns and anomalies,<br />

3. It helps in checking assumptions (such as independence, linearity or Normality).<br />

Many useful plots can be easily churned out using a computer, though sometimes you have to devise original<br />

plots to display <strong>the</strong> data in <strong>the</strong> most appropriate way.<br />

Exercise 4.6<br />

The following table shows 66 measurements on <strong>the</strong> speed of light, made by S. Newcomb in 1882. Values<br />

are <strong>the</strong> times in nanoseconds (ns), less 24,800 ns, for light to travel from his laboratory to a mirror and<br />

back. Values are to be read row-by-row, thus <strong>the</strong> first to observations are 24,828 ns and 24,826 ns.<br />

28 26 33 24 34 -44 27 16 40 -2<br />

29 22 24 21 25 30 23 29 31 19<br />

24 20 36 32 36 28 25 21 28 29<br />

37 25 28 26 30 32 36 26 30 22<br />

36 23 27 27 28 27 31 27 26 33<br />

26 32 32 24 39 28 24 25 32 25<br />

29 27 28 29 16 23<br />

Produce a histogram, a Normal probability plot and a time plot of Newcomb’s data. Decide which (if any)<br />

observations to ignore, and produce a normal probability plot of <strong>the</strong> remaining reduced data set. Finally<br />

compare <strong>the</strong> mean of this reduced data set with (i) <strong>the</strong> mean and (ii) <strong>the</strong> 10% trimmed mean of <strong>the</strong> original<br />

data. Solution: Plots are shown in Figure 4.1. There are clearly 2 large outliers, but <strong>the</strong> time plot also<br />

suggests that <strong>the</strong> 6th to 10th observations are unusually variable, and that <strong>the</strong> last two observations are<br />

atypically low (both being lower than <strong>the</strong> previous 20 observations).<br />

The Normal probability plot is calculated by calculating y (i) (<strong>the</strong> sorted data) and z i as follows, and<br />

plotting y (i) against z i .<br />

i y (i) x i = (i−0.5)/n z i = Φ −1 (x i )<br />

1 −44 0.0075 −2.434<br />

2 −2 0.0224 −2.007<br />

3 16 0.0373 −1.783<br />

4 16 0.0522 −1.624<br />

. . .<br />

.<br />

65 39 0.9776 2.007<br />

66 40 0.9925 2.434<br />

Omitting <strong>the</strong> first 10 and <strong>the</strong> last 2 recorded observations leaves a data-set where <strong>the</strong> Normality and<br />

independence assumptions are much more reasonable—see plot (d) of Figure 4.1.<br />

Location estimates are (i) 26.2, (ii) 27.4, (iii) 27.9. The trimmed mean is reasonably close to <strong>the</strong> mean of<br />

observations 11–64.<br />

‖<br />

55


Figure 4.1: Plots of Newcomb’s data: (a) histogram, (b) Normal probability plot, (c) time plot, (d) Normal<br />

probability plot of data after excluding <strong>the</strong> first 10 and last 2 observations.<br />

4.8 Bootstrapping<br />

‘Bootstrap’ methods have become increasingly used over <strong>the</strong> past few years.<br />

question:<br />

They address <strong>the</strong> general<br />

‘What are <strong>the</strong> properties of <strong>the</strong> calculated statistics (e.g. MLEs ̂θ) given that <strong>the</strong> underlying<br />

distributional assumptions may be false (and, in reality, will be false)’<br />

Bootstrapping uses <strong>the</strong> observed data directly as an estimate of <strong>the</strong> underlying population, <strong>the</strong>n uses<br />

‘plug-in’ estimation, and typically involves computer simulation.<br />

Several o<strong>the</strong>r computer-intensive approaches to statistical inference have also become very popular recently.<br />

56


4.9 Problems<br />

1. [Light relief] Discuss <strong>the</strong> following quote:<br />

‘As a statistician, I want to use ma<strong>the</strong>matics to help deal with practical uncertainty. The natural<br />

ma<strong>the</strong>matical way to handle uncertainty is via probability.<br />

‘About <strong>the</strong> simplest practical probability statement I can think of is “The probability that a fair coin,<br />

tossed at random, will come down ‘heads’ is 1/2”.<br />

‘Now try to define “fair coin”, “at random” and “probability 1/2” without using subjective probability<br />

or circular definitions.<br />

‘Summary: if a practical probability statement is not subjective, <strong>the</strong>n it must be tautologous, illdefined,<br />

or useless.<br />

‘<strong>Of</strong> course, for balance, some of <strong>the</strong> time I teach subjective methods, and some of <strong>the</strong> time I teach<br />

useless methods :-).’<br />

Ewart Shaw (Internet posting 13–Aug–1993).<br />

2. (a) Plot <strong>the</strong> captopril data (Table 4.1), and suggest what sort of models seem reasonable.<br />

(b) Roughly estimate from your graph(s) <strong>the</strong> effect of captopril (C) on systolic and diastolic blood<br />

pressure (SBP & DBP).<br />

(c) Suggest a single summary measure (SBP, DBP or a combination of <strong>the</strong> two) to quantify <strong>the</strong><br />

effect of treatment.<br />

(d) Do you think a transformation of <strong>the</strong> data would be appropriate<br />

(e) Comment on <strong>the</strong> number of parameters in your model(s).<br />

(f) Calculate ρ S and ρ between ∆ S , <strong>the</strong> change (after-before) in SBP, and ∆ D , <strong>the</strong> change (afterbefore)<br />

in DBP.<br />

Suggest some advantages and disadvantages in using ρ S and ρ here.<br />

(g) Calculate some fur<strong>the</strong>r summary statistics such as means, variances, correlations and fivenumber<br />

summaries, and comment on how useful <strong>the</strong>y are as summaries of <strong>the</strong> data.<br />

(h) Are <strong>the</strong>re any problems in using <strong>the</strong> data to estimate <strong>the</strong> effect of captopril<br />

information would be useful<br />

What fur<strong>the</strong>r<br />

(i) What advantages/disadvantages would <strong>the</strong>re be in using bootstrapping here, i.e. using <strong>the</strong> discrete<br />

distribution that assigns probability 1/15 to each of <strong>the</strong> 15 points x 1 = (210, 201, 130, 125),<br />

x 2 = (169, 165, 122, 121), . . . , x 15 = (154, 131, 100, 82) as an estimate of <strong>the</strong> underlying population,<br />

and working out <strong>the</strong> properties of ρ S , ρ, etc. based on that assumption<br />

57


Chapter 5<br />

Hypo<strong>the</strong>sis Testing<br />

5.1 Introduction<br />

A hypo<strong>the</strong>sis is a claim about <strong>the</strong> real world; statisticians will be interested in hypo<strong>the</strong>ses like:<br />

1. ‘The probabilities of a male panda or a female panda being born are equal’,<br />

2. ‘The number of flying bombs falling on a given area of London during World War II follows a Poisson<br />

distribution’,<br />

3. ‘The mean systolic blood pressure of 35-year-old men is no higher than that of 40-year-old women’,<br />

4. ‘The mean value of Y = log(systolic blood pressure) is independent of X = age’<br />

(i.e. E[Y |X = x] = constant).<br />

These hypo<strong>the</strong>ses can be translated into statements about parameters within a probability model:<br />

1. ‘p 1 = p 2 ’,<br />

2. ‘N ∼ Poi(λ) for some λ > 0’, i.e.: p n = P(N = n) = λ n exp(−λ)/n! (within <strong>the</strong> general probability<br />

model p n ≥ 0 ∀n = 0, 1, . . .; ∑ p n = 1),<br />

3. ‘θ 1 ≤ θ 2 ’ and<br />

4. ‘β 1 = 0’ (assuming <strong>the</strong> linear model E[Y |x] = β 0 + β 1 x).<br />

Definition 5.1 (Hypo<strong>the</strong>sis test)<br />

A hypo<strong>the</strong>sis test is a procedure for deciding whe<strong>the</strong>r to accept a particular hypo<strong>the</strong>sis as a reasonable<br />

simplifying assumption, or to reject it as unreasonable in <strong>the</strong> light of <strong>the</strong> data.<br />

Definition 5.2 (Null hypo<strong>the</strong>sis)<br />

The null hypo<strong>the</strong>sis H 0 is <strong>the</strong> simplifying assumption we are considering making.<br />

Definition 5.3 (Alternative hypo<strong>the</strong>sis)<br />

The alternative hypo<strong>the</strong>sis H 1 is <strong>the</strong> alternative explanation(s) we are considering for <strong>the</strong> data.<br />

Definition 5.4 (Type I error)<br />

A type I error is made if H 0 is rejected when H 0 is true.<br />

Definition 5.5 (Type II error)<br />

A type II error is made if H 0 is accepted when H 0 is false.<br />

58


Comments<br />

1. In <strong>the</strong> first example above (pandas) <strong>the</strong> null hypo<strong>the</strong>sis is H 0 : p 1 = p 2 .<br />

2. The alternative hypo<strong>the</strong>sis in <strong>the</strong> first example would usually be H 1 : p 1 ≠ p 2 , though it could also<br />

be (for example)<br />

(a) H 1 : p 1 < p 2 ,<br />

(b) H 1 : p 1 > p 2 , or<br />

(c) H 1 : p 1 − p 2 = δ for some specified δ ≠ 0.<br />

5.2 Simple Hypo<strong>the</strong>sis Tests<br />

The simplest type of hypo<strong>the</strong>sis testing occurs when <strong>the</strong> probability distribution giving rise to <strong>the</strong> data is<br />

specified completely under <strong>the</strong> null and alternative hypo<strong>the</strong>ses.<br />

Definition 5.6 (Simple hypo<strong>the</strong>ses)<br />

A simple hypo<strong>the</strong>sis is of <strong>the</strong> form H k : θ = θ k ,<br />

i.e. <strong>the</strong> probability distribution of <strong>the</strong> data is specified completely.<br />

Definition 5.7 (Composite hypo<strong>the</strong>ses)<br />

A composite hypo<strong>the</strong>sis is of <strong>the</strong> form H k : θ ∈ Ω k ,<br />

i.e. <strong>the</strong> parameter θ lies in a specified subset Ω k of <strong>the</strong> parameter space Ω Θ .<br />

Definition 5.8 (Simple hypo<strong>the</strong>sis test)<br />

A simple hypo<strong>the</strong>sis test tests a simple null hypo<strong>the</strong>sis H 0 : θ = θ 0 against a simple alternative<br />

H 1 : θ = θ 1 , where θ parametrises <strong>the</strong> distribution of our experimental random variables X =<br />

X 1 , X 2 , . . . X n .<br />

There may be many seemingly sensible approaches to testing a given hypo<strong>the</strong>sis. A reasonable criterion<br />

for choosing between <strong>the</strong>m is to attempt to minimise <strong>the</strong> chance of making a mistake: incorrectly rejecting<br />

a true null hypo<strong>the</strong>sis, or incorrectly accepting a false null hypo<strong>the</strong>sis.<br />

Definition 5.9 (Size)<br />

A test of size α is one which rejects <strong>the</strong> null hypo<strong>the</strong>sis H 0 : θ = θ 0 in favour of <strong>the</strong> alternative<br />

H 1 : θ = θ 1 iff<br />

X ∈ C α where P(X ∈ C α | θ = θ 0 ) = α<br />

for some subset C α of <strong>the</strong> sample space S of X.<br />

Definition 5.10 (Critical region)<br />

The set C α in Definition 5.9 is called <strong>the</strong> critical region or rejection region of <strong>the</strong> test.<br />

Definition 5.11 (Power & power function)<br />

The power function of a test with critical region C α is <strong>the</strong> function<br />

β(θ) = P(X ∈ C α | θ),<br />

and <strong>the</strong> power is β = β(θ 1 ), i.e. <strong>the</strong> probability that we reject H 0 in favour of H 1 when H 1 is true.<br />

A hypo<strong>the</strong>sis test typically uses a test statistic T (X), whose distribution is known under H 0 , and such that<br />

extreme values of T (X) are more compatible with H 1 that H 0 .<br />

Many useful hypo<strong>the</strong>sis tests have <strong>the</strong> following form:<br />

59


Definition 5.12 (Simple likelihood ratio test)<br />

A simple likelihood ratio test (SLRT) of H 0 : θ = θ 0 against H 1 : θ = θ 1 rejects H 0 iff<br />

X ∈ C ∗ α =<br />

{<br />

x ∣ L(θ 0; x)<br />

}<br />

L(θ 1 ; x) ≤ A α<br />

where L(θ; x) is <strong>the</strong> likelihood of θ given <strong>the</strong> data x, and <strong>the</strong> number A α is chosen so that <strong>the</strong> size of<br />

<strong>the</strong> test is α.<br />

Exercise 5.1<br />

IID<br />

Suppose that X 1 , X 2 , . . . , X n ∼ N (θ, 1). Show that <strong>the</strong> likelihood ratio for testing H 0 : θ = 0 against<br />

H 1 : θ = 1 can be written<br />

λ(x) = exp [ n ( x − 2)] 1 .<br />

Hence show that <strong>the</strong> corresponding SLRT of size α rejects H 0 when <strong>the</strong> test statistic T (X) = X satisfies<br />

T > Φ −1 (1 − α)/ √ n.<br />

‖<br />

Comments<br />

1. For a simple hypo<strong>the</strong>sis test, both H 0 and H 1 are ‘point hypo<strong>the</strong>ses’, each specifying a particular<br />

value for <strong>the</strong> parameter θ ra<strong>the</strong>r than a region of <strong>the</strong> parameter space.<br />

2. The size α is <strong>the</strong> probability of rejecting H 0 when H 0 is in fact true; clearly we want α to be small<br />

(α = 0.05, say).<br />

3. Clearly for a fixed size α of test, <strong>the</strong> larger <strong>the</strong> power β of a test <strong>the</strong> better.<br />

However, <strong>the</strong>re is an inevitable trade-off between small size and high power (as in a jury trial: <strong>the</strong><br />

more careful one is not to convict an innocent defendant, <strong>the</strong> more likely one is to free a guilty one<br />

by mistake).<br />

4. In practice, no hypo<strong>the</strong>sis will be precisely true, so <strong>the</strong> whole foundation of classical hypo<strong>the</strong>sis testing<br />

seems suspect!<br />

5. Regarding likelihood as a measure of compatibility between data and model, an SLRT compares <strong>the</strong><br />

compatibility of θ 0 and θ 1 with <strong>the</strong> observed data x, and accepts H 0 iff <strong>the</strong> ratio is sufficiently large.<br />

6. One reason for <strong>the</strong> importance of likelihood ratio tests is <strong>the</strong> following <strong>the</strong>orem, which shows that<br />

out of all tests of a given size, an SLRT (if one exists) is ‘best’ in a certain sense.<br />

Theorem 5.1 (The Neyman-Pearson lemma)<br />

Given random variables X 1 , X 2 , . . . , X n , with joint density f(x|θ), <strong>the</strong> simple likelihood ratio test of<br />

a fixed size α for testing H 0 : θ = θ 0 against H 1 : θ = θ 1 is at least as powerful as any o<strong>the</strong>r test of<br />

<strong>the</strong> same size.<br />

Exercise 5.2<br />

[Proof of Theorem 5.1] Prove <strong>the</strong> Neyman-Pearson lemma. Solution: Fix <strong>the</strong> size of <strong>the</strong> test to be α.<br />

Let A be a positive constant and C 0 a subset of <strong>the</strong> sample space satisfying<br />

(a) P(X ∈ C 0 | θ = θ 0 ) = α,<br />

(b) X ∈ C 0 ⇐⇒ L(θ 0; x)<br />

L(θ 1 ; x) = f(x|θ 0)<br />

f(x|θ 1 ) ≤ A.<br />

Suppose that <strong>the</strong>re exists ano<strong>the</strong>r test of size α, defined by <strong>the</strong> critical region C 1 , i.e.<br />

60


C 0<br />

C 1<br />

Reject H 0 iff x ∈ C 1 , where P(x ∈ C 1 |θ = θ 0 ) = α.<br />

Let B 1 = C 0 ∩ C 1 , B 2 = C 0 ∩ C c 1, B 3 = C c 0 ∩ C 1 .<br />

B 2<br />

B 1<br />

B 3<br />

Ω X<br />

Figure 5.1: Proof of Neyman-Pearson lemma<br />

Note that B 1 ∪ B 2 = C 0 , B 1 ∪ B 3 = C 1 , and B 1 , B 2 & B 3 are disjoint.<br />

Let <strong>the</strong> power of <strong>the</strong> likelihood ratio test be I 0 = P(X ∈ C 0 | θ = θ 1 ),<br />

and <strong>the</strong> power of <strong>the</strong> o<strong>the</strong>r test be I 1 = P(X ∈ C 1 | θ = θ 1 ).<br />

We want to show that I 0 − I 1 ≥ 0.<br />

But<br />

I 0 − I 1 = ∫ C 0<br />

f(x|θ 1 )dx − ∫ C 1<br />

f(x|θ 1 )dx<br />

= ∫ B 1∪B 2<br />

f(x|θ 1 )dx − ∫ B 1∪B 3<br />

f(x|θ 1 )dx<br />

= ∫ B 2<br />

f(x|θ 1 )dx − ∫ B 3<br />

f(x|θ 1 )dx.<br />

Also B 2 ⊆ C 0 , so f(x|θ 1 ) ≥ A −1 f(x|θ 0 ) for x ∈ B 2 ,<br />

similarly B 3 ⊆ C c 0, so f(x|θ 1 ) ≤ A −1 f(x|θ 0 ) for x ∈ B 3 ,<br />

Therefore<br />

as required.<br />

[ ∫<br />

I 0 − I 1 ≥ A −1 B 2<br />

f(x|θ 0 )dx − ∫ ]<br />

B 3<br />

f(x|θ 0 )dx<br />

[ ∫<br />

= A −1 C 0<br />

f(x|θ 0 )dx − ∫ ]<br />

C 1<br />

f(x|θ 0 )dx<br />

= A −1 [α − α] = 0<br />

‖<br />

5.3 Simple Null, Composite Alternative<br />

Suppose that we wish to test <strong>the</strong> simple null hypo<strong>the</strong>sis H 0 : θ = θ 0 against <strong>the</strong> composite alternative<br />

hypo<strong>the</strong>sis H 1 : θ ∈ Ω 1 .<br />

The easiest way to investigate this is to imagine <strong>the</strong> collection of simple hypo<strong>the</strong>sis tests with null hypo<strong>the</strong>sis<br />

H 0 : θ = θ 0 and alternative H 1 : θ = θ 1 , where θ 1 ∈ Ω 1 . Then, for any given θ 1 , an SLRT is <strong>the</strong> most<br />

powerful test for a given size α. The only problem would be if different values of θ 1 result in different<br />

SLRTs.<br />

61


Definition 5.13 (UMP Tests)<br />

A hypo<strong>the</strong>sis test is called a uniformly most powerful test of H 0 : θ = θ 0 against H 1 : θ = θ 1 , θ 1 ∈ Ω 1 ,<br />

if<br />

1. There exists a critical region C α corresponding to a test of size α not depending on θ 1 ,<br />

2. For all values of θ 1 ∈ Ω 1 , <strong>the</strong> critical region C α defines a most powerful test of H 0 : θ = θ 0<br />

against H 1 : θ = θ 1 .<br />

Exercise 5.3<br />

Suppose that X 1 , X 2 , . . . , X n<br />

IID<br />

∼ N(0, σ 2 ).<br />

(a) Find <strong>the</strong> UMP test of H 0 : σ 2 = 1 against H 1 : σ 2 > 1.<br />

(b) Find <strong>the</strong> UMP test of H 0 : σ 2 = 1 against H 1 : σ 2 < 1.<br />

(c) Show that no UMP test of H 0 : σ 2 = 1 against H 1 : σ 2 ≠ 1 exists.<br />

‖<br />

Comments<br />

1. If a UMP test exists, <strong>the</strong>n it is clearly <strong>the</strong> appropriate test to use.<br />

2. <strong>Of</strong>ten UMP tests don’t exist!<br />

3. A UMP test involves <strong>the</strong> data only via a likelihood ratio, so is a function of <strong>the</strong> sufficient statistics.<br />

4. The critical region C α <strong>the</strong>refore often has a simple form, and is usually easily found once <strong>the</strong> distribution<br />

of <strong>the</strong> sufficient statistics have been determined<br />

(hence <strong>the</strong> importance of <strong>the</strong> χ 2 , t and F distributions).<br />

5. The above three examples illustrate how important is <strong>the</strong> form of alternative hypo<strong>the</strong>sis being considered.<br />

The first two are one-sided alternatives whereas H 1 : σ 2 ≠ 1 is a two-sided alternative<br />

hypo<strong>the</strong>sis, since σ 2 could lie on ei<strong>the</strong>r side of 1.<br />

5.4 Composite Hypo<strong>the</strong>sis Tests<br />

The most general situation we’ll consider is where <strong>the</strong> parameter space Ω is divided into two subsets:<br />

Ω = Ω 0 ∪ Ω 1 , where Ω 0 ∩ Ω 1 = ∅, and <strong>the</strong> hypo<strong>the</strong>ses are H 0 : θ ∈ Ω 0 , H 1 : θ ∈ Ω 1 .<br />

For example, one may want to test <strong>the</strong> null hypo<strong>the</strong>sis that <strong>the</strong> data come from an exponential distribution<br />

against <strong>the</strong> alternative that <strong>the</strong> data come from a more general gamma distribution. Note that here, as in<br />

many o<strong>the</strong>r cases, dim(Ω 0 ) < dim(Ω 1 ) = dim(Ω).<br />

One possible approach to this situation is to regard <strong>the</strong> maximum possible likelihood over θ ∈ Ω i as a<br />

measure of compatibility between <strong>the</strong> data and <strong>the</strong> hypo<strong>the</strong>sis H i (i = 0, 1). It’s <strong>the</strong>refore convenient to<br />

define <strong>the</strong> following:<br />

̂θ is <strong>the</strong> MLE of θ over <strong>the</strong> whole parameter space Ω,<br />

̂θ 0 is <strong>the</strong> MLE of θ over Ω 0 , i.e. under <strong>the</strong> null hypo<strong>the</strong>sis H 0 , and<br />

̂θ 1 is <strong>the</strong> MLE of θ over Ω 1 , i.e. under <strong>the</strong> alternative hypo<strong>the</strong>sis H 1 .<br />

62


Note that ̂θ must <strong>the</strong>refore be <strong>the</strong> same as ei<strong>the</strong>r ̂θ0 or ̂θ1 , since Ω = Ω 0 ∪ Ω 1 .<br />

One might consider using <strong>the</strong> likelihood ratio criterion L(̂θ1 ; x)/L(̂θ0 ; x), by direct analogy with <strong>the</strong> SLRT.<br />

However, it’s generally easier to use <strong>the</strong> equivalent ratio L(̂θ; x)/L(̂θ0 ; x):<br />

Definition 5.14 (Likelihood Ratio Test (LRT))<br />

A likelihood ratio test rejects H 0 : ̂θ ∈ Ω0 in favour of <strong>the</strong> alternative H 1 : ̂θ ∈ Ω1 = Ω \ Ω 0 iff<br />

λ(x) = L(̂θ; x)<br />

≥ λ, (5.1)<br />

L(̂θ0 ; x)<br />

where ̂θ is <strong>the</strong> MLE of θ over <strong>the</strong> whole parameter space Ω, ̂θ0 is <strong>the</strong> MLE of θ over Ω 0 , and <strong>the</strong><br />

value λ is fixed so that<br />

sup<br />

θ∈Ω 0<br />

P(λ(X) ≥ λ|θ) = α<br />

where α, <strong>the</strong> size of <strong>the</strong> test, is some chosen value.<br />

Equivalently, <strong>the</strong> test criterion uses <strong>the</strong> log LRT statistic:<br />

r(x) = l(̂θ; x) − l(̂θ0 ; x) ≥ λ ′ , (5.2)<br />

where l(θ; x) = log L(θ; x), and λ ′ is chosen to give chosen size α = sup θ∈Ω0 P(r(X) ≥ λ ′ |θ).<br />

Comments<br />

1. The size α is typically chosen by convention to be 0.05 or 0.01.<br />

2. Note that high values of <strong>the</strong> test statistic λ(x), or equivalently of r(x), are taken as evidence against<br />

<strong>the</strong> null hypo<strong>the</strong>sis H 0 .<br />

3. The test given in Definition 5.14 is sometimes referred to as a generalized likelihood ratio test, and<br />

Equation 5.1 a generalized likelihood ratio test statistic.<br />

4. Equation 5.2 is often easier to work with than Equation 5.1—see <strong>the</strong> exercises and problems.<br />

Exercise 5.4<br />

IID<br />

[Paired t-test] Suppose that X 1 , X 2 , . . . , X n ∼ N (µ, σ 2 ), and let X = ∑ X i /n, S 2 = ∑ (X i − X) 2 /(n − 1).<br />

What is <strong>the</strong> distribution of T = X/(S/ √ n)<br />

Is <strong>the</strong> test based on rejecting H 0 : µ = 0 for large T a likelihood ratio test<br />

Assuming that <strong>the</strong> observed differences in diastolic blood pressure (after–before) are IID and Normally<br />

distributed with mean δ D , use <strong>the</strong> captopril data (4.1) to test <strong>the</strong> null hypo<strong>the</strong>sis H 0 : δ D = 0 against <strong>the</strong><br />

alternative hypo<strong>the</strong>sis H 1 : δ D ≠ 0.<br />

Comment: this procedure is called <strong>the</strong> paired t test<br />

Exercise 5.5<br />

[Two sample t-test] Suppose X 1 , X 2 , . . . , X m<br />

IID<br />

∼ N (µ X , σ 2 ) and Y 1 , Y 2 , . . . , Y n<br />

IID<br />

∼ N (µ Y , σ 2 ).<br />

(a) Derive <strong>the</strong> LRT for testing H 0 : µ X = µ Y versus H 1 : µ X ≠ µ Y .<br />

(b) Show that <strong>the</strong> LRT can be based on <strong>the</strong> test statistic<br />

‖<br />

where<br />

T =<br />

X √<br />

− Y<br />

1<br />

S p<br />

m + 1 n<br />

. (5.3)<br />

∑ m<br />

Sp 2 i=1<br />

=<br />

(X i − X) 2 + ∑ n<br />

i=1 (Y i − Y ) 2<br />

. (5.4)<br />

m + n − 2<br />

63


(c) Show that, under H 0 , T ∼ t m+n−2 .<br />

(d) Two groups of female rats were placed on diets with high and low protein content, and <strong>the</strong> gain<br />

in weight (grammes) between <strong>the</strong> 28th and 84th days of age was measured for each rat, with <strong>the</strong><br />

following results:<br />

High protein diet<br />

134 146 104 119 124 161 107 83 113 129 97 123<br />

Low protein diet<br />

70 118 101 85 107 132 94<br />

Using <strong>the</strong> test statistic T above, test <strong>the</strong> null hypo<strong>the</strong>sis that <strong>the</strong> mean weight gain is <strong>the</strong> same under<br />

both diets.<br />

Comment: this is called <strong>the</strong> two sample t-test, and S 2 p is <strong>the</strong> pooled estimate of variance.<br />

‖<br />

Exercise 5.6<br />

IID<br />

[F -test] Suppose X 1 , X 2 , . . . , X m ∼ N (µ X , σX 2 ) and Y IID<br />

1, Y 2 , . . . , Y n ∼ N (µ Y , σY 2 ), where µ X, µ Y , σ X and σ Y<br />

are all unknown.<br />

Suppose we wish to test <strong>the</strong> hypo<strong>the</strong>sis H 0 : σX 2 = σ2 Y against <strong>the</strong> alternative H 1 : σX 2 ≠ σ2 Y .<br />

(a) Let S 2 X = ∑ m<br />

i=1 (X i − X) 2 and S 2 Y = ∑ n<br />

i=1 (Y i − Y ) 2 .<br />

What are <strong>the</strong> distributions of S 2 X /σ2 X and S2 Y /σ2 Y <br />

(b) Under H 0 , what is <strong>the</strong> distribution of <strong>the</strong> statistic<br />

V = S2 X /(m − 1)<br />

<br />

/(n − 1)<br />

S 2 Y<br />

(c) Taking values of V much larger or smaller than 1 as evidence against H 0 , and given data with m = 16,<br />

n = 16, ∑ x i = 84, ∑ y i = 18, ∑ x 2 i = 563, ∑ y 2 i = 72, test <strong>the</strong> null hypo<strong>the</strong>sis H 0.<br />

Comment: with <strong>the</strong> alternative hypo<strong>the</strong>sis H 1 : σX 2 > σ2 Y , <strong>the</strong> above procedure is called an F test.<br />

‖<br />

Even in simple cases like this, <strong>the</strong> null distribution of <strong>the</strong> log likelihood ratio test statistic r(x) (5.2) can be<br />

difficult or impossible to find analytically. Fortunately, <strong>the</strong>re is a very powerful and very general <strong>the</strong>orem<br />

that gives <strong>the</strong> approximate distribution of r(x):<br />

Theorem 5.2 (Wald’s Theorem)<br />

Let X 1 , X 2 , . . . , X n<br />

IID<br />

∼ f(x|θ) where θ ∈ Ω, and let r(x) denote <strong>the</strong> log likelihood ratio test statistic<br />

r(x) = l(̂θ; x) − l(̂θ0 ; x),<br />

where ̂θ is <strong>the</strong> MLE of θ over Ω and ̂θ0 is <strong>the</strong> MLE of θ over Ω 0 ⊂ Ω.<br />

Then under reasonable conditions on <strong>the</strong> PDF (or PMF) f(·|·), <strong>the</strong> distribution of 2r(x) converges<br />

to a χ 2 distribution on dim(Ω) − dim(Ω 0 ) degrees of freedom as n → ∞.<br />

64


Comments<br />

1. A proof is beyond <strong>the</strong> scope of this course, but may be found in e.g. Kendall & Stuart, ‘The Advanced<br />

Theory of <strong>Statistics</strong>’, Vol. II.<br />

2. Wald’s <strong>the</strong>orem implies that, provided <strong>the</strong> sample size is large, you only need tables of <strong>the</strong> χ 2<br />

distribution to find <strong>the</strong> critical regions for a wide range of hypo<strong>the</strong>sis tests.<br />

Ano<strong>the</strong>r important <strong>the</strong>orem, see Problem 3.7.8, page 39, is <strong>the</strong> following:<br />

IID<br />

Theorem 5.3 (Sample Mean and Variance of X i ∼ N (µ, σ 2 ))<br />

IID<br />

Let X 1 , X 2 , . . . , X n ∼ N(µ, σ 2 ). Then<br />

1. X = ∑ X i /n and Y = ∑ (X i − X) 2 are independent RVs,<br />

2. X has a N(µ, σ 2 /n) distribution,<br />

3. Y/σ 2 has a χ 2 n−1 distribution.<br />

Exercise 5.7<br />

Suppose X 1 , X 2 , . . . , X n<br />

IID<br />

∼ N (θ, 1), with hypo<strong>the</strong>ses H 0 : θ = 0 and H 1 : θ arbitrary.<br />

Show that 2r(x) = nx 2 , and hence that Wald’s <strong>the</strong>orem holds exactly in this case.<br />

‖<br />

Exercise 5.8<br />

Suppose now that X i ∼ N (θ i , 1), i = 1, . . . , n are independent, with null hypo<strong>the</strong>sis H 0 : θ i = θ ∀i and<br />

alternative hypo<strong>the</strong>sis H 1 : θ i arbitrary.<br />

Show that 2r(x) = ∑ n<br />

i=1 (x i − x) 2 . and hence (quoting any o<strong>the</strong>r <strong>the</strong>orems you need) that Wald’s <strong>the</strong>orem<br />

again holds exactly.<br />

‖<br />

65


5.5 Problems<br />

1. Suppose that X ∼ Bin(n, p). Under <strong>the</strong> null hypo<strong>the</strong>sis H 0 : p = p 0 , what are EX and VarX<br />

Show that if n is large and p 0 is not too close to 0 or 1, <strong>the</strong>n<br />

X/n − p<br />

√ 0<br />

∼ N (0, 1)<br />

p0 (1 − p 0 )/n approximately.<br />

Out of 1000 tosses of a given coin, 560 were heads and 440 were tails. Is it reasonable to assume that<br />

<strong>the</strong> coin is fair Justify your answer.<br />

2. Out of 370 new-born babies at a Hospital, 197 were male and 173 female.<br />

Test <strong>the</strong> null hypo<strong>the</strong>sis H 0 : p < 1/2 versus H 1 : p ≥ 1/2, where p denotes <strong>the</strong> probability that a<br />

baby born at <strong>the</strong> Hospital will be male.<br />

Discuss any assumptions you make.<br />

3. X is a single observation whose density is given by<br />

{<br />

(1 + θ)x<br />

θ<br />

if 0 < x < 1,<br />

f(x) =<br />

0 o<strong>the</strong>rwise.<br />

Find <strong>the</strong> most powerful size α test of H 0 : θ = 0 against H 1 : θ = 1.<br />

Is <strong>the</strong>re a U.M.P. test of H 0 : θ ≤ 0 against H 1 : θ > 0 If so, what is it<br />

IID<br />

4. Suppose X 1 , X 2 , . . . , X n ∼ N (µ, σ 2 ) with null hypo<strong>the</strong>sis H 0 : σ 2 = 1 and alternative H 1 : σ 2 is<br />

arbitrary. Show that <strong>the</strong> LRT will reject H 0 for large values of <strong>the</strong> test statistic r(x) = n(̂v−1−log ̂v),<br />

where ̂v = ∑ n<br />

i=1 (x i − x) 2 /n.<br />

5. Let X 1 , . . . , X n be independent each with density<br />

{<br />

λx<br />

f(x) =<br />

−2 e −λ/x if x > 0,<br />

0 o<strong>the</strong>rwise,<br />

where λ is an unknown parameter.<br />

(a) Show that <strong>the</strong> UMP test of H 0 : λ = 1 2 against H 1 : λ > 1 2<br />

is of <strong>the</strong> form:<br />

‘reject H 0 if ∑ n<br />

i=1 X−1 i ≤ A ∗ ’, where A ∗ is chosen to fix <strong>the</strong> size of <strong>the</strong> test.<br />

(b) Find <strong>the</strong> distribution of ∑ n<br />

i=1 X−1 i under <strong>the</strong> null & alternative hypo<strong>the</strong>ses.<br />

(c) You observe values 0.59, 0.36, 0.71, 0.86, 0.13, 0.01, 3.17, 1.18, 3.28, 0.49 for X 1 , . . . , X 10 .<br />

Test H 0 against H 1 , & comment on <strong>the</strong> test in <strong>the</strong> light of any assumptions made.<br />

6. (a) Define <strong>the</strong> size and power of a hypo<strong>the</strong>sis test of a simple null hypo<strong>the</strong>sis H 0 : θ = θ 0 against a<br />

simple alternative hypo<strong>the</strong>sis H 1 : θ = θ 1 .<br />

(b) State and prove <strong>the</strong> Neyman-Pearson Lemma for continuous random variables X 1 , . . . , X n when<br />

testing <strong>the</strong> null hypo<strong>the</strong>sis H 0 : θ = θ 0 against <strong>the</strong> alternative H 1 : θ = θ 1 .<br />

(c) Assume that a particular bus service runs at regular intervals of θ minutes, but that you do<br />

not know θ. Assume also that <strong>the</strong> times you find you have to wait for a bus on n occasions,<br />

X 1 , . . . , X n , are independent and identically distributed with density<br />

{ θ<br />

−1<br />

if 0 ≤ x ≤ θ,<br />

f(x|θ) =<br />

0 o<strong>the</strong>rwise.<br />

i. Discuss briefly when <strong>the</strong> above assumptions would be reasonable in practice.<br />

ii. Find <strong>the</strong> likelihood L(θ; x) for θ given <strong>the</strong> data (X 1 , . . . , X n ) = x = (x 1 , . . . , x n ).<br />

iii. Find <strong>the</strong> most powerful test of size α of <strong>the</strong> hypo<strong>the</strong>sis H 0 : θ = θ 0 = 20 against <strong>the</strong><br />

alternative H 1 : θ = θ 1 > 20.<br />

66


From Warwick <strong>ST217</strong> exam 1997<br />

7. The following problem is quoted verbatim from Osborn (1979), ‘Statistical Exercises in Medical<br />

Research’ :<br />

A study of immunoglobulin levels in mycetoma patients in <strong>the</strong> Sudan involved 22 patients to be<br />

compared to 22 normal individuals. The levels of IgG recorded for <strong>the</strong> 22 mycetoma patients are<br />

shown below. The mean level for <strong>the</strong> normal individuals was calculated to be 1,477 mg/100ml before<br />

<strong>the</strong> data for this group was lost overboard from a punt on <strong>the</strong> river Nile. Use <strong>the</strong> data below to<br />

estimate <strong>the</strong> within group variance and hence perform a ‘t’ test to investigate <strong>the</strong> significance of <strong>the</strong><br />

difference between <strong>the</strong> mean levels of IgG in mycetoma patients and normals.<br />

IgG levels (mg/100ml) in 22 mycetoma patients<br />

1,047 1,135 1,350 1,122 1,345<br />

1,377 1,375 804 1,062 1,204<br />

1,210 1,067 1,032 1,002 1,053<br />

1,103 907 960 960 936<br />

1,270 1,230<br />

IID<br />

8. Let X 1 , X 2 , . . . , X n ∼ Exp(θ), i.e. f(x|θ) = θe −θx for θ ∈ (0, ∞).<br />

Show that a likelihood ratio test for H 0 : θ ≤ θ 0 versus H 1 : θ > θ 0 has <strong>the</strong> form:<br />

‘Reject H 0 iff θ 0 x < k, where k is given by α =<br />

∫ nk<br />

0<br />

1<br />

Γ(n) zn−1 e −z dz’.<br />

Show that a test of this form is UMP for testing H 0 : θ = θ 0 versus H 1 : θ > θ 0 .<br />

9. (a) Define <strong>the</strong> size and power function of a hypo<strong>the</strong>sis test procedure.<br />

Osborn (1979) 4.6.16<br />

(b) State and prove <strong>the</strong> Neyman-Pearson lemma in <strong>the</strong> case of a test statistic that has a continuous<br />

distribution.<br />

IID<br />

(c) Let X 1 , X 2 , . . . , X n ∼ N (µ, σ 2 ), where σ 2 is known. Find <strong>the</strong> likelihood ratio<br />

f X (x|µ 1 )/f X (x|µ 0 )<br />

and hence show that <strong>the</strong> most powerful test of size α for testing <strong>the</strong> null hypo<strong>the</strong>sis H 0 : µ = µ 0<br />

against <strong>the</strong> alternative H 1 : µ = µ 1 , for some µ 1 < µ 0 , has <strong>the</strong> form:<br />

‘Reject H 0 if X < µ 0 + σ Φ −1 (α)/ √ n ’,<br />

where X = ∑ n<br />

i=1 X i/n is <strong>the</strong> sample mean, and Φ −1 (α) is <strong>the</strong> 100 α% point of <strong>the</strong> standard<br />

Normal N (0, 1) distribution.<br />

(d) Define a uniformly most powerful (UMP) test, and show that <strong>the</strong> above test is UMP for testing<br />

H 0 : µ = µ 0 against H 1 : µ < µ 0 .<br />

(e) What is <strong>the</strong> UMP test of H 0 : µ = µ 0 against H 1 : µ > µ 0 <br />

(f) Deduce that no UMP test of size α exists for testing H 0 : µ = µ 0 against H 1 : µ ≠ µ 0 .<br />

(g) What test would you choose to test H 0 : µ = µ 0 against H 1 : µ ≠ µ 0 , and why<br />

From Warwick <strong>ST217</strong> exam 1999<br />

10. A group of clinicians wish to study survival after heart attack, by classifying new heart attack patients<br />

according to<br />

(a) whe<strong>the</strong>r <strong>the</strong>y survive at least 7 days after admission, and<br />

(b) whe<strong>the</strong>r <strong>the</strong>y currently smoke 10 or more cigarettes per day.<br />

67


From previous experience, <strong>the</strong> clinicians predict that after N days <strong>the</strong> observed counts<br />

Survive<br />

Die<br />

Smoker R 1 R 2<br />

Non-smoker R 3 R 4<br />

will follow independent Poisson distributions with means<br />

Survive<br />

Die<br />

Smoker Nr 1 Nr 2<br />

Non-smoker Nr 3 Nr 4<br />

The clinicians intend to estimate <strong>the</strong> population log-odds ratio l = log(r 1 r 4 /r 2 r 3 ) by <strong>the</strong> sample<br />

value L = log(R 1 R 4 /R 2 R 3 ), and <strong>the</strong>y wish to choose N to give a probability 1 − β of being able to<br />

reject <strong>the</strong> hypo<strong>the</strong>sis H 0 : l = 0 at <strong>the</strong> 100α% significance level, when <strong>the</strong> true value of l is l 0 > 0.<br />

Using <strong>the</strong> formula Var ( f(X) ) ≈ ( f ′ (EX) ) 2<br />

Var(X), show that L has approximate variance<br />

1<br />

Nr 1<br />

+ 1<br />

Nr 2<br />

+ 1<br />

Nr 3<br />

+ 1<br />

Nr 4<br />

,<br />

and hence, assuming a Normal approximation to <strong>the</strong> distribution of L, that <strong>the</strong> required number of<br />

days is roughly<br />

N = 1 l 2 0<br />

{ 1<br />

r 1<br />

+ 1 r 2<br />

+ 1 r 3<br />

+ 1 r 4<br />

} {Φ −1 (α/2) + Φ −1 (β) } 2<br />

,<br />

where Φ is <strong>the</strong> standard Normal cumulative distribution function.<br />

Comment critically on <strong>the</strong> clinicians’ method for choosing N.<br />

From Warwick ST332 exam 1988<br />

11. (a) Define <strong>the</strong> size and power of a hypo<strong>the</strong>sis test, and explain what is meant by a simple likelihood<br />

ratio test and by a uniformly most powerful test.<br />

(b) Let X 1 , X 2 , . . . , X n be independent random variables, each having a Poisson distribution with<br />

mean λ. Find <strong>the</strong> likelihood ratio test for testing H 0 : λ = λ 0 against H 1 : λ = λ 1 , where<br />

λ 1 > λ 0 .<br />

Show also that this test is uniformly most powerful.<br />

(c) Twenty-five leaves were selected at random from each of six similar apple trees. The number of<br />

adult female European red mites on each was counted, with <strong>the</strong> following results:<br />

No. of mites 0 1 2 3 4 5 6 7<br />

Frequency 70 38 17 10 9 3 2 1<br />

Assuming that <strong>the</strong> number of mites per leaf follow IID Poisson distributions, and using a Normal<br />

approximation to <strong>the</strong> Poisson distribution, carry out a test of size 0.05 of <strong>the</strong> null hypo<strong>the</strong>sis<br />

H 0 that <strong>the</strong> mean number of mites per leaf is 1.0, against <strong>the</strong> alternative H 1 that it is greater<br />

than 1.0.<br />

Discuss briefly whe<strong>the</strong>r <strong>the</strong> assumptions you have made in testing H 0 appear reasonable here.<br />

From Warwick <strong>ST217</strong> exam 2000<br />

12. Hypo<strong>the</strong>sis test procedures can be inverted to produce confidence intervals or more generally confidence<br />

regions. Thus, given a size α test of <strong>the</strong> null hypo<strong>the</strong>sis H 0 : θ = θ 0 , <strong>the</strong> set of all values θ 0<br />

that would NOT be rejected forms a ‘100(1 − α)% confidence interval for θ’.<br />

An amateur statistician argues as follows:<br />

68


Suppose something starts at time t 0 and ends at time t 1 . Then at time t ∈ (t 0 , t 1 ), <strong>the</strong><br />

ratio r of its remaining lifetime (t 1 − t) to its current age (t − t 0 ), i.e.<br />

r(t) = t 1 − t<br />

t − t 0<br />

,<br />

is clearly a monotonic decreasing function of t. Also it is easy to check that r = 39 after<br />

(1/40)th of <strong>the</strong> total lifetime, and that r = 1/39 after (39/40)th of <strong>the</strong> total lifetime.<br />

Therefore, for 95% of something’s existence, its remaining lifetime lies in <strong>the</strong> interval<br />

(<br />

(t − t0 )/39, 39(t − t 0 ) ) ,<br />

where t is <strong>the</strong> time under consideration, and t 0 is <strong>the</strong> time <strong>the</strong> thing came into existence.<br />

The statistician is also an amateur <strong>the</strong>ologian, and firmly believes that <strong>the</strong> World came into existence<br />

6006 year ago. Using his pet procedure outlined above, he says he is ‘95% confident that <strong>the</strong> World<br />

will end sometime between 154 years hence, and 234234 years hence’.<br />

His friend, also an amateur statistician, says she has an even more general procedure to produce<br />

confidence intervals:<br />

In any situation I simply roll an icosahedral (fair 20-sided) die. If <strong>the</strong> die shows ‘13’ <strong>the</strong>n I<br />

quote <strong>the</strong> empty set ∅ as a 95% confidence interval, o<strong>the</strong>rwise I quote <strong>the</strong> whole real line R.<br />

She rolls <strong>the</strong> die, which comes up 13. She <strong>the</strong>refore says she is ‘95% confident that <strong>the</strong> World ended<br />

before it even began (although presumably no-one has noticed yet).’<br />

Discuss.<br />

5.6 The Multinomial Distribution and χ 2 Tests<br />

5.6.1 Multinomial Data<br />

Definition 5.15 (Multinomial Distribution)<br />

The multinomial distribution Mn(n, θ) is a probability distribution on points y = (y 1 , y 2 , . . . , y k ),<br />

where y i ∈ {0, 1, 2, . . .}, i = 1, 2, . . . , k, and ∑ k<br />

i=1 y i = n, with PMF<br />

f(y 1 , y 2 , . . . , y k ) =<br />

where θ i > 0 for i = 1, . . . , k, and ∑ k<br />

i=1 θ i = 1.<br />

n!<br />

y 1 !y 2 ! · · · y k !<br />

k∏<br />

i=1<br />

θ yi<br />

i (5.5)<br />

Comments<br />

1. The multinomial distribution arises when one has n independent observations, each classified in one<br />

of k ways (e.g. ‘eye colour’ classified as ‘Brown’, ‘Blue’ or ‘O<strong>the</strong>r’; here k = 3).<br />

Let θ i denote <strong>the</strong> probability that any given observation lies in category number i, and let Y i denote<br />

<strong>the</strong> number of observations falling in category i. Then <strong>the</strong> random vector Y = (Y 1 , Y 2 , . . . , Y k ) has a<br />

Mn(n, θ) distribution.<br />

2. A binomial distribution is <strong>the</strong> special case k = 2, and is usually parametrised by p = θ 1 (so θ 2 = 1−p).<br />

69


Exercise 5.9<br />

By partial differentiation of <strong>the</strong> likelihood function, show that <strong>the</strong> MLEs ̂θ i of <strong>the</strong> parameters θ i of <strong>the</strong><br />

Mn(n, θ) satisfy <strong>the</strong> equations<br />

y i<br />

̂θ i<br />

−<br />

and hence that ̂θ i = y i /n for i = 1, . . . , k.<br />

y k<br />

1 − ∑ k−1<br />

j=1 ̂θ j<br />

= 0, (i = 1, . . . , k − 1)<br />

‖<br />

5.6.2 Chi-Squared Tests<br />

Suppose one wishes to test <strong>the</strong> null hypo<strong>the</strong>sis H 0 that, in <strong>the</strong> multinomial distribution 5.5, θ is some<br />

function θ(φ) of ano<strong>the</strong>r parameter φ. The alternative hypo<strong>the</strong>sis H 1 is that θ is arbitrary.<br />

Exercise 5.10<br />

IID<br />

Suppose H 0 is that X 1 , X 2 , . . . X n ∼ Bin(3, φ). Let Y i (for i = 1, 2, 3, 4) denote <strong>the</strong> number of observations<br />

X j taking value i − 1. What is <strong>the</strong> null distribution of Y = (Y 1 , Y 2 , Y 3 , Y 4 )<br />

‖<br />

The log likelihood ratio test statistic r(X) is given by<br />

k∑<br />

r(X) = Y i log ̂θ i −<br />

where ̂θ i = y i /n for i = 1, . . . , k.<br />

i=1<br />

k∑<br />

Y i log θ i (̂φ) (5.6)<br />

By Walds <strong>the</strong>orem, under H 0 , 2r(X) has approximately a χ 2 distribution:<br />

k∑<br />

2 Y i [log ̂θ i − log θ i (̂φ)] ∼ χ 2 k 1−k 0<br />

(5.7)<br />

where<br />

̂θ i = Y i /n,<br />

i=1<br />

k 0 is <strong>the</strong> dimension of <strong>the</strong> parameter φ, and<br />

k 1 = k − 1 is <strong>the</strong> dimension of θ under <strong>the</strong> constraint ∑ k<br />

i=1 θ i = 1.<br />

Comments<br />

1. in Example 5.10, k = 4, k 0 = 1, k 1 = 3 and ̂φ = X = ∑ 4<br />

i=1 Y i.<br />

We would reject H 0 , that <strong>the</strong> sample comes from a Bin(3, φ) distribution for some φ, if 2r(x) is<br />

greater than <strong>the</strong> 95% point of <strong>the</strong> χ 2 2 distribution, where r(x) is given in Formula 5.6.<br />

2. It is straightforward to check, using a Taylor series expansion of <strong>the</strong> log function, that provided EY i<br />

is large ∀ i,<br />

k∑<br />

2 Y i [log ̂θ<br />

k∑ (Y i − µ i ) 2<br />

i − log θ i (̂φ)] ≏<br />

, (5.8)<br />

µ i<br />

i=1<br />

where µ i = nθ i (̂φ) is <strong>the</strong> expected number of individuals (under H 0 ) in <strong>the</strong> ith category.<br />

i=1<br />

i=1<br />

Definition 5.16 (Chi-squared Goodness of Fit Statistic)<br />

X 2 =<br />

k∑ (o i − e i ) 2<br />

i=1<br />

e i<br />

, (5.9)<br />

where o i is <strong>the</strong> observed count in <strong>the</strong> ith category and e i is <strong>the</strong> corresponding expected count under<br />

<strong>the</strong> null hypo<strong>the</strong>sis, is called <strong>the</strong> χ 2 goodness-of-fit statistic.<br />

70


Comments<br />

1. Under H 0 , X 2 has approximately a χ 2 distribution with number of degrees of freedom being (number<br />

of categories) - 1 - (number of parameters estimated under H 0 ).<br />

This approximation works well provided all <strong>the</strong> expected counts are reasonably large (say all are at<br />

least 5).<br />

2. This χ 2 test was suggested by Karl Pearson before <strong>the</strong> <strong>the</strong>ory of hypo<strong>the</strong>sis testing was fully developed.<br />

5.7 Problems<br />

1. In a genetic experiment, peas were classified according to <strong>the</strong>ir shape (‘round’ or ‘angular’) and<br />

colour (‘yellow’ or ‘green’). Out of 556 peas, 315 were round+yellow, 108 were round+green, 101<br />

were angular+yellow and 32 were angular+green.<br />

Test <strong>the</strong> null hypo<strong>the</strong>sis that <strong>the</strong> probabilities of <strong>the</strong>se four types are 9/16, 3/16, 3/16 and 1/16<br />

respectively.<br />

2. A sample of 300 people was selected from a population, and classified into blood type (O/A/B/AB,<br />

and Rhesus positive/negative), as shown in <strong>the</strong> following table:<br />

O A B AB<br />

Rh positive 82 89 54 19<br />

Rh negative 13 27 7 9<br />

The null hypo<strong>the</strong>sis H 0 is that being Rhesus negative is independent of whe<strong>the</strong>r an individual’s blood<br />

group is O, A, B or AB. Estimate <strong>the</strong> probabilities under H 0 of falling into each of <strong>the</strong> 8 categories,<br />

and hence test <strong>the</strong> hypo<strong>the</strong>sis H 0 .<br />

3. The random variables X 1 , X 2 , . . . , X n are IID with P(X i = j) = p j for j = 1, 2, 3, 4, where ∑ p j = 1<br />

and p j > 0 for each j = 1, 2, 3, 4.<br />

Interest centres on <strong>the</strong> hypo<strong>the</strong>sis H 0 that p 1 = p 2 and simultaneously p 3 = p 4 .<br />

(a) Define <strong>the</strong> following terms<br />

i. a hypo<strong>the</strong>sis test,<br />

ii. simple and composite hypo<strong>the</strong>ses, and<br />

iii. a likelihood ratio test.<br />

(b) Letting θ = (p 1 , p 2 , p 3 , p 4 ), X = (X 1 , . . . , X n ) T with observed values x = (x 1 , . . . , x n ) T , and<br />

letting y j denote <strong>the</strong> number of x 1 , x 2 , . . . , x n equal to j, what is <strong>the</strong> likelihood L(θ|x)<br />

(c) Assume <strong>the</strong> usual regularity conditions, i.e. that <strong>the</strong> distribution of −2 log L(θ|x) tends to χ 2 ν<br />

as <strong>the</strong> sample size n → ∞. What are <strong>the</strong> dimension of <strong>the</strong> parameter space Ω θ and <strong>the</strong> number<br />

of degrees of freedom ν of <strong>the</strong> asymptotic chi-squared distribution<br />

(d) By partial differentiation of <strong>the</strong> log-likelihood, or o<strong>the</strong>rwise, show that <strong>the</strong> maximum likelihood<br />

estimator of p j is y j /n.<br />

(e) Hence show that <strong>the</strong> asymptotic test statistic of H 0 : p 1 = p 2 and p 3 = p 4 is<br />

−2 log L(x) = 2<br />

4∑<br />

y j log(y j /m j ),<br />

j=1<br />

where m 1 = m 2 = (y 1 + y 2 )/2 and m 3 = m 4 = (y 3 + y 4 )/2.<br />

(f) In a hospital casualty unit, <strong>the</strong> numbers of limb fractures seen over a certain period of time are:<br />

71


Side<br />

Left Right<br />

Arm 46 49<br />

Leg 22 32<br />

Using <strong>the</strong> test developed above, test <strong>the</strong> hypo<strong>the</strong>sis that limb fractures are equally likely to<br />

occur on <strong>the</strong> right side as on <strong>the</strong> left side.<br />

Discuss briefly whe<strong>the</strong>r <strong>the</strong> assumptions underlying <strong>the</strong> test appear reasonable here.<br />

From Warwick <strong>ST217</strong> exam 1998<br />

Prudens quaestio dimidium scientiae.<br />

Half of science is asking <strong>the</strong> right questions.<br />

Roger Bacon<br />

We all learn by experience, and your lesson this time is that you should never lose sight of <strong>the</strong><br />

alternative.<br />

Sir Arthur Conan Doyle<br />

One forms provisional <strong>the</strong>ories and <strong>the</strong>n waits for time or fuller knowledge to explode <strong>the</strong>m.<br />

Sir Arthur Conan Doyle<br />

What used to be called prejudice is now called a null hypo<strong>the</strong>sis.<br />

A. W. F. Edwards<br />

The conventional view serves to protect us from <strong>the</strong> painful job of thinking.<br />

John Kenneth Galbraith<br />

Science must begin with myths, and with <strong>the</strong> criticism of myths.<br />

Sir Karl Raimund Popper<br />

72


Chapter 6<br />

Linear Statistical Models<br />

6.1 Introduction<br />

Definition 6.1 (Response Variable)<br />

a response variable is a random variable Y whose value we wish to predict.<br />

Definition 6.2 (Explanatory Variable)<br />

An explanatory variable is a random variable X whose values can be used to predict Y .<br />

Definition 6.3 (Linear Model)<br />

A linear model is a prediction function for Y in terms of <strong>the</strong> values x 1 , x 2 , . . . , x k of X 1 , X 2 , . . . , X k<br />

of <strong>the</strong> form<br />

E[Y |x 1 , x 2 , . . . , x k ] = β 0 + β 1 x 1 + β 2 x 2 + · · · + β k x k (6.1)<br />

Thus if Y 1 , Y 2 , . . . , Y n are <strong>the</strong> responses for cases 1, 2, . . . , n, and x ij is <strong>the</strong> value of X j (j = 1, . . . , k) for<br />

case i, <strong>the</strong>n<br />

E[Y|X] = Xβ (6.2)<br />

where<br />

is <strong>the</strong> vector of responses,<br />

is <strong>the</strong> matrix of explanatory variables, and<br />

is <strong>the</strong> (unknown) parameter vector.<br />

⎛<br />

Y = ⎜<br />

⎝<br />

Y 1<br />

Y 2<br />

.<br />

Y n<br />

⎞<br />

⎟<br />

⎠<br />

X = (x ij ) where x i0 = 1 for i = 1, . . . n,<br />

⎛<br />

β = ⎜<br />

⎝<br />

β 0<br />

β 1<br />

.<br />

β k<br />

⎞<br />

⎟<br />

⎠<br />

73


Examples<br />

Consider <strong>the</strong> captopril data (page 43), and let<br />

X 1 = Diastolic BP before treatment, X 2 = Systolic BP before treatment,<br />

X 3 = Diastolic BP after treatment, X 4 = Systolic BP after treatment,<br />

Z 1 = 2X 1 + X 2 , Z 2 = 2X 3 + X 4 .<br />

Some possible linear models of interest are:<br />

1. Response Y = X 4 ,<br />

(a) explanatory variable X 2<br />

variable),<br />

(this is a ‘simple linear regression model’, with just 1 explanatory<br />

(b) explanatory variable X 3<br />

(c) explanatory variables X 1 and X 2 (a ‘multiple regression model’).<br />

2. Response Y = Z 2 ,<br />

(a) explanatory variable Z 1<br />

(b) explanatory variables Z 1 and Z 2 1 (a ‘quadratic regression model’).<br />

Note how new explanatory variables may be obtained by transforming and/or combining old ones.<br />

3. Looking just at <strong>the</strong> interrelationship between SBP and DBP at a given time:<br />

Comments<br />

(a) response Y = X 2 , explanatory variable X 1 ,<br />

(b) response Y = X 1 , explanatory variable X 2 ,<br />

(c) response Y = X 4 , explanatory variable X 3 , etc.<br />

1. A linear relationship is <strong>the</strong> simplest possible relationship between response variables and explanatory<br />

variables, so linear models are easy to understand, interpret and also to check for plausibility.<br />

2. One can (in <strong>the</strong>ory) approximate an arbitrarily complicated relationship by a linear model, for example<br />

quadratic regression can obviously be extended to ‘polynomial regression’<br />

3. Linear models have nice links with<br />

• geometry,<br />

• linear algebra,<br />

• conditional expectations and variances,<br />

• <strong>the</strong> Normal distribution.<br />

E[Y |x] = β 0 + β 1 x + β 2 x 2 + · · · + β m x m .<br />

4. Distributional assumptions (if any!) will typically be made ONLY about <strong>the</strong> response variable Y ,<br />

NOT about <strong>the</strong> explanatory variables.<br />

Therefore <strong>the</strong> model makes sense even if <strong>the</strong> X i s are chosen nonrandomly (‘designed experiments’).<br />

5. The response variable Y is sometimes called <strong>the</strong> ‘dependent variable’, and <strong>the</strong> explanatory variables<br />

are sometimes called ‘predictor variables’, ‘regressor variables’, or (very misleadingly) ‘independent<br />

variables’.<br />

74


6.2 Simple Linear Regression<br />

Definition 6.4<br />

A simple linear regression model is a linear model with one response variable Y and one explanatory<br />

variable X, i.e. a model of <strong>the</strong> form<br />

E[Y |x 1 ] = β 0 + β 1 x 1 . (6.3)<br />

Typically in practice we have n data points (x i , y i ) for i = 1, . . . , n, and we want to predict a future<br />

response Y from <strong>the</strong> corresponding observed value x of X.<br />

<strong>Of</strong>ten <strong>the</strong>re’s a natural candidate for which variable should be treated as <strong>the</strong> response:<br />

1. X may precede Y in time, for example<br />

(a) X is BP before treatment and Y is BP after treatment, or<br />

(b) X is number of hours revision and Y is exam mark;<br />

2. X may be in some way more fundamental, for example<br />

(a) X is age and Y is height or<br />

(b) X is height and Y is weight;<br />

3. X may be easier or cheaper to observe, so we hope in future to estimate Y without measuring it.<br />

In simple linear regression we don’t know β 0 or β 1 , but need to estimate <strong>the</strong>m in order to predict Y by<br />

Ŷ = ̂β 0 + ̂β 1 x.<br />

To make accurate predictions we require <strong>the</strong> prediction error<br />

to be small.<br />

Y − Ŷ = Y − ̂β 0 + ̂β 1 x<br />

This suggests that, given data (x i , y i ) for i = 1, . . . , n, we should fit ̂β 0 and ̂β 1 by simultaneously making<br />

all <strong>the</strong> vertical deviations of <strong>the</strong> observed data points from <strong>the</strong> fitted line y = ̂β 0 + ̂β 1 x small.<br />

The easiest way to do this is to minimise <strong>the</strong> sum of squared deviations ∑ (y i − ŷ i ) 2 , i.e. to use <strong>the</strong> ‘least<br />

squares’ criterion.<br />

6.3 Method of Least Squares<br />

For simple linear regression,<br />

ŷ i = β 0 + β 1 x i (i = 1, . . . , n) (6.4)<br />

Therefore to estimate β 0 and β 1 by least squares, we need to minimise<br />

Q =<br />

n∑<br />

[y i − (β 0 + β 1 x i )] 2 . (6.5)<br />

i=1<br />

Exercise 6.1<br />

Show that Q in equation 6.5 is minimised at values β 0 and β 1 satisfying <strong>the</strong> simultaneous equations<br />

β 0 n + β 1<br />

∑<br />

xi = ∑ y i ,<br />

β 0<br />

∑<br />

xi + β 1<br />

∑ x<br />

2<br />

i = ∑ x i y i ,<br />

(6.6)<br />

75


and hence that<br />

̂β 1 =<br />

∑<br />

xi y i − n x y<br />

∑ x<br />

2<br />

i − nx 2 , (6.7)<br />

̂β 0 = y − ̂β 1 x. (6.8)<br />

‖<br />

Comments<br />

1. Forming ∂ 2 Q/∂β 2 0, ∂ 2 Q/∂β 2 1 and ∂ 2 Q/∂β 0 β 1 verifies that Q is minimised at β = ̂β.<br />

2. Equations 6.6 are called <strong>the</strong> ‘normal equations’ for β 0 and β 1 (‘normal’ as in ‘perpendicular’ ra<strong>the</strong>r<br />

than as in ‘standard’ or as in ‘Normal distribution’).<br />

3. y = ̂β 0 + ̂β 1 x is called <strong>the</strong> ‘least squares fit’ to <strong>the</strong> data.<br />

4. From equations 6.7 and 6.8, <strong>the</strong> least squares fitted line passes through (x, y), <strong>the</strong> centroid of <strong>the</strong><br />

data points.<br />

5. Concentrate on understanding and remembering <strong>the</strong> method for finding ̂β, ra<strong>the</strong>r than on memorising<br />

<strong>the</strong> formulae 6.7 and 6.8 for ̂β 1 and ̂β 1 .<br />

6. Geometrical interpretation<br />

We have a vector y = (y 1 , y 2 , . . . , y n ) T of observed responses, i.e. a point in n-dimensional space,<br />

toge<strong>the</strong>r with a surface S representing possible joint predicted values under <strong>the</strong> model (for simple<br />

linear regression, it’s <strong>the</strong> 2-dimensional surface β 0 + β 1 x for real values of β 0 and β 1 ).<br />

Minimising ∑ (y i − ŷ i ) 2 is equivalent to dropping a perpendicular from <strong>the</strong> point y to <strong>the</strong> surface S;<br />

<strong>the</strong> perpendicular hits <strong>the</strong> surface at ŷ. Thus we are literally finding <strong>the</strong> model closest to <strong>the</strong> data.<br />

6.4 Problems<br />

1. Show that <strong>the</strong> expression ∑ x i y i − n x y occurring in <strong>the</strong> formula for ̂β 1 could also be written as<br />

∑ (xi − x)(y i − y), ∑ (x i − x)y i , or ∑ x i (y i − y).<br />

2. Show that <strong>the</strong> ‘residual sum of squares’, ∑ n<br />

i=1 (y i − ŷ i ) 2 , satisfies <strong>the</strong> following identity:<br />

n∑<br />

(y i − ŷ i ) 2 =<br />

i=1<br />

n∑<br />

(y i − ̂β 0 − ̂β 1 x i ) 2 =<br />

i=1<br />

3. For <strong>the</strong> captopril data, find <strong>the</strong> least squares lines<br />

(a) to predict SBP before captopril from DBP before captopril,<br />

(b) to predict SBP after captopril from DBP after captopril,<br />

(c) to predict DBP before captopril from SBP before captopril.<br />

Compare <strong>the</strong>se three lines.<br />

n∑<br />

(y i − y) 2 − ̂β<br />

∑ n<br />

1 (x i − x)(y i − y).<br />

Discuss whe<strong>the</strong>r it is sensible to combine <strong>the</strong> before and after measurements in order to obtain a<br />

better prediction of SBP at a given time from DBP measured at that time.<br />

4. Illustrate <strong>the</strong> geometrical interpretation of least squares (see above comments) in <strong>the</strong> following two<br />

cases<br />

(a) model E[Y |x] = β 0 + β 1 x with 3 data points (x 1 , y 1 ), (x 2 , y 2 ) and (x 3 , y 3 ),<br />

(b) model E[Y |x] = βx with 2 data points (x 1 , y 1 ) and (x 2 , y 2 ).<br />

i=1<br />

What does Pythagoras’ <strong>the</strong>orem tell us in <strong>the</strong> second case<br />

i=1<br />

76


5. Suppose that <strong>the</strong> random variables X and Y have <strong>the</strong> joint density<br />

f X,Y (x, y) =<br />

1<br />

√ ×<br />

2π σ X σ Y 1 − ρ<br />

2<br />

( [ (x−µX ) 2 ( )(<br />

1<br />

exp −<br />

2 ( x−µX y−µY<br />

1−ρ 2) − 2ρ<br />

σ X<br />

σ X<br />

σ Y<br />

)<br />

+<br />

( y−µY<br />

σ Y<br />

) 2<br />

])<br />

.<br />

(a) show by substituting<br />

or o<strong>the</strong>rwise, that Y ∼ N (µ Y , σ 2 Y ).<br />

v = x − µ X<br />

followed by w = v − ρ(y − µ Y )/σ<br />

√ Y<br />

,<br />

σ X 1 − ρ<br />

2<br />

(b) Hence or o<strong>the</strong>rwise show that <strong>the</strong> conditional distribution of Y given X = x is Normal with<br />

mean µ Y + (ρσ Y /σ X )(x − µ X ) and variance σ 2 Y<br />

(<br />

1 − ρ<br />

2 ) .<br />

(c) Find E[XY ], and show that <strong>the</strong> symbols µ X , µ Y , σ X , σ Y and ρ have <strong>the</strong>ir usual interpretations<br />

of means, standard deviations, and correlation.<br />

(d) Discuss briefly <strong>the</strong> similarities and differences between <strong>the</strong> above model for (X, Y ), and <strong>the</strong><br />

Simple Normal Linear Regression Model for Y given X.<br />

From Warwick <strong>ST217</strong> exam 2003<br />

6.5 The Normal Linear Model (NLM)<br />

6.5.1 Introduction<br />

Definition 6.5 (NLM)<br />

Given n response RVs Y i (i = 1, 2, . . . , n), with corresponding values of explanatory variables x T i , <strong>the</strong><br />

NLM makes <strong>the</strong> following assumptions:<br />

1. (Conditional) Independence<br />

The Y i are mutually independent given <strong>the</strong> x T i .<br />

2. Linearity<br />

The expected value of <strong>the</strong> response variable is linearly related to <strong>the</strong> unknown parameters β:<br />

3. Normality<br />

EY i = x T i β.<br />

The random variation Y i |x i is Normally distributed.<br />

4. Homoscedasticity (Equal Variances)<br />

i.e. Y i |x i ∼ N(x T i β, σ2 ).<br />

6.5.2 Matrix Formulation of NLM<br />

The NLM for responses y = (y 1 , y 2 , . . . , y n ) T can be recast as follows<br />

1. E[Y] = Xβ for some parameter vector β = (β 1 , β 2 , . . . , β p ) T ,<br />

2. ɛ = Y − E[Y] ∼ MVN(0, σ 2 I), where I is <strong>the</strong> (n × n) identity matrix.<br />

77


It can be shown that <strong>the</strong> least squares estimates of β are given by solving <strong>the</strong> simultaneous linear equations<br />

(<strong>the</strong> normal equations), with solution (assuming that X T X is nonsingular)<br />

X T y = X T Xβ (6.9)<br />

̂β = (X T X) −1 X T y, (6.10)<br />

Comments<br />

1. Note that, by formula 6.10, each estimator ̂β j is a linear combination of <strong>the</strong> Y i s.<br />

Therefore under <strong>the</strong> NLM, ̂β has a MVN distribution.<br />

2. Even if <strong>the</strong> Normality assumption doesn’t hold, <strong>the</strong> CLT implies that, provided <strong>the</strong> number n of<br />

cases is large, <strong>the</strong> distribution of <strong>the</strong> estimator ̂β will still be approximately MVN.<br />

3. The most important assumption is independence, since it’s relatively easy to modify <strong>the</strong> standard<br />

NLM to account for<br />

• nonlinearity: transform <strong>the</strong> data, or include e.g. x 2 ij as an explanatory variable,<br />

• unequal variances (‘heteroscedasticity’): e.g. transform from y i − ŷ i to z i = (y i − ŷ i )/̂σ i .<br />

• non-Normality: transform, or simply get more data!<br />

4. In <strong>the</strong> general formulation <strong>the</strong> constant term β 0 is omitted, though in practice <strong>the</strong> first column of <strong>the</strong><br />

matrix X will often contain 1’s and <strong>the</strong> corresponding parameter β 1 will be <strong>the</strong> ‘constant term’.<br />

5. The corresponding fitted values are ŷ = X T ̂β, and <strong>the</strong> vector of residuals is r = y − ŷ,<br />

i.e. r i = y i − ŷ i , where ŷ i = x T i ̂β = ∑ p<br />

j=1 x ij ̂β j .<br />

Definition 6.6 (RSS)<br />

The residual sum of squares (RSS) in <strong>the</strong> fitted NLM is<br />

s 2 = ∑ n<br />

i=1 (y i − ŷ i ) 2<br />

= (y − X̂β) T (y − X̂β)<br />

(6.11)<br />

Important Fact about <strong>the</strong> RSS<br />

Considering <strong>the</strong> RSS s 2 to be <strong>the</strong> observed value of a corresponding RV S 2 , it can be shown that<br />

• S 2 /σ 2 ∼ χ 2 (n−p) ,<br />

• S 2 is independent of ̂β.<br />

Exercise 6.2<br />

(a) Show that <strong>the</strong> log-likelihood function for <strong>the</strong> NLM is<br />

(constant) − n 2 log(σ2 ) − 1<br />

2σ 2 (y − Xβ)T (y − Xβ). (6.12)<br />

(b) Show that <strong>the</strong> maximum likelihood estimate of β is identical to <strong>the</strong> least squares estimate.<br />

What is <strong>the</strong> distribution of ̂β<br />

78


(c) Show that <strong>the</strong> MLE ̂σ 2 of σ 2 is<br />

What are <strong>the</strong> mean and variance of ̂σ 2 <br />

̂σ 2 = s2<br />

n . (6.13)<br />

(d) Show that an unbiased estimator of σ 2 is given by <strong>the</strong> formula<br />

Residual Sum of Squares<br />

Residual Degrees of Freedom<br />

‖<br />

6.5.3 Examples of <strong>the</strong> NLM<br />

1. Simple Linear Regression (again)<br />

Y i = β 0 + β 1 x i + ɛ i , (6.14)<br />

IID<br />

where ɛ i ∼ N (0, σ 2 ).<br />

2. Two-sample t-test<br />

⎛<br />

y =<br />

⎜<br />

⎝<br />

x 1<br />

x 2<br />

.<br />

x m<br />

y 1<br />

.<br />

y n<br />

⎞<br />

⎛<br />

1 0<br />

1 0<br />

⎞<br />

. .<br />

, X =<br />

1 0<br />

, β =<br />

0 1<br />

⎟ ⎜ ⎟<br />

⎠ ⎝ . . ⎠<br />

0 1<br />

(<br />

β0<br />

β 1<br />

)<br />

, (6.15)<br />

and we’re interested in <strong>the</strong> hypo<strong>the</strong>sis H 0 : (β 0 − β 1 ) = 0.<br />

3. Paired t-test<br />

Some quantity Y is measured on each of n individuals under 2 different conditions (e.g. drugs A<br />

and B), and we want to test whe<strong>the</strong>r <strong>the</strong> mean of Y can be assumed equal in both circumstances.<br />

⎛<br />

y =<br />

⎜<br />

⎝<br />

y 11<br />

y 21<br />

.<br />

y n1<br />

y 12<br />

y 22<br />

.<br />

.<br />

y n2<br />

⎞ ⎛<br />

⎞<br />

1 0 · · · 0 0<br />

0 1 · · · 0 0<br />

⎛<br />

. . . .. . .<br />

, X =<br />

0 0 · · · 1 0<br />

1 0 · · · 0 1<br />

, β =<br />

⎜<br />

0 1 · · · 0 1<br />

⎝<br />

⎟ ⎜<br />

⎠ ⎝<br />

. ⎟<br />

. . .. . . ⎠<br />

0 0 · · · 1 1<br />

α 1<br />

α 2<br />

.<br />

α n<br />

δ<br />

⎞<br />

, (6.16)<br />

⎟<br />

⎠<br />

where δ is <strong>the</strong> difference between <strong>the</strong> expected responses under <strong>the</strong> two conditions, and <strong>the</strong> α i are<br />

‘nuisance parameters’ representing <strong>the</strong> overall level of response for <strong>the</strong> ith individual.<br />

The null hypo<strong>the</strong>sis is H 0 : δ = 0.<br />

4. Multiple Regression (example <strong>the</strong>reof)<br />

79


Y = SBP after captopril, x 1 = SBP before captopril, x 2 = DBP before captopril,<br />

⎛ ⎞ ⎛<br />

⎞<br />

201<br />

1 210 130<br />

165<br />

1 169 122<br />

166<br />

1 187 124<br />

157<br />

1 160 104<br />

147<br />

1 167 112<br />

145<br />

1 176 101<br />

⎛<br />

168<br />

1 185 121<br />

y =<br />

180<br />

, X =<br />

1 206 124<br />

, β = ⎝ β ⎞<br />

0<br />

β 1<br />

⎠ , (6.17)<br />

147<br />

1 173 115<br />

β 2<br />

136<br />

1 146 102<br />

151<br />

1 174 98<br />

168<br />

1 201 119<br />

⎜ 179<br />

⎟ ⎜ 1 198 106<br />

⎟<br />

⎝ 129 ⎠ ⎝ 1 148 107 ⎠<br />

131<br />

1 154 100<br />

where (roughly speaking) β 1 represents <strong>the</strong> increase in EY per unit increase in SBP before captopril<br />

(x 1 ), allowing for <strong>the</strong> fact that EY also depends partly on DBP before captopril (x 2 ), and β 2 has a<br />

similar interpretation in terms of <strong>the</strong> effect of x 2 allowing for x 1 .<br />

In all <strong>the</strong> above examples, it’s straightforward to calculate ̂β = (X T X) −1 X T y, and also (for example) to<br />

calculate <strong>the</strong> sampling distribution of ̂β i under <strong>the</strong> null hypo<strong>the</strong>sis H 0 : β i = 0.<br />

Exercise 6.3<br />

Verify <strong>the</strong> following calculations from <strong>the</strong> data given in 6.17 above:<br />

⎛<br />

X T X = ⎝<br />

⎛<br />

(X T X) −1 = ⎝<br />

15 2654 1685<br />

2654 475502 300137<br />

1685 300137 190817<br />

⎞<br />

⎛<br />

⎠ , X T y = ⎝<br />

8.563 −0.009165 −0.06120<br />

−0.009165 0.0003026 −0.0003951<br />

−0.06120 −0.0003951 0.001167<br />

⎞<br />

⎠ ,<br />

2370<br />

424523<br />

268373<br />

̂β =<br />

⎛<br />

⎝<br />

⎞<br />

⎠ ,<br />

−20.7<br />

0.724<br />

0.450<br />

⎞<br />

⎠ .<br />

‖<br />

6.6 Checking Assumptions of <strong>the</strong> NLM<br />

Clearly it’s very important in practice to check that your assumptions seem reasonable; <strong>the</strong>re are various<br />

ways to do this<br />

6.6.1 Formal hypo<strong>the</strong>sis testing<br />

χ 2 tests are not very powerful, but are simple and general: count <strong>the</strong> number of data points satisfying<br />

various (exhaustive & mutually exclusive) conditions, and compare with <strong>the</strong> expected counts under your<br />

assumptions.<br />

O<strong>the</strong>r tests, for example to test for Normality, have been devised. However, a general problem with<br />

statistical tests is that <strong>the</strong>y don’t usually suggest what to do if your null hypo<strong>the</strong>sis is rejected.<br />

Exercise 6.4<br />

How might you use a χ 2 test to check whe<strong>the</strong>r SBP after captopril is independent of SBP before captopril<br />

‖<br />

80


Exercise 6.5<br />

A possible test for linearity in <strong>the</strong> simple Normal linear regression model (i.e. <strong>the</strong> NLM with just one<br />

explanatory variable x) is to fit <strong>the</strong> quadratic NLM<br />

and test <strong>the</strong> null hypo<strong>the</strong>sis H 0 : β 2 = 0.<br />

EY = β 0 + β 1 x + β 2 x 2 (6.18)<br />

Suppose that Y is SBP and x is dose of drug, and that you have rejected <strong>the</strong> above null hypo<strong>the</strong>sis.<br />

Comment on <strong>the</strong> advisability of using Formula 6.18 for predicting Y given x.<br />

‖<br />

6.6.2 Graphical Methods and Residuals<br />

If all <strong>the</strong> assumptions of <strong>the</strong> NLM are valid, <strong>the</strong>n <strong>the</strong> residuals<br />

r i = y i − ŷ i<br />

= y i − x T ̂β i<br />

(6.19)<br />

should resemble observations on IID Normal random variables.<br />

Therefore plots of r i against ANYTHING should be patternless<br />

SEE LECTURE<br />

Comments<br />

1. Before fitting a formal statistical model (including e.g. performing a t-test), you should plot <strong>the</strong> data,<br />

particularly <strong>the</strong> response variable against each explanatory variable.<br />

2. After fitting a model, produce several residual plots. The computer is your friend!<br />

3. Note that it’s <strong>the</strong> residual plots that are most informative. For example, <strong>the</strong> NLM DOESN’T assume<br />

that <strong>the</strong> Y i are Normally distributed about µ Y , but DOES assume that each Y i is Normally distributed<br />

about EY i |x i .<br />

i.e. it’s <strong>the</strong> conditional distributions, not <strong>the</strong> marginal distributions, that are important.<br />

81


6.7 Problems<br />

1. Show that <strong>the</strong> following is an equivalent formulation of <strong>the</strong> two-sample t-test to that given above in<br />

Formulae 6.15<br />

⎛<br />

x 1<br />

⎞ ⎛<br />

1<br />

⎞<br />

0<br />

x 2<br />

⎟ ⎜ 1 0 ⎟<br />

Y =<br />

⎜<br />

⎝<br />

.<br />

x m<br />

y 1<br />

.<br />

y n<br />

with null hypo<strong>the</strong>sis H 0 : β 1 = 0.<br />

. .<br />

, X =<br />

1 0<br />

, β =<br />

1 1<br />

⎟ ⎜ ⎟<br />

⎠ ⎝ . . ⎠<br />

1 1<br />

(<br />

β0<br />

β 1<br />

)<br />

, (6.20)<br />

2. Independent samples of 10 U.S. men aged 25–34 years, and 15 U.S. men aged 45–54 years were taken.<br />

Their heights (in inches) were as follows:<br />

(a) Age 25–34<br />

73.3 64.8 72.1 68.9 68.7 70.4 66.8 70.7 74.4 71.8<br />

(b) Age 45–54<br />

73.2 68.5 62.4 65.5 71.3 69.5 74.5 70.6 69.3 67.1 64.7 73.0 66.7 68.1 64.3<br />

Use a two-sample t-test to test <strong>the</strong> hypo<strong>the</strong>sis that <strong>the</strong> population means of <strong>the</strong> two age-groups are<br />

equal (<strong>the</strong> 90%, 95%, 97.5%, and 99% points of <strong>the</strong> t 23 distribution are 1.319, 1.714, 2.069 and 2.500<br />

respectively).<br />

Comment on whe<strong>the</strong>r <strong>the</strong> underlying assumptions of <strong>the</strong> two-sample t-test appear reasonable for this<br />

set of data.<br />

Comment also on whe<strong>the</strong>r <strong>the</strong> data can be used to suggest that <strong>the</strong> population of <strong>the</strong> U.S. has (or<br />

hasn’t) tended to get taller over <strong>the</strong> last 20 years.<br />

3. Verify that <strong>the</strong> least squares estimates in simple linear regression<br />

∑<br />

xi y i − n x y<br />

̂β 1 = ∑ x<br />

2<br />

i − nx 2 , ̂β0 = y − ̂β 1 x,<br />

are a special case of <strong>the</strong> general formula ̂β = (X T X) −1 X T y.<br />

4. The following data-set shows average January minimum temperature in degrees Fahrenheit (y), toge<strong>the</strong>r<br />

with Latitude (x 1 ) and Longitude (x 2 ) for 28 US cities. Plot y against x 1 , and comment on<br />

what this plot suggests about <strong>the</strong> reasonableness of <strong>the</strong> various assumptions underlying <strong>the</strong> NLM for<br />

predicting y from x 1 and x 2 .<br />

y x 1 x 2 y x 1 x 2 y x 1 x 2<br />

44 31.2 88.5 38 32.9 86.8 35 33.6 112.5<br />

31 35.4 92.8 47 34.3 118.7 42 38.4 123.0<br />

15 40.7 105.3 22 41.7 73.4 26 40.5 76.3<br />

30 39.7 77.5 45 31.0 82.3 65 25.0 82.0<br />

58 26.3 80.7 37 33.9 85.0 22 43.7 117.1<br />

19 42.3 88.0 21 39.8 86.9 11 41.8 93.6<br />

22 38.1 97.6 27 39.0 86.5 45 30.8 90.2<br />

12 44.2 70.5 25 39.7 77.3 23 42.7 71.4<br />

21 43.1 83.9 2 45.9 93.9 24 39.3 90.5<br />

8 47.1 112.4<br />

Data from HSDS, set 262<br />

82


5. (a) Assuming <strong>the</strong> model<br />

E[Y |x] = β 0 + β 1 x,<br />

Var[Y |x] = σ 2 independently of x,<br />

derive formulae for <strong>the</strong> least squares estimates ̂β 0 and ̂β 1 from data (x i , y i ), i = 1, . . . , n.<br />

What advantages are gained if <strong>the</strong> corresponding random variables Y i |x i can be assumed to be<br />

independently Normally distributed<br />

(b) The following table shows <strong>the</strong> tensile strength (y) of different batches of cement after being<br />

‘cured’ (dried) for various lengths of time x: 3 batches were cured for 1 day, 3 for 2 days, 5 for<br />

3 days, etc. The batch means and standard deviations (s.d.) are also given.<br />

Curing time<br />

Tensile strength<br />

(days) x (kg/cm 2 ) y mean s.d.<br />

1 13.0 13.3 11.8 12.7 0.8<br />

2 21.9 24.5 24.7 23.7 1.6<br />

3 29.8 28.0 24.1 24.1 26.2 26.5 2.5<br />

7 32.4 30.4 34.5 33.1 35.7 33.2 2.0<br />

28 41.8 42.6 40.3 35.7 37.3 40.0 3.0<br />

Plot y against x and discuss briefly how reasonable seem each of <strong>the</strong> following assumptions:<br />

(i) linearity: E[Y i |x i ] = β 0 + β 1 x i for some constants β 0 and β 1 .<br />

(ii) independence: <strong>the</strong> Y i are mutually independent given <strong>the</strong> x i .<br />

If conditional independence (ii) is assumed true, <strong>the</strong>n how reasonable here are <strong>the</strong> fur<strong>the</strong>r assumptions:<br />

(iii) homoscedasticity: Var[Y i |x i ] = σ 2 for all i = 1, . . . , n,<br />

(iv) Normality: <strong>the</strong> random variables Y i are each Normally distributed.<br />

Say briefly whe<strong>the</strong>r you consider any of <strong>the</strong> above assumptions (i)–(iv) would be more plausible<br />

following<br />

(A) transforming from y to y ′ = log e (y), and/or<br />

(B) transforming x in an appropriate way.<br />

NOTE: you do not need to carry out numerical calculations such as finding <strong>the</strong><br />

least-squares fit explicitly.<br />

From Warwick <strong>ST217</strong> exam 2000<br />

83


6. To monitor an industrial process for converting ammonia to nitric acid, <strong>the</strong> percentage of ammonia<br />

lost (y) was measured on each of 21 consecutive days, toge<strong>the</strong>r with explanatory variables representing<br />

air flow (x 1 ), cooling water temperature (x 2 ) and acid concentration (x 3 ). The data, toge<strong>the</strong>r with<br />

<strong>the</strong> residuals after fitting <strong>the</strong> model ŷ = 3.614 + 0.072 x 1 + 0.130 x 2 − 0.152 x 3 , are given in <strong>the</strong><br />

following table:<br />

Air Water Acid<br />

Flow Temp. Conc.<br />

Day y (x 1) (x 2) (x 3) Resid.<br />

1 4.2 80 27 58.9 0.323<br />

2 3.7 80 27 58.8 −0.192<br />

3 3.7 75 25 59.0 0.456<br />

4 2.8 62 24 58.7 0.570<br />

5 1.8 62 22 58.7 −0.171<br />

6 1.8 62 23 58.7 −0.301<br />

7 1.9 62 24 59.3 −0.239<br />

8 2.0 62 24 59.3 −0.139<br />

9 1.5 58 23 58.7 −0.314<br />

10 1.4 58 18 58.0 0.127<br />

11 1.4 58 18 58.9 0.264<br />

12 1.3 58 17 58.8 0.278<br />

13 1.1 58 18 58.2 −0.143<br />

14 1.2 58 19 59.3 −0.005<br />

15 0.8 50 18 58.9 0.236<br />

16 0.7 50 18 58.6 0.091<br />

17 0.8 50 19 57.2 −0.152<br />

18 0.8 50 19 57.9 −0.046<br />

19 0.9 50 20 58.0 −0.060<br />

20 1.5 56 20 58.2 0.141<br />

21 1.5 70 20 59.1 −0.724<br />

Some residual plots are shown on <strong>the</strong> next page (Fig. 6.1).<br />

(a) Discuss whe<strong>the</strong>r <strong>the</strong> pattern of residuals casts doubt on any of <strong>the</strong> assumptions underlying <strong>the</strong><br />

Normal Linear Model (NLM).<br />

Describe any fur<strong>the</strong>r plots or calculations that you think would help you assess whe<strong>the</strong>r <strong>the</strong><br />

fitted NLM is appropriate here.<br />

Continued. . .<br />

84


(b) Various suggestions could be made for improving <strong>the</strong> model, such as<br />

i. transforming <strong>the</strong> response (e.g. to log y or to y/x 1 ),<br />

ii. transforming some or all of <strong>the</strong> explanatory variables,<br />

iii. deleting outliers,<br />

iv. including quadratic or even higher-order terms (e.g. x 2 2),<br />

v. including interaction terms (e.g. x 1 x 3 ),<br />

vi. carrying out a nonparametric analysis of <strong>the</strong> data,<br />

vii. applying a bootstrap procedure,<br />

viii. fitting a nonlinear model.<br />

Outline <strong>the</strong> merits and disadvantages of each of <strong>the</strong>se suggestions here. What would be your<br />

next step in analysing this data-set<br />

Figure 6.1: Residual plots<br />

From Warwick <strong>ST217</strong> exam 1999<br />

85


7. Table 6.1, originally from Narula & Wellington (1977), shows data on selling prices of 28 houses<br />

in Erie, Pennsylvania, toge<strong>the</strong>r with explanatory variables that could be used to predict <strong>the</strong> selling<br />

price. The variables are:<br />

X 1 = current taxes (local, school and county) ÷ 100,<br />

X 2 = number of bathrooms,<br />

X 3 = lot size ÷ 1000 (square feet),<br />

X 4 = living space ÷ 1000 (square feet),<br />

X 5 = number of garage spaces,<br />

X 6 = number of rooms,<br />

X 7 = number of bedrooms,<br />

X 8 = age of house (years),<br />

X 9 = number of fireplaces,<br />

Y = actual sale price ÷ 1000 (dollars).<br />

Find a function of X 1 –X 9 that predicts Y reasonably accurately (such functions are used to fix<br />

property taxes, which should be based on <strong>the</strong> current market value of each property).<br />

X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 Y<br />

4.9176 1.0 3.4720 0.9980 1.0 7 4 42 0 25.9<br />

5.0208 1.0 3.5310 1.5000 2.0 7 4 62 0 29.5<br />

4.5429 1.0 2.2750 1.1750 1.0 6 3 40 0 27.9<br />

4.5573 1.0 4.0500 1.2320 1.0 6 3 54 0 25.9<br />

5.0597 1.0 4.4550 1.1210 1.0 6 3 42 0 29.9<br />

3.8910 1.0 4.4550 0.9880 1.0 6 3 56 0 29.9<br />

5.8980 1.0 5.8500 1.2400 1.0 7 3 51 1 30.9<br />

5.6039 1.0 9.5200 1.5010 0.0 6 3 32 0 28.9<br />

15.4202 2.5 9.8000 3.4200 2.0 10 5 42 1 84.9<br />

14.4598 2.5 12.8000 3.0000 2.0 9 5 14 1 82.9<br />

5.8282 1.0 6.4350 1.2250 2.0 6 3 32 0 35.9<br />

5.3003 1.0 4.9883 1.5520 1.0 6 3 30 0 31.5<br />

6.2712 1.0 5.5200 0.9750 1.0 5 2 30 0 31.0<br />

5.9592 1.0 6.6660 1.1210 2.0 6 3 32 0 30.9<br />

5.0500 1.0 5.0000 1.0200 0.0 5 2 46 1 30.0<br />

8.2464 1.5 5.1500 1.6640 2.0 8 4 50 0 36.9<br />

6.6969 1.5 6.9020 1.4880 1.5 7 3 22 1 41.9<br />

7.7841 1.5 7.1020 1.3760 1.0 6 3 17 0 40.5<br />

9.0384 1.0 7.8000 1.5000 1.5 7 3 23 0 43.9<br />

5.9894 1.0 5.5200 1.2560 2.0 6 3 40 1 37.5<br />

7.5422 1.5 4.0000 1.6900 1.0 6 3 22 0 37.9<br />

8.7951 1.5 9.8900 1.8200 2.0 8 4 50 1 44.5<br />

6.0931 1.5 6.7265 1.6520 1.0 6 3 44 0 37.9<br />

8.3607 1.5 9.1500 1.7770 2.0 8 4 48 1 38.9<br />

8.1400 1.0 8.0000 1.5040 2.0 7 3 3 0 36.9<br />

9.1416 1.5 7.3262 1.8310 1.5 8 4 31 0 45.8<br />

12.0000 1.5 5.0000 1.2000 2.0 6 3 30 1 41.0<br />

Table 6.1: House price data<br />

Weisberg (1980)<br />

86


8. The number of ‘hits’ recorded on J.E.H.Shaw’s WWW homepage in late 1999 are given below. ‘Local’<br />

means <strong>the</strong> homepage was accessed from within Warwick University, ‘Remote’ means it was accessed<br />

from outside. Data for <strong>the</strong> week beginning 7–Nov–1999 were unavailable. Note that <strong>the</strong>re was an<br />

exam on Wednesday 8–Dec–1999 for <strong>the</strong> course ST104, taught by J.E.H.Shaw.<br />

Week Number of Hits<br />

Beginning Local Remote Total<br />

26 Sept 0 182 182<br />

3 Oct 35 253 288<br />

10 Oct 901 315 1216<br />

17 Oct 641 443 1084<br />

24 Oct 1549 525 2074<br />

31 Oct 823 344 1167<br />

7 Nov — — —<br />

14 Nov 1136 383 1519<br />

21 Nov 2114 584 2698<br />

28 Nov 2097 536 2633<br />

5 Dec 3732 461 4193<br />

12 Dec 5 352 357<br />

19 Dec 0 296 296<br />

(a) Fit a linear least-squares regression line to predict <strong>the</strong> number of remote hits (Y ) in a week from<br />

<strong>the</strong> observed number x of local hits.<br />

(b) Calculate <strong>the</strong> residuals and plot <strong>the</strong>m against date. Does <strong>the</strong> plot give any evidence that <strong>the</strong><br />

interrelationship between X and Y changes over time<br />

(c) Using both general considerations and residual plots, comment on how reasonable here are <strong>the</strong><br />

assumptions underlying <strong>the</strong> simple Normal linear regression model, and suggest possible ways<br />

to improve <strong>the</strong> prediction of Y .<br />

9. The following table shows <strong>the</strong> assets x (billions of dollars) and net income y (millions of dollars) for<br />

<strong>the</strong> 20 largest US banks in 1973.<br />

Bank x y Bank x y Bank x y Bank x y<br />

1 49.0 218.8 6 14.2 63.6 11 11.6 42.9 16 6.7 42.7<br />

2 42.3 265.6 7 13.5 96.9 12 9.5 32.4 17 6.0 28.9<br />

3 36.3 170.9 8 13.4 60.9 13 9.4 68.3 18 4.6 40.7<br />

4 16.4 85.9 9 13.2 144.2 14 7.5 48.6 19 3.8 13.8<br />

5 14.9 88.1 10 11.8 53.6 15 7.2 32.2 20 3.4 22.2<br />

(a) Plot income (y) against assets (x), and also log(income) against log(assets).<br />

(b) Verify that <strong>the</strong> least squares fit regression lines are<br />

fit 1: y = 4.987 x + 7.57,<br />

fit 2: log(y) = 0.963 log(x) + 1.782 (Note: logs to base e),<br />

and show <strong>the</strong> fitted lines on your plots.<br />

(c) Produce Normal probability plots of <strong>the</strong> residuals from each fit.<br />

(d) Which (if ei<strong>the</strong>r) of <strong>the</strong>se models would you use to describe <strong>the</strong> relationship between total assets<br />

and net income Why<br />

(e) Bank number 19 (<strong>the</strong> Franklin National Bank) failed in 1974, and was <strong>the</strong> largest ever US bank to<br />

fail. Identify <strong>the</strong> point representing this bank on each of your plots, and discuss briefly whe<strong>the</strong>r,<br />

from <strong>the</strong> data presented, one might have expected beforehand that <strong>the</strong> Franklin National Bank<br />

was in trouble.<br />

87


10. The following data show <strong>the</strong> blood alcohol levels (mg/100ml) at post mortem for traffic accident<br />

victims. Blood samples in each case were taken from <strong>the</strong> leg (A) and from <strong>the</strong> heart (B). Do <strong>the</strong>se<br />

results indicate that blood alcohol levels differ systematically between samples from <strong>the</strong> leg and <strong>the</strong><br />

heart<br />

Case A B Case A B<br />

1 44 44 11 265 277<br />

2 265 269 12 27 39<br />

3 250 256 13 68 84<br />

4 153 154 14 230 228<br />

5 88 83 15 180 187<br />

6 180 185 16 149 155<br />

7 35 36 17 286 290<br />

8 494 502 18 72 80<br />

9 249 249 19 39 50<br />

10 204 208 20 272 290<br />

11. (a) Assume <strong>the</strong> linear model<br />

E[Y|X] = Xβ,<br />

Var[Y|X] = σ 2 I n ,<br />

Osborn (1979) 4.6.5<br />

where I n denotes <strong>the</strong> n × n identity matrix, and X T X is nonsingular. By writing Y − Xβ =<br />

(Y − X̂β) + X(̂β − β), or o<strong>the</strong>rwise, show that for this model, <strong>the</strong> residual sum of squares<br />

is minimised at β = ̂β = (X T X) −1 X T Y.<br />

(Y − Xβ) T (Y − Xβ)<br />

(b) Show that E[̂β] = β and that Var[̂β] = σ 2 (X T X) −1 .<br />

(c) Let A = X(X T X) −1 X T . Show that A and I n − A are both idempotent, i.e. AA = A and<br />

(I n − A)(I n − A) = I n − A.<br />

(d) For <strong>the</strong> particular case of a Normal linear model, find <strong>the</strong> joint distribution of <strong>the</strong> fitted values<br />

Ŷ = X̂β, and show that Y − Ŷ is independent of Ŷ. Quote carefully any properties of <strong>the</strong><br />

Normal distribution you use.<br />

(e) For <strong>the</strong> simple linear regression model (EY i = β 0 + β 1 x i ), write down <strong>the</strong> corresponding matrix<br />

X and vector Y, find (X T X) −1 , and hence find <strong>the</strong> least squares estimates ̂β 0 and ̂β 1 and <strong>the</strong>ir<br />

variances.<br />

From Warwick <strong>ST217</strong> exam 2001<br />

88


6.8 The Analysis of Variance (ANOVA)<br />

6.8.1 One-Way Analysis of Variance: Introduction<br />

This is a generalization of <strong>the</strong> two-sample t-test to p > 2 groups.<br />

Suppose <strong>the</strong>re are observations y ij (j = 1, 2, . . . , n i ) in <strong>the</strong> ith group (i = 1, 2, . . . , p),<br />

and let n = n 1 + n 2 + · · · + n p denote <strong>the</strong> total number of observations.<br />

Denote <strong>the</strong> corresponding RVs by Y ij , and assume that Y ij ∼ N (β i , σ 2 ) independently.<br />

Traditionally <strong>the</strong> main aim has been to test <strong>the</strong> null hypo<strong>the</strong>sis<br />

H 0 : β 1 = β 2 = . . . = β p<br />

i.e. : β = β 0 = (β 0 , β 0 , . . . , β 0 )<br />

The idea is to fit MLEs ̂β and ̂β 0 and apply a likelihood ratio test, i.e. test whe<strong>the</strong>r <strong>the</strong> ratio<br />

change in RSS<br />

RSS<br />

= squared distance from ŷ to ŷ 0<br />

squared distance from y to ŷ<br />

(where ŷ and ŷ 0 are <strong>the</strong> corresponding fitted values) is larger than would be expected by chance.<br />

A useful notation for group means etc. uses overbars and ‘+’ suffices as follows:<br />

(<br />

)<br />

y i+ = 1 ∑n i<br />

y ij , y<br />

n ++ = 1 p∑ ∑n i<br />

y ij = 1 p∑<br />

n i y<br />

i<br />

n<br />

n i+ , etc.<br />

j=1<br />

i=1 j=1<br />

The underlying models fit naturally in <strong>the</strong> NLM framework:<br />

i=1<br />

Definition 6.7 (One-Way ANOVA)<br />

The one-way ANOVA model is a NLM of <strong>the</strong> form<br />

Y ∼ MVN(Xβ, σ 2 I),<br />

where<br />

⎛<br />

⎞<br />

1 0 0 · · · 0<br />

1 0 0 · · · 0<br />

. . . . . . . .. .<br />

1 0 0 · · · 0<br />

0 1 0 · · · 0<br />

⎛<br />

⎞<br />

⎛ ⎞<br />

Y 1<br />

. . . . .. .<br />

1 0 0 · · · 0<br />

⎛ ⎞<br />

Y 2<br />

Y = ⎜ ⎟<br />

⎝ . ⎠ , X = 0 1 0 · · · 0<br />

0 1 0 · · · 0<br />

β 1<br />

0 0 1 · · · 0<br />

=<br />

0 0 1 · · · 0<br />

β 2<br />

, β = ⎜ . ⎟<br />

⎜<br />

⎝<br />

.<br />

Y n<br />

. . . . .<br />

. . . ..<br />

⎟ ⎝<br />

. ⎠<br />

. ⎠ ,<br />

β 0 0 1 · · · 0<br />

0 0 0 · · · 1<br />

p<br />

. . . . .<br />

0 0 0 · · · 1<br />

⎜<br />

⎝<br />

. ⎟<br />

. . . .. . ⎠<br />

0 0 0 · · · 1<br />

(6.21)<br />

where X has n 1 rows of <strong>the</strong> first type, . . . n p rows of <strong>the</strong> last type, and n 1 + n 2 + · · · + n p = n.<br />

Exercise 6.6<br />

Show that for one-way ANOVA, X T X = diag(n 1 , n 2 , . . . , n p ), and hence ̂β = (Y 1+ , Y 2+ , . . . , Y p+ ) T .<br />

‖<br />

89


6.8.2 One-Way Analysis of Variance: ANOVA Table<br />

Let<br />

β 0 = E[Y ++ ] = 1 n<br />

p∑ ∑n i<br />

EY ij = 1 n<br />

i=1 j=1<br />

α i = β i − β 0 (i = 1, 2, . . . , p).<br />

p∑<br />

n i β i ,<br />

Typically <strong>the</strong> p groups correspond to p different treatments, and α i is <strong>the</strong>n called <strong>the</strong> ith treatment effect.<br />

We’re interested in <strong>the</strong> hypo<strong>the</strong>ses<br />

H 0 : α i = 0 (i = 1, 2, . . . , p),<br />

H 1 : <strong>the</strong> α i are arbitrary.<br />

i=1<br />

Note that<br />

1. Y ++ is <strong>the</strong> MLE of β = β 0 under H 0 ,<br />

2. Y i+ is <strong>the</strong> MLE of β + α i , i.e. <strong>the</strong> mean response given <strong>the</strong> i treatment.<br />

Hence <strong>the</strong> fitted values under H 0 and H 1 are given by Y ++ and Y i+ respectively.<br />

If we also include <strong>the</strong> ‘null model’ that all <strong>the</strong> β i are zero, <strong>the</strong>n <strong>the</strong> possible models of interest are:<br />

Model # params DF RSS<br />

∑<br />

β i = 0 ∀ i i.e. ŷ ij = 0 0 n<br />

i,j y2 ij (1)<br />

∑<br />

β i = β 0 ∀ i i.e. ŷ ij = y ++ 1 n − 1<br />

i,j (y ij − y ++ ) 2 (2)<br />

∑<br />

β i arbitrary, i.e. ŷ ij = y i+ p n − p<br />

i,j (y ij − y i+ ) 2 (3)<br />

The calculations needed to test H 0 , involving <strong>the</strong> RSS formulae given above, can be conveniently presented<br />

in an ‘ANOVA table’:<br />

Source of Degrees of Sum of squares Mean square<br />

variation freedom (DF) (SS) (MS) = SS/DF<br />

Overall mean 1 (1)–(2) = ny 2 ++ ny 2 ++<br />

Treatment p − 1 (2)–(3) = ∑ i n i(y i+ − y ++ ) 2 ∑i n i(y i+ − y ++ ) / 2 (p − 1)<br />

Residual n − p (3) =<br />

∑i,j (y ij − y i+ ) 2 ∑<br />

i,j (y ij − y i+ ) / 2 (n − p)<br />

Total n (1) =<br />

∑<br />

i,j y2 ij<br />

Finally, calculate <strong>the</strong> ‘F ratio’<br />

F =<br />

Treatment MS<br />

Residual MS<br />

=<br />

Treatment SS/(p − 1)<br />

Residual SS/(n − p)<br />

(6.22)<br />

which, under H 0 , has an F distribution on (p − 1) and (n − p) d.f.<br />

Large values of F are evidence against H 0 .<br />

Note: DON’T try too hard to remember formulae for sums of squares in an ANOVA table.<br />

Instead THINK OF THE MODELS BEING FITTED. The ‘lack of fit’ of each model is given by <strong>the</strong><br />

corresponding RSS, & <strong>the</strong> formulae for <strong>the</strong> differences in RSS simplify.<br />

90


6.9 Problems<br />

1. Show that <strong>the</strong> formulae for sums of squares in one-way ANOVA simplify:<br />

p∑<br />

n i (Y i+ − Y ++ ) 2 =<br />

i=1<br />

p∑ ∑n i<br />

(Y ij − Y i+ ) 2 =<br />

i=1 j=1<br />

p∑<br />

n i Y 2 i+ − nY 2 ++,<br />

i=1<br />

p∑ ∑n i<br />

Yij 2 −<br />

i=1 j=1<br />

p∑<br />

n i Y 2 i+.<br />

i=1<br />

2. (a) Define <strong>the</strong> Normal Linear Model, and describe briefly how each of its assumptions may be<br />

informally checked by plotting residuals.<br />

(b) The following data summarise <strong>the</strong> number of days survived by mice inoculated with three strains<br />

of typhoid (31 mice with ‘9D’, 60 mice with ‘11C’ and 133 mice with ‘DSCI’).<br />

Days Numbers of Mice<br />

to Inoculated with. . .<br />

Death 9D 11C DSCI Total<br />

2 6 1 3 10<br />

3 4 3 5 12<br />

4 9 3 5 17<br />

5 8 6 8 22<br />

6 3 6 19 28<br />

7 1 14 23 38<br />

8 11 22 33<br />

9 4 14 18<br />

10 6 14 20<br />

11 2 7 9<br />

12 3 8 11<br />

13 1 4 5<br />

14 1 1<br />

Total 31 60 133 224<br />

∑<br />

Xi 125 442 1037 1604<br />

∑ X<br />

2<br />

i 561 3602 8961 13124<br />

(X i is <strong>the</strong> survival time of <strong>the</strong> ith mouse in <strong>the</strong> given group).<br />

Without carrying out any calculations, discuss briefly how reasonable seem <strong>the</strong> assumptions<br />

underlying a one-way ANOVA on <strong>the</strong> data, and whe<strong>the</strong>r a transformation of <strong>the</strong> data may be<br />

appropriate.<br />

(c) Carry out a one-way ANOVA on <strong>the</strong> untransformed data. What do you conclude about <strong>the</strong><br />

responses to <strong>the</strong> three strains of typhoid<br />

From Warwick <strong>ST217</strong> exam 1997<br />

3. The amount of nitrogen-bound bovine serum albumin produced by three groups of mice was measured.<br />

The groups were: normal mice treated with a placebo (i.e. an inert substance), alloxan-diabetic mice<br />

treated with a placebo, and alloxan-diabetic mice treated with insulin. The resulting data are shown<br />

in <strong>the</strong> following table:<br />

91


Normal Alloxan-diabetic Alloxan-diabetic<br />

+ placebo + placebo + insulin<br />

156 391 82<br />

282 46 100<br />

197 469 98<br />

297 86 150<br />

116 174 243<br />

127 133 68<br />

119 13 228<br />

29 499 131<br />

253 168 73<br />

122 62 18<br />

349 127 20<br />

110 276 100<br />

143 176 72<br />

64 146 133<br />

26 108 465<br />

86 276 40<br />

122 50 46<br />

455 73 34<br />

655 44<br />

14<br />

(a) Produce appropriate graphical display(s) and numerical summaries of <strong>the</strong>se data, and comment<br />

on what can be learnt from <strong>the</strong>se.<br />

(b) Carry out a one-way analysis of variance on <strong>the</strong> three groups. You may feel it necessary to<br />

transform <strong>the</strong> data first.<br />

Data from HSDS, set 304<br />

4. The following table shows measurements of <strong>the</strong> steady-state haemoglobin levels for patients with<br />

different types of sickle-cell anaemia (‘HB SS’, ‘HB S/-thalassaemia’ and ‘HB SC’). Construct an<br />

ANOVA table and hence test whe<strong>the</strong>r <strong>the</strong> steady-state haemoglobin levels differ between <strong>the</strong> three<br />

types.<br />

HB SS HB S/-thalassaemia HB SC<br />

7.2 8.1 10.7<br />

7.7 9.2 11.3<br />

8.0 10.0 11.5<br />

8.1 10.4 11.6<br />

8.3 10.6 11.7<br />

8.4 10.9 11.8<br />

8.4 11.1 12.0<br />

8.5 11.9 12.1<br />

8.6 12.0 12.3<br />

8.7 12.1 12.6<br />

9.1 12.6<br />

9.1 13.3<br />

9.1 13.3<br />

9.8 13.8<br />

10.1 13.9<br />

10.3<br />

Data from HSDS, set 310<br />

92


5. The data in Table 6.2, collected by Brian Everitt, are described in HSDS as being <strong>the</strong> ‘weights, in<br />

kg, of young girls receiving three different treatments for anorexia over a fixed period of time with<br />

<strong>the</strong> control group receiving <strong>the</strong> standard treatment’.<br />

(a) Using a one-way ANOVA on <strong>the</strong> weight gains, compare <strong>the</strong> three methods of treatment.<br />

(b) Plot <strong>the</strong> data so as to clarify <strong>the</strong> effects of <strong>the</strong> three treatments, and discuss whe<strong>the</strong>r <strong>the</strong> above<br />

formal analysis was appropriate.<br />

Cognitive<br />

behavioural<br />

Family<br />

treatment Control <strong>the</strong>rapy<br />

Weight Weight Weight<br />

before after before after before after<br />

80.5 82.2 80.7 80.2 83.8 95.2<br />

84.9 85.6 89.4 80.1 83.3 94.3<br />

81.5 81.4 91.8 86.4 86.0 91.5<br />

82.6 81.9 74.0 86.3 82.5 91.9<br />

79.9 76.4 78.1 76.1 86.7 100.3<br />

88.7 103.6 88.3 78.1 79.6 76.7<br />

94.9 98.4 87.3 75.1 76.9 76.8<br />

76.3 93.4 75.1 86.7 94.2 101.6<br />

81.0 73.4 80.6 73.5 73.4 94.9<br />

80.5 82.1 78.4 84.6 80.5 75.2<br />

85.0 96.7 77.6 77.4 81.6 77.8<br />

89.2 95.3 88.7 79.5 82.1 95.5<br />

81.3 82.4 81.3 89.6 77.6 90.7<br />

81.3 82.4 81.3 89.6 77.6 90.7<br />

76.5 72.5 78.1 81.4 83.5 92.5<br />

70.0 90.9 70.5 81.8 89.9 93.8<br />

80.4 71.3 77.3 77.3 86.0 91.7<br />

83.3 85.4 85.2 84.2 87.3 98.0<br />

83.0 81.6 86.0 75.4<br />

87.7 89.1 84.1 79.5<br />

84.2 83.9 79.7 73.0<br />

86.4 82.7 85.5 88.3<br />

76.5 75.7 84.4 84.7<br />

80.2 82.6 79.6 81.4<br />

87.8 100.4 77.5 81.2<br />

83.3 85.2 72.3 88.2<br />

79.7 83.6 89.0 78.8<br />

84.5 84.6<br />

80.8 96.2<br />

87.4 86.7<br />

Table 6.2: Anorexia data<br />

Data from HSDS, set 285<br />

93


6. The following data come from a study of pollution in inland waterways. In each of seven localities,<br />

five pike were caught and <strong>the</strong> log concentration of copper in <strong>the</strong>ir livers measured.<br />

Locality<br />

Log concentration of copper (ppm)<br />

1. Windermere 0.187 0.836 0.704 0.938 0.124<br />

2. Grassmere 0.449 0.769 0.301 0.045 0.846<br />

3. River Stour 0.628 0.193 0.810 0.000 0.855<br />

4. Wimbourne St Giles 0.412 0.286 0.497 0.417 0.337<br />

5. River Avon 0.243 0.258 -0.276 -0.538 0.041<br />

6. River Leam 0.134 0.281 0.529 0.305 0.459<br />

7. River Kennett 0.471 0.371 0.297 0.691 0.535<br />

(a) The data are plotted in Figure 6.2. Discuss briefly what <strong>the</strong> plot suggests about <strong>the</strong> relative<br />

copper pollution in <strong>the</strong> various localities.<br />

Figure 6.2: Concentration of copper in pike livers<br />

(b) Carry out a one-way analysis of variance to test for differences between <strong>the</strong> data between localities.<br />

Do <strong>the</strong> results of <strong>the</strong> formal analysis agree with your subjective impressions from<br />

Figure 6.2<br />

94


7. (a) Define <strong>the</strong> Normal Linear Model, and show how <strong>the</strong> ‘One-Way Analysis of Variance’ may be<br />

written in matrix form as a Normal Linear Model.<br />

(b) The following table gives <strong>the</strong> concentrations (nanograms per litre) of two pollutants (aldrin<br />

and HCB) found in ten separate samples at each of three different depths of <strong>the</strong> Wolf River in<br />

Tennessee. (Data from P. R. Jaffe, F. L. Parker & D. J. Wilson (1982), ‘Distribution of toxic<br />

substances in rivers’, Journal of <strong>the</strong> Environmental Engineering Division 108:639–649.)<br />

Depth: Surface Middepth Bottom<br />

Pollutant: Aldrin HCB Aldrin HCB Aldrin HCB<br />

3.08 3.74 5.17 6.03 4.81 5.44<br />

3.58 4.61 6.17 6.55 5.71 6.88<br />

3.81 4.00 6.26 3.55 4.90 5.37<br />

4.31 4.67 4.26 4.59 5.35 5.44<br />

4.35 4.87 3.17 3.77 5.26 5.03<br />

4.40 5.12 3.76 4.81 6.26 6.48<br />

3.67 4.52 4.76 5.85 3.76 3.89<br />

5.17 5.29 4.90 5.74 8.07 5.85<br />

5.17 5.74 6.57 6.77 8.79 6.85<br />

4.35 5.48 5.17 5.64 7.30 7.16<br />

Column sum: 41.9 48.0 50.2 53.3 60.2 58.4<br />

Sum of squares: 179.5 234.4 262.9 295.1 385.0 350.2<br />

Carry out a one-way ANOVA on <strong>the</strong> aldrin data. What do you conclude about <strong>the</strong> relative<br />

concentration at different depths<br />

(c) Plot <strong>the</strong> tabulated data in what you consider an appropriate way. Discuss briefly whe<strong>the</strong>r your<br />

plot casts any doubt on <strong>the</strong> assumptions underlying <strong>the</strong> One-Way ANOVA, and how you might<br />

continue your analysis (but do NOT perform any fur<strong>the</strong>r numerical calculations!)<br />

From Warwick <strong>ST217</strong> exam 2002<br />

6.10 Two-Way Analysis of Variance<br />

Here <strong>the</strong>re are two factors (e.g. two treatments, or patient number and treatment given) that can be varied<br />

independently.<br />

Factor A has I ‘levels’ 1, 2, . . . , I, and factor B has J ‘levels’ 1, 2, . . . , J. For example:<br />

(a) A is patient number 1, 2, . . . , I, every patient receiving each treatment j = 1, 2, . . . , J in turn,<br />

(b) A is treatment number 1, 2, . . . , I, and B is one of J possible supplementary treatments.<br />

Data can be conveniently tabulated:<br />

Factor B<br />

1 2 . . . J<br />

1 Y 11 Y 12 . . . Y 1J<br />

2 Y 21 Y 22 . . . Y 2J<br />

Factor A 3 Y 31 Y 32 . . . Y 3J<br />

. . .<br />

. .. .<br />

I Y I1 Y I2 . . . Y IJ<br />

i.e. <strong>the</strong>re is precisely one observation Y ij at each (i, j) combination of factor levels.<br />

95


Again assume <strong>the</strong> NLM with<br />

E[Y ij ] = θ i + φ j for i = 1 . . . I and j = 1 . . . J.<br />

i.e.<br />

Y ij ∼ N (θ i + φ j , σ 2 ) independently. (6.23)<br />

A problem here is that one could transform θ i ↦→ θ i + c and φ j ↦→ φ j − c for each i and j, where c is<br />

arbitrary. Therefore for identifiability one needs to impose some (arbitrary) constraints.<br />

The simplest and most symmetrical reformulation for <strong>the</strong> two-way ANOVA model is<br />

Y ij ∼ N (µ + α i + β j , σ 2 ), where<br />

∑ I<br />

i=1 α i = 0,<br />

(6.24)<br />

∑ J<br />

j=1 β j = 0.<br />

Exercise 6.7<br />

What is <strong>the</strong> matrix formulation of <strong>the</strong> model 6.24<br />

‖<br />

Particular models of interest within <strong>the</strong> framework of Formulae 6.24 are:<br />

(1) Y ij ∼ N (0, σ 2 ),<br />

RSS = ∑ i,j<br />

Y 2<br />

ij,<br />

DF = n = IJ.<br />

(2) Y ij ∼ N (µ, σ 2 ),<br />

RSS = ∑ i,j<br />

(Y ij − Y ++ ) 2 , DF = n − 1 = IJ − 1.<br />

(3) Y ij ∼ N (µ + α i , σ 2 ),<br />

Ŷ ij = ̂µ + ̂α i = Y i+ .<br />

Therefore RSS = ∑ (Y ij − Y i+ ) 2 , DF = n − I = I(J − 1).<br />

i,j<br />

(4) Y ij ∼ N (µ + β j , σ 2 ),<br />

Ŷ ij = ̂µ + ̂β j = Y +j .<br />

Therefore RSS = ∑ (Y ij − Y +j ) 2 , DF = n − J = (I − 1)J.<br />

i,j<br />

(5) Y ij ∼ N (µ + α i + β j , σ 2 ),<br />

Ŷ ij = ̂µ + ̂α i + ̂β j = Y i+ + Y +j − Y ++ .<br />

Therefore RSS = ∑ (Y ij − Y i+ − Y +j + Y ++ ) 2 , DF = n − I − J + 1 = (I−1)(J−1).<br />

i,j<br />

Again, we can form an ANOVA table summarising <strong>the</strong> independent ‘sources of variation’.<br />

The degrees of freedom are <strong>the</strong> differences between <strong>the</strong> DFs associated with <strong>the</strong> various models.<br />

The sums of squares are <strong>the</strong> differences between <strong>the</strong> SSs associated with <strong>the</strong> various models.<br />

96


Source of Degrees of Sum of squares Mean square<br />

variation freedom (DF) (SS) (MS)<br />

Overall mean 1 (1)−(2) ( ) /<br />

Effect of Factor A I−1 (2)−(3) ( (2)−(3) ) / (I−1)<br />

Effect of Factor B J−1 (2)−(4) (2)−(4) (J−1)<br />

Residuals (I−1)(J−1) (5) (5) / ( (I−1)(J−1) )<br />

Total IJ = n (1)<br />

Table 6.3: Two-way ANOVA table<br />

Comments<br />

1. DeGroot gives a more general version.<br />

2. As with one-way ANOVA, one can test H 0 : α i = 0, i = 1 . . . I, by comparing<br />

with <strong>the</strong> 95% point of F (I−1),([I−1][J−1]) .<br />

(SS due to A)/(I − 1)<br />

(Residual SS)/([I − 1][J − 1])<br />

3. Similarly one can test H 0 : β j = 0, j = 1 . . . J, by comparing<br />

with <strong>the</strong> 95% point of F (J−1),([I−1][J−1]) .<br />

(SS due to B)/(J − 1)<br />

(Residual SS)/([I − 1][J − 1])<br />

4. The above two F tests are using completely separate aspects of <strong>the</strong> data (row sums of <strong>the</strong> Y ij table,<br />

column sums of <strong>the</strong> Y ij table).<br />

5. The case J = 2 is equivalent to <strong>the</strong> paired t-test (Exercise 5.4).<br />

6. As for one-way ANOVA, <strong>the</strong> formulae for sums of squares simplify:<br />

‘sum over each observation <strong>the</strong> squared difference between <strong>the</strong> fitted values under <strong>the</strong> two models<br />

being considered’.<br />

The residual SS is <strong>the</strong>n most easily obtained by subtraction.<br />

See problem 6.11.1<br />

97


6.11 Problems<br />

1. For <strong>the</strong> two-way analysis of variance (Table 6.3, page 97), find simplified formulae for <strong>the</strong> sums of<br />

squares analogous to those found for <strong>the</strong> one-way ANOVA (exercise 6.9.1).<br />

2. Three pertussis vaccines were tested on each of ten days. The following table shows estimates of <strong>the</strong><br />

log doses of vaccine (in millions of organisms) required to protect 50% of mice against a subsequent<br />

infection with pertussis organisms.<br />

Vaccine<br />

Day A B C Total<br />

1 2.64 2.93 2.93 8.50<br />

2 2.00 2.52 2.56 7.08<br />

3 3.04 3.05 3.35 9.44<br />

4 2.07 2.97 2.55 7.59<br />

5 2.54 2.44 2.45 7.43<br />

6 2.76 3.18 3.25 9.19<br />

7 2.03 2.30 2.17 6.50<br />

8 2.20 2.56 2.18 6.94<br />

9 2.38 2.99 2.74 8.11<br />

10 2.42 3.20 3.14 8.76<br />

Total 24.08 28.14 27.32 79.54<br />

Test <strong>the</strong> statistical significance of <strong>the</strong> differences between days and between vaccines.<br />

Osborn (1979) 8.1.2<br />

3. (a) Explain what is meant by <strong>the</strong> Normal Linear Model (NLM), and show how <strong>the</strong> two-way analysis<br />

of variance may be formulated in this way.<br />

(b) The following table gives <strong>the</strong> average UK cereal yield (tonnes per hectare) from 1994 to 1998,<br />

toge<strong>the</strong>r with <strong>the</strong> row, column, and overall totals.<br />

1994 1995 1996 1997 1998 Total<br />

Wheat 7.35 7.70 8.15 7.38 7.56 38.14<br />

Barley 5.37 5.73 6.14 5.76 5.29 28.29<br />

Oats 5.50 5.52 6.14 5.78 6.00 28.94<br />

O<strong>the</strong>r cereal 5.65 5.52 5.86 5.52 5.04 27.59<br />

Total 23.87 24.47 26.29 24.44 23.89 122.96<br />

Calculate <strong>the</strong> fitted yields and residuals for Wheat in each of <strong>the</strong> five years<br />

i. under <strong>the</strong> NLM assuming no column effect, and<br />

ii. under <strong>the</strong> NLM assuming that row & column effects are additive.<br />

(c) Describe briefly how to test <strong>the</strong> null hypo<strong>the</strong>sis that <strong>the</strong>re is no column effect (i.e. no consistent<br />

change in yield from year to year). You do not need to carry out <strong>the</strong> numerical calculations.<br />

(d) A nonparametric test of <strong>the</strong> above hypo<strong>the</strong>sis may be carried out as follows: rank <strong>the</strong> data for<br />

each row from lowest to highest (thus for Wheat <strong>the</strong> values 7.35, 7.70, 8.15, 7.38 and 7.56 are<br />

replaced by 1, 4, 5, 2 and 3 respectively), <strong>the</strong>n sum <strong>the</strong> four ranks for each year, and finally<br />

carry out a one-way analysis of variance on <strong>the</strong> five sums of ranks.<br />

Comment on <strong>the</strong> advantages and disadvantages of applying this procedure, ra<strong>the</strong>r than <strong>the</strong><br />

standard two-way ANOVA, to <strong>the</strong> above data.<br />

From Warwick <strong>ST217</strong> exam 2001<br />

98


4. The following table gives <strong>the</strong> estimated hospital waiting lists (000s) by month & region, throughout<br />

<strong>the</strong> years 2000 & 2001.<br />

Month Year NY T E L SE SW WM NW<br />

1 2000 137.6 105.5 121.6 173.3 192.6 111.0 99.4 177.6<br />

2 2000 132.0 103.2 118.7 167.9 190.4 107.7 95.8 172.2<br />

3 2000 125.1 98.7 111.9 162.3 184.0 100.7 89.6 164.8<br />

4 2000 129.5 99.5 114.3 163.5 186.9 101.5 92.1 166.5<br />

5 2000 129.9 99.5 114.6 163.2 186.4 100.6 92.4 166.2<br />

6 2000 128.9 99.9 114.4 163.3 183.7 100.2 91.9 165.5<br />

7 2000 127.9 99.0 113.6 160.8 183.9 99.6 90.7 164.9<br />

8 2000 126.7 98.9 113.2 159.6 183.4 100.1 90.2 165.6<br />

9 2000 124.6 98.2 111.9 158.1 183.8 99.6 91.0 164.6<br />

10 2000 123.4 97.1 112.0 156.0 183.4 98.8 91.1 163.1<br />

11 2000 121.0 97.2 111.7 155.7 183.4 98.9 92.1 161.2<br />

12 2000 121.3 99.0 112.7 158.0 188.2 99.4 92.8 162.9<br />

1 2001 122.4 97.7 113.3 159.5 188.8 99.8 93.4 164.0<br />

2 2001 121.6 96.9 113.0 159.2 186.7 100.1 92.1 163.3<br />

3 2001 119.6 95.3 109.7 156.3 181.1 97.1 87.2 160.5<br />

4 2001 122.6 96.5 110.7 158.9 184.1 99.3 88.8 162.7<br />

5 2001 124.0 97.2 111.0 160.0 185.9 100.3 90.1 164.4<br />

6 2001 124.2 98.0 111.8 160.6 187.2 99.7 90.8 165.6<br />

7 2001 123.2 98.9 111.4 160.9 187.4 100.2 91.0 165.5<br />

8 2001 123.2 99.6 111.9 161.4 185.7 100.0 91.6 166.2<br />

9 2001 123.0 99.4 111.6 159.1 185.2 100.2 91.0 165.8<br />

10 2001 124.1 99.0 111.8 156.7 184.6 101.1 90.7 165.5<br />

11 2001 123.0 99.6 113.2 155.4 183.7 102.0 89.8 164.7<br />

12 2001 124.3 100.7 115.8 159.1 186.7 103.6 92.1 168.0<br />

Key<br />

NY Nor<strong>the</strong>rn & Yorkshire T Trent<br />

E Eastern L London SE South East<br />

SW South West WM West Midlands NW North West<br />

[Data extracted from archived Press Releases at http://tap.ccta.gov.uk/doh/intpress.nsf]<br />

Fit a two-way ANOVA model, possibly after transforming <strong>the</strong> data, and address (briefly) <strong>the</strong> following<br />

questions:<br />

(a) Does <strong>the</strong> pattern of change in waiting lists differ across <strong>the</strong> regions<br />

(b) Is <strong>the</strong>re a simple (but not misleading) description of <strong>the</strong> overall change in waiting lists over <strong>the</strong><br />

two years<br />

(c) Predict <strong>the</strong> values for <strong>the</strong> eight regions in March 2002 (to <strong>the</strong> nearest 100, as in <strong>the</strong> Table).<br />

(d) The set of figures for March 2001 were <strong>the</strong> latest available at <strong>the</strong> time of <strong>the</strong> General Election<br />

in May 2001. A cynical acquaintance suggests to you that <strong>the</strong> March 2001 waiting lists were<br />

‘unusually good’. What do you think<br />

99


5. Table 4.2, page 44, presented data on <strong>the</strong> preventive effect of four different drugs on allergic response<br />

in ten patients.<br />

A simple way to analyse <strong>the</strong> data is via a two-way ANOVA on a suitable measure of patient response,<br />

such as <strong>the</strong> increase in √ NCF, which is tabulated below (for example, 1.95 = √ 3.8 − √ 0.0 and<br />

1.52 = √ 9.2 − √ 2.3).<br />

Patient number<br />

Drug 1 2 3 4 5 6 7 8 9 10<br />

P 1.95 1.52 0.77 0.44 0.78 1.69 0.37 0.95 1.10 0.62<br />

C 0.71 1.30 1.32 1.48 0.58 0.41 0.00 2.09 0.32 −0.22<br />

D 0.65 0.67 0.65 0.48 0.00 0.44 0.26 0.42 1.18 0.63<br />

K 0.19 0.54 −0.07 0.82 0.54 −0.44 0.27 −0.03 0.59 0.71<br />

(a) Test <strong>the</strong> statistical significance of <strong>the</strong> differences between drugs and between patients.<br />

(b) Plot <strong>the</strong> original data (Table 4.2) in a way that would help you assess whe<strong>the</strong>r <strong>the</strong> assumptions<br />

underlying <strong>the</strong> above two-way ANOVA are reasonable.<br />

(c) Comment on <strong>the</strong> analysis you have made suggesting possible improvements where appropriate.<br />

You do NOT need to carry out any fur<strong>the</strong>r complicated calculations.<br />

6. Table 6.4 shows purported IQ scores of identical twins, one raised in a foster home (Y ), and <strong>the</strong><br />

o<strong>the</strong>r raised by natural parents (X). The data are also categorised according to <strong>the</strong> social class of<br />

<strong>the</strong> natural parents (upper, middle, low). The data come from Burt (1966), and are also available in<br />

Weisberg (1980).<br />

upper class middle class lower class<br />

Case Y X Case Y X Case Y X<br />

1 82 82 8 71 78 14 63 68<br />

2 80 90 9 75 79 15 77 73<br />

3 88 91 10 93 82 16 86 81<br />

4 108 115 11 95 97 17 83 85<br />

5 116 115 12 88 100 18 93 87<br />

6 117 129 13 111 107 19 97 87<br />

7 132 131 20 87 93<br />

21 94 94<br />

22 96 95<br />

23 112 97<br />

24 113 97<br />

25 106 103<br />

26 107 106<br />

27 98 111<br />

Table 6.4: Burt’s twin IQ data<br />

(a) Plot <strong>the</strong> data.<br />

(b) Fit simple linear regression models to predict Y from X within each social class.<br />

(c) Fit parallel lines predicting Y from X within each social class (i.e. fit regression models with<br />

<strong>the</strong> same slope in each of <strong>the</strong> three classes, but possibly different intercepts).<br />

(d) Produce an ANOVA table and an F -test to test whe<strong>the</strong>r <strong>the</strong> parallelism assumption is reasonable.<br />

Comment on <strong>the</strong> calculated F ratio.<br />

100


7. (a) Give a matrix formulation of <strong>the</strong> Normal Linear Model, and show how <strong>the</strong> two-sample t-test and<br />

<strong>the</strong> two-way analysis of variance may both be considered to be examples of <strong>the</strong> Normal Linear<br />

Model.<br />

(b) The following table shows <strong>the</strong> effect of oestrogen on weight gain in sheep in four separate ranches.<br />

Treatments are a combination of sex of sheep (M or F) and level of oestrogen (S0 or S3). Thus<br />

<strong>the</strong> value 52 in <strong>the</strong> table denotes <strong>the</strong> weight gain of a female sheep in Ranch II, given level S0<br />

of oestrogen.<br />

Ranch<br />

Treatment I II III IV<br />

F-S0 47 52 62 51<br />

M-S0 50 54 67 57<br />

F-S3 57 53 69 57<br />

M-S3 54 65 74 59<br />

data from an experiment reported by Little & Hills (1978)<br />

Produce a two-way analysis of variance table, and test whe<strong>the</strong>r <strong>the</strong>re are statistically significant<br />

differences between <strong>the</strong> treatments and/or <strong>the</strong> ranches, in terms of weight gain.<br />

(c) Discuss briefly how you would check <strong>the</strong> assumptions underlying your analysis (you do NOT<br />

need to produce any plots or fur<strong>the</strong>r calculations).<br />

From Warwick <strong>ST217</strong> exam 2003<br />

For we know in part, and we prophesy in part.<br />

But when that which is perfect is come, <strong>the</strong>n that which is in part shall be done away.<br />

1 Corinthians 13:9–10<br />

Everything should be made as simple as possible, but not simpler.<br />

Albert Einstein<br />

A <strong>the</strong>ory is a good <strong>the</strong>ory if it satisfies two requirements: it must accurately describe a large<br />

class of observations on <strong>the</strong> basis of a model that contains only a few arbitrary elements, and<br />

it must make definite predictions about <strong>the</strong> results of future observations.<br />

Stephen William Hawking<br />

The purpose of models is not to fit <strong>the</strong> data but to sharpen <strong>the</strong> question.<br />

Samuel Karlin<br />

Science may be described as <strong>the</strong> art of systematic oversimplification.<br />

Sir Karl Raimund Popper<br />

101


Chapter 7<br />

Fur<strong>the</strong>r Topics<br />

7.1 Generalisations of <strong>the</strong> Linear Model<br />

You can generalise <strong>the</strong> systematic part of <strong>the</strong> linear model, i.e. <strong>the</strong> formula for E[Y |x]<br />

and/or <strong>the</strong> random part, i.e. <strong>the</strong> distribution of Y − E[Y |x].<br />

7.1.1 Nonlinear Models<br />

These are models of <strong>the</strong> form<br />

E[Y |x] = g(x, β) (7.1)<br />

where Y is <strong>the</strong> response, x is a vector of explanatory variables, β = (β 1 . . . β p ) T is a parameter vector, and<br />

<strong>the</strong> function g is nonlinear in <strong>the</strong> β i s.<br />

Examples<br />

1. Asymptotic regression:<br />

Y i = α − βγ xi + ɛ i (i = 1, 2, . . . , n),<br />

ɛ i<br />

IID<br />

∼ N (0, σ 2 ).<br />

There are four parameters to be estimated: β = (α, β, γ, σ 2 ) T .<br />

Assuming that 0 < γ < 1, we have:<br />

(a) E[Y |x] is monotonic increasing in x,<br />

(b) E[Y |x = 0] = α − β,<br />

(c) as x → ∞, E[Y |x] → α.<br />

This ‘asymptotic regression’ model might be appropriate, for example, if<br />

(a) x = age of an animal,<br />

y = height or weight, or<br />

(b) x = time spent training,<br />

y = height jumped<br />

(for n people of similar build).<br />

2. The ‘Michaelis-Menten’ equation in enzyme kinetics<br />

E[Y |x] =<br />

β 1x<br />

β 2 + x<br />

with various possible distributional assumptions, <strong>the</strong> simplest of which is<br />

[Y |x] ∼ N (β 1 x/(β 2 +x), σ 2 ).<br />

102


Comments<br />

1. Nonlinear models can be fitted, in principle, by maximum likelihood.<br />

2. In practice one needs computers and iteration.<br />

3. Even if <strong>the</strong> random variation is assumed to be Normal, <strong>the</strong> likelihood may have a very non-Normal<br />

shape.<br />

7.1.2 Generalised Linear Models<br />

Definition 7.1 (GLM)<br />

A generalized linear model (GLM) has a random part and a systematic part:<br />

Random Part<br />

1. The ith response Y i has a probability distribution with mean µ i .<br />

2. The distributions are all of <strong>the</strong> same form (e.g. all Normal with variance σ 2 , or all Poisson, etc.)<br />

3. The Y i s are independent.<br />

Systematic Part<br />

g(µ i ) = x T i β<br />

p∑<br />

= β j x ij ,<br />

j=1<br />

where<br />

1. x i = (x i1 . . . x ip ) T is a vector of explanatory variables,<br />

2. β = (β 1 . . . β p ) T is a parameter vector, and<br />

3. g(·) is a monotonic function called <strong>the</strong> link function.<br />

Comments<br />

1. If Y i ∼ N (µ i , σ 2 ) and g(·) is <strong>the</strong> identity function, <strong>the</strong>n we have <strong>the</strong> NLM.<br />

2. O<strong>the</strong>r GLMs typically must have <strong>the</strong>ir parameters estimated by maximising <strong>the</strong> likelihood numerically<br />

(iteratively in a computer).<br />

3. The principles behind fitting GLMs are similar to those for fitting NLMs<br />

Example: ‘logistic regression’<br />

1. Random part: binary response<br />

e.g. Y i |x i =<br />

{ 1 if individual i survived<br />

0 if individual i died<br />

(and all Y i s are conditionally independent given <strong>the</strong> corresponding x i s).<br />

Note that µ i = E[Y i |x i ] is here <strong>the</strong> probability of surviving given explanatory variables x i , and is<br />

usually written p i or π i .<br />

2. Systematic part:<br />

( )<br />

πi<br />

g(π i ) = log .<br />

1 − π i<br />

103


Exercise 7.1<br />

Show that under <strong>the</strong> logistic regression model, if n patients have identical explanatory variables x say, <strong>the</strong>n<br />

(a) Each of <strong>the</strong>se n patients has probability of survival given by<br />

π = exp(xT β)<br />

1 + exp(x T β) ,<br />

(b) The number R surviving out of n has expected values nπ and variance nπ(1 − π).<br />

‖<br />

7.2 Simpson’s Paradox<br />

Simpson’s paradox occurs when <strong>the</strong>re are three RVs X, Y and Z, such that <strong>the</strong> conditional distributions<br />

[X, Y |Z] show a relationship between [X|Z] and [Y |Z], but <strong>the</strong> marginal distribution [X, Y ] apparently<br />

shows a very different relationship between X and Y . For example,<br />

1. X(Y )=male(female) death rate, Z=age,<br />

2. X(Y )=male(female) admission rate to University, Z=admission rate for student’s chosen course.<br />

7.3 Problems<br />

1. (a) Explain what is meant by<br />

i. <strong>the</strong> Normal linear model,<br />

ii. simple linear regression, and<br />

iii. nonlinear regression.<br />

(b) For simple linear regression applied to data (x i , y i ), i = 1, . . . , n, show that <strong>the</strong> maximum likelihood<br />

estimators ̂β 0 and ̂β 1 of <strong>the</strong> intercept β 0 and slope β 1 satisfy <strong>the</strong> simultaneous equations<br />

and<br />

Hence find ̂β 0 and ̂β 1 .<br />

̂β 0<br />

̂β 0 n + ̂β 1<br />

n<br />

∑<br />

i=1<br />

x i + ̂β 1<br />

n<br />

∑<br />

i=1<br />

n<br />

∑<br />

i=1<br />

x i =<br />

x 2 i =<br />

n∑<br />

i=1<br />

y i<br />

n∑<br />

x i y i .<br />

(c) The following table shows Y , <strong>the</strong> survival time (weeks) of leukaemia patients and x, <strong>the</strong> corresponding<br />

log of initial white blood cell count.<br />

i=1<br />

x Y x Y x Y<br />

3.36 65 4.00 121 4.54 22<br />

2.88 156 4.23 4 5.00 1<br />

3.63 100 3.73 39 5.00 1<br />

3.41 134 3.85 143 4.72 5<br />

3.78 16 3.97 56 5.00 65<br />

4.02 108 4.51 26<br />

Plot <strong>the</strong> data and, without carrying out any calculations, discuss how reasonable are <strong>the</strong> assumptions<br />

underlying simple linear regression in this case.<br />

From Warwick <strong>ST217</strong> exam 1998<br />

104


2. Because of concerns about sex discrimination, a study was carried out by <strong>the</strong> Graduate Division at<br />

<strong>the</strong> University of California, Berkeley. In fall 1973, <strong>the</strong>re were 8,442 male applications and 4,321<br />

female applications to graduate school. It was found that about 44% of <strong>the</strong> men and 35% of <strong>the</strong><br />

women were admitted.<br />

When <strong>the</strong> data were investigated fur<strong>the</strong>r, it was found that just 6 of <strong>the</strong> more than 100 majors<br />

accounted for over one-third of <strong>the</strong> total number of applicants. The data for <strong>the</strong>se six majors (which<br />

Berkeley forbids identifying by name) are summarized in <strong>the</strong> table below.<br />

Men<br />

Women<br />

Number of Percent Number of Percent<br />

Major applicants admitted applicants admitted<br />

A 825 62 108 82<br />

B 560 63 25 68<br />

C 325 37 593 34<br />

D 417 33 375 35<br />

E 191 28 393 24<br />

F 373 6 341 7<br />

Discuss <strong>the</strong> possibility of sex discrimination in admission, with particular reference to explanatory<br />

variables, conditional probability, independence and Simpson’s paradox.<br />

Data from Freedman et al. (1991), page 17<br />

3. (a) At a party, <strong>the</strong> POTAS 1 of your dreams approaches you, and says by way of introduction:<br />

Hi—I’m working on a study of human pheromones, and need some statistical help. Can<br />

you explain to me what’s meant by ‘logistic regression’, and why <strong>the</strong> idea’s important<br />

Give a brief verbal explanation of logistic regression, without (i) using any formulae, (ii) saying<br />

anything that’s technically incorrect, (iii) boring <strong>the</strong> o<strong>the</strong>r person senseless and ruining a<br />

potentially beautiful friendship, (iv) o<strong>the</strong>rwise embarrassing yourself.<br />

(b) Repeat <strong>the</strong> exercise, replacing logistic regression successively with:<br />

Bayesian inference, conditional expectation, likelihood,<br />

a multinomial distribution, multiple regression, <strong>the</strong> Neyman-Pearson lemma,<br />

nuisance parameters, one-way ANOVA, order statistics,<br />

<strong>the</strong> Poisson distribution, a linear model, size & power,<br />

statistical independence, a t-test.<br />

(c) Suddenly, a somewhat inebriated student (SIS) appears and interrupts your ra<strong>the</strong>r impressive<br />

explanation with <strong>the</strong> following exchange:<br />

SIS: Think of a number from 1 to 10.<br />

POTASOYD: Erm—seven<br />

SIS: Wrong. Get your clo<strong>the</strong>s off.<br />

You <strong>the</strong>n watch aghast while he starts introducing himself in <strong>the</strong> same way to everyone in <strong>the</strong><br />

room. As a statistician, you of course note down <strong>the</strong> numbers x i he is given, namely<br />

7, 2, 3, 1, 5, 2, 10, 10, 7, 3, 9, 1, 2, 2, 7, 10, 5, 8, 5, 7, 3, 10, 6, 1, 5, 3, 2, 7, 8, 5, 7.<br />

His response y i is ‘Wrong’ in each case, and you formulate <strong>the</strong> hypo<strong>the</strong>ses<br />

H 0 : y i = ‘Wrong’ irrespective of x i<br />

{ ‘Right’ if xi = x<br />

H 1 : y i =<br />

0 , for some x 0 ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10},<br />

‘Wrong’ if x i ≠ x 0 .<br />

How might you test <strong>the</strong> null hypo<strong>the</strong>sis H 0 against <strong>the</strong> alternative H 1 <br />

1 Person <strong>Of</strong> The Appropriate Sex<br />

105


4. (a) Explain what is meant by:<br />

i. a generalised linear model,<br />

ii. a nonlinear model.<br />

(b) Discuss <strong>the</strong> models you would most likely consider for <strong>the</strong> following data sets:<br />

i. Data on <strong>the</strong> age, sex, and weight of 100 people who suffered a heart attack (for <strong>the</strong> first<br />

time), and whe<strong>the</strong>r or not <strong>the</strong>y were still alive two years later.<br />

ii. Data on <strong>the</strong> age, sex and weight of 100 salmon in a fish farm.<br />

From Warwick <strong>ST217</strong> exam 1996<br />

I have yet to see any problem, however complicated, which, when you looked at it <strong>the</strong> right<br />

way, did not become still more complicated.<br />

Poul Anderson<br />

The manipulation of statistical formulas is no substitute for knowing what one is doing.<br />

Hubert M. Blalock, Jr.<br />

A judicious man uses statistics, not to get knowledge, but to save himself from having ignorance<br />

foisted upon him.<br />

Thomas Carlyle<br />

The best material model of a cat is ano<strong>the</strong>r, or preferably <strong>the</strong> same, cat.<br />

A. Rosenblueth & Norbert Wiener<br />

A little inaccuracy sometimes saves tons of explanation.<br />

Saki (Hector Hugh Munro)<br />

karma police arrest this man he talks in maths he buzzesLikeAfridge hes like a detuned radio.<br />

Thom Yorke<br />

Better is <strong>the</strong> end of a thing than <strong>the</strong> beginning <strong>the</strong>reof.<br />

Ecclesiastes 7:8<br />

106


Bibliography<br />

Barnett, V. (1982). Comparative Statistical Inference (Second ed.). New York: John Wiley and Sons.<br />

Burt, C. (1966). The genetic determination of differences in intelligence: A study of monozygotic twins<br />

reared toge<strong>the</strong>r and apart. Brit. J. Psych. 57: 137–153.<br />

Casella, G. and R. L. Berger (1990). Statistical Inference. Pacific Grove, CA: Wadsworth &<br />

Brooks/Cole.<br />

Casella, G. and R. L. Berger (2001). Statistical Inference (second ed.). Pacific Grove, CA: Duxbury<br />

Press.<br />

DeGroot, M. H. (1989). Probability and <strong>Statistics</strong> (Second ed.). Reading, Mass.: Addison-Wesley.<br />

DeGroot, M. H. and M. J. Schervish (2002). Probability and <strong>Statistics</strong> (Third ed.). Reading, Mass.:<br />

Addison-Wesley.<br />

Efron, B. and R. J. Tibshirani (1993). An Introduction to <strong>the</strong> Bootstrap, Volume 57 of Monographs<br />

on <strong>Statistics</strong> and Applied Probability. New York: Chapman and Hall.<br />

Freedman, D., R. Pisani, R. Purves, and A. Adhikari (1991). <strong>Statistics</strong> (Second ed.). New York:<br />

W. W. Norton.<br />

Gilks, W. R., S. Richardson, and D. J. Spiegelhalter (eds.) (1996). Markov Chain Monte Carlo<br />

in Practice. London: Chapman and Hall.<br />

Hand, D. J., F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (eds.) (1994). A<br />

Handbook of Small Data Sets. London: Chapman and Hall.<br />

Hogg, R. V. and A. T. Craig (1970). Introduction to <strong>Ma<strong>the</strong>matical</strong> <strong>Statistics</strong>. New York: MacMillan.<br />

Lindgren, B. W. (1994). Statistical Theory (Fourth ed.). London: Chapman and Hall.<br />

Mood, A. M., F. A. Graybill, and D. C. Boes (1974). Introduction to <strong>the</strong> Theory of <strong>Statistics</strong><br />

(Third ed.). New York: McGraw-Hill.<br />

Moore, D. S. and G. S. McCabe (1998). Introduction to <strong>the</strong> Practice of <strong>Statistics</strong> (Third ed.).<br />

Oxford, UK: W. H. Freeman & Company Limited.<br />

Moore, D. S. and G. S. McCabe (2002). Introduction to <strong>the</strong> Practice of <strong>Statistics</strong> (Fourth ed.).<br />

Oxford, UK: W. H. Freeman & Company Limited.<br />

Narula, S. C. and J. F. Wellington (1977). Prediction, linear regression and minimum sum of<br />

relative errors. Technometrics 19: 185–190.<br />

O.P.C.S. (1995). 1993 Mortality <strong>Statistics</strong>, Volume 20 of DH2. London: Her Majesty’s Stationery <strong>Of</strong>fice.<br />

Osborn, J. F. (1979). Statistical Exercises in Medical Research. Oxford, UK: Blackwell Scientific Publications.<br />

Rice, J. A. (1995). <strong>Ma<strong>the</strong>matical</strong> <strong>Statistics</strong> and Data Analysis (Second ed.). Pacific Grove, CA:<br />

Wadsworth.<br />

Sprent, P. (1998). Data Driven Statistical Methods. London: Chapman and Hall.<br />

Weisberg, S. (1980). Applied Linear Regression. New York: John Wiley and Sons.<br />

107

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!