08.06.2015 Views

0.1 Central Limit Theorem (CLT)

0.1 Central Limit Theorem (CLT)

0.1 Central Limit Theorem (CLT)

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The normal distribution is extremely common/ useful, for one reason: the normal distribution approximates<br />

a lot of other distributions. This is the result of one of the most fundamental theorems in Math:<br />

<strong>0.1</strong> <strong>Central</strong> <strong>Limit</strong> <strong>Theorem</strong> (<strong>CLT</strong>)<br />

<strong>Theorem</strong> <strong>0.1</strong>.1 (<strong>Central</strong> <strong>Limit</strong> <strong>Theorem</strong>)<br />

If X 1 , X 2 , . . . , X n are n independent, identically distributed random variables with E[X i ] = µ and V ar[X i ] =<br />

σ 2 , then:<br />

the sample mean ¯X := 1 n<br />

∑ n<br />

i=1 X i is approximately normal distributed<br />

with E[ ¯X] = µ and V ar[ ¯X] = σ 2 /n.<br />

i.e. ¯X ∼ N(µ,<br />

σ 2<br />

n ) or ∑ i X i ∼ N(nµ, nσ 2 )<br />

Corollary <strong>0.1</strong>.2<br />

(a) for large n the binomial distribution B n,p is approximately normal N np,np(1−p) .<br />

(b) for large λ the Poisson distribution P o λ is approximately normal N λ,λ .<br />

(c) for large k the Erlang distribution Erlang k,λ is approximately normal N k<br />

λ , k<br />

λ 2<br />

Why?<br />

(a) Let X be a variable with a B n,p distribution.<br />

We know, that X is the result from repeating the same Bernoulli experiment n times and looking at<br />

the overall number of successes. We can therefor, write X as the sum of n B 1,p variables X i :<br />

X := X 1 + X 2 + . . . + X n<br />

X is then the sum of n independent, identically distributed random variables. Then, the <strong>Central</strong><br />

<strong>Limit</strong> <strong>Theorem</strong> states, that X has an approximate normal distribution with E[X] = nE[X i ] = np and<br />

V ar[X] = nV ar[X i ] = np(1 − p).<br />

(b) it is enough to show the statement for the case that λ is a large integer:<br />

Let Y be a Poisson variable with rate λ. Then we can think of Y as the number of occurrences in<br />

an experiment that runs for time λ - that is the same as to observe λ experiments that each run<br />

independently for time 1 and add their results:<br />

Y = Y 1 + Y 2 + . . . + Y λ , with Y i ∼ P o 1 .<br />

Again, Y is the sum of n independent, identically distributed random variables. Then, the <strong>Central</strong><br />

<strong>Limit</strong> <strong>Theorem</strong> states, that X has an approximate normal distribution with E[Y ] = λ·1 and V ar[Y ] =<br />

λV ar[Y i ] = λ.<br />

(c) this statement is the easiest to prove, since an Erlang k,λ distributed variable Z is by definition the sum<br />

of k independently distributed exponential variables Z 1 , . . . , Z k .<br />

For Z the <strong>CLT</strong> holds, and we get, that Z is approximately normal distributed with E[Z] = kE[Z i ] = k λ<br />

and V ar[Z] = kV ar[Z i ] = k λ 2 .


2<br />

Why do we need the central limit theorem at all? - first of all, the <strong>CLT</strong> gives us the distribution of the<br />

sample mean in a very general setting: the only thing we need to know, is that all the observed values come<br />

from the same distribution, and the variance for this distribution is not infinite.<br />

A second reason is, that most tables only contain the probabilities up to a certain limit - the Poisson table<br />

e.g. only has values for λ ≤ 10, the Binomial distribution is tabled only for n ≤ 20. After that, we can use<br />

the Normal approximation to get probabilities.<br />

Example <strong>0.1</strong>.1 Hits on a webpage<br />

Hits occur with a rate of 2 per min.<br />

What is the probability to wait for more than 20 min for the 50th hit?<br />

Let Y be the waiting time until the 50th hit.<br />

We know: Y has an Erlang 50,2 distribution. therefore:<br />

P (Y > 20) = 1 − Erlang 50,2 (20) = 1 − (1 − P o 2·20 (50 − 1)) =<br />

= P o 40 (49) <strong>CLT</strong> ≈ !<br />

N 40,40 (49) =<br />

( ) 49 − 40<br />

= Φ √ = Φ(1.42) table<br />

= 0.9222.<br />

40<br />

✷<br />

Example <strong>0.1</strong>.2 Mean of Uniform Variables<br />

Let U 1 , U 2 , U 3 , U 4 , and U 5 be standard uniform variables, i.e. U i ∼ U (0,1) .<br />

Without the <strong>CLT</strong> we would have no idea, what distribution the sample mean Ū = 1 ∑ 5<br />

5 i=1 U i had!<br />

With it, we know: Ū approx<br />

∼ N(0.5, 1<br />

60 ).<br />

Issue:<br />

Accuracy of approximation<br />

• increases with n<br />

• increases with the amount of symmetry in the distribution of X i<br />

Rule of thumb for the Binomial distribution:<br />

Use the normal approximation for B n,p , if np > 5 (if p ≤ 0.5) or nq > 5 (if p ≥ 0.5)!<br />

From now on, we will use probability theory only to find answers to the questions arising from specific<br />

problems we are working on.<br />

In this chapter we want to draw inferences about some characteristic of an underlying population - e.g. the<br />

average height of a person. Instead of measuring this characteristic of each individual, we will draw a sample,<br />

i.e. choose a “suitable” subset of the population and measure the characteristic only for those individuals.<br />

Using some probabilistic arguments we can then extend the information we got from that sample and make<br />

an estimate of the characteristic for the whole population. Probability theory will give us the means to find<br />

those estimates and measure, how “probable” our estimates are.<br />

Of course, choosing the sample, is crucial. We will demand two properties from a sample:<br />

• the sample should be representative - taking only basketball players into the sample would change our<br />

estimate about a person’s height drastically.<br />

• if there’s a large number in the sample we should come close to the “true” value of the characteristic<br />

The three main area of statistics are


0.2. PARAMETER ESTIMATION 3<br />

• estimation of parameters:<br />

point or interval estimates: “my best guess for value x is . . . ”, “my guess is that value x is in interval<br />

(a, b)”<br />

• evaluation of plausibility of values: hypothesis testing<br />

• prediction of future (individual) values<br />

0.2 Parameter Estimation<br />

Statistics are all around us - scores in sports, prices at the grocers, weather reports ( and how often they<br />

turn out to be close to the actual weather), taxes, evaluations . . .<br />

The most basic form of statistics are descriptive statistics.<br />

But - what exactly is a statistic? - Here is the formal definition:<br />

Definition 0.2.1 (Statistics)<br />

Any function W (x 1 , . . . , x k ) of observed values x 1 , . . . , x k is called a statistic.<br />

Some statistics you already know are:<br />

∑<br />

Mean (Average) ¯X =<br />

1<br />

n i X i<br />

Minimum X (1) - Parentheses indicate that the values are sorted<br />

Maximum X (n)<br />

Range X (n) − X (1)<br />

Mode<br />

value(s) that appear(s) most often<br />

Median<br />

“middle value” - that value, for which one half of the data is larger,<br />

the other half is smaller. If n is odd the median is X (n/2) , if n is<br />

even, the median is the average of the two middle values: 0.5 ·<br />

X ((n−1)/2) + 0.5 · X ((n+1)/2)<br />

For this section it is important to distinguish between x i and X i properly. If not stated otherwise, any capital<br />

letter denotes some random variable, a small letter describes a realization of this random variable, i.e.<br />

what we have observed. x i therefore is a real number, X i is a function, that assigns a real number to an<br />

event from the sample space.<br />

Definition 0.2.2 (estimator)<br />

Let X 1 , . . . , X k be k i.i.d random variables with distribution F θ with (unknown) parameter θ.<br />

A statistic ˆΘ = ˆΘ(X 1 , . . . , X k ) used to estimate the value of θ is called an estimator of θ.<br />

ˆθ = ˆΘ(x 1 , . . . , x k ) is called an estimate of θ.<br />

Desirable properties of estimates:


4<br />

• Unbiasedness, i.e the expected value of an estimator is<br />

the true parameter:<br />

E[ ˆΘ] = θ<br />

• Efficiency, for two estimators, ˆΘ 1 and ˆΘ 2 of the same<br />

parameter θ, ˆΘ 1 is said to be more efficient than ˆΘ 2 ,<br />

if<br />

V ar[ ˆΘ 1 ] < V ar[ ˆΘ 2 ]<br />

• Consistency, if we have a larger sample size n, we want<br />

the estimate ˆθ to be closer to the true parameter θ.<br />

lim P (| ˆΘ − θ| > ɛ) = 0<br />

n→∞<br />

Unbiasedness:<br />

x true value<br />

• value for<br />

from one sample<br />

• •<br />

•<br />

•<br />

•<br />

• •<br />

••<br />

x •<br />

• •<br />

and not<br />

• •<br />

•<br />

• •<br />

• •<br />

x<br />

•<br />

•<br />

•<br />

• •<br />

•<br />

•<br />

Efficiency:<br />

Consistency:<br />

estimator 1<br />

•<br />

•<br />

• • •<br />

•<br />

• x<br />

•<br />

•<br />

• •<br />

•<br />

• is better than<br />

• x •<br />

•<br />

•<br />

•<br />

•<br />

•<br />

•<br />

•<br />

estimator 2<br />

•<br />

•<br />

••<br />

• •<br />

• x<br />

• •<br />

•<br />

•<br />

•<br />

•<br />

same estimator<br />

• • •<br />

• •<br />

x<br />

•<br />

•<br />

••<br />

for n = 100 for n = 10000<br />

Example 0.2.1<br />

Let X 1 , . . . , X n be n i.i.d. random variables with E[X i ] = µ.<br />

Then ¯X = 1 n<br />

∑ n<br />

i=1 X i is an unbiased estimator of µ, because<br />

E[ ¯X] = 1 n<br />

n∑<br />

E[X i ] = 1 n n · µ = µ.<br />

i=1<br />

ok, so, once we have an estimator, we can decide, whether it has the properties.<br />

estimators?<br />

But how do we find<br />

0.2.1 Maximum Likelihood Estimation<br />

Situation: We have n data values x 1 , . . . , x n . The assumption is, that these data values are realizations of n<br />

i.i.d. random variables X 1 , . . . , X n with distribution F θ . Unfortunately the value for θ is unknown.<br />

X<br />

observed values<br />

x 1 , x 2 , x 3, ...<br />

f with = 0<br />

f with = -1.8<br />

f with = 1<br />

By changing the value for θ we can “move the density function f θ around” - in the diagram, the third density<br />

function fits the data best.<br />

Principle: since we do not know the true value θ of the distribution, we take that value ˆθ that most likely<br />

produced the observed values, i.e.<br />

maximize something like<br />

P (X 1 = x 1 ∩ X 2 = x 2 ∩ . . . ∩ X n = x n )<br />

=<br />

Xi are independent!<br />

=<br />

n∏<br />

P (X i = x i ) (*)<br />

i=1<br />

= P (X 1 = x 1 ) · P (X 2 = x 2 ) · . . . · P (X n = x n ) =


0.2. PARAMETER ESTIMATION 5<br />

This is not quite the right way to write the probability, if X 1 , . . . , X n are continuous variables. (Remember:<br />

P (X = x) = 0 for a continuous variable X; this is still valid)<br />

We use the above “probability” just as a plausibility argument. To come around the problem that P (X =<br />

x) = 0 for a continuous variable, we will write (*) as:<br />

n∏<br />

p θ (x i )<br />

i=1<br />

} {{ }<br />

for discreteX i<br />

and<br />

n∏<br />

f θ (x i )<br />

i=1<br />

} {{ }<br />

for continuousX i<br />

where p θ ) is the probability mass function of discrete X i (all X i have the same, since they are i.d) and f θ is<br />

the density function of continuous X i .<br />

Both these functions depend on θ. In fact, we can write the above expressions as a function in θ. This<br />

function, which we will denote by L(θ), is called the Likelihood function of X 1 , . . . , X n .<br />

The goal is now, to find a value ˆθ that maximizes the Likelihood function. (this is what “moves” the density<br />

to the right spot, so it fits the observed values well)<br />

How do we get a maximum of L(θ)? - by the usual way, we maximize a function! - Differentiate it and<br />

set it to zero! (After that, we ought to check with the second derivative, whether we’ve actually found a<br />

maximum, but we won’t do that unless we’ve found more than one possible value for ˆθ.)<br />

Most of the time, it is difficult to find a derivative of L(θ) - instead we use another trick, and find a maximum<br />

for log L(θ), the Log-Likelihood function.<br />

Note: though its name is “log”, we use the natural logarithm ln.<br />

The plan to find an ML-estimator is:<br />

1. Find Likelihood function L(θ).<br />

2. Get natural log of Likelihood function log L(θ).<br />

3. Differentiate log-Likelihood function with respect to θ.<br />

4. Set derivative to zero.<br />

5. Solve for θ.<br />

Example 0.2.2 Roll a Die<br />

A die is rolled until its face shows a 6.<br />

repeating this experiment 100 times gave the following results:


6<br />

#Rolls of a Die until first 6<br />

20<br />

15<br />

10<br />

# runs<br />

5<br />

0<br />

1 2 3 4 5 6 7 8 9 11 14 16 20 27 29<br />

k 1 2 3 4 5 6 7 8 9 11 14 15 16 17 20 21 27 29<br />

# trials 18 20 8 9 9 5 8 3 5 3 3 3 1 1 1 1 1 1<br />

We know, that k the number of rolls until a 6 shows up has a geometric distribution Geo p . For a fair die, p<br />

is 1/6.<br />

The Geometric distribution has probability mass function p(k) = (1 − p) k−1 · p.<br />

What is the ML-estimate ˆp for p?<br />

1. Likelihood function L(p):<br />

Since we have observed 100 outcomes k 1 , ..., k 100 , the likelihood function L(p) = ∏ 100<br />

i=1 p(k i),<br />

∏100<br />

∏100<br />

L(p) = (1 − p) ki−1 p = p 100 · (1 − p) ki−1 = p 100 · (1 − p) P 100<br />

i=1 (ki−1) = p 100 · (1 − p) P 100<br />

i=1<br />

i=1<br />

i=1 ki−100 .<br />

2. log of Likelihood function log L(p):<br />

log L(p) = log<br />

(p 100 · (1 − p) P 100<br />

ki−100) i=1 =<br />

= log ( p 100) + log<br />

((1 − p) P 100<br />

ki−100) i=1 =<br />

( 100<br />

)<br />

∑<br />

= 100 log p + k i − 100 log(1 − p).<br />

i=1


0.2. PARAMETER ESTIMATION 7<br />

3. Differentiate log-Likelihood with respect to p:<br />

4. Set derivative to zero.<br />

(<br />

d<br />

∑100<br />

dp log L(p) = 1001 p + k i − 100<br />

i=1<br />

=<br />

=<br />

(<br />

1<br />

100(1 − p) −<br />

p(1 − p)<br />

(<br />

1<br />

p(1 − p)<br />

For the estimate ˆp the derivative must be zero:<br />

)<br />

−1<br />

1 − p =<br />

)<br />

∑100<br />

100 − p k i .<br />

i=1<br />

( ∑100<br />

k i − 100<br />

i=1<br />

)<br />

p<br />

)<br />

=<br />

⇐⇒<br />

(<br />

1<br />

ˆp(1 − ˆp)<br />

d<br />

log L(ˆp) = 0<br />

dp<br />

)<br />

∑100<br />

k i = 0<br />

100 − ˆp<br />

i=1<br />

5. Solve for ˆp.<br />

(<br />

)<br />

1<br />

∑100<br />

100 − ˆp k i<br />

ˆp(1 − ˆp)<br />

ˆp = ∑ 100<br />

100<br />

i=1 k =<br />

i<br />

In total, we have an estimate ˆp = 100<br />

568 = <strong>0.1</strong>710.<br />

i=1<br />

= 0<br />

∑100<br />

100 − ˆp k i = 0<br />

1<br />

100<br />

i=1<br />

1<br />

∑ 100<br />

i=1 k .<br />

i<br />

Example 0.2.3 Red Cars in the Parking Lot<br />

The values 3,2,3,3,4,1,4,2,4,3 have been observed while counting the numbers of red cars pulling into the<br />

parking lot # 22 between 8:30 - 8:40 am Mo to Fr during two weeks.<br />

The assumption is, that these values are realizations of ten independent Poisson variables with (the same)<br />

rate λ.<br />

What is the Maximum Likelihood estimate of λ?<br />

The probability mass function of a Poisson distribution is p λ (x) = e −λ · λx<br />

x! .<br />

We have ten values x i , this gives a Likelihood function:<br />

L(λ) =<br />

10∏<br />

i=1<br />

e −λ · λXi<br />

X i ! = e−10λ · λ P 10<br />

∏10<br />

i=1 Xi ·<br />

i=1<br />

1<br />

X i !<br />

The log-Likelihood then is<br />

log L(λ) = −10λ + ln(λ) ·<br />

10∑<br />

i=1<br />

X i − ∑ ln(X i ).


8<br />

Differentiating the log-Likelihood with respect to λ gives:<br />

Setting it to zero:<br />

d<br />

dλ log L(λ) = −10 + 1 λ ·<br />

1<br />

ˆλ ·<br />

10∑<br />

i=1<br />

X i = 10<br />

⇐⇒ ˆλ = 1<br />

10<br />

10∑<br />

X i<br />

i=1<br />

10∑<br />

X i<br />

i=1<br />

⇐⇒ ˆλ = 29<br />

10 = 2.9<br />

This gives us an estimate for λ - and since λ is also the expected value of the Poisson distribution, we can<br />

say, that on average the number of red cars pulling into the parking lot each morning between 8:30 and 8:40<br />

pm is 2.9.<br />

ML-estimators for µ and σ 2 of a Normal distribution<br />

Let X 1 , . . . , X n be n independent, identically distributed normal variables with E[X i ] = µ and V ar[X i ] = σ 2 .<br />

µ and σ 2 are unknown.<br />

The normal density function f µ,σ 2 is<br />

f µ,σ 2(x) =<br />

1<br />

(x−µ)2<br />

√ e− 2σ 2<br />

2πσ<br />

2<br />

Since we have n independent variables, the Likelihood function is a product of n densities:<br />

n∏<br />

L(µ, σ 2 1<br />

) = √ (x 2πσ<br />

2 e− i −µ)2<br />

2σ 2 = (2πσ 2 ) n/2 · e − P n (x i −µ) 2<br />

i=1 2σ 2<br />

Log-Likelihood:<br />

i=1<br />

log L(µ, σ 2 ) = − n 2 ln(2πσ2 ) − 1<br />

2σ 2<br />

n<br />

∑<br />

i=1<br />

(x i − µ) 2<br />

Since we have now two parameters, µ and σ 2 , we need to get 2 partial derivatives of the log-Likelihood:<br />

d<br />

dµ log L(µ, σ2 ) = 0 − 2 · −1 n∑<br />

2σ 2 · (x i − µ) 2 · (−1) = 1 ∑ n<br />

σ 2 (x i − µ) 2<br />

i=1<br />

d<br />

dσ 2 log L(µ, σ2 ) = − n 1<br />

2 σ 2 + 1 ∑<br />

n<br />

2(σ 2 ) 2 (x i − µ) 2<br />

We know, must find values for µ and σ 2 , that yield zeros for both derivatives at the same time.<br />

Setting d<br />

dµ log L(µ, σ2 ) = 0 gives<br />

ˆµ = 1 n∑<br />

x i ,<br />

n<br />

plugging this value into the derivative for σ 2 and setting<br />

d<br />

dσ<br />

log L(ˆµ, σ 2 ) = 0 gives<br />

2<br />

ˆσ 2 = 1 n∑<br />

(x i − ˆµ) 2<br />

n<br />

i=1<br />

i=1<br />

i=1<br />

i=1


0.3. CONFIDENCE INTERVALS 9<br />

0.3 Confidence intervals<br />

The previous section has provided a way to compute point estimates for parameters. Based on that, our<br />

next question is - how good is this point estimate? or How close is the estimate to the true value of the<br />

parameter?<br />

Instead of just looking at the point estimate, we will now try to compute an interval around the estimated<br />

parameter value, in which the true parameter is “likely” to fall. An interval like that is called confidence<br />

interval.<br />

Definition 0.3.1 (Confidence Interval)<br />

Let ˆθ be an estimate of θ.<br />

If P (|ˆθ − θ| < e) > α, we say, that the interval (ˆθ − e, ˆθ + e) is an α · 100% Confidence interval of θ (cf. fig.<br />

1.1).<br />

Usually, α is a value near 1, such as 0.9, 0.95, 0.99, 0.999, etc.<br />

Note:<br />

• for any given set of values x 1 , . . . , x n the value or ˆθ is fixed, as well as the interval (ˆθ − e, ˆθ + e).<br />

• The true value θ is either within the confidence interval or not.<br />

prob £<br />

1 -<br />

P( - e < x < + e) ><br />

x - e<br />

-e<br />

confidence interval for<br />

x<br />

x + e<br />

e<br />

prob £<br />

1 -<br />

Figure 1: The probability that ¯x falls into an e interval around µ is α. Vice versa, we know, that for all of<br />

those ¯x µ is within an e interval around ¯x. That’s the idea of a confidence interval.<br />

A lot of people are tempted to reformulate the above probability to:<br />

!!DON’T DO!!<br />

P (ˆθ − e < θ < ˆθ + e) > α<br />

Though it looks ok, it’s not. Repeat: IT IS NOT OK.<br />

θ is a fixed value - therefore, it does not have a probability to fall into some interval.<br />

The only probability that we have, here, is<br />

P (θ − e < ˆθ < θ + e) > α,<br />

we can therefore say, that ˆθ has a probability of at least α to fall into an e- interval around θ. Unfortunately,<br />

that doesn’t help at all, since we do not know θ!<br />

How do we compute confidence intervals, then?<br />

First, we look at estimates of a mean of a distribution:<br />

- that’s different for each estimator.


10<br />

0.3.1 Large sample C.I. for µ<br />

Situation: we have a large set of observed values (n > 30, usually).<br />

The assumption is, that these values are realizations of n i.i.d random variables X 1 , . . . , X n with E[ ¯X] = µ<br />

and V ar[ ¯X] = σ 2 .<br />

We already know from the previous section, that ¯X is an unbiased ML-estimator for µ.<br />

But we know more! - The <strong>CLT</strong> tells us, that in exactly the situation we are ¯X is an approximately normal<br />

distributed random variable with E[ ¯X] = µ and V ar[ ¯X] = σ2<br />

n .<br />

We therefore can find the boundary e by using the standard normal distribution. Remember: if ¯X ∼<br />

N(µ, σ 2 /n) then Z := ¯X−µ<br />

σ/ √ ∼ N(0, 1) = Φ:<br />

n<br />

P (| ¯X − µ| ≤ e) ≥ α use standardization<br />

( | ¯X − µ|<br />

⇐⇒ P<br />

σ/ √ n ≤ e )<br />

σ/ √ ≥ α<br />

n<br />

(<br />

⇐⇒ P |Z| <<br />

e )<br />

σ/ √ ≥ α<br />

n<br />

(<br />

⇐⇒ P −<br />

e<br />

σ/ √ n < Z < e )<br />

σ/ √ ≥ α<br />

n<br />

( ) (<br />

e<br />

⇐⇒ Φ<br />

σ/ √ − Φ −<br />

e )<br />

n σ/ √ ≥ α<br />

n<br />

( ) ( ( ))<br />

e<br />

e<br />

⇐⇒ Φ<br />

σ/ √ − 1 − Φ<br />

n<br />

σ/ √ ≥ α<br />

n<br />

( ) e<br />

⇐⇒ 2Φ<br />

σ/ √ − 1 ≥ α<br />

n<br />

( ) e<br />

⇐⇒ Φ<br />

σ/ √ ≥ 1 + α n 2<br />

⇐⇒<br />

e ( ) 1 + α<br />

σ/ √ n ≥ Φ−1 2<br />

( ) 1 + α<br />

⇐⇒ e ≥ Φ −1 σ<br />

√<br />

2 n<br />

} {{ }<br />

:=z<br />

This computation gives a α· 100% confidence value around µ as:<br />

(<br />

)<br />

σ<br />

¯X − z · √ , ¯X σ<br />

+ z · √ n n<br />

Now we can do an example:<br />

Example 0.3.1<br />

Suppose, we want to find a 95% confidence interval for the mean salary of an ISU employee.<br />

A random sample of 100 ISU employees gives us a sample mean salary of $21543 = ¯x.<br />

Suppose, the standard deviation of salaries is known to be $3000.<br />

By using the above expression, we get a 95% confidence interval as:<br />

21543 ± Φ −1 ( 1 + 0.95<br />

2<br />

)<br />

· 3000 √<br />

100<br />

= 21543 ± Φ −1 (0.975) · 300<br />

How do we read Φ −1 (0.975) from the standard normal table? - We look for which z the probability N (0,1) (z) ≥<br />

0.975!


0.3. CONFIDENCE INTERVALS 11<br />

This gives us z = 1.96, the 95% confidence interval is then:<br />

21543 ± 588,<br />

i.e. if we repeat this study 100 times (with 100 different employees each time), we can say: in 95 out of 100<br />

studies, the true parameter µ falls into a $588 range around ¯x.<br />

Critical values for z, depending on α are:<br />

α z = Φ −1 ( 1+α<br />

2 )<br />

0.90 1.65<br />

0.95 1.96<br />

0.98 2.33<br />

0.99 2.58<br />

Problem: Usually, we do not know<br />

√<br />

σ<br />

∑<br />

1 n<br />

Slight generalization: use s =<br />

n−1 i=1 (X i − ¯X) 2 instead of σ!<br />

An α· 100% confidence interval for µ is given as<br />

(<br />

)<br />

s<br />

¯X − z · √ , ¯X s<br />

+ z · √ n n<br />

where z = Φ −1 ( 1+α<br />

2 ).<br />

Example 0.3.2<br />

Suppose, we want to analyze some complicated queueing system, for which we have no formulas and theory.<br />

We are interested in the mean queue length of the system after reaching steady state.<br />

The only thing possible for us is to run simulations of this system and look at the queue length at some large<br />

time t, e.g. t = 1000 hrs.<br />

After 50 simulations, we have got data:<br />

= number in queue at time 1000 hrs in 1st simulation<br />

X 1<br />

X 2<br />

. . .<br />

X 50<br />

= number in queue at time 1000 hrs in 2nd simulation<br />

= number in queue at time 1000 hrs in 50th simulation<br />

√ ∑<br />

1 n<br />

Our observations yield an average queue length of ¯x = 21.5 and s =<br />

n−1 i=1 (x i − ¯x) 2 = 15.<br />

A 90% confidence interval is given as<br />

(<br />

¯x − z ·<br />

s<br />

√ n<br />

, ¯x + z ·<br />

)<br />

s<br />

√ n<br />

=<br />

(<br />

21.5 − 1.65 ·<br />

= (17.9998, 25.0002)<br />

15<br />

√<br />

50<br />

, 21.5 + 1.65 ·<br />

15<br />

√<br />

50<br />

)<br />

=<br />

Example 0.3.3<br />

The graphs show a set of 80 experiments. The values from each experiment are shown in one of the green<br />

framed boxes. Each experiment consists of simulating 20 values from a standard normal distributions (these<br />

are drawn as the small blue lines). For each of the experiments, the average from the 20 value is computed<br />

(that’s ¯x) as well as a confidence interval for µ- for parts a) and b) it’s the 95% confidence interval, for part<br />

c) it is the 90% confidence interval, for part d) it is the 99% confidence interval. The upper and the lower<br />

confidence bound together with the sample mean are drawn in red next to the sampled observations.


12<br />

a) 95 % confidence intervals<br />

b) 95 % confidence intervals<br />

c) 90 % confidence intervals<br />

d) 99 % confidence intervals<br />

There are several things to see from this diagram. First of all, we know in this example the “true” value of<br />

the parameter µ - since the observations are sampled from a standard normal distribution, µ = 0. The true<br />

parameter is represented by the straight horizontal line through 0.


0.3. CONFIDENCE INTERVALS 13<br />

We see, that each sample yields a different confidence interval, all of the are centered around the sample<br />

mean. The different sizes of the intervals tells us another thing: in computing these confidence intervals, we<br />

had to use the estimate s instead of the true standard deviation σ = 1. Each sample gave a slightly different<br />

standard deviation. Overall, though, the intervals are not very different in lengths between parts a) and b).<br />

The intervals in c) tend to be slightly smaller, though - these are 90% confidence intervals, whereas the<br />

intervals in part d) are on average larger than the first ones, they are 99% confidence intervals.<br />

Almost all the confidence intervals contain 0 - but not all. And that is, what we expect. For a 90% confidence<br />

interval we expect, that in 10 out of 100 times, the confidence interval does not contain the true parameter.<br />

When we check that - we see, that in part c) 4 out of the 20 confidence intervals don’t contain the true<br />

parameter for µ - that’s 20%, on average we would expect 10% of the conficence intervals not to contain µ.<br />

Official use of Confidence Intervals:<br />

In an average of 90 out of 100 times the 90% confidence interval of θ does contain<br />

the true value of θ.<br />

0.3.2 Large sample confidence intervals for a proportion p<br />

Let p be a proportion of a large population or a probability.<br />

In order to get an estimate for this proportion, we can take a sample of n individuals from the population<br />

and check each one of them, whether or not they fulfill the criterion to be in that proportion of interest.<br />

Mathematically, this corresponds to a Bernoulli-n-sequence, where we are only interested in the number of<br />

“successes”, X, which in our case corresponds to the number of individuals that qualify for the interesting<br />

subgroup.<br />

X then has a Binomial distribution, with parameters n and p. We know, that ¯X is an estimate for E[X].<br />

Now think: for a Binomial variable X, the expected value E[X] = n · p. Therefore we get an estimate ˆp for<br />

p as ˆp = 1 ¯X. n<br />

Furthermore, we even have a distribution for ˆp for large n: Since ˆX is, using the <strong>CLT</strong>, a normal variable<br />

with E[ ¯X] = np and V ar[ ¯X] = np(1 − p), we get that for large n ˆp is a approximately normally distributed<br />

with E[ˆp] = p and V ar[ˆp] = p(1−p)<br />

n<br />

.<br />

BTW: this tells us, that ˆp is an unbiased estimator of p.<br />

Prepared with the distribution of ˆp we can set up an α · 100% confidence interval as:<br />

where e is some positive real number with:<br />

(ˆp − e, ˆp + e)<br />

P (|ˆp − p| ≤ e) ≥ α<br />

We can derive the expression for e in the same way as in the previous section and we come up with:<br />

e = z ·<br />

p(1 − p)<br />

n<br />

where z = Φ −1 ( 1+α<br />

2 ).<br />

We also run into the problem that e in this form is not ready for use, since we do not know the value for p.<br />

In this situation, we have different options. We can either find a value that maximizes the value p(1 − p) or<br />

we can substitute an appropriate value for p.<br />

0.3.2.1 Conservative Method:<br />

replace p(1 − p) by something that’s guaranteed to be at least as large.<br />

p<br />

The function p(1 − p) has a maximum for p = 0.5. p(1 − p) is then<br />

0.25.


14<br />

The conservative α · 100% confidence interval for p is<br />

where z = Φ −1 ( 1+α<br />

2 ).<br />

0.3.2.2 Substitution Method:<br />

Substitute ˆp for p, then:<br />

ˆp ± z ·<br />

1<br />

2 √ n<br />

The α · 100% confidence interval for p by substitution is<br />

√<br />

ˆp(1 − ˆp)<br />

ˆp ± z ·<br />

n<br />

where z = Φ −1 ( 1+α<br />

2 ).<br />

Where is the difference between the two methods?<br />

• for large n there is almost no difference at all<br />

• if ˆp is close to 0.5, there is also almost no difference<br />

Besides that, conservative confidence intervals (as the name says) are larger than confidence intervals found<br />

by substitution. However, they are at the same time easier to compute.<br />

Example 0.3.4 Complicated queueing system, continued<br />

Suppose, that now we are interested in the large t probability p that a server is available.<br />

Doing 100 simulations has shown, that in 65 of them a server was available at time t = 1000 hrs.<br />

What is a 95% confidence interval for this probability?<br />

If 60 out of 100 simulations showed a free server, we can use ˆp = 60<br />

100<br />

= 0.6 as an estimate for p.<br />

For a 95% confidence interval, z = Φ −1 (0.975) = 1.96.<br />

The conservative confidence interval is:<br />

ˆp ± z 1<br />

2 √ n = 0.6 ± 1.96 1 √100 = 0.6 ± 0.098.<br />

2 ·<br />

For the confidence interval using substitution we get:<br />

√ √<br />

ˆp(1 − ˆp<br />

0.6 · 0.4<br />

ˆp ± z = 0.6 ± 1.96 = 0.6 ± 0.096.<br />

n<br />

100<br />

Example 0.3.5 Batting Average<br />

In the 2002 season the baseball player Sammy Sosa had a batting average of 0.288. (The batting average is<br />

the ratio of the number of hits and the times at bat.) Sammy Sosa was at bats 555 times in the 2002 season.<br />

Could the ”true” batting average still be 0.300?<br />

Compute a 95% Confidence Interval for the true batting average.<br />

Conservative Method gives:<br />

0.288 ±<br />

1<br />

1.96 ·<br />

2 √ 555<br />

0.288 ± 0.042


0.3. CONFIDENCE INTERVALS 15<br />

Substitution Method gives:<br />

0.288 ± 1.96 ·<br />

0.288 ± 0.038<br />

√<br />

0.288(1 − 0.288)<br />

The substitution method gives a slightly smaller confidence interval, but both intervals contain 0.3. There is<br />

not enough evidence to allow the conclusion that the true average is not 0.3.<br />

Confidence intervals give a way to measure the precision we get from simulations intended to evaluate<br />

probabilities. But besides that it also gives as a way to plan how large a sample size has to be to get a<br />

desired precision.<br />

Example 0.3.6<br />

Suppose, we want to estimate the fraction of records in the 2000 IRS data base that have a taxable income<br />

over $35 K.<br />

We want to get a 98% confidence interval and wish to estimate the quantity to within 0.01.<br />

this means that our boundaries e need to be smaller than 0.01 (we’ll choose a conservative confidence interval<br />

for ease of computation):<br />

555<br />

e ≤ 0.01<br />

1<br />

⇐⇒ z ·<br />

2 √ ≤ 0.01 z is 2.33<br />

n<br />

1<br />

⇐⇒ 2.33 ·<br />

2 √ n ≤ 0.01<br />

⇐⇒ √ n ≥ 2.33<br />

2 · 0.01 = 116.5<br />

⇒ n ≥ 13573<br />

0.3.3 Related C.I. Methods<br />

Related to the previous confidence intervals, are confidence intervals for the difference between two means,<br />

µ 1 − µ 2 , or the difference between two proportions , p 1 − p 2 .<br />

Confidence intervals for these differences are given as:<br />

large n confidence interval for µ 1 − µ 2<br />

(based on independent ¯X 1 and ¯X 2 )<br />

large n confidence interval for p 1 − p 2<br />

(based on independent ˆp 1 and ˆp 2 )<br />

¯X 1 − ¯X<br />

√<br />

√<br />

s<br />

2 ± z<br />

2 1<br />

n 1<br />

+ s2 2<br />

n 2<br />

ˆp 1 − ˆp 2 ± z 1 1<br />

2 n 1<br />

+ 1 n 2<br />

(conservative)<br />

√<br />

ˆp<br />

or ˆp 1 − ˆp 2 ±z 1(1−ˆp 1)<br />

n 1<br />

+ ˆp2(1−ˆp2)<br />

n 2<br />

(substitution)<br />

Why? The argumentation in both cases is very similar - we will only discuss the confidence interval for<br />

the difference between means.<br />

¯X 1 − ¯X 2 is approximately normal, since ¯X 1 and ¯X 2 are approximately normal, with ( ¯X 1 , ¯X 2 are independent)<br />

E[ ¯X 1 − ¯X 2 ] = E[ ¯X 1 ] − E[ ¯X 2 ] = µ 1 − µ 2<br />

V ar[ ¯X 1 − ¯X 2 ] = V ar[ ¯X 1 ] + (−1) 2 V ar[ ¯X 2 ] = σ2<br />

n 1<br />

+ σ2<br />

n 2


16<br />

Then we can use the same arguments as before and get a C.I. for µ 1 − µ 2 as shown above.<br />

✷<br />

Example 0.3.7<br />

Assume, we have two parts of the IRS database: East Coast and West Coast.<br />

We want to compare the mean taxable income between reported from the two regions in 2000.<br />

East Coast West Coast<br />

# of sampled records: n 1 = 1000 n 2 = 2000<br />

mean taxable income: ¯x 1 = $37200 ¯x 2 = $42000<br />

standard deviation: s 1 = $10100 s 2 = $15600<br />

We can, for example, compute a 2 sided 95% confidence interval for µ 1 − µ 2 = difference in mean taxable<br />

income as reported from 2000 tax return between East and West Coast:<br />

37000 − 42000 ±<br />

√<br />

10100<br />

2<br />

1000 + 156002 = −5000 ± 927<br />

2000<br />

Note: this shows pretty conclusively that the mean West Coast taxable income is higher than the mean East<br />

Coast taxable income (in the report from 2000). The interval contains only negative numbers - if it contained<br />

the 0, the message wouldn’t be so clear.<br />

One-sided intervals<br />

idea: use only one of the end points ¯x ± z s √ n<br />

This yields confidence intervals for µ of the form<br />

(−∞, #)<br />

} {{ }<br />

upper bound<br />

(##, ∞)<br />

} {{ }<br />

lower bound<br />

However, now we need to adjust z to the new situation. Instead of worrying about two tails of the normal<br />

distribution, we use for a one sided confidence interval only one tail.<br />

P( x < + e) <<br />

confidence interval for<br />

x<br />

e<br />

x + e<br />

prob ≤ 1 -<br />

Figure 2: One sided (upper bounded) confidence interval for µ (in red).<br />

Example 0.3.8 complicated queueing system, continued<br />

What is a 95% upper confidence bound of µ, the parameter for the length of the queue?<br />

¯x + z √ s<br />

n<br />

is the upper confidence bound. Instead of z = Φ −1 ( α+1<br />

2 ) we use z = Φ−1 (α) (see fig. 1.2).<br />

This gives: 21.5 + 1.65 √ 15<br />

50<br />

= 25.0<br />

as the upper confidence bound. Therefore the one sided upper bounded confidence interval is (−∞, 25.0).


0.4. HYPOTHESIS TESTING 17<br />

Critical values z = Φ −1 (α) for the one sided confidence interval are<br />

α z = Φ −1 (α)<br />

0.90 1.29<br />

0.95 1.65<br />

0.98 2.06<br />

0.99 2.33<br />

Example 0.3.9<br />

Two different digital communication systems send 100 large messages via each system and determine how<br />

many are corrupted in transmission.<br />

ˆp 1 = 0.05 and ˆp 2 = <strong>0.1</strong>0.<br />

What’s the difference in the corruption rates? Find a 98% confidence interval:<br />

Use:<br />

√<br />

0.05 · 0.95<br />

0.05 − <strong>0.1</strong> ± 2.33 ·<br />

+<br />

100<br />

<strong>0.1</strong>0 · 0.90<br />

100<br />

= −0.05 ± 0.086<br />

This calculation tells us, that based on these sample sizes, we don’t even have a solid idea about the sign of<br />

p 1 − p 2 , i.e. we can’t tell which of the p i s is larger.<br />

So far, we have only considered large sample confidence intervals. The problem with smaller sample sizes is,<br />

that the normal approximation in the <strong>CLT</strong> doesn’t work, if the standard deviation σ 2 is unknown.<br />

What you need to know is, that there exist different methods to compute C.I. for smaller sample sizes.<br />

0.4 Hypothesis Testing<br />

Example 0.4.1 Tea Tasting Lady<br />

It is claimed that a certain lady is able to tell, by tasting a cup of tea with milk, whether the milk was put<br />

in first or the tea was put in first.<br />

To put the claim to the test, the lady is given 10 cups of tea to taste and is asked to state in each case<br />

whether the milk went in first or the tea went in first.<br />

To guard against deliberate or accidental communication of information, before pouring each cup of tea a<br />

coin is tossed to decide whether the milk goes in first or the tea goes in first. The person who brings the cup<br />

of tea to the lady does not know the outcome of the coin toss.<br />

Either the lady has some skill (she can tell to some extent the difference) or she has not, in which case she<br />

is simply guessing.<br />

Suppose, the lady tested 10 cups of tea in this manner and got 9 of them right.<br />

This looks rather suspicious, the lady seems to have some skill. But how can we check it?<br />

We start with the sceptical assumption that the lady does not have any skill. If the lady has no skill at all,<br />

the probability she gives a correct answer for any single cup of tea is 1/2.<br />

The number of cups she gets right has therefore a Binomial distribution with parameter n = 10 and p = 0.5.<br />

The diagram shows the probability mass function of this distribution:


18<br />

p(x)<br />

observed x<br />

Events that are as unlikely or less likely are, that the lady got all 10 cups right or - very different, but<br />

nevertheless very rare - that she only got 1 cup or none right (note, this would be evidence of some “antiskill”,<br />

but it would certainly be evidence against her guessing).<br />

The total probability for these events is (remember, the binomial probability mass function is p(x) = ( n<br />

x)<br />

p x (1−<br />

p) n−x )<br />

x<br />

p(0) + p(1) + p(9) + p(10) = 0.5 10 + 10 · 0.5 10 + 10 · 0.5 10 + 0.5 10 = 0.021<br />

i.e. what we have just observed is a fairly rare event under the assumption, that the lady is only guessing.<br />

This suggests, that the lady may have some skill in detecting which was poured first into the cup.<br />

Jargon: 0.021 is called the p-value for testing the hypothesis p = 0.5.<br />

The fact that the p-value is small is evidence against the hypothesis.<br />

Hypothesis testing is a formal procedure to check whether or not some - previously made - assumption can<br />

be rejected based on the data.<br />

We are going to abstract the main elements of the previous example and cook up a standard series of steps<br />

for hypothesis testing:<br />

Example 0.4.2<br />

University CC administrators have historical records that indicate that between August and Oct 2002 the<br />

mean time between hits on the ISU homepage was 2 per min.<br />

They suspect that in fact the mean time between hits has decreased (i.e. traffic is up) - sampling 50<br />

inter-arrival times from records for November 2002 gives: ¯X = 1.7 min and s = 1.9 min.<br />

Is this strong evidence for an increase in traffic?


0.4. HYPOTHESIS TESTING 19<br />

Formal Procedure<br />

Application to Example<br />

1 State a “null hypothesis” of the form H 0 : function<br />

H 0 : µ = 2.0 min between hits<br />

of parameter(s) = #<br />

meant to embody a status quo/ pre data view<br />

2 State an “alternative hypothesis” ⎧ of the form<br />

⎨ ><br />

H a : µ < 2 (traffic is down)<br />

H a : function of parameter(s) ≠ #<br />

⎩<br />

<<br />

meant to identify departure from H 0<br />

3 State test criteria - consists of a test statistic, test statistic will be Z = ¯X−2.0<br />

s/ √ n<br />

a “reference distribution” giving the behavior The reference density will be standard normal,<br />

of the test statistic if H 0 is true and the kinds<br />

of values of the test statistic that count as evidence<br />

large negative values for Z count as evidence<br />

against H 0 in favor of H a<br />

against H 0 .<br />

4 show computations sample gives z = 1.7−2.0<br />

1.9/ √ = −1.12<br />

50<br />

5 Report and interpret a p-value = “observed<br />

level of significance, with which H 0 can be rejected”.<br />

This is the probability of an observed<br />

value of the test statistic at least as extreme as<br />

the one at hand. The smaller this value is, the<br />

less likely it is that H 0 is true.<br />

Note aside: a 90% confidence interval for µ is<br />

The p-value is P (Z ≤ −1.12) = Φ(−1.12) =<br />

<strong>0.1</strong>314<br />

This value is not terribly small - the evidence of<br />

a decrease in mean time between hits is somewhat<br />

weak.<br />

¯x ± 1.65 s √ n<br />

= 1.7 ± 0.44<br />

This interval contains the hypothesized value of µ = 2.0<br />

There are four basic hypothesis tests of this form, testing a mean, a proportion or differences between two<br />

means or two proportions. Depending on the hypothesis, the test statistic will be different. Here’s an<br />

overview of the tests, we are going to use:<br />

Hypothesis Statistic Reference Distribution<br />

H 0 : µ = #<br />

Z = ¯X−#<br />

s/ √ n<br />

Z is standard normal<br />

where ˆp =<br />

n1 ˆp1+n2 ˆp2<br />

n 1+n 2<br />

.<br />

H 0 : p = # Z = q ˆp−#<br />

#(1−#)<br />

n<br />

H 0 : µ 1 − µ 2 = # Z = ¯X r 1− ¯X 2−#<br />

s 2 1<br />

n + s2 2<br />

1 n 2<br />

H 0 : p 1 − p 2 = # Z =<br />

ˆp 1−ˆp 2−#<br />

√ˆp(1−ˆp)<br />

q<br />

1<br />

n 1<br />

+ 1<br />

n 2<br />

Z is standard normal<br />

Z is standard normal<br />

Z is standard normal<br />

Example 0.4.3 tax fraud<br />

Historically, IRS taxpayer compliance audits have revealed that about 5% of individuals do things on their<br />

tax returns that invite criminal prosecution.<br />

A sample of n = 1000 tax returns produces ˆp = 0.061 as an estimate of the fraction of fraudulent returns. -<br />

does this provide a clear signal of change in the tax payer behavior?<br />

1. state null hypothesis: H 0 : p = 0.05<br />

2. alternative hypothesis: H a : p ≠ 0.05


20<br />

3. test statistic:<br />

Z =<br />

ˆp − 0.05<br />

√<br />

0.05 · 0.95/n<br />

Z has under the null hypothesis a standard normal distribution, any large values of Z - positive and<br />

negative values - will count as evidence against H 0 .<br />

4. computation: z = (0.061 − 0.05)/ √ 0.05 · 0.95/1000 = 1.59<br />

5. p-value: P (|Z| ≥ 1.59) = P (Z ≤ −1.59) + P (Z ≥ 1.59) = <strong>0.1</strong>1 This is not a very small value, we<br />

therefore have only very weak evidence against H 0 .<br />

Example 0.4.4 life time of disk drives<br />

n 1 = 30 and n 2 = 40 disk drives of 2 different designs were tested under conditions of “accelerated” stress<br />

and times to failure recorded:<br />

Standard Design New Design<br />

n 1 = 30 n 2 = 40<br />

¯x 1 = 1205 hr ¯x 2 = 1400 hr<br />

s 1 = 1000 hr s 2 = 900 hr<br />

Does this provide conclusive evidence that the new design has a larger mean time to failure under “accelerated”<br />

stress conditions?<br />

1. state null hypothesis: H 0 : µ 1 = µ 2 (µ 1 − µ 2 = 0)<br />

2. alternative hypothesis: H a : µ 1 < µ 2 (µ 1 − µ 2 < 0)<br />

3. test statistic is:<br />

Z = ¯x 1 − ¯x 2 − 0<br />

√<br />

s 2 1<br />

n 1<br />

+ s2 2<br />

n 2<br />

Z has under the null hypothesis a standard normal distribution, we will consider large negative values<br />

of Z as evidence against H 0 .<br />

4. computation: z = (1205 − 1400 − 0)/ √ 1000 2 /30 + 900 2 /40 = −0.84<br />

5. p-value: P (Z < −0.84) = 0.2005<br />

This is not a very small value, we therefore have only very weak evidence against H 0 .<br />

Example 0.4.5 queueing systems<br />

2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities<br />

of there being an available server.<br />

We do simulations for each system, and look whether at time t = 2000 there is a server available:<br />

System 1 System 2<br />

n 1 = 1000 runs n 2 = 500 runs (each with different random seed)<br />

server at time t = 2000 available?<br />

ˆp 1 = 551<br />

1000<br />

ˆp 2 = 303<br />

500<br />

How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems?


0.5. GOODNESS OF FIT TESTS 21<br />

1. state null hypothesis: H 0 : p 1 = p 2 (p 1 − p 2 = 0)<br />

2. alternative hypothesis: H a : p 1 ≠ p 2 (p 1 − p 2 ≠ 0)<br />

3. Preliminary: note that, if there was no difference between the two systems, a plausible estimate of the<br />

availability of a server would be<br />

a test statistic is:<br />

ˆp = nˆp 1 + nˆp 2 551 + 303<br />

=<br />

n 1 + n 2 1000 + 500 = 0.569<br />

Z =<br />

ˆp 1 − ˆp 2 − 0<br />

√ˆp(1 − ˆp) ·<br />

√<br />

1<br />

n 1<br />

+ 1 n 2<br />

Z has under the null hypothesis a standard normal distribution, we will consider large values of Z as<br />

evidence against H 0 .<br />

4. computation: z = (0.551 − 0.606)/( √ 0.569 · (1 − 0.569) √ 1/1000 + 1/500) = −2.03<br />

5. p-value: P (|Z| > 2.03) = 0.04 This is fairly strong evidence of a real difference in t=2000 availabilities<br />

of a server between the two systems.<br />

0.5 Goodness of Fit Tests<br />

The basic situation is still the same as in the previous section: we have n realizations x 1 , . . . , x n (observed<br />

data) of independent, identically distributed random variables X 1 , . . . , X n .<br />

A goodness of fit test is different from the previous ones. Here, we don’t test a single parameter, but we test<br />

the whole distribution underlying our observations.<br />

Basically, the null hypothesis will be H 0 : the data follow a specified distribution F<br />

vs. H a : the data do not follow the specified distribution.<br />

For this problem there are different approaches depending on whether the specified distribution is continuous<br />

or discrete.<br />

For simplification, we will only consider the case of a finite discrete distribution, i.e. we are dealing with a<br />

finite sample space Ω = {1, . . . , k}, on which we have a probability mass function p.<br />

The above null hypothesis then becomes<br />

H 0 : p X (i) = p(i) for i = 1, . . . , k<br />

vs<br />

H a : p X (i) ≠ p i for at least one i ∈ {0, . . . , k}


22<br />

Example 0.5.1 M&Ms<br />

On the web page of the M&M/Mars company: www.m-ms.com the percentages of each color in a bag of<br />

peanut M&Ms is given as<br />

These percentages form a probability mass function for the colors in the M&Ms bag.<br />

A count for two different bags, gave the following numbers for each color:<br />

bag brown yellow red blue orange green sum<br />

129 GM 12 43 28 26 25 33 24 179<br />

129 GM 22 40 38 36 20 24 16 174<br />

How do we check, whether these numbers come from the distribution as given on the web site?<br />

To get an answer for that question, we will need to think about what kind of results we expected to get.<br />

Think: for each color, we have a certain probability p color , to draw an M&M out of the bag, which has this<br />

color vs. 1 − p color for a different color. We can therefore think of the number of M&Ms in each color as<br />

random variables with a Binomial distribution. N br , the number of brown M&Ms has parameters n and p br .<br />

Under the null hypothesis<br />

H 0 : p br = p r = p ye = p bl = 0.2, p or = p gr = <strong>0.1</strong><br />

vs<br />

H a : one of the p i is different from the above specification<br />

N br has a B n,0.2 distribution. For the first bag we therefore expected N br to be 0.2 · 179 = 35.8.<br />

In the same manner we can compute the expected values for all the other colors in each bag:<br />

bag brown yellow red blue orange green<br />

129 GM 12 35.8 35.8 35.8 35.8 17.9 17.9<br />

129 GM 22 34.8 34.8 34.8 34.8 17.4 17.4<br />

Now we are going to need a test statistics that measures the difference between what we have observed and<br />

what we have expected.<br />

As a test statistic, we will use<br />

Q =<br />

k∑<br />

j=1<br />

(obs j − exp j ) 2<br />

exp j<br />

,<br />

where<br />

obs j :<br />

exp j :<br />

the number of times j is observed among the x i , i = 1, . . . , n and<br />

number of expected js = n · p(j)<br />

<strong>Theorem</strong> 0.5.1<br />

The test statistic Q, defined as above, has χ 2 distribution with k − 1 degrees of freedom.<br />

In order to be able to use that, we obviously need some more information about the χ 2 distribution.


0.5. GOODNESS OF FIT TESTS 23<br />

The χ 2 distribution<br />

Given a set of independent standard normal random variables Z 1 , . . . , Z r , the distribution of their sum of<br />

squares<br />

r∑<br />

X :=<br />

i=1<br />

is called the χ 2 distribution with r degrees of freedom.<br />

The density function itself is a bit complicated (it’s a special case of a Gamma distribution), all we need to<br />

know about the distribution at this stage is:<br />

Z 2 i<br />

E[X] = r V ar[X] = 2r,<br />

and the probability that P (X ≥ 2(r + 1)) ≤ 0.05, roughly. For large r the probability is far smaller than<br />

0.05.<br />

Why has the above test statistic Q a χ 2 statistic with k − 1 degrees of freedom?<br />

This is difficult to prove, but it is at least plausible: The parts, from which Q is put together, look - almost<br />

- like squared normal distributions: since obs j has a Binomial distribution, we may for large n assume that<br />

we can approximate its distribution by N(np(j), np j (1 − p(j)). A standardization of obs j would therefore<br />

look like:<br />

obs j − np(j)<br />

√<br />

np(j)(1 − p(j))<br />

=<br />

1<br />

√<br />

1 − p(j)<br />

obs j − exp j<br />

√ expj<br />

.<br />

The degrees of freedom is reduced by one, because we have random variables, that are dependent: once we<br />

know the numbers of five colors in the bag, we get the sixth by subtracting the other numbers from the total<br />

number of M&Ms in the bag.<br />

A more formal reason for the degrees of freedom is given by computing the expected value for Q, which we<br />

can do by using that E[X 2 ] = V ar[X] + (E[X]) 2 :<br />

E[Q] =<br />

=<br />

=<br />

k∑<br />

j=1<br />

j=1<br />

1<br />

np(j) E[(obs i − np(j)) 2 ] =<br />

k∑ 1<br />

np(j) (V ar[obs j − np(j)] +(E[obs j − np(j) ) 2 ]) 2 ) =<br />

} {{ } } {{ }<br />

=V ar[obs j]<br />

=0<br />

k∑ 1<br />

k∑<br />

np(j) np j(1 − p(j)) = (1 − p(j)) =<br />

j=1<br />

= k −<br />

k∑<br />

p(j) = k − 1.<br />

j=1<br />

Now that we’ve defined a reference distribution for the test, we need to identify values, which we will count<br />

as evidence against H 0 . Since we’ve squared the differences between expected and observed values, we can<br />

only count large positive numbers for Q as evidence against H 0 .<br />

Now, let’s get back to our example:<br />

Example 0.5.2 M&Ms, continued<br />

The value for Q with the above null hypothesis about the color distribution is 23.91 for the first bag (129<br />

GM 12) and 10.02 for the second bag (129 GM 22)<br />

The p-values for these results are 0.00023 and 0.075, respectively.<br />

For the first bag it’s highly unlikely that the M&Ms have a color distribution as posted on the web site, for<br />

the second bag, however, we can’t quite reject the null hypothesis with the same vigor. The p-value is still<br />

quite small, though.<br />

j=1


24<br />

Maybe, there is something wrong with the filling routine at M&Ms’ . . .<br />

We can also look into which of the colors contribute most to the Q statistic (unsquared). These numbers<br />

are called the residuals<br />

ḃag brown yellow red blue orange green<br />

129 GM 12 1.20 -1.30 -1.64 -1.81 3.57 1.44<br />

129 GM 22 0.88 0.54 0.20 -2.51 1.58 -0.34<br />

The largest residuals in each bag are too many orange M&Ms in the first, too few blue in the second.<br />

If we had combined the results from the two bags, the result for Q would have been even more extreme:<br />

color brown yellow red blue orange green sum<br />

#in bag 83 66 62 45 57 40 353<br />

exptd 70.6 70.6 70.6 70.6 35.3 35.3 353<br />

residuals 1.48 -0.55 -1.02 -3.05 3.65 0.79 Q = 26.77<br />

Here, the two largest residuals are for blue and orange M&Ms. It seems as if the number that are too few<br />

for the blue is replaced by the orange M&Ms.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!