0.1 Central Limit Theorem (CLT)
0.1 Central Limit Theorem (CLT)
0.1 Central Limit Theorem (CLT)
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
The normal distribution is extremely common/ useful, for one reason: the normal distribution approximates<br />
a lot of other distributions. This is the result of one of the most fundamental theorems in Math:<br />
<strong>0.1</strong> <strong>Central</strong> <strong>Limit</strong> <strong>Theorem</strong> (<strong>CLT</strong>)<br />
<strong>Theorem</strong> <strong>0.1</strong>.1 (<strong>Central</strong> <strong>Limit</strong> <strong>Theorem</strong>)<br />
If X 1 , X 2 , . . . , X n are n independent, identically distributed random variables with E[X i ] = µ and V ar[X i ] =<br />
σ 2 , then:<br />
the sample mean ¯X := 1 n<br />
∑ n<br />
i=1 X i is approximately normal distributed<br />
with E[ ¯X] = µ and V ar[ ¯X] = σ 2 /n.<br />
i.e. ¯X ∼ N(µ,<br />
σ 2<br />
n ) or ∑ i X i ∼ N(nµ, nσ 2 )<br />
Corollary <strong>0.1</strong>.2<br />
(a) for large n the binomial distribution B n,p is approximately normal N np,np(1−p) .<br />
(b) for large λ the Poisson distribution P o λ is approximately normal N λ,λ .<br />
(c) for large k the Erlang distribution Erlang k,λ is approximately normal N k<br />
λ , k<br />
λ 2<br />
Why?<br />
(a) Let X be a variable with a B n,p distribution.<br />
We know, that X is the result from repeating the same Bernoulli experiment n times and looking at<br />
the overall number of successes. We can therefor, write X as the sum of n B 1,p variables X i :<br />
X := X 1 + X 2 + . . . + X n<br />
X is then the sum of n independent, identically distributed random variables. Then, the <strong>Central</strong><br />
<strong>Limit</strong> <strong>Theorem</strong> states, that X has an approximate normal distribution with E[X] = nE[X i ] = np and<br />
V ar[X] = nV ar[X i ] = np(1 − p).<br />
(b) it is enough to show the statement for the case that λ is a large integer:<br />
Let Y be a Poisson variable with rate λ. Then we can think of Y as the number of occurrences in<br />
an experiment that runs for time λ - that is the same as to observe λ experiments that each run<br />
independently for time 1 and add their results:<br />
Y = Y 1 + Y 2 + . . . + Y λ , with Y i ∼ P o 1 .<br />
Again, Y is the sum of n independent, identically distributed random variables. Then, the <strong>Central</strong><br />
<strong>Limit</strong> <strong>Theorem</strong> states, that X has an approximate normal distribution with E[Y ] = λ·1 and V ar[Y ] =<br />
λV ar[Y i ] = λ.<br />
(c) this statement is the easiest to prove, since an Erlang k,λ distributed variable Z is by definition the sum<br />
of k independently distributed exponential variables Z 1 , . . . , Z k .<br />
For Z the <strong>CLT</strong> holds, and we get, that Z is approximately normal distributed with E[Z] = kE[Z i ] = k λ<br />
and V ar[Z] = kV ar[Z i ] = k λ 2 .
2<br />
Why do we need the central limit theorem at all? - first of all, the <strong>CLT</strong> gives us the distribution of the<br />
sample mean in a very general setting: the only thing we need to know, is that all the observed values come<br />
from the same distribution, and the variance for this distribution is not infinite.<br />
A second reason is, that most tables only contain the probabilities up to a certain limit - the Poisson table<br />
e.g. only has values for λ ≤ 10, the Binomial distribution is tabled only for n ≤ 20. After that, we can use<br />
the Normal approximation to get probabilities.<br />
Example <strong>0.1</strong>.1 Hits on a webpage<br />
Hits occur with a rate of 2 per min.<br />
What is the probability to wait for more than 20 min for the 50th hit?<br />
Let Y be the waiting time until the 50th hit.<br />
We know: Y has an Erlang 50,2 distribution. therefore:<br />
P (Y > 20) = 1 − Erlang 50,2 (20) = 1 − (1 − P o 2·20 (50 − 1)) =<br />
= P o 40 (49) <strong>CLT</strong> ≈ !<br />
N 40,40 (49) =<br />
( ) 49 − 40<br />
= Φ √ = Φ(1.42) table<br />
= 0.9222.<br />
40<br />
✷<br />
Example <strong>0.1</strong>.2 Mean of Uniform Variables<br />
Let U 1 , U 2 , U 3 , U 4 , and U 5 be standard uniform variables, i.e. U i ∼ U (0,1) .<br />
Without the <strong>CLT</strong> we would have no idea, what distribution the sample mean Ū = 1 ∑ 5<br />
5 i=1 U i had!<br />
With it, we know: Ū approx<br />
∼ N(0.5, 1<br />
60 ).<br />
Issue:<br />
Accuracy of approximation<br />
• increases with n<br />
• increases with the amount of symmetry in the distribution of X i<br />
Rule of thumb for the Binomial distribution:<br />
Use the normal approximation for B n,p , if np > 5 (if p ≤ 0.5) or nq > 5 (if p ≥ 0.5)!<br />
From now on, we will use probability theory only to find answers to the questions arising from specific<br />
problems we are working on.<br />
In this chapter we want to draw inferences about some characteristic of an underlying population - e.g. the<br />
average height of a person. Instead of measuring this characteristic of each individual, we will draw a sample,<br />
i.e. choose a “suitable” subset of the population and measure the characteristic only for those individuals.<br />
Using some probabilistic arguments we can then extend the information we got from that sample and make<br />
an estimate of the characteristic for the whole population. Probability theory will give us the means to find<br />
those estimates and measure, how “probable” our estimates are.<br />
Of course, choosing the sample, is crucial. We will demand two properties from a sample:<br />
• the sample should be representative - taking only basketball players into the sample would change our<br />
estimate about a person’s height drastically.<br />
• if there’s a large number in the sample we should come close to the “true” value of the characteristic<br />
The three main area of statistics are
0.2. PARAMETER ESTIMATION 3<br />
• estimation of parameters:<br />
point or interval estimates: “my best guess for value x is . . . ”, “my guess is that value x is in interval<br />
(a, b)”<br />
• evaluation of plausibility of values: hypothesis testing<br />
• prediction of future (individual) values<br />
0.2 Parameter Estimation<br />
Statistics are all around us - scores in sports, prices at the grocers, weather reports ( and how often they<br />
turn out to be close to the actual weather), taxes, evaluations . . .<br />
The most basic form of statistics are descriptive statistics.<br />
But - what exactly is a statistic? - Here is the formal definition:<br />
Definition 0.2.1 (Statistics)<br />
Any function W (x 1 , . . . , x k ) of observed values x 1 , . . . , x k is called a statistic.<br />
Some statistics you already know are:<br />
∑<br />
Mean (Average) ¯X =<br />
1<br />
n i X i<br />
Minimum X (1) - Parentheses indicate that the values are sorted<br />
Maximum X (n)<br />
Range X (n) − X (1)<br />
Mode<br />
value(s) that appear(s) most often<br />
Median<br />
“middle value” - that value, for which one half of the data is larger,<br />
the other half is smaller. If n is odd the median is X (n/2) , if n is<br />
even, the median is the average of the two middle values: 0.5 ·<br />
X ((n−1)/2) + 0.5 · X ((n+1)/2)<br />
For this section it is important to distinguish between x i and X i properly. If not stated otherwise, any capital<br />
letter denotes some random variable, a small letter describes a realization of this random variable, i.e.<br />
what we have observed. x i therefore is a real number, X i is a function, that assigns a real number to an<br />
event from the sample space.<br />
Definition 0.2.2 (estimator)<br />
Let X 1 , . . . , X k be k i.i.d random variables with distribution F θ with (unknown) parameter θ.<br />
A statistic ˆΘ = ˆΘ(X 1 , . . . , X k ) used to estimate the value of θ is called an estimator of θ.<br />
ˆθ = ˆΘ(x 1 , . . . , x k ) is called an estimate of θ.<br />
Desirable properties of estimates:
4<br />
• Unbiasedness, i.e the expected value of an estimator is<br />
the true parameter:<br />
E[ ˆΘ] = θ<br />
• Efficiency, for two estimators, ˆΘ 1 and ˆΘ 2 of the same<br />
parameter θ, ˆΘ 1 is said to be more efficient than ˆΘ 2 ,<br />
if<br />
V ar[ ˆΘ 1 ] < V ar[ ˆΘ 2 ]<br />
• Consistency, if we have a larger sample size n, we want<br />
the estimate ˆθ to be closer to the true parameter θ.<br />
lim P (| ˆΘ − θ| > ɛ) = 0<br />
n→∞<br />
Unbiasedness:<br />
x true value<br />
• value for<br />
from one sample<br />
• •<br />
•<br />
•<br />
•<br />
• •<br />
••<br />
x •<br />
• •<br />
and not<br />
• •<br />
•<br />
• •<br />
• •<br />
x<br />
•<br />
•<br />
•<br />
• •<br />
•<br />
•<br />
Efficiency:<br />
Consistency:<br />
estimator 1<br />
•<br />
•<br />
• • •<br />
•<br />
• x<br />
•<br />
•<br />
• •<br />
•<br />
• is better than<br />
• x •<br />
•<br />
•<br />
•<br />
•<br />
•<br />
•<br />
•<br />
estimator 2<br />
•<br />
•<br />
••<br />
• •<br />
• x<br />
• •<br />
•<br />
•<br />
•<br />
•<br />
same estimator<br />
• • •<br />
• •<br />
x<br />
•<br />
•<br />
••<br />
for n = 100 for n = 10000<br />
Example 0.2.1<br />
Let X 1 , . . . , X n be n i.i.d. random variables with E[X i ] = µ.<br />
Then ¯X = 1 n<br />
∑ n<br />
i=1 X i is an unbiased estimator of µ, because<br />
E[ ¯X] = 1 n<br />
n∑<br />
E[X i ] = 1 n n · µ = µ.<br />
i=1<br />
ok, so, once we have an estimator, we can decide, whether it has the properties.<br />
estimators?<br />
But how do we find<br />
0.2.1 Maximum Likelihood Estimation<br />
Situation: We have n data values x 1 , . . . , x n . The assumption is, that these data values are realizations of n<br />
i.i.d. random variables X 1 , . . . , X n with distribution F θ . Unfortunately the value for θ is unknown.<br />
X<br />
observed values<br />
x 1 , x 2 , x 3, ...<br />
f with = 0<br />
f with = -1.8<br />
f with = 1<br />
By changing the value for θ we can “move the density function f θ around” - in the diagram, the third density<br />
function fits the data best.<br />
Principle: since we do not know the true value θ of the distribution, we take that value ˆθ that most likely<br />
produced the observed values, i.e.<br />
maximize something like<br />
P (X 1 = x 1 ∩ X 2 = x 2 ∩ . . . ∩ X n = x n )<br />
=<br />
Xi are independent!<br />
=<br />
n∏<br />
P (X i = x i ) (*)<br />
i=1<br />
= P (X 1 = x 1 ) · P (X 2 = x 2 ) · . . . · P (X n = x n ) =
0.2. PARAMETER ESTIMATION 5<br />
This is not quite the right way to write the probability, if X 1 , . . . , X n are continuous variables. (Remember:<br />
P (X = x) = 0 for a continuous variable X; this is still valid)<br />
We use the above “probability” just as a plausibility argument. To come around the problem that P (X =<br />
x) = 0 for a continuous variable, we will write (*) as:<br />
n∏<br />
p θ (x i )<br />
i=1<br />
} {{ }<br />
for discreteX i<br />
and<br />
n∏<br />
f θ (x i )<br />
i=1<br />
} {{ }<br />
for continuousX i<br />
where p θ ) is the probability mass function of discrete X i (all X i have the same, since they are i.d) and f θ is<br />
the density function of continuous X i .<br />
Both these functions depend on θ. In fact, we can write the above expressions as a function in θ. This<br />
function, which we will denote by L(θ), is called the Likelihood function of X 1 , . . . , X n .<br />
The goal is now, to find a value ˆθ that maximizes the Likelihood function. (this is what “moves” the density<br />
to the right spot, so it fits the observed values well)<br />
How do we get a maximum of L(θ)? - by the usual way, we maximize a function! - Differentiate it and<br />
set it to zero! (After that, we ought to check with the second derivative, whether we’ve actually found a<br />
maximum, but we won’t do that unless we’ve found more than one possible value for ˆθ.)<br />
Most of the time, it is difficult to find a derivative of L(θ) - instead we use another trick, and find a maximum<br />
for log L(θ), the Log-Likelihood function.<br />
Note: though its name is “log”, we use the natural logarithm ln.<br />
The plan to find an ML-estimator is:<br />
1. Find Likelihood function L(θ).<br />
2. Get natural log of Likelihood function log L(θ).<br />
3. Differentiate log-Likelihood function with respect to θ.<br />
4. Set derivative to zero.<br />
5. Solve for θ.<br />
Example 0.2.2 Roll a Die<br />
A die is rolled until its face shows a 6.<br />
repeating this experiment 100 times gave the following results:
6<br />
#Rolls of a Die until first 6<br />
20<br />
15<br />
10<br />
# runs<br />
5<br />
0<br />
1 2 3 4 5 6 7 8 9 11 14 16 20 27 29<br />
k 1 2 3 4 5 6 7 8 9 11 14 15 16 17 20 21 27 29<br />
# trials 18 20 8 9 9 5 8 3 5 3 3 3 1 1 1 1 1 1<br />
We know, that k the number of rolls until a 6 shows up has a geometric distribution Geo p . For a fair die, p<br />
is 1/6.<br />
The Geometric distribution has probability mass function p(k) = (1 − p) k−1 · p.<br />
What is the ML-estimate ˆp for p?<br />
1. Likelihood function L(p):<br />
Since we have observed 100 outcomes k 1 , ..., k 100 , the likelihood function L(p) = ∏ 100<br />
i=1 p(k i),<br />
∏100<br />
∏100<br />
L(p) = (1 − p) ki−1 p = p 100 · (1 − p) ki−1 = p 100 · (1 − p) P 100<br />
i=1 (ki−1) = p 100 · (1 − p) P 100<br />
i=1<br />
i=1<br />
i=1 ki−100 .<br />
2. log of Likelihood function log L(p):<br />
log L(p) = log<br />
(p 100 · (1 − p) P 100<br />
ki−100) i=1 =<br />
= log ( p 100) + log<br />
((1 − p) P 100<br />
ki−100) i=1 =<br />
( 100<br />
)<br />
∑<br />
= 100 log p + k i − 100 log(1 − p).<br />
i=1
0.2. PARAMETER ESTIMATION 7<br />
3. Differentiate log-Likelihood with respect to p:<br />
4. Set derivative to zero.<br />
(<br />
d<br />
∑100<br />
dp log L(p) = 1001 p + k i − 100<br />
i=1<br />
=<br />
=<br />
(<br />
1<br />
100(1 − p) −<br />
p(1 − p)<br />
(<br />
1<br />
p(1 − p)<br />
For the estimate ˆp the derivative must be zero:<br />
)<br />
−1<br />
1 − p =<br />
)<br />
∑100<br />
100 − p k i .<br />
i=1<br />
( ∑100<br />
k i − 100<br />
i=1<br />
)<br />
p<br />
)<br />
=<br />
⇐⇒<br />
(<br />
1<br />
ˆp(1 − ˆp)<br />
d<br />
log L(ˆp) = 0<br />
dp<br />
)<br />
∑100<br />
k i = 0<br />
100 − ˆp<br />
i=1<br />
5. Solve for ˆp.<br />
(<br />
)<br />
1<br />
∑100<br />
100 − ˆp k i<br />
ˆp(1 − ˆp)<br />
ˆp = ∑ 100<br />
100<br />
i=1 k =<br />
i<br />
In total, we have an estimate ˆp = 100<br />
568 = <strong>0.1</strong>710.<br />
i=1<br />
= 0<br />
∑100<br />
100 − ˆp k i = 0<br />
1<br />
100<br />
i=1<br />
1<br />
∑ 100<br />
i=1 k .<br />
i<br />
Example 0.2.3 Red Cars in the Parking Lot<br />
The values 3,2,3,3,4,1,4,2,4,3 have been observed while counting the numbers of red cars pulling into the<br />
parking lot # 22 between 8:30 - 8:40 am Mo to Fr during two weeks.<br />
The assumption is, that these values are realizations of ten independent Poisson variables with (the same)<br />
rate λ.<br />
What is the Maximum Likelihood estimate of λ?<br />
The probability mass function of a Poisson distribution is p λ (x) = e −λ · λx<br />
x! .<br />
We have ten values x i , this gives a Likelihood function:<br />
L(λ) =<br />
10∏<br />
i=1<br />
e −λ · λXi<br />
X i ! = e−10λ · λ P 10<br />
∏10<br />
i=1 Xi ·<br />
i=1<br />
1<br />
X i !<br />
The log-Likelihood then is<br />
log L(λ) = −10λ + ln(λ) ·<br />
10∑<br />
i=1<br />
X i − ∑ ln(X i ).
8<br />
Differentiating the log-Likelihood with respect to λ gives:<br />
Setting it to zero:<br />
d<br />
dλ log L(λ) = −10 + 1 λ ·<br />
1<br />
ˆλ ·<br />
10∑<br />
i=1<br />
X i = 10<br />
⇐⇒ ˆλ = 1<br />
10<br />
10∑<br />
X i<br />
i=1<br />
10∑<br />
X i<br />
i=1<br />
⇐⇒ ˆλ = 29<br />
10 = 2.9<br />
This gives us an estimate for λ - and since λ is also the expected value of the Poisson distribution, we can<br />
say, that on average the number of red cars pulling into the parking lot each morning between 8:30 and 8:40<br />
pm is 2.9.<br />
ML-estimators for µ and σ 2 of a Normal distribution<br />
Let X 1 , . . . , X n be n independent, identically distributed normal variables with E[X i ] = µ and V ar[X i ] = σ 2 .<br />
µ and σ 2 are unknown.<br />
The normal density function f µ,σ 2 is<br />
f µ,σ 2(x) =<br />
1<br />
(x−µ)2<br />
√ e− 2σ 2<br />
2πσ<br />
2<br />
Since we have n independent variables, the Likelihood function is a product of n densities:<br />
n∏<br />
L(µ, σ 2 1<br />
) = √ (x 2πσ<br />
2 e− i −µ)2<br />
2σ 2 = (2πσ 2 ) n/2 · e − P n (x i −µ) 2<br />
i=1 2σ 2<br />
Log-Likelihood:<br />
i=1<br />
log L(µ, σ 2 ) = − n 2 ln(2πσ2 ) − 1<br />
2σ 2<br />
n<br />
∑<br />
i=1<br />
(x i − µ) 2<br />
Since we have now two parameters, µ and σ 2 , we need to get 2 partial derivatives of the log-Likelihood:<br />
d<br />
dµ log L(µ, σ2 ) = 0 − 2 · −1 n∑<br />
2σ 2 · (x i − µ) 2 · (−1) = 1 ∑ n<br />
σ 2 (x i − µ) 2<br />
i=1<br />
d<br />
dσ 2 log L(µ, σ2 ) = − n 1<br />
2 σ 2 + 1 ∑<br />
n<br />
2(σ 2 ) 2 (x i − µ) 2<br />
We know, must find values for µ and σ 2 , that yield zeros for both derivatives at the same time.<br />
Setting d<br />
dµ log L(µ, σ2 ) = 0 gives<br />
ˆµ = 1 n∑<br />
x i ,<br />
n<br />
plugging this value into the derivative for σ 2 and setting<br />
d<br />
dσ<br />
log L(ˆµ, σ 2 ) = 0 gives<br />
2<br />
ˆσ 2 = 1 n∑<br />
(x i − ˆµ) 2<br />
n<br />
i=1<br />
i=1<br />
i=1<br />
i=1
0.3. CONFIDENCE INTERVALS 9<br />
0.3 Confidence intervals<br />
The previous section has provided a way to compute point estimates for parameters. Based on that, our<br />
next question is - how good is this point estimate? or How close is the estimate to the true value of the<br />
parameter?<br />
Instead of just looking at the point estimate, we will now try to compute an interval around the estimated<br />
parameter value, in which the true parameter is “likely” to fall. An interval like that is called confidence<br />
interval.<br />
Definition 0.3.1 (Confidence Interval)<br />
Let ˆθ be an estimate of θ.<br />
If P (|ˆθ − θ| < e) > α, we say, that the interval (ˆθ − e, ˆθ + e) is an α · 100% Confidence interval of θ (cf. fig.<br />
1.1).<br />
Usually, α is a value near 1, such as 0.9, 0.95, 0.99, 0.999, etc.<br />
Note:<br />
• for any given set of values x 1 , . . . , x n the value or ˆθ is fixed, as well as the interval (ˆθ − e, ˆθ + e).<br />
• The true value θ is either within the confidence interval or not.<br />
prob £<br />
1 -<br />
P( - e < x < + e) ><br />
x - e<br />
-e<br />
confidence interval for<br />
x<br />
x + e<br />
e<br />
prob £<br />
1 -<br />
Figure 1: The probability that ¯x falls into an e interval around µ is α. Vice versa, we know, that for all of<br />
those ¯x µ is within an e interval around ¯x. That’s the idea of a confidence interval.<br />
A lot of people are tempted to reformulate the above probability to:<br />
!!DON’T DO!!<br />
P (ˆθ − e < θ < ˆθ + e) > α<br />
Though it looks ok, it’s not. Repeat: IT IS NOT OK.<br />
θ is a fixed value - therefore, it does not have a probability to fall into some interval.<br />
The only probability that we have, here, is<br />
P (θ − e < ˆθ < θ + e) > α,<br />
we can therefore say, that ˆθ has a probability of at least α to fall into an e- interval around θ. Unfortunately,<br />
that doesn’t help at all, since we do not know θ!<br />
How do we compute confidence intervals, then?<br />
First, we look at estimates of a mean of a distribution:<br />
- that’s different for each estimator.
10<br />
0.3.1 Large sample C.I. for µ<br />
Situation: we have a large set of observed values (n > 30, usually).<br />
The assumption is, that these values are realizations of n i.i.d random variables X 1 , . . . , X n with E[ ¯X] = µ<br />
and V ar[ ¯X] = σ 2 .<br />
We already know from the previous section, that ¯X is an unbiased ML-estimator for µ.<br />
But we know more! - The <strong>CLT</strong> tells us, that in exactly the situation we are ¯X is an approximately normal<br />
distributed random variable with E[ ¯X] = µ and V ar[ ¯X] = σ2<br />
n .<br />
We therefore can find the boundary e by using the standard normal distribution. Remember: if ¯X ∼<br />
N(µ, σ 2 /n) then Z := ¯X−µ<br />
σ/ √ ∼ N(0, 1) = Φ:<br />
n<br />
P (| ¯X − µ| ≤ e) ≥ α use standardization<br />
( | ¯X − µ|<br />
⇐⇒ P<br />
σ/ √ n ≤ e )<br />
σ/ √ ≥ α<br />
n<br />
(<br />
⇐⇒ P |Z| <<br />
e )<br />
σ/ √ ≥ α<br />
n<br />
(<br />
⇐⇒ P −<br />
e<br />
σ/ √ n < Z < e )<br />
σ/ √ ≥ α<br />
n<br />
( ) (<br />
e<br />
⇐⇒ Φ<br />
σ/ √ − Φ −<br />
e )<br />
n σ/ √ ≥ α<br />
n<br />
( ) ( ( ))<br />
e<br />
e<br />
⇐⇒ Φ<br />
σ/ √ − 1 − Φ<br />
n<br />
σ/ √ ≥ α<br />
n<br />
( ) e<br />
⇐⇒ 2Φ<br />
σ/ √ − 1 ≥ α<br />
n<br />
( ) e<br />
⇐⇒ Φ<br />
σ/ √ ≥ 1 + α n 2<br />
⇐⇒<br />
e ( ) 1 + α<br />
σ/ √ n ≥ Φ−1 2<br />
( ) 1 + α<br />
⇐⇒ e ≥ Φ −1 σ<br />
√<br />
2 n<br />
} {{ }<br />
:=z<br />
This computation gives a α· 100% confidence value around µ as:<br />
(<br />
)<br />
σ<br />
¯X − z · √ , ¯X σ<br />
+ z · √ n n<br />
Now we can do an example:<br />
Example 0.3.1<br />
Suppose, we want to find a 95% confidence interval for the mean salary of an ISU employee.<br />
A random sample of 100 ISU employees gives us a sample mean salary of $21543 = ¯x.<br />
Suppose, the standard deviation of salaries is known to be $3000.<br />
By using the above expression, we get a 95% confidence interval as:<br />
21543 ± Φ −1 ( 1 + 0.95<br />
2<br />
)<br />
· 3000 √<br />
100<br />
= 21543 ± Φ −1 (0.975) · 300<br />
How do we read Φ −1 (0.975) from the standard normal table? - We look for which z the probability N (0,1) (z) ≥<br />
0.975!
0.3. CONFIDENCE INTERVALS 11<br />
This gives us z = 1.96, the 95% confidence interval is then:<br />
21543 ± 588,<br />
i.e. if we repeat this study 100 times (with 100 different employees each time), we can say: in 95 out of 100<br />
studies, the true parameter µ falls into a $588 range around ¯x.<br />
Critical values for z, depending on α are:<br />
α z = Φ −1 ( 1+α<br />
2 )<br />
0.90 1.65<br />
0.95 1.96<br />
0.98 2.33<br />
0.99 2.58<br />
Problem: Usually, we do not know<br />
√<br />
σ<br />
∑<br />
1 n<br />
Slight generalization: use s =<br />
n−1 i=1 (X i − ¯X) 2 instead of σ!<br />
An α· 100% confidence interval for µ is given as<br />
(<br />
)<br />
s<br />
¯X − z · √ , ¯X s<br />
+ z · √ n n<br />
where z = Φ −1 ( 1+α<br />
2 ).<br />
Example 0.3.2<br />
Suppose, we want to analyze some complicated queueing system, for which we have no formulas and theory.<br />
We are interested in the mean queue length of the system after reaching steady state.<br />
The only thing possible for us is to run simulations of this system and look at the queue length at some large<br />
time t, e.g. t = 1000 hrs.<br />
After 50 simulations, we have got data:<br />
= number in queue at time 1000 hrs in 1st simulation<br />
X 1<br />
X 2<br />
. . .<br />
X 50<br />
= number in queue at time 1000 hrs in 2nd simulation<br />
= number in queue at time 1000 hrs in 50th simulation<br />
√ ∑<br />
1 n<br />
Our observations yield an average queue length of ¯x = 21.5 and s =<br />
n−1 i=1 (x i − ¯x) 2 = 15.<br />
A 90% confidence interval is given as<br />
(<br />
¯x − z ·<br />
s<br />
√ n<br />
, ¯x + z ·<br />
)<br />
s<br />
√ n<br />
=<br />
(<br />
21.5 − 1.65 ·<br />
= (17.9998, 25.0002)<br />
15<br />
√<br />
50<br />
, 21.5 + 1.65 ·<br />
15<br />
√<br />
50<br />
)<br />
=<br />
Example 0.3.3<br />
The graphs show a set of 80 experiments. The values from each experiment are shown in one of the green<br />
framed boxes. Each experiment consists of simulating 20 values from a standard normal distributions (these<br />
are drawn as the small blue lines). For each of the experiments, the average from the 20 value is computed<br />
(that’s ¯x) as well as a confidence interval for µ- for parts a) and b) it’s the 95% confidence interval, for part<br />
c) it is the 90% confidence interval, for part d) it is the 99% confidence interval. The upper and the lower<br />
confidence bound together with the sample mean are drawn in red next to the sampled observations.
12<br />
a) 95 % confidence intervals<br />
b) 95 % confidence intervals<br />
c) 90 % confidence intervals<br />
d) 99 % confidence intervals<br />
There are several things to see from this diagram. First of all, we know in this example the “true” value of<br />
the parameter µ - since the observations are sampled from a standard normal distribution, µ = 0. The true<br />
parameter is represented by the straight horizontal line through 0.
0.3. CONFIDENCE INTERVALS 13<br />
We see, that each sample yields a different confidence interval, all of the are centered around the sample<br />
mean. The different sizes of the intervals tells us another thing: in computing these confidence intervals, we<br />
had to use the estimate s instead of the true standard deviation σ = 1. Each sample gave a slightly different<br />
standard deviation. Overall, though, the intervals are not very different in lengths between parts a) and b).<br />
The intervals in c) tend to be slightly smaller, though - these are 90% confidence intervals, whereas the<br />
intervals in part d) are on average larger than the first ones, they are 99% confidence intervals.<br />
Almost all the confidence intervals contain 0 - but not all. And that is, what we expect. For a 90% confidence<br />
interval we expect, that in 10 out of 100 times, the confidence interval does not contain the true parameter.<br />
When we check that - we see, that in part c) 4 out of the 20 confidence intervals don’t contain the true<br />
parameter for µ - that’s 20%, on average we would expect 10% of the conficence intervals not to contain µ.<br />
Official use of Confidence Intervals:<br />
In an average of 90 out of 100 times the 90% confidence interval of θ does contain<br />
the true value of θ.<br />
0.3.2 Large sample confidence intervals for a proportion p<br />
Let p be a proportion of a large population or a probability.<br />
In order to get an estimate for this proportion, we can take a sample of n individuals from the population<br />
and check each one of them, whether or not they fulfill the criterion to be in that proportion of interest.<br />
Mathematically, this corresponds to a Bernoulli-n-sequence, where we are only interested in the number of<br />
“successes”, X, which in our case corresponds to the number of individuals that qualify for the interesting<br />
subgroup.<br />
X then has a Binomial distribution, with parameters n and p. We know, that ¯X is an estimate for E[X].<br />
Now think: for a Binomial variable X, the expected value E[X] = n · p. Therefore we get an estimate ˆp for<br />
p as ˆp = 1 ¯X. n<br />
Furthermore, we even have a distribution for ˆp for large n: Since ˆX is, using the <strong>CLT</strong>, a normal variable<br />
with E[ ¯X] = np and V ar[ ¯X] = np(1 − p), we get that for large n ˆp is a approximately normally distributed<br />
with E[ˆp] = p and V ar[ˆp] = p(1−p)<br />
n<br />
.<br />
BTW: this tells us, that ˆp is an unbiased estimator of p.<br />
Prepared with the distribution of ˆp we can set up an α · 100% confidence interval as:<br />
where e is some positive real number with:<br />
(ˆp − e, ˆp + e)<br />
P (|ˆp − p| ≤ e) ≥ α<br />
We can derive the expression for e in the same way as in the previous section and we come up with:<br />
e = z ·<br />
p(1 − p)<br />
n<br />
where z = Φ −1 ( 1+α<br />
2 ).<br />
We also run into the problem that e in this form is not ready for use, since we do not know the value for p.<br />
In this situation, we have different options. We can either find a value that maximizes the value p(1 − p) or<br />
we can substitute an appropriate value for p.<br />
0.3.2.1 Conservative Method:<br />
replace p(1 − p) by something that’s guaranteed to be at least as large.<br />
p<br />
The function p(1 − p) has a maximum for p = 0.5. p(1 − p) is then<br />
0.25.
14<br />
The conservative α · 100% confidence interval for p is<br />
where z = Φ −1 ( 1+α<br />
2 ).<br />
0.3.2.2 Substitution Method:<br />
Substitute ˆp for p, then:<br />
ˆp ± z ·<br />
1<br />
2 √ n<br />
The α · 100% confidence interval for p by substitution is<br />
√<br />
ˆp(1 − ˆp)<br />
ˆp ± z ·<br />
n<br />
where z = Φ −1 ( 1+α<br />
2 ).<br />
Where is the difference between the two methods?<br />
• for large n there is almost no difference at all<br />
• if ˆp is close to 0.5, there is also almost no difference<br />
Besides that, conservative confidence intervals (as the name says) are larger than confidence intervals found<br />
by substitution. However, they are at the same time easier to compute.<br />
Example 0.3.4 Complicated queueing system, continued<br />
Suppose, that now we are interested in the large t probability p that a server is available.<br />
Doing 100 simulations has shown, that in 65 of them a server was available at time t = 1000 hrs.<br />
What is a 95% confidence interval for this probability?<br />
If 60 out of 100 simulations showed a free server, we can use ˆp = 60<br />
100<br />
= 0.6 as an estimate for p.<br />
For a 95% confidence interval, z = Φ −1 (0.975) = 1.96.<br />
The conservative confidence interval is:<br />
ˆp ± z 1<br />
2 √ n = 0.6 ± 1.96 1 √100 = 0.6 ± 0.098.<br />
2 ·<br />
For the confidence interval using substitution we get:<br />
√ √<br />
ˆp(1 − ˆp<br />
0.6 · 0.4<br />
ˆp ± z = 0.6 ± 1.96 = 0.6 ± 0.096.<br />
n<br />
100<br />
Example 0.3.5 Batting Average<br />
In the 2002 season the baseball player Sammy Sosa had a batting average of 0.288. (The batting average is<br />
the ratio of the number of hits and the times at bat.) Sammy Sosa was at bats 555 times in the 2002 season.<br />
Could the ”true” batting average still be 0.300?<br />
Compute a 95% Confidence Interval for the true batting average.<br />
Conservative Method gives:<br />
0.288 ±<br />
1<br />
1.96 ·<br />
2 √ 555<br />
0.288 ± 0.042
0.3. CONFIDENCE INTERVALS 15<br />
Substitution Method gives:<br />
0.288 ± 1.96 ·<br />
0.288 ± 0.038<br />
√<br />
0.288(1 − 0.288)<br />
The substitution method gives a slightly smaller confidence interval, but both intervals contain 0.3. There is<br />
not enough evidence to allow the conclusion that the true average is not 0.3.<br />
Confidence intervals give a way to measure the precision we get from simulations intended to evaluate<br />
probabilities. But besides that it also gives as a way to plan how large a sample size has to be to get a<br />
desired precision.<br />
Example 0.3.6<br />
Suppose, we want to estimate the fraction of records in the 2000 IRS data base that have a taxable income<br />
over $35 K.<br />
We want to get a 98% confidence interval and wish to estimate the quantity to within 0.01.<br />
this means that our boundaries e need to be smaller than 0.01 (we’ll choose a conservative confidence interval<br />
for ease of computation):<br />
555<br />
e ≤ 0.01<br />
1<br />
⇐⇒ z ·<br />
2 √ ≤ 0.01 z is 2.33<br />
n<br />
1<br />
⇐⇒ 2.33 ·<br />
2 √ n ≤ 0.01<br />
⇐⇒ √ n ≥ 2.33<br />
2 · 0.01 = 116.5<br />
⇒ n ≥ 13573<br />
0.3.3 Related C.I. Methods<br />
Related to the previous confidence intervals, are confidence intervals for the difference between two means,<br />
µ 1 − µ 2 , or the difference between two proportions , p 1 − p 2 .<br />
Confidence intervals for these differences are given as:<br />
large n confidence interval for µ 1 − µ 2<br />
(based on independent ¯X 1 and ¯X 2 )<br />
large n confidence interval for p 1 − p 2<br />
(based on independent ˆp 1 and ˆp 2 )<br />
¯X 1 − ¯X<br />
√<br />
√<br />
s<br />
2 ± z<br />
2 1<br />
n 1<br />
+ s2 2<br />
n 2<br />
ˆp 1 − ˆp 2 ± z 1 1<br />
2 n 1<br />
+ 1 n 2<br />
(conservative)<br />
√<br />
ˆp<br />
or ˆp 1 − ˆp 2 ±z 1(1−ˆp 1)<br />
n 1<br />
+ ˆp2(1−ˆp2)<br />
n 2<br />
(substitution)<br />
Why? The argumentation in both cases is very similar - we will only discuss the confidence interval for<br />
the difference between means.<br />
¯X 1 − ¯X 2 is approximately normal, since ¯X 1 and ¯X 2 are approximately normal, with ( ¯X 1 , ¯X 2 are independent)<br />
E[ ¯X 1 − ¯X 2 ] = E[ ¯X 1 ] − E[ ¯X 2 ] = µ 1 − µ 2<br />
V ar[ ¯X 1 − ¯X 2 ] = V ar[ ¯X 1 ] + (−1) 2 V ar[ ¯X 2 ] = σ2<br />
n 1<br />
+ σ2<br />
n 2
16<br />
Then we can use the same arguments as before and get a C.I. for µ 1 − µ 2 as shown above.<br />
✷<br />
Example 0.3.7<br />
Assume, we have two parts of the IRS database: East Coast and West Coast.<br />
We want to compare the mean taxable income between reported from the two regions in 2000.<br />
East Coast West Coast<br />
# of sampled records: n 1 = 1000 n 2 = 2000<br />
mean taxable income: ¯x 1 = $37200 ¯x 2 = $42000<br />
standard deviation: s 1 = $10100 s 2 = $15600<br />
We can, for example, compute a 2 sided 95% confidence interval for µ 1 − µ 2 = difference in mean taxable<br />
income as reported from 2000 tax return between East and West Coast:<br />
37000 − 42000 ±<br />
√<br />
10100<br />
2<br />
1000 + 156002 = −5000 ± 927<br />
2000<br />
Note: this shows pretty conclusively that the mean West Coast taxable income is higher than the mean East<br />
Coast taxable income (in the report from 2000). The interval contains only negative numbers - if it contained<br />
the 0, the message wouldn’t be so clear.<br />
One-sided intervals<br />
idea: use only one of the end points ¯x ± z s √ n<br />
This yields confidence intervals for µ of the form<br />
(−∞, #)<br />
} {{ }<br />
upper bound<br />
(##, ∞)<br />
} {{ }<br />
lower bound<br />
However, now we need to adjust z to the new situation. Instead of worrying about two tails of the normal<br />
distribution, we use for a one sided confidence interval only one tail.<br />
P( x < + e) <<br />
confidence interval for<br />
x<br />
e<br />
x + e<br />
prob ≤ 1 -<br />
Figure 2: One sided (upper bounded) confidence interval for µ (in red).<br />
Example 0.3.8 complicated queueing system, continued<br />
What is a 95% upper confidence bound of µ, the parameter for the length of the queue?<br />
¯x + z √ s<br />
n<br />
is the upper confidence bound. Instead of z = Φ −1 ( α+1<br />
2 ) we use z = Φ−1 (α) (see fig. 1.2).<br />
This gives: 21.5 + 1.65 √ 15<br />
50<br />
= 25.0<br />
as the upper confidence bound. Therefore the one sided upper bounded confidence interval is (−∞, 25.0).
0.4. HYPOTHESIS TESTING 17<br />
Critical values z = Φ −1 (α) for the one sided confidence interval are<br />
α z = Φ −1 (α)<br />
0.90 1.29<br />
0.95 1.65<br />
0.98 2.06<br />
0.99 2.33<br />
Example 0.3.9<br />
Two different digital communication systems send 100 large messages via each system and determine how<br />
many are corrupted in transmission.<br />
ˆp 1 = 0.05 and ˆp 2 = <strong>0.1</strong>0.<br />
What’s the difference in the corruption rates? Find a 98% confidence interval:<br />
Use:<br />
√<br />
0.05 · 0.95<br />
0.05 − <strong>0.1</strong> ± 2.33 ·<br />
+<br />
100<br />
<strong>0.1</strong>0 · 0.90<br />
100<br />
= −0.05 ± 0.086<br />
This calculation tells us, that based on these sample sizes, we don’t even have a solid idea about the sign of<br />
p 1 − p 2 , i.e. we can’t tell which of the p i s is larger.<br />
So far, we have only considered large sample confidence intervals. The problem with smaller sample sizes is,<br />
that the normal approximation in the <strong>CLT</strong> doesn’t work, if the standard deviation σ 2 is unknown.<br />
What you need to know is, that there exist different methods to compute C.I. for smaller sample sizes.<br />
0.4 Hypothesis Testing<br />
Example 0.4.1 Tea Tasting Lady<br />
It is claimed that a certain lady is able to tell, by tasting a cup of tea with milk, whether the milk was put<br />
in first or the tea was put in first.<br />
To put the claim to the test, the lady is given 10 cups of tea to taste and is asked to state in each case<br />
whether the milk went in first or the tea went in first.<br />
To guard against deliberate or accidental communication of information, before pouring each cup of tea a<br />
coin is tossed to decide whether the milk goes in first or the tea goes in first. The person who brings the cup<br />
of tea to the lady does not know the outcome of the coin toss.<br />
Either the lady has some skill (she can tell to some extent the difference) or she has not, in which case she<br />
is simply guessing.<br />
Suppose, the lady tested 10 cups of tea in this manner and got 9 of them right.<br />
This looks rather suspicious, the lady seems to have some skill. But how can we check it?<br />
We start with the sceptical assumption that the lady does not have any skill. If the lady has no skill at all,<br />
the probability she gives a correct answer for any single cup of tea is 1/2.<br />
The number of cups she gets right has therefore a Binomial distribution with parameter n = 10 and p = 0.5.<br />
The diagram shows the probability mass function of this distribution:
18<br />
p(x)<br />
observed x<br />
Events that are as unlikely or less likely are, that the lady got all 10 cups right or - very different, but<br />
nevertheless very rare - that she only got 1 cup or none right (note, this would be evidence of some “antiskill”,<br />
but it would certainly be evidence against her guessing).<br />
The total probability for these events is (remember, the binomial probability mass function is p(x) = ( n<br />
x)<br />
p x (1−<br />
p) n−x )<br />
x<br />
p(0) + p(1) + p(9) + p(10) = 0.5 10 + 10 · 0.5 10 + 10 · 0.5 10 + 0.5 10 = 0.021<br />
i.e. what we have just observed is a fairly rare event under the assumption, that the lady is only guessing.<br />
This suggests, that the lady may have some skill in detecting which was poured first into the cup.<br />
Jargon: 0.021 is called the p-value for testing the hypothesis p = 0.5.<br />
The fact that the p-value is small is evidence against the hypothesis.<br />
Hypothesis testing is a formal procedure to check whether or not some - previously made - assumption can<br />
be rejected based on the data.<br />
We are going to abstract the main elements of the previous example and cook up a standard series of steps<br />
for hypothesis testing:<br />
Example 0.4.2<br />
University CC administrators have historical records that indicate that between August and Oct 2002 the<br />
mean time between hits on the ISU homepage was 2 per min.<br />
They suspect that in fact the mean time between hits has decreased (i.e. traffic is up) - sampling 50<br />
inter-arrival times from records for November 2002 gives: ¯X = 1.7 min and s = 1.9 min.<br />
Is this strong evidence for an increase in traffic?
0.4. HYPOTHESIS TESTING 19<br />
Formal Procedure<br />
Application to Example<br />
1 State a “null hypothesis” of the form H 0 : function<br />
H 0 : µ = 2.0 min between hits<br />
of parameter(s) = #<br />
meant to embody a status quo/ pre data view<br />
2 State an “alternative hypothesis” ⎧ of the form<br />
⎨ ><br />
H a : µ < 2 (traffic is down)<br />
H a : function of parameter(s) ≠ #<br />
⎩<br />
<<br />
meant to identify departure from H 0<br />
3 State test criteria - consists of a test statistic, test statistic will be Z = ¯X−2.0<br />
s/ √ n<br />
a “reference distribution” giving the behavior The reference density will be standard normal,<br />
of the test statistic if H 0 is true and the kinds<br />
of values of the test statistic that count as evidence<br />
large negative values for Z count as evidence<br />
against H 0 in favor of H a<br />
against H 0 .<br />
4 show computations sample gives z = 1.7−2.0<br />
1.9/ √ = −1.12<br />
50<br />
5 Report and interpret a p-value = “observed<br />
level of significance, with which H 0 can be rejected”.<br />
This is the probability of an observed<br />
value of the test statistic at least as extreme as<br />
the one at hand. The smaller this value is, the<br />
less likely it is that H 0 is true.<br />
Note aside: a 90% confidence interval for µ is<br />
The p-value is P (Z ≤ −1.12) = Φ(−1.12) =<br />
<strong>0.1</strong>314<br />
This value is not terribly small - the evidence of<br />
a decrease in mean time between hits is somewhat<br />
weak.<br />
¯x ± 1.65 s √ n<br />
= 1.7 ± 0.44<br />
This interval contains the hypothesized value of µ = 2.0<br />
There are four basic hypothesis tests of this form, testing a mean, a proportion or differences between two<br />
means or two proportions. Depending on the hypothesis, the test statistic will be different. Here’s an<br />
overview of the tests, we are going to use:<br />
Hypothesis Statistic Reference Distribution<br />
H 0 : µ = #<br />
Z = ¯X−#<br />
s/ √ n<br />
Z is standard normal<br />
where ˆp =<br />
n1 ˆp1+n2 ˆp2<br />
n 1+n 2<br />
.<br />
H 0 : p = # Z = q ˆp−#<br />
#(1−#)<br />
n<br />
H 0 : µ 1 − µ 2 = # Z = ¯X r 1− ¯X 2−#<br />
s 2 1<br />
n + s2 2<br />
1 n 2<br />
H 0 : p 1 − p 2 = # Z =<br />
ˆp 1−ˆp 2−#<br />
√ˆp(1−ˆp)<br />
q<br />
1<br />
n 1<br />
+ 1<br />
n 2<br />
Z is standard normal<br />
Z is standard normal<br />
Z is standard normal<br />
Example 0.4.3 tax fraud<br />
Historically, IRS taxpayer compliance audits have revealed that about 5% of individuals do things on their<br />
tax returns that invite criminal prosecution.<br />
A sample of n = 1000 tax returns produces ˆp = 0.061 as an estimate of the fraction of fraudulent returns. -<br />
does this provide a clear signal of change in the tax payer behavior?<br />
1. state null hypothesis: H 0 : p = 0.05<br />
2. alternative hypothesis: H a : p ≠ 0.05
20<br />
3. test statistic:<br />
Z =<br />
ˆp − 0.05<br />
√<br />
0.05 · 0.95/n<br />
Z has under the null hypothesis a standard normal distribution, any large values of Z - positive and<br />
negative values - will count as evidence against H 0 .<br />
4. computation: z = (0.061 − 0.05)/ √ 0.05 · 0.95/1000 = 1.59<br />
5. p-value: P (|Z| ≥ 1.59) = P (Z ≤ −1.59) + P (Z ≥ 1.59) = <strong>0.1</strong>1 This is not a very small value, we<br />
therefore have only very weak evidence against H 0 .<br />
Example 0.4.4 life time of disk drives<br />
n 1 = 30 and n 2 = 40 disk drives of 2 different designs were tested under conditions of “accelerated” stress<br />
and times to failure recorded:<br />
Standard Design New Design<br />
n 1 = 30 n 2 = 40<br />
¯x 1 = 1205 hr ¯x 2 = 1400 hr<br />
s 1 = 1000 hr s 2 = 900 hr<br />
Does this provide conclusive evidence that the new design has a larger mean time to failure under “accelerated”<br />
stress conditions?<br />
1. state null hypothesis: H 0 : µ 1 = µ 2 (µ 1 − µ 2 = 0)<br />
2. alternative hypothesis: H a : µ 1 < µ 2 (µ 1 − µ 2 < 0)<br />
3. test statistic is:<br />
Z = ¯x 1 − ¯x 2 − 0<br />
√<br />
s 2 1<br />
n 1<br />
+ s2 2<br />
n 2<br />
Z has under the null hypothesis a standard normal distribution, we will consider large negative values<br />
of Z as evidence against H 0 .<br />
4. computation: z = (1205 − 1400 − 0)/ √ 1000 2 /30 + 900 2 /40 = −0.84<br />
5. p-value: P (Z < −0.84) = 0.2005<br />
This is not a very small value, we therefore have only very weak evidence against H 0 .<br />
Example 0.4.5 queueing systems<br />
2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities<br />
of there being an available server.<br />
We do simulations for each system, and look whether at time t = 2000 there is a server available:<br />
System 1 System 2<br />
n 1 = 1000 runs n 2 = 500 runs (each with different random seed)<br />
server at time t = 2000 available?<br />
ˆp 1 = 551<br />
1000<br />
ˆp 2 = 303<br />
500<br />
How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems?
0.5. GOODNESS OF FIT TESTS 21<br />
1. state null hypothesis: H 0 : p 1 = p 2 (p 1 − p 2 = 0)<br />
2. alternative hypothesis: H a : p 1 ≠ p 2 (p 1 − p 2 ≠ 0)<br />
3. Preliminary: note that, if there was no difference between the two systems, a plausible estimate of the<br />
availability of a server would be<br />
a test statistic is:<br />
ˆp = nˆp 1 + nˆp 2 551 + 303<br />
=<br />
n 1 + n 2 1000 + 500 = 0.569<br />
Z =<br />
ˆp 1 − ˆp 2 − 0<br />
√ˆp(1 − ˆp) ·<br />
√<br />
1<br />
n 1<br />
+ 1 n 2<br />
Z has under the null hypothesis a standard normal distribution, we will consider large values of Z as<br />
evidence against H 0 .<br />
4. computation: z = (0.551 − 0.606)/( √ 0.569 · (1 − 0.569) √ 1/1000 + 1/500) = −2.03<br />
5. p-value: P (|Z| > 2.03) = 0.04 This is fairly strong evidence of a real difference in t=2000 availabilities<br />
of a server between the two systems.<br />
0.5 Goodness of Fit Tests<br />
The basic situation is still the same as in the previous section: we have n realizations x 1 , . . . , x n (observed<br />
data) of independent, identically distributed random variables X 1 , . . . , X n .<br />
A goodness of fit test is different from the previous ones. Here, we don’t test a single parameter, but we test<br />
the whole distribution underlying our observations.<br />
Basically, the null hypothesis will be H 0 : the data follow a specified distribution F<br />
vs. H a : the data do not follow the specified distribution.<br />
For this problem there are different approaches depending on whether the specified distribution is continuous<br />
or discrete.<br />
For simplification, we will only consider the case of a finite discrete distribution, i.e. we are dealing with a<br />
finite sample space Ω = {1, . . . , k}, on which we have a probability mass function p.<br />
The above null hypothesis then becomes<br />
H 0 : p X (i) = p(i) for i = 1, . . . , k<br />
vs<br />
H a : p X (i) ≠ p i for at least one i ∈ {0, . . . , k}
22<br />
Example 0.5.1 M&Ms<br />
On the web page of the M&M/Mars company: www.m-ms.com the percentages of each color in a bag of<br />
peanut M&Ms is given as<br />
These percentages form a probability mass function for the colors in the M&Ms bag.<br />
A count for two different bags, gave the following numbers for each color:<br />
bag brown yellow red blue orange green sum<br />
129 GM 12 43 28 26 25 33 24 179<br />
129 GM 22 40 38 36 20 24 16 174<br />
How do we check, whether these numbers come from the distribution as given on the web site?<br />
To get an answer for that question, we will need to think about what kind of results we expected to get.<br />
Think: for each color, we have a certain probability p color , to draw an M&M out of the bag, which has this<br />
color vs. 1 − p color for a different color. We can therefore think of the number of M&Ms in each color as<br />
random variables with a Binomial distribution. N br , the number of brown M&Ms has parameters n and p br .<br />
Under the null hypothesis<br />
H 0 : p br = p r = p ye = p bl = 0.2, p or = p gr = <strong>0.1</strong><br />
vs<br />
H a : one of the p i is different from the above specification<br />
N br has a B n,0.2 distribution. For the first bag we therefore expected N br to be 0.2 · 179 = 35.8.<br />
In the same manner we can compute the expected values for all the other colors in each bag:<br />
bag brown yellow red blue orange green<br />
129 GM 12 35.8 35.8 35.8 35.8 17.9 17.9<br />
129 GM 22 34.8 34.8 34.8 34.8 17.4 17.4<br />
Now we are going to need a test statistics that measures the difference between what we have observed and<br />
what we have expected.<br />
As a test statistic, we will use<br />
Q =<br />
k∑<br />
j=1<br />
(obs j − exp j ) 2<br />
exp j<br />
,<br />
where<br />
obs j :<br />
exp j :<br />
the number of times j is observed among the x i , i = 1, . . . , n and<br />
number of expected js = n · p(j)<br />
<strong>Theorem</strong> 0.5.1<br />
The test statistic Q, defined as above, has χ 2 distribution with k − 1 degrees of freedom.<br />
In order to be able to use that, we obviously need some more information about the χ 2 distribution.
0.5. GOODNESS OF FIT TESTS 23<br />
The χ 2 distribution<br />
Given a set of independent standard normal random variables Z 1 , . . . , Z r , the distribution of their sum of<br />
squares<br />
r∑<br />
X :=<br />
i=1<br />
is called the χ 2 distribution with r degrees of freedom.<br />
The density function itself is a bit complicated (it’s a special case of a Gamma distribution), all we need to<br />
know about the distribution at this stage is:<br />
Z 2 i<br />
E[X] = r V ar[X] = 2r,<br />
and the probability that P (X ≥ 2(r + 1)) ≤ 0.05, roughly. For large r the probability is far smaller than<br />
0.05.<br />
Why has the above test statistic Q a χ 2 statistic with k − 1 degrees of freedom?<br />
This is difficult to prove, but it is at least plausible: The parts, from which Q is put together, look - almost<br />
- like squared normal distributions: since obs j has a Binomial distribution, we may for large n assume that<br />
we can approximate its distribution by N(np(j), np j (1 − p(j)). A standardization of obs j would therefore<br />
look like:<br />
obs j − np(j)<br />
√<br />
np(j)(1 − p(j))<br />
=<br />
1<br />
√<br />
1 − p(j)<br />
obs j − exp j<br />
√ expj<br />
.<br />
The degrees of freedom is reduced by one, because we have random variables, that are dependent: once we<br />
know the numbers of five colors in the bag, we get the sixth by subtracting the other numbers from the total<br />
number of M&Ms in the bag.<br />
A more formal reason for the degrees of freedom is given by computing the expected value for Q, which we<br />
can do by using that E[X 2 ] = V ar[X] + (E[X]) 2 :<br />
E[Q] =<br />
=<br />
=<br />
k∑<br />
j=1<br />
j=1<br />
1<br />
np(j) E[(obs i − np(j)) 2 ] =<br />
k∑ 1<br />
np(j) (V ar[obs j − np(j)] +(E[obs j − np(j) ) 2 ]) 2 ) =<br />
} {{ } } {{ }<br />
=V ar[obs j]<br />
=0<br />
k∑ 1<br />
k∑<br />
np(j) np j(1 − p(j)) = (1 − p(j)) =<br />
j=1<br />
= k −<br />
k∑<br />
p(j) = k − 1.<br />
j=1<br />
Now that we’ve defined a reference distribution for the test, we need to identify values, which we will count<br />
as evidence against H 0 . Since we’ve squared the differences between expected and observed values, we can<br />
only count large positive numbers for Q as evidence against H 0 .<br />
Now, let’s get back to our example:<br />
Example 0.5.2 M&Ms, continued<br />
The value for Q with the above null hypothesis about the color distribution is 23.91 for the first bag (129<br />
GM 12) and 10.02 for the second bag (129 GM 22)<br />
The p-values for these results are 0.00023 and 0.075, respectively.<br />
For the first bag it’s highly unlikely that the M&Ms have a color distribution as posted on the web site, for<br />
the second bag, however, we can’t quite reject the null hypothesis with the same vigor. The p-value is still<br />
quite small, though.<br />
j=1
24<br />
Maybe, there is something wrong with the filling routine at M&Ms’ . . .<br />
We can also look into which of the colors contribute most to the Q statistic (unsquared). These numbers<br />
are called the residuals<br />
ḃag brown yellow red blue orange green<br />
129 GM 12 1.20 -1.30 -1.64 -1.81 3.57 1.44<br />
129 GM 22 0.88 0.54 0.20 -2.51 1.58 -0.34<br />
The largest residuals in each bag are too many orange M&Ms in the first, too few blue in the second.<br />
If we had combined the results from the two bags, the result for Q would have been even more extreme:<br />
color brown yellow red blue orange green sum<br />
#in bag 83 66 62 45 57 40 353<br />
exptd 70.6 70.6 70.6 70.6 35.3 35.3 353<br />
residuals 1.48 -0.55 -1.02 -3.05 3.65 0.79 Q = 26.77<br />
Here, the two largest residuals are for blue and orange M&Ms. It seems as if the number that are too few<br />
for the blue is replaced by the orange M&Ms.