0.1 Central Limit Theorem (CLT)

The normal distribution is extremely common/ useful, for one reason: the normal distribution approximates 

a lot of other distributions. This is the result of one of the most fundamental theorems in Math: 

0.1 Central Limit Theorem (CLT) 

Theorem 0.1.1 (Central Limit Theorem) 

If X 1 , X 2 , . . . , X n are n independent, identically distributed random variables with E[X i ] = µ and V ar[X i ] = 

σ 2 , then: 

the sample mean ¯X := 1 n 

∑ n 

i=1 X i is approximately normal distributed 

with E[ ¯X] = µ and V ar[ ¯X] = σ 2 /n. 

i.e. ¯X ∼ N(µ, 

σ 2 

n ) or ∑ i X i ∼ N(nµ, nσ 2 ) 

Corollary 0.1.2 

(a) for large n the binomial distribution B n,p is approximately normal N np,np(1−p) . 

(b) for large λ the Poisson distribution P o λ is approximately normal N λ,λ . 

(c) for large k the Erlang distribution Erlang k,λ is approximately normal N k 

λ , k 

λ 2 

Why? 

(a) Let X be a variable with a B n,p distribution. 

We know, that X is the result from repeating the same Bernoulli experiment n times and looking at 

the overall number of successes. We can therefor, write X as the sum of n B 1,p variables X i : 

X := X 1 + X 2 + . . . + X n 

X is then the sum of n independent, identically distributed random variables. Then, the Central 

Limit Theorem states, that X has an approximate normal distribution with E[X] = nE[X i ] = np and 

V ar[X] = nV ar[X i ] = np(1 − p). 

(b) it is enough to show the statement for the case that λ is a large integer: 

Let Y be a Poisson variable with rate λ. Then we can think of Y as the number of occurrences in 

an experiment that runs for time λ - that is the same as to observe λ experiments that each run 

independently for time 1 and add their results: 

Y = Y 1 + Y 2 + . . . + Y λ , with Y i ∼ P o 1 . 

Again, Y is the sum of n independent, identically distributed random variables. Then, the Central 

Limit Theorem states, that X has an approximate normal distribution with E[Y ] = λ·1 and V ar[Y ] = 

λV ar[Y i ] = λ. 

(c) this statement is the easiest to prove, since an Erlang k,λ distributed variable Z is by definition the sum 

of k independently distributed exponential variables Z 1 , . . . , Z k . 

For Z the CLT holds, and we get, that Z is approximately normal distributed with E[Z] = kE[Z i ] = k λ 

and V ar[Z] = kV ar[Z i ] = k λ 2 .

2 

Why do we need the central limit theorem at all? - first of all, the CLT gives us the distribution of the 

sample mean in a very general setting: the only thing we need to know, is that all the observed values come 

from the same distribution, and the variance for this distribution is not infinite. 

A second reason is, that most tables only contain the probabilities up to a certain limit - the Poisson table 

e.g. only has values for λ ≤ 10, the Binomial distribution is tabled only for n ≤ 20. After that, we can use 

the Normal approximation to get probabilities. 

Example 0.1.1 Hits on a webpage 

Hits occur with a rate of 2 per min. 

What is the probability to wait for more than 20 min for the 50th hit? 

Let Y be the waiting time until the 50th hit. 

We know: Y has an Erlang 50,2 distribution. therefore: 

P (Y > 20) = 1 − Erlang 50,2 (20) = 1 − (1 − P o 2·20 (50 − 1)) = 

= P o 40 (49) CLT ≈ ! 

N 40,40 (49) = 

( ) 49 − 40 

= Φ √ = Φ(1.42) table 

= 0.9222. 

40 

✷ 

Example 0.1.2 Mean of Uniform Variables 

Let U 1 , U 2 , U 3 , U 4 , and U 5 be standard uniform variables, i.e. U i ∼ U (0,1) . 

Without the CLT we would have no idea, what distribution the sample mean Ū = 1 ∑ 5 

5 i=1 U i had! 

With it, we know: Ū approx 

∼ N(0.5, 1 

60 ). 

Issue: 

Accuracy of approximation 

• increases with n 

• increases with the amount of symmetry in the distribution of X i 

Rule of thumb for the Binomial distribution: 

Use the normal approximation for B n,p , if np > 5 (if p ≤ 0.5) or nq > 5 (if p ≥ 0.5)! 

From now on, we will use probability theory only to find answers to the questions arising from specific 

problems we are working on. 

In this chapter we want to draw inferences about some characteristic of an underlying population - e.g. the 

average height of a person. Instead of measuring this characteristic of each individual, we will draw a sample, 

i.e. choose a “suitable” subset of the population and measure the characteristic only for those individuals. 

Using some probabilistic arguments we can then extend the information we got from that sample and make 

an estimate of the characteristic for the whole population. Probability theory will give us the means to find 

those estimates and measure, how “probable” our estimates are. 

Of course, choosing the sample, is crucial. We will demand two properties from a sample: 

• the sample should be representative - taking only basketball players into the sample would change our 

estimate about a person’s height drastically. 

• if there’s a large number in the sample we should come close to the “true” value of the characteristic 

The three main area of statistics are

0.2. PARAMETER ESTIMATION 3 

• estimation of parameters: 

point or interval estimates: “my best guess for value x is . . . ”, “my guess is that value x is in interval 

(a, b)” 

• evaluation of plausibility of values: hypothesis testing 

• prediction of future (individual) values 

0.2 Parameter Estimation 

Statistics are all around us - scores in sports, prices at the grocers, weather reports ( and how often they 

turn out to be close to the actual weather), taxes, evaluations . . . 

The most basic form of statistics are descriptive statistics. 

But - what exactly is a statistic? - Here is the formal definition: 

Definition 0.2.1 (Statistics) 

Any function W (x 1 , . . . , x k ) of observed values x 1 , . . . , x k is called a statistic. 

Some statistics you already know are: 

∑ 

Mean (Average) ¯X = 

1 

n i X i 

Minimum X (1) - Parentheses indicate that the values are sorted 

Maximum X (n) 

Range X (n) − X (1) 

Mode 

value(s) that appear(s) most often 

Median 

“middle value” - that value, for which one half of the data is larger, 

the other half is smaller. If n is odd the median is X (n/2) , if n is 

even, the median is the average of the two middle values: 0.5 · 

X ((n−1)/2) + 0.5 · X ((n+1)/2) 

For this section it is important to distinguish between x i and X i properly. If not stated otherwise, any capital 

letter denotes some random variable, a small letter describes a realization of this random variable, i.e. 

what we have observed. x i therefore is a real number, X i is a function, that assigns a real number to an 

event from the sample space. 

Definition 0.2.2 (estimator) 

Let X 1 , . . . , X k be k i.i.d random variables with distribution F θ with (unknown) parameter θ. 

A statistic ˆΘ = ˆΘ(X 1 , . . . , X k ) used to estimate the value of θ is called an estimator of θ. 

ˆθ = ˆΘ(x 1 , . . . , x k ) is called an estimate of θ. 

Desirable properties of estimates:

4 

• Unbiasedness, i.e the expected value of an estimator is 

the true parameter: 

E[ ˆΘ] = θ 

• Efficiency, for two estimators, ˆΘ 1 and ˆΘ 2 of the same 

parameter θ, ˆΘ 1 is said to be more efficient than ˆΘ 2 , 

if 

V ar[ ˆΘ 1 ] < V ar[ ˆΘ 2 ] 

• Consistency, if we have a larger sample size n, we want 

the estimate ˆθ to be closer to the true parameter θ. 

lim P (| ˆΘ − θ| > ɛ) = 0 

n→∞ 

Unbiasedness: 

x true value 

• value for 

from one sample 

• • 

• 

• 

• 

• • 

•• 

x • 

• • 

and not 

• • 

• 

• • 

• • 

x 

• 

• 

• 

• • 

• 

• 

Efficiency: 

Consistency: 

estimator 1 

• 

• 

• • • 

• 

• x 

• 

• 

• • 

• 

• is better than 

• x • 

• 

• 

• 

• 

• 

• 

• 

estimator 2 

• 

• 

•• 

• • 

• x 

• • 

• 

• 

• 

• 

same estimator 

• • • 

• • 

x 

• 

• 

•• 

for n = 100 for n = 10000 

Example 0.2.1 

Let X 1 , . . . , X n be n i.i.d. random variables with E[X i ] = µ. 

Then ¯X = 1 n 

∑ n 

i=1 X i is an unbiased estimator of µ, because 

E[ ¯X] = 1 n 

n∑ 

E[X i ] = 1 n n · µ = µ. 

i=1 

ok, so, once we have an estimator, we can decide, whether it has the properties. 

estimators? 

But how do we find 

0.2.1 Maximum Likelihood Estimation 

Situation: We have n data values x 1 , . . . , x n . The assumption is, that these data values are realizations of n 

i.i.d. random variables X 1 , . . . , X n with distribution F θ . Unfortunately the value for θ is unknown. 

X 

observed values 

x 1 , x 2 , x 3, ... 

f with = 0 

f with = -1.8 

f with = 1 

By changing the value for θ we can “move the density function f θ around” - in the diagram, the third density 

function fits the data best. 

Principle: since we do not know the true value θ of the distribution, we take that value ˆθ that most likely 

produced the observed values, i.e. 

maximize something like 

P (X 1 = x 1 ∩ X 2 = x 2 ∩ . . . ∩ X n = x n ) 

= 

Xi are independent! 

= 

n∏ 

P (X i = x i ) (*) 

i=1 

= P (X 1 = x 1 ) · P (X 2 = x 2 ) · . . . · P (X n = x n ) =


This is not quite the right way to write the probability, if X 1 , . . . , X n are continuous variables. (Remember: 

P (X = x) = 0 for a continuous variable X; this is still valid) 

We use the above “probability” just as a plausibility argument. To come around the problem that P (X = 

x) = 0 for a continuous variable, we will write (*) as: 

n∏ 

p θ (x i ) 

i=1 

} {{ } 

for discreteX i 

and 

n∏ 

f θ (x i ) 

i=1 

} {{ } 

for continuousX i 

where p θ ) is the probability mass function of discrete X i (all X i have the same, since they are i.d) and f θ is 

the density function of continuous X i . 

Both these functions depend on θ. In fact, we can write the above expressions as a function in θ. This 

function, which we will denote by L(θ), is called the Likelihood function of X 1 , . . . , X n . 

The goal is now, to find a value ˆθ that maximizes the Likelihood function. (this is what “moves” the density 

to the right spot, so it fits the observed values well) 

How do we get a maximum of L(θ)? - by the usual way, we maximize a function! - Differentiate it and 

set it to zero! (After that, we ought to check with the second derivative, whether we’ve actually found a 

maximum, but we won’t do that unless we’ve found more than one possible value for ˆθ.) 

Most of the time, it is difficult to find a derivative of L(θ) - instead we use another trick, and find a maximum 

for log L(θ), the Log-Likelihood function. 

Note: though its name is “log”, we use the natural logarithm ln. 

The plan to find an ML-estimator is: 

1. Find Likelihood function L(θ). 

2. Get natural log of Likelihood function log L(θ). 

3. Differentiate log-Likelihood function with respect to θ. 

4. Set derivative to zero. 

5. Solve for θ. 

Example 0.2.2 Roll a Die 

A die is rolled until its face shows a 6. 

repeating this experiment 100 times gave the following results:

6 

#Rolls of a Die until first 6 

20 

15 

10 

# runs 

5 

0 

1 2 3 4 5 6 7 8 9 11 14 16 20 27 29 

k 1 2 3 4 5 6 7 8 9 11 14 15 16 17 20 21 27 29 

# trials 18 20 8 9 9 5 8 3 5 3 3 3 1 1 1 1 1 1 

We know, that k the number of rolls until a 6 shows up has a geometric distribution Geo p . For a fair die, p 

is 1/6. 

The Geometric distribution has probability mass function p(k) = (1 − p) k−1 · p. 

What is the ML-estimate ˆp for p? 

1. Likelihood function L(p): 

Since we have observed 100 outcomes k 1 , ..., k 100 , the likelihood function L(p) = ∏ 100 

i=1 p(k i), 

∏100 

∏100 

L(p) = (1 − p) ki−1 p = p 100 · (1 − p) ki−1 = p 100 · (1 − p) P 100 

i=1 (ki−1) = p 100 · (1 − p) P 100 

i=1 

i=1 

i=1 ki−100 . 

2. log of Likelihood function log L(p): 

log L(p) = log 

(p 100 · (1 − p) P 100 

ki−100) i=1 = 

= log ( p 100) + log 

((1 − p) P 100 

ki−100) i=1 = 

( 100 

) 

∑ 

= 100 log p + k i − 100 log(1 − p). 

i=1


3. Differentiate log-Likelihood with respect to p: 

4. Set derivative to zero. 

( 

d 

∑100 

dp log L(p) = 1001 p + k i − 100 

i=1 

= 

= 

( 

1 

100(1 − p) − 

p(1 − p) 

( 

1 

p(1 − p) 

For the estimate ˆp the derivative must be zero: 

) 

−1 

1 − p = 

) 

∑100 

100 − p k i . 

i=1 

( ∑100 

k i − 100 

i=1 

) 

p 

) 

= 

⇐⇒ 

( 

1 

ˆp(1 − ˆp) 

d 

log L(ˆp) = 0 

dp 

) 

∑100 

k i = 0 

100 − ˆp 

i=1 

5. Solve for ˆp. 

( 

) 

1 

∑100 

100 − ˆp k i 

ˆp(1 − ˆp) 

ˆp = ∑ 100 

100 

i=1 k = 

i 

In total, we have an estimate ˆp = 100 

568 = 0.1710. 

i=1 

= 0 

∑100 

100 − ˆp k i = 0 

1 

100 

i=1 

1 

∑ 100 

i=1 k . 

i 

Example 0.2.3 Red Cars in the Parking Lot 

The values 3,2,3,3,4,1,4,2,4,3 have been observed while counting the numbers of red cars pulling into the 

parking lot # 22 between 8:30 - 8:40 am Mo to Fr during two weeks. 

The assumption is, that these values are realizations of ten independent Poisson variables with (the same) 

rate λ. 

What is the Maximum Likelihood estimate of λ? 

The probability mass function of a Poisson distribution is p λ (x) = e −λ · λx 

x! . 

We have ten values x i , this gives a Likelihood function: 

L(λ) = 

10∏ 

i=1 

e −λ · λXi 

X i ! = e−10λ · λ P 10 

∏10 

i=1 Xi · 

i=1 

1 

X i ! 

The log-Likelihood then is 

log L(λ) = −10λ + ln(λ) · 

10∑ 

i=1 

X i − ∑ ln(X i ).

8 

Differentiating the log-Likelihood with respect to λ gives: 

Setting it to zero: 

d 

dλ log L(λ) = −10 + 1 λ · 

1 

ˆλ · 

10∑ 

i=1 

X i = 10 

⇐⇒ ˆλ = 1 

10 

10∑ 

X i 

i=1 

10∑ 

X i 

i=1 

⇐⇒ ˆλ = 29 

10 = 2.9 

This gives us an estimate for λ - and since λ is also the expected value of the Poisson distribution, we can 

say, that on average the number of red cars pulling into the parking lot each morning between 8:30 and 8:40 

pm is 2.9. 

ML-estimators for µ and σ 2 of a Normal distribution 

Let X 1 , . . . , X n be n independent, identically distributed normal variables with E[X i ] = µ and V ar[X i ] = σ 2 . 

µ and σ 2 are unknown. 

The normal density function f µ,σ 2 is 

f µ,σ 2(x) = 

1 

(x−µ)2 

√ e− 2σ 2 

2πσ 

2 

Since we have n independent variables, the Likelihood function is a product of n densities: 

n∏ 

L(µ, σ 2 1 

) = √ (x 2πσ 

2 e− i −µ)2 

2σ 2 = (2πσ 2 ) n/2 · e − P n (x i −µ) 2 

i=1 2σ 2 

Log-Likelihood: 

i=1 

log L(µ, σ 2 ) = − n 2 ln(2πσ2 ) − 1 

2σ 2 

n 

∑ 

i=1 

(x i − µ) 2 

Since we have now two parameters, µ and σ 2 , we need to get 2 partial derivatives of the log-Likelihood: 

d 

dµ log L(µ, σ2 ) = 0 − 2 · −1 n∑ 

2σ 2 · (x i − µ) 2 · (−1) = 1 ∑ n 

σ 2 (x i − µ) 2 

i=1 

d 

dσ 2 log L(µ, σ2 ) = − n 1 

2 σ 2 + 1 ∑ 

n 

2(σ 2 ) 2 (x i − µ) 2 

We know, must find values for µ and σ 2 , that yield zeros for both derivatives at the same time. 

Setting d 

dµ log L(µ, σ2 ) = 0 gives 

ˆµ = 1 n∑ 

x i , 

n 

plugging this value into the derivative for σ 2 and setting 

d 

dσ 

log L(ˆµ, σ 2 ) = 0 gives 

2 

ˆσ 2 = 1 n∑ 

(x i − ˆµ) 2 

n 

i=1 

i=1 

i=1 

i=1

0.3. CONFIDENCE INTERVALS 9 

0.3 Confidence intervals 

The previous section has provided a way to compute point estimates for parameters. Based on that, our 

next question is - how good is this point estimate? or How close is the estimate to the true value of the 

parameter? 

Instead of just looking at the point estimate, we will now try to compute an interval around the estimated 

parameter value, in which the true parameter is “likely” to fall. An interval like that is called confidence 

interval. 

Definition 0.3.1 (Confidence Interval) 

Let ˆθ be an estimate of θ. 

If P (|ˆθ − θ| < e) > α, we say, that the interval (ˆθ − e, ˆθ + e) is an α · 100% Confidence interval of θ (cf. fig. 

1.1). 

Usually, α is a value near 1, such as 0.9, 0.95, 0.99, 0.999, etc. 

Note: 

• for any given set of values x 1 , . . . , x n the value or ˆθ is fixed, as well as the interval (ˆθ − e, ˆθ + e). 

• The true value θ is either within the confidence interval or not. 

prob £ 

1 - 

P( - e < x < + e) > 

x - e 

-e 

confidence interval for 

x 

x + e 

e 

prob £ 

1 - 

Figure 1: The probability that ¯x falls into an e interval around µ is α. Vice versa, we know, that for all of 

those ¯x µ is within an e interval around ¯x. That’s the idea of a confidence interval. 

A lot of people are tempted to reformulate the above probability to: 

!!DON’T DO!! 

P (ˆθ − e < θ < ˆθ + e) > α 

Though it looks ok, it’s not. Repeat: IT IS NOT OK. 

θ is a fixed value - therefore, it does not have a probability to fall into some interval. 

The only probability that we have, here, is 

P (θ − e < ˆθ < θ + e) > α, 

we can therefore say, that ˆθ has a probability of at least α to fall into an e- interval around θ. Unfortunately, 

that doesn’t help at all, since we do not know θ! 

How do we compute confidence intervals, then? 

First, we look at estimates of a mean of a distribution: 

- that’s different for each estimator.

10 

0.3.1 Large sample C.I. for µ 

Situation: we have a large set of observed values (n > 30, usually). 

The assumption is, that these values are realizations of n i.i.d random variables X 1 , . . . , X n with E[ ¯X] = µ 

and V ar[ ¯X] = σ 2 . 

We already know from the previous section, that ¯X is an unbiased ML-estimator for µ. 

But we know more! - The CLT tells us, that in exactly the situation we are ¯X is an approximately normal 

distributed random variable with E[ ¯X] = µ and V ar[ ¯X] = σ2 

n . 

We therefore can find the boundary e by using the standard normal distribution. Remember: if ¯X ∼ 

N(µ, σ 2 /n) then Z := ¯X−µ 

σ/ √ ∼ N(0, 1) = Φ: 

n 

P (| ¯X − µ| ≤ e) ≥ α use standardization 

( | ¯X − µ| 

⇐⇒ P 

σ/ √ n ≤ e ) 

σ/ √ ≥ α 

n 

( 

⇐⇒ P |Z| < 

e ) 

σ/ √ ≥ α 

n 

( 

⇐⇒ P − 

e 

σ/ √ n < Z < e ) 

σ/ √ ≥ α 

n 

( ) ( 

e 

⇐⇒ Φ 

σ/ √ − Φ − 

e ) 

n σ/ √ ≥ α 

n 

( ) ( ( )) 

e 

e 

⇐⇒ Φ 

σ/ √ − 1 − Φ 

n 

σ/ √ ≥ α 

n 

( ) e 

⇐⇒ 2Φ 

σ/ √ − 1 ≥ α 

n 

( ) e 

⇐⇒ Φ 

σ/ √ ≥ 1 + α n 2 

⇐⇒ 

e ( ) 1 + α 

σ/ √ n ≥ Φ−1 2 

( ) 1 + α 

⇐⇒ e ≥ Φ −1 σ 

√ 

2 n 

} {{ } 

:=z 

This computation gives a α· 100% confidence value around µ as: 

( 

) 

σ 

¯X − z · √ , ¯X σ 

+ z · √ n n 

Now we can do an example: 

Example 0.3.1 

Suppose, we want to find a 95% confidence interval for the mean salary of an ISU employee. 

A random sample of 100 ISU employees gives us a sample mean salary of $21543 = ¯x. 

Suppose, the standard deviation of salaries is known to be $3000. 

By using the above expression, we get a 95% confidence interval as: 

21543 ± Φ −1 ( 1 + 0.95 

2 

) 

· 3000 √ 

100 

= 21543 ± Φ −1 (0.975) · 300 

How do we read Φ −1 (0.975) from the standard normal table? - We look for which z the probability N (0,1) (z) ≥ 

0.975!


This gives us z = 1.96, the 95% confidence interval is then: 

21543 ± 588, 

i.e. if we repeat this study 100 times (with 100 different employees each time), we can say: in 95 out of 100 

studies, the true parameter µ falls into a $588 range around ¯x. 

Critical values for z, depending on α are: 

α z = Φ −1 ( 1+α 

2 ) 

0.90 1.65 

0.95 1.96 

0.98 2.33 

0.99 2.58 

Problem: Usually, we do not know 

√ 

σ 

∑ 

1 n 

Slight generalization: use s = 

n−1 i=1 (X i − ¯X) 2 instead of σ! 

An α· 100% confidence interval for µ is given as 

( 

) 

s 

¯X − z · √ , ¯X s 

+ z · √ n n 

where z = Φ −1 ( 1+α 

2 ). 

Example 0.3.2 

Suppose, we want to analyze some complicated queueing system, for which we have no formulas and theory. 

We are interested in the mean queue length of the system after reaching steady state. 

The only thing possible for us is to run simulations of this system and look at the queue length at some large 

time t, e.g. t = 1000 hrs. 

After 50 simulations, we have got data: 

= number in queue at time 1000 hrs in 1st simulation 

X 1 

X 2 

. . . 

X 50 

= number in queue at time 1000 hrs in 2nd simulation 

= number in queue at time 1000 hrs in 50th simulation 

√ ∑ 

1 n 

Our observations yield an average queue length of ¯x = 21.5 and s = 

n−1 i=1 (x i − ¯x) 2 = 15. 

A 90% confidence interval is given as 

( 

¯x − z · 

s 

√ n 

, ¯x + z · 

) 

s 

√ n 

= 

( 

21.5 − 1.65 · 

= (17.9998, 25.0002) 

15 

√ 

50 

, 21.5 + 1.65 · 

15 

√ 

50 

) 

= 

Example 0.3.3 

The graphs show a set of 80 experiments. The values from each experiment are shown in one of the green 

framed boxes. Each experiment consists of simulating 20 values from a standard normal distributions (these 

are drawn as the small blue lines). For each of the experiments, the average from the 20 value is computed 

(that’s ¯x) as well as a confidence interval for µ- for parts a) and b) it’s the 95% confidence interval, for part 

c) it is the 90% confidence interval, for part d) it is the 99% confidence interval. The upper and the lower 

confidence bound together with the sample mean are drawn in red next to the sampled observations.

12 

a) 95 % confidence intervals 

b) 95 % confidence intervals 

c) 90 % confidence intervals 

d) 99 % confidence intervals 

There are several things to see from this diagram. First of all, we know in this example the “true” value of 

the parameter µ - since the observations are sampled from a standard normal distribution, µ = 0. The true 

parameter is represented by the straight horizontal line through 0.


We see, that each sample yields a different confidence interval, all of the are centered around the sample 

mean. The different sizes of the intervals tells us another thing: in computing these confidence intervals, we 

had to use the estimate s instead of the true standard deviation σ = 1. Each sample gave a slightly different 

standard deviation. Overall, though, the intervals are not very different in lengths between parts a) and b). 

The intervals in c) tend to be slightly smaller, though - these are 90% confidence intervals, whereas the 

intervals in part d) are on average larger than the first ones, they are 99% confidence intervals. 

Almost all the confidence intervals contain 0 - but not all. And that is, what we expect. For a 90% confidence 

interval we expect, that in 10 out of 100 times, the confidence interval does not contain the true parameter. 

When we check that - we see, that in part c) 4 out of the 20 confidence intervals don’t contain the true 

parameter for µ - that’s 20%, on average we would expect 10% of the conficence intervals not to contain µ. 

Official use of Confidence Intervals: 

In an average of 90 out of 100 times the 90% confidence interval of θ does contain 

the true value of θ. 

0.3.2 Large sample confidence intervals for a proportion p 

Let p be a proportion of a large population or a probability. 

In order to get an estimate for this proportion, we can take a sample of n individuals from the population 

and check each one of them, whether or not they fulfill the criterion to be in that proportion of interest. 

Mathematically, this corresponds to a Bernoulli-n-sequence, where we are only interested in the number of 

“successes”, X, which in our case corresponds to the number of individuals that qualify for the interesting 

subgroup. 

X then has a Binomial distribution, with parameters n and p. We know, that ¯X is an estimate for E[X]. 

Now think: for a Binomial variable X, the expected value E[X] = n · p. Therefore we get an estimate ˆp for 

p as ˆp = 1 ¯X. n 

Furthermore, we even have a distribution for ˆp for large n: Since ˆX is, using the CLT, a normal variable 

with E[ ¯X] = np and V ar[ ¯X] = np(1 − p), we get that for large n ˆp is a approximately normally distributed 

with E[ˆp] = p and V ar[ˆp] = p(1−p) 

n 

. 

BTW: this tells us, that ˆp is an unbiased estimator of p. 

Prepared with the distribution of ˆp we can set up an α · 100% confidence interval as: 

where e is some positive real number with: 

(ˆp − e, ˆp + e) 

P (|ˆp − p| ≤ e) ≥ α 

We can derive the expression for e in the same way as in the previous section and we come up with: 

e = z · 

p(1 − p) 

n 

where z = Φ −1 ( 1+α 

2 ). 

We also run into the problem that e in this form is not ready for use, since we do not know the value for p. 

In this situation, we have different options. We can either find a value that maximizes the value p(1 − p) or 

we can substitute an appropriate value for p. 

0.3.2.1 Conservative Method: 

replace p(1 − p) by something that’s guaranteed to be at least as large. 

p 

The function p(1 − p) has a maximum for p = 0.5. p(1 − p) is then 

0.25.

14 

The conservative α · 100% confidence interval for p is 

where z = Φ −1 ( 1+α 

2 ). 

0.3.2.2 Substitution Method: 

Substitute ˆp for p, then: 

ˆp ± z · 

1 

2 √ n 

The α · 100% confidence interval for p by substitution is 

√ 

ˆp(1 − ˆp) 

ˆp ± z · 

n 

where z = Φ −1 ( 1+α 

2 ). 

Where is the difference between the two methods? 

• for large n there is almost no difference at all 

• if ˆp is close to 0.5, there is also almost no difference 

Besides that, conservative confidence intervals (as the name says) are larger than confidence intervals found 

by substitution. However, they are at the same time easier to compute. 

Example 0.3.4 Complicated queueing system, continued 

Suppose, that now we are interested in the large t probability p that a server is available. 

Doing 100 simulations has shown, that in 65 of them a server was available at time t = 1000 hrs. 

What is a 95% confidence interval for this probability? 

If 60 out of 100 simulations showed a free server, we can use ˆp = 60 

100 

= 0.6 as an estimate for p. 

For a 95% confidence interval, z = Φ −1 (0.975) = 1.96. 

The conservative confidence interval is: 

ˆp ± z 1 

2 √ n = 0.6 ± 1.96 1 √100 = 0.6 ± 0.098. 

2 · 

For the confidence interval using substitution we get: 

√ √ 

ˆp(1 − ˆp 

0.6 · 0.4 

ˆp ± z = 0.6 ± 1.96 = 0.6 ± 0.096. 

n 

100 

Example 0.3.5 Batting Average 

In the 2002 season the baseball player Sammy Sosa had a batting average of 0.288. (The batting average is 

the ratio of the number of hits and the times at bat.) Sammy Sosa was at bats 555 times in the 2002 season. 

Could the ”true” batting average still be 0.300? 

Compute a 95% Confidence Interval for the true batting average. 

Conservative Method gives: 

0.288 ± 

1 

1.96 · 

2 √ 555 

0.288 ± 0.042


Substitution Method gives: 

0.288 ± 1.96 · 

0.288 ± 0.038 

√ 

0.288(1 − 0.288) 

The substitution method gives a slightly smaller confidence interval, but both intervals contain 0.3. There is 

not enough evidence to allow the conclusion that the true average is not 0.3. 

Confidence intervals give a way to measure the precision we get from simulations intended to evaluate 

probabilities. But besides that it also gives as a way to plan how large a sample size has to be to get a 

desired precision. 

Example 0.3.6 

Suppose, we want to estimate the fraction of records in the 2000 IRS data base that have a taxable income 

over $35 K. 

We want to get a 98% confidence interval and wish to estimate the quantity to within 0.01. 

this means that our boundaries e need to be smaller than 0.01 (we’ll choose a conservative confidence interval 

for ease of computation): 

555 

e ≤ 0.01 

1 

⇐⇒ z · 

2 √ ≤ 0.01 z is 2.33 

n 

1 

⇐⇒ 2.33 · 

2 √ n ≤ 0.01 

⇐⇒ √ n ≥ 2.33 

2 · 0.01 = 116.5 

⇒ n ≥ 13573 

0.3.3 Related C.I. Methods 

Related to the previous confidence intervals, are confidence intervals for the difference between two means, 

µ 1 − µ 2 , or the difference between two proportions , p 1 − p 2 . 

Confidence intervals for these differences are given as: 

large n confidence interval for µ 1 − µ 2 

(based on independent ¯X 1 and ¯X 2 ) 

large n confidence interval for p 1 − p 2 

(based on independent ˆp 1 and ˆp 2 ) 

¯X 1 − ¯X 

√ 

√ 

s 

2 ± z 

2 1 

n 1 

+ s2 2 

n 2 

ˆp 1 − ˆp 2 ± z 1 1 

2 n 1 

+ 1 n 2 

(conservative) 

√ 

ˆp 

or ˆp 1 − ˆp 2 ±z 1(1−ˆp 1) 

n 1 

+ ˆp2(1−ˆp2) 

n 2 

(substitution) 

Why? The argumentation in both cases is very similar - we will only discuss the confidence interval for 

the difference between means. 

¯X 1 − ¯X 2 is approximately normal, since ¯X 1 and ¯X 2 are approximately normal, with ( ¯X 1 , ¯X 2 are independent) 

E[ ¯X 1 − ¯X 2 ] = E[ ¯X 1 ] − E[ ¯X 2 ] = µ 1 − µ 2 

V ar[ ¯X 1 − ¯X 2 ] = V ar[ ¯X 1 ] + (−1) 2 V ar[ ¯X 2 ] = σ2 

n 1 

+ σ2 

n 2

16 

Then we can use the same arguments as before and get a C.I. for µ 1 − µ 2 as shown above. 

✷ 

Example 0.3.7 

Assume, we have two parts of the IRS database: East Coast and West Coast. 

We want to compare the mean taxable income between reported from the two regions in 2000. 

East Coast West Coast 

# of sampled records: n 1 = 1000 n 2 = 2000 

mean taxable income: ¯x 1 = $37200 ¯x 2 = $42000 

standard deviation: s 1 = $10100 s 2 = $15600 

We can, for example, compute a 2 sided 95% confidence interval for µ 1 − µ 2 = difference in mean taxable 

income as reported from 2000 tax return between East and West Coast: 

37000 − 42000 ± 

√ 

10100 

2 

1000 + 156002 = −5000 ± 927 

2000 

Note: this shows pretty conclusively that the mean West Coast taxable income is higher than the mean East 

Coast taxable income (in the report from 2000). The interval contains only negative numbers - if it contained 

the 0, the message wouldn’t be so clear. 

One-sided intervals 

idea: use only one of the end points ¯x ± z s √ n 

This yields confidence intervals for µ of the form 

(−∞, #) 

} {{ } 

upper bound 

(##, ∞) 

} {{ } 

lower bound 

However, now we need to adjust z to the new situation. Instead of worrying about two tails of the normal 

distribution, we use for a one sided confidence interval only one tail. 

P( x < + e) < 

confidence interval for 

x 

e 

x + e 

prob ≤ 1 - 

Figure 2: One sided (upper bounded) confidence interval for µ (in red). 

Example 0.3.8 complicated queueing system, continued 

What is a 95% upper confidence bound of µ, the parameter for the length of the queue? 

¯x + z √ s 

n 

is the upper confidence bound. Instead of z = Φ −1 ( α+1 

2 ) we use z = Φ−1 (α) (see fig. 1.2). 

This gives: 21.5 + 1.65 √ 15 

50 

= 25.0 

as the upper confidence bound. Therefore the one sided upper bounded confidence interval is (−∞, 25.0).

0.4. HYPOTHESIS TESTING 17 

Critical values z = Φ −1 (α) for the one sided confidence interval are 

α z = Φ −1 (α) 

0.90 1.29 

0.95 1.65 

0.98 2.06 

0.99 2.33 

Example 0.3.9 

Two different digital communication systems send 100 large messages via each system and determine how 

many are corrupted in transmission. 

ˆp 1 = 0.05 and ˆp 2 = 0.10. 

What’s the difference in the corruption rates? Find a 98% confidence interval: 

Use: 

√ 

0.05 · 0.95 

0.05 − 0.1 ± 2.33 · 

+ 

100 

0.10 · 0.90 

100 

= −0.05 ± 0.086 

This calculation tells us, that based on these sample sizes, we don’t even have a solid idea about the sign of 

p 1 − p 2 , i.e. we can’t tell which of the p i s is larger. 

So far, we have only considered large sample confidence intervals. The problem with smaller sample sizes is, 

that the normal approximation in the CLT doesn’t work, if the standard deviation σ 2 is unknown. 

What you need to know is, that there exist different methods to compute C.I. for smaller sample sizes. 

0.4 Hypothesis Testing 

Example 0.4.1 Tea Tasting Lady 

It is claimed that a certain lady is able to tell, by tasting a cup of tea with milk, whether the milk was put 

in first or the tea was put in first. 

To put the claim to the test, the lady is given 10 cups of tea to taste and is asked to state in each case 

whether the milk went in first or the tea went in first. 

To guard against deliberate or accidental communication of information, before pouring each cup of tea a 

coin is tossed to decide whether the milk goes in first or the tea goes in first. The person who brings the cup 

of tea to the lady does not know the outcome of the coin toss. 

Either the lady has some skill (she can tell to some extent the difference) or she has not, in which case she 

is simply guessing. 

Suppose, the lady tested 10 cups of tea in this manner and got 9 of them right. 

This looks rather suspicious, the lady seems to have some skill. But how can we check it? 

We start with the sceptical assumption that the lady does not have any skill. If the lady has no skill at all, 

the probability she gives a correct answer for any single cup of tea is 1/2. 

The number of cups she gets right has therefore a Binomial distribution with parameter n = 10 and p = 0.5. 

The diagram shows the probability mass function of this distribution:

18 

p(x) 

observed x 

Events that are as unlikely or less likely are, that the lady got all 10 cups right or - very different, but 

nevertheless very rare - that she only got 1 cup or none right (note, this would be evidence of some “antiskill”, 

but it would certainly be evidence against her guessing). 

The total probability for these events is (remember, the binomial probability mass function is p(x) = ( n 

x) 

p x (1− 

p) n−x ) 

x 

p(0) + p(1) + p(9) + p(10) = 0.5 10 + 10 · 0.5 10 + 10 · 0.5 10 + 0.5 10 = 0.021 

i.e. what we have just observed is a fairly rare event under the assumption, that the lady is only guessing. 

This suggests, that the lady may have some skill in detecting which was poured first into the cup. 

Jargon: 0.021 is called the p-value for testing the hypothesis p = 0.5. 

The fact that the p-value is small is evidence against the hypothesis. 

Hypothesis testing is a formal procedure to check whether or not some - previously made - assumption can 

be rejected based on the data. 

We are going to abstract the main elements of the previous example and cook up a standard series of steps 

for hypothesis testing: 

Example 0.4.2 

University CC administrators have historical records that indicate that between August and Oct 2002 the 

mean time between hits on the ISU homepage was 2 per min. 

They suspect that in fact the mean time between hits has decreased (i.e. traffic is up) - sampling 50 

inter-arrival times from records for November 2002 gives: ¯X = 1.7 min and s = 1.9 min. 

Is this strong evidence for an increase in traffic?

0.4. HYPOTHESIS TESTING 19 

Formal Procedure 

Application to Example 

1 State a “null hypothesis” of the form H 0 : function 

H 0 : µ = 2.0 min between hits 

of parameter(s) = # 

meant to embody a status quo/ pre data view 

2 State an “alternative hypothesis” ⎧ of the form 

⎨ > 

H a : µ < 2 (traffic is down) 

H a : function of parameter(s) ≠ # 

⎩ 

< 

meant to identify departure from H 0 

3 State test criteria - consists of a test statistic, test statistic will be Z = ¯X−2.0 

s/ √ n 

a “reference distribution” giving the behavior The reference density will be standard normal, 

of the test statistic if H 0 is true and the kinds 

of values of the test statistic that count as evidence 

large negative values for Z count as evidence 

against H 0 in favor of H a 

against H 0 . 

4 show computations sample gives z = 1.7−2.0 

1.9/ √ = −1.12 

50 

5 Report and interpret a p-value = “observed 

level of significance, with which H 0 can be rejected”. 

This is the probability of an observed 

value of the test statistic at least as extreme as 

the one at hand. The smaller this value is, the 

less likely it is that H 0 is true. 

Note aside: a 90% confidence interval for µ is 

The p-value is P (Z ≤ −1.12) = Φ(−1.12) = 

0.1314 

This value is not terribly small - the evidence of 

a decrease in mean time between hits is somewhat 

weak. 

¯x ± 1.65 s √ n 

= 1.7 ± 0.44 

This interval contains the hypothesized value of µ = 2.0 

There are four basic hypothesis tests of this form, testing a mean, a proportion or differences between two 

means or two proportions. Depending on the hypothesis, the test statistic will be different. Here’s an 

overview of the tests, we are going to use: 

Hypothesis Statistic Reference Distribution 

H 0 : µ = # 

Z = ¯X−# 

s/ √ n 

Z is standard normal 

where ˆp = 

n1 ˆp1+n2 ˆp2 

n 1+n 2 

. 

H 0 : p = # Z = q ˆp−# 

#(1−#) 

n 

H 0 : µ 1 − µ 2 = # Z = ¯X r 1− ¯X 2−# 

s 2 1 

n + s2 2 

1 n 2 

H 0 : p 1 − p 2 = # Z = 

ˆp 1−ˆp 2−# 

√ˆp(1−ˆp) 

q 

1 

n 1 

+ 1 

n 2 




Example 0.4.3 tax fraud 

Historically, IRS taxpayer compliance audits have revealed that about 5% of individuals do things on their 

tax returns that invite criminal prosecution. 

A sample of n = 1000 tax returns produces ˆp = 0.061 as an estimate of the fraction of fraudulent returns. - 

does this provide a clear signal of change in the tax payer behavior? 

1. state null hypothesis: H 0 : p = 0.05 

2. alternative hypothesis: H a : p ≠ 0.05

20 

3. test statistic: 

Z = 

ˆp − 0.05 

√ 

0.05 · 0.95/n 

Z has under the null hypothesis a standard normal distribution, any large values of Z - positive and 

negative values - will count as evidence against H 0 . 

4. computation: z = (0.061 − 0.05)/ √ 0.05 · 0.95/1000 = 1.59 

5. p-value: P (|Z| ≥ 1.59) = P (Z ≤ −1.59) + P (Z ≥ 1.59) = 0.11 This is not a very small value, we 

therefore have only very weak evidence against H 0 . 

Example 0.4.4 life time of disk drives 

n 1 = 30 and n 2 = 40 disk drives of 2 different designs were tested under conditions of “accelerated” stress 

and times to failure recorded: 

Standard Design New Design 

n 1 = 30 n 2 = 40 

¯x 1 = 1205 hr ¯x 2 = 1400 hr 

s 1 = 1000 hr s 2 = 900 hr 

Does this provide conclusive evidence that the new design has a larger mean time to failure under “accelerated” 

stress conditions? 

1. state null hypothesis: H 0 : µ 1 = µ 2 (µ 1 − µ 2 = 0) 

2. alternative hypothesis: H a : µ 1 < µ 2 (µ 1 − µ 2 < 0) 

3. test statistic is: 

Z = ¯x 1 − ¯x 2 − 0 

√ 

s 2 1 

n 1 

+ s2 2 

n 2 

Z has under the null hypothesis a standard normal distribution, we will consider large negative values 

of Z as evidence against H 0 . 

4. computation: z = (1205 − 1400 − 0)/ √ 1000 2 /30 + 900 2 /40 = −0.84 

5. p-value: P (Z < −0.84) = 0.2005 

This is not a very small value, we therefore have only very weak evidence against H 0 . 

Example 0.4.5 queueing systems 

2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities 

of there being an available server. 

We do simulations for each system, and look whether at time t = 2000 there is a server available: 

System 1 System 2 

n 1 = 1000 runs n 2 = 500 runs (each with different random seed) 

server at time t = 2000 available? 

ˆp 1 = 551 

1000 

ˆp 2 = 303 

500 

How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems?

0.5. GOODNESS OF FIT TESTS 21 

1. state null hypothesis: H 0 : p 1 = p 2 (p 1 − p 2 = 0) 

2. alternative hypothesis: H a : p 1 ≠ p 2 (p 1 − p 2 ≠ 0) 

3. Preliminary: note that, if there was no difference between the two systems, a plausible estimate of the 

availability of a server would be 

a test statistic is: 

ˆp = nˆp 1 + nˆp 2 551 + 303 

= 

n 1 + n 2 1000 + 500 = 0.569 

Z = 

ˆp 1 − ˆp 2 − 0 

√ˆp(1 − ˆp) · 

√ 

1 

n 1 

+ 1 n 2 

Z has under the null hypothesis a standard normal distribution, we will consider large values of Z as 

evidence against H 0 . 

4. computation: z = (0.551 − 0.606)/( √ 0.569 · (1 − 0.569) √ 1/1000 + 1/500) = −2.03 

5. p-value: P (|Z| > 2.03) = 0.04 This is fairly strong evidence of a real difference in t=2000 availabilities 

of a server between the two systems. 

0.5 Goodness of Fit Tests 

The basic situation is still the same as in the previous section: we have n realizations x 1 , . . . , x n (observed 

data) of independent, identically distributed random variables X 1 , . . . , X n . 

A goodness of fit test is different from the previous ones. Here, we don’t test a single parameter, but we test 

the whole distribution underlying our observations. 

Basically, the null hypothesis will be H 0 : the data follow a specified distribution F 

vs. H a : the data do not follow the specified distribution. 

For this problem there are different approaches depending on whether the specified distribution is continuous 

or discrete. 

For simplification, we will only consider the case of a finite discrete distribution, i.e. we are dealing with a 

finite sample space Ω = {1, . . . , k}, on which we have a probability mass function p. 

The above null hypothesis then becomes 

H 0 : p X (i) = p(i) for i = 1, . . . , k 

vs 

H a : p X (i) ≠ p i for at least one i ∈ {0, . . . , k}

22 

Example 0.5.1 M&Ms 

On the web page of the M&M/Mars company: www.m-ms.com the percentages of each color in a bag of 

peanut M&Ms is given as 

These percentages form a probability mass function for the colors in the M&Ms bag. 

A count for two different bags, gave the following numbers for each color: 

bag brown yellow red blue orange green sum 

129 GM 12 43 28 26 25 33 24 179 

129 GM 22 40 38 36 20 24 16 174 

How do we check, whether these numbers come from the distribution as given on the web site? 

To get an answer for that question, we will need to think about what kind of results we expected to get. 

Think: for each color, we have a certain probability p color , to draw an M&M out of the bag, which has this 

color vs. 1 − p color for a different color. We can therefore think of the number of M&Ms in each color as 

random variables with a Binomial distribution. N br , the number of brown M&Ms has parameters n and p br . 

Under the null hypothesis 

H 0 : p br = p r = p ye = p bl = 0.2, p or = p gr = 0.1 

vs 

H a : one of the p i is different from the above specification 

N br has a B n,0.2 distribution. For the first bag we therefore expected N br to be 0.2 · 179 = 35.8. 

In the same manner we can compute the expected values for all the other colors in each bag: 

bag brown yellow red blue orange green 

129 GM 12 35.8 35.8 35.8 35.8 17.9 17.9 

129 GM 22 34.8 34.8 34.8 34.8 17.4 17.4 

Now we are going to need a test statistics that measures the difference between what we have observed and 

what we have expected. 

As a test statistic, we will use 

Q = 

k∑ 

j=1 

(obs j − exp j ) 2 

exp j 

, 

where 

obs j : 

exp j : 

the number of times j is observed among the x i , i = 1, . . . , n and 

number of expected js = n · p(j) 

Theorem 0.5.1 

The test statistic Q, defined as above, has χ 2 distribution with k − 1 degrees of freedom. 

In order to be able to use that, we obviously need some more information about the χ 2 distribution.

0.5. GOODNESS OF FIT TESTS 23 

The χ 2 distribution 

Given a set of independent standard normal random variables Z 1 , . . . , Z r , the distribution of their sum of 

squares 

r∑ 

X := 

i=1 

is called the χ 2 distribution with r degrees of freedom. 

The density function itself is a bit complicated (it’s a special case of a Gamma distribution), all we need to 

know about the distribution at this stage is: 

Z 2 i 

E[X] = r V ar[X] = 2r, 

and the probability that P (X ≥ 2(r + 1)) ≤ 0.05, roughly. For large r the probability is far smaller than 

0.05. 

Why has the above test statistic Q a χ 2 statistic with k − 1 degrees of freedom? 

This is difficult to prove, but it is at least plausible: The parts, from which Q is put together, look - almost 

- like squared normal distributions: since obs j has a Binomial distribution, we may for large n assume that 

we can approximate its distribution by N(np(j), np j (1 − p(j)). A standardization of obs j would therefore 

look like: 

obs j − np(j) 

√ 

np(j)(1 − p(j)) 

= 

1 

√ 

1 − p(j) 

obs j − exp j 

√ expj 

. 

The degrees of freedom is reduced by one, because we have random variables, that are dependent: once we 

know the numbers of five colors in the bag, we get the sixth by subtracting the other numbers from the total 

number of M&Ms in the bag. 

A more formal reason for the degrees of freedom is given by computing the expected value for Q, which we 

can do by using that E[X 2 ] = V ar[X] + (E[X]) 2 : 

E[Q] = 

= 

= 

k∑ 

j=1 

j=1 

1 

np(j) E[(obs i − np(j)) 2 ] = 

k∑ 1 

np(j) (V ar[obs j − np(j)] +(E[obs j − np(j) ) 2 ]) 2 ) = 

} {{ } } {{ } 

=V ar[obs j] 

=0 

k∑ 1 

k∑ 

np(j) np j(1 − p(j)) = (1 − p(j)) = 

j=1 

= k − 

k∑ 

p(j) = k − 1. 

j=1 

Now that we’ve defined a reference distribution for the test, we need to identify values, which we will count 

as evidence against H 0 . Since we’ve squared the differences between expected and observed values, we can 

only count large positive numbers for Q as evidence against H 0 . 

Now, let’s get back to our example: 

Example 0.5.2 M&Ms, continued 

The value for Q with the above null hypothesis about the color distribution is 23.91 for the first bag (129 

GM 12) and 10.02 for the second bag (129 GM 22) 

The p-values for these results are 0.00023 and 0.075, respectively. 

For the first bag it’s highly unlikely that the M&Ms have a color distribution as posted on the web site, for 

the second bag, however, we can’t quite reject the null hypothesis with the same vigor. The p-value is still 

quite small, though. 

j=1

24 

Maybe, there is something wrong with the filling routine at M&Ms’ . . . 

We can also look into which of the colors contribute most to the Q statistic (unsquared). These numbers 

are called the residuals 

ḃag brown yellow red blue orange green 

129 GM 12 1.20 -1.30 -1.64 -1.81 3.57 1.44 

129 GM 22 0.88 0.54 0.20 -2.51 1.58 -0.34 

The largest residuals in each bag are too many orange M&Ms in the first, too few blue in the second. 

If we had combined the results from the two bags, the result for Q would have been even more extreme: 

color brown yellow red blue orange green sum 

#in bag 83 66 62 45 57 40 353 

exptd 70.6 70.6 70.6 70.6 35.3 35.3 353 

residuals 1.48 -0.55 -1.02 -3.05 3.65 0.79 Q = 26.77 

Here, the two largest residuals are for blue and orange M&Ms. It seems as if the number that are too few 

for the blue is replaced by the orange M&Ms.

0.1 Central Limit Theorem (CLT)

Create successful ePaper yourself

Delete template?

Save as template?