ST3239: Survey Methodology - The Department of Statistics and ...
ST3239: Survey Methodology - The Department of Statistics and ...
ST3239: Survey Methodology - The Department of Statistics and ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>ST3239</strong>: <strong>Survey</strong> <strong>Methodology</strong><br />
by Wang ZHOU
Chapter 1<br />
Elements <strong>of</strong> the sampling problem<br />
1.1 Introduction<br />
Often we are interested in some characteristics <strong>of</strong> a finite population, e.g. the average income<br />
<strong>of</strong> last year’s graduates from NUS. Since the population is usually very large, we would like<br />
to say something (i.e. make inference) about the population by collecting <strong>and</strong> analysing only<br />
a part <strong>of</strong> that population. <strong>The</strong> principles <strong>and</strong> methods <strong>of</strong> collecting <strong>and</strong> analysing data from<br />
finite population is a branch <strong>of</strong> statistics known as Sample <strong>Survey</strong> Method. <strong>The</strong> theory<br />
involved is called Sampling <strong>The</strong>ory. Sample survey is widely used in many areas such as<br />
agriculture, education, industry, social affairs, medicine.<br />
1.2 Some technical terms<br />
1. An element is an object on which a measurement is taken.<br />
2. A population is a collection <strong>of</strong> elments about which we require information.<br />
3. Population charateristic: this is the aspect <strong>of</strong> the population we wish to measure, e.g.<br />
the average income <strong>of</strong> last year’s graduates from NUS, or the total wheat yield <strong>of</strong> all<br />
farmers in a certain country.<br />
4. Sampling units are nonoverlapping collections <strong>of</strong> elements from the population. Sampling<br />
units may be the individual members <strong>of</strong> the population, they may be a coarser<br />
subdivision <strong>of</strong> the population, e.g. a household which may contain more than one individual<br />
member.<br />
5. A frame is a list <strong>of</strong> sampling units, e.g., telephone directory.<br />
6. A sample is a collection <strong>of</strong> sampling units drawn from a frame or frames.<br />
1
1.3 Why sample?<br />
If a sample is equal to the population, then we have a census, whcih contains all the information<br />
one wants. However, census is rarely conducted for several reasons:<br />
• cost, (money is limited)<br />
• time, (time is limited)<br />
• destructive (testing a product can be destructive, e.g. light bulbs),<br />
• accessibility (non-response can be a serious issue).<br />
In those cases, sampling is the only alternative.<br />
1.4 How to select the sample: the design <strong>of</strong> the sample<br />
survey<br />
<strong>The</strong> procedure for selecting the sample is called the sample survey design. <strong>The</strong> general aim <strong>of</strong><br />
sample survey is to draw samples which are “representative” <strong>of</strong> the whole population. Broadly<br />
speaking, we can classify sampling schemes into two categories: probability sampling <strong>and</strong><br />
some other sampling schemes.<br />
1. Probability sampling is a sampling scheme whereby particular samples are numerated <strong>and</strong><br />
each has a non-zero probability <strong>of</strong> being selected. With probability built in the design, we can<br />
make statements such as “our estimate is unbiased <strong>and</strong> we are 95% confident that it is within 2<br />
percentage point <strong>of</strong> the true proportion”. In this course, we shall only concentrate on Probability<br />
sampling.<br />
2. Some other sampling schemes<br />
a) ‘volunteer sampling’: a TV telephone polls, medical volunteers for research.<br />
b) ‘subjective sampling’: We choose samples that we consider to be typical or “representative”<br />
<strong>of</strong> the population.<br />
c) ‘quota sampling’: One keeps sampling until certain quota is filled.<br />
All these sampling procedures provide some information about the population, but it is<br />
hard to deduce the nature <strong>of</strong> the population from the studies as the samples are very subjective<br />
<strong>and</strong> <strong>of</strong>ten very biased. Furthermore, it is hard to measure the precision <strong>of</strong> these estimates.<br />
1.5 How to design a questionnaire <strong>and</strong> plan a survey<br />
This can be the most important <strong>and</strong> perhaps most difficult part <strong>of</strong> the survey sampling problem.<br />
We shall come back to this point in more details later.<br />
2
Chapter 2<br />
Simple r<strong>and</strong>om sampling<br />
Definition: If a sample <strong>of</strong> size n is drawn from a population <strong>of</strong> size N in such a way that every<br />
possible sample <strong>of</strong> size n has the same probability <strong>of</strong> being selected, the sampling procedure<br />
is called simple r<strong>and</strong>om sampling. <strong>The</strong> sample thus obtained is called a simple r<strong>and</strong>om<br />
sample. Simple r<strong>and</strong>om sampling is <strong>of</strong>ten written as s.r.s. for short <strong>and</strong> is the simplest<br />
sampling procedure.<br />
2.1 How to draw a simple r<strong>and</strong>om sample<br />
Suppose that the population <strong>of</strong> size N has values<br />
{u 1 , u 2 , · · · , u N }.<br />
If ( we ) draw n (distinct) items without replacement from the population, ( ) there are altogether<br />
N N<br />
different ways <strong>of</strong> doing it. So if we assign probability 1/ to each <strong>of</strong> the different<br />
n<br />
n<br />
samples, then each sample thus obtained is a simple r<strong>and</strong>om sample. We denote this sample<br />
by<br />
{y 1 , y 2 , · · · , y n }.<br />
Remark: In our previous statistics course, we always use upper-case letters like X, Y etc.<br />
to denote r<strong>and</strong>om variables <strong>and</strong> lower-case letters like x, y etc. to represent fixed values.<br />
However, in sample survey course, by convention, we use lower-case letters like y 1 , y 2 etc. to<br />
denote r<strong>and</strong>om variables.<br />
<strong>The</strong>orem 2.1.1 For simple r<strong>and</strong>om sampling, we have<br />
P (y 1 = u i1 , y 2 = u i2 , · · · , y n = u in ) = 1 N<br />
where i 1 , i 2 , · · · , i n are mutually different.<br />
1<br />
(N − 1) · · · 1<br />
(N − n + 1)<br />
(N − n)!<br />
= .<br />
N!<br />
3
Pro<strong>of</strong>. By the definition <strong>of</strong> s.r.s, the probability ( ) <strong>of</strong> obtaining the sample {u i1 , u i2 , · · · , u in }<br />
N<br />
(where the order is not important) is 1/ . <strong>The</strong>re are n! number <strong>of</strong> ways <strong>of</strong> ordering<br />
n<br />
{u i1 , u i2 , · · · , u in }. <strong>The</strong>refore,<br />
P (y 1 = u i1 , y 2 = u i2 , · · · , y n = u in ) =<br />
(<br />
N<br />
n<br />
1<br />
)<br />
n!<br />
=<br />
(N − n)!n!<br />
N!n!<br />
=<br />
(N − n)!<br />
.<br />
N!<br />
Remark: Recall that the total number <strong>of</strong> all possible samples is ( )<br />
N<br />
n , which could be very large<br />
if N <strong>and</strong> n are large. <strong>The</strong>refore, getting a simple r<strong>and</strong>om sample by first listing all possible<br />
samples <strong>and</strong> then drawing one at r<strong>and</strong>om would not be practical. An easier way to get a<br />
simple r<strong>and</strong>om sample is simply to draw n values at r<strong>and</strong>om without replacement from the N<br />
population values. That is, we first draw one value at r<strong>and</strong>om from the N population values,<br />
<strong>and</strong> then draw another value at r<strong>and</strong>om from the remaining N − 1 population values <strong>and</strong> so<br />
on, until we get a sample <strong>of</strong> n (different) values.<br />
<strong>The</strong>orem 2.1.2 A sample obtained by drawing n values successively without replacement from<br />
the N population values is a simple r<strong>and</strong>om sample.<br />
Pro<strong>of</strong>. Suppose that our sample obtained by drawing n values without replacement from the<br />
N population values is<br />
{a 1 , a 2 , · · · , a n },<br />
where the order is not important. Let {a i1 , a i2 , · · · , a in } be any permutation <strong>of</strong> {a 1 , a 2 , · · · , a n }.<br />
Since the sample is drawn without replacement, we have<br />
P (y 1 = a i1 , · · · , y n = a in ) = 1 N<br />
1<br />
(N − 1) · · · 1<br />
(N − n + 1)<br />
(N − n)!<br />
= .<br />
N!<br />
Hence, the probability <strong>of</strong> obtaining the sample {a 1 , · · · , a n } (where the order is not important)<br />
is<br />
∑<br />
∑ (N − n)! (N − n)! 1<br />
P (y 1 = a i1 , · · · , y n = a in ) =<br />
= n! × = ( ).<br />
N!<br />
N!<br />
all (i 1 ,···,i n )<br />
all (i 1 ,···,i n )<br />
N<br />
n<br />
<strong>The</strong> theorem is thus proved by the definition <strong>of</strong> the simple r<strong>and</strong>om sampling.<br />
4
Two special cases will be used later when n = 1, <strong>and</strong> n = 2.<br />
<strong>The</strong>orem 2.1.3 For any i, j = 1, ..., n <strong>and</strong> s, t = 1, ..., N,<br />
(i)<br />
P (y i = u s ) = 1/N.<br />
(ii) P (y i = u s , y j = u t ) =<br />
1<br />
, i ≠ j, s ≠ t.<br />
N(N − 1)<br />
Pro<strong>of</strong>.<br />
P (y k = u j ) =<br />
=<br />
P (y k = u s , y j = u t ) =<br />
=<br />
∑<br />
P (y 1 = u i1 , · · · , y k = u ik , · · · , y n = u in )<br />
all (i 1 , · · · , i n ), but i k = j<br />
( )<br />
(N − n)! N − 1<br />
(N − n)! (N − 1)!<br />
× (n − 1)! = ×<br />
N! n − 1<br />
N! (N − n)! = 1 N .<br />
∑<br />
P (y 1 = u i1 , · · · , y n = u in )<br />
all (i 1 , · · · , i n ), but i k = s,i j = t<br />
( )<br />
(N − n)! N − 2<br />
(N − n)! (N − 2)!<br />
× (n − 2)! = ×<br />
N! n − 2<br />
N! (N − n)! = 1<br />
N(N − 1) .<br />
Example 1. A population contains {a, b, c, d}. We wish to draw a s.r.s <strong>of</strong> size 2. List all possible<br />
samples <strong>and</strong> find out the prob. <strong>of</strong> drawing {b, d}.<br />
Solution. Possible samples <strong>of</strong> size 2 are<br />
{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d},<br />
<strong>The</strong> probability <strong>of</strong> drawing {b, d} is 1/6.<br />
5
2.2 Estimation <strong>of</strong> population mean <strong>and</strong> total<br />
2.2.1 Estimation <strong>of</strong> population mean<br />
Suppose that the population <strong>of</strong> size N has values {u 1 , u 2 , · · · , u N }, we can define<br />
1) the population mean<br />
µ = u 1 + u 2 + · · · + u N<br />
N<br />
= 1 N<br />
N∑<br />
u i ,<br />
i=1<br />
2) the population variance<br />
σ 2 = 1 N∑<br />
(u i − µ) 2 .<br />
N<br />
i=1<br />
We wish to estimate the quantities µ <strong>and</strong> σ 2 <strong>and</strong> to study the accuracy <strong>of</strong> their estimators.<br />
Suppose that a simple r<strong>and</strong>om sample <strong>of</strong> size n is drawn, resulting in {y 1 , y 2 , · · · , y n }. <strong>The</strong>n an<br />
obvious estimator for µ is the sample mean:<br />
n∑<br />
ˆµ = ȳ = y i /n.<br />
i=1<br />
<strong>The</strong>orem 2.2.1<br />
Pro<strong>of</strong>.<br />
(i). By an ealier theorem,<br />
(i) E(y i ) = µ, V ar(y i ) = σ 2 .<br />
(ii) Cov(y i , y j ) = − σ2<br />
, for i ≠ j.<br />
N − 1<br />
N∑<br />
N∑ 1<br />
E(y i ) = u k P (y i = u k ) = u k<br />
k=1<br />
k=1<br />
N = µ.<br />
N∑<br />
N∑<br />
V ar(y i ) = (u k − µ) 2 P (y i = u k ) = (u k − µ) 2 1<br />
k=1<br />
k=1<br />
N = σ2 .<br />
(ii). By defintion, Cov(y i , y j ) = E(y i y j ) − E(y i )E(y j ) = E(y i y j ) − µ 2 . Now,<br />
E(y i y j ) =<br />
=<br />
=<br />
=<br />
∑<br />
u s u t P (y i = u s , y j = u t ) =<br />
∑<br />
all s ≠ t<br />
all s ≠ t<br />
⎡<br />
⎤<br />
1 ⎢ ∑<br />
⎥<br />
⎣<br />
u s u t ⎦ =<br />
N(N − 1)<br />
1<br />
N(N − 1)<br />
1<br />
N(N − 1)<br />
u s u t − ∑<br />
all s, t<br />
s=t<br />
[ (<br />
(Nµ) 2 ∑ N<br />
−<br />
1<br />
N(N − 1)<br />
)]<br />
(u s − µ) 2 + Nµ 2<br />
s=1<br />
u s u t<br />
1<br />
N(N − 1)<br />
[<br />
(Nµ) 2 − Nσ 2 − Nµ 2] = − σ2<br />
N − 1 + µ2 .<br />
Thus, Cov(y i , y j ) = E(y i y j ) − µ 2 = − σ2<br />
N−1 . 6<br />
[(<br />
∑ N<br />
) ( N<br />
) ]<br />
∑ N∑<br />
u s u t − u 2 s<br />
s=1 t=1 s=1
<strong>The</strong>orem 2.2.2<br />
E(ȳ) = µ,<br />
V ar(ȳ) = σ2<br />
n<br />
( ) N − n<br />
.<br />
N − 1<br />
Pro<strong>of</strong>.<br />
Note ȳ = 1 n (y 1 + ... + y n ). So<br />
E(ȳ) = 1 n (Ey 1 + ... + Ey n ) = 1 (nµ) = µ.<br />
n<br />
Now<br />
V ar(ȳ) = 1 n Cov( ∑ n n∑<br />
y 2 i , y j ) = 1 ∑ n n∑<br />
Cov(y<br />
i=1 j=1<br />
n 2<br />
i , y j )<br />
i=1 j=1<br />
⎛<br />
⎞<br />
= 1 ⎝ ∑ Cov(y<br />
n 2 i , y j ) + ∑ Cov(y i , y j ) ⎠<br />
i≠j<br />
i=j<br />
⎛<br />
⎞<br />
= 1 ⎝ ∑ (− σ2 n<br />
n 2 i≠j<br />
N − 1 ) + ∑<br />
V ar(y i ) ⎠<br />
i=1<br />
= 1 (<br />
)<br />
n(n − 1)(−<br />
σ2<br />
n 2 N − 1 ) + nσ2<br />
(<br />
(n − 1)(− 1<br />
= σ2<br />
n<br />
= σ2<br />
n<br />
( ) N − n<br />
N − 1<br />
N − 1 ) + 1 )<br />
Remark: From <strong>The</strong>orem 2.2.2, we see that ȳ is an unbiased estimator for µ. Also as n gets large<br />
(but n ≤ N), V ar(ȳ) tends to 0. This implies that ȳ will be a more accurate estimator for µ as n gets<br />
larger (but less than N). In particular, when n = N, we have a census <strong>and</strong> V ar(ȳ) = 0.<br />
Remark: In our previous statistics course, we usually sample {y 1 , y 2 , · · · , y n } from the population<br />
with replacement. <strong>The</strong>refore, {y 1 , y 2 , · · · , y n } are independent <strong>and</strong> identically distributed (i.i.d.).<br />
And recall we have results like<br />
E iid (ȳ) = µ,<br />
V ar iid (ȳ) = σ2<br />
n .<br />
Notice that V ar iid (ȳ) is different from V ar(ȳ) in <strong>The</strong>orem 2.2.2. In fact, for n > 1,<br />
V ar(ȳ) = σ2<br />
n<br />
( ) N − n<br />
< σ2<br />
N − 1 n = V ar iid(ȳ).<br />
Thus, for the same sample size n, sampling without replacement produces a less variable estimator <strong>of</strong><br />
µ. Why?<br />
7
Summary<br />
1. How to draw a simple r<strong>and</strong>om sample? (purpose, method) Simple r<strong>and</strong>om sampling is<br />
the basic survey methodology.<br />
2. After getting a s.r.s, how to describe the population, or how to analyze the data?<br />
Estimation <strong>of</strong> the population mean. (Sample mean.)<br />
Estimation <strong>of</strong> σ 2 <strong>and</strong> V ar(ȳ)<br />
<strong>The</strong> population variance σ 2 is usually unknown. Now define<br />
s 2 = 1 n∑<br />
(y i − ȳ) 2 = 1 ( n<br />
)<br />
∑<br />
yi 2 − n(ȳ) 2 .<br />
n − 1<br />
i=1<br />
n − 1<br />
i=1<br />
Example.<br />
When a few data points are repeated in a data set, the results are <strong>of</strong>ten arrayed in a<br />
frequency table. For example, a quiz given to 25 students was graded on a 4-point scale 0, 1,<br />
2, 3 with 3 being a perfect score. Here are the results:<br />
Score(X) Frequency(F ) Proportion(P )<br />
3 16 0.64<br />
2 4 0.16<br />
1 2 0.08<br />
0 3 0.12<br />
(a). Calculate the average score by using frequencies.<br />
(b). Calculate the average score by using proportions.<br />
(c). Calculate the st<strong>and</strong>ard deviation.<br />
Solution<br />
If the above 25 students constitute a r<strong>and</strong>om sample, then s 2 =<br />
Let us look at some properties <strong>of</strong> s 2 . Is it unbiased?<br />
n 1.0976 = 1.1433.<br />
n−1<br />
<strong>The</strong>orem 2.2.3<br />
E(s 2 ) =<br />
N<br />
N − 1 σ2 .<br />
8
Pro<strong>of</strong>.<br />
Es 2 =<br />
=<br />
=<br />
(<br />
1 n<br />
)<br />
∑<br />
Eyi 2 − nE(ȳ) 2<br />
n − 1<br />
i=1<br />
(<br />
1 ∑ n [<br />
V ar(yi ) + (Ey i ) 2] − n [ V ar(ȳ) + (Eȳ) 2])<br />
n − 1<br />
i=1<br />
(<br />
1<br />
n [ [ ])<br />
σ 2 + µ 2] σ<br />
2<br />
N − n<br />
− n<br />
n − 1<br />
n N − 1 + µ2<br />
= nσ2<br />
n − 1<br />
= Nσ2<br />
N − 1<br />
[<br />
1 − 1 n<br />
( N − n<br />
N − 1<br />
)]<br />
= nσ2<br />
n − 1<br />
( )<br />
nN − n − (N − n)<br />
n(N − 1)<br />
<strong>The</strong> next theorem is an easy consequence <strong>of</strong> the last theorem.<br />
<strong>The</strong>orem 2.2.4 ˆσ 2 := N−1<br />
N<br />
s2 is an unbiased estimator <strong>of</strong> σ 2 , e.g.<br />
E<br />
( N − 1<br />
N s2 )<br />
= σ 2 .<br />
We shall define<br />
f = n N<br />
to be the sample proportion,<br />
1 − f = 1 − n N<br />
to be the finite population correction (ab. fpc)<br />
<strong>The</strong>n we have the following theorem.<br />
<strong>The</strong>orem 2.2.5 An unbiased estimator for V ar(ȳ) is<br />
Pro<strong>of</strong>.<br />
E ̂V ar(ȳ) = Es2<br />
n<br />
̂V ar(ȳ) = s2<br />
n<br />
(1 − f) .<br />
(1 − f) =<br />
Nσ2<br />
n(N − 1) (1 − n N )<br />
Confidence intervals for µ<br />
It can be shown that the sample average ȳ under the simple r<strong>and</strong>om sampling is approximately<br />
normally distributed provided n is large (≥ 30, say) <strong>and</strong> f = n/N is not too close to 0 or 1.<br />
9
Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then<br />
ȳ − µ<br />
√V ar(ȳ)<br />
∼ N(0, 1) approximately.<br />
If V ar(ȳ) is replaced by its estimator ̂V ar(ȳ), we still have<br />
ȳ − µ<br />
√ ̂V ar(ȳ)<br />
∼ approx. N(0, 1), as n/N → λ > 0.<br />
Thus,<br />
⎛<br />
⎞<br />
ȳ − µ<br />
1 − α ≈ P ⎝<br />
√ ≤ z ⎠<br />
α/2 = P<br />
∣ ̂V ar(ȳ) ∣<br />
<strong>The</strong>refore, an approximate (1 − α) confidence interval for µ is<br />
( √ √ )<br />
ȳ − z α/2 ̂V ar(ȳ) ≤ µ ≤ ȳ + z α/2 ̂V ar(ȳ)<br />
√<br />
s<br />
ȳ ∓ z α/2 ̂V ar(ȳ) = ȳ ∓ z α/2 √<br />
√1 − f. n<br />
B := z α/2<br />
√ ̂V ar(ȳ) , is called bound on the error <strong>of</strong> estimation.<br />
Example. Suppose that a s.r.s. <strong>of</strong> size n = 200 is taken from a population <strong>of</strong> size N = 1000.<br />
resulting in ȳ = 94 <strong>and</strong> s 2 = 400. Find a 95% C.I. for µ.<br />
Solution<br />
94 ∓ 1.96 20 √<br />
200<br />
√1 − 1/5 = 94 ∓ 2.479<br />
Example. A simple r<strong>and</strong>om sample <strong>of</strong> n = 100 water meters within a community is<br />
monitored to estimate the average daily water consumption per household over a specified dry<br />
spell. <strong>The</strong> sample mean <strong>and</strong> variance are found to be ȳ = 12.5 <strong>and</strong> s 2 = 1252. If we assume<br />
that there are N = 10, 000 households within the community, estimate µ, the true average daily<br />
consumption, <strong>and</strong> find a 95% confidence interval for µ.<br />
Solution<br />
10
2.3 Selecting the sample size for estimating population<br />
means<br />
population mean<br />
( )<br />
We have seen that V ar(ȳ) = σ2 N−n<br />
n N−1 . So the bigger the sample size n is (but ≤ N), the more<br />
accurate our estimate ȳ is. It is <strong>of</strong> interest to find out the minimum n such that our estimate<br />
is within an error bound with certain probability 1 − α, say,<br />
i.e.,<br />
By the central limit theorem,<br />
P (|ȳ − µ| < B) ≈ 1 − α,<br />
⎛<br />
|ȳ − µ|<br />
P ⎝ √<br />
V ar(ȳ) <<br />
⎞<br />
B<br />
√ ⎠ ≈ 1 − α.<br />
V ar(ȳ)<br />
B<br />
√<br />
V ar(ȳ) =<br />
√<br />
σ 2<br />
n<br />
B<br />
( N−n<br />
N−1<br />
) ≈ z α/2 ⇐⇒<br />
σ 2<br />
n<br />
( ) N − n<br />
= B2<br />
N − 1 zα/2<br />
2<br />
= D,<br />
Thus,<br />
⇐⇒<br />
N<br />
n<br />
− 1 =<br />
(N − 1)D<br />
σ 2<br />
n ≈<br />
⇐⇒<br />
N<br />
n<br />
= 1 +<br />
(N − 1)D<br />
σ 2<br />
Nσ 2<br />
B2<br />
, where D =<br />
(N − 1)D + σ2 zα/2<br />
2<br />
=<br />
(N − 1)D + σ2<br />
σ 2<br />
Remark 1: if α = 5%, then z α/2 = 1.96 ≈ 2, so D ≈ B2 . This coincides with the formula in the<br />
4<br />
textbook (page 93).<br />
Remark 2: the above formula requires the knowledge <strong>of</strong> the population variance σ 2 , which is<br />
typically unknown in practice. However, we can approximate σ 2 by the following methods:<br />
1) from pilot studies<br />
2) from previous surveys<br />
3) other studies.<br />
11
e.g. Suppose that a total <strong>of</strong> 1500 students are to graduate next year. Determine the sample<br />
size n needed to ensure that the sample average in starting salary is within $40 <strong>of</strong> the population<br />
average with probability at least 0.9. From previous studies, we know that the st<strong>and</strong>ard<br />
deviation <strong>of</strong> the starting salary is approximately $400.<br />
Solution. n =<br />
1500×400 2<br />
1499×40 2 /1.645 2 +400 2 = 229.37 ≈ 230.<br />
e.g. Example 4.5 (p.94, 5th edition). <strong>The</strong> average amount <strong>of</strong> money µ for a hospital’s accounts<br />
receivable must be estimated. Although no prior data are available to estimate the population<br />
variance σ 2 , that most accounts lie within a $100 range is known. <strong>The</strong>re are 1000 open accounts.<br />
Find the sample size needed to estimate µ with a bound on the error <strong>of</strong> estimation $3 with<br />
probability 0.95.<br />
Remark. <strong>The</strong> solution depends on how one inteprets “most accounts”, whether it means 70%,<br />
90%, 95% or 99% <strong>of</strong> all accounts.<br />
Solution. We need an estimate <strong>of</strong> σ 2 . For the normal distribution, N(0, σ 2 ), we have<br />
P (|N(0, σ 2 )| ≤ 1.96σ) = P (|N(0, 1)| ≤ 1.96) = 95%, P (|N(0, σ 2 )| ≤ 3σ) = P (|N(0, 1)| ≤ 3) =<br />
99.87% So 95% accounts lie within a 4σ range <strong>and</strong> 99.87% accounts lie within a 6σ range.<br />
B = 3, N = 1000.<br />
If most means 95%, we take 2 × (2σ) = 100, so σ = 25. <strong>The</strong>n<br />
n = 210.76 ≈ 211.<br />
If most means 99.87%, we take 2 × (3σ) = 100, so σ = 50/3. <strong>The</strong>n<br />
n ≈ 107.<br />
12
2.3.1 A quick summary on estimation <strong>of</strong> population mean<br />
<strong>The</strong> population mean is defined to be<br />
µ = 1 N (u 1 + u 2 + · · · + u N ).<br />
Suppose a simple r<strong>and</strong>om sample is {y 1 , ..., y n }.<br />
1) An estimator <strong>of</strong> the population mean µ <strong>and</strong> variance σ 2 are<br />
ˆµ = ȳ = 1 n<br />
n∑<br />
i=1<br />
y i , s 2 = 1<br />
n − 1<br />
n∑<br />
(y i − ȳ) 2 .<br />
i=1<br />
2) <strong>The</strong> mean <strong>and</strong> variance <strong>of</strong> ȳ are<br />
Eȳ = µ,<br />
V ar(ȳ) = σ2<br />
n<br />
( ) N − n<br />
.<br />
N − 1<br />
3) An estimator <strong>of</strong> the variance <strong>of</strong> ȳ is<br />
̂V ar(ȳ) = s2<br />
n<br />
(1 − f) ,<br />
where f = n/N.<br />
4) An approximate (1 − α) confidence interval for µ is<br />
√<br />
s<br />
ȳ ∓ z α/2 ̂V ar(ȳ) = ȳ ∓ z α/2 √<br />
√1 − f. n<br />
5) Minimum sample size n needed to have an error bound B with probability 1 − α<br />
n ≈<br />
Nσ 2<br />
B2<br />
, where D =<br />
(N − 1)D + σ2 zα/2<br />
2<br />
13
2.3.2 Estimation <strong>of</strong> population total<br />
<strong>The</strong> population total is defined to be<br />
τ = (u 1 + u 2 + · · · + u N ) = Nµ<br />
Suppose a simple r<strong>and</strong>om sample is {y 1 , ..., y n }.<br />
1) An estimator <strong>of</strong> the population total τ is<br />
ˆτ = Nȳ<br />
3) <strong>The</strong> mean <strong>and</strong> variance <strong>of</strong> ˆτ are<br />
Eˆτ = τ,<br />
V ar(ˆτ) = N 2 σ2<br />
n<br />
( ) N − n<br />
.<br />
N − 1<br />
2) An estimator <strong>of</strong> the variance <strong>of</strong> ˆτ is<br />
̂V ar(ˆτ) = ̂V ar(Nȳ) = N 2 s2<br />
n<br />
(1 − f)<br />
Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then<br />
ˆτ − τ<br />
√V ar(ˆτ)<br />
∼ N(0, 1) approximately.<br />
If V ar(ˆτ) is replaced by its estimator ̂V ar(ˆτ), we still have<br />
ˆτ − τ<br />
√ ̂V ar(ˆτ)<br />
∼ approx. N(0, 1), as n/N → λ > 0.<br />
Thus,<br />
⎛<br />
⎞<br />
ˆτ − τ<br />
1 − α ≈ P ⎝<br />
√ ≤ z ⎠<br />
α/2 = P<br />
∣ ̂V ar(ˆτ) ∣<br />
<strong>The</strong>refore, an approximate (1 − α) confidence interval for τ is<br />
( √ √ )<br />
ˆτ − z α/2 ̂V ar(ˆτ) ≤ τ ≤ ˆτ + z α/2 ̂V ar(ˆτ)<br />
ˆτ ∓ z α/2<br />
√<br />
̂V ar(ˆτ) = ˆτ ∓ z α/2 N s √ n<br />
√1 − f.<br />
B := z α/2<br />
√ ̂V ar(ˆτ) = Nz α/2<br />
√ ̂V ar(ȳ) , is called bound on the error <strong>of</strong> estimation.<br />
14
4) An approximate (1 − α) confidence interval for τ is<br />
√<br />
ˆτ ∓ z α/2 ̂V ar(ˆτ) = ˆτ ∓ z α/2 N √ s<br />
(<br />
)<br />
s<br />
√1 − f = N ȳ ∓ z α/2 √<br />
√1 − f .<br />
n n<br />
5) Minimum sample size n needed to have an error bound B with probability 1 − α<br />
n ≈<br />
Nσ 2<br />
B2<br />
, where D =<br />
(N − 1)D + σ2 N 2 zα/2<br />
2<br />
Example 4.6. (Page 95 <strong>of</strong> the textbook). An investigator is interested in estimating the<br />
total weight gain in 0 to 4 weeks for N = 1000 chicks fed on a new ration. Obviously, to weigh<br />
each bird would be time-consuming <strong>and</strong> tedious. <strong>The</strong>refore, determine the number <strong>of</strong> chicks<br />
to be sampled in this study in order to estimate τ within a bound on the error <strong>of</strong> estimation<br />
equal to 1000 grams with probability 95%. Many similar studies on chick nutrition have been<br />
run in the past. Using data from these studies, the investigator found that σ 2 , the population<br />
variance, was approximately 36.00 (grams) 2 . Determine the required sample size.<br />
Solution<br />
15
2.4 Estimation <strong>of</strong> population proportion<br />
If we are interested in the proportion p <strong>of</strong> the population with a specified characteristic. Let<br />
y i = { 1 if the ith element has the characteristic<br />
0 if not<br />
It is easy to see that E(y i ) = E(y 2 i ) = p (Why?). <strong>The</strong>refore, we have<br />
µ = E(y i ) = p,<br />
σ 2 = var(y i ) = p − p 2 = pq, where q = 1 − p<br />
So the total number <strong>of</strong> elements in the sample <strong>of</strong> size n possessing the specified characteristic<br />
is ∑ n<br />
i=1 y i . <strong>The</strong>refore,<br />
1. An estimator <strong>of</strong> the population proportion p is<br />
∑ ni=1<br />
y i<br />
ȳ = = ˆp, say.<br />
n<br />
And an estimator <strong>of</strong> the population variance σ 2 = pq is<br />
s 2 1<br />
n∑<br />
= (y i − ȳ) 2 = 1 ( n<br />
)<br />
∑<br />
yi 2 − n(ȳ) 2<br />
n − 1<br />
i=1<br />
n − 1<br />
i=1<br />
(<br />
1 n<br />
)<br />
∑<br />
=<br />
y i − nˆp 2 = 1 ( )<br />
nˆp − nˆp<br />
2<br />
n − 1<br />
i=1<br />
n − 1<br />
n<br />
= ˆpˆq where ˆq = 1 − ˆp<br />
n − 1<br />
From <strong>The</strong>orems 2.2.2 <strong>and</strong> 2.2.3, we have<br />
E(ˆp) = p,<br />
E(s 2 ) =<br />
N<br />
N − 1 σ2 =<br />
2. Again, from <strong>The</strong>orem 2.2.2, the variance <strong>of</strong> ˆp is<br />
V ar(ˆp) = σ2<br />
n<br />
( ) N − n<br />
= pq<br />
N − 1 n<br />
N pq. (4.1)<br />
N − 1<br />
( ) N − n<br />
.<br />
N − 1<br />
3. From equation (4.1) <strong>and</strong> <strong>The</strong>orem 2.2.5, an estimator <strong>of</strong> the variance <strong>of</strong> ˆp is<br />
̂V ar(ˆp) = s2<br />
n<br />
(1 − f) = ˆpˆq<br />
(1 − f) .<br />
n − 1<br />
4. An approximate (1 − α) confidence interval for p is<br />
√ √ˆpˆq<br />
ˆp ∓ z α/2 ̂V ar(ˆp) = ˆp ∓ z α/2 √<br />
√1 − f.<br />
n − 1<br />
16
5. <strong>The</strong> minimum sample size n required to estimate p such that our estimate ˆp is within an<br />
error bound B with probability 1 − α is,<br />
n ≈<br />
Npq<br />
B2<br />
, where D =<br />
(N − 1)D + pq zα/2<br />
2<br />
Note that the right h<strong>and</strong> side is an increasing function <strong>of</strong> σ 2 = pq.<br />
a) p is <strong>of</strong>ten unknown, so we can replace it by some estimate (from previous<br />
study, pilot study, etc.).<br />
b) If we don’t have an estimate ˆp, we can replace it by p = 1/2, thus pq = 1/4.<br />
e.g. Suppose that a small town has population <strong>of</strong> N = 800 people. Let p = the proportion <strong>of</strong><br />
people with blood type A.<br />
(1). What sample size n must be drawn in order to estimate p to be within 0.04 <strong>of</strong><br />
p with probability 0.95?<br />
(2). Suppose that we know no more than 10% <strong>of</strong> the population have blood type<br />
A. Find n again in (1). Comment on the difference between (1) <strong>and</strong> (2).<br />
(3). A simple r<strong>and</strong>om sample <strong>of</strong> size n = 200 is taken <strong>and</strong> it is found that 7% <strong>of</strong><br />
the sample has blood type A. Find a 90% confidence interval for p.<br />
Solution. N = 800, α = 0.05, B = 0.04<br />
(1). Take p = 1/2 in the formula, we get n = 344.<br />
(2). p ≤ 0.10 so σ 2 = pq ≤ 0.09. Simple calculation yields n = 171.<br />
(3). (0.040, 0.096).<br />
Example A simple r<strong>and</strong>om sample <strong>of</strong> n = 40 college students was interviewed to determine<br />
the proportion <strong>of</strong> students in favor <strong>of</strong> converting from the semester to the quarter system. 25<br />
students answered affirmatively. Estimate p, the proportion <strong>of</strong> students on campus in favor <strong>of</strong><br />
the change. (Assume N = 2000.) Find a 95% confidence interval for p.<br />
17
Solution<br />
2.5 Comparing estimates<br />
Suppose x 1 , · · · , x m is a r<strong>and</strong>om sample from a population with mean µ x <strong>and</strong> y 1 , · · · , y n is a<br />
r<strong>and</strong>om sample from a population with mean µ y . We are interested in the difference <strong>of</strong> means<br />
µ y − µ x , which can be estimated unbiased by ȳ − ¯x, as<br />
E(ȳ − ¯x) = µ y − µ x .<br />
Further,<br />
V ar(ȳ − ¯x) = V ar(ȳ) + V ar(¯x) − 2Cov(ȳ, ¯x).<br />
Remark: If the two samples x 1 , · · · , x m <strong>and</strong> y 1 , · · · , y n are independent, then Cov(ȳ, ¯x) = 0.<br />
However, a more interesting case is when the two samples are dependent, which will be<br />
illustrated in the following example.<br />
An dependent example<br />
Suppose an opinion poll asks n people the question “Do you favor the abortion?”<br />
<strong>The</strong> opinions given are<br />
YES, NO, NO OPINION.<br />
Let the proportions <strong>of</strong> people who answer ‘YES’, ‘NO’, ‘No opinion’ be p 1 , p 2 <strong>and</strong> p 3 ,<br />
respectively. In particular, we are interested in comparing p 1 <strong>and</strong> p 2 by looking at p 1 − p 2 .<br />
Clearly, p 1 <strong>and</strong> p 2 are dependent proportions, since if one is high, the other is likely to be low.<br />
Let ˆp 1 , ˆp 2 <strong>and</strong> ˆp 3 be the three respective sample proportions amongst the sample <strong>of</strong> size<br />
n. <strong>The</strong>n X = nˆp 1 , Y = nˆp 2 <strong>and</strong> Z = nˆp 3 follows a multinomial distribution with parameter<br />
(n, p 1 , p 2 , p 3 ). That is<br />
Please note that<br />
P (X = x, Y = y, Z = z) =<br />
∑<br />
(<br />
x≥0,y≥0,x+y+z=n<br />
n<br />
x, y, z<br />
)<br />
p x 1p y 2p z 3 = n!<br />
x! y! z! px 1p y 2p z 3<br />
n!<br />
x! y! z! px 1p y 2p z 3 = 1.<br />
18
Question: What is the distribution <strong>of</strong> X? (Hint: Classify the people into “Yes” <strong>and</strong> “Not<br />
Yes”)<br />
<strong>The</strong>orem 2.5.1<br />
E(X) = np 1 , E(Y ) = np 2 , E(Z) = np 3 ,<br />
V ar(X) = np 1 q 1 , V ar(Y ) = np 2 q 2 ,<br />
Cov(X, Y ) = −np 1 p 2 .<br />
Pro<strong>of</strong>. X = number <strong>of</strong> people saying “YES” ∼ Bin(n, p 1 ). So EX = np 1 , V ar(X) = np 1 q 1 .<br />
Now Cov(X, Y ) = E(XY ) − (EX)(EY ) = E(XY ) − n 2 p 1 p 2 . But<br />
E(XY ) =<br />
=<br />
=<br />
=<br />
∑<br />
x,y≥0,x+y≤n<br />
∑<br />
x,y≥1,x+y≤n<br />
∑<br />
x,y≥1,x+y≤n<br />
∑<br />
x,y≥1,x+y≤n<br />
xyP (X = x, Y = y)<br />
xyP (X = x, Y = y, Z = n − x − y)<br />
n!<br />
xy<br />
x! y! (n − x − y)! px 1p y 2p n−x−y<br />
3<br />
n!<br />
(x − 1)! (y − 1)! (n − x − y)! px 1p y 2p n−x−y<br />
3<br />
= n(n − 1)p 1 p 2<br />
∑<br />
x−1,y−1≥0,(x−1)+(y−1)≤(n−2)<br />
(n − 2)!<br />
(x − 1)! (y − 1)! ((n − 2) − (x − 1) − (y − 1))! px−1 1 p y−1<br />
2 p (n−2)−(x−1)−(y−1)<br />
3<br />
∑<br />
= n(n − 1)p 1 p 2<br />
x 1 ,y 1 ≥0,x 1 +y 1 ≤(n−2)<br />
(n − 2)!<br />
(x 1 )! (y 1 )! ((n − 2) − x 1 − y 1 )! px 1<br />
1 p y 1<br />
2 p (n−2)−x 1−y 1<br />
3<br />
= n(n − 1)p 1 p 2 = n 2 p 1 p 2 − np 1 p 2 .<br />
<strong>The</strong>refore, Cov(X, Y ) = E(XY ) − n 2 p 1 p 2 = −np 1 p 2 .<br />
<strong>The</strong>orem 2.5.2<br />
E(ˆp 1 ) = p 1 , E(ˆp 2 ) = p 2 ,<br />
V ar(ˆp 1 ) = p 1 q 1 /n, V ar(ˆp 2 ) = p 2 q 2 /n,<br />
Cov(ˆp 1 , ˆp 2 ) = −p 1 p 2 /n.<br />
19
Pro<strong>of</strong>.<br />
Note that ˆp 1 = X/n <strong>and</strong> ˆp 2 = Y/n. Apply the last theorem.<br />
From the last theorem, we have<br />
V ar(ˆp 1 − ˆp 2 ) = V ar(ˆp 1 ) + V ar(ˆp 2 ) − 2Cov(ˆp 1 , ˆp 2 ) = p 1q 1<br />
n + p 2q 2<br />
n + 2p 1p 2<br />
n .<br />
One estimator <strong>of</strong> V ar(ˆp 1 − ˆp 2 ) is<br />
̂V ar(ˆp 1 − ˆp 2 ) = ˆp 1ˆq 1<br />
n + ˆp 2ˆq 2<br />
n + 2ˆp 1ˆp 2<br />
n .<br />
Is is unbiased? No! An unbiased estimator <strong>of</strong> the variance <strong>of</strong> ˆp 1 is ̂V ar(ˆp 1 ) = ˆp 1 ˆq 1<br />
(1 − f) .<br />
n−1<br />
Also E ˆp 1ˆp 2 = EXY/n 2 = p 1 p 2 (1−1/n) implies an unbiased estimator <strong>of</strong> p 1 p 2 is ˆp 1ˆp 2 (1−1/n) −1 .<br />
So<br />
̂V ar(ˆp 1 ) + ̂V ar(ˆp 2 ) + 2n −1ˆp 1ˆp 2 (1 − 1/n) −1<br />
is an unbiased estimator <strong>of</strong> V ar(ˆp 1 − ˆp 2 ). But it is easy to use<br />
̂V ar(ˆp 1 − ˆp 2 ) = ˆp 1ˆq 1<br />
n + ˆp 2ˆq 2<br />
n + 2ˆp 1ˆp 2<br />
n .<br />
<strong>The</strong>refore, an approximate (1 − α) confidence interval for p 1 − p 2 is<br />
√<br />
√<br />
√√√<br />
̂<br />
ˆp 1ˆq 1<br />
(ˆp 1 − ˆp 2 ) ∓ z α/2 V ar(ˆp 1 − ˆp 2 ) = (ˆp 1 − ˆp 2 ) ∓ z α/2<br />
n + ˆp 2ˆq 2<br />
n + 2ˆp 1ˆp 2<br />
n .<br />
e.g. (From the textbook.) Should smoking be banned from the workplace? A Time/Yankelovich<br />
poll <strong>of</strong> 800 adult Americans carried out on April 6-7, 1994 gave the following results.<br />
Nonsmokers<br />
Smokers<br />
Banned 44% 8%<br />
Special areas 52% 80%<br />
No restrictions 3% 11%<br />
Based on a sample <strong>of</strong> 600 nonsmokers <strong>and</strong> 200 smokers, estimate <strong>and</strong> construct a 95% C.I.<br />
for<br />
(1) the true difference between the proportions choosing “Banned” between nonsmokers<br />
<strong>and</strong> smokers;<br />
(2) the true difference between the proportions among nonsmokers choosing between<br />
“Banned” <strong>and</strong> “Special Areas”.<br />
20
Solution<br />
A. <strong>The</strong> proportions choosing “banned” are independent <strong>of</strong> each other; a high value does not<br />
force a low value <strong>of</strong> the other. Thus, an appropriate estimate <strong>of</strong> this difference is<br />
√<br />
0.44 × 0.56<br />
0.44 − 0.08 ± 2<br />
+<br />
600<br />
0.08 × 0.92<br />
200<br />
= 0.36 ± 0.06<br />
B. <strong>The</strong> proportion <strong>of</strong> nonsmokers choosing “special areas” is dependent on the proportions<br />
choosing “banned”; if the latter is large, the former must be small. <strong>The</strong>se are multinomial<br />
proportions. Thus, an appropriate estimate <strong>of</strong> this difference is<br />
√<br />
0.44 × 0.56<br />
0.52 − 0.44 ± 2<br />
+<br />
600<br />
0.52 × 0.48<br />
600<br />
+ 2 ×<br />
0.44 × 0.52<br />
600<br />
= 0.08 ± 0.08<br />
Example <strong>The</strong> major league baseball season in US came to an abrupt end in the middle <strong>of</strong><br />
1994. In a poll <strong>of</strong> 600 adult Americans, 29% blamed the players for the strike, 34% blamed<br />
the owners, <strong>and</strong> the rest held various other opinions. Does evidence suggest that the true<br />
proportions who blame players <strong>and</strong> owner, respectively, are really different?<br />
p 1 : proportions <strong>of</strong> Americans who blamed the players.<br />
p 2 : proportions <strong>of</strong> Americans who blamed the owners.<br />
ˆV ar(ˆp 1 − ˆp 2 ) = ˆp 1ˆq 1<br />
n + ˆp 2ˆq 2<br />
n + 2ˆp 1ˆp 2<br />
n<br />
= ˆ0.29 × ˆ0.71 0.34 × 0.66<br />
+<br />
600 600<br />
= 1.0458 × 10 −3<br />
So an approximate 95% C.I. for p 1 − p 2 is<br />
+<br />
0.29 − 0.34 ± z 0.025<br />
√<br />
ˆV ar(ˆp1 − ˆp 2 )<br />
= −0.05 ± 1.96 × 0.03234<br />
= (−0.11339, 0.01339)<br />
2 × 0.29 × 0.34<br />
600<br />
21