29.07.2014 Views

ST3239: Survey Methodology - The Department of Statistics and ...

ST3239: Survey Methodology - The Department of Statistics and ...

ST3239: Survey Methodology - The Department of Statistics and ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>ST3239</strong>: <strong>Survey</strong> <strong>Methodology</strong><br />

by Wang ZHOU


Chapter 1<br />

Elements <strong>of</strong> the sampling problem<br />

1.1 Introduction<br />

Often we are interested in some characteristics <strong>of</strong> a finite population, e.g. the average income<br />

<strong>of</strong> last year’s graduates from NUS. Since the population is usually very large, we would like<br />

to say something (i.e. make inference) about the population by collecting <strong>and</strong> analysing only<br />

a part <strong>of</strong> that population. <strong>The</strong> principles <strong>and</strong> methods <strong>of</strong> collecting <strong>and</strong> analysing data from<br />

finite population is a branch <strong>of</strong> statistics known as Sample <strong>Survey</strong> Method. <strong>The</strong> theory<br />

involved is called Sampling <strong>The</strong>ory. Sample survey is widely used in many areas such as<br />

agriculture, education, industry, social affairs, medicine.<br />

1.2 Some technical terms<br />

1. An element is an object on which a measurement is taken.<br />

2. A population is a collection <strong>of</strong> elments about which we require information.<br />

3. Population charateristic: this is the aspect <strong>of</strong> the population we wish to measure, e.g.<br />

the average income <strong>of</strong> last year’s graduates from NUS, or the total wheat yield <strong>of</strong> all<br />

farmers in a certain country.<br />

4. Sampling units are nonoverlapping collections <strong>of</strong> elements from the population. Sampling<br />

units may be the individual members <strong>of</strong> the population, they may be a coarser<br />

subdivision <strong>of</strong> the population, e.g. a household which may contain more than one individual<br />

member.<br />

5. A frame is a list <strong>of</strong> sampling units, e.g., telephone directory.<br />

6. A sample is a collection <strong>of</strong> sampling units drawn from a frame or frames.<br />

1


1.3 Why sample?<br />

If a sample is equal to the population, then we have a census, whcih contains all the information<br />

one wants. However, census is rarely conducted for several reasons:<br />

• cost, (money is limited)<br />

• time, (time is limited)<br />

• destructive (testing a product can be destructive, e.g. light bulbs),<br />

• accessibility (non-response can be a serious issue).<br />

In those cases, sampling is the only alternative.<br />

1.4 How to select the sample: the design <strong>of</strong> the sample<br />

survey<br />

<strong>The</strong> procedure for selecting the sample is called the sample survey design. <strong>The</strong> general aim <strong>of</strong><br />

sample survey is to draw samples which are “representative” <strong>of</strong> the whole population. Broadly<br />

speaking, we can classify sampling schemes into two categories: probability sampling <strong>and</strong><br />

some other sampling schemes.<br />

1. Probability sampling is a sampling scheme whereby particular samples are numerated <strong>and</strong><br />

each has a non-zero probability <strong>of</strong> being selected. With probability built in the design, we can<br />

make statements such as “our estimate is unbiased <strong>and</strong> we are 95% confident that it is within 2<br />

percentage point <strong>of</strong> the true proportion”. In this course, we shall only concentrate on Probability<br />

sampling.<br />

2. Some other sampling schemes<br />

a) ‘volunteer sampling’: a TV telephone polls, medical volunteers for research.<br />

b) ‘subjective sampling’: We choose samples that we consider to be typical or “representative”<br />

<strong>of</strong> the population.<br />

c) ‘quota sampling’: One keeps sampling until certain quota is filled.<br />

All these sampling procedures provide some information about the population, but it is<br />

hard to deduce the nature <strong>of</strong> the population from the studies as the samples are very subjective<br />

<strong>and</strong> <strong>of</strong>ten very biased. Furthermore, it is hard to measure the precision <strong>of</strong> these estimates.<br />

1.5 How to design a questionnaire <strong>and</strong> plan a survey<br />

This can be the most important <strong>and</strong> perhaps most difficult part <strong>of</strong> the survey sampling problem.<br />

We shall come back to this point in more details later.<br />

2


Chapter 2<br />

Simple r<strong>and</strong>om sampling<br />

Definition: If a sample <strong>of</strong> size n is drawn from a population <strong>of</strong> size N in such a way that every<br />

possible sample <strong>of</strong> size n has the same probability <strong>of</strong> being selected, the sampling procedure<br />

is called simple r<strong>and</strong>om sampling. <strong>The</strong> sample thus obtained is called a simple r<strong>and</strong>om<br />

sample. Simple r<strong>and</strong>om sampling is <strong>of</strong>ten written as s.r.s. for short <strong>and</strong> is the simplest<br />

sampling procedure.<br />

2.1 How to draw a simple r<strong>and</strong>om sample<br />

Suppose that the population <strong>of</strong> size N has values<br />

{u 1 , u 2 , · · · , u N }.<br />

If ( we ) draw n (distinct) items without replacement from the population, ( ) there are altogether<br />

N N<br />

different ways <strong>of</strong> doing it. So if we assign probability 1/ to each <strong>of</strong> the different<br />

n<br />

n<br />

samples, then each sample thus obtained is a simple r<strong>and</strong>om sample. We denote this sample<br />

by<br />

{y 1 , y 2 , · · · , y n }.<br />

Remark: In our previous statistics course, we always use upper-case letters like X, Y etc.<br />

to denote r<strong>and</strong>om variables <strong>and</strong> lower-case letters like x, y etc. to represent fixed values.<br />

However, in sample survey course, by convention, we use lower-case letters like y 1 , y 2 etc. to<br />

denote r<strong>and</strong>om variables.<br />

<strong>The</strong>orem 2.1.1 For simple r<strong>and</strong>om sampling, we have<br />

P (y 1 = u i1 , y 2 = u i2 , · · · , y n = u in ) = 1 N<br />

where i 1 , i 2 , · · · , i n are mutually different.<br />

1<br />

(N − 1) · · · 1<br />

(N − n + 1)<br />

(N − n)!<br />

= .<br />

N!<br />

3


Pro<strong>of</strong>. By the definition <strong>of</strong> s.r.s, the probability ( ) <strong>of</strong> obtaining the sample {u i1 , u i2 , · · · , u in }<br />

N<br />

(where the order is not important) is 1/ . <strong>The</strong>re are n! number <strong>of</strong> ways <strong>of</strong> ordering<br />

n<br />

{u i1 , u i2 , · · · , u in }. <strong>The</strong>refore,<br />

P (y 1 = u i1 , y 2 = u i2 , · · · , y n = u in ) =<br />

(<br />

N<br />

n<br />

1<br />

)<br />

n!<br />

=<br />

(N − n)!n!<br />

N!n!<br />

=<br />

(N − n)!<br />

.<br />

N!<br />

Remark: Recall that the total number <strong>of</strong> all possible samples is ( )<br />

N<br />

n , which could be very large<br />

if N <strong>and</strong> n are large. <strong>The</strong>refore, getting a simple r<strong>and</strong>om sample by first listing all possible<br />

samples <strong>and</strong> then drawing one at r<strong>and</strong>om would not be practical. An easier way to get a<br />

simple r<strong>and</strong>om sample is simply to draw n values at r<strong>and</strong>om without replacement from the N<br />

population values. That is, we first draw one value at r<strong>and</strong>om from the N population values,<br />

<strong>and</strong> then draw another value at r<strong>and</strong>om from the remaining N − 1 population values <strong>and</strong> so<br />

on, until we get a sample <strong>of</strong> n (different) values.<br />

<strong>The</strong>orem 2.1.2 A sample obtained by drawing n values successively without replacement from<br />

the N population values is a simple r<strong>and</strong>om sample.<br />

Pro<strong>of</strong>. Suppose that our sample obtained by drawing n values without replacement from the<br />

N population values is<br />

{a 1 , a 2 , · · · , a n },<br />

where the order is not important. Let {a i1 , a i2 , · · · , a in } be any permutation <strong>of</strong> {a 1 , a 2 , · · · , a n }.<br />

Since the sample is drawn without replacement, we have<br />

P (y 1 = a i1 , · · · , y n = a in ) = 1 N<br />

1<br />

(N − 1) · · · 1<br />

(N − n + 1)<br />

(N − n)!<br />

= .<br />

N!<br />

Hence, the probability <strong>of</strong> obtaining the sample {a 1 , · · · , a n } (where the order is not important)<br />

is<br />

∑<br />

∑ (N − n)! (N − n)! 1<br />

P (y 1 = a i1 , · · · , y n = a in ) =<br />

= n! × = ( ).<br />

N!<br />

N!<br />

all (i 1 ,···,i n )<br />

all (i 1 ,···,i n )<br />

N<br />

n<br />

<strong>The</strong> theorem is thus proved by the definition <strong>of</strong> the simple r<strong>and</strong>om sampling.<br />

4


Two special cases will be used later when n = 1, <strong>and</strong> n = 2.<br />

<strong>The</strong>orem 2.1.3 For any i, j = 1, ..., n <strong>and</strong> s, t = 1, ..., N,<br />

(i)<br />

P (y i = u s ) = 1/N.<br />

(ii) P (y i = u s , y j = u t ) =<br />

1<br />

, i ≠ j, s ≠ t.<br />

N(N − 1)<br />

Pro<strong>of</strong>.<br />

P (y k = u j ) =<br />

=<br />

P (y k = u s , y j = u t ) =<br />

=<br />

∑<br />

P (y 1 = u i1 , · · · , y k = u ik , · · · , y n = u in )<br />

all (i 1 , · · · , i n ), but i k = j<br />

( )<br />

(N − n)! N − 1<br />

(N − n)! (N − 1)!<br />

× (n − 1)! = ×<br />

N! n − 1<br />

N! (N − n)! = 1 N .<br />

∑<br />

P (y 1 = u i1 , · · · , y n = u in )<br />

all (i 1 , · · · , i n ), but i k = s,i j = t<br />

( )<br />

(N − n)! N − 2<br />

(N − n)! (N − 2)!<br />

× (n − 2)! = ×<br />

N! n − 2<br />

N! (N − n)! = 1<br />

N(N − 1) .<br />

Example 1. A population contains {a, b, c, d}. We wish to draw a s.r.s <strong>of</strong> size 2. List all possible<br />

samples <strong>and</strong> find out the prob. <strong>of</strong> drawing {b, d}.<br />

Solution. Possible samples <strong>of</strong> size 2 are<br />

{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d},<br />

<strong>The</strong> probability <strong>of</strong> drawing {b, d} is 1/6.<br />

5


2.2 Estimation <strong>of</strong> population mean <strong>and</strong> total<br />

2.2.1 Estimation <strong>of</strong> population mean<br />

Suppose that the population <strong>of</strong> size N has values {u 1 , u 2 , · · · , u N }, we can define<br />

1) the population mean<br />

µ = u 1 + u 2 + · · · + u N<br />

N<br />

= 1 N<br />

N∑<br />

u i ,<br />

i=1<br />

2) the population variance<br />

σ 2 = 1 N∑<br />

(u i − µ) 2 .<br />

N<br />

i=1<br />

We wish to estimate the quantities µ <strong>and</strong> σ 2 <strong>and</strong> to study the accuracy <strong>of</strong> their estimators.<br />

Suppose that a simple r<strong>and</strong>om sample <strong>of</strong> size n is drawn, resulting in {y 1 , y 2 , · · · , y n }. <strong>The</strong>n an<br />

obvious estimator for µ is the sample mean:<br />

n∑<br />

ˆµ = ȳ = y i /n.<br />

i=1<br />

<strong>The</strong>orem 2.2.1<br />

Pro<strong>of</strong>.<br />

(i). By an ealier theorem,<br />

(i) E(y i ) = µ, V ar(y i ) = σ 2 .<br />

(ii) Cov(y i , y j ) = − σ2<br />

, for i ≠ j.<br />

N − 1<br />

N∑<br />

N∑ 1<br />

E(y i ) = u k P (y i = u k ) = u k<br />

k=1<br />

k=1<br />

N = µ.<br />

N∑<br />

N∑<br />

V ar(y i ) = (u k − µ) 2 P (y i = u k ) = (u k − µ) 2 1<br />

k=1<br />

k=1<br />

N = σ2 .<br />

(ii). By defintion, Cov(y i , y j ) = E(y i y j ) − E(y i )E(y j ) = E(y i y j ) − µ 2 . Now,<br />

E(y i y j ) =<br />

=<br />

=<br />

=<br />

∑<br />

u s u t P (y i = u s , y j = u t ) =<br />

∑<br />

all s ≠ t<br />

all s ≠ t<br />

⎡<br />

⎤<br />

1 ⎢ ∑<br />

⎥<br />

⎣<br />

u s u t ⎦ =<br />

N(N − 1)<br />

1<br />

N(N − 1)<br />

1<br />

N(N − 1)<br />

u s u t − ∑<br />

all s, t<br />

s=t<br />

[ (<br />

(Nµ) 2 ∑ N<br />

−<br />

1<br />

N(N − 1)<br />

)]<br />

(u s − µ) 2 + Nµ 2<br />

s=1<br />

u s u t<br />

1<br />

N(N − 1)<br />

[<br />

(Nµ) 2 − Nσ 2 − Nµ 2] = − σ2<br />

N − 1 + µ2 .<br />

Thus, Cov(y i , y j ) = E(y i y j ) − µ 2 = − σ2<br />

N−1 . 6<br />

[(<br />

∑ N<br />

) ( N<br />

) ]<br />

∑ N∑<br />

u s u t − u 2 s<br />

s=1 t=1 s=1


<strong>The</strong>orem 2.2.2<br />

E(ȳ) = µ,<br />

V ar(ȳ) = σ2<br />

n<br />

( ) N − n<br />

.<br />

N − 1<br />

Pro<strong>of</strong>.<br />

Note ȳ = 1 n (y 1 + ... + y n ). So<br />

E(ȳ) = 1 n (Ey 1 + ... + Ey n ) = 1 (nµ) = µ.<br />

n<br />

Now<br />

V ar(ȳ) = 1 n Cov( ∑ n n∑<br />

y 2 i , y j ) = 1 ∑ n n∑<br />

Cov(y<br />

i=1 j=1<br />

n 2<br />

i , y j )<br />

i=1 j=1<br />

⎛<br />

⎞<br />

= 1 ⎝ ∑ Cov(y<br />

n 2 i , y j ) + ∑ Cov(y i , y j ) ⎠<br />

i≠j<br />

i=j<br />

⎛<br />

⎞<br />

= 1 ⎝ ∑ (− σ2 n<br />

n 2 i≠j<br />

N − 1 ) + ∑<br />

V ar(y i ) ⎠<br />

i=1<br />

= 1 (<br />

)<br />

n(n − 1)(−<br />

σ2<br />

n 2 N − 1 ) + nσ2<br />

(<br />

(n − 1)(− 1<br />

= σ2<br />

n<br />

= σ2<br />

n<br />

( ) N − n<br />

N − 1<br />

N − 1 ) + 1 )<br />

Remark: From <strong>The</strong>orem 2.2.2, we see that ȳ is an unbiased estimator for µ. Also as n gets large<br />

(but n ≤ N), V ar(ȳ) tends to 0. This implies that ȳ will be a more accurate estimator for µ as n gets<br />

larger (but less than N). In particular, when n = N, we have a census <strong>and</strong> V ar(ȳ) = 0.<br />

Remark: In our previous statistics course, we usually sample {y 1 , y 2 , · · · , y n } from the population<br />

with replacement. <strong>The</strong>refore, {y 1 , y 2 , · · · , y n } are independent <strong>and</strong> identically distributed (i.i.d.).<br />

And recall we have results like<br />

E iid (ȳ) = µ,<br />

V ar iid (ȳ) = σ2<br />

n .<br />

Notice that V ar iid (ȳ) is different from V ar(ȳ) in <strong>The</strong>orem 2.2.2. In fact, for n > 1,<br />

V ar(ȳ) = σ2<br />

n<br />

( ) N − n<br />

< σ2<br />

N − 1 n = V ar iid(ȳ).<br />

Thus, for the same sample size n, sampling without replacement produces a less variable estimator <strong>of</strong><br />

µ. Why?<br />

7


Summary<br />

1. How to draw a simple r<strong>and</strong>om sample? (purpose, method) Simple r<strong>and</strong>om sampling is<br />

the basic survey methodology.<br />

2. After getting a s.r.s, how to describe the population, or how to analyze the data?<br />

Estimation <strong>of</strong> the population mean. (Sample mean.)<br />

Estimation <strong>of</strong> σ 2 <strong>and</strong> V ar(ȳ)<br />

<strong>The</strong> population variance σ 2 is usually unknown. Now define<br />

s 2 = 1 n∑<br />

(y i − ȳ) 2 = 1 ( n<br />

)<br />

∑<br />

yi 2 − n(ȳ) 2 .<br />

n − 1<br />

i=1<br />

n − 1<br />

i=1<br />

Example.<br />

When a few data points are repeated in a data set, the results are <strong>of</strong>ten arrayed in a<br />

frequency table. For example, a quiz given to 25 students was graded on a 4-point scale 0, 1,<br />

2, 3 with 3 being a perfect score. Here are the results:<br />

Score(X) Frequency(F ) Proportion(P )<br />

3 16 0.64<br />

2 4 0.16<br />

1 2 0.08<br />

0 3 0.12<br />

(a). Calculate the average score by using frequencies.<br />

(b). Calculate the average score by using proportions.<br />

(c). Calculate the st<strong>and</strong>ard deviation.<br />

Solution<br />

If the above 25 students constitute a r<strong>and</strong>om sample, then s 2 =<br />

Let us look at some properties <strong>of</strong> s 2 . Is it unbiased?<br />

n 1.0976 = 1.1433.<br />

n−1<br />

<strong>The</strong>orem 2.2.3<br />

E(s 2 ) =<br />

N<br />

N − 1 σ2 .<br />

8


Pro<strong>of</strong>.<br />

Es 2 =<br />

=<br />

=<br />

(<br />

1 n<br />

)<br />

∑<br />

Eyi 2 − nE(ȳ) 2<br />

n − 1<br />

i=1<br />

(<br />

1 ∑ n [<br />

V ar(yi ) + (Ey i ) 2] − n [ V ar(ȳ) + (Eȳ) 2])<br />

n − 1<br />

i=1<br />

(<br />

1<br />

n [ [ ])<br />

σ 2 + µ 2] σ<br />

2<br />

N − n<br />

− n<br />

n − 1<br />

n N − 1 + µ2<br />

= nσ2<br />

n − 1<br />

= Nσ2<br />

N − 1<br />

[<br />

1 − 1 n<br />

( N − n<br />

N − 1<br />

)]<br />

= nσ2<br />

n − 1<br />

( )<br />

nN − n − (N − n)<br />

n(N − 1)<br />

<strong>The</strong> next theorem is an easy consequence <strong>of</strong> the last theorem.<br />

<strong>The</strong>orem 2.2.4 ˆσ 2 := N−1<br />

N<br />

s2 is an unbiased estimator <strong>of</strong> σ 2 , e.g.<br />

E<br />

( N − 1<br />

N s2 )<br />

= σ 2 .<br />

We shall define<br />

f = n N<br />

to be the sample proportion,<br />

1 − f = 1 − n N<br />

to be the finite population correction (ab. fpc)<br />

<strong>The</strong>n we have the following theorem.<br />

<strong>The</strong>orem 2.2.5 An unbiased estimator for V ar(ȳ) is<br />

Pro<strong>of</strong>.<br />

E ̂V ar(ȳ) = Es2<br />

n<br />

̂V ar(ȳ) = s2<br />

n<br />

(1 − f) .<br />

(1 − f) =<br />

Nσ2<br />

n(N − 1) (1 − n N )<br />

Confidence intervals for µ<br />

It can be shown that the sample average ȳ under the simple r<strong>and</strong>om sampling is approximately<br />

normally distributed provided n is large (≥ 30, say) <strong>and</strong> f = n/N is not too close to 0 or 1.<br />

9


Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then<br />

ȳ − µ<br />

√V ar(ȳ)<br />

∼ N(0, 1) approximately.<br />

If V ar(ȳ) is replaced by its estimator ̂V ar(ȳ), we still have<br />

ȳ − µ<br />

√ ̂V ar(ȳ)<br />

∼ approx. N(0, 1), as n/N → λ > 0.<br />

Thus,<br />

⎛<br />

⎞<br />

ȳ − µ<br />

1 − α ≈ P ⎝<br />

√ ≤ z ⎠<br />

α/2 = P<br />

∣ ̂V ar(ȳ) ∣<br />

<strong>The</strong>refore, an approximate (1 − α) confidence interval for µ is<br />

( √ √ )<br />

ȳ − z α/2 ̂V ar(ȳ) ≤ µ ≤ ȳ + z α/2 ̂V ar(ȳ)<br />

√<br />

s<br />

ȳ ∓ z α/2 ̂V ar(ȳ) = ȳ ∓ z α/2 √<br />

√1 − f. n<br />

B := z α/2<br />

√ ̂V ar(ȳ) , is called bound on the error <strong>of</strong> estimation.<br />

Example. Suppose that a s.r.s. <strong>of</strong> size n = 200 is taken from a population <strong>of</strong> size N = 1000.<br />

resulting in ȳ = 94 <strong>and</strong> s 2 = 400. Find a 95% C.I. for µ.<br />

Solution<br />

94 ∓ 1.96 20 √<br />

200<br />

√1 − 1/5 = 94 ∓ 2.479<br />

Example. A simple r<strong>and</strong>om sample <strong>of</strong> n = 100 water meters within a community is<br />

monitored to estimate the average daily water consumption per household over a specified dry<br />

spell. <strong>The</strong> sample mean <strong>and</strong> variance are found to be ȳ = 12.5 <strong>and</strong> s 2 = 1252. If we assume<br />

that there are N = 10, 000 households within the community, estimate µ, the true average daily<br />

consumption, <strong>and</strong> find a 95% confidence interval for µ.<br />

Solution<br />

10


2.3 Selecting the sample size for estimating population<br />

means<br />

population mean<br />

( )<br />

We have seen that V ar(ȳ) = σ2 N−n<br />

n N−1 . So the bigger the sample size n is (but ≤ N), the more<br />

accurate our estimate ȳ is. It is <strong>of</strong> interest to find out the minimum n such that our estimate<br />

is within an error bound with certain probability 1 − α, say,<br />

i.e.,<br />

By the central limit theorem,<br />

P (|ȳ − µ| < B) ≈ 1 − α,<br />

⎛<br />

|ȳ − µ|<br />

P ⎝ √<br />

V ar(ȳ) <<br />

⎞<br />

B<br />

√ ⎠ ≈ 1 − α.<br />

V ar(ȳ)<br />

B<br />

√<br />

V ar(ȳ) =<br />

√<br />

σ 2<br />

n<br />

B<br />

( N−n<br />

N−1<br />

) ≈ z α/2 ⇐⇒<br />

σ 2<br />

n<br />

( ) N − n<br />

= B2<br />

N − 1 zα/2<br />

2<br />

= D,<br />

Thus,<br />

⇐⇒<br />

N<br />

n<br />

− 1 =<br />

(N − 1)D<br />

σ 2<br />

n ≈<br />

⇐⇒<br />

N<br />

n<br />

= 1 +<br />

(N − 1)D<br />

σ 2<br />

Nσ 2<br />

B2<br />

, where D =<br />

(N − 1)D + σ2 zα/2<br />

2<br />

=<br />

(N − 1)D + σ2<br />

σ 2<br />

Remark 1: if α = 5%, then z α/2 = 1.96 ≈ 2, so D ≈ B2 . This coincides with the formula in the<br />

4<br />

textbook (page 93).<br />

Remark 2: the above formula requires the knowledge <strong>of</strong> the population variance σ 2 , which is<br />

typically unknown in practice. However, we can approximate σ 2 by the following methods:<br />

1) from pilot studies<br />

2) from previous surveys<br />

3) other studies.<br />

11


e.g. Suppose that a total <strong>of</strong> 1500 students are to graduate next year. Determine the sample<br />

size n needed to ensure that the sample average in starting salary is within $40 <strong>of</strong> the population<br />

average with probability at least 0.9. From previous studies, we know that the st<strong>and</strong>ard<br />

deviation <strong>of</strong> the starting salary is approximately $400.<br />

Solution. n =<br />

1500×400 2<br />

1499×40 2 /1.645 2 +400 2 = 229.37 ≈ 230.<br />

e.g. Example 4.5 (p.94, 5th edition). <strong>The</strong> average amount <strong>of</strong> money µ for a hospital’s accounts<br />

receivable must be estimated. Although no prior data are available to estimate the population<br />

variance σ 2 , that most accounts lie within a $100 range is known. <strong>The</strong>re are 1000 open accounts.<br />

Find the sample size needed to estimate µ with a bound on the error <strong>of</strong> estimation $3 with<br />

probability 0.95.<br />

Remark. <strong>The</strong> solution depends on how one inteprets “most accounts”, whether it means 70%,<br />

90%, 95% or 99% <strong>of</strong> all accounts.<br />

Solution. We need an estimate <strong>of</strong> σ 2 . For the normal distribution, N(0, σ 2 ), we have<br />

P (|N(0, σ 2 )| ≤ 1.96σ) = P (|N(0, 1)| ≤ 1.96) = 95%, P (|N(0, σ 2 )| ≤ 3σ) = P (|N(0, 1)| ≤ 3) =<br />

99.87% So 95% accounts lie within a 4σ range <strong>and</strong> 99.87% accounts lie within a 6σ range.<br />

B = 3, N = 1000.<br />

If most means 95%, we take 2 × (2σ) = 100, so σ = 25. <strong>The</strong>n<br />

n = 210.76 ≈ 211.<br />

If most means 99.87%, we take 2 × (3σ) = 100, so σ = 50/3. <strong>The</strong>n<br />

n ≈ 107.<br />

12


2.3.1 A quick summary on estimation <strong>of</strong> population mean<br />

<strong>The</strong> population mean is defined to be<br />

µ = 1 N (u 1 + u 2 + · · · + u N ).<br />

Suppose a simple r<strong>and</strong>om sample is {y 1 , ..., y n }.<br />

1) An estimator <strong>of</strong> the population mean µ <strong>and</strong> variance σ 2 are<br />

ˆµ = ȳ = 1 n<br />

n∑<br />

i=1<br />

y i , s 2 = 1<br />

n − 1<br />

n∑<br />

(y i − ȳ) 2 .<br />

i=1<br />

2) <strong>The</strong> mean <strong>and</strong> variance <strong>of</strong> ȳ are<br />

Eȳ = µ,<br />

V ar(ȳ) = σ2<br />

n<br />

( ) N − n<br />

.<br />

N − 1<br />

3) An estimator <strong>of</strong> the variance <strong>of</strong> ȳ is<br />

̂V ar(ȳ) = s2<br />

n<br />

(1 − f) ,<br />

where f = n/N.<br />

4) An approximate (1 − α) confidence interval for µ is<br />

√<br />

s<br />

ȳ ∓ z α/2 ̂V ar(ȳ) = ȳ ∓ z α/2 √<br />

√1 − f. n<br />

5) Minimum sample size n needed to have an error bound B with probability 1 − α<br />

n ≈<br />

Nσ 2<br />

B2<br />

, where D =<br />

(N − 1)D + σ2 zα/2<br />

2<br />

13


2.3.2 Estimation <strong>of</strong> population total<br />

<strong>The</strong> population total is defined to be<br />

τ = (u 1 + u 2 + · · · + u N ) = Nµ<br />

Suppose a simple r<strong>and</strong>om sample is {y 1 , ..., y n }.<br />

1) An estimator <strong>of</strong> the population total τ is<br />

ˆτ = Nȳ<br />

3) <strong>The</strong> mean <strong>and</strong> variance <strong>of</strong> ˆτ are<br />

Eˆτ = τ,<br />

V ar(ˆτ) = N 2 σ2<br />

n<br />

( ) N − n<br />

.<br />

N − 1<br />

2) An estimator <strong>of</strong> the variance <strong>of</strong> ˆτ is<br />

̂V ar(ˆτ) = ̂V ar(Nȳ) = N 2 s2<br />

n<br />

(1 − f)<br />

Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then<br />

ˆτ − τ<br />

√V ar(ˆτ)<br />

∼ N(0, 1) approximately.<br />

If V ar(ˆτ) is replaced by its estimator ̂V ar(ˆτ), we still have<br />

ˆτ − τ<br />

√ ̂V ar(ˆτ)<br />

∼ approx. N(0, 1), as n/N → λ > 0.<br />

Thus,<br />

⎛<br />

⎞<br />

ˆτ − τ<br />

1 − α ≈ P ⎝<br />

√ ≤ z ⎠<br />

α/2 = P<br />

∣ ̂V ar(ˆτ) ∣<br />

<strong>The</strong>refore, an approximate (1 − α) confidence interval for τ is<br />

( √ √ )<br />

ˆτ − z α/2 ̂V ar(ˆτ) ≤ τ ≤ ˆτ + z α/2 ̂V ar(ˆτ)<br />

ˆτ ∓ z α/2<br />

√<br />

̂V ar(ˆτ) = ˆτ ∓ z α/2 N s √ n<br />

√1 − f.<br />

B := z α/2<br />

√ ̂V ar(ˆτ) = Nz α/2<br />

√ ̂V ar(ȳ) , is called bound on the error <strong>of</strong> estimation.<br />

14


4) An approximate (1 − α) confidence interval for τ is<br />

√<br />

ˆτ ∓ z α/2 ̂V ar(ˆτ) = ˆτ ∓ z α/2 N √ s<br />

(<br />

)<br />

s<br />

√1 − f = N ȳ ∓ z α/2 √<br />

√1 − f .<br />

n n<br />

5) Minimum sample size n needed to have an error bound B with probability 1 − α<br />

n ≈<br />

Nσ 2<br />

B2<br />

, where D =<br />

(N − 1)D + σ2 N 2 zα/2<br />

2<br />

Example 4.6. (Page 95 <strong>of</strong> the textbook). An investigator is interested in estimating the<br />

total weight gain in 0 to 4 weeks for N = 1000 chicks fed on a new ration. Obviously, to weigh<br />

each bird would be time-consuming <strong>and</strong> tedious. <strong>The</strong>refore, determine the number <strong>of</strong> chicks<br />

to be sampled in this study in order to estimate τ within a bound on the error <strong>of</strong> estimation<br />

equal to 1000 grams with probability 95%. Many similar studies on chick nutrition have been<br />

run in the past. Using data from these studies, the investigator found that σ 2 , the population<br />

variance, was approximately 36.00 (grams) 2 . Determine the required sample size.<br />

Solution<br />

15


2.4 Estimation <strong>of</strong> population proportion<br />

If we are interested in the proportion p <strong>of</strong> the population with a specified characteristic. Let<br />

y i = { 1 if the ith element has the characteristic<br />

0 if not<br />

It is easy to see that E(y i ) = E(y 2 i ) = p (Why?). <strong>The</strong>refore, we have<br />

µ = E(y i ) = p,<br />

σ 2 = var(y i ) = p − p 2 = pq, where q = 1 − p<br />

So the total number <strong>of</strong> elements in the sample <strong>of</strong> size n possessing the specified characteristic<br />

is ∑ n<br />

i=1 y i . <strong>The</strong>refore,<br />

1. An estimator <strong>of</strong> the population proportion p is<br />

∑ ni=1<br />

y i<br />

ȳ = = ˆp, say.<br />

n<br />

And an estimator <strong>of</strong> the population variance σ 2 = pq is<br />

s 2 1<br />

n∑<br />

= (y i − ȳ) 2 = 1 ( n<br />

)<br />

∑<br />

yi 2 − n(ȳ) 2<br />

n − 1<br />

i=1<br />

n − 1<br />

i=1<br />

(<br />

1 n<br />

)<br />

∑<br />

=<br />

y i − nˆp 2 = 1 ( )<br />

nˆp − nˆp<br />

2<br />

n − 1<br />

i=1<br />

n − 1<br />

n<br />

= ˆpˆq where ˆq = 1 − ˆp<br />

n − 1<br />

From <strong>The</strong>orems 2.2.2 <strong>and</strong> 2.2.3, we have<br />

E(ˆp) = p,<br />

E(s 2 ) =<br />

N<br />

N − 1 σ2 =<br />

2. Again, from <strong>The</strong>orem 2.2.2, the variance <strong>of</strong> ˆp is<br />

V ar(ˆp) = σ2<br />

n<br />

( ) N − n<br />

= pq<br />

N − 1 n<br />

N pq. (4.1)<br />

N − 1<br />

( ) N − n<br />

.<br />

N − 1<br />

3. From equation (4.1) <strong>and</strong> <strong>The</strong>orem 2.2.5, an estimator <strong>of</strong> the variance <strong>of</strong> ˆp is<br />

̂V ar(ˆp) = s2<br />

n<br />

(1 − f) = ˆpˆq<br />

(1 − f) .<br />

n − 1<br />

4. An approximate (1 − α) confidence interval for p is<br />

√ √ˆpˆq<br />

ˆp ∓ z α/2 ̂V ar(ˆp) = ˆp ∓ z α/2 √<br />

√1 − f.<br />

n − 1<br />

16


5. <strong>The</strong> minimum sample size n required to estimate p such that our estimate ˆp is within an<br />

error bound B with probability 1 − α is,<br />

n ≈<br />

Npq<br />

B2<br />

, where D =<br />

(N − 1)D + pq zα/2<br />

2<br />

Note that the right h<strong>and</strong> side is an increasing function <strong>of</strong> σ 2 = pq.<br />

a) p is <strong>of</strong>ten unknown, so we can replace it by some estimate (from previous<br />

study, pilot study, etc.).<br />

b) If we don’t have an estimate ˆp, we can replace it by p = 1/2, thus pq = 1/4.<br />

e.g. Suppose that a small town has population <strong>of</strong> N = 800 people. Let p = the proportion <strong>of</strong><br />

people with blood type A.<br />

(1). What sample size n must be drawn in order to estimate p to be within 0.04 <strong>of</strong><br />

p with probability 0.95?<br />

(2). Suppose that we know no more than 10% <strong>of</strong> the population have blood type<br />

A. Find n again in (1). Comment on the difference between (1) <strong>and</strong> (2).<br />

(3). A simple r<strong>and</strong>om sample <strong>of</strong> size n = 200 is taken <strong>and</strong> it is found that 7% <strong>of</strong><br />

the sample has blood type A. Find a 90% confidence interval for p.<br />

Solution. N = 800, α = 0.05, B = 0.04<br />

(1). Take p = 1/2 in the formula, we get n = 344.<br />

(2). p ≤ 0.10 so σ 2 = pq ≤ 0.09. Simple calculation yields n = 171.<br />

(3). (0.040, 0.096).<br />

Example A simple r<strong>and</strong>om sample <strong>of</strong> n = 40 college students was interviewed to determine<br />

the proportion <strong>of</strong> students in favor <strong>of</strong> converting from the semester to the quarter system. 25<br />

students answered affirmatively. Estimate p, the proportion <strong>of</strong> students on campus in favor <strong>of</strong><br />

the change. (Assume N = 2000.) Find a 95% confidence interval for p.<br />

17


Solution<br />

2.5 Comparing estimates<br />

Suppose x 1 , · · · , x m is a r<strong>and</strong>om sample from a population with mean µ x <strong>and</strong> y 1 , · · · , y n is a<br />

r<strong>and</strong>om sample from a population with mean µ y . We are interested in the difference <strong>of</strong> means<br />

µ y − µ x , which can be estimated unbiased by ȳ − ¯x, as<br />

E(ȳ − ¯x) = µ y − µ x .<br />

Further,<br />

V ar(ȳ − ¯x) = V ar(ȳ) + V ar(¯x) − 2Cov(ȳ, ¯x).<br />

Remark: If the two samples x 1 , · · · , x m <strong>and</strong> y 1 , · · · , y n are independent, then Cov(ȳ, ¯x) = 0.<br />

However, a more interesting case is when the two samples are dependent, which will be<br />

illustrated in the following example.<br />

An dependent example<br />

Suppose an opinion poll asks n people the question “Do you favor the abortion?”<br />

<strong>The</strong> opinions given are<br />

YES, NO, NO OPINION.<br />

Let the proportions <strong>of</strong> people who answer ‘YES’, ‘NO’, ‘No opinion’ be p 1 , p 2 <strong>and</strong> p 3 ,<br />

respectively. In particular, we are interested in comparing p 1 <strong>and</strong> p 2 by looking at p 1 − p 2 .<br />

Clearly, p 1 <strong>and</strong> p 2 are dependent proportions, since if one is high, the other is likely to be low.<br />

Let ˆp 1 , ˆp 2 <strong>and</strong> ˆp 3 be the three respective sample proportions amongst the sample <strong>of</strong> size<br />

n. <strong>The</strong>n X = nˆp 1 , Y = nˆp 2 <strong>and</strong> Z = nˆp 3 follows a multinomial distribution with parameter<br />

(n, p 1 , p 2 , p 3 ). That is<br />

Please note that<br />

P (X = x, Y = y, Z = z) =<br />

∑<br />

(<br />

x≥0,y≥0,x+y+z=n<br />

n<br />

x, y, z<br />

)<br />

p x 1p y 2p z 3 = n!<br />

x! y! z! px 1p y 2p z 3<br />

n!<br />

x! y! z! px 1p y 2p z 3 = 1.<br />

18


Question: What is the distribution <strong>of</strong> X? (Hint: Classify the people into “Yes” <strong>and</strong> “Not<br />

Yes”)<br />

<strong>The</strong>orem 2.5.1<br />

E(X) = np 1 , E(Y ) = np 2 , E(Z) = np 3 ,<br />

V ar(X) = np 1 q 1 , V ar(Y ) = np 2 q 2 ,<br />

Cov(X, Y ) = −np 1 p 2 .<br />

Pro<strong>of</strong>. X = number <strong>of</strong> people saying “YES” ∼ Bin(n, p 1 ). So EX = np 1 , V ar(X) = np 1 q 1 .<br />

Now Cov(X, Y ) = E(XY ) − (EX)(EY ) = E(XY ) − n 2 p 1 p 2 . But<br />

E(XY ) =<br />

=<br />

=<br />

=<br />

∑<br />

x,y≥0,x+y≤n<br />

∑<br />

x,y≥1,x+y≤n<br />

∑<br />

x,y≥1,x+y≤n<br />

∑<br />

x,y≥1,x+y≤n<br />

xyP (X = x, Y = y)<br />

xyP (X = x, Y = y, Z = n − x − y)<br />

n!<br />

xy<br />

x! y! (n − x − y)! px 1p y 2p n−x−y<br />

3<br />

n!<br />

(x − 1)! (y − 1)! (n − x − y)! px 1p y 2p n−x−y<br />

3<br />

= n(n − 1)p 1 p 2<br />

∑<br />

x−1,y−1≥0,(x−1)+(y−1)≤(n−2)<br />

(n − 2)!<br />

(x − 1)! (y − 1)! ((n − 2) − (x − 1) − (y − 1))! px−1 1 p y−1<br />

2 p (n−2)−(x−1)−(y−1)<br />

3<br />

∑<br />

= n(n − 1)p 1 p 2<br />

x 1 ,y 1 ≥0,x 1 +y 1 ≤(n−2)<br />

(n − 2)!<br />

(x 1 )! (y 1 )! ((n − 2) − x 1 − y 1 )! px 1<br />

1 p y 1<br />

2 p (n−2)−x 1−y 1<br />

3<br />

= n(n − 1)p 1 p 2 = n 2 p 1 p 2 − np 1 p 2 .<br />

<strong>The</strong>refore, Cov(X, Y ) = E(XY ) − n 2 p 1 p 2 = −np 1 p 2 .<br />

<strong>The</strong>orem 2.5.2<br />

E(ˆp 1 ) = p 1 , E(ˆp 2 ) = p 2 ,<br />

V ar(ˆp 1 ) = p 1 q 1 /n, V ar(ˆp 2 ) = p 2 q 2 /n,<br />

Cov(ˆp 1 , ˆp 2 ) = −p 1 p 2 /n.<br />

19


Pro<strong>of</strong>.<br />

Note that ˆp 1 = X/n <strong>and</strong> ˆp 2 = Y/n. Apply the last theorem.<br />

From the last theorem, we have<br />

V ar(ˆp 1 − ˆp 2 ) = V ar(ˆp 1 ) + V ar(ˆp 2 ) − 2Cov(ˆp 1 , ˆp 2 ) = p 1q 1<br />

n + p 2q 2<br />

n + 2p 1p 2<br />

n .<br />

One estimator <strong>of</strong> V ar(ˆp 1 − ˆp 2 ) is<br />

̂V ar(ˆp 1 − ˆp 2 ) = ˆp 1ˆq 1<br />

n + ˆp 2ˆq 2<br />

n + 2ˆp 1ˆp 2<br />

n .<br />

Is is unbiased? No! An unbiased estimator <strong>of</strong> the variance <strong>of</strong> ˆp 1 is ̂V ar(ˆp 1 ) = ˆp 1 ˆq 1<br />

(1 − f) .<br />

n−1<br />

Also E ˆp 1ˆp 2 = EXY/n 2 = p 1 p 2 (1−1/n) implies an unbiased estimator <strong>of</strong> p 1 p 2 is ˆp 1ˆp 2 (1−1/n) −1 .<br />

So<br />

̂V ar(ˆp 1 ) + ̂V ar(ˆp 2 ) + 2n −1ˆp 1ˆp 2 (1 − 1/n) −1<br />

is an unbiased estimator <strong>of</strong> V ar(ˆp 1 − ˆp 2 ). But it is easy to use<br />

̂V ar(ˆp 1 − ˆp 2 ) = ˆp 1ˆq 1<br />

n + ˆp 2ˆq 2<br />

n + 2ˆp 1ˆp 2<br />

n .<br />

<strong>The</strong>refore, an approximate (1 − α) confidence interval for p 1 − p 2 is<br />

√<br />

√<br />

√√√<br />

̂<br />

ˆp 1ˆq 1<br />

(ˆp 1 − ˆp 2 ) ∓ z α/2 V ar(ˆp 1 − ˆp 2 ) = (ˆp 1 − ˆp 2 ) ∓ z α/2<br />

n + ˆp 2ˆq 2<br />

n + 2ˆp 1ˆp 2<br />

n .<br />

e.g. (From the textbook.) Should smoking be banned from the workplace? A Time/Yankelovich<br />

poll <strong>of</strong> 800 adult Americans carried out on April 6-7, 1994 gave the following results.<br />

Nonsmokers<br />

Smokers<br />

Banned 44% 8%<br />

Special areas 52% 80%<br />

No restrictions 3% 11%<br />

Based on a sample <strong>of</strong> 600 nonsmokers <strong>and</strong> 200 smokers, estimate <strong>and</strong> construct a 95% C.I.<br />

for<br />

(1) the true difference between the proportions choosing “Banned” between nonsmokers<br />

<strong>and</strong> smokers;<br />

(2) the true difference between the proportions among nonsmokers choosing between<br />

“Banned” <strong>and</strong> “Special Areas”.<br />

20


Solution<br />

A. <strong>The</strong> proportions choosing “banned” are independent <strong>of</strong> each other; a high value does not<br />

force a low value <strong>of</strong> the other. Thus, an appropriate estimate <strong>of</strong> this difference is<br />

√<br />

0.44 × 0.56<br />

0.44 − 0.08 ± 2<br />

+<br />

600<br />

0.08 × 0.92<br />

200<br />

= 0.36 ± 0.06<br />

B. <strong>The</strong> proportion <strong>of</strong> nonsmokers choosing “special areas” is dependent on the proportions<br />

choosing “banned”; if the latter is large, the former must be small. <strong>The</strong>se are multinomial<br />

proportions. Thus, an appropriate estimate <strong>of</strong> this difference is<br />

√<br />

0.44 × 0.56<br />

0.52 − 0.44 ± 2<br />

+<br />

600<br />

0.52 × 0.48<br />

600<br />

+ 2 ×<br />

0.44 × 0.52<br />

600<br />

= 0.08 ± 0.08<br />

Example <strong>The</strong> major league baseball season in US came to an abrupt end in the middle <strong>of</strong><br />

1994. In a poll <strong>of</strong> 600 adult Americans, 29% blamed the players for the strike, 34% blamed<br />

the owners, <strong>and</strong> the rest held various other opinions. Does evidence suggest that the true<br />

proportions who blame players <strong>and</strong> owner, respectively, are really different?<br />

p 1 : proportions <strong>of</strong> Americans who blamed the players.<br />

p 2 : proportions <strong>of</strong> Americans who blamed the owners.<br />

ˆV ar(ˆp 1 − ˆp 2 ) = ˆp 1ˆq 1<br />

n + ˆp 2ˆq 2<br />

n + 2ˆp 1ˆp 2<br />

n<br />

= ˆ0.29 × ˆ0.71 0.34 × 0.66<br />

+<br />

600 600<br />

= 1.0458 × 10 −3<br />

So an approximate 95% C.I. for p 1 − p 2 is<br />

+<br />

0.29 − 0.34 ± z 0.025<br />

√<br />

ˆV ar(ˆp1 − ˆp 2 )<br />

= −0.05 ± 1.96 × 0.03234<br />

= (−0.11339, 0.01339)<br />

2 × 0.29 × 0.34<br />

600<br />

21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!