ST3239: Survey Methodology - The Department of Statistics and ...

ST3239: Survey Methodology 

by Wang ZHOU

Chapter 1 

Elements of the sampling problem 

1.1 Introduction 

Often we are interested in some characteristics of a finite population, e.g. the average income 

of last year’s graduates from NUS. Since the population is usually very large, we would like 

to say something (i.e. make inference) about the population by collecting and analysing only 

a part of that population. The principles and methods of collecting and analysing data from 

finite population is a branch of statistics known as Sample Survey Method. The theory 

involved is called Sampling Theory. Sample survey is widely used in many areas such as 

agriculture, education, industry, social affairs, medicine. 

1.2 Some technical terms 

1. An element is an object on which a measurement is taken. 

2. A population is a collection of elments about which we require information. 

3. Population charateristic: this is the aspect of the population we wish to measure, e.g. 

the average income of last year’s graduates from NUS, or the total wheat yield of all 

farmers in a certain country. 

4. Sampling units are nonoverlapping collections of elements from the population. Sampling 

units may be the individual members of the population, they may be a coarser 

subdivision of the population, e.g. a household which may contain more than one individual 

member. 

5. A frame is a list of sampling units, e.g., telephone directory. 

6. A sample is a collection of sampling units drawn from a frame or frames. 

1

1.3 Why sample? 

If a sample is equal to the population, then we have a census, whcih contains all the information 

one wants. However, census is rarely conducted for several reasons: 

• cost, (money is limited) 

• time, (time is limited) 

• destructive (testing a product can be destructive, e.g. light bulbs), 

• accessibility (non-response can be a serious issue). 

In those cases, sampling is the only alternative. 

1.4 How to select the sample: the design of the sample 

survey 

The procedure for selecting the sample is called the sample survey design. The general aim of 

sample survey is to draw samples which are “representative” of the whole population. Broadly 

speaking, we can classify sampling schemes into two categories: probability sampling and 

some other sampling schemes. 

1. Probability sampling is a sampling scheme whereby particular samples are numerated and 

each has a non-zero probability of being selected. With probability built in the design, we can 

make statements such as “our estimate is unbiased and we are 95% confident that it is within 2 

percentage point of the true proportion”. In this course, we shall only concentrate on Probability 

sampling. 

2. Some other sampling schemes 

a) ‘volunteer sampling’: a TV telephone polls, medical volunteers for research. 

b) ‘subjective sampling’: We choose samples that we consider to be typical or “representative” 

of the population. 

c) ‘quota sampling’: One keeps sampling until certain quota is filled. 

All these sampling procedures provide some information about the population, but it is 

hard to deduce the nature of the population from the studies as the samples are very subjective 

and often very biased. Furthermore, it is hard to measure the precision of these estimates. 

1.5 How to design a questionnaire and plan a survey 

This can be the most important and perhaps most difficult part of the survey sampling problem. 

We shall come back to this point in more details later. 

2

Chapter 2 

Simple random sampling 

Definition: If a sample of size n is drawn from a population of size N in such a way that every 

possible sample of size n has the same probability of being selected, the sampling procedure 

is called simple random sampling. The sample thus obtained is called a simple random 

sample. Simple random sampling is often written as s.r.s. for short and is the simplest 

sampling procedure. 

2.1 How to draw a simple random sample 

Suppose that the population of size N has values 

{u 1 , u 2 , · · · , u N }. 

If ( we ) draw n (distinct) items without replacement from the population, ( ) there are altogether 

N N 

different ways of doing it. So if we assign probability 1/ to each of the different 

n 

n 

samples, then each sample thus obtained is a simple random sample. We denote this sample 

by 

{y 1 , y 2 , · · · , y n }. 

Remark: In our previous statistics course, we always use upper-case letters like X, Y etc. 

to denote random variables and lower-case letters like x, y etc. to represent fixed values. 

However, in sample survey course, by convention, we use lower-case letters like y 1 , y 2 etc. to 

denote random variables. 

Theorem 2.1.1 For simple random sampling, we have 

P (y 1 = u i1 , y 2 = u i2 , · · · , y n = u in ) = 1 N 

where i 1 , i 2 , · · · , i n are mutually different. 

1 

(N − 1) · · · 1 

(N − n + 1) 

(N − n)! 

= . 

N! 

3

Proof. By the definition of s.r.s, the probability ( ) of obtaining the sample {u i1 , u i2 , · · · , u in } 

N 

(where the order is not important) is 1/ . There are n! number of ways of ordering 

n 

{u i1 , u i2 , · · · , u in }. Therefore, 

P (y 1 = u i1 , y 2 = u i2 , · · · , y n = u in ) = 

( 

N 

n 

1 

) 

n! 

= 

(N − n)!n! 

N!n! 

= 

(N − n)! 

. 

N! 

Remark: Recall that the total number of all possible samples is ( ) 

N 

n , which could be very large 

if N and n are large. Therefore, getting a simple random sample by first listing all possible 

samples and then drawing one at random would not be practical. An easier way to get a 

simple random sample is simply to draw n values at random without replacement from the N 

population values. That is, we first draw one value at random from the N population values, 

and then draw another value at random from the remaining N − 1 population values and so 

on, until we get a sample of n (different) values. 

Theorem 2.1.2 A sample obtained by drawing n values successively without replacement from 

the N population values is a simple random sample. 

Proof. Suppose that our sample obtained by drawing n values without replacement from the 

N population values is 

{a 1 , a 2 , · · · , a n }, 

where the order is not important. Let {a i1 , a i2 , · · · , a in } be any permutation of {a 1 , a 2 , · · · , a n }. 

Since the sample is drawn without replacement, we have 

P (y 1 = a i1 , · · · , y n = a in ) = 1 N 

1 

(N − 1) · · · 1 

(N − n + 1) 

(N − n)! 

= . 

N! 

Hence, the probability of obtaining the sample {a 1 , · · · , a n } (where the order is not important) 

is 

∑ 

∑ (N − n)! (N − n)! 1 

P (y 1 = a i1 , · · · , y n = a in ) = 

= n! × = ( ). 

N! 

N! 

all (i 1 ,···,i n ) 

all (i 1 ,···,i n ) 

N 

n 

The theorem is thus proved by the definition of the simple random sampling. 

4

Two special cases will be used later when n = 1, and n = 2. 

Theorem 2.1.3 For any i, j = 1, ..., n and s, t = 1, ..., N, 

(i) 

P (y i = u s ) = 1/N. 

(ii) P (y i = u s , y j = u t ) = 

1 

, i ≠ j, s ≠ t. 

N(N − 1) 

Proof. 

P (y k = u j ) = 

= 

P (y k = u s , y j = u t ) = 

= 

∑ 

P (y 1 = u i1 , · · · , y k = u ik , · · · , y n = u in ) 

all (i 1 , · · · , i n ), but i k = j 

( ) 

(N − n)! N − 1 

(N − n)! (N − 1)! 

× (n − 1)! = × 

N! n − 1 

N! (N − n)! = 1 N . 

∑ 

P (y 1 = u i1 , · · · , y n = u in ) 

all (i 1 , · · · , i n ), but i k = s,i j = t 

( ) 

(N − n)! N − 2 

(N − n)! (N − 2)! 

× (n − 2)! = × 

N! n − 2 

N! (N − n)! = 1 

N(N − 1) . 

Example 1. A population contains {a, b, c, d}. We wish to draw a s.r.s of size 2. List all possible 

samples and find out the prob. of drawing {b, d}. 

Solution. Possible samples of size 2 are 

{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}, 

The probability of drawing {b, d} is 1/6. 

5

2.2 Estimation of population mean and total 

2.2.1 Estimation of population mean 

Suppose that the population of size N has values {u 1 , u 2 , · · · , u N }, we can define 

1) the population mean 

µ = u 1 + u 2 + · · · + u N 

N 

= 1 N 

N∑ 

u i , 

i=1 

2) the population variance 

σ 2 = 1 N∑ 

(u i − µ) 2 . 

N 

i=1 

We wish to estimate the quantities µ and σ 2 and to study the accuracy of their estimators. 

Suppose that a simple random sample of size n is drawn, resulting in {y 1 , y 2 , · · · , y n }. Then an 

obvious estimator for µ is the sample mean: 

n∑ 

ˆµ = ȳ = y i /n. 

i=1 

Theorem 2.2.1 


(i). By an ealier theorem, 

(i) E(y i ) = µ, V ar(y i ) = σ 2 . 

(ii) Cov(y i , y j ) = − σ2 

, for i ≠ j. 

N − 1 

N∑ 

N∑ 1 

E(y i ) = u k P (y i = u k ) = u k 

k=1 

k=1 

N = µ. 

N∑ 

N∑ 

V ar(y i ) = (u k − µ) 2 P (y i = u k ) = (u k − µ) 2 1 

k=1 

k=1 

N = σ2 . 

(ii). By defintion, Cov(y i , y j ) = E(y i y j ) − E(y i )E(y j ) = E(y i y j ) − µ 2 . Now, 

E(y i y j ) = 

= 

= 

= 

∑ 

u s u t P (y i = u s , y j = u t ) = 

∑ 

all s ≠ t 

all s ≠ t 

⎡ 

⎤ 

1 ⎢ ∑ 

⎥ 

⎣ 

u s u t ⎦ = 

N(N − 1) 

1 

N(N − 1) 

1 

N(N − 1) 

u s u t − ∑ 

all s, t 

s=t 

[ ( 

(Nµ) 2 ∑ N 

− 

1 

N(N − 1) 

)] 

(u s − µ) 2 + Nµ 2 

s=1 

u s u t 

1 

N(N − 1) 

[ 

(Nµ) 2 − Nσ 2 − Nµ 2] = − σ2 

N − 1 + µ2 . 

Thus, Cov(y i , y j ) = E(y i y j ) − µ 2 = − σ2 

N−1 . 6 

[( 

∑ N 

) ( N 

) ] 

∑ N∑ 

u s u t − u 2 s 

s=1 t=1 s=1


E(ȳ) = µ, 

V ar(ȳ) = σ2 

n 

( ) N − n 

. 

N − 1 


Note ȳ = 1 n (y 1 + ... + y n ). So 

E(ȳ) = 1 n (Ey 1 + ... + Ey n ) = 1 (nµ) = µ. 

n 

Now 

V ar(ȳ) = 1 n Cov( ∑ n n∑ 

y 2 i , y j ) = 1 ∑ n n∑ 

Cov(y 

i=1 j=1 

n 2 

i , y j ) 

i=1 j=1 

⎛ 

⎞ 

= 1 ⎝ ∑ Cov(y 

n 2 i , y j ) + ∑ Cov(y i , y j ) ⎠ 

i≠j 

i=j 

⎛ 

⎞ 

= 1 ⎝ ∑ (− σ2 n 

n 2 i≠j 

N − 1 ) + ∑ 

V ar(y i ) ⎠ 

i=1 

= 1 ( 

) 

n(n − 1)(− 

σ2 

n 2 N − 1 ) + nσ2 

( 

(n − 1)(− 1 

= σ2 

n 

= σ2 

n 

( ) N − n 

N − 1 

N − 1 ) + 1 ) 

Remark: From Theorem 2.2.2, we see that ȳ is an unbiased estimator for µ. Also as n gets large 

(but n ≤ N), V ar(ȳ) tends to 0. This implies that ȳ will be a more accurate estimator for µ as n gets 

larger (but less than N). In particular, when n = N, we have a census and V ar(ȳ) = 0. 

Remark: In our previous statistics course, we usually sample {y 1 , y 2 , · · · , y n } from the population 

with replacement. Therefore, {y 1 , y 2 , · · · , y n } are independent and identically distributed (i.i.d.). 

And recall we have results like 

E iid (ȳ) = µ, 

V ar iid (ȳ) = σ2 

n . 

Notice that V ar iid (ȳ) is different from V ar(ȳ) in Theorem 2.2.2. In fact, for n > 1, 

V ar(ȳ) = σ2 

n 

( ) N − n 

< σ2 

N − 1 n = V ar iid(ȳ). 

Thus, for the same sample size n, sampling without replacement produces a less variable estimator of 

µ. Why? 

7

Summary 

1. How to draw a simple random sample? (purpose, method) Simple random sampling is 

the basic survey methodology. 

2. After getting a s.r.s, how to describe the population, or how to analyze the data? 

Estimation of the population mean. (Sample mean.) 

Estimation of σ 2 and V ar(ȳ) 

The population variance σ 2 is usually unknown. Now define 

s 2 = 1 n∑ 

(y i − ȳ) 2 = 1 ( n 

) 

∑ 

yi 2 − n(ȳ) 2 . 

n − 1 

i=1 

n − 1 

i=1 

Example. 

When a few data points are repeated in a data set, the results are often arrayed in a 

frequency table. For example, a quiz given to 25 students was graded on a 4-point scale 0, 1, 

2, 3 with 3 being a perfect score. Here are the results: 

Score(X) Frequency(F ) Proportion(P ) 

3 16 0.64 

2 4 0.16 

1 2 0.08 

0 3 0.12 

(a). Calculate the average score by using frequencies. 

(b). Calculate the average score by using proportions. 

(c). Calculate the standard deviation. 

Solution 

If the above 25 students constitute a random sample, then s 2 = 

Let us look at some properties of s 2 . Is it unbiased? 

n 1.0976 = 1.1433. 

n−1 


E(s 2 ) = 

N 

N − 1 σ2 . 

8


Es 2 = 

= 

= 

( 

1 n 

) 

∑ 

Eyi 2 − nE(ȳ) 2 

n − 1 

i=1 

( 

1 ∑ n [ 

V ar(yi ) + (Ey i ) 2] − n [ V ar(ȳ) + (Eȳ) 2]) 

n − 1 

i=1 

( 

1 

n [ [ ]) 

σ 2 + µ 2] σ 

2 

N − n 

− n 

n − 1 

n N − 1 + µ2 

= nσ2 

n − 1 

= Nσ2 

N − 1 

[ 

1 − 1 n 

( N − n 

N − 1 

)] 

= nσ2 

n − 1 

( ) 

nN − n − (N − n) 

n(N − 1) 

The next theorem is an easy consequence of the last theorem. 

Theorem 2.2.4 ˆσ 2 := N−1 

N 

s2 is an unbiased estimator of σ 2 , e.g. 

E 

( N − 1 

N s2 ) 

= σ 2 . 

We shall define 

f = n N 

to be the sample proportion, 

1 − f = 1 − n N 

to be the finite population correction (ab. fpc) 

Then we have the following theorem. 

Theorem 2.2.5 An unbiased estimator for V ar(ȳ) is 


E ̂V ar(ȳ) = Es2 

n 

̂V ar(ȳ) = s2 

n 

(1 − f) . 

(1 − f) = 

Nσ2 

n(N − 1) (1 − n N ) 

Confidence intervals for µ 

It can be shown that the sample average ȳ under the simple random sampling is approximately 

normally distributed provided n is large (≥ 30, say) and f = n/N is not too close to 0 or 1. 

9

Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then 

ȳ − µ 

√V ar(ȳ) 

∼ N(0, 1) approximately. 

If V ar(ȳ) is replaced by its estimator ̂V ar(ȳ), we still have 

ȳ − µ 

√ ̂V ar(ȳ) 

∼ approx. N(0, 1), as n/N → λ > 0. 

Thus, 

⎛ 

⎞ 

ȳ − µ 

1 − α ≈ P ⎝ 

√ ≤ z ⎠ 

α/2 = P 

∣ ̂V ar(ȳ) ∣ 

Therefore, an approximate (1 − α) confidence interval for µ is 

( √ √ ) 

ȳ − z α/2 ̂V ar(ȳ) ≤ µ ≤ ȳ + z α/2 ̂V ar(ȳ) 

√ 

s 

ȳ ∓ z α/2 ̂V ar(ȳ) = ȳ ∓ z α/2 √ 

√1 − f. n 

B := z α/2 

√ ̂V ar(ȳ) , is called bound on the error of estimation. 

Example. Suppose that a s.r.s. of size n = 200 is taken from a population of size N = 1000. 

resulting in ȳ = 94 and s 2 = 400. Find a 95% C.I. for µ. 

Solution 

94 ∓ 1.96 20 √ 

200 

√1 − 1/5 = 94 ∓ 2.479 

Example. A simple random sample of n = 100 water meters within a community is 

monitored to estimate the average daily water consumption per household over a specified dry 

spell. The sample mean and variance are found to be ȳ = 12.5 and s 2 = 1252. If we assume 

that there are N = 10, 000 households within the community, estimate µ, the true average daily 

consumption, and find a 95% confidence interval for µ. 

Solution 

10

2.3 Selecting the sample size for estimating population 

means 

population mean 

( ) 

We have seen that V ar(ȳ) = σ2 N−n 

n N−1 . So the bigger the sample size n is (but ≤ N), the more 

accurate our estimate ȳ is. It is of interest to find out the minimum n such that our estimate 

is within an error bound with certain probability 1 − α, say, 

i.e., 

By the central limit theorem, 

P (|ȳ − µ| < B) ≈ 1 − α, 

⎛ 

|ȳ − µ| 

P ⎝ √ 

V ar(ȳ) < 

⎞ 

B 

√ ⎠ ≈ 1 − α. 

V ar(ȳ) 

B 

√ 

V ar(ȳ) = 

√ 

σ 2 

n 

B 

( N−n 

N−1 

) ≈ z α/2 ⇐⇒ 

σ 2 

n 

( ) N − n 

= B2 

N − 1 zα/2 

2 

= D, 

Thus, 

⇐⇒ 

N 

n 

− 1 = 

(N − 1)D 

σ 2 

n ≈ 

⇐⇒ 

N 

n 

= 1 + 

(N − 1)D 

σ 2 

Nσ 2 

B2 

, where D = 

(N − 1)D + σ2 zα/2 

2 

= 

(N − 1)D + σ2 

σ 2 

Remark 1: if α = 5%, then z α/2 = 1.96 ≈ 2, so D ≈ B2 . This coincides with the formula in the 

4 

textbook (page 93). 

Remark 2: the above formula requires the knowledge of the population variance σ 2 , which is 

typically unknown in practice. However, we can approximate σ 2 by the following methods: 

1) from pilot studies 

2) from previous surveys 

3) other studies. 

11

e.g. Suppose that a total of 1500 students are to graduate next year. Determine the sample 

size n needed to ensure that the sample average in starting salary is within $40 of the population 

average with probability at least 0.9. From previous studies, we know that the standard 

deviation of the starting salary is approximately $400. 

Solution. n = 

1500×400 2 

1499×40 2 /1.645 2 +400 2 = 229.37 ≈ 230. 

e.g. Example 4.5 (p.94, 5th edition). The average amount of money µ for a hospital’s accounts 

receivable must be estimated. Although no prior data are available to estimate the population 

variance σ 2 , that most accounts lie within a $100 range is known. There are 1000 open accounts. 

Find the sample size needed to estimate µ with a bound on the error of estimation $3 with 

probability 0.95. 

Remark. The solution depends on how one inteprets “most accounts”, whether it means 70%, 

90%, 95% or 99% of all accounts. 

Solution. We need an estimate of σ 2 . For the normal distribution, N(0, σ 2 ), we have 

P (|N(0, σ 2 )| ≤ 1.96σ) = P (|N(0, 1)| ≤ 1.96) = 95%, P (|N(0, σ 2 )| ≤ 3σ) = P (|N(0, 1)| ≤ 3) = 

99.87% So 95% accounts lie within a 4σ range and 99.87% accounts lie within a 6σ range. 

B = 3, N = 1000. 

If most means 95%, we take 2 × (2σ) = 100, so σ = 25. Then 

n = 210.76 ≈ 211. 

If most means 99.87%, we take 2 × (3σ) = 100, so σ = 50/3. Then 

n ≈ 107. 

12

2.3.1 A quick summary on estimation of population mean 

The population mean is defined to be 

µ = 1 N (u 1 + u 2 + · · · + u N ). 

Suppose a simple random sample is {y 1 , ..., y n }. 

1) An estimator of the population mean µ and variance σ 2 are 

ˆµ = ȳ = 1 n 

n∑ 

i=1 

y i , s 2 = 1 

n − 1 

n∑ 

(y i − ȳ) 2 . 

i=1 

2) The mean and variance of ȳ are 

Eȳ = µ, 

V ar(ȳ) = σ2 

n 

( ) N − n 

. 

N − 1 

3) An estimator of the variance of ȳ is 

̂V ar(ȳ) = s2 

n 

(1 − f) , 

where f = n/N. 

4) An approximate (1 − α) confidence interval for µ is 

√ 

s 

ȳ ∓ z α/2 ̂V ar(ȳ) = ȳ ∓ z α/2 √ 

√1 − f. n 

5) Minimum sample size n needed to have an error bound B with probability 1 − α 

n ≈ 

Nσ 2 

B2 

, where D = 

(N − 1)D + σ2 zα/2 

2 

13

2.3.2 Estimation of population total 

The population total is defined to be 

τ = (u 1 + u 2 + · · · + u N ) = Nµ 

Suppose a simple random sample is {y 1 , ..., y n }. 

1) An estimator of the population total τ is 

ˆτ = Nȳ 

3) The mean and variance of ˆτ are 

Eˆτ = τ, 

V ar(ˆτ) = N 2 σ2 

n 

( ) N − n 

. 

N − 1 

2) An estimator of the variance of ˆτ is 

̂V ar(ˆτ) = ̂V ar(Nȳ) = N 2 s2 

n 

(1 − f) 

Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then 

ˆτ − τ 

√V ar(ˆτ) 

∼ N(0, 1) approximately. 

If V ar(ˆτ) is replaced by its estimator ̂V ar(ˆτ), we still have 

ˆτ − τ 

√ ̂V ar(ˆτ) 

∼ approx. N(0, 1), as n/N → λ > 0. 

Thus, 

⎛ 

⎞ 

ˆτ − τ 

1 − α ≈ P ⎝ 

√ ≤ z ⎠ 

α/2 = P 

∣ ̂V ar(ˆτ) ∣ 

Therefore, an approximate (1 − α) confidence interval for τ is 

( √ √ ) 

ˆτ − z α/2 ̂V ar(ˆτ) ≤ τ ≤ ˆτ + z α/2 ̂V ar(ˆτ) 

ˆτ ∓ z α/2 

√ 

̂V ar(ˆτ) = ˆτ ∓ z α/2 N s √ n 

√1 − f. 

B := z α/2 

√ ̂V ar(ˆτ) = Nz α/2 

√ ̂V ar(ȳ) , is called bound on the error of estimation. 

14

4) An approximate (1 − α) confidence interval for τ is 

√ 

ˆτ ∓ z α/2 ̂V ar(ˆτ) = ˆτ ∓ z α/2 N √ s 

( 

) 

s 

√1 − f = N ȳ ∓ z α/2 √ 

√1 − f . 

n n 

5) Minimum sample size n needed to have an error bound B with probability 1 − α 

n ≈ 

Nσ 2 

B2 

, where D = 

(N − 1)D + σ2 N 2 zα/2 

2 

Example 4.6. (Page 95 of the textbook). An investigator is interested in estimating the 

total weight gain in 0 to 4 weeks for N = 1000 chicks fed on a new ration. Obviously, to weigh 

each bird would be time-consuming and tedious. Therefore, determine the number of chicks 

to be sampled in this study in order to estimate τ within a bound on the error of estimation 

equal to 1000 grams with probability 95%. Many similar studies on chick nutrition have been 

run in the past. Using data from these studies, the investigator found that σ 2 , the population 

variance, was approximately 36.00 (grams) 2 . Determine the required sample size. 

Solution 

15

2.4 Estimation of population proportion 

If we are interested in the proportion p of the population with a specified characteristic. Let 

y i = { 1 if the ith element has the characteristic 

0 if not 

It is easy to see that E(y i ) = E(y 2 i ) = p (Why?). Therefore, we have 

µ = E(y i ) = p, 

σ 2 = var(y i ) = p − p 2 = pq, where q = 1 − p 

So the total number of elements in the sample of size n possessing the specified characteristic 

is ∑ n 

i=1 y i . Therefore, 

1. An estimator of the population proportion p is 

∑ ni=1 

y i 

ȳ = = ˆp, say. 

n 

And an estimator of the population variance σ 2 = pq is 

s 2 1 

n∑ 

= (y i − ȳ) 2 = 1 ( n 

) 

∑ 

yi 2 − n(ȳ) 2 

n − 1 

i=1 

n − 1 

i=1 

( 

1 n 

) 

∑ 

= 

y i − nˆp 2 = 1 ( ) 

nˆp − nˆp 

2 

n − 1 

i=1 

n − 1 

n 

= ˆpˆq where ˆq = 1 − ˆp 

n − 1 

From Theorems 2.2.2 and 2.2.3, we have 

E(ˆp) = p, 

E(s 2 ) = 

N 

N − 1 σ2 = 

2. Again, from Theorem 2.2.2, the variance of ˆp is 

V ar(ˆp) = σ2 

n 

( ) N − n 

= pq 

N − 1 n 

N pq. (4.1) 

N − 1 

( ) N − n 

. 

N − 1 

3. From equation (4.1) and Theorem 2.2.5, an estimator of the variance of ˆp is 

̂V ar(ˆp) = s2 

n 

(1 − f) = ˆpˆq 

(1 − f) . 

n − 1 

4. An approximate (1 − α) confidence interval for p is 

√ √ˆpˆq 

ˆp ∓ z α/2 ̂V ar(ˆp) = ˆp ∓ z α/2 √ 

√1 − f. 

n − 1 

16

5. The minimum sample size n required to estimate p such that our estimate ˆp is within an 

error bound B with probability 1 − α is, 

n ≈ 

Npq 

B2 

, where D = 

(N − 1)D + pq zα/2 

2 

Note that the right hand side is an increasing function of σ 2 = pq. 

a) p is often unknown, so we can replace it by some estimate (from previous 

study, pilot study, etc.). 

b) If we don’t have an estimate ˆp, we can replace it by p = 1/2, thus pq = 1/4. 

e.g. Suppose that a small town has population of N = 800 people. Let p = the proportion of 

people with blood type A. 

(1). What sample size n must be drawn in order to estimate p to be within 0.04 of 

p with probability 0.95? 

(2). Suppose that we know no more than 10% of the population have blood type 

A. Find n again in (1). Comment on the difference between (1) and (2). 

(3). A simple random sample of size n = 200 is taken and it is found that 7% of 

the sample has blood type A. Find a 90% confidence interval for p. 

Solution. N = 800, α = 0.05, B = 0.04 

(1). Take p = 1/2 in the formula, we get n = 344. 

(2). p ≤ 0.10 so σ 2 = pq ≤ 0.09. Simple calculation yields n = 171. 

(3). (0.040, 0.096). 

Example A simple random sample of n = 40 college students was interviewed to determine 

the proportion of students in favor of converting from the semester to the quarter system. 25 

students answered affirmatively. Estimate p, the proportion of students on campus in favor of 

the change. (Assume N = 2000.) Find a 95% confidence interval for p. 

17

Solution 

2.5 Comparing estimates 

Suppose x 1 , · · · , x m is a random sample from a population with mean µ x and y 1 , · · · , y n is a 

random sample from a population with mean µ y . We are interested in the difference of means 

µ y − µ x , which can be estimated unbiased by ȳ − ¯x, as 

E(ȳ − ¯x) = µ y − µ x . 

Further, 

V ar(ȳ − ¯x) = V ar(ȳ) + V ar(¯x) − 2Cov(ȳ, ¯x). 

Remark: If the two samples x 1 , · · · , x m and y 1 , · · · , y n are independent, then Cov(ȳ, ¯x) = 0. 

However, a more interesting case is when the two samples are dependent, which will be 

illustrated in the following example. 

An dependent example 

Suppose an opinion poll asks n people the question “Do you favor the abortion?” 

The opinions given are 

YES, NO, NO OPINION. 

Let the proportions of people who answer ‘YES’, ‘NO’, ‘No opinion’ be p 1 , p 2 and p 3 , 

respectively. In particular, we are interested in comparing p 1 and p 2 by looking at p 1 − p 2 . 

Clearly, p 1 and p 2 are dependent proportions, since if one is high, the other is likely to be low. 

Let ˆp 1 , ˆp 2 and ˆp 3 be the three respective sample proportions amongst the sample of size 

n. Then X = nˆp 1 , Y = nˆp 2 and Z = nˆp 3 follows a multinomial distribution with parameter 

(n, p 1 , p 2 , p 3 ). That is 

Please note that 

P (X = x, Y = y, Z = z) = 

∑ 

( 

x≥0,y≥0,x+y+z=n 

n 

x, y, z 

) 

p x 1p y 2p z 3 = n! 

x! y! z! px 1p y 2p z 3 

n! 

x! y! z! px 1p y 2p z 3 = 1. 

18

Question: What is the distribution of X? (Hint: Classify the people into “Yes” and “Not 

Yes”) 


E(X) = np 1 , E(Y ) = np 2 , E(Z) = np 3 , 

V ar(X) = np 1 q 1 , V ar(Y ) = np 2 q 2 , 

Cov(X, Y ) = −np 1 p 2 . 

Proof. X = number of people saying “YES” ∼ Bin(n, p 1 ). So EX = np 1 , V ar(X) = np 1 q 1 . 

Now Cov(X, Y ) = E(XY ) − (EX)(EY ) = E(XY ) − n 2 p 1 p 2 . But 

E(XY ) = 

= 

= 

= 

∑ 

x,y≥0,x+y≤n 

∑ 


∑ 


∑ 


xyP (X = x, Y = y) 

xyP (X = x, Y = y, Z = n − x − y) 

n! 

xy 

x! y! (n − x − y)! px 1p y 2p n−x−y 

3 

n! 

(x − 1)! (y − 1)! (n − x − y)! px 1p y 2p n−x−y 

3 

= n(n − 1)p 1 p 2 

∑ 

x−1,y−1≥0,(x−1)+(y−1)≤(n−2) 

(n − 2)! 

(x − 1)! (y − 1)! ((n − 2) − (x − 1) − (y − 1))! px−1 1 p y−1 

2 p (n−2)−(x−1)−(y−1) 

3 

∑ 

= n(n − 1)p 1 p 2 

x 1 ,y 1 ≥0,x 1 +y 1 ≤(n−2) 

(n − 2)! 

(x 1 )! (y 1 )! ((n − 2) − x 1 − y 1 )! px 1 

1 p y 1 

2 p (n−2)−x 1−y 1 

3 

= n(n − 1)p 1 p 2 = n 2 p 1 p 2 − np 1 p 2 . 

Therefore, Cov(X, Y ) = E(XY ) − n 2 p 1 p 2 = −np 1 p 2 . 


E(ˆp 1 ) = p 1 , E(ˆp 2 ) = p 2 , 

V ar(ˆp 1 ) = p 1 q 1 /n, V ar(ˆp 2 ) = p 2 q 2 /n, 

Cov(ˆp 1 , ˆp 2 ) = −p 1 p 2 /n. 

19


Note that ˆp 1 = X/n and ˆp 2 = Y/n. Apply the last theorem. 

From the last theorem, we have 

V ar(ˆp 1 − ˆp 2 ) = V ar(ˆp 1 ) + V ar(ˆp 2 ) − 2Cov(ˆp 1 , ˆp 2 ) = p 1q 1 

n + p 2q 2 

n + 2p 1p 2 

n . 

One estimator of V ar(ˆp 1 − ˆp 2 ) is 

̂V ar(ˆp 1 − ˆp 2 ) = ˆp 1ˆq 1 

n + ˆp 2ˆq 2 

n + 2ˆp 1ˆp 2 

n . 

Is is unbiased? No! An unbiased estimator of the variance of ˆp 1 is ̂V ar(ˆp 1 ) = ˆp 1 ˆq 1 

(1 − f) . 

n−1 

Also E ˆp 1ˆp 2 = EXY/n 2 = p 1 p 2 (1−1/n) implies an unbiased estimator of p 1 p 2 is ˆp 1ˆp 2 (1−1/n) −1 . 

So 

̂V ar(ˆp 1 ) + ̂V ar(ˆp 2 ) + 2n −1ˆp 1ˆp 2 (1 − 1/n) −1 

is an unbiased estimator of V ar(ˆp 1 − ˆp 2 ). But it is easy to use 

̂V ar(ˆp 1 − ˆp 2 ) = ˆp 1ˆq 1 

n + ˆp 2ˆq 2 

n + 2ˆp 1ˆp 2 

n . 

Therefore, an approximate (1 − α) confidence interval for p 1 − p 2 is 

√ 

√ 

√√√ 

̂ 

ˆp 1ˆq 1 

(ˆp 1 − ˆp 2 ) ∓ z α/2 V ar(ˆp 1 − ˆp 2 ) = (ˆp 1 − ˆp 2 ) ∓ z α/2 

n + ˆp 2ˆq 2 

n + 2ˆp 1ˆp 2 

n . 

e.g. (From the textbook.) Should smoking be banned from the workplace? A Time/Yankelovich 

poll of 800 adult Americans carried out on April 6-7, 1994 gave the following results. 

Nonsmokers 

Smokers 

Banned 44% 8% 

Special areas 52% 80% 

No restrictions 3% 11% 

Based on a sample of 600 nonsmokers and 200 smokers, estimate and construct a 95% C.I. 

for 

(1) the true difference between the proportions choosing “Banned” between nonsmokers 

and smokers; 

(2) the true difference between the proportions among nonsmokers choosing between 

“Banned” and “Special Areas”. 

20

Solution 

A. The proportions choosing “banned” are independent of each other; a high value does not 

force a low value of the other. Thus, an appropriate estimate of this difference is 

√ 

0.44 × 0.56 

0.44 − 0.08 ± 2 

+ 

600 

0.08 × 0.92 

200 

= 0.36 ± 0.06 

B. The proportion of nonsmokers choosing “special areas” is dependent on the proportions 

choosing “banned”; if the latter is large, the former must be small. These are multinomial 

proportions. Thus, an appropriate estimate of this difference is 

√ 

0.44 × 0.56 

0.52 − 0.44 ± 2 

+ 

600 

0.52 × 0.48 

600 

+ 2 × 

0.44 × 0.52 

600 

= 0.08 ± 0.08 

Example The major league baseball season in US came to an abrupt end in the middle of 

1994. In a poll of 600 adult Americans, 29% blamed the players for the strike, 34% blamed 

the owners, and the rest held various other opinions. Does evidence suggest that the true 

proportions who blame players and owner, respectively, are really different? 

p 1 : proportions of Americans who blamed the players. 

p 2 : proportions of Americans who blamed the owners. 

ˆV ar(ˆp 1 − ˆp 2 ) = ˆp 1ˆq 1 

n + ˆp 2ˆq 2 

n + 2ˆp 1ˆp 2 

n 

= ˆ0.29 × ˆ0.71 0.34 × 0.66 

+ 

600 600 

= 1.0458 × 10 −3 

So an approximate 95% C.I. for p 1 − p 2 is 

+ 

0.29 − 0.34 ± z 0.025 

√ 

ˆV ar(ˆp1 − ˆp 2 ) 

= −0.05 ± 1.96 × 0.03234 

= (−0.11339, 0.01339) 

2 × 0.29 × 0.34 

600 

21

ST3239: Survey Methodology - The Department of Statistics and ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?