14.08.2013 Views

STAT170 Workshop Notes prepared by Nan Carter for Numeracy ...

STAT170 Workshop Notes prepared by Nan Carter for Numeracy ...

STAT170 Workshop Notes prepared by Nan Carter for Numeracy ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>STAT170</strong> <strong>Workshop</strong> <strong>Notes</strong> <strong>prepared</strong> <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong><br />

<strong>for</strong> <strong>Numeracy</strong> Centre, Macquarie University, 2010.<br />

Week 3 H/Y. Questions <strong>prepared</strong> <strong>for</strong> Stat170 workshop held at the <strong>Numeracy</strong> Centre,<br />

Macquarie University. Prepared <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong> with examples from various sources<br />

and adapted <strong>for</strong> Stat170.<br />

The role of a statistician is to obtain in<strong>for</strong>mation about a ‘population’ using a sample.<br />

Question 1. For each of the following cases, give the<br />

• Population of interest (also called ‘the target population’)<br />

• Sample to be selected (is it a random/representative/convenient sample?)<br />

• Variable(s) of interest<br />

• Type of variable – categorical: binary, nominal, ordinal<br />

Continuous (interval)<br />

a) We are interested in the average annual income of all registered medical<br />

practitioners in the Sydney Metropolitan area. From the register we randomly<br />

select 50 names and hence collect in<strong>for</strong>mation regarding their annual incomes.<br />

b) We are interested in determining, <strong>for</strong> the residents of NSW, which (if any) of the<br />

four major blood groupings (A, B, AB, O) is the most common and which may be<br />

considered rare. A sample of 200 people is randomly selected from blood donors in<br />

all major cities and the blood group of each sample is recorded.<br />

c) A quality control supervisor in a factory is interested in the proportions of defective<br />

items produced per day. At various times during each day he randomly selects<br />

finished items so that <strong>by</strong> the end of the day he has 300 items. He then records the<br />

number of defective items found. Data is collected over 21 days.<br />

d) In a sociological study, a researcher is interested in the number of children in the<br />

family of couples who have been married <strong>for</strong> ten years. She selects a random<br />

sample of 25 couples married <strong>for</strong> that length of time, and records the number of<br />

children in the family.<br />

e) You are interested in the relationship between sex and income of all adults who<br />

reside in Sydney. You select a random sample of 75 adult males and 75 adult<br />

females from the electoral roll, and ask each person to give you their income.<br />

Question 2. For each part of question 1, state what you think would be an appropriate<br />

graphical summary to use to display the data collected.<br />

1<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Question 3. A certain bank has many branches throughout Australia. The age of the<br />

branch managers of a random sample of 156 of these branches is given as follows:<br />

Age Frequency<br />

30-34 8<br />

35-39 21<br />

40-44 24<br />

45-49 32<br />

50-54 40<br />

55-59 23<br />

60-64 8<br />

Give a graphical summary <strong>for</strong> these data.<br />

CHECK ANSWERS: 1.a)<br />

• Popln: All registered medical practitioners in the Sydney Metropolitan Area<br />

• Sample: the 50 randomly selected names<br />

• Variable: annual incomes<br />

• Type of data: continuous (numerical)<br />

1.b)<br />

• Popln: all residents in NSW<br />

• Sample: 200 blood donors from all major cities<br />

• Variable: major blood grouping<br />

• Type of data: categorical, nominal (A,B,AB,O)<br />

1.c)<br />

• Popln: days of production<br />

• Sample: the 21 days <strong>for</strong> which data were collected<br />

• Variable: the number of defectives found<br />

• Type of data: numerical, but is it ordinal or continuous?<br />

1.d)<br />

• Popln: all couples married <strong>for</strong> 10 years<br />

• Sample: 25 such couples<br />

• Variable: number of children in the family<br />

• Type of data: categorical, ordinal<br />

1.e)<br />

• Popln: adults residing in Sydney<br />

• Sample: 75 males and 75 females registered to vote<br />

• Variables: income and sex<br />

• Type of data: continuous <strong>for</strong> income and binary <strong>for</strong> sex.<br />

Question 2. a): histogram, boxplot, stem and leaf b): barchart (or pie chart) c):<br />

barchart or histogram d): barchart e): comparative boxplot<br />

Question 3. Histogram.<br />

2<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Week 4 H/Y. Exercises <strong>for</strong> <strong>STAT170</strong> workshop held at <strong>Numeracy</strong> Centre, Macquarie<br />

University; <strong>prepared</strong> <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong>. Wk4_5h<br />

Observations on many variables are ‘normally distributed’ (the histogram is close to<br />

being symmetrical, and has one hump, is unimodal). We can relate the area under a<br />

curve between any two values of the variable to the probability of a random value<br />

occurring <strong>by</strong> chance. This concept of relating the probability of a value occurring <strong>by</strong><br />

chance and the area under a curve, is used <strong>for</strong> other distributions e.g. the ‘tdistribution,<br />

the chi-square distribution. By convention, the symbol z is used <strong>for</strong> a<br />

variable that is normal, with mean zero and standard deviation 1. N(0,1) is the<br />

‘standard normal variable’.<br />

Suppose we have a normal distribution with mean μ = 10, and standard deviation σ = 2.<br />

What is the probability that a value, x, selected at random from the population, will be:<br />

a) between 7 and 14? b) between 12 and 15? c)less than 6.5?; c) equal to 15 or<br />

more extreme?<br />

1. a) If 37.7% of a normal population has a standard score between zero some<br />

positive z-value, find that z-value.<br />

b) what is the probability that a z-score will lie between zero and 1.16?<br />

c) what is the probability that a randomly chosen z-score will lie between zero<br />

and 1.16?<br />

In the following questions, use the table of single-tail areas <strong>for</strong> the standardised normal<br />

curve that is given in your study guides and textbook. Y is the symbol <strong>for</strong> the normal<br />

variable.<br />

3. If μ = 5, σ = 1 and z = 2.05, what is the value of y? and<br />

if probability, p, is 0.02, what is the value of z?<br />

4. Let μ = 150 and σ = 24<br />

a) if p = 0.0188, what is z?<br />

b) if z = -1.25, what is y?<br />

b) if z = 0.42 what is p?<br />

c) what is the probability of a value of z that is more extreme than z = -2.08?<br />

Finding the probability that a value could be more extreme than a given value is<br />

used in testing nhypotheses.<br />

5. If the probability that a box of kiwi fruit is underweight is 0.1992, how many<br />

underweight boxes would you expect in a consignment of 150 boxes? If the probability<br />

were 0.0150, how many would you expect?<br />

6. a) what is the probability of a value of z that is more extreme than z = -1.43?<br />

b)If z = 0.71, what is p?<br />

c)What is the area under the curve between z = -1.43 and z = 0.71?<br />

d) If the area under the curve in the upper tail is 0.22, what is the value of z?<br />

7. a) If z = 1.00 what is p?<br />

b) If p = 0.03, what is the value of z?<br />

3<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


CHECKS: Q1: a) 0.9104 b) 0.1525 c) 0.0401 Q2: a) 1.16<br />

b) 0.3770 c) 0.3770 Q3: 7.05 and 2.05 Q4: a) 2.08 b) 120<br />

c) 0.3372 d) 0.0188 Q5: 29 ; 27 Q6: a) 0.0764 b) 0.2389<br />

c) 0.6847 d) 0.77 Q7: a) 0.1587 b) 1.88<br />

ALL the questions on this page need to be interpreted as finding z-values, y-values or<br />

probabilities (i.e. area under the standard normal curve). The calculations that you<br />

need have already been done above; BUT, not all the above answers are needed, AND<br />

the order of the questions has been scrambled.<br />

1. The height of adult women in Australia is approx normally distributed with mean<br />

μ=160cm, and standard deviation σ=7cm.<br />

a) What is the probability that a single randomly selected woman has height<br />

between 150cm and 165cm?<br />

b) What is the probability that a randomly selected woman has a height less than<br />

or equal to your own height? (eg my height is 158cm).<br />

c) What is the greatest height exceeded <strong>by</strong> 22% of the population of adult women?<br />

2. The useful life of a certain brand of batteries is normally distributed with a mean of<br />

150 hours and a standard deviation of 24 hours. If a battery is selected at random,<br />

find the probability that its useful life is:<br />

a) less than 100 hours<br />

b) more than 120 hours<br />

c) between 160 and 200 hours.<br />

3. The weights of bags of fertiliser are normally distributed with mean 210 kg and<br />

standard deviation of 2.3 kg. For a consignment of 150 bags, what is the expected<br />

number of bags that will weigh less than 208 kg? How many of the bags weighing<br />

less that 208 kg would be expected to weigh more than 205 kg?<br />

4. A soft drink machine discharges on average 250ml per cup, with a standard<br />

deviation of 14 ml. The ml of fill is approximately normally distributed.<br />

e) What is the probability of overfilling a 264 ml cup?<br />

f) How large a cup is required so that the probability that the cup will overflow is<br />

only 3%?<br />

5. The average life of an electric sandwich maker is 5 years with a standard deviation<br />

of one year. If the manufacturer replaces free any appliance that fails while under<br />

guarantee, how long a guarantee should be offered if the profit depends on no more<br />

than 2% being replaced? (Assume lifetimes are normally distributed.)<br />

CHECKS: Q1. a) 0.6847 b) <strong>for</strong> my height of 158cm: 0.3859 c) 165.4<br />

Q2. a) 0.0188 b) 0.8944 c) 0.3184<br />

Q3. 29 ; 27 Q4. a) 0.1587 b) 276 Q5. 7 years<br />

4<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Week 5 H/Y. Questions on SAMPLING DISTRIBUTION of MEANS and<br />

PROPORTIONS <strong>prepared</strong> <strong>for</strong> Stat170 workshop held at the <strong>Numeracy</strong><br />

Centre, Macquarie University, <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong>, with examples<br />

from various sources and adapted <strong>for</strong> Stat170.<br />

Here, we need to keep in mind that we are working with two<br />

distributions:<br />

• the distribution of the individuals in the population from<br />

which we take a random sample; this distribution has mean μ<br />

and standard deviation σ.<br />

• the distribution of means of samples that are all the same<br />

size (n); this is the sampling distribution of means of<br />

samples of size n; this distribution has mean μ and standard<br />

deviation σ/√n (also referred to as the ‘standard error’).<br />

Notice that<br />

• The two distributions have the same mean μ<br />

• the means of samples are less variable than the individuals<br />

in the ‘parent’ population (the means have s.d. σ/√n where<br />

the individuals have s.d. σ)<br />

• if you take larger samples (i.e. n is larger) the means of<br />

these samples are even less variable (s.d. σ/√n gets smaller<br />

as n increases).<br />

EXERCISE 1. From past experience it is known that IQs are<br />

normally distributed with mean μ = 100 and standard deviation<br />

σ = 14.<br />

• what are the mean and the standard deviation of the sampling<br />

distribution of means of samples of size 16?<br />

• what would be the mean and standard deviation of the<br />

sampling distribution if the samples were of size 49?<br />

EXERCISE 2. Recall that if our variable of interest in the<br />

parent population is normally distributed then the sampling<br />

distribution of means is also a normal distribution.<br />

• If we take a sample of size 16 from our population and find<br />

their IQs, what is the probability that the mean IQ <strong>for</strong> the<br />

sample will lie between 95 and 105?<br />

• If the sample were of size 49, what would the probability<br />

be?<br />

• If the sample were of size 100, what would the probability<br />

be?<br />

5<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


EXERCISE 3. The heights of the adult males in a particular<br />

country are known to be normally distributed with a mean of<br />

1.7 metres and a standard deviation of 0.09 metres. Find the<br />

probability that<br />

• an individual man, randomly selected, has a height of less<br />

than 1.5 metres.<br />

• a random sample of 25 men has average height of more than<br />

1.72 metres.<br />

**** In all the above exercises the distribution of the variable of interest <strong>for</strong><br />

individuals in the parent population was ‘normal’. What happens when it is not<br />

normally distributed (or when its distribution is unknown)? There is a theorem that<br />

proves that if the samples are large enough then the sampling distribution of means is<br />

approximately normal (Central Limit Theorem). At Macquarie we can apply the CLT if<br />

n>25. This is a most important and useful result: we can always get a normal variable <strong>by</strong><br />

taking a large-enough sample and working with the sample mean.<br />

EXERCISE 4. A certain brand of rope is known to have a<br />

breaking strength on average of 25 kg with a standard<br />

deviation of 0.5 kg. A random sample of 50 pieces of rope is<br />

tested <strong>for</strong> breaking strength. What is the probability that<br />

the mean strength is found to be<br />

• between 24.9 and 25.1 kg?<br />

• less than 24.9 kg?<br />

EXERCISE 5. A wire company that manufactures wires <strong>for</strong> circus<br />

acts has taken a sample of 100 pieces of wire and wishes to<br />

see if the thickness of a batch of wires meets minimum<br />

specifications. Assume that the population data have a mean<br />

of μ = 0.45 cm with a standard deviation of σ= 0.03 cm.<br />

• Calculate the mean and standard error of the sampling<br />

distribution of means of such samples.<br />

• What may be said about the shape of this sampling<br />

distribution?<br />

• What is the probability that the sample mean is at least<br />

0.448 cm?<br />

EXERCISE 6. A certain trucking company wants to estimate the<br />

average tonnage of freight handled per month, and it has taken<br />

a sample of 36 months. Assume the true average tonnage per<br />

month is 225 tonnes, with SD of 30 tonnes. What is the<br />

probability that the sample mean will have a value within 7<br />

tonnes of the true mean?<br />

6<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


EXAMPLES ON WORKING WITH PROPORTIONS.<br />

EXERCISE 7. A manufacturer claims that 95% of the components<br />

supplied <strong>for</strong> a new jet transport meet a specified rigid<br />

standard of per<strong>for</strong>mance (ie 5% do not).<br />

If a sample of 400 of the components is tested, what is the<br />

probability that 30 do NOT meet the per<strong>for</strong>mance standard?<br />

EXERCISE 8. A report claims that the percentage of<br />

miscarriages among LSD users is 12%. From the files in the<br />

maternity ward of a Sydney hospital, out of 100 pregnant women<br />

who used LSD, 24 had miscarriages. Comment on the claim made<br />

in the report<br />

EXERCISE 9.The president of a certain company believes that<br />

30% of the company’s orders come from new or first-time<br />

customers. What would be the probability that, in a random<br />

sample of 100 orders, 40 are from new customers?<br />

CHECKS (not solutions): Q1: 100 and 3.5; 100 and 2.0 Q2:<br />

0.8472 ; 0.9876 ; 0.99964 Q3: 0.0132 ; 0.1335 Q4: The<br />

distribution of breaking strength is not known, but the sample<br />

is large (50) so, <strong>by</strong> the CLT, the sample mean is normally<br />

distributed; 0.8414 ; 0.0793.<br />

Q5: 0.45 and 0.003; it is reasonably symmetric about 0.45 cm;<br />

0.7486 Q6: 0.8384 Q7: z=2.294 Q8: z = 3.693;<br />

Q9: z = 2.182.<br />

7<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Week 6 H/Y. Questions on CONFIDENCE INTERVALS & HYPOTHESIS<br />

TESTING <strong>prepared</strong> <strong>for</strong> Stat170 workshop held at the <strong>Numeracy</strong><br />

Centre, Macquarie University.<br />

NOTE. FOR THIS TOPIC we still need to keep in mind that we<br />

are working with two distributions:<br />

• the distribution of the individuals in the population from<br />

which we take a random sample; this distribution has mean μ<br />

and standard deviation σ.<br />

• the distribution of means of samples that are all the same<br />

size (n); this is the sampling distribution of means of<br />

samples of size n; this distribution has mean μ and standard<br />

deviation σ/√n (also referred to as the ‘standard error’).<br />

• If we need to use the sample standard deviation, s, in a<br />

test of hypothesis about a mean μ, the variable is no longer<br />

a z variable but a t-variable with n-1 degrees of freedom.<br />

Similarly, in a 95% confidence interval the z-value 1.96 is<br />

replaced <strong>by</strong> the appropriate t-value with (n-1) degrees of<br />

freedom.<br />

EXERCISE 1. The daily consumption of electric power in a<br />

certain city is known to be normally distributed, and it is<br />

claimed to have a mean of 6.5 units with a standard deviation<br />

of σ = 1.5. In order to estimate the true mean daily power<br />

consumption, the consumption was sampled on 18 randomly<br />

selected days and the mean consumption was 6.973. (a) Test<br />

the hypothesis that that daily consumption has not, on<br />

average, changed. (b)Using the sample, find a 95% confidence<br />

interval <strong>for</strong> the true mean daily power consumption, μ. How<br />

confident are you that the claim that mean daily consumption<br />

is 6.5 units is true?<br />

EXERCISE 2. Washers are sold as having an inside diameter of 1.75cm. A sample of<br />

200 washers produced <strong>by</strong> the machine has mean inside diameter of 1.77 cm. It is known<br />

from past experience that the standard deviation of the diameter of washers produced<br />

<strong>by</strong> this machine is 0.18 cm. (a) Test the hypothesis that the machine is still producing<br />

washers with an inside diameter of 1.75 cm on average. (b) Using the sample, find a<br />

95% confidence interval <strong>for</strong> the mean inside diameter of all washers currently being<br />

produced <strong>by</strong> the machine. Are you at least 95% confident that the washers do on<br />

average meet the claimed diameter size.<br />

EXERCISE 3. Telecom wants to estimate the average length of telephone calls made<br />

between two cities. Manager A claims it is close to 2 minutes, and Manager B disputes<br />

this saying it is shorter, more likely 1.5 minutes. From a sample of 36 randomly selected<br />

calls, it finds that the average length is 1.90 minutes. Historically, the standard deviation<br />

of lengths of calls has been found to be 0.53 minutes.<br />

• Using the data from the random sample, find a 95% confidence<br />

interval <strong>for</strong> the true average length of calls. Which<br />

Manager’s claim was closer? (A or B)<br />

8<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


EXERCISE 4. Suppose we wanted to place an order <strong>for</strong> wire that<br />

had a breaking strength of 8kg on average with a standard<br />

deviation of 1.6kg. We first bought a random sample of 50<br />

lengths, measured the breaking strength of each and found the<br />

mean was 7.8kg. Could we be 95% confident that a large order<br />

would meet our specification on average?<br />

EXERCISE 5. A quality control engineer wishes to check the<br />

mean weight of potato chip bags produced on his machines. He<br />

takes a random sample of 36 bags and finds their mean weight<br />

is 199.0 grams. Assuming that the standard deviation of all<br />

weights is σ = 3.6 grams, Can he be 95% sure that the machines<br />

could be producing bags that are on average 200 grams? Check<br />

your answer using an appropriate hypothesis test.<br />

EXERCISE 6. A metallurgist claims that the mean hardness of a<br />

die-cast aluminium product is 13.7 with a standard deviation<br />

of 0.8. He gives us a random sample of 50 pieces and we find<br />

the mean hardness is 14.3. We claim these are likely to be<br />

too hard <strong>for</strong> our purposes. Are we making the correct<br />

decision?<br />

ANSWERS: Q1: (6.280,7.666) Q2: (1.75,1.79) Q3: (1.73,2.07) ; (1.67,2.13)<br />

Q4: Yes, the 95% CI is (7.36,8.24) Q5: Yes, the 95% CI is (196.8,199.2) Q6: Yes, it is<br />

possible because we are 95% confident that the true mean is between 14.08 and 14.52.<br />

EXAMPLES ON WORKING WITH PROPORTIONS. EXERCISE 7. A<br />

manufacturer claims that 95% of the components supplied <strong>for</strong> a<br />

new jet transport meet a specified rigid standard of<br />

per<strong>for</strong>mance (ie 5% do not). If a sample of 400 of the<br />

components is tested, what is the probability that 30 do NOT<br />

meet the per<strong>for</strong>mance standard?<br />

EXERCISE 8. A report claims that the percentage of<br />

miscarriages among LSD users is 12%. From the files in the<br />

maternity ward of a Sydney hospital, out of 100 pregnant women<br />

who used LSD, 24 had miscarriages. Comment on the claim made<br />

in the report.<br />

EXERCISE 9.The president of a certain company believes that<br />

30% of the company’s orders come from new or first-time<br />

customers. What would be the probability that, in a random<br />

sample of 100 orders, 40 are from new customers?<br />

CHECKS (not solutions): Q1: 100 and 3.5; 100 and 2.0 Q2:<br />

0.8472 ; 0.9876 ; 0.99964 Q3: 0.0132 ; 0.1335 Q4: The<br />

distribution of breaking strength is not known, but the sample<br />

is large (50) so, <strong>by</strong> the CLT, the sample mean is normally<br />

distributed; 0.8414 ; 0.0793. Q5: 0.45 and 0.003; it is<br />

reasonably symmetric about 0.45 cm; 0.7486 Q6: 0.8384 Q7:<br />

z=2.294 Q8: z = 3.693; Q9: z = 2.182.<br />

9<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Week 7 H/Y TESTING HYPOTHESES ABOUT POPULATION MEANS<br />

i. Hypotheses about a population mean, µ, using one random sample of<br />

observations, without value <strong>for</strong> σ .<br />

Without a known value <strong>for</strong> the population standard deviation, σ, the<br />

hypothesis is tested using the standard deviation, s, of the random sample.<br />

For a sample size n, the sample estimate s has (n-1) degrees of freedom. The<br />

test statistic is no longer a z-statistic, but a t-statistic.<br />

Go back to the examples in part (i) on pages 10-11, and in the description<br />

suppose that the given value of σ is the sample estimate, s. Now answer the<br />

questions Q1 to Q12 using these values of s, and the appropriate tdistribution.<br />

ii. One random sample of items, two variables observed on each<br />

item, and we are interested in the difference between the means of the<br />

variables. The two variables are said to be ‘paired’ or ‘matched’ because they<br />

are observed on the same item.<br />

PAIRED t-TEST:<br />

QUESTION 1. (from Stat171 lecture overheads, <strong>prepared</strong> <strong>by</strong> Stephen Brown)<br />

In an experiment using 15 hypertensive patients, the effect of the drug Captopril on their<br />

blood pressure is assessed <strong>by</strong> recording their blood pressure, giving them Captopril and then<br />

measuring their blood pressure again two hours later. Does the drug have any effect on blood<br />

pressure?<br />

10<br />

Patient Blood pressure Blood pressure DIFFERENCE<br />

BEFORE drug AFTER drug BEFORE - AFTER<br />

1 130 125 5<br />

2 122 121 1<br />

3 124 121 3<br />

4 104 106 -2<br />

5 112 101 11<br />

6 101 85 16<br />

7 121 98 23<br />

8 124 105 19<br />

9 115 103 12<br />

10 102 98 4<br />

11 98 90 8<br />

12 119 98 21<br />

13 106 110 -4<br />

14 107 103 4<br />

15 100 82 18<br />

Notice that<br />

• there is only one random sample of patients<br />

• there are two observations on each patient and we are interested in the difference between<br />

the two<br />

• the change is expressed as the drop in blood pressure.<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


We now have a single sample of differences, and the researcher is expecting that there will be<br />

a drop in blood pressure after taking the Captopril.<br />

The Statistical hypothesis is that <strong>for</strong> population of treated patients the mean drop is zero.<br />

So, H0 : μd = 0<br />

Assumptions: we only assume that the differences are Normally distributed, and that the<br />

differences are independent.<br />

COMMENTS<br />

• To do a paired t-test, the experimenter must plan ahead to collect the data in the<br />

appropriate way.<br />

• As students in Stat170, you are expected to recognise paired data from the in<strong>for</strong>mation<br />

given in the question.<br />

• Remember that you will be interested in the difference between the two observations, that<br />

is, the null hypothesis will be Ho: μd = 0.<br />

• The pairing is not always as obvious as the ‘be<strong>for</strong>e’ and ‘after’ example above; see <strong>for</strong><br />

example, question 2, where we have observations on pairs of dogs, each pair taken from<br />

the same litter (dogs from the same litter have the same parents, and hence should be<br />

more alike genetically than dogs from different litters). Here, the random sample is of<br />

litters of dogs arriving at the pound.<br />

• See also question 2 in the 1998 mid-year examination, where you have a random sample<br />

of women and in<strong>for</strong>mation on their incomes and the incomes of their partners: the<br />

question relates to the difference between a woman’s and her partner’s income.<br />

• ***** ASSUMPTIONS: We need make absolutely no assumptions about the original<br />

observations. If they are not normally distributed, it does not matter, as long as the<br />

differences are. Plot the differences to check that the assumption is not seriously violated.<br />

QUESTION 2. (from Stat171 tutorial exercises)<br />

In order to investigate whether the time of neutering of male<br />

dogs had any effect on their rate of growth, Crenshaw and<br />

<strong>Carter</strong> (1995) neutered dogs at 2 months and 7 months of age<br />

and measured their weight at 18 months of age. As litters of<br />

any breed became available at a pound, two male dogs were<br />

randomly selected from each litter. One was neuterd at 2<br />

months, the other was neutered at 7 months. All the dogs were<br />

kept under identical conditions during the 18 months of the<br />

trial. The weights (in kg) of the dogs at 18 months were:<br />

Litter 1 2 3 4 5 6 7 8 9 10<br />

2 months 19.0 23.2 16.1 30.3 27.1 22.3 18.1 27.8 34.2 17.3<br />

7 months 18.9 23.5 16.7 30.3 27.8 22.4 18.2 28.1 34.1 17.7<br />

a) Is there any evidence that early neutering affects the rate of growth of the dogs?<br />

b) Comment on why you think the experimenter chose this particular design in setting up<br />

the experiment. Do you think the right choice of design was made?<br />

11<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


QUESTION 3. A researcher wants to know the average sucrose content in a certain<br />

concentration of sugar-beet juice. There are two methods of measurement and the researcher<br />

wants to find out whether they both give the same mean dry weight. Ten sample containers<br />

of beet juice were randomly selected from ten different batches. Half the liquid in each<br />

container is measured using method A and the remaining half using method B. The<br />

measuring process yields the following data:<br />

METH<br />

OD<br />

12<br />

BATCH<br />

1 2 3 4 5 6 7 8 9 10<br />

A 11.0 5.0 9.8 5.7 6.5 8.2 5.9 6.0 7.5 5.4<br />

B 11.1 5.6 9.7 5.3 6.7 8.5 5.6 5.8 7.1 5.5<br />

Do these data support the hypothesis that both methods of<br />

measuring yield the same mean concentration?<br />

CHECKS:<br />

Q1: mean diffce = 9.267; s = 8.614; t=4.166; df=14; 0.0005


Week 8 H/Y. <strong>STAT170</strong> workshop on the comparison of the means of two<br />

samples. (Prepared <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong> <strong>for</strong> workshops at the <strong>Numeracy</strong> Centre; material<br />

taken from various sources and adapted <strong>for</strong> Stat170.)<br />

TWO INDEPENDENT RANDOM SAMPLES and the SAME VARIABLE OBSERVED<br />

ON BOTH SAMPLES<br />

We are interested in whether the two samples could come from two populations with the<br />

same mean:<br />

H0: μ1 = μ2<br />

OR: to test if two samples come from the same population.<br />

THIS TEST IS ONLY VALID if both populations are NORMAL with the SAME SD.<br />

13<br />

POOL the two estimates of the unknown SD σ to find sp,<br />

then use the two-independent sample t-test.<br />

The SE(ybar1 - ybar2) = sp√(1/n1 + 1/n2), and<br />

t =(ybar1 ­ ybar2)/SE(ybar1 ­ ybar2) with (n1+n2–2) df<br />

QUESTION 1: The Chapin Social Insight Test is a psychological test designed to measure<br />

how accurately the subject appraises other people. The possible scores on the test range from<br />

0 to 41. During the development of the Chapin Test, it was given to several groups of people.<br />

Here are the results <strong>for</strong> male and female students majoring in the liberal arts:<br />

Group Sex N X-bar S.D.<br />

1 Male 133 25.34 5.05<br />

2 Female 162 24.94 5.44<br />

Do these data support the contention that female and male students differ in average social<br />

insight?<br />

QUESTION 2: Two groups of children were given visual acuity tests. Group 1 was<br />

composed of 11 children who receive their health care from private physicians. The mean<br />

score <strong>for</strong> this group was 26 with a standard deviation of 5. The second group, consisting of<br />

14 children who receive their health care from the health department, had an average score of<br />

21 with a standard deviation of 6. Assuming normally distributed scores <strong>for</strong> the populations<br />

of private and health department patients, find if the two groups of children have different<br />

scores on average. Find 90% and 95% confidence intervals <strong>for</strong> the difference between the<br />

mean scores of the two populations.<br />

QUESTION 3: An educator believes that new, directed reading activities in the classroom<br />

will help elementary school pupils improve some aspects of their reading ability. She<br />

arranges <strong>for</strong> a third grade class of 26 students to take part in these activities <strong>for</strong> an 8-week<br />

period. A control class room of 28 third graders follows the same curriculum without the<br />

activities. At the end of 8 weeks, all students are given a Degree of Reading Power (DRP)<br />

test, which measures the aspects of reading ability that the treatment is designed to improve.<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


The summary statistics are:<br />

Group number x-bar SD<br />

Activity group 26 51.48 11.01<br />

Control group 28 41.52 17.15<br />

Is there a difference, on average, between the two groups of students? State any assumptions<br />

that you have made.<br />

ANSWERS: Q1: pooled SD = 5.2679 SE(y1bar – y2bar) = 0.6164<br />

t= 0.649 df = 293 p-value < 0.05<br />

Q2: pooled sd = 5.5873 SE(y1bar – y2bar) = 2.2512<br />

t= 2.221 0.02


Week 9 H/Y. Stat170 workshop on Hypothesis Testing, <strong>prepared</strong> <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong> <strong>for</strong><br />

<strong>Numeracy</strong> Centre, Macquarie University. Examples taken from various sources and<br />

adapted <strong>for</strong> Stat170.<br />

This document is made up from workshops that cover hypothesis testing about means and<br />

proportions: the hypotheses are about<br />

i. population mean, µ, using one random sample of observations, with value <strong>for</strong> σ<br />

ii. population proportion, π, using one random sample of observations<br />

iii. population mean, µ, using one random sample of observations, without value <strong>for</strong> σ<br />

iv. two population means and interested in the difference, µd = µ1- µ2, using one<br />

random sample of items and the two variables observed on each item<br />

v. two population means using two random samples of items, without values <strong>for</strong> σ1 , σ2<br />

i. SAMPLING DISTRIBUTION of MEANS We start <strong>by</strong> working with<br />

distributions <strong>for</strong> which the standard deviation σ is known, AND we are<br />

interested in testing an hypothesis about the mean of the population using data.<br />

from a single random sample of items.<br />

NOTE 1. Recall that we need to keep in mind that we are working with two distributions:<br />

• the distribution of the individuals in the population from which we take a random sample;<br />

this distribution has mean μ and standard deviation σ.<br />

• the distribution of means of samples that are all the same size (n); this is the sampling<br />

distribution of means of samples of size n; this distribution has mean μ and standard<br />

deviation σ/√n (also referred to as the ‘standard error’).<br />

Notice that<br />

• The two distributions have the same mean μ<br />

• the means of samples are less variable than the individuals in the ‘parent’ population (the<br />

means have s.d. σ/√n where the individuals have s.d. σ)<br />

• if you take larger samples (i.e. n is larger) the means of these samples are even less<br />

variable (s.d. σ/√n gets smaller as n increases).<br />

{RECALL: the Central Limit Theorem, which proves that, if the parent distribution is<br />

NOT NORMAL (or NOT KNOWN) BUT the samples are large enough, then the sampling<br />

distribution of means is approximately normal. At Macquarie we can apply the CLT if n>25.<br />

THIS MEANS THAT, ONE WAY OR ANOTHER, WE CAN HAVE A SAMPLING<br />

DISTRIBUTION OF MEANS THAT IS NORMAL (APPROX at least).}<br />

NOTE 2. If we have a normal distribution with mean μ and standard deviation σ/√n, then<br />

95% of the values in the distribution lie in the interval from (μ - 1.96 x σ/√n to μ + 1.96 x<br />

σ/√n). This means that, if we took lots and lots of samples and found the mean of each, we<br />

would expect that 95% of those means would lie in that interval.<br />

NOTE 3. IT ALSO MEANS THAT (think this through carefully):<br />

• if we do not know μ, but have a lot of estimates ybar from samples (all of the same size,<br />

n) and <strong>for</strong> each ybar we calculate the interval (ybar - 1.96 x σ/√n to ybar + 1.96 x σ/√n),<br />

then we expect that 95% of these intervals will include the true value of μ. IN OTHER<br />

WORDS: we can expect that the true value of μ will be included in 95% of the intervals.<br />

15<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


• SO: if we have only one sample and find its mean, ybar, we can say that we are 95%<br />

confident that the true mean μ lies within the interval (ybar - 1.96 x σ/√n to ybar + 1.96 x<br />

σ/√n). This is what is called a 95% confidence interval <strong>for</strong> the population mean μ.<br />

EXERCISE 1. The daily consumption of electric power in a certain city is known to be<br />

normally distributed with a standard deviation of σ = 1.5. In order to estimate the true<br />

mean daily power consumption, the consumption was sampled on 18 randomly selected days<br />

and the mean consumption was 6.973. Find a 95% confidence interval <strong>for</strong> the true mean<br />

power consumption, μ.<br />

EXERCISE 2. The mean inside diameter of a sample of 200<br />

washers produced <strong>by</strong> a machine is 1.77 cm. It is known from<br />

past experience that the standard deviation of the diameter of<br />

washers produced <strong>by</strong> this machine is 0.18 cm. Find a 95%<br />

confidence interval <strong>for</strong> the mean inside diameter of all<br />

washers produced <strong>by</strong> the machine.<br />

EXERCISE 3. Telecom wants to estimate the average length of<br />

telephone calls made between two cities. From a sample of 36<br />

randomly selected calls, it finds that the average length is<br />

1.90 minutes. Historically, the standard deviation of lengths<br />

of calls has been found to be 0.53 minutes.<br />

• Find a 95% confidence interval <strong>for</strong> the true mean length of calls.<br />

• (optional) What would be a 99% confidence interval?<br />

EXERCISE 4. Suppose we wanted to place an order <strong>for</strong> wire that had a breaking strength of<br />

8kg on average with a standard deviation of 1.6kg. We first bought a random sample of 50<br />

lengths, measured the breaking strength of each and found the mean was 7.8kg. Could we be<br />

95% confident that a large order would meet our specification on average?<br />

EXERCISE 5. A quality control engineer wishes to check the mean weight of potato chip<br />

bags produced on his machines. He takes a random sample of 36 bags and find their mean<br />

weight is 198 grams. Assuming that the standard deviation of all weights is σ = 3.6 grams,<br />

Can he be 95% sure that the machines are producing bags that are on average 198.8 grams?<br />

EXERCISE 6. A metallurgist claims that the mean hardness of die-cast aluminium is 13.7<br />

with a standard deviation of 0.8. He gives us a random sample of 50 pieces and we find the<br />

mean hardness is 14.3. We claim these are likely to be too hard <strong>for</strong> our purposes. Are we<br />

making the correct decision?<br />

EXERCISE 7. Go back to Q1, and use a test of hypothesis to answer the research question: is<br />

the consumption of power equal to 6.5?<br />

EXERCISE 8: In Q2 the research question is whether the mean inside diameter is on average<br />

1.80cm.<br />

EXERCISE 9: In Q3 test the hypothesis that the mean length of call is 2 minutes.<br />

EXERCISE 10: In Q4 use a hypothesis test to find if a large order would meet our<br />

specification on average.<br />

16<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


EXERCISE 11: In Q5 use a hypothesis test to check the average weight of potato chip bags.<br />

EXERCISE 12: Use a hypothesis test to answer our question 6.<br />

CHECK ANSWERS: Q1: (6.280,7.666) Q2: (1.75,1.79) Q3: (1.73,2.07) ; (1.67,2.13)<br />

Q4: Yes, the 95% CI is (7.36,8.24) Q5: Yes, the 95% CI is (196.8,199.2) Q6: Yes, it is<br />

possible because we are 95% confident that the true mean is between 14.08 and 14.52.<br />

Q7: H 0: μ = 6.5; z = 1.33; do not reject H 0 and conclude that the mean daily<br />

consumption could be 6.5.<br />

ii. Examples on working with hypotheses about population proportions,<br />

<strong>prepared</strong> <strong>for</strong> <strong>STAT170</strong> workshop at <strong>Numeracy</strong> Centre, <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong> from<br />

various sources.<br />

17<br />

A population proportion, π,is estimated <strong>by</strong> a random<br />

sample proportion, p. The population proportion has SE =<br />

√{ π(1- π) / (n-1)} and, if π is not known, SE is estimated<br />

<strong>by</strong> substituting the sample proportion p <strong>for</strong> π. The<br />

proportion in a random sample of size n can be assumed<br />

normally distributed if n π and n(1- π) are BOTH at least 5.<br />

1. A manufacturer claims that 95% of the components supplied<br />

<strong>for</strong> a new jet transport meet a specified rigid standard of<br />

per<strong>for</strong>mance (ie 5% do not). A random sample of 400 of the<br />

components is tested, and 30 do not meet the per<strong>for</strong>mance<br />

standard. Comment on the manufacturer’s claim.<br />

2. A report claims that the percentage of miscarriages among LSD users is 12%. From the<br />

files in the maternity ward of a Sydney hospital, out of 100 pregnant women who used<br />

LSD, 24 had miscarriages. Comment on the claim made in the report<br />

3. The president of a certain company believes that 30% of the<br />

company’s orders come from new or first-time customers. In<br />

a random sample of 100 orders, 40 are from new customers.<br />

Comment on the president’s claim.<br />

CHECKS: Q1 z=2.294 Q2: z = 3.693; Q3: z = 2.182;<br />

iii. Hypotheses about a population mean, µ, using one random sample of<br />

observations, without value <strong>for</strong> σ .<br />

Without a known value <strong>for</strong> the population standard deviation, σ, the<br />

hypothesis is tested using the random sample standard deviation, s. For a<br />

sample size n, the sample estimate s has (n-1) degrees of freedom. The test<br />

statistic is no longer z, but the t-statistic.<br />

Go back to the examples in part (i) on pages 2-3, and in the description<br />

suppose that the given value of σ is the sample estimate, s. Now answer the<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


questions Q1 to Q12 using these values of s, and the appropriate tdistribution.<br />

iv. One random sample of items, two variables observed on each<br />

item, and we are interested in the difference between the means of the<br />

variables. The two variables are said to be ‘paired’ or ‘matched’ because they<br />

are observed on the same item.<br />

18<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


PAIRED t-TEST:<br />

QUESTION 1. (from Stat171 lecture overheads, <strong>prepared</strong> <strong>by</strong> Stephen Brown)<br />

In an experiment using 15 hypertensive patients, the effect of the drug Captopril on their<br />

blood pressure is assessed <strong>by</strong> recording their blood pressure, giving them Captopril and then<br />

measuring their blood pressure again two hours later. Does the drug have any effect on blood<br />

pressure?<br />

19<br />

Patient Blood pressure Blood pressure DIFFERENCE<br />

BEFORE drug AFTER drug BEFORE - AFTER<br />

1 130 125 5<br />

2 122 121 1<br />

3 124 121 3<br />

4 104 106 -2<br />

5 112 101 11<br />

6 101 85 16<br />

7 121 98 23<br />

8 124 105 19<br />

9 115 103 12<br />

10 102 98 4<br />

11 98 90 8<br />

12 119 98 21<br />

13 106 110 -4<br />

14 107 103 4<br />

15 100 82 18<br />

Notice that<br />

• there is only one random sample of patients<br />

• there are two observations on each patient and we are interested in the difference between<br />

the two<br />

• the change is expressed as the drop in blood pressure.<br />

We now have a single sample of differences, and the researcher is expecting that there will be<br />

a drop in blood pressure after taking the Captopril.<br />

The Statistical hypothesis is that <strong>for</strong> population of treated patients the mean drop is zero.<br />

So, H0 : μd = 0<br />

Assumptions: we only assume that the differences are Normally distributed, and that the<br />

differences are independent.<br />

COMMENTS<br />

• To do a paired t-test, the experimenter must plan ahead to collect the data in the<br />

appropriate way.<br />

• As students in Stat170, you are expected to recognise paired data from the in<strong>for</strong>mation<br />

given in the question.<br />

• Remember that you will be interested in the difference between the two observations, that<br />

is, the null hypothesis will be Ho: μd = 0.<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


• The pairing is not always as obvious as the ‘be<strong>for</strong>e’ and ‘after’ example above; see <strong>for</strong><br />

example, question 2, where we have observations on pairs of dogs, each pair taken from<br />

the same litter (dogs from the same litter have the same parents, and hence should be<br />

more alike genetically than dogs from different litters). Here, the random sample is of<br />

litters of dogs arriving at the pound.<br />

• See also question 2 in the 1998 mid-year examination, where you have a random sample<br />

of women and in<strong>for</strong>mation on their incomes and the incomes of their partners: the<br />

question relates to the difference between a woman’s and her partner’s income.<br />

• ***** ASSUMPTIONS: We need make absolutely no assumptions about the original<br />

observations. If they are not normally distributed, it does not matter, as long as the<br />

differences are. Plot the differences to check that the assumption is not seriously violated.<br />

QUESTION 2. (from Stat171 tutorial exercises)<br />

In order to investigate whether the time of neutering of male<br />

dogs had any effect on their rate of growth, Crenshaw and<br />

<strong>Carter</strong> (1995) neutered dogs at 2 months and 7 months of age<br />

and measured their weight at 18 months of age. As litters of<br />

any breed became available at a pound, two male dogs were<br />

randomly selected from each litter. One was neuterd at 2<br />

months, the other was neutered at 7 months. All the dogs were<br />

kept under identical conditions during the 18 months of the<br />

trial. The weights (in kg) of the dogs at 18 months were:<br />

Litter 1 2 3 4 5 6 7 8 9 10<br />

2 months 19.0 23.2 16.1 30.3 27.1 22.3 18.1 27.8 34.2 17.3<br />

7 months 18.9 23.5 16.7 30.3 27.8 22.4 18.2 28.1 34.1 17.7<br />

c) Is there any evidence that early neutering affects the rate of growth of the dogs?<br />

d) Comment on why you think the experimenter chose this particular design in setting up<br />

the experiment. Do you think the right choice of design was made?<br />

QUESTION 3. A researcher wants to know the average sucrose content in a certain<br />

concentration of sugar-beet juice. There are two methods of measurement and the researcher<br />

wants to find out whether they both give the same mean dry weight. Ten sample containers<br />

of beet juice were randomly selected from ten different batches. Half the liquid in each<br />

container is measured using method A and the remaining half using method B. The<br />

measuring process yields the following data:<br />

METH<br />

OD<br />

20<br />

BATCH<br />

1 2 3 4 5 6 7 8 9 10<br />

A 11.0 5.0 9.8 5.7 6.5 8.2 5.9 6.0 7.5 5.4<br />

B 11.1 5.6 9.7 5.3 6.7 8.5 5.6 5.8 7.1 5.5<br />

Do these data support the hypothesis that both methods of<br />

measuring yield the same mean concentration?<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


QUESTION 4. “You can taste the luscious grapes in Lindeman’s<br />

Montillo Sherry” - advertisement in ‘The SUN’.<br />

On the presumption that this claim was meant to imply that this product is more appealing<br />

than its competitors, ten allegedly experienced tasters were asked to rate samples of this<br />

product and also of another sherry on a scale from 0 to 100.<br />

Taster A B C D E F G H I J<br />

Montillo 83 80 94 81 90 68 74 81 86 80<br />

Competitor 64 82 88 86 91 72 66 71 82 80<br />

Is there significant evidence of a substantial preference <strong>for</strong> Montillo over the competing<br />

brand.<br />

CHECKS:<br />

Q1: mean diffce = 9.267; s = 8.614; t=4.166; df=14; 0.0005


QUESTION 1: The Chapin Social Insight Test is a psychological test designed<br />

to measure how accurately the subject appraises other people. The possible<br />

scores on the test range from 0 to 41. During the development of the Chapin<br />

Test, it was given to several groups of people. Here are the results <strong>for</strong> male and<br />

female students majoring in the liberal arts:<br />

Group Sex N X-bar S.D.<br />

1 Male 133 25.34 5.05<br />

2 Female 162 24.94 5.44<br />

Do these data support the contention that female and male students differ in<br />

average social insight?<br />

The variable is a score on a psychological test. It is reasonable to assume the<br />

variable, score, is normally distributed.<br />

Do these data support the contention that female and male students differ in<br />

average score?<br />

CHECKS: sp = 5.2679; SE(y1bar – y2bar) = 0.6164; t = 0.649; df = 293; pvalue<br />

> 0.05; no.<br />

METHOD Two independent random samples: do the samples come from<br />

populations with the same mean?<br />

Population 1 male 2 female<br />

Mean μ1 μ2<br />

Standard deviation σ1 σ2<br />

Sample size n1 133 n 2 162<br />

Sample mean y1bar 25.34 y2bar 24.94<br />

Sample SD s1 5.05 s 2 5.44<br />

Null hypothesis H0: μ1 = μ2<br />

Assumptions:<br />

• the populations are both normal<br />

• the standard deviations are equal, or in symbols: σ1 = σ 2;<br />

• to check this second assumption use:<br />

smax / s min < about 2.<br />

The value of the t-statistic is given <strong>by</strong><br />

y1bar – y2bar<br />

t = __________________ where<br />

SE(y1bar – y2bar)<br />

SE(y1bar – y2bar) = sp √( 1/n1 + 1/n2) and<br />

22<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


sp =√[{(n1-1)s1 2 + (n2-1)s2 2 } /{n1 + n2 – 2}]<br />

the t-statistic has (n1 + n2 - 2) degrees of freedom.<br />

QUESTION 2: Two groups of children were given visual acuity tests. Group 1<br />

was composed of 11 children who receive their health care from private<br />

physicians. The mean score <strong>for</strong> this group was 26 with a standard deviation of<br />

5. The second group, consisting of 14 children who receive their health care<br />

from the health department, had an average score of 21 with a standard<br />

deviation of 6. Assuming normally distributed scores <strong>for</strong> the populations of<br />

private and health department patients, find if the two groups of children have<br />

different scores on average. Find 95% confidence intervals <strong>for</strong> the difference<br />

between the mean scores of the two populations.<br />

QUESTION 3: An educator believes that new, directed reading activities in the<br />

classroom will help elementary school pupils improve some aspects of their<br />

reading ability. She arranges <strong>for</strong> a third grade class of 26 students to take part<br />

in these activities <strong>for</strong> an 8-week period. A control class room of 28 third<br />

graders follows the same curriculum without the activities. At the end of 8<br />

weeks, all students are given a Degree of Reading Power (DRP) test, which<br />

measures the aspects of reading ability that the treatment is designed to<br />

improve.<br />

The summary statistics are:<br />

Group number x-bar SD<br />

Activity group 26 51.48 11.01<br />

Control group 28 41.52 17.15<br />

Is there a difference, on average, between the two groups of<br />

students?<br />

ANSWERS: Q1: pooled SD = 5.2679 SE(y1bar – y2bar) = 0.6164<br />

t= 0.649 df = 293 p-value < 0.05<br />

Q2: pooled sd = 5.5873 SE(y1bar – y2bar) = 2.2512<br />

t= 2.221 0.02


Week 9 H/Y. <strong>STAT170</strong> workshop, REGRESSION ANALYSIS,<br />

held at <strong>Numeracy</strong> Centre, Macquarie University.<br />

Examples taken from various sources and adapted <strong>for</strong><br />

Stat170 <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong>.<br />

EXAMPLE: Prior to opening a new shop, management<br />

requires an estimate of yearly sales revenue. Such<br />

an estimate is used in planning the appropriate shop<br />

capacity, making initial staffing decisions, and<br />

deciding whether or not the potential revenues<br />

justify the cost of the operation.<br />

Suppose management believes that the size of the<br />

student population on a near<strong>by</strong> campus is related to<br />

annual sales revenues. On an intuitive basis,<br />

management believes that shops located near larger<br />

campuses generate more revenue than those near small<br />

campuses.<br />

To evaluate the relationship between student<br />

population (X)<br />

and annual sales (Y), management collected data from<br />

a random sample of 10 of its shops located near<br />

university campuses. These data are summarised<br />

below:<br />

Shop Student<br />

population<br />

(thousands)<br />

(X)<br />

24<br />

Annual sales<br />

($thousands)<br />

(Y)<br />

1 2 58<br />

2 6 105<br />

3 8 88<br />

4 8 118<br />

5 12 117<br />

6 16 137<br />

7 20 157<br />

8 20 169<br />

9 22 149<br />

10 26 202<br />

MEAN 14.0 130.0<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


a) Draw the scatterplot of the data, and comment.<br />

(Look first <strong>for</strong> an overall pattern; its direction,<br />

<strong>for</strong>m, and strength of relationship.)<br />

b) Use your calculator to confirm the values of the<br />

means, and to find<br />

SX = 7.94425.<br />

25<br />

Then find S = (n-1)( sx 2 )<br />

XX<br />

= 568<br />

c) Given that SX = 2840, find the value of the slope<br />

Y<br />

b = SXY / SXX = 5<br />

d) Given that a = y-bar – b(x-bar), find the value of<br />

the intercept a = 130 – 5(14) = 60<br />

e) state the regression equation<br />

f) interpret the regression equation<br />

g) Use two values of X to find two corresponding<br />

predicted values of Y, and then draw the line on<br />

your scatter plot.<br />

h) What are the assumptions of the linear model <strong>for</strong><br />

the relationship between annual sales and student<br />

population?<br />

i) Given that the standard error of the estimated<br />

slope is 0.5803, find the 95% confidence interval<br />

<strong>for</strong> the true value of the slope β.<br />

sales<br />

relayionship between annual sales and<br />

near<strong>by</strong> student population<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

0 10 20 30<br />

student population<br />

sales<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Relation: Survey of stores<br />

outcome: sales<br />

df: 8<br />

predictor<br />

coeff SE t p-value 95% C.I.<br />

constant<br />

60.0000 9.226 6.5033 0.000 38.725 81.275<br />

no. of students 5.0000 0.580 8.6167 0.000 3.662 6.338<br />

r-sq: 0.903 Resid SS: 1530.000 s: 13.829<br />

Fitted line: sales = 60 + 5 no. of students<br />

sales<br />

220<br />

200<br />

180<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

Residual<br />

20<br />

15<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

-15<br />

-20<br />

26<br />

Survey of stores<br />

0 10 20<br />

no. of students<br />

Normal Scores Plot<br />

-25<br />

-30<br />

p = 00.65<br />

-1.5 -1 -0.5 0 0.5 1 1.5<br />

normal score<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.<br />

30


Residual<br />

20<br />

15<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

-15<br />

-20<br />

-25<br />

EXERCISE 1. The relationship between production<br />

level and price of bananas.<br />

The following sample data have been collected over a<br />

period of twenty years <strong>for</strong> Queensland bananas:<br />

Production<br />

000s cases<br />

(X)<br />

27<br />

Residuals vs Fitted<br />

-30<br />

60 110 160<br />

Fitted value<br />

Price $ per<br />

case (Y)<br />

Production<br />

000s cases<br />

(X)<br />

32.0 1.27 38.3 0.57<br />

29.0 0.63 37.4 0.89<br />

33.0 0.26 31.1 0.98<br />

30.2 0.67 35.2 1.04<br />

24.2 1.79 30.5 1.05<br />

33.2 0.94 31.6 0.96<br />

36.0 0.52 35.0 0.67<br />

32.5 0.76 30.0 1.24<br />

42.0 0.49 26.3 2.24<br />

34.8 0.63 31.9 1.47<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.<br />

Price $ per<br />

case (Y)<br />

a) plot the data on a scatterplot and comment on the<br />

display.<br />

b) Given that X-bar = 32.71, sX = 4.08681,<br />

Y-bar = 0.9535,<br />

SS XY = -24.5867,<br />

find the equation of the regression line.<br />

c) Draw the regression line on the scatterplot.


d) Interpret the regression line.<br />

e) Check the assumptions of the model.<br />

f) What average price would be expected if<br />

• production were 30,000 cases?<br />

• Production rose to 45,000 cases?<br />

• Production were 40,000 cases?<br />

g) A grower needs a price of $2 per case to break<br />

even, what production should she aim <strong>for</strong>?<br />

SCATTERPLOT<br />

of PRICE against PRODUCTION<br />

2.5<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

22 27 32 37 42 47<br />

28<br />

price per case $<br />

No. of cases (ooos)<br />

Relation:<br />

Banana Production<br />

outcome:<br />

price per case $<br />

df: 18<br />

predictor<br />

coeff SE t p-value 95% C.I.<br />

constant<br />

3.4878 0.670 5.2085 0.000 2.081 4.895<br />

No. of cases (ooos) -0.0775 0.020 -3.8126 0.001 -0.120 -0.035<br />

r-sq: 0.447 Resid SS: 2.359 s: 0.362<br />

Fitted line: price per case $ = 3.4878 - 0.0775 No. of cases (ooos)<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Residual<br />

Residual<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

-0.2<br />

-0.4<br />

29<br />

Normal Scores Plot<br />

-0.6<br />

-0.8<br />

p = 00.46<br />

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2<br />

price per<br />

case $<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

-0.2<br />

-0.4<br />

-0.6<br />

-0.8<br />

normal score<br />

Banana Production<br />

22 27 32 37 42<br />

No. of cases (ooos)<br />

Residuals vs Fitted<br />

0 0.5 1 1.5<br />

Fitted value<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Question 2.<br />

The following table gives a sample of advertising<br />

expenditures and associated sales volumes <strong>for</strong> a<br />

company during 10 randomly selected months. We are<br />

interested in being able to predict gross monthly<br />

sales volume from the amount spent on advertising.<br />

(from J. Gosling, Introductory Statistics)<br />

Month 1 2 3 4 5 6 7 8 9 10<br />

Advertising<br />

expenses<br />

(x $10 000)<br />

1.2 0.8 1.0 1.3 0.7 0.8 1.0 0.6 0.9 1.1<br />

Sales Volume 101<br />

(x $10 000)<br />

92 110 120 90 82 93 75 91 105<br />

a) Draw, and comment on, a scatter plot of the data.<br />

b) State the slope of the regression line<br />

c) State the intercept of the regression line.<br />

d) State the equation of the regression line.<br />

e) Draw the line on your scatter plot.<br />

f) Interpret the regression equation; would it be<br />

useful <strong>for</strong> prediction?<br />

g) Check the assumptions of the model.<br />

Scatterplot of sales against advertising expenses<br />

1.4<br />

1.3<br />

1.2<br />

1.1<br />

1.0<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

Advertising expenses (x $10000)<br />

70 80 90 100 110 120 130<br />

30<br />

Sales Volume (x $10 000)<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Relation:<br />

Monthly sales volume<br />

outcome:<br />

Sales Volume (x $10 000) df: 8<br />

predictor<br />

coeff SE t p-value 95% C.I.<br />

constant<br />

46.4865 9.885 4.7029 0.002 23.693 69.280<br />

Advert. Exp (x $10000) 52.5676 10.261 5.1231 0.001 28.906 76.229<br />

r-sq: 0.766 Resid SS: 373.973 s: 6.837<br />

PLOT of sales against expenses<br />

PLOT of residuals against fitted<br />

NORMAL PROBABILITY PLOT<br />

Residual<br />

130<br />

response<br />

120<br />

15<br />

10<br />

110<br />

100<br />

5<br />

0<br />

-5<br />

90<br />

80<br />

70<br />

31<br />

Fitted line: Sales (x $10 000) = 46.4865 + 52.5676 Advert. Exp.(x $10000)<br />

0.5 0.7 0.9 1.1 1.3 1.5<br />

Normal Scores Plot<br />

-10<br />

p = 00.88<br />

-1.5 -1 -0.5 0 0.5 1 1.5<br />

normal score<br />

determ. 1<br />

15<br />

residuals<br />

10<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.<br />

5<br />

0<br />

-5<br />

-10<br />

predicted<br />

75 85 95 105 115 125


EXERCISE 3. In a large random sample of 91 homesales,<br />

two of the variables reported were selling<br />

price (thousands of $) and size of the home<br />

(thousands of square feet). The scatter plot is shown<br />

below, and an extract from the computer analysis of<br />

the relationship between price and size is<br />

reproduced.<br />

PRICE<br />

a) comment on the display.<br />

b) Given that the estimate of the intercept is –10.3,<br />

and the estimate of the slope is 65.7, state the<br />

equation of the regression line.<br />

b) Draw the regression line on the scatterplot.<br />

c) Interpret the regression line.<br />

d) Given that the standard error of the slope, b, is<br />

4.007, find the 95% confidence interval <strong>for</strong><br />

the population slope, β.<br />

e) Check the assumptions of the model using the plot<br />

below:<br />

32<br />

200<br />

100<br />

0<br />

0.0<br />

SIZE<br />

.5<br />

1.0<br />

1.5<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.<br />

2.0<br />

2.5<br />

3.0<br />

3.5


Unstandardized Residual<br />

33<br />

40<br />

20<br />

0<br />

-20<br />

-40<br />

-60<br />

0.0<br />

SIZE<br />

.5<br />

1.0<br />

1.5<br />

2.0<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.<br />

2.5<br />

3.0<br />

3.5


Week 11 <strong>STAT170</strong> workshop on CATEGORICAL variables. Prepared <strong>for</strong><br />

<strong>Numeracy</strong> Centre Macquarie University, <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong>.<br />

Examples taken from various sources and adapted <strong>for</strong> <strong>STAT170</strong>.<br />

CATEGORICAL VARIABLES: WE CONSIDER ONLY ONE RANDOM SAMPLE IN<br />

<strong>STAT170</strong><br />

SUMMARY:<br />

1. ONE CATEGORICAL VARIABLE OBSERVED ON EACH ITEM<br />

• TWO CATEGORIES: use EITHER<br />

34<br />

z-TEST OF PROPORTIONS FOR H0: π = π0 where π0 has a<br />

particular value, and<br />

provided that nπ > 5 and n(1-π) > 5<br />

z = (p - π0 )/SE(p) where p is the sample estimate of π<br />

and SE(p)= √[(π)(1-π)/n]<br />

The 95% confidence interval <strong>for</strong> π uses the<br />

estimated SE(p) = √[(p)(1-p)/n]<br />

and is given <strong>by</strong> (p – 1.96 SE(p), p + 1.96 SE(p))<br />

OR χ 2 GOODNESS OF FIT TEST:<br />

The same null hypothesis as <strong>for</strong> the z-test, and it<br />

is used to find expected values, E, <strong>for</strong> both<br />

categories. The observed COUNTS are denoted <strong>by</strong> ‘O’.<br />

The χ 2 -test is valid if all expected values, E, are<br />

>5;<br />

χ 2 = Σ[(O-E) 2 /E] with 1 degree of freedom<br />

• MORE THAN TWO CATEGORIES: use only the χ 2<br />

goodness of fit test with degrees of freedom =<br />

(no. of categories–1)<br />

Use info from the question to <strong>for</strong>m the null<br />

hypothesis, and thence the expected values. The<br />

observed counts are given in the data.<br />

WARNING: SMALL EXPECTED FREQUENCIES: The chi-square<br />

test is not valid when any expected frequency is less<br />

than 5. To counter this problem, we group classes<br />

together to create larger observed counts and larger<br />

expected counts until the problem goes away! There<br />

is a good example of the need to do this in the 1998<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


mid-year exam paper (See below: the question on no.<br />

of children and level of education of working women.)<br />

2. TWO CATEGORICAL VARIABLES OBSERVED ON EACH ITEM<br />

and we are INTERESTED IN WHETHER THE<br />

VARIABLES ARE INDEPENDENT. THE DATA ARE A TABLE OF<br />

OBSERVED COUNTS, O.<br />

• χ 2 TEST OF INDEPENDENCE:<br />

H0: the variables are independent<br />

For ‘expected values’, E, use the totals of<br />

observed counts in the table:<br />

E = (row total)(column total) / (grand total)<br />

WE STILL REQUIRE THAT ALL Es BE >5.<br />

Then, χ 2 = Σ[(O-E) 2 /E] where the sum is over all<br />

cells;<br />

and degrees of freedom are (r-1)(c-1) where there<br />

are r rows and c columns. If H0 is rejected then<br />

the variables are not independent, they are<br />

associated (or related).<br />

35<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Examples on working with proportions, <strong>prepared</strong> <strong>for</strong><br />

<strong>STAT170</strong> workshop at <strong>Numeracy</strong> Centre, Macquarie<br />

University, <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong> from various sources.<br />

CATEGORICAL VARIABLE with ONLY TWO CATEGORIES.<br />

4. A manufacturer claims that 95% of the components<br />

supplied <strong>for</strong> a new jet transport meet a specified<br />

rigid standard of per<strong>for</strong>mance (ie 5% do not).<br />

A sample of 400 of the components is tested, and 30<br />

do NOT meet the per<strong>for</strong>mance standard. Comment on<br />

the manufacturer’s claim.<br />

5. A report claims that the percentage of miscarriages<br />

among LSD users is 12%. From the files in the<br />

maternity ward of a Sydney hospital, out of 100<br />

pregnant women who used LSD, 24 had miscarriages.<br />

Comment on the claim made in the report<br />

6. The president of a certain company believes that<br />

30% of the company’s orders come from new or firsttime<br />

customers. In a random sample of 100 orders,<br />

40 are from new customers; does this meet the<br />

company’s claim?<br />

CHECKS: Q1 z=2.294 Q2: z = 3.693; Q3: z = 2.182;<br />

Now we want to do the same questions using χ 2 .<br />

The problem is how to set up the hypothesis H0, and<br />

how to find values <strong>for</strong> the expected numbers E.<br />

Try thinking of it this way: we are often interested<br />

in<br />

• testing a theory which is used to give the<br />

hypothesis H0 and the expected numbers if the<br />

hypothesis is true.<br />

• testing a ‘claim’ made <strong>by</strong> others, and which is used<br />

to give the hypothesis H0 and the expected numbers<br />

36<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


if the hypothesis is true. FOR EXAMPLE: In a study<br />

of the genetics of eye colour in fruit flies, there<br />

is a theory that the numbers of red-eyed and whiteeyed<br />

flies will be in the proportions of 1:3. So<br />

we have<br />

H0: the proportions are as 1:3 (i.e ¼ are red-eyed and<br />

¾ white-eyed)<br />

If the total number of flies is 9693, we can find the<br />

expected numbers in each class, assuming H0 is true.<br />

Expected values are<br />

1/4 x 9693 = 2423.25 and ¾ x 9693 = 7269.75. The data<br />

supplies the observed numbers (1981 and 7712<br />

respectively), and so we can calculate<br />

37<br />

χ 2 = Σ[(O-E) 2 /E] where the sum is over all<br />

cells,<br />

and degrees of freedom = no of classes – 1<br />

χ 2 = (1981-2423.5) 2 /2423.25 +(7712-7269.75) 2 /7269.75 =<br />

107.6 with 1 df.<br />

From χ 2 table, p-value < 0.05, so we reject H0 and<br />

conclude that the theory does not hold <strong>for</strong> the<br />

inheritance of red and white eyes in these fruit<br />

flies. In the population from which this sample was<br />

taken, there are fewer red-eyed and more white-eyed<br />

flies than the theory claims.<br />

4. Do the above exercise using a different<br />

statistical test.<br />

5. In tossing a die 300 times the frequency of<br />

occurrence of 1, 2, 3, 4, 5, 6 are, respectively, 43,<br />

49, 56, 45, 66 and 41.<br />

• Is there any basis <strong>for</strong> the claim that the die is<br />

biased? If there is a bias, which number appears<br />

more frequently than expected?<br />

6. In a survey taken 5 years ago, 25% of students<br />

watched less than 5 hours of TV per week, 10% watched<br />

20 or more hours per week, and the rest watched<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


etween 5 and 20 hours. A recent survey of 450<br />

students showed that 117 students watched TV <strong>for</strong> less<br />

than 5 hours per week, 267 watched between 5 and 20<br />

hours, and the remainder watched 20 or more hours per<br />

week. Has the pattern of watching TV changed <strong>for</strong><br />

students in the five years? If there has been a<br />

significant change, describe where it has occurred.<br />

CHECKS: Q4: z = -11.14, p


sandwich, only 4 became ill. We would like to test<br />

whether there is an association between the<br />

consumption of a sandwich and the onset of<br />

gastroenteritis.<br />

Question 9. In a sample of 110 working women in<br />

Sydney, all aged between 30 and 50 years, the<br />

following table shows the data classified <strong>by</strong> ‘highest<br />

level of education attained’ [coded 1: did not<br />

complete secondary school; 2: graduated from<br />

secondary school; 3: attended TAFE college or other<br />

institution; 4: bachelor degree from university; 5:<br />

higher degree from university] and <strong>by</strong> ‘number of<br />

children’.<br />

Number of Children<br />

Education<br />

39<br />

0 1 2 3 4 5,6<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.<br />

All<br />

1 4 2 2 9 3 0 20<br />

2 1 4 7 11 4 2 29<br />

3 0 3 11 7 2 0 23<br />

4 1 6 12 10 0 0 29<br />

5 2 3 3 1 0 0 9<br />

All 8 18 35 38 9 2 110<br />

a) Are all the expected frequencies at least 5?<br />

b) If not, collapse the table until this criterion<br />

is met.<br />

Note: to identify the smallest expected frequency,<br />

multiply smallest row total <strong>by</strong> smallest column<br />

total; if


Week 12 Questions on ODDS RATIO TESTS <strong>prepared</strong> <strong>for</strong><br />

Stat170 workshop held at the <strong>Numeracy</strong> Centre, Macquarie<br />

University. Prepared <strong>by</strong> <strong>Nan</strong> <strong>Carter</strong> with examples from<br />

various sources and adapted <strong>for</strong> Stat170.<br />

ODDS RATIO TESTS<br />

The χ 2 test of independence showed that wearing a helmet and<br />

receiving a head injury in a bicycle accident are associated,<br />

but does not indicate how strong, or weak, the association is.<br />

To get that in<strong>for</strong>mation we need to look at odds ratios.<br />

DEFINITION: if p is the probability of an event occurring,<br />

and (1-p) the probability of it not occurring, then p/(1-p) is<br />

the odds of it occurring.<br />

We can show that the odds of a cyclist wearing a helmet will<br />

receive a head injury is [17/147]/[130/147], which simplifies<br />

to 17/130.<br />

Similarly, we can show that the odds that a cyclist not<br />

wearing a helmet will receive a head injury is<br />

[218/646]/[428/646], which simplifies to 218/428.<br />

So, the odds ratio of a cyclist receiving a head injury if<br />

wearing a helmet relative to a cyclist receiving a head injury<br />

if not wearing a helmet is [17/130]/[218/428], which can be<br />

written [17x428]/[218x130] = 0.26.<br />

Similarly, a cyclist receiving a head injury if not wearing a<br />

helmet relative to a cyclist receiving a head injury if<br />

wearing a helmet, is [218/428]/[17/130], which can be written<br />

[218x130]/[17x428]= 3.89.<br />

40<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


Notice that<br />

a) 3.89 = 1/0.26, so if we find one ratio, we also have the<br />

other as the reciprocal.<br />

b) If the two odds were equal, then their ratio would be 1.<br />

As statisticians, we calculate a 95% confidence interval<br />

<strong>for</strong> the ratio and if it contains 1, we say that with 95%<br />

confidence the ratio could be 1, which is to say that the<br />

odds could be equal.<br />

c) For interpretation, we say that ‘a cyclist in an accident<br />

and not wearing a helmet is nearly four times as likely<br />

to receive a head injury than a cyclist who was wearing a<br />

helmet’.<br />

d) Or we might say that a bicyclist wearing a helmet is only<br />

25% as likely to receive a head injury if in an accident<br />

as one who is not wearing a helmet.<br />

This is EcStat output <strong>for</strong> this problem: read off the 95%<br />

confidence interval.<br />

Helmets and head injuries<br />

Observed counts<br />

head injury<br />

wearing helmet yes no total<br />

yes 17 130 147<br />

no 218 428 646<br />

total 235 558 793<br />

Expected counts<br />

wearing helmet<br />

head injury<br />

yes no total<br />

yes 43.6 103.4 147.0<br />

no 191.4 454.6 646.0<br />

total 235.0 558.0 793.0<br />

Chi-squared decomposition<br />

head injury<br />

wearing helmet yes no total<br />

yes -4.024 2.612 23.018<br />

no 1.920 -1.246 5.238<br />

total 19.882 8.373 28.255<br />

Test: df chi-sq p-val<br />

Indep 1 28.26 0.000<br />

wearing helmet head injury OR 95% CI<br />

yes / no yes / no 0.26 0.15 0.44<br />

41<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.


EXERCISE: In a Canadian experiment to examine the claim that<br />

the use of vitamin C helps to prevent the common cold, 818<br />

volunteers were randomly divided into 2 groups (one of 411 and<br />

the other of 407), and received a supply of either vitamin C<br />

or placebo. The results are shown below:<br />

TREATMENT<br />

OUTCOME<br />

placebo Vitamin C TOTALS<br />

Has Colds 335 302 637<br />

No Colds 76 105 181<br />

TOTALS 411 407 818<br />

Do these data suggest that getting a cold is independent of<br />

whether or not the subjects received vitamin C?<br />

Can the risk of a cold be reduced <strong>by</strong> using vitamin C?<br />

vitamin C<br />

Observed counts<br />

response<br />

determinant yes no total<br />

placebo 335 76 411<br />

vitamin C 302 105 407<br />

total 637 181 818<br />

Expected counts<br />

response<br />

determinant yes no total<br />

placebo 320.1 90.9 411.0<br />

vitamin C 316.9 90.1 407.0<br />

total 637.0 181.0 818.0<br />

Chi-squared decomposition<br />

response<br />

determinant yes no total<br />

placebo 0.698 2.455 3.153<br />

vitamin C 0.704 2.479 3.184<br />

total 1.402 4.934 6.337<br />

Test:<br />

df<br />

chisq<br />

p-val<br />

Indep 1 6.34 0.012<br />

determinant response OR 95% CI<br />

placebo /<br />

vitamin C yes / no 1.53 1.10 2.14<br />

42<br />

<strong>Nan</strong> <strong>Carter</strong>: workshop notes <strong>prepared</strong> <strong>for</strong> <strong>Numeracy</strong> Centre Macquarie University.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!