CONTENTS - Department of Mathematics and Statistics - University ...

CONTENTS 

Introduction, General Information and Administration, Overview 

SECTION 1 

This covers an introduction to the package R-cmdr, presents an overview of biostatistics and 

research methodology. 

Biostatistics and Research Methodology; R-cmdr 

Types of Data 

Numerical Data and Histograms 

Measures of Centre: Mean and Median 

Measures of Variability: Standard Deviation, Variance and Interquartile range 

Box-and-Whisker Plots 

SECTION 2 

This covers the measures of disease frequency and disease association with several examples 

looking at prevalence, incidence, relative risks, attributable risk and odds ratios. 

Prevalence and Incidence 

Cumulative Incidence 

Incidence Rate 

Disease Association 

Relative Risk 

Attributable Risk 

Odds Ratio 

SECTION 3 

This section covers a brief introduction to probability definitions, notation, rules and random 

variables with examples, several involving tree diagram use. 

Definitions including mutually exclusive and independent events 

The Addition Rule for combining probabilities 

The Multiplication Rule for probabilities 

Tree diagrams with examples 

Screening test terminology 

Probability Distributions and Random Variables 

Rules for combining Random Variables 

SECTION 4 

This section introduces both the Binomial and Normal Distributions which model many 

phenomena arising in the real world. Consequently the distributions allow us to answer 

some important and relevant questions. 

The Binomial Distribution: Definition, mean and variance 

The Binomial Table: Examples 

The Normal Distribution: Definition 

Standard Normal Distribution and Table 

General Normal Distribution 

Normal Approximation to the Binomial 

Transforming Data to Normal 

Contents

SECTION 5 

This section defines sampling distributions, establishes the standard deviations of these 

distributions called standard errors, and set up confidence intervals for population means, 

differences between the means of two populations, proportions and difference between 

proportions based on random samples drawn from the populations. 

An outline of the Research Process 

The Distribution of Sample Means 

The Standard Error of the Mean 

Confidence Interval for a Mean 

The t-distribution and Its Use 

Comparison of Two Independent Groups 

The Standard Error of the Difference Between Two means 

Pooled Estimate for the Common Variance 

Comparison of Two Dependent Groups (Paired Data) 

Confidence Interval for a Proportion 

Confidence Interval for Difference Between Two Proportions 

Summary of Distributions and Confidence Intervals 

SECTION 6 

This section reviews hypothesis testing, type 1 and type 2 errors, conclusive and inconclusive 

results and the power of a study. 

Null and Alternative Hypotheses 

Study Based and Data Driven Hypotheses 

One and Two Sided Tests 

Four Steps in the Hypothesis Testing Procedure 

Examples 

Pooled proportion estimate 

Clinical and Ecological Importance 

Conclusive and Inconclusive Results 

Errors in Hypothesis Testing 

Power of a Study 

Examples 

SECTION 7 

One factor analysis of variance 

Post analysis of variance tests on means 

Multiple comparison procedures 

SECTION 8 

This section covers the analysis of count data including the Chi-square test for contingency, 

the chi-square test for trend as well as relative risks, attributable risks and odds ratios along 

with their confidence intervals. The analysis of a three way table and Simpson’s paradox are 

investigated as a way of introducing the concept of a confounding variable in the lead up to 

regression analyses. 

Categorical Data Examples 

Relative Risk and its Confidence Interval 

Attributable Risk and its Confidence Interval 

Odds Ratio and its Confidence Interval 

Chi-square Test for Contingency 

Chi-square Test for Trend 

Interpretation of Confidence Intervals 

Simpson’s Paradox and Confounder Control 

Contents

SECTION 9 

This section introduces the topic of Simple Linear Regression which sets out to fit a straight 

line through what is called a scatter diagram. One purpose of this analysis is to establish 

whether one predictor variable is influencing the outcomes of a response variable and also 

measuring the magnitude of the effect of this predictor variable on the outcome. It is possible 

to use the fitted straight line to make predictions. 

Simple linear regression is also the first step in controlling for a confounder variable. This 

occurs with the extension to multiple regression which will be considered in the next section. 

Scatter Diagrams and Examples 

Equation of Fitted Straight Line 

Analysis of Variance for Regression Model 

Confidence Interval for Slope 

Confidence Interval for Prediction 

Correlation as Measure of Linear Association 

Review Exercises 

SECTION 10 

Multiple regression models and logistic regression models are introduced in this section. In 

the case of ordinary multiple regression the response or outcome variable is on a continuous 

scale whereas in the case of a logistic regression the outcome measure is binary taking 

therefore only two possible values interpreted as success versus failure. 

The models allow us to identify those variables which have an effect on the outcomes and 

those variables which do not. 

Adding additional variables leads to adjusted values for estimated parameters and it is this 

that allows us to control for confounding. 

The Multiple Regression Model 

R-cmdr Printout for Multiple Regression 

Dummy Variables 

Checking Model Fit 

Parallel Regression Lines and Analysis of Covariance 

Binary Outcomes and Logistic Regression 

Study Design principles 

Critical appraisal 

Confounding analysis 

Sources of bias 

SECTION 11 

Appendix 1: The Basics – mathematical rules and statistical concepts 

Appendix 2: Some summaries 

Appendix 3: Formulae 

Contents

STAT115 INTRODUCTION TO BIOSTATISTICS 2012 

Advances in our understanding of factors which affect health and wellbeing come through 

research in the health sciences. Examples of such research include surveys to describe 

patterns of disease in a community or risk factors for disease such as diet and smoking; studies 

trying to find out whether a newly developed treatment works; studies of factors which may 

prevent disease such as physical activity; studies of barriers to improving health such as 

reasons for declining vaccination rates in children, prevention of smoking. Biostatistics 

(statistics applied in the health sciences) is a vital tool in our mission to improve health and 

wellbeing for all people. 

STAT115 provides an introduction to the core principles and methods of biostatistics. In this 

course you will gain an understanding of how statistics is used to answer research questions: 

how to look for patterns in data, how to test hypotheses about disease causation and prevention 

and improvement in well-being. The understanding and skills gained in STAT115 can be a 

starting point for a career in biostatistics or can be used to assist understanding of research in 

other disciplines including physiology, anatomy, human nutrition, sports science, and 

psychology. 

Lecturers 

GENERAL INFORMATION AND ADMINISTRATION 

Dr Katrina Sharples, Dept of Preventive and Social Medicine, Adams Building 

Dr Janine Wright, Room 237, Science III building 

Mr Daniel Turek, Room 231, Science III building 

Dr David Bryant, Room 514, Science III building 

Lectures 

Lectures are held as follows: Monday, Tuesday, Thursday and Friday at 11.00 am, 

commencing Monday 9 July. Although these notes are extensive, experience shows that 

students who miss lectures have a severe disadvantage. 

Help Sessions and Tutorials 

These will be held in 539 Castle St Laboratory which has 36 computers. Tutorials are 

cafeteria style which means that you can attend at any scheduled time when tutors are 

available to help with weekly exercises. Times can be found on the STAT115 paper page on 

the Mathematics and Statistics Department website. In addition, you may access the 

computers to complete assignments outside of scheduled sessions. Attend early in the week 

to avoid the inevitable rush before submission day. 

STAT 115 Web Page and Resource Area 

The STAT 115 web page: www.maths.otago.ac.nz/stat115 will contain course resource 

material. Answers to weekly exercises, notices, old exam papers with solutions and any other 

useful information will be posted here. You can access such information by clicking on the 

Resources button. You are strongly advised to read through the solutions to weekly exercises 

as students who fail to do this are at a severe disadvantage. 

i 

Introduction & overview

Support Classes 

There is also a Wednesday evening support class for students worried about their mathematics 

background for this course. This class will be held in 539 Castle St at 6pm on Wednesday 

evening. If you wish to attend the support class you will need to register using the form 

which is available on the resource page or from the Maths and Statistics Reception, Science 

III, 2 nd floor. Our experience is that only a small number of students will need to use the 

support class. Note, there is no mathematics pre-requisite for this course. If you have 

difficulty in carrying out the calculations in the Basics Booklet of Appendix 1 of these notes 

you may find it helpful to attend the support class. In addition, you can access Mathercize by 

going to the web page mathercize.otago.ac.nz, log-in password line. The options 

STAT115 Exercises for Biostatistics 

STAT115 Revision mathematics 

will take you through background material for this course in an easy to use self-testing 

environment. 

Study Centre 

A Study Centre will operate in a room at the back of 539 Castle St. This is an area where you 

can go to work with fellow students. There will also be statistics help available at times as 

shown on the door. 

References 

There is no set text for the course as this course booklet contains all material necessary. The 

book: Harraway, J. Introductory Statistical Methods for Biological, Health and Social 

Sciences. (University of Otago Press) has multiple copies on reserve in the Science Library at 

the Loans Desk. The first 17 chapters are relevant for this course. A second book on close 

reserve: Clark, M.J. and Randal, J.A. A First Course in Applied Statistics (Pearson). 

Computing 

The R-commander (R-cmdr) package will be used in tutorials. No prior knowledge of the 

package is needed as a handout and full instructions will be available in the tutorials. All 

students will have their own User Name and Password. The User Name is the name on your 

student ID card and the Password is your student ID number. 

Time Commitment 

STAT 115 is a one semester course worth 18 points. It is expected that students should spend 

an average of 12 hours per week on this course. After allowing four hours per week attending 

lectures, this leaves eight hours for other course related activities such as assignments, reading 

notes and revising. 

Calculators 

There is no restriction on the type of calculator that can be used except that no device with 

communication capability shall be accepted as a calculator. 

ii 


Course content (in approximate lecture order) 

Introduction: research methods and study design; designed experiments versus 

observational studies; case control, cohort and intervention studies. 

Data description and presentation: the use of R-commander; histograms, boxand-whisker 

plots, measures of centre and spread of data, measures of disease 

frequency and association. 

Probability: the nature of random variation; diagnostic tests; probability 

distributions including the binomial and normal distributions. 

Estimation: sampling distributions; confidence intervals for means, differences 

proportions. 

Hypothesis testing: classical procedures for means, proportions, and differences; 

the p-value; statistical vs clinical significance; power and sample size. 

Analysis of variance: completely randomised design; bonferroni procedure for 

multiple comparisons. 

Categorical data: tests for association; rates, relative risk and risk differences, 

odds ratios; confidence intervals for relative risk and odds ratio. 

Regression and correlation: the simple linear regression model; tests on the slope; 

predictions; confidence intervals for predictions; correlation. 

Multiple regression: tests on the estimated parameters; dummy variables for 

qualitative predictors; parallel regressions and control of confounding. 

Ethics and Study design: Ethical issues, bias and confounding. 

2 lectures 

6 lectures 

8 lectures 

5 lectures 

3 lectures 

3 lectures 

4 lectures 

5 lectures 

4 lectures 

7 lectures 

Internal Assessment 

There will be eight assignments and three mastery tests. Each assessment will have a mark 

recorded out of 20. These assessments will be admininstered on-line. The assignments can 

be completed anywhere you have an internet connection. The mastery tests will be conducted 

in the Castle St Computer Laboratory. A booking system for half-hour slots in which to 

attempt the tests will operate. Cutoff times for each assignment will be announced in lectures. 

Exam format 

A three-hour exam will produce a mark out of 100. 

Final mark 

In your overall mark we will count your exam mark for 2/3 of the total and the internal 

assessment for 1/3. However, if your final exam mark taken out of 100 is greater than this, 

we will use just the final exam mark. That is, the final mark F will be calculated as: 

F = {E, (2E + A)/3} 

where E (exam mark) is out of 100 and A (internal assessment) is out of 100. The internal 

assessment marks will be made up 1/3 from the eight assignments and 2/3 from the three 

mastery tests. 

iii 


Email Contact with Students 

From time to time lecturers may wish to email students taking STAT 115. This will be done 

by contacting you using your Student email address. You should check your student address 

regularly. If you have another address then you might like to arrange that emails sent to your 

student address are forwarded automatically. 

Disability and Impairment Support 

The Department of Mathematics and Statistics encourages students to seek support if they 

find they are having difficulty with their studies due to a disability, temporary or permanent 

impairment, injury, chronic illness or deafness. 

Contact either The Course Convenor, 

or Disability Information and Support 

Telephone 479 8235 

Email: disabilities@otago.ac.nz 

Website: http://www.otago.ac.nz/disabilities 

Plagiarism 

Students should make sure that all submitted work is their own. “Plagiarism is a form of 

dishonest practice. Plagiarism is defined as copying or paraphrasing another’s work and 

presenting it as one’s own” (University Council, December 2004). In practice this means that 

plagiarism includes any attempt in any piece of submitted work (e.g. an assignment or test) to 

present as one’s own work the work of another (whether of another student or a published 

authority). Any student found to be responsible for plagiarism in any piece of work submitted 

for assessment shall be subject to the University’s dishonest practice regulations which may 

result in various penalties, including forfeiture of marks for the piece of work submitted, a 

zero grade for the paper or in extreme cases exclusion from the University. 

SURV 102 Computational Methods for Surveyors 

Students enrolled for SURV102 will attend lectures in STAT115 for four weeks beginning on 

Monday 23 July. 

A separate notice about assessment in SURV102 will be made in the Surveying Department. 

iv 


Biostatistics and Health Research - An Overview 

1 Health Research 

Billions of dollars are spent every year in a quest to improve human health and well-being. 

The broad goal of this quest is to acquire new knowledge to help prevent, detect, diagnose and 

treat disease. 

What sort of knowledge do we look for 

What causes a disease 

Once you have a disease, what happens 

Who has the disease 

What is the best strategy for treatment or prevention 

How do societal factors affect health 

What causes a disease 

Understanding the factors which lead to the development of disease gives ideas about how to 

prevent disease. For example: 

• Drinking water is treated to kill bacteria, virus and other contaminants like giardia. 

• Our ability to prevent heart disease has improved with our understanding of specific 

dietary components which increase risk, and with our understanding of how exercise 

works to reduce risk. 

• The realization that the cause of AIDS was a virus (HIV) which could be transmitted 

through sexual intercourse and blood transfusions led to prevention strategies to 

reduce transmission. These included use of condoms, screening of blood products 

and drugs to reduce of transmission from mother to baby. 

• Understanding how and when sports injuries occur helps to develop rules of play and 

training schedules which reduce injury burden. 

Once you have a disease, what happens 

Understanding how a disease progresses gives ideas about how to cure disease, or to prolong 

survival or to improve quality of life. For example: 

• Understanding how HIV affects the immune system has led to the development of 

drugs such as zidovudine which prevent the virus from reproducing and seem to 

slow the destruction of the immune system. 

• Understanding how bacteria work allowed the development of different types of 

antibiotics with different actions. 

• Cancer develops when cells in a part of the body begin to grow out of control. 

Knowledge of the cell cycle was important in developing cancer drugs 

(chemotherapy) which work only on actively reproducing cells. 

Who has the disease 

Detecting who has a disease and diagnosing disease are the first steps in delivering effective 

treatments. For example: 

• Development of non-invasive technologies for looking inside the body (such as 

ultrasound, CT scans, MRI) provided techniques for making the initial diagnosis of 

cancer, or for identifying the form of damage to a knee following injury. 

• Tests which look at cells from biopsies or blood can give more accurate diagnosis of 

cancer than the non-invasive technologies. 

• We identify people with HIV infection though a blood test which detects antibodies 

to the virus. 

v 


What is the best strategy for treatment or prevention 

Once we have developed a new treatment or approach to prevention we need to evaluate the 

risks and benefits of that treatment before it is made available for use. For example: 

• Exercise and balance programmes have been demonstrated to reduce the risk of 

falling in the elderly 

• The statin family of drugs have been demonstrated to reduce risk of death from 

cardiovascular disease 

• Evaluations of the use of beta-carotene (which the body converts to vitamin A) 

found that contrary to expectations, it did not prevent lung cancer; in fact it increased 

the risk of lung cancer. 

How do societal factors affect health 

Working with individuals can lead to significant improvements in health, but societal factors 

can also have an impact. 

• Societal attitudes to alcohol and smoking can make it difficult for individuals to 

change behaviour 

• Understanding how societal factors operate is important for developing systems of 

health care. 

Where does knowledge come from 

During the last century we have gained an enormous amount of knowledge, but there are still 

many gaps. 

• Cancer and cardiovascular disease still end many people’s lives prematurely. 

• Back pain is very common. We still are not very good at treating it or preventing it. 

• Diabetes is becoming increasingly common, particularly among Maori and Pacific 

Island populations. It has many serious health consequences. 

• New diseases provide additional challenges. HIV/AIDS, a disease thought to have 

jumped the species barrier into humans, has had an enormous impact. Avian 

influenza is common in birds in Asia, and can cause severe disease in humans, but 

doesn’t currently spread directly from human to human. But it would only take a 

small change in the genome of the virus to make it highly infectious amongst 

humans. 

Knowledge can come from ‘experience’ or ‘research’ 

Experience is a very unreliable way of obtaining knowledge. Humans are not objective; our 

recall is very selective. The history of medicine is littered with treatments which doctors were 

convinced, through their own experience, worked, but time has shown to be ineffective or 

harmful in many of the settings where they were used: bloodletting, ground woodlice, 

mercury, arsenic, and so on. These treatments were widely used centuries ago, but there are 

more modern examples. 

• An early treatment for heart attack, where blood flow to part of the heart muscle is 

blocked, involved sprinkling powdered asbestos on to the heart to increase blood flow 

to the affected areas. It was never truly shown to work, but thousands of these 

operations were done. 

• Hormone replacement therapy was widely used initially for treatment of the symptoms 

of menopause, but was also believed to reduce risk of heart disease in postmenopausal 

women. The results of a study published recently found in fact it 

increased the risk of heart disease. 

That leaves research. 

vi 


2 The Research Process and Biostatistics 

What is research 

Research is a systematic process for providing answers to questions 

Examples of research questions: 

• What are the causes of meningococcal meningitis 

• What is the best treatment strategy for chronic back pain 

• What are the genetic events that lead to childhood cancer 

• Can this new drug improve survival in people with colon cancer 

• What is the role of selenium as an antioxidant in the protection against risk factors for 

cardiovascular disease 

• To what extent do western diet and exercise habits need to change in order to reduce 

insulin resistance 

• Does this conditioning programme reduce serious knee injury in team sports 

Biostatistics is the field of development and application of statistical methods to research in 

health-related fields, including medicine, public health, and biology. Since early in the 

twentieth century, biostatistics has become an indispensable tool for health research. 

Statistics is often defined as the art and science of collecting, summarising, presenting and 

interpreting data. Statistics is a set of techniques which formally implement the fundamental 

principles of the scientific method. The scientific method underlies the research process: 

observation and theories lead to the development of hypotheses. We work out the best test of 

the hypothesis, then collect data and determine to what extent the data are consistent with the 

hypothesis. 

The research process 

When we carry out research we often collect data on a sample or subgroup from a population. 

Our goal is to use the information collected on that sample to draw inferences about a larger 

population. 

Underlying populations 

Inference 

Sample 

Statistics 

vii 


Examples 

• We use the frequency with which diabetes occurs in a sample to estimate the 

frequency with which diabetes occurs in the population the sample came from. 

• We study a new treatment in a subgroup of patients in order to be able to make claims 

about the effects of the treatment in all such patients. 

Steps in the research process 

Development of the research questions 

Design of the study 

Collection of information 

Data description and analysis 

Interpretation of results 

Ideas for research come from many places – from reading the literature, observation and 

clinical experience, from talking to colleagues and from just sitting and thinking. 

The first step is to refine the idea into a question, or series of questions, which can be 

answered in a single study; that is, we need to be able to design a study to answer the 

question. The question may be framed as a hypothesis. For example, we might wish to 

answer the question “Does a low fat diet reduce risk of diabetes” The hypothesis would be 

“Low fat diet reduces the risk of diabetes”. We then need to work out how best to test the 

hypothesis. 

The study design specifies the methods for selecting people (or other units) for the study and 

for collecting the information that will be used to answer the questions. It needs to be feasible 

and ethical. We need to identify which study designs can give us appropriate data, and how to 

maximize our chance of being able to distinguish a true relationship from random noise. 

Once we have collected the data we use statistical methods to describe and analyse the data 

and interpret the results. The analysis and the interpretation of the results will depend on the 

study design. 

Biostatisticians work with scientists to identify and implement the correct statistical methods 

for designing studies and analyzing and interpreting the results. 

3. Introduction to study design 

Understanding where data come from is vital for making sensible choices about statistical 

analysis. At this stage in the course we will give an overview of some of the study designs 

that are commonly used in epidemiology and clinical research. We will return to this material 

in the second half of the course. 

There are several different ways to classify study designs, and several specific ‘named’ study 

designs. It can be confusing since different epidemiology books use the terms differently. The 

classifications and definitions exist to help us think about the strengths and weaknesses of a 

particular study for addressing the research questions. The differences in the ways the 

definitions are used arise where textbooks emphasize the relative strengths and weaknesses a 

little differently. 

viii 


2.1 Classifications of Study Designs 

1. Descriptive versus analytic 

This classification relates to the primary aims or objectives of the study. Where the study aims 

to test an hypothesis we say the study is analytic. For example, does this vaccine reduce the 

risk of meningococcal disease Here we hypothesize a relationship between vaccine and risk 

of meningococcal disease (we hypothesize that vaccine reduces risk) and aim to test that 

hypothesis. Analytic studies are studies which test hypotheses. 

Descriptive studies are used where the aims are simply to describe something, with no prespecified 

hypothesis. For example, if we wish to describe trends in incidence of 

meningococcal disease over time we carry out a descriptive study. Here there are no prespecified 

hypotheses about the reasons for a change over time. 

Many descriptive studies in epidemiology describe patterns of disease in populations. This can 

provide clues about causes of disease and lead on to further studies. The standard approach is to 

examine the characteristics of disease according to time, place, and person: 

TIME A descriptive study can be repeated in order to examine trends over time 

examples: epidemics, seasonality eg: influenza 

PLACE Many diseases vary according to country, or even within countries 

examples: breast cancer incidence by country, multiple sclerosis and latitude 

PERSON Characteristics of people with the disease can be studied, for instance age, sex, 

ethnic group, socioeconomic group, occupation 

example: heart disease in New Zealand according to age and sex and ethnic group 

2. Experimental versus observational 

In experimental studies the investigators intervene in the natural order (hence the alternative 

name intervention study). The investigator decides the exact nature of the intervention, 

chooses a control strategy, and decides who will receive the intervention under study and who 

will be part of the control group. The goal is to control the conditions so that the effect of 

interest can be isolated and studied. For example, if investigators want to know whether a 

drug (nevirapine) reduces maternal-infant transmission of HIV they can construct an 

experiment which isolates the effect of the drug from any other factors which might affect risk 

of transmission. The extent to which we can isolate the effect of the intervention (eg drug) 

determines how good the experiment is. Of course ethics are a fundamental consideration. 

In observational studies we simply observe a naturally occurring process without intervening. 

It is much harder to test a hypothesis in an observational study, but for many research 

questions in the health sciences it is not ethical or feasible to conduct an experiment. We aim 

to design our observational studies to get as close as possible to the information we would 

have got if the experiment could have been done. 

3. Randomised versus non-randomised (applies to experiments only) 

Experiments always (should) have a control group as well as a group (or groups) which gets 

the intervention(s) under study. Randomisation is a process we can use to allocate people to 

either the intervention group or the control group – the simplest version of randomisation is 

like flipping a coin: each person has a 50% chance of being in the intervention group. Careful 

use of randomisation gives the best test of an hypothesis. 

ix 


In some experiments the investigators use a method other than randomisation to decide who 

will be in the intervention group and who will be in the control group. For example in a 

community intervention study the investigators might choose a set of communities to get the 

intervention (often those interested or with structures in place to take part), and then choose a 

matched set of control communities. Experiments like this which are non-randomised are 

sometimes referred to as quasi-experiments. Sometimes they are the only practical alternative, 

but they never provide the same strength of evidence as a randomised trial. 

Note that the process of randomisation is not the same as random sampling. The purpose of 

random sampling is to select a single group which is representative of a population (see 

below). 

4. Cross-sectional versus longitudinal 

This classification refers to the data themselves and the (calendar) time points or periods 

about which the information is collected. For example, we might do a study looking at the 

relationship between oral contraceptive use and coronary heart disease. Fully cross-sectional 

data would refer to one point in (calendar) time. For example, in a survey we might ask, do 

you have coronary heart disease today Are you taking oral contraceptives today Note that if 

we are collecting data on existing disease we are working with prevalence of coronary heart 

disease rather than incidence of coronary heart, and so cross-sectional data is not very good 

for testing hypotheses about the causes of disease. (The exposures may have changed after 

disease was diagnosed.) 

Longitudinal data have some time course present. The ideal for testing hypotheses about 

disease causation is to get information about things that occurred before the disease 

developed. Often the best we can do is collect information about exposures that occurred 

before diagnosis of disease since the time between developing disease and diagnosis is often 

unclear. Longitudinal studies collect information over a period of time, eg exposures which 

occur before disease is diagnosed. 

5. Study unit 

The majority of studies in epidemiology collect data on individuals. However, there are some 

where the ‘unit’ under study is something bigger – such as a family, a community or a 

country. In some studies it is the group that is of interest, not the individual, and we might 

want to test a hypothesis relating to the group (an analytic study). For example, the COMMIT 

study asked, does a community prevention programme reduce the prevalence of smoking in 

the community The intervention is carried out at the community level, and we can evaluate 

by examining whether the prevalence of smoking in the community changes. Note the 

outcome data are collected on the individual (whether someone smokes or not), to measure the 

effect of the intervention in a community. 

2.1 Common study designs in epidemiology and clinical research 

1. Case report 

Usually describes the occurrence of disease in one person. The purpose is to alert others to the 

fact that this combination of factors can occur, and to encourage people to keep a look out for 

other similar cases. Such case reports (to a central registry) led to the initial recognition of 

AIDS. Case reports are always descriptive and observational. The cross-sectional longitudinal 

x 


classification doesn’t really apply, but they could be considered ‘longitudinal’ in the sense 

that they may collect data on the person’s experience over time. 

2. Case series 

A case series takes a group of people with a recognised disease and describes patterns among 

them. A study of the initial case series of men diagnosed with AIDS recognised a common 

dysfunction of the immune system and that the disease occurred in gay men, injecting drug 

users and blood product recipients. This led to the hypothesis that it was caused by a 

transmissible agent, and gave clues as to the modes of transmission. Case series are always 

descriptive, observational, and are generally cross-sectional, but could be longitudinal if they 

describe changes in individuals over time. 

3. Descriptive study using population data 

Many descriptive epidemiological studies make use of data that is collected routinely on a 

population. This includes census data, death certificates, data reported to cancer registries, 

hospital morbidity and mortality data, and infectious disease data reported as ‘notifiable’ 

diseases. Provided the data sources are reliable this can provide valuable descriptions of the 

disease (or risk factor) experience in a population. These studies are descriptive and 

observational. 

4. Sample survey 

Where data are collected specifically for a research study, they generally involve collecting 

data for only a sample (subset) of the population of interest. This will give the opportunity to 

collect more information about each person, at the cost of the random variation that comes 

with sampling from a population. There are many way to go about selecting a sample. In 

quantitative research we generally choose random samples. In a random sample everyone has 

a known chance of being selected for the study; this allows us to use statistical methods to 

accurately determine the influence of random error (through use of confidence intervals). 

And hence, to make valid inferences regarding the population the sample came from. Random 

sampling gives us the best chance of getting a sample which is representative of the 

population. 

The simplest type of random sample is a simple random sample, where everyone has the same 

chance of being chosen. We can also draw stratified samples or cluster samples. In stratified 

sampling we divide the population into groups (or strata) – for example ethnic groups. We 

then choose to sample a fixed number from each stratum to ensure all groups are adequately 

represented in the study. For example, we might wish to choose the same number of people 

from each ethnic group to ensure we have enough data for reliable estimates in each group. 

Cluster sampling is used where we can’t easily select a sample of individuals. For example, if 

we wish to study children, we can’t carry select a simple random simple because we have no 

list of children from which to select the sample. One approach commonly used is to select 

schools at random, classrooms within a school at random, and children from a class at 

random. 

A true survey generally means getting people to fill in a questionnaire. However people have 

extended the idea to include other forms of data collection: we may take measurements of 

height and weight, fitness tests, blood tests and so on. 

xi 


These studies are most often descriptive, but can be analytic, are observational, and can be 

cross-sectional or longitudinal. 

5. Cross-sectional study 

In epidemiology the term cross-sectional study often refers to a survey. The data are often not 

fully cross-sectional according to the definition above. For example we might carry out a 

survey of use of hormone replacement therapy (HRT) among New Zealand women. 

Such a survey would generally ask about past life experiences and past use of HRT, rather 

than just current use, which gives a longitudinal element to the data. When the study collects 

information about disease status, it is generally prevalent disease. So while cross-sectional 

studies can be used to test hypotheses they are not very good for testing hypotheses about 

disease causation. 

6. Case-control study 

Two groups 

Group with disease (cases) 

Group free from disease (controls) 

In a case-control study, people are selected for the study according to whether they have the 

disease of interest (cases) or not (controls). Generally case-control studies identify incident cases 

and collect information about experiences before diagnosis of disease of the cases, and for an 

equivalent time period for the controls. Case-control studies are sometimes called retrospective 

studies because information is collected about exposures that occurred in the past. For example, a 

case-control study of cervical cancer selected a group of women with cervical cancer and a 

control group of women who did not have cervical cancer. Information was collected about past 

experiences which were hypothesised to be related to risk of cervical cancer including number of 

sexual partners. Case-control studies are analytic, observational and longitudinal. 

7. Cohort Study 

A group of people is observed over a period of time in order to measure the frequency of the 

disease being investigated. A cohort study starts by documenting exposures and then measuring 

the subsequent risk of developing disease, according to exposure. Cohort studies aim to identify 

associations between exposure to suspected causal agents and the development of disease. The 

cohort may be selected by taking a random sample from a population (eg the Scottish Heart 

Study); by selecting some geographical areas (eg Framingham study) or taking a particular group 

(eg British Doctors study, Nurses Health Study). Researchers may also identify an exposed group 

of interest (eg people working in a particular industry) and find an appropriate control group who 

are not exposed to the substance under study. Exposure can be measured at the beginning of the 

study (baseline) and also periodically during the follow-up period. The entire cohort of people is 

followed up to determine if and when disease develops. 

8. Randomised controlled trial (RCT) 

In a randomised controlled trial a group of study participants are selected and then randomly 

allocated to an intervention group (s) (who get the intervention under study) and a control 

group. Since group allocation is entirely by chance, this is the best approach for getting two 

groups who are comparable is all respects. This means that if there is a difference in outcome 

xii 


etween the two groups it can be attributed to the intervention (provided other aspects of the 

study are well carried out). 

9. Clinical trial 

This the term used for an experiment which evaluates a treatment. They are often, but not 

always, randomised controlled trials. 

10. Prevention trial 

This is the term used for an experiment used to evaluate a prevention strategy. They can be 

randomised controlled trials. 

11. Community intervention study 

This is the term used for a study to evaluate a community intervention. They are usually 

experiments, but often not randomised, and may not involve a control group. 

4. Content of STAT115 

Learning aims and objectives 

By the end of the course students should 

• be aware of the appropriate use of common study designs and their strengths and 

weaknesses 

• be able to describe the information contained in a data set 

• be able to carry out common statistical data analyses 

• be able to interpret the results of common statistical analyses in the context of the 

particular study design used 

• be aware of ethical issues relating to research involving humans 

• be able to critically evaluate selected research articles published in health sciences 

journals. 

The material in this course will provide skills for interpreting research in your chosen field of 

study, as well as some basic skills for analysing data that you collect through course projects 

or labs using a computer and a statistical software package. If you have mathematical skills, 

and are stimulated by the idea of being involved in health research, you may wish to pursue a 

career in biostatistics. There are many jobs available for biostatisticians, in New Zealand and 

overseas. Most are employed in research groups at universities or government or in 

pharmaceutical or biotech companies. 

Types of research questions covered in STAT 115 

There are many types of research question in the health sciences: 

• Laboratory studies: research involves understanding how cells and cell components 

work, identifying compounds which can be used to treat disease and how they affect 

cells. 

• Animal studies: used as models for humans 

xiii 


• Human studies: 

– anatomy and physiology consider the structure and function of the human body 

– clinical research asks questions relating to patient care including evaluation of 

new treatments 

– epidemiology is the study of the distribution and causes of disease 

• Studies of public health: the science and art of promoting health, preventing disease 

and prolonging life through organised efforts of society 

• Studies of society: 

– medical sociology examines topics such as the social aspects of physical and 

mental illness, physician-patient relationships, the organization and structure of 

health organizations and the socio-economic basis of the health care system. 

In STAT 115 we will focus on research questions involving humans, mainly clinical research 

and epidemiology. There are many research questions in these areas which can be understood 

without specialised knowledge. In the other areas, particularly laboratory studies, an in-depth 

understanding of the field (eg biochemistry, molecular biology, anatomy or physiology) is 

needed to understand the research questions. 

Studying humans brings particular challenges, and it is these challenges which have driven the 

specialised development of biostatistics from it statistical basis. The challenges arise from the 

more complex ethical issues in research involving humans, as well as the complexities of the 

biological system and the consequential research questions we wish to answer. 

xiv 


SECTION 1 

This covers an introduction to the package R-cmdr, presents an overview of biostatistics and 

research methodology. 

Biostatistics and Research Methodology; R-cmdr 

Types of Data 

Numerical Data and Histograms 

Measures of Centre: Mean and Median 

Measures of Variability: Standard Deviation, Variance and Interquartile range 

Box-and-Whisker Plots 

1 

Section 1

Biostatistics and research: an overview 

Course aim: 

An introduction to the core biostatistical methods 

essential to the health sciences 

• scientific method 

• design of research studies 

• description and analysis of data 

The scientific method underpins the design of 

research studies. Sound research design is vital 

for obtaining reliable information. A major part 

of this course is about techniques for describing 

data and understanding the analysis principles. 

This enables us to make sense of the mass of 

information collected in a research study. 

2 

Section 1

Learning aims and objectives 

By the end of the course students should 

• be aware of the appropriate use of 

common study designs and their strengths 

and weaknesses 

• be able to describe the information 

contained in a data set 

• be able to carry out common statistical 

data analyses 

• be able to interpret the results of common 

statistical analyses in the context of the 

particular study design used 

• be aware of ethical issues relating to 

research involving humans 

• be able to critically evaluate selected 

research articles published in health 

sciences journals 

3 

Section 1

Goal of health sciences professions 

To improve the health and well-being of 

individuals and communities 

This involves 

• treatment of disease 

• prevention of disease 

• promotion of health 

In order to do this we need knowledge about 

• causes of disease 

• diagnosis 

• disease processes 

• effectiveness of treatments 

• societal factors which affect health 

4 

Section 1

Examples of current gaps in knowledge 

• causes of meningococcal meningitis 

How to prevent Vaccine 

• SARS, avian influenza 

New diseases 

• back pain 

Not good at treating 

• cancer 

Nasty treatments for child cancer 

• diabetes 

Common in Pacific communities 

• cardiovascular disease 

Common cause of death 

• prevention of overweight and obesity 

• effective promotion of behaviour change 

Prevention of smoking 

Knowledge may come from 

• teaching 

• experience 

• research 

5 

Section 1

Research 

A process for providing answers to questions for 

which the answer is not immediately available 

General research areas 

What are the causes of meningococcal 

meningitis 

Can we develop a vaccine to prevent SARS 

What are the genetic events which lead to 

childhood cancer 

Can a new drug improve survival in people with 

colorectal cancer 

How can we prevent childhood overweight and 

obesity 

What are the main factors affecting quality of life 

of people with a chronic illness 

Research provides a systematic process for 

answering these questions 

6 

Section 1

Iron Deficiency – Should NZ parents be 

Concerned 

[Dr Elaine Ferguson, Dept of Human 

Nutrition] 

A survey randomly selecting 323 children 

aged 6-24 months in Dunedin, Christchurch 

and Invercargill. 

To assess prevalence of iron deficiency. 

To explore factors associated with low body 

iron store. Possible Factors are: 

Categorical: 

Continuous: 

• Sex 

• Ethnicity 

• Maternal Education 

• Household Income 

• Breast feeding 

• Age 

• Meat intake 

Regression methods are used as well as 

procedures for summarising data. 

7 

Section 1

Does early childhood circumcision reduce the 

risk of acquiring genital herpes 

[Dr Nigel Dickson, Dept of Preventive and 

Social Medicine] 

• Cohort of over 1000 births in 1972 in 

Dunedin. 

• Called the Dunedin Multidisciplinary 

Health and Development study. 

• Does early circumcision reduce the risk 

of genital herpes. 

• Initially appears to be the case but it is 

an observational study. 

• Number of sexual partners is a 

confounder. 

• When confounder allowed for early 

circumcision appears not to be 

protected. 

• Designed experiments (or clinical trials) 

set up in Africa to investigate effect of 

circumcision on HIV. 

8 

Section 1

The research process 

The objective for most studies is to use data from 

a sample to draw inference about a larger 

population: 

Underlying population 

Inference 

Sample 


Examples: 

• we use the frequency with which a disease 

occurs in a sample to estimate the 

frequency with which disease occurs in the 

population 

• we study a new treatment in a group of 

patients in order to be able to make claims 

about the effects of the treatment in all such 

patients 

9 

Section 1

Steps in the research process: 

Development of the research question 

Design of the study 

Collection of information 

Data description and analysis 

Interpretation of results 

• the research question 

- needs to be framed very carefully 

- must be specific enough to be 

answerable by a research study 

• the study design 

- is determined by the research 

question 

- describes the methods used to collect 

the information 

• analysis and interpretation 

- depends on the study design 

10 

Section 1

Research questions relevant to this course: 

Epidemiology: 

the study of distribution 

and determinants of disease 

frequency 

Clinical research: the study of questions 

relating to care of patients 

Descriptive questions: 

What is the distribution of a disease 

What is the natural history of a disease 

Analytic questions: 

What are the causes of a disease 

Will this approach prevent disease 

Does this treatment improve outcome 

11 

Section 1

Data Analysis and Computer Software 

Easy to use software is essential for data 

management and data analysis. In this course R- 

cmdr (Statistical Package for the Social Sciences) 

will be used. This package is widely available on 

campus, used in most Departments which specify 

first year statistics as a pre-requisite, and widely 

available internationally. 

At school you may have used EXCEL. Possibly 

at University you have used EXCEL. EXCEL 

is excellent for data management and reporting 

but is poor for statistical analyses and clumsy for 

graphical procedures. 

R-cmdr is easy to use with good pull down menu 

options. There are three windows in R-cmdr 

• Data Editor (where data being analysed are 

located) 

• Output Window (where results appear) 

• Syntax Window (not used in this course) 

12 

Section 1

Introduction to study design 

1. Descriptive studies 

2. Analytic studies 

Experimental studies 

Observational studies 

Examples of analytic study types 

3. Summary 

Classification of research designs 

Classification of common study types 

There are two types of research questions. 

Descriptive – describing things 

Analytic – testing hypotheses 

Strengths and weaknesses of the different designs 

will be discussed. 

13 

Section 1

1. Descriptive studies 

Aim: to describe, for example: 

• the characteristics of people with a disease 

(person, place, time) 

• lifestyle patterns of a population 

• attitudes to health care 

• etc 

Descriptive studies are often called surveys or 

cross-sectional studies 

Descriptive studies generally use a sample from a 

population 

14 

Section 1

Example: What are the serum cholesterol levels 

of New Zealanders 

Method: 

Select a subgroup (sample) of people 

and measure their serum cholesterol 

levels 

Random sampling 

• choose the sample in such a way that 

every individual in the population has a 

known chance of being selected 

• in a simple random sample, everyone has 

an equal chance of being chosen 

• this method is the best way of obtaining a 

sample which is representative of the 

population 

Suppose we want to estimate mean cholesterol in 

the population: 

15 

Section 1

Sample average = true mean + error 

unknown 

random error: 

systematic 

error 

random 

error 

• due to natural biological variability 

• increasing the sample size will reduce 

the random fluctuations in the sample 

mean 

systematic error (=bias) 

• due to aspects of the design or 

conduct of the study which 

systematically distort the results 

• occurs if a sample is not representative of 

the population 

• cannot be reduced by increasing the 

sample size 

16 

Section 1

2. Analytic studies 

Purpose: to test hypotheses, about, for 

example: 


• methods for prevention of disease 

• the effects of treatments 

Experimental studies 

• the researcher intervenes and records the 

result of their intervention 

• the aim is to control all other factors to 

isolate the effects of the intervention 

• best way to study causation 

Observational studies 

• the investigator does not intervene, simply 

observes a naturally occurring process, 

and collects information 

• ideal is to get as close as possible to the 

information that would have been 

obtained if the experimental study could 

have been done 

17 

Section 1

Example: Options for studying the 

relationship between smoking and lung cancer 

Experimental study 

Randomly assign people 

Follow for 20 years 

Check lung cancer rates 

Clearly unethical 

Smokers (start) 

Non-smokers 

Observational study 

Cohort 

Smokers (known) 20yrs % with lung CA 

Compare 

Non-smokers (known) 20yrs % with lung 

CA 

Problem: groups may differ in other ways that are 

related to CA risk – confounding. 

Case control 

% smokers 20 yrs lung cancer (now) (cases) 

% smokers 20 yrs no lung cancer (now)(controls) 

No long term follow up needed. Smaller 

samples. Could be recall bias from 20 years ago. 

Also confounding. 

18 

Section 1

Examples of analytic study types 

Randomised controlled trial (RCT) 

• a “Gold standard” analytic study (best) 

• experimental 

Characteristics of a RCT: 

• select a group of people 

• randomly allocate them to either an 

intervention or a control group 

• follow participants up over time, and 

measure outcome 

A control group is used to isolate the effects of 

the intervention 

Random allocation, or randomisation means 

every person has the same chance of being in 

each. This gives the best chance of getting two 

groups which are comparable in all respects 

Used to evaluate new treatments 

Often not ethical in studies of disease causation 

19 

Section 1

Example RCT: LIPID study (NEJM, 1998) 

Does treatment with pravastatin reduce the risk 

of death in patients with coronary heart disease 

Study participants: 

9014 patients 

age 31-75 

coronary heart disease 

cholesterol 155 - 271mg/decilitre 

participants (selected) 

control 

(n=4502) 

randomisation 

randomly 

allocated 

intervention 

pravastatin 

(n=4512) 

6 yrs 

8.3% mortality 6.4% 

20 

Section 1

Advantages of RCT: 

• experiment – the best way to test an 

hypothesis 

• differences in outcome can be attributed to 

the exposure 

Disadvantages of RCT: 

• may not be ethical 

Cohort study 

Observational study, generally carried out to test 

hypotheses 

Characteristics: 

• participants are selected before 

disease has developed 

• followed over time to determine 

development of disease 

• information is collected about 

exposures at baseline and during 

follow-up 

• longitudinal 

21 

Section 1

Example of cohort study: 

Study to investigate the relationship between 

smoking and lung cancer (eg British Doctors 

study) 

Group of people 

without lung cancer 

smokers 

non-smokers 

20 years 

% develop % develop 

lung cancer lung cancer 

Compare 

22 

Section 1

Case-control study 

Observational study, generally carried out to test 

hypotheses 

Characteristics 

• participants are chosen on the basis of 

their disease status: a group with disease 

(cases) and a group without (controls) 

• information is collected from people with 

and without disease about exposures that 

occurred in the past 

• longitudinal (retrospective) 

23 

Section 1

Example of case-control study 

Study to investigate the relationship between 

smoking and lung cancer 

Known at start 

group of people 

with lung 

cancer 

group of people 

without 

lung cancer 

document smoking history 

% smokers %smokers 

in the past 

in the past 

24 

Section 1

Cohort vs case-control studies 

Cohort study 

Advantages: 

• closest observational study to 

randomised controlled trial 

• good for examining common outcomes 

• can evaluate the effect of exposure on 

multiple outcomes 

Disadvantages: 

• long duration needed if the disease takes 

a long time to develop after exposure 

• if the disease is rare, the number of 

participants needs to be very large 


Advantages 

• relatively quick 

• smaller than cohort studies, particularly 

for rare diseases 

• can examine the effects of multiple 

exposures 

Disadvantages 

• events have already occurred so the 

potential for bias is higher 

25 

Section 1

3. Summary 

Classification of research designs 

Note: these provide a useful framework for 

thinking about the strengths and weaknesses of 

different study designs, but they will not always 

work. 

i) Classification by purpose of the study 

descriptive (describe things) 

versus 

analytic (testing hypotheses) 

ii) Classification by form of the design 

experimental (researcher intervenes) 

versus 

observational (researcher observes) 

iii) Classification by time 

cross-sectional 

(information collected about one point in time) 

versus 

longitudinal 

26 

Section 1

Classification of common study types 

Randomised controlled trial 

• analytic 

• experimental 

• longitudinal 

(prospective) 

Cohort study 

• analytic 

• observational 

• longitudinal (usually prospective) 

Case-control studies 

• analytic 


• longitudinal (retrospective) 

27 

Section 1

Types of data and graphical summaries 

[A] Data and variables 

There are two types of measurement of interest in 

many scientific studies. 

• First, the outcomes measured on each 

experimental unit (plant, animal, person) 

provide values of what is called a response 

variable. 

• Second the characteristics or levels of exposure 

that explain at least some of the differences in 

the observed values of the response variable 

are called explanatory variables. 

e.g. iron levels in new born children is the 

outcome or response – what are the 

explanatory variables 

e.g. diabetes presence is outcome – what are 

the explanatory variables 

Data forming the response and exposure 

variables can be either categorical or numerical 

(otherwise known as qualitative and 

quantitative). 

28 

Section 1

1. Categorical data: 

The simplest case involves two categories. 

For example a person could be 

• male/female 

• smoker/non-smoker 

• diabetic/non-diabetic 

Such data have other names such as binary 

data, dichotomous data, yes/no data and 0 – 1 

data (the last is particularly important, for 

example 0 represents non-diabetic and 1 

represents diabetic). 

A problem could be to establish the chance 

(or probability) that a woman with a certain 

profile (defining the explanatory variables) 

may drink alcohol during pregnancy (the 

response) or equivalently to find the 

proportion of pregnant women who will drink 

alcohol. Ultimately, we are interested in who 

will do this. 

More than two categories can occur. 

• blood group: A/B/AB/O 

• Maori/Pacific Island/Caucasian/Asian. 

29 

Section 1

In these examples the data are said to be 

nominal. But this type of data is said to be 

ordinal if the categories are in some order. 

For example, “degree of pain” may be 

minimal/moderate/severe/unbearable 

If more than two ordinal categories it is not 

possible to use 0/1/2/3 to identify the 

classes since “unbearable” is not three 

times “moderate” even though the data are 

ordered. Consequences of this will be 

important in the second half of the 

semester. 

2. Numerical data: 

(a) Discrete Here observations take only 

certain numerical values. Usually they are 

counts of events. For example, 

• number of possums caught in traps 

• number of children in a family 

(0/1/2/3/4) 

30 

Section 1

These are not like categorical data as 3 

children is three times as many as one. 

This type of data can be treated as though it 

is categorical but this discards information 

about the magnitude of the relationships 

between successive outcomes. Ordinal 

categorical data is important. 

(b) Continuous quantitative measures. Here 

recorded values or observations result from 

some form of measurement [e.g. height, 

age, blood pressure, serum cholesterol, 

oxygen levels in a lake]. 

• Often no restriction on values other than 

that caused by accuracy of equipment 

for recording values. 

• Often the values show pattern similar to 

what is called the bell-shaped normal 

curve with many values clustered 

around a central point and few values in 

the tails. 

31 

Section 1

3. Rates, Ratios and Proportions 

These are constructed from categorical data 

and include for example measures of disease 

frequency and disease association. Examples 

of disease frequency are 

• prevalence or proportion (concerned with 

existing cases) 

• incidence rate (concerned with new cases) 

e.g. the prevalence of obesity in the New 

Zealand population 

(Gives indication of burden on the country 

by identifying proportion affected) 

e.g. the incidence rate of HIV in New Zealand 

in 2008. 

(This deals with number of new cases and 

is useful if looking at causes) 

Examples of disease association are 

• absolute (or attributable) risk 

• relative risk 

• odds ratio 

32 

Section 1

e.g. the relative risk of melanoma for a 

farmer compared with an office worker. 

Here, the prevalence of melanoma 

among farmers is divided by the 

prevalence among office workers. This 

will show if there is any association 

between prevalence of melanoma and 

occupation after an appropriate analysis 

by essentially comparing the two 

groups. 

4. Other types of response data 

• Scores (direct measurement not possible; 

instead a patient is assessed on several 

subjective scales then the values on each 

are added to give a score for a patient) 

e.g. 30 questions on a health survey. A 

respondent gives values 0 to 3 on each 

question then score out of 90 given. This 

total has convenient properties whereas 

individual values may not. 

• Patients assess their degree of low back 

pain after treatment on scale 1 (no pain) to 

5 (unbearable pain). 

33 

Section 1

Two treatments may be assessed from the 

two sets of values for patients in a new 

treatment compared with a standard. The 

data may be viewed as categorical or 

continuous but there are problems as the 

difference between 1 and 2 is not 

necessarily the same as the distance 

between 4 and 5. The data are certainly 

ordinal. 

• In social sciences, data are often ordinal. 

e.g. In a questionnaire people are asked to 

respond by checking the category that best 

describes their level of agreement with a 

statement from 

a great deal somewhat not much not at all 

usually coded as 4, 3, 2, 1. 

Such data can be regarded as continuous or 

categorical (ordinal). If ordinal then a 

question is how many categories should be 

chosen e.g. 4 (as here) or 5 or 7 or 9, and is 

the distance between 1 and 2 the same as that 

between 2 and 3 etc 

34 

Section 1

[B] Describing Numerical Data 

Graphs can be used to summarise data but many 

graphs can be highly misleading especially if too 

much information is presented. We shall 

summarise numerical data graphically using 

• histograms 

• box-whisker plots 

Particular values which summarise numerical 

data are: 

• mean; median; mode 

• standard deviation; interquartile range 

These approximate the centre and the variability 

of the data collected respectively. 

35 

Section 1

Example for Continuous Data: In a 

hypertension study 56 men who are heavy 

smokers (smoked for 25 years) have blood 

pressures measured (in mm of Hg). Summarise 

the outcomes. 

Blood pressures are classified into intervals to 

form a frequency table and interval frequencies 

(f j ) are obtained as shown below. 

Frequency Table 

Pressure(mm of Hg) Frequency (f j ) 

59.5 – (69.5) 2 

69.5 – (79.5) 7 

79.5 – (84.5) 9 

84.5 – (89.5) 10 

89.5 – (94.5) 11 

94.5 – (99.5) 7 

99.5 – (109.5) 8 

109.5 – (119.5) 2 

Total 

56 (sample size) 

Although the readings are likely to be recorded to 

the nearest mm and hence appear to be discrete, 

the data are actually continuous and for this 

reason the intervals are recorded as 59.5 – (69.5) 

which is 59.5 up to but not including 69.5. 

36 

Section 1

Relative frequency: this is f j /n in the j th interval 

where n is the sample size. 

Pressure 

Relative 

(mm of Hg) Freq(f j ) Freq(f j /n) 

59.5 – (69.5) 2 0.036 

69.5 – (79.5) 7 0.125 

79.5 – (84.5) 9 0.161 

84.5 – (89.5) 10 0.179 

89.5 – (94.5) 11 0.196 

94.5 – (99.5) 7 0.125 

99.5 – (109.5) 8 0.143 

109.5 – (119.5) 2 0.036 

Total 56 1.00 

Here, 2/56 = 0.036 (rounded to 3 d.p.) 

7/56 = 0.125 

Percentage frequency: the relative frequency 

multiplied by 100. 

e.g. 0.036 = 3.6% (or 3.6 per 100) meaning that 

3.6% of the values are in 59.5 – (69.5) 

Note: Relative (or percentage) frequencies allow 

comparison of samples when samples are of 

unequal size. Absolute frequencies f j will not 

allow this since all f j will be large for a large 

sample of outcomes but small for a small sample. 

37 

Section 1

Histograms: These are simple pictures of the 

data. The base of a rectangle is interval length 

and area of a rectangle is proportional to class 

frequency (or relative frequency). When class 

intervals are all equal, rectangle heights are 

proportional to the frequencies as well. 

Example: Return to the blood pressure readings. 

Pressure (mm) (f j ) (f j /n) 

59.5 – (69.5) 2 0.036 

69.5 – (79.5) 7 0.125 

79.5 – (84.5) 9 0.161 

84.5 – (89.5) 10 0.179 

89.5 – (94.5) 11 0.196 

94.5 – (99.5) 7 0.125 

99.5 – (109.5) 8 0.143 

109.5 – (119.5) 2 0.036 

Total 56 1.00 

12 

10 

8 

6 

4 

2 

0 

FREQ PER 

5mm Interval 

59.5 

69.5 79.5 89.5 99.5 109.5 119.5 

38 

Bl pr 

Section 1

N.B. (1) The heights of the first two and last 

two rectangles are halved but their bases are 

doubled from 5 to 10mm. (Area therefore 

remains proportional to frequency in these 

intervals if 5mm is regarded as the horizontal 

“unit”) 

(2) The label on the vertical axis is given as 

“Freq. Per unit interval” where “unit” = five. 

(3) The relative frequency histogram follows: 

0.20 

0.15 

REL FREQ PER 

5mm Interval 

0.196 

0.125 

0.10 

0.063 

0.072 

0.05 

0.00 

0.018 

59.5 69.5 79.5 89.5 99.5 109.5 119.5 

Bl pr 

(4) The frequency and relative frequency 

histograms have the same shape. Only the 

scales on vertical axis differ. Both give some 

idea of the data centre, the extent of the 

variability in the data and the distribution of 

the data. 

39 

Section 1

(5) The relative (or percentage) frequency 

histogram is used if comparing two (or more) 

samples of data, one sample of values from a 

control group and the other from a treated 

group of experiment units. 

(6) Notice how a histogram with rectangle 

heights proportional to class frequencies 

would give a misleading picture of the data. 

(7) You will find that most of the histograms 

produced by statistical packages like R-cmdr 

have class intervals of equal length and you 

can decide the number of intervals you want 

in the graph. Usually between 5 and 20 

intervals of equal length are chosen for a 

good summary of the data. 

40 

Section 1

Measures of Central Tendency. 

The mean is “typical” of the majority of data in a 

sample. 

Example: Six patients lived the following years 

after diagnosis of HIV. 

Datum (Outcome) Symbol 

1.8 x 1 

3.2 x 2 

6.8 x 3 

4.6 x 4 

2.8 x 5 

7.9 x 6 

Mean = 6 

1 (1.8 + 3.2 + 6.8 + 4.6 + 2.8 + 7.9) 

= 27.1/6 

= 4.52 years 

Notation: mean 

or x ∑ = 1 

x1 + x2 

+ x3 

+ x4 

+ x5 

+ 

n 

1 

= n i 

xi 

n 

x = 

6 

x 

41 

Section 1

Note The mean need not be one of the outcome 

values and i is a suffix taking values i = 1 to i = n 

(or 6 here). Any symbol can be used for this 

suffix. 

Example: The 56 blood pressure readings just 

considered have a mean of 89.54mm of Hg. This 

value is “typical” of the data in the sense that it is 

near the centre of the region where most values 

are located. 

The Median is a second measure “typical” of data 

in a sample and is the “middle value” of the data 

after arranging the numbers in order from 

smallest to largest. 

Example: Data: 95 86 78 90 62 73 89 

Rearrange: 62 73 78 86 89 90 95 

Median = 86 (the middle value) 

Note: 

1. If 62 replaced by 5, the median is unchanged 

(the mean would be much smaller). This 

indicates that in general the median is not 

affected by a few very extreme values 

whereas the mean is. 

42 

Section 1

2. If even number of values, halve the two 

centre values. 

Example: For the 56 blood pressure readings, 

the median turns out to be 89.30 (compare mean 

of 89.54) 

The mode is another measure of centre. It is the 

commonest value in the data. This only makes 

sense for discrete data. For continuous grouped 

data it coincides with the peak in the histogram. 

The histogram is bimodal if there is more than 

one peak. 

Further Notes 

(1) The mean (89.54) and median (89.30) for the 

blood pressure readings are close because the 

data are almost “symmetrical.” 

(2) For “non-symmetrical” data mean and 

median are different since the mean is pulled 

in the direction of the extreme values. The 

data are said to be skew. 

43 

Section 1

0 

Median 

Mean 

The mean may be unsuitable as a measure of 

centre while the median is more “typical” of 

most values. 

(3) For measurements which cannot be negative 

it is quite common to have many values close 

to zero thus presenting a skew distribution. 

This is called positive skewness. (The 

histogram above represents positively skewed 

data.) 

(4)The opposite phenomenon with an extended 

left hand tail is called negative skewness and 

is rare. 

(5) A trimmed mean is the mean with the lower 

5% and upper 5% of values removed. 

44 

Section 1

Measures of Variability 

“Looking at the world using data is like looking 

through a window with ripples in the glass”. 

(Professor Chris Wild, Auckland University) 

Statistics is about variability. Variability reflects 

differences in the values collected for different 

units being measured, for example people, or 

animals or plants or companies or readings on 

different days etc. Two sets of values can have 

the same mean and median yet show quite 

different patterns. 

Variability can be random or caused by different 

treatments or “factors” acting on the experiment 

units in a study in different ways. The hope is 

that the random variation will be relatively small 

or controlled by choice of appropriate study 

designs. This will result in the identification of 

important treatment effects explaining key 

aspects of the variation. 

45 

Section 1

If data are highly variable there are problems 

analysing the data and it will be necessary to 

select larger samples. 

The first measure of variation is the range (the 

distance between the lowest and highest values). 

It is sensitive to any extreme values and hence 

not useful. But reduced ranges (encompassing 

the central 95% say of the data) are useful as 

extreme values (outliers) are excluded. 

Note: In clinical chemistry (e.g. cholesterol 

measures) a reference range encompassing the 

central 95% of values describes variability in 

normal people and allows test results for other 

individuals to be assessed to see if corrective 

action is needed. 

A second measure is the (sample) variance defined 

by s 2 2 

= 1 

s ∑ − 

− 

= 1( 

x 

n 1 

n i i x 

Although the divisor is (n – 1) in this equation, we 

can see that s 2 is effectively the “average” of the 

squared deviations of the individual data values 

46 

) 

2 

Section 1

(x i ) from their mean x . For technical reasons do 

not divide by n. 

Notes: 1. The variance is an overall measure of 

the extent to which values x i differ from their 

mean x . 

2. Squaring is essential. If the deviations from x 

are added, the value 0 is obtained always. 

A third convenient measure is the standard 

deviation (s) given by 

1 

= var iance = ∑i n = ( xi − x) 

n −1 

s 1 

Note: The standard deviation s is measured in 

the same units as the original data (taking the 

square root cancels the squaring). 

2 

47 

Section 1

Example: Find the standard deviation of 11, 18, 14, 

15, 12 

x i x i – x (x i – x ) 2 

11 11 –14 = – 3 9 

18 18 – 14 = 4 16 

14 14 – 14 = 0 0 

15 15 – 14 = 1 1 

12 12 – 14 = – 2 4 

70 0 30 

x = 70/5 = 14 s = 30 / 4 = 2. 74 

Note that 2.74 is a “typical” or “average” 

deviation from the mean x = 14. 

Example: Return to the 56 blood pressure 

readings 

Pressure Interval f j 

59.5 – (69.5) 2 

69.5 – (79.5) 7 

79.5 – (84.5) 9 

84.5 – (89.5) 10 

89.5 – (94.5) 11 

94.5 – (99.5) 7 

99.5 – (109.5) 8 

109.5 – (119.5) 2 

Total 56 

The standard deviation is s = 11.21. This value is 

“typical” of deviations from x = 89. 54. 

48 

Section 1

The Interquartile Range is another measure of 

variability. 

25% 

data 

25% 

data 

25% 

data 

Q L median Q U 

Interquartile range 

Range 

25% 

data 

The lower quartile Q L is the value below which a 

quarter of data lie. The upper quartile Q U has 4 

3 

of data below it. (These are also known as the 

25 th and 75 th percentiles.) 

Notes: 1. Interquartile range can be a helpful 

measure of variability. It is not affected by 

extreme values. 

2. Computer packages also give Q L and Q U for 

large data sets and the approximations for 

grouped data are no longer needed. 

Example: For the 56 blood pressure readings 

Q L = 82.2 and Q U = 96.6 with Q U – Q L = 14.4 

49 

Section 1

Box-and-whisker plot 

This is a second way of summarising data 

graphically. Like relative frequencies it is useful 

when comparing samples of unequal size. 

Ex Blood pressures 

Q L = 82.2; Q U = 96.6; Median = 89.3 

Suppose 63 and 116 are lowest and highest 

values. 

X 

60 70 80 90 100 110 120 

X 

The centre of the data, its variation, its symmetry 

(or lack of symmetry) and extreme values are 

displayed. 

Notes: (1). Two samples can be compared 

X 

X 

X 

X 

Both samples skew, the second is more variable 

(larger interquartile range) with a larger median. 

50 

Section 1

(2) The points at the ends of the whiskers depend 

on the package and are 

• the extreme values or 

• the 2 2 1 % and 97 2 1 % values (centiles) or 

• points 1 

1 

2 

times the interquartile range 

away from the boxes 

Outliers beyond these points are shown in R- 

cmdr by an asterix or a small circle (as below) 

where there are obvious changes in the ozone 

readings recorded over summer in a New 

Zealand city. An asterix will represent an 

extreme outlier. 

11 12 1 2 3 

51 

Section 1

Example: Thirty-two traps were placed in each 

of three habitats: pasture, replanted forest and 

tussock on Stephens Island. The data are the 

counts of skinks per trap totalled over a ten-day 

period in each habitat. The boxplots are below. 

Summarize conclusions about skink density. 

Pasture 4 3 0 2 2 1 4 1 2 5 0 1 5 6 5 6 

11 3 1 1 4 8 5 14 6 8 10 7 4 8 13 6 

Replant 15 24 31 8 4 18 14 33 11 16 20 1 17 12 27 26 

forest 18 6 12 16 11 8 13 12 11 8 10 17 29 3 12 5 

Tussock 14 23 15 14 5 16 10 16 14 10 7 10 8 12 19 17 

7 12 29 10 11 11 10 10 6 13 7 10 8 12 6 12 

Greater skink density in replanted forest and 

tussock. Greater variation in replanted forests. 

Some outliers in the three habitats: 

Means: 4.88; 14.63; 12.00 

Medians: 4.50; 12.50; 11.00 

Std Deviations: 3.64; 8.18; 5.07 

52 

Section 1

Example: Thirty-four adult hoki caught off the 

Kapiti coast. Individual lengths as follows: 

Males: 18.7 19.0 18.8 18.4 19.3 19.6 20.3 19.9 19.3 18.9 

18.9 19.0 19.7 20.4 18.6 19.5 20.3 19.9 19.2 18.7 

Females: 18.6 19.6 18.3 17.5 18.3 19.0 18.5 18.7 19.3 18.5 

19.1 18.7 19.1 18.8 

Boxplots indicate male hoki longer than female 

hoki. Slightly greater variation in the males but no 

outliers. Distributions almost symmetric 

Mean: 19.32; 18.71 

Median: 19.25; 18.70 

Std Deviation: 0.61; 0.51 

53 

Section 1

Interpreting Box whisker plots (Ref: Professor 

Chris Wild, Auckland University) 

Observed data: 

A 

B 

A 

B 

The call is 

B values bigger 

B values bigger 

The above two hold for all sample sizes. Larger 

random samples have more information about the 

populations from which they come. With large 

random samples we can make the “B values 

bigger” call from smaller shifts. Avoid using the 

box whisker plots for samples smaller than about 

20. 

54 

Section 1

Observed data: 

A 

B 

A 

B 

A 

B 

A 

B 

A 

B 

The call is 

B values bigger if 

both sample sizes >20 

What is my call 

What is my call 

Cannot tell unless 

both samples are huge 

Cannot tell for 

all sample sizes 

55 

Section 1

How to make the call. 

This is based on a confidence interval idea. (See 

later). But the result is easy to calculate. In the 

following IQR is the interquartile range and n is a 

sample size. 

Med 

Med – 1.5 IQR n 

Med + 1.5 IQR n 

In the following we can claim the values of B tend 

to be bigger than the values of A back in the 

populations from which the samples have been 

taken if these horizontal lines (intervals) do not 

overlap. 

A 

B 

56 

Section 1

SECTION 2 

This covers the measures of disease frequency and disease association with several examples looking 

at prevalence, incidence, relative risks, attributable risk and odds ratios. 

Prevalence and Incidence 


Incidence Rate 

Disease Association 

Relative Risk 

Attributable Risk 

Odds Ratio 

57 

Section 2

[C] Measures of Disease Frequency 

All measures of disease frequency are ratios of the 

form numerator/denominator. 

There are two types of ratio: 

1. Proportion: everyone in numerator must be 

included in the denominator. 

2. Rate: a measure of time is included in the 

denominator. 

The measures of disease frequency are: 

1. Prevalence 

• gives frequency of existing cases of disease 

• is useful for measuring the disease burden in a 

community 

• often measured in a cross-sectional survey 

e.g. proportion of Otago students at 3pm Tuesday 

who have swine flu’. 

58 

Section 2

2. Incidence: 

• measures frequency of new cases of disease 

• is useful for looking at causes of disease 

e.g. number of new cases of cold that develop in a 

week. 

Example: Frequency of hepatitis in two regions. 

New cases Reporting 

Location of hepatitis Period Population 

Region A 58 1985 25,000 

Region B 35 1984-1985 7000 

Region A: 

58/25000/year 

= 232 per 100,000 per year 

= 23.2 per 10,000 per year 

= 2.32 per 1000 per year 

Region B: 

35/7000/2years = 17.5/7000/year 

= 250 per 100,000 per year 

= 2.50 per 1000 per year 

Note: The time period must be specified for the 

results and comparisons to be meaningful. 

59 

Section 2

Example: In a survey of eye disease among 2477 

people aged 52-85 in Framingham, Massachusetts, 

there were 310 with cataracts and 22 blind. 

Prevalence of cataracts 

= 310 = 0.125 = 125 per 1000 (or 12.5%) 

2477 

Prevalence of blindness 

22 

= = 0.009 = 9 per 1000 (or 0.9%) 

2477 

60 

Section 2

Example: In the following diagram the time a 

person has the disease is shaded. 

Subject 

Number 

5 

4 

3 

2 

1 

Prevalence → 

1 / 5 

t 

2 / 5 

3 / 5 

2 / 5 

Time 

Note on Prevalence: 

Prevalence is the proportion of people in a 

population who have the disease at a given point in 

time. The time point may refer to calendar time, or 

to a fixed point in the course of events. 

e.g. the proportion of people free from back pain 2 

months after back injury. 

Note on Incidence 

Incidence on the other hand quantifies the number 

of new cases of disease in a given time period. 

There are two measures: 

• cumulative incidence 

• incidence rate 

61 

Section 2

2.1 Cumulative incidence is the proportion of 

people who become diseased during a specified 

period of time 

number of new cases of disease 

= 

total population at risk 

This provides an estimate of the probability, or risk, 

that an individual will develop the disease during 

the specified period of time. 

Example: In a study in Evans County, Georgia, 

there were 609 men aged 40 – 76 who had no 

detected heart disease in 1960. These men were 

followed for 7 years and 71 cases of heart disease 

were detected during this period. 

Cumulative incidence = 71/609 

= 0.117 (or 11.7%) 

over the 7 year period 

Notes (1) The time period over which cumulative 

incidence is calculated must be specified for it to be 

interpretable. 

(2) Cumulative incidence assumes the entire 

population at risk at the beginning of the study 

period has been followed for the whole study 

period. But often - 

62 

Section 2

• people are lost to follow-up 

• people are enrolled in the study at different 

times 

The length of the follow-up period is not therefore 

the same for everyone in the study. It is the 

incidence rate that takes account of varying amounts 

of follow-up time. 

2.2 Incidence rate: 

= 

the number of new cases of disease 

total person-time at risk 

Same amount of person time if follow: 

16 people for one year 

4 people for four years 

All have 16 person-years of observation. 

Example: Calculation of person-years for 

incidence rate 

Total 

Jan Jan Jan Jan Jan Jan time 

Subject 1997 1998 1999 2000 2001 2002 at risk 

A • (lost to follow-up) 2.0 

B • × 

3.0 

C • 5.0 

D • 4.0 

E • × 

2.5 

Total years at risk 16.5 

63 

Section 2

• = Initiation of follow-up 

× = Development of disease 

Number of new cases = 2 

Number of person-years at risk = 16.5 

Incidence rate = 2/16.5 = 0.121 

That is, 12.1 cases per 100 person years of 

observation 

Example: 

A study in the United States measured the incidence 

rate of stroke in a group of 118,539 women aged 

30-55 years of age. The women were free from 

stroke in 1986, and were followed for 8 years. 

Person-years 

of 

observation 

(over 8 years) 

Stroke 

incidence 

rate 

(per 100,000 

person years) 

Smoking 

category 

No. of cases 

of stroke 

Never smoked 70 395,594 17.7 

Ex-smoker 65 232,712 27.9 

Smoker 139 280,141 49.6 

Total 274 908,447 30.2 

64 

Section 2

274 

Incidence rate = × 100,000 = 30.2 cases of 

908,447 

stroke per 100,000 person-years of observation. 

Average follow-up per woman 

= 908,447 

118,539 

= 7.7 years 

Note: The denominator for measures of incidence 

should include only those who are at risk of 

developing the disease. It should exclude 

• those who already have the disease 

• those who cannot develop the disease 

Failure to do this will lead to an underestimate of 

the true incidence since fewer will develop the 

condition. 

For example when studying the incidence of 

endometrial cancer we should exclude women 

who have had a hysterectomy. 

65 

Section 2

Example In (a)-(c) calculate a relevant measure 

of disease frequency and give its name. 

(a) You survey 346 travellers returning from overseas 

travel and find that 95 of them experienced a 

diarrhoeal illness on their trip. 

(1 mark) 

(b) A tour of 143 people is travelling through Central 

America for 2 weeks. During this trip 28 of the 

people experience a diarrhoeal illness. (1 mark) 

(c) A group of 18 Peace Corps volunteers in Guatemala 

kept daily records of their exposure to various risk 

factors (such as untreated water) and whether or not 

they had diarrhoea. The following values are the 

numbers of new episodes of diarrhoea with the 

number of weeks of records (in brackets) for each 

of the 18 individuals: 

12(88) 12(46) 19(77) 7(102) 8(73) 15(110) 7(101) 9(94) 2(62) 

8(25) 1(90) 1(17) 15(28) 9(30) 5(101) 7(21) 14(109) 17(93) 

NOTE: You should assume that the reported number 

of weeks does not include weeks in which the 

individual had diarrhoea when the week started (i.e., 

each person was disease free at the start of each 

week). 

(1 mark) 

66 

Section 2

Solution 

(a) 95/346 = 0.275. Prevalence = 27.5 per 100 

overseas travellers report experiencing 

diarrhoea during their trip. 

(b) 28/143 = 0.196. Cumulative incidence = 19.6 

cases per 100 exposed per 2 weeks. 

(c) In this problem you are calculating an 

incidence rate. You generally calculate the 

incidence rate as the total number of 

episodes divided by the total exposure 

time: 

12+12+19+7+8+15+7+9+2+8+1+1+15+9+5+7+14+17 

88+46+77+102+73+110+101+94+62+25+90+17+28+30+101+21+109+93 

= 169/1269 = 0.133 

Thus, incidence rate = 13.3 cases per 100 

person-weeks of observation. 

67 

Section 2

Relationship between prevalence and incidence 

Example: Disease A 

Subject 

1 

2 

3 

4 

5 

L t 

Cumulative Incidence = 5 / 5 in t-years 

Prevalence at time L = 2 / 5 

Disease B 

Subject 

1 

2 

3 

4 

5 

Cumulative Incidence = 5 / 5 in t-years 

Prevalence at time L = 5 / 5 

L 

t 

Time 

Time 

68 

Section 2

Note: Prevalence depends on 

• incidence rate 

• duration of disease 

Diabetes (adult onset) 

• annual incidence rate is low 

• duration is long as disease is neither curable 

or total 

so prevalence is high relative to incidence 

Cold 

• incidence is high 

• duration is short 

So prevalence is low relative to incidence 

69 

Section 2

HIV/AIDS 

Many with HIV will live for a long time. 

Prevalence of HIV in the community will be high. 

There is also an issue related to the fact that a 

person may not know they are HIV positive. 

Hence likely to underestimate the prevalence. 

70 

Section 2

If diagnosed with AIDS death is quick, ie few 

living with AIDS. Hence AIDS prevalence 

relatively low. 

There are obvious issues related to health care 

provision and planning. 

71 

Section 2

[D] Measures of disease association 

The comparisons of disease frequency in different 

groups of people are made. In the simplest (and 

very common) setting there are two groups, one 

exposed and the other unexposed. 

Example: Data from cohort study of oral 

contraceptive use (OC) and bacteria in the urine 

among women aged 16-49 years over 3 years. 

Bacteria present 

Yes No Total 

OC use Yes 27 455 482 

No 77 1831 1908 

Total 104 2286 2390 

Data from D.A. Evans et al. NEJM (1978) 

Bacteria is the Disease Category. (Outcome 

measure.) 

OC use is the Exposure Category. 

72 

Section 2


OC users: 27/482 = 0.056 

56 cases per 1000 in 3 years 

Non users: 77/1908 = 0.040 

40 cases per 1000 in 3 years 

Measures of Association: 

Difference (Absolute effect) 

56-40 = 16 cases per 1000 in 3 years 

Ratio (Relative effect) 

56/40 = 1.4 

The number of OC users with bacteria is 1.4 

times the number for non users. 

[Note that the ratio does not include the time 

interval] 

73 

Section 2

1. Relative effect = Relative Risk (RR) 

• ratio of incidence in exposed group (I e ) to 

incidence in unexposed group (I 0 ) 

⎧> 1 (exposure → disease) 

Ie 

⎪ 

RR = ⎨=1 if Ie 

= I0 

I0 

⎪ 

⎩ < 1 (exposure is protective) 

• indicates how much more likely disease is to 

develop in the exposed group than in the 

unexposed group 

• no association between exposure and disease: 

RR = 1 (I e = I 0 ) 

• good measure of strength of an association 

• the usual measure in studies of causation of 

disease 

• can also calculate ratios of prevalences, but the 

interpretation is different 

74 

Section 2

2. Absolute effect = Attributable Risk (AR) 

• difference in incidence between exposed and 

unexposed groups 

AR = I e – I 0 

• indicates how many more people with disease 

there are in the exposed than the unexposed 

group 

• no association between exposure and disease: 

AR = 0 (I e = I 0 ) 

• assuming a cause-effect relationship between 

exposure and disease, we say: 

if AR>0, AR is the number of cases of the disease 

among the exposed that can be attributed to their 

exposure; 

if AR

Example: A randomised trial of the effectiveness 

of infra-red stimulation compared with placebo on 

pain caused by cervical osteoarthritis (degenerative 

joint disease in the neck) carried out over two 

months. 

(Placebo or Control: mock stimulation) 

Treatment Control 

Improvement in pain 18 8 

No improvement in pain 7 17 

Total 25 25 

Exposure is Treatment/Control 

Disease is Improvement/No improvement in pain 

[The outcome classification] 

Cumulative incidence of improvement (in 2 

months) 

Treatment group:18/25 

Control group: 8/25 

18/ 25 

Rel. Risk = 

8/25 = 2.3 

The chance of improvement in the treatment group 

is 2.3 times the chance in the control group. 

76 

Section 2

Example: Prevalence of coronary heart disease 

(CHD) at initial examination among 4469 persons 

age 30-62 years of age in the Framingham Study 

Number Number Prevalence 

examined with CHD per 1,000 

Males 2024 48 23.7 

Females 2445 28 11.5 

Note that 23.7 = (48/2024) x 1,000 hence 

called prevalence per 1,000 

Similarly, 11.5 = (28/ 2445) x 1,000 

Relative risk = (23.7/11.5) = 2.1 

[Heart disease is twice as common in males as in 

females] 

Attributable risk = 23.7-11.5 = 12.2 per 1000 

[There are 12.2 more cases of heart disease in 1000 

men than in 1000 women] 

77 

Section 2

Example: Data from a cohort study of 

postmenopausal hormone use and coronary heart 

disease among female nurses 

Coronary heart 

disease 

Yes No Person-years 

Postmenopausal 

hormone use 

Yes 30 - 54,308.7 

No 60 - 51,477.5 

Data from Stamfer et al, NEJM (1985) 

Incidence rate: 

Users: 30/54308.7 = 55 per 100,000 person-years 

Non-users: 60/51477.5 = 117 per 100,000 person 

years 

Attributable Risk: 

55-117 =-62 cases of CHD per 100,000 person 

years 

Hormone use prevents 62 cases per 100,000 person 

years 

Relative Risk: 55/117 = 0.47 

The risk of CHD among users is 0.47 times the risk 

in non-users (ie a 53% reduction in risk) 

78 

Section 2

Example: Relative and attributable risks of 

mortality from lung cancer and coronary heart 

disease among cigarette smokers in a cohort study 

in British male physicians 

Annual mortality rate per 100,000 

Lung cancer Heart disease 

Cigarette smokers 140 669 

Non-smokers 10 413 

Relative risk 14.0 1.6 

Attributable risk 130 256 

(per 100,000 per year) 

Data from Doll and Peto, Br Med J (1976) 

RR: 140/10 = 14.0 669/413 = 1.6 

AR: 140 – 10 = 130 669 – 413 = 256 

Heart disease is more common therefore a smaller 

relative increase in risk produces more people with 

disease. 

79 

Section 2

Note 

Relative risks 

• provide information on the strength of an 

association 

• can be used to assist in assessment of the 

likelihood of a causal association 

Attributable risks 

• measure the impact of an exposure, (assuming 

that it is causal) 

If a disease is common a small relative risk will 

translate to a large attributable risk. 

[see previous example] 

80 

Section 2

3. Odds Ratio: A third measure of association 

This can be used in case-control studies, where 

measures of disease frequency in the study 

population are not available 

Odds of disease = 

Chance (or Probability) of disease 

Chance (or Probability) of no disease 

See later 

81 

Section 2

SECTION 3 

This section covers a brief introduction to probability definitions, notation, rules and random 

variables with examples, several involving tree diagram use. 

Definitions including mutually exclusive and independent events 

The Addition Rule for combining probabilities 

The Multiplication Rule for probabilities 

Tree diagrams with examples 

Screening test terminology 

Probability Distributions and Random Variables 

Rules for combining Random Variables 

83 

Section 3

Introduction To Probability 

To define what we mean by probability we need 

to talk about experiments and events 

• An experiment is the process by which 

observations or measurements are obtained. 

• The outcome of an experiment is referred to as 

an event and may also represent a group of 

possible outcomes. 

• The set of all possible individual outcomes is 

the sample space. 

Example: Toss a coin once. Observe event A – 

the coin comes up a head (H) or B – the coin 

comes up a tail (T). The sample space is {H, T}. 

An experiment results in outcomes that cannot be 

predicted in advance. This uncertainty about an 

outcome is measured by the probability of the 

event. Different events have different 

probabilities. We define the probability of an 

n 

event A as Pr(A) = 

A 

N where n A is the number of 

experiments resulting in event A in a very large 

number (N) of repetitions of the experiment. 

84 

Section 3

A probability is therefore like a relative 

frequency. It is a measure on a scale from 0 

representing absolute impossibility to 1 

representing absolute certainty. Subjective 

estimates of probability are “unlikely”, 

“possibly”, “almost never”, etc which all convey 

an idea of likelihood of occurrence of an event. 

But different people attach different values to 

these (and this is a problem). For example, what 

is the probability that God exists (0 or 1). 

Probability calculations began with games of 

chance over 3000 years ago. The games involve 

coins, dice, cards, roulette etc. With such objects 

we can develop exact probabilities of possible 

outcomes or events by making sensible 

assumptions: 

• a die (plural dice) is fair ( 1 6 

is the probability of 

any outcome) 

• a coin is fair ( 1 2 

is probability of a head) 

• a card is drawn ( 

52 1 is probability) 

• a birth date ( 

365 1 is prob of particular day) 

Probabilities associated with these objects can be 

calculated using our knowledge of the properties 

of these objects. 

85 

Section 3

Example: An experiment involves throwing a 

fair die. Event is “obtaining” an even number. 

The answer is 3 6 or 1 (easy). This probability 

2 

could also be found by experiment involving 

tossing the die many times. 

In practice, experiments are much more complex 

than this in situations of interest to researchers. 

Events result from such experiments and event 

probabilities are needed if we are to draw 

conclusions from the sample data collected. 

Further Examples 

1. An experiment treats 20 patients in a clinical 

investigation involving a new drug. 

An event is “at least 12 patients are cured” 

What is the probability of the event 

2. An experiment selects 500 voters in a survey. 

An event is “at least 300 support windmill 

farms in Central Otago”. 

3. Experiment treats two “equal” samples of 

cancer patients, one by surgery and one by 

chemotherapy. 

86 

Section 3

An event is “more chemotherapy patients are 

cured”. The probability will give insight into 

the better treatment. 

Theoretical probabilities are unknown in such 

situations, hence these probabilities must be 

estimated from experimental data by observing 

outcomes or noting historical information. 

Combining Probabilities for Multiple Events 

Example: Consider the probability of being in 

each of the four blood groups. The probabilities 

from the Dunedin blood donor centre are: 

Blood Type Pr(Blood Type) 

A 0.38 

B 0.11 

AB 0.04 

O 0.47 

(These numbers can also be estimated by 

“experiment” and will take these values if 

many people are sampled) 

1. What is the probability that a person is either 

A or B 

87 

Section 3

2. What is the probability that 3 unconnected (or 

independent) people are all in blood group O 

Solution: 

1. For any two independent outcomes the 

probability of either occurring is (in this case) 

the sum of the individual probabilities. 

The probability of being either A or B is 

Pr(A) + Pr(B) = 0.38 + 0.11 = 0.49 

Note: Pr(A) + Pr(B) + Pr(AB) + Pr(O) = 1 

Here we have assumed that the outcomes are 

mutually exclusive: that is, a person cannot be 

in blood groups A and B. 

2. For any two independent outcomes, the 

probability that both are observed is the 

product of the individual probabilities. This 

can be extended to three people in the obvious 

way. 

Therefore, probability three people have blood 

group O can be shown to be (see later) 

88 

Section 3

Pr(O) × Pr(O) × Pr(O) 

= 0.47 × 0.47 × 0.47 

= 0.104 

Note: Independent events arise if the outcome of 

one event tells us nothing about the other 

event. We obviously must exclude the 

possibility that the three people are in the 

same family. 

Note: This example illustrates the two laws for 

combining probabilities: 

• the addition rule in part 1. 

• the multiplication rule in part 2. 

89 

Section 3

Properties of Probabilities and Probability 

Laws. 

Notation: There is a convenient notation for 

representing event probabilities. Suppose S 

represents all possible outcomes of an experiment, 

A is the collection of these outcomes representing 

an event and A is the collection of outcomes 

which are not in A. 

• A is the event called the complement of A 

• A and A are said to be mutually exclusive (no 

overlap) 

• Also Pr(A) + Pr( A) = 1 since A and A must 

represent every possible outcome. 

Now suppose two events A and B may overlap. 

• Event A or B denoted by A∪ B occurs if at 

least one of A or B occurs. Called the union of 

A and B. 

90 

Section 3

• Event A and B denoted by A∩ B occurs if both 

A and B occur. Called the intersection of A and 

B. 

Example: A fair die is thrown. A is the event “a 

number greater than 3 is thrown” and B is the event 

“an even number is thrown”. 

Then S = {1, 2, 3, 4, 5, 6} 

A = {4, 5, 6} Pr(A) = 3 6 

B = {2, 4, 6} Pr(B) = 3 6 

A ∩ B = {4, 6} and A ∪ B = {2, 4, 5, 6} 

Pr(A ∩ B) = 2 Pr(A ∪ B) = 

6 

4 6 

Set of all outcomes 

A 

A 

B 

B 

A ∩ B 

Fig (i) A∩ B not empty Fig (ii) A∩ B empty 

(mutual exclusiveness) 

The addition rule for combining probabilities 

Pr(A or B) = Pr( 

Set of all outcomes 

A∪ B) = Pr(A) + Pr(B) – Pr ( A∩ B) 

91 

Section 3

since values in the intersection A∩ B are counted 

twice. The special case when A and B are 

mutually exclusive is 

Pr( A∪ B) = Pr(A) + Pr(B) 

This was illustrated in blood group example, part (1) 

Example: The dice again: 

Pr(A ∪ B) = 3 6 + 3 6 – 2 6 = 4 6 

using addition rule. 

The Multiplication Rule 

The intersection of two events A and B is the 

event that both occur. The probability of this is 

Pr(A and B) = Pr(A ∩ B) = Pr(A) Pr(B|A) 

In words this says that for both of the two events 

to occur, first one must occur [Pr(A)] and then 

given that the first has occurred, the second must 

occur [Pr(B|A)]. 

If both Pr(A) and Pr(A and B) are given, this rule 

can be used to define conditional probability as 

Pr( B| A) 

= 

Pr( A∩ 

B) 

Pr( A) 

92 

Section 3

Independence 

The idea behind the term Pr(B|A) is that the 

occurrence of event A may cause a reassignment 

of probability to event B that makes it differ from 

the original value Pr(B). When the occurrence of 

A gives no additional information about B, A and 

B are independent. 

That is Pr(B|A) = Pr(B) 

In this situation the multiplication rule is 

Pr( A∩ 

B) = Pr(A)Pr(B) 

Otherwise it is the original 

Pr( A∩ 

B) = Pr(A) Pr(B|A) 

This first rule was illustrated in the blood group 

example where the probability of 3 independent 

people all having blood group O was 

Pr( A∩ 

B ∩C) = Pr(A) Pr(B)Pr(C) 

= 0.47 × 0.47 × 0.47 = 0.104 

93 

Section 3

Example: A survey of hospital patients shows 

that the probability a patient has high blood 

pressure given he/she is diabetic is 0.85. If 10% 

of patients are diabetic and 25% have high blood 

pressure: 

(a) Find prob. a patient has both diabetes and 

high blood pressure. 

(b) Are the conditions of diabetes and high 

blood pressure independent 

Solution (a) A is event “patient has high blood 

pressure” 

B is event “patient is diabetic” 

Pr( A| B ) = 0.85, Pr( B ) = 0.10 and Pr( A ) = 0.25 

∴Pr( A∩ B) 

= Pr( A| B ) Pr( B ) by multiplication rule 

= 0.85 x 0.10 

= 0.085 

(b) Pr( A ) = 0.25 ≠ Pr( A| B) 

. Hence not independent 

94 

Section 3

A tree diagram is useful for helping calculate the 

probability of a combined event. The stages of 

the combined event can be dependent or 

independent. 

Example: Independent Stages. 

Stephens Island is an uninhabited island in Cook 

Strait where tuatara are being re-established. For 

some years three locations have been visited on 

the island and tuatara have been found at a 

location with probability 0.4. At any visit X 

represents the number of locations out of three at 

which tuatara are observed. X can take values 0, 

1, 2 or 3. Find the probabilities that 0, 1, 2, or 3 

locations have tuatara on a visit. 

T is the event “location has tuatara’’ and N is the 

complementary event “location has no tuatara”. 

LOC 1 

LOC 2 

LOC 3 

95 

Section 3

Location 

1 

Location 

2 

Location 

3 

Outcome Pr(Outcome) No 

0.40 

T 

TTT 

0.064 

3 

0.40 

T 

0.40 

0.60 

T 

N 

N 

T 

N 

TTN 

TNT 

TNN 

0.096 

0.096 

0.144 

2 

2 

1 

0.60 

N 

0.40 

T 

0.60 

N 

T 

N 

T 

NTT 

NTN 

NNT 

0.096 

0.144 

0.144 

2 

1 

1 

N 

NNN 0.216 

0 

Then Pr(T) = 0.40 (known historically) 

The second location is independent of the first 

Pr(both T) = Pr(T ∩ T) = Pr(T)Pr(T) 

= (0.40)(0.40) = 0.160 

using the multiplication rule and 

Pr(TTT) = (0.4) (0.4) (0.4) = 0.064 

96 

Section 3

The tree diagram shows all possible outcomes. 

Branch probabilities are multiplied to give the 

probabilities of the 8 possible outcomes. 

The addition rule tells us that the probability of 

seeing tuatara at two of the three sites, Pr(X = 2), 

adds the probabilities of the three possible 

outcomes, TTN, TNT and NTT. 

That is, Pr(X = 2) = 0.096 + 0.096 + 0.096 

= 0.288 

Similarly, Pr(X = 0) = 0.216, Pr(X = 1) = 0.432 

and Pr(X = 3) = 0.064. 

In the next examples, the probability at each 

branch of the tree is conditional on earlier 

outcomes. i.e. no longer are the events 

independent, but branch probabilities are still 

multiplied according to the multiplication law for 

probabilities. 

97 

Section 3

Example: Dependent stages. Andrew, John, 

and Mark play a game. There are six similar cars, 

two of which have had the brake cylinders 

removed. The player chooses a car at random, 

drives at high speed towards a cliff, and brakes in 

time to stop. The boys decide to proceed in 

alphabetical order. Find Pr(each will lose) and 

Pr(no loser), assuming that the game stops when 

the first boy drives over the cliff. 

2 

6 

4 

6 

Andrew loses 

2 

5 

3 

5 

John loses 

2 

4 

2 

4 

Mark loses 

No loser 

Pr(Andrew loses) =Pr(Andrew picks a faulty car) = 2 6 

Pr(John loses) 

Pr(Mark loses) 

= Pr(Andrew picks a good car and John 

picks a faulty car) 

⎛4⎞⎛2⎞ 4 

= ⎜ ⎟⎜ ⎟ = 

⎝6⎠⎝5⎠ 

15 

= Pr Andrew and John pick good cars, 

and Mark picks a faulty car) 

⎛4⎞⎛3⎞⎛2⎞ 

3 

= ⎜ ⎟⎜ ⎟⎜ ⎟= 

⎝6⎠⎝5⎠⎝4⎠ 

15 

98 

Section 3

In probability notation we get: 

A is event Andrew loses 

A is event Andrew does not lose 

Pr( A ) = 2/6 Pr( A ) = 4/6 

J is event John loses 

J is event John does not lose. 

It is not Pr( J ) = 2/6 

Instead, Pr( J ) is revised using extra information: 

Pr( J ) = Pr( JA | )Pr( A ) 

⎛ 

= 

5 

2 ⎞ ⎛ ⎞ 

⎜ ⎟ ⎜4 

⎟ 

⎜ ⎟ ⎜ ⎟ 

⎜ ⎟ ⎜6 

⎟ 

⎝ 

⎠ ⎝ 

= 4/15 

and so on. 

⎠ 

99 

Section 3

Example: Screening Programmes 

A patient with certain symptoms consulted her 

doctor to be checked for a cancer. The patient 

undergoes a biopsy. With this test there is a 

probability of 0.90 that a woman with the cancer 

shows a positive biopsy, and a probability of only 

0.001 that a healthy woman incorrectly shows a 

positive biopsy. 

Historical information also suggests that 1 in 

10,000 women have the cancer. [This is the 

prevalence of the cancer in the population.] 

Find the probability that a woman has the cancer 

given the biopsy says she does. 

(Essentially the problem is to decide the ability of 

the biopsy to diagnose true patient status. The 

principle applies to breast and cervical cancer in 

New Zealand.) 

Solution: A is event “woman has the cancer” 

B is event “biopsy is positive” (indicating cancer) 

100 

Section 3

Pr(A) = 0.0001 (disease prevalence) 

Pr(B|A) = 0.90 (a conditional prob.) 

Pr(B|A) = 0.001 (A is complement of A) 

The problem is to find Pr(A|B) 

Pr(A) 

= 0.0001 

A 

Pr(B|A) 

= 0.90 

Pr(B|A) 

= 0.10 

B 

B 

Biopsy + ve 

(true positive) 

Biopsy – ve 

(false negative) 

Pr(A) 

= 0.9999 

(the 

complement) 

A 

Pr(B|A) 

= 0.001 

Pr(B|A ) 

= 0.999 

By the multiplication rule for dependent events, 

Pr(True positive) = Pr(A ∩ B) 

= Pr(B|A)Pr(A) 

= 0.90 × 0.0001 

= 0.00009 (nine out of 100 

000 show true positive) 

Pr(False negative) = Pr(B|A)Pr(A) 

= 0.10 × 0.0001 

= 0.00001 

B 

B 

Biopsy + ve 

(false positive) 

Biopsy – ve 

(true negative) 

101 

Section 3

Pr(False positive) = 0.001 × 0.9999 

= 0.00100 (100 out of 100 

000 show false positive) 

Pr(True negative) = 0.999 × 0.9999 

= 0.99890 

Pr(Test positive) = Pr (B) 

= 0.00009 + 0.00100 

= 0.00109 (109 out of 

100 000 show positive 

test) 

Therefore, 

0.00009 

Pr(A ∩ B) 

Pr(A|B) = 

= 

0.00009 + 0.00100 Pr(B) 

0.00009 

= 

0.00109 

= 0.083 (nine of the 109 with the 

positive biopsy have the cancer) 

Conclusion: Only 8.3% of those women 

identified as having the disease actually do. 

(This is not at all what we would expect and is 

rather unsatisfactory.) 

102 

Section 3

1. Pr(B|A) is called the sensitivity of the test (the 

probability a person with the disease returns a 

positive result or the proportion of positives 

that are correctly identified). 

2. Pr( B | A) is called the specificity of the test 

(the proportion of negatives that are correctly 

identified by the test). 

3. Sensitivity and specificity are from a practical 

point of view not helpful as the point of 

diagnostic testing is to make a diagnosis i.e. 

we need to know the probability of the test 

giving the correct diagnosis, whether it is 

positive or negative. That is pr(A|B), not 

pr(B|A). 

4. Pr(A|B) is the positive predictive value (the 

proportion of patients with positive test 

results who are correctly diagnosed). 

5. The negative predictive value is the 

proportion of patients with negative test 

results who are correctly diagnosed i.e. 

Pr( A | B). 

103 

Section 3

Example: A patient consulted his GP because 

he had intermittent chest pain. The description 

of such pain is known to suggest a patient has 

heart disease with a probability of 0.48. The 

patient took an ECG test which has a sensitivity 

of 0.90 and a specificity of 0.84. The patient 

returns a positive ECG. Now find the 

probability he has heart disease in light of this 

additional information. Also find the positive 

and negative predictive values. 

Solution: 

Pr(H) = 0.48 

H 

Sensitivity 

= 0.90 

0.10 

T 

T 

(0.90)(0.48) = 0.4320 

(0.10)(0.48) = 0.0480 

Pr(H) 

= 0.52 

H 

0.16 

Specificity 

= 0.84 

T 

T 

(0.16)(0.52) = 0.0832 

(0.84)(0.52) = 0.4368 

H is event “patient has heart disease” 

T is event “ECG test is positive” 

Pr(T) = 0.4320 + 0.0832 = 0.5152 

Pr(H|T) = 0.4320/0.5152 = 0.839 

104 

Section 3

Notice how the probability of heart disease has 

been revised up from 0.48 to 0.839 as a result of 

the test. 

Positive predictive value = 0.839 

Pr(Test negative) = 0.0480 + 0.4368 = 0.4848 

Negative predictive value = 0.4368/0.4848 = 0.901 

105 

Section 3

Example 

Like swine flu’ today, about six years ago SARS was a threat to world health. In the early days 

of the SARS epidemic emergency measures were put in place by the World Health 

Organisation in an attempt to control the spread of SARS and to identify the condition. But no 

adequate screening tests existed to identify the condition when it first appeared in Hong Kong. 

A study was carried out in the early days to evaluate a WHO criteria for identifying patients 

with SARS in the SARS screening clinic in Hong Kong. Of 556 consecutive clinic attendees, 

97 were confirmed with SARS. Of these 97 patients with confirmed SARS, 25 met the WHO 

criteria for suspected SARS. Of the 459 patients in whom SARS was not confirmed, 438 were 

negative according to the WHO criteria. 

(a) 

Find the prevalence of confirmed SARS at the clinic (i.e. the proportion 

with SARS). 

(1 mark) 

(b) Estimate the sensitivity and specificity of the WHO test from the numbers above. (2 

marks) 

(c) Estimate the probability that the WHO test produces a positive result. (1 mark) 

(d) Estimate the positive predictive value of the test. (1 mark) 

(e) Estimate the negative predictive value of the test. (1 mark) 

(f) 

How would the positive predictive value of the test be affected if the prevalence of 

SARS among clinic attendees were to decrease 

(1 mark) 

106 

Section 3

WHO SARS Confirmed 

Result Yes No Total 

Positive 25 [21] 46 

Negative [72] 438 510 

Total 97 459 556 

(a) Prevalence = 97/556 = 0.174 

(b) Sensitivity = 25/97 = 0.258; specificity = 438/459 = 0.954 

• T + 

0.174 

S 

• 

0.258 

0.742 

• T – 

0.826 

• 

S 

0.046 

0.954 

• T + 

• T – 

(c) Pr (T + ) = (0.174)(0.258) + (0.826)(0.046) 

= 0.0449 + 0.0380 

= 0.083 

(d) Positive predictive value = 0.045/0.083 = 0.542 

(e) Pr(T – ) = (0.174)(0.742) + (0.826)(0.954) 

= 0.917 

Negative predictive value = 0.788/0.917 = 0.859 

(f) The positive predictive value will decrease. 

107 

Section 3

Example: Sensitive Survey Questions. 

This is an important way of gaining information 

on sensitive or controversial issues. 

The question is: do you have or have you ever 

had a sexually transmitted disease (STD) 

It is unlikely a truthful response or any response 

will be given. 

In a mail survey of 268 young people five said 

they had a STD. 

Probability = 5/268 = 0.019 (or 19 per 1000) 

Instead, proceed as follows: 

1. Roll a die, allowing no one to see the 

outcome. 

2. Toss a fair coin. 

3. If the die shows “1” answer truthfully the 

question: “Have you thrown a head” 

4. If the die shows 2, 3, 4, 5 or 6 answer 

truthfully to the question: 

108 

Section 3

“Have you ever had a sexually transmitted 

disease” 

A tree diagram summarises this procedure where 

θ is the proportion of response “YES” to the STD 

question. 

Roll 

Die 

1/6 

5/6 

1 

2 to 6 

Pr (Yes) = 1 5 θ 

+ 

12 6 

1/2 

1/2 

Head Yes 1/12 

Head No 1/12 

STD Yes 5θ/6 

STD No 5(1 - θ)/6 

There were 54 “Yes” and 214 “No” for 268 

people. 

Estimate Pr(Yes) = 54/268 = 0.2015 

∴ 0.2015 = 1 5 θ 

+ 

12 6 

∴ 12(0.2015) = 1 + 10θ 

θ 

1 – θ 

109 

Section 3

∴ 2.418 – 1 = 10θ 

∴ 1.418 = 10θ 

∴ θ = 0.1418 

or 142 per 1000 have STD 

(compare 19 per 1000 previously) 

110 

Section 3

Probability Distribution and Random Variables 

A random variable has values which depend on 

the outcome of a random experiment. Random 

variables are labelled with a capital letter (X 

say). They can be discrete or continuous. The 

number of locations with tuatara on Stephens 

Island is discrete (possible values 0, 1, 2, 3) 

while cholesterol levels are continuous. 

Example: (Tuatara again) Three locations are 

visited on 50 occasions in the tuatara study and 

the number of locations with tuatara found are 

recorded each time. Results follow along with 

values calculated previously in the fourth column. 

X = x j f j f j /n f j /n = Pr(X = x j ) 

(Tuatara at 

locations) 

(frequency) 

(rel.freq) (as n becomes large) 

0 8 0.16 0.216 

1 22 0.44 0.432 

2 15 0.30 0.288 

3 5 0.10 0.064 

Total n = 50 1.00 1.000 

111 

Section 3

X is the random variable. X is discrete here 

because all possible outcomes x j can be counted. 

The 50 results in the study are summarised by the 

relative frequencies. 

If many trials (n large) are carried out, the relative 

frequencies of each x j stabilise to give 

probabilities 

Pr(X = x j ) 

for each outcome. Together these probabilities 

form the probability distribution rather than a 

relative frequency distribution. 

NB (1) 

4 

∑ Pr( X = xj 

) = 1 as for relative 

j= 

1 

frequencies 

(2) All probabilities are between 0 and 1. 

112 

Section 3

Describing Probability Distributions 

Let X be a symbol for a probability distribution 

and let μ X be the mean of X. (Assume X is 

discrete for the moment.) 

For a sample of n values from the distribution 

suppose each possible x j occurs f j times and 

there are k possible values of j. Then the sample 

mean is 

x = 1 k k f 

n ∑ x j f j = x j 

∑ 

⎛ ⎞ 

j ⎝ n ⎠ 

j=1 

j=1 

As the sample size becomes large, the relative 

frequencies become probabilities and the mean of 

the probability distribution X is μ X where 

k 

∑ 

μ X = x j Pr(X = x j ) 

j =1 

A similar argument shows that the variance σ X 2 of 

the probability distribution X is 

k 

σ 2 X = ∑ (x j −μ X ) 2 Pr(X = x j ) 

j=1 

113 

Section 3

Take the square root to get the standard deviation 

of the probability distribution σ X . 

Note: The sample mean x and variance s 2 are 

estimates for population mean μ X and variance 

σ X 2 . 

Ex: Find the mean and standard deviation of the 

distribution of the number of locations at which 

tuatara are found. 

X=x j Pr(X=x j ) x j Pr(X=x j ) (x j - μ X ) 2 (x j - μ X ) 2 Pr(X=x j ) 

0 0.216 0.000 (0 – 1.2) 2 = 1.44 0.311 

1 0.432 0.043 0.04 0.017 

2 0.288 0.576 0.64 0.184 

3 0.064 0.192 3.24 0.207 

Total 1.00 1.200 5.36 0.720 

4 

μ = ∑ = = 

x Pr( X x ) 1.20 

X j j 

j= 

1 

On average just over one location per visit will 

have tuatara present. 

σ 

4 

2 2 

X 

xj X 

X xj 

j= 

1 

= ∑ ( − μ ) Pr( = ) = 0.72 

and σ 

X 

= 0.85 

114 

Section 3

Example: A person infected with a disease can 

pass it on to others. Let the random variable, X, 

be the number of others infected by this person. 

X is found to have the following probability 

distribution. 

2 

Find μ 

X 


X 

X = x 

j 

Pr(X = x j 

) 

0 0.10 

1 0.25 

2 0.40 

3 0.20 

4 0.05 

μ 

X 

= 0(0.10)+1(0.25)+2(0.40)+3(0.20)+4(0.05) 

= 1.85 

2 

σ 

X 

= (0 – 1.85) 2 0.10+(1 - 1.85) 2 0.25+(2 - 1.85) 2 0.40 

+(3 - 1.85) 2 0.20+(4 - 1.85) 2 0.05 

= 1.0275 

Also, σ 

X 

= 1.0275 = 1.0137 

115 

Section 3

Rules for combining random variables 

Often we are interested in the mean and 

variance of a rescaled random variable, or in the 

mean and variance of sums (or differences) of 

random variables. The following properties are 

true of all numerical random variables, discrete 

or continuous. 

If X and Y are independent random variables 

and a and b are constants, then: 

1. The mean of the new random variable 

a + bX is 

μ a+bX = a + bμ X 

2. The variance of a + bX is: 

σ 2 a+bX = b 2 σ 2 X 

3. The mean of the new random variable 

aX + bY is 

μ aX+bY = aμ X + bμ Y 

4. The variance of aX + bY is 

σ 2 aX+bY = a 2 σ 2 X + b 2 σ 2 Y 

116 

Section 3

Note: Properties 3 and 4 tell us that 

μ 

X+ Y= μX+ μY 

and 

σ 

+ 

= σ + σ . 

2 2 2 

X Y X Y 

Also, μ 

X− Y= μX− μYand 

σ 

− 

= σ + σ . 

2 2 2 

X Y X Y 

Example: Temperatures used to be recorded in 

degrees Fahrenheit. Suppose a random variable F 

measures January temperature (in Fahrenheit) in 

Dunedin and daily maximum summer temperatures 

have a mean of 70°F with a standard deviation of 

5°F. 

Use the conversion formula C = 5 ( F − 32) to find 

9 

the mean and standard deviation for the temperatures 

in degrees Celsius. 

Solution: 

We will let the random variable C represent the 

temperature in Celsius. The equation 

C = 5 ( F − 32) may be rearranged by expanding the 

9 

brackets to become 

C = 5 F − 5 × 32 or C = 5 F − 

160 

9 9 

9 9 

117 

Section 3

We have μa+ bX 

= a + bμ 

X 

160 

Therefore a = − and b = 5 9 9 

μ = a + bμ 

C 

160 5 

= − + × 

9 9 

= 21.1° C 

2 2 2 

We also have σ b σ 

σ 

a+ bX 

= 

X 

2 

2 ⎛5 

⎞ 2 

C 

= × 5 

⎜ ⎟ 

⎝9 

⎠ 

25 

= × 25 

81 

= 7.716 

F 

70 

Therefore σ = 7.716 = 2.78° C 

C 

Example: What is the difference 

between T = X + X + X 

and T = 3X 

118 

Section 3

Note: These results can be extended to several 

random variables. 

Example: (Infected person continued) 

Three people living in separate areas have the 

disease. Random variables X 1 , X 2 , X 3 are 

numbers of other people infected by them. Find 

mean and variance of total number infected by 

the original three. 

Total T = X 1 +X 2 + X 3 (X 1 , X 2 , X 3 assumed 

independent as people in different areas) 

μ = μ + μ + μ =1.85+1.85+1.85 = 5.55 

T X X X 

1 2 3 

σ = σ + σ + σ =1.0275+1.0275 

2 2 2 2 

T X X X 

1 2 3 

+ 1.0275 = 3.0825 

Note: Do not say T = 3 X 1 . Although 

μ 

T 

= 3μ 

= 5.55, 

X 

1 

σ 

= 9σ 

≠ 3.0825 

2 2 

T X 

1 

This is a very common source of error. 

119 

Section 3

120

SECTION 4 

This section introduces both the Binomial and Normal Distributions which model many 

phenomena arising in the real world. Consequently the distributions allow us to answer some 

important and relevant questions. 

The Binomial Distribution: Definition, mean and variance 

The Binomial Table: Examples 

The Normal Distribution: Definition 

Standard Normal Distribution and Table 

General Normal Distribution 

Normal Approximation to the Binomial 

Transforming Data to Normal 

121 

Section 4

The Binomial Distribution 

The binomial distribution arises when 

investigating proportions. e.g. the proportion of 

adult population with diabetes. Each individual 

has or does not have diabetes. 

Let Y be the random variable for an individual 

outcome of a person in the population. Two 

outcomes occur, namely Y = 1 (e.g. diabetes 

present or success) and Y = 0 (e.g. diabetes not 

present or failure). The parameter π represents 

the unknown proportion of 1’s occurring. 

The probability distribution of Y is 

Y = 

y Pr(Y = 

j 

y j 

) 

1 π “success” 

0 1 – π “failure” 

Then μ Y = 1(π) + 0(1 – π) = π 

σ = (1 – π) 2 π + (0 – π) 2 (1 – π) 

2 

Y 

= (1 – π) [π(1 – π) + π 2 ] 

= π(1 – π) 

122 

Section 4

Now suppose that we take a sample of size n 

from the underlying population. What is the 

distribution of the number of successes 

The total number of successes is X where 

X = Y 1 + Y 2 + Y 3 + … + Y n 

with all the Y j independent of each other. 

∴ μ X = π + π + π + … + π = n π 

2 2 2 

σ = σ 

Y 1 

+ σY 2 

+ … + σY n 

2 

X 

= π (1 – π) + π (1 – π) + … + π (1 – π) 

= nπ(1 – π) 

X is called a binomial distribution and 

μ 

X 

= nπ 

2 

σ = nπ(1 − π) 

X 

where π is the parameter giving Pr(“success”) or 

Pr(diabetes present). 

123 

Section 4

The mean number of successes is nπ and the 

variance of the number of successes is nπ(1 – π) 

The binomial distribution results from n trials 

involving independent binary outcomes. 

e.g. melanoma (Yes/No) 

Smoking (smokes/does not smoke) 

Diabetes (present/absent) 

Tuatara (present/absent) 

Example: X = number of locations in group of 

n that have tuatara present. 

It is known that Pr(success) = π = 0.40 and 

Pr(failure) = 1 – π = 0.60. 

Each location is assumed independent of other 

locations. 

Also assume the probability of tuatara being 

present remains constant at each location. 

124 

Section 4

Notes 1. If these conditions are met, if n (the 

number of trials) and π (the probability of 

success) are known, all probabilities in the 

distribution are known exactly. 

2. n and π are said to be the parameters of the 

distribution. 

3. The binomial distribution requires 

independent trials and a probability of 

success which remains constant for each 

trial. 

4. We use binomial tables to approximate 

these binomial probabilities for values of n 

up to 20. (See table section of these notes.) 

125 

Section 4

For example suppose n = 8 and π = 0.40 are the 

two defining parameters. 

π 

n x 0.05 0.10 0.15 … 0.40 0.50 

8 0 0.6634 -- -- 0.0160 0.0039 

1 0.2793 -- -- 0.0896 0.0312 

2 0.0515 -- -- 0.2090 0.1094 

3 0.0054 -- -- 0.2787 0.2187 

4 0.0004 -- -- … 0.2322 0.2734 

5 0.0000 -- -- 0.1239 0.2188 

6 0.0000 -- -- 0.0413 0.1094 

7 0.0000 -- -- 0.0079 0.0313 

8 0.0000 -- -- 0.0007 0.0039 

9 0 -- -- 

1 -- … -- 

2 -- -- 

3 -- -- 

etc 

Notice that Pr(X = 3) = 0.2787 for π = 0.40 and n = 8 

Example: Records show that twenty percent of 

violin pupils are known to develop OOS during 

the course of their training. Define X to be the 

number of violin pupils out of 9 who develop 

OOS during their training. 

126 

Section 4

(a) Find the probability distribution of X. 

(b) What is the probability that none of the 9 

pupils develop OOS 

(c) What is the probability that more than 4 out 

of the 9 pupils develop OOS 

(d) In 2005 a certain violin teacher had 9 new 

pupils and 5 developed OOS during training. 

What conclusion would you draw about the 

training methods of this teacher 

Solution 

(a) Here X is binomial with n = 9; π = 0.20 

(and assume the pupils are all independent 

of each other). The binomial table gives 

n x π = 0.20 

9 0 0.1342 = Pr(X = 0) 

1 0.3020 = Pr(X = 1) 

2 0.3020 etc 

3 0.1762 

4 0.0661 

5 0.0165 

6 0.0028 

7 0.0003 

8 0.0000 

9 0.0000 

127 

Section 4

(b) Pr(X = 0) = 0.1342 

(c) Pr(X > 4) = Pr(X = 5) + Pr(X = 6) 

+ Pr(X = 7) + Pr(X = 8) + Pr(X = 9) 

= 0.0196 

(d) It would be rare or unusual (probability = 0.0196) 

for more than four violin pupils to develop OOS 

if 20% is the overall percentage known to develop 

OOS historically. We conclude the training 

methods of this teacher are likely to result in a 

greater occurrence of OOS among pupils. 

If the violin teacher has no effect on OOS, 

π will remain on 0.20 and the probability 

that more than four of the pupils will 

develop OOS is 0.0196. 

This is viewed (by convention) to be a small 

probability indicating a rare or unusual event 

has arisen if the value of π = 0.20 still holds 

for the pupils of this teacher. 

Either π = 0.20 is unchanged for this teacher 

and a rare event has been observed 

128 

Section 4

or the teacher is at fault and more pupils 

develop OOS. This second alternative is 

usually taken and therefore we conclude the 

teacher has a higher incidence of OOS. 

Notes 

1. It is the size of the probability of this 

observed “event” or a more extreme and 

convincing event which results in the 

conclusion (more than 4). 

2. 0.0196 is a chance of just under 2 per 100 

(2%). 

3. A probability less than 0.05 is (by 

convention) taken to imply an event is rare or 

unlikely to occur. 

4. A probability above 0.05 often means an 

event is not unusual. If the violin teacher had 

produced such a probability then the teaching 

would not be at all unusual in relation to 

incidence of OOS. 

129 

Section 4

Binomial Examples and Normal Distribution 

Example: (artificial data) A sociological 

report suggests that 75% of Maori children 

under 18 live with both parents. A random 

sample of 20 Maori children is selected, and X 

is the binomial random variable for the number 

of these 20 who live with both parents. 

(a) Define the parameters of the distribution of X. 

(b) Find Pr(X = 15). 

(c) Find the probability that 11 or fewer live 

with both parents (i.e. Pr(X ≤ 11)). 

(d) A random sample of 20 New Zealand 

Caucasian children had only 11 living with 

both parents. Does this result provide any 

evidence to support the claim that 75% of NZ 

Caucasian children live with both parents 

130 

Section 4

Solution 

(a) X is binomial with n = 20, π = 0.75. 

(b) The problem is that 0.75 does not occur in 

the binomial table directly. 

Whenever π > 0.50, we replace the event 

“success” by its complement “failure”. This 

is because the binomial table does not have 

values greater than 0.50. In this case, 

“failure” is the event “child does not live 

with both parents”. For easy analysis, define 

new random variable 

Y = number not living with both parents. 

Y is binomial, n = 20 and new π′ = 0.25 

[here y = n – x and π′ = 1 – π] 

∴ Pr(X = 15 given π = 0.75) 

= Pr(Y = 5 given π′ = 0.25) 

= 0.2023 from table 

(c) Pr(X ≤ 11) = Pr(Y ≥ 9) 

= Pr(Y = 9) + Pr(Y = 10) 

+ … + Pr(Y = 20) 

= 0.0271 + 0.0099 

+ … + 0.0000 

= 0.0410 

131 

Section 4

(d) No. In fact there is evidence it is less than 75% 

for NZ Caucasian children. 

If π = 0.75 is assumed for Caucasian families, 

then the probability of observing 11 or fewer 

living with both parents is, by our convention, 

small (less than 0.05) providing evidence 

against 75%. Hence reject claim that π = 0.75 

for Caucasian families and conclude fewer live 

with both parents (because 11 is the direction of 

fewer rather than more). 

Note: Suppose instead 12 out of 20 of the NZ 

Caucasian children were living with both 

parents. 

Pr(X ≤ 12) = Pr(Y ≥ 8) = 0.1019 if 

π = 0.75 meaning π′ = 0.25. 

This probability is not small, and now there is 

no evidence from our data to suppose the 

situation is any different among Caucasian 

families. 

132 

Section 4

Example (Revision) 

The standard drug for treating a cancer is claimed 

to halve the tumor size in 30% of all patients 

treated. Suppose X is the binomial random 

variable for the number of patients in a sample of 

seven who have their tumor size halved. 

(a) List the conditions which must be met if X is 

binomial. 

Patients independent. Two outcomes only. 

Constant probability tumor size halved over 

all the patients. 

(b) Using the appropriate table, write down the 

distribution of probabilities for the number 

(X) who have their tumor size halved. 

X = x j Pr(X = x j ) 

0 0.0824 

1 0.2471 

2 0.3177 

3 0.2269 

4 0.0972 

5 0.0250 

6 0.0036 

7 0.0002 

133 

Section 4

(c) Write down the probability that three of the 

patients have their tumor size halved. 

Probability = 0.2269 

(d) Find the probability that three or more of the 

patients have their tumor size halved. 

Probability = 0.3529 

(e) In a pilot study in Auckland, three out of seven 

patients given a new drug had their tumor size 

halved. What conclusion if any can be drawn 

about the new drug Explain how you reach 

your conclusion. 

Conclusion: There is no reason to suppose the 

new drug is any different to the standard. 

Explanation: Prob. of three or more is 0.3529 

which is large meaning the result with the new 

drug is consistent with the 30% before. 

Note: This study involves a very small number of 

patients and will be reconsidered later with a larger 

sample. 

134 

Section 4

The Normal Distribution 

This distribution will allow us to calculate 

probabilities associated with observed sample 

results when we are dealing with continuous 

outcome measures and sample means. First we 

develop properties of the normal distribution. 

A relative frequency histogram tends to a 

probability distribution as sample size n becomes 

large. 

HISTOGRAM 

DISTRIBUTION 

f(X) 

a 

b 

X 

n increases 

and class 

width decreases 

Shaded area 

Shaded area 

= proportion of = probability of 

observations 

value between 

between a and b 

a and b 

(This represents a 

(This represents a 

sample with a 

population with a 

small number of 

very large number 

individuals.) 

of individuals.) 

135 

a 

b 

X 

Section 4

The resulting curve is known as a probability 

function (or probability density function) and is 

described by a curve y = f(X). 

The area under this curve, say between two points 

X = a and X = b, is the probability Pr(a < X < b) 

X is a random variable taking values on a 

continuous scale. 

We have seen several sets of sample data which 

produce symmetrical histograms, bell shaped 

with a concentration of values at the centre and 

few values at extremes. (e.g. cholesterol levels in 

the pravastatin study) Such data are said to be 

collected from a normal distribution or from a 

population of values which are normally 

distributed. 

[Gauss, 1777-1855, first developed the equation 

of such a normal curve while observing pattern in 

errors made while making measurements in 

astronomy] 

136 

Section 4

μ 

Y 

Y = f(X) 

X 

The equation of such a normal curve is 

f( X) 

1 

= e 

σ 2π 

− 

1 

2 

X −μ 

( ) 2 

σ 

where parameter μ is the mean and parameter σ is 

the standard deviation of the distribution (in 

practice, μ and σ will be estimated from sample 

data by the values x and s). 

Notes 1. The graph is symmetrical about centre 

point denoted by μ. 

2. The two parameters μ and σ completely define 

a normal distribution (recall that parameters n 

and π define a binomial distribution). 

Notation: X ∼ N(μ,σ 2 ) 

3. Increasing μ moves the curve but does not 

alter its shape 

Section 4 

137

μ 2 > μ 1 

σ unchanged 

μ 1 μ 2 

X 

4. Increasing σ spreads the curve more widely 

about X = μ, but does not alter the centre of the 

distribution. 

σ 1 

σ 2 

μ 

σ 2 > σ 1 

μ unchanged 

X 

Both the above could be normal distributions. 

5. Areas under these curves can be found from 

tables. The table is based on what is known as 

the standard normal distribution which has μ = 

0 and σ = 1. 

138 

Section 4

Normal distribution calculations. 

The Standard Normal Distribution (Z) 

Z ∼ N(0, 1) i.e. Z distributed with μ Z = 0, σ Z 

2 = 1 

∴ f(Z) = 1 2π e−1 2 Z2 Shaded area 

= Pr(0 < Z < z) 

(see tables) 

O z Z 

z .00 .01 .02 .03 .04 .05 …… .09 

.0 .0000 

.1 

.2 

.3 

 

1.5 

1.6 0.4484 0.4495 

1.7 

 

3 0.4990 

139 

Section 4

Some calculations: 

1. Find Pr(0 < Z < 1.63) 

From table choose z = 1.63 

∴ Pr(0 < Z < 1.63) = 0.4484 

O 1.63 Z 

Also, Pr(0 < Z < 1.64) = 0.4495 

3 

10 

∴ Pr(0 < Z < 1.633) ≈ 0.4484 + (0.0011) 

= 0.4487 

[final calculation need not be this accurate 

+ 0.4484 would be accepted for our purposes 

using this table.] 

2. Find Pr(Z > 1.64) 

Pr(Z > 1.64) 

= 0.5 – Pr(0 < Z < 1.64) 

= 0.5 – 0.4495 

= 0.0505 

3. Pr(1 < Z < 1.64) 

= Pr(0 < Z < 1.64) - Pr(0 < Z < 1) 

= 0.4495 – 0.3413 

= 0.1082 

140 

O 1.64 Z 

Section 4

4. Pr(-1 < Z < 1.64) = Pr(0 < Z < 1.64) 

+ Pr(-1 < Z < 0) 

= Pr(0 < Z < 1.64) 

+ Pr(0 < Z < 1) by symmetry 

= 0.4495 + 0.3413 

= 0.7908 

–1 

O 1.64 Z 

5. Pr(-1 < Z < 1) = 2Pr(0 < Z < 1) 

= 2(0.3413) 

= 0.6826 

Pr(-2 < Z < 2) = 2Pr(0 < Z < 2) = 0.9546 

Since σ Z = 1, a value z of Z is a count of the number 

of standard deviations to this point. Notice that 

approx 68% of the area is within one and 95% 

within two standard deviations of the centre. 

6. Find the value z above which 25% of the area lies. 

Here, find a value close to 0.25 in the centre of 

normal table, then read back to margins. 

0.25 

0.50 

O 

0.25 

z Z 

Pr(0 < Z < 0.67) = 0.2486 

Pr(0 < Z < 0.68) = 0.2517 

Hence, z = 0.675 approx. 

141 

Section 4

The General Normal Distribution (X) 

2 

X ~ N( μ X , σ X ) say. 

Areas under this curve cannot be found directly 

from the normal table but X is related to the 

standard normal Z ~ N(0, 1 2 ) by 

Z 

= 

X − μ 

σ 

X 

X 

Notes 1. The distribution X is said to be 

standardised when μ X subtracted and the 

result divided by σ . 

X 

2. Z is essentially the number of standard 

deviations ( σ X ) from μ X to a value x of X. 

142 

Section 4

Some calculations 

1. Pr( μ X - σ X < X < μ X + σ X ) 

= Pr(-σ X < X - μ X < + σ X ) 

X − μ 

= Pr(-1 < 

X 

< + 1) 

σ X 

= Pr(-1 < Z < + 1) 

= 2 Pr(0 < Z < 1) = 0.6826 

[68.26% of distribution within one standard 

deviation of the centre] 

2. In general, Pr(a < X < b) 

= Pr(a - μ X < X - μ X 

a − μ 

= Pr( 

X X − μ 

< 

X b − μ 

< 

σ X σ X σ X 

a − μ 

= Pr( 

X b − μ 

< Z < 

X 

) 

σ 

σ 

X 

X 

X 

) 

μ X a b X 

143 

Section 4

Example: Assume that diastolic blood pressures 

for men aged 35-44 have a normal distribution with 

mean μ X = 80 and standard deviation σ X = 12 

(a) Find Pr(90 < X < 100) 

(b) The percentage of men in this age range who 

are hypertensive (a level over 100). 

Solution 

(a) Pr(90 < X < 100) = 

⎛ 90 −80 

100 −80 

Pr 

⎞ 

⎜ < Z < ⎟ 

⎝ 12 12 ⎠ 

= Pr(0.833 < Z < 1.667) 

= Pr(0 < Z < 1.667) 

– Pr(0 < Z < 0.833) 

= 0.4525 – 0.2967 

= 0.1558 

(b) X ~ N(80, 144). Find Pr(X > 100) 

⎛ 100 −80⎞ 

Pr(X > 100) = Pr⎜ 

Z > ⎟ 

⎝ 12 ⎠ 

= Pr(Z > 1.67) 

= 0.5 – Pr(0 < Z < 1.67) 

= 0.5 – 0.4525 

= 0.0475 

We expect 4.8% of men in this age group to be 

hypertensive. 

144 

Section 4

(c) Find the diastolic blood pressure which is 

exceeded by 10% of men aged 35-44. 

X ~ N(80, 144) 

80 

O 

0.40 

x 

z 

0.10 

X (original scale) 

Z (standard scale) 

(It is helpful, initially, to sketch the standard 

scale as well as the original scale). 

From the standard normal table, find the 

value, z, which cuts off area 0.40 as shown. 

Reading to the margins from the value 0.40 in 

centre of table gives z = 1.282 (part way 

between 1.28 and 1.29). 

x − μ 

Use z = 

X 

to get 1.282 = 

σ X 

∴ x = 80 +12(1.282) 

= 95.38 

x −80 

12 

145 

Section 4

The Normal Approximation to the Binomial 

(n, π) 

If there is a large sample selected from a 

population of binary values (e.g. people with or 

without diabetes) probabilities of observed 

outcomes are found from the normal N( μ X , σ ) 

distribution where μ X = nπ and 

σ = nπ ( 1−π 

) 

X 

2 

X 

σ 

X 

μ X = nπ 

= nπ ( 1−π 

) 

x–1 x x+1 

1 

x – 2 

1 x + 2 

X 

Area of shaded block (if x integer) is the binomial 

probability of obtaining x successes. 

This is approximately the area under the normal 

curve between x − 1 and x + 1 . 

2 

2 

146 

Section 4

∴ Pr( X = x) 

⎛ 1 1 

( x − 

2) − nπ 

( x+ 2) 

−nπ 

⎞ 

≈ Pr 

< Z < 

⎜ 

nπ (1 −π) nπ(1 −π) 

⎟ 

⎝ 

⎠ 

Notes: 1. This approximation is good provided n 

is large and π is not too close to 0 or 1. (Under 

these conditions the binomial distribution is 

reasonably close to symmetrical and hence the 

normal curve is seen to be a good 

approximation.) 

2. The normal approximation is good if 

nπ± 3 nπ(1 −π ) 

gives two values between 0 and n (the min. 

and max values of the binomial counts) since 

95% of the possible values should lie within 

these limits indicating a near symmetrical 

distribution. 

147 

Section 4

Probability 

We know Pr(blood group B) = 0.11 

n = 2 nπ = 0.22 

π = 0.11 n π ( 1−π 

) 

= 0.44 

Hence 0.22 ± 3(0.44) 

Figure 1 Binomial distribution of number of people out of two in blood 

group B. 

Probability 

1.0 

0.8 

0.6 

0.4 

0.2 

0.0 

0.4 

0.3 

0.2 

0.1 

0.0 

0 1 2 

Number in blood group B 

0 1 2 3 4 5 6 7 

Number subjects 

n = 10 nπ = 1.10 

π = 0.11 n π ( 1−π 

) 

= 0.99 

Hence 1.10 ± 3(0.99) 

Figure 2 Binomial distribution showing the number of subjects out of ten 

in blood group B based on the probability of being in blood group B. 

Probability 

0.15 

0.10 

0.05 

0.0 

n = 100 nπ = 11 

π = 0.11 n π ( 1−π 

) 

= 3.13 

Hence 11 ± 3(3.13) 

0 5 10 15 20 

Number subjects 

Figure 3 Binomial distribution showing the number of subjects out of 100 in blood group B based 

on the probability of being in blood group B. 

148 

Section 4

More on the normal and Statistical Inference 

Example: One in 40 adults on average develops 

a respiratory condition. A random sample of 400 

workers in a certain occupation has 16 with the 

condition. Find the probability that 16 or more 

suffer from this condition in general. What 

conclusion would you draw about the possible 

effect of this occupation on the occurrence of the 

condition Justify your answer. 

Solution: Let X be the distribution of the number 

in a sample of 400 with the condition. 

Then X ~ Binomial (n =400; π = 1/40) 

μ X = nπ = 10; σ 

X 

= n π ( 1−π 

) = 3.123 

Since n π ± 2 nπ 

(1 −π 

) is 10 ± 6. 2, the normal 

approximation can be used. 

Pr(X ≥ 16) 

≈ Pr( Z 

15.5 −10 

> ) 

3.123 

15 1 

2 

16 

149 

16 1 

2 

X 

= Pr(Z > 1.761) 

= 0.0391 

Section 4

This is the p-value associated with a study result 

of 16. There is evidence of a higher incidence of 

the respiratory condition than expected in this 

occupation. (The probability 0.0391 is small 

indicating that the event X = 16 or more is rare if 

π = 1/40 were to hold in this occupation.) 

Therefore, π is likely to be greater than 1/40 for 

workers in this occupation. (If this is the case, 

the event observed would not be unusual.) 

150 

Section 4

Example: 

It is claimed cancer tumor size is halved in 30% 

of all patients using current treatment. A new 

drug was used on 70 patients with the cancer. 

(Last week we looked at a case where the drug 

was tried on 7 patients with 3 successes.) 

(a) Suppose Y is the binomial random variable 

for the number of patients who have their 

tumor size halved. Write down the values for 

the mean and standard deviation of Y. 

μ Y = nπ = 70(0.3) = 21 

σ = nπ(1 − π) = 21(0.7) = 3.83 

Y 

151 

Section 4

(b) In a study, thirty out of seventy patients 

(previously 3 out of 7) administered the 

standard drug experience a halving of their 

tumors. Find the probability that 30 or more 

out of 70 have their tumors halved. 

⎛ 29.5 − 21⎞ 

Pr(Y ≥ 30) = Pr⎜ 

Z > ⎟ 

⎝ 3.83 ⎠ 

= Pr(Z > 2.22) 

= 0.5 – 0.4868 

= 0.0132 

(c) In a study 30 out of 70 patients in Auckland 

administered this new drug had their tumor 

size halved. What conclusion can be drawn 

about the new drug 

There is evidence that the new drug is more 

effective than the standard because the 

probability of 30 or more successes is less 

than 0.05 indicating the observed 30 (or 

more) is not likely to occur unless the new 

drug has a beneficial effect. 

152 

Section 4

Transforming Data 

If data being analysed are continuous but not 

normally distributed, it may be necessary to 

modify the data by transforming each value in 

order to create new values which are normal. 

Then work with the transformed values. Typical 

transformations involve logs, square roots or 

reciprocals. 

There are three reasons for transforming data. 

1. Statistical procedures which we develop may 

only be valid if the data are approximately 

normal, and non-normal data can be converted 

to normal by transforming. 

2. When comparing for example two samples of 

data (e.g. cholesterol levels after treatment with 

pravastatin or a control) the two groups should 

have similar standard deviations for some 

testing procedures to be valid. Transforming 

such data can produce two sets of values with 

similar standard deviations. 

3. Transforming can also reduce the influence of 

outlying values on the results of an analysis. 

153 

Section 4

(e.g. suppose most values are around 10 in a 

data set with one value of 100. 

Then ln10 = 2.30 and ln100 = 4.61) 

EXAMPLE: A sample of 216 values of a serum 

has mean = 60.7 and standard deviation 77.9 

Frequency 

90 

80 

70 

60 

50 

40 

30 

20 

10 

0 

Histogram of the serum 

values in 216 patients with 

fitted normal distribution is 

shown. (The normal fit is 

terrible!) 

The data are transformed by using the ln function. 

Mean = 3.547 and standard dev. = 1.03 

25 

20 

15 

10 

5 

0 

0 100 200 300 400 500 

Serum bilirubin (μmol/l) 

1 2 3 4 5 6 

154 

Histogram of log serum 

values with fitted 

Normal distribution 

(ln values) looks 

reasonably normal. 

Section 4

Now suppose we want the range of values 

containing the central 95% of all patients. If data 

are normal, 95% of the population lie in 

mean ± 1.96 (standard deviations) 

0.475 0.475 

–1.96 1.96 

(from standard 

normal table) 

For the raw data, mean = 60.7 and s.d. = 77.9. 

Hence, interval could be 60.7 ± 1.96(77.9) which 

cannot be correct with the negative values. 

But the transformed data have approximately a 

normal distribution. For transformed data, 

mean = 3.547 and standard deviation = 1.030. 

Hence, 95% of the patients will have ln (serum) 

levels in the range 

3.547 ± 1.96(1.030) 

155 

Section 4

That is, 95% of distribution (or values) between 

3.547 – 2.019 and 3.547 + 2.019 

or 1.528 and 5.566 

Transforming back to original scale, 

e 1.528 = 4.61 and e 5.566 = 261.4 

Hence, 95% of patients would have serum levels 

between 4.61 and 261.4 μmol/l 

156 

Section 4

REVIEW EXERCISES 

4. For the standard normal distribution find the following: 

(a) The area below –1.58. 

(b) The two points between which the central 85% of the area lies. (2 marks) 

5. In the Framingham Study, serum cholesterol levels were measured for a large number of healthy 

males. The population was then followed for 16 years. At the end of this time, the men were 

divided into two groups: those who had developed coronary heart disease and those who had not. 

The distributions of the initial serum cholesterol levels for each group were found to be 

approximately normal. Among individuals who eventually developed coronary heart disease, the 

mean serum cholesterol level was μ d = 244 mg/100 ml and the standard deviation was σ d = 51 

mg/100ml; for those who did not develop the disease, the mean serum cholesterol level was μ nd = 

219 mg/100 ml and the standard deviation was σ nd = 41 mg/100ml. 

(a) Suppose that an initial serum cholesterol level of 260 mg/100ml or higher is used to predict 

coronary heart disease. What is the probability of correctly predicting heart disease for a man 

who will develop it 

(b) 

(c) 

What is the probability of predicting heart disease for a man who will not develop it 

What is the probability of failing to predict heart disease for a man who will develop it 

(3 marks) 

6. The length of human pregnancies from conception to birth varies according to a distribution that is 

approximately normal with mean 266 days and standard deviation 16 days. 

(a) What percent of pregnancies last less than 240 days (that’s about 8 months) 

(b) What percent of pregnancies last between 240 and 270 days (roughly between 8 months and 9 

months) 

(c) How long do the longest 20% of pregnancies last (3 marks) 

1. The probability of recovery for patients who are administered an established treatment for a 

stomach complaint is 0.8. A random sample of 100 patients with the complaint is monitored. 

Suppose X is the binomial random variable for the number of patients in this sample who recover 

when the established treatment is used. 

(a) Specify the parameters of X. 

(b) Find the mean and standard deviation of X. 

(c) Find the probability that at least 90 of the patients administered the treatment recover. Here you 

should first verify that the normal approximation to the binomial distribution can be used. 

(d) In a trial involving a new drug for the treatment of this stomach complaint, 90 out of 100 

patients who are administered the new drug recover. What conclusion can you draw about the 

new drug State your reason. 

(7 marks) 

157 

Section 4

SOLUTIONS 

4. [Note to markers: Since students only have access to a table with z values to two decimal places, be 

prepared to accept calculations based on the nearest values in the table. Many students will, of course, 

interpolate between table values.] 

(a) 

Shaded area = 0.5 – Pr(0 < Z < 1.58) 

= 0.5 – 0.4429 

= 0.0571 

–1.58 0 Z 

(b) 

Pr(0 < Z < 1.44) = 0.425 

The central 85% lies 

between –1.44 and + 1.44 

5. (a) 

244 260 

260 − 244 

For men who develop chd, Pr(X > 260) = Pr(Z > ) 

51 

= Pr(Z > 0.314) 

= 0.5 – 0.1217 

X 

= 0.3783 

260 − 219 

(b) For men who do not develop chd, Pr(X > 260) = Pr(Z > 

41 

= Pr(Z > 1) 

= 0.5 – 0.3413 

219 260 = X0.1587 

(c) The probability of failing to predict chd for a man who will develop it is 1 – 0.3783 = 0.6217 

6. X ~ N(266, 16 2 ) or X is normal with μ X = 266 and σ 

2 

X 

= 256 

(a) Pr(X < 240) 

240 − 266 

= Pr( Z < ) 

16 

= Pr(Z < – 1.625) 

= 0.5 – Pr(0 < Z < 1.625) 

240 266 X 

= 0.5 – 0.4479 

= 0.0521 

That is, 5.2% less than 8 months. 

270 − 266 

) 

16 

= Pr(–1.625 < Z < 0.25) 

= 0.4479 + 0.0987 

= 0.5466 i.e. 54.7% 

(b) Pr(240 < X < 270) = Pr(–1.625 < Z < 

(c) 

0.425 0.425 

–z 0 z Z 

240 

266 270 X 

z = 0.842 approx, from table 

(30% of the standard normal 

lies between 0 and z = 0.842) 

0.30 

x − 266 

0.20 ∴ = 0.842 

266 x X 16 

∴ x = 266 + 16(0.842) 

0 z Z 

= 279.47 days 

That is, approximately 280 days or more. 

1. (a) n = 100; π = 0.8 

(b) μ X = nπ = 80; σ 

X 

= 80(0.2) 

= 4. 

(c) μ X ± 2σ X gives 80 + 2(4) or 72 to 88. Both values lie in the range of possible values 0 to 100 hence 

normal approximation can be used. (1.96 instead of 2 also acceptable) 

89.5 

− 80 

Pr(X > 89.5) = Pr(Z > ) 

4 

= Pr(Z > 2.375) 

= 0.5 – 0.4912 

= 0.0088 

(d) There is evidence that the new drug produces a greater number who recover from the stomach complaint 

than expected from the established treatment; the probability 0.0088 is very small for a recovery rate of 

80%. 

Section 4 

158

SECTION 5 

This section defines sampling distributions, establishes the standard deviations of these distributions 

called standard errors, and set up confidence intervals for population means, differences between the 

means of two populations, proportions and difference between proportions based on random samples 

drawn from the populations. 

An outline of the Research Process 


The Standard Error of the Mean 

Confidence Interval for a Mean 

The t-distribution and Its Use 

Comparison of Two Independent Groups 

The Standard Error of the Difference Between Two means 

Pooled Estimate for the Common Variance 

Comparison of Two Dependent Groups (Paired Data) 

Confidence Interval for a Proportion 

Confidence Interval for Difference Between Two Proportions 

Summary of Distributions and Confidence Intervals 

159 

Section 5

The Research Process in Two Situations 

Binomial 

Underlying Population 

Bernoulli Outcomes 

Success or failure 

Inference Y = 1 or 0 

Use probability 

of outcome or 

estimate of the 

success proportion 

Sample 

(n) 


Result of study 

Number of successes, X 

Binomial 

e.g. Prevalence (π) of asthma in women aged 20 to 

40. 

This can be estimated as the proportion (p) in a 

sample chosen from the population. 

160 

Section 5

Normal 


Continuous Outcomes 

X ~ N(μ, σ 2 ) say 

Inference 

Use probability 

of outcome or an 

estimate based 

on the sample mean 

Sample 

(n) 


Result of study 

How does the sample 

mean behave 

What is the sample 

mean distribution X 

e.g. What is the mean resting pulse rate (μ) in beats 

per minute for men in age range 20 to 25 years 

The mean x from the sample is an estimate for the 

mean μ in population of all men in this age range. 

161 

Section 5

Sampling Distributions 

Statistical inference is process of using 

information from a sample to infer something 

about the population from which sample was 

drawn, thus completing the research loops just 

described. 

How reliable are these estimates for π and μ 

To answer these questions focus first on the 

sample mean x for a sample of size n say. 

Proportions will be discussed later. The argument 

proceeds as follows: 

Successive samples of size n can be drawn from 

the population. These produce means x 1 , x 2 , x 3 , 

x 4 , … etc and these form what is called a 

distribution of sample means, X , which is quite 

different to the original distribution, X, of values 

in the population. 

The problem is now to find μ and σ . [Here, 

X X 

σ is the standard deviation of the distribution of 

X 

means and hence is the “typical” variation in these 

means. i.e. the “typical” error.] 

162 

Section 5


Suppose a population with distribution X has known 

mean μ X and standard deviation σ X 

Ex: Female adult heights. Suppose μ X = 169 cm 

and σ X = 3.20 cm 

A sample of size n = 4 drawn randomly from the 

population has values 163, 172, 166, 166 say with 

mean x 1 = 667/4 = 166.8 cm 

x 

1 

Distribution of 

individual heights (X) 

σ X = 3.20 cm 

160 163 166 169 172 175 178 X 

=μ 

X 

The four sample values and their mean are plotted. 

The average x 1 is not as extreme as the individual 

values in the sample. x 1 is an estimate of μ X 

(usually unknown in the real situation). 

A second sample of n = 4 gives x 2 = 170.5 cm 

A third sample of n = 4 gives x 3 = 169.5 cm 

163 

Section 5

If this process is continued we can obtain a 

distribution of sample means. What are the 

properties of this distribution These will allow us 

to decide how well a sample mean estimates μ X . 

Distributions of means are for samples of size 

n = 10, n = 25 and n = 100. The population from 

which the samples are taken is Normal. 

164 

Section 5

Distributions of means are for samples of size 

n = 10, n = 25 and n = 100. The population from 

which the samples are taken is not Normal. But the 

sampling distributions are normal. 

165 

Section 5

Derivation: 

Suppose a random sample of size n is taken from a 

population with distribution X. The sample can be 

viewed as values from n random variables X 1 , X 2 , 

…, X n each with mean μ X and variance σ . X 1 , X 2 , 

…, X n are independent (if the population is large), 

and are identical. 

A value, x , from one sample is one value of X , the 

distribution of sample means for sample of size n. 

Then 

1 

X = 1 2 … + 

n 

( X + X + ) 

1 

∴ μ X = X X + 

n 

1 2 

∴ μ X = μ X 

X n 

( μ + μ + … μ ) 

X n 

1 

= ( nμ 

X ) (X 1 , X 2 etc identical) 

n 

2 

X 

166 

Section 5

The addition rule for the variance of independent 

random variables gives 

σ 

2 

2 

2 1 2 1 2 1 

X = 

⎛ 

σ 

⎛ 

X σ 

⎛ 

⎜ 

⎞ ⎟ + ⎜ 

⎞ ⎟ X + … + ⎜ 

⎝ n⎠ 

1 

= 

⎛ ⎞ 

⎜ ⎟ 

⎝ n⎠ 

2 

nσ 

⎝ n⎠ 

⎞ 

⎟ 

⎝ n ⎠ 

2 

σ 

2 

X 

1 2 

n 

2 

X 

= 

σ X 2 

n 

[i.e. if T = aX + bY, then 

2 

T 

2 

2 

X 

2 

2 

Y 

σ = a σ + b σ ] 

Therefore, the standard deviation of the distribution 

of sample means is 

σ 

σ 

X 

X = 

n 

σ 

X 

The derivations of μ = μ 

X X 

and σ = need 

X 

n 

not be known. These two formula are in fact very 

important and you must know how to use them. 

167 

Section 5

Note: 1. σ X is called the standard error of the 

mean. (It is the “typical” deviation in the mean 

– i.e. a measure of precision of the error in the 

mean). 

2. If μ X = 169; σ X = 3.20 for heights of women, 

for sample of size n = 4, μ X = 169 and 

= σ 3.20 

σ X 

X 

4 = = 1.60. 

2 

3. If sample size (n) is greater than 4, σ X is 

smaller meaning the distribution X is more 

compact about μ = μ X . 

X 

4. If X is normal, it can be shown that X is 

normal no matter what the sample size. 

5. If X is not normal, but n large, then X is 

approximately normal. (This result is a 

consequence of the Central Limit Theorem in 

note 6.) 

6. For random sample of size n, the sample means 

x i fluctuate around the population mean μ X 

with standard error σ X = σ X / n . As n 

increases, the distribution fluctuates less and 

less, getting closer to a normal distribution. 

168 

Section 5

Example: Suppose a population has values which 

are normally distributed (distribution is X) and 

μ X = 7.9 with σ X = 0.60. 

Find (i) Pr(X > 7.7) 

(ii) Find Pr( X > 7.7) where X is the distribution 

of means for samples of size n = 9. 

Solution: 

⎛ 7.7 − 7.9 ⎞ 

(i) Pr(X > 7.7) = Pr⎜ 

Z > ⎟ 

⎝ 0.60 ⎠ 

= Pr(Z > - 0.333) = 0.6304 

(ii) Since 

σ 

X 

0.60 

μ = 7.9 and σ = = = 0.2, 

X 

X 

n 9 

⎛ 7.7 − 7.9 ⎞ 

Pr( X > 7.7) = Pr⎜ 

Z > ⎟ 

⎝ 0.2 ⎠ 

= Pr(Z > -1) = 0.8413 

169 

Section 5

Example: Serum values for a sample n = 216 

give x = 34.46 and s = 5.84. What is the 

standard error of x 

σ 

Standard error = where σ is the (unknown) 

n 

population standard deviation. 

In practice, we estimate σ by s. 

s 

∴ estimated standard error = 

n 

= 

5.84 

216 

= 0.397. 

Suppose the sample had been twice the size, 

n = 432, and sample had same mean and standard 

5.84 

deviation. Estimated s.e. = = 0.281 

432 

(compare 0.397 for n = 216) 

170 

Section 5

A Confidence Interval for the Mean 

The problem here is to use sample data to find an 

estimate for an unknown population mean. This 

estimate reflects the random variation in the data 

collected by establishing an interval in which we 

are fairly certain that the mean μ lies. 

As can be seen, this will complete the research 

loop concerning the unknown population. 

X 

To motivate the procedure we work with the 

distribution of sample means, X , which is 

( 

2 

) 

⎛ 

2 

N μ , σ or alternatively N⎜ 

X X 

μ 

X 

X , σ 

⎝ n 

First consider the standard Normal: 

⎞ 

⎟ 

⎠ 

Area = 0.025 Area = 0.025 

0.95 

of area 

–1.96 +1.96 Z 

171 

Section 5

0.95 = Pr(-1.96 < Z < +1.96) 

⎛ X − μ ⎞ 

= Pr⎜−1.96 

< 

X 

< + 1. 96⎟ ⎝ σ X / n ⎠ 

= Pr 

⎛ σ 

⎞ 

⎜− 

X 

σ 

1 .96 < X − μ < + 

X 

X 1.96 ⎟ 

⎝ n 

n ⎠ 

⎛ σ 

⎞ 

= Pr⎜ 

− 

X 

σ 

μ < X < + 

X 

X 1 .96 μ X 1. 96 ⎟ 

⎝ 

n 

n ⎠ 

This result is used to construct a 95% confidence 

interval as follows: 

For a sample x 1 , x 2 , …, x n of n values from a 

population we are said to be 95% confident that 

the sample mean satisfies 

μ 

X 

σ 

X 

σ 

X 

− 1.96 < x < μ 

X 

+ 1.96 

n 

n 

But 

x 

while 

σ X 

σ 

< μ X +1. 96 implies x −1 

. 96 

X < μ X 

n 

n 

σ X 

σ 

μ X −1 . 96 < x implies μ x 

X 

X < +1. 96 . 

n 

n 

172 

Section 5

Therefore, we are 95% confident that the unknown 

population mean μ satisfies 

X 

x 

σ X 

σ 

− 1 .96 < μ x 

X 

X < + 1.96 

n 

n 

Alternatively, we are 95% confident that the true 

population mean lies in the interval 

x 

σ 

± 1.96 

X 

n 

173 

Section 5

Notes: 1. The sample has produced an interval 

estimate for the unknown population mean. 

2. A 99% confidence interval replaces the value 

1.96 by 2.58 since the tail areas beyond +2.58 

and –2.58 are both 0.005 

0.005 0.005 

0.99 

area 

–2.58 +2.58 Z 

3. The 99% confidence interval 

x 

± 

σ 

2.58 

X 

n 

is wider, hence less precise, but we are now 

99% certain μ X is in this interval. 

σ 

4. As n increases, 

X 

n 

decreases and the 

confidence interval is narrower meaning a 

more precise estimate. 

i.e. a large sample leads to greater accuracy. 

174 

Section 5

Example: A pharmacologist is investigating the 

length of time that a sedative is effective. Eight 

patients are selected at random for a study and the 

eight times for which the sedative is effective have 

mean x = 8.4 (It is also known that the standard 

deviation for such measures is σ X = 1.5 hours). 

Find 95% and 99% confidence intervals for the true 

mean number of hours μ . 

X 

Solution: Here, n = 8; x = 8.4; 

(assuming that X is normal). 

1.5 

σ = = 0.53 

X 

8 

The 95% confidence interval is 

8.4 ± 1.96 (0.53) 

or 8.4 ± 1.04 

That is, 7.36 < μ X < 9.44 or (7.36, 9.44). 


8.4 ± 2.58(0.53) 

or 8.4 ± 1.37 

That is, 7.03 < μ X < 9.77 or (7.03, 9.77) 

The second interval is much wider. 

175 

Section 5

Example: The pharmacologist is required to find 

the value of μ X to within 15 minutes with 95% 

confidence. Assuming that the standard deviation 

is σ X = 1.5 hours, find the size of the sample which 

must be taken in order to achieve this accuracy. 

Solution: Since 15 minutes is ¼ hour, for a sample 

size n we need 

x ± 1 

4 

to be an interval which is wider than 

σ 

x ± 1.96 

X 

n 

1.5 

or x ± 1.96 

n 

1.5 

∴1.96 

≤ 1 

n 4 

Rearranging, 1.96 (1.5) 4 ≤ n 

or 11.76 ≤ n 

Squaring, n ≥ 138.3 

Hence, 139 patients must be selected. 

176 

Section 5

Use of t-table when σ X is unknown 

In all practical contexts, σ X is not known. In this 

case it is estimated in the best possible way by the 

sample standard deviation s X . In this situation, the 

t-table provides alternative larger values in place of 

1.96 and 2.58. 

The confidence intervals are wider and hence there 

is less precision. 


s 

x 

x − tν 

< μ 

X 

< x + tν 

n 

s 

x 

n 

where ν = n – 1 is the “number of degrees of 

freedom” and t ν is found in the appropriate column 

in the t-table for 95% confidence (see table at end 

of notes) 

Note: (n – 1) = ν is the divisor in the estimate 

for the variance.) 

2 

X s 

177 

Section 5

Exercise: Now suppose that the pharmacologist 

did not know the value of σ X and was forced to 

take the sample standard deviation from the sample 

of size n = 8 as the best estimate of σ X , namely 

s X = 1.5 hours. Find 95% and 99% confidence 

intervals for μ . 

X 

Solution: x = 8.4 and 

s 

estimated standard error = X 1.5 

= = 0. 53 

n 8 

The 95% confidence interval for the mean sedative 

time μ X for all such patients is 

8.4 ± t 7 (0.53) where t 7 = 2.365 

That is, 8.4 ± 1.25 

or 7.15 < μ X < 9.65 

The 99% interval is 

8.4 ± t 7 (0.53) where t 7 = 3.500 

That is, 8.4 ± 1.86 

or 6.54 < μ X < 10.26 

Both are wider than before. 

178 

Section 5

Student’s t distribution 

ν 

-t ν 0 t ν T 

2p 0.100 0.050 0.020 0.010 

p 0.050 0.025 0.010 0.005 

1 

2 

3 

4 

5 

6 

7 1.895 2.365 2.998 3.500 

8 

9 

10 

 

 

 

120 

∝ 1.645 1.960 2.326 2.576 

Area (p) 

or probability 

p refers to the area of one tail 2p gives the 

combined area of both tails (View the t-distribution 

above as a slight modification to the normal 

distribution Z. 

179 

Section 5

Notes: 1. The interval is wide when samples 

small. That is, less precision in estimates. 

2. This last example is the most common 

situation where: the population is assumed to 

be normal; μ X and σ X are both unknown; 

σ X is estimated by s X from a random sample 

of size n. 

3. Even for large n the t-table is used. Last row 

of table has 1.96 and 2.58 row of t. 

4. From the point of view of exams we shall 

accept the normal distribution value for 

degrees of freedom greater than 30. 

180 

Section 5

Example: Tablets must be produced which weigh 

200milligram. Choose sample of n = 20 from 

production line. x = 201.7mg and s X = 5.13mg. 

Does this sample confirm that μ = 200mg 

Solution: 19 degrees of freedom and 

t 19 = 2.093 for a 95% confidence interval. 

Therefore, 

X 

5.13 

201.7 – 2.093 < μ X < 

20 

or 199.3 < μ X < 204.1 

201.7 

+ 

5.13 

2.093 

20 

The weight of 200milligram lies in this interval. 

Hence, 200milligram is an acceptable value of the 

mean μ with 95% confidence. 

X 

181 

Section 5

The Meaning of a confidence Interval 

199.3 μ X = 200 204.1 

Sample 100 

↑ 

Sample 6 

Sample 5 

Sample 4 

Sample 3 

Sample 2 

Sample 1 

Sample 5 does not include 

μ X = 200mg. 

In general, if 100 different samples construct 100 

intervals, then five of the 100 will miss μ X if we 

are working at 95% confidence levels. 

(This is the possible error which must be accepted. 

With 99% confidence intervals which are wider 

only one will miss μ X .) 

182 

Section 5

100 Confidence Intervals (95%) 

Sample 100 

↑ 

Sample 6 

Sample 5 

Sample 4 

Sample 3 

Sample 2 

Sample 1 

199.3 204.1 

In the above, the position of the true mean μ X is 

unknown. Also, in practice we only have one of the 

above intervals. We say we are 95% confident the 

true mean lies in this interval. 

183 

Section 5

184 

Section 5

185 

Section 5

186 

Section 5

187 

Section 5

188 

Section 5

Example: 

It is claimed that males committed for trial for 

minor offences are spending more time in prison on 

remand than females committed for trial for similar 

offences. A sample of 40 females and 49 males 

awaiting trial gave the following information. The 

outcome measure is time on remand (X days). 

Female Male 

Sample mean ( x i 

) 

16.3 29.5 

Sample standard deviation (s i ) 14.6 17.2 

Sample size (n i ) 40 49 

The difference between the sample means is 

x M – x F = 29.5 – 16.3 = 13.2 days 

Is this an important difference 

189 

Section 5

If μ M and μ F are the population mean times for 

males and females, a 95% confidence interval for 

μ M – μ F is 

x M – x F ± 

2 2 

sM sF 

1.96 + 

n n 

M 

(17.2) (14.6) 

= 13.2 ± 1.96 + 

49 40 

= 13.2 ± 6.61 

= (6.59, 19.81) 

or 6.59 < μ M – μ F < 19.81 

F 

2 2 

The population male remand time is likely to be 

within 6.59 and 19.81 days longer than that for 

females (alternatively, the true mean difference is 

between 6.59 and 19.81 days). 

190 

Section 5

Case 2: Comparing means when samples small 

In this situation the CLT no longer holds for the 

difference between the sample means. 

Instead we need to assume that the population 

from which the difference is drawn is normally 

distributed. 

This should be the case if the populations for any 

small samples are normal. 

In addition to assuming normality, we assume the 

two populations have equal variances. 

191 

Section 5

If 

2 

σ 1 and 

2 

σ 2 are similar and equal to 

2 

σ say. 

Then the 95% confidence interval μ 1 – μ 2 is 

1 1 

( x 1 − x2) 

± 1.96σ 

+ 

n n 

1 1 

1 

( x 1 − x2) 

−1.96σ 

+ < μ1 

− μ2 

< ( x1 

− x2) 

± 1.96σ 

+ 

n n 

n 

1 

2 

1 

2 

1 

1 

n 

2 

The common variance σ 2 needs to be estimated 

from sample data. If both populations have the 

same variance, the best estimate for σ 2 is found 

when the variation in both samples is averaged to 

give the pooled estimate s 2 pwhere 

∴ 

with 

s 

2 p 

( n 

= 

1 

2 

1 

1 

−1) 

s 

n 

+ ( n 

+ n 

2 

2 

2 ∑ ( x1 

− x1 

) 

1 

= 

n1 

−1 

2 

− 2 

−1) 

s 

s i 


2 

2 

2 

2 ( x2 

− x2) 

s 

2 

= i 

n2 

−1 

When sample estimates for the variances are used, 

replace 1.96 by t-value to get 

1 1 

( x1 

− x2) 

± tν 

s p + 

n1 

n2 

with degrees of freedom ν = n 1 + n 2 – 2. 

192 

Section 5

Example 3: Following data are 24 hour total energy 

expenditures (MJ/day) in groups of lean and obese 

patients (1986 study) 

Lean (n = 13) Obese (n = 9) 

6.13 8.79 

7.05 9.19 

7.48 9.21 Question: Is 

7.48 9.68 there a 

7.53 9.69 difference in 

7.58 9.97 energy 

7.90 11.51 expenditure 

8.08 11.85 between lean 

8.09 12.79 and obese 

8.11 patients 

8.40 

10.15 

10.88 

Mean: 8.066 10.298 

S.D.: 1.238 1.398 

Possible explanations for the difference between 

samples in above situations: 

1. bias (need to randomise) 

2. confounding (e.g. gender, age) 

3. chance (random variation) 

4. true difference 

The methods we discuss in next few lectures assume 

that bias and confounding are not the explanation. 

193 

Section 5

n 1 = 13; x 1 = 8.066; s 1 = 1.238 (lean) 

n 2 = 9; x 2 = 10.298; s 2 = 1.398 (obese) 

Solution: x 2 – x 1 = 2.232 (obese – lean) 

2 

2 

2 (13−1)1.238 

+ (9 −1)1.398 

s p = 

13 + 9 − 2 

12 (1.533) + 8(1.954) 

= 

20 

= 1.7014 

∴s 

p = 1 .7014 = 1.304 

ν = 20 giving t 20 = 2.086 for 95% interval 

∴ 95% confidence interval is 

1 1 

2 .232 ± 2.086(1.304) + 

13 9 

or 2.232 ± 1.180 

That is, 1.05 < 

μ − μ < 3.41 MJ/day 

obese 

lean 

194 

Section 5

Note: This confidence interval tells us that we 

can be 95% sure that the true difference in energy 

expenditure between obese and lean patients is 

between 1.05 and 3.41 MJ/day. 

Since this interval is entirely positive, it means 

that we can conclude that lean patients have lower 

energy expenditure than obese patients. 

Notes: 1. ν = n 1 + n 2 – 2 is the divisor in the 

2 

formula for s p , the variance estimate (the 

degrees of freedom are always the divisor in 

the variance estimate e.g. n – 1 in the single 

sample case) 

2. Both populations should have values which 

are normally distributed if the samples are 

small. 

2 2 

3. The two population variances, σ 

1 


2 

should be equal approximately. (Otherwise 

may need to transform data or use another 

test.) R-cmdr has an option which confirms 

this. 

195 

Section 5

4. The two samples from the two populations are 

random and independent of each other. 

5. Testing whether μ 1 = μ2 

can be achieved by 

seeing if μ1 − μ2 

= 0 i.e. confirming if 0 lies in 

the confidence interval for the difference. 

6. It is possible to obtain the probability value 

associated with the study outcome value of 

2.232. (see later) 

Example: A nutrition scientist is assessing a 

weight-loss programme to evaluate its 

effectiveness. Ten people randomly selected. 

Initial weight recorded and followup weight 20 

weeks later. 

Subject Initial Weight (x Ii ) Weight at followup (x Fi ) 

1 180 165 

2 142 138 

3 126 128 

4 138 136 

5 175 170 

6 205 197 

7 116 115 

8 142 128 

9 157 144 

10 136 130 

196 

Section 5

Find a 95% confidence interval for the reduction 

in weight (Assuming the two sets of values 

independent). 

x 

I 

= 151.7 x 

F 

= 145.1 

s 2 I = 750.76 s 2 F = 620.01 

2 9(750.76) + 9(620.01) 

s P = 

18 

= 685.17 

Since ν = 18 giving t 18 = 2.101 we get 

1 1 

(151.7 − 145.1) ± 2.101 685.17 + 10 10 

or 6.6 ± 24.6 

That is –18.0 < μ I – μ F < 31.2 

Note 1: Since the confidence interval includes 0, 

conclude there is no evidence to indicate that the 

weight loss programme has altered weights. 

Note 2: In this study the two sets of data are not 

independent. One person produces two values 

here. 

197 

Section 5

Case 3: Comparing means if matched data. 

It is natural to consider the differences d i in the 

weights for each person rather than considering 

the two samples separately. The d i are the data 

now and a confidence interval is constructed for 

the average difference μ d based on the single 

sample of differences. The 95% confidence 

interval is 

d 

± t 

ν 

sd 

n 

where d is the average of the d i , n is the number 

of data pairs, ν = n – 1, and s d is the standard 

deviation of the differences. We have 

s 

d 

= 

∑ ( di 

− d 

n −1 

) 

2 

with n – 1 degrees of freedom. 

198 

Section 5

Example: Weight loss programme again 

Subject x Ii x Fi d i = x Ii – x Fi 

2 

( di 

− d) 

1 180 165 15 70.56 

2 142 138 4 6.76 

3 126 128 –2 73.96 

4 138 136 2 21.16 

5 175 170 5 2.56 

6 205 197 8 1.96 

7 116 115 1 31.36 

8 142 128 14 54.76 

9 157 144 13 40.96 

10 136 130 6 0.36 

Total 66 304.40 

d = 66/10 = 6.6 

2 

∑ 

2 

( di 

− d) 304.4 

sd 

= = = 33.82 

n −1 9 

ν = n – 1 = 9 giving t ν = 2.262 for a 95% interval. 

The 95% confidence interval for the average 

difference is 

33.82 

6.6 ± 2.262 

or 6.6 ± 4.2 

That is, 2.4 < μ d < 10.8 

10 

199 

Section 5

There is evidence that the weight loss programme 

has reduced weights since the difference of 0 is 

not in this interval (we are 95% sure). 

Notes: (1) The “profile” of each person is 

constant in this study because the same 

person has produced the two values. 

(2) A test involving paired data based on d is 

called a paired t-test. The earlier test on 

μ − is called an unpaired t-test. 

1 μ 2 

(3) Negative differences are possible in this 

analysis when subtracting. Be consistent with 

subtraction process. 

200 

Section 5

Confidence Intervals for a Proportion 

Suppose X is a binomial distribution with 

parameters n and π (i.e. the number of “successes” 

lies between 0 and n). 

Then 

μ X = nπ 

σ = nπ ( 1−π 

) 

X 

Suppose one sample produces a proportion of 

successes p = in n trials. 

n x 

Many such samples can be taken to get different 

values of p. The resulting distribution (P) of these 

proportions is normal (by the Central Limit 

theorem.) It follows that 

P = 

X 

n 

where X is binomial. The mean and standard 

deviation of P are then 

201 

Section 5

1 1 

μ P = μ = nπ 

= π 

n 

X n 

2 1 2 1 

and, since σ = 

⎛ 

⎜ 

⎞ ⎟ σ X = nπ 

(1 −π 

) 

2 

⎝ n⎠ 

n 

2 

P , 

σ 

P 

= 

π ( 1−π 

) 

n 

The sample proportion (p) estimates the unknown 

true population proportion (π) (e.g. prevalence of 

asthma in women not known.). Thus the 

estimated standard error is 

p ( 1− 

p) 

n 

and the 95% confidence interval for π is 

p 

± 1.96 

p(1 

− 

n 

p) 

Note: 1.96 (or 99% equivalent 2.58) are always 

used for confidence intervals for proportions. (If 

202 

Section 5

small sample, the normal distribution is not a good 

approx.) 

Example: A random sample of 500 Aucklanders 

taken in 1996 had 173 supporting aerial spraying 

to eradicate tussock moth. Estimate the 

proportion (π) of Aucklanders who support this. 

Solution: 


x 173 

p = = = 0.346 

n 500 

p( 1− 

p) 

0.346(1 − 0.346) 

= 

n 

500 

= 0.021 


0.346 ± 1.96(0.021) 

or 0.346 ± 0.041 

Therefore, 0.305 < π < 0.387 

We are 95% sure that between 30.5% and 38.7% 

of the Auckland population support the spraying. 

203 

Section 5

Note: Alternatively, we could say 34.6% of the 

population support spraying with a margin of error 

of 4.1%. But ‘margin of error’ concept must be 

used with caution. It is reasonable if the value of 

p lies between 0.3 and 0.7 but the margin of error 

should be adjusted if p outside this range. (We 

omit this adjustment.) 

Example: Epidemiologist estimates proportion of 

women with asthma. Find the sample size (n) 

needed to give an estimate for this proportion with 

an error no more than 0.03 with 95% confidence. 

Solution: The largest possible value of p(1 – p) 

occurs when p = 

1 

2 

(verify this by choosing several 

p values). 

The most conservative (or safest) sample size is 

obtained using this value p = 

1 

2 

. The requested 

accuracy requires confidence interval p ± 0.03 

to be the largest interval. But the actual interval is 

0.5(1 − 0.5) 

p ± 1.96 

for sample size n. 

n 

Therefore 

204 

Section 5

0.5(1 − 0.5) 

1.96 

< 0.03 

n 

2 

(1.96) (0.5)(0.5) 

∴ < (0.03) 2 

n 

2 

(1.96) (0.5)(0.5) 

∴ n > = 1067.11 

2 

(0.03) 

It follows that 1068 women must be tested. 

Now consider the Confidence Interval for 

Difference Between Two Proportions 

(Derivation not examined) 

The difference π1 −π 

2is estimated by p 1 – p 2 

where p 1 = x1 

n1 

and p 2 = x2 

n2 

for the two 

samples. 

The distribution P 1 – P 2 of sample proportion 

differences is a normal distribution with 

μ 

P −P 

1 

2 

= π 

1 

−π 

2 

and standard deviation (standard error) 

σ 

P − P 

= 

π 

−π 

1 (1 1) 

2(1 

2) 

1 2 

n n 

1 

205 

π 

+ 

−π 

2 

Section 5

using the addition rule for the mean and variance 

of two independent random variables, P 1 and P 2 . 

If π 1 and π 2 are estimated from sample data, the 

95% confidence interval is 

p1(1 − p1) p2(1 − p2) 

( p1 − p2) ± 1.96 

+ 

n 

n 

1 2 

Exercise: To study the effectiveness of a drug for 

arthritis, two samples of patients were randomly 

selected. One sample of 100 was injected with the 

drug, the other sample of 60 receiving a placebo 

injection. After a period of time the patients were 

asked if their arthritic condition had improved. 

Results were 

EXPOSURE 

DRUG(1) PLACEBO(2) 

IMPROVED 59 22 

NOT IMPROVED 41 38 

TOTAL 100 60 

206 

Section 5

Solution: The proportions improved are 

59 

22 

p 1 = = 0.59 and p 2 = = 0. 37 

100 

60 

p 1 – p 2 = 0.22 and the estimated standard error of 

difference between the proportions is 

0.59(1 − 0.59) 

100 

+ 

0.37(1 − 0.37) 

60 

= 

0.0794 


0.22 ± 1.96 (0.0794) 

or 0.22 ± 0.156 

or 0.064 < π1 − π 2 < 0.376 

Since 0 excluded from interval and the interval is 

positive, there is evidence π1 − π 2 > 0. That is, we 

conclude the proportion improved is higher when 

the drug is used. 

207 

Section 5


2. A population is known to be normally distributed with a mean µx = 60 and standard deviation σx = 

15. Let X be the distribution of means of samples of size 25 drawn from the population. 

(a) Define completely the probability distribution X. 

(b) What is the probability that a value in the population will lie between 55 and 65 

(c) What is the probability that the mean of a sample of size 25 will lie between 55 and 65 (4 marks) 

3. Large studies indicate that the mean cholesterol level in children aged 2 – 14 is 175 mg%/mL and 

the standard deviation is 30 mg%/mL. 

The problem here is to see if there is a familial aggregation of cholesterol levels. A group of fathers 

who have had a heart attack and have elevated cholesterol levels (≥ 250 mg%/mL) are identified. 

The cholesterol levels of their offspring within the 2-14 age range are measured. The mean 

cholesterol level in a group of 100 such children is 207.3 mg%/mL. The problem is to decide if this 

value is sufficiently far from 175 mg%/mL for us to believe that the underlying mean cholesterol 

level μ in the population of all children selected in this way is greater than 175 mg%/mL. 

(a) Construct a 95% confidence interval for μ on the basis of the sample data. State your conclusion 

about familial aggregation of cholesterol levels. 

(2 marks) 

(b) Find the probability of obtaining the sample mean of 207.3 mg%/mL or a value which is greater 

under the assumption that there is no familial aggregation. State your conclusion from this 

probability. 

(2 marks) 

4. Patients with chronic kidney failure may be treated by dialysis, using a machine that removes toxic 

wastes from the blood, a function normally performed by the kidneys. Kidney failure and dialysis 

can cause other changes, such as retention of phosphorus, that must be corrected by changes in diet. 

A study of the nutrition of dialysis patients measured the level of phosphorus in the blood on six 

occasions. Here are the data for one patient (milligrams of phosphorous per decilitre of blood): 

5.5 6.1 4.8 5.8 6.2 4.6 

The measurements are separated in time and can be considered a random sample of the patient’s 

blood phosphorus level. 

(a) 

(b) 

(c) 

If the level varies normally with σ = 0.8 mg/dl, find a 95% confidence interval for the mean 

blood phosphorus level of this patient. 

(1 mark) 

If the value of σ is unknown but estimated by the sample standard deviation s = 0.669, find a 

95% confidence interval for the mean blood phosphorus level of this patient. (1 mark) 

The normal range of phosphorus in the blood is considered to be 2.6 to 4.8 mg/dl. Is there 

evidence that the patient has a mean phosphorus level that exceeds 4.8 Explain. (1 mark) 

5. A salmon fishing company is monitoring the weight of salmon in its ponds prior to harvest. A pilot 

sample of ten fish, randomly selected, shows a mean weight of 2.31 kilograms with a standard 

deviation of 0.17 kilogram. 

(a) 

(b) 

Obtain a 95% confidence interval for the mean weight of all salmon in the ponds.(2 marks) 

Using the standard deviation from the pilot survey as an estimate of the true variation of 

weights of salmon in the ponds, establish how many fish should be sampled to obtain an 

estimate of the mean weight of all the salmon in the ponds to within 0.03 kilogram with 

95% confidence. (Take 2 as an approximation to the value of t.) (3 marks 

208 

Section 5

SOLUTIONS 

2. (a) X is a normal distribution with μ X 

= 60 and 

i.e. X ~ N(60, 9) 

55 − 60 65 − 60 

(b) Pr(55 < X < 65) = Pr( < Z < ) 

15 15 

= Pr(–0.33 < Z < 0.33) 

= 2(0.1293) 

= 0.2586 approx 

55 − 60 65 − 60 

(c) Pr(55 < X < 65) = Pr( < Z < ) 

3 

3 

= Pr(-1.67 < Z


2. The extent to which X-rays can penetrate tooth enamel has been suggested as a suitable 

mechanism for differentiating between males and females in forensic medicine. Listed 

below in appropriate units are the ‘spectropenetration gradients’ for eight female teeth and 

eight male teeth: 

Male (x 1 

) 4.9 5.4 5.0 5.5 5.4 6.6 6.3 4.3 

Female (x 2 

) 4.8 5.3 3.7 4.1 5.6 4.0 3.6 5.0 

The data give sample means: 

x = 5.4250, 

1 

x = 4.5125 

2 

2 

2 

and the sample variances: s = 0.5536, s = 0.5784: 

1 

2 

(a) Calculate the pooled estimate for the variance common to the male and female 

populations. 

(1 mark) 

(b) Estimate the standard error of the difference between the population means. (1 mark) 

(c) Construct a 95% confidence interval for the difference between the two population 

means. 

(1 mark) 

(d) What conclusion do you now draw about the procedure for differentiating between males 

and females 

(1 mark) 

SOLUTIONS 

( n 

−1) 

s + ( n −1) 

s 

n + n − 2 

7(0.5536) + 7(0.5784) 

8 + 8 − 2 

2 1 1 2 2 

2. (a) s p 

= 

= 

= (1.132) 

= 0.566 

1 

2 

2 

(b) Estimated standard error of difference = 

2 

1 

2 

1 1 

0 .566 + = 0.376 

8 8 

(c) The 95% confidence interval is x − x ± t (0.376) 

1 2 14 

That is, (5.4250 – 4.5125) ± 2.145(0.376) 

or 0.9125 ± 0.8065 

giving 0.106 < μ 1 – μ 2 < 1.719 

(d) We are 95% sure that there is a difference in the mean tooth penetrations for males and 

females since 0 does not lie in the confidence interval in (c). (Because the confidence 

interval is positive the male tooth penetration will be greater.) 

210 

Section 5

[A] DISTRIBUTION SUMMARY 

1. Binomial (X): n trials; π is the probability of 

success (discrete) 

μ = nπ 

X 

σ = nπ(1 − π) 

X 

2. Normal (X): (continuous) 

Parameters are μ 

X 


X 

⎛ X − μ ⎞ 

X 

3. Standard Normal ⎜Z 

= ⎟ 

⎝ σ 

X ⎠ 

Parameters are μ 

Z 

= 0 and σ 

Z 

= 1 

4. Normal Approximation to Binomial 

Original binomial has parameters n and π. 

The normal approx has parameters μ 

X 

= nπ 

, 

σ = nπ(1 − π) 

X 

5. Distribution of Sample Means ( X ) 

σ 

X 

Normal with μ = μ 

X X 

and σ = . The 

X 

n 

standard deviation σ is also called the 

X 

standard error of the mean. 

211 

Section 5

6. Distribution of Differences between Sample 

Means ( X1− 

X2) 

μ = μX 

−μX 

(or μ1− 

μ2) 

X − X 

1 2 1 2 

2 2 

σ1 σ2 1 1 

2 2 

σ = + = σ + 

X X 

1 2 

1 2 

if σ = σ 

− 

n n n n 

1 2 1 2 

7. Distribution of Sample Proportions (P) 

μP 

= π 

π (1 −π 

) 

σ 

P 

= 

n 

8. Distribution of Differences between Sample 

Proportions (P 1 – P 2 ) 

μP 

1− P= π 

2 1− 

π2 

π1(1 −π1) π2(1 −π2) 

σ 

P1 − P= + 

2 

n n 

1 2 

Estimates for π, μ, σ are found from sample data 

and given by p, x and s. 

212 

Section 5

[B] SUMMARY: CONFIDENCE INTERVALS 

s 

1. Mean x ± tν 

with ν = n –1 D.F. 

n 

2. Difference Between Means (small samples and 

independent, normal populations with equal 

variances.) 

1 1 

( x1− x2) ± tν 

sp 

+ with ν = n 1 + n 2 – 2. 

n1 n2 

2 2 

2 

( n1− 1) s1 + ( n2−1) 

s2 

Here, sp 

= 

n + n −2 

1 2 

2 2 

1 2 

Note: If samples ≥ 30, x1− x2 

± 1.96 s + 

s 

n1 n2 

3. Difference Between Means (paired populations) 

d 

s d 

± tν 

with ν = n – 1 

n 

4. Proportion: 

p 

(1 ) 

1.96 p − 

± 

p 

n 

5. Difference Between Two Proportions 

1(1 1) 2(1 2) 

( 

1 2) 1.96 p − 

p p 

p p − 

− ± + 

p 

n n 

1 2 

213 

Section 5

214

SECTION 6 

This section reviews hypothesis testing, type 1 and type 2 errors, conclusive and inconclusive 

results and the power of a study. 

Null and Alternative Hypotheses 

Study Based and Data Driven Hypotheses 

One and Two Sided Tests 

Four Steps in the Hypothesis Testing Procedure 

Examples 

Pooled proportion estimate 

Clinical and Ecological Importance 

Conclusive and Inconclusive Results 


Power of a Study 

Examples 

215 

Section 6

Hypothesis Testing 

In most scientific studies we set up hypotheses 

before about treatments (or populations) which 

are the focus of the study. A null hypothesis (H 0 ) 

which is a claim about a treatment which is 

assumed to be true unless the data collected in 

our study show substantial evidence against H 0 . 

At the same time we propose a research or 

alternative hypothesis (H A ) which will be adopted 

if there is sufficient evidence against the null 

hypothesis. 

There are two types of alternative hypotheses: 

(i) 

(ii) 

a study based hypothesis which will imply 

that we do not know at the outset whether a 

new treatment is beneficial or possibly 

harmful and 

a data based hypothesis which is suggested 

by the very nature of the collected data and 

which will usually suggest treatment 

benefit. 

216 

Section 6

If the data suggest harm we are likely to terminate 

the study quickly but if the data suggest benefit we 

need to know if the benefit is clinically important. 

The study based alternative will usually lead to a 

two sided test while the data based alternative 

will lead to a one sided test. In the literature, the 

two sided test is by far the most common. 

There are FOUR STEPS in the standard 

hypothesis testing procedure. 

Step (1) A null hypothesis (H 0 ) is assumed about 

a population parameter. 

Step (2) An alternative (research) hypothesis is 

proposed. This is accepted if H 0 is rejected. 

217 

Section 6

Step (3) A test statistic is computed from data. 

It is the standardised value of a sample 

mean, sample proportion or sample 

difference obtained from the data. It is either 

a z-score (large sample) or a t-score (for 

small samples) given by 

test statistic 

= 

observed sample value - null value 

estimated standard error 

That is, the number of standard deviations 

from null value to the sample value. It is this 

test statistic which allows calculation of the 

p-value associated with the outcome of a 

particular study. 

Step (4) The probability of observing the value of 

the test statistic in step (3), or a value which is 

even more extreme, is calculated under the 

assumption that the null hypothesis is true. 

This probability is the p-value for the test 

statistic. The test statistic has of course 

summarized the data in the study. We draw 

appropriate conclusions if the p-value is less 

than 0.05. 

218 

Section 6

Examples Hypothesis Testing 

Exercise: Suppose the resting pulse rates for 

young women are normally distributed with mean 

μ = 66 and standard deviation σ = 9.2 beats per 

minute. A drug for the treatment of a medical 

condition is administered to 100 young women 

and their average pulse rate is found to be x = 68 

beats per minute. Because the drug had for a long 

time been observed to increase pulse rates, test 

the claim that the drug does in fact increase the 

pulse rates. (i.e. H A is data based.) 

Solution: 

Step (1) H 0 : μ = 66 (the null hypothesis) 

Step (2) H A : μ > 66 (the research hypothesis) 

Step(3) x = 68 from sample data. Assuming H 0 

is true, and noting that population standard 

deviation is known, standardising x leads to 

z 

= 

observed sample mean - null mean 

standard error of mean 

219 

Section 6

x − μ 

= 

σ / n 

68 − 66 

= 

9.2 / 100 

= 2.174 

Step(4): Calculate p-value assuming μ = 66 

0.50 

66 68 X 

0 2.174 Z 

p-value = Pr( X > 68 given μ = 66) 

68 − 66 

= Pr( Z > ) 

9.2 / 100 

= Pr( Z > 2.174) = 0. 015 

This means that if H 0 is true, there is only a 

probability of 0.015 of observing a sample mean 

as large or larger than 68. Hence there is little 

support for H 0 . Reject H 0 and conclude the mean 

pulse rate has been increased by the treatment. 

220 

Section 6

Notes: 1. R-cmdr and other statistical 

packages give a p-value directly beside the 

study result or the test statistic. If the p-value 

is less than 0.05 we have significance at the 

5% level and if p-value is less than 0.01 we 

have significance at the 1% level. 

2. If σ is unknown but estimated from the sample, 

the standardised statistic is t and the p-value is 

found from the t-table with appropriate degrees 

of freedom. (The exact p-value is not possible 

since only a few values are given at top of 

columns in t-table.) 

e.g. Suppose s = 9.2 rather than σ = 9.2 and 

sample size is n = 100. 

Then, 

68 − 66 

p-value = Pr( t > ) 

9.2 / 100 

= Pr( t > 2.174) 

with 99 DF 

t = 2.174 lies between values in columns 

headed p = 0.025 and p = 0.010. 

Hence p-value lies between these two 

numbers (R-cmdr gives exact value) 

221 

Section 6

Exercise: In a large overseas city it was 

estimated that 15% of girls between the ages of 

14 and 18 became pregnant. Concerned parents 

and health workers introduced an educational 

programme in an effort to lower this percentage. 

After four years of the programme, a random 

sample of n = 293 18-year-old girls revealed that 

27 had become pregnant. 

(a) 

(b) 

(c) 

Define null and alternative hypotheses for 

investigating whether the proportion 

becoming pregnant after the educational 

programme has decreased. (Suppose the 

alternative hypothesis is one sided.) 

Calculate the probability value. 

State your conclusion. 

Step(1): H 0 : π = 0.15 (15% still become 

pregnant) 

Step(2): H A : π < 0.15 (less than 15% become 

pregnant) 

222 

Section 6

Step(3): Sample gives p = 27/293 = 0.092 

observed proportion - null proportion 

z = 

standard error of proportion 

= 

p −π 

π ( 1−π 

) 

n 

= 

0.092 − 0.15 

0.15(1 − 0.15) 

under H 0 : π = 0.15 

293 

= –2.78 

use 0.15 and 

not 0.092 here 

Step (4): p-value = pr(Z < –2.78) 

= 0.5000 – 0.4973 

= 0.0027 

Z 

–2.78 0 

There is evidence that after the education 

campaign the proportion becoming pregnant has 

reduced. 

223 

Section 6

Exercise: The birthweight of a baby is thought to 

be associated with the smoking habits of the 

mother during pregnancy. The means and 

variances of the INDIVIDUAL values in the two 

samples of birthweights, one for non-smoking 

and the other for smoking mothers, are in the 

following table. 

Mother 

non-smoker 

Mother 

smoker 

Sample Size (n i ) 100 50 

Sample Mean ( x i 

) 3.45 3.30 

Sample Variance ( s 2 i 

) 0.36 0.32 

Investigate the claim that the mean birthweights 

are different in the two groups. In this case we 

shall suppose the alternative is study driven rather 

than data driven. 

Step(1): H 0 : μ NS – μ S = 0 (no difference in the 

mean birth weight) 

Step(2): H A : μ NS – μ S ≠ 0 (there is a difference 

in the mean birth weight) 

224 

Section 6

Step(3): Sample gives x − x = 3.45 − 3.30 = 0.15 

Standardising gives the test statistic 

NS 

S 

observed difference of means - null difference 

t = 

estimated standard error of difference 

0.15 − 0 

= 

1 1 

s p 

+ 

100 50 

2 

2 

2 ( n 1) ( 1) 

where 

1 − s1 

+ n2 

− s 

s = 

2 

p 

n1 

+ n2 

− 2 

99(0.36) + 49(0.32) 

= 

148 

= 0.3468 

0.15 

= 

1 1 

0.3468 + 100 50 

= 0.15 

0.102 

= 1.47 

(use of pooling optional) 

Since the sample is large we can use the standard 

normal z in place of t with 148 degrees of 

freedom. 

225 

Section 6

Step(4): In this case (two sided H A ) 

p-value = Pr(|z| > 1.47) 

= Pr(z > 1.47 or z < –1.47) 

–1.47 0 1.47 

p-value = 2(0.5 – 0.4292) = 2(0.0708) = 0.1416 

There is no evidence that the mean birthweights 

for the smoking and non-smoking groups are 

different. 

Note: If the test had been one-sided 

[H A : μ NS – μ S > 0] p-value = Pr(z > 1.47) 

= 0.0708 

z 

z 

0 1.47 

There is again no evidence the non-smoking 

group has a greater mean birthweight than the 

smoking group. 

226 

Section 6

227 

Section 6

228 

Section 6

Notes on Hypothesis Testing 

1. There is some terminology for reporting the 

result of a test. 

(a) If the p-value < 0.05 the result is 

“significant at α = 0.05 level” (5% level) 

or “There is some evidence that …” 

(b) If the p-value < 0.01 the result is 

“significant at α = 0.01 level” (1% level) 

or “There is strong evidence that …” 

(c) If p-value > 0.05 the result is “not 

significant” or “There is no evidence that 

…” 

In the above α is generally a pre-selected cutoff 

value. 

229 

Section 6

2. Choosing a smaller level of significance 

requires the test statistic to be more 

extreme before H 0 rejected. 

3. Whether the test is one or two sided 

depends on whether the alternative 

hypothesis is data based or study based. 

4. If H A is one sided, the p-value is the area 

in one tail of the distribution of the 

standardised test statistic. 

5. If H A is two-sided, the p-value is the area 

in the two tails of the distribution of the 

standardised test statistic. 

6. If using t-table choose column heading 2p 

for a two sided alternative hypothesis and 

p for a one sided alternative hypothesis. 

7. When a test statistic leads to rejection of 

H 0 , there are two possible explanations 

(a) H 0 is true but random variation has given 

an improbable test statistic. 

230 

Section 6

(b) H 0 is not true, and the observed statistic is 

consistent with H A . 

The second alternative (b) is taken but 

there is possible error. This error is the 

value α, the level of significance, which is 

usually 0.05 or 0.01. α is called the type 

one error (it is the chance of a false 

conviction in a court of law – i.e. must 

operate beyond reasonable doubt hence 

choose small α). 

8. In published work a p-value is quoted 

beside the study result (indicating whether a 

new treatment, say, has an effect) and a 

confidence interval is reported (giving some 

idea of the magnitude of an effect). 

But one problem still remains when reporting 

conclusions from a scientific study. It is possible 

to obtain a result which is statistically 

significant (with a small p-value) yet from a 

clinical point of view the result is unimportant. 

That is, it is not clinically important. 

(Ecological importance is an equivalent concept.) 

231 

Section 6

Example: There are two treatments for raising 

iron levels in infants, a standard treatment A and 

a new treatment B. 

A mean for treatment B that is 20 units greater 

than the mean for treatment A is recognised as a 

clinically important improvement which would 

lead to widespread introduction of treatment B. 

An experiment produces the following mean 

differences, xB 

− xA, with a 95% confidence 

interval. Decide in each case whether the p-value 

is less than or greater than 0.05. Report whether 

the scientific result is conclusive or inconclusive 

by considering clinical importance. 

(a) Mean Diff = 40. Confidence Interval is 

(33, 47) 

The confidence interval does not include the 

null hypothesis value so the p-value is less 

than 0.05 (a statistically significant result). 

The point estimate of 40 is in the direction 

indicating treatment benefit. The result is 

conclusive and there is evidence the benefit 

is enough to be important. 

232 

Section 6

(b) 

Mean Diff = 36. Confidence interval is 

(18, 54) 

p-value < 0.05. The result is conclusive. 

There is treatment benefit but it may not be 

as large as hoped. 

(c) 

Mean Diff = 27. Confidence interval is 

(–4, 58) 

p-value > 0.05 and inconclusive result. The 

confidence interval includes H 0 . The new 

treatment is probably better than treatment 

A but we cannot completely rule out the 

possibility that it is worse. 

(d) 

Mean Diff = –7. Confidence interval is 

(–55, 41) 

p-value > 0.05and result inconclusive. The 

new treatment is likely to be harmful but we 

cannot rule out the possibility that there is a 

clinically important benefit. 

(e) Mean Diff = –12. Confidence Interval = 

(–34, 10) 

233 

Section 6

p-value > 0.05 and result is conclusive. Any 

benefit is not clinically important and it is 

more likely there will be treatment harm. 

Treatment B should not be pursued as a 

potential treatment. 

(f) Mean Diff = –13. Confidence interval = 

(–19, –7) 

p-value < 0.05 and result very conclusive. 

The new treatment is harmful. 

(g) Mean Diff = 11. Confidence interval = 

(4, 18) 

p-value < 0.05. The result is conclusive. 

There is treatment benefit but not enough to 

lead to the introduction of treatment B. 

Note: In practice you decide what is clinically 

important. This is difficult but as you gain 

experience with your own area of research it 

becomes easier and you are able to critique any 

published research. 

234 

Section 6

Summary of previous results 

0 = null value 

20 = clinically important improvement. 

p-value < 0.05 implies confidence interval 

excludes the null value of zero. 

p-value > 0.05 implies null value included 

The result can be conclusive or inconclusive. 

(a) ( ) 

(b) ( ) 

(c) ( ) 

(d) ( ) 

-55 -7 0 20 41 

(e) ( ) 

-34 -12 0 10 20 

(f) ( ) 

-19 -13 

-4 

0 20 27 

-7 0 20 

(g) ( ) 

0 20 33 40 47 

0 1820 

36 54 

0 4 111820 

58 

235 

Section 6

(a) Conclusive p-value < 0.05 

(b) Conclusive p-value < 0.05 

(c) Inconclusive p-value > 0.05 

(d) Inconclusive p-value > 0.05 

(e) Conclusive p-value > 0.05 

(f) Conclusive p-value < 0.05 

(g) Conclusive p-value < 0.05 

Clearly, if the confidence interval is too large, 

there is greater chance for an inconclusive result. 

236 

Section 6

Example 

A clinical trial is set up to compare two drugs 

(pravastatin, A, and a control, B) for lowering 

cholesterol. The mean cholesterol reductions in 

the two groups are compared. The probability 

that such a study will correctly detect a clinically 

important difference between the effects of the 

drugs is called the power of the study. Power 

depends on the size of the difference, the 

variability of estimates, sample size, and the level 

of significance. 

Figure 12.4: 95% confidence intervals for different sample sizes 

Mean reduction 

greater in A 

target treatment 

difference 

(clinically 

important) 

0 

no difference 

Mean reduction 

greater in B 

n = 10 n = 20 n = 50 n = 200 

237 

Section 6

If the two samples are of size 5 (giving total 

n = 10), the three 95% confidence intervals 

include zero difference and the important 

difference. As n increases, the confidence 

intervals become smaller and it is possible to 

detect the difference. 

NB 1. It is helpful to aim for a confidence 

interval which has diameter (or range) no 

greater than the clinically important treatment 

difference as in this case the result obtained 

must be conclusive (rather than inconclusive). 

2. If the clinically important effect size is large, 

the confidence interval can be wider and hence 

a smaller sample taken. 

3. A larger sample gives smaller confidence interval. 

4. Less random variation in the data gives 

smaller confidence interval. (That is, the value 

of σ is smaller.) 

5. A smaller level of significance (α), say 0.01, 

gives a wider confidence interval and hence 

smaller power as there is less chance of 

detecting a clinically important effect in a 

conclusive way. 

238 

Section 6


The level of significance (α) is chosen by the 

researcher, usually 0.05, and is the chance that the 

null hypothesis (H 0 ) will be rejected when in 

actual fact it is true. It would seem sensible for α 

to be made as small as possible. Then the 

probability of correctly not rejecting H 0 when it is 

true will be large. But this is not the real issue in 

a scientific study involving hypothesis testing. 

The real issue is to have high probability of 

rejecting H 0 when in fact H 0 is false or needing to 

be rejected. That is, a high probability that a test 

will correctly detect a real treatment effect of a 

given magnitude. This is known as the power of 

the test, and involves detecting clinically 

worthwhile improvements as defined by 

researchers. Power is related to the level of 

significance. A smaller value for the level of 

significance results in a smaller power. A power 

between 80% and 90% is desirable. 

239 

Section 6

These ideas have a parallel in the courts of law in 

this country. To illustrate, suppose we are 

interested in testing a new treatment to see if it 

has an effect. 

1. The treatment is “arrested”. 

2. The treatment is charged with having an 

effect (H A ). 

3. It is assumed treatment is “innocent” (has no 

effect, H 0 ) until the evidence (data) shows 

otherwise. The evidence is summarized in the 

test statistic. 

4. The level of significance (α) is the probability 

that an innocent treatment will be convicted. 

This error must be made small. That is, the 

probability of a false conviction. 

5. The power is the probability that a guilty 

treatment will be convicted. This is the best 

outcome for a court case as it is a correct 

conviction. This probability should be large 

since then we correctly convict the treatment 

concluding there is an important treatment 

effect. Power should be at least 0.80 or 0.90. 

240 

Section 6

Some Computer packages (Minitab is one) have 

an excellent routine for analysing the power of a 

study and showing how power, data variability, 

sample size, level of significance and clinically 

important effects are related. 

EXAMPLE: The problem is to design a milk 

feeding trial in 5 year old children to see if a daily 

supplement of milk for a year leads to an 

increased gain in height compared with a control 

group (such a study would be both expensive and 

difficult for practical and ethical reasons). It is 

known that at this age children grow 6cm in a 

year with a standard deviation of 2cm (σ). The 

effect of milk on height gain is important if it 

results in a gain of at least 0.5cm. We want a 

high probability of detecting such a difference so 

we set the power to be 0.9 (90%) and choose a 

1% (α = 0.01) significance level. 

Known: 

Find 

σ = 2 (data variability) 

α = 0.01 (chosen level of sig.) 

Clinically important diff = 0.5 cm 

Target power is 0.90 (90%) 

Sample Size. 

241 

Section 6

(a) 

Find the sample size required to meet these 

conditions. (i.e. σ = 2.0cm; clinically 

important difference = 0.5cm; power = 0.9; 

α = 0.01) 

Step 1. STAT > POWER AND SAMPLE 

SIZE > 2-SAMPLE t (i.e. choose an 

unpaired t-test) 

Step 2. Specify power value of 0.9, a clinically 

important difference of 0.5 and sigma 2.0 

Step 3. Choose Not equal for a study based two 

sided alternative hypothesis and 

significance level alpha of 0.01 

242 

Section 6

A printout is as follows: 

Power and Sample Size 

2-Sample t Test 

Testing mean 1 = mean 2 (versus not =) 

Calculating power for mean 1 = mean 2 + difference 

Alpha = 0.01 Sigma = 2 

Sample Target Actual 

Difference Size Power Power 

0.5 478 0.9000 0.9001 

There need to be 478 children in each sample 

meaning 956 children in total. 

(i.e. the size of one sample is given) 

[Note: the actual power will be different as a 

result of rounding to the sample size.] 

243 

Section 6

(b) 

Now consider clinically important 

differences of 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 

A printout gives 








0.5 478 0.9000 0.9001 

0.6 333 0.9000 0.9007 

0.7 245 0.9000 0.9006 

0.8 188 0.9000 0.9006 

0.9 149 0.9000 0.9009 

1.0 121 0.9000 0.9008 

Notice that smaller samples will detect the larger 

clinically important differences. Necessary 

sample size reduces from the 956 to 242 [similar 

to moving from a high resolution microscope to 

pocket magnifying glass which is all that is 

needed to detect 1.0] 

244 

Section 6

(c) 

Halve the value of sigma to 1.0 and repeat 

the analysis in (b) 








0.5 121 0.9000 0.9008 

0.6 85 0.9000 0.9027 

0.7 63 0.9000 0.9032 

0.8 49 0.9000 0.9058 

0.9 39 0.9000 0.9051 

1.0 32 0.9000 0.9060 

Notice how greater precision (decreased standard 

deviation) in the data results in smaller sample 

sizes required to achieve the desired power which 

is now only 64 for a difference of 1.0. 

245 

Section 6

(d) A doctor set up a study involving 100 

children (50 in each group) and monitored 

the children for one year. The doctor 

wanted to detect a clinically important 

difference of 0.5, knew from historical 

information that sigma = 2.0, and set up a 

study based (two sided) test at α = 0.05(5%) 

level of significance. The printout obtained 

for the doctor after the study was carried out 

follows. 






Sample 

Difference Size Power 

0.5 50 0.2358 

The power for this study is only 0.2358. The 

probability of detecting the clinically important 

difference of 0.5 is too small. The study was a 

waste of effort in the sense that it is unlikely to 

detect a difference as small as 0.5 when this size 

difference is important. 

If α = 0.01, power = 0.0891 

246 

Section 6

Revision Examples 

1. Exam 2006: 

In a study to assess the impact of an industrial 

development on a nearby river, water temperature 

was measured. It has been suggested the mean 

water temperature is higher in this river than in a 

similar river 30 km away that is not affected by 

the development. Daily temperature in degrees 

Celsius were taken at mid day for a fortnight in 

February from both rivers. Two readings from 

the “unaffected” river were spoiled. The data are 

summarised below: 

Unaffected Affected 

river river 


Sample Mean ( x i 

) 15.41 16.49 

Sample Variance ( s 2 i 

) 1.963 2.132 

(a) 

(4 marks) Assuming that temperature has a 

common variability in both rivers and the 

values are approximately normal, calculate 

the pooled estimate for the common 

variance and an estimate for the standard 

error of the difference between the two 

means. 

247 

Section 6

2 

11(1.963) + 13(2.132) 

s p 

= = 

24 

Pooled variance = 2.055 

2.055 

standard error = 

1 1 

2.055 + = 0.564 

12 14 

Estimated standard error = 0.564 

(b) 

(2 marks) Using the appropriate value from 

the t-table construct the 95% confidence 

interval for the difference in mean 

temperature in the affected and unaffected 

rivers. 

1.08 ± t 24 (0.564) where t 24 = 2.064 

or 1.08 ± 1.164 

Confidence interval: 

–0.084 < μ A – μ u < 2.244 

248 

Section 6

(c) 

(d) 

(e) 

(2 marks) A mean temperature increase of 

0.6 degrees Celsius is ecologically 

important. State your conclusion about the 

true mean temperature from the confidence 

interval in (b). 

Conclusion: 

Result inconclusive. There is no evidence of a 

temperature mean difference but an important 

increase cannot be ruled out. 

(1 mark) State one way in which you might 

increase the power of this study. 

Statement: 

Increase sample size. 

(5 marks) A more powerful study is to be 

set up which has a 95% confidence interval 

for the difference between the mean river 

temperatures no greater than 0.6 degrees 

Celsius. Assuming the same number of 

measurements is taken from each river and 

the pooled estimate for the common 

variance from (a) is the best estimate for the 

variability, approximately how many 

readings should be taken from each river 

249 

Section 6

Taking 1.96 as multiplier, the 95% C.I. is 

1 1 

( μ2− μ1) ± 1.96 2.054 + n n 

But required precision needs (μ 2 – μ 1 ) ± 0.3 

2 

Therefore, 1.96 2.054 0.3 

n ≤ 

2 

(1.96) (2.054)2 

∴ 

≤ n 

2 

(0.3) 

∴ 175.3 ≤ n 

Number of readings from each river: 176 

(f) 

(2 marks) The 95% confidence interval from 

the study in (e) is (0.49, 1.12). What 

conclusion would you now reach about the true 

mean temperature difference 

Conclusion: 

Result conclusive. There is evidence of 

increased temperatures but the increase may 

not be ecologically important. 

250 

Section 6

2. Exam 2005 

An ecologist must determine whether a cleanup 

project at a lake has been effective. This is to be 

done by recording dissolved oxygen content (in 

parts per million, ppm) in the lake, with higher 

values indicating less pollution. Prior to the 

cleanup project a random sample of 50 dissolved 

oxygen readings was recorded around the lake. Six 

months after the initiation of the cleanup a second 

random sample of 70 readings was recorded. 

Results are summarised in the following table. 

Before Cleanup After Cleanup 


Sample Mean ( x i ) 10.30 10.46 

2 

s 0.32 0.36 

Sample Variance ( ) 

i 

(a) (1 mark) State null and alternative hypotheses 

for testing the data driven hypothesis that the 

cleanup has resulted in an increase in the 

dissolved oxygen content. 

Null hypothesis, H 0 : μ BC = μ AC 

Alternative hypothesis, H A : μ BC < μ AC 

251 

Section 6

(b) (6 marks) Calculate the pooled estimate for 

the common variance of the two samples, an 

estimate for the standard error of the difference 

between the two means, and a standardised 

normal z statistic for testing the hypotheses. 

2 

49(0.32) + 69(0.36) 

= = 0.3434 

118 

s p 

Pooled variance = 0.3434 

estimated standard error 

= 

1 1 

0.3434 + = 0.1085 

50 70 

Standard error = 0.1085 

z 

10.46 −10.30 = = 1.475 

0.1085 

Standardised z statistic = 1.475 

252 

Section 6

(c) (2 marks) Find the probability value (p-value) 

for the z statistic in (b) and state your 

conclusion from the p-value (using a 5% level 

of significance). 

p-value = 0.5 – 0.4306 = 0.0694 

Conclusion: There is no evidence that the 

clean-up has raised the mean dissolved 

oxygen reading. 

(d) (2 marks) Construct the 95% confidence 

interval for the difference in the dissolved oxygen 

means for the readings before cleanup and the 

readings after cleanup. 

(10.46 – 10.30) ± t 118 (0.1085) where 

t 118 = 1.98 (accept 1.96) 

i.e. 0.160 ± 0.215 

Confidence interval: 

–0.055 < μ AC – μ BC < 0.375 

253 

Section 6

(e) (1 mark) The power of this study is small. 

Suggest one way in which you might increase the 

power of this study. 

Answer: Select a larger sample 

(f) (3 marks) A more powerful study produced 

the 95% confidence interval (0.04, 0.27). 

What conclusions would you reach about the 

p-value of this study result and the effect of 

the cleanup project if an increase of 0.25 in the 

dissolved oxygen mean is ecologically 

important 

Conclusion: p-value < 0.05 

There is evidence the oxygen mean has 

increased after the cleanup but it may not be 

an important increase (or may not be as great 

as hoped) 

[Question 4 : 15 marks] 

254 

Section 6

SECTION 7 

One factor analysis of variance, post analysis of 

variance tests on means, and multiple comparison 

procedures. 

255 

Section 7

ONE FACTOR ANALYSIS OF VARIANCE 

This section of the course returns to the 

continuous outcome theme. 

In the studies of this type considered so far there 

have been two treatments when usually a new 

treatment is compared with a control or placebo. 

In the first half of the semester we answered the 

question about the effect of the new treatment by 

using the two sample t-test to find p-values and 

confidence intervals for the comparison of means. 

These studies involved an outcome measured on a 

continuous scale and the scores in the two 

treatments were compared. 

Regression procedures were developed which 

allowed us to introduce potential confounding 

variables and hence obtain adjusted or modified 

confidence intervals and different p-values. 

We are now going to investigate how to analyse 

continuous data when there are more than two 

treatments of interest. 

256 

Section 7

Example A general surgeon believes that 

providing pain relief immediately following 

surgery improves the level of comfort postsurgery. 

Three pain killing drugs and a placebo 

are randomly administered to patients 

immediately following tonsillectomies. The 

times in hours until onset of pain are as follows. 

The study is double blind. 

Placebo Drug A Drug B Drug C 

1.6 2.6 1.2 3.6 

0.3 12.6 1.7 3.2 

1.1 2.8 0.9 3.4 

0.4 4.5 2.1 3.9 

1.4 5.3 1.3 4.9 

2.4 4.4 

3.9 

Which drugs, if any, may be better than placebo 

Notice that there are now three comparisons with 

placebo. We can do better than just make the 

three comparisons using three unpaired t-tests. 

257 

Section 7

Example: A comparison was made of protein 

intake among three groups of post-menopausal 

women: (1) women eating a standard American 

diet (STD), (2) women eating a lacto-ovovegetarian 

diet (LAC), and (3) women eating a 

strict vegetarian diet (VEG). It was hypothesized 

that protein intake was affected by diet. The 

protein intakes (mg) for 30 women are: 

STD LAC VEG 

76 62 47 

63 76 75 

84 71 32 

72 61 40 

66 35 52 

83 56 37 

77 44 56 

79 58 35 

72 55 27 

69 49 66 

What are the effects of diet on protein intake 

Notice that there are three comparisons which 

could be of interest. 

258 

Section 7

We now investigate the problem of how to deal 

with multiple comparisons. The unpaired t test 

for comparing two sample means will be 

extended to situations involving more than two 

samples. As with simple linear regression the 

idea is again to partition the total variability of a 

response or outcome measure into components 

due to different sources of variation. 

Example: The effect of five drug treatments (A 

to E) on reduction of fever is investigated. Four 

children are assigned each treatment and 

temperature reductions measured in appropriate 

units with high values showing greater reduction. 

Responses as follows: 

A B C D E 

9 7 2 4 4 

8 4 3 8 9 

6 9 4 1 6 

9 6 3 3 3 

Total 32 26 12 16 22 108 

Mean 8.0 6.5 3.0 4.0 5.5 5.4 

259 

Section 7

One source of variation is due to differences 

between the effects of the drugs, the other source 

of variation is the random variation between the 

individual children within each drug treatment. 

But which of these is most responsible for 

explaining the variation in the responses 

The Method 

Each response can be divided into three 

components as follows: 

Response = overall effect present in each value 

+ a drug treatment (factor) effect 

+ random error (or residual effect) 

From the estimates for these components we find 

a number measuring treatment variation and a 

number measuring residual (including error) 

variation. These values are compared using an F 

statistic as in regression. 

260 

Section 7

Estimation of Components (for reference) 

1. Overall mean = 5.4 (this is the estimate for 

the overall effect with one degree of 

freedom) 

2. The five treatment effects estimated as 

follows: 

A: 8.0 – 5.4 = 2.6 

B: 6.5 – 5.4 = 1.1 

C: 3.0 – 5.4 = – 2.4 

D: 4.0 – 5.4 = – 1.4 

E: 5.5 – 5.4 = 0.1 

These add to zero (as they are deviations 

from their mean). 

There are 5 – 1 = 4 degrees of freedom. 

Note: The responses for A are, on average, 

2.6 units above the overall mean, while 

responses for D are, on average 1.4 units 

below overall mean. 

261 

Section 7

3. The residuals (including random error) are 

estimated by subtracting the overall mean 

and the treatment effect from each response 

to get: 

A: 9 = 5.4 + 2.6 + 1.0 

8 = 5.4 + 2.6 + 0.0 

6 = 5.4 + 2.6 – 2.0 

9 = 5.4 + 2.6 + 1.0 

B: 7 = 5.4 + 1.1 + 0.5 

4 = 5.4 + 1.1 – 2.5 

9 = 5.4 + 1.1 + 2.5 

6 = 5.4 + 1.1 – 0.5 

C: 2 = 5.4 + (– 2.4) – 1.0 

3 = 5.4 + (– 2.4) + 0.0 

4 = 5.4 + (– 2.4) + 1.0 

3 = 5.4 + (– 2.4) + 0.0 

D: 4 = 5.4 + (– 1.4) + 0.0 

8 = 5.4 + (– 1.4) + 4.0 

1 = 5.4 + (– 1.4) – 3.0 

3 = 5.4 + (– 1.4) – 1.0 

E: 4 = 5.4 + 0.1 – 1.5 

9 = 5.4 + 0.1 + 3.5 

6 = 5.4 + 0.1 + 0.5 

3 = 5.4 + 0.1 – 2.5 

The residuals are the third values on right. 

262 

Section 7

There are 20 data values altogether and hence 20 

degrees of freedom but 5 degrees of freedom 

have been used up leaving 15 for the residual 

effect. 

Sums of Squares Computation 

2 

∑ (responses) 

= 9 2 + 8 2 + 6 2 + 9 2 + … + 

6 2 + 3 2 

= 714 (with 20 DF) 

2 

∑ (overall means) = 5.4 2 +… + 5.4 2 

= 20 (5.4) 2 

= 583.2 (with 1 DF) 

2 

∑ (treatment effects) = 2.6 2 + … + 2.6 2 + … + 

0.1 2 + … + 0.1 2 

= 4[(2.6) 2 + (1.1) 2 + (–2.4) 2 

+ (–1.4) 2 + (0.1) 2 ] 

= 62.8 (with 5 – 1 = 4 DF) 

2 

∑ (residuals) 

= (1.0) 2 + (0.0) 2 + (–2.0) 2 + 

… + (–2.5) 2 

= 68.0 (with 15 DF) 

From these 714 = 583.2 + 62.8 + 68.0 

In general Total response Sum of Squares 

= overall mean SS + treatments SS 

+ residuals (error) SS 

263 

Section 7

Notes: 1. If there are no treatment differences, 

treatment effects will all be close to zero, 

hence treatments SS will be small. But how 

does this compare with the random SS 

measured by the residuals. 

2. We find the mean (or average) squares (MS) 

for treatment and residual effects and 

compare these with an F statistic. Sums of 

squares are divided by degrees of freedom. 

The Analysis of Variance (ANOVA) Table 

The calculations are summarised in a table similar 

to those arising with a regression analysis. 

Source of Sum of DF Mean F 

Variation Squares Squares 

Overall mean 583.2 1 

Treatment effects 62.8 4 15.70 3.47 

Residual (error) 68.0 (15) 4.53 

Total 714.0 20 

F = 15.70/4.53 = 3.47 (giving effect of the 

treatments on the responses compared with the 

chance (residual) effect on responses. 

Is this value large enough to be significant 

The critical value is found from F table (5%) 

264 

Section 7

υ 2 υ 1 1 2 3 4 … 30 

1 

 

15 3.056 

 

120 

If υ 1 = 4 and υ 2 = 15 then critical F = 3.056 

meaning Pr(F 4,15 > 3.056) = 0.05. Since 3.47 > 

3.056 we have significance at the 5% level. This 

means that treatment effects outweigh the chance 

(residual) effect. 

Conclusion: There is evidence of a difference 

between the mean temperature reductions 

resulting from the five treatments. 

Note: 

Because the overall mean appears in each data 

value, it makes no impact on variability between 

data values and the ANOVA table becomes. 

Source SS DF MS F 

Treatment effects 62.8 4 15.70 3.47 

Residual (error) 68.0 (15) 4.53 

Total (mean deleted) 130.8 19 

265 

Section 7

SYSTEMATIC CALCULATIONS 

The calculations for a one factor analysis of 

variance can be carried out easily using statistical 

software or by the following computation method 

which is quicker than the previous partitioning 

approach. 

A B C D E 

9 7 2 4 4 

8 4 3 8 9 

6 9 4 1 6 

9 6 3 3 3 

Col Total (C j ) 32 26 12 16 22 108 

2 

j 

C 1024 676 144 256 484 2584 

The between treatments (or samples) sum of 

squares is 

C 

n 

2 2 2 

1 C C 

+ 

2 

+ + 

k 

− 

1 

n 

2 

… 

n 

k 

(overall mean SS) 

where n 1 , n 2 etc are sample sizes and k = 5 here. 

266 

Section 7

If n 1 = n 2 = … = n k = n (say) this becomes 

[ 

2 2 

] C + C + … + C − (overall mean SS) 

1 2 

1 2 k 

n 

Total SS = 9 2 + … + 3 2 = 714.0 as before. 

Overall mean SS = 20(108/20) 2 = 583.2 as before 

Treatment effects SS 

1 

= [ 1024 + 676 + 144 + 256 + 484] − 583. 2 

4 

= 62.8 as n 1 = n 2 = … = 4 

SOURCE SS DF MS F 


Treatment effects 62.8 4 15.70 3.47* 

Residual (error) (68.0) (15) 4.53 

Total 714.0 20 

Brackets indicate numbers found by subtraction. 

If the effect of the overall mean is deleted again, 

the reduced table is produced. 



Residual (error) (68.0) (15) 4.53 

Total 130.8 19 

267 

Section 7

A Note on the Residual Mean Square 

s or s ) 

( 2 p 

2 

e 

The four treatment A residuals were 1.0, 0.0, –2.0, 

1.0. These are values 9, 8, 6, 9 with A mean of 8 

subtracted. i.e. they are of form x − x . An 

Ai A 

estimate of the variance for treatment A is therefore 

s 

2 

A 

2 

Ai A A 

1 2 2 2 2 

= ∑ ( x − x ) /( n −1) 

= (1.0 + 0.0 + [ −2.0] 

+ 1.0 ) 

3 

For the other four treatments the variance estimates 

are 

2 

2 

s = ∑ ( x − x ) /( n −1) 

B Bi B B 

 

2 

2 

s = ∑ ( x − x ) /( n −1) 

E 

Ei 

E 

where in this case n A = n B = n C = n D = n E = 4 

If it is assumed that the variance is the same at all 

five treatments, then the common or pooled 

variance estimate is 

E 

268 

Section 7

[ 

2 2 2 2 2 

] 

s + s + s + s s 

2 1 

s = 

+ 

p 5 A B C D E 

1 ⎡1 

2 1 

= ∑ ( x − x ) + … + ∑ ( x − x 

5 ⎢⎣ 3 

Ai A 

3 

Ei 

1 

2 

= ∑ ( x − x ) + … +∑ ( x − x ) 

15 Ai A 

Ei E 

1 

[ 

2 2 2 2 

] 

= 1.0 + 0.0 + ( −2.0) 

+ 1.0 +… 

15 

= Residual SS/Residual DF 

= Residual Mean Square ( s 2 e 

) 

[ 

2] 

The residual mean square is just the pooled 

variance estimate for all five samples. (It is a 

direct extension of the pooled variance estimate 

in an unpaired t test.) 

Notes (1) For the F test to be valid, the 

variances in all samples compared (here 5) 

should be approximately equal. 

(2) The square root of the residual mean square 

s is the standard deviation of the residuals. 

2 

e 

E 

) 

2 

⎤ 

⎥⎦ 

269 

Section 7

(3) In the R-cmdr printout for such an analysis 

the overall mean effect is deleted from the 

ANOVA table (as in the equivalent 

regression printout). The important section 

of the table remains. 



Residual (error) effect 68.0 15 4.53 

Total (less mean) 130.8 19 

Example: 20 children allocated randomly to four 

equal groups subjected to different treatments. 

After 3 months, progress measured by a test, with 

responses below (one child in group 3 died). Test 

for treatment mean differences. 

TREATMENT 

1 2 3 4 

4 31 30 19 

12 49 41 66 

44 22 13 65 

9 56 26 46 

17 19 89 

C 86 177 110 285 658 

j 

2 

C 7396 31329 12100 81225 

j 

270 

Section 7

Total SS = 4 2 + 12 2 + … + 89 2 = 32214 

Overall mean SS = 19(658/19) 2 = 22787.58 

Total SS (less mean SS) = 9426.42 

Treatment effect SS 

7396 31329 12100 81225 

= + + + − 22787. 58 

5 5 4 5 

= 4227.43 

The ANOVA table becomes 


Treatment effect 4227.43 3 1409.14 4.066 

Error (residual) (5198.99) (15) 346.60 

Total (less mean) 9426.42 18 

Critical value at 5% level of significance is 3.287 

< 4.066 (Using 3 and 15 DF) 

Conclusion: There is some evidence that the 

mean outcomes in the four treatments differ. 

271 

Section 7

272 

Section 7

POST ANALYSIS OF VARIANCE RESULTS 

It is important for further interpretation to set up 

confidence intervals for individual sample means 

or for differences between pairs of sample means. 

The useful new development here is that the 

residual mean square is an excellent estimate for 

the data variance meaning there is no need to 

additionally calculate the usual pooled variance 

estimate for each pair of samples. The advantage 

of using the residual mean square is that it 

involves all the data, not just data in individual 

samples. 

Example: Set up a 95% confidence interval for 

the mean of treatment 2. 

Solution: Here, 

x = 177/5 = 35.4 

2 

s s 346.60 

Estimated standard error = 2 

e 

= = 

n n 5 

= 8.33 

which has 15 degrees of freedom (same as 

residual) 

273 

Section 7

The 95% C.I. is 35.4 ± t 15 (8.33) 

where t 15 = 2.132 

That is, 35.4 ± 17.76 

or 17.64 < μ 2 < 53.16 

N.B. (1) Use 15 DF rather than 5 – 1 = 4 DF for 

the single second sample. Hence greater 

precision as t 15 < t 4 (note t 4 = 2.776) 

(2) R-cmdr gives confidence intervals for these 

treatment means automatically. 

(3) As we have seen, use of the residual mean 

square requires the variances to be equal in 

each sample. 

Example: Compare the mean scores for 

treatments 3 and 4 by setting up a 95% C.I. for 

the difference. 

274 

Section 7

Solution: x = 110/4 = 27.5 x = 285/5 = 57.0 

3 

4 

estimated standard error of difference 

= 

1 

s p 

+ 

n 

2 1 1 

= s 

e 

+ 

4 5 

= 

3 

1 

n 

4 

1 1 

346.60 + 4 5 

= 12.49 

with 15 DF again rather than n 3 + n 4 – 2 = 7 DF 

as for the usual unpaired t-test. 

The 95% C.I. for μ 4 – μ 3 is 

(57.0 – 27.5) ± t 15 (12.49) 

where t 15 = 2.132 

That is 29.5 ± 26.63 

or 2.87 < μ 4 – μ 3 < 56.13 

Since zero excluded, there is evidence treatment 4 

has a higher average score than treatment 3. 

275 

Section 7

A NOTE ON ASSUMPTIONS IN ANOVA 

Residuals and residual plots can be used to check 

the required assumptions. As in a regression 

analysis, the residuals should be 

(i) normally distributed, 

(ii) randomly distributed about 0, 

(iii) have similar variation within each of the 

samples chosen 

The following graph shows the variability in each 

of the drugs in the temperature reduction fever 

data. There could be some concern about unequal 

variation within each of the five treatments (but 

the samples are very small in this case so this is 

not too surprising). 

276 

Section 7

The next two residual plots confirm that the 

variation is similar for each drug treatment and 

the residuals are close to being normally 

distributed. 

277 

Section 7

278

SECTION 8 

This section covers the analysis of count data including the Chi-square test for contingency, the chisquare 

test for trend as well as relative risks, attributable risks and odds ratios along with their 

confidence intervals. The analysis of a three way table and Simpson’s paradox are investigated as a 

way of introducing the concept of a confounding variable in the lead up to regression analyses. 

Categorical Data Examples 

Relative Risk and its Confidence Interval 

Attributable Risk and its Confidence Interval 

Odds Ratio and its Confidence Interval 

Chi-square Test for Contingency 

Chi-square Test for Trend 

Interpretation of Confidence Intervals 

Simpson’s Paradox and Confounder Control 

279 

Section 8

Analysis of categorical data 

Categorical Data arise when individuals or 

experimental units are classified into one of two 

or more mutually exclusive groups. For example, 

• binary e.g. sex (M/F); dead/alive; 

diseased/disease free; 

treatment/placebo; smoker (yes/no) 

Tuatara present/absent 

herpes present/absent 

melanoma present/absent 

• nominal e.g. ethnicity 

• ordinal e.g. disease severity; socio economic 

status; smoking (never/ex/current) 

In a sample of units, the number falling into a 

particular group is the frequency. The analysis of 

such data is sometimes referred to as the analysis 

of frequencies or counts. 

280 

Section 8

Examples of research questions that we shall 

look at. 

Estimation of one proportion: 

Ex 1. What is the prevalence of asthma in a 

population 

Associations between two factors: 

Ex 2. Is a vaccine effective in reducing the risk 

of catching influenza 

Ex 3. Is there an association between exposure 

to chlorinated water and dental enamel 

erosion 

Ex 4. Does infra-red stimulation (IRS) provide 

effective pain relief in patients with 

cervical osteoarthritis 

Ex 5. Is there an association between income 

level and severity of cardiovascular 

disease in a group of people presenting for 

treatment 

281 

Section 8

What tools do we need to answer these types of 

questions Recall the research loop 


Selection 

bias 

Inference 

Sample 

Confounding 

Statistical 

analysis 

Information 

bias 

Possible explanations for an association include 

• bias (controlled with study design when 

selecting the people for a study or 

systematic error arising from the way 

information was collected from study 

participants) 

• confounding (must be allowed for) 

• chance (or random error) 

• true association 

We shall use proportions, relative and attributable 

risks, odds ratios, confidence intervals and 

probability values. 

282 

Section 8

Example 1: What is the prevalence of asthma in 

a population 

Population: adult males on a general practice 

register. 

Study 

• random sample from population, n = 215 

• 39 have history of asthma 

Sample proportion p = 39/215 = 0.18 

0.18(1 − 0.18) 

Standard error of proportion = 

215 

= 0.026 

95% confidence interval for the true proportion 

(0.13, 0.24) 

Conclusion 

We can be 95% sure that the true prevalence of 

asthma among men attending this general practice 

is between 13% and 24%. 

Confidence intervals for very small proportions 

• If the number of events is small the 

distribution of sample proportions is not 

normal and values would be negative. 

• an ‘exact’ method based on the binomial 

distribution must be used. 

283 

Section 8

Evaluating associations in 2 × 2 tables 

Example 2: is a vaccine effective in reducing the 

risk of catching influenza 

Study 

169 people were randomly allocated to receive a 

flu vaccine or a placebo. At end of winter they 

were asked if they had contracted flu’. 

Flu’ No Flu’ Total 

Vaccine 9 75 84 

Placebo 22 63 85 

Total 31 138 169 

This is what is called as a prospective cohort 

study. In a cohort study the cohort of people is 

followed into the future. Such studies can be 

expensive as they may be of long duration. Also 

if a disease is rare (say a cancer) many 

participants will be needed. The Dunedin 

Multidisciplinary Study is one of these. Recall 

the example on circumcision and sexually 

transmitted disease. 

284 

Section 8

Example 3: Is there an association between 

exposure to chlorinated water and dental enamel 

erosion 

Study 

Of 49 swimmers with enamel erosion (the cases) 

32 reported swimming 6 or more hours per week 

compared with 118 to 245 swimmers without 

enamel erosion (the controls). 

Swim time Erosion of enamel Total 

per week Yes No 

(Cases) (Controls) 

≥ 6 hrs 32 118 150 

< 6 hrs 17 127 144 

Total 49 245 294 

This is what is called a retrospective case control 

study. Advantage is that such a study is relatively 

quick and smaller than a cohort study particularly 

for rare diseases. But greater potential for bias as 

there may be inaccurate recall. 

The analysis of this 2 × 2 table is not the same as 

the analysis in the 2 × 2 table in the previous 

cohort study. (We shall see that odds ratio rather 

than relative risk must be used.) 

285 

Section 8

Both these data summaries are in the form of a 

2 × 2 table. Usually there is an exposure (or 

predictor) category and an outcome (response 

category). 

Outcome (disease) 

Exposed Present Absent Total 

Yes a b a + b 

No c d c + d 

Total a + c b + d n 

We know how to summarize data from tables like 

these 

• the choice of measure depends on the study 

design 

• options include relative risk, attributable risk 

(difference in proportions), odds ratio 

The tools needed for statistical inference are 

• confidence intervals for relative risks 

attributable risks and odds ratios 

• hypothesis tests (p-values) for these 

associations 

286 

Section 8

Prospective Studies 

• groups are followed up to see if an outcome 

of interest occurs 

• the proportions in each group who develop 

the outcome are found (these are often called 

the incidence which defines numbers of new 

cases of a disease) 

• the ratio of these proportions is the relative 

risk 

• the difference in these proportions is the 

attributable risk 

General form of 2 × 2 table: 



Yes a b a + b 

No c d c + d 


Relative risk, RR = 

a/( a+ 

b) 

c/( c+ 

d) 

Attributable risk, AR = a/( a+ b) − c/( c+ 

d) 

Section 8 

287

Example 2: Is a vaccine effective in reducing the 

risk of catching influenza 

Flu’ No Flu’ Total 

Vaccine 9 75 84 

Placebo 22 63 85 

Total 31 138 169 

Risk in vaccine group = 9/84 

Risk in placebo group = 22/85 

Relative risk, RR = 9/84 

22/85 = 0.4 

Those who were vaccinated were 0.4 times as 

likely to develop the flu as those who were not 

vaccinated. So flu vaccine was associated with a 

60% reduction in risk of flu. 

Notes: 

• if a RR = 1.00, then rates are equal and there 

is no association between flu’ and vaccine 

• the convention is to calculate the relative risk 

this way round so that a ‘protective’ exposure 

gives a relative risk less than 1. 

288 

Section 8

Confidence interval for relative risk 

One method for finding confidence intervals for 

RR is as follows: 

The sampling distribution for ln(RR) is 

approximately normal with standard deviation (or 

standard error) given by 

[ ] 

s.e. ln(RR) 

1 1 1 1 

= − + − 

a a+ b c c+ 

d 

Then the 95% confidence interval for ln(RR) is 

ln(RR) ± 1.96 s.e.[ln(RR)] 

For example, 

1 1 1 1 

s.e. [ ln(RR) ] = − + − = 0.364 

9 84 22 85 

Now RR = 0.414, giving ln(RR) = –0.882 

289 

Section 8

The confidence interval (95%) becomes 

–0.882 ± 1.96 (0.364) 

i.e. –0.882 ± 0.714 

Therefore, –1.596 < ln(RR) < –0.168 

Taking exponentials, 0.20 < RR < 0.85 

So the 95% confidence interval for the true 

relative risk is (0.20, 0.85) 

Since 1 is not contained in this confidence 

interval we conclude that there is evidence of 

association between vaccine use and a reduced 

risk of contracting flu’ 

Note: 

• this method will give a correct CI only if the 

numbers in each cell are not too small 

• in order to complete our evaluation of the 

effectiveness of the vaccine we need to also 

consider possible sources of bias and 

confounding 

• regression procedures allow us to take 

account of confounding effects (see later). 

290 

Section 8

Confidence interval for attributable risk 

Once we have determined treatment is effective 

we may also wish to consider how many cases of 

flu’ vaccine is likely to prevent: 

Attributable risk: 

22/85–9/84 = 0.26 – 0.11 = 0.15 

Use the normal approximation to get a confidence 

interval for this difference in proportions. The 

estimated standard error for the difference 

between the proportions is 

p1(1 − p1) p2(1 − p2) 0.26(0.74) 0.11(0.89) 

+ = + 

n1 n2 

85 84 

= 0.059 

and the 95% confidence interval for the 

attributable risk (risk difference) is 

giving 

0.15 ± 1.96(0.059) 

(0.04, 0.27) 

291 

Section 8

So, assuming the treatment is effective, in every 

100 people vaccinated there will be between 4 

and 27 fewer cases of flu than if they had not 

been vaccinated (i.e. vaccination prevents 

between 4 and 27 cases of flu in every 100 

people) 

292 

Section 8

Case control studies 

• a group of individuals with a disease (called 

the cases) is compared to a control group who 

do not have the disease. In these cases we 

choose the number of people with the disease 

and the number without. 

General form of 2 × 2 table 



Yes a b a + b 

No c d c + d 


The measure of association used in case-control 

studies is the odds ratio, not the relative risk 

• In terms of probabilities, the odds of an event 

Pr( A) Pr( A) 

A is defined as = . With the 

Pr( A) 1 − Pr( A) 

notation in the table above, in the exposed 

group the odds of disease present equals 

293 

Section 8

⎛ a ⎞ ⎛ b ⎞ 

⎜ ⎟ ⎜ ⎟ which simplifies to a/b. 

⎝a+ b⎠ ⎝a+ 

b⎠ For unexposed group, odds = c/d. 

Example: Is there an association between 

exposure to chlorinated water and dental enamel 

erosion 

Study 

Of 49 swimmers with enamel erosion (the cases) 

32 reported swimming 6 or more hours per week 

compared with 118 of 245 swimmers without 

enamel erosion (the controls). 




≥ 6 hrs 32 118 150 

< 6 hrs 17 127 144 

Total 49 245 294 

For ≥ 6 hrs, odds = a/b = 32/118 

For < 6 hrs, odds = c/d =17/127 

294 

Section 8

a/ 

b 

The odds ratio, OR = 

c/ 

d 

= 32/118 

17 /127 

= 2.026 (= 2.0) 

Note 1: why we use the odds ratio 

Compare the numbers in the previous table to a 

study which is identical except that we chose to 

have only 49 controls: 




≥ 6 hrs 32 24 56 

< 6 hrs 17 25 42 

Total 49 49 98 

The values 24 and 25 give the same proportions 

with slight rounding 

32/ 24 

Odds ratio = = 2.0 with rounding which is 

17 / 25 

the same as the previous result. 

295 

Section 8

But now suppose we were to try and calculate the 

relative risk in both cases: 

‘Risk’ ‘RR’ 

Study 1 ≥ 6 hrs 32/150 

< 6 hrs 17/144 1.75 

Study 2 ≥ 6 hrs 32/56 

< 6 hrs 17/42 1.43 

Notice that there is disagreement. The 

consequence is that the relative risk can be made 

to take any value by choice of numbers of cases 

and controls. This is unacceptable. 

Note 2: When are the odds ratio and relative risk 

close 

Consider a retrospective case-control study: 

If disease (the outcome of interest) is rare, 

then a and c will be small in the table. 

Disease No Disease 

Exposed (Case) (Control) Total 

Yes a b a + b 

No c d c + d 

296 

Section 8

so 

a a c c 

≈ and ≈ 

b a+ b d c+ 

d 

Then relative risk = 

⎛ a ⎞ ⎛ c ⎞ a/ 

b 

⎜ ⎟ ⎜ ⎟ ≈ 

⎝a+ b⎠ ⎝c+ 

d ⎠ c/ 

d 

Thus, in a case-control study investigating a rare 

disease the odds ratio gives a good estimate of the 

true unestimable relative risk. 

Confidence interval for odds ratio 

In repeated sampling, ln(OR) are normal with 

standard deviation (or standard error) given by 

[ ] 

s.e. ln(OR) 

1 1 1 1 

= + + + 

a b c d 

The 95% confidence interval for ln(OR) is 

For the example 

ln(OR) ± 1.96 s.e. [ln(OR)] 

297 

Section 8

1 1 1 1 

s.e. [ ln(OR) ] = + + + = 0.326 

32 118 17 127 

and ln(OR) = ln (2.026) = 0.706 

The confidence interval becomes 

0.706 ± 1.96 (0.326) 

i.e. 0.706 ± 0.639 

Therefore, 0.067 < ln(OR) < 1.345 

∴ e 0.067 < OR < e 1.345 

∴ 1.069 < OR < 3.838 

We conclude the odds of erosion in dental enamel 

are raised among those swimming more than 6 hours 

per week. We would reject the null hypothesis as p- 

value < 0.05. 

Note: An odds ratio simply measures if an 

association is present between outcome and 

exposure. With a relative risk we are interested if 

treatment improves outcome status. A protective 

exposure gives a relative risk less than 1. 

298 

Section 8

Chi Square Test for Contingency Tables 

The above examples (2 × 2 tables) are very 

common in health research and other areas. 

However, we may want: 

• p-values to formally test for an association 

• to answer questions relating to larger 

contingency tables. 

Note: 

• as long as one of the variables is binary we 

can think of comparing proportions and 

calculate RRs or ORs 

• if both variables have more than 2 categories 

the analysis is more complex 

299 

Section 8

Example 4 

Does infra-red stimulation (IRS) provide effective 

pain relief in patients with cervical osteoarthritis 

A randomised controlled trial was carried out 

with 100 patients: 20 were randomly allocated to 

a double dose and 40 each to a single dose and 

control (placebo) treatment. The patients were 

classified according to improvement levels over a 

period of one week as follows: 

(hypothetical data) 

Pain score 

IRS Improve No Worse Total 

change 

Double dose 10 5 5 20 = r 1 

Single Dose 15 20 5 40 = r 2 

Control 5 20 15 40 = r 3 

Total 30 = c 1 45 = c 2 25 = c 3 100 = n 

• we can look at the percentage improved, no 

better and worse for each treatment category 

300 

Section 8

We wish to know whether the data indicate that 

either 

or 

IRS does provide effective pain relief (and in 

what dose) 

it is no better than the control. 

Calculating a p-value for the following 

hypotheses will tell us whether there is evidence 

that IRS is effective, or whether the differences 

we have observed between the treatment groups 

are consistent with random variation. 

Hypotheses: 

H 0 : The response and the type of treatment are 

independent (i.e. no association) 

H A : response and type of treatment are not 

independent (i.e. are associated in some way 

or one of the responses may occur more often 

with one of the treatments) 

301 

Section 8

If there were no association between treatment and 

outcome (H 0 ), I would expect to have the same 

fraction of improved responses using the three 

treatments and this fraction should be 

c 1 /n = 30/100 (i.e. 30 of the 100 patients show 

improvement). 

Suppose E 11 , E 21 and E 31 are the numbers of 

improvements expected if RESPONSE and 

TREATMENT are independent. Then 

30 E11 

E 

= = 

21 

E = 31 

100 20 40 40 

20(30) 

∴E 11 = = 6 100 

40(30) 

E 21 = 100 

40(30) 

E 31 = 100 

= 12 

= 12 

In general, 

E 

ij = 

r c 

i 

n 

j 

for each “cell” or “class” in the contingency table. 

302 

Section 8

Using this formula, expected numbers can be 

calculated for each cell: 

RESPONSE 

TREATMENT Improve No change Worse Total 

Double dose 6 9 [5] 20 

Single Dose 12 18 [10] 40 

Control [12] [18] [10] 40 

Total 30 45 25 100 

Each row and column total has to be met by the 

entries in the table and for this reason the numbers 

in brackets can be found by subtraction. 

The observed frequencies (the data counts) are now 

compared with the expected counts calculated 

under H 0 . 

If H 0 is true, then the expected counts will agree 

closely with those observed. [But how closely 

must they agree] 

This is answered by calculating the chi-square (χ 2 ) 

statistic 

303 

Section 8

χ 

2 

= 

∑ 

over 

all cells 

(Observed - Expected) 

Expected 

2 

i.e. 

χ 

2 

= 

∑ 

over 

all cells( i, 

j) 

( O 

ij 

− 

E 

E 

ij 

ij 

) 

2 

Observed Counts (O ij ) 

Treatment Response 

1 2 3 

1 10 5 5 

2 15 20 5 

3 5 20 15 

Expected Counts (E ij ) [Under H 0 : independent] 

Treatment 

Response 

1 2 3 

(Improved) (No change) (Worse) 

Double 1 6 9 5 

Single 2 12 18 10 

Control 3 12 18 10 

χ 2 is large if O ij and E ij seriously disagree – hence 

χ 2 being large will result in H 0 rejection. 

304 

Section 8

Example: For the drug responses, 

χ 2 = 

(10 − 6) 

6 

2 

+ 

(5 − 9) 

9 

2 

+ 

(5 − 5) 

5 

2 

+ 

(15 −12) 

12 

2 

+ 

(20 −18) 

18 

2 

+ 

(5 −10) 

10 

2 

+ 

(5 −12) 

12 

2 

+ 

(20 −18) 

18 

2 

+ 

(15 −10) 

10 

2 

= 14.72 (χ 2 will always be positive) 

In repeated sampling these χ 2 values are distributed 

as a chi-square distribution which has 

υ = (number of rows – 1) × (number of columns – 1) 

degrees of freedom. 

Here, υ = (3 – 1) × (3 – 1) = 4 

which is just the number of values that can be 

freely inserted in the table!! (the remaining values 

are fixed if the row and column totals are to be 

met.) 

The critical χ 2 value is found from the table at the 

end of the notes. 

305 

Section 8

0 

Critical value 

α (area, or the 

level of 

significance) 

α 

υ 0.1 0.05 0.025 0.01 0.005 0.001 

1 

2 

3 

4 9.488 14.86 

5 

 

100 

Since 14.72 > 9.488, the null hypothesis of no 

association is rejected. 

Note: when we do this on the computer we get the 

exact p-value, p = 0.005 

χ 

2 

υ 

306 

Section 8

• the p-value gives the probability of observing a 

difference this large or larger between what we 

observed and what is expected under H 0 , if H 0 

is true. 

• since the p-value is small, it is unlikely we 

would observe a difference this big just by 

chance, it is more likely that the null hypothesis 

is false. 

• there is evidence that the pain levels depend on 

the treatment administered. 

• closer inspection of the observed frequencies 

indicates 

• more patients improved on double dose 

than expected 

• few patients experiencerd improved 

response on the control 

• fewer than expected being worse on single 

dose. 

307 

Section 8

Notes 

1. Check the observed counts in order to interpret 

a significant association. 

2. Maximum power is achieved if there are equal 

numbers in each ‘exposure’ group. This is 

often not possible to achieve in observational 

studies. 

3. This chi-square procedure is unreliable if 

counts are small, in particular less than 5. 

• For larger contingency tables it is possible 

to combine classes in order to raise 

frequencies. 

• For 2 × 2 tables if expected frequencies 

are between 5 and 10, a correction called 

Yates correction will modify the χ 2 

statistic. 

• For 2 × 2 tables, if expected frequencies 

are less than 5, there is a test called 

Fisher’s Exact Test which can be used. 

308 

Section 8

Example 5 

Is there an association between income level and 

severity of cardiovascular disease in a group of 

people presenting for treatment 

Study 

A group of people presenting to a hospital with 

acute myocardial infarction or unstable angina are 

enrolled in a study. Cross-sectional data are 

collected at baseline. 

Income level (Exposure) 

Disease 

level 1 2 3 4 Total 

(Outcome) 

0 100 107 111 122 440 

≥1 (Severe) 115 112 104 97 428 

Total 215 219 215 219 868 

% ≥1 52.0 51.1 48.4 44.3 

RR 1.00 0.96 0.90 0.84 

115/ 215 

115/ 215 

112/ 219 

115/ 215 

104/ 215 

115/ 215 

97 / 219 

115/ 215 

309 

Section 8

To test whether or not there is an association 

between disease severity and income level: 

H 0 = there is no association between disease 

severity and income (i.e. the proportion 

with severe disease is the same for all 

income levels) 

H A = 

there is some association (i.e. the 

percentage with severe disease varies by 

income) 

Expected frequencies: 

Income level 

Disease 1 2 3 4 Total 

level 

0 108.99 111.01 108.99 111.01 440 

≥1 106.01 107.99 106.01 107.99 428 

Total 215 219 215 219 868 

440 

215 108.99 

868 

E 

11 

= × = 

12 

440 

E 

13 

= 215× = 108.99 

868 

440 

E = 219× = 111.01 

868 

310 

Section 8

χ 2 

2 

(100−108.99) 

(107−111.01) 

(111−108.99) 

= 

+ 

+ 

108.99 111.01 108.99 

2 

2 

2 

(122−111.01) 

+ 

(115−106.01) 

+ 

(112−107.99) 

111.01 106.01 107.99 

2 

2 

(104−106.01) 

(97−107.99) 

+ 

+ 

106.01 107.99 

= 4.1 

2 

2 

+ 

The appropriate sampling distribution is a χ 2 with 

3 d.f. 

From the χ 2 table 

Pr(χ 2 (3 d.f.) > 6.251) = 0.1 

so p-value > 0.1 

From the computer, p-value = 0.25 

Hence the observed differences in proportions we 

have seen are of the order we might expect to see 

by chance. There is no evidence supporting 

rejection of the null hypothesis. 

We conclude that there is no evidence of an 

association between disease severity and income. 

311 

Section 8

Contingency Tables (Continued) 

Tests for trend 

Example 5 (continued): Do people with lower 

incomes tend to present with more severe 

disease 

The chi-squared test of association may not 

provide the best answer to this question. It does 

not take account of the ordering in the income 

variable. Specifically, our prior hypothesis is 

that the percentage with severe disease decreases 

as income increases. 

We can test this hypothesis directly using a χ 2 

test for trend. The main difference is that this 

test has only one degree of freedom rather than 

the three for the test of association. 

Note: You will NOT be asked to calculate a test 

for trend in this course. You may be asked to 

2 

interpret the p-value or a χ 

trend 

value with one 

degree of freedom. 

312 

Section 8

This page for reference only 

Income level (x i ) 

Disease 1 2 3 4 Total 

level 

0 100 107 111 122 440 

≥ 1 (r i ) 115 112 104 97 R = 428 

Total (n i ) 215 219 215 219 N = 868 

r i x i 115 224 312 388 

n i x i 215 438 645 876 

n i x i 

2 

215 876 1935 3504 

p 

= R N = 428 = 0.49 ∑ rx 

i i 

2174 

868 x = = = 

N 868 

χ 

2 

trend 

[ ∑ rx − Rx] 2 

i i 

= 

2 2 

p(1 − p) 

⎡⎣∑ 

nx 

i i−Nx 

⎤⎦ 

1039 − 428× 

2.50 

= 

0.49(1 −0.49) ⎡⎣ 

6530 − 868× 

2.50 

= 4.06 

[ ] 2 2 

⎤⎦ 

2.505 

313 

Section 8

The trend statistic has only 1 degree of freedom. 

From χ 2 table, Pr(χ 2 (1 d.f.) > 3.841) = 0.05 

Since 4.1 > 3.841 the p-value < 0.05, so we 

conclude there is evidence that the proportion 

with severe disease decreases as income 

increases. 

Overview 

• interpretation of confidence intervals for RR 

and OR 

• relationship between confidence intervals, p- 

values and sample size 

Example: (Hypothetical Data) 

The following confidence intervals are from a 

study into the erosion of tooth enamel as a result 

of exposure to chlorinated water. 

They are the ratio of odds for those exposed 

(swim ≥ 6 hours per week) to those not exposed 

(swim < 6 hours per week). 

Suppose an odds ratio greater than 1.5 is 

considered clinically important. 

314 

Section 8

(a) OR = 1.90 with CI (1.23, 2.92) 

• p < 0.05 and conclusive. 

• 1 is not contained in the CI, so there is 

evidence of an association between 

exposure and outcome. 

• the CI is above 1 indicating harm. 

(Swimming bad for teeth.) 

• note we have not ruled out a non-clinically 

important association 

(b) OR = 1.69 with CI (0.83, 3.45) 

• p > 0.05 and inconclusive. 

• point estimate indicates possible clinically 

important association but “protection”!! of 

tooth enamel (rather than “harm”) is also 

plausible. 

(c) OR = 0.81 with CI (0.39, 1.70) 

• p > 0.05, inconclusive. 

• conclude no evidence of an association 

even though CI includes clinically 

important effects. 

• the point estimate is in the “protection” 

range (harm is above 1). 

315 

Section 8

(d) OR = 0.85 with CI (0.53, 1.37) 

• p > 0.05, conclusive. 

• point estimate in protection range and CI 

excludes any clinically important harm. 

(e) OR = 0.81 with CI (0.67, 0.97) 

• p < 0.05 and conclusive 

• CI excludes 1 

• CI entirely less than 1, indicating benefit 

from swimming 

(f) OR = 1.23 with CI (1.03, 1.48) 

• p < 0.05 and conclusive 

• CI excludes 1 

• CI entirely above 1, but excludes the 

clinically important difference 

• there is evidence of an association between 

exposure to chlorinated water for more than 

6 hours per week but the increased odds are 

not clinically important. 

(g) OR = 1.15 with CI (0.73, 1.80) 

p > 0.05 and inconclusive. A clinically 

important association is not ruled out. 

Advice: Probably continue swimming. 

316 

Section 8

0 1 1.5 2 3 3.5 

a 

x 

b 

x 

c 

d 

e 

x 

x 

x 

f 

x 

g 

x 

0 1 1.5 2 3 3.5 

Notice that these confidence intervals are not 

symmetric. 

317 

Section 8

A Problem when Contingency Tables are 

combined 

Example: A University has a Law School and a 

Medical Sciences School with men and women 

being admitted or declined admission as follows: 

Admit Decline Total 

Male 490 210 700 

Female 280 220 500 

Total 770 430 1200 

Is there gender bias concerning admission (i.e. 

is there an association between gender and 

admission decision) 

Expected frequencies under H 0 : no association are 

Admit Decline Total 

Male 700(770) 

[250.8] 700 

= 449.2 

1200 

Female [320.8] [179.2] 500 

Total 770 430 1200 

χ 2 = 

(490 − 449.2) 

449.2 

2 

+ … + … + … = 24.82 

318 

Section 8

with υ = 1 degree of freedom. Since critical 

value at α = 0.01 level of significance is 6.635, 

there is strong evidence of an association. 

Inspection of the observed frequencies shows a 

tendency to admit a higher number of men than 

expected i.e. O 11 = 490 but E 11 = 449.2. This 

means fewer women are admitted than expected 

under equal opportunity. The admission patterns 

for the two schools are also known as follows: 

LAW SCHOOL MEDICAL SCIENCES 

Admit Decline Total Admit Decline Total 

M 480 120 600 M 10 90 100 

F 180 20 200 F 100 200 300 

Total 660 140 800 Total 110 290 400 

The expected frequencies under H 0 are: 

Admit Decline Admit Decline 

M 495 105 M 27.5 72.5 

F 165 35 F 82.5 217.5 

For Law School χ 2 = 10.38** 

For Medical Sciences School, χ 2 = 20.45** 

319 

Section 8

There is strong evidence of an association in both 

schools. 

HOWEVER, inspection of the observed counts 

indicates a higher number of women than 

expected are admitted to both schools. 

For LAW, O 21 = 180 with E 21 = 165 

For MEDICAL SCIENCES, O 21 = 100 with 

E 21 = 82.5 

This is the opposite conclusion to that when the 

schools are combined. Is there discrimination 

against men or women!! 

This is known as Simpson’s Paradox. 

The reason for this discrepancy is that more 

women applied to the Medical Sciences school to 

which it was more difficult to be admitted. The 

final conclusion is therefore unclear. 

Notice that there are essentially three factors of 

classification here, and we have summed over 

one of these factors, namely the “TYPE OF 

SCHOOL” 

320 

Section 8

COMBINED 

ADMIT DECLINE 

Male 490 (449.2) 210 ( ) 

Female 280 (320.8) 220 ( ) 

LAW 

MEDICAL 

Admit Decline Admit Decline 

M 480 (495) 120 ( ) M 10 (27.5) 90 ( ) 

F 180 (165) 20 ( ) F 100 (82.5) 200 ( ) 

(Expected numbers are in bold) 

“Variable” 1 = GENDER 

“Variable” 2 = ADMISSION DECISION 

“Variable” 3 = SCHOOL TYPE 

Note how careful we must be with such an 

observational study which fails to recognise an 

important “variable” (here school type). 

This phenomenon can occur whenever we sum 

over a classification in categorical data. 

321 

Section 8


1. A randomized double blind study (prospective) was set up to test for an association between 

the use of aspirin and the incidence of fatal or nonfatal strokes in a five year period from the 

start of the study. The results (Journal of the American Medical Association, 243: 661-669) 

are summarised in the following contingency table: 

Stroke No stroke 

Placebo 45 2257 

Aspirin 29 2238 

(b) 

(c) 

Calculate and interpret the risk of stroke for people in the placebo group relative to the 

aspirin group. Set up a 95% confidence interval for the relative risk. (3 marks) 

The use of aspirin was felt to increase the occurrence of gastrointestinal irritation. In 

the study, 229 of 2267 patients in the aspirin treatment suffered irritation as opposed to 

22 of the 2302 in the placebo treatment. Calculate the relative risk of gastrointestinal 

irritation for people in the aspirin group compared with those in the control. Set up a 

95% confidence interval for the relative risk and interpret the result. (3 marks) 

(d) Calculate the attributable risk for aspirin compared with control Set up a 95% 

confidence interval for the attributable risk and interpret the result. 

(3 marks) 

3. Long-term Mobile Phone Use and Brain Tumour Risk. 

Lonn et al (2005), American Journal of Epidemiology, 161: 526-535 

Human exposure to radiofrequency has increased dramatically during recent years from 

widespread use of mobile phones. If radiofrequency radiation has a carcinogenic effect, the 

exposure poses an important public health problem, and intracranial tumours would be of 

primary interest. Handheld mobile phones were introduced in Sweden during the late 

1980’s. This case-control study was carried out to test the hypothesis that long-term mobile 

phone use increases the risk of brain tumours. 

(a) 

(b) 

This was a case-control study. Describe one advantage and one disadvantage of using a 

case-control study instead of a cohort study to investigate the association between longterm 

use of mobile phones and the risk of brain tumour. 

The information is summarised below. 

Brain Tumour (Outcome) 

Mobile phone use Yes No Total 

Never/rarely 155 275 430 

Regularly 118 399 517 

Total 273 674 947 

(i) Calculate the odds ratio for the association between long-term mobile phone use 

and the risk of brain tumour. 

(ii) Interpret the odds ratio. 

(iii) Calculate the 95% confidence interval for the odds ratio. 

(iv) Interpret the confidence interval. 

322 

Section 8

SOLUTIONS 

29 

1. (b) Risk (aspirin group) = 

2267 

45 

and risk (placebo group) = 2302 

45 / 2302 

Relative risk, RR = = 1.53 

29 / 2267 

The risk is 1.53 times greater for those in the placebo. 

Also, s.e. (ln RR) = 

1 

45 

− 

1 

2302 

+ 

1 

29 

− 

1 

2267 

= 0.236 

and since ln(RR) = 0.424 the 95% confidence interval is 

0.424 ± 1.96 (0.236) 

or 0.424 ± 0.463 

or -0.039 < ln(RR) < 0.887 

Therefore 0.96 < RR < 2.43, taking exponentials 

(notice that the null value for the relative risk is 1 hence no evidence against the null hypothesis) 

(c) Irritation No irritation Total 

Placebo 22 2280 2302 

Aspirin 229 2038 2267 

229 / 2267 

RR = 

22 / 2302 

= 10.57 

ln(RR) = 2.358 

s.e. ln(RR) = 

1 1 1 1 

− + − = 0.221 

229 2267 22 2302 

The 95% C.I. for ln(RR) is 2.358 ± 1.96 (0.221) 

That is, 2.358 ± 0.433 

Giving 1.925 < ln RR < 2.791 

Taking exponentials, 

6.86 < RR < 16.30 

The null value of equal risk is rejected. 

The true relative risk of irritation if aspirin used is between 6.86 and 16.30 

(d) 

229 22 

Attributable risk = − = 0.10101 – 0.00956 = 0.09145 

2267 2302 

Estimated standard error = 

0.10101(0.89899) 0.00956(0.99044) 

+ = 0.00665 

2267 2302 

The 95% C.I. for attributable risk is 0.09145 ± 1.96(0.00665) 

or 0.091 ± 0.013 

or 0.078 < AR < 0.104 

Between 78 and 104 in every 1000 people have increased occurrence of gastrointestinal irritation 

as a result of using aspirin. 

3. (a) Advantage: A case control study is quick and cheaper since information on exposure and disease 

status are obtained at same time. Brain tumours also are rare so number of participants for cohort 

study would be large. 

Disadvantage: Information collected likely to be affected by recall bias since events have already 

occurred. 

(b) (i) OR = (118/399)/(155/275) = 0.52 

(ii) Those who use mobile phones have 0.52 times the odds of a brain tumour compared with those 

who do not. [Protective effect from using mobile phones – the odds are 48% less for mobile 

phone users compared with those who do not use mobile phones.] 

(iii) ln(0.52) = –0.654 

The 95% C.I. for ln(OR) is 

1 1 1 1 

− 0.65 ± 1.96 + + + 

155 275 118 399 

or –0.654 ± 0.284 

or –0.938 < ln(OR) < –0.370 

Therefore, 0.39 < OR < 0.69 

(iv) 95% confident true OR between 0.39 and 0.69. The value (1) is excluded hence 

chance is an unlikely explanation. 

323 

Section 8

324

SECTION 9 

This section introduces the topic of Simple Linear Regression which sets out to fit a straight line 

through what is called a scatter diagram. One purpose of this analysis is to establish whether one 

predictor variable is influencing the outcomes of a response variable and also measuring the 

magnitude of the effect of this predictor variable on the outcome. It is possible to use the fitted 

straight line to make predictions. 

Simple linear regression is also the first step in controlling for a confounder variable. This occurs 

with the extension to multiple regression which will be considered in the next section. 

Scatter Diagrams and Examples 

Equation of Fitted Straight Line 

Analysis of Variance for Regression Model 

Confidence Interval for Slope 

Confidence Interval for Prediction 

Correlation as Measure of Linear Association 

Review Exercises 

325 

Section 9

Regression Procedures Introduction 

During the semester we have analysed data from 

1. studies which have measured outcomes on 

continuous scales [e.g. blood pressure; lung 

capacity; cholesterol] resulting from different 

treatments 

2. studies which have measured binary 

outcomes, establishing odds ratios and 

relative risks as a result of exposure to 

certain conditions. [e.g. effect of chlorine on 

tooth enamel; effect of sun exposure on 

melanoma] 

In both cases there are potentially other variables 

which have an effect and/or possible confounding 

factors other than the treatments or exposures 

which influence the outcomes. 

We must allow for these confounders otherwise 

invalid conclusions will be drawn about the real 

effects of the treatments or exposures. 

326 

Section 9

Regression methods are used to introduce these 

controls. We now develop: 

1. Simple linear Regression (now) 

• to describe the relationship between two 

variables and test whether changes in an 

outcome measure may be linked to 

changes in the other variable. 

• to enable the prediction of the value of 

the outcome measure from the other 

variable. 

2. Multiple Regression (later) 

• to identify the main factors influencing a 

continuous outcome 

• to adjust the means of outcomes for 

confounders or other factors. 

3. Logistic Regression (later) 

• to identify the main factors influencing 

binary outcomes and hence odds ratios 

and relative risks 

• to adjust odds ratios for confounding or 

other factors. 

Show Hans Rosling’s website gapminder. 

327 

Section 9

Example: Blood Alcohol Concentration in 

mg/100mL and Body Mass in kg for 8 adults after 

drinking 12 glasses of regular beer. 

0.04 

0.02 

0.00 

MASS (kg) BAC (mg/100mL) 

55 0.140 

85 0.102 

69 0.120 

65 0.126 

80 0.106 

90 0.092 

67 0.128 

73 0.120 

BAC (mg/100mL) 

0.14 X 

X X 

0.12 

X 

0.10 

0.08 

0.06 

X 

X X 

MASS (kg) 

50 60 70 80 90 100 

Does BAC drop as Body Mass increases 

Other variables which could be important are: 

gender amount eaten alcohol level of the beer 

Eventually we shall see how to determine which 

of these may be important. 

328 

X 

Section 9

BAC 

X 

X X 

• 

• 

• • 

X 

X 

X X 

X 

• • 

• 

• 

MASS 

• Women consistently above men 

• Lines could be parallel 

BAC 

X 

X X 

X 

X 

X X 

X 

• • 

• • 

• • 

• • 

MASS 

• Lines not parallel. (If low body mass, large 

difference, if high body mass there is no 

difference.) 

329 

Section 9

Example: Lung function in children as measured 

by a lung capacity variable called FEV. 

FEV 

+ 

+ + 

+ 

+ 

+ 

+ 

+ 

+ 

+ + 

+ + + + + + + 

+ + + 

+ 

+ + + + + 

+ 

+ 

+ 

+ 

+ + 

+ + + + 

+ 

+ + + + 

+ + 

+ 

+ 

+ 

+ + 

+ + + + + 

3 5 7 9 11 13 15 17 19 Age 

FEV values are increasing as the children grow. 

But now see the next two graphs. 

330 

Section 9

FEV 

+ 

+ + 

+ 

+ 

+ 

+ 

+ 

+ 

+ + 

+ + + + + + + 

+ + + 

+ 

+ + + + + 

+ 

+ 

+ 

+ 

+ + 

+ + + 

+ 

+ + + + 

+ + + 

+ 

+ 

+ + 

+ + + + + 

3 5 7 9 11 13 15 17 19 Age 

• Once start smoking FEV is reduced for the smokers. 

FEV 

+ 

+ + 

+ 

+ 

+ 

+ 

+ 

+ 

+ + 

+ + + + + + + 

+ + + 

+ 

+ + + + + 

+ 

+ 

+ 

+ 

+ 

+ 

+ + + 

+ 

+ + + + 

+ + 

+ 

+ 

+ 

+ + 

+ + + + + 

Non-smoker 

Smoker 

Non-smoker 

Smoker 

3 5 7 9 11 13 15 17 19 Age 

• This is more accurate as children may only begin 

smoking at age 9 and the rate of increase is much 

smaller with the non-parallel lower line. 

• Multiple regression needed for this analysis. 

331 

Section 9

With a simple linear regression take one variable 

as response and one variable as a predictor. 

The response is plotted on the vertical Y axis. 

The predictor is plotted on the horizontal X axis. 

Equivalent terms for response and predictor: 

⎧outcome 

⎪ 

response = ⎨dependent variable 

⎪ 

⎩( 

Y - variable) 

⎧explanatory variable 

⎪ 

covariate 

predictor = ⎨ 

⎪independent variable 

⎪⎩ 

( X - variable) 

Simple regression deals with the case where the 

relationship is approximately a straight line. 

Example: The values of a response variable (Y) 

and the values of a predictor variable (X) are as 

follows 

X Y 

100 39.7 

200 51.1 

300 49.9 The scatter diagram 

400 69.8 below shows the 

500 65.2 relationship between 

600 65.1 Y and X. 

700 80.7 

332 

Section 9

80 

Y 

X 

70 

60 

X 

X 

X 

50 

X 

X 

40 

X 

100 200 300 400 500 600 

700 

Y increases as X increases. The question is 

whether this apparent increase in Y is caused by 

changing X, or has it been caused by some other 

factor, or has it arisen by chance alone. 

The values of X, the independent variable, are 

known exactly (i.e. no error) whereas the values 

of Y, the dependent variable, have some random 

error associated with them. 

The relationship between Y and X could be linear 

so we attempt to “fit” a straight line through the 

data. This line gives the predicted yields ŷ for 

i 

each value x i of X. 

X 

333 

Section 9

80 

Y 

X 

70 

60 

50 

40 

X 

d X X 

4 

{ 

⎫ 

X ⎪ 

X 

⎬ŷ 

4 

X 

⎪ 

⎪⎭ 

100 200 300 400 500 600 

⎫ 

⎪ 

⎪ 

⎬ 

⎪ 

⎪ 

⎪⎭ 

700 

y 

4 

X 

An attempt is made to minimise the differences d i 

= y i – ŷ between the observed values (y 

i 

i ) and the 

predicted values ( ŷ i 

). The d i are positive for 

points above the fitted line and negative for 

points below the line. The expression ∑ n 

i = 1 

d 

i 

where there are n data points (i.e. the sample is of 

size n) does not measure “fit” due to cancellation 

of negative and positive values. 

Therefore, minimise ∑ i= 

1d 2 = ∑i= 

y − y 

i 1( ˆ ) . 

i i 

Suppose the straight line which does this has 

slope “β 1 ” and intercept “β 0 ”. That is, 

n 

n 

2 

y 

= β + β x 

0 1 

334 

Section 9

The method of least squares finds the values of β 0 

and β 1 which minimise 

n 

2 

n 

2 

( y ˆ ) ( [ 

1 i− yi = y 

1 i− β0+ 

β1xi]) 

i= i= 

∑ ∑ 

The estimates of β 0 and β 1 are 

0 

ˆβ and ˆβ 1 

which 

turn out to be 

ˆ 

∑ 

n 

i= 

1 

β 

1 

= 

∑ 

( x − x)( y − y) 

n 

i 

i= 

1 

i 

( x − x) 

ˆ β = y − ˆ β x 

0 1 

i 

2 

The line which best “fits” the data is 

ŷ = ( y − ˆ β ˆ 

1x) 

+ β1x 

= y + ˆβ 1(x – x ) 

⎡∑( x −x)( y −y) ⎤ 

y+ ⎥ ( x−x 

) 

⎢⎣ 

∑ ⎥⎦ 

i i 

= ⎢ 

2 

( xi 

−x) 

335 

Section 9

Example: 

x i y i (x i – x ) (x i – x ) 2 (y i – y )(x i – x )(y i – y ) 

100 39.7 –300 90000 –20.51 6153 

200 51.1 –200 40000 –9.11 1822 

300 49.9 –100 10000 –10.31 1031 

400 69.8 0 0 9.59 0 

500 65.2 100 10000 4.99 499 

600 65.1 200 40000 4.89 978 

700 80.7 300 90000 20.49 6147 

2800 421.5 280000 16630 

x = 400 y = 60.21 

Therefore, ˆβ 1 

= 16630/280000 = 0.059 

0 

ˆβ = 60.21 – 0.059 (400) = 36.61 

giving ŷ = 36.61 + 0.059 x 

To draw this line on the scatter diagram two 

points are needed: 

e.g. if x = 400, = 36.61 + 0.059 (400) = 60.21 

if x = 100, ŷ = 42.51 

336 

Section 9

N.B. 1. In this situation we have regressed Y on 

X. 

This implies the X values are known without 

error but the Y values are influenced by 

random variation. 

2. Numerically, we could regress X on Y. But 

the “slope” of this regression is not the same 

as that for Y on X. The reason is that now the 

Y values are known exactly with the X values 

influenced by random variation. 

3. ŷ = y + ˆβ 1(x – x ) 

When x = x, yˆ 

= y+ ˆ β1(0) 

= y 

This means that the point ( x, 

y) 

always lies 

on the least squares straight line. i.e. the 

regression line always passes through the 

centre of the scatter diagram. 

337 

Section 9

4. We say the least squares line “fits” or 

“models” the relationship between Y and X. 

5. A straight line may give poor fit e.g. 

Y 

X 

X 

X X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X X 

X 

X 

Here, it is not appropriate to use the line to 

predict values of Y for given values of X. 

The next step in our regression analysis is to 

establish how well this fitted line is able to model 

or explain the effect X has on Y; and also, if the 

fitted line is used to make forecasts of the values 

of Y, how accurate these forecasts turn out to be. 

(We set up confidence intervals for these 

forecasts.) 

Definition: The value d i = y i – ŷ is called the 

i 

residual at the value x i of X. These residuals are 

338 

Section 9

important as they represent the error made when 

using the line to make a forecast. 

Analysis of Variance for a Regression Model 

Y 

y 

The diagram shows that any numerical value y i 

can be partitioned into three components as 

follows: 

That is, 

any value, 

d 

i 

⎧ 

= ( y − yˆ 

) 

i i ⎨ 

⎩ 

x 

} 1 

ˆ β ( x − x ) 

⎫ 

⎪ 

⎬y 

⎪ 

⎭ 

y = y+ ˆ β ( x − x) + ( y − yˆ 

) 

i 1 i i i 

x 

x i 

( x 

y i = an overall average 

+ an amount explained by a 

predictor variable X 

+ a residual (or random error) 

i 

i 

, 

y 

i 

) 

Regression 

Line 

X 

339 

Section 9

The amount explained by the independent 

variable X is called the regression effect. This is 

also known as the explained component of the 

outcomes y i . 

The magnitude of the regression effect is related 

to the slope of the line and the distance x i is away 

from the overall mean x of the values x i . 

The mean y is the overall average effect. 

The term ( y − yˆ 

) is the residual effect. This is 

i i 

also known as the unexplained component of the 

outcomes. 

Therefore, 

data value = overall average effect 

+ regression effect + residual (error) 

effect. 

= overall average effect 

+ explained amount + unexplained 

amount 

340 

Section 9

To illustrate, the example has x = 400, 

y = 60.21 and ˆβ 1 

= 0.059 

x i y i = y + 0.059(x i – 400) + residual 

100 39.7 = 60.21 + (–17.82) + (–2.69) 

200 51.1 = 60.21 + (–11.88) + 2.77 

300 49.9 = 60.21 + (–5.94) + (–4.37) 

400 69.8 = 60.21 + 0.00 + 9.59 

500 65.2 = 60.21 + 5.95 + (–0.95) 

600 65.1 = 60.21 + 11.88 + (–6.99) 

700 80.7 = 60.21 + 17.82 + 2.67 

overall mean explained unexplained effect 

common to effect. chosen to give 

each data 

equality. 

value. 

It is important to establish if the explained effect 

has a much greater impact on the values y i than 

the unexplained residual effect i.e. does the 

regression effect explain more of the variation in 

the y i values. It turns out that the total variation 

in the y i values can be partitioned into an overall 

mean component, a regression component and a 

residual component as follows: 

341 

Section 9

[This page just for reference] 

Total sum of Squares (SS) of y i values 

= (39.7) 2 + (51.1) 2 + (49.9) 2 + (69.8) 2 

+ (65.2) 2 + (65.1) 2 + (80.7) 2 

= 26550.89 

The overall mean SS 

= (60.21) 2 + … + (60.21) 2 (7 times) 

= 7(60.21) 2 

= 25380.32 

The regression effect SS 

= (–17.82) 2 + (–11.88) 2 + … + (17.82) 2 

= 987.70 

The residual effect SS 

= (–2.69) 2 + (2.77) 2 + … + (2.67) 2 

= 182.87 

Now notice that 

26550.89 = 25380.32 + 987.70 + 182.87 

i.e. Total SS = overall mean SS + regression SS 

+ residual SS 

342 

Section 9

That is, the total variation is partitioned into these 

components which should now be compared. But 

the three component values cannot be compared 

directly. Note that: 

(i) There are seven data values y i hence seven 

degrees of freedom. 

(ii) One overall mean has one DF. 

(iii) The seven regression values depend on the 

one slope estimate 1 

ˆβ , hence one DF. 

(iv) The seven residuals have the remaining 

7 – 2 = 5 DF. 

The average or mean squares (MS) are then found 

by dividing the sums of squares by the degrees of 

freedom. These mean squares can be compared. 

The procedure is summarised in the following 

analysis of variance table: 

SOURCE OF VARIATION SS DF MS 


Regression effect 987.70 1 987.70 

Residual effect 182.87 (5) 36.57 

Total 26550.89 7 

343 

Section 9

The average regression effect (or the average 

effect of X on the Y values) far exceeds the 

average residual effect (unexplained) since 

987.70 far exceeds 36.57. But is this difference 

large enough to be important. The question of 

whether the average regression effect is large 

enough is answered by defining F = 987.70/36.57 

= 27.01 and testing this F-statistic for 

significance by reference to the F-table as 

follows: (note that the DF here are 1 and 5 

respectively for numerator and denominator). 

Since 27.01 > 6.608 there is evidence that the 

regression (or explained) effect dominates the 

residual (or unexplained) effect. Since the key 

part of the regression effect is the slope 1 

ˆβ , this 

effectively means ˆβ 1 

≠ 0 (or alternatively that 

there is evidence that changes in the values x i of X 

explain the variation in the values y i of Y (and this 

dominates any left over residual or unexplained 

effects). 

344 

Section 9

The F-distribution (Table in Appendix) 

υ 1 = num DF 

υ 2 = denom DF 

α = 0.05 (say) 

0 

F 

υ 1 ,υ 2 

F 

υ 1 1 2 3 4 … 60 

υ 2 

1 … … … … … 

2 

3 

4 

5 6.608 5.786 5.409 … … … 

6 

 

120 3.920 3.072 2.680 … … … 

345 

Section 9

Note: 1. The residual effect includes any 

random error plus the effects of other 

variables which may be affecting the 

outcome Y values. 

2. Computer software produces the analysis of 

variance table directly. 

3. It is a slightly modified form because the 

overall mean effect is never used. Therefore, 

this is subtracted (with appropriate changes 

to the total SS and the degrees of freedom) 

SOURCE OF VARIATION SS DF MS F 

Regression effect 987.70 1 987.70 27.01* 

Residual effect 182.87 (5) 36.57 

Total (overall mean 

removed) 

1170.87 6 

4. The “fitted” straight line should pass through 

the middle of the scatter diagram, and hence 

the residuals should take positive and 

negative values as X increases. (This can be 

checked by studying plots of the residuals 

produced by the program.) 

346 

Section 9

5. For the validity of the F-test, residuals should 

be approximately normally distributed. This 

can also be checked by obtaining the normal 

probability plot using the program. 

Analyse > Regression > Linear with Y in 

the Dependent Variable box and X in the 

Independent Variable box produces the 

following printout. 

347 

Section 9

A Confidence Interval for Slope of line. 

Our sample of n = 7 produced an estimate 

ˆβ = 0.059 

1 

Repeated samples of size n = 7 give values ˆβ 1 

which follow a normal distribution (just the 

Central Limit Theorem again). 

If β 

1 

is the true slope of the regression line then 

the standard error of β 

1 

is 

σ 

β 

1 

= 

∑ 

n 

i= 

σ 

e 

( x ) 

1 i 

− x 

2 

where σ 2 e is estimated from the data by the 

formula. 

s 

2 

e 

n 

∑i 

1 ( − ˆ 

= = y y 

i 

n−2 

Notes 

1. ( y ˆ 

i− yi) 

is the residual (or error) at the value 

x i of X. 

i 

) 

2 

348 

Section 9

2. The divisor is (n – 2) rather than (n – 1) used 

in the calculation of an ordinary variance 

because here two values “ β 0 

” and “ β 1 

” are 

estimated from the data and used to find the 

ŷ from which the deviations are measured. 

i 

[For an ordinary variance, 

only x is estimated.] 

2 

2 ∑ ( x − x) 

s = 

i 

, 

n −1 

The estimated standard error of the slope of the 

regression line is 

s 

β 

1 

= 

∑ 

n 

i= 

s 

e 

( x ) 

1 i 

− x 

2 

Therefore, the 95% confidence interval for β 

1 

is 

ˆ β ± t 

1 n−2 

∑ 

n 

i= 

1 

s 

e 

( x − x) 

i 

2 

Notes. 

(1) There are υ = n – 2 degrees of freedom for 

use with the t-table. 

349 

Section 9

(2) If σ e was known exactly (which it never is) 

the 95% confidence interval would be 

ˆ β ± 1.96 

1 

∑ 

σ 

e 

( x − x) 

i 

2 

. 

(3) In practice, σ e is always estimated by 

s 

e 

= 

∑ ( y 

i 

− yˆ 

n − 2 

i 

) 

2 

(4) 

2 

s is just the residual mean square and this 

e 

can be read directly from the analysis of 

variance. 

Example 

Refer to the earlier data which gave 

2 

∑ ( x i 

− x) = 280000, 1 

= 0.059 and 

yˆ = 36.6 + 0. 059x 

i 

i 

350 

Section 9

x i y i ( y − yˆ 

) 

i 

i 

( y − 

100 39.7 -2.69 7.24 

200 51.1 2.77 7.67 

300 49.9 -4.37 19.10 

400 69.8 9.59 91.97 

500 65.2 -0.95 0.90 

600 65.1 -6.99 48.86 

700 80.7 2.67 7.13 

182.87 

i 

yˆ 

i 

2 

) 

Therefore, 

Residuals (see earlier) 

Residual 

sum of squares 

2 182.87 

s = = 36.58 (the residual mean square) 

e 7 − 2 

with 

n – 2 = 7 – 2 = 5 D.F. giving 

t 5 = 2.571 for 95% confidence. 

The standard error of the slope is estimated to be 

s 

e 36.58 

= = 0.0114 

2 

∑ ( x − x) 

280000 

i 

351 

Section 9

The 95% confidence interval is 0.059 ± 

2.571(0.0114) or 0.059 ± 0.029 

Hence 0.030 < β 1 0 

0 

X 

As X changes, the values 

of Y tend to show an 

increasing trend with 

random variation about 

the trend line. 

Example 

A test has been designed to measure patient stress 

level (X). Blood pressure (Y) is recorded for 

different stress levels. 

352 

Section 9

Stress (X) 55 94 64 73 96 86 

Blood Pr. (Y) 72 91 76 78 94 81 

These data give x = 78; y = 82; 

2 

∑ ( x i 

− x) = 1394 and ∑ ( x − i 

x)( yi 

− y) 

= 686. 

Find the least squares line, 95% confidence 

interval for slope and test the research proposal 

that higher stress results in higher blood pressure 

levels. 

353 

Section 9

Solution: 

ˆ ∑( xi 

− x)( yi 

− y) 686 

β1 = = = 0.492 

∑ 

2 

( xi 

− x) 1394 

yˆ = y+ ˆ β ( x− x) = 82 + 0.492( x− 

78) 

∴ 

1 

Suppose a computer analysis gives the analysis of 

variance as follows: 

SOURCE OF VARIATION SS DF MS F 

Regression effect 337.59 1 337.59 33.41 

Residual effect 40.41 4 10.10 

2 

2 ∑ ( y − yˆ 

) 40.41 

Then s = 

i i 

= = 10. 10 

e n − 2 4 

giving s e = 3.178 as residual standard deviation. 

For 95% confidence, t 4 = 2.776 and standard error 

3.178 

of slope = = 0.085. 

1394 


0.492 ± 2.776(0.085) 

It follows that 0.256 < ˆβ 1 

< 0.728 

The test has the p-value less than 0.05. 

354 

Section 9

Confidence Interval for Prediction using a 

Regression Line 

The prediction value at value x i of X is found by 

substituting the value x i in the regression equation 

e.g. For our data, ŷ = 36.6 + 0.059x 

When x i = 750, ŷ = 36.6 + 0.059(750) 

i 

= 80.85 

But what error is associated with this prediction 

At value X = x k say the estimated standard error 

of the prediction is 

s 

1 ( x − x) 

2 

k 

yˆ = se 

1+ + 

2 

n ( xi 

− x) 

∑ 

where s e is the residual standard deviation. 

But s e = 36 . 58 = 6.05 (see ANOVA table) 

∴ 

s y ˆ 

2 

1 (750 − 400) 

= 6.05 1+ + = 7.604 

7 280000 

355 

Section 9


yˆ ± t s where t 5 = 2.571 

That is 80.85 ± 2.571(7.604) 

5 

yˆ 

Therefore, 61.30 < ŷ < 100.40 

750 

where ŷ is the prediction at x k = 750. 

750 

Notes 

(1) R-cmdr (and other packages) give this 

interval when requested. 

(2) A graph showing the confidence bands 

around the regression line can also be 

produced as follows 

(3) Essentially the confidence interval for the 

prediction involves line error and natural 

variation to predict a data point. 

356 

Section 9

Y 

Prediction Interval 

X 

EXAMPLE: 2003 EXAM 

The data for this question are a sample of 100 low 

birth weight infants. Measurements of systolic 

blood pressure (sbp) and values of gestational age 

(gestage) are recorded. The following table 

shows the layout of the data along with the results 

of some calculations using the 100 data values. 

357 

Section 9

sbp 

(Y mm Hg) 

gestage 

(X weeks) 

43 29 y= 47.31 

51 31 x = 28.89 

42 33 

2 

∑ ( i 

− x) 

= 635.69 

39 31 

2 

 

∑ ( i 

− y) 

= 15222.24 

 

 

40 33 ∑ ( x i 

x)( yi 

− y) 

= 806.31 

50 28 

(a) (4 marks) Using systolic blood pressure as 

the response and gestational age as the 

predictor variable, compute the least squares 

regression line. Interpret the slope of this 

regression line. 

(b) (5 marks) The standard deviation of the 

sample points about the regression line in (a) 

is s e = 3.47. Obtain an estimate for the 

standard error of the slope of the regression 

and hence set up a 95% confidence interval 

for the slope of the regression line. State 

whether you would reject the null hypothesis 

that the true slope is equal to 0. 

358 

Section 9

(c) (3 marks) What is the predicted systolic 

blood pressure for a low birth weight infant 

whose gestational age is 31 weeks 

Construct a 95% confidence interval for the 

prediction. 

(d) (1 mark) The value of the coefficient of 

determination is R – Sq = 67%. Interpret this 

value. (discussed next lecture) 

(e) (3 marks) What conclusions would you draw 

from the two residual plots below arising 

from the fitted regression in (a) 

359 

Section 9

SOLUTION 

(a) ˆβ 1 

= 806.31/635.69 = 1.27 

0 

ˆβ = 47.31 – 1.27(28.89) = 10.62 

ŷ = 10.62 + 1.27x 

For infants with gestation age one week 

higher, the model predicts sbp increases by 

1.27 mmHg. 

360 

Section 9

(b) Estimated standard error 

= 3.47 / 635.69 = 0.138 

95% C.I. is 1.27 ± 1.98(0.138) 

giving 1.27 ± 0.273 

or 1.00 < ˆβ 1 

< 1.54 

The confidence interval excludes zero (pvalue 

< 0.05) hence reject null hypothesis. 

(c) Prediction = 10.62 ± 1.27(31) = 49.99 

95% C.I. is 

49.99 ± 1.98(3.47) 

giving 49.99 ± 6.92 

or 43.07 < ŷ 

31 

< 56.91 

1 (31− 

28.89) 

1+ + 100 635.69 

(d) 67% of the total sum of squares of the sbp 

values is explained by changes in the number 

of weeks of gestation. (Alternatively, 67% of 

the variation in the sbp values is explained.) 

(Discussed next lecture.) 

(e) Variation about the fitted line is constant for 

different gestation times. The residuals 

appear close to pa normal distribution except 

for a possible outlier at x = 29. 

2 

361 

Section 9

Correlation 

The correlation coefficient is a measure of linear 

association. The Pearson correlation coefficient r 

is defined 

r 

= 

∑ 

∑ 

n 

( x −x)( y − y) 

i= 

1 i i 

n 

2 

n 

2 

( x ) ( ) 

i 1 i−x ∑ y 

i 1 i− 

y 

= = 

This measures the ‘strength” of linear association 

between X and Y (as we shall now see). Recall 

that the regression line passes through the point, 

x, 

y . 

( ) 

y 

+ 

2 

+ + + 

+ + + 

+ + + 

+ + + + 

+ + + + 

3 

x 

+ 

+ 

+ 

+ 

+ 

+ 

+ 

+ 

4 

1 

X 

Section 9 

362

The denominator in formula for r is always 

positive. In quadrant 1 , x i – x > 0 and 

y i – y > 0 meaning ( xi− x)( yi− y) > 0. In 

quadrant 3 , ( xi 

− x) 

< 0 and y i – y < 0 giving 

( xi−x)( yi− y) > 0. In quadrant 2 and 4 , 

( x −x)( y − y) < 0. 

i 

i 

Therefore, r is large and positive if points mainly 

in quadrants 1 and 3 ; it is large and negative if 

points in quadrants 2 and 4. 

Y 

Y 

+ + 

+ + + + 

+ 

+ 

+ + + 

+ + 

+ 

+ + 

+ 

+ 

+ + 

+ + 

+ 

X 

+ + 

+ 

+ + + 

+ 

+ 

+ 

+ 

+ 

+ + 

+ + 

X 

(i) 

(ii) 

In case (i) the contribution is equal from each 

quadrant, the contributions cancel, and therefore r 

= 0. i.e. there is no relationship between Y and X. 

In case (ii) there is again cancellation and r = 0, 

but here there is a strong relationship between Y 

and X but it is non-linear. 

363 

Section 9

therefore measures the strength of the linear 

association between X and Y. But we must be 

careful as r = 0 in the following case (iii) where 

β 1 = 0. In fact r is directly related to β 1 and zero 

if β 1 is zero. 

Y 

+ + 

+ + + + + + + 

+ + + + 

+ + + + 

(iii) 

X 

Example: A researcher investigates the 

relationship between reading and spelling tests 

administered to nine students 

Student 1 2 3 4 5 6 7 8 9 

X (spelling) 52 90 63 81 93 51 48 99 85 

Y (reading) 56 81 75 72 50 45 39 87 59 

364 

Section 9

2 

( y i 

− y) ( x − x)( 

y − y) 

x i y i ( x ) 

2 

i 

− x 

i i 

52 56 … … … 

90 81 … … … 

63 75 … … … 

81 72 … … … 

93 50 … … … 

51 45 … … … 

48 39 … … … 

99 87 … … … 

85 59 … … … 

3220.2225 2258.0001 1718.6665 

x = 73.55 y = 62. 67 

1718.6665 

r = =+ 0.6374 

3220.2225(2258.0001) 

But what does this mean 

Very strong correlation: 

x 

x x 

x x 

x 

x x x 

x xx 

Near +1 

x 

x 

x 

x 

Near –1 

x 

x 

x x x x 

x x 

Section 9 

365

Smaller 

x 

x 

x x x 

x x x 

x 

x 

x 

x 

x 

x 

x 

x 

x 

x 

x 

+ 0.7 

x 

x 

x 

x 

x x 

x 

– 0.7 

x 

x 

x x 

x Ho x 

lla 

nd 

x x 

x x 

Very small 

x x 

x 

x x 

x x x 

x 

x 

x 

x x 

x 

x x x x x 

x x x 

x x 

x + 0.2 

x 

– 0.2 

x 

x 

x x x x 

x x x x 

x x 

x x x x x 

x x x x 

x 

Notes: 

1. The largest value of r turns out to be +1. In 

this case all points lie on a straight line in 

quadrants 1 and 3 . This implies perfect 

positive linear association. i.e. as X increases, 

Y increases in the same ratio (if the increase 

of X is doubled, the increase in Y would also 

be doubled). 

366 

Section 9

2. r = –1 is smallest value which implies perfect 

negative linear association when all points lie 

in quadrants 2 and 4. i.e. as X increases, Y 

decreases in same ratio. 

3. |r| > 0.7 implies strong linear relationship. 

|r| < 0.3 implies negligible linear relationship. 

4. The correlation coefficient is an index. It 

does not depend on the units of either X or Y. 

(numerator and denominator in same units) 

5. r is called the Pearson Correlation 

Coefficient. 

6. An important correlation does not imply a 

causal link between the two variables. (The 

correlation is often caused by the effect of a 

third variable influencing both X and Y). 

e.g. smoking and lung cancer incidence 

correlated – not smoking causing lung 

cancer. 

7. If r is large, a regression line will fit the data 

well. 

367 

Section 9

8. If r 2 gives the fraction of variability in the Y 

values associated with the predictor variable 

X. 

e.g. In the example, r = 0.6374 so 0.406 

40.6% of the variability in Y is explained 

by changes in X. 

That is, 

2 SS(Regression) 

r = 

SSTotal (Reg+Resid) 

for a simple linear regression. 

368 

Section 9

Some examples on correlation and association discussed in lectures. 

Correlation measures association but association is not the same as causation. 

Example: For school children, shoe size is strongly correlated with reading skills. 

Learning new words does not make the feet get bigger. 

Instead, there is a third factor, age. As children get older, they learn to read better and they outgrow 

their shoes. 

Age is a confounder. Here, this confounder is easy to spot. Often this is not so easy. The 

arithmetic of the correlation coefficient does not give protection against third factors. 

Example: Education level and unemployment. 

In the Great Depression (1929 – 1933), better educated people had shorter spells of unemployment. 

(Education level and days unemployed were very highly correlated: negatively as more education 

associated with less days unemployed). Does education protect you against unemployment. 

Discussion: 

Perhaps, but the data were observational. Age is a confounding variable. Younger people were 

better educated as education level had been increasing over time. (It still is!!) 

Employers seemed to prefer younger job seekers. 

Controlling for age made the effect of education on unemployment much weaker. 

Example: 

In countries where people eat lots of fat, rates of breast and colon cancer are high. This correlation 

is often used to argue that fat in the diet causes cancer. How good is this evidence 

Death 

Rate 

(per 100000) 

25 

20 

15 

10 

+ 

+ 

+ 

+ 

+ 

+ 

+ + 

+ 

+ Finland 

+ 

+ Spain 

+ Holland 

+ UK 

+ Denmark 

+ + NZ 

+ 

5 

+ 

+ Japan 

+ Sri Lanka 

Thailand 

25 50 75 100 125 150 175 

Fat intake per capita per day (grams) 

369 

Section 9

Discussion: There is a very high correlation as shown by the scatter diagram which is very 

elongated. If fat in diet causes cancer, then the points should slope up as shown. So the diagram is 

some evidence for the theory. But the evidence is weak. 

For example, countries with lots of fat in diet also have lots of sugar, and a similar plot for sugar 

would be found. 

As it turns out, fat and sugar are relatively expensive. In rich countries people can afford to eat fat 

and sugar rather than starchier grain products. 

Some aspects of diet in these countries or these life-style factors probably do cause certain kinds of 

cancer. Epidemiologists can identify only a few of these factors with confidence. Fat is not among 

them. 

Example: Ultrasound and low birthweight. 

Babies can be examined in the womb using ultrasound. Several experiments on lab animals have 

shown ultrasound exams can cause low birthweight. If true for humans, there are grounds for 

concern. Scientists at Johns Hopkins Hospital in Baltimore ran an observational study to find out. 

Babies exposed to ultrasound differ from unexposed babies in many ways beside exposure; this 

investigation was only an observational study. 

The scientists found a number of confounding variables and adjusted for them. There was still an 

association. Babies exposed to ultrasound in the womb had lower birthweight, on average. 

Is this evidence that ultrasound causes lower birthweight 

Discussion: Obstetricians suggest ultrasound examination when something seems wrong. The 

investigators concluded that the ultrasound exams and low birthweights had a common cause – 

problem pregnancies. 

Later, a randomized controlled experiment was carried out to get more definite evidence. If 

anything, ultrasound was protective. 

Journal of Obstetrics and Gynaecology. Volume 71 (1988) pp 513-517 

Also Lancet (1988) pp 585-588 

370 

Section 9


1. Physical fitness testing is an important aspect of athletic training. A common measure of the magnitude of 

cardiovascular fitness is the maximum volume of oxygen uptake during a strenuous exercise. A study was 

conducted on 18 middle-aged men to study the influence of the time that it takes to complete a 2-mile run. 

The oxygen uptake measure was accomplished with standard laboratory methods as the subjects performed 

on a motor driven treadmill. The data (Ribisl et al. Journal of Sports Medicine, 9: 17-22) are below: 

Maximum Volume of O 2 (Y) Time in Seconds (X) 

42.33 

53.10 

918 

805 

Data summary 

x = 831.40 

42.08 892 y = 47.67 

42.45 968 

42.46 907 2 

∑ ( i 

− x) 

= 160613.28 

49.92 743 

36.23 

49.66 

1045 

810 

∑ ( x i 

x)( yi 

− y) 

= -8698.33 

41.49 927 2 

∑ ( y ˆ 

i 

− yi) 

= 55.25 

46.16 813 

48.18 858 

51.81 760 

53.28 

53.29 

747 

743 

47.18 803 

56.91 

47.80 

683 

844 

53.69 700 

(a) Use the data summary to find an estimate for the equation of the least squares regression line of Y on X. 

(2 marks) 

(b) 

(c) 

(d) 

Find an estimate for the standard error of the slope of the regression line and set up a 95% confidence 

interval for the slope of the regression line. 

(4 marks) 

What does the confidence interval in (b) tell you about the effect of time (X) on maximum volume of 

oxygen uptake (Y). 

(1 mark) 

If a man in this age group takes 50 seconds longer to run the 2-mile distance what is the change in his 

maximum volume of oxygen update Write down the 95% confidence interval for this change using the 

result from (c). 

(2 marks) 

(e) Set up a 95% confidence interval for the maximum volume of oxygen uptake for a man who takes 11 

minutes (660 seconds) to complete a two mile run. 

(3 marks) 

371 

Section 9

SOLUTIONS 

−8698.33 

1. (a) b YX = =− 0.054 

160613.28 

ŷ = 47.67 – 0.054(x – 831.4) 

= 92.566 – 0.054x 

(b) Estimated standard error = 

∑ 

se 

( x − x) 

i 

2 

where 

55.25/16 

That is, standard error = = 0.004637 

160613.28 

A 95% confidence interval for the true slope is 

–0.054 ± t 16 (0.004637) where t 16 = 2.120 

That is, –0.054 ± 0.0098 

giving – 0.064 < β YX < –0.044 

y ˆ 

2 i 

− yi 

e 

= 

s 

∑ 

( ) 

n − 2 

(c) The maximum volume of oxygen uptake is smaller for men who take longer to run 2 

miles. 

2 

(d) 

Oxygen uptake reduces by 50(0.054) = 2.7 units. 

The 95% confidence interval extends from 50(0.064) to 50(0.044) or a reduction from 

2.2 to 3.2 units. 

(e) When x = 660 seconds, ŷ = 92.566 – 0.054(660) = 56.93 


1 

56.93 ± t s 1+ + 

( x − x) 

2 

k 

16 e 

2 

n ( xi 

− x) 

∑ 

That is, 56.93 ± 2.120 

or 56.93 ± 4.38 

52.55 < ŷ 

660 

< 61.31 

1 (660 − 831.4) 

55.25/16 1+ + 18 160613.28 

2 

372 

Section 9

SECTION 10 

Multiple regression models and logistic regression models are introduced in this section. In the case 

of ordinary multiple regression the response or outcome variable is on a continuous scale whereas 

in the case of a logistic regression the outcome measure is binary taking therefore only two possible 

values interpreted as success versus failure. 

The models allow us to identify those variables which have an effect on the outcomes and those 

variables which do not. 

Adding additional variables leads to adjusted values for estimated parameters and it is this that 

allows us to control for confounding. 

The Multiple Regression Model 

R-cmdr Printout for Multiple Regression 

Dummy Variables 

Checking Model Fit 

Parallel Regression Lines and Analysis of Covariance 

Binary Outcomes and Logistic Regression 

373 

Section 10

Multiple regression 

• Simple linear regression (SLR) allowed us to 

assess the effect of a single independent 

variable (X) on a response variable (Y). 

• But what do we do if we think that the 

response may change according to more 

than one independent variable 

• SLR regression can be extended. 

• Multiple regression allows us to assess the 

effects of several independent variables on 

the outcome variable and it allows the 

prediction of a response from the values of 

several independent variables. 

• In multiple regression, there is a single 

dependent (outcome) variable and two or 

more independent (explanatory, predictor) 

variables or covariates. 

• The predictor variables can be: 

Continuous (e.g. blood pressure, height) 

Categorical – binary (e.g. sex) 

374 

Section 10

• The type of multiple regression that is 

performed depends on the data type of the 

outcome variable. 

• If the outcome variable is continuous, we use 

multiple linear regression. 

• If the outcome variable is binary, we use 

multiple logistic regression. 

The possible applications of multiple 

regression include: 

1. Adjusting for the effect of confounding 

variables. 

2. Establishing which variables are important in 

explaining the values of the outcome 

(response) variable. 

3. Predicting values of the outcome variable. 

375 

Section 10

4. Describing the strength of the association 

between the outcome variable and explanatory 

variables and reducing residual variation by 

introducing further effects as predictor 

variables. 

Multiple regression investigates and tests the joint 

effect of all predictors on the outcome variable as 

well as the measurement of individual effects of 

each predictor. 

Example: Predict lung capacity from age, sex 

and height of patient. 

Lung capacity itself is difficult to measure. For 

heart lung transplants to have best chance of 

success it is desirable to have donor and recipient 

lungs of similar size. 

376 

Section 10

The multiple linear regression model: 

y = β + β x + β x + β x + … + 

0 1 1 2 2 33 

error 

For simple linear regression the model is: 

y 

= β + β x+ 

ε 

0 1 

The fitted straight line then becomes 

ŷ 

= ˆ β + ˆ β x 

0 1 

where 

0 

ˆβ and ˆβ 1 

are chosen to minimise the sum 

of the squared errors (residuals). 

In the case of two explanatory variables, the 

multiple linear regression model can be written in 

the following form: 

y = β + β x + β x + ε 

0 1 1 2 2 

where ε is the residual (including random error) 

with mean of zero (for all data values i) and 

constant variance. 

377 

Section 10

The fitted regression equation is 

ŷ = ˆ β + ˆ β x + ˆ β x 

0 1 1 2 2 

The estimates ˆ β0, ˆ β ˆ 

1 

and β 

2 

are found from the 

data in such a way that the sum of the squared 

residuals (errors), that is 

y − ( β + β x + β x ) , is minimised. 

∑ 

[ ] 2 

i 

0 1 1 2 2 

The results are complicated and statistical 

software is always used for calculations. 

378 

Section 10

Example 

For lung transplantation it is desirable for the 

donor’s lungs to be of a similar size as those of 

the recipient. Total lung capacity (TLC) is 

difficult to measure, so it is useful to be able to 

predict TLC from other information. The 

following table shows the pre-transplant TLC of 

32 recipients of heart-lung transplants, and their 

age, sex and height 

Age Sex Height(cm) TLC(1) Age Sex Height(cm) TLC(1) 

1 35 F 149 3.40 17 30 F 172 6.30 

2 11 F 138 3.41 18 21 F 163 6.55 

3 12 M 148 3.80 19 21 F 164 6.60 

4 16 F 156 3.90 20 20 M 189 6.62 

5 32 F 152 4.00 21 34 M 182 6.89 

6 16 F 157 4.10 22 43 M 184 6.90 

7 14 F 165 4.46 23 35 M 174 7.00 

8 16 M 152 4.55 24 39 M 177 7.20 

9 35 F 177 4.83 25 43 M 183 7.30 

10 33 F 158 5.10 26 37 M 175 7.65 

11 40 F 166 5.44 27 32 M 173 7.80 

12 28 F 165 5.50 28 24 M 173 7.90 

13 23 F 160 5.73 29 20 F 162 8.05 

14 52 M 178 5.77 30 25 M 180 8.10 

15 46 F 169 5.80 31 22 M 173 8.70 

16 29 M 173 6.00 32 25 M 171 9.45 

379 

Section 10

Step 1: First look at some plots in order to gain an 

understanding of the data 

1. Plot each predictor variable against the 

outcome. 

Relationship between total lung capacity and age 

Total lung capacity(l) 

2 4 6 8 10 

10 20 30 40 50 

age(yrs) 

It appears that total lung capacity is not affected 

by age. 

380 

Section 10

It appears total lung capacity increases as height 

increases. 

The effect of sex is not clear. 

381 

Section 10

Step 2: Fit (in R-cmdr) Simple Linear 

Regression models for each predictor variable. 

1. Age alone: 

TLC= 5.07 + 0.036 age 

If age increases by one year, TLC increases 

by 0.036 litre (which is not significant if 

tested). 

2. Height alone: 

TLC= -9.74 + 0.095 x height 

If height increases by 1 cm, TLC increases 

by 0.095 (which is significant if tested). 

382 

Section 10

Step 3: Fit (in R-cmdr) Multiple Linear 

Regression Model. 

3. Age and height 

383 

Section 10

From regression equation for the model including 

age and height, the predicted TLC for someone 

aged 25 and with a height of 160 cm is: 

TLC = -11.218 – 0.030 × 25 + 0.108 × 160 

= 5.322 litres 

Regressions which include binary (e.g. sex) 

predictor variables 

The predictor variable, SEX, has two categories 

only, female and male. We need a technique for 

including such binary variables in the regression 

models. 

Define a dummy variable (D) as follows: 

D 

= 

⎧0 if 

⎨ 

⎩1if 

female 

male 

If there are two other predictors X 1 and X 2 then 

we fit the model 

y = β 

0 

+ β 

1 

x 

1 

+ β 

2 

x 

2 

+ β 

3 

d + ε 

384 

Section 10

The fitted equation is therefore 

ŷ = ˆ β + ˆ β x + ˆ β x + ˆ β d 

0 1 1 2 2 3 

We find estimates ˆ β0, ˆ β ˆ ˆ 

1, β2 and β 

3 

by minimising 

the squared residuals as before (using the 

computer). 

4. Model with age, height and sex. 

385 

Section 10

Model interpretation: 

* TLC decreases with increasing age. 

For a person 10 years older, the predicted TLC will 

be 0.25 litres lower. 

* TLC increases with increasing height. 

For a person 10 cm higher, the predicted TLC will 

be 0.9 litres higher. 

* Males have higher TLC than women: 

For males, the predicted TLC is 0.697 litres higher 

than for women with same age and height. 

females, sex = 0 

so TLC = –8.54 – 0.025age + 0.0895height + 0.697 × 0 

males sex = 1 

so TLC = 8.54 – 0.025age + 0.0895height + 0.697 × 1 

Therefore, the difference in average TLC between 

males and females is 0.697. 

Note: compare this to the crude difference in mean 

TLC between males and females 

386 

Section 10

It is 6.98 – 5.20 = 1.78 litres 

where 6.98 and 5.20 are male and female 

averages 

Some of this difference between males and 

females can be explained by differences in age 

and height. 

Overall, how well does the model fit 

The analysis of variance is 

1. The regression effect has 3 degrees of 

freedom since there are 3 predictor variables 

in the model. 

2. The ANOVA table shows the ‘usefulness’ of 

the linear regression model – we want the p- 

value to be < 0.05. 

Here, p-value = 0.000, implying that at least 

one of the explanatory variables has a 

significant linear relationship with the 

outcome variable. 

387 

Section 10

3. The strength of the relationship between 

TLC and the three predictors can be 

expressed as the proportion of the total SS 

explained by the regression equation. 

The coefficient of determination is: 

R 2 = 44.305/81.712 = 54.2% 

Thus, 54.2% of the total sum of squares 

(variation) is explained by age, height and sex 

together. 

Notice how the value of R 2 has increased from 

0.510 or 51.0% to the value of 0.542 or 54.2% 

when all three predictor variables are included. 

388 

Section 10

Are all three variables needed in the model 

There are 3 ways of evaluating the importance of 

a variable in the model: 

1. Construct a test of the null hypothesis that 

the regression coefficient = 0. 

2. Calculate a 95% confidence interval for the 

regression coefficient. 

Note: Regardless of whether an additional 

variable is significant or not the real point 

at issue is that the other regression 

parameters are adjusted for the influence 

of these new confounding variables to 

produce adjusted test or confidence 

intervals. 

Model is 

TLC = β 

0 

+ β 

1 

age + β 

2 

height + β 

3 

sex + ε 

giving R-cmdr printout as follows: 

389 

Section 10

Std Error is the standard error of the 

corresponding regression coefficient. (See how 

the coefficients of age and height change when 

allowance made for sex). 

1. Test of the hypothesis H 0 : β 3 = 0 

Is the variable sex an important predictor in the 

model 

T 

ˆ β 

3 

− 0 0.697 − 0 

= = 

s.e.( ˆ β ) 0.499 

3 

= 1.396 

p – value = 0.174. There is no evidence sex is 

important in predicting TLC – the coefficient is 

not significantly different from 0. 

(Note: the t-test has 28 degrees of freedom, the 

DF of the residual (error) effect). 

390 

Section 10

Test of the hypothesis H 0 : β 1 = 0 

Age: t = –0.025/0.024 = –1.063 

with 28 degrees of freedom (Residual DF) 

p-value = 0.297 

No evidence age affects TLC 

Test of the hypothesis H 0 : β 2 = 0 

Height: t = 3.647 (p-value = 0.001) 

Strong evidence height is important 

in predicting TLC. 

2. Calculating a confidence interval for a 

regression parameter 

A true parameter β 

i 

is estimated by ˆi β . 

For sex, the parameter estimates the difference in 

average TLC between males and females after 

taking into account age and height. 

The C.I. for ˆi β is: ˆ β ˆ 

i± t28se 

..( β 

i) 

391 

Section 10

For sex, this becomes 

0.697 ± t 28 (0.499) 

where t 28 = 2.048 for 95% confidence interval. 

That is 0.697 ± 1.022 

That is (–0.326, 1.720) 

This includes zero so there is no evidence of 

difference in average TLC between men and 

women. 

Note: 

The above interval is called an adjusted confidence 

interval. Recall unadjusted difference in means was 

-1.78. The unadjusted 95% confidence interval for 

the true difference in mean TLC between males and 

females is (-2.77, -0.79). 

Adjusting for age and height has removed the 

statistically significant association between sex and 

TLC. 

392 

Section 10

95% confidence interval for coefficient of age: 

–0.0250 ± t 28 (0.024) 

or ( –0.073, 0.023) 

95% confidence interval for coefficient of height 

0.0895 ± t 28 (0.025) 

or (0.039, 0.14) 

Note the correspondence between the 95% 

confidence interval and the t-test carried out at 

the 0.05 (2 sided) significance level. 

393 

Section 10

Note: 

(i) The effect of sex was contained in the 

residual when TLC was expressed in terms of 

age and height only. The effect of the 

residual was therefore greater. 

(ii) The real effect of interest can be hidden by 

residual variability - reducing this residual 

variability by including more predictors in 

the model can improve the analysis (and 

therefore the study). The p-values associated 

with hypothesis tests for the parameters of 

interest will generally be smaller. 

(iii) Confounders can affect the parameter 

estimates of the predictor variables of interest 

as well as the residual variability. Therefore 

including confounders in the model is 

important for obtaining valid estimates of the 

coefficients of interest regardless of the 

reduction in residual variability. 

394 

Section 10

Checking the fit of the model 

We do not expect our model to be correct. We 

want it to capture the important aspects of the 

process under investigation, but also to simplify 

things enough to aid understanding. Choosing an 

appropriate model is a complex art which is 

covered more fully in higher level courses on 

regression. Here we consider some basic 

principles. 

Rule of thumb: 

We should not perform a multiple linear 

regression analysis if the number of variables in 

the model is greater than the number of 

individuals divided by 10. 

Residual plots 

1. The residuals associated with each data value 

should be normally distributed with mean = 0 

and constant variance. (In R-cmdr we can 

save the residuals for subsequent plotting e.g. 

normal probability plot). 

395 

Section 10

2. The printouts also identify any unusual data 

point which has a very large residual. The 

residuals can be standardised to have mean 

zero and standard deviation one. Hence we 

can see clearly the unusual cases. (One of 

the options in R-cmdr is to save the 

standardised residuals). 

(1) Checking the normality assumptions of 

the residuals. 

The matching histogram will present the usual 

bell-shaped pattern for the 32 residuals. 

396 

Section 10

The points in the normal P-P plot lie along a 

straight line, confirming the distribution of the 

residuals is close to normal. 

Two extreme points correspond to: 

i) female, aged 20, height 162 cm. Predicted 

value from model is 5.46 and actual TLC is 

8.05. 

ii) male, aged 25, height 171. Predicted TLC 

from model is 6.84, actual TLC is 9.45 

(2) Plot of residuals vs independent variables 

Residuals versus age plot 

397 

Section 10

This plot identifies the negative residuals for the 

people under 20 years and also shows the two 

large outliers. Otherwise the plot is reasonably 

random about zero. 

Residuals versus height plot. 

Again the plot has negative residuals for the 

shorter people and identifies the two large 

outliers. These plots indicate special thought 

should be given to whether the young people 

should be retained in the model. 

398 

Section 10

Analysis of Covariance 

This analysis uses a multiple regression to 

compare simple regressions coinciding with the 

categories of a qualitative explanatory variable. 

Example: A study investigates the effect of a 

treatment for hypertension on systolic blood 

pressure (BP) compared with a control treatment. 

Age for all patients also known and it was 

thought that age might confound the differences 

in BP between the groups. 

TREATMENT CONTROL 

BP(Y) AGE(X) BP(Y) AGE(X) 

120 26 109 33 

114 37 145 62 

132 31 131 54 

130 48 129 44 

146 55 101 31 

122 35 115 39 

136 40 133 60 

118 29 105 38 

Control mean = 121.00 mm (of mercury) 

Treatment mean = 127.25 mm (of mercury) 

399 

Section 10

But note: 

average age of control group = 45.13 years 

average age of treated group = 37.63 years 

[A] First, an ordinary unpaired t-test will be 

performed on the BP(Y) values using the 

pooled variance of the Y values. 

Analyze > compare Means > Independent- 

Samples t-test 

y is the test variable and d is the grouping 

variable. d is 0 for control and 1 for 

treatment. 

There is no evidence of a difference between 

the two means as t = -0.932. The 95% 

confidence interval for μ − μ is (-8.1, 

T C 

20.6) which includes 0 confirming no 

evidence of a difference between the means. 

Also p-value = 0.367 

400 

Section 10

At this stage the ages have been ignored. 

Age could be increasing the residual 

variation hiding the true treatment difference. 

i.e. Age could be a confounder. 

[B] Second, a regression analysis of Y on d is 

performed where d = 0 for the control and d 

= 1 for the treatment. 

Analyze > Regression > Linear 

(Again, the 16 Y values are in one column 

and the values of d in a second column) 

The estimated regression equation is 

ŷ = 121 + 6.25d 

401 

Section 10

The estimated coefficient for d is 6.25 with a 

standard error of 6.708. Note that when d = 0, 

ŷ = 121.00 and when d = 1, ŷ = 127.25 so 

coefficient of d is the difference between the 

two means. The 95% confidence interval for 

treatment difference is 

6.25 ± t 14 (6.708) where t 14 = 2.145 

giving 6.25 ± 14.39 or (-8.14, 20.64) 

as before. This regression is equivalent to the 

unpaired t-test. The age variable effect remains 

hidden in the residual. 

Note: The Confidence Interval can also be obtained 

on the printout if requested. 

402 

Section 10

[C] Third, a regression analysis of Y on X and D 

together is performed where D = 0 for control, 

otherwise 1. 

Analyze > Regression > Linear 

(Values of X are now in a third column) 

The estimated regression equation is 

ŷ = 73.9 + 1.04x + 14.1d 

The estimated coefficient of d is now 14.082 

with a standard error of 3.818. The coefficient 

of d represents the difference between patients 

of the same age, one in the control and one in 

the treated group. 

403 

Section 10

e.g. Let X = x k be age of two such patients. 

Then ŷ T – ŷ C = (73.9 + 1.04 x k + 14.1) 

– (73.9 + 1.04 x k + 0) 

= 14.1 

The 95% confidence interval for the difference 

is 

14.082 ± t 13 (3.818) where t 13 = 2.160 

giving 14.082 ± 8.247 

or (5.84, 22.33) 

Now there is evidence that the treatment raises 

blood pressure as 0 excluded from the 

confidence interval. The 13 DF are n – 3, 

namely those of the residual. 

Also note that the t test value associated with d is 

3.69 with a p-value of 0.003. 

Also note how the effect of age has effectively been 

removed from the residual which is substantially 

reduced from 2519.5 to 669.4. 

The value of R 2 has risen from 0.058 or 5.8% to 

0.750 or 75% when X is added to the model 

involving d only. 

404 

Section 10

The confidence interval here is the ADJUSTED 

CONFIDENCE INTERVAL after allowing for 

the effect of age. 

Unadjusted interval: (– 8.14, 20.64) 

Adjusted interval: (5.84, 22.33) 

It is helpful to put a geometrical interpretation on 

this analysis. The scatter diagram of Y(BP) 

against X(age) follows for all 16 patients. 

150 

140 

130 

120 

110 

100 

Y(BP) 

x 

x 

x 

x 

x 

x 

x 

x 

25 30 35 40 45 50 55 60 65 

X(Age) 

Notice the difference between the treated group 

(dots) and the control group (crosses) 

Suppose we fit the equation (by least squares) 

ŷ = 

0 

ˆβ + ˆβ 1x + 

2 

ˆβ d 

where d = 0 for control, d = 1 for treatment. 

Section 10 

405

If d = 0, ŷ = 

0 

ˆβ + ˆβ 1x 

If d = 1, ŷ = 

0 

ˆβ + ˆβ 1x + 

2 

ˆβ 

= ( 

0 

ˆβ + 

2 

ˆβ ) + ˆβ 1x 

These two lines are PARALLEL (same slope ˆβ 1) but 

the intercepts are 

0 

ˆβ and ( 

0 

ˆβ + 

2 

ˆβ ). Thus, 

2 

ˆβ is the 

vertical distance between the two parallel straight 

lines. 

150 

140 

130 

120 

110 

100 

Y(BP) 

x 

x 

x 

x 

x 

X(Age) 

25 30 35 40 45 50 55 60 65 

2 

ˆβ is the effect of the treated group relative to the 

control. If 

2 

ˆβ is significant, then there is evidence of 

different blood pressure values in the two groups. 

We see how to test 

2 

ˆβ for significance shortly. The 

next printout gives Y regressed on X only, and Y 

regressed on X and d together. 

x 

x 

x 

⎫ 

⎬ 

⎭ 

ˆβ 2 

406 

Section 10

Notes: 

1. 

2 

ˆβ (the coefficient of d) = 14.082 is the 

increase in blood pressure level due to 

administering the treatment (regardless of the 

age of a patient since the two lines, being 

parallel, have constant difference). 

2. The 95% confidence interval for the 

coefficient of d (namely β 2 ) is 

14.082 ± t 13 (3.818) where t 13 = 2.160 

It follows that 5.84 < β 2 < 22.33 

3. Without taking age into account, the treatment 

raised blood pressure by 6.25 mm of mercury 

only. Taking age into account, the treatment 

raised blood pressure by 14.082 mm. 

4. The ordinary unpaired t-test originally 

suggested for this problem is equivalent to 

regressing Y on d alone. In this case, the 

variable x (or Age) remains as part of the 

residual which is therefore inflated hiding the 

true treatment effect. In addition correlation 

between age and treatment group distorts the 

estimate of treatment effect on blood 

pressure. 

407 

Section 10

Binary outcomes: Logistic Regression 

Recall: For simple and multiple linear regression 

the outcome variable was continuous. 

What do we do if the outcome variable Y is binary 

e.g. disease present: yes/no 

e.g. tuatara: present / absent 

e.g. claim to ACC goes to litigation : Yes / No 

e.g. depression: yes/no in 18 yr olds if bullied at 

school earlier 

We use logistic regression (LR). 

In a logistic regression the explanatory or predictor 

X variables can be either continuous or 

categorical(binary). 

Like multiple regression, we can use logistic 

regression to: 

(1) control for confounding; 

(2) investigate the effect of several variables on 

the outcome variable at one time. 

We can use the method of LR with data from any 

study type as long as we have a binary outcome. 

408 

Section 10

The logistic regression model is: 

⎛ p 

⎜ 

⎞ = β + β + β + + β + ε 

− p 

⎟ 

⎝ ⎠ 

ln 

0 1X1 2X2 

… 

kXk 

1 

where 

Y is the binary outcome variable (values 0 or 1) 

p is the probability that a particular event will 

occur, i.e. Pr(Y = 1). 

X1, X2,..., X 

k 

are the explanatory variables 

β0is the intercept 

β1, β2,..., β 

k 

are the regression coefficients 

ε is the random error 

Interpreting the model: 

p 

is the ‘odds’ of the event occurring 

1− 

p 

⎛ p ⎞ 

ln ⎜ 1 − p 

⎟ is the ‘log odds’ 

⎝ ⎠ 

The regression coefficient β 

i 

represents the change 

in the log odds for a 1-unit change in X . 

i 

Fitted logistic model: 

The formulae to estimate the values β 

0 

and β 

1 

etc 

are computationally complex. We shall not worry 

Section 10 

409

about the details here and we shall instead focus on 

understanding the results from a logistic regression 

R-cmdr printout. 

Example: 

A study was conducted to investigate the 

relationship between physical inactivity and 

myocardial infarction (MI). It was found that 

people who were physically inactive had an 

increased risk of MI. Age was considered to be a 

potential confounder. 

Compared to younger people, older people: 

• are more likely to be physically inactive. 

• have a higher risk of MI. 

Hence, we would expect that age can explain some 

of the association between physical inactivity and 

MI. 

Outcome: 

whether a person has a MI (Y) where Y = 0 or 1 

Exposure of interest: 

whether a person was physically inactive 

(exposure variable, X 1 ) 

Possible confounder: 

age (X 2 ) of the person. 

410 

Section 10

(1) Investigating the relationship between 

physical inactivity and MI. 

Option 1: Calculate the odds ratio as shown 

earlier in the semester. 

The 2 × 2 contingency table for outcome and 

exposure is constructed from the 924 people. 

Outcome - MI 

Exposure (X 1 ) Yes No 

Physically inactive 136 98 

Physically active 343 347 

Odds ratio of MI in exposed to unexposed: 

136/ 98 

OR = = 1.40 

343/ 347 

with 95% confidence interval 1.04< OR < 1.89. 

Interpretation: The odds of having a MI is 40% 

higher for a person that is physically inactive 

compared to a physically active person. The 

result is significant. 

411 

Section 10

Option 2: Alternatively, we can fit a logistic 

regression model, using R-cmdr 

Y = MI 

1 = Yes 0 = No 

X 1 = Physically inactive 1 = Yes 0 = No 

Fitted Regression Model: 

⎛ pˆ ⎞ 

ln ˆ ˆ 

⎜ = + X 

1− 

pˆ 

⎟ β β 

⎝ ⎠ 

0 1 1 

where ˆp is the probability that a person has a MI. 

R-cmdr commands: 

Analyze > Regression > Binary Logistic 

Dependent: enter MI 

Covariate: enter Physical Inactivity. OK. 

Results from R-cmdr 

⎛ pˆ 

⎞ 

Fitted equation is ln⎜ 

1− 

pˆ 

⎟ 

⎝ ⎠ = -0.01+ 0.34 X 

1 

Odds ratio = 1.40 as before 

95% confidence interval for OR is (1.04, 1.89) 

412 

Section 10

BUT what about the potential confounding 

effect of age How can we control for that 

Note: The odds ratio calculated previously is a 

crude odds ratio – it (and its corresponding 95% 

confidence interval) is not adjusted for the 

potential confounder age. 

To control for age, we include age as a second 

explanatory variable in our logistic regression. 

(2) Investigating the relationship between 

physical inactivity and MI, adjusting 

(controlling) for age. 

Now add age (X 2 ) to the regression in order to 

obtain the adjusted OR and its 95% confidence 

interval. 

Y = MI 

1 = Yes 0 = No 

X 

1 

= Physically inactive 1 = Yes 0 = No 

X = age 

2 

Results from R-cmdr: 

413 

Section 10

The fitted regression is 

⎛ pˆ 

⎞ 

ln⎜ 

1− 

pˆ 

⎟ 

⎝ ⎠ = -0.41+0.17 X 

1+0.68 X 2 

This leads to the age adjusted odds ratio of 1.19 

which has 95% confidence interval (0.87, 1.62) 

These values are read from the printout and 

compare with the crude ratio of 1.40 with 

confidence interval (1.04, 1.89). 

Conclusion: After adjusting for age, the OR 

decreased from 1.40 to 1.19. Therefore, age was 

making the association between physical inactivity 

and MI more extreme than it actually was. 

414 

Section 10

SECTION 11 

Study design principles, critical appraisal, sources 

of bias and confounding. 

415 

Section 11

Study Design and Critical Appraisal 

Research process: 

1. Development of research question 

2. Design of study 

3. Collection of information 

4. Description of data 

5. Interpretation of results 

Study design 

• Study design refers to the methods used to select the 

study participants, control any experimental 

conditions, and collect the information. 

• Interpretation of results depends on the study design. 

• The study design should be tailored to the research 

question. 

• Methods of statistical analysis and information 

produced will depend on the study design. 

“The data from a good study can be analysed in many 

ways, but no amount of clever analysis can compensate 

for problems with the design of the study.” Altman. 

416 

Section 11

Critical appraisal 

Critical appraisal is the process of reviewing a study with the 

goal of identifying its strengths and weaknesses, the major 

results, and its broader implications. 

Why teach study design and critical appraisal 

• it is not possible to sensibly interpret the results of 

statistical analysis without understanding the context in, 

and methods with which the data were collected 

• health sciences practice and policy needs to be based on 

sound evidence (as far as possible). 

• poorly conducted research should not influence policy or 

practice. 

• because even well conducted research is not perfect, it is 

necessary to understand the nature of evidence so that 

you can begin to learn to interpret research findings for 

yourselves. 

• for this you need to gain an understanding of the 

scientific method as used in the health sciences. 

• this understanding is enhanced by learning to critique 

research. 

417 

Section 11

Outline of next four lectures 

1. Introduction to critical appraisal (lecture 1) 

• process for critical appraisal 

• structure of a research paper 

2. Design and appraisal of surveys (lecture 1) 

• review of surveys 

• internal validity 

bias 

chance 

• external validity 

• example 

3. Design and appraisal of analytic studies 

(lectures 2 – 4) 

• review of analytic study designs 


bias 

confounding 

chance 


• causation 

• examples: randomised controlled trials 

cohort studies 

case-control studies 

418 

Section 11

1. Introduction to critical appraisal 

Guideline for critical appraisal 

Study summary 

What were the study objectives 

Why was the study necessary 

What type of study design was used 

How were the participants selected 

What information was collected 

What were the key results 

Internal validity 

What do the findings of the study tell us about the 

population studied 

External validity / Generalisability 

Can the findings of the study be applied to other 

populations 

Causation (for analytic studies only) 

Implications 

What are the implications of the study 

419 

Section 11

Structure of a scientific paper 

Abstract or summary 

• usually contains the key results of the study. 

Introduction 

• gives the background, necessity and objectives. 

Methods 

• summarises the study design including source of 

participants and methods used to collect data. 

Results 

• description of the study participants including response 

rates. 

• summary of analyses. 

Discussion 

• provides the authors views of the internal and external 

validity of the study, and their conclusions about the 

implications of the study. 

420 

Section 11

2. Design and appraisal of Descriptive studies 

Aim: To describe characteristics of a group or groups of 

people at a given point in time. 

Generally, a sample is taken from the population and the 

distribution of variables within that sample is described. 

Examples: A descriptive study can be used to 

• describe characteristics of a group of people, 

e.g. prevalence of asthma, prevalence of smoking, 

average cholesterol level. 

• find out peoples’ opinions and attitudes, 

e.g. attitudes to alternative health care; satisfaction 

with health care delivery. 

• find out extent of peoples knowledge, 

e.g. knowledge of risk factors for melanoma, risk factors 

for coronary heart disease. 

• comparisons of subgroups may well be part of a survey, 

e.g. comparison of attitudes of men and women to 

alternative health care; comparison of prevalence of 

smoking among different ethnic groups in NZ. 

A descriptive study is concerned with and designed only to 

describe the existing distribution of variables, without regard 

to causal or other hypotheses. 

Descriptive studies can generate hypotheses. 

Descriptive studies are often called surveys or cross-sectional 

studies. 

421 

Section 11

Descriptive studies generally use a sample from a population. 

Descriptive studies 

underlying population 

(parameters eg μ, π) 

other populations 

(external validity) 

sample 

inference 

(internal validity) 

statistics 

(eg x, p) 

422 

Section 11

Recall 

Suppose we want to estimate mean cholesterol in the 

population: 

sample 

mean 

= 

population 

mean 

+ "error" 

systematic error 

(bias) 

random 

variation 

random error (chance): 

• due to natural biological variability. 

• increasing the sample size will reduce the random 

fluctuations in the sample mean. 

systematic error (=bias) 

• due to aspects of the design or conduct of the study which 

systematically distort the results. 

• occurs if a sample is not representative of the population. 

• cannot be reduced by increasing the sample size. 

423 

Section 11

Internal validity for descriptive studies 

Bias 

• bias 

• chance (random error) 

Selection bias 

• systematic error arising from the way people are selected 

for the study. 

• includes biases from sample selection and from nonresponse 

to study. 

Information bias 

• systematic error arising from the way information was 

collected from the study participants. 

Chance 

• Confidence intervals around estimates indicate the degree 

of precision with which the sample value estimates the 

population value. 

424 

Section 11


• systematic error arising from the way people are selected 

for the study. 

• includes biases from sample selection and from nonresponse 

to study. 

Questions to ask: 

• Is the sample representative of the population 

• What was the response rate 

Example: A study was conducted to estimate the prevalence 

of smoking among males and females in NZ. 

Design: 

A random sample of households was selected using random 

digit dialling. If the call was not answered, the machine 

automatically went on to the next number. All interviews 

were conducted from 8am – 5pm (weekdays only). 

63% of people agreed to participate in the study. 

425 

Section 11


• systematic error arising from the way information was 

collected from the study participants. 

Question to ask: 

Is the information gathered correct 

Example: Suppose an investigator wished to estimate the 

prevalence of depression in NZ. To do this, he carried out 

face-to-face interviews around the country with a random 

sample of adults. Can you think of how information bias 

might enter into his study 

426 

Section 11

Example 

Life in New Zealand Survey, Hillary Commission for 

Recreation and Sport, 1990, David Russell and Noela 

Wilson. 

Objectives 

• to provide a snapshot of New Zealanders from a health 

perspective. 

• included questions on physical activity, leisure patterns, 

dietary habits and other risk factors for disease. 

Necessity for the study 

• study provides a benchmark for comparison in future 

years. 

• the information is useful for generating hypotheses and 

for designing interventions to improve health. 

Type of study design 

• survey of New Zealanders 15 years and over. 

• carried out April 1989 – May 1990. 

Selection of participants 

• over 18 years: 

- selected from electoral rolls. 

- each month 10 people were selected at random from 

each of the 97 electoral rolls, plus 19 from each of 

the 4 Maori rolls. 

• 15 – 18 years: 

- snowball sample was used. 

- people already selected were asked to identify up to 5 

people aged 15 – 18. 

• total number selected: 12,463. 

427 

Section 11

Results 

Physical activity 

Activity level 

low moderate high 

Male 

15 – 18 17 20 64 

19 – 24 23 27 51 

25 – 44 34 31 35 

45 – 64 50 34 16 

64+ 58 39 3 

All 37 31 32 

Female 

15 – 18 24 22 54 

19 – 24 30 30 40 

25 – 44 20 53 26 

45 – 64 25 64 11 

64+ 34 63 3 

All 25 51 24 

Can you summarise these results 

428 

Section 11


Bias 

Selection bias: 

• random sampling was used for those 18 and over. 

• bias from snowball sample (note multiple starting 

points based on random sampling) 

• response rate 

Information bias: 

• questionnaire 

• accuracy of recall 

• tendency to report what people think the researchers 

will want to see 

Chance 

• the study is large so the confidence intervals for overall 

proportions will be fairly narrow, but for smaller 

subgroups the proportions may not be so well estimated. 

e.g.: women aged 64+, n=814 

proportion with low activity level 

= 34%, CI= (30.8 to 37.3) 

proportion with high activity level 

= 3% CI= (2.0 to 4.5) 

429 

Section 11

External validity 

Are the results applicable to other populations 

• this calls for a judgement as to whether the other 

populations are likely to be similar to New Zealand in 

terms of their exercise patterns. 

Implications 

• high activity levels are the levels recommended to 

maintain cardio-respiratory fitness. 

• programmes to increase activity levels may be useful in 

preventing cardiovascular disease. 

• efforts to increase activity levels of men over the age of 

45 may be particularly useful. 

430 

Section 11

Study design and critical appraisal sessions: 2 







bias 

chance 


• example 





bias 

confounding 

• chance 


• causation 


cohort studies 

case-control studies 

431 

Section 11


Review of analytic study designs 

Purpose 

To test hypotheses regarding 


• disease prevention strategies 

• effectiveness of treatments 

Example: 

• Is a statin drug more effective than a diet high in plant 

sterols, soy proteins and almonds in reducing serum 

cholesterol levels 

• Do people who are physically inactive have an increased 

risk of developing colon cancer 

When we are conducting an analytical study we are studying 

associations among two or more variables. We will have 

an outcome variable (eg 

exposure variables (eg 

confounding variables – these are variables which distort 

the association of interest (eg age) 

432 

Section 11

Types of design: 

• experimental (intervention) 

• e.g. randomised controlled trials. 


• e.g. case-control studies, cohort studies. 

Key features of common designs 

Randomised controlled trials 

• people are assigned to an intervention or control group 

using random allocation, then followed up over a period 

of time. 

Cohort studies 

• participants are selected before they develop disease. 

• exposure status is measured, and they are followed up 

over a period of time. 


• two groups of people are chosen: a group with disease 

(cases) and a group without disease (controls). 

• information is collected from both groups about 

exposures that occurred in the past. 

433 

Section 11

Key ideas: 

• control (or comparison) groups are essential. 

• experimental studies provide much stronger tests of 

hypotheses than observational studies. 

• experimental studies allow testing of causal relationships 

• with observational studies it is much harder to isolate the 

effects of the exposure of interest, so much harder to 

determine whether an association is causal 

434 

Section 11

Example 

Does smoking cause coronary heart disease 

1. Estimate the association between smoking and coronary 

heart disease (eg relative risk). 

2. Does this relative risk represent the true association 

between smoking and CHD in the population studied 

(internal validity) 

if yes 

3. Can this result be generalised to other populations 

(external validity) 

4. Is the association causal 

435 

Section 11


Does the observed association represent the true association 

Specifically: 

What are the possible explanations for the observed results 

• bias 

• confounding 

• chance 

• true relationship 

436 

Section 11

Assessing internal validity: 

Bias 

Selection bias – systematic error arising from the way 

participants are selected for inclusion in the study. 

In an analytic study, selection bias occurs if the selection 

processes cause a systematic difference between the groups 

of people selected for the study. 

It includes bias from non-response. 

Information bias – systematic error arising from the way 

study information is obtained, interpreted and recorded. 

In an analytic study information bias is a particular problem 

if there are systematic differences in the information obtained 

from the different groups of people in the study. 

Information bias may be introduced by the: 

• Observer 

• Study individual (respondent) 

• Instruments used to collect the data (e.g. badly-designed 

questionnaire) 

437 

Section 11

Example 

Case-control study to examine relationship between stress 

and coronary heart disease: 

cases: 

controls: 

people with coronary heart disease 

identified through opportunistic 

screening by GPS 

random sample from the population 

Information on stress collected through a structured interview 

Selection bias: 

Information bias: 

438 

Section 11

Evaluation and control of bias 

• Statistical methods cannot control for bias in the selection 

of subjects or in the measurement of the variables of 

interest. Control of bias can only be done during the 

design and data collection phases of the study. 

• General inaccuracy which is the same in both groups 

generally results in an underestimate of the true 

association. 

• If inaccuracy is different in the two groups, the 

association can be an over or under estimation. 

• It is important to identify sources of bias and estimate the 

magnitude and direction of their effect on the association. 

439 

Section 11

Confounding 

A distortion of the association between exposure and disease 

caused by the presence of a third factor. 

• A confounder is a variable which causes this distortion 

• To be a confounder a variable must be 

• associated with the exposure (independent of disease). 

• associated with disease (independent of exposure). 

• it must not just be an intermediate link in the causal 

chain. 

440 

Section 11

Example of confounding: 

A study was conducted to investigate the relationship 

between coffee consumption and oral cancer. It was found 

that coffee drinkers had an increased risk of oral cancer. 

Smoking is a potential confounder in this study. 

Compared to non-smokers: 

• Smokers are more likely to drink coffee; 

• Smoking is an independent risk factor for oral cancer. 

Hence, the observed association may be due to smoking 

habits rather than coffee drinking. 

Can you think of any other potential confounders 

441 

Section 11

Example of non-confounding: 

diet cholesterol level coronary heart disease 

In this case, the raised cholesterol levels are likely to be due 

in part to diet, so are part of the causal pathway. Therefore in 

studies of diet and coronary heart disease raised cholesterol 

would not be considered a confounder. 

Example of a confounder in a cohort study: 

Results from a cohort study investigating the relationship 

between myocardial infarction and exercise. 

Myocardial 

infarctions 

Personyears 

Table A: all subjects (n=8000 

person-years) 

Low exercise 105 4000 26.25 

High exercise 25 4000 6.25 

Relative risk = 26.25/6.25 = 4.2 

Subgroup Analysis 

Obese subjects (n=4000) 

Low exercise 90 3000 30.0 


Relative risk = 3.0 

Non-obese subjects (n=4000) 

Low exercise 15 1000 15.0 


Relative risk = 3.0 

Incidence/1000 

442 

Section 11

Positive and Negative Confounding 

Positive confounder – a confounding variable which makes 

an association look more extreme or create a spurious 

associations. 

Example: A study was conducted to investigate the 

relationship between physical inactivity and MI. It was found 

that people who were physically inactive had an increased 

risk of MI. Age was considered to be a potential confounder. 

Physical inactivity 

Myocardial infarction 

Age 

Crude odds ratio =2.5 

But compared to younger people, older people: 

• are more likely to be physically inactive. 


Hence, age can explain some of the association between 

physical inactivity and MI. 

After “adjusting” for the confounding association of age the 

OR decreases to 1.4. So confounding by age is making the 

association between physical inactivity and MI seem more 

extreme than it should be, i.e. it is a positive confounder. 

443 

Section 11

Negative confounder – a counfounding variable which 

makes an association look less extreme or even in the 

opposite direction. It can mask a real difference. 

Example: A study was conducted to investigate the 

relationship between physical inactivity and MI. It was found 

that people who were physically inactive had an increased 

risk of MI. Sex was considered to be a potential confounder. 

Physical inactivity 

Myocardial infarction 

Sex 

Crude OR = 2.5 

But compared to females, males: 

• are less likely to be physically inactive. 


Hence, sex masks some of the association between physical 

inactivity and MI. 

After “adjusting” for the confounding effect of sex, the OR 

becomes 3.9. 

So confounding by sex makes the association between 

physical activity and MI seem less extreme than it should be, 

i.e. it is a negative confounder. 

444 

Section 11

Some comments on confounding: 

AGE and SEX are the most common confounding variables. 

This is because these two variables are not only associated 

with most exposures we are interested in such as diet, 

smoking habits etc., but they are also independent risk factors 

for most diseases. 

Control of confounding 

Confounders can be controlled for during the study design, 

during the analysis, or both in the design and the analysis. 

The aim is to make the groups being compared as similar as 

possible with respect to the confounders. 

(1) Identify potential confounders. A review of previous 

literature in the area should give you an idea of potential 

confounders. 

Also: What are the known risk factors for the outcome of 

interest; What factors are associated with exposure 

Data should be collected on all potential confounders since if 

you do not obtain the information you cannot control for it. 

445 

Section 11

(2) Control of confounding during the study design. 

Restriction: 

• Limits participation in a study to specific groups that 

are similar to each other with respect to the 

confounder. 

e.g. Include only non-smokers in a study of exercise 

and risk of CHD. 

• Disadvantages 

• residual confounding if restriction criteria are too 

wide. 

• lack of generalisability. 

• smaller number of available participants. 

Matching: 

Particular subjects are selected in such a way that the 

potential confounders are distributed in an identical 

manner among each of the study groups. 

Case-control study: Matching cases and controls. 

Cohort study: Matching exposed and unexposed. 

Matching needs to be accounted for in the analysis 

Randomisation 

446 

Section 11

(3) Control of confounding during the analysis. 

Multivariate analysis – multiple regression. 

Evaluating confounding 

• Check for associations between suspected confounder and 

exposure and disease. 

• See whether controlling for confounding affects the 

association. 

Chance 

• Study design: ensure study has sufficient power. 

• Confidence intervals and p-values for the association 

indicate the role of chance in the study. 

• When multiple statistical tests are carried out in a study, 

there is an increased chance of “false positive” results. 

447 

Section 11


Randomised controlled trials (RCTs) 

Aim: To study evaluate the effects of an intervention 

• considered the “Gold standard” for evaluation of 

interventions 

Why 

• allows isolation of the effects of the intervention through 

controlling the experimental condition 

• experiment (“trial”) 

• comparison/control group (“controlled”) 

• randomisation (“randomised”) 

Randomisation 

• process for deciding who will get the experimental 

intervention and who will be the control 

448 

Section 11

Basic structure of a RCT 

• population to be studied 

• choice of comparison group 

• allocation of subjects to intervention or control group 

• choice of outcome measure 

Population to be studied: 

Usually not a representative sample from the population 

• eg in trials of treatments they will be patients coming to 

see the doctors who have agreed to take part in the 

study 

Chosen to maximise internal validity with some cost in terms 

of generalisability. 

• eg we may choose participants who are likely to be able 

to complete the requirements of the trial 

Choice of comparison group: 

• the control group should provide information on what 

would have happened without the experimental 

intervention 

• in trials of disease treatment or prevention the control 

group should in general receive the best available 

“standard” treatment. 

• sometimes there is no standard treatment or practice, in 

which case a “placebo” control group may be used. 

449 

Section 11

• “placebos” are substances with no biological effect on 

the disease process. 

• placebos are used to isolate the particular effect of 

interest from effects that may occur because of people’s 

belief they are getting a particular intervention 

• use of a placebo allows “blinding” of intervention and 

control groups, so that the results are not biased through 

knowledge of who got the new intervention 

Allocation of subjects to treatment groups: 

Example: Is the new treatment more effective than the 

standard treatment 

How would we test this 

(1) We could compare the results of the new treatment 

on patients with records of previous results from 

other patients using the old treatment (historical 

controls). 

Do you think this is a good idea 

(2) Ask people to volunteer for the new treatment and 

give the standard treatment to those who do not 

volunteer 

Do you think this is a good idea 

(3) Allocate patients to the new treatment or the old 

treatment using an “objective” method and observe 

the outcome. 

450 

Section 11

The way in which patients are allocated to treatments can 

influence the results enormously. 

We need a method of allocation to treatments in which the 

characteristics of subjects will not affect their chance of 

being put into any particular group – RANDOM 

ALLOCATION 

Volunteers are assigned to intervention groups using 

randomisation, then followed up over a period of time. 

Randomisation: 

• best way to control for both known and unknown 

confounders. 

• but does not guarantee control of confounding. 

• is ethical when there is genuine uncertainty about whether 

the new intervention or the comparison strategy is better 

(“equipoise”). 

451 

Section 11

Choice of outcome measure: 

• needs to be sensitive to the effects of intervention 

• early in the process of evaluation short term outcomes are 

used to screen for promising interventions 

• ultimately, need to demonstrate that the intervention has a 

tangible benefits for the individual and society 

Example: Zidovudine in treatment of people with 

asymptomatic HIV infection. 

Studies found 

• statistically significant improvement in immune function 

(measured by CD4 count). 

but 

• no difference in survival at 3 years. 

452 

Section 11

Randomised controlled trials : Example 

Nichol et al. The effectiveness of vaccination against 

influenza in healthy working adults. 

New England J. Med (1995) 

Objectives 

• to clarify the benefits of immunisation for influenza in a 

population not at high risk for complications. 

Background 

• most deaths from influenza occur among elderly people, 

but all age groups are affected. 

• influenza accounts for millions of days lost from work 

each year. 

• current recommendations of the US Advisory Committee 

on Immunisation Practices target persons at increased risk 

for complications of influenza, although all people who 

wish to avoid illness are encouraged to consider 

vaccination. 

Type of study 

Randomised controlled trial 

453 

Section 11


• recruited in Minneapolis-St Paul through newspaper 

advertisements, advertisements at work sites and 

recruitment sessions at shopping malls. 

• aged 18 – 64 years. 

• employed full time. 

• no medical conditions which would place them at high 

risk for complications of influenza. 

• not allergic to eggs. 

• not pregnant or planned pregnancy within 3 months. 

• had not had a previous vaccination for influenza. 

Information collected 

“Exposure” (=treatment) 

• influenza group: active vaccine 

• placebo group: vaccine diluent 

Outcome measure: 

• structured telephone interviews 

Week 1: 

side effects 

Monthly for 4 months: 

• occurrence of upper respiratory illness 

• use of sick leave 

• visits to the doctor 

454 

Section 11

Key results 

849 randomised 

placebo vaccine 

n=425 n=424 

complete follow-up 

n= 416 (98%) n=409 (96%) 

455 

Section 11

456 

Section 11


Chance 

• 95% confidence intervals around the differences exclude 

zero. 

• p-values are small, indicating that differences this large 

(or larger) are very unlikely to occur by chance if the 

vaccine is not effective. 

• several outcome measures were used, increasing the 

chance of false positive results, but since the p-values are 

very small this is not likely to affect the conclusions. 

457 

Section 11

Confounding 

Randomisation + intention to treat analysis 

458 

Section 11

Intention-to-treat analysis 

“once randomised, always analysed” 

• outcome is compared in 

• the group randomised to placebo 

• and the group randomised to vaccine. 

• this preserves the control of confounding achieved by 

randomisation. 

Bias 

Selection bias is not a problem in randomised controlled 

trials (see generalisability though) 

Information bias in randomised trials arises from 

• incomplete follow-up of participants 

• error in measurement of outcome 

Information bias in vaccine trial: 

Completeness of follow-up: 

• placebo: 98% (416/425) 

• vaccine: 96% (409/424) 

Measurement of illness: 

• definition of influenza 

• recall of symptoms 

459 

Section 11

Blinding 

• means participants experience or recall of symptoms is not 

affected by knowledge of whether they had the vaccine 

(single blind). 

• people collecting the information from the participants 

cannot introduce bias through their knowledge of whether 

or not they had the vaccine (double blind). 

Generalisability 

• broad group of working adults 

• risk of influenza 

• strain of influenza 

Implications 

• the trial demonstrates that vaccination against influenza 

can be effective in reducing symptoms, sick leave and 

visits to the doctor. 

460 

Section 11








• bias 

• chance 


• example 





• bias 

• confounding 

• chance 


• causation 


• cohort studies 

• case-control studies 

461 

Section 11

Cohort study 

Ref: “Cohort studies: marching towards outcomes”, Lancet 

2002; 359:341-45. 

462 

Section 11

Prospective cohort study (concurrent): Cohort is defined 

and characterised at the start of the study and followed up 

into the future. 

• Assemble the cohort. 

• Measure predictor variables and potential confounders. 

• Follow up the cohort and measure outcomes. 

Retrospective (historical) cohort: Cohort is defined and 

characterised in the past, based on data already recorded, 

and followed up toward the present to some cut-off time. 

• Identify a suitable cohort. 

• Collect data about predictor variables from past records. 

• Collect data about subsequent outcomes that occurred at 

a later time. 

463 

Section 11

Cohort studies: example 

Hart C, Davey Smith G. “Coffee consumption and coronary 

heart disease mortality in Scottish men: a 21 year follow-up 

study.” 

J Epidemiol Commun Health (1997); 51: 461-2 

Objective 

• to examine the effects of coffee on coronary heart 

disease mortality. 

Background / Necessity 

• recent studies of this hypothesis have produced 

conflicting results. 

• data on confounding factors has often been limited in 

those studies. 


• cohort (prospective) 

464 

Section 11


• 5,766 men aged 35-64 from work places in an area in 

the west of Scotland. 

• enrolled between 1970 and 1973. 


• at enrollment: 

• how many cups of coffee they usually drank per 

day; 

• information on confounders such as smoking, social 

class. 

• followed up for 20 years. 

• information about deaths from coronary heart disease 

was obtained from the national registry. 

465 

Section 11

Key results 

No. of cups CHD 

coffee per day Deaths RR 95% CI 

0 308 1.0 

1 94 0.89 (0.70, 1.12) 

2 104 0.98 (0.78, 1.23) 

3-4 82 0.90 (0.70, 1.16) 

5+ 37 0.96 (0.67, 1.37) 

p value from trend test = 0.71 

Chance 

• all confidence intervals include the null value, 1. 

• the upper limits of the confidence intervals for < 5 cups 

per day are fairly close to 1. 

• for 5+ cups per day we cannot exclude a true RR as big as 

1.37 (a 37% increase in risk). 

• the test for trend gave a p-value >> 0.05. 

466 

Section 11

Bias 


• because there is only one selection process, selection 

bias is minimised. 

• the study sample may not be representative of the 

population in west Scotland, but in analytic studies that 

issue is addressed under generalisability. 


• information bias could come from: 

• inaccurary in exposure information; 

• loss to followup; 

• inaccurary in determining death from CHD. 

• crude measure of coffee consumption used, may bias 

RR towards null. 

• followup will be nearly complete using national 

registry. 

• may be some misclassification of cause of death. 

467 

Section 11

Confounding 

• RR presented were adjusted for a number of 

confounding factors including: age, diastolic blood 

pressure, cholesterol, smoking, social class and body 

mass index. 


• type of coffee drunk (instant vs ground). 

Implications 

• found no clear evidence of an association between 

instant coffee use and risk of CHD. 

• cannot rule out an increase in those drinking 5+ cups 

per day (small numbers). 

• other types of coffee may have detrimental effects on 

CHD risk. 

468 

Section 11


Ref: “Case-Control studies: research in reverse”, Lancet 

2002; 359:431-34. 

• Subjects are ascertained based on whether they have 

experienced the outcome of interest (cases) or not 

(controls). 

• Information is collected from cases and controls about 

their past exposures. 

469 

Section 11

Case-control studies: example 

Shinton R and Sagar G. “Lifelong exercise and stroke.” 

BMJ (1993); 307: 231-4. 

Objective 

• to examine the potential of lifelong patterns of increased 

physical activity to prevent stroke. 

Background / Necessity 

• there is growing evidence that exercise can protect 

against stroke. 

• the importance of exercise in early adult life in 

protection from stroke has received little attention. 

• previous studies had not adequately controlled for 

confounding. 



470 

Section 11


Study population: people registered with a GP in west 

Birmingham, England. 

Cases: 

• men and women aged 35-74 who had just had their first 

stroke. 

• obtained by phoning GPs weekly, and by checking 

admissions at the local hospital. 

Controls: 

• randomly selected from the general practice population. 

• no history of stroke. 

471 

Section 11


• structured questionnaire. 

• one interviewer for all cases and controls. 

• when disability prevented an adequate response the 

closest friend or relative was interviewed. 

• people were classified by their responses into those who 

did or did not engage in vigorous exercise during: 

youth (15-25) 

early middle age (25-40) 

late middle age(40-55) 

• information on confounders (e.g. age, sex, smoking) 

472 

Section 11

Key results 

Response rates: 

Cases: 

• 125 patients were eligible for inclusion. 

• no patient or relative declined to participate. 

(100% response rate) 

Controls: 

• 220 controls were selected and contacted. 

• 13 excluded. 

• 198 of the remainder (207) agreed to participate. 

(95.7% response rate) 

Table I. Odds ratios* (95% confidence interval) of stroke 

according to when exercise undertaken. 

Exercise undertaken 

no 

yes 

Age undertaken 

15-25 1.0 0.33 (0.2 to 0.6) 

25-40 1.0 0.43 (0.2 to 0.8) 

40-55 1.0 0.63 (0.3 to 1.5) 

* Odds ratios are adjusted for age and sex 

473 

Section 11

Now, let’s consider possible explanations for an 

association: Internal validity 

Chance 

• confidence intervals show the range of plausible values 

of the true odds ratio which are consistent with the 

study results. 

• if the confidence interval for an odds ratio excludes 1, 

then the study provides evidence of an association in the 

population studied. 

• if the confidence interval for the odds ratio includes 1, 

then the study results are consistent with the possibility 

that there is no true association. 

• to conclude definitely that there is no association, the 

confidence interval must include 1 and be narrow, so that 

important differences in the risk of disease can be 

excluded. 

474 

Section 11

In this study: 

• the odds ratios increase with increasing age at which the 

exercise was undertaken. 

• the confidence intervals for ages 15-25 and 25-40 

exclude 1, so there is some evidence of an association 

between exercise at those ages and reduction in risk of 

stroke. 

• the odds ratio for exercise undertaken at age 40–55 is 

less than 1, but the confidence interval contains 1 

indicating that this apparent beneficial effect could just 

be due to random variation or chance. 

475 

Section 11

Bias 

Case-control studies are particularly susceptible to bias 

because at the time the study is done both exposure and 

disease have already occurred. 


cases: all non-fatal cases which arose from the GP 

population were included. 

controls: randomly selected from the population the cases 

arose from. 

Therefore, the controls are representative of the population 

the cases arose from, and selection bias is minimised. 

Response rates were high. 

(100% for cases and 95.7% for controls) 

476 

Section 11


Things done to minimise bias: 

• cases and controls all interviewed by the same 

interviewer. 

• structured questionnaire was used. 

Possible sources of information bias: 

recall bias: 

• cases and controls may both have trouble recalling 

accurately exercise patterns when they were young. 

• similar patterns of poor recall in cases and controls will 

bias an odds ratio towards 1, so it could not explain the 

observed association. 

• cases have had a stroke so they may be less likely to 

remember than the controls. 

• if cases were less likely than controls to report exercise, 

an apparent protective association between exercise and 

stroke would be created. 

bias from surrogate interviewee: 

• information on exercise for cases unable to respond was 

obtained from a friend or relative. 

477 

Section 11

Interviewer bias: 

• the interviewer will have known whether or not people 

were case or controls. 

• If he/she prodded the controls harder for information on 

exercise an apparent protective effect would be created. 

Confounding 

• risk factors for stroke include age, sex, and smoking. 

• since all 3 of these are likely to be associated with 

exercise, they may be confounding the relationship 

between exercise and stroke. 

• analyses were adjusted to remove confounding effects 

of confounding variables including age, sex and 

smoking. 

478 

Section 11


Could we apply the results of this study to the New Zealand 

population 

• need to think about whether or not New Zealanders 

would be likely to experience the same apparent benefit 

from exercise. 

• depends on the nature of the exercise and the biological 

mechanism by which exercise reduces risk of stroke. 

Causation 

• it is difficult to show causation conclusively with a 

single observational study, primarily because of the 

susceptibility to bias and confounding. 

• an association is more likely to be causal if : 

• the observed association is very strong; 

• a dose-response effect can be demonstrated; 

• the results from several different studies are 

consistent; 

• there is a known biological mechanism. 

479 

Section 11

480

Appendix One: The Basics 

This appendix contains some background material to help you prepare for the course. 

1. Basic Mathematical Rules 

1. BEDMAS – how to work things out in the right order 

2. Rounding 

3. Dealing with Negatives 

4. Fractions 

5. Solving Equations 

6. Powers and Logarithms 

7. Sigma means Add Up 

2. Basic Statistical Concepts 

1. Mean 

2. Median 

3. Range 

4. Variance and Standard Deviation 

5. Quartiles and Interquartile Range 

6. Scatterplot 

3. Sample Exercises 

MATHERCIZE 

Practice examples for many of the topics covered in this booklet are available on the computer 

package MATHERCIZE. This program is available at: http://mathercize.otago.ac.nz, and the 

login password is line. 

Appendix 1 – Basic rules and concepts

Section 1: Basic Mathematical Rules 

1. BEDMAS – how to work things out in the right order 

Brackets 

Exponents (also known as Powers) 

Division and Multiplication 

Addition and Subtraction 

When Division and Multiplication occur together, work from the left. Similarly when Addition and 

Subtraction occur together, work from the left. Otherwise follow the guidelines suggested by the 

word BEDMAS. 

Note that a scientific calculator will maintain this order, provided care is taken, but other calculators 

do not. 

Example 1 

Evaluate ( 3+ 2) × 6 + 9 2 ÷ ( 2+ 7− 

6) 

• First evaluate both brackets: ( 3+ 2) = 5 and ( 2 + 7 − 6) 

= 3 

• Then the exponent: 

2 

9 = 81 

• Then the division and multiplication: 5× 6 = 30 and 81÷ 3 = 27 

• Finally the addition: 30 + 27 = 57 

Setting this out on paper: 

2 

3+ 2 × 6 + 9 ÷ 2+ 7− 6 = 5 × 6 + 

Example 2 

5 + 

( ) ( ) 

2 

Evaluate ( 9 − 5 ÷ 5× 

2 ) − 9 

2 

9 ÷ 3 

= 5 × 6 + 81 ÷ 3 

= 30 + 27 

= 57 

• First evaluate the brackets. The exponent is evaluated first: 

9 5 5 2 2 

− ÷ × = 9 − 5 ÷ 5 × 4 

( ) ( ) 

( 9 1 4) 

( ) 

= − × 

= 9− 4 = 5 

• Finally the addition 5 + 5 – 9 = 1. 


Example 3 

If Z = X − μ calculate Z if X = 15, μ = 8, and σ = 2.75 

σ 

• First carry out a “clean” substitution. This means that each variable whose value is known 

is replaced by that value without any calculation being done: 

Z = 15 − 8 

2.75 

• The division sign implies brackets. The expression could be rewritten as 

(15 − 8) 

Z = 

2.75 

although the brackets are seldom shown. Nevertheless the expression 15 – 8 

is evaluated first. 

• 

7 

Finally the division: = 7 ÷ 2.75 = 2.55 (to two decimal places). 

2.75 

• Note that using brackets on a standard calculator should let you evaluate the expression 

directly. Try ( 15 – 8 ) ÷ 2.75 = (Missing out the brackets will almost 

certainly lead to an incorrect answer.) 

Example 4 

s 

If t = 2.086, s = 3.44, and n = 21, evaluate the expression t 

n 

• Clean substitution: 2.086 × 3.44 

21 

s 

• Note the multiplication sign: t 

n means t × sn 

• A square root is an exponent, so evaluate 21 = 4.583 (to three d.p.) 

• There is no addition or subtraction involved, so work from the left: 

2.086 × 3.44 ÷ 4.583 = 1.57 (to two d.p.) (Rounding is discussed below.) 

• Again this may be calculated directly on a calculator. Press the buttons: 

2.086 × 3.44 ÷ 21 = 

Example 5 

Evaluate the expression x − μ 

if x = 215.8, μ = 246, s = 64.5, and n = 10. 

s 

n 

• For this example, only the calculator working is shown. Press the buttons: 

( 215.8 – 246 ) ÷ ( 64.5 ÷ 10 ) = 

The answer is –1.48 (to two d.p.) 

• Try to obtain the same answer using the rules of BEDMAS. 

Example 6 

Evaluate 1.96 

2 2 

4.5 3.6 

+ 

18 22 

• The square root sign implies brackets around the expression 

2 2 

4.5 3.6 

+ 

18 22 


⎛ 2 2 

4.5 3.6 ⎞ 

i.e. we have to evaluate 1.96 

+ 

⎜ 18 22 ⎟ 

⎝ 

⎠ 

• All the exponents inside the brackets are calculated first, followed by the divisions: 

2 

2 

4.5 20.25 

3.6 12.96 

= = 1.125 and = = 0.589 

18 18 

22 22 

• Next the addition, followed by the remaining exponent (the square root) 

1.125 + 0.589 = 1.714 and 1.714 = 1.309 

• Finally the multiplication: 1.96 × 1.309 = 2.56 (to two d.p.) 

• Again note that this could be calculated directly on a calculator (although a single small 

mistake will make everything wrong). Try 

1.96 × (4.5 x 2 ÷ 18 + 3.6 x 2 ÷ 22) = 

The result should be 2.566 which rounds to 2.57, a little different to the answer above due to 

rounding. Note that x 2 refers to the button on a Casio calculator. Other brands may have 

different notations for squaring, although they should be similar. 

2. Rounding 

When you have decided how many digits you want to round to, look at the next digit. If this value 

is 0, 1, 2, 3, or 4, the previous digit is rounded down. Otherwise (if the value is 5, 6, 7, 8, or 9), the 

previous digit is rounded up. 

Example: 

By calculator, 8 30 = 1.460593487 

• To three d.p. (decimal places) 8 30 

= 1.461 because the next digit (5) causes the third 

decimal value (0) to be rounded up. 

• To four d.p. 8 30 = 1.4606 

• To five d.p. 8 30 = 1.46059 

• To six d.p. 8 30 = 1.460593 


There are no hard and fast rules concerning how many digits you should round a value to, although 

a few general principles should be noted: 

• When you are calculating an expression do not round too soon. For example, consider the 

expression 150 . To eight decimal places, 10 = 3.16227766. 

10 

• If you use a calculator to evaluate 150 , and round your final answer to three decimal 

10 

places, the result is 47.434. 

• However, if you first round 10 to 3.16 and you then calculate 150 , the result is 47.468 

3.16 

(to three d.p.). This may not appear to be much different to the value 47.434, but it could 

make a substantial difference if you have to use the value in further calculations. 

• Do not round your working to fewer figures than your final answer. In the previous 

example, the value 3.16 has three significant figures, while the (slightly incorrect) given 

answer 47.468 has five figures. Having rounded to three figures in the working, three 

figures (or fewer) should be used for the final answer. 

You should not give an answer “more” accurate than the data or working. 

• As a rule of thumb, round probabilities to four decimal places. 

• Historically, Z-scores have been rounded to two decimal places. The reason for this is that 

normal distribution tables use two decimal place Z-scores. 

3. Dealing with Negatives 

Adding a negative number is the same as subtracting the corresponding positive number: 

• Example 5 + (-4) = 5 – 4 = 1 

Subtracting a negative number is like adding a positive number: 

• Example 5 – (-4) = 5 + 4 = 9 

Multiplying two negative numbers give a positive number: 

− 5 × − 4 = 

• Example ( ) ( ) 20 

Multiplying a negative number by a positive number gives a negative number: 

− 5 × 4 = − 

• Example ( ) 20 


4. Fractions 

Many people have difficulty with fractions. Sometimes the difficulty is in the interpretation rather 

than with actual calculations. 

Example 

Imagine that you have attended a course and you are trying to work out your final mark. You have 

been told that you scored: 

• 8.5 out of 10 for the assignments 

• 20 out of 40 for the test 

• 32 out of 50 for the exam 

If you add these three values up as if they were fractions, you would get 

8.5 20 32 

+ + = 1.99 (Check this using a calculator.) 

10 40 50 

This is clearly a silly answer because the values were not actually fractions as such, but marks from 

different sections of the assessment scheme for the course. 

If you just add up the marks you get 60.5. This is a more reasonable answer, because it gives a total 

out of 100. 

But suppose that in the course mentioned in this example, the assessment scheme states that if the 

internal mark is higher than the exam mark, your final mark is the average. Otherwise the final 

mark is the exam mark. For this example, the internal total is 28.5 out of 50, or 57%, while the 

exam mark translates to 64%. As the exam mark is higher than the other combined marks, the final 

mark in this case would be 64. 

Using Calculators for Fractions 

When probabilities are involved, dealing with fractions is important. This section aims to show 

how to use a calculator to handle problems involving fractions. 

As long as you estimate whether the final answer is sensible, practically all fraction work can be 

carried out using a calculator. The key button to use is a b c 

on a Casio. Other calculators should 

have equivalent buttons. 

Simplifying Fractions 

12 

Example 1: 

20 

Example 2: 

On your calculator type 12 

a b c 

20 = 

The answer is given as 3 5 i.e. 12 

20 = 3 5 

21 

105 Type 21 b c 

a 105 = i.e. 21 

105 = 1 5 


Converting Fractions to Decimals 

The a b c 

button will often do this, although not always! 

Example 1: Convert 11 into decimal form 

15 

On the calculator type 11 a b c 

15 = 

The screen shows 11 15. Now press the a b c 

button and the fraction is 

converted to the decimal 0.733333 . . . Press a b c 

again, and the fraction 

version reappears. 

Example 2: 

Example 3: 

Convert 0.6875 to a fraction. 

Type .6875 = Now press the a b c 

button. The screen shows 

11 16 i.e. 0.6875 = 11 

16 

Convert 0.1234567 to a fraction. 

Type .1234567 = . Now press the a b c 

button. 

Nothing happens. The calculator leaves the decimal alone. If you want to 

convert this one to a fraction you will have to carry out the working yourself: 

1234567 

0.1234567 = 

10000000 

Adding and Subtracting Fractions 

3 2 

Example: 

+ 

5 3 

On your calculator type 3 a b c 

5 + 2 a b c 

3 = 

The screen shows 1 4 15 i.e. 3 + 2 = 1 

4 

5 3 15 

(Incidentally, if you now press the a b c 

button, the decimal equivalent to this 

fraction appears on screen: 1.266666. . .) 

Remember that if these two fractions represent probabilities that you are adding together, and the 

final answer was also meant to represent a probability, then there has to be an error somewhere 

because a probability cannot be larger than 1. 


Multiplying and Dividing Fractions 

5 5 

Example 1: × 

8 3 

Type 5 a b c 

8 x 5 a b c 

3 

The result is 1 1 24 i.e. 5 × 5 = 

25 

8 3 24 

Example 2: 

5 10 

÷ 

7 11 

Type 5 

a b c 

7 ÷ 10 a b c 

11 

The result is 11 14 i.e. 

5 10 11 

÷ = 

7 11 14 

More Complicated Calculations 

As soon as you have a problem involving both addition and multiplication, brackets become very 

useful. 

3⎛1 3⎞ 

Example: 

⎜ + ⎟ 

4⎝8 7⎠ 

Note that the fraction in front of the brackets implies multiplication. 

Type 3 a b c 

4 × (1 a b c 

8 + 3 a b c 

7) = 

The answer is 93 or 0.4152 (to four d.p.) 

224 

Note that as an alternative approach you could use BEDMAS and work out the brackets first: 

1 a b c 

8 + 3 a b c 

7 = gives 31 

56 

Now type × 3 a 4 = to reach 93 224 as before. 

b c 

5. Solving Equations 

Solving equations involves more than evaluating expressions, which was covered earlier. To solve 

an equation you should make a clean substitution, then rearrange the expression so that the required 

variable is on its own. 

Loosely speaking, solving equations involves “undoing BEDMAS”. For example, anything inside 

brackets is dealt with last. 

In STAT 115 one particular type of equation will need to be solved: 


X − μ 

Example 1: If Z = , calculate X if Z = 1.96, μ = 8.5, and σ = 1.8 

σ 

• First make a “clean substitution”, i.e. substitute each of the known variables into the 

equation without trying to simplify at all: 

X − 8.5 

1.96 = 

1.8 

• The division sign implies brackets around X – 8.5. We are “undoing” the equation, so 

this part will be left to last. 

( X − 8.5) 

1.96 = 

1.8 

• This means we “undo” the value 1.8 first. Because the right hand side of the equation reads 

“(X – 8.5) divided by 1.8”, we will multiply by 1.8, since multiplication is the inverse 

operation to division: 

1.96 × 1.8 = (X – 8.5) 

• Because we added the brackets because of the original division sign and we have dealt with 

the division, the brackets are no longer needed: 

3.528 = X – 8.5 

• To undo subtraction we perform the opposite operation, addition: 

3.528 + 8.5 = X 

• We have now rearranged the equation so that X is on its own. 

X = 12.0 (one decimal place) 

X − μ 

Example 2: If Z = calculate X if Z = 2.58, μ = -2.5, σ = 0.85, 

σ 

n 

and n = 60. 

• Clean substitution: 2.58 = 

X −−2.5 

0.85 

60 

• Simplify a little: 

2.58 = 

X + 2.5 

0.1097 

• Solve the equation: 2.58 × 0.1097 = X + 2.5 

0.2830 = X + 2.5 

0.2830 − 2.5 = X 

X = − 2.22 (to two d.p.) 


6. Powers and Logarithms 

The following power rules may be needed occasionally, and examples will be given where 

necessary. 

a b a b 

x x x + 

a 

= ( x ) b 

= 

ab 

x 

x 

x 

a 

b 

= 

x 

a−b 

x 

− a = 

1 

x 

a 

1 2 x 

= 

x 

The following log rules may also be needed. Note that in this paper, log means log e (or natural log 

i.e. ln). 

log 

ln 

e x 

= ln x 

ln x= y ←⎯→ e = x (where e= 

2.71828 (five d.p.)) 

y 

( x ) yln 

( x) 

= ln ( x) + ln ( y) = ln ( xy) 

⎛ 

ln ( x) ln ( y) 

ln x ⎞ 

− = ⎜ ⎟ 

⎝ y ⎠ 

Example: 

y 

⎛ ˆ π ⎞ 

If log⎜ 

1 ˆ 

⎟ = 3.1305 – 1.1499 – 0.027729 x 45 , find the value of the expression 

⎝ − π ⎠ 

• First use BEDMAS to evaluate the RHS (Right Hand Side) of the expression: 

ˆ π 

. 

1 − ˆ π 

3.1305 – 1.1499 – 0.027729 × 45 = 3.1305 – 1.1499 – 1.247805 

= 0.732795 

⎛ ˆ π ⎞ 

• We now have log⎜ 

1 ˆ 

⎟ = 0.732795. 

⎝ − π ⎠ 

Remembering that log here means ln we are able to rewrite this in exponential form using 

the formula 

y 

ln x = y ←⎯→ e = x 

Therefore 

⎛ ˆ π ⎞ 

⎜ 

1 ˆ 

⎟ 

⎝ − π ⎠ 

= 

0.732795 

e = 2.08 (two d.p.) 


7. Sigma means Add Up 

The Greek letter Σ (capital sigma) means “add up what follows”. 

Example 1: 

Example 2: 

3 

Evaluate ∑ 3 i 

i= 1 

Each of the values 1, 2, and 3 are substituted into the expression one by one 

in place of the variable i. Then the three values are added: 

3 

i 1 2 3 

∑ 3 = 3 + 3 + 3 

i= 1 

= 3 + 9 + 27 = 39 

n 

Expand the expression ∑ xi 

, where x 1 is the first observation, x 2 the 

i= 

1 

second observation, etc. in a data set. 

There are n observations. Write out the sum of the first two or three 

observations, use three dots to indicate the other values, and add on the final 

observation: 

n 

∑ i 

i= 

1 

x x x x ... x 

= 1 + 2 + 3 + + 

n 

Notation 

x i is the i th term from the data set x1, x2, x3 , ..., xn 

1, 

x 

x ij is the (i, j) th term from the data set 

x11, x21, ..., xn 

1, 

1 

x12, x22, ..., xn2 

2, 

..., 

x1k, x2k, ..., xn k 

k 

− n. 

Example 3: 

If we select 50 female and 50 male Stat 115 students and measure their 

heights, we obtain the data set 

xij 

i = 1, 2 j = 1, 2, . . . , 50 

Here i represents sex (1 for female and 2 for male), and j the individual. 

For example, x29 

is the height of the 9 th male in the sample. 

Example 4: Evaluate the expression x = 

set {4, 7.5, 3.5, 8} 

1 

4 

4 

∑ xi 

where x i is the i th 

i= 

1 

• Substitute each of the x i values into the expression and follow BEDMAS: 

observation in the 


1 

x = 4 + 7.5 + 3.5 + 8 

4 

1 

= ( 23 ) 

4 

= 5.75 

( ) 

Example 5: Evaluate the expression v = ( x − x ) 

4 

1 

2 

∑ i where x i is the i th observation 

3 

i= 

1 

in the set {4, 7.5, 3.5, 8} and x = 5.75 (calculated in Example 4). 

• Substitute each of the x i values into the expression, along with x = 5.75: 

v = 1 ( (4 − 5.75) 2 + (7.5 − 5.75) 2 + (3.5 − 5.75) 2 + (8 − 5.75) 2 

) 

3 

• Follow BEDMAS and evaluate each one of the four inner brackets: 

v = 1 ( ( − 1.75) 2 + (1.75) 2 + ( − 2.25) 2 + (2.25) 2 

) 

3 

• The exponents (squares) are calculated and then the four terms are added: 

v = 1 ( 3.0625 + 3.0625 + 5.0625 + 5.0625 ) 

3 

1 

= ( 16.25 ) 

3 

• The multiplication by 1 is outside the brackets so it is calculated last: 

3 

v = 5.417 (to three d.p.) 

Example 6: Evaluate the expression χ 2 (observed - expected) 

= ∑ 

expected 

allcells 

for the table below where the expected values are given in brackets and the 

observed values are not in brackets: 

15 (26) 50 (39) 

33 (22) 22 (33) 

2 

• Note that for this type of sigma expression, the notation means we have to add up the result 

from each of the four cells. 

χ 2 

• Substitute each value into the expression: 

χ 2 (15 − 26) (50 − 39) (33 − 22) (22 − 33) 

= 

+ + + 

26 39 22 33 

• Evaluate each bracket, and then square the result: 

χ 2 = 

= 

2 2 2 2 

2 2 2 2 

( −11) ( −11) (11) ( −11) 

+ + + 

26 39 22 33 

121 121 121 121 

+ + + 

26 39 22 33 

a (or equivalent) button to calculate the sum: 

• Use the b c 

= 16.923 (to three d.p.) 


Section 2: Basic Statistical Concepts 

1. Mean 

The mean x is commonly referred to as the “average”. It is used as a measure of the “centre” of a 

data set. To find the mean, simply add up all your data values (observations) and divide by the 

number of values (sample size): 

n 

x1+ x 2 + ... + xn 

1 

x = or ∑ xi 

n n 

i = 

1 

Example: 

Calculate the mean of the data set 2, 4, 6, 8, 10, 12. 

There are six values in the data set i.e. n = 6. 

2 + 4 + 6 + 8 + 10 + 12 42 

x = = = 7 

6 6 

2. Median 

The median is defined as the middle observation in the data set, and is another measure of the centre 

of the data. Note that the data must be in order before you calculate the median! 

• In general, the median is the ( n + 1) 

th observation, where n is the sample size. 

2 

• If there is an odd number of observations, the median will be the middle observation. 

• If there is an even number of observations, the median will be the mean of the two middle 

observations. 

Example 1: Calculate the median of the data set 10, 1, 3, 8, 9. 

• First sort the data into order: 1, 3, 8, 9, 10 

• There are n = 5 observations so the median in the data set is the ( + ) 

i.e. 8. 

Example 2: Calculate the median of the data set 32, 2, 36, 14, 6, 33. 

• First sort the data into order: 2, 6, 14, 32, 33, 36 

6 + 1 

2 

• There are n = 6 observations so the median is the ( ) = 3. 5 

14 + 32 

2 

• Take the mean of the 3 rd and 4 th observations i.e. ( ) = 23 

5 1 = 3 rd observation, 

2 

th observation. 

. 


3. Range 

The range is the difference between the largest and smallest observations in the data set. It is a 

measure of the variation in the data. 

Example: 

The range of the data set 2, 5, 6, 9, 16, 2, 13 is 16 – 2 = 14. 

4. Variance and Standard Deviation 

• The variance (s 2 ) is calculated as follows: 

n 

2 1 

s = ∑ xi 

−x 

n −1 

i= 

1 

( ) 2 

The standard deviation (s) is the most commonly used measure of variation in a set of data. It is the 

square root of the variance 

n 

1 

i.e. s = ∑ ( xi 

−x) 2 

n −1 

i= 

1 

Usually we calculate the variance first, then we take the square root to give the standard deviation. 

(This follows the order of operation indicated by BEDMAS) 

Example: 

The mean for the data set 9, 5, 6, 4, 16, 2 is 7.0. Calculate the standard deviation: 

• First calculate the variance. Substitute in each value, including x = 7 and n = 6: 

( ) 

2 1 (9 7) 

2 (5 7) 

2 (6 7) 

2 (4 7) 

2 (16 7) 

2 (2 7) 

2 

s = − + − + − + − + − + − 

5 

• Evaluate the expression, following BEDMAS: 

( ) 

2 1 2 2 2 2 2 2 

s = (2) + ( − 2) + ( − 1) + ( − 3) + (9) + ( − 5) 

5 

1 

= ( 4 + 4 + 1 + 9 + 81 + 25 ) 

5 

1 

= (124) = 24.8 

5 

• Take the square root of the variance to give the standard deviation: 

s = 24.8 = 4.98 (to two decimal places) 


5. Quartiles and Interquartile Range 

There are two quartiles: a lower quartile (Q 1 ) and an upper quartile (Q 3 ). The lower quartile has 

25% of the data below it, and the upper quartile has 25% of the data above it. 

To find a quartile, first find the median of the data set. Then treat the data above the median (upper 

set) and the data below the median (lower set) as separate sets. The lower quartile is the median of 

the lower set, while the upper quartile is the median of the upper set. 

The interquartile range is the upper quartile minus the lower quartile, and contains 50% of the data. 

It is a measure of the variation in the data. 

Example 1: 

The data set 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 has a median of 11. 

• Therefore the lower set is 1, 3, 5, 7, 9, which has a median of 5. So the lower quartile is 5. 

• The upper set is 13, 15, 17, 19, 21, and has a median of 17. So the upper quartile is 17. 

• The interquartile range is 17 – 5 = 12 

Example 2: 

The data set 1, 5, 6, 8, 12, 16, 19, 22, 29, 31, 36, 40 has a median of 17.5. 

• The lower set is 1, 5, 6, 8, 12, 16 which has a median of 7, so the lower quartile is 7. 

• The upper set is 19, 22, 29, 31, 36, 40, which has a median of 30, so the upper quartile is 30. 

• The interquartile range is 30 – 7 = 23. 

6. Scatterplot 

A scatterplot shows the relationship between two variables. Each observation consists of two 

measurements. Often we are interested in the “response” of one measurement to the value of the 

other. We try to distinguish between the “response” variable and the “explanatory” variable. The 

response variable is plotted on the y-axis (vertical axis) and the explanatory variable on the x-axis 

(horizontal axis). 

Example: 

The weight of 13 students and the amount of time it took them to drink a particular beverage are 

plotted below: the explanatory variable is the student’s weight (x-axis) and the response variable is 

the time taken to drink the beverage (y-axis). 

Time taken to drink beverage 

9 

8 

7 

6 

5 

4 

3 

2 

1 

0 

50.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0 

Weight of Student (kg) 


Section 3: Sample Exercise 

This sample exercise contains questions based on the Basics Booklet, plus a few questions from 

material taught during the first week of the course. 

1. In a recent study looking at rainbow trout, researchers measured the lengths of juvenile fish. 

The lengths (in cm) for five randomly selected fish were: 

18.6, 15.4, 13.4, 17.0, 12.9 

Calculate to one decimal place the mean for these data. 

2. For a second random sample of six juvenile fish the lengths (in cm) were: 

Calculate the median for these data. 

15.5, 12.6, 17.5, 17.4, 13.8, 12.2 

3. Calculate the range for the data in Question 2. 

4. For a third random sample of five juvenile fish the lengths (in cm) were: 

14.5, 14.8, 16.5, 18.4, 13.8 

The mean for these data is 15.6 (cm). Calculate to one decimal place the standard deviation 

for these data. 

5. The mean value of 15.6 (cm) in Question 4 is a: 

A. Parameter 

B. Statistic 

C. Distribution 

D. Population value 

E. Measure of Spread 

6. The following list contains five values: 

3.2% 

0.096 

0.048 

0.32 

0.58% 

Beside each value select “True” if the value is less than 0.05 or “False” if the value is greater 

than 0.05. 

7. Calculate the value of the expression 

5 

∑ 3i 

. 

i = 1 

8. If 

Z. 

Z 

X − μ 

= , with X = 43.6, μ = 48, σ = 8.6 and n = 50, then calculate the value of 

σ 

n 


9. If 

1.96 

X − 2.8 

= , calculate the value of X. 

5 

10. In a previous STAT110 class at Otago University, 64% of students sitting the paper were 

known to be first year students. In a study of students sitting the paper, a random sample of 

40 students was taken, and 60% of the students in this sample were found to be first year 

students. 

To earn the mark in this question you will have to answer correctly both questions below. For 

each question select your answer from these five options: 

A. 64% 

B. 40 students 

C. 60% 

D. all first year students at Otago University 

E. students sitting the paper 

Question 1: The statistic in the paragraph above is: . . . . . 

Question 2: What is the population . . . . . 

Answers 

Answers without working are provided. For the working, look through the Basic Booklet above, or 

consult your Notes for the first week of the course. If you need help, go to one of the help sessions. 

Details of these sessions are provided in the Course Outline at the start of this book. 

1. 15.5 cm (1 d.p.) 

2. 14.65 cm 

3. 5.3 cm 

4. 1.9 cm 

5. B 

6. True, false, true, false, true 

7. 45 

8. –3.62 (2 d.p.) 

9. 12.6 

10. B, E 


Appendix Two: Some Summaries 

1. Some Useful Rules of Probability 

2. Random Variables 

3. Binomial Distribution 

4. Normal Distribution 

Basic Probability Rules and Distributions 

1. Some Useful Rules of Probability 

• Pr(A or B) = Pr(A) + Pr(B) – Pr(A and B) 

If we use set notation for this rule, it can be rewritten as 

Pr(A ∪ B) = Pr(A) + Pr(B) – Pr(A ∩ B) 

A 

B 

A 

B 

A ∪ B 

A ∩ B 

• If A and B are mutually exclusive 

(disjoint) then: 

Pr(A and B) = 0 

or Pr(A ∩ B) = 0 

A 

B 

• If A represents the complement of A 

(every event not in A) then 

Pr(A) + Pr( A ) = 1 

A 

A 

Appendix 2 – Some summaries

• Probability of B given A: Pr( A ∩ B) = Pr( A) x Pr( B| A) 

This may be rewritten as 

Pr( B| A) 

= 

Pr( A ∩ B) 

Pr( A) 

Pr(B | A) 

B 

Pr(A ∩ B) 

Pr(A) 

A 

Pr( B | A) 

B 

Pr(A ∩ B ) 

Pr(B) 

A 

Pr(B | A ) 

B 

Pr( A ∩ B) 

Pr( B | A ) 

B 

Pr( A ∩ B ) 

• If A and B are independent then: (i) P(B | A) = P(B) 

(ii) P(A ∩B) = P(A) × P(B) 

2. Random Variables 

• A random variable is one whose value is determined by a random mechanism. 

• A continuous random variable can take any value in an interval. 

• A discrete random variable can take one of a countable number of values. 


Suppose 

1. We have a fixed number of trials (n) 

2. Trials are independent 

3. Each trial has only two outcomes (“success” or “failure”) 

4. The probability of success (π) is the same for each trial 

The total number of successes (X) is a discrete random variable and has a Binomial distribution, 

with 

⎛n⎞ 

x n x 

Pr( X = x) = ⎜ ⎟ ( 1 ) 

x π − π − 

⎝ ⎠ 

The mean and variance of the distribution are 

μ nπ 

σ 

2 

= nπ 1− 

π 

Example: 

If n = 30 and π = 0.6 then 

• μ = n π = 30 x 0.6 = 18 

= and ( ) 


2 = n 1− = 30 x 0.6 x 0.4 = 7.2 

• σ π( π) 

• The standard deviation is σ = 7.2 = 2.68 (to two d.p.) 


A distribution that is commonly used to describe the behaviour of continuous random variables is 

the normal distribution. 

2 

• X ~ N( , σ ) 

μ means “X has a normal distribution with mean μ and variance 

• X ~ N ( 0,1) 

means X has a standard normal distribution 

2 

• If X ~ N( μ , σ ) 

, then the standardised random variable 

For any Normal distribution, approximately: 

• 68% of the observations are between μ − σ and μ + σ . 

• 95% of the observations are between μ − 2σ 

and μ + 2σ 

. 

• 99.7% of the observations are between μ − 3σ 

and μ + 3σ 

. 

− μ 

= X 

σ 

Z ~ N ( 0,1) 

2 

σ ” 

Example: 

If X ~ N( 45, 30 ) then 

• μ = 45 

• the standard deviation σ = 30 = 5.477 (to three d.p.) 

• Approximately 68% of the observations are expected to be between 

μ − σ = 39.5 and μ + σ = 50.5. 

• Approximately 95% of the observations are expected to be between 

34 and 56. 

• Over 99% (i.e. almost all) of the observations are expected to be between 

28.5 and 61. 

• Pr(X < 40) = Pr(Z < 

X − μ 

) 

σ 

40 − 45 

= Pr(Z < ) 

5.477 

= Pr(Z < –0.913) 

= 0.1806 

40 45 X 

–0.913 0 

Z 


Summary of Formulae 


If X is a normal random variable with parameters µ X (mean) and σ 2 X (variance) 

• Mean: µ x 

• Standard deviation: σ X = 

√ 

σ 2 X 

A standard normal random variable Z has mean µ Z = 0 and σZ 2 

variable X into a standard normal (and vice versa): 

= 1. To transform a normal random 

Z = X − µ X 

σ X 

and X = Zσ X + µ X . 


If X is a binomial random variable with n trials and probability π then 

• Mean: µ x = nπ 

• Standard deviation: σ X = √ nπ(1 − π) 

• If nπ and n(1 − π) are both greater than 5, then X is approximately normally distributed with mean 

µ X and variance σ 2 X . 

3. Distributions of Statistics 

• The mean ¯X of a random sample of size n has mean µ ¯X = µ X and standard deviation σ ¯X = σX √ n 

. 

• The sample proportion P computed from√ 

a binomial distribution with parameters n and π has a mean 

π(1−π) 

of µ P = π and standard deviation σ P = 

n 

. If nπ and n(1 − π) are both greater than 5, then P 

will be approximately normally distributed. 

• The distribution of the difference between two sample means ¯X 1 − ¯X 2 has a mean of µ ¯X1 − ¯X 2 

= µ 1 − µ 2 

and a standard deviation of σ ¯X1 − ¯X 2 

= 

√ 

σ 2 

1 

n 1 

+ σ2 2 

n 2 

. 

- In large random samples (n 1 and n 2 ≥ 30) σ ¯X1 − ¯X 2 

can be estimated by ˆσ ¯X1 − ¯X 2 

= 

√ 

s 2 

1 

- If σ 2 1 = σ2 2 then we can estimate σ ¯X 1 − ¯X 2 

by ˆσ ¯X1 − ¯X 2 

= 

√ 

(n1 −1)s 2 1 +(n 2−1)s 2 2 

n 1 +n 2 −2 

4. Contingency tables 

√ 

1 

n 1 

+ 1 n 2 

. 

n 1 

+ s2 2 

n 2 

. 

Factor 2 

Factor 1 Level 1 Level 2 Total 

Level 1 w x r 1 = w + x 

Level 2 y z r 2 = y + z 

c 1 = w + y c 2 = x + z n = w + x + y + z 

χ 2 = 

2∑ 

i=1 j=1 

2∑ (o ij − e ij ) 2 

e ij 

where e ij = r ic j 

n 

and o ij is the observed 

value in row i column j. 

Odds ratio: OR =(w/x)/(y/z) =(w × z)/(x × y) 

Relative risk: RR =(w/(w + x)) / (y/(y + z)) 

Attributable risk: AR = w/(w + x) − y/(y + z) 

Appendix 3 - Formulae

5. Confidence Intervals 

All of the 100(1 − α)% confidence intervals calculated in this course are of the form: 

Estimate ± multiplier × standard error. 

In the following ¯x, p etc are the values calculated from the samples. 

Estimate df (ν) Multiplier Standard Error 

Population mean 

• Random sample, σ x known ¯x NA z α/2 

√ 

σ X 

n 

• Random normal sample, σ x unknown 

and estimated by s 

Difference between population means 

• Small random samples, normal population, 

σ 1 = σ 2 = σ unknown 

¯x n − 1 t α/2,ν 

s √ n 

¯x 1 − ¯x 2 n 1 + n 2 − 2 t α/2,ν 

√ 

(n1 −1)s 2 1 +(n 2−1)s 2 2 

n 1 +n 2 −2 

• Large random samples (both ≥ 30) ¯x 1 − ¯x 2 NA z α/2 

√ 

s 2 

1 

• Paired difference in small random ¯d ν = n − 1 t α/2,ν 

s d 

samples from a normal population 

After ANOVA and Regression 

• Estimate, multiplier and standard errors determined from output 

n 1 

+ s2 2 

n 2 

√n 

√ 

1 

n 1 

+ 1 n 2 

Population proportions 

√ 

p(1−p) 

• Population proportion p NA z α/2 

√ n 

• Difference between 2 population proportions 

p 1 − p 2 NA z 

p1 (1−p 1 ) 

α/2 n 1 

+ p 2(1−p 2 ) 

n 2 

Odds ratio, relative risk, attributable risk (see contingency tables above for 

√ 

w, x, y and z) 

• Log (natural) odds ratio ln(OR) NA z 1 

α/2 w + 1 x + 1 y + 1 z 

√ 

• Log (natural) relative risk ln(RR) NA z 1 

α/2 w − 1 

w+x + 1 y − 1 

y+z 

• Attributable risk –as for the difference of two population proportions with p 1 = w/(w + x) and p 2 = y/(y + z) 

6. Regression 

ŷ = ˆβ 0 + ˆβ 1 x where ˆβ 1 = 

where s e = 

7. ANOVA 

√ ∑(yi 

−ŷ i ) 2 

n−2 

1. Total SS = Treatment SS + Error SS 

2. Total df = Treatment df + Error df 

∑ (xi −¯x)(y i −ȳ) 

∑ (xi −¯x) 2 and ˆβ 0 =ȳ − ˆβ 1¯x. Standard error of the slope SE( ˆβ 1 )= 

= √ MS Residual. Standard error of a forecast at x k = s e 

√ 

1+ 1 n + (x k−¯x) 2 ∑ (xi −¯x) 2 . 

3. MS Treatment = Treatment SS/Treatment df and MS Error = Error SS/Error df 

4. Overall mean SS = nȳ 2 where n = n 1 + ...+ n k and ȳ = 1 n (n 1ȳ 1 + ...+ n k ȳ k ). 

5. Treatment SS = C2 1 

n 1 

+ C2 2 

n 2 

+ ...+ C2 k 

n k 

− nȳ 2 where C j is the jth column total. 

√ s e ∑(xi , −¯x) 2 

Appendix 3 - Formulae

CONTENTS - Department of Mathematics and Statistics - University ...

Create successful ePaper yourself

Delete template?

Save as template?