21.01.2015 Views

CONTENTS - Department of Mathematics and Statistics - University ...

CONTENTS - Department of Mathematics and Statistics - University ...

CONTENTS - Department of Mathematics and Statistics - University ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>CONTENTS</strong><br />

Introduction, General Information <strong>and</strong> Administration, Overview<br />

SECTION 1<br />

This covers an introduction to the package R-cmdr, presents an overview <strong>of</strong> biostatistics <strong>and</strong><br />

research methodology.<br />

Biostatistics <strong>and</strong> Research Methodology; R-cmdr<br />

Types <strong>of</strong> Data<br />

Numerical Data <strong>and</strong> Histograms<br />

Measures <strong>of</strong> Centre: Mean <strong>and</strong> Median<br />

Measures <strong>of</strong> Variability: St<strong>and</strong>ard Deviation, Variance <strong>and</strong> Interquartile range<br />

Box-<strong>and</strong>-Whisker Plots<br />

SECTION 2<br />

This covers the measures <strong>of</strong> disease frequency <strong>and</strong> disease association with several examples<br />

looking at prevalence, incidence, relative risks, attributable risk <strong>and</strong> odds ratios.<br />

Prevalence <strong>and</strong> Incidence<br />

Cumulative Incidence<br />

Incidence Rate<br />

Disease Association<br />

Relative Risk<br />

Attributable Risk<br />

Odds Ratio<br />

SECTION 3<br />

This section covers a brief introduction to probability definitions, notation, rules <strong>and</strong> r<strong>and</strong>om<br />

variables with examples, several involving tree diagram use.<br />

Definitions including mutually exclusive <strong>and</strong> independent events<br />

The Addition Rule for combining probabilities<br />

The Multiplication Rule for probabilities<br />

Tree diagrams with examples<br />

Screening test terminology<br />

Probability Distributions <strong>and</strong> R<strong>and</strong>om Variables<br />

Rules for combining R<strong>and</strong>om Variables<br />

SECTION 4<br />

This section introduces both the Binomial <strong>and</strong> Normal Distributions which model many<br />

phenomena arising in the real world. Consequently the distributions allow us to answer<br />

some important <strong>and</strong> relevant questions.<br />

The Binomial Distribution: Definition, mean <strong>and</strong> variance<br />

The Binomial Table: Examples<br />

The Normal Distribution: Definition<br />

St<strong>and</strong>ard Normal Distribution <strong>and</strong> Table<br />

General Normal Distribution<br />

Normal Approximation to the Binomial<br />

Transforming Data to Normal<br />

Contents


SECTION 5<br />

This section defines sampling distributions, establishes the st<strong>and</strong>ard deviations <strong>of</strong> these<br />

distributions called st<strong>and</strong>ard errors, <strong>and</strong> set up confidence intervals for population means,<br />

differences between the means <strong>of</strong> two populations, proportions <strong>and</strong> difference between<br />

proportions based on r<strong>and</strong>om samples drawn from the populations.<br />

An outline <strong>of</strong> the Research Process<br />

The Distribution <strong>of</strong> Sample Means<br />

The St<strong>and</strong>ard Error <strong>of</strong> the Mean<br />

Confidence Interval for a Mean<br />

The t-distribution <strong>and</strong> Its Use<br />

Comparison <strong>of</strong> Two Independent Groups<br />

The St<strong>and</strong>ard Error <strong>of</strong> the Difference Between Two means<br />

Pooled Estimate for the Common Variance<br />

Comparison <strong>of</strong> Two Dependent Groups (Paired Data)<br />

Confidence Interval for a Proportion<br />

Confidence Interval for Difference Between Two Proportions<br />

Summary <strong>of</strong> Distributions <strong>and</strong> Confidence Intervals<br />

SECTION 6<br />

This section reviews hypothesis testing, type 1 <strong>and</strong> type 2 errors, conclusive <strong>and</strong> inconclusive<br />

results <strong>and</strong> the power <strong>of</strong> a study.<br />

Null <strong>and</strong> Alternative Hypotheses<br />

Study Based <strong>and</strong> Data Driven Hypotheses<br />

One <strong>and</strong> Two Sided Tests<br />

Four Steps in the Hypothesis Testing Procedure<br />

Examples<br />

Pooled proportion estimate<br />

Clinical <strong>and</strong> Ecological Importance<br />

Conclusive <strong>and</strong> Inconclusive Results<br />

Errors in Hypothesis Testing<br />

Power <strong>of</strong> a Study<br />

Examples<br />

SECTION 7<br />

One factor analysis <strong>of</strong> variance<br />

Post analysis <strong>of</strong> variance tests on means<br />

Multiple comparison procedures<br />

SECTION 8<br />

This section covers the analysis <strong>of</strong> count data including the Chi-square test for contingency,<br />

the chi-square test for trend as well as relative risks, attributable risks <strong>and</strong> odds ratios along<br />

with their confidence intervals. The analysis <strong>of</strong> a three way table <strong>and</strong> Simpson’s paradox are<br />

investigated as a way <strong>of</strong> introducing the concept <strong>of</strong> a confounding variable in the lead up to<br />

regression analyses.<br />

Categorical Data Examples<br />

Relative Risk <strong>and</strong> its Confidence Interval<br />

Attributable Risk <strong>and</strong> its Confidence Interval<br />

Odds Ratio <strong>and</strong> its Confidence Interval<br />

Chi-square Test for Contingency<br />

Chi-square Test for Trend<br />

Interpretation <strong>of</strong> Confidence Intervals<br />

Simpson’s Paradox <strong>and</strong> Confounder Control<br />

Contents


SECTION 9<br />

This section introduces the topic <strong>of</strong> Simple Linear Regression which sets out to fit a straight<br />

line through what is called a scatter diagram. One purpose <strong>of</strong> this analysis is to establish<br />

whether one predictor variable is influencing the outcomes <strong>of</strong> a response variable <strong>and</strong> also<br />

measuring the magnitude <strong>of</strong> the effect <strong>of</strong> this predictor variable on the outcome. It is possible<br />

to use the fitted straight line to make predictions.<br />

Simple linear regression is also the first step in controlling for a confounder variable. This<br />

occurs with the extension to multiple regression which will be considered in the next section.<br />

Scatter Diagrams <strong>and</strong> Examples<br />

Equation <strong>of</strong> Fitted Straight Line<br />

Analysis <strong>of</strong> Variance for Regression Model<br />

Confidence Interval for Slope<br />

Confidence Interval for Prediction<br />

Correlation as Measure <strong>of</strong> Linear Association<br />

Review Exercises<br />

SECTION 10<br />

Multiple regression models <strong>and</strong> logistic regression models are introduced in this section. In<br />

the case <strong>of</strong> ordinary multiple regression the response or outcome variable is on a continuous<br />

scale whereas in the case <strong>of</strong> a logistic regression the outcome measure is binary taking<br />

therefore only two possible values interpreted as success versus failure.<br />

The models allow us to identify those variables which have an effect on the outcomes <strong>and</strong><br />

those variables which do not.<br />

Adding additional variables leads to adjusted values for estimated parameters <strong>and</strong> it is this<br />

that allows us to control for confounding.<br />

The Multiple Regression Model<br />

R-cmdr Printout for Multiple Regression<br />

Dummy Variables<br />

Checking Model Fit<br />

Parallel Regression Lines <strong>and</strong> Analysis <strong>of</strong> Covariance<br />

Binary Outcomes <strong>and</strong> Logistic Regression<br />

Study Design principles<br />

Critical appraisal<br />

Confounding analysis<br />

Sources <strong>of</strong> bias<br />

SECTION 11<br />

Appendix 1: The Basics – mathematical rules <strong>and</strong> statistical concepts<br />

Appendix 2: Some summaries<br />

Appendix 3: Formulae<br />

Contents


STAT115 INTRODUCTION TO BIOSTATISTICS 2012<br />

Advances in our underst<strong>and</strong>ing <strong>of</strong> factors which affect health <strong>and</strong> wellbeing come through<br />

research in the health sciences. Examples <strong>of</strong> such research include surveys to describe<br />

patterns <strong>of</strong> disease in a community or risk factors for disease such as diet <strong>and</strong> smoking; studies<br />

trying to find out whether a newly developed treatment works; studies <strong>of</strong> factors which may<br />

prevent disease such as physical activity; studies <strong>of</strong> barriers to improving health such as<br />

reasons for declining vaccination rates in children, prevention <strong>of</strong> smoking. Biostatistics<br />

(statistics applied in the health sciences) is a vital tool in our mission to improve health <strong>and</strong><br />

wellbeing for all people.<br />

STAT115 provides an introduction to the core principles <strong>and</strong> methods <strong>of</strong> biostatistics. In this<br />

course you will gain an underst<strong>and</strong>ing <strong>of</strong> how statistics is used to answer research questions:<br />

how to look for patterns in data, how to test hypotheses about disease causation <strong>and</strong> prevention<br />

<strong>and</strong> improvement in well-being. The underst<strong>and</strong>ing <strong>and</strong> skills gained in STAT115 can be a<br />

starting point for a career in biostatistics or can be used to assist underst<strong>and</strong>ing <strong>of</strong> research in<br />

other disciplines including physiology, anatomy, human nutrition, sports science, <strong>and</strong><br />

psychology.<br />

Lecturers<br />

GENERAL INFORMATION AND ADMINISTRATION<br />

Dr Katrina Sharples, Dept <strong>of</strong> Preventive <strong>and</strong> Social Medicine, Adams Building<br />

Dr Janine Wright, Room 237, Science III building<br />

Mr Daniel Turek, Room 231, Science III building<br />

Dr David Bryant, Room 514, Science III building<br />

Lectures<br />

Lectures are held as follows: Monday, Tuesday, Thursday <strong>and</strong> Friday at 11.00 am,<br />

commencing Monday 9 July. Although these notes are extensive, experience shows that<br />

students who miss lectures have a severe disadvantage.<br />

Help Sessions <strong>and</strong> Tutorials<br />

These will be held in 539 Castle St Laboratory which has 36 computers. Tutorials are<br />

cafeteria style which means that you can attend at any scheduled time when tutors are<br />

available to help with weekly exercises. Times can be found on the STAT115 paper page on<br />

the <strong>Mathematics</strong> <strong>and</strong> <strong>Statistics</strong> <strong>Department</strong> website. In addition, you may access the<br />

computers to complete assignments outside <strong>of</strong> scheduled sessions. Attend early in the week<br />

to avoid the inevitable rush before submission day.<br />

STAT 115 Web Page <strong>and</strong> Resource Area<br />

The STAT 115 web page: www.maths.otago.ac.nz/stat115 will contain course resource<br />

material. Answers to weekly exercises, notices, old exam papers with solutions <strong>and</strong> any other<br />

useful information will be posted here. You can access such information by clicking on the<br />

Resources button. You are strongly advised to read through the solutions to weekly exercises<br />

as students who fail to do this are at a severe disadvantage.<br />

i<br />

Introduction & overview


Support Classes<br />

There is also a Wednesday evening support class for students worried about their mathematics<br />

background for this course. This class will be held in 539 Castle St at 6pm on Wednesday<br />

evening. If you wish to attend the support class you will need to register using the form<br />

which is available on the resource page or from the Maths <strong>and</strong> <strong>Statistics</strong> Reception, Science<br />

III, 2 nd floor. Our experience is that only a small number <strong>of</strong> students will need to use the<br />

support class. Note, there is no mathematics pre-requisite for this course. If you have<br />

difficulty in carrying out the calculations in the Basics Booklet <strong>of</strong> Appendix 1 <strong>of</strong> these notes<br />

you may find it helpful to attend the support class. In addition, you can access Mathercize by<br />

going to the web page mathercize.otago.ac.nz, log-in password line. The options<br />

STAT115 Exercises for Biostatistics<br />

STAT115 Revision mathematics<br />

will take you through background material for this course in an easy to use self-testing<br />

environment.<br />

Study Centre<br />

A Study Centre will operate in a room at the back <strong>of</strong> 539 Castle St. This is an area where you<br />

can go to work with fellow students. There will also be statistics help available at times as<br />

shown on the door.<br />

References<br />

There is no set text for the course as this course booklet contains all material necessary. The<br />

book: Harraway, J. Introductory Statistical Methods for Biological, Health <strong>and</strong> Social<br />

Sciences. (<strong>University</strong> <strong>of</strong> Otago Press) has multiple copies on reserve in the Science Library at<br />

the Loans Desk. The first 17 chapters are relevant for this course. A second book on close<br />

reserve: Clark, M.J. <strong>and</strong> R<strong>and</strong>al, J.A. A First Course in Applied <strong>Statistics</strong> (Pearson).<br />

Computing<br />

The R-comm<strong>and</strong>er (R-cmdr) package will be used in tutorials. No prior knowledge <strong>of</strong> the<br />

package is needed as a h<strong>and</strong>out <strong>and</strong> full instructions will be available in the tutorials. All<br />

students will have their own User Name <strong>and</strong> Password. The User Name is the name on your<br />

student ID card <strong>and</strong> the Password is your student ID number.<br />

Time Commitment<br />

STAT 115 is a one semester course worth 18 points. It is expected that students should spend<br />

an average <strong>of</strong> 12 hours per week on this course. After allowing four hours per week attending<br />

lectures, this leaves eight hours for other course related activities such as assignments, reading<br />

notes <strong>and</strong> revising.<br />

Calculators<br />

There is no restriction on the type <strong>of</strong> calculator that can be used except that no device with<br />

communication capability shall be accepted as a calculator.<br />

ii<br />

Introduction & overview


Course content (in approximate lecture order)<br />

Introduction: research methods <strong>and</strong> study design; designed experiments versus<br />

observational studies; case control, cohort <strong>and</strong> intervention studies.<br />

Data description <strong>and</strong> presentation: the use <strong>of</strong> R-comm<strong>and</strong>er; histograms, box<strong>and</strong>-whisker<br />

plots, measures <strong>of</strong> centre <strong>and</strong> spread <strong>of</strong> data, measures <strong>of</strong> disease<br />

frequency <strong>and</strong> association.<br />

Probability: the nature <strong>of</strong> r<strong>and</strong>om variation; diagnostic tests; probability<br />

distributions including the binomial <strong>and</strong> normal distributions.<br />

Estimation: sampling distributions; confidence intervals for means, differences<br />

proportions.<br />

Hypothesis testing: classical procedures for means, proportions, <strong>and</strong> differences;<br />

the p-value; statistical vs clinical significance; power <strong>and</strong> sample size.<br />

Analysis <strong>of</strong> variance: completely r<strong>and</strong>omised design; bonferroni procedure for<br />

multiple comparisons.<br />

Categorical data: tests for association; rates, relative risk <strong>and</strong> risk differences,<br />

odds ratios; confidence intervals for relative risk <strong>and</strong> odds ratio.<br />

Regression <strong>and</strong> correlation: the simple linear regression model; tests on the slope;<br />

predictions; confidence intervals for predictions; correlation.<br />

Multiple regression: tests on the estimated parameters; dummy variables for<br />

qualitative predictors; parallel regressions <strong>and</strong> control <strong>of</strong> confounding.<br />

Ethics <strong>and</strong> Study design: Ethical issues, bias <strong>and</strong> confounding.<br />

2 lectures<br />

6 lectures<br />

8 lectures<br />

5 lectures<br />

3 lectures<br />

3 lectures<br />

4 lectures<br />

5 lectures<br />

4 lectures<br />

7 lectures<br />

Internal Assessment<br />

There will be eight assignments <strong>and</strong> three mastery tests. Each assessment will have a mark<br />

recorded out <strong>of</strong> 20. These assessments will be admininstered on-line. The assignments can<br />

be completed anywhere you have an internet connection. The mastery tests will be conducted<br />

in the Castle St Computer Laboratory. A booking system for half-hour slots in which to<br />

attempt the tests will operate. Cut<strong>of</strong>f times for each assignment will be announced in lectures.<br />

Exam format<br />

A three-hour exam will produce a mark out <strong>of</strong> 100.<br />

Final mark<br />

In your overall mark we will count your exam mark for 2/3 <strong>of</strong> the total <strong>and</strong> the internal<br />

assessment for 1/3. However, if your final exam mark taken out <strong>of</strong> 100 is greater than this,<br />

we will use just the final exam mark. That is, the final mark F will be calculated as:<br />

F = {E, (2E + A)/3}<br />

where E (exam mark) is out <strong>of</strong> 100 <strong>and</strong> A (internal assessment) is out <strong>of</strong> 100. The internal<br />

assessment marks will be made up 1/3 from the eight assignments <strong>and</strong> 2/3 from the three<br />

mastery tests.<br />

iii<br />

Introduction & overview


Email Contact with Students<br />

From time to time lecturers may wish to email students taking STAT 115. This will be done<br />

by contacting you using your Student email address. You should check your student address<br />

regularly. If you have another address then you might like to arrange that emails sent to your<br />

student address are forwarded automatically.<br />

Disability <strong>and</strong> Impairment Support<br />

The <strong>Department</strong> <strong>of</strong> <strong>Mathematics</strong> <strong>and</strong> <strong>Statistics</strong> encourages students to seek support if they<br />

find they are having difficulty with their studies due to a disability, temporary or permanent<br />

impairment, injury, chronic illness or deafness.<br />

Contact either The Course Convenor,<br />

or Disability Information <strong>and</strong> Support<br />

Telephone 479 8235<br />

Email: disabilities@otago.ac.nz<br />

Website: http://www.otago.ac.nz/disabilities<br />

Plagiarism<br />

Students should make sure that all submitted work is their own. “Plagiarism is a form <strong>of</strong><br />

dishonest practice. Plagiarism is defined as copying or paraphrasing another’s work <strong>and</strong><br />

presenting it as one’s own” (<strong>University</strong> Council, December 2004). In practice this means that<br />

plagiarism includes any attempt in any piece <strong>of</strong> submitted work (e.g. an assignment or test) to<br />

present as one’s own work the work <strong>of</strong> another (whether <strong>of</strong> another student or a published<br />

authority). Any student found to be responsible for plagiarism in any piece <strong>of</strong> work submitted<br />

for assessment shall be subject to the <strong>University</strong>’s dishonest practice regulations which may<br />

result in various penalties, including forfeiture <strong>of</strong> marks for the piece <strong>of</strong> work submitted, a<br />

zero grade for the paper or in extreme cases exclusion from the <strong>University</strong>.<br />

SURV 102 Computational Methods for Surveyors<br />

Students enrolled for SURV102 will attend lectures in STAT115 for four weeks beginning on<br />

Monday 23 July.<br />

A separate notice about assessment in SURV102 will be made in the Surveying <strong>Department</strong>.<br />

iv<br />

Introduction & overview


Biostatistics <strong>and</strong> Health Research - An Overview<br />

1 Health Research<br />

Billions <strong>of</strong> dollars are spent every year in a quest to improve human health <strong>and</strong> well-being.<br />

The broad goal <strong>of</strong> this quest is to acquire new knowledge to help prevent, detect, diagnose <strong>and</strong><br />

treat disease.<br />

What sort <strong>of</strong> knowledge do we look for<br />

What causes a disease<br />

Once you have a disease, what happens<br />

Who has the disease<br />

What is the best strategy for treatment or prevention<br />

How do societal factors affect health<br />

What causes a disease<br />

Underst<strong>and</strong>ing the factors which lead to the development <strong>of</strong> disease gives ideas about how to<br />

prevent disease. For example:<br />

• Drinking water is treated to kill bacteria, virus <strong>and</strong> other contaminants like giardia.<br />

• Our ability to prevent heart disease has improved with our underst<strong>and</strong>ing <strong>of</strong> specific<br />

dietary components which increase risk, <strong>and</strong> with our underst<strong>and</strong>ing <strong>of</strong> how exercise<br />

works to reduce risk.<br />

• The realization that the cause <strong>of</strong> AIDS was a virus (HIV) which could be transmitted<br />

through sexual intercourse <strong>and</strong> blood transfusions led to prevention strategies to<br />

reduce transmission. These included use <strong>of</strong> condoms, screening <strong>of</strong> blood products<br />

<strong>and</strong> drugs to reduce <strong>of</strong> transmission from mother to baby.<br />

• Underst<strong>and</strong>ing how <strong>and</strong> when sports injuries occur helps to develop rules <strong>of</strong> play <strong>and</strong><br />

training schedules which reduce injury burden.<br />

Once you have a disease, what happens<br />

Underst<strong>and</strong>ing how a disease progresses gives ideas about how to cure disease, or to prolong<br />

survival or to improve quality <strong>of</strong> life. For example:<br />

• Underst<strong>and</strong>ing how HIV affects the immune system has led to the development <strong>of</strong><br />

drugs such as zidovudine which prevent the virus from reproducing <strong>and</strong> seem to<br />

slow the destruction <strong>of</strong> the immune system.<br />

• Underst<strong>and</strong>ing how bacteria work allowed the development <strong>of</strong> different types <strong>of</strong><br />

antibiotics with different actions.<br />

• Cancer develops when cells in a part <strong>of</strong> the body begin to grow out <strong>of</strong> control.<br />

Knowledge <strong>of</strong> the cell cycle was important in developing cancer drugs<br />

(chemotherapy) which work only on actively reproducing cells.<br />

Who has the disease<br />

Detecting who has a disease <strong>and</strong> diagnosing disease are the first steps in delivering effective<br />

treatments. For example:<br />

• Development <strong>of</strong> non-invasive technologies for looking inside the body (such as<br />

ultrasound, CT scans, MRI) provided techniques for making the initial diagnosis <strong>of</strong><br />

cancer, or for identifying the form <strong>of</strong> damage to a knee following injury.<br />

• Tests which look at cells from biopsies or blood can give more accurate diagnosis <strong>of</strong><br />

cancer than the non-invasive technologies.<br />

• We identify people with HIV infection though a blood test which detects antibodies<br />

to the virus.<br />

v<br />

Introduction & overview


What is the best strategy for treatment or prevention<br />

Once we have developed a new treatment or approach to prevention we need to evaluate the<br />

risks <strong>and</strong> benefits <strong>of</strong> that treatment before it is made available for use. For example:<br />

• Exercise <strong>and</strong> balance programmes have been demonstrated to reduce the risk <strong>of</strong><br />

falling in the elderly<br />

• The statin family <strong>of</strong> drugs have been demonstrated to reduce risk <strong>of</strong> death from<br />

cardiovascular disease<br />

• Evaluations <strong>of</strong> the use <strong>of</strong> beta-carotene (which the body converts to vitamin A)<br />

found that contrary to expectations, it did not prevent lung cancer; in fact it increased<br />

the risk <strong>of</strong> lung cancer.<br />

How do societal factors affect health<br />

Working with individuals can lead to significant improvements in health, but societal factors<br />

can also have an impact.<br />

• Societal attitudes to alcohol <strong>and</strong> smoking can make it difficult for individuals to<br />

change behaviour<br />

• Underst<strong>and</strong>ing how societal factors operate is important for developing systems <strong>of</strong><br />

health care.<br />

Where does knowledge come from<br />

During the last century we have gained an enormous amount <strong>of</strong> knowledge, but there are still<br />

many gaps.<br />

• Cancer <strong>and</strong> cardiovascular disease still end many people’s lives prematurely.<br />

• Back pain is very common. We still are not very good at treating it or preventing it.<br />

• Diabetes is becoming increasingly common, particularly among Maori <strong>and</strong> Pacific<br />

Isl<strong>and</strong> populations. It has many serious health consequences.<br />

• New diseases provide additional challenges. HIV/AIDS, a disease thought to have<br />

jumped the species barrier into humans, has had an enormous impact. Avian<br />

influenza is common in birds in Asia, <strong>and</strong> can cause severe disease in humans, but<br />

doesn’t currently spread directly from human to human. But it would only take a<br />

small change in the genome <strong>of</strong> the virus to make it highly infectious amongst<br />

humans.<br />

Knowledge can come from ‘experience’ or ‘research’<br />

Experience is a very unreliable way <strong>of</strong> obtaining knowledge. Humans are not objective; our<br />

recall is very selective. The history <strong>of</strong> medicine is littered with treatments which doctors were<br />

convinced, through their own experience, worked, but time has shown to be ineffective or<br />

harmful in many <strong>of</strong> the settings where they were used: bloodletting, ground woodlice,<br />

mercury, arsenic, <strong>and</strong> so on. These treatments were widely used centuries ago, but there are<br />

more modern examples.<br />

• An early treatment for heart attack, where blood flow to part <strong>of</strong> the heart muscle is<br />

blocked, involved sprinkling powdered asbestos on to the heart to increase blood flow<br />

to the affected areas. It was never truly shown to work, but thous<strong>and</strong>s <strong>of</strong> these<br />

operations were done.<br />

• Hormone replacement therapy was widely used initially for treatment <strong>of</strong> the symptoms<br />

<strong>of</strong> menopause, but was also believed to reduce risk <strong>of</strong> heart disease in postmenopausal<br />

women. The results <strong>of</strong> a study published recently found in fact it<br />

increased the risk <strong>of</strong> heart disease.<br />

That leaves research.<br />

vi<br />

Introduction & overview


2 The Research Process <strong>and</strong> Biostatistics<br />

What is research<br />

Research is a systematic process for providing answers to questions<br />

Examples <strong>of</strong> research questions:<br />

• What are the causes <strong>of</strong> meningococcal meningitis<br />

• What is the best treatment strategy for chronic back pain<br />

• What are the genetic events that lead to childhood cancer<br />

• Can this new drug improve survival in people with colon cancer<br />

• What is the role <strong>of</strong> selenium as an antioxidant in the protection against risk factors for<br />

cardiovascular disease<br />

• To what extent do western diet <strong>and</strong> exercise habits need to change in order to reduce<br />

insulin resistance<br />

• Does this conditioning programme reduce serious knee injury in team sports<br />

Biostatistics is the field <strong>of</strong> development <strong>and</strong> application <strong>of</strong> statistical methods to research in<br />

health-related fields, including medicine, public health, <strong>and</strong> biology. Since early in the<br />

twentieth century, biostatistics has become an indispensable tool for health research.<br />

<strong>Statistics</strong> is <strong>of</strong>ten defined as the art <strong>and</strong> science <strong>of</strong> collecting, summarising, presenting <strong>and</strong><br />

interpreting data. <strong>Statistics</strong> is a set <strong>of</strong> techniques which formally implement the fundamental<br />

principles <strong>of</strong> the scientific method. The scientific method underlies the research process:<br />

observation <strong>and</strong> theories lead to the development <strong>of</strong> hypotheses. We work out the best test <strong>of</strong><br />

the hypothesis, then collect data <strong>and</strong> determine to what extent the data are consistent with the<br />

hypothesis.<br />

The research process<br />

When we carry out research we <strong>of</strong>ten collect data on a sample or subgroup from a population.<br />

Our goal is to use the information collected on that sample to draw inferences about a larger<br />

population.<br />

Underlying populations<br />

Inference<br />

Sample<br />

<strong>Statistics</strong><br />

vii<br />

Introduction & overview


Examples<br />

• We use the frequency with which diabetes occurs in a sample to estimate the<br />

frequency with which diabetes occurs in the population the sample came from.<br />

• We study a new treatment in a subgroup <strong>of</strong> patients in order to be able to make claims<br />

about the effects <strong>of</strong> the treatment in all such patients.<br />

Steps in the research process<br />

Development <strong>of</strong> the research questions<br />

Design <strong>of</strong> the study<br />

Collection <strong>of</strong> information<br />

Data description <strong>and</strong> analysis<br />

Interpretation <strong>of</strong> results<br />

Ideas for research come from many places – from reading the literature, observation <strong>and</strong><br />

clinical experience, from talking to colleagues <strong>and</strong> from just sitting <strong>and</strong> thinking.<br />

The first step is to refine the idea into a question, or series <strong>of</strong> questions, which can be<br />

answered in a single study; that is, we need to be able to design a study to answer the<br />

question. The question may be framed as a hypothesis. For example, we might wish to<br />

answer the question “Does a low fat diet reduce risk <strong>of</strong> diabetes” The hypothesis would be<br />

“Low fat diet reduces the risk <strong>of</strong> diabetes”. We then need to work out how best to test the<br />

hypothesis.<br />

The study design specifies the methods for selecting people (or other units) for the study <strong>and</strong><br />

for collecting the information that will be used to answer the questions. It needs to be feasible<br />

<strong>and</strong> ethical. We need to identify which study designs can give us appropriate data, <strong>and</strong> how to<br />

maximize our chance <strong>of</strong> being able to distinguish a true relationship from r<strong>and</strong>om noise.<br />

Once we have collected the data we use statistical methods to describe <strong>and</strong> analyse the data<br />

<strong>and</strong> interpret the results. The analysis <strong>and</strong> the interpretation <strong>of</strong> the results will depend on the<br />

study design.<br />

Biostatisticians work with scientists to identify <strong>and</strong> implement the correct statistical methods<br />

for designing studies <strong>and</strong> analyzing <strong>and</strong> interpreting the results.<br />

3. Introduction to study design<br />

Underst<strong>and</strong>ing where data come from is vital for making sensible choices about statistical<br />

analysis. At this stage in the course we will give an overview <strong>of</strong> some <strong>of</strong> the study designs<br />

that are commonly used in epidemiology <strong>and</strong> clinical research. We will return to this material<br />

in the second half <strong>of</strong> the course.<br />

There are several different ways to classify study designs, <strong>and</strong> several specific ‘named’ study<br />

designs. It can be confusing since different epidemiology books use the terms differently. The<br />

classifications <strong>and</strong> definitions exist to help us think about the strengths <strong>and</strong> weaknesses <strong>of</strong> a<br />

particular study for addressing the research questions. The differences in the ways the<br />

definitions are used arise where textbooks emphasize the relative strengths <strong>and</strong> weaknesses a<br />

little differently.<br />

viii<br />

Introduction & overview


2.1 Classifications <strong>of</strong> Study Designs<br />

1. Descriptive versus analytic<br />

This classification relates to the primary aims or objectives <strong>of</strong> the study. Where the study aims<br />

to test an hypothesis we say the study is analytic. For example, does this vaccine reduce the<br />

risk <strong>of</strong> meningococcal disease Here we hypothesize a relationship between vaccine <strong>and</strong> risk<br />

<strong>of</strong> meningococcal disease (we hypothesize that vaccine reduces risk) <strong>and</strong> aim to test that<br />

hypothesis. Analytic studies are studies which test hypotheses.<br />

Descriptive studies are used where the aims are simply to describe something, with no prespecified<br />

hypothesis. For example, if we wish to describe trends in incidence <strong>of</strong><br />

meningococcal disease over time we carry out a descriptive study. Here there are no prespecified<br />

hypotheses about the reasons for a change over time.<br />

Many descriptive studies in epidemiology describe patterns <strong>of</strong> disease in populations. This can<br />

provide clues about causes <strong>of</strong> disease <strong>and</strong> lead on to further studies. The st<strong>and</strong>ard approach is to<br />

examine the characteristics <strong>of</strong> disease according to time, place, <strong>and</strong> person:<br />

TIME A descriptive study can be repeated in order to examine trends over time<br />

examples: epidemics, seasonality eg: influenza<br />

PLACE Many diseases vary according to country, or even within countries<br />

examples: breast cancer incidence by country, multiple sclerosis <strong>and</strong> latitude<br />

PERSON Characteristics <strong>of</strong> people with the disease can be studied, for instance age, sex,<br />

ethnic group, socioeconomic group, occupation<br />

example: heart disease in New Zeal<strong>and</strong> according to age <strong>and</strong> sex <strong>and</strong> ethnic group<br />

2. Experimental versus observational<br />

In experimental studies the investigators intervene in the natural order (hence the alternative<br />

name intervention study). The investigator decides the exact nature <strong>of</strong> the intervention,<br />

chooses a control strategy, <strong>and</strong> decides who will receive the intervention under study <strong>and</strong> who<br />

will be part <strong>of</strong> the control group. The goal is to control the conditions so that the effect <strong>of</strong><br />

interest can be isolated <strong>and</strong> studied. For example, if investigators want to know whether a<br />

drug (nevirapine) reduces maternal-infant transmission <strong>of</strong> HIV they can construct an<br />

experiment which isolates the effect <strong>of</strong> the drug from any other factors which might affect risk<br />

<strong>of</strong> transmission. The extent to which we can isolate the effect <strong>of</strong> the intervention (eg drug)<br />

determines how good the experiment is. Of course ethics are a fundamental consideration.<br />

In observational studies we simply observe a naturally occurring process without intervening.<br />

It is much harder to test a hypothesis in an observational study, but for many research<br />

questions in the health sciences it is not ethical or feasible to conduct an experiment. We aim<br />

to design our observational studies to get as close as possible to the information we would<br />

have got if the experiment could have been done.<br />

3. R<strong>and</strong>omised versus non-r<strong>and</strong>omised (applies to experiments only)<br />

Experiments always (should) have a control group as well as a group (or groups) which gets<br />

the intervention(s) under study. R<strong>and</strong>omisation is a process we can use to allocate people to<br />

either the intervention group or the control group – the simplest version <strong>of</strong> r<strong>and</strong>omisation is<br />

like flipping a coin: each person has a 50% chance <strong>of</strong> being in the intervention group. Careful<br />

use <strong>of</strong> r<strong>and</strong>omisation gives the best test <strong>of</strong> an hypothesis.<br />

ix<br />

Introduction & overview


In some experiments the investigators use a method other than r<strong>and</strong>omisation to decide who<br />

will be in the intervention group <strong>and</strong> who will be in the control group. For example in a<br />

community intervention study the investigators might choose a set <strong>of</strong> communities to get the<br />

intervention (<strong>of</strong>ten those interested or with structures in place to take part), <strong>and</strong> then choose a<br />

matched set <strong>of</strong> control communities. Experiments like this which are non-r<strong>and</strong>omised are<br />

sometimes referred to as quasi-experiments. Sometimes they are the only practical alternative,<br />

but they never provide the same strength <strong>of</strong> evidence as a r<strong>and</strong>omised trial.<br />

Note that the process <strong>of</strong> r<strong>and</strong>omisation is not the same as r<strong>and</strong>om sampling. The purpose <strong>of</strong><br />

r<strong>and</strong>om sampling is to select a single group which is representative <strong>of</strong> a population (see<br />

below).<br />

4. Cross-sectional versus longitudinal<br />

This classification refers to the data themselves <strong>and</strong> the (calendar) time points or periods<br />

about which the information is collected. For example, we might do a study looking at the<br />

relationship between oral contraceptive use <strong>and</strong> coronary heart disease. Fully cross-sectional<br />

data would refer to one point in (calendar) time. For example, in a survey we might ask, do<br />

you have coronary heart disease today Are you taking oral contraceptives today Note that if<br />

we are collecting data on existing disease we are working with prevalence <strong>of</strong> coronary heart<br />

disease rather than incidence <strong>of</strong> coronary heart, <strong>and</strong> so cross-sectional data is not very good<br />

for testing hypotheses about the causes <strong>of</strong> disease. (The exposures may have changed after<br />

disease was diagnosed.)<br />

Longitudinal data have some time course present. The ideal for testing hypotheses about<br />

disease causation is to get information about things that occurred before the disease<br />

developed. Often the best we can do is collect information about exposures that occurred<br />

before diagnosis <strong>of</strong> disease since the time between developing disease <strong>and</strong> diagnosis is <strong>of</strong>ten<br />

unclear. Longitudinal studies collect information over a period <strong>of</strong> time, eg exposures which<br />

occur before disease is diagnosed.<br />

5. Study unit<br />

The majority <strong>of</strong> studies in epidemiology collect data on individuals. However, there are some<br />

where the ‘unit’ under study is something bigger – such as a family, a community or a<br />

country. In some studies it is the group that is <strong>of</strong> interest, not the individual, <strong>and</strong> we might<br />

want to test a hypothesis relating to the group (an analytic study). For example, the COMMIT<br />

study asked, does a community prevention programme reduce the prevalence <strong>of</strong> smoking in<br />

the community The intervention is carried out at the community level, <strong>and</strong> we can evaluate<br />

by examining whether the prevalence <strong>of</strong> smoking in the community changes. Note the<br />

outcome data are collected on the individual (whether someone smokes or not), to measure the<br />

effect <strong>of</strong> the intervention in a community.<br />

2.1 Common study designs in epidemiology <strong>and</strong> clinical research<br />

1. Case report<br />

Usually describes the occurrence <strong>of</strong> disease in one person. The purpose is to alert others to the<br />

fact that this combination <strong>of</strong> factors can occur, <strong>and</strong> to encourage people to keep a look out for<br />

other similar cases. Such case reports (to a central registry) led to the initial recognition <strong>of</strong><br />

AIDS. Case reports are always descriptive <strong>and</strong> observational. The cross-sectional longitudinal<br />

x<br />

Introduction & overview


classification doesn’t really apply, but they could be considered ‘longitudinal’ in the sense<br />

that they may collect data on the person’s experience over time.<br />

2. Case series<br />

A case series takes a group <strong>of</strong> people with a recognised disease <strong>and</strong> describes patterns among<br />

them. A study <strong>of</strong> the initial case series <strong>of</strong> men diagnosed with AIDS recognised a common<br />

dysfunction <strong>of</strong> the immune system <strong>and</strong> that the disease occurred in gay men, injecting drug<br />

users <strong>and</strong> blood product recipients. This led to the hypothesis that it was caused by a<br />

transmissible agent, <strong>and</strong> gave clues as to the modes <strong>of</strong> transmission. Case series are always<br />

descriptive, observational, <strong>and</strong> are generally cross-sectional, but could be longitudinal if they<br />

describe changes in individuals over time.<br />

3. Descriptive study using population data<br />

Many descriptive epidemiological studies make use <strong>of</strong> data that is collected routinely on a<br />

population. This includes census data, death certificates, data reported to cancer registries,<br />

hospital morbidity <strong>and</strong> mortality data, <strong>and</strong> infectious disease data reported as ‘notifiable’<br />

diseases. Provided the data sources are reliable this can provide valuable descriptions <strong>of</strong> the<br />

disease (or risk factor) experience in a population. These studies are descriptive <strong>and</strong><br />

observational.<br />

4. Sample survey<br />

Where data are collected specifically for a research study, they generally involve collecting<br />

data for only a sample (subset) <strong>of</strong> the population <strong>of</strong> interest. This will give the opportunity to<br />

collect more information about each person, at the cost <strong>of</strong> the r<strong>and</strong>om variation that comes<br />

with sampling from a population. There are many way to go about selecting a sample. In<br />

quantitative research we generally choose r<strong>and</strong>om samples. In a r<strong>and</strong>om sample everyone has<br />

a known chance <strong>of</strong> being selected for the study; this allows us to use statistical methods to<br />

accurately determine the influence <strong>of</strong> r<strong>and</strong>om error (through use <strong>of</strong> confidence intervals).<br />

And hence, to make valid inferences regarding the population the sample came from. R<strong>and</strong>om<br />

sampling gives us the best chance <strong>of</strong> getting a sample which is representative <strong>of</strong> the<br />

population.<br />

The simplest type <strong>of</strong> r<strong>and</strong>om sample is a simple r<strong>and</strong>om sample, where everyone has the same<br />

chance <strong>of</strong> being chosen. We can also draw stratified samples or cluster samples. In stratified<br />

sampling we divide the population into groups (or strata) – for example ethnic groups. We<br />

then choose to sample a fixed number from each stratum to ensure all groups are adequately<br />

represented in the study. For example, we might wish to choose the same number <strong>of</strong> people<br />

from each ethnic group to ensure we have enough data for reliable estimates in each group.<br />

Cluster sampling is used where we can’t easily select a sample <strong>of</strong> individuals. For example, if<br />

we wish to study children, we can’t carry select a simple r<strong>and</strong>om simple because we have no<br />

list <strong>of</strong> children from which to select the sample. One approach commonly used is to select<br />

schools at r<strong>and</strong>om, classrooms within a school at r<strong>and</strong>om, <strong>and</strong> children from a class at<br />

r<strong>and</strong>om.<br />

A true survey generally means getting people to fill in a questionnaire. However people have<br />

extended the idea to include other forms <strong>of</strong> data collection: we may take measurements <strong>of</strong><br />

height <strong>and</strong> weight, fitness tests, blood tests <strong>and</strong> so on.<br />

xi<br />

Introduction & overview


These studies are most <strong>of</strong>ten descriptive, but can be analytic, are observational, <strong>and</strong> can be<br />

cross-sectional or longitudinal.<br />

5. Cross-sectional study<br />

In epidemiology the term cross-sectional study <strong>of</strong>ten refers to a survey. The data are <strong>of</strong>ten not<br />

fully cross-sectional according to the definition above. For example we might carry out a<br />

survey <strong>of</strong> use <strong>of</strong> hormone replacement therapy (HRT) among New Zeal<strong>and</strong> women.<br />

Such a survey would generally ask about past life experiences <strong>and</strong> past use <strong>of</strong> HRT, rather<br />

than just current use, which gives a longitudinal element to the data. When the study collects<br />

information about disease status, it is generally prevalent disease. So while cross-sectional<br />

studies can be used to test hypotheses they are not very good for testing hypotheses about<br />

disease causation.<br />

6. Case-control study<br />

Two groups<br />

Group with disease (cases)<br />

Group free from disease (controls)<br />

In a case-control study, people are selected for the study according to whether they have the<br />

disease <strong>of</strong> interest (cases) or not (controls). Generally case-control studies identify incident cases<br />

<strong>and</strong> collect information about experiences before diagnosis <strong>of</strong> disease <strong>of</strong> the cases, <strong>and</strong> for an<br />

equivalent time period for the controls. Case-control studies are sometimes called retrospective<br />

studies because information is collected about exposures that occurred in the past. For example, a<br />

case-control study <strong>of</strong> cervical cancer selected a group <strong>of</strong> women with cervical cancer <strong>and</strong> a<br />

control group <strong>of</strong> women who did not have cervical cancer. Information was collected about past<br />

experiences which were hypothesised to be related to risk <strong>of</strong> cervical cancer including number <strong>of</strong><br />

sexual partners. Case-control studies are analytic, observational <strong>and</strong> longitudinal.<br />

7. Cohort Study<br />

A group <strong>of</strong> people is observed over a period <strong>of</strong> time in order to measure the frequency <strong>of</strong> the<br />

disease being investigated. A cohort study starts by documenting exposures <strong>and</strong> then measuring<br />

the subsequent risk <strong>of</strong> developing disease, according to exposure. Cohort studies aim to identify<br />

associations between exposure to suspected causal agents <strong>and</strong> the development <strong>of</strong> disease. The<br />

cohort may be selected by taking a r<strong>and</strong>om sample from a population (eg the Scottish Heart<br />

Study); by selecting some geographical areas (eg Framingham study) or taking a particular group<br />

(eg British Doctors study, Nurses Health Study). Researchers may also identify an exposed group<br />

<strong>of</strong> interest (eg people working in a particular industry) <strong>and</strong> find an appropriate control group who<br />

are not exposed to the substance under study. Exposure can be measured at the beginning <strong>of</strong> the<br />

study (baseline) <strong>and</strong> also periodically during the follow-up period. The entire cohort <strong>of</strong> people is<br />

followed up to determine if <strong>and</strong> when disease develops.<br />

8. R<strong>and</strong>omised controlled trial (RCT)<br />

In a r<strong>and</strong>omised controlled trial a group <strong>of</strong> study participants are selected <strong>and</strong> then r<strong>and</strong>omly<br />

allocated to an intervention group (s) (who get the intervention under study) <strong>and</strong> a control<br />

group. Since group allocation is entirely by chance, this is the best approach for getting two<br />

groups who are comparable is all respects. This means that if there is a difference in outcome<br />

xii<br />

Introduction & overview


etween the two groups it can be attributed to the intervention (provided other aspects <strong>of</strong> the<br />

study are well carried out).<br />

9. Clinical trial<br />

This the term used for an experiment which evaluates a treatment. They are <strong>of</strong>ten, but not<br />

always, r<strong>and</strong>omised controlled trials.<br />

10. Prevention trial<br />

This is the term used for an experiment used to evaluate a prevention strategy. They can be<br />

r<strong>and</strong>omised controlled trials.<br />

11. Community intervention study<br />

This is the term used for a study to evaluate a community intervention. They are usually<br />

experiments, but <strong>of</strong>ten not r<strong>and</strong>omised, <strong>and</strong> may not involve a control group.<br />

4. Content <strong>of</strong> STAT115<br />

Learning aims <strong>and</strong> objectives<br />

By the end <strong>of</strong> the course students should<br />

• be aware <strong>of</strong> the appropriate use <strong>of</strong> common study designs <strong>and</strong> their strengths <strong>and</strong><br />

weaknesses<br />

• be able to describe the information contained in a data set<br />

• be able to carry out common statistical data analyses<br />

• be able to interpret the results <strong>of</strong> common statistical analyses in the context <strong>of</strong> the<br />

particular study design used<br />

• be aware <strong>of</strong> ethical issues relating to research involving humans<br />

• be able to critically evaluate selected research articles published in health sciences<br />

journals.<br />

The material in this course will provide skills for interpreting research in your chosen field <strong>of</strong><br />

study, as well as some basic skills for analysing data that you collect through course projects<br />

or labs using a computer <strong>and</strong> a statistical s<strong>of</strong>tware package. If you have mathematical skills,<br />

<strong>and</strong> are stimulated by the idea <strong>of</strong> being involved in health research, you may wish to pursue a<br />

career in biostatistics. There are many jobs available for biostatisticians, in New Zeal<strong>and</strong> <strong>and</strong><br />

overseas. Most are employed in research groups at universities or government or in<br />

pharmaceutical or biotech companies.<br />

Types <strong>of</strong> research questions covered in STAT 115<br />

There are many types <strong>of</strong> research question in the health sciences:<br />

• Laboratory studies: research involves underst<strong>and</strong>ing how cells <strong>and</strong> cell components<br />

work, identifying compounds which can be used to treat disease <strong>and</strong> how they affect<br />

cells.<br />

• Animal studies: used as models for humans<br />

xiii<br />

Introduction & overview


• Human studies:<br />

– anatomy <strong>and</strong> physiology consider the structure <strong>and</strong> function <strong>of</strong> the human body<br />

– clinical research asks questions relating to patient care including evaluation <strong>of</strong><br />

new treatments<br />

– epidemiology is the study <strong>of</strong> the distribution <strong>and</strong> causes <strong>of</strong> disease<br />

• Studies <strong>of</strong> public health: the science <strong>and</strong> art <strong>of</strong> promoting health, preventing disease<br />

<strong>and</strong> prolonging life through organised efforts <strong>of</strong> society<br />

• Studies <strong>of</strong> society:<br />

– medical sociology examines topics such as the social aspects <strong>of</strong> physical <strong>and</strong><br />

mental illness, physician-patient relationships, the organization <strong>and</strong> structure <strong>of</strong><br />

health organizations <strong>and</strong> the socio-economic basis <strong>of</strong> the health care system.<br />

In STAT 115 we will focus on research questions involving humans, mainly clinical research<br />

<strong>and</strong> epidemiology. There are many research questions in these areas which can be understood<br />

without specialised knowledge. In the other areas, particularly laboratory studies, an in-depth<br />

underst<strong>and</strong>ing <strong>of</strong> the field (eg biochemistry, molecular biology, anatomy or physiology) is<br />

needed to underst<strong>and</strong> the research questions.<br />

Studying humans brings particular challenges, <strong>and</strong> it is these challenges which have driven the<br />

specialised development <strong>of</strong> biostatistics from it statistical basis. The challenges arise from the<br />

more complex ethical issues in research involving humans, as well as the complexities <strong>of</strong> the<br />

biological system <strong>and</strong> the consequential research questions we wish to answer.<br />

xiv<br />

Introduction & overview


SECTION 1<br />

This covers an introduction to the package R-cmdr, presents an overview <strong>of</strong> biostatistics <strong>and</strong><br />

research methodology.<br />

Biostatistics <strong>and</strong> Research Methodology; R-cmdr<br />

Types <strong>of</strong> Data<br />

Numerical Data <strong>and</strong> Histograms<br />

Measures <strong>of</strong> Centre: Mean <strong>and</strong> Median<br />

Measures <strong>of</strong> Variability: St<strong>and</strong>ard Deviation, Variance <strong>and</strong> Interquartile range<br />

Box-<strong>and</strong>-Whisker Plots<br />

1<br />

Section 1


Biostatistics <strong>and</strong> research: an overview<br />

Course aim:<br />

An introduction to the core biostatistical methods<br />

essential to the health sciences<br />

• scientific method<br />

• design <strong>of</strong> research studies<br />

• description <strong>and</strong> analysis <strong>of</strong> data<br />

The scientific method underpins the design <strong>of</strong><br />

research studies. Sound research design is vital<br />

for obtaining reliable information. A major part<br />

<strong>of</strong> this course is about techniques for describing<br />

data <strong>and</strong> underst<strong>and</strong>ing the analysis principles.<br />

This enables us to make sense <strong>of</strong> the mass <strong>of</strong><br />

information collected in a research study.<br />

2<br />

Section 1


Learning aims <strong>and</strong> objectives<br />

By the end <strong>of</strong> the course students should<br />

• be aware <strong>of</strong> the appropriate use <strong>of</strong><br />

common study designs <strong>and</strong> their strengths<br />

<strong>and</strong> weaknesses<br />

• be able to describe the information<br />

contained in a data set<br />

• be able to carry out common statistical<br />

data analyses<br />

• be able to interpret the results <strong>of</strong> common<br />

statistical analyses in the context <strong>of</strong> the<br />

particular study design used<br />

• be aware <strong>of</strong> ethical issues relating to<br />

research involving humans<br />

• be able to critically evaluate selected<br />

research articles published in health<br />

sciences journals<br />

3<br />

Section 1


Goal <strong>of</strong> health sciences pr<strong>of</strong>essions<br />

To improve the health <strong>and</strong> well-being <strong>of</strong><br />

individuals <strong>and</strong> communities<br />

This involves<br />

• treatment <strong>of</strong> disease<br />

• prevention <strong>of</strong> disease<br />

• promotion <strong>of</strong> health<br />

In order to do this we need knowledge about<br />

• causes <strong>of</strong> disease<br />

• diagnosis<br />

• disease processes<br />

• effectiveness <strong>of</strong> treatments<br />

• societal factors which affect health<br />

4<br />

Section 1


Examples <strong>of</strong> current gaps in knowledge<br />

• causes <strong>of</strong> meningococcal meningitis<br />

How to prevent Vaccine<br />

• SARS, avian influenza<br />

New diseases<br />

• back pain<br />

Not good at treating<br />

• cancer<br />

Nasty treatments for child cancer<br />

• diabetes<br />

Common in Pacific communities<br />

• cardiovascular disease<br />

Common cause <strong>of</strong> death<br />

• prevention <strong>of</strong> overweight <strong>and</strong> obesity<br />

• effective promotion <strong>of</strong> behaviour change<br />

Prevention <strong>of</strong> smoking<br />

Knowledge may come from<br />

• teaching<br />

• experience<br />

• research<br />

5<br />

Section 1


Research<br />

A process for providing answers to questions for<br />

which the answer is not immediately available<br />

General research areas<br />

What are the causes <strong>of</strong> meningococcal<br />

meningitis<br />

Can we develop a vaccine to prevent SARS<br />

What are the genetic events which lead to<br />

childhood cancer<br />

Can a new drug improve survival in people with<br />

colorectal cancer<br />

How can we prevent childhood overweight <strong>and</strong><br />

obesity<br />

What are the main factors affecting quality <strong>of</strong> life<br />

<strong>of</strong> people with a chronic illness<br />

Research provides a systematic process for<br />

answering these questions<br />

6<br />

Section 1


Iron Deficiency – Should NZ parents be<br />

Concerned<br />

[Dr Elaine Ferguson, Dept <strong>of</strong> Human<br />

Nutrition]<br />

A survey r<strong>and</strong>omly selecting 323 children<br />

aged 6-24 months in Dunedin, Christchurch<br />

<strong>and</strong> Invercargill.<br />

To assess prevalence <strong>of</strong> iron deficiency.<br />

To explore factors associated with low body<br />

iron store. Possible Factors are:<br />

Categorical:<br />

Continuous:<br />

• Sex<br />

• Ethnicity<br />

• Maternal Education<br />

• Household Income<br />

• Breast feeding<br />

• Age<br />

• Meat intake<br />

Regression methods are used as well as<br />

procedures for summarising data.<br />

7<br />

Section 1


Does early childhood circumcision reduce the<br />

risk <strong>of</strong> acquiring genital herpes<br />

[Dr Nigel Dickson, Dept <strong>of</strong> Preventive <strong>and</strong><br />

Social Medicine]<br />

• Cohort <strong>of</strong> over 1000 births in 1972 in<br />

Dunedin.<br />

• Called the Dunedin Multidisciplinary<br />

Health <strong>and</strong> Development study.<br />

• Does early circumcision reduce the risk<br />

<strong>of</strong> genital herpes.<br />

• Initially appears to be the case but it is<br />

an observational study.<br />

• Number <strong>of</strong> sexual partners is a<br />

confounder.<br />

• When confounder allowed for early<br />

circumcision appears not to be<br />

protected.<br />

• Designed experiments (or clinical trials)<br />

set up in Africa to investigate effect <strong>of</strong><br />

circumcision on HIV.<br />

8<br />

Section 1


The research process<br />

The objective for most studies is to use data from<br />

a sample to draw inference about a larger<br />

population:<br />

Underlying population<br />

Inference<br />

Sample<br />

<strong>Statistics</strong><br />

Examples:<br />

• we use the frequency with which a disease<br />

occurs in a sample to estimate the<br />

frequency with which disease occurs in the<br />

population<br />

• we study a new treatment in a group <strong>of</strong><br />

patients in order to be able to make claims<br />

about the effects <strong>of</strong> the treatment in all such<br />

patients<br />

9<br />

Section 1


Steps in the research process:<br />

Development <strong>of</strong> the research question<br />

Design <strong>of</strong> the study<br />

Collection <strong>of</strong> information<br />

Data description <strong>and</strong> analysis<br />

Interpretation <strong>of</strong> results<br />

• the research question<br />

- needs to be framed very carefully<br />

- must be specific enough to be<br />

answerable by a research study<br />

• the study design<br />

- is determined by the research<br />

question<br />

- describes the methods used to collect<br />

the information<br />

• analysis <strong>and</strong> interpretation<br />

- depends on the study design<br />

10<br />

Section 1


Research questions relevant to this course:<br />

Epidemiology:<br />

the study <strong>of</strong> distribution<br />

<strong>and</strong> determinants <strong>of</strong> disease<br />

frequency<br />

Clinical research: the study <strong>of</strong> questions<br />

relating to care <strong>of</strong> patients<br />

Descriptive questions:<br />

What is the distribution <strong>of</strong> a disease<br />

What is the natural history <strong>of</strong> a disease<br />

Analytic questions:<br />

What are the causes <strong>of</strong> a disease<br />

Will this approach prevent disease<br />

Does this treatment improve outcome<br />

11<br />

Section 1


Data Analysis <strong>and</strong> Computer S<strong>of</strong>tware<br />

Easy to use s<strong>of</strong>tware is essential for data<br />

management <strong>and</strong> data analysis. In this course R-<br />

cmdr (Statistical Package for the Social Sciences)<br />

will be used. This package is widely available on<br />

campus, used in most <strong>Department</strong>s which specify<br />

first year statistics as a pre-requisite, <strong>and</strong> widely<br />

available internationally.<br />

At school you may have used EXCEL. Possibly<br />

at <strong>University</strong> you have used EXCEL. EXCEL<br />

is excellent for data management <strong>and</strong> reporting<br />

but is poor for statistical analyses <strong>and</strong> clumsy for<br />

graphical procedures.<br />

R-cmdr is easy to use with good pull down menu<br />

options. There are three windows in R-cmdr<br />

• Data Editor (where data being analysed are<br />

located)<br />

• Output Window (where results appear)<br />

• Syntax Window (not used in this course)<br />

12<br />

Section 1


Introduction to study design<br />

1. Descriptive studies<br />

2. Analytic studies<br />

Experimental studies<br />

Observational studies<br />

Examples <strong>of</strong> analytic study types<br />

3. Summary<br />

Classification <strong>of</strong> research designs<br />

Classification <strong>of</strong> common study types<br />

There are two types <strong>of</strong> research questions.<br />

Descriptive – describing things<br />

Analytic – testing hypotheses<br />

Strengths <strong>and</strong> weaknesses <strong>of</strong> the different designs<br />

will be discussed.<br />

13<br />

Section 1


1. Descriptive studies<br />

Aim: to describe, for example:<br />

• the characteristics <strong>of</strong> people with a disease<br />

(person, place, time)<br />

• lifestyle patterns <strong>of</strong> a population<br />

• attitudes to health care<br />

• etc<br />

Descriptive studies are <strong>of</strong>ten called surveys or<br />

cross-sectional studies<br />

Descriptive studies generally use a sample from a<br />

population<br />

14<br />

Section 1


Example: What are the serum cholesterol levels<br />

<strong>of</strong> New Zeal<strong>and</strong>ers<br />

Method:<br />

Select a subgroup (sample) <strong>of</strong> people<br />

<strong>and</strong> measure their serum cholesterol<br />

levels<br />

R<strong>and</strong>om sampling<br />

• choose the sample in such a way that<br />

every individual in the population has a<br />

known chance <strong>of</strong> being selected<br />

• in a simple r<strong>and</strong>om sample, everyone has<br />

an equal chance <strong>of</strong> being chosen<br />

• this method is the best way <strong>of</strong> obtaining a<br />

sample which is representative <strong>of</strong> the<br />

population<br />

Suppose we want to estimate mean cholesterol in<br />

the population:<br />

15<br />

Section 1


Sample average = true mean + error<br />

unknown<br />

r<strong>and</strong>om error:<br />

systematic<br />

error<br />

r<strong>and</strong>om<br />

error<br />

• due to natural biological variability<br />

• increasing the sample size will reduce<br />

the r<strong>and</strong>om fluctuations in the sample<br />

mean<br />

systematic error (=bias)<br />

• due to aspects <strong>of</strong> the design or<br />

conduct <strong>of</strong> the study which<br />

systematically distort the results<br />

• occurs if a sample is not representative <strong>of</strong><br />

the population<br />

• cannot be reduced by increasing the<br />

sample size<br />

16<br />

Section 1


2. Analytic studies<br />

Purpose: to test hypotheses, about, for<br />

example:<br />

• causes <strong>of</strong> disease<br />

• methods for prevention <strong>of</strong> disease<br />

• the effects <strong>of</strong> treatments<br />

Experimental studies<br />

• the researcher intervenes <strong>and</strong> records the<br />

result <strong>of</strong> their intervention<br />

• the aim is to control all other factors to<br />

isolate the effects <strong>of</strong> the intervention<br />

• best way to study causation<br />

Observational studies<br />

• the investigator does not intervene, simply<br />

observes a naturally occurring process,<br />

<strong>and</strong> collects information<br />

• ideal is to get as close as possible to the<br />

information that would have been<br />

obtained if the experimental study could<br />

have been done<br />

17<br />

Section 1


Example: Options for studying the<br />

relationship between smoking <strong>and</strong> lung cancer<br />

Experimental study<br />

R<strong>and</strong>omly assign people<br />

Follow for 20 years<br />

Check lung cancer rates<br />

Clearly unethical<br />

Smokers (start)<br />

Non-smokers<br />

Observational study<br />

Cohort<br />

Smokers (known) 20yrs % with lung CA<br />

Compare<br />

Non-smokers (known) 20yrs % with lung<br />

CA<br />

Problem: groups may differ in other ways that are<br />

related to CA risk – confounding.<br />

Case control<br />

% smokers 20 yrs lung cancer (now) (cases)<br />

% smokers 20 yrs no lung cancer (now)(controls)<br />

No long term follow up needed. Smaller<br />

samples. Could be recall bias from 20 years ago.<br />

Also confounding.<br />

18<br />

Section 1


Examples <strong>of</strong> analytic study types<br />

R<strong>and</strong>omised controlled trial (RCT)<br />

• a “Gold st<strong>and</strong>ard” analytic study (best)<br />

• experimental<br />

Characteristics <strong>of</strong> a RCT:<br />

• select a group <strong>of</strong> people<br />

• r<strong>and</strong>omly allocate them to either an<br />

intervention or a control group<br />

• follow participants up over time, <strong>and</strong><br />

measure outcome<br />

A control group is used to isolate the effects <strong>of</strong><br />

the intervention<br />

R<strong>and</strong>om allocation, or r<strong>and</strong>omisation means<br />

every person has the same chance <strong>of</strong> being in<br />

each. This gives the best chance <strong>of</strong> getting two<br />

groups which are comparable in all respects<br />

Used to evaluate new treatments<br />

Often not ethical in studies <strong>of</strong> disease causation<br />

19<br />

Section 1


Example RCT: LIPID study (NEJM, 1998)<br />

Does treatment with pravastatin reduce the risk<br />

<strong>of</strong> death in patients with coronary heart disease<br />

Study participants:<br />

9014 patients<br />

age 31-75<br />

coronary heart disease<br />

cholesterol 155 - 271mg/decilitre<br />

participants (selected)<br />

control<br />

(n=4502)<br />

r<strong>and</strong>omisation<br />

r<strong>and</strong>omly<br />

allocated<br />

intervention<br />

pravastatin<br />

(n=4512)<br />

6 yrs<br />

8.3% mortality 6.4%<br />

20<br />

Section 1


Advantages <strong>of</strong> RCT:<br />

• experiment – the best way to test an<br />

hypothesis<br />

• differences in outcome can be attributed to<br />

the exposure<br />

Disadvantages <strong>of</strong> RCT:<br />

• may not be ethical<br />

Cohort study<br />

Observational study, generally carried out to test<br />

hypotheses<br />

Characteristics:<br />

• participants are selected before<br />

disease has developed<br />

• followed over time to determine<br />

development <strong>of</strong> disease<br />

• information is collected about<br />

exposures at baseline <strong>and</strong> during<br />

follow-up<br />

• longitudinal<br />

21<br />

Section 1


Example <strong>of</strong> cohort study:<br />

Study to investigate the relationship between<br />

smoking <strong>and</strong> lung cancer (eg British Doctors<br />

study)<br />

Group <strong>of</strong> people<br />

without lung cancer<br />

smokers<br />

non-smokers<br />

20 years<br />

% develop % develop<br />

lung cancer lung cancer<br />

Compare<br />

22<br />

Section 1


Case-control study<br />

Observational study, generally carried out to test<br />

hypotheses<br />

Characteristics<br />

• participants are chosen on the basis <strong>of</strong><br />

their disease status: a group with disease<br />

(cases) <strong>and</strong> a group without (controls)<br />

• information is collected from people with<br />

<strong>and</strong> without disease about exposures that<br />

occurred in the past<br />

• longitudinal (retrospective)<br />

23<br />

Section 1


Example <strong>of</strong> case-control study<br />

Study to investigate the relationship between<br />

smoking <strong>and</strong> lung cancer<br />

Known at start<br />

group <strong>of</strong> people<br />

with lung<br />

cancer<br />

group <strong>of</strong> people<br />

without<br />

lung cancer<br />

document smoking history<br />

% smokers %smokers<br />

in the past<br />

in the past<br />

24<br />

Section 1


Cohort vs case-control studies<br />

Cohort study<br />

Advantages:<br />

• closest observational study to<br />

r<strong>and</strong>omised controlled trial<br />

• good for examining common outcomes<br />

• can evaluate the effect <strong>of</strong> exposure on<br />

multiple outcomes<br />

Disadvantages:<br />

• long duration needed if the disease takes<br />

a long time to develop after exposure<br />

• if the disease is rare, the number <strong>of</strong><br />

participants needs to be very large<br />

Case-control study<br />

Advantages<br />

• relatively quick<br />

• smaller than cohort studies, particularly<br />

for rare diseases<br />

• can examine the effects <strong>of</strong> multiple<br />

exposures<br />

Disadvantages<br />

• events have already occurred so the<br />

potential for bias is higher<br />

25<br />

Section 1


3. Summary<br />

Classification <strong>of</strong> research designs<br />

Note: these provide a useful framework for<br />

thinking about the strengths <strong>and</strong> weaknesses <strong>of</strong><br />

different study designs, but they will not always<br />

work.<br />

i) Classification by purpose <strong>of</strong> the study<br />

descriptive (describe things)<br />

versus<br />

analytic (testing hypotheses)<br />

ii) Classification by form <strong>of</strong> the design<br />

experimental (researcher intervenes)<br />

versus<br />

observational (researcher observes)<br />

iii) Classification by time<br />

cross-sectional<br />

(information collected about one point in time)<br />

versus<br />

longitudinal<br />

26<br />

Section 1


Classification <strong>of</strong> common study types<br />

R<strong>and</strong>omised controlled trial<br />

• analytic<br />

• experimental<br />

• longitudinal<br />

(prospective)<br />

Cohort study<br />

• analytic<br />

• observational<br />

• longitudinal (usually prospective)<br />

Case-control studies<br />

• analytic<br />

• observational<br />

• longitudinal (retrospective)<br />

27<br />

Section 1


Types <strong>of</strong> data <strong>and</strong> graphical summaries<br />

[A] Data <strong>and</strong> variables<br />

There are two types <strong>of</strong> measurement <strong>of</strong> interest in<br />

many scientific studies.<br />

• First, the outcomes measured on each<br />

experimental unit (plant, animal, person)<br />

provide values <strong>of</strong> what is called a response<br />

variable.<br />

• Second the characteristics or levels <strong>of</strong> exposure<br />

that explain at least some <strong>of</strong> the differences in<br />

the observed values <strong>of</strong> the response variable<br />

are called explanatory variables.<br />

e.g. iron levels in new born children is the<br />

outcome or response – what are the<br />

explanatory variables<br />

e.g. diabetes presence is outcome – what are<br />

the explanatory variables<br />

Data forming the response <strong>and</strong> exposure<br />

variables can be either categorical or numerical<br />

(otherwise known as qualitative <strong>and</strong><br />

quantitative).<br />

28<br />

Section 1


1. Categorical data:<br />

The simplest case involves two categories.<br />

For example a person could be<br />

• male/female<br />

• smoker/non-smoker<br />

• diabetic/non-diabetic<br />

Such data have other names such as binary<br />

data, dichotomous data, yes/no data <strong>and</strong> 0 – 1<br />

data (the last is particularly important, for<br />

example 0 represents non-diabetic <strong>and</strong> 1<br />

represents diabetic).<br />

A problem could be to establish the chance<br />

(or probability) that a woman with a certain<br />

pr<strong>of</strong>ile (defining the explanatory variables)<br />

may drink alcohol during pregnancy (the<br />

response) or equivalently to find the<br />

proportion <strong>of</strong> pregnant women who will drink<br />

alcohol. Ultimately, we are interested in who<br />

will do this.<br />

More than two categories can occur.<br />

• blood group: A/B/AB/O<br />

• Maori/Pacific Isl<strong>and</strong>/Caucasian/Asian.<br />

29<br />

Section 1


In these examples the data are said to be<br />

nominal. But this type <strong>of</strong> data is said to be<br />

ordinal if the categories are in some order.<br />

For example, “degree <strong>of</strong> pain” may be<br />

minimal/moderate/severe/unbearable<br />

If more than two ordinal categories it is not<br />

possible to use 0/1/2/3 to identify the<br />

classes since “unbearable” is not three<br />

times “moderate” even though the data are<br />

ordered. Consequences <strong>of</strong> this will be<br />

important in the second half <strong>of</strong> the<br />

semester.<br />

2. Numerical data:<br />

(a) Discrete Here observations take only<br />

certain numerical values. Usually they are<br />

counts <strong>of</strong> events. For example,<br />

• number <strong>of</strong> possums caught in traps<br />

• number <strong>of</strong> children in a family<br />

(0/1/2/3/4)<br />

30<br />

Section 1


These are not like categorical data as 3<br />

children is three times as many as one.<br />

This type <strong>of</strong> data can be treated as though it<br />

is categorical but this discards information<br />

about the magnitude <strong>of</strong> the relationships<br />

between successive outcomes. Ordinal<br />

categorical data is important.<br />

(b) Continuous quantitative measures. Here<br />

recorded values or observations result from<br />

some form <strong>of</strong> measurement [e.g. height,<br />

age, blood pressure, serum cholesterol,<br />

oxygen levels in a lake].<br />

• Often no restriction on values other than<br />

that caused by accuracy <strong>of</strong> equipment<br />

for recording values.<br />

• Often the values show pattern similar to<br />

what is called the bell-shaped normal<br />

curve with many values clustered<br />

around a central point <strong>and</strong> few values in<br />

the tails.<br />

31<br />

Section 1


3. Rates, Ratios <strong>and</strong> Proportions<br />

These are constructed from categorical data<br />

<strong>and</strong> include for example measures <strong>of</strong> disease<br />

frequency <strong>and</strong> disease association. Examples<br />

<strong>of</strong> disease frequency are<br />

• prevalence or proportion (concerned with<br />

existing cases)<br />

• incidence rate (concerned with new cases)<br />

e.g. the prevalence <strong>of</strong> obesity in the New<br />

Zeal<strong>and</strong> population<br />

(Gives indication <strong>of</strong> burden on the country<br />

by identifying proportion affected)<br />

e.g. the incidence rate <strong>of</strong> HIV in New Zeal<strong>and</strong><br />

in 2008.<br />

(This deals with number <strong>of</strong> new cases <strong>and</strong><br />

is useful if looking at causes)<br />

Examples <strong>of</strong> disease association are<br />

• absolute (or attributable) risk<br />

• relative risk<br />

• odds ratio<br />

32<br />

Section 1


e.g. the relative risk <strong>of</strong> melanoma for a<br />

farmer compared with an <strong>of</strong>fice worker.<br />

Here, the prevalence <strong>of</strong> melanoma<br />

among farmers is divided by the<br />

prevalence among <strong>of</strong>fice workers. This<br />

will show if there is any association<br />

between prevalence <strong>of</strong> melanoma <strong>and</strong><br />

occupation after an appropriate analysis<br />

by essentially comparing the two<br />

groups.<br />

4. Other types <strong>of</strong> response data<br />

• Scores (direct measurement not possible;<br />

instead a patient is assessed on several<br />

subjective scales then the values on each<br />

are added to give a score for a patient)<br />

e.g. 30 questions on a health survey. A<br />

respondent gives values 0 to 3 on each<br />

question then score out <strong>of</strong> 90 given. This<br />

total has convenient properties whereas<br />

individual values may not.<br />

• Patients assess their degree <strong>of</strong> low back<br />

pain after treatment on scale 1 (no pain) to<br />

5 (unbearable pain).<br />

33<br />

Section 1


Two treatments may be assessed from the<br />

two sets <strong>of</strong> values for patients in a new<br />

treatment compared with a st<strong>and</strong>ard. The<br />

data may be viewed as categorical or<br />

continuous but there are problems as the<br />

difference between 1 <strong>and</strong> 2 is not<br />

necessarily the same as the distance<br />

between 4 <strong>and</strong> 5. The data are certainly<br />

ordinal.<br />

• In social sciences, data are <strong>of</strong>ten ordinal.<br />

e.g. In a questionnaire people are asked to<br />

respond by checking the category that best<br />

describes their level <strong>of</strong> agreement with a<br />

statement from<br />

a great deal somewhat not much not at all<br />

usually coded as 4, 3, 2, 1.<br />

Such data can be regarded as continuous or<br />

categorical (ordinal). If ordinal then a<br />

question is how many categories should be<br />

chosen e.g. 4 (as here) or 5 or 7 or 9, <strong>and</strong> is<br />

the distance between 1 <strong>and</strong> 2 the same as that<br />

between 2 <strong>and</strong> 3 etc<br />

34<br />

Section 1


[B] Describing Numerical Data<br />

Graphs can be used to summarise data but many<br />

graphs can be highly misleading especially if too<br />

much information is presented. We shall<br />

summarise numerical data graphically using<br />

• histograms<br />

• box-whisker plots<br />

Particular values which summarise numerical<br />

data are:<br />

• mean; median; mode<br />

• st<strong>and</strong>ard deviation; interquartile range<br />

These approximate the centre <strong>and</strong> the variability<br />

<strong>of</strong> the data collected respectively.<br />

35<br />

Section 1


Example for Continuous Data: In a<br />

hypertension study 56 men who are heavy<br />

smokers (smoked for 25 years) have blood<br />

pressures measured (in mm <strong>of</strong> Hg). Summarise<br />

the outcomes.<br />

Blood pressures are classified into intervals to<br />

form a frequency table <strong>and</strong> interval frequencies<br />

(f j ) are obtained as shown below.<br />

Frequency Table<br />

Pressure(mm <strong>of</strong> Hg) Frequency (f j )<br />

59.5 – (69.5) 2<br />

69.5 – (79.5) 7<br />

79.5 – (84.5) 9<br />

84.5 – (89.5) 10<br />

89.5 – (94.5) 11<br />

94.5 – (99.5) 7<br />

99.5 – (109.5) 8<br />

109.5 – (119.5) 2<br />

Total<br />

56 (sample size)<br />

Although the readings are likely to be recorded to<br />

the nearest mm <strong>and</strong> hence appear to be discrete,<br />

the data are actually continuous <strong>and</strong> for this<br />

reason the intervals are recorded as 59.5 – (69.5)<br />

which is 59.5 up to but not including 69.5.<br />

36<br />

Section 1


Relative frequency: this is f j /n in the j th interval<br />

where n is the sample size.<br />

Pressure<br />

Relative<br />

(mm <strong>of</strong> Hg) Freq(f j ) Freq(f j /n)<br />

59.5 – (69.5) 2 0.036<br />

69.5 – (79.5) 7 0.125<br />

79.5 – (84.5) 9 0.161<br />

84.5 – (89.5) 10 0.179<br />

89.5 – (94.5) 11 0.196<br />

94.5 – (99.5) 7 0.125<br />

99.5 – (109.5) 8 0.143<br />

109.5 – (119.5) 2 0.036<br />

Total 56 1.00<br />

Here, 2/56 = 0.036 (rounded to 3 d.p.)<br />

7/56 = 0.125<br />

Percentage frequency: the relative frequency<br />

multiplied by 100.<br />

e.g. 0.036 = 3.6% (or 3.6 per 100) meaning that<br />

3.6% <strong>of</strong> the values are in 59.5 – (69.5)<br />

Note: Relative (or percentage) frequencies allow<br />

comparison <strong>of</strong> samples when samples are <strong>of</strong><br />

unequal size. Absolute frequencies f j will not<br />

allow this since all f j will be large for a large<br />

sample <strong>of</strong> outcomes but small for a small sample.<br />

37<br />

Section 1


Histograms: These are simple pictures <strong>of</strong> the<br />

data. The base <strong>of</strong> a rectangle is interval length<br />

<strong>and</strong> area <strong>of</strong> a rectangle is proportional to class<br />

frequency (or relative frequency). When class<br />

intervals are all equal, rectangle heights are<br />

proportional to the frequencies as well.<br />

Example: Return to the blood pressure readings.<br />

Pressure (mm) (f j ) (f j /n)<br />

59.5 – (69.5) 2 0.036<br />

69.5 – (79.5) 7 0.125<br />

79.5 – (84.5) 9 0.161<br />

84.5 – (89.5) 10 0.179<br />

89.5 – (94.5) 11 0.196<br />

94.5 – (99.5) 7 0.125<br />

99.5 – (109.5) 8 0.143<br />

109.5 – (119.5) 2 0.036<br />

Total 56 1.00<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

FREQ PER<br />

5mm Interval<br />

59.5<br />

69.5 79.5 89.5 99.5 109.5 119.5<br />

38<br />

Bl pr<br />

Section 1


N.B. (1) The heights <strong>of</strong> the first two <strong>and</strong> last<br />

two rectangles are halved but their bases are<br />

doubled from 5 to 10mm. (Area therefore<br />

remains proportional to frequency in these<br />

intervals if 5mm is regarded as the horizontal<br />

“unit”)<br />

(2) The label on the vertical axis is given as<br />

“Freq. Per unit interval” where “unit” = five.<br />

(3) The relative frequency histogram follows:<br />

0.20<br />

0.15<br />

REL FREQ PER<br />

5mm Interval<br />

0.196<br />

0.125<br />

0.10<br />

0.063<br />

0.072<br />

0.05<br />

0.00<br />

0.018<br />

59.5 69.5 79.5 89.5 99.5 109.5 119.5<br />

Bl pr<br />

(4) The frequency <strong>and</strong> relative frequency<br />

histograms have the same shape. Only the<br />

scales on vertical axis differ. Both give some<br />

idea <strong>of</strong> the data centre, the extent <strong>of</strong> the<br />

variability in the data <strong>and</strong> the distribution <strong>of</strong><br />

the data.<br />

39<br />

Section 1


(5) The relative (or percentage) frequency<br />

histogram is used if comparing two (or more)<br />

samples <strong>of</strong> data, one sample <strong>of</strong> values from a<br />

control group <strong>and</strong> the other from a treated<br />

group <strong>of</strong> experiment units.<br />

(6) Notice how a histogram with rectangle<br />

heights proportional to class frequencies<br />

would give a misleading picture <strong>of</strong> the data.<br />

(7) You will find that most <strong>of</strong> the histograms<br />

produced by statistical packages like R-cmdr<br />

have class intervals <strong>of</strong> equal length <strong>and</strong> you<br />

can decide the number <strong>of</strong> intervals you want<br />

in the graph. Usually between 5 <strong>and</strong> 20<br />

intervals <strong>of</strong> equal length are chosen for a<br />

good summary <strong>of</strong> the data.<br />

40<br />

Section 1


Measures <strong>of</strong> Central Tendency.<br />

The mean is “typical” <strong>of</strong> the majority <strong>of</strong> data in a<br />

sample.<br />

Example: Six patients lived the following years<br />

after diagnosis <strong>of</strong> HIV.<br />

Datum (Outcome) Symbol<br />

1.8 x 1<br />

3.2 x 2<br />

6.8 x 3<br />

4.6 x 4<br />

2.8 x 5<br />

7.9 x 6<br />

Mean = 6<br />

1 (1.8 + 3.2 + 6.8 + 4.6 + 2.8 + 7.9)<br />

= 27.1/6<br />

= 4.52 years<br />

Notation: mean<br />

or x ∑ = 1<br />

x1 + x2<br />

+ x3<br />

+ x4<br />

+ x5<br />

+<br />

n<br />

1<br />

= n i<br />

xi<br />

n<br />

x =<br />

6<br />

x<br />

41<br />

Section 1


Note The mean need not be one <strong>of</strong> the outcome<br />

values <strong>and</strong> i is a suffix taking values i = 1 to i = n<br />

(or 6 here). Any symbol can be used for this<br />

suffix.<br />

Example: The 56 blood pressure readings just<br />

considered have a mean <strong>of</strong> 89.54mm <strong>of</strong> Hg. This<br />

value is “typical” <strong>of</strong> the data in the sense that it is<br />

near the centre <strong>of</strong> the region where most values<br />

are located.<br />

The Median is a second measure “typical” <strong>of</strong> data<br />

in a sample <strong>and</strong> is the “middle value” <strong>of</strong> the data<br />

after arranging the numbers in order from<br />

smallest to largest.<br />

Example: Data: 95 86 78 90 62 73 89<br />

Rearrange: 62 73 78 86 89 90 95<br />

Median = 86 (the middle value)<br />

Note:<br />

1. If 62 replaced by 5, the median is unchanged<br />

(the mean would be much smaller). This<br />

indicates that in general the median is not<br />

affected by a few very extreme values<br />

whereas the mean is.<br />

42<br />

Section 1


2. If even number <strong>of</strong> values, halve the two<br />

centre values.<br />

Example: For the 56 blood pressure readings,<br />

the median turns out to be 89.30 (compare mean<br />

<strong>of</strong> 89.54)<br />

The mode is another measure <strong>of</strong> centre. It is the<br />

commonest value in the data. This only makes<br />

sense for discrete data. For continuous grouped<br />

data it coincides with the peak in the histogram.<br />

The histogram is bimodal if there is more than<br />

one peak.<br />

Further Notes<br />

(1) The mean (89.54) <strong>and</strong> median (89.30) for the<br />

blood pressure readings are close because the<br />

data are almost “symmetrical.”<br />

(2) For “non-symmetrical” data mean <strong>and</strong><br />

median are different since the mean is pulled<br />

in the direction <strong>of</strong> the extreme values. The<br />

data are said to be skew.<br />

43<br />

Section 1


0<br />

Median<br />

Mean<br />

The mean may be unsuitable as a measure <strong>of</strong><br />

centre while the median is more “typical” <strong>of</strong><br />

most values.<br />

(3) For measurements which cannot be negative<br />

it is quite common to have many values close<br />

to zero thus presenting a skew distribution.<br />

This is called positive skewness. (The<br />

histogram above represents positively skewed<br />

data.)<br />

(4)The opposite phenomenon with an extended<br />

left h<strong>and</strong> tail is called negative skewness <strong>and</strong><br />

is rare.<br />

(5) A trimmed mean is the mean with the lower<br />

5% <strong>and</strong> upper 5% <strong>of</strong> values removed.<br />

44<br />

Section 1


Measures <strong>of</strong> Variability<br />

“Looking at the world using data is like looking<br />

through a window with ripples in the glass”.<br />

(Pr<strong>of</strong>essor Chris Wild, Auckl<strong>and</strong> <strong>University</strong>)<br />

<strong>Statistics</strong> is about variability. Variability reflects<br />

differences in the values collected for different<br />

units being measured, for example people, or<br />

animals or plants or companies or readings on<br />

different days etc. Two sets <strong>of</strong> values can have<br />

the same mean <strong>and</strong> median yet show quite<br />

different patterns.<br />

Variability can be r<strong>and</strong>om or caused by different<br />

treatments or “factors” acting on the experiment<br />

units in a study in different ways. The hope is<br />

that the r<strong>and</strong>om variation will be relatively small<br />

or controlled by choice <strong>of</strong> appropriate study<br />

designs. This will result in the identification <strong>of</strong><br />

important treatment effects explaining key<br />

aspects <strong>of</strong> the variation.<br />

45<br />

Section 1


If data are highly variable there are problems<br />

analysing the data <strong>and</strong> it will be necessary to<br />

select larger samples.<br />

The first measure <strong>of</strong> variation is the range (the<br />

distance between the lowest <strong>and</strong> highest values).<br />

It is sensitive to any extreme values <strong>and</strong> hence<br />

not useful. But reduced ranges (encompassing<br />

the central 95% say <strong>of</strong> the data) are useful as<br />

extreme values (outliers) are excluded.<br />

Note: In clinical chemistry (e.g. cholesterol<br />

measures) a reference range encompassing the<br />

central 95% <strong>of</strong> values describes variability in<br />

normal people <strong>and</strong> allows test results for other<br />

individuals to be assessed to see if corrective<br />

action is needed.<br />

A second measure is the (sample) variance defined<br />

by s 2 2<br />

= 1<br />

s ∑ −<br />

−<br />

= 1(<br />

x<br />

n 1<br />

n i i x<br />

Although the divisor is (n – 1) in this equation, we<br />

can see that s 2 is effectively the “average” <strong>of</strong> the<br />

squared deviations <strong>of</strong> the individual data values<br />

46<br />

)<br />

2<br />

Section 1


(x i ) from their mean x . For technical reasons do<br />

not divide by n.<br />

Notes: 1. The variance is an overall measure <strong>of</strong><br />

the extent to which values x i differ from their<br />

mean x .<br />

2. Squaring is essential. If the deviations from x<br />

are added, the value 0 is obtained always.<br />

A third convenient measure is the st<strong>and</strong>ard<br />

deviation (s) given by<br />

1<br />

= var iance = ∑i n = ( xi − x)<br />

n −1<br />

s 1<br />

Note: The st<strong>and</strong>ard deviation s is measured in<br />

the same units as the original data (taking the<br />

square root cancels the squaring).<br />

2<br />

47<br />

Section 1


Example: Find the st<strong>and</strong>ard deviation <strong>of</strong> 11, 18, 14,<br />

15, 12<br />

x i x i – x (x i – x ) 2<br />

11 11 –14 = – 3 9<br />

18 18 – 14 = 4 16<br />

14 14 – 14 = 0 0<br />

15 15 – 14 = 1 1<br />

12 12 – 14 = – 2 4<br />

70 0 30<br />

x = 70/5 = 14 s = 30 / 4 = 2. 74<br />

Note that 2.74 is a “typical” or “average”<br />

deviation from the mean x = 14.<br />

Example: Return to the 56 blood pressure<br />

readings<br />

Pressure Interval f j<br />

59.5 – (69.5) 2<br />

69.5 – (79.5) 7<br />

79.5 – (84.5) 9<br />

84.5 – (89.5) 10<br />

89.5 – (94.5) 11<br />

94.5 – (99.5) 7<br />

99.5 – (109.5) 8<br />

109.5 – (119.5) 2<br />

Total 56<br />

The st<strong>and</strong>ard deviation is s = 11.21. This value is<br />

“typical” <strong>of</strong> deviations from x = 89. 54.<br />

48<br />

Section 1


The Interquartile Range is another measure <strong>of</strong><br />

variability.<br />

25%<br />

data<br />

25%<br />

data<br />

25%<br />

data<br />

Q L median Q U<br />

Interquartile range<br />

Range<br />

25%<br />

data<br />

The lower quartile Q L is the value below which a<br />

quarter <strong>of</strong> data lie. The upper quartile Q U has 4<br />

3<br />

<strong>of</strong> data below it. (These are also known as the<br />

25 th <strong>and</strong> 75 th percentiles.)<br />

Notes: 1. Interquartile range can be a helpful<br />

measure <strong>of</strong> variability. It is not affected by<br />

extreme values.<br />

2. Computer packages also give Q L <strong>and</strong> Q U for<br />

large data sets <strong>and</strong> the approximations for<br />

grouped data are no longer needed.<br />

Example: For the 56 blood pressure readings<br />

Q L = 82.2 <strong>and</strong> Q U = 96.6 with Q U – Q L = 14.4<br />

49<br />

Section 1


Box-<strong>and</strong>-whisker plot<br />

This is a second way <strong>of</strong> summarising data<br />

graphically. Like relative frequencies it is useful<br />

when comparing samples <strong>of</strong> unequal size.<br />

Ex Blood pressures<br />

Q L = 82.2; Q U = 96.6; Median = 89.3<br />

Suppose 63 <strong>and</strong> 116 are lowest <strong>and</strong> highest<br />

values.<br />

X<br />

60 70 80 90 100 110 120<br />

X<br />

The centre <strong>of</strong> the data, its variation, its symmetry<br />

(or lack <strong>of</strong> symmetry) <strong>and</strong> extreme values are<br />

displayed.<br />

Notes: (1). Two samples can be compared<br />

X<br />

X<br />

X<br />

X<br />

Both samples skew, the second is more variable<br />

(larger interquartile range) with a larger median.<br />

50<br />

Section 1


(2) The points at the ends <strong>of</strong> the whiskers depend<br />

on the package <strong>and</strong> are<br />

• the extreme values or<br />

• the 2 2 1 % <strong>and</strong> 97 2 1 % values (centiles) or<br />

• points 1<br />

1<br />

2<br />

times the interquartile range<br />

away from the boxes<br />

Outliers beyond these points are shown in R-<br />

cmdr by an asterix or a small circle (as below)<br />

where there are obvious changes in the ozone<br />

readings recorded over summer in a New<br />

Zeal<strong>and</strong> city. An asterix will represent an<br />

extreme outlier.<br />

11 12 1 2 3<br />

51<br />

Section 1


Example: Thirty-two traps were placed in each<br />

<strong>of</strong> three habitats: pasture, replanted forest <strong>and</strong><br />

tussock on Stephens Isl<strong>and</strong>. The data are the<br />

counts <strong>of</strong> skinks per trap totalled over a ten-day<br />

period in each habitat. The boxplots are below.<br />

Summarize conclusions about skink density.<br />

Pasture 4 3 0 2 2 1 4 1 2 5 0 1 5 6 5 6<br />

11 3 1 1 4 8 5 14 6 8 10 7 4 8 13 6<br />

Replant 15 24 31 8 4 18 14 33 11 16 20 1 17 12 27 26<br />

forest 18 6 12 16 11 8 13 12 11 8 10 17 29 3 12 5<br />

Tussock 14 23 15 14 5 16 10 16 14 10 7 10 8 12 19 17<br />

7 12 29 10 11 11 10 10 6 13 7 10 8 12 6 12<br />

Greater skink density in replanted forest <strong>and</strong><br />

tussock. Greater variation in replanted forests.<br />

Some outliers in the three habitats:<br />

Means: 4.88; 14.63; 12.00<br />

Medians: 4.50; 12.50; 11.00<br />

Std Deviations: 3.64; 8.18; 5.07<br />

52<br />

Section 1


Example: Thirty-four adult hoki caught <strong>of</strong>f the<br />

Kapiti coast. Individual lengths as follows:<br />

Males: 18.7 19.0 18.8 18.4 19.3 19.6 20.3 19.9 19.3 18.9<br />

18.9 19.0 19.7 20.4 18.6 19.5 20.3 19.9 19.2 18.7<br />

Females: 18.6 19.6 18.3 17.5 18.3 19.0 18.5 18.7 19.3 18.5<br />

19.1 18.7 19.1 18.8<br />

Boxplots indicate male hoki longer than female<br />

hoki. Slightly greater variation in the males but no<br />

outliers. Distributions almost symmetric<br />

Mean: 19.32; 18.71<br />

Median: 19.25; 18.70<br />

Std Deviation: 0.61; 0.51<br />

53<br />

Section 1


Interpreting Box whisker plots (Ref: Pr<strong>of</strong>essor<br />

Chris Wild, Auckl<strong>and</strong> <strong>University</strong>)<br />

Observed data:<br />

A<br />

B<br />

A<br />

B<br />

The call is<br />

B values bigger<br />

B values bigger<br />

The above two hold for all sample sizes. Larger<br />

r<strong>and</strong>om samples have more information about the<br />

populations from which they come. With large<br />

r<strong>and</strong>om samples we can make the “B values<br />

bigger” call from smaller shifts. Avoid using the<br />

box whisker plots for samples smaller than about<br />

20.<br />

54<br />

Section 1


Observed data:<br />

A<br />

B<br />

A<br />

B<br />

A<br />

B<br />

A<br />

B<br />

A<br />

B<br />

The call is<br />

B values bigger if<br />

both sample sizes >20<br />

What is my call<br />

What is my call<br />

Cannot tell unless<br />

both samples are huge<br />

Cannot tell for<br />

all sample sizes<br />

55<br />

Section 1


How to make the call.<br />

This is based on a confidence interval idea. (See<br />

later). But the result is easy to calculate. In the<br />

following IQR is the interquartile range <strong>and</strong> n is a<br />

sample size.<br />

Med<br />

Med – 1.5 IQR n<br />

Med + 1.5 IQR n<br />

In the following we can claim the values <strong>of</strong> B tend<br />

to be bigger than the values <strong>of</strong> A back in the<br />

populations from which the samples have been<br />

taken if these horizontal lines (intervals) do not<br />

overlap.<br />

A<br />

B<br />

56<br />

Section 1


SECTION 2<br />

This covers the measures <strong>of</strong> disease frequency <strong>and</strong> disease association with several examples looking<br />

at prevalence, incidence, relative risks, attributable risk <strong>and</strong> odds ratios.<br />

Prevalence <strong>and</strong> Incidence<br />

Cumulative Incidence<br />

Incidence Rate<br />

Disease Association<br />

Relative Risk<br />

Attributable Risk<br />

Odds Ratio<br />

57<br />

Section 2


[C] Measures <strong>of</strong> Disease Frequency<br />

All measures <strong>of</strong> disease frequency are ratios <strong>of</strong> the<br />

form numerator/denominator.<br />

There are two types <strong>of</strong> ratio:<br />

1. Proportion: everyone in numerator must be<br />

included in the denominator.<br />

2. Rate: a measure <strong>of</strong> time is included in the<br />

denominator.<br />

The measures <strong>of</strong> disease frequency are:<br />

1. Prevalence<br />

• gives frequency <strong>of</strong> existing cases <strong>of</strong> disease<br />

• is useful for measuring the disease burden in a<br />

community<br />

• <strong>of</strong>ten measured in a cross-sectional survey<br />

e.g. proportion <strong>of</strong> Otago students at 3pm Tuesday<br />

who have swine flu’.<br />

58<br />

Section 2


2. Incidence:<br />

• measures frequency <strong>of</strong> new cases <strong>of</strong> disease<br />

• is useful for looking at causes <strong>of</strong> disease<br />

e.g. number <strong>of</strong> new cases <strong>of</strong> cold that develop in a<br />

week.<br />

Example: Frequency <strong>of</strong> hepatitis in two regions.<br />

New cases Reporting<br />

Location <strong>of</strong> hepatitis Period Population<br />

Region A 58 1985 25,000<br />

Region B 35 1984-1985 7000<br />

Region A:<br />

58/25000/year<br />

= 232 per 100,000 per year<br />

= 23.2 per 10,000 per year<br />

= 2.32 per 1000 per year<br />

Region B:<br />

35/7000/2years = 17.5/7000/year<br />

= 250 per 100,000 per year<br />

= 2.50 per 1000 per year<br />

Note: The time period must be specified for the<br />

results <strong>and</strong> comparisons to be meaningful.<br />

59<br />

Section 2


Example: In a survey <strong>of</strong> eye disease among 2477<br />

people aged 52-85 in Framingham, Massachusetts,<br />

there were 310 with cataracts <strong>and</strong> 22 blind.<br />

Prevalence <strong>of</strong> cataracts<br />

= 310 = 0.125 = 125 per 1000 (or 12.5%)<br />

2477<br />

Prevalence <strong>of</strong> blindness<br />

22<br />

= = 0.009 = 9 per 1000 (or 0.9%)<br />

2477<br />

60<br />

Section 2


Example: In the following diagram the time a<br />

person has the disease is shaded.<br />

Subject<br />

Number<br />

5<br />

4<br />

3<br />

2<br />

1<br />

Prevalence →<br />

1 / 5<br />

t<br />

2 / 5<br />

3 / 5<br />

2 / 5<br />

Time<br />

Note on Prevalence:<br />

Prevalence is the proportion <strong>of</strong> people in a<br />

population who have the disease at a given point in<br />

time. The time point may refer to calendar time, or<br />

to a fixed point in the course <strong>of</strong> events.<br />

e.g. the proportion <strong>of</strong> people free from back pain 2<br />

months after back injury.<br />

Note on Incidence<br />

Incidence on the other h<strong>and</strong> quantifies the number<br />

<strong>of</strong> new cases <strong>of</strong> disease in a given time period.<br />

There are two measures:<br />

• cumulative incidence<br />

• incidence rate<br />

61<br />

Section 2


2.1 Cumulative incidence is the proportion <strong>of</strong><br />

people who become diseased during a specified<br />

period <strong>of</strong> time<br />

number <strong>of</strong> new cases <strong>of</strong> disease<br />

=<br />

total population at risk<br />

This provides an estimate <strong>of</strong> the probability, or risk,<br />

that an individual will develop the disease during<br />

the specified period <strong>of</strong> time.<br />

Example: In a study in Evans County, Georgia,<br />

there were 609 men aged 40 – 76 who had no<br />

detected heart disease in 1960. These men were<br />

followed for 7 years <strong>and</strong> 71 cases <strong>of</strong> heart disease<br />

were detected during this period.<br />

Cumulative incidence = 71/609<br />

= 0.117 (or 11.7%)<br />

over the 7 year period<br />

Notes (1) The time period over which cumulative<br />

incidence is calculated must be specified for it to be<br />

interpretable.<br />

(2) Cumulative incidence assumes the entire<br />

population at risk at the beginning <strong>of</strong> the study<br />

period has been followed for the whole study<br />

period. But <strong>of</strong>ten -<br />

62<br />

Section 2


• people are lost to follow-up<br />

• people are enrolled in the study at different<br />

times<br />

The length <strong>of</strong> the follow-up period is not therefore<br />

the same for everyone in the study. It is the<br />

incidence rate that takes account <strong>of</strong> varying amounts<br />

<strong>of</strong> follow-up time.<br />

2.2 Incidence rate:<br />

=<br />

the number <strong>of</strong> new cases <strong>of</strong> disease<br />

total person-time at risk<br />

Same amount <strong>of</strong> person time if follow:<br />

16 people for one year<br />

4 people for four years<br />

All have 16 person-years <strong>of</strong> observation.<br />

Example: Calculation <strong>of</strong> person-years for<br />

incidence rate<br />

Total<br />

Jan Jan Jan Jan Jan Jan time<br />

Subject 1997 1998 1999 2000 2001 2002 at risk<br />

A • (lost to follow-up) 2.0<br />

B • ×<br />

3.0<br />

C • 5.0<br />

D • 4.0<br />

E • ×<br />

2.5<br />

Total years at risk 16.5<br />

63<br />

Section 2


• = Initiation <strong>of</strong> follow-up<br />

× = Development <strong>of</strong> disease<br />

Number <strong>of</strong> new cases = 2<br />

Number <strong>of</strong> person-years at risk = 16.5<br />

Incidence rate = 2/16.5 = 0.121<br />

That is, 12.1 cases per 100 person years <strong>of</strong><br />

observation<br />

Example:<br />

A study in the United States measured the incidence<br />

rate <strong>of</strong> stroke in a group <strong>of</strong> 118,539 women aged<br />

30-55 years <strong>of</strong> age. The women were free from<br />

stroke in 1986, <strong>and</strong> were followed for 8 years.<br />

Person-years<br />

<strong>of</strong><br />

observation<br />

(over 8 years)<br />

Stroke<br />

incidence<br />

rate<br />

(per 100,000<br />

person years)<br />

Smoking<br />

category<br />

No. <strong>of</strong> cases<br />

<strong>of</strong> stroke<br />

Never smoked 70 395,594 17.7<br />

Ex-smoker 65 232,712 27.9<br />

Smoker 139 280,141 49.6<br />

Total 274 908,447 30.2<br />

64<br />

Section 2


274<br />

Incidence rate = × 100,000 = 30.2 cases <strong>of</strong><br />

908,447<br />

stroke per 100,000 person-years <strong>of</strong> observation.<br />

Average follow-up per woman<br />

= 908,447<br />

118,539<br />

= 7.7 years<br />

Note: The denominator for measures <strong>of</strong> incidence<br />

should include only those who are at risk <strong>of</strong><br />

developing the disease. It should exclude<br />

• those who already have the disease<br />

• those who cannot develop the disease<br />

Failure to do this will lead to an underestimate <strong>of</strong><br />

the true incidence since fewer will develop the<br />

condition.<br />

For example when studying the incidence <strong>of</strong><br />

endometrial cancer we should exclude women<br />

who have had a hysterectomy.<br />

65<br />

Section 2


Example In (a)-(c) calculate a relevant measure<br />

<strong>of</strong> disease frequency <strong>and</strong> give its name.<br />

(a) You survey 346 travellers returning from overseas<br />

travel <strong>and</strong> find that 95 <strong>of</strong> them experienced a<br />

diarrhoeal illness on their trip.<br />

(1 mark)<br />

(b) A tour <strong>of</strong> 143 people is travelling through Central<br />

America for 2 weeks. During this trip 28 <strong>of</strong> the<br />

people experience a diarrhoeal illness. (1 mark)<br />

(c) A group <strong>of</strong> 18 Peace Corps volunteers in Guatemala<br />

kept daily records <strong>of</strong> their exposure to various risk<br />

factors (such as untreated water) <strong>and</strong> whether or not<br />

they had diarrhoea. The following values are the<br />

numbers <strong>of</strong> new episodes <strong>of</strong> diarrhoea with the<br />

number <strong>of</strong> weeks <strong>of</strong> records (in brackets) for each<br />

<strong>of</strong> the 18 individuals:<br />

12(88) 12(46) 19(77) 7(102) 8(73) 15(110) 7(101) 9(94) 2(62)<br />

8(25) 1(90) 1(17) 15(28) 9(30) 5(101) 7(21) 14(109) 17(93)<br />

NOTE: You should assume that the reported number<br />

<strong>of</strong> weeks does not include weeks in which the<br />

individual had diarrhoea when the week started (i.e.,<br />

each person was disease free at the start <strong>of</strong> each<br />

week).<br />

(1 mark)<br />

66<br />

Section 2


Solution<br />

(a) 95/346 = 0.275. Prevalence = 27.5 per 100<br />

overseas travellers report experiencing<br />

diarrhoea during their trip.<br />

(b) 28/143 = 0.196. Cumulative incidence = 19.6<br />

cases per 100 exposed per 2 weeks.<br />

(c) In this problem you are calculating an<br />

incidence rate. You generally calculate the<br />

incidence rate as the total number <strong>of</strong><br />

episodes divided by the total exposure<br />

time:<br />

12+12+19+7+8+15+7+9+2+8+1+1+15+9+5+7+14+17<br />

88+46+77+102+73+110+101+94+62+25+90+17+28+30+101+21+109+93<br />

= 169/1269 = 0.133<br />

Thus, incidence rate = 13.3 cases per 100<br />

person-weeks <strong>of</strong> observation.<br />

67<br />

Section 2


Relationship between prevalence <strong>and</strong> incidence<br />

Example: Disease A<br />

Subject<br />

1<br />

2<br />

3<br />

4<br />

5<br />

L t<br />

Cumulative Incidence = 5 / 5 in t-years<br />

Prevalence at time L = 2 / 5<br />

Disease B<br />

Subject<br />

1<br />

2<br />

3<br />

4<br />

5<br />

Cumulative Incidence = 5 / 5 in t-years<br />

Prevalence at time L = 5 / 5<br />

L<br />

t<br />

Time<br />

Time<br />

68<br />

Section 2


Note: Prevalence depends on<br />

• incidence rate<br />

• duration <strong>of</strong> disease<br />

Diabetes (adult onset)<br />

• annual incidence rate is low<br />

• duration is long as disease is neither curable<br />

or total<br />

so prevalence is high relative to incidence<br />

Cold<br />

• incidence is high<br />

• duration is short<br />

So prevalence is low relative to incidence<br />

69<br />

Section 2


HIV/AIDS<br />

Many with HIV will live for a long time.<br />

Prevalence <strong>of</strong> HIV in the community will be high.<br />

There is also an issue related to the fact that a<br />

person may not know they are HIV positive.<br />

Hence likely to underestimate the prevalence.<br />

70<br />

Section 2


If diagnosed with AIDS death is quick, ie few<br />

living with AIDS. Hence AIDS prevalence<br />

relatively low.<br />

There are obvious issues related to health care<br />

provision <strong>and</strong> planning.<br />

71<br />

Section 2


[D] Measures <strong>of</strong> disease association<br />

The comparisons <strong>of</strong> disease frequency in different<br />

groups <strong>of</strong> people are made. In the simplest (<strong>and</strong><br />

very common) setting there are two groups, one<br />

exposed <strong>and</strong> the other unexposed.<br />

Example: Data from cohort study <strong>of</strong> oral<br />

contraceptive use (OC) <strong>and</strong> bacteria in the urine<br />

among women aged 16-49 years over 3 years.<br />

Bacteria present<br />

Yes No Total<br />

OC use Yes 27 455 482<br />

No 77 1831 1908<br />

Total 104 2286 2390<br />

Data from D.A. Evans et al. NEJM (1978)<br />

Bacteria is the Disease Category. (Outcome<br />

measure.)<br />

OC use is the Exposure Category.<br />

72<br />

Section 2


Cumulative Incidence<br />

OC users: 27/482 = 0.056<br />

56 cases per 1000 in 3 years<br />

Non users: 77/1908 = 0.040<br />

40 cases per 1000 in 3 years<br />

Measures <strong>of</strong> Association:<br />

Difference (Absolute effect)<br />

56-40 = 16 cases per 1000 in 3 years<br />

Ratio (Relative effect)<br />

56/40 = 1.4<br />

The number <strong>of</strong> OC users with bacteria is 1.4<br />

times the number for non users.<br />

[Note that the ratio does not include the time<br />

interval]<br />

73<br />

Section 2


1. Relative effect = Relative Risk (RR)<br />

• ratio <strong>of</strong> incidence in exposed group (I e ) to<br />

incidence in unexposed group (I 0 )<br />

⎧> 1 (exposure → disease)<br />

Ie<br />

⎪<br />

RR = ⎨=1 if Ie<br />

= I0<br />

I0<br />

⎪<br />

⎩ < 1 (exposure is protective)<br />

• indicates how much more likely disease is to<br />

develop in the exposed group than in the<br />

unexposed group<br />

• no association between exposure <strong>and</strong> disease:<br />

RR = 1 (I e = I 0 )<br />

• good measure <strong>of</strong> strength <strong>of</strong> an association<br />

• the usual measure in studies <strong>of</strong> causation <strong>of</strong><br />

disease<br />

• can also calculate ratios <strong>of</strong> prevalences, but the<br />

interpretation is different<br />

74<br />

Section 2


2. Absolute effect = Attributable Risk (AR)<br />

• difference in incidence between exposed <strong>and</strong><br />

unexposed groups<br />

AR = I e – I 0<br />

• indicates how many more people with disease<br />

there are in the exposed than the unexposed<br />

group<br />

• no association between exposure <strong>and</strong> disease:<br />

AR = 0 (I e = I 0 )<br />

• assuming a cause-effect relationship between<br />

exposure <strong>and</strong> disease, we say:<br />

if AR>0, AR is the number <strong>of</strong> cases <strong>of</strong> the disease<br />

among the exposed that can be attributed to their<br />

exposure;<br />

if AR


Example: A r<strong>and</strong>omised trial <strong>of</strong> the effectiveness<br />

<strong>of</strong> infra-red stimulation compared with placebo on<br />

pain caused by cervical osteoarthritis (degenerative<br />

joint disease in the neck) carried out over two<br />

months.<br />

(Placebo or Control: mock stimulation)<br />

Treatment Control<br />

Improvement in pain 18 8<br />

No improvement in pain 7 17<br />

Total 25 25<br />

Exposure is Treatment/Control<br />

Disease is Improvement/No improvement in pain<br />

[The outcome classification]<br />

Cumulative incidence <strong>of</strong> improvement (in 2<br />

months)<br />

Treatment group:18/25<br />

Control group: 8/25<br />

18/ 25<br />

Rel. Risk =<br />

8/25 = 2.3<br />

The chance <strong>of</strong> improvement in the treatment group<br />

is 2.3 times the chance in the control group.<br />

76<br />

Section 2


Example: Prevalence <strong>of</strong> coronary heart disease<br />

(CHD) at initial examination among 4469 persons<br />

age 30-62 years <strong>of</strong> age in the Framingham Study<br />

Number Number Prevalence<br />

examined with CHD per 1,000<br />

Males 2024 48 23.7<br />

Females 2445 28 11.5<br />

Note that 23.7 = (48/2024) x 1,000 hence<br />

called prevalence per 1,000<br />

Similarly, 11.5 = (28/ 2445) x 1,000<br />

Relative risk = (23.7/11.5) = 2.1<br />

[Heart disease is twice as common in males as in<br />

females]<br />

Attributable risk = 23.7-11.5 = 12.2 per 1000<br />

[There are 12.2 more cases <strong>of</strong> heart disease in 1000<br />

men than in 1000 women]<br />

77<br />

Section 2


Example: Data from a cohort study <strong>of</strong><br />

postmenopausal hormone use <strong>and</strong> coronary heart<br />

disease among female nurses<br />

Coronary heart<br />

disease<br />

Yes No Person-years<br />

Postmenopausal<br />

hormone use<br />

Yes 30 - 54,308.7<br />

No 60 - 51,477.5<br />

Data from Stamfer et al, NEJM (1985)<br />

Incidence rate:<br />

Users: 30/54308.7 = 55 per 100,000 person-years<br />

Non-users: 60/51477.5 = 117 per 100,000 person<br />

years<br />

Attributable Risk:<br />

55-117 =-62 cases <strong>of</strong> CHD per 100,000 person<br />

years<br />

Hormone use prevents 62 cases per 100,000 person<br />

years<br />

Relative Risk: 55/117 = 0.47<br />

The risk <strong>of</strong> CHD among users is 0.47 times the risk<br />

in non-users (ie a 53% reduction in risk)<br />

78<br />

Section 2


Example: Relative <strong>and</strong> attributable risks <strong>of</strong><br />

mortality from lung cancer <strong>and</strong> coronary heart<br />

disease among cigarette smokers in a cohort study<br />

in British male physicians<br />

Annual mortality rate per 100,000<br />

Lung cancer Heart disease<br />

Cigarette smokers 140 669<br />

Non-smokers 10 413<br />

Relative risk 14.0 1.6<br />

Attributable risk 130 256<br />

(per 100,000 per year)<br />

Data from Doll <strong>and</strong> Peto, Br Med J (1976)<br />

RR: 140/10 = 14.0 669/413 = 1.6<br />

AR: 140 – 10 = 130 669 – 413 = 256<br />

Heart disease is more common therefore a smaller<br />

relative increase in risk produces more people with<br />

disease.<br />

79<br />

Section 2


Note<br />

Relative risks<br />

• provide information on the strength <strong>of</strong> an<br />

association<br />

• can be used to assist in assessment <strong>of</strong> the<br />

likelihood <strong>of</strong> a causal association<br />

Attributable risks<br />

• measure the impact <strong>of</strong> an exposure, (assuming<br />

that it is causal)<br />

If a disease is common a small relative risk will<br />

translate to a large attributable risk.<br />

[see previous example]<br />

80<br />

Section 2


3. Odds Ratio: A third measure <strong>of</strong> association<br />

This can be used in case-control studies, where<br />

measures <strong>of</strong> disease frequency in the study<br />

population are not available<br />

Odds <strong>of</strong> disease =<br />

Chance (or Probability) <strong>of</strong> disease<br />

Chance (or Probability) <strong>of</strong> no disease<br />

See later<br />

81<br />

Section 2


SECTION 3<br />

This section covers a brief introduction to probability definitions, notation, rules <strong>and</strong> r<strong>and</strong>om<br />

variables with examples, several involving tree diagram use.<br />

Definitions including mutually exclusive <strong>and</strong> independent events<br />

The Addition Rule for combining probabilities<br />

The Multiplication Rule for probabilities<br />

Tree diagrams with examples<br />

Screening test terminology<br />

Probability Distributions <strong>and</strong> R<strong>and</strong>om Variables<br />

Rules for combining R<strong>and</strong>om Variables<br />

83<br />

Section 3


Introduction To Probability<br />

To define what we mean by probability we need<br />

to talk about experiments <strong>and</strong> events<br />

• An experiment is the process by which<br />

observations or measurements are obtained.<br />

• The outcome <strong>of</strong> an experiment is referred to as<br />

an event <strong>and</strong> may also represent a group <strong>of</strong><br />

possible outcomes.<br />

• The set <strong>of</strong> all possible individual outcomes is<br />

the sample space.<br />

Example: Toss a coin once. Observe event A –<br />

the coin comes up a head (H) or B – the coin<br />

comes up a tail (T). The sample space is {H, T}.<br />

An experiment results in outcomes that cannot be<br />

predicted in advance. This uncertainty about an<br />

outcome is measured by the probability <strong>of</strong> the<br />

event. Different events have different<br />

probabilities. We define the probability <strong>of</strong> an<br />

n<br />

event A as Pr(A) =<br />

A<br />

N where n A is the number <strong>of</strong><br />

experiments resulting in event A in a very large<br />

number (N) <strong>of</strong> repetitions <strong>of</strong> the experiment.<br />

84<br />

Section 3


A probability is therefore like a relative<br />

frequency. It is a measure on a scale from 0<br />

representing absolute impossibility to 1<br />

representing absolute certainty. Subjective<br />

estimates <strong>of</strong> probability are “unlikely”,<br />

“possibly”, “almost never”, etc which all convey<br />

an idea <strong>of</strong> likelihood <strong>of</strong> occurrence <strong>of</strong> an event.<br />

But different people attach different values to<br />

these (<strong>and</strong> this is a problem). For example, what<br />

is the probability that God exists (0 or 1).<br />

Probability calculations began with games <strong>of</strong><br />

chance over 3000 years ago. The games involve<br />

coins, dice, cards, roulette etc. With such objects<br />

we can develop exact probabilities <strong>of</strong> possible<br />

outcomes or events by making sensible<br />

assumptions:<br />

• a die (plural dice) is fair ( 1 6<br />

is the probability <strong>of</strong><br />

any outcome)<br />

• a coin is fair ( 1 2<br />

is probability <strong>of</strong> a head)<br />

• a card is drawn (<br />

52 1 is probability)<br />

• a birth date (<br />

365 1 is prob <strong>of</strong> particular day)<br />

Probabilities associated with these objects can be<br />

calculated using our knowledge <strong>of</strong> the properties<br />

<strong>of</strong> these objects.<br />

85<br />

Section 3


Example: An experiment involves throwing a<br />

fair die. Event is “obtaining” an even number.<br />

The answer is 3 6 or 1 (easy). This probability<br />

2<br />

could also be found by experiment involving<br />

tossing the die many times.<br />

In practice, experiments are much more complex<br />

than this in situations <strong>of</strong> interest to researchers.<br />

Events result from such experiments <strong>and</strong> event<br />

probabilities are needed if we are to draw<br />

conclusions from the sample data collected.<br />

Further Examples<br />

1. An experiment treats 20 patients in a clinical<br />

investigation involving a new drug.<br />

An event is “at least 12 patients are cured”<br />

What is the probability <strong>of</strong> the event<br />

2. An experiment selects 500 voters in a survey.<br />

An event is “at least 300 support windmill<br />

farms in Central Otago”.<br />

3. Experiment treats two “equal” samples <strong>of</strong><br />

cancer patients, one by surgery <strong>and</strong> one by<br />

chemotherapy.<br />

86<br />

Section 3


An event is “more chemotherapy patients are<br />

cured”. The probability will give insight into<br />

the better treatment.<br />

Theoretical probabilities are unknown in such<br />

situations, hence these probabilities must be<br />

estimated from experimental data by observing<br />

outcomes or noting historical information.<br />

Combining Probabilities for Multiple Events<br />

Example: Consider the probability <strong>of</strong> being in<br />

each <strong>of</strong> the four blood groups. The probabilities<br />

from the Dunedin blood donor centre are:<br />

Blood Type Pr(Blood Type)<br />

A 0.38<br />

B 0.11<br />

AB 0.04<br />

O 0.47<br />

(These numbers can also be estimated by<br />

“experiment” <strong>and</strong> will take these values if<br />

many people are sampled)<br />

1. What is the probability that a person is either<br />

A or B<br />

87<br />

Section 3


2. What is the probability that 3 unconnected (or<br />

independent) people are all in blood group O<br />

Solution:<br />

1. For any two independent outcomes the<br />

probability <strong>of</strong> either occurring is (in this case)<br />

the sum <strong>of</strong> the individual probabilities.<br />

The probability <strong>of</strong> being either A or B is<br />

Pr(A) + Pr(B) = 0.38 + 0.11 = 0.49<br />

Note: Pr(A) + Pr(B) + Pr(AB) + Pr(O) = 1<br />

Here we have assumed that the outcomes are<br />

mutually exclusive: that is, a person cannot be<br />

in blood groups A <strong>and</strong> B.<br />

2. For any two independent outcomes, the<br />

probability that both are observed is the<br />

product <strong>of</strong> the individual probabilities. This<br />

can be extended to three people in the obvious<br />

way.<br />

Therefore, probability three people have blood<br />

group O can be shown to be (see later)<br />

88<br />

Section 3


Pr(O) × Pr(O) × Pr(O)<br />

= 0.47 × 0.47 × 0.47<br />

= 0.104<br />

Note: Independent events arise if the outcome <strong>of</strong><br />

one event tells us nothing about the other<br />

event. We obviously must exclude the<br />

possibility that the three people are in the<br />

same family.<br />

Note: This example illustrates the two laws for<br />

combining probabilities:<br />

• the addition rule in part 1.<br />

• the multiplication rule in part 2.<br />

89<br />

Section 3


Properties <strong>of</strong> Probabilities <strong>and</strong> Probability<br />

Laws.<br />

Notation: There is a convenient notation for<br />

representing event probabilities. Suppose S<br />

represents all possible outcomes <strong>of</strong> an experiment,<br />

A is the collection <strong>of</strong> these outcomes representing<br />

an event <strong>and</strong> A is the collection <strong>of</strong> outcomes<br />

which are not in A.<br />

• A is the event called the complement <strong>of</strong> A<br />

• A <strong>and</strong> A are said to be mutually exclusive (no<br />

overlap)<br />

• Also Pr(A) + Pr( A) = 1 since A <strong>and</strong> A must<br />

represent every possible outcome.<br />

Now suppose two events A <strong>and</strong> B may overlap.<br />

• Event A or B denoted by A∪ B occurs if at<br />

least one <strong>of</strong> A or B occurs. Called the union <strong>of</strong><br />

A <strong>and</strong> B.<br />

90<br />

Section 3


• Event A <strong>and</strong> B denoted by A∩ B occurs if both<br />

A <strong>and</strong> B occur. Called the intersection <strong>of</strong> A <strong>and</strong><br />

B.<br />

Example: A fair die is thrown. A is the event “a<br />

number greater than 3 is thrown” <strong>and</strong> B is the event<br />

“an even number is thrown”.<br />

Then S = {1, 2, 3, 4, 5, 6}<br />

A = {4, 5, 6} Pr(A) = 3 6<br />

B = {2, 4, 6} Pr(B) = 3 6<br />

A ∩ B = {4, 6} <strong>and</strong> A ∪ B = {2, 4, 5, 6}<br />

Pr(A ∩ B) = 2 Pr(A ∪ B) =<br />

6<br />

4 6<br />

Set <strong>of</strong> all outcomes<br />

A<br />

A<br />

B<br />

B<br />

A ∩ B<br />

Fig (i) A∩ B not empty Fig (ii) A∩ B empty<br />

(mutual exclusiveness)<br />

The addition rule for combining probabilities<br />

Pr(A or B) = Pr(<br />

Set <strong>of</strong> all outcomes<br />

A∪ B) = Pr(A) + Pr(B) – Pr ( A∩ B)<br />

91<br />

Section 3


since values in the intersection A∩ B are counted<br />

twice. The special case when A <strong>and</strong> B are<br />

mutually exclusive is<br />

Pr( A∪ B) = Pr(A) + Pr(B)<br />

This was illustrated in blood group example, part (1)<br />

Example: The dice again:<br />

Pr(A ∪ B) = 3 6 + 3 6 – 2 6 = 4 6<br />

using addition rule.<br />

The Multiplication Rule<br />

The intersection <strong>of</strong> two events A <strong>and</strong> B is the<br />

event that both occur. The probability <strong>of</strong> this is<br />

Pr(A <strong>and</strong> B) = Pr(A ∩ B) = Pr(A) Pr(B|A)<br />

In words this says that for both <strong>of</strong> the two events<br />

to occur, first one must occur [Pr(A)] <strong>and</strong> then<br />

given that the first has occurred, the second must<br />

occur [Pr(B|A)].<br />

If both Pr(A) <strong>and</strong> Pr(A <strong>and</strong> B) are given, this rule<br />

can be used to define conditional probability as<br />

Pr( B| A)<br />

=<br />

Pr( A∩<br />

B)<br />

Pr( A)<br />

92<br />

Section 3


Independence<br />

The idea behind the term Pr(B|A) is that the<br />

occurrence <strong>of</strong> event A may cause a reassignment<br />

<strong>of</strong> probability to event B that makes it differ from<br />

the original value Pr(B). When the occurrence <strong>of</strong><br />

A gives no additional information about B, A <strong>and</strong><br />

B are independent.<br />

That is Pr(B|A) = Pr(B)<br />

In this situation the multiplication rule is<br />

Pr( A∩<br />

B) = Pr(A)Pr(B)<br />

Otherwise it is the original<br />

Pr( A∩<br />

B) = Pr(A) Pr(B|A)<br />

This first rule was illustrated in the blood group<br />

example where the probability <strong>of</strong> 3 independent<br />

people all having blood group O was<br />

Pr( A∩<br />

B ∩C) = Pr(A) Pr(B)Pr(C)<br />

= 0.47 × 0.47 × 0.47 = 0.104<br />

93<br />

Section 3


Example: A survey <strong>of</strong> hospital patients shows<br />

that the probability a patient has high blood<br />

pressure given he/she is diabetic is 0.85. If 10%<br />

<strong>of</strong> patients are diabetic <strong>and</strong> 25% have high blood<br />

pressure:<br />

(a) Find prob. a patient has both diabetes <strong>and</strong><br />

high blood pressure.<br />

(b) Are the conditions <strong>of</strong> diabetes <strong>and</strong> high<br />

blood pressure independent<br />

Solution (a) A is event “patient has high blood<br />

pressure”<br />

B is event “patient is diabetic”<br />

Pr( A| B ) = 0.85, Pr( B ) = 0.10 <strong>and</strong> Pr( A ) = 0.25<br />

∴Pr( A∩ B)<br />

= Pr( A| B ) Pr( B ) by multiplication rule<br />

= 0.85 x 0.10<br />

= 0.085<br />

(b) Pr( A ) = 0.25 ≠ Pr( A| B)<br />

. Hence not independent<br />

94<br />

Section 3


A tree diagram is useful for helping calculate the<br />

probability <strong>of</strong> a combined event. The stages <strong>of</strong><br />

the combined event can be dependent or<br />

independent.<br />

Example: Independent Stages.<br />

Stephens Isl<strong>and</strong> is an uninhabited isl<strong>and</strong> in Cook<br />

Strait where tuatara are being re-established. For<br />

some years three locations have been visited on<br />

the isl<strong>and</strong> <strong>and</strong> tuatara have been found at a<br />

location with probability 0.4. At any visit X<br />

represents the number <strong>of</strong> locations out <strong>of</strong> three at<br />

which tuatara are observed. X can take values 0,<br />

1, 2 or 3. Find the probabilities that 0, 1, 2, or 3<br />

locations have tuatara on a visit.<br />

T is the event “location has tuatara’’ <strong>and</strong> N is the<br />

complementary event “location has no tuatara”.<br />

LOC 1<br />

LOC 2<br />

LOC 3<br />

95<br />

Section 3


Location<br />

1<br />

Location<br />

2<br />

Location<br />

3<br />

Outcome Pr(Outcome) No<br />

0.40<br />

T<br />

TTT<br />

0.064<br />

3<br />

0.40<br />

T<br />

0.40<br />

0.60<br />

T<br />

N<br />

N<br />

T<br />

N<br />

TTN<br />

TNT<br />

TNN<br />

0.096<br />

0.096<br />

0.144<br />

2<br />

2<br />

1<br />

0.60<br />

N<br />

0.40<br />

T<br />

0.60<br />

N<br />

T<br />

N<br />

T<br />

NTT<br />

NTN<br />

NNT<br />

0.096<br />

0.144<br />

0.144<br />

2<br />

1<br />

1<br />

N<br />

NNN 0.216<br />

0<br />

Then Pr(T) = 0.40 (known historically)<br />

The second location is independent <strong>of</strong> the first<br />

Pr(both T) = Pr(T ∩ T) = Pr(T)Pr(T)<br />

= (0.40)(0.40) = 0.160<br />

using the multiplication rule <strong>and</strong><br />

Pr(TTT) = (0.4) (0.4) (0.4) = 0.064<br />

96<br />

Section 3


The tree diagram shows all possible outcomes.<br />

Branch probabilities are multiplied to give the<br />

probabilities <strong>of</strong> the 8 possible outcomes.<br />

The addition rule tells us that the probability <strong>of</strong><br />

seeing tuatara at two <strong>of</strong> the three sites, Pr(X = 2),<br />

adds the probabilities <strong>of</strong> the three possible<br />

outcomes, TTN, TNT <strong>and</strong> NTT.<br />

That is, Pr(X = 2) = 0.096 + 0.096 + 0.096<br />

= 0.288<br />

Similarly, Pr(X = 0) = 0.216, Pr(X = 1) = 0.432<br />

<strong>and</strong> Pr(X = 3) = 0.064.<br />

In the next examples, the probability at each<br />

branch <strong>of</strong> the tree is conditional on earlier<br />

outcomes. i.e. no longer are the events<br />

independent, but branch probabilities are still<br />

multiplied according to the multiplication law for<br />

probabilities.<br />

97<br />

Section 3


Example: Dependent stages. Andrew, John,<br />

<strong>and</strong> Mark play a game. There are six similar cars,<br />

two <strong>of</strong> which have had the brake cylinders<br />

removed. The player chooses a car at r<strong>and</strong>om,<br />

drives at high speed towards a cliff, <strong>and</strong> brakes in<br />

time to stop. The boys decide to proceed in<br />

alphabetical order. Find Pr(each will lose) <strong>and</strong><br />

Pr(no loser), assuming that the game stops when<br />

the first boy drives over the cliff.<br />

2<br />

6<br />

4<br />

6<br />

Andrew loses<br />

2<br />

5<br />

3<br />

5<br />

John loses<br />

2<br />

4<br />

2<br />

4<br />

Mark loses<br />

No loser<br />

Pr(Andrew loses) =Pr(Andrew picks a faulty car) = 2 6<br />

Pr(John loses)<br />

Pr(Mark loses)<br />

= Pr(Andrew picks a good car <strong>and</strong> John<br />

picks a faulty car)<br />

⎛4⎞⎛2⎞ 4<br />

= ⎜ ⎟⎜ ⎟ =<br />

⎝6⎠⎝5⎠<br />

15<br />

= Pr Andrew <strong>and</strong> John pick good cars,<br />

<strong>and</strong> Mark picks a faulty car)<br />

⎛4⎞⎛3⎞⎛2⎞<br />

3<br />

= ⎜ ⎟⎜ ⎟⎜ ⎟=<br />

⎝6⎠⎝5⎠⎝4⎠<br />

15<br />

98<br />

Section 3


In probability notation we get:<br />

A is event Andrew loses<br />

A is event Andrew does not lose<br />

Pr( A ) = 2/6 Pr( A ) = 4/6<br />

J is event John loses<br />

J is event John does not lose.<br />

It is not Pr( J ) = 2/6<br />

Instead, Pr( J ) is revised using extra information:<br />

Pr( J ) = Pr( JA | )Pr( A )<br />

⎛<br />

=<br />

5<br />

2 ⎞ ⎛ ⎞<br />

⎜ ⎟ ⎜4<br />

⎟<br />

⎜ ⎟ ⎜ ⎟<br />

⎜ ⎟ ⎜6<br />

⎟<br />

⎝<br />

⎠ ⎝<br />

= 4/15<br />

<strong>and</strong> so on.<br />

⎠<br />

99<br />

Section 3


Example: Screening Programmes<br />

A patient with certain symptoms consulted her<br />

doctor to be checked for a cancer. The patient<br />

undergoes a biopsy. With this test there is a<br />

probability <strong>of</strong> 0.90 that a woman with the cancer<br />

shows a positive biopsy, <strong>and</strong> a probability <strong>of</strong> only<br />

0.001 that a healthy woman incorrectly shows a<br />

positive biopsy.<br />

Historical information also suggests that 1 in<br />

10,000 women have the cancer. [This is the<br />

prevalence <strong>of</strong> the cancer in the population.]<br />

Find the probability that a woman has the cancer<br />

given the biopsy says she does.<br />

(Essentially the problem is to decide the ability <strong>of</strong><br />

the biopsy to diagnose true patient status. The<br />

principle applies to breast <strong>and</strong> cervical cancer in<br />

New Zeal<strong>and</strong>.)<br />

Solution: A is event “woman has the cancer”<br />

B is event “biopsy is positive” (indicating cancer)<br />

100<br />

Section 3


Pr(A) = 0.0001 (disease prevalence)<br />

Pr(B|A) = 0.90 (a conditional prob.)<br />

Pr(B|A) = 0.001 (A is complement <strong>of</strong> A)<br />

The problem is to find Pr(A|B)<br />

Pr(A)<br />

= 0.0001<br />

A<br />

Pr(B|A)<br />

= 0.90<br />

Pr(B|A)<br />

= 0.10<br />

B<br />

B<br />

Biopsy + ve<br />

(true positive)<br />

Biopsy – ve<br />

(false negative)<br />

Pr(A)<br />

= 0.9999<br />

(the<br />

complement)<br />

A<br />

Pr(B|A)<br />

= 0.001<br />

Pr(B|A )<br />

= 0.999<br />

By the multiplication rule for dependent events,<br />

Pr(True positive) = Pr(A ∩ B)<br />

= Pr(B|A)Pr(A)<br />

= 0.90 × 0.0001<br />

= 0.00009 (nine out <strong>of</strong> 100<br />

000 show true positive)<br />

Pr(False negative) = Pr(B|A)Pr(A)<br />

= 0.10 × 0.0001<br />

= 0.00001<br />

B<br />

B<br />

Biopsy + ve<br />

(false positive)<br />

Biopsy – ve<br />

(true negative)<br />

101<br />

Section 3


Pr(False positive) = 0.001 × 0.9999<br />

= 0.00100 (100 out <strong>of</strong> 100<br />

000 show false positive)<br />

Pr(True negative) = 0.999 × 0.9999<br />

= 0.99890<br />

Pr(Test positive) = Pr (B)<br />

= 0.00009 + 0.00100<br />

= 0.00109 (109 out <strong>of</strong><br />

100 000 show positive<br />

test)<br />

Therefore,<br />

0.00009<br />

Pr(A ∩ B)<br />

Pr(A|B) =<br />

=<br />

0.00009 + 0.00100 Pr(B)<br />

0.00009<br />

=<br />

0.00109<br />

= 0.083 (nine <strong>of</strong> the 109 with the<br />

positive biopsy have the cancer)<br />

Conclusion: Only 8.3% <strong>of</strong> those women<br />

identified as having the disease actually do.<br />

(This is not at all what we would expect <strong>and</strong> is<br />

rather unsatisfactory.)<br />

102<br />

Section 3


1. Pr(B|A) is called the sensitivity <strong>of</strong> the test (the<br />

probability a person with the disease returns a<br />

positive result or the proportion <strong>of</strong> positives<br />

that are correctly identified).<br />

2. Pr( B | A) is called the specificity <strong>of</strong> the test<br />

(the proportion <strong>of</strong> negatives that are correctly<br />

identified by the test).<br />

3. Sensitivity <strong>and</strong> specificity are from a practical<br />

point <strong>of</strong> view not helpful as the point <strong>of</strong><br />

diagnostic testing is to make a diagnosis i.e.<br />

we need to know the probability <strong>of</strong> the test<br />

giving the correct diagnosis, whether it is<br />

positive or negative. That is pr(A|B), not<br />

pr(B|A).<br />

4. Pr(A|B) is the positive predictive value (the<br />

proportion <strong>of</strong> patients with positive test<br />

results who are correctly diagnosed).<br />

5. The negative predictive value is the<br />

proportion <strong>of</strong> patients with negative test<br />

results who are correctly diagnosed i.e.<br />

Pr( A | B).<br />

103<br />

Section 3


Example: A patient consulted his GP because<br />

he had intermittent chest pain. The description<br />

<strong>of</strong> such pain is known to suggest a patient has<br />

heart disease with a probability <strong>of</strong> 0.48. The<br />

patient took an ECG test which has a sensitivity<br />

<strong>of</strong> 0.90 <strong>and</strong> a specificity <strong>of</strong> 0.84. The patient<br />

returns a positive ECG. Now find the<br />

probability he has heart disease in light <strong>of</strong> this<br />

additional information. Also find the positive<br />

<strong>and</strong> negative predictive values.<br />

Solution:<br />

Pr(H) = 0.48<br />

H<br />

Sensitivity<br />

= 0.90<br />

0.10<br />

T<br />

T<br />

(0.90)(0.48) = 0.4320<br />

(0.10)(0.48) = 0.0480<br />

Pr(H)<br />

= 0.52<br />

H<br />

0.16<br />

Specificity<br />

= 0.84<br />

T<br />

T<br />

(0.16)(0.52) = 0.0832<br />

(0.84)(0.52) = 0.4368<br />

H is event “patient has heart disease”<br />

T is event “ECG test is positive”<br />

Pr(T) = 0.4320 + 0.0832 = 0.5152<br />

Pr(H|T) = 0.4320/0.5152 = 0.839<br />

104<br />

Section 3


Notice how the probability <strong>of</strong> heart disease has<br />

been revised up from 0.48 to 0.839 as a result <strong>of</strong><br />

the test.<br />

Positive predictive value = 0.839<br />

Pr(Test negative) = 0.0480 + 0.4368 = 0.4848<br />

Negative predictive value = 0.4368/0.4848 = 0.901<br />

105<br />

Section 3


Example<br />

Like swine flu’ today, about six years ago SARS was a threat to world health. In the early days<br />

<strong>of</strong> the SARS epidemic emergency measures were put in place by the World Health<br />

Organisation in an attempt to control the spread <strong>of</strong> SARS <strong>and</strong> to identify the condition. But no<br />

adequate screening tests existed to identify the condition when it first appeared in Hong Kong.<br />

A study was carried out in the early days to evaluate a WHO criteria for identifying patients<br />

with SARS in the SARS screening clinic in Hong Kong. Of 556 consecutive clinic attendees,<br />

97 were confirmed with SARS. Of these 97 patients with confirmed SARS, 25 met the WHO<br />

criteria for suspected SARS. Of the 459 patients in whom SARS was not confirmed, 438 were<br />

negative according to the WHO criteria.<br />

(a)<br />

Find the prevalence <strong>of</strong> confirmed SARS at the clinic (i.e. the proportion<br />

with SARS).<br />

(1 mark)<br />

(b) Estimate the sensitivity <strong>and</strong> specificity <strong>of</strong> the WHO test from the numbers above. (2<br />

marks)<br />

(c) Estimate the probability that the WHO test produces a positive result. (1 mark)<br />

(d) Estimate the positive predictive value <strong>of</strong> the test. (1 mark)<br />

(e) Estimate the negative predictive value <strong>of</strong> the test. (1 mark)<br />

(f)<br />

How would the positive predictive value <strong>of</strong> the test be affected if the prevalence <strong>of</strong><br />

SARS among clinic attendees were to decrease<br />

(1 mark)<br />

106<br />

Section 3


WHO SARS Confirmed<br />

Result Yes No Total<br />

Positive 25 [21] 46<br />

Negative [72] 438 510<br />

Total 97 459 556<br />

(a) Prevalence = 97/556 = 0.174<br />

(b) Sensitivity = 25/97 = 0.258; specificity = 438/459 = 0.954<br />

• T +<br />

0.174<br />

S<br />

•<br />

0.258<br />

0.742<br />

• T –<br />

0.826<br />

•<br />

S<br />

0.046<br />

0.954<br />

• T +<br />

• T –<br />

(c) Pr (T + ) = (0.174)(0.258) + (0.826)(0.046)<br />

= 0.0449 + 0.0380<br />

= 0.083<br />

(d) Positive predictive value = 0.045/0.083 = 0.542<br />

(e) Pr(T – ) = (0.174)(0.742) + (0.826)(0.954)<br />

= 0.917<br />

Negative predictive value = 0.788/0.917 = 0.859<br />

(f) The positive predictive value will decrease.<br />

107<br />

Section 3


Example: Sensitive Survey Questions.<br />

This is an important way <strong>of</strong> gaining information<br />

on sensitive or controversial issues.<br />

The question is: do you have or have you ever<br />

had a sexually transmitted disease (STD)<br />

It is unlikely a truthful response or any response<br />

will be given.<br />

In a mail survey <strong>of</strong> 268 young people five said<br />

they had a STD.<br />

Probability = 5/268 = 0.019 (or 19 per 1000)<br />

Instead, proceed as follows:<br />

1. Roll a die, allowing no one to see the<br />

outcome.<br />

2. Toss a fair coin.<br />

3. If the die shows “1” answer truthfully the<br />

question: “Have you thrown a head”<br />

4. If the die shows 2, 3, 4, 5 or 6 answer<br />

truthfully to the question:<br />

108<br />

Section 3


“Have you ever had a sexually transmitted<br />

disease”<br />

A tree diagram summarises this procedure where<br />

θ is the proportion <strong>of</strong> response “YES” to the STD<br />

question.<br />

Roll<br />

Die<br />

1/6<br />

5/6<br />

1<br />

2 to 6<br />

Pr (Yes) = 1 5 θ<br />

+<br />

12 6<br />

1/2<br />

1/2<br />

Head Yes 1/12<br />

Head No 1/12<br />

STD Yes 5θ/6<br />

STD No 5(1 - θ)/6<br />

There were 54 “Yes” <strong>and</strong> 214 “No” for 268<br />

people.<br />

Estimate Pr(Yes) = 54/268 = 0.2015<br />

∴ 0.2015 = 1 5 θ<br />

+<br />

12 6<br />

∴ 12(0.2015) = 1 + 10θ<br />

θ<br />

1 – θ<br />

109<br />

Section 3


∴ 2.418 – 1 = 10θ<br />

∴ 1.418 = 10θ<br />

∴ θ = 0.1418<br />

or 142 per 1000 have STD<br />

(compare 19 per 1000 previously)<br />

110<br />

Section 3


Probability Distribution <strong>and</strong> R<strong>and</strong>om Variables<br />

A r<strong>and</strong>om variable has values which depend on<br />

the outcome <strong>of</strong> a r<strong>and</strong>om experiment. R<strong>and</strong>om<br />

variables are labelled with a capital letter (X<br />

say). They can be discrete or continuous. The<br />

number <strong>of</strong> locations with tuatara on Stephens<br />

Isl<strong>and</strong> is discrete (possible values 0, 1, 2, 3)<br />

while cholesterol levels are continuous.<br />

Example: (Tuatara again) Three locations are<br />

visited on 50 occasions in the tuatara study <strong>and</strong><br />

the number <strong>of</strong> locations with tuatara found are<br />

recorded each time. Results follow along with<br />

values calculated previously in the fourth column.<br />

X = x j f j f j /n f j /n = Pr(X = x j )<br />

(Tuatara at<br />

locations)<br />

(frequency)<br />

(rel.freq) (as n becomes large)<br />

0 8 0.16 0.216<br />

1 22 0.44 0.432<br />

2 15 0.30 0.288<br />

3 5 0.10 0.064<br />

Total n = 50 1.00 1.000<br />

111<br />

Section 3


X is the r<strong>and</strong>om variable. X is discrete here<br />

because all possible outcomes x j can be counted.<br />

The 50 results in the study are summarised by the<br />

relative frequencies.<br />

If many trials (n large) are carried out, the relative<br />

frequencies <strong>of</strong> each x j stabilise to give<br />

probabilities<br />

Pr(X = x j )<br />

for each outcome. Together these probabilities<br />

form the probability distribution rather than a<br />

relative frequency distribution.<br />

NB (1)<br />

4<br />

∑ Pr( X = xj<br />

) = 1 as for relative<br />

j=<br />

1<br />

frequencies<br />

(2) All probabilities are between 0 <strong>and</strong> 1.<br />

112<br />

Section 3


Describing Probability Distributions<br />

Let X be a symbol for a probability distribution<br />

<strong>and</strong> let μ X be the mean <strong>of</strong> X. (Assume X is<br />

discrete for the moment.)<br />

For a sample <strong>of</strong> n values from the distribution<br />

suppose each possible x j occurs f j times <strong>and</strong><br />

there are k possible values <strong>of</strong> j. Then the sample<br />

mean is<br />

x = 1 k k f<br />

n ∑ x j f j = x j<br />

∑<br />

⎛ ⎞<br />

j ⎝ n ⎠<br />

j=1<br />

j=1<br />

As the sample size becomes large, the relative<br />

frequencies become probabilities <strong>and</strong> the mean <strong>of</strong><br />

the probability distribution X is μ X where<br />

k<br />

∑<br />

μ X = x j Pr(X = x j )<br />

j =1<br />

A similar argument shows that the variance σ X 2 <strong>of</strong><br />

the probability distribution X is<br />

k<br />

σ 2 X = ∑ (x j −μ X ) 2 Pr(X = x j )<br />

j=1<br />

113<br />

Section 3


Take the square root to get the st<strong>and</strong>ard deviation<br />

<strong>of</strong> the probability distribution σ X .<br />

Note: The sample mean x <strong>and</strong> variance s 2 are<br />

estimates for population mean μ X <strong>and</strong> variance<br />

σ X 2 .<br />

Ex: Find the mean <strong>and</strong> st<strong>and</strong>ard deviation <strong>of</strong> the<br />

distribution <strong>of</strong> the number <strong>of</strong> locations at which<br />

tuatara are found.<br />

X=x j Pr(X=x j ) x j Pr(X=x j ) (x j - μ X ) 2 (x j - μ X ) 2 Pr(X=x j )<br />

0 0.216 0.000 (0 – 1.2) 2 = 1.44 0.311<br />

1 0.432 0.043 0.04 0.017<br />

2 0.288 0.576 0.64 0.184<br />

3 0.064 0.192 3.24 0.207<br />

Total 1.00 1.200 5.36 0.720<br />

4<br />

μ = ∑ = =<br />

x Pr( X x ) 1.20<br />

X j j<br />

j=<br />

1<br />

On average just over one location per visit will<br />

have tuatara present.<br />

σ<br />

4<br />

2 2<br />

X<br />

xj X<br />

X xj<br />

j=<br />

1<br />

= ∑ ( − μ ) Pr( = ) = 0.72<br />

<strong>and</strong> σ<br />

X<br />

= 0.85<br />

114<br />

Section 3


Example: A person infected with a disease can<br />

pass it on to others. Let the r<strong>and</strong>om variable, X,<br />

be the number <strong>of</strong> others infected by this person.<br />

X is found to have the following probability<br />

distribution.<br />

2<br />

Find μ<br />

X<br />

<strong>and</strong> σ<br />

X<br />

X = x<br />

j<br />

Pr(X = x j<br />

)<br />

0 0.10<br />

1 0.25<br />

2 0.40<br />

3 0.20<br />

4 0.05<br />

μ<br />

X<br />

= 0(0.10)+1(0.25)+2(0.40)+3(0.20)+4(0.05)<br />

= 1.85<br />

2<br />

σ<br />

X<br />

= (0 – 1.85) 2 0.10+(1 - 1.85) 2 0.25+(2 - 1.85) 2 0.40<br />

+(3 - 1.85) 2 0.20+(4 - 1.85) 2 0.05<br />

= 1.0275<br />

Also, σ<br />

X<br />

= 1.0275 = 1.0137<br />

115<br />

Section 3


Rules for combining r<strong>and</strong>om variables<br />

Often we are interested in the mean <strong>and</strong><br />

variance <strong>of</strong> a rescaled r<strong>and</strong>om variable, or in the<br />

mean <strong>and</strong> variance <strong>of</strong> sums (or differences) <strong>of</strong><br />

r<strong>and</strong>om variables. The following properties are<br />

true <strong>of</strong> all numerical r<strong>and</strong>om variables, discrete<br />

or continuous.<br />

If X <strong>and</strong> Y are independent r<strong>and</strong>om variables<br />

<strong>and</strong> a <strong>and</strong> b are constants, then:<br />

1. The mean <strong>of</strong> the new r<strong>and</strong>om variable<br />

a + bX is<br />

μ a+bX = a + bμ X<br />

2. The variance <strong>of</strong> a + bX is:<br />

σ 2 a+bX = b 2 σ 2 X<br />

3. The mean <strong>of</strong> the new r<strong>and</strong>om variable<br />

aX + bY is<br />

μ aX+bY = aμ X + bμ Y<br />

4. The variance <strong>of</strong> aX + bY is<br />

σ 2 aX+bY = a 2 σ 2 X + b 2 σ 2 Y<br />

116<br />

Section 3


Note: Properties 3 <strong>and</strong> 4 tell us that<br />

μ<br />

X+ Y= μX+ μY<br />

<strong>and</strong><br />

σ<br />

+<br />

= σ + σ .<br />

2 2 2<br />

X Y X Y<br />

Also, μ<br />

X− Y= μX− μY<strong>and</strong><br />

σ<br />

−<br />

= σ + σ .<br />

2 2 2<br />

X Y X Y<br />

Example: Temperatures used to be recorded in<br />

degrees Fahrenheit. Suppose a r<strong>and</strong>om variable F<br />

measures January temperature (in Fahrenheit) in<br />

Dunedin <strong>and</strong> daily maximum summer temperatures<br />

have a mean <strong>of</strong> 70°F with a st<strong>and</strong>ard deviation <strong>of</strong><br />

5°F.<br />

Use the conversion formula C = 5 ( F − 32) to find<br />

9<br />

the mean <strong>and</strong> st<strong>and</strong>ard deviation for the temperatures<br />

in degrees Celsius.<br />

Solution:<br />

We will let the r<strong>and</strong>om variable C represent the<br />

temperature in Celsius. The equation<br />

C = 5 ( F − 32) may be rearranged by exp<strong>and</strong>ing the<br />

9<br />

brackets to become<br />

C = 5 F − 5 × 32 or C = 5 F −<br />

160<br />

9 9<br />

9 9<br />

117<br />

Section 3


We have μa+ bX<br />

= a + bμ<br />

X<br />

160<br />

Therefore a = − <strong>and</strong> b = 5 9 9<br />

μ = a + bμ<br />

C<br />

160 5<br />

= − + ×<br />

9 9<br />

= 21.1° C<br />

2 2 2<br />

We also have σ b σ<br />

σ<br />

a+ bX<br />

=<br />

X<br />

2<br />

2 ⎛5<br />

⎞ 2<br />

C<br />

= × 5<br />

⎜ ⎟<br />

⎝9<br />

⎠<br />

25<br />

= × 25<br />

81<br />

= 7.716<br />

F<br />

70<br />

Therefore σ = 7.716 = 2.78° C<br />

C<br />

Example: What is the difference<br />

between T = X + X + X<br />

<strong>and</strong> T = 3X <br />

118<br />

Section 3


Note: These results can be extended to several<br />

r<strong>and</strong>om variables.<br />

Example: (Infected person continued)<br />

Three people living in separate areas have the<br />

disease. R<strong>and</strong>om variables X 1 , X 2 , X 3 are<br />

numbers <strong>of</strong> other people infected by them. Find<br />

mean <strong>and</strong> variance <strong>of</strong> total number infected by<br />

the original three.<br />

Total T = X 1 +X 2 + X 3 (X 1 , X 2 , X 3 assumed<br />

independent as people in different areas)<br />

μ = μ + μ + μ =1.85+1.85+1.85 = 5.55<br />

T X X X<br />

1 2 3<br />

σ = σ + σ + σ =1.0275+1.0275<br />

2 2 2 2<br />

T X X X<br />

1 2 3<br />

+ 1.0275 = 3.0825<br />

Note: Do not say T = 3 X 1 . Although<br />

μ<br />

T<br />

= 3μ<br />

= 5.55,<br />

X<br />

1<br />

σ<br />

= 9σ<br />

≠ 3.0825<br />

2 2<br />

T X<br />

1<br />

This is a very common source <strong>of</strong> error.<br />

119<br />

Section 3


120


SECTION 4<br />

This section introduces both the Binomial <strong>and</strong> Normal Distributions which model many<br />

phenomena arising in the real world. Consequently the distributions allow us to answer some<br />

important <strong>and</strong> relevant questions.<br />

The Binomial Distribution: Definition, mean <strong>and</strong> variance<br />

The Binomial Table: Examples<br />

The Normal Distribution: Definition<br />

St<strong>and</strong>ard Normal Distribution <strong>and</strong> Table<br />

General Normal Distribution<br />

Normal Approximation to the Binomial<br />

Transforming Data to Normal<br />

121<br />

Section 4


The Binomial Distribution<br />

The binomial distribution arises when<br />

investigating proportions. e.g. the proportion <strong>of</strong><br />

adult population with diabetes. Each individual<br />

has or does not have diabetes.<br />

Let Y be the r<strong>and</strong>om variable for an individual<br />

outcome <strong>of</strong> a person in the population. Two<br />

outcomes occur, namely Y = 1 (e.g. diabetes<br />

present or success) <strong>and</strong> Y = 0 (e.g. diabetes not<br />

present or failure). The parameter π represents<br />

the unknown proportion <strong>of</strong> 1’s occurring.<br />

The probability distribution <strong>of</strong> Y is<br />

Y =<br />

y Pr(Y =<br />

j<br />

y j<br />

)<br />

1 π “success”<br />

0 1 – π “failure”<br />

Then μ Y = 1(π) + 0(1 – π) = π<br />

σ = (1 – π) 2 π + (0 – π) 2 (1 – π)<br />

2<br />

Y<br />

= (1 – π) [π(1 – π) + π 2 ]<br />

= π(1 – π)<br />

122<br />

Section 4


Now suppose that we take a sample <strong>of</strong> size n<br />

from the underlying population. What is the<br />

distribution <strong>of</strong> the number <strong>of</strong> successes<br />

The total number <strong>of</strong> successes is X where<br />

X = Y 1 + Y 2 + Y 3 + … + Y n<br />

with all the Y j independent <strong>of</strong> each other.<br />

∴ μ X = π + π + π + … + π = n π<br />

2 2 2<br />

σ = σ<br />

Y 1<br />

+ σY 2<br />

+ … + σY n<br />

2<br />

X<br />

= π (1 – π) + π (1 – π) + … + π (1 – π)<br />

= nπ(1 – π)<br />

X is called a binomial distribution <strong>and</strong><br />

μ<br />

X<br />

= nπ<br />

2<br />

σ = nπ(1 − π)<br />

X<br />

where π is the parameter giving Pr(“success”) or<br />

Pr(diabetes present).<br />

123<br />

Section 4


The mean number <strong>of</strong> successes is nπ <strong>and</strong> the<br />

variance <strong>of</strong> the number <strong>of</strong> successes is nπ(1 – π)<br />

The binomial distribution results from n trials<br />

involving independent binary outcomes.<br />

e.g. melanoma (Yes/No)<br />

Smoking (smokes/does not smoke)<br />

Diabetes (present/absent)<br />

Tuatara (present/absent)<br />

Example: X = number <strong>of</strong> locations in group <strong>of</strong><br />

n that have tuatara present.<br />

It is known that Pr(success) = π = 0.40 <strong>and</strong><br />

Pr(failure) = 1 – π = 0.60.<br />

Each location is assumed independent <strong>of</strong> other<br />

locations.<br />

Also assume the probability <strong>of</strong> tuatara being<br />

present remains constant at each location.<br />

124<br />

Section 4


Notes 1. If these conditions are met, if n (the<br />

number <strong>of</strong> trials) <strong>and</strong> π (the probability <strong>of</strong><br />

success) are known, all probabilities in the<br />

distribution are known exactly.<br />

2. n <strong>and</strong> π are said to be the parameters <strong>of</strong> the<br />

distribution.<br />

3. The binomial distribution requires<br />

independent trials <strong>and</strong> a probability <strong>of</strong><br />

success which remains constant for each<br />

trial.<br />

4. We use binomial tables to approximate<br />

these binomial probabilities for values <strong>of</strong> n<br />

up to 20. (See table section <strong>of</strong> these notes.)<br />

125<br />

Section 4


For example suppose n = 8 <strong>and</strong> π = 0.40 are the<br />

two defining parameters.<br />

π<br />

n x 0.05 0.10 0.15 … 0.40 0.50<br />

8 0 0.6634 -- -- 0.0160 0.0039<br />

1 0.2793 -- -- 0.0896 0.0312<br />

2 0.0515 -- -- 0.2090 0.1094<br />

3 0.0054 -- -- 0.2787 0.2187<br />

4 0.0004 -- -- … 0.2322 0.2734<br />

5 0.0000 -- -- 0.1239 0.2188<br />

6 0.0000 -- -- 0.0413 0.1094<br />

7 0.0000 -- -- 0.0079 0.0313<br />

8 0.0000 -- -- 0.0007 0.0039<br />

9 0 -- --<br />

1 -- … --<br />

2 -- --<br />

3 -- --<br />

etc<br />

Notice that Pr(X = 3) = 0.2787 for π = 0.40 <strong>and</strong> n = 8<br />

Example: Records show that twenty percent <strong>of</strong><br />

violin pupils are known to develop OOS during<br />

the course <strong>of</strong> their training. Define X to be the<br />

number <strong>of</strong> violin pupils out <strong>of</strong> 9 who develop<br />

OOS during their training.<br />

126<br />

Section 4


(a) Find the probability distribution <strong>of</strong> X.<br />

(b) What is the probability that none <strong>of</strong> the 9<br />

pupils develop OOS<br />

(c) What is the probability that more than 4 out<br />

<strong>of</strong> the 9 pupils develop OOS<br />

(d) In 2005 a certain violin teacher had 9 new<br />

pupils <strong>and</strong> 5 developed OOS during training.<br />

What conclusion would you draw about the<br />

training methods <strong>of</strong> this teacher<br />

Solution<br />

(a) Here X is binomial with n = 9; π = 0.20<br />

(<strong>and</strong> assume the pupils are all independent<br />

<strong>of</strong> each other). The binomial table gives<br />

n x π = 0.20<br />

9 0 0.1342 = Pr(X = 0)<br />

1 0.3020 = Pr(X = 1)<br />

2 0.3020 etc<br />

3 0.1762<br />

4 0.0661<br />

5 0.0165<br />

6 0.0028<br />

7 0.0003<br />

8 0.0000<br />

9 0.0000<br />

127<br />

Section 4


(b) Pr(X = 0) = 0.1342<br />

(c) Pr(X > 4) = Pr(X = 5) + Pr(X = 6)<br />

+ Pr(X = 7) + Pr(X = 8) + Pr(X = 9)<br />

= 0.0196<br />

(d) It would be rare or unusual (probability = 0.0196)<br />

for more than four violin pupils to develop OOS<br />

if 20% is the overall percentage known to develop<br />

OOS historically. We conclude the training<br />

methods <strong>of</strong> this teacher are likely to result in a<br />

greater occurrence <strong>of</strong> OOS among pupils.<br />

If the violin teacher has no effect on OOS,<br />

π will remain on 0.20 <strong>and</strong> the probability<br />

that more than four <strong>of</strong> the pupils will<br />

develop OOS is 0.0196.<br />

This is viewed (by convention) to be a small<br />

probability indicating a rare or unusual event<br />

has arisen if the value <strong>of</strong> π = 0.20 still holds<br />

for the pupils <strong>of</strong> this teacher.<br />

Either π = 0.20 is unchanged for this teacher<br />

<strong>and</strong> a rare event has been observed<br />

128<br />

Section 4


or the teacher is at fault <strong>and</strong> more pupils<br />

develop OOS. This second alternative is<br />

usually taken <strong>and</strong> therefore we conclude the<br />

teacher has a higher incidence <strong>of</strong> OOS.<br />

Notes<br />

1. It is the size <strong>of</strong> the probability <strong>of</strong> this<br />

observed “event” or a more extreme <strong>and</strong><br />

convincing event which results in the<br />

conclusion (more than 4).<br />

2. 0.0196 is a chance <strong>of</strong> just under 2 per 100<br />

(2%).<br />

3. A probability less than 0.05 is (by<br />

convention) taken to imply an event is rare or<br />

unlikely to occur.<br />

4. A probability above 0.05 <strong>of</strong>ten means an<br />

event is not unusual. If the violin teacher had<br />

produced such a probability then the teaching<br />

would not be at all unusual in relation to<br />

incidence <strong>of</strong> OOS.<br />

129<br />

Section 4


Binomial Examples <strong>and</strong> Normal Distribution<br />

Example: (artificial data) A sociological<br />

report suggests that 75% <strong>of</strong> Maori children<br />

under 18 live with both parents. A r<strong>and</strong>om<br />

sample <strong>of</strong> 20 Maori children is selected, <strong>and</strong> X<br />

is the binomial r<strong>and</strong>om variable for the number<br />

<strong>of</strong> these 20 who live with both parents.<br />

(a) Define the parameters <strong>of</strong> the distribution <strong>of</strong> X.<br />

(b) Find Pr(X = 15).<br />

(c) Find the probability that 11 or fewer live<br />

with both parents (i.e. Pr(X ≤ 11)).<br />

(d) A r<strong>and</strong>om sample <strong>of</strong> 20 New Zeal<strong>and</strong><br />

Caucasian children had only 11 living with<br />

both parents. Does this result provide any<br />

evidence to support the claim that 75% <strong>of</strong> NZ<br />

Caucasian children live with both parents<br />

130<br />

Section 4


Solution<br />

(a) X is binomial with n = 20, π = 0.75.<br />

(b) The problem is that 0.75 does not occur in<br />

the binomial table directly.<br />

Whenever π > 0.50, we replace the event<br />

“success” by its complement “failure”. This<br />

is because the binomial table does not have<br />

values greater than 0.50. In this case,<br />

“failure” is the event “child does not live<br />

with both parents”. For easy analysis, define<br />

new r<strong>and</strong>om variable<br />

Y = number not living with both parents.<br />

Y is binomial, n = 20 <strong>and</strong> new π′ = 0.25<br />

[here y = n – x <strong>and</strong> π′ = 1 – π]<br />

∴ Pr(X = 15 given π = 0.75)<br />

= Pr(Y = 5 given π′ = 0.25)<br />

= 0.2023 from table<br />

(c) Pr(X ≤ 11) = Pr(Y ≥ 9)<br />

= Pr(Y = 9) + Pr(Y = 10)<br />

+ … + Pr(Y = 20)<br />

= 0.0271 + 0.0099<br />

+ … + 0.0000<br />

= 0.0410<br />

131<br />

Section 4


(d) No. In fact there is evidence it is less than 75%<br />

for NZ Caucasian children.<br />

If π = 0.75 is assumed for Caucasian families,<br />

then the probability <strong>of</strong> observing 11 or fewer<br />

living with both parents is, by our convention,<br />

small (less than 0.05) providing evidence<br />

against 75%. Hence reject claim that π = 0.75<br />

for Caucasian families <strong>and</strong> conclude fewer live<br />

with both parents (because 11 is the direction <strong>of</strong><br />

fewer rather than more).<br />

Note: Suppose instead 12 out <strong>of</strong> 20 <strong>of</strong> the NZ<br />

Caucasian children were living with both<br />

parents.<br />

Pr(X ≤ 12) = Pr(Y ≥ 8) = 0.1019 if<br />

π = 0.75 meaning π′ = 0.25.<br />

This probability is not small, <strong>and</strong> now there is<br />

no evidence from our data to suppose the<br />

situation is any different among Caucasian<br />

families.<br />

132<br />

Section 4


Example (Revision)<br />

The st<strong>and</strong>ard drug for treating a cancer is claimed<br />

to halve the tumor size in 30% <strong>of</strong> all patients<br />

treated. Suppose X is the binomial r<strong>and</strong>om<br />

variable for the number <strong>of</strong> patients in a sample <strong>of</strong><br />

seven who have their tumor size halved.<br />

(a) List the conditions which must be met if X is<br />

binomial.<br />

Patients independent. Two outcomes only.<br />

Constant probability tumor size halved over<br />

all the patients.<br />

(b) Using the appropriate table, write down the<br />

distribution <strong>of</strong> probabilities for the number<br />

(X) who have their tumor size halved.<br />

X = x j Pr(X = x j )<br />

0 0.0824<br />

1 0.2471<br />

2 0.3177<br />

3 0.2269<br />

4 0.0972<br />

5 0.0250<br />

6 0.0036<br />

7 0.0002<br />

133<br />

Section 4


(c) Write down the probability that three <strong>of</strong> the<br />

patients have their tumor size halved.<br />

Probability = 0.2269<br />

(d) Find the probability that three or more <strong>of</strong> the<br />

patients have their tumor size halved.<br />

Probability = 0.3529<br />

(e) In a pilot study in Auckl<strong>and</strong>, three out <strong>of</strong> seven<br />

patients given a new drug had their tumor size<br />

halved. What conclusion if any can be drawn<br />

about the new drug Explain how you reach<br />

your conclusion.<br />

Conclusion: There is no reason to suppose the<br />

new drug is any different to the st<strong>and</strong>ard.<br />

Explanation: Prob. <strong>of</strong> three or more is 0.3529<br />

which is large meaning the result with the new<br />

drug is consistent with the 30% before.<br />

Note: This study involves a very small number <strong>of</strong><br />

patients <strong>and</strong> will be reconsidered later with a larger<br />

sample.<br />

134<br />

Section 4


The Normal Distribution<br />

This distribution will allow us to calculate<br />

probabilities associated with observed sample<br />

results when we are dealing with continuous<br />

outcome measures <strong>and</strong> sample means. First we<br />

develop properties <strong>of</strong> the normal distribution.<br />

A relative frequency histogram tends to a<br />

probability distribution as sample size n becomes<br />

large.<br />

HISTOGRAM<br />

DISTRIBUTION<br />

f(X)<br />

a<br />

b<br />

X<br />

n increases<br />

<strong>and</strong> class<br />

width decreases<br />

Shaded area<br />

Shaded area<br />

= proportion <strong>of</strong> = probability <strong>of</strong><br />

observations<br />

value between<br />

between a <strong>and</strong> b<br />

a <strong>and</strong> b<br />

(This represents a<br />

(This represents a<br />

sample with a<br />

population with a<br />

small number <strong>of</strong><br />

very large number<br />

individuals.)<br />

<strong>of</strong> individuals.)<br />

135<br />

a<br />

b<br />

X<br />

Section 4


The resulting curve is known as a probability<br />

function (or probability density function) <strong>and</strong> is<br />

described by a curve y = f(X).<br />

The area under this curve, say between two points<br />

X = a <strong>and</strong> X = b, is the probability Pr(a < X < b)<br />

X is a r<strong>and</strong>om variable taking values on a<br />

continuous scale.<br />

We have seen several sets <strong>of</strong> sample data which<br />

produce symmetrical histograms, bell shaped<br />

with a concentration <strong>of</strong> values at the centre <strong>and</strong><br />

few values at extremes. (e.g. cholesterol levels in<br />

the pravastatin study) Such data are said to be<br />

collected from a normal distribution or from a<br />

population <strong>of</strong> values which are normally<br />

distributed.<br />

[Gauss, 1777-1855, first developed the equation<br />

<strong>of</strong> such a normal curve while observing pattern in<br />

errors made while making measurements in<br />

astronomy]<br />

136<br />

Section 4


μ<br />

Y<br />

Y = f(X)<br />

X<br />

The equation <strong>of</strong> such a normal curve is<br />

f( X)<br />

1<br />

= e<br />

σ 2π<br />

−<br />

1<br />

2<br />

X −μ<br />

( ) 2<br />

σ<br />

where parameter μ is the mean <strong>and</strong> parameter σ is<br />

the st<strong>and</strong>ard deviation <strong>of</strong> the distribution (in<br />

practice, μ <strong>and</strong> σ will be estimated from sample<br />

data by the values x <strong>and</strong> s).<br />

Notes 1. The graph is symmetrical about centre<br />

point denoted by μ.<br />

2. The two parameters μ <strong>and</strong> σ completely define<br />

a normal distribution (recall that parameters n<br />

<strong>and</strong> π define a binomial distribution).<br />

Notation: X ∼ N(μ,σ 2 )<br />

3. Increasing μ moves the curve but does not<br />

alter its shape<br />

Section 4<br />

137


μ 2 > μ 1<br />

σ unchanged<br />

μ 1 μ 2<br />

X<br />

4. Increasing σ spreads the curve more widely<br />

about X = μ, but does not alter the centre <strong>of</strong> the<br />

distribution.<br />

σ 1<br />

σ 2<br />

μ<br />

σ 2 > σ 1<br />

μ unchanged<br />

X<br />

Both the above could be normal distributions.<br />

5. Areas under these curves can be found from<br />

tables. The table is based on what is known as<br />

the st<strong>and</strong>ard normal distribution which has μ =<br />

0 <strong>and</strong> σ = 1.<br />

138<br />

Section 4


Normal distribution calculations.<br />

The St<strong>and</strong>ard Normal Distribution (Z)<br />

Z ∼ N(0, 1) i.e. Z distributed with μ Z = 0, σ Z<br />

2 = 1<br />

∴ f(Z) = 1 2π e−1 2 Z2 Shaded area<br />

= Pr(0 < Z < z)<br />

(see tables)<br />

O z Z<br />

z .00 .01 .02 .03 .04 .05 …… .09<br />

.0 .0000<br />

.1<br />

.2<br />

.3<br />

<br />

1.5<br />

1.6 0.4484 0.4495<br />

1.7<br />

<br />

3 0.4990<br />

139<br />

Section 4


Some calculations:<br />

1. Find Pr(0 < Z < 1.63)<br />

From table choose z = 1.63<br />

∴ Pr(0 < Z < 1.63) = 0.4484<br />

O 1.63 Z<br />

Also, Pr(0 < Z < 1.64) = 0.4495<br />

3<br />

10<br />

∴ Pr(0 < Z < 1.633) ≈ 0.4484 + (0.0011)<br />

= 0.4487<br />

[final calculation need not be this accurate<br />

+ 0.4484 would be accepted for our purposes<br />

using this table.]<br />

2. Find Pr(Z > 1.64)<br />

Pr(Z > 1.64)<br />

= 0.5 – Pr(0 < Z < 1.64)<br />

= 0.5 – 0.4495<br />

= 0.0505<br />

3. Pr(1 < Z < 1.64)<br />

= Pr(0 < Z < 1.64) - Pr(0 < Z < 1)<br />

= 0.4495 – 0.3413<br />

= 0.1082<br />

140<br />

O 1.64 Z<br />

Section 4


4. Pr(-1 < Z < 1.64) = Pr(0 < Z < 1.64)<br />

+ Pr(-1 < Z < 0)<br />

= Pr(0 < Z < 1.64)<br />

+ Pr(0 < Z < 1) by symmetry<br />

= 0.4495 + 0.3413<br />

= 0.7908<br />

–1<br />

O 1.64 Z<br />

5. Pr(-1 < Z < 1) = 2Pr(0 < Z < 1)<br />

= 2(0.3413)<br />

= 0.6826<br />

Pr(-2 < Z < 2) = 2Pr(0 < Z < 2) = 0.9546<br />

Since σ Z = 1, a value z <strong>of</strong> Z is a count <strong>of</strong> the number<br />

<strong>of</strong> st<strong>and</strong>ard deviations to this point. Notice that<br />

approx 68% <strong>of</strong> the area is within one <strong>and</strong> 95%<br />

within two st<strong>and</strong>ard deviations <strong>of</strong> the centre.<br />

6. Find the value z above which 25% <strong>of</strong> the area lies.<br />

Here, find a value close to 0.25 in the centre <strong>of</strong><br />

normal table, then read back to margins.<br />

0.25<br />

0.50<br />

O<br />

0.25<br />

z Z<br />

Pr(0 < Z < 0.67) = 0.2486<br />

Pr(0 < Z < 0.68) = 0.2517<br />

Hence, z = 0.675 approx.<br />

141<br />

Section 4


The General Normal Distribution (X)<br />

2<br />

X ~ N( μ X , σ X ) say.<br />

Areas under this curve cannot be found directly<br />

from the normal table but X is related to the<br />

st<strong>and</strong>ard normal Z ~ N(0, 1 2 ) by<br />

Z<br />

=<br />

X − μ<br />

σ<br />

X<br />

X<br />

Notes 1. The distribution X is said to be<br />

st<strong>and</strong>ardised when μ X subtracted <strong>and</strong> the<br />

result divided by σ .<br />

X<br />

2. Z is essentially the number <strong>of</strong> st<strong>and</strong>ard<br />

deviations ( σ X ) from μ X to a value x <strong>of</strong> X.<br />

142<br />

Section 4


Some calculations<br />

1. Pr( μ X - σ X < X < μ X + σ X )<br />

= Pr(-σ X < X - μ X < + σ X )<br />

X − μ<br />

= Pr(-1 <<br />

X<br />

< + 1)<br />

σ X<br />

= Pr(-1 < Z < + 1)<br />

= 2 Pr(0 < Z < 1) = 0.6826<br />

[68.26% <strong>of</strong> distribution within one st<strong>and</strong>ard<br />

deviation <strong>of</strong> the centre]<br />

2. In general, Pr(a < X < b)<br />

= Pr(a - μ X < X - μ X < b - μ X )<br />

a − μ<br />

= Pr(<br />

X X − μ<br />

<<br />

X b − μ<br />

<<br />

σ X σ X σ X<br />

a − μ<br />

= Pr(<br />

X b − μ<br />

< Z <<br />

X<br />

)<br />

σ<br />

σ<br />

X<br />

X<br />

X<br />

)<br />

μ X a b X<br />

143<br />

Section 4


Example: Assume that diastolic blood pressures<br />

for men aged 35-44 have a normal distribution with<br />

mean μ X = 80 <strong>and</strong> st<strong>and</strong>ard deviation σ X = 12<br />

(a) Find Pr(90 < X < 100)<br />

(b) The percentage <strong>of</strong> men in this age range who<br />

are hypertensive (a level over 100).<br />

Solution<br />

(a) Pr(90 < X < 100) =<br />

⎛ 90 −80<br />

100 −80<br />

Pr<br />

⎞<br />

⎜ < Z < ⎟<br />

⎝ 12 12 ⎠<br />

= Pr(0.833 < Z < 1.667)<br />

= Pr(0 < Z < 1.667)<br />

– Pr(0 < Z < 0.833)<br />

= 0.4525 – 0.2967<br />

= 0.1558<br />

(b) X ~ N(80, 144). Find Pr(X > 100)<br />

⎛ 100 −80⎞<br />

Pr(X > 100) = Pr⎜<br />

Z > ⎟<br />

⎝ 12 ⎠<br />

= Pr(Z > 1.67)<br />

= 0.5 – Pr(0 < Z < 1.67)<br />

= 0.5 – 0.4525<br />

= 0.0475<br />

We expect 4.8% <strong>of</strong> men in this age group to be<br />

hypertensive.<br />

144<br />

Section 4


(c) Find the diastolic blood pressure which is<br />

exceeded by 10% <strong>of</strong> men aged 35-44.<br />

X ~ N(80, 144)<br />

80<br />

O<br />

0.40<br />

x<br />

z<br />

0.10<br />

X (original scale)<br />

Z (st<strong>and</strong>ard scale)<br />

(It is helpful, initially, to sketch the st<strong>and</strong>ard<br />

scale as well as the original scale).<br />

From the st<strong>and</strong>ard normal table, find the<br />

value, z, which cuts <strong>of</strong>f area 0.40 as shown.<br />

Reading to the margins from the value 0.40 in<br />

centre <strong>of</strong> table gives z = 1.282 (part way<br />

between 1.28 <strong>and</strong> 1.29).<br />

x − μ<br />

Use z =<br />

X<br />

to get 1.282 =<br />

σ X<br />

∴ x = 80 +12(1.282)<br />

= 95.38<br />

x −80<br />

12<br />

145<br />

Section 4


The Normal Approximation to the Binomial<br />

(n, π)<br />

If there is a large sample selected from a<br />

population <strong>of</strong> binary values (e.g. people with or<br />

without diabetes) probabilities <strong>of</strong> observed<br />

outcomes are found from the normal N( μ X , σ )<br />

distribution where μ X = nπ <strong>and</strong><br />

σ = nπ ( 1−π<br />

)<br />

X<br />

2<br />

X<br />

σ<br />

X<br />

μ X = nπ<br />

= nπ ( 1−π<br />

)<br />

x–1 x x+1<br />

1<br />

x – 2<br />

1 x + 2<br />

X<br />

Area <strong>of</strong> shaded block (if x integer) is the binomial<br />

probability <strong>of</strong> obtaining x successes.<br />

This is approximately the area under the normal<br />

curve between x − 1 <strong>and</strong> x + 1 .<br />

2<br />

2<br />

146<br />

Section 4


∴ Pr( X = x)<br />

⎛ 1 1<br />

( x −<br />

2) − nπ<br />

( x+ 2)<br />

−nπ<br />

⎞<br />

≈ Pr<br />

< Z <<br />

⎜<br />

nπ (1 −π) nπ(1 −π)<br />

⎟<br />

⎝<br />

⎠<br />

Notes: 1. This approximation is good provided n<br />

is large <strong>and</strong> π is not too close to 0 or 1. (Under<br />

these conditions the binomial distribution is<br />

reasonably close to symmetrical <strong>and</strong> hence the<br />

normal curve is seen to be a good<br />

approximation.)<br />

2. The normal approximation is good if<br />

nπ± 3 nπ(1 −π )<br />

gives two values between 0 <strong>and</strong> n (the min.<br />

<strong>and</strong> max values <strong>of</strong> the binomial counts) since<br />

95% <strong>of</strong> the possible values should lie within<br />

these limits indicating a near symmetrical<br />

distribution.<br />

147<br />

Section 4


Probability<br />

We know Pr(blood group B) = 0.11<br />

n = 2 nπ = 0.22<br />

π = 0.11 n π ( 1−π<br />

)<br />

= 0.44<br />

Hence 0.22 ± 3(0.44)<br />

Figure 1 Binomial distribution <strong>of</strong> number <strong>of</strong> people out <strong>of</strong> two in blood<br />

group B.<br />

Probability<br />

1.0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0.0<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

0 1 2<br />

Number in blood group B<br />

0 1 2 3 4 5 6 7<br />

Number subjects<br />

n = 10 nπ = 1.10<br />

π = 0.11 n π ( 1−π<br />

)<br />

= 0.99<br />

Hence 1.10 ± 3(0.99)<br />

Figure 2 Binomial distribution showing the number <strong>of</strong> subjects out <strong>of</strong> ten<br />

in blood group B based on the probability <strong>of</strong> being in blood group B.<br />

Probability<br />

0.15<br />

0.10<br />

0.05<br />

0.0<br />

n = 100 nπ = 11<br />

π = 0.11 n π ( 1−π<br />

)<br />

= 3.13<br />

Hence 11 ± 3(3.13)<br />

0 5 10 15 20<br />

Number subjects<br />

Figure 3 Binomial distribution showing the number <strong>of</strong> subjects out <strong>of</strong> 100 in blood group B based<br />

on the probability <strong>of</strong> being in blood group B.<br />

148<br />

Section 4


More on the normal <strong>and</strong> Statistical Inference<br />

Example: One in 40 adults on average develops<br />

a respiratory condition. A r<strong>and</strong>om sample <strong>of</strong> 400<br />

workers in a certain occupation has 16 with the<br />

condition. Find the probability that 16 or more<br />

suffer from this condition in general. What<br />

conclusion would you draw about the possible<br />

effect <strong>of</strong> this occupation on the occurrence <strong>of</strong> the<br />

condition Justify your answer.<br />

Solution: Let X be the distribution <strong>of</strong> the number<br />

in a sample <strong>of</strong> 400 with the condition.<br />

Then X ~ Binomial (n =400; π = 1/40)<br />

μ X = nπ = 10; σ<br />

X<br />

= n π ( 1−π<br />

) = 3.123<br />

Since n π ± 2 nπ<br />

(1 −π<br />

) is 10 ± 6. 2, the normal<br />

approximation can be used.<br />

Pr(X ≥ 16)<br />

≈ Pr( Z<br />

15.5 −10<br />

> )<br />

3.123<br />

15 1<br />

2<br />

16<br />

149<br />

16 1<br />

2<br />

X<br />

= Pr(Z > 1.761)<br />

= 0.0391<br />

Section 4


This is the p-value associated with a study result<br />

<strong>of</strong> 16. There is evidence <strong>of</strong> a higher incidence <strong>of</strong><br />

the respiratory condition than expected in this<br />

occupation. (The probability 0.0391 is small<br />

indicating that the event X = 16 or more is rare if<br />

π = 1/40 were to hold in this occupation.)<br />

Therefore, π is likely to be greater than 1/40 for<br />

workers in this occupation. (If this is the case,<br />

the event observed would not be unusual.)<br />

150<br />

Section 4


Example:<br />

It is claimed cancer tumor size is halved in 30%<br />

<strong>of</strong> all patients using current treatment. A new<br />

drug was used on 70 patients with the cancer.<br />

(Last week we looked at a case where the drug<br />

was tried on 7 patients with 3 successes.)<br />

(a) Suppose Y is the binomial r<strong>and</strong>om variable<br />

for the number <strong>of</strong> patients who have their<br />

tumor size halved. Write down the values for<br />

the mean <strong>and</strong> st<strong>and</strong>ard deviation <strong>of</strong> Y.<br />

μ Y = nπ = 70(0.3) = 21<br />

σ = nπ(1 − π) = 21(0.7) = 3.83<br />

Y<br />

151<br />

Section 4


(b) In a study, thirty out <strong>of</strong> seventy patients<br />

(previously 3 out <strong>of</strong> 7) administered the<br />

st<strong>and</strong>ard drug experience a halving <strong>of</strong> their<br />

tumors. Find the probability that 30 or more<br />

out <strong>of</strong> 70 have their tumors halved.<br />

⎛ 29.5 − 21⎞<br />

Pr(Y ≥ 30) = Pr⎜<br />

Z > ⎟<br />

⎝ 3.83 ⎠<br />

= Pr(Z > 2.22)<br />

= 0.5 – 0.4868<br />

= 0.0132<br />

(c) In a study 30 out <strong>of</strong> 70 patients in Auckl<strong>and</strong><br />

administered this new drug had their tumor<br />

size halved. What conclusion can be drawn<br />

about the new drug<br />

There is evidence that the new drug is more<br />

effective than the st<strong>and</strong>ard because the<br />

probability <strong>of</strong> 30 or more successes is less<br />

than 0.05 indicating the observed 30 (or<br />

more) is not likely to occur unless the new<br />

drug has a beneficial effect.<br />

152<br />

Section 4


Transforming Data<br />

If data being analysed are continuous but not<br />

normally distributed, it may be necessary to<br />

modify the data by transforming each value in<br />

order to create new values which are normal.<br />

Then work with the transformed values. Typical<br />

transformations involve logs, square roots or<br />

reciprocals.<br />

There are three reasons for transforming data.<br />

1. Statistical procedures which we develop may<br />

only be valid if the data are approximately<br />

normal, <strong>and</strong> non-normal data can be converted<br />

to normal by transforming.<br />

2. When comparing for example two samples <strong>of</strong><br />

data (e.g. cholesterol levels after treatment with<br />

pravastatin or a control) the two groups should<br />

have similar st<strong>and</strong>ard deviations for some<br />

testing procedures to be valid. Transforming<br />

such data can produce two sets <strong>of</strong> values with<br />

similar st<strong>and</strong>ard deviations.<br />

3. Transforming can also reduce the influence <strong>of</strong><br />

outlying values on the results <strong>of</strong> an analysis.<br />

153<br />

Section 4


(e.g. suppose most values are around 10 in a<br />

data set with one value <strong>of</strong> 100.<br />

Then ln10 = 2.30 <strong>and</strong> ln100 = 4.61)<br />

EXAMPLE: A sample <strong>of</strong> 216 values <strong>of</strong> a serum<br />

has mean = 60.7 <strong>and</strong> st<strong>and</strong>ard deviation 77.9<br />

Frequency<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

Histogram <strong>of</strong> the serum<br />

values in 216 patients with<br />

fitted normal distribution is<br />

shown. (The normal fit is<br />

terrible!)<br />

The data are transformed by using the ln function.<br />

Mean = 3.547 <strong>and</strong> st<strong>and</strong>ard dev. = 1.03<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

0 100 200 300 400 500<br />

Serum bilirubin (μmol/l)<br />

1 2 3 4 5 6<br />

154<br />

Histogram <strong>of</strong> log serum<br />

values with fitted<br />

Normal distribution<br />

(ln values) looks<br />

reasonably normal.<br />

Section 4


Now suppose we want the range <strong>of</strong> values<br />

containing the central 95% <strong>of</strong> all patients. If data<br />

are normal, 95% <strong>of</strong> the population lie in<br />

mean ± 1.96 (st<strong>and</strong>ard deviations)<br />

0.475 0.475<br />

–1.96 1.96<br />

(from st<strong>and</strong>ard<br />

normal table)<br />

For the raw data, mean = 60.7 <strong>and</strong> s.d. = 77.9.<br />

Hence, interval could be 60.7 ± 1.96(77.9) which<br />

cannot be correct with the negative values.<br />

But the transformed data have approximately a<br />

normal distribution. For transformed data,<br />

mean = 3.547 <strong>and</strong> st<strong>and</strong>ard deviation = 1.030.<br />

Hence, 95% <strong>of</strong> the patients will have ln (serum)<br />

levels in the range<br />

3.547 ± 1.96(1.030)<br />

155<br />

Section 4


That is, 95% <strong>of</strong> distribution (or values) between<br />

3.547 – 2.019 <strong>and</strong> 3.547 + 2.019<br />

or 1.528 <strong>and</strong> 5.566<br />

Transforming back to original scale,<br />

e 1.528 = 4.61 <strong>and</strong> e 5.566 = 261.4<br />

Hence, 95% <strong>of</strong> patients would have serum levels<br />

between 4.61 <strong>and</strong> 261.4 μmol/l<br />

156<br />

Section 4


REVIEW EXERCISES<br />

4. For the st<strong>and</strong>ard normal distribution find the following:<br />

(a) The area below –1.58.<br />

(b) The two points between which the central 85% <strong>of</strong> the area lies. (2 marks)<br />

5. In the Framingham Study, serum cholesterol levels were measured for a large number <strong>of</strong> healthy<br />

males. The population was then followed for 16 years. At the end <strong>of</strong> this time, the men were<br />

divided into two groups: those who had developed coronary heart disease <strong>and</strong> those who had not.<br />

The distributions <strong>of</strong> the initial serum cholesterol levels for each group were found to be<br />

approximately normal. Among individuals who eventually developed coronary heart disease, the<br />

mean serum cholesterol level was μ d = 244 mg/100 ml <strong>and</strong> the st<strong>and</strong>ard deviation was σ d = 51<br />

mg/100ml; for those who did not develop the disease, the mean serum cholesterol level was μ nd =<br />

219 mg/100 ml <strong>and</strong> the st<strong>and</strong>ard deviation was σ nd = 41 mg/100ml.<br />

(a) Suppose that an initial serum cholesterol level <strong>of</strong> 260 mg/100ml or higher is used to predict<br />

coronary heart disease. What is the probability <strong>of</strong> correctly predicting heart disease for a man<br />

who will develop it<br />

(b)<br />

(c)<br />

What is the probability <strong>of</strong> predicting heart disease for a man who will not develop it<br />

What is the probability <strong>of</strong> failing to predict heart disease for a man who will develop it<br />

(3 marks)<br />

6. The length <strong>of</strong> human pregnancies from conception to birth varies according to a distribution that is<br />

approximately normal with mean 266 days <strong>and</strong> st<strong>and</strong>ard deviation 16 days.<br />

(a) What percent <strong>of</strong> pregnancies last less than 240 days (that’s about 8 months)<br />

(b) What percent <strong>of</strong> pregnancies last between 240 <strong>and</strong> 270 days (roughly between 8 months <strong>and</strong> 9<br />

months)<br />

(c) How long do the longest 20% <strong>of</strong> pregnancies last (3 marks)<br />

1. The probability <strong>of</strong> recovery for patients who are administered an established treatment for a<br />

stomach complaint is 0.8. A r<strong>and</strong>om sample <strong>of</strong> 100 patients with the complaint is monitored.<br />

Suppose X is the binomial r<strong>and</strong>om variable for the number <strong>of</strong> patients in this sample who recover<br />

when the established treatment is used.<br />

(a) Specify the parameters <strong>of</strong> X.<br />

(b) Find the mean <strong>and</strong> st<strong>and</strong>ard deviation <strong>of</strong> X.<br />

(c) Find the probability that at least 90 <strong>of</strong> the patients administered the treatment recover. Here you<br />

should first verify that the normal approximation to the binomial distribution can be used.<br />

(d) In a trial involving a new drug for the treatment <strong>of</strong> this stomach complaint, 90 out <strong>of</strong> 100<br />

patients who are administered the new drug recover. What conclusion can you draw about the<br />

new drug State your reason.<br />

(7 marks)<br />

157<br />

Section 4


SOLUTIONS<br />

4. [Note to markers: Since students only have access to a table with z values to two decimal places, be<br />

prepared to accept calculations based on the nearest values in the table. Many students will, <strong>of</strong> course,<br />

interpolate between table values.]<br />

(a)<br />

Shaded area = 0.5 – Pr(0 < Z < 1.58)<br />

= 0.5 – 0.4429<br />

= 0.0571<br />

–1.58 0 Z<br />

(b)<br />

Pr(0 < Z < 1.44) = 0.425<br />

The central 85% lies<br />

between –1.44 <strong>and</strong> + 1.44<br />

5. (a)<br />

244 260<br />

260 − 244<br />

For men who develop chd, Pr(X > 260) = Pr(Z > )<br />

51<br />

= Pr(Z > 0.314)<br />

= 0.5 – 0.1217<br />

X<br />

= 0.3783<br />

260 − 219<br />

(b) For men who do not develop chd, Pr(X > 260) = Pr(Z ><br />

41<br />

= Pr(Z > 1)<br />

= 0.5 – 0.3413<br />

219 260 = X0.1587<br />

(c) The probability <strong>of</strong> failing to predict chd for a man who will develop it is 1 – 0.3783 = 0.6217<br />

6. X ~ N(266, 16 2 ) or X is normal with μ X = 266 <strong>and</strong> σ<br />

2<br />

X<br />

= 256<br />

(a) Pr(X < 240)<br />

240 − 266<br />

= Pr( Z < )<br />

16<br />

= Pr(Z < – 1.625)<br />

= 0.5 – Pr(0 < Z < 1.625)<br />

240 266 X<br />

= 0.5 – 0.4479<br />

= 0.0521<br />

That is, 5.2% less than 8 months.<br />

270 − 266<br />

)<br />

16<br />

= Pr(–1.625 < Z < 0.25)<br />

= 0.4479 + 0.0987<br />

= 0.5466 i.e. 54.7%<br />

(b) Pr(240 < X < 270) = Pr(–1.625 < Z <<br />

(c)<br />

0.425 0.425<br />

–z 0 z Z<br />

240<br />

266 270 X<br />

z = 0.842 approx, from table<br />

(30% <strong>of</strong> the st<strong>and</strong>ard normal<br />

lies between 0 <strong>and</strong> z = 0.842)<br />

0.30<br />

x − 266<br />

0.20 ∴ = 0.842<br />

266 x X 16<br />

∴ x = 266 + 16(0.842)<br />

0 z Z<br />

= 279.47 days<br />

That is, approximately 280 days or more.<br />

1. (a) n = 100; π = 0.8<br />

(b) μ X = nπ = 80; σ<br />

X<br />

= 80(0.2)<br />

= 4.<br />

(c) μ X ± 2σ X gives 80 + 2(4) or 72 to 88. Both values lie in the range <strong>of</strong> possible values 0 to 100 hence<br />

normal approximation can be used. (1.96 instead <strong>of</strong> 2 also acceptable)<br />

89.5<br />

− 80<br />

Pr(X > 89.5) = Pr(Z > )<br />

4<br />

= Pr(Z > 2.375)<br />

= 0.5 – 0.4912<br />

= 0.0088<br />

(d) There is evidence that the new drug produces a greater number who recover from the stomach complaint<br />

than expected from the established treatment; the probability 0.0088 is very small for a recovery rate <strong>of</strong><br />

80%.<br />

Section 4<br />

158


SECTION 5<br />

This section defines sampling distributions, establishes the st<strong>and</strong>ard deviations <strong>of</strong> these distributions<br />

called st<strong>and</strong>ard errors, <strong>and</strong> set up confidence intervals for population means, differences between the<br />

means <strong>of</strong> two populations, proportions <strong>and</strong> difference between proportions based on r<strong>and</strong>om samples<br />

drawn from the populations.<br />

An outline <strong>of</strong> the Research Process<br />

The Distribution <strong>of</strong> Sample Means<br />

The St<strong>and</strong>ard Error <strong>of</strong> the Mean<br />

Confidence Interval for a Mean<br />

The t-distribution <strong>and</strong> Its Use<br />

Comparison <strong>of</strong> Two Independent Groups<br />

The St<strong>and</strong>ard Error <strong>of</strong> the Difference Between Two means<br />

Pooled Estimate for the Common Variance<br />

Comparison <strong>of</strong> Two Dependent Groups (Paired Data)<br />

Confidence Interval for a Proportion<br />

Confidence Interval for Difference Between Two Proportions<br />

Summary <strong>of</strong> Distributions <strong>and</strong> Confidence Intervals<br />

159<br />

Section 5


The Research Process in Two Situations<br />

Binomial<br />

Underlying Population<br />

Bernoulli Outcomes<br />

Success or failure<br />

Inference Y = 1 or 0<br />

Use probability<br />

<strong>of</strong> outcome or<br />

estimate <strong>of</strong> the<br />

success proportion<br />

Sample<br />

(n)<br />

<strong>Statistics</strong><br />

Result <strong>of</strong> study<br />

Number <strong>of</strong> successes, X<br />

Binomial<br />

e.g. Prevalence (π) <strong>of</strong> asthma in women aged 20 to<br />

40.<br />

This can be estimated as the proportion (p) in a<br />

sample chosen from the population.<br />

160<br />

Section 5


Normal<br />

Underlying Population<br />

Continuous Outcomes<br />

X ~ N(μ, σ 2 ) say<br />

Inference<br />

Use probability<br />

<strong>of</strong> outcome or an<br />

estimate based<br />

on the sample mean<br />

Sample<br />

(n)<br />

<strong>Statistics</strong><br />

Result <strong>of</strong> study<br />

How does the sample<br />

mean behave<br />

What is the sample<br />

mean distribution X <br />

e.g. What is the mean resting pulse rate (μ) in beats<br />

per minute for men in age range 20 to 25 years<br />

The mean x from the sample is an estimate for the<br />

mean μ in population <strong>of</strong> all men in this age range.<br />

161<br />

Section 5


Sampling Distributions<br />

Statistical inference is process <strong>of</strong> using<br />

information from a sample to infer something<br />

about the population from which sample was<br />

drawn, thus completing the research loops just<br />

described.<br />

How reliable are these estimates for π <strong>and</strong> μ<br />

To answer these questions focus first on the<br />

sample mean x for a sample <strong>of</strong> size n say.<br />

Proportions will be discussed later. The argument<br />

proceeds as follows:<br />

Successive samples <strong>of</strong> size n can be drawn from<br />

the population. These produce means x 1 , x 2 , x 3 ,<br />

x 4 , … etc <strong>and</strong> these form what is called a<br />

distribution <strong>of</strong> sample means, X , which is quite<br />

different to the original distribution, X, <strong>of</strong> values<br />

in the population.<br />

The problem is now to find μ <strong>and</strong> σ . [Here,<br />

X X<br />

σ is the st<strong>and</strong>ard deviation <strong>of</strong> the distribution <strong>of</strong><br />

X<br />

means <strong>and</strong> hence is the “typical” variation in these<br />

means. i.e. the “typical” error.]<br />

162<br />

Section 5


The Distribution <strong>of</strong> Sample Means<br />

Suppose a population with distribution X has known<br />

mean μ X <strong>and</strong> st<strong>and</strong>ard deviation σ X<br />

Ex: Female adult heights. Suppose μ X = 169 cm<br />

<strong>and</strong> σ X = 3.20 cm<br />

A sample <strong>of</strong> size n = 4 drawn r<strong>and</strong>omly from the<br />

population has values 163, 172, 166, 166 say with<br />

mean x 1 = 667/4 = 166.8 cm<br />

x<br />

1<br />

Distribution <strong>of</strong><br />

individual heights (X)<br />

σ X = 3.20 cm<br />

160 163 166 169 172 175 178 X<br />

=μ<br />

X<br />

The four sample values <strong>and</strong> their mean are plotted.<br />

The average x 1 is not as extreme as the individual<br />

values in the sample. x 1 is an estimate <strong>of</strong> μ X<br />

(usually unknown in the real situation).<br />

A second sample <strong>of</strong> n = 4 gives x 2 = 170.5 cm<br />

A third sample <strong>of</strong> n = 4 gives x 3 = 169.5 cm<br />

163<br />

Section 5


If this process is continued we can obtain a<br />

distribution <strong>of</strong> sample means. What are the<br />

properties <strong>of</strong> this distribution These will allow us<br />

to decide how well a sample mean estimates μ X .<br />

Distributions <strong>of</strong> means are for samples <strong>of</strong> size<br />

n = 10, n = 25 <strong>and</strong> n = 100. The population from<br />

which the samples are taken is Normal.<br />

164<br />

Section 5


Distributions <strong>of</strong> means are for samples <strong>of</strong> size<br />

n = 10, n = 25 <strong>and</strong> n = 100. The population from<br />

which the samples are taken is not Normal. But the<br />

sampling distributions are normal.<br />

165<br />

Section 5


Derivation:<br />

Suppose a r<strong>and</strong>om sample <strong>of</strong> size n is taken from a<br />

population with distribution X. The sample can be<br />

viewed as values from n r<strong>and</strong>om variables X 1 , X 2 ,<br />

…, X n each with mean μ X <strong>and</strong> variance σ . X 1 , X 2 ,<br />

…, X n are independent (if the population is large),<br />

<strong>and</strong> are identical.<br />

A value, x , from one sample is one value <strong>of</strong> X , the<br />

distribution <strong>of</strong> sample means for sample <strong>of</strong> size n.<br />

Then<br />

1<br />

X = 1 2 … +<br />

n<br />

( X + X + )<br />

1<br />

∴ μ X = X X +<br />

n<br />

1 2<br />

∴ μ X = μ X<br />

X n<br />

( μ + μ + … μ )<br />

X n<br />

1<br />

= ( nμ<br />

X ) (X 1 , X 2 etc identical)<br />

n<br />

2<br />

X<br />

166<br />

Section 5


The addition rule for the variance <strong>of</strong> independent<br />

r<strong>and</strong>om variables gives<br />

σ<br />

2<br />

2<br />

2 1 2 1 2 1<br />

X =<br />

⎛<br />

σ<br />

⎛<br />

X σ<br />

⎛<br />

⎜<br />

⎞ ⎟ + ⎜<br />

⎞ ⎟ X + … + ⎜<br />

⎝ n⎠<br />

1<br />

=<br />

⎛ ⎞<br />

⎜ ⎟<br />

⎝ n⎠<br />

2<br />

nσ<br />

⎝ n⎠<br />

⎞<br />

⎟<br />

⎝ n ⎠<br />

2<br />

σ<br />

2<br />

X<br />

1 2<br />

n<br />

2<br />

X<br />

=<br />

σ X 2<br />

n<br />

[i.e. if T = aX + bY, then<br />

2<br />

T<br />

2<br />

2<br />

X<br />

2<br />

2<br />

Y<br />

σ = a σ + b σ ]<br />

Therefore, the st<strong>and</strong>ard deviation <strong>of</strong> the distribution<br />

<strong>of</strong> sample means is<br />

σ<br />

σ<br />

X<br />

X =<br />

n<br />

σ<br />

X<br />

The derivations <strong>of</strong> μ = μ<br />

X X<br />

<strong>and</strong> σ = need<br />

X<br />

n<br />

not be known. These two formula are in fact very<br />

important <strong>and</strong> you must know how to use them.<br />

167<br />

Section 5


Note: 1. σ X is called the st<strong>and</strong>ard error <strong>of</strong> the<br />

mean. (It is the “typical” deviation in the mean<br />

– i.e. a measure <strong>of</strong> precision <strong>of</strong> the error in the<br />

mean).<br />

2. If μ X = 169; σ X = 3.20 for heights <strong>of</strong> women,<br />

for sample <strong>of</strong> size n = 4, μ X = 169 <strong>and</strong><br />

= σ 3.20<br />

σ X<br />

X<br />

4 = = 1.60.<br />

2<br />

3. If sample size (n) is greater than 4, σ X is<br />

smaller meaning the distribution X is more<br />

compact about μ = μ X .<br />

X<br />

4. If X is normal, it can be shown that X is<br />

normal no matter what the sample size.<br />

5. If X is not normal, but n large, then X is<br />

approximately normal. (This result is a<br />

consequence <strong>of</strong> the Central Limit Theorem in<br />

note 6.)<br />

6. For r<strong>and</strong>om sample <strong>of</strong> size n, the sample means<br />

x i fluctuate around the population mean μ X<br />

with st<strong>and</strong>ard error σ X = σ X / n . As n<br />

increases, the distribution fluctuates less <strong>and</strong><br />

less, getting closer to a normal distribution.<br />

168<br />

Section 5


Example: Suppose a population has values which<br />

are normally distributed (distribution is X) <strong>and</strong><br />

μ X = 7.9 with σ X = 0.60.<br />

Find (i) Pr(X > 7.7)<br />

(ii) Find Pr( X > 7.7) where X is the distribution<br />

<strong>of</strong> means for samples <strong>of</strong> size n = 9.<br />

Solution:<br />

⎛ 7.7 − 7.9 ⎞<br />

(i) Pr(X > 7.7) = Pr⎜<br />

Z > ⎟<br />

⎝ 0.60 ⎠<br />

= Pr(Z > - 0.333) = 0.6304<br />

(ii) Since<br />

σ<br />

X<br />

0.60<br />

μ = 7.9 <strong>and</strong> σ = = = 0.2,<br />

X<br />

X<br />

n 9<br />

⎛ 7.7 − 7.9 ⎞<br />

Pr( X > 7.7) = Pr⎜<br />

Z > ⎟<br />

⎝ 0.2 ⎠<br />

= Pr(Z > -1) = 0.8413<br />

169<br />

Section 5


Example: Serum values for a sample n = 216<br />

give x = 34.46 <strong>and</strong> s = 5.84. What is the<br />

st<strong>and</strong>ard error <strong>of</strong> x <br />

σ<br />

St<strong>and</strong>ard error = where σ is the (unknown)<br />

n<br />

population st<strong>and</strong>ard deviation.<br />

In practice, we estimate σ by s.<br />

s<br />

∴ estimated st<strong>and</strong>ard error =<br />

n<br />

=<br />

5.84<br />

216<br />

= 0.397.<br />

Suppose the sample had been twice the size,<br />

n = 432, <strong>and</strong> sample had same mean <strong>and</strong> st<strong>and</strong>ard<br />

5.84<br />

deviation. Estimated s.e. = = 0.281<br />

432<br />

(compare 0.397 for n = 216)<br />

170<br />

Section 5


A Confidence Interval for the Mean<br />

The problem here is to use sample data to find an<br />

estimate for an unknown population mean. This<br />

estimate reflects the r<strong>and</strong>om variation in the data<br />

collected by establishing an interval in which we<br />

are fairly certain that the mean μ lies.<br />

As can be seen, this will complete the research<br />

loop concerning the unknown population.<br />

X<br />

To motivate the procedure we work with the<br />

distribution <strong>of</strong> sample means, X , which is<br />

(<br />

2<br />

)<br />

⎛<br />

2<br />

N μ , σ or alternatively N⎜<br />

X X<br />

μ<br />

X<br />

X , σ<br />

⎝ n<br />

First consider the st<strong>and</strong>ard Normal:<br />

⎞<br />

⎟<br />

⎠<br />

Area = 0.025 Area = 0.025<br />

0.95<br />

<strong>of</strong> area<br />

–1.96 +1.96 Z<br />

171<br />

Section 5


0.95 = Pr(-1.96 < Z < +1.96)<br />

⎛ X − μ ⎞<br />

= Pr⎜−1.96<br />

<<br />

X<br />

< + 1. 96⎟ ⎝ σ X / n ⎠<br />

= Pr<br />

⎛ σ<br />

⎞<br />

⎜−<br />

X<br />

σ<br />

1 .96 < X − μ < +<br />

X<br />

X 1.96 ⎟<br />

⎝ n<br />

n ⎠<br />

⎛ σ<br />

⎞<br />

= Pr⎜<br />

−<br />

X<br />

σ<br />

μ < X < +<br />

X<br />

X 1 .96 μ X 1. 96 ⎟<br />

⎝<br />

n<br />

n ⎠<br />

This result is used to construct a 95% confidence<br />

interval as follows:<br />

For a sample x 1 , x 2 , …, x n <strong>of</strong> n values from a<br />

population we are said to be 95% confident that<br />

the sample mean satisfies<br />

μ<br />

X<br />

σ<br />

X<br />

σ<br />

X<br />

− 1.96 < x < μ<br />

X<br />

+ 1.96<br />

n<br />

n<br />

But<br />

x<br />

while<br />

σ X<br />

σ<br />

< μ X +1. 96 implies x −1<br />

. 96<br />

X < μ X<br />

n<br />

n<br />

σ X<br />

σ<br />

μ X −1 . 96 < x implies μ x<br />

X<br />

X < +1. 96 .<br />

n<br />

n<br />

172<br />

Section 5


Therefore, we are 95% confident that the unknown<br />

population mean μ satisfies<br />

X<br />

x<br />

σ X<br />

σ<br />

− 1 .96 < μ x<br />

X<br />

X < + 1.96<br />

n<br />

n<br />

Alternatively, we are 95% confident that the true<br />

population mean lies in the interval<br />

x<br />

σ<br />

± 1.96<br />

X<br />

n<br />

173<br />

Section 5


Notes: 1. The sample has produced an interval<br />

estimate for the unknown population mean.<br />

2. A 99% confidence interval replaces the value<br />

1.96 by 2.58 since the tail areas beyond +2.58<br />

<strong>and</strong> –2.58 are both 0.005<br />

0.005 0.005<br />

0.99<br />

area<br />

–2.58 +2.58 Z<br />

3. The 99% confidence interval<br />

x<br />

±<br />

σ<br />

2.58<br />

X<br />

n<br />

is wider, hence less precise, but we are now<br />

99% certain μ X is in this interval.<br />

σ<br />

4. As n increases,<br />

X<br />

n<br />

decreases <strong>and</strong> the<br />

confidence interval is narrower meaning a<br />

more precise estimate.<br />

i.e. a large sample leads to greater accuracy.<br />

174<br />

Section 5


Example: A pharmacologist is investigating the<br />

length <strong>of</strong> time that a sedative is effective. Eight<br />

patients are selected at r<strong>and</strong>om for a study <strong>and</strong> the<br />

eight times for which the sedative is effective have<br />

mean x = 8.4 (It is also known that the st<strong>and</strong>ard<br />

deviation for such measures is σ X = 1.5 hours).<br />

Find 95% <strong>and</strong> 99% confidence intervals for the true<br />

mean number <strong>of</strong> hours μ .<br />

X<br />

Solution: Here, n = 8; x = 8.4;<br />

(assuming that X is normal).<br />

1.5<br />

σ = = 0.53<br />

X<br />

8<br />

The 95% confidence interval is<br />

8.4 ± 1.96 (0.53)<br />

or 8.4 ± 1.04<br />

That is, 7.36 < μ X < 9.44 or (7.36, 9.44).<br />

The 99% confidence interval is<br />

8.4 ± 2.58(0.53)<br />

or 8.4 ± 1.37<br />

That is, 7.03 < μ X < 9.77 or (7.03, 9.77)<br />

The second interval is much wider.<br />

175<br />

Section 5


Example: The pharmacologist is required to find<br />

the value <strong>of</strong> μ X to within 15 minutes with 95%<br />

confidence. Assuming that the st<strong>and</strong>ard deviation<br />

is σ X = 1.5 hours, find the size <strong>of</strong> the sample which<br />

must be taken in order to achieve this accuracy.<br />

Solution: Since 15 minutes is ¼ hour, for a sample<br />

size n we need<br />

x ± 1<br />

4<br />

to be an interval which is wider than<br />

σ<br />

x ± 1.96<br />

X<br />

n<br />

1.5<br />

or x ± 1.96<br />

n<br />

1.5<br />

∴1.96<br />

≤ 1<br />

n 4<br />

Rearranging, 1.96 (1.5) 4 ≤ n<br />

or 11.76 ≤ n<br />

Squaring, n ≥ 138.3<br />

Hence, 139 patients must be selected.<br />

176<br />

Section 5


Use <strong>of</strong> t-table when σ X is unknown<br />

In all practical contexts, σ X is not known. In this<br />

case it is estimated in the best possible way by the<br />

sample st<strong>and</strong>ard deviation s X . In this situation, the<br />

t-table provides alternative larger values in place <strong>of</strong><br />

1.96 <strong>and</strong> 2.58.<br />

The confidence intervals are wider <strong>and</strong> hence there<br />

is less precision.<br />

The 95% confidence interval is<br />

s<br />

x<br />

x − tν<br />

< μ<br />

X<br />

< x + tν<br />

n<br />

s<br />

x<br />

n<br />

where ν = n – 1 is the “number <strong>of</strong> degrees <strong>of</strong><br />

freedom” <strong>and</strong> t ν is found in the appropriate column<br />

in the t-table for 95% confidence (see table at end<br />

<strong>of</strong> notes)<br />

Note: (n – 1) = ν is the divisor in the estimate<br />

for the variance.)<br />

2<br />

X s<br />

177<br />

Section 5


Exercise: Now suppose that the pharmacologist<br />

did not know the value <strong>of</strong> σ X <strong>and</strong> was forced to<br />

take the sample st<strong>and</strong>ard deviation from the sample<br />

<strong>of</strong> size n = 8 as the best estimate <strong>of</strong> σ X , namely<br />

s X = 1.5 hours. Find 95% <strong>and</strong> 99% confidence<br />

intervals for μ .<br />

X<br />

Solution: x = 8.4 <strong>and</strong><br />

s<br />

estimated st<strong>and</strong>ard error = X 1.5<br />

= = 0. 53<br />

n 8<br />

The 95% confidence interval for the mean sedative<br />

time μ X for all such patients is<br />

8.4 ± t 7 (0.53) where t 7 = 2.365<br />

That is, 8.4 ± 1.25<br />

or 7.15 < μ X < 9.65<br />

The 99% interval is<br />

8.4 ± t 7 (0.53) where t 7 = 3.500<br />

That is, 8.4 ± 1.86<br />

or 6.54 < μ X < 10.26<br />

Both are wider than before.<br />

178<br />

Section 5


Student’s t distribution<br />

ν<br />

-t ν 0 t ν T<br />

2p 0.100 0.050 0.020 0.010<br />

p 0.050 0.025 0.010 0.005<br />

1 <br />

2 <br />

3 <br />

4 <br />

5 <br />

6 <br />

7 1.895 2.365 2.998 3.500<br />

8<br />

9<br />

10<br />

<br />

<br />

<br />

120<br />

∝ 1.645 1.960 2.326 2.576<br />

Area (p)<br />

or probability<br />

p refers to the area <strong>of</strong> one tail 2p gives the<br />

combined area <strong>of</strong> both tails (View the t-distribution<br />

above as a slight modification to the normal<br />

distribution Z.<br />

179<br />

Section 5


Notes: 1. The interval is wide when samples<br />

small. That is, less precision in estimates.<br />

2. This last example is the most common<br />

situation where: the population is assumed to<br />

be normal; μ X <strong>and</strong> σ X are both unknown;<br />

σ X is estimated by s X from a r<strong>and</strong>om sample<br />

<strong>of</strong> size n.<br />

3. Even for large n the t-table is used. Last row<br />

<strong>of</strong> table has 1.96 <strong>and</strong> 2.58 row <strong>of</strong> t.<br />

4. From the point <strong>of</strong> view <strong>of</strong> exams we shall<br />

accept the normal distribution value for<br />

degrees <strong>of</strong> freedom greater than 30.<br />

180<br />

Section 5


Example: Tablets must be produced which weigh<br />

200milligram. Choose sample <strong>of</strong> n = 20 from<br />

production line. x = 201.7mg <strong>and</strong> s X = 5.13mg.<br />

Does this sample confirm that μ = 200mg<br />

Solution: 19 degrees <strong>of</strong> freedom <strong>and</strong><br />

t 19 = 2.093 for a 95% confidence interval.<br />

Therefore,<br />

X<br />

5.13<br />

201.7 – 2.093 < μ X <<br />

20<br />

or 199.3 < μ X < 204.1<br />

201.7<br />

+<br />

5.13<br />

2.093<br />

20<br />

The weight <strong>of</strong> 200milligram lies in this interval.<br />

Hence, 200milligram is an acceptable value <strong>of</strong> the<br />

mean μ with 95% confidence.<br />

X<br />

181<br />

Section 5


The Meaning <strong>of</strong> a confidence Interval<br />

199.3 μ X = 200 204.1<br />

Sample 100<br />

↑<br />

Sample 6<br />

Sample 5<br />

Sample 4<br />

Sample 3<br />

Sample 2<br />

Sample 1<br />

Sample 5 does not include<br />

μ X = 200mg.<br />

In general, if 100 different samples construct 100<br />

intervals, then five <strong>of</strong> the 100 will miss μ X if we<br />

are working at 95% confidence levels.<br />

(This is the possible error which must be accepted.<br />

With 99% confidence intervals which are wider<br />

only one will miss μ X .)<br />

182<br />

Section 5


100 Confidence Intervals (95%)<br />

Sample 100<br />

↑<br />

Sample 6<br />

Sample 5<br />

Sample 4<br />

Sample 3<br />

Sample 2<br />

Sample 1<br />

199.3 204.1<br />

In the above, the position <strong>of</strong> the true mean μ X is<br />

unknown. Also, in practice we only have one <strong>of</strong> the<br />

above intervals. We say we are 95% confident the<br />

true mean lies in this interval.<br />

183<br />

Section 5


184<br />

Section 5


185<br />

Section 5


186<br />

Section 5


187<br />

Section 5


188<br />

Section 5


Example:<br />

It is claimed that males committed for trial for<br />

minor <strong>of</strong>fences are spending more time in prison on<br />

rem<strong>and</strong> than females committed for trial for similar<br />

<strong>of</strong>fences. A sample <strong>of</strong> 40 females <strong>and</strong> 49 males<br />

awaiting trial gave the following information. The<br />

outcome measure is time on rem<strong>and</strong> (X days).<br />

Female Male<br />

Sample mean ( x i<br />

)<br />

16.3 29.5<br />

Sample st<strong>and</strong>ard deviation (s i ) 14.6 17.2<br />

Sample size (n i ) 40 49<br />

The difference between the sample means is<br />

x M – x F = 29.5 – 16.3 = 13.2 days<br />

Is this an important difference<br />

189<br />

Section 5


If μ M <strong>and</strong> μ F are the population mean times for<br />

males <strong>and</strong> females, a 95% confidence interval for<br />

μ M – μ F is<br />

x M – x F ±<br />

2 2<br />

sM sF<br />

1.96 +<br />

n n<br />

M<br />

(17.2) (14.6)<br />

= 13.2 ± 1.96 +<br />

49 40<br />

= 13.2 ± 6.61<br />

= (6.59, 19.81)<br />

or 6.59 < μ M – μ F < 19.81<br />

F<br />

2 2<br />

The population male rem<strong>and</strong> time is likely to be<br />

within 6.59 <strong>and</strong> 19.81 days longer than that for<br />

females (alternatively, the true mean difference is<br />

between 6.59 <strong>and</strong> 19.81 days).<br />

190<br />

Section 5


Case 2: Comparing means when samples small<br />

In this situation the CLT no longer holds for the<br />

difference between the sample means.<br />

Instead we need to assume that the population<br />

from which the difference is drawn is normally<br />

distributed.<br />

This should be the case if the populations for any<br />

small samples are normal.<br />

In addition to assuming normality, we assume the<br />

two populations have equal variances.<br />

191<br />

Section 5


If<br />

2<br />

σ 1 <strong>and</strong><br />

2<br />

σ 2 are similar <strong>and</strong> equal to<br />

2<br />

σ say.<br />

Then the 95% confidence interval μ 1 – μ 2 is<br />

1 1<br />

( x 1 − x2)<br />

± 1.96σ<br />

+<br />

n n<br />

1 1<br />

1<br />

( x 1 − x2)<br />

−1.96σ<br />

+ < μ1<br />

− μ2<br />

< ( x1<br />

− x2)<br />

± 1.96σ<br />

+<br />

n n<br />

n<br />

1<br />

2<br />

1<br />

2<br />

1<br />

1<br />

n<br />

2<br />

The common variance σ 2 needs to be estimated<br />

from sample data. If both populations have the<br />

same variance, the best estimate for σ 2 is found<br />

when the variation in both samples is averaged to<br />

give the pooled estimate s 2 pwhere<br />

∴<br />

with<br />

s<br />

2 p<br />

( n<br />

=<br />

1<br />

2<br />

1<br />

1<br />

−1)<br />

s<br />

n<br />

+ ( n<br />

+ n<br />

2<br />

2<br />

2 ∑ ( x1<br />

− x1<br />

)<br />

1<br />

=<br />

n1<br />

−1<br />

2<br />

− 2<br />

−1)<br />

s<br />

s i<br />

<strong>and</strong><br />

2<br />

2<br />

2<br />

2 ( x2<br />

− x2)<br />

s<br />

2<br />

= i<br />

n2<br />

−1<br />

When sample estimates for the variances are used,<br />

replace 1.96 by t-value to get<br />

1 1<br />

( x1<br />

− x2)<br />

± tν<br />

s p +<br />

n1<br />

n2<br />

with degrees <strong>of</strong> freedom ν = n 1 + n 2 – 2.<br />

192<br />

Section 5


Example 3: Following data are 24 hour total energy<br />

expenditures (MJ/day) in groups <strong>of</strong> lean <strong>and</strong> obese<br />

patients (1986 study)<br />

Lean (n = 13) Obese (n = 9)<br />

6.13 8.79<br />

7.05 9.19<br />

7.48 9.21 Question: Is<br />

7.48 9.68 there a<br />

7.53 9.69 difference in<br />

7.58 9.97 energy<br />

7.90 11.51 expenditure<br />

8.08 11.85 between lean<br />

8.09 12.79 <strong>and</strong> obese<br />

8.11 patients<br />

8.40<br />

10.15<br />

10.88<br />

Mean: 8.066 10.298<br />

S.D.: 1.238 1.398<br />

Possible explanations for the difference between<br />

samples in above situations:<br />

1. bias (need to r<strong>and</strong>omise)<br />

2. confounding (e.g. gender, age)<br />

3. chance (r<strong>and</strong>om variation)<br />

4. true difference<br />

The methods we discuss in next few lectures assume<br />

that bias <strong>and</strong> confounding are not the explanation.<br />

193<br />

Section 5


n 1 = 13; x 1 = 8.066; s 1 = 1.238 (lean)<br />

n 2 = 9; x 2 = 10.298; s 2 = 1.398 (obese)<br />

Solution: x 2 – x 1 = 2.232 (obese – lean)<br />

2<br />

2<br />

2 (13−1)1.238<br />

+ (9 −1)1.398<br />

s p =<br />

13 + 9 − 2<br />

12 (1.533) + 8(1.954)<br />

=<br />

20<br />

= 1.7014<br />

∴s<br />

p = 1 .7014 = 1.304<br />

ν = 20 giving t 20 = 2.086 for 95% interval<br />

∴ 95% confidence interval is<br />

1 1<br />

2 .232 ± 2.086(1.304) +<br />

13 9<br />

or 2.232 ± 1.180<br />

That is, 1.05 <<br />

μ − μ < 3.41 MJ/day<br />

obese<br />

lean<br />

194<br />

Section 5


Note: This confidence interval tells us that we<br />

can be 95% sure that the true difference in energy<br />

expenditure between obese <strong>and</strong> lean patients is<br />

between 1.05 <strong>and</strong> 3.41 MJ/day.<br />

Since this interval is entirely positive, it means<br />

that we can conclude that lean patients have lower<br />

energy expenditure than obese patients.<br />

Notes: 1. ν = n 1 + n 2 – 2 is the divisor in the<br />

2<br />

formula for s p , the variance estimate (the<br />

degrees <strong>of</strong> freedom are always the divisor in<br />

the variance estimate e.g. n – 1 in the single<br />

sample case)<br />

2. Both populations should have values which<br />

are normally distributed if the samples are<br />

small.<br />

2 2<br />

3. The two population variances, σ<br />

1<br />

<strong>and</strong> σ<br />

2<br />

should be equal approximately. (Otherwise<br />

may need to transform data or use another<br />

test.) R-cmdr has an option which confirms<br />

this.<br />

195<br />

Section 5


4. The two samples from the two populations are<br />

r<strong>and</strong>om <strong>and</strong> independent <strong>of</strong> each other.<br />

5. Testing whether μ 1 = μ2<br />

can be achieved by<br />

seeing if μ1 − μ2<br />

= 0 i.e. confirming if 0 lies in<br />

the confidence interval for the difference.<br />

6. It is possible to obtain the probability value<br />

associated with the study outcome value <strong>of</strong><br />

2.232. (see later)<br />

Example: A nutrition scientist is assessing a<br />

weight-loss programme to evaluate its<br />

effectiveness. Ten people r<strong>and</strong>omly selected.<br />

Initial weight recorded <strong>and</strong> followup weight 20<br />

weeks later.<br />

Subject Initial Weight (x Ii ) Weight at followup (x Fi )<br />

1 180 165<br />

2 142 138<br />

3 126 128<br />

4 138 136<br />

5 175 170<br />

6 205 197<br />

7 116 115<br />

8 142 128<br />

9 157 144<br />

10 136 130<br />

196<br />

Section 5


Find a 95% confidence interval for the reduction<br />

in weight (Assuming the two sets <strong>of</strong> values<br />

independent).<br />

x<br />

I<br />

= 151.7 x<br />

F<br />

= 145.1<br />

s 2 I = 750.76 s 2 F = 620.01<br />

2 9(750.76) + 9(620.01)<br />

s P =<br />

18<br />

= 685.17<br />

Since ν = 18 giving t 18 = 2.101 we get<br />

1 1<br />

(151.7 − 145.1) ± 2.101 685.17 + 10 10<br />

or 6.6 ± 24.6<br />

That is –18.0 < μ I – μ F < 31.2<br />

Note 1: Since the confidence interval includes 0,<br />

conclude there is no evidence to indicate that the<br />

weight loss programme has altered weights.<br />

Note 2: In this study the two sets <strong>of</strong> data are not<br />

independent. One person produces two values<br />

here.<br />

197<br />

Section 5


Case 3: Comparing means if matched data.<br />

It is natural to consider the differences d i in the<br />

weights for each person rather than considering<br />

the two samples separately. The d i are the data<br />

now <strong>and</strong> a confidence interval is constructed for<br />

the average difference μ d based on the single<br />

sample <strong>of</strong> differences. The 95% confidence<br />

interval is<br />

d<br />

± t<br />

ν<br />

sd<br />

n<br />

where d is the average <strong>of</strong> the d i , n is the number<br />

<strong>of</strong> data pairs, ν = n – 1, <strong>and</strong> s d is the st<strong>and</strong>ard<br />

deviation <strong>of</strong> the differences. We have<br />

s<br />

d<br />

=<br />

∑ ( di<br />

− d<br />

n −1<br />

)<br />

2<br />

with n – 1 degrees <strong>of</strong> freedom.<br />

198<br />

Section 5


Example: Weight loss programme again<br />

Subject x Ii x Fi d i = x Ii – x Fi<br />

2<br />

( di<br />

− d)<br />

1 180 165 15 70.56<br />

2 142 138 4 6.76<br />

3 126 128 –2 73.96<br />

4 138 136 2 21.16<br />

5 175 170 5 2.56<br />

6 205 197 8 1.96<br />

7 116 115 1 31.36<br />

8 142 128 14 54.76<br />

9 157 144 13 40.96<br />

10 136 130 6 0.36<br />

Total 66 304.40<br />

d = 66/10 = 6.6<br />

2<br />

∑<br />

2<br />

( di<br />

− d) 304.4<br />

sd<br />

= = = 33.82<br />

n −1 9<br />

ν = n – 1 = 9 giving t ν = 2.262 for a 95% interval.<br />

The 95% confidence interval for the average<br />

difference is<br />

33.82<br />

6.6 ± 2.262<br />

or 6.6 ± 4.2<br />

That is, 2.4 < μ d < 10.8<br />

10<br />

199<br />

Section 5


There is evidence that the weight loss programme<br />

has reduced weights since the difference <strong>of</strong> 0 is<br />

not in this interval (we are 95% sure).<br />

Notes: (1) The “pr<strong>of</strong>ile” <strong>of</strong> each person is<br />

constant in this study because the same<br />

person has produced the two values.<br />

(2) A test involving paired data based on d is<br />

called a paired t-test. The earlier test on<br />

μ − is called an unpaired t-test.<br />

1 μ 2<br />

(3) Negative differences are possible in this<br />

analysis when subtracting. Be consistent with<br />

subtraction process.<br />

200<br />

Section 5


Confidence Intervals for a Proportion<br />

Suppose X is a binomial distribution with<br />

parameters n <strong>and</strong> π (i.e. the number <strong>of</strong> “successes”<br />

lies between 0 <strong>and</strong> n).<br />

Then<br />

μ X = nπ<br />

σ = nπ ( 1−π<br />

)<br />

X<br />

Suppose one sample produces a proportion <strong>of</strong><br />

successes p = in n trials.<br />

n x<br />

Many such samples can be taken to get different<br />

values <strong>of</strong> p. The resulting distribution (P) <strong>of</strong> these<br />

proportions is normal (by the Central Limit<br />

theorem.) It follows that<br />

P =<br />

X<br />

n<br />

where X is binomial. The mean <strong>and</strong> st<strong>and</strong>ard<br />

deviation <strong>of</strong> P are then<br />

201<br />

Section 5


1 1<br />

μ P = μ = nπ<br />

= π<br />

n<br />

X n<br />

2 1 2 1<br />

<strong>and</strong>, since σ =<br />

⎛<br />

⎜<br />

⎞ ⎟ σ X = nπ<br />

(1 −π<br />

)<br />

2<br />

⎝ n⎠<br />

n<br />

2<br />

P ,<br />

σ<br />

P<br />

=<br />

π ( 1−π<br />

)<br />

n<br />

The sample proportion (p) estimates the unknown<br />

true population proportion (π) (e.g. prevalence <strong>of</strong><br />

asthma in women not known.). Thus the<br />

estimated st<strong>and</strong>ard error is<br />

p ( 1−<br />

p)<br />

n<br />

<strong>and</strong> the 95% confidence interval for π is<br />

p<br />

± 1.96<br />

p(1<br />

−<br />

n<br />

p)<br />

Note: 1.96 (or 99% equivalent 2.58) are always<br />

used for confidence intervals for proportions. (If<br />

202<br />

Section 5


small sample, the normal distribution is not a good<br />

approx.)<br />

Example: A r<strong>and</strong>om sample <strong>of</strong> 500 Auckl<strong>and</strong>ers<br />

taken in 1996 had 173 supporting aerial spraying<br />

to eradicate tussock moth. Estimate the<br />

proportion (π) <strong>of</strong> Auckl<strong>and</strong>ers who support this.<br />

Solution:<br />

<strong>and</strong><br />

x 173<br />

p = = = 0.346<br />

n 500<br />

p( 1−<br />

p)<br />

0.346(1 − 0.346)<br />

=<br />

n<br />

500<br />

= 0.021<br />

The 95% confidence interval is<br />

0.346 ± 1.96(0.021)<br />

or 0.346 ± 0.041<br />

Therefore, 0.305 < π < 0.387<br />

We are 95% sure that between 30.5% <strong>and</strong> 38.7%<br />

<strong>of</strong> the Auckl<strong>and</strong> population support the spraying.<br />

203<br />

Section 5


Note: Alternatively, we could say 34.6% <strong>of</strong> the<br />

population support spraying with a margin <strong>of</strong> error<br />

<strong>of</strong> 4.1%. But ‘margin <strong>of</strong> error’ concept must be<br />

used with caution. It is reasonable if the value <strong>of</strong><br />

p lies between 0.3 <strong>and</strong> 0.7 but the margin <strong>of</strong> error<br />

should be adjusted if p outside this range. (We<br />

omit this adjustment.)<br />

Example: Epidemiologist estimates proportion <strong>of</strong><br />

women with asthma. Find the sample size (n)<br />

needed to give an estimate for this proportion with<br />

an error no more than 0.03 with 95% confidence.<br />

Solution: The largest possible value <strong>of</strong> p(1 – p)<br />

occurs when p =<br />

1<br />

2<br />

(verify this by choosing several<br />

p values).<br />

The most conservative (or safest) sample size is<br />

obtained using this value p =<br />

1<br />

2<br />

. The requested<br />

accuracy requires confidence interval p ± 0.03<br />

to be the largest interval. But the actual interval is<br />

0.5(1 − 0.5)<br />

p ± 1.96<br />

for sample size n.<br />

n<br />

Therefore<br />

204<br />

Section 5


0.5(1 − 0.5)<br />

1.96<br />

< 0.03<br />

n<br />

2<br />

(1.96) (0.5)(0.5)<br />

∴ < (0.03) 2<br />

n<br />

2<br />

(1.96) (0.5)(0.5)<br />

∴ n > = 1067.11<br />

2<br />

(0.03)<br />

It follows that 1068 women must be tested.<br />

Now consider the Confidence Interval for<br />

Difference Between Two Proportions<br />

(Derivation not examined)<br />

The difference π1 −π<br />

2is estimated by p 1 – p 2<br />

where p 1 = x1<br />

n1<br />

<strong>and</strong> p 2 = x2<br />

n2<br />

for the two<br />

samples.<br />

The distribution P 1 – P 2 <strong>of</strong> sample proportion<br />

differences is a normal distribution with<br />

μ<br />

P −P<br />

1<br />

2<br />

= π<br />

1<br />

−π<br />

2<br />

<strong>and</strong> st<strong>and</strong>ard deviation (st<strong>and</strong>ard error)<br />

σ<br />

P − P<br />

=<br />

π<br />

−π<br />

1 (1 1)<br />

2(1<br />

2)<br />

1 2<br />

n n<br />

1<br />

205<br />

π<br />

+<br />

−π<br />

2<br />

Section 5


using the addition rule for the mean <strong>and</strong> variance<br />

<strong>of</strong> two independent r<strong>and</strong>om variables, P 1 <strong>and</strong> P 2 .<br />

If π 1 <strong>and</strong> π 2 are estimated from sample data, the<br />

95% confidence interval is<br />

p1(1 − p1) p2(1 − p2)<br />

( p1 − p2) ± 1.96<br />

+<br />

n<br />

n<br />

1 2<br />

Exercise: To study the effectiveness <strong>of</strong> a drug for<br />

arthritis, two samples <strong>of</strong> patients were r<strong>and</strong>omly<br />

selected. One sample <strong>of</strong> 100 was injected with the<br />

drug, the other sample <strong>of</strong> 60 receiving a placebo<br />

injection. After a period <strong>of</strong> time the patients were<br />

asked if their arthritic condition had improved.<br />

Results were<br />

EXPOSURE<br />

DRUG(1) PLACEBO(2)<br />

IMPROVED 59 22<br />

NOT IMPROVED 41 38<br />

TOTAL 100 60<br />

206<br />

Section 5


Solution: The proportions improved are<br />

59<br />

22<br />

p 1 = = 0.59 <strong>and</strong> p 2 = = 0. 37<br />

100<br />

60<br />

p 1 – p 2 = 0.22 <strong>and</strong> the estimated st<strong>and</strong>ard error <strong>of</strong><br />

difference between the proportions is<br />

0.59(1 − 0.59)<br />

100<br />

+<br />

0.37(1 − 0.37)<br />

60<br />

=<br />

0.0794<br />

The 95% confidence interval is<br />

0.22 ± 1.96 (0.0794)<br />

or 0.22 ± 0.156<br />

or 0.064 < π1 − π 2 < 0.376<br />

Since 0 excluded from interval <strong>and</strong> the interval is<br />

positive, there is evidence π1 − π 2 > 0. That is, we<br />

conclude the proportion improved is higher when<br />

the drug is used.<br />

207<br />

Section 5


REVIEW EXERCISES<br />

2. A population is known to be normally distributed with a mean µx = 60 <strong>and</strong> st<strong>and</strong>ard deviation σx =<br />

15. Let X be the distribution <strong>of</strong> means <strong>of</strong> samples <strong>of</strong> size 25 drawn from the population.<br />

(a) Define completely the probability distribution X.<br />

(b) What is the probability that a value in the population will lie between 55 <strong>and</strong> 65<br />

(c) What is the probability that the mean <strong>of</strong> a sample <strong>of</strong> size 25 will lie between 55 <strong>and</strong> 65 (4 marks)<br />

3. Large studies indicate that the mean cholesterol level in children aged 2 – 14 is 175 mg%/mL <strong>and</strong><br />

the st<strong>and</strong>ard deviation is 30 mg%/mL.<br />

The problem here is to see if there is a familial aggregation <strong>of</strong> cholesterol levels. A group <strong>of</strong> fathers<br />

who have had a heart attack <strong>and</strong> have elevated cholesterol levels (≥ 250 mg%/mL) are identified.<br />

The cholesterol levels <strong>of</strong> their <strong>of</strong>fspring within the 2-14 age range are measured. The mean<br />

cholesterol level in a group <strong>of</strong> 100 such children is 207.3 mg%/mL. The problem is to decide if this<br />

value is sufficiently far from 175 mg%/mL for us to believe that the underlying mean cholesterol<br />

level μ in the population <strong>of</strong> all children selected in this way is greater than 175 mg%/mL.<br />

(a) Construct a 95% confidence interval for μ on the basis <strong>of</strong> the sample data. State your conclusion<br />

about familial aggregation <strong>of</strong> cholesterol levels.<br />

(2 marks)<br />

(b) Find the probability <strong>of</strong> obtaining the sample mean <strong>of</strong> 207.3 mg%/mL or a value which is greater<br />

under the assumption that there is no familial aggregation. State your conclusion from this<br />

probability.<br />

(2 marks)<br />

4. Patients with chronic kidney failure may be treated by dialysis, using a machine that removes toxic<br />

wastes from the blood, a function normally performed by the kidneys. Kidney failure <strong>and</strong> dialysis<br />

can cause other changes, such as retention <strong>of</strong> phosphorus, that must be corrected by changes in diet.<br />

A study <strong>of</strong> the nutrition <strong>of</strong> dialysis patients measured the level <strong>of</strong> phosphorus in the blood on six<br />

occasions. Here are the data for one patient (milligrams <strong>of</strong> phosphorous per decilitre <strong>of</strong> blood):<br />

5.5 6.1 4.8 5.8 6.2 4.6<br />

The measurements are separated in time <strong>and</strong> can be considered a r<strong>and</strong>om sample <strong>of</strong> the patient’s<br />

blood phosphorus level.<br />

(a)<br />

(b)<br />

(c)<br />

If the level varies normally with σ = 0.8 mg/dl, find a 95% confidence interval for the mean<br />

blood phosphorus level <strong>of</strong> this patient.<br />

(1 mark)<br />

If the value <strong>of</strong> σ is unknown but estimated by the sample st<strong>and</strong>ard deviation s = 0.669, find a<br />

95% confidence interval for the mean blood phosphorus level <strong>of</strong> this patient. (1 mark)<br />

The normal range <strong>of</strong> phosphorus in the blood is considered to be 2.6 to 4.8 mg/dl. Is there<br />

evidence that the patient has a mean phosphorus level that exceeds 4.8 Explain. (1 mark)<br />

5. A salmon fishing company is monitoring the weight <strong>of</strong> salmon in its ponds prior to harvest. A pilot<br />

sample <strong>of</strong> ten fish, r<strong>and</strong>omly selected, shows a mean weight <strong>of</strong> 2.31 kilograms with a st<strong>and</strong>ard<br />

deviation <strong>of</strong> 0.17 kilogram.<br />

(a)<br />

(b)<br />

Obtain a 95% confidence interval for the mean weight <strong>of</strong> all salmon in the ponds.(2 marks)<br />

Using the st<strong>and</strong>ard deviation from the pilot survey as an estimate <strong>of</strong> the true variation <strong>of</strong><br />

weights <strong>of</strong> salmon in the ponds, establish how many fish should be sampled to obtain an<br />

estimate <strong>of</strong> the mean weight <strong>of</strong> all the salmon in the ponds to within 0.03 kilogram with<br />

95% confidence. (Take 2 as an approximation to the value <strong>of</strong> t.) (3 marks<br />

208<br />

Section 5


SOLUTIONS<br />

2. (a) X is a normal distribution with μ X<br />

= 60 <strong>and</strong><br />

i.e. X ~ N(60, 9)<br />

55 − 60 65 − 60<br />

(b) Pr(55 < X < 65) = Pr( < Z < )<br />

15 15<br />

= Pr(–0.33 < Z < 0.33)<br />

= 2(0.1293)<br />

= 0.2586 approx<br />

55 − 60 65 − 60<br />

(c) Pr(55 < X < 65) = Pr( < Z < )<br />

3<br />

3<br />

= Pr(-1.67 < Z


REVIEW EXERCISES<br />

2. The extent to which X-rays can penetrate tooth enamel has been suggested as a suitable<br />

mechanism for differentiating between males <strong>and</strong> females in forensic medicine. Listed<br />

below in appropriate units are the ‘spectropenetration gradients’ for eight female teeth <strong>and</strong><br />

eight male teeth:<br />

Male (x 1<br />

) 4.9 5.4 5.0 5.5 5.4 6.6 6.3 4.3<br />

Female (x 2<br />

) 4.8 5.3 3.7 4.1 5.6 4.0 3.6 5.0<br />

The data give sample means:<br />

x = 5.4250,<br />

1<br />

x = 4.5125<br />

2<br />

2<br />

2<br />

<strong>and</strong> the sample variances: s = 0.5536, s = 0.5784:<br />

1<br />

2<br />

(a) Calculate the pooled estimate for the variance common to the male <strong>and</strong> female<br />

populations.<br />

(1 mark)<br />

(b) Estimate the st<strong>and</strong>ard error <strong>of</strong> the difference between the population means. (1 mark)<br />

(c) Construct a 95% confidence interval for the difference between the two population<br />

means.<br />

(1 mark)<br />

(d) What conclusion do you now draw about the procedure for differentiating between males<br />

<strong>and</strong> females<br />

(1 mark)<br />

SOLUTIONS<br />

( n<br />

−1)<br />

s + ( n −1)<br />

s<br />

n + n − 2<br />

7(0.5536) + 7(0.5784)<br />

8 + 8 − 2<br />

2 1 1 2 2<br />

2. (a) s p<br />

=<br />

=<br />

= (1.132)<br />

= 0.566<br />

1<br />

2<br />

2<br />

(b) Estimated st<strong>and</strong>ard error <strong>of</strong> difference =<br />

2<br />

1<br />

2<br />

1 1<br />

0 .566 + = 0.376<br />

8 8<br />

(c) The 95% confidence interval is x − x ± t (0.376)<br />

1 2 14<br />

That is, (5.4250 – 4.5125) ± 2.145(0.376)<br />

or 0.9125 ± 0.8065<br />

giving 0.106 < μ 1 – μ 2 < 1.719<br />

(d) We are 95% sure that there is a difference in the mean tooth penetrations for males <strong>and</strong><br />

females since 0 does not lie in the confidence interval in (c). (Because the confidence<br />

interval is positive the male tooth penetration will be greater.)<br />

210<br />

Section 5


[A] DISTRIBUTION SUMMARY<br />

1. Binomial (X): n trials; π is the probability <strong>of</strong><br />

success (discrete)<br />

μ = nπ<br />

X<br />

σ = nπ(1 − π)<br />

X<br />

2. Normal (X): (continuous)<br />

Parameters are μ<br />

X<br />

<strong>and</strong> σ<br />

X<br />

⎛ X − μ ⎞<br />

X<br />

3. St<strong>and</strong>ard Normal ⎜Z<br />

= ⎟<br />

⎝ σ<br />

X ⎠<br />

Parameters are μ<br />

Z<br />

= 0 <strong>and</strong> σ<br />

Z<br />

= 1<br />

4. Normal Approximation to Binomial<br />

Original binomial has parameters n <strong>and</strong> π.<br />

The normal approx has parameters μ<br />

X<br />

= nπ<br />

,<br />

σ = nπ(1 − π)<br />

X<br />

5. Distribution <strong>of</strong> Sample Means ( X )<br />

σ<br />

X<br />

Normal with μ = μ<br />

X X<br />

<strong>and</strong> σ = . The<br />

X<br />

n<br />

st<strong>and</strong>ard deviation σ is also called the<br />

X<br />

st<strong>and</strong>ard error <strong>of</strong> the mean.<br />

211<br />

Section 5


6. Distribution <strong>of</strong> Differences between Sample<br />

Means ( X1−<br />

X2)<br />

μ = μX<br />

−μX<br />

(or μ1−<br />

μ2)<br />

X − X<br />

1 2 1 2<br />

2 2<br />

σ1 σ2 1 1<br />

2 2<br />

σ = + = σ +<br />

X X<br />

1 2<br />

1 2<br />

if σ = σ<br />

−<br />

n n n n<br />

1 2 1 2<br />

7. Distribution <strong>of</strong> Sample Proportions (P)<br />

μP<br />

= π<br />

π (1 −π<br />

)<br />

σ<br />

P<br />

=<br />

n<br />

8. Distribution <strong>of</strong> Differences between Sample<br />

Proportions (P 1 – P 2 )<br />

μP<br />

1− P= π<br />

2 1−<br />

π2<br />

π1(1 −π1) π2(1 −π2)<br />

σ<br />

P1 − P= +<br />

2<br />

n n<br />

1 2<br />

Estimates for π, μ, σ are found from sample data<br />

<strong>and</strong> given by p, x <strong>and</strong> s.<br />

212<br />

Section 5


[B] SUMMARY: CONFIDENCE INTERVALS<br />

s<br />

1. Mean x ± tν<br />

with ν = n –1 D.F.<br />

n<br />

2. Difference Between Means (small samples <strong>and</strong><br />

independent, normal populations with equal<br />

variances.)<br />

1 1<br />

( x1− x2) ± tν<br />

sp<br />

+ with ν = n 1 + n 2 – 2.<br />

n1 n2<br />

2 2<br />

2<br />

( n1− 1) s1 + ( n2−1)<br />

s2<br />

Here, sp<br />

=<br />

n + n −2<br />

1 2<br />

2 2<br />

1 2<br />

Note: If samples ≥ 30, x1− x2<br />

± 1.96 s +<br />

s<br />

n1 n2<br />

3. Difference Between Means (paired populations)<br />

d<br />

s d<br />

± tν<br />

with ν = n – 1<br />

n<br />

4. Proportion:<br />

p<br />

(1 )<br />

1.96 p −<br />

±<br />

p<br />

n<br />

5. Difference Between Two Proportions<br />

1(1 1) 2(1 2)<br />

(<br />

1 2) 1.96 p −<br />

p p<br />

p p −<br />

− ± +<br />

p<br />

n n<br />

1 2<br />

213<br />

Section 5


214


SECTION 6<br />

This section reviews hypothesis testing, type 1 <strong>and</strong> type 2 errors, conclusive <strong>and</strong> inconclusive<br />

results <strong>and</strong> the power <strong>of</strong> a study.<br />

Null <strong>and</strong> Alternative Hypotheses<br />

Study Based <strong>and</strong> Data Driven Hypotheses<br />

One <strong>and</strong> Two Sided Tests<br />

Four Steps in the Hypothesis Testing Procedure<br />

Examples<br />

Pooled proportion estimate<br />

Clinical <strong>and</strong> Ecological Importance<br />

Conclusive <strong>and</strong> Inconclusive Results<br />

Errors in Hypothesis Testing<br />

Power <strong>of</strong> a Study<br />

Examples<br />

215<br />

Section 6


Hypothesis Testing<br />

In most scientific studies we set up hypotheses<br />

before about treatments (or populations) which<br />

are the focus <strong>of</strong> the study. A null hypothesis (H 0 )<br />

which is a claim about a treatment which is<br />

assumed to be true unless the data collected in<br />

our study show substantial evidence against H 0 .<br />

At the same time we propose a research or<br />

alternative hypothesis (H A ) which will be adopted<br />

if there is sufficient evidence against the null<br />

hypothesis.<br />

There are two types <strong>of</strong> alternative hypotheses:<br />

(i)<br />

(ii)<br />

a study based hypothesis which will imply<br />

that we do not know at the outset whether a<br />

new treatment is beneficial or possibly<br />

harmful <strong>and</strong><br />

a data based hypothesis which is suggested<br />

by the very nature <strong>of</strong> the collected data <strong>and</strong><br />

which will usually suggest treatment<br />

benefit.<br />

216<br />

Section 6


If the data suggest harm we are likely to terminate<br />

the study quickly but if the data suggest benefit we<br />

need to know if the benefit is clinically important.<br />

The study based alternative will usually lead to a<br />

two sided test while the data based alternative<br />

will lead to a one sided test. In the literature, the<br />

two sided test is by far the most common.<br />

There are FOUR STEPS in the st<strong>and</strong>ard<br />

hypothesis testing procedure.<br />

Step (1) A null hypothesis (H 0 ) is assumed about<br />

a population parameter.<br />

Step (2) An alternative (research) hypothesis is<br />

proposed. This is accepted if H 0 is rejected.<br />

217<br />

Section 6


Step (3) A test statistic is computed from data.<br />

It is the st<strong>and</strong>ardised value <strong>of</strong> a sample<br />

mean, sample proportion or sample<br />

difference obtained from the data. It is either<br />

a z-score (large sample) or a t-score (for<br />

small samples) given by<br />

test statistic<br />

=<br />

observed sample value - null value<br />

estimated st<strong>and</strong>ard error<br />

That is, the number <strong>of</strong> st<strong>and</strong>ard deviations<br />

from null value to the sample value. It is this<br />

test statistic which allows calculation <strong>of</strong> the<br />

p-value associated with the outcome <strong>of</strong> a<br />

particular study.<br />

Step (4) The probability <strong>of</strong> observing the value <strong>of</strong><br />

the test statistic in step (3), or a value which is<br />

even more extreme, is calculated under the<br />

assumption that the null hypothesis is true.<br />

This probability is the p-value for the test<br />

statistic. The test statistic has <strong>of</strong> course<br />

summarized the data in the study. We draw<br />

appropriate conclusions if the p-value is less<br />

than 0.05.<br />

218<br />

Section 6


Examples Hypothesis Testing<br />

Exercise: Suppose the resting pulse rates for<br />

young women are normally distributed with mean<br />

μ = 66 <strong>and</strong> st<strong>and</strong>ard deviation σ = 9.2 beats per<br />

minute. A drug for the treatment <strong>of</strong> a medical<br />

condition is administered to 100 young women<br />

<strong>and</strong> their average pulse rate is found to be x = 68<br />

beats per minute. Because the drug had for a long<br />

time been observed to increase pulse rates, test<br />

the claim that the drug does in fact increase the<br />

pulse rates. (i.e. H A is data based.)<br />

Solution:<br />

Step (1) H 0 : μ = 66 (the null hypothesis)<br />

Step (2) H A : μ > 66 (the research hypothesis)<br />

Step(3) x = 68 from sample data. Assuming H 0<br />

is true, <strong>and</strong> noting that population st<strong>and</strong>ard<br />

deviation is known, st<strong>and</strong>ardising x leads to<br />

z<br />

=<br />

observed sample mean - null mean<br />

st<strong>and</strong>ard error <strong>of</strong> mean<br />

219<br />

Section 6


x − μ<br />

=<br />

σ / n<br />

68 − 66<br />

=<br />

9.2 / 100<br />

= 2.174<br />

Step(4): Calculate p-value assuming μ = 66<br />

0.50<br />

66 68 X<br />

0 2.174 Z<br />

p-value = Pr( X > 68 given μ = 66)<br />

68 − 66<br />

= Pr( Z > )<br />

9.2 / 100<br />

= Pr( Z > 2.174) = 0. 015<br />

This means that if H 0 is true, there is only a<br />

probability <strong>of</strong> 0.015 <strong>of</strong> observing a sample mean<br />

as large or larger than 68. Hence there is little<br />

support for H 0 . Reject H 0 <strong>and</strong> conclude the mean<br />

pulse rate has been increased by the treatment.<br />

220<br />

Section 6


Notes: 1. R-cmdr <strong>and</strong> other statistical<br />

packages give a p-value directly beside the<br />

study result or the test statistic. If the p-value<br />

is less than 0.05 we have significance at the<br />

5% level <strong>and</strong> if p-value is less than 0.01 we<br />

have significance at the 1% level.<br />

2. If σ is unknown but estimated from the sample,<br />

the st<strong>and</strong>ardised statistic is t <strong>and</strong> the p-value is<br />

found from the t-table with appropriate degrees<br />

<strong>of</strong> freedom. (The exact p-value is not possible<br />

since only a few values are given at top <strong>of</strong><br />

columns in t-table.)<br />

e.g. Suppose s = 9.2 rather than σ = 9.2 <strong>and</strong><br />

sample size is n = 100.<br />

Then,<br />

68 − 66<br />

p-value = Pr( t > )<br />

9.2 / 100<br />

= Pr( t > 2.174)<br />

with 99 DF<br />

t = 2.174 lies between values in columns<br />

headed p = 0.025 <strong>and</strong> p = 0.010.<br />

Hence p-value lies between these two<br />

numbers (R-cmdr gives exact value)<br />

221<br />

Section 6


Exercise: In a large overseas city it was<br />

estimated that 15% <strong>of</strong> girls between the ages <strong>of</strong><br />

14 <strong>and</strong> 18 became pregnant. Concerned parents<br />

<strong>and</strong> health workers introduced an educational<br />

programme in an effort to lower this percentage.<br />

After four years <strong>of</strong> the programme, a r<strong>and</strong>om<br />

sample <strong>of</strong> n = 293 18-year-old girls revealed that<br />

27 had become pregnant.<br />

(a)<br />

(b)<br />

(c)<br />

Define null <strong>and</strong> alternative hypotheses for<br />

investigating whether the proportion<br />

becoming pregnant after the educational<br />

programme has decreased. (Suppose the<br />

alternative hypothesis is one sided.)<br />

Calculate the probability value.<br />

State your conclusion.<br />

Step(1): H 0 : π = 0.15 (15% still become<br />

pregnant)<br />

Step(2): H A : π < 0.15 (less than 15% become<br />

pregnant)<br />

222<br />

Section 6


Step(3): Sample gives p = 27/293 = 0.092<br />

observed proportion - null proportion<br />

z =<br />

st<strong>and</strong>ard error <strong>of</strong> proportion<br />

=<br />

p −π<br />

π ( 1−π<br />

)<br />

n<br />

=<br />

0.092 − 0.15<br />

0.15(1 − 0.15)<br />

under H 0 : π = 0.15<br />

293<br />

= –2.78<br />

use 0.15 <strong>and</strong><br />

not 0.092 here<br />

Step (4): p-value = pr(Z < –2.78)<br />

= 0.5000 – 0.4973<br />

= 0.0027<br />

Z<br />

–2.78 0<br />

There is evidence that after the education<br />

campaign the proportion becoming pregnant has<br />

reduced.<br />

223<br />

Section 6


Exercise: The birthweight <strong>of</strong> a baby is thought to<br />

be associated with the smoking habits <strong>of</strong> the<br />

mother during pregnancy. The means <strong>and</strong><br />

variances <strong>of</strong> the INDIVIDUAL values in the two<br />

samples <strong>of</strong> birthweights, one for non-smoking<br />

<strong>and</strong> the other for smoking mothers, are in the<br />

following table.<br />

Mother<br />

non-smoker<br />

Mother<br />

smoker<br />

Sample Size (n i ) 100 50<br />

Sample Mean ( x i<br />

) 3.45 3.30<br />

Sample Variance ( s 2 i<br />

) 0.36 0.32<br />

Investigate the claim that the mean birthweights<br />

are different in the two groups. In this case we<br />

shall suppose the alternative is study driven rather<br />

than data driven.<br />

Step(1): H 0 : μ NS – μ S = 0 (no difference in the<br />

mean birth weight)<br />

Step(2): H A : μ NS – μ S ≠ 0 (there is a difference<br />

in the mean birth weight)<br />

224<br />

Section 6


Step(3): Sample gives x − x = 3.45 − 3.30 = 0.15<br />

St<strong>and</strong>ardising gives the test statistic<br />

NS<br />

S<br />

observed difference <strong>of</strong> means - null difference<br />

t =<br />

estimated st<strong>and</strong>ard error <strong>of</strong> difference<br />

0.15 − 0<br />

=<br />

1 1<br />

s p<br />

+<br />

100 50<br />

2<br />

2<br />

2 ( n 1) ( 1)<br />

where<br />

1 − s1<br />

+ n2<br />

− s<br />

s =<br />

2<br />

p<br />

n1<br />

+ n2<br />

− 2<br />

99(0.36) + 49(0.32)<br />

=<br />

148<br />

= 0.3468<br />

0.15<br />

=<br />

1 1<br />

0.3468 + 100 50<br />

= 0.15<br />

0.102<br />

= 1.47<br />

(use <strong>of</strong> pooling optional)<br />

Since the sample is large we can use the st<strong>and</strong>ard<br />

normal z in place <strong>of</strong> t with 148 degrees <strong>of</strong><br />

freedom.<br />

225<br />

Section 6


Step(4): In this case (two sided H A )<br />

p-value = Pr(|z| > 1.47)<br />

= Pr(z > 1.47 or z < –1.47)<br />

–1.47 0 1.47<br />

p-value = 2(0.5 – 0.4292) = 2(0.0708) = 0.1416<br />

There is no evidence that the mean birthweights<br />

for the smoking <strong>and</strong> non-smoking groups are<br />

different.<br />

Note: If the test had been one-sided<br />

[H A : μ NS – μ S > 0] p-value = Pr(z > 1.47)<br />

= 0.0708<br />

z<br />

z<br />

0 1.47<br />

There is again no evidence the non-smoking<br />

group has a greater mean birthweight than the<br />

smoking group.<br />

226<br />

Section 6


227<br />

Section 6


228<br />

Section 6


Notes on Hypothesis Testing<br />

1. There is some terminology for reporting the<br />

result <strong>of</strong> a test.<br />

(a) If the p-value < 0.05 the result is<br />

“significant at α = 0.05 level” (5% level)<br />

or “There is some evidence that …”<br />

(b) If the p-value < 0.01 the result is<br />

“significant at α = 0.01 level” (1% level)<br />

or “There is strong evidence that …”<br />

(c) If p-value > 0.05 the result is “not<br />

significant” or “There is no evidence that<br />

…”<br />

In the above α is generally a pre-selected cut<strong>of</strong>f<br />

value.<br />

229<br />

Section 6


2. Choosing a smaller level <strong>of</strong> significance<br />

requires the test statistic to be more<br />

extreme before H 0 rejected.<br />

3. Whether the test is one or two sided<br />

depends on whether the alternative<br />

hypothesis is data based or study based.<br />

4. If H A is one sided, the p-value is the area<br />

in one tail <strong>of</strong> the distribution <strong>of</strong> the<br />

st<strong>and</strong>ardised test statistic.<br />

5. If H A is two-sided, the p-value is the area<br />

in the two tails <strong>of</strong> the distribution <strong>of</strong> the<br />

st<strong>and</strong>ardised test statistic.<br />

6. If using t-table choose column heading 2p<br />

for a two sided alternative hypothesis <strong>and</strong><br />

p for a one sided alternative hypothesis.<br />

7. When a test statistic leads to rejection <strong>of</strong><br />

H 0 , there are two possible explanations<br />

(a) H 0 is true but r<strong>and</strong>om variation has given<br />

an improbable test statistic.<br />

230<br />

Section 6


(b) H 0 is not true, <strong>and</strong> the observed statistic is<br />

consistent with H A .<br />

The second alternative (b) is taken but<br />

there is possible error. This error is the<br />

value α, the level <strong>of</strong> significance, which is<br />

usually 0.05 or 0.01. α is called the type<br />

one error (it is the chance <strong>of</strong> a false<br />

conviction in a court <strong>of</strong> law – i.e. must<br />

operate beyond reasonable doubt hence<br />

choose small α).<br />

8. In published work a p-value is quoted<br />

beside the study result (indicating whether a<br />

new treatment, say, has an effect) <strong>and</strong> a<br />

confidence interval is reported (giving some<br />

idea <strong>of</strong> the magnitude <strong>of</strong> an effect).<br />

But one problem still remains when reporting<br />

conclusions from a scientific study. It is possible<br />

to obtain a result which is statistically<br />

significant (with a small p-value) yet from a<br />

clinical point <strong>of</strong> view the result is unimportant.<br />

That is, it is not clinically important.<br />

(Ecological importance is an equivalent concept.)<br />

231<br />

Section 6


Example: There are two treatments for raising<br />

iron levels in infants, a st<strong>and</strong>ard treatment A <strong>and</strong><br />

a new treatment B.<br />

A mean for treatment B that is 20 units greater<br />

than the mean for treatment A is recognised as a<br />

clinically important improvement which would<br />

lead to widespread introduction <strong>of</strong> treatment B.<br />

An experiment produces the following mean<br />

differences, xB<br />

− xA, with a 95% confidence<br />

interval. Decide in each case whether the p-value<br />

is less than or greater than 0.05. Report whether<br />

the scientific result is conclusive or inconclusive<br />

by considering clinical importance.<br />

(a) Mean Diff = 40. Confidence Interval is<br />

(33, 47)<br />

The confidence interval does not include the<br />

null hypothesis value so the p-value is less<br />

than 0.05 (a statistically significant result).<br />

The point estimate <strong>of</strong> 40 is in the direction<br />

indicating treatment benefit. The result is<br />

conclusive <strong>and</strong> there is evidence the benefit<br />

is enough to be important.<br />

232<br />

Section 6


(b)<br />

Mean Diff = 36. Confidence interval is<br />

(18, 54)<br />

p-value < 0.05. The result is conclusive.<br />

There is treatment benefit but it may not be<br />

as large as hoped.<br />

(c)<br />

Mean Diff = 27. Confidence interval is<br />

(–4, 58)<br />

p-value > 0.05 <strong>and</strong> inconclusive result. The<br />

confidence interval includes H 0 . The new<br />

treatment is probably better than treatment<br />

A but we cannot completely rule out the<br />

possibility that it is worse.<br />

(d)<br />

Mean Diff = –7. Confidence interval is<br />

(–55, 41)<br />

p-value > 0.05<strong>and</strong> result inconclusive. The<br />

new treatment is likely to be harmful but we<br />

cannot rule out the possibility that there is a<br />

clinically important benefit.<br />

(e) Mean Diff = –12. Confidence Interval =<br />

(–34, 10)<br />

233<br />

Section 6


p-value > 0.05 <strong>and</strong> result is conclusive. Any<br />

benefit is not clinically important <strong>and</strong> it is<br />

more likely there will be treatment harm.<br />

Treatment B should not be pursued as a<br />

potential treatment.<br />

(f) Mean Diff = –13. Confidence interval =<br />

(–19, –7)<br />

p-value < 0.05 <strong>and</strong> result very conclusive.<br />

The new treatment is harmful.<br />

(g) Mean Diff = 11. Confidence interval =<br />

(4, 18)<br />

p-value < 0.05. The result is conclusive.<br />

There is treatment benefit but not enough to<br />

lead to the introduction <strong>of</strong> treatment B.<br />

Note: In practice you decide what is clinically<br />

important. This is difficult but as you gain<br />

experience with your own area <strong>of</strong> research it<br />

becomes easier <strong>and</strong> you are able to critique any<br />

published research.<br />

234<br />

Section 6


Summary <strong>of</strong> previous results<br />

0 = null value<br />

20 = clinically important improvement.<br />

p-value < 0.05 implies confidence interval<br />

excludes the null value <strong>of</strong> zero.<br />

p-value > 0.05 implies null value included<br />

The result can be conclusive or inconclusive.<br />

(a) ( )<br />

(b) ( )<br />

(c) ( )<br />

(d) ( )<br />

-55 -7 0 20 41<br />

(e) ( )<br />

-34 -12 0 10 20<br />

(f) ( )<br />

-19 -13<br />

-4<br />

0 20 27<br />

-7 0 20<br />

(g) ( )<br />

0 20 33 40 47<br />

0 1820<br />

36 54<br />

0 4 111820<br />

58<br />

235<br />

Section 6


(a) Conclusive p-value < 0.05<br />

(b) Conclusive p-value < 0.05<br />

(c) Inconclusive p-value > 0.05<br />

(d) Inconclusive p-value > 0.05<br />

(e) Conclusive p-value > 0.05<br />

(f) Conclusive p-value < 0.05<br />

(g) Conclusive p-value < 0.05<br />

Clearly, if the confidence interval is too large,<br />

there is greater chance for an inconclusive result.<br />

236<br />

Section 6


Example<br />

A clinical trial is set up to compare two drugs<br />

(pravastatin, A, <strong>and</strong> a control, B) for lowering<br />

cholesterol. The mean cholesterol reductions in<br />

the two groups are compared. The probability<br />

that such a study will correctly detect a clinically<br />

important difference between the effects <strong>of</strong> the<br />

drugs is called the power <strong>of</strong> the study. Power<br />

depends on the size <strong>of</strong> the difference, the<br />

variability <strong>of</strong> estimates, sample size, <strong>and</strong> the level<br />

<strong>of</strong> significance.<br />

Figure 12.4: 95% confidence intervals for different sample sizes<br />

Mean reduction<br />

greater in A<br />

target treatment<br />

difference<br />

(clinically<br />

important)<br />

0<br />

no difference<br />

Mean reduction<br />

greater in B<br />

n = 10 n = 20 n = 50 n = 200<br />

237<br />

Section 6


If the two samples are <strong>of</strong> size 5 (giving total<br />

n = 10), the three 95% confidence intervals<br />

include zero difference <strong>and</strong> the important<br />

difference. As n increases, the confidence<br />

intervals become smaller <strong>and</strong> it is possible to<br />

detect the difference.<br />

NB 1. It is helpful to aim for a confidence<br />

interval which has diameter (or range) no<br />

greater than the clinically important treatment<br />

difference as in this case the result obtained<br />

must be conclusive (rather than inconclusive).<br />

2. If the clinically important effect size is large,<br />

the confidence interval can be wider <strong>and</strong> hence<br />

a smaller sample taken.<br />

3. A larger sample gives smaller confidence interval.<br />

4. Less r<strong>and</strong>om variation in the data gives<br />

smaller confidence interval. (That is, the value<br />

<strong>of</strong> σ is smaller.)<br />

5. A smaller level <strong>of</strong> significance (α), say 0.01,<br />

gives a wider confidence interval <strong>and</strong> hence<br />

smaller power as there is less chance <strong>of</strong><br />

detecting a clinically important effect in a<br />

conclusive way.<br />

238<br />

Section 6


Errors in Hypothesis Testing<br />

The level <strong>of</strong> significance (α) is chosen by the<br />

researcher, usually 0.05, <strong>and</strong> is the chance that the<br />

null hypothesis (H 0 ) will be rejected when in<br />

actual fact it is true. It would seem sensible for α<br />

to be made as small as possible. Then the<br />

probability <strong>of</strong> correctly not rejecting H 0 when it is<br />

true will be large. But this is not the real issue in<br />

a scientific study involving hypothesis testing.<br />

The real issue is to have high probability <strong>of</strong><br />

rejecting H 0 when in fact H 0 is false or needing to<br />

be rejected. That is, a high probability that a test<br />

will correctly detect a real treatment effect <strong>of</strong> a<br />

given magnitude. This is known as the power <strong>of</strong><br />

the test, <strong>and</strong> involves detecting clinically<br />

worthwhile improvements as defined by<br />

researchers. Power is related to the level <strong>of</strong><br />

significance. A smaller value for the level <strong>of</strong><br />

significance results in a smaller power. A power<br />

between 80% <strong>and</strong> 90% is desirable.<br />

239<br />

Section 6


These ideas have a parallel in the courts <strong>of</strong> law in<br />

this country. To illustrate, suppose we are<br />

interested in testing a new treatment to see if it<br />

has an effect.<br />

1. The treatment is “arrested”.<br />

2. The treatment is charged with having an<br />

effect (H A ).<br />

3. It is assumed treatment is “innocent” (has no<br />

effect, H 0 ) until the evidence (data) shows<br />

otherwise. The evidence is summarized in the<br />

test statistic.<br />

4. The level <strong>of</strong> significance (α) is the probability<br />

that an innocent treatment will be convicted.<br />

This error must be made small. That is, the<br />

probability <strong>of</strong> a false conviction.<br />

5. The power is the probability that a guilty<br />

treatment will be convicted. This is the best<br />

outcome for a court case as it is a correct<br />

conviction. This probability should be large<br />

since then we correctly convict the treatment<br />

concluding there is an important treatment<br />

effect. Power should be at least 0.80 or 0.90.<br />

240<br />

Section 6


Some Computer packages (Minitab is one) have<br />

an excellent routine for analysing the power <strong>of</strong> a<br />

study <strong>and</strong> showing how power, data variability,<br />

sample size, level <strong>of</strong> significance <strong>and</strong> clinically<br />

important effects are related.<br />

EXAMPLE: The problem is to design a milk<br />

feeding trial in 5 year old children to see if a daily<br />

supplement <strong>of</strong> milk for a year leads to an<br />

increased gain in height compared with a control<br />

group (such a study would be both expensive <strong>and</strong><br />

difficult for practical <strong>and</strong> ethical reasons). It is<br />

known that at this age children grow 6cm in a<br />

year with a st<strong>and</strong>ard deviation <strong>of</strong> 2cm (σ). The<br />

effect <strong>of</strong> milk on height gain is important if it<br />

results in a gain <strong>of</strong> at least 0.5cm. We want a<br />

high probability <strong>of</strong> detecting such a difference so<br />

we set the power to be 0.9 (90%) <strong>and</strong> choose a<br />

1% (α = 0.01) significance level.<br />

Known:<br />

Find<br />

σ = 2 (data variability)<br />

α = 0.01 (chosen level <strong>of</strong> sig.)<br />

Clinically important diff = 0.5 cm<br />

Target power is 0.90 (90%)<br />

Sample Size.<br />

241<br />

Section 6


(a)<br />

Find the sample size required to meet these<br />

conditions. (i.e. σ = 2.0cm; clinically<br />

important difference = 0.5cm; power = 0.9;<br />

α = 0.01)<br />

Step 1. STAT > POWER AND SAMPLE<br />

SIZE > 2-SAMPLE t (i.e. choose an<br />

unpaired t-test)<br />

Step 2. Specify power value <strong>of</strong> 0.9, a clinically<br />

important difference <strong>of</strong> 0.5 <strong>and</strong> sigma 2.0<br />

Step 3. Choose Not equal for a study based two<br />

sided alternative hypothesis <strong>and</strong><br />

significance level alpha <strong>of</strong> 0.01<br />

242<br />

Section 6


A printout is as follows:<br />

Power <strong>and</strong> Sample Size<br />

2-Sample t Test<br />

Testing mean 1 = mean 2 (versus not =)<br />

Calculating power for mean 1 = mean 2 + difference<br />

Alpha = 0.01 Sigma = 2<br />

Sample Target Actual<br />

Difference Size Power Power<br />

0.5 478 0.9000 0.9001<br />

There need to be 478 children in each sample<br />

meaning 956 children in total.<br />

(i.e. the size <strong>of</strong> one sample is given)<br />

[Note: the actual power will be different as a<br />

result <strong>of</strong> rounding to the sample size.]<br />

243<br />

Section 6


(b)<br />

Now consider clinically important<br />

differences <strong>of</strong> 0.5, 0.6, 0.7, 0.8, 0.9, 1.0<br />

A printout gives<br />

Power <strong>and</strong> Sample Size<br />

2-Sample t Test<br />

Testing mean 1 = mean 2 (versus not =)<br />

Calculating power for mean 1 = mean 2 + difference<br />

Alpha = 0.01 Sigma = 2<br />

Sample Target Actual<br />

Difference Size Power Power<br />

0.5 478 0.9000 0.9001<br />

0.6 333 0.9000 0.9007<br />

0.7 245 0.9000 0.9006<br />

0.8 188 0.9000 0.9006<br />

0.9 149 0.9000 0.9009<br />

1.0 121 0.9000 0.9008<br />

Notice that smaller samples will detect the larger<br />

clinically important differences. Necessary<br />

sample size reduces from the 956 to 242 [similar<br />

to moving from a high resolution microscope to<br />

pocket magnifying glass which is all that is<br />

needed to detect 1.0]<br />

244<br />

Section 6


(c)<br />

Halve the value <strong>of</strong> sigma to 1.0 <strong>and</strong> repeat<br />

the analysis in (b)<br />

Power <strong>and</strong> Sample Size<br />

2-Sample t Test<br />

Testing mean 1 = mean 2 (versus not =)<br />

Calculating power for mean 1 = mean 2 + difference<br />

Alpha = 0.01 Sigma = 1<br />

Sample Target Actual<br />

Difference Size Power Power<br />

0.5 121 0.9000 0.9008<br />

0.6 85 0.9000 0.9027<br />

0.7 63 0.9000 0.9032<br />

0.8 49 0.9000 0.9058<br />

0.9 39 0.9000 0.9051<br />

1.0 32 0.9000 0.9060<br />

Notice how greater precision (decreased st<strong>and</strong>ard<br />

deviation) in the data results in smaller sample<br />

sizes required to achieve the desired power which<br />

is now only 64 for a difference <strong>of</strong> 1.0.<br />

245<br />

Section 6


(d) A doctor set up a study involving 100<br />

children (50 in each group) <strong>and</strong> monitored<br />

the children for one year. The doctor<br />

wanted to detect a clinically important<br />

difference <strong>of</strong> 0.5, knew from historical<br />

information that sigma = 2.0, <strong>and</strong> set up a<br />

study based (two sided) test at α = 0.05(5%)<br />

level <strong>of</strong> significance. The printout obtained<br />

for the doctor after the study was carried out<br />

follows.<br />

Power <strong>and</strong> Sample Size<br />

2-Sample t Test<br />

Testing mean 1 = mean 2 (versus not =)<br />

Calculating power for mean 1 = mean 2 + difference<br />

Alpha = 0.05 Sigma = 2<br />

Sample<br />

Difference Size Power<br />

0.5 50 0.2358<br />

The power for this study is only 0.2358. The<br />

probability <strong>of</strong> detecting the clinically important<br />

difference <strong>of</strong> 0.5 is too small. The study was a<br />

waste <strong>of</strong> effort in the sense that it is unlikely to<br />

detect a difference as small as 0.5 when this size<br />

difference is important.<br />

If α = 0.01, power = 0.0891<br />

246<br />

Section 6


Revision Examples<br />

1. Exam 2006:<br />

In a study to assess the impact <strong>of</strong> an industrial<br />

development on a nearby river, water temperature<br />

was measured. It has been suggested the mean<br />

water temperature is higher in this river than in a<br />

similar river 30 km away that is not affected by<br />

the development. Daily temperature in degrees<br />

Celsius were taken at mid day for a fortnight in<br />

February from both rivers. Two readings from<br />

the “unaffected” river were spoiled. The data are<br />

summarised below:<br />

Unaffected Affected<br />

river river<br />

Sample Size (n i ) 12 14<br />

Sample Mean ( x i<br />

) 15.41 16.49<br />

Sample Variance ( s 2 i<br />

) 1.963 2.132<br />

(a)<br />

(4 marks) Assuming that temperature has a<br />

common variability in both rivers <strong>and</strong> the<br />

values are approximately normal, calculate<br />

the pooled estimate for the common<br />

variance <strong>and</strong> an estimate for the st<strong>and</strong>ard<br />

error <strong>of</strong> the difference between the two<br />

means.<br />

247<br />

Section 6


2<br />

11(1.963) + 13(2.132)<br />

s p<br />

= =<br />

24<br />

Pooled variance = 2.055<br />

2.055<br />

st<strong>and</strong>ard error =<br />

1 1<br />

2.055 + = 0.564<br />

12 14<br />

Estimated st<strong>and</strong>ard error = 0.564<br />

(b)<br />

(2 marks) Using the appropriate value from<br />

the t-table construct the 95% confidence<br />

interval for the difference in mean<br />

temperature in the affected <strong>and</strong> unaffected<br />

rivers.<br />

1.08 ± t 24 (0.564) where t 24 = 2.064<br />

or 1.08 ± 1.164<br />

Confidence interval:<br />

–0.084 < μ A – μ u < 2.244<br />

248<br />

Section 6


(c)<br />

(d)<br />

(e)<br />

(2 marks) A mean temperature increase <strong>of</strong><br />

0.6 degrees Celsius is ecologically<br />

important. State your conclusion about the<br />

true mean temperature from the confidence<br />

interval in (b).<br />

Conclusion:<br />

Result inconclusive. There is no evidence <strong>of</strong> a<br />

temperature mean difference but an important<br />

increase cannot be ruled out.<br />

(1 mark) State one way in which you might<br />

increase the power <strong>of</strong> this study.<br />

Statement:<br />

Increase sample size.<br />

(5 marks) A more powerful study is to be<br />

set up which has a 95% confidence interval<br />

for the difference between the mean river<br />

temperatures no greater than 0.6 degrees<br />

Celsius. Assuming the same number <strong>of</strong><br />

measurements is taken from each river <strong>and</strong><br />

the pooled estimate for the common<br />

variance from (a) is the best estimate for the<br />

variability, approximately how many<br />

readings should be taken from each river<br />

249<br />

Section 6


Taking 1.96 as multiplier, the 95% C.I. is<br />

1 1<br />

( μ2− μ1) ± 1.96 2.054 + n n<br />

But required precision needs (μ 2 – μ 1 ) ± 0.3<br />

2<br />

Therefore, 1.96 2.054 0.3<br />

n ≤<br />

2<br />

(1.96) (2.054)2<br />

∴<br />

≤ n<br />

2<br />

(0.3)<br />

∴ 175.3 ≤ n<br />

Number <strong>of</strong> readings from each river: 176<br />

(f)<br />

(2 marks) The 95% confidence interval from<br />

the study in (e) is (0.49, 1.12). What<br />

conclusion would you now reach about the true<br />

mean temperature difference<br />

Conclusion:<br />

Result conclusive. There is evidence <strong>of</strong><br />

increased temperatures but the increase may<br />

not be ecologically important.<br />

250<br />

Section 6


2. Exam 2005<br />

An ecologist must determine whether a cleanup<br />

project at a lake has been effective. This is to be<br />

done by recording dissolved oxygen content (in<br />

parts per million, ppm) in the lake, with higher<br />

values indicating less pollution. Prior to the<br />

cleanup project a r<strong>and</strong>om sample <strong>of</strong> 50 dissolved<br />

oxygen readings was recorded around the lake. Six<br />

months after the initiation <strong>of</strong> the cleanup a second<br />

r<strong>and</strong>om sample <strong>of</strong> 70 readings was recorded.<br />

Results are summarised in the following table.<br />

Before Cleanup After Cleanup<br />

Sample Size (n i ) 50 70<br />

Sample Mean ( x i ) 10.30 10.46<br />

2<br />

s 0.32 0.36<br />

Sample Variance ( )<br />

i<br />

(a) (1 mark) State null <strong>and</strong> alternative hypotheses<br />

for testing the data driven hypothesis that the<br />

cleanup has resulted in an increase in the<br />

dissolved oxygen content.<br />

Null hypothesis, H 0 : μ BC = μ AC<br />

Alternative hypothesis, H A : μ BC < μ AC<br />

251<br />

Section 6


(b) (6 marks) Calculate the pooled estimate for<br />

the common variance <strong>of</strong> the two samples, an<br />

estimate for the st<strong>and</strong>ard error <strong>of</strong> the difference<br />

between the two means, <strong>and</strong> a st<strong>and</strong>ardised<br />

normal z statistic for testing the hypotheses.<br />

2<br />

49(0.32) + 69(0.36)<br />

= = 0.3434<br />

118<br />

s p<br />

Pooled variance = 0.3434<br />

estimated st<strong>and</strong>ard error<br />

=<br />

1 1<br />

0.3434 + = 0.1085<br />

50 70<br />

St<strong>and</strong>ard error = 0.1085<br />

z<br />

10.46 −10.30 = = 1.475<br />

0.1085<br />

St<strong>and</strong>ardised z statistic = 1.475<br />

252<br />

Section 6


(c) (2 marks) Find the probability value (p-value)<br />

for the z statistic in (b) <strong>and</strong> state your<br />

conclusion from the p-value (using a 5% level<br />

<strong>of</strong> significance).<br />

p-value = 0.5 – 0.4306 = 0.0694<br />

Conclusion: There is no evidence that the<br />

clean-up has raised the mean dissolved<br />

oxygen reading.<br />

(d) (2 marks) Construct the 95% confidence<br />

interval for the difference in the dissolved oxygen<br />

means for the readings before cleanup <strong>and</strong> the<br />

readings after cleanup.<br />

(10.46 – 10.30) ± t 118 (0.1085) where<br />

t 118 = 1.98 (accept 1.96)<br />

i.e. 0.160 ± 0.215<br />

Confidence interval:<br />

–0.055 < μ AC – μ BC < 0.375<br />

253<br />

Section 6


(e) (1 mark) The power <strong>of</strong> this study is small.<br />

Suggest one way in which you might increase the<br />

power <strong>of</strong> this study.<br />

Answer: Select a larger sample<br />

(f) (3 marks) A more powerful study produced<br />

the 95% confidence interval (0.04, 0.27).<br />

What conclusions would you reach about the<br />

p-value <strong>of</strong> this study result <strong>and</strong> the effect <strong>of</strong><br />

the cleanup project if an increase <strong>of</strong> 0.25 in the<br />

dissolved oxygen mean is ecologically<br />

important<br />

Conclusion: p-value < 0.05<br />

There is evidence the oxygen mean has<br />

increased after the cleanup but it may not be<br />

an important increase (or may not be as great<br />

as hoped)<br />

[Question 4 : 15 marks]<br />

254<br />

Section 6


SECTION 7<br />

One factor analysis <strong>of</strong> variance, post analysis <strong>of</strong><br />

variance tests on means, <strong>and</strong> multiple comparison<br />

procedures.<br />

255<br />

Section 7


ONE FACTOR ANALYSIS OF VARIANCE<br />

This section <strong>of</strong> the course returns to the<br />

continuous outcome theme.<br />

In the studies <strong>of</strong> this type considered so far there<br />

have been two treatments when usually a new<br />

treatment is compared with a control or placebo.<br />

In the first half <strong>of</strong> the semester we answered the<br />

question about the effect <strong>of</strong> the new treatment by<br />

using the two sample t-test to find p-values <strong>and</strong><br />

confidence intervals for the comparison <strong>of</strong> means.<br />

These studies involved an outcome measured on a<br />

continuous scale <strong>and</strong> the scores in the two<br />

treatments were compared.<br />

Regression procedures were developed which<br />

allowed us to introduce potential confounding<br />

variables <strong>and</strong> hence obtain adjusted or modified<br />

confidence intervals <strong>and</strong> different p-values.<br />

We are now going to investigate how to analyse<br />

continuous data when there are more than two<br />

treatments <strong>of</strong> interest.<br />

256<br />

Section 7


Example A general surgeon believes that<br />

providing pain relief immediately following<br />

surgery improves the level <strong>of</strong> comfort postsurgery.<br />

Three pain killing drugs <strong>and</strong> a placebo<br />

are r<strong>and</strong>omly administered to patients<br />

immediately following tonsillectomies. The<br />

times in hours until onset <strong>of</strong> pain are as follows.<br />

The study is double blind.<br />

Placebo Drug A Drug B Drug C<br />

1.6 2.6 1.2 3.6<br />

0.3 12.6 1.7 3.2<br />

1.1 2.8 0.9 3.4<br />

0.4 4.5 2.1 3.9<br />

1.4 5.3 1.3 4.9<br />

2.4 4.4<br />

3.9<br />

Which drugs, if any, may be better than placebo<br />

Notice that there are now three comparisons with<br />

placebo. We can do better than just make the<br />

three comparisons using three unpaired t-tests.<br />

257<br />

Section 7


Example: A comparison was made <strong>of</strong> protein<br />

intake among three groups <strong>of</strong> post-menopausal<br />

women: (1) women eating a st<strong>and</strong>ard American<br />

diet (STD), (2) women eating a lacto-ovovegetarian<br />

diet (LAC), <strong>and</strong> (3) women eating a<br />

strict vegetarian diet (VEG). It was hypothesized<br />

that protein intake was affected by diet. The<br />

protein intakes (mg) for 30 women are:<br />

STD LAC VEG<br />

76 62 47<br />

63 76 75<br />

84 71 32<br />

72 61 40<br />

66 35 52<br />

83 56 37<br />

77 44 56<br />

79 58 35<br />

72 55 27<br />

69 49 66<br />

What are the effects <strong>of</strong> diet on protein intake<br />

Notice that there are three comparisons which<br />

could be <strong>of</strong> interest.<br />

258<br />

Section 7


We now investigate the problem <strong>of</strong> how to deal<br />

with multiple comparisons. The unpaired t test<br />

for comparing two sample means will be<br />

extended to situations involving more than two<br />

samples. As with simple linear regression the<br />

idea is again to partition the total variability <strong>of</strong> a<br />

response or outcome measure into components<br />

due to different sources <strong>of</strong> variation.<br />

Example: The effect <strong>of</strong> five drug treatments (A<br />

to E) on reduction <strong>of</strong> fever is investigated. Four<br />

children are assigned each treatment <strong>and</strong><br />

temperature reductions measured in appropriate<br />

units with high values showing greater reduction.<br />

Responses as follows:<br />

A B C D E<br />

9 7 2 4 4<br />

8 4 3 8 9<br />

6 9 4 1 6<br />

9 6 3 3 3<br />

Total 32 26 12 16 22 108<br />

Mean 8.0 6.5 3.0 4.0 5.5 5.4<br />

259<br />

Section 7


One source <strong>of</strong> variation is due to differences<br />

between the effects <strong>of</strong> the drugs, the other source<br />

<strong>of</strong> variation is the r<strong>and</strong>om variation between the<br />

individual children within each drug treatment.<br />

But which <strong>of</strong> these is most responsible for<br />

explaining the variation in the responses<br />

The Method<br />

Each response can be divided into three<br />

components as follows:<br />

Response = overall effect present in each value<br />

+ a drug treatment (factor) effect<br />

+ r<strong>and</strong>om error (or residual effect)<br />

From the estimates for these components we find<br />

a number measuring treatment variation <strong>and</strong> a<br />

number measuring residual (including error)<br />

variation. These values are compared using an F<br />

statistic as in regression.<br />

260<br />

Section 7


Estimation <strong>of</strong> Components (for reference)<br />

1. Overall mean = 5.4 (this is the estimate for<br />

the overall effect with one degree <strong>of</strong><br />

freedom)<br />

2. The five treatment effects estimated as<br />

follows:<br />

A: 8.0 – 5.4 = 2.6<br />

B: 6.5 – 5.4 = 1.1<br />

C: 3.0 – 5.4 = – 2.4<br />

D: 4.0 – 5.4 = – 1.4<br />

E: 5.5 – 5.4 = 0.1<br />

These add to zero (as they are deviations<br />

from their mean).<br />

There are 5 – 1 = 4 degrees <strong>of</strong> freedom.<br />

Note: The responses for A are, on average,<br />

2.6 units above the overall mean, while<br />

responses for D are, on average 1.4 units<br />

below overall mean.<br />

261<br />

Section 7


3. The residuals (including r<strong>and</strong>om error) are<br />

estimated by subtracting the overall mean<br />

<strong>and</strong> the treatment effect from each response<br />

to get:<br />

A: 9 = 5.4 + 2.6 + 1.0<br />

8 = 5.4 + 2.6 + 0.0<br />

6 = 5.4 + 2.6 – 2.0<br />

9 = 5.4 + 2.6 + 1.0<br />

B: 7 = 5.4 + 1.1 + 0.5<br />

4 = 5.4 + 1.1 – 2.5<br />

9 = 5.4 + 1.1 + 2.5<br />

6 = 5.4 + 1.1 – 0.5<br />

C: 2 = 5.4 + (– 2.4) – 1.0<br />

3 = 5.4 + (– 2.4) + 0.0<br />

4 = 5.4 + (– 2.4) + 1.0<br />

3 = 5.4 + (– 2.4) + 0.0<br />

D: 4 = 5.4 + (– 1.4) + 0.0<br />

8 = 5.4 + (– 1.4) + 4.0<br />

1 = 5.4 + (– 1.4) – 3.0<br />

3 = 5.4 + (– 1.4) – 1.0<br />

E: 4 = 5.4 + 0.1 – 1.5<br />

9 = 5.4 + 0.1 + 3.5<br />

6 = 5.4 + 0.1 + 0.5<br />

3 = 5.4 + 0.1 – 2.5<br />

The residuals are the third values on right.<br />

262<br />

Section 7


There are 20 data values altogether <strong>and</strong> hence 20<br />

degrees <strong>of</strong> freedom but 5 degrees <strong>of</strong> freedom<br />

have been used up leaving 15 for the residual<br />

effect.<br />

Sums <strong>of</strong> Squares Computation<br />

2<br />

∑ (responses)<br />

= 9 2 + 8 2 + 6 2 + 9 2 + … +<br />

6 2 + 3 2<br />

= 714 (with 20 DF)<br />

2<br />

∑ (overall means) = 5.4 2 +… + 5.4 2<br />

= 20 (5.4) 2<br />

= 583.2 (with 1 DF)<br />

2<br />

∑ (treatment effects) = 2.6 2 + … + 2.6 2 + … +<br />

0.1 2 + … + 0.1 2<br />

= 4[(2.6) 2 + (1.1) 2 + (–2.4) 2<br />

+ (–1.4) 2 + (0.1) 2 ]<br />

= 62.8 (with 5 – 1 = 4 DF)<br />

2<br />

∑ (residuals)<br />

= (1.0) 2 + (0.0) 2 + (–2.0) 2 +<br />

… + (–2.5) 2<br />

= 68.0 (with 15 DF)<br />

From these 714 = 583.2 + 62.8 + 68.0<br />

In general Total response Sum <strong>of</strong> Squares<br />

= overall mean SS + treatments SS<br />

+ residuals (error) SS<br />

263<br />

Section 7


Notes: 1. If there are no treatment differences,<br />

treatment effects will all be close to zero,<br />

hence treatments SS will be small. But how<br />

does this compare with the r<strong>and</strong>om SS<br />

measured by the residuals.<br />

2. We find the mean (or average) squares (MS)<br />

for treatment <strong>and</strong> residual effects <strong>and</strong><br />

compare these with an F statistic. Sums <strong>of</strong><br />

squares are divided by degrees <strong>of</strong> freedom.<br />

The Analysis <strong>of</strong> Variance (ANOVA) Table<br />

The calculations are summarised in a table similar<br />

to those arising with a regression analysis.<br />

Source <strong>of</strong> Sum <strong>of</strong> DF Mean F<br />

Variation Squares Squares<br />

Overall mean 583.2 1<br />

Treatment effects 62.8 4 15.70 3.47<br />

Residual (error) 68.0 (15) 4.53<br />

Total 714.0 20<br />

F = 15.70/4.53 = 3.47 (giving effect <strong>of</strong> the<br />

treatments on the responses compared with the<br />

chance (residual) effect on responses.<br />

Is this value large enough to be significant<br />

The critical value is found from F table (5%)<br />

264<br />

Section 7


υ 2 υ 1 1 2 3 4 … 30<br />

1<br />

<br />

15 3.056<br />

<br />

120<br />

If υ 1 = 4 <strong>and</strong> υ 2 = 15 then critical F = 3.056<br />

meaning Pr(F 4,15 > 3.056) = 0.05. Since 3.47 ><br />

3.056 we have significance at the 5% level. This<br />

means that treatment effects outweigh the chance<br />

(residual) effect.<br />

Conclusion: There is evidence <strong>of</strong> a difference<br />

between the mean temperature reductions<br />

resulting from the five treatments.<br />

Note:<br />

Because the overall mean appears in each data<br />

value, it makes no impact on variability between<br />

data values <strong>and</strong> the ANOVA table becomes.<br />

Source SS DF MS F<br />

Treatment effects 62.8 4 15.70 3.47<br />

Residual (error) 68.0 (15) 4.53<br />

Total (mean deleted) 130.8 19<br />

265<br />

Section 7


SYSTEMATIC CALCULATIONS<br />

The calculations for a one factor analysis <strong>of</strong><br />

variance can be carried out easily using statistical<br />

s<strong>of</strong>tware or by the following computation method<br />

which is quicker than the previous partitioning<br />

approach.<br />

A B C D E<br />

9 7 2 4 4<br />

8 4 3 8 9<br />

6 9 4 1 6<br />

9 6 3 3 3<br />

Col Total (C j ) 32 26 12 16 22 108<br />

2<br />

j<br />

C 1024 676 144 256 484 2584<br />

The between treatments (or samples) sum <strong>of</strong><br />

squares is<br />

C<br />

n<br />

2 2 2<br />

1 C C<br />

+<br />

2<br />

+ +<br />

k<br />

−<br />

1<br />

n<br />

2<br />

…<br />

n<br />

k<br />

(overall mean SS)<br />

where n 1 , n 2 etc are sample sizes <strong>and</strong> k = 5 here.<br />

266<br />

Section 7


If n 1 = n 2 = … = n k = n (say) this becomes<br />

[<br />

2 2<br />

] C + C + … + C − (overall mean SS)<br />

1 2<br />

1 2 k<br />

n<br />

Total SS = 9 2 + … + 3 2 = 714.0 as before.<br />

Overall mean SS = 20(108/20) 2 = 583.2 as before<br />

Treatment effects SS<br />

1<br />

= [ 1024 + 676 + 144 + 256 + 484] − 583. 2<br />

4<br />

= 62.8 as n 1 = n 2 = … = 4<br />

SOURCE SS DF MS F<br />

Overall mean 583.2 1<br />

Treatment effects 62.8 4 15.70 3.47*<br />

Residual (error) (68.0) (15) 4.53<br />

Total 714.0 20<br />

Brackets indicate numbers found by subtraction.<br />

If the effect <strong>of</strong> the overall mean is deleted again,<br />

the reduced table is produced.<br />

SOURCE SS DF MS F<br />

Treatment effects 62.8 4 15.70 3.47*<br />

Residual (error) (68.0) (15) 4.53<br />

Total 130.8 19<br />

267<br />

Section 7


A Note on the Residual Mean Square<br />

s or s )<br />

( 2 p<br />

2<br />

e<br />

The four treatment A residuals were 1.0, 0.0, –2.0,<br />

1.0. These are values 9, 8, 6, 9 with A mean <strong>of</strong> 8<br />

subtracted. i.e. they are <strong>of</strong> form x − x . An<br />

Ai A<br />

estimate <strong>of</strong> the variance for treatment A is therefore<br />

s<br />

2<br />

A<br />

2<br />

Ai A A<br />

1 2 2 2 2<br />

= ∑ ( x − x ) /( n −1)<br />

= (1.0 + 0.0 + [ −2.0]<br />

+ 1.0 )<br />

3<br />

For the other four treatments the variance estimates<br />

are<br />

2<br />

2<br />

s = ∑ ( x − x ) /( n −1)<br />

B Bi B B<br />

<br />

2<br />

2<br />

s = ∑ ( x − x ) /( n −1)<br />

E<br />

Ei<br />

E<br />

where in this case n A = n B = n C = n D = n E = 4<br />

If it is assumed that the variance is the same at all<br />

five treatments, then the common or pooled<br />

variance estimate is<br />

E<br />

268<br />

Section 7


[<br />

2 2 2 2 2<br />

]<br />

s + s + s + s s<br />

2 1<br />

s =<br />

+<br />

p 5 A B C D E<br />

1 ⎡1<br />

2 1<br />

= ∑ ( x − x ) + … + ∑ ( x − x<br />

5 ⎢⎣ 3<br />

Ai A<br />

3<br />

Ei<br />

1<br />

2<br />

= ∑ ( x − x ) + … +∑ ( x − x )<br />

15 Ai A<br />

Ei E<br />

1<br />

[<br />

2 2 2 2<br />

]<br />

= 1.0 + 0.0 + ( −2.0)<br />

+ 1.0 +…<br />

15<br />

= Residual SS/Residual DF<br />

= Residual Mean Square ( s 2 e<br />

)<br />

[<br />

2]<br />

The residual mean square is just the pooled<br />

variance estimate for all five samples. (It is a<br />

direct extension <strong>of</strong> the pooled variance estimate<br />

in an unpaired t test.)<br />

Notes (1) For the F test to be valid, the<br />

variances in all samples compared (here 5)<br />

should be approximately equal.<br />

(2) The square root <strong>of</strong> the residual mean square<br />

s is the st<strong>and</strong>ard deviation <strong>of</strong> the residuals.<br />

2<br />

e<br />

E<br />

)<br />

2<br />

⎤<br />

⎥⎦<br />

269<br />

Section 7


(3) In the R-cmdr printout for such an analysis<br />

the overall mean effect is deleted from the<br />

ANOVA table (as in the equivalent<br />

regression printout). The important section<br />

<strong>of</strong> the table remains.<br />

SOURCE SS DF MS F<br />

Treatment effects 62.8 4 15.70 3.47*<br />

Residual (error) effect 68.0 15 4.53<br />

Total (less mean) 130.8 19<br />

Example: 20 children allocated r<strong>and</strong>omly to four<br />

equal groups subjected to different treatments.<br />

After 3 months, progress measured by a test, with<br />

responses below (one child in group 3 died). Test<br />

for treatment mean differences.<br />

TREATMENT<br />

1 2 3 4<br />

4 31 30 19<br />

12 49 41 66<br />

44 22 13 65<br />

9 56 26 46<br />

17 19 89<br />

C 86 177 110 285 658<br />

j<br />

2<br />

C 7396 31329 12100 81225<br />

j<br />

270<br />

Section 7


Total SS = 4 2 + 12 2 + … + 89 2 = 32214<br />

Overall mean SS = 19(658/19) 2 = 22787.58<br />

Total SS (less mean SS) = 9426.42<br />

Treatment effect SS<br />

7396 31329 12100 81225<br />

= + + + − 22787. 58<br />

5 5 4 5<br />

= 4227.43<br />

The ANOVA table becomes<br />

SOURCE SS DF MS F<br />

Treatment effect 4227.43 3 1409.14 4.066<br />

Error (residual) (5198.99) (15) 346.60<br />

Total (less mean) 9426.42 18<br />

Critical value at 5% level <strong>of</strong> significance is 3.287<br />

< 4.066 (Using 3 <strong>and</strong> 15 DF)<br />

Conclusion: There is some evidence that the<br />

mean outcomes in the four treatments differ.<br />

271<br />

Section 7


272<br />

Section 7


POST ANALYSIS OF VARIANCE RESULTS<br />

It is important for further interpretation to set up<br />

confidence intervals for individual sample means<br />

or for differences between pairs <strong>of</strong> sample means.<br />

The useful new development here is that the<br />

residual mean square is an excellent estimate for<br />

the data variance meaning there is no need to<br />

additionally calculate the usual pooled variance<br />

estimate for each pair <strong>of</strong> samples. The advantage<br />

<strong>of</strong> using the residual mean square is that it<br />

involves all the data, not just data in individual<br />

samples.<br />

Example: Set up a 95% confidence interval for<br />

the mean <strong>of</strong> treatment 2.<br />

Solution: Here,<br />

x = 177/5 = 35.4<br />

2<br />

s s 346.60<br />

Estimated st<strong>and</strong>ard error = 2<br />

e<br />

= =<br />

n n 5<br />

= 8.33<br />

which has 15 degrees <strong>of</strong> freedom (same as<br />

residual)<br />

273<br />

Section 7


The 95% C.I. is 35.4 ± t 15 (8.33)<br />

where t 15 = 2.132<br />

That is, 35.4 ± 17.76<br />

or 17.64 < μ 2 < 53.16<br />

N.B. (1) Use 15 DF rather than 5 – 1 = 4 DF for<br />

the single second sample. Hence greater<br />

precision as t 15 < t 4 (note t 4 = 2.776)<br />

(2) R-cmdr gives confidence intervals for these<br />

treatment means automatically.<br />

(3) As we have seen, use <strong>of</strong> the residual mean<br />

square requires the variances to be equal in<br />

each sample.<br />

Example: Compare the mean scores for<br />

treatments 3 <strong>and</strong> 4 by setting up a 95% C.I. for<br />

the difference.<br />

274<br />

Section 7


Solution: x = 110/4 = 27.5 x = 285/5 = 57.0<br />

3<br />

4<br />

estimated st<strong>and</strong>ard error <strong>of</strong> difference<br />

=<br />

1<br />

s p<br />

+<br />

n<br />

2 1 1<br />

= s<br />

e<br />

+<br />

4 5<br />

=<br />

3<br />

1<br />

n<br />

4<br />

1 1<br />

346.60 + 4 5<br />

= 12.49<br />

with 15 DF again rather than n 3 + n 4 – 2 = 7 DF<br />

as for the usual unpaired t-test.<br />

The 95% C.I. for μ 4 – μ 3 is<br />

(57.0 – 27.5) ± t 15 (12.49)<br />

where t 15 = 2.132<br />

That is 29.5 ± 26.63<br />

or 2.87 < μ 4 – μ 3 < 56.13<br />

Since zero excluded, there is evidence treatment 4<br />

has a higher average score than treatment 3.<br />

275<br />

Section 7


A NOTE ON ASSUMPTIONS IN ANOVA<br />

Residuals <strong>and</strong> residual plots can be used to check<br />

the required assumptions. As in a regression<br />

analysis, the residuals should be<br />

(i) normally distributed,<br />

(ii) r<strong>and</strong>omly distributed about 0,<br />

(iii) have similar variation within each <strong>of</strong> the<br />

samples chosen<br />

The following graph shows the variability in each<br />

<strong>of</strong> the drugs in the temperature reduction fever<br />

data. There could be some concern about unequal<br />

variation within each <strong>of</strong> the five treatments (but<br />

the samples are very small in this case so this is<br />

not too surprising).<br />

276<br />

Section 7


The next two residual plots confirm that the<br />

variation is similar for each drug treatment <strong>and</strong><br />

the residuals are close to being normally<br />

distributed.<br />

277<br />

Section 7


278


SECTION 8<br />

This section covers the analysis <strong>of</strong> count data including the Chi-square test for contingency, the chisquare<br />

test for trend as well as relative risks, attributable risks <strong>and</strong> odds ratios along with their<br />

confidence intervals. The analysis <strong>of</strong> a three way table <strong>and</strong> Simpson’s paradox are investigated as a<br />

way <strong>of</strong> introducing the concept <strong>of</strong> a confounding variable in the lead up to regression analyses.<br />

Categorical Data Examples<br />

Relative Risk <strong>and</strong> its Confidence Interval<br />

Attributable Risk <strong>and</strong> its Confidence Interval<br />

Odds Ratio <strong>and</strong> its Confidence Interval<br />

Chi-square Test for Contingency<br />

Chi-square Test for Trend<br />

Interpretation <strong>of</strong> Confidence Intervals<br />

Simpson’s Paradox <strong>and</strong> Confounder Control<br />

279<br />

Section 8


Analysis <strong>of</strong> categorical data<br />

Categorical Data arise when individuals or<br />

experimental units are classified into one <strong>of</strong> two<br />

or more mutually exclusive groups. For example,<br />

• binary e.g. sex (M/F); dead/alive;<br />

diseased/disease free;<br />

treatment/placebo; smoker (yes/no)<br />

Tuatara present/absent<br />

herpes present/absent<br />

melanoma present/absent<br />

• nominal e.g. ethnicity<br />

• ordinal e.g. disease severity; socio economic<br />

status; smoking (never/ex/current)<br />

In a sample <strong>of</strong> units, the number falling into a<br />

particular group is the frequency. The analysis <strong>of</strong><br />

such data is sometimes referred to as the analysis<br />

<strong>of</strong> frequencies or counts.<br />

280<br />

Section 8


Examples <strong>of</strong> research questions that we shall<br />

look at.<br />

Estimation <strong>of</strong> one proportion:<br />

Ex 1. What is the prevalence <strong>of</strong> asthma in a<br />

population<br />

Associations between two factors:<br />

Ex 2. Is a vaccine effective in reducing the risk<br />

<strong>of</strong> catching influenza<br />

Ex 3. Is there an association between exposure<br />

to chlorinated water <strong>and</strong> dental enamel<br />

erosion<br />

Ex 4. Does infra-red stimulation (IRS) provide<br />

effective pain relief in patients with<br />

cervical osteoarthritis<br />

Ex 5. Is there an association between income<br />

level <strong>and</strong> severity <strong>of</strong> cardiovascular<br />

disease in a group <strong>of</strong> people presenting for<br />

treatment<br />

281<br />

Section 8


What tools do we need to answer these types <strong>of</strong><br />

questions Recall the research loop<br />

Underlying Population<br />

Selection<br />

bias<br />

Inference<br />

Sample<br />

Confounding<br />

Statistical<br />

analysis<br />

Information<br />

bias<br />

Possible explanations for an association include<br />

• bias (controlled with study design when<br />

selecting the people for a study or<br />

systematic error arising from the way<br />

information was collected from study<br />

participants)<br />

• confounding (must be allowed for)<br />

• chance (or r<strong>and</strong>om error)<br />

• true association<br />

We shall use proportions, relative <strong>and</strong> attributable<br />

risks, odds ratios, confidence intervals <strong>and</strong><br />

probability values.<br />

282<br />

Section 8


Example 1: What is the prevalence <strong>of</strong> asthma in<br />

a population<br />

Population: adult males on a general practice<br />

register.<br />

Study<br />

• r<strong>and</strong>om sample from population, n = 215<br />

• 39 have history <strong>of</strong> asthma<br />

Sample proportion p = 39/215 = 0.18<br />

0.18(1 − 0.18)<br />

St<strong>and</strong>ard error <strong>of</strong> proportion =<br />

215<br />

= 0.026<br />

95% confidence interval for the true proportion<br />

(0.13, 0.24)<br />

Conclusion<br />

We can be 95% sure that the true prevalence <strong>of</strong><br />

asthma among men attending this general practice<br />

is between 13% <strong>and</strong> 24%.<br />

Confidence intervals for very small proportions<br />

• If the number <strong>of</strong> events is small the<br />

distribution <strong>of</strong> sample proportions is not<br />

normal <strong>and</strong> values would be negative.<br />

• an ‘exact’ method based on the binomial<br />

distribution must be used.<br />

283<br />

Section 8


Evaluating associations in 2 × 2 tables<br />

Example 2: is a vaccine effective in reducing the<br />

risk <strong>of</strong> catching influenza<br />

Study<br />

169 people were r<strong>and</strong>omly allocated to receive a<br />

flu vaccine or a placebo. At end <strong>of</strong> winter they<br />

were asked if they had contracted flu’.<br />

Flu’ No Flu’ Total<br />

Vaccine 9 75 84<br />

Placebo 22 63 85<br />

Total 31 138 169<br />

This is what is called as a prospective cohort<br />

study. In a cohort study the cohort <strong>of</strong> people is<br />

followed into the future. Such studies can be<br />

expensive as they may be <strong>of</strong> long duration. Also<br />

if a disease is rare (say a cancer) many<br />

participants will be needed. The Dunedin<br />

Multidisciplinary Study is one <strong>of</strong> these. Recall<br />

the example on circumcision <strong>and</strong> sexually<br />

transmitted disease.<br />

284<br />

Section 8


Example 3: Is there an association between<br />

exposure to chlorinated water <strong>and</strong> dental enamel<br />

erosion<br />

Study<br />

Of 49 swimmers with enamel erosion (the cases)<br />

32 reported swimming 6 or more hours per week<br />

compared with 118 to 245 swimmers without<br />

enamel erosion (the controls).<br />

Swim time Erosion <strong>of</strong> enamel Total<br />

per week Yes No<br />

(Cases) (Controls)<br />

≥ 6 hrs 32 118 150<br />

< 6 hrs 17 127 144<br />

Total 49 245 294<br />

This is what is called a retrospective case control<br />

study. Advantage is that such a study is relatively<br />

quick <strong>and</strong> smaller than a cohort study particularly<br />

for rare diseases. But greater potential for bias as<br />

there may be inaccurate recall.<br />

The analysis <strong>of</strong> this 2 × 2 table is not the same as<br />

the analysis in the 2 × 2 table in the previous<br />

cohort study. (We shall see that odds ratio rather<br />

than relative risk must be used.)<br />

285<br />

Section 8


Both these data summaries are in the form <strong>of</strong> a<br />

2 × 2 table. Usually there is an exposure (or<br />

predictor) category <strong>and</strong> an outcome (response<br />

category).<br />

Outcome (disease)<br />

Exposed Present Absent Total<br />

Yes a b a + b<br />

No c d c + d<br />

Total a + c b + d n<br />

We know how to summarize data from tables like<br />

these<br />

• the choice <strong>of</strong> measure depends on the study<br />

design<br />

• options include relative risk, attributable risk<br />

(difference in proportions), odds ratio<br />

The tools needed for statistical inference are<br />

• confidence intervals for relative risks<br />

attributable risks <strong>and</strong> odds ratios<br />

• hypothesis tests (p-values) for these<br />

associations<br />

286<br />

Section 8


Prospective Studies<br />

• groups are followed up to see if an outcome<br />

<strong>of</strong> interest occurs<br />

• the proportions in each group who develop<br />

the outcome are found (these are <strong>of</strong>ten called<br />

the incidence which defines numbers <strong>of</strong> new<br />

cases <strong>of</strong> a disease)<br />

• the ratio <strong>of</strong> these proportions is the relative<br />

risk<br />

• the difference in these proportions is the<br />

attributable risk<br />

General form <strong>of</strong> 2 × 2 table:<br />

Outcome (disease)<br />

Exposed Present Absent Total<br />

Yes a b a + b<br />

No c d c + d<br />

Total a + c b + d n<br />

Relative risk, RR =<br />

a/( a+<br />

b)<br />

c/( c+<br />

d)<br />

Attributable risk, AR = a/( a+ b) − c/( c+<br />

d)<br />

Section 8<br />

287


Example 2: Is a vaccine effective in reducing the<br />

risk <strong>of</strong> catching influenza<br />

Flu’ No Flu’ Total<br />

Vaccine 9 75 84<br />

Placebo 22 63 85<br />

Total 31 138 169<br />

Risk in vaccine group = 9/84<br />

Risk in placebo group = 22/85<br />

Relative risk, RR = 9/84<br />

22/85 = 0.4<br />

Those who were vaccinated were 0.4 times as<br />

likely to develop the flu as those who were not<br />

vaccinated. So flu vaccine was associated with a<br />

60% reduction in risk <strong>of</strong> flu.<br />

Notes:<br />

• if a RR = 1.00, then rates are equal <strong>and</strong> there<br />

is no association between flu’ <strong>and</strong> vaccine<br />

• the convention is to calculate the relative risk<br />

this way round so that a ‘protective’ exposure<br />

gives a relative risk less than 1.<br />

288<br />

Section 8


Confidence interval for relative risk<br />

One method for finding confidence intervals for<br />

RR is as follows:<br />

The sampling distribution for ln(RR) is<br />

approximately normal with st<strong>and</strong>ard deviation (or<br />

st<strong>and</strong>ard error) given by<br />

[ ]<br />

s.e. ln(RR)<br />

1 1 1 1<br />

= − + −<br />

a a+ b c c+<br />

d<br />

Then the 95% confidence interval for ln(RR) is<br />

ln(RR) ± 1.96 s.e.[ln(RR)]<br />

For example,<br />

1 1 1 1<br />

s.e. [ ln(RR) ] = − + − = 0.364<br />

9 84 22 85<br />

Now RR = 0.414, giving ln(RR) = –0.882<br />

289<br />

Section 8


The confidence interval (95%) becomes<br />

–0.882 ± 1.96 (0.364)<br />

i.e. –0.882 ± 0.714<br />

Therefore, –1.596 < ln(RR) < –0.168<br />

Taking exponentials, 0.20 < RR < 0.85<br />

So the 95% confidence interval for the true<br />

relative risk is (0.20, 0.85)<br />

Since 1 is not contained in this confidence<br />

interval we conclude that there is evidence <strong>of</strong><br />

association between vaccine use <strong>and</strong> a reduced<br />

risk <strong>of</strong> contracting flu’<br />

Note:<br />

• this method will give a correct CI only if the<br />

numbers in each cell are not too small<br />

• in order to complete our evaluation <strong>of</strong> the<br />

effectiveness <strong>of</strong> the vaccine we need to also<br />

consider possible sources <strong>of</strong> bias <strong>and</strong><br />

confounding<br />

• regression procedures allow us to take<br />

account <strong>of</strong> confounding effects (see later).<br />

290<br />

Section 8


Confidence interval for attributable risk<br />

Once we have determined treatment is effective<br />

we may also wish to consider how many cases <strong>of</strong><br />

flu’ vaccine is likely to prevent:<br />

Attributable risk:<br />

22/85–9/84 = 0.26 – 0.11 = 0.15<br />

Use the normal approximation to get a confidence<br />

interval for this difference in proportions. The<br />

estimated st<strong>and</strong>ard error for the difference<br />

between the proportions is<br />

p1(1 − p1) p2(1 − p2) 0.26(0.74) 0.11(0.89)<br />

+ = +<br />

n1 n2<br />

85 84<br />

= 0.059<br />

<strong>and</strong> the 95% confidence interval for the<br />

attributable risk (risk difference) is<br />

giving<br />

0.15 ± 1.96(0.059)<br />

(0.04, 0.27)<br />

291<br />

Section 8


So, assuming the treatment is effective, in every<br />

100 people vaccinated there will be between 4<br />

<strong>and</strong> 27 fewer cases <strong>of</strong> flu than if they had not<br />

been vaccinated (i.e. vaccination prevents<br />

between 4 <strong>and</strong> 27 cases <strong>of</strong> flu in every 100<br />

people)<br />

292<br />

Section 8


Case control studies<br />

• a group <strong>of</strong> individuals with a disease (called<br />

the cases) is compared to a control group who<br />

do not have the disease. In these cases we<br />

choose the number <strong>of</strong> people with the disease<br />

<strong>and</strong> the number without.<br />

General form <strong>of</strong> 2 × 2 table<br />

Outcome (disease)<br />

Exposed Present Absent Total<br />

Yes a b a + b<br />

No c d c + d<br />

Total a + c b + d n<br />

The measure <strong>of</strong> association used in case-control<br />

studies is the odds ratio, not the relative risk<br />

• In terms <strong>of</strong> probabilities, the odds <strong>of</strong> an event<br />

Pr( A) Pr( A)<br />

A is defined as = . With the<br />

Pr( A) 1 − Pr( A)<br />

notation in the table above, in the exposed<br />

group the odds <strong>of</strong> disease present equals<br />

293<br />

Section 8


⎛ a ⎞ ⎛ b ⎞<br />

⎜ ⎟ ⎜ ⎟ which simplifies to a/b.<br />

⎝a+ b⎠ ⎝a+<br />

b⎠ For unexposed group, odds = c/d.<br />

Example: Is there an association between<br />

exposure to chlorinated water <strong>and</strong> dental enamel<br />

erosion<br />

Study<br />

Of 49 swimmers with enamel erosion (the cases)<br />

32 reported swimming 6 or more hours per week<br />

compared with 118 <strong>of</strong> 245 swimmers without<br />

enamel erosion (the controls).<br />

Swim time Erosion <strong>of</strong> enamel Total<br />

per week Yes No<br />

(Cases) (Controls)<br />

≥ 6 hrs 32 118 150<br />

< 6 hrs 17 127 144<br />

Total 49 245 294<br />

For ≥ 6 hrs, odds = a/b = 32/118<br />

For < 6 hrs, odds = c/d =17/127<br />

294<br />

Section 8


a/<br />

b<br />

The odds ratio, OR =<br />

c/<br />

d<br />

= 32/118<br />

17 /127<br />

= 2.026 (= 2.0)<br />

Note 1: why we use the odds ratio<br />

Compare the numbers in the previous table to a<br />

study which is identical except that we chose to<br />

have only 49 controls:<br />

Swim time Erosion <strong>of</strong> enamel Total<br />

per week Yes No<br />

(Cases) (Controls)<br />

≥ 6 hrs 32 24 56<br />

< 6 hrs 17 25 42<br />

Total 49 49 98<br />

The values 24 <strong>and</strong> 25 give the same proportions<br />

with slight rounding<br />

32/ 24<br />

Odds ratio = = 2.0 with rounding which is<br />

17 / 25<br />

the same as the previous result.<br />

295<br />

Section 8


But now suppose we were to try <strong>and</strong> calculate the<br />

relative risk in both cases:<br />

‘Risk’ ‘RR’<br />

Study 1 ≥ 6 hrs 32/150<br />

< 6 hrs 17/144 1.75<br />

Study 2 ≥ 6 hrs 32/56<br />

< 6 hrs 17/42 1.43<br />

Notice that there is disagreement. The<br />

consequence is that the relative risk can be made<br />

to take any value by choice <strong>of</strong> numbers <strong>of</strong> cases<br />

<strong>and</strong> controls. This is unacceptable.<br />

Note 2: When are the odds ratio <strong>and</strong> relative risk<br />

close<br />

Consider a retrospective case-control study:<br />

If disease (the outcome <strong>of</strong> interest) is rare,<br />

then a <strong>and</strong> c will be small in the table.<br />

Disease No Disease<br />

Exposed (Case) (Control) Total<br />

Yes a b a + b<br />

No c d c + d<br />

296<br />

Section 8


so<br />

a a c c<br />

≈ <strong>and</strong> ≈<br />

b a+ b d c+<br />

d<br />

Then relative risk =<br />

⎛ a ⎞ ⎛ c ⎞ a/<br />

b<br />

⎜ ⎟ ⎜ ⎟ ≈<br />

⎝a+ b⎠ ⎝c+<br />

d ⎠ c/<br />

d<br />

Thus, in a case-control study investigating a rare<br />

disease the odds ratio gives a good estimate <strong>of</strong> the<br />

true unestimable relative risk.<br />

Confidence interval for odds ratio<br />

In repeated sampling, ln(OR) are normal with<br />

st<strong>and</strong>ard deviation (or st<strong>and</strong>ard error) given by<br />

[ ]<br />

s.e. ln(OR)<br />

1 1 1 1<br />

= + + +<br />

a b c d<br />

The 95% confidence interval for ln(OR) is<br />

For the example<br />

ln(OR) ± 1.96 s.e. [ln(OR)]<br />

297<br />

Section 8


1 1 1 1<br />

s.e. [ ln(OR) ] = + + + = 0.326<br />

32 118 17 127<br />

<strong>and</strong> ln(OR) = ln (2.026) = 0.706<br />

The confidence interval becomes<br />

0.706 ± 1.96 (0.326)<br />

i.e. 0.706 ± 0.639<br />

Therefore, 0.067 < ln(OR) < 1.345<br />

∴ e 0.067 < OR < e 1.345<br />

∴ 1.069 < OR < 3.838<br />

We conclude the odds <strong>of</strong> erosion in dental enamel<br />

are raised among those swimming more than 6 hours<br />

per week. We would reject the null hypothesis as p-<br />

value < 0.05.<br />

Note: An odds ratio simply measures if an<br />

association is present between outcome <strong>and</strong><br />

exposure. With a relative risk we are interested if<br />

treatment improves outcome status. A protective<br />

exposure gives a relative risk less than 1.<br />

298<br />

Section 8


Chi Square Test for Contingency Tables<br />

The above examples (2 × 2 tables) are very<br />

common in health research <strong>and</strong> other areas.<br />

However, we may want:<br />

• p-values to formally test for an association<br />

• to answer questions relating to larger<br />

contingency tables.<br />

Note:<br />

• as long as one <strong>of</strong> the variables is binary we<br />

can think <strong>of</strong> comparing proportions <strong>and</strong><br />

calculate RRs or ORs<br />

• if both variables have more than 2 categories<br />

the analysis is more complex<br />

299<br />

Section 8


Example 4<br />

Does infra-red stimulation (IRS) provide effective<br />

pain relief in patients with cervical osteoarthritis<br />

A r<strong>and</strong>omised controlled trial was carried out<br />

with 100 patients: 20 were r<strong>and</strong>omly allocated to<br />

a double dose <strong>and</strong> 40 each to a single dose <strong>and</strong><br />

control (placebo) treatment. The patients were<br />

classified according to improvement levels over a<br />

period <strong>of</strong> one week as follows:<br />

(hypothetical data)<br />

Pain score<br />

IRS Improve No Worse Total<br />

change<br />

Double dose 10 5 5 20 = r 1<br />

Single Dose 15 20 5 40 = r 2<br />

Control 5 20 15 40 = r 3<br />

Total 30 = c 1 45 = c 2 25 = c 3 100 = n<br />

• we can look at the percentage improved, no<br />

better <strong>and</strong> worse for each treatment category<br />

300<br />

Section 8


We wish to know whether the data indicate that<br />

either<br />

or<br />

IRS does provide effective pain relief (<strong>and</strong> in<br />

what dose)<br />

it is no better than the control.<br />

Calculating a p-value for the following<br />

hypotheses will tell us whether there is evidence<br />

that IRS is effective, or whether the differences<br />

we have observed between the treatment groups<br />

are consistent with r<strong>and</strong>om variation.<br />

Hypotheses:<br />

H 0 : The response <strong>and</strong> the type <strong>of</strong> treatment are<br />

independent (i.e. no association)<br />

H A : response <strong>and</strong> type <strong>of</strong> treatment are not<br />

independent (i.e. are associated in some way<br />

or one <strong>of</strong> the responses may occur more <strong>of</strong>ten<br />

with one <strong>of</strong> the treatments)<br />

301<br />

Section 8


If there were no association between treatment <strong>and</strong><br />

outcome (H 0 ), I would expect to have the same<br />

fraction <strong>of</strong> improved responses using the three<br />

treatments <strong>and</strong> this fraction should be<br />

c 1 /n = 30/100 (i.e. 30 <strong>of</strong> the 100 patients show<br />

improvement).<br />

Suppose E 11 , E 21 <strong>and</strong> E 31 are the numbers <strong>of</strong><br />

improvements expected if RESPONSE <strong>and</strong><br />

TREATMENT are independent. Then<br />

30 E11<br />

E<br />

= =<br />

21<br />

E = 31<br />

100 20 40 40<br />

20(30)<br />

∴E 11 = = 6 100<br />

40(30)<br />

E 21 = 100<br />

40(30)<br />

E 31 = 100<br />

= 12<br />

= 12<br />

In general,<br />

E<br />

ij =<br />

r c<br />

i<br />

n<br />

j<br />

for each “cell” or “class” in the contingency table.<br />

302<br />

Section 8


Using this formula, expected numbers can be<br />

calculated for each cell:<br />

RESPONSE<br />

TREATMENT Improve No change Worse Total<br />

Double dose 6 9 [5] 20<br />

Single Dose 12 18 [10] 40<br />

Control [12] [18] [10] 40<br />

Total 30 45 25 100<br />

Each row <strong>and</strong> column total has to be met by the<br />

entries in the table <strong>and</strong> for this reason the numbers<br />

in brackets can be found by subtraction.<br />

The observed frequencies (the data counts) are now<br />

compared with the expected counts calculated<br />

under H 0 .<br />

If H 0 is true, then the expected counts will agree<br />

closely with those observed. [But how closely<br />

must they agree]<br />

This is answered by calculating the chi-square (χ 2 )<br />

statistic<br />

303<br />

Section 8


χ<br />

2<br />

=<br />

∑<br />

over<br />

all cells<br />

(Observed - Expected)<br />

Expected<br />

2<br />

i.e.<br />

χ<br />

2<br />

=<br />

∑<br />

over<br />

all cells( i,<br />

j)<br />

( O<br />

ij<br />

−<br />

E<br />

E<br />

ij<br />

ij<br />

)<br />

2<br />

Observed Counts (O ij )<br />

Treatment Response<br />

1 2 3<br />

1 10 5 5<br />

2 15 20 5<br />

3 5 20 15<br />

Expected Counts (E ij ) [Under H 0 : independent]<br />

Treatment<br />

Response<br />

1 2 3<br />

(Improved) (No change) (Worse)<br />

Double 1 6 9 5<br />

Single 2 12 18 10<br />

Control 3 12 18 10<br />

χ 2 is large if O ij <strong>and</strong> E ij seriously disagree – hence<br />

χ 2 being large will result in H 0 rejection.<br />

304<br />

Section 8


Example: For the drug responses,<br />

χ 2 =<br />

(10 − 6)<br />

6<br />

2<br />

+<br />

(5 − 9)<br />

9<br />

2<br />

+<br />

(5 − 5)<br />

5<br />

2<br />

+<br />

(15 −12)<br />

12<br />

2<br />

+<br />

(20 −18)<br />

18<br />

2<br />

+<br />

(5 −10)<br />

10<br />

2<br />

+<br />

(5 −12)<br />

12<br />

2<br />

+<br />

(20 −18)<br />

18<br />

2<br />

+<br />

(15 −10)<br />

10<br />

2<br />

= 14.72 (χ 2 will always be positive)<br />

In repeated sampling these χ 2 values are distributed<br />

as a chi-square distribution which has<br />

υ = (number <strong>of</strong> rows – 1) × (number <strong>of</strong> columns – 1)<br />

degrees <strong>of</strong> freedom.<br />

Here, υ = (3 – 1) × (3 – 1) = 4<br />

which is just the number <strong>of</strong> values that can be<br />

freely inserted in the table!! (the remaining values<br />

are fixed if the row <strong>and</strong> column totals are to be<br />

met.)<br />

The critical χ 2 value is found from the table at the<br />

end <strong>of</strong> the notes.<br />

305<br />

Section 8


0<br />

Critical value<br />

α (area, or the<br />

level <strong>of</strong><br />

significance)<br />

α<br />

υ 0.1 0.05 0.025 0.01 0.005 0.001<br />

1<br />

2<br />

3<br />

4 9.488 14.86<br />

5<br />

<br />

100<br />

Since 14.72 > 9.488, the null hypothesis <strong>of</strong> no<br />

association is rejected.<br />

Note: when we do this on the computer we get the<br />

exact p-value, p = 0.005<br />

χ<br />

2<br />

υ<br />

306<br />

Section 8


• the p-value gives the probability <strong>of</strong> observing a<br />

difference this large or larger between what we<br />

observed <strong>and</strong> what is expected under H 0 , if H 0<br />

is true.<br />

• since the p-value is small, it is unlikely we<br />

would observe a difference this big just by<br />

chance, it is more likely that the null hypothesis<br />

is false.<br />

• there is evidence that the pain levels depend on<br />

the treatment administered.<br />

• closer inspection <strong>of</strong> the observed frequencies<br />

indicates<br />

• more patients improved on double dose<br />

than expected<br />

• few patients experiencerd improved<br />

response on the control<br />

• fewer than expected being worse on single<br />

dose.<br />

307<br />

Section 8


Notes<br />

1. Check the observed counts in order to interpret<br />

a significant association.<br />

2. Maximum power is achieved if there are equal<br />

numbers in each ‘exposure’ group. This is<br />

<strong>of</strong>ten not possible to achieve in observational<br />

studies.<br />

3. This chi-square procedure is unreliable if<br />

counts are small, in particular less than 5.<br />

• For larger contingency tables it is possible<br />

to combine classes in order to raise<br />

frequencies.<br />

• For 2 × 2 tables if expected frequencies<br />

are between 5 <strong>and</strong> 10, a correction called<br />

Yates correction will modify the χ 2<br />

statistic.<br />

• For 2 × 2 tables, if expected frequencies<br />

are less than 5, there is a test called<br />

Fisher’s Exact Test which can be used.<br />

308<br />

Section 8


Example 5<br />

Is there an association between income level <strong>and</strong><br />

severity <strong>of</strong> cardiovascular disease in a group <strong>of</strong><br />

people presenting for treatment<br />

Study<br />

A group <strong>of</strong> people presenting to a hospital with<br />

acute myocardial infarction or unstable angina are<br />

enrolled in a study. Cross-sectional data are<br />

collected at baseline.<br />

Income level (Exposure)<br />

Disease<br />

level 1 2 3 4 Total<br />

(Outcome)<br />

0 100 107 111 122 440<br />

≥1 (Severe) 115 112 104 97 428<br />

Total 215 219 215 219 868<br />

% ≥1 52.0 51.1 48.4 44.3<br />

RR 1.00 0.96 0.90 0.84<br />

115/ 215<br />

115/ 215<br />

112/ 219<br />

115/ 215<br />

104/ 215<br />

115/ 215<br />

97 / 219<br />

115/ 215<br />

309<br />

Section 8


To test whether or not there is an association<br />

between disease severity <strong>and</strong> income level:<br />

H 0 = there is no association between disease<br />

severity <strong>and</strong> income (i.e. the proportion<br />

with severe disease is the same for all<br />

income levels)<br />

H A =<br />

there is some association (i.e. the<br />

percentage with severe disease varies by<br />

income)<br />

Expected frequencies:<br />

Income level<br />

Disease 1 2 3 4 Total<br />

level<br />

0 108.99 111.01 108.99 111.01 440<br />

≥1 106.01 107.99 106.01 107.99 428<br />

Total 215 219 215 219 868<br />

440<br />

215 108.99<br />

868<br />

E<br />

11<br />

= × =<br />

12<br />

440<br />

E<br />

13<br />

= 215× = 108.99<br />

868<br />

440<br />

E = 219× = 111.01<br />

868<br />

310<br />

Section 8


χ 2<br />

2<br />

(100−108.99)<br />

(107−111.01)<br />

(111−108.99)<br />

=<br />

+<br />

+<br />

108.99 111.01 108.99<br />

2<br />

2<br />

2<br />

(122−111.01)<br />

+<br />

(115−106.01)<br />

+<br />

(112−107.99)<br />

111.01 106.01 107.99<br />

2<br />

2<br />

(104−106.01)<br />

(97−107.99)<br />

+<br />

+<br />

106.01 107.99<br />

= 4.1<br />

2<br />

2<br />

+<br />

The appropriate sampling distribution is a χ 2 with<br />

3 d.f.<br />

From the χ 2 table<br />

Pr(χ 2 (3 d.f.) > 6.251) = 0.1<br />

so p-value > 0.1<br />

From the computer, p-value = 0.25<br />

Hence the observed differences in proportions we<br />

have seen are <strong>of</strong> the order we might expect to see<br />

by chance. There is no evidence supporting<br />

rejection <strong>of</strong> the null hypothesis.<br />

We conclude that there is no evidence <strong>of</strong> an<br />

association between disease severity <strong>and</strong> income.<br />

311<br />

Section 8


Contingency Tables (Continued)<br />

Tests for trend<br />

Example 5 (continued): Do people with lower<br />

incomes tend to present with more severe<br />

disease<br />

The chi-squared test <strong>of</strong> association may not<br />

provide the best answer to this question. It does<br />

not take account <strong>of</strong> the ordering in the income<br />

variable. Specifically, our prior hypothesis is<br />

that the percentage with severe disease decreases<br />

as income increases.<br />

We can test this hypothesis directly using a χ 2<br />

test for trend. The main difference is that this<br />

test has only one degree <strong>of</strong> freedom rather than<br />

the three for the test <strong>of</strong> association.<br />

Note: You will NOT be asked to calculate a test<br />

for trend in this course. You may be asked to<br />

2<br />

interpret the p-value or a χ<br />

trend<br />

value with one<br />

degree <strong>of</strong> freedom.<br />

312<br />

Section 8


This page for reference only<br />

Income level (x i )<br />

Disease 1 2 3 4 Total<br />

level<br />

0 100 107 111 122 440<br />

≥ 1 (r i ) 115 112 104 97 R = 428<br />

Total (n i ) 215 219 215 219 N = 868<br />

r i x i 115 224 312 388<br />

n i x i 215 438 645 876<br />

n i x i<br />

2<br />

215 876 1935 3504<br />

p<br />

= R N = 428 = 0.49 ∑ rx<br />

i i<br />

2174<br />

868 x = = =<br />

N 868<br />

χ<br />

2<br />

trend<br />

[ ∑ rx − Rx] 2<br />

i i<br />

=<br />

2 2<br />

p(1 − p)<br />

⎡⎣∑<br />

nx<br />

i i−Nx<br />

⎤⎦<br />

1039 − 428×<br />

2.50<br />

=<br />

0.49(1 −0.49) ⎡⎣<br />

6530 − 868×<br />

2.50<br />

= 4.06<br />

[ ] 2 2<br />

⎤⎦<br />

2.505<br />

313<br />

Section 8


The trend statistic has only 1 degree <strong>of</strong> freedom.<br />

From χ 2 table, Pr(χ 2 (1 d.f.) > 3.841) = 0.05<br />

Since 4.1 > 3.841 the p-value < 0.05, so we<br />

conclude there is evidence that the proportion<br />

with severe disease decreases as income<br />

increases.<br />

Overview<br />

• interpretation <strong>of</strong> confidence intervals for RR<br />

<strong>and</strong> OR<br />

• relationship between confidence intervals, p-<br />

values <strong>and</strong> sample size<br />

Example: (Hypothetical Data)<br />

The following confidence intervals are from a<br />

study into the erosion <strong>of</strong> tooth enamel as a result<br />

<strong>of</strong> exposure to chlorinated water.<br />

They are the ratio <strong>of</strong> odds for those exposed<br />

(swim ≥ 6 hours per week) to those not exposed<br />

(swim < 6 hours per week).<br />

Suppose an odds ratio greater than 1.5 is<br />

considered clinically important.<br />

314<br />

Section 8


(a) OR = 1.90 with CI (1.23, 2.92)<br />

• p < 0.05 <strong>and</strong> conclusive.<br />

• 1 is not contained in the CI, so there is<br />

evidence <strong>of</strong> an association between<br />

exposure <strong>and</strong> outcome.<br />

• the CI is above 1 indicating harm.<br />

(Swimming bad for teeth.)<br />

• note we have not ruled out a non-clinically<br />

important association<br />

(b) OR = 1.69 with CI (0.83, 3.45)<br />

• p > 0.05 <strong>and</strong> inconclusive.<br />

• point estimate indicates possible clinically<br />

important association but “protection”!! <strong>of</strong><br />

tooth enamel (rather than “harm”) is also<br />

plausible.<br />

(c) OR = 0.81 with CI (0.39, 1.70)<br />

• p > 0.05, inconclusive.<br />

• conclude no evidence <strong>of</strong> an association<br />

even though CI includes clinically<br />

important effects.<br />

• the point estimate is in the “protection”<br />

range (harm is above 1).<br />

315<br />

Section 8


(d) OR = 0.85 with CI (0.53, 1.37)<br />

• p > 0.05, conclusive.<br />

• point estimate in protection range <strong>and</strong> CI<br />

excludes any clinically important harm.<br />

(e) OR = 0.81 with CI (0.67, 0.97)<br />

• p < 0.05 <strong>and</strong> conclusive<br />

• CI excludes 1<br />

• CI entirely less than 1, indicating benefit<br />

from swimming<br />

(f) OR = 1.23 with CI (1.03, 1.48)<br />

• p < 0.05 <strong>and</strong> conclusive<br />

• CI excludes 1<br />

• CI entirely above 1, but excludes the<br />

clinically important difference<br />

• there is evidence <strong>of</strong> an association between<br />

exposure to chlorinated water for more than<br />

6 hours per week but the increased odds are<br />

not clinically important.<br />

(g) OR = 1.15 with CI (0.73, 1.80)<br />

p > 0.05 <strong>and</strong> inconclusive. A clinically<br />

important association is not ruled out.<br />

Advice: Probably continue swimming.<br />

316<br />

Section 8


0 1 1.5 2 3 3.5<br />

a<br />

x<br />

b<br />

x<br />

c<br />

d<br />

e<br />

x<br />

x<br />

x<br />

f<br />

x<br />

g<br />

x<br />

0 1 1.5 2 3 3.5<br />

Notice that these confidence intervals are not<br />

symmetric.<br />

317<br />

Section 8


A Problem when Contingency Tables are<br />

combined<br />

Example: A <strong>University</strong> has a Law School <strong>and</strong> a<br />

Medical Sciences School with men <strong>and</strong> women<br />

being admitted or declined admission as follows:<br />

Admit Decline Total<br />

Male 490 210 700<br />

Female 280 220 500<br />

Total 770 430 1200<br />

Is there gender bias concerning admission (i.e.<br />

is there an association between gender <strong>and</strong><br />

admission decision)<br />

Expected frequencies under H 0 : no association are<br />

Admit Decline Total<br />

Male 700(770)<br />

[250.8] 700<br />

= 449.2<br />

1200<br />

Female [320.8] [179.2] 500<br />

Total 770 430 1200<br />

χ 2 =<br />

(490 − 449.2)<br />

449.2<br />

2<br />

+ … + … + … = 24.82<br />

318<br />

Section 8


with υ = 1 degree <strong>of</strong> freedom. Since critical<br />

value at α = 0.01 level <strong>of</strong> significance is 6.635,<br />

there is strong evidence <strong>of</strong> an association.<br />

Inspection <strong>of</strong> the observed frequencies shows a<br />

tendency to admit a higher number <strong>of</strong> men than<br />

expected i.e. O 11 = 490 but E 11 = 449.2. This<br />

means fewer women are admitted than expected<br />

under equal opportunity. The admission patterns<br />

for the two schools are also known as follows:<br />

LAW SCHOOL MEDICAL SCIENCES<br />

Admit Decline Total Admit Decline Total<br />

M 480 120 600 M 10 90 100<br />

F 180 20 200 F 100 200 300<br />

Total 660 140 800 Total 110 290 400<br />

The expected frequencies under H 0 are:<br />

Admit Decline Admit Decline<br />

M 495 105 M 27.5 72.5<br />

F 165 35 F 82.5 217.5<br />

For Law School χ 2 = 10.38**<br />

For Medical Sciences School, χ 2 = 20.45**<br />

319<br />

Section 8


There is strong evidence <strong>of</strong> an association in both<br />

schools.<br />

HOWEVER, inspection <strong>of</strong> the observed counts<br />

indicates a higher number <strong>of</strong> women than<br />

expected are admitted to both schools.<br />

For LAW, O 21 = 180 with E 21 = 165<br />

For MEDICAL SCIENCES, O 21 = 100 with<br />

E 21 = 82.5<br />

This is the opposite conclusion to that when the<br />

schools are combined. Is there discrimination<br />

against men or women!!<br />

This is known as Simpson’s Paradox.<br />

The reason for this discrepancy is that more<br />

women applied to the Medical Sciences school to<br />

which it was more difficult to be admitted. The<br />

final conclusion is therefore unclear.<br />

Notice that there are essentially three factors <strong>of</strong><br />

classification here, <strong>and</strong> we have summed over<br />

one <strong>of</strong> these factors, namely the “TYPE OF<br />

SCHOOL”<br />

320<br />

Section 8


COMBINED<br />

ADMIT DECLINE<br />

Male 490 (449.2) 210 ( )<br />

Female 280 (320.8) 220 ( )<br />

LAW<br />

MEDICAL<br />

Admit Decline Admit Decline<br />

M 480 (495) 120 ( ) M 10 (27.5) 90 ( )<br />

F 180 (165) 20 ( ) F 100 (82.5) 200 ( )<br />

(Expected numbers are in bold)<br />

“Variable” 1 = GENDER<br />

“Variable” 2 = ADMISSION DECISION<br />

“Variable” 3 = SCHOOL TYPE<br />

Note how careful we must be with such an<br />

observational study which fails to recognise an<br />

important “variable” (here school type).<br />

This phenomenon can occur whenever we sum<br />

over a classification in categorical data.<br />

321<br />

Section 8


REVIEW EXERCISES<br />

1. A r<strong>and</strong>omized double blind study (prospective) was set up to test for an association between<br />

the use <strong>of</strong> aspirin <strong>and</strong> the incidence <strong>of</strong> fatal or nonfatal strokes in a five year period from the<br />

start <strong>of</strong> the study. The results (Journal <strong>of</strong> the American Medical Association, 243: 661-669)<br />

are summarised in the following contingency table:<br />

Stroke No stroke<br />

Placebo 45 2257<br />

Aspirin 29 2238<br />

(b)<br />

(c)<br />

Calculate <strong>and</strong> interpret the risk <strong>of</strong> stroke for people in the placebo group relative to the<br />

aspirin group. Set up a 95% confidence interval for the relative risk. (3 marks)<br />

The use <strong>of</strong> aspirin was felt to increase the occurrence <strong>of</strong> gastrointestinal irritation. In<br />

the study, 229 <strong>of</strong> 2267 patients in the aspirin treatment suffered irritation as opposed to<br />

22 <strong>of</strong> the 2302 in the placebo treatment. Calculate the relative risk <strong>of</strong> gastrointestinal<br />

irritation for people in the aspirin group compared with those in the control. Set up a<br />

95% confidence interval for the relative risk <strong>and</strong> interpret the result. (3 marks)<br />

(d) Calculate the attributable risk for aspirin compared with control Set up a 95%<br />

confidence interval for the attributable risk <strong>and</strong> interpret the result.<br />

(3 marks)<br />

3. Long-term Mobile Phone Use <strong>and</strong> Brain Tumour Risk.<br />

Lonn et al (2005), American Journal <strong>of</strong> Epidemiology, 161: 526-535<br />

Human exposure to radi<strong>of</strong>requency has increased dramatically during recent years from<br />

widespread use <strong>of</strong> mobile phones. If radi<strong>of</strong>requency radiation has a carcinogenic effect, the<br />

exposure poses an important public health problem, <strong>and</strong> intracranial tumours would be <strong>of</strong><br />

primary interest. H<strong>and</strong>held mobile phones were introduced in Sweden during the late<br />

1980’s. This case-control study was carried out to test the hypothesis that long-term mobile<br />

phone use increases the risk <strong>of</strong> brain tumours.<br />

(a)<br />

(b)<br />

This was a case-control study. Describe one advantage <strong>and</strong> one disadvantage <strong>of</strong> using a<br />

case-control study instead <strong>of</strong> a cohort study to investigate the association between longterm<br />

use <strong>of</strong> mobile phones <strong>and</strong> the risk <strong>of</strong> brain tumour.<br />

The information is summarised below.<br />

Brain Tumour (Outcome)<br />

Mobile phone use Yes No Total<br />

Never/rarely 155 275 430<br />

Regularly 118 399 517<br />

Total 273 674 947<br />

(i) Calculate the odds ratio for the association between long-term mobile phone use<br />

<strong>and</strong> the risk <strong>of</strong> brain tumour.<br />

(ii) Interpret the odds ratio.<br />

(iii) Calculate the 95% confidence interval for the odds ratio.<br />

(iv) Interpret the confidence interval.<br />

322<br />

Section 8


SOLUTIONS<br />

29<br />

1. (b) Risk (aspirin group) =<br />

2267<br />

45<br />

<strong>and</strong> risk (placebo group) = 2302<br />

45 / 2302<br />

Relative risk, RR = = 1.53<br />

29 / 2267<br />

The risk is 1.53 times greater for those in the placebo.<br />

Also, s.e. (ln RR) =<br />

1<br />

45<br />

−<br />

1<br />

2302<br />

+<br />

1<br />

29<br />

−<br />

1<br />

2267<br />

= 0.236<br />

<strong>and</strong> since ln(RR) = 0.424 the 95% confidence interval is<br />

0.424 ± 1.96 (0.236)<br />

or 0.424 ± 0.463<br />

or -0.039 < ln(RR) < 0.887<br />

Therefore 0.96 < RR < 2.43, taking exponentials<br />

(notice that the null value for the relative risk is 1 hence no evidence against the null hypothesis)<br />

(c) Irritation No irritation Total<br />

Placebo 22 2280 2302<br />

Aspirin 229 2038 2267<br />

229 / 2267<br />

RR =<br />

22 / 2302<br />

= 10.57<br />

ln(RR) = 2.358<br />

s.e. ln(RR) =<br />

1 1 1 1<br />

− + − = 0.221<br />

229 2267 22 2302<br />

The 95% C.I. for ln(RR) is 2.358 ± 1.96 (0.221)<br />

That is, 2.358 ± 0.433<br />

Giving 1.925 < ln RR < 2.791<br />

Taking exponentials,<br />

6.86 < RR < 16.30<br />

The null value <strong>of</strong> equal risk is rejected.<br />

The true relative risk <strong>of</strong> irritation if aspirin used is between 6.86 <strong>and</strong> 16.30<br />

(d)<br />

229 22<br />

Attributable risk = − = 0.10101 – 0.00956 = 0.09145<br />

2267 2302<br />

Estimated st<strong>and</strong>ard error =<br />

0.10101(0.89899) 0.00956(0.99044)<br />

+ = 0.00665<br />

2267 2302<br />

The 95% C.I. for attributable risk is 0.09145 ± 1.96(0.00665)<br />

or 0.091 ± 0.013<br />

or 0.078 < AR < 0.104<br />

Between 78 <strong>and</strong> 104 in every 1000 people have increased occurrence <strong>of</strong> gastrointestinal irritation<br />

as a result <strong>of</strong> using aspirin.<br />

3. (a) Advantage: A case control study is quick <strong>and</strong> cheaper since information on exposure <strong>and</strong> disease<br />

status are obtained at same time. Brain tumours also are rare so number <strong>of</strong> participants for cohort<br />

study would be large.<br />

Disadvantage: Information collected likely to be affected by recall bias since events have already<br />

occurred.<br />

(b) (i) OR = (118/399)/(155/275) = 0.52<br />

(ii) Those who use mobile phones have 0.52 times the odds <strong>of</strong> a brain tumour compared with those<br />

who do not. [Protective effect from using mobile phones – the odds are 48% less for mobile<br />

phone users compared with those who do not use mobile phones.]<br />

(iii) ln(0.52) = –0.654<br />

The 95% C.I. for ln(OR) is<br />

1 1 1 1<br />

− 0.65 ± 1.96 + + +<br />

155 275 118 399<br />

or –0.654 ± 0.284<br />

or –0.938 < ln(OR) < –0.370<br />

Therefore, 0.39 < OR < 0.69<br />

(iv) 95% confident true OR between 0.39 <strong>and</strong> 0.69. The value (1) is excluded hence<br />

chance is an unlikely explanation.<br />

323<br />

Section 8


324


SECTION 9<br />

This section introduces the topic <strong>of</strong> Simple Linear Regression which sets out to fit a straight line<br />

through what is called a scatter diagram. One purpose <strong>of</strong> this analysis is to establish whether one<br />

predictor variable is influencing the outcomes <strong>of</strong> a response variable <strong>and</strong> also measuring the<br />

magnitude <strong>of</strong> the effect <strong>of</strong> this predictor variable on the outcome. It is possible to use the fitted<br />

straight line to make predictions.<br />

Simple linear regression is also the first step in controlling for a confounder variable. This occurs<br />

with the extension to multiple regression which will be considered in the next section.<br />

Scatter Diagrams <strong>and</strong> Examples<br />

Equation <strong>of</strong> Fitted Straight Line<br />

Analysis <strong>of</strong> Variance for Regression Model<br />

Confidence Interval for Slope<br />

Confidence Interval for Prediction<br />

Correlation as Measure <strong>of</strong> Linear Association<br />

Review Exercises<br />

325<br />

Section 9


Regression Procedures Introduction<br />

During the semester we have analysed data from<br />

1. studies which have measured outcomes on<br />

continuous scales [e.g. blood pressure; lung<br />

capacity; cholesterol] resulting from different<br />

treatments<br />

2. studies which have measured binary<br />

outcomes, establishing odds ratios <strong>and</strong><br />

relative risks as a result <strong>of</strong> exposure to<br />

certain conditions. [e.g. effect <strong>of</strong> chlorine on<br />

tooth enamel; effect <strong>of</strong> sun exposure on<br />

melanoma]<br />

In both cases there are potentially other variables<br />

which have an effect <strong>and</strong>/or possible confounding<br />

factors other than the treatments or exposures<br />

which influence the outcomes.<br />

We must allow for these confounders otherwise<br />

invalid conclusions will be drawn about the real<br />

effects <strong>of</strong> the treatments or exposures.<br />

326<br />

Section 9


Regression methods are used to introduce these<br />

controls. We now develop:<br />

1. Simple linear Regression (now)<br />

• to describe the relationship between two<br />

variables <strong>and</strong> test whether changes in an<br />

outcome measure may be linked to<br />

changes in the other variable.<br />

• to enable the prediction <strong>of</strong> the value <strong>of</strong><br />

the outcome measure from the other<br />

variable.<br />

2. Multiple Regression (later)<br />

• to identify the main factors influencing a<br />

continuous outcome<br />

• to adjust the means <strong>of</strong> outcomes for<br />

confounders or other factors.<br />

3. Logistic Regression (later)<br />

• to identify the main factors influencing<br />

binary outcomes <strong>and</strong> hence odds ratios<br />

<strong>and</strong> relative risks<br />

• to adjust odds ratios for confounding or<br />

other factors.<br />

Show Hans Rosling’s website gapminder.<br />

327<br />

Section 9


Example: Blood Alcohol Concentration in<br />

mg/100mL <strong>and</strong> Body Mass in kg for 8 adults after<br />

drinking 12 glasses <strong>of</strong> regular beer.<br />

0.04<br />

0.02<br />

0.00<br />

MASS (kg) BAC (mg/100mL)<br />

55 0.140<br />

85 0.102<br />

69 0.120<br />

65 0.126<br />

80 0.106<br />

90 0.092<br />

67 0.128<br />

73 0.120<br />

BAC (mg/100mL)<br />

0.14 X<br />

X X<br />

0.12<br />

X<br />

0.10<br />

0.08<br />

0.06<br />

X<br />

X X<br />

MASS (kg)<br />

50 60 70 80 90 100<br />

Does BAC drop as Body Mass increases<br />

Other variables which could be important are:<br />

gender amount eaten alcohol level <strong>of</strong> the beer<br />

Eventually we shall see how to determine which<br />

<strong>of</strong> these may be important.<br />

328<br />

X<br />

Section 9


BAC<br />

X<br />

X X<br />

•<br />

•<br />

• •<br />

X<br />

X<br />

X X<br />

X<br />

• •<br />

•<br />

•<br />

MASS<br />

• Women consistently above men<br />

• Lines could be parallel<br />

BAC<br />

X<br />

X X<br />

X<br />

X<br />

X X<br />

X<br />

• •<br />

• •<br />

• •<br />

• •<br />

MASS<br />

• Lines not parallel. (If low body mass, large<br />

difference, if high body mass there is no<br />

difference.)<br />

329<br />

Section 9


Example: Lung function in children as measured<br />

by a lung capacity variable called FEV.<br />

FEV<br />

+<br />

+ +<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+ +<br />

+ + + + + + +<br />

+ + +<br />

+<br />

+ + + + +<br />

+<br />

+<br />

+<br />

+<br />

+ +<br />

+ + + +<br />

+<br />

+ + + +<br />

+ +<br />

+<br />

+<br />

+<br />

+ +<br />

+ + + + +<br />

3 5 7 9 11 13 15 17 19 Age<br />

FEV values are increasing as the children grow.<br />

But now see the next two graphs.<br />

330<br />

Section 9


FEV<br />

+<br />

+ +<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+ +<br />

+ + + + + + +<br />

+ + +<br />

+<br />

+ + + + +<br />

+<br />

+<br />

+<br />

+<br />

+ +<br />

+ + +<br />

+<br />

+ + + +<br />

+ + +<br />

+<br />

+<br />

+ +<br />

+ + + + +<br />

3 5 7 9 11 13 15 17 19 Age<br />

• Once start smoking FEV is reduced for the smokers.<br />

FEV<br />

+<br />

+ +<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+ +<br />

+ + + + + + +<br />

+ + +<br />

+<br />

+ + + + +<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+ + +<br />

+<br />

+ + + +<br />

+ +<br />

+<br />

+<br />

+<br />

+ +<br />

+ + + + +<br />

Non-smoker<br />

Smoker<br />

Non-smoker<br />

Smoker<br />

3 5 7 9 11 13 15 17 19 Age<br />

• This is more accurate as children may only begin<br />

smoking at age 9 <strong>and</strong> the rate <strong>of</strong> increase is much<br />

smaller with the non-parallel lower line.<br />

• Multiple regression needed for this analysis.<br />

331<br />

Section 9


With a simple linear regression take one variable<br />

as response <strong>and</strong> one variable as a predictor.<br />

The response is plotted on the vertical Y axis.<br />

The predictor is plotted on the horizontal X axis.<br />

Equivalent terms for response <strong>and</strong> predictor:<br />

⎧outcome<br />

⎪<br />

response = ⎨dependent variable<br />

⎪<br />

⎩(<br />

Y - variable)<br />

⎧explanatory variable<br />

⎪<br />

covariate<br />

predictor = ⎨<br />

⎪independent variable<br />

⎪⎩<br />

( X - variable)<br />

Simple regression deals with the case where the<br />

relationship is approximately a straight line.<br />

Example: The values <strong>of</strong> a response variable (Y)<br />

<strong>and</strong> the values <strong>of</strong> a predictor variable (X) are as<br />

follows<br />

X Y<br />

100 39.7<br />

200 51.1<br />

300 49.9 The scatter diagram<br />

400 69.8 below shows the<br />

500 65.2 relationship between<br />

600 65.1 Y <strong>and</strong> X.<br />

700 80.7<br />

332<br />

Section 9


80<br />

Y<br />

X<br />

70<br />

60<br />

X<br />

X<br />

X<br />

50<br />

X<br />

X<br />

40<br />

X<br />

100 200 300 400 500 600<br />

700<br />

Y increases as X increases. The question is<br />

whether this apparent increase in Y is caused by<br />

changing X, or has it been caused by some other<br />

factor, or has it arisen by chance alone.<br />

The values <strong>of</strong> X, the independent variable, are<br />

known exactly (i.e. no error) whereas the values<br />

<strong>of</strong> Y, the dependent variable, have some r<strong>and</strong>om<br />

error associated with them.<br />

The relationship between Y <strong>and</strong> X could be linear<br />

so we attempt to “fit” a straight line through the<br />

data. This line gives the predicted yields ŷ for<br />

i<br />

each value x i <strong>of</strong> X.<br />

X<br />

333<br />

Section 9


80<br />

Y<br />

X<br />

70<br />

60<br />

50<br />

40<br />

X<br />

d X X<br />

4<br />

{<br />

⎫<br />

X ⎪<br />

X<br />

⎬ŷ<br />

4<br />

X<br />

⎪<br />

⎪⎭<br />

100 200 300 400 500 600<br />

⎫<br />

⎪<br />

⎪<br />

⎬<br />

⎪<br />

⎪<br />

⎪⎭<br />

700<br />

y<br />

4<br />

X<br />

An attempt is made to minimise the differences d i<br />

= y i – ŷ between the observed values (y<br />

i<br />

i ) <strong>and</strong> the<br />

predicted values ( ŷ i<br />

). The d i are positive for<br />

points above the fitted line <strong>and</strong> negative for<br />

points below the line. The expression ∑ n<br />

i = 1<br />

d<br />

i<br />

where there are n data points (i.e. the sample is <strong>of</strong><br />

size n) does not measure “fit” due to cancellation<br />

<strong>of</strong> negative <strong>and</strong> positive values.<br />

Therefore, minimise ∑ i=<br />

1d 2 = ∑i=<br />

y − y<br />

i 1( ˆ ) .<br />

i i<br />

Suppose the straight line which does this has<br />

slope “β 1 ” <strong>and</strong> intercept “β 0 ”. That is,<br />

n<br />

n<br />

2<br />

y<br />

= β + β x<br />

0 1<br />

334<br />

Section 9


The method <strong>of</strong> least squares finds the values <strong>of</strong> β 0<br />

<strong>and</strong> β 1 which minimise<br />

n<br />

2<br />

n<br />

2<br />

( y ˆ ) ( [<br />

1 i− yi = y<br />

1 i− β0+<br />

β1xi])<br />

i= i=<br />

∑ ∑<br />

The estimates <strong>of</strong> β 0 <strong>and</strong> β 1 are<br />

0<br />

ˆβ <strong>and</strong> ˆβ 1<br />

which<br />

turn out to be<br />

ˆ<br />

∑<br />

n<br />

i=<br />

1<br />

β<br />

1<br />

=<br />

∑<br />

( x − x)( y − y)<br />

n<br />

i<br />

i=<br />

1<br />

i<br />

( x − x)<br />

ˆ β = y − ˆ β x<br />

0 1<br />

i<br />

2<br />

The line which best “fits” the data is<br />

ŷ = ( y − ˆ β ˆ<br />

1x)<br />

+ β1x<br />

= y + ˆβ 1(x – x )<br />

⎡∑( x −x)( y −y) ⎤<br />

y+ ⎥ ( x−x<br />

)<br />

⎢⎣<br />

∑ ⎥⎦<br />

i i<br />

= ⎢<br />

2<br />

( xi<br />

−x)<br />

335<br />

Section 9


Example:<br />

x i y i (x i – x ) (x i – x ) 2 (y i – y )(x i – x )(y i – y )<br />

100 39.7 –300 90000 –20.51 6153<br />

200 51.1 –200 40000 –9.11 1822<br />

300 49.9 –100 10000 –10.31 1031<br />

400 69.8 0 0 9.59 0<br />

500 65.2 100 10000 4.99 499<br />

600 65.1 200 40000 4.89 978<br />

700 80.7 300 90000 20.49 6147<br />

2800 421.5 280000 16630<br />

x = 400 y = 60.21<br />

Therefore, ˆβ 1<br />

= 16630/280000 = 0.059<br />

0<br />

ˆβ = 60.21 – 0.059 (400) = 36.61<br />

giving ŷ = 36.61 + 0.059 x<br />

To draw this line on the scatter diagram two<br />

points are needed:<br />

e.g. if x = 400, = 36.61 + 0.059 (400) = 60.21<br />

if x = 100, ŷ = 42.51<br />

336<br />

Section 9


N.B. 1. In this situation we have regressed Y on<br />

X.<br />

This implies the X values are known without<br />

error but the Y values are influenced by<br />

r<strong>and</strong>om variation.<br />

2. Numerically, we could regress X on Y. But<br />

the “slope” <strong>of</strong> this regression is not the same<br />

as that for Y on X. The reason is that now the<br />

Y values are known exactly with the X values<br />

influenced by r<strong>and</strong>om variation.<br />

3. ŷ = y + ˆβ 1(x – x )<br />

When x = x, yˆ<br />

= y+ ˆ β1(0)<br />

= y<br />

This means that the point ( x,<br />

y)<br />

always lies<br />

on the least squares straight line. i.e. the<br />

regression line always passes through the<br />

centre <strong>of</strong> the scatter diagram.<br />

337<br />

Section 9


4. We say the least squares line “fits” or<br />

“models” the relationship between Y <strong>and</strong> X.<br />

5. A straight line may give poor fit e.g.<br />

Y<br />

X<br />

X<br />

X X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X X<br />

X<br />

X<br />

Here, it is not appropriate to use the line to<br />

predict values <strong>of</strong> Y for given values <strong>of</strong> X.<br />

The next step in our regression analysis is to<br />

establish how well this fitted line is able to model<br />

or explain the effect X has on Y; <strong>and</strong> also, if the<br />

fitted line is used to make forecasts <strong>of</strong> the values<br />

<strong>of</strong> Y, how accurate these forecasts turn out to be.<br />

(We set up confidence intervals for these<br />

forecasts.)<br />

Definition: The value d i = y i – ŷ is called the<br />

i<br />

residual at the value x i <strong>of</strong> X. These residuals are<br />

338<br />

Section 9


important as they represent the error made when<br />

using the line to make a forecast.<br />

Analysis <strong>of</strong> Variance for a Regression Model<br />

Y<br />

y<br />

The diagram shows that any numerical value y i<br />

can be partitioned into three components as<br />

follows:<br />

That is,<br />

any value,<br />

d<br />

i<br />

⎧<br />

= ( y − yˆ<br />

)<br />

i i ⎨<br />

⎩<br />

x<br />

} 1<br />

ˆ β ( x − x )<br />

⎫<br />

⎪<br />

⎬y<br />

⎪<br />

⎭<br />

y = y+ ˆ β ( x − x) + ( y − yˆ<br />

)<br />

i 1 i i i<br />

x<br />

x i<br />

( x<br />

y i = an overall average<br />

+ an amount explained by a<br />

predictor variable X<br />

+ a residual (or r<strong>and</strong>om error)<br />

i<br />

i<br />

,<br />

y<br />

i<br />

)<br />

Regression<br />

Line<br />

X<br />

339<br />

Section 9


The amount explained by the independent<br />

variable X is called the regression effect. This is<br />

also known as the explained component <strong>of</strong> the<br />

outcomes y i .<br />

The magnitude <strong>of</strong> the regression effect is related<br />

to the slope <strong>of</strong> the line <strong>and</strong> the distance x i is away<br />

from the overall mean x <strong>of</strong> the values x i .<br />

The mean y is the overall average effect.<br />

The term ( y − yˆ<br />

) is the residual effect. This is<br />

i i<br />

also known as the unexplained component <strong>of</strong> the<br />

outcomes.<br />

Therefore,<br />

data value = overall average effect<br />

+ regression effect + residual (error)<br />

effect.<br />

= overall average effect<br />

+ explained amount + unexplained<br />

amount<br />

340<br />

Section 9


To illustrate, the example has x = 400,<br />

y = 60.21 <strong>and</strong> ˆβ 1<br />

= 0.059<br />

x i y i = y + 0.059(x i – 400) + residual<br />

100 39.7 = 60.21 + (–17.82) + (–2.69)<br />

200 51.1 = 60.21 + (–11.88) + 2.77<br />

300 49.9 = 60.21 + (–5.94) + (–4.37)<br />

400 69.8 = 60.21 + 0.00 + 9.59<br />

500 65.2 = 60.21 + 5.95 + (–0.95)<br />

600 65.1 = 60.21 + 11.88 + (–6.99)<br />

700 80.7 = 60.21 + 17.82 + 2.67<br />

overall mean explained unexplained effect<br />

common to effect. chosen to give<br />

each data<br />

equality.<br />

value.<br />

It is important to establish if the explained effect<br />

has a much greater impact on the values y i than<br />

the unexplained residual effect i.e. does the<br />

regression effect explain more <strong>of</strong> the variation in<br />

the y i values. It turns out that the total variation<br />

in the y i values can be partitioned into an overall<br />

mean component, a regression component <strong>and</strong> a<br />

residual component as follows:<br />

341<br />

Section 9


[This page just for reference]<br />

Total sum <strong>of</strong> Squares (SS) <strong>of</strong> y i values<br />

= (39.7) 2 + (51.1) 2 + (49.9) 2 + (69.8) 2<br />

+ (65.2) 2 + (65.1) 2 + (80.7) 2<br />

= 26550.89<br />

The overall mean SS<br />

= (60.21) 2 + … + (60.21) 2 (7 times)<br />

= 7(60.21) 2<br />

= 25380.32<br />

The regression effect SS<br />

= (–17.82) 2 + (–11.88) 2 + … + (17.82) 2<br />

= 987.70<br />

The residual effect SS<br />

= (–2.69) 2 + (2.77) 2 + … + (2.67) 2<br />

= 182.87<br />

Now notice that<br />

26550.89 = 25380.32 + 987.70 + 182.87<br />

i.e. Total SS = overall mean SS + regression SS<br />

+ residual SS<br />

342<br />

Section 9


That is, the total variation is partitioned into these<br />

components which should now be compared. But<br />

the three component values cannot be compared<br />

directly. Note that:<br />

(i) There are seven data values y i hence seven<br />

degrees <strong>of</strong> freedom.<br />

(ii) One overall mean has one DF.<br />

(iii) The seven regression values depend on the<br />

one slope estimate 1<br />

ˆβ , hence one DF.<br />

(iv) The seven residuals have the remaining<br />

7 – 2 = 5 DF.<br />

The average or mean squares (MS) are then found<br />

by dividing the sums <strong>of</strong> squares by the degrees <strong>of</strong><br />

freedom. These mean squares can be compared.<br />

The procedure is summarised in the following<br />

analysis <strong>of</strong> variance table:<br />

SOURCE OF VARIATION SS DF MS<br />

Overall mean 25380.32 1<br />

Regression effect 987.70 1 987.70<br />

Residual effect 182.87 (5) 36.57<br />

Total 26550.89 7<br />

343<br />

Section 9


The average regression effect (or the average<br />

effect <strong>of</strong> X on the Y values) far exceeds the<br />

average residual effect (unexplained) since<br />

987.70 far exceeds 36.57. But is this difference<br />

large enough to be important. The question <strong>of</strong><br />

whether the average regression effect is large<br />

enough is answered by defining F = 987.70/36.57<br />

= 27.01 <strong>and</strong> testing this F-statistic for<br />

significance by reference to the F-table as<br />

follows: (note that the DF here are 1 <strong>and</strong> 5<br />

respectively for numerator <strong>and</strong> denominator).<br />

Since 27.01 > 6.608 there is evidence that the<br />

regression (or explained) effect dominates the<br />

residual (or unexplained) effect. Since the key<br />

part <strong>of</strong> the regression effect is the slope 1<br />

ˆβ , this<br />

effectively means ˆβ 1<br />

≠ 0 (or alternatively that<br />

there is evidence that changes in the values x i <strong>of</strong> X<br />

explain the variation in the values y i <strong>of</strong> Y (<strong>and</strong> this<br />

dominates any left over residual or unexplained<br />

effects).<br />

344<br />

Section 9


The F-distribution (Table in Appendix)<br />

υ 1 = num DF<br />

υ 2 = denom DF<br />

α = 0.05 (say)<br />

0<br />

F<br />

υ 1 ,υ 2<br />

F<br />

υ 1 1 2 3 4 … 60<br />

υ 2<br />

1 … … … … …<br />

2 <br />

3 <br />

4 <br />

5 6.608 5.786 5.409 … … …<br />

6 <br />

<br />

120 3.920 3.072 2.680 … … …<br />

345<br />

Section 9


Note: 1. The residual effect includes any<br />

r<strong>and</strong>om error plus the effects <strong>of</strong> other<br />

variables which may be affecting the<br />

outcome Y values.<br />

2. Computer s<strong>of</strong>tware produces the analysis <strong>of</strong><br />

variance table directly.<br />

3. It is a slightly modified form because the<br />

overall mean effect is never used. Therefore,<br />

this is subtracted (with appropriate changes<br />

to the total SS <strong>and</strong> the degrees <strong>of</strong> freedom)<br />

SOURCE OF VARIATION SS DF MS F<br />

Regression effect 987.70 1 987.70 27.01*<br />

Residual effect 182.87 (5) 36.57<br />

Total (overall mean<br />

removed)<br />

1170.87 6<br />

4. The “fitted” straight line should pass through<br />

the middle <strong>of</strong> the scatter diagram, <strong>and</strong> hence<br />

the residuals should take positive <strong>and</strong><br />

negative values as X increases. (This can be<br />

checked by studying plots <strong>of</strong> the residuals<br />

produced by the program.)<br />

346<br />

Section 9


5. For the validity <strong>of</strong> the F-test, residuals should<br />

be approximately normally distributed. This<br />

can also be checked by obtaining the normal<br />

probability plot using the program.<br />

Analyse > Regression > Linear with Y in<br />

the Dependent Variable box <strong>and</strong> X in the<br />

Independent Variable box produces the<br />

following printout.<br />

347<br />

Section 9


A Confidence Interval for Slope <strong>of</strong> line.<br />

Our sample <strong>of</strong> n = 7 produced an estimate<br />

ˆβ = 0.059<br />

1<br />

Repeated samples <strong>of</strong> size n = 7 give values ˆβ 1<br />

which follow a normal distribution (just the<br />

Central Limit Theorem again).<br />

If β<br />

1<br />

is the true slope <strong>of</strong> the regression line then<br />

the st<strong>and</strong>ard error <strong>of</strong> β<br />

1<br />

is<br />

σ<br />

β<br />

1<br />

=<br />

∑<br />

n<br />

i=<br />

σ<br />

e<br />

( x )<br />

1 i<br />

− x<br />

2<br />

where σ 2 e is estimated from the data by the<br />

formula.<br />

s<br />

2<br />

e<br />

n<br />

∑i<br />

1 ( − ˆ<br />

= = y y<br />

i<br />

n−2<br />

Notes<br />

1. ( y ˆ<br />

i− yi)<br />

is the residual (or error) at the value<br />

x i <strong>of</strong> X.<br />

i<br />

)<br />

2<br />

348<br />

Section 9


2. The divisor is (n – 2) rather than (n – 1) used<br />

in the calculation <strong>of</strong> an ordinary variance<br />

because here two values “ β 0<br />

” <strong>and</strong> “ β 1<br />

” are<br />

estimated from the data <strong>and</strong> used to find the<br />

ŷ from which the deviations are measured.<br />

i<br />

[For an ordinary variance,<br />

only x is estimated.]<br />

2<br />

2 ∑ ( x − x)<br />

s =<br />

i<br />

,<br />

n −1<br />

The estimated st<strong>and</strong>ard error <strong>of</strong> the slope <strong>of</strong> the<br />

regression line is<br />

s<br />

β<br />

1<br />

=<br />

∑<br />

n<br />

i=<br />

s<br />

e<br />

( x )<br />

1 i<br />

− x<br />

2<br />

Therefore, the 95% confidence interval for β<br />

1<br />

is<br />

ˆ β ± t<br />

1 n−2<br />

∑<br />

n<br />

i=<br />

1<br />

s<br />

e<br />

( x − x)<br />

i<br />

2<br />

Notes.<br />

(1) There are υ = n – 2 degrees <strong>of</strong> freedom for<br />

use with the t-table.<br />

349<br />

Section 9


(2) If σ e was known exactly (which it never is)<br />

the 95% confidence interval would be<br />

ˆ β ± 1.96<br />

1<br />

∑<br />

σ<br />

e<br />

( x − x)<br />

i<br />

2<br />

.<br />

(3) In practice, σ e is always estimated by<br />

s<br />

e<br />

=<br />

∑ ( y<br />

i<br />

− yˆ<br />

n − 2<br />

i<br />

)<br />

2<br />

(4)<br />

2<br />

s is just the residual mean square <strong>and</strong> this<br />

e<br />

can be read directly from the analysis <strong>of</strong><br />

variance.<br />

Example<br />

Refer to the earlier data which gave<br />

2<br />

∑ ( x i<br />

− x) = 280000, 1<br />

= 0.059 <strong>and</strong><br />

yˆ = 36.6 + 0. 059x<br />

i<br />

i<br />

350<br />

Section 9


x i y i ( y − yˆ<br />

)<br />

i<br />

i<br />

( y −<br />

100 39.7 -2.69 7.24<br />

200 51.1 2.77 7.67<br />

300 49.9 -4.37 19.10<br />

400 69.8 9.59 91.97<br />

500 65.2 -0.95 0.90<br />

600 65.1 -6.99 48.86<br />

700 80.7 2.67 7.13<br />

182.87<br />

i<br />

yˆ<br />

i<br />

2<br />

)<br />

Therefore,<br />

Residuals (see earlier)<br />

Residual<br />

sum <strong>of</strong> squares<br />

2 182.87<br />

s = = 36.58 (the residual mean square)<br />

e 7 − 2<br />

with<br />

n – 2 = 7 – 2 = 5 D.F. giving<br />

t 5 = 2.571 for 95% confidence.<br />

The st<strong>and</strong>ard error <strong>of</strong> the slope is estimated to be<br />

s<br />

e 36.58<br />

= = 0.0114<br />

2<br />

∑ ( x − x)<br />

280000<br />

i<br />

351<br />

Section 9


The 95% confidence interval is 0.059 ±<br />

2.571(0.0114) or 0.059 ± 0.029<br />

Hence 0.030 < β 1 0<br />

0<br />

X<br />

As X changes, the values<br />

<strong>of</strong> Y tend to show an<br />

increasing trend with<br />

r<strong>and</strong>om variation about<br />

the trend line.<br />

Example<br />

A test has been designed to measure patient stress<br />

level (X). Blood pressure (Y) is recorded for<br />

different stress levels.<br />

352<br />

Section 9


Stress (X) 55 94 64 73 96 86<br />

Blood Pr. (Y) 72 91 76 78 94 81<br />

These data give x = 78; y = 82;<br />

2<br />

∑ ( x i<br />

− x) = 1394 <strong>and</strong> ∑ ( x − i<br />

x)( yi<br />

− y)<br />

= 686.<br />

Find the least squares line, 95% confidence<br />

interval for slope <strong>and</strong> test the research proposal<br />

that higher stress results in higher blood pressure<br />

levels.<br />

353<br />

Section 9


Solution:<br />

ˆ ∑( xi<br />

− x)( yi<br />

− y) 686<br />

β1 = = = 0.492<br />

∑<br />

2<br />

( xi<br />

− x) 1394<br />

yˆ = y+ ˆ β ( x− x) = 82 + 0.492( x−<br />

78)<br />

∴<br />

1<br />

Suppose a computer analysis gives the analysis <strong>of</strong><br />

variance as follows:<br />

SOURCE OF VARIATION SS DF MS F<br />

Regression effect 337.59 1 337.59 33.41<br />

Residual effect 40.41 4 10.10<br />

2<br />

2 ∑ ( y − yˆ<br />

) 40.41<br />

Then s =<br />

i i<br />

= = 10. 10<br />

e n − 2 4<br />

giving s e = 3.178 as residual st<strong>and</strong>ard deviation.<br />

For 95% confidence, t 4 = 2.776 <strong>and</strong> st<strong>and</strong>ard error<br />

3.178<br />

<strong>of</strong> slope = = 0.085.<br />

1394<br />

The 95% confidence interval is<br />

0.492 ± 2.776(0.085)<br />

It follows that 0.256 < ˆβ 1<br />

< 0.728<br />

The test has the p-value less than 0.05.<br />

354<br />

Section 9


Confidence Interval for Prediction using a<br />

Regression Line<br />

The prediction value at value x i <strong>of</strong> X is found by<br />

substituting the value x i in the regression equation<br />

e.g. For our data, ŷ = 36.6 + 0.059x<br />

When x i = 750, ŷ = 36.6 + 0.059(750)<br />

i<br />

= 80.85<br />

But what error is associated with this prediction<br />

At value X = x k say the estimated st<strong>and</strong>ard error<br />

<strong>of</strong> the prediction is<br />

s<br />

1 ( x − x)<br />

2<br />

k<br />

yˆ = se<br />

1+ +<br />

2<br />

n ( xi<br />

− x)<br />

∑<br />

where s e is the residual st<strong>and</strong>ard deviation.<br />

But s e = 36 . 58 = 6.05 (see ANOVA table)<br />

∴<br />

s y ˆ<br />

2<br />

1 (750 − 400)<br />

= 6.05 1+ + = 7.604<br />

7 280000<br />

355<br />

Section 9


The 95% confidence interval is<br />

yˆ ± t s where t 5 = 2.571<br />

That is 80.85 ± 2.571(7.604)<br />

5<br />

yˆ<br />

Therefore, 61.30 < ŷ < 100.40<br />

750<br />

where ŷ is the prediction at x k = 750.<br />

750<br />

Notes<br />

(1) R-cmdr (<strong>and</strong> other packages) give this<br />

interval when requested.<br />

(2) A graph showing the confidence b<strong>and</strong>s<br />

around the regression line can also be<br />

produced as follows<br />

(3) Essentially the confidence interval for the<br />

prediction involves line error <strong>and</strong> natural<br />

variation to predict a data point.<br />

356<br />

Section 9


Y<br />

Prediction Interval<br />

X<br />

EXAMPLE: 2003 EXAM<br />

The data for this question are a sample <strong>of</strong> 100 low<br />

birth weight infants. Measurements <strong>of</strong> systolic<br />

blood pressure (sbp) <strong>and</strong> values <strong>of</strong> gestational age<br />

(gestage) are recorded. The following table<br />

shows the layout <strong>of</strong> the data along with the results<br />

<strong>of</strong> some calculations using the 100 data values.<br />

357<br />

Section 9


sbp<br />

(Y mm Hg)<br />

gestage<br />

(X weeks)<br />

43 29 y= 47.31<br />

51 31 x = 28.89<br />

42 33<br />

2<br />

∑ ( i<br />

− x)<br />

= 635.69<br />

39 31<br />

2<br />

<br />

∑ ( i<br />

− y)<br />

= 15222.24<br />

<br />

<br />

40 33 ∑ ( x i<br />

x)( yi<br />

− y)<br />

= 806.31<br />

50 28<br />

(a) (4 marks) Using systolic blood pressure as<br />

the response <strong>and</strong> gestational age as the<br />

predictor variable, compute the least squares<br />

regression line. Interpret the slope <strong>of</strong> this<br />

regression line.<br />

(b) (5 marks) The st<strong>and</strong>ard deviation <strong>of</strong> the<br />

sample points about the regression line in (a)<br />

is s e = 3.47. Obtain an estimate for the<br />

st<strong>and</strong>ard error <strong>of</strong> the slope <strong>of</strong> the regression<br />

<strong>and</strong> hence set up a 95% confidence interval<br />

for the slope <strong>of</strong> the regression line. State<br />

whether you would reject the null hypothesis<br />

that the true slope is equal to 0.<br />

358<br />

Section 9


(c) (3 marks) What is the predicted systolic<br />

blood pressure for a low birth weight infant<br />

whose gestational age is 31 weeks<br />

Construct a 95% confidence interval for the<br />

prediction.<br />

(d) (1 mark) The value <strong>of</strong> the coefficient <strong>of</strong><br />

determination is R – Sq = 67%. Interpret this<br />

value. (discussed next lecture)<br />

(e) (3 marks) What conclusions would you draw<br />

from the two residual plots below arising<br />

from the fitted regression in (a)<br />

359<br />

Section 9


SOLUTION<br />

(a) ˆβ 1<br />

= 806.31/635.69 = 1.27<br />

0<br />

ˆβ = 47.31 – 1.27(28.89) = 10.62<br />

ŷ = 10.62 + 1.27x<br />

For infants with gestation age one week<br />

higher, the model predicts sbp increases by<br />

1.27 mmHg.<br />

360<br />

Section 9


(b) Estimated st<strong>and</strong>ard error<br />

= 3.47 / 635.69 = 0.138<br />

95% C.I. is 1.27 ± 1.98(0.138)<br />

giving 1.27 ± 0.273<br />

or 1.00 < ˆβ 1<br />

< 1.54<br />

The confidence interval excludes zero (pvalue<br />

< 0.05) hence reject null hypothesis.<br />

(c) Prediction = 10.62 ± 1.27(31) = 49.99<br />

95% C.I. is<br />

49.99 ± 1.98(3.47)<br />

giving 49.99 ± 6.92<br />

or 43.07 < ŷ<br />

31<br />

< 56.91<br />

1 (31−<br />

28.89)<br />

1+ + 100 635.69<br />

(d) 67% <strong>of</strong> the total sum <strong>of</strong> squares <strong>of</strong> the sbp<br />

values is explained by changes in the number<br />

<strong>of</strong> weeks <strong>of</strong> gestation. (Alternatively, 67% <strong>of</strong><br />

the variation in the sbp values is explained.)<br />

(Discussed next lecture.)<br />

(e) Variation about the fitted line is constant for<br />

different gestation times. The residuals<br />

appear close to pa normal distribution except<br />

for a possible outlier at x = 29.<br />

2<br />

361<br />

Section 9


Correlation<br />

The correlation coefficient is a measure <strong>of</strong> linear<br />

association. The Pearson correlation coefficient r<br />

is defined<br />

r<br />

=<br />

∑<br />

∑<br />

n<br />

( x −x)( y − y)<br />

i=<br />

1 i i<br />

n<br />

2<br />

n<br />

2<br />

( x ) ( )<br />

i 1 i−x ∑ y<br />

i 1 i−<br />

y<br />

= =<br />

This measures the ‘strength” <strong>of</strong> linear association<br />

between X <strong>and</strong> Y (as we shall now see). Recall<br />

that the regression line passes through the point,<br />

x,<br />

y .<br />

( )<br />

y<br />

+<br />

2<br />

+ + +<br />

+ + +<br />

+ + +<br />

+ + + +<br />

+ + + +<br />

3<br />

x<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+<br />

4<br />

1<br />

X<br />

Section 9<br />

362


The denominator in formula for r is always<br />

positive. In quadrant 1 , x i – x > 0 <strong>and</strong><br />

y i – y > 0 meaning ( xi− x)( yi− y) > 0. In<br />

quadrant 3 , ( xi<br />

− x)<br />

< 0 <strong>and</strong> y i – y < 0 giving<br />

( xi−x)( yi− y) > 0. In quadrant 2 <strong>and</strong> 4 ,<br />

( x −x)( y − y) < 0.<br />

i<br />

i<br />

Therefore, r is large <strong>and</strong> positive if points mainly<br />

in quadrants 1 <strong>and</strong> 3 ; it is large <strong>and</strong> negative if<br />

points in quadrants 2 <strong>and</strong> 4.<br />

Y<br />

Y<br />

+ +<br />

+ + + +<br />

+<br />

+<br />

+ + +<br />

+ +<br />

+<br />

+ +<br />

+<br />

+<br />

+ +<br />

+ +<br />

+<br />

X<br />

+ +<br />

+<br />

+ + +<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+ +<br />

+ +<br />

X<br />

(i)<br />

(ii)<br />

In case (i) the contribution is equal from each<br />

quadrant, the contributions cancel, <strong>and</strong> therefore r<br />

= 0. i.e. there is no relationship between Y <strong>and</strong> X.<br />

In case (ii) there is again cancellation <strong>and</strong> r = 0,<br />

but here there is a strong relationship between Y<br />

<strong>and</strong> X but it is non-linear.<br />

363<br />

Section 9


therefore measures the strength <strong>of</strong> the linear<br />

association between X <strong>and</strong> Y. But we must be<br />

careful as r = 0 in the following case (iii) where<br />

β 1 = 0. In fact r is directly related to β 1 <strong>and</strong> zero<br />

if β 1 is zero.<br />

Y<br />

+ +<br />

+ + + + + + +<br />

+ + + +<br />

+ + + +<br />

(iii)<br />

X<br />

Example: A researcher investigates the<br />

relationship between reading <strong>and</strong> spelling tests<br />

administered to nine students<br />

Student 1 2 3 4 5 6 7 8 9<br />

X (spelling) 52 90 63 81 93 51 48 99 85<br />

Y (reading) 56 81 75 72 50 45 39 87 59<br />

364<br />

Section 9


2<br />

( y i<br />

− y) ( x − x)(<br />

y − y)<br />

x i y i ( x )<br />

2<br />

i<br />

− x<br />

i i<br />

52 56 … … …<br />

90 81 … … …<br />

63 75 … … …<br />

81 72 … … …<br />

93 50 … … …<br />

51 45 … … …<br />

48 39 … … …<br />

99 87 … … …<br />

85 59 … … …<br />

3220.2225 2258.0001 1718.6665<br />

x = 73.55 y = 62. 67<br />

1718.6665<br />

r = =+ 0.6374<br />

3220.2225(2258.0001)<br />

But what does this mean<br />

Very strong correlation:<br />

x<br />

x x<br />

x x<br />

x<br />

x x x<br />

x xx<br />

Near +1<br />

x<br />

x<br />

x<br />

x<br />

Near –1<br />

x<br />

x<br />

x x x x<br />

x x<br />

Section 9<br />

365


Smaller<br />

x<br />

x<br />

x x x<br />

x x x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

+ 0.7<br />

x<br />

x<br />

x<br />

x<br />

x x<br />

x<br />

– 0.7<br />

x<br />

x<br />

x x<br />

x Ho x<br />

lla<br />

nd<br />

x x<br />

x x<br />

Very small<br />

x x<br />

x<br />

x x<br />

x x x<br />

x<br />

x<br />

x<br />

x x<br />

x<br />

x x x x x<br />

x x x<br />

x x<br />

x + 0.2<br />

x<br />

– 0.2<br />

x<br />

x<br />

x x x x<br />

x x x x<br />

x x<br />

x x x x x<br />

x x x x<br />

x<br />

Notes:<br />

1. The largest value <strong>of</strong> r turns out to be +1. In<br />

this case all points lie on a straight line in<br />

quadrants 1 <strong>and</strong> 3 . This implies perfect<br />

positive linear association. i.e. as X increases,<br />

Y increases in the same ratio (if the increase<br />

<strong>of</strong> X is doubled, the increase in Y would also<br />

be doubled).<br />

366<br />

Section 9


2. r = –1 is smallest value which implies perfect<br />

negative linear association when all points lie<br />

in quadrants 2 <strong>and</strong> 4. i.e. as X increases, Y<br />

decreases in same ratio.<br />

3. |r| > 0.7 implies strong linear relationship.<br />

|r| < 0.3 implies negligible linear relationship.<br />

4. The correlation coefficient is an index. It<br />

does not depend on the units <strong>of</strong> either X or Y.<br />

(numerator <strong>and</strong> denominator in same units)<br />

5. r is called the Pearson Correlation<br />

Coefficient.<br />

6. An important correlation does not imply a<br />

causal link between the two variables. (The<br />

correlation is <strong>of</strong>ten caused by the effect <strong>of</strong> a<br />

third variable influencing both X <strong>and</strong> Y).<br />

e.g. smoking <strong>and</strong> lung cancer incidence<br />

correlated – not smoking causing lung<br />

cancer.<br />

7. If r is large, a regression line will fit the data<br />

well.<br />

367<br />

Section 9


8. If r 2 gives the fraction <strong>of</strong> variability in the Y<br />

values associated with the predictor variable<br />

X.<br />

e.g. In the example, r = 0.6374 so 0.406<br />

40.6% <strong>of</strong> the variability in Y is explained<br />

by changes in X.<br />

That is,<br />

2 SS(Regression)<br />

r =<br />

SSTotal (Reg+Resid)<br />

for a simple linear regression.<br />

368<br />

Section 9


Some examples on correlation <strong>and</strong> association discussed in lectures.<br />

Correlation measures association but association is not the same as causation.<br />

Example: For school children, shoe size is strongly correlated with reading skills.<br />

Learning new words does not make the feet get bigger.<br />

Instead, there is a third factor, age. As children get older, they learn to read better <strong>and</strong> they outgrow<br />

their shoes.<br />

Age is a confounder. Here, this confounder is easy to spot. Often this is not so easy. The<br />

arithmetic <strong>of</strong> the correlation coefficient does not give protection against third factors.<br />

Example: Education level <strong>and</strong> unemployment.<br />

In the Great Depression (1929 – 1933), better educated people had shorter spells <strong>of</strong> unemployment.<br />

(Education level <strong>and</strong> days unemployed were very highly correlated: negatively as more education<br />

associated with less days unemployed). Does education protect you against unemployment.<br />

Discussion:<br />

Perhaps, but the data were observational. Age is a confounding variable. Younger people were<br />

better educated as education level had been increasing over time. (It still is!!)<br />

Employers seemed to prefer younger job seekers.<br />

Controlling for age made the effect <strong>of</strong> education on unemployment much weaker.<br />

Example:<br />

In countries where people eat lots <strong>of</strong> fat, rates <strong>of</strong> breast <strong>and</strong> colon cancer are high. This correlation<br />

is <strong>of</strong>ten used to argue that fat in the diet causes cancer. How good is this evidence<br />

Death<br />

Rate<br />

(per 100000)<br />

25<br />

20<br />

15<br />

10<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+<br />

+ +<br />

+<br />

+ Finl<strong>and</strong><br />

+<br />

+ Spain<br />

+ Holl<strong>and</strong><br />

+ UK<br />

+ Denmark<br />

+ + NZ<br />

+<br />

5<br />

+<br />

+ Japan<br />

+ Sri Lanka<br />

Thail<strong>and</strong><br />

25 50 75 100 125 150 175<br />

Fat intake per capita per day (grams)<br />

369<br />

Section 9


Discussion: There is a very high correlation as shown by the scatter diagram which is very<br />

elongated. If fat in diet causes cancer, then the points should slope up as shown. So the diagram is<br />

some evidence for the theory. But the evidence is weak.<br />

For example, countries with lots <strong>of</strong> fat in diet also have lots <strong>of</strong> sugar, <strong>and</strong> a similar plot for sugar<br />

would be found.<br />

As it turns out, fat <strong>and</strong> sugar are relatively expensive. In rich countries people can afford to eat fat<br />

<strong>and</strong> sugar rather than starchier grain products.<br />

Some aspects <strong>of</strong> diet in these countries or these life-style factors probably do cause certain kinds <strong>of</strong><br />

cancer. Epidemiologists can identify only a few <strong>of</strong> these factors with confidence. Fat is not among<br />

them.<br />

Example: Ultrasound <strong>and</strong> low birthweight.<br />

Babies can be examined in the womb using ultrasound. Several experiments on lab animals have<br />

shown ultrasound exams can cause low birthweight. If true for humans, there are grounds for<br />

concern. Scientists at Johns Hopkins Hospital in Baltimore ran an observational study to find out.<br />

Babies exposed to ultrasound differ from unexposed babies in many ways beside exposure; this<br />

investigation was only an observational study.<br />

The scientists found a number <strong>of</strong> confounding variables <strong>and</strong> adjusted for them. There was still an<br />

association. Babies exposed to ultrasound in the womb had lower birthweight, on average.<br />

Is this evidence that ultrasound causes lower birthweight<br />

Discussion: Obstetricians suggest ultrasound examination when something seems wrong. The<br />

investigators concluded that the ultrasound exams <strong>and</strong> low birthweights had a common cause –<br />

problem pregnancies.<br />

Later, a r<strong>and</strong>omized controlled experiment was carried out to get more definite evidence. If<br />

anything, ultrasound was protective.<br />

Journal <strong>of</strong> Obstetrics <strong>and</strong> Gynaecology. Volume 71 (1988) pp 513-517<br />

Also Lancet (1988) pp 585-588<br />

370<br />

Section 9


REVIEW EXERCISES<br />

1. Physical fitness testing is an important aspect <strong>of</strong> athletic training. A common measure <strong>of</strong> the magnitude <strong>of</strong><br />

cardiovascular fitness is the maximum volume <strong>of</strong> oxygen uptake during a strenuous exercise. A study was<br />

conducted on 18 middle-aged men to study the influence <strong>of</strong> the time that it takes to complete a 2-mile run.<br />

The oxygen uptake measure was accomplished with st<strong>and</strong>ard laboratory methods as the subjects performed<br />

on a motor driven treadmill. The data (Ribisl et al. Journal <strong>of</strong> Sports Medicine, 9: 17-22) are below:<br />

Maximum Volume <strong>of</strong> O 2 (Y) Time in Seconds (X)<br />

42.33<br />

53.10<br />

918<br />

805<br />

Data summary<br />

x = 831.40<br />

42.08 892 y = 47.67<br />

42.45 968<br />

42.46 907 2<br />

∑ ( i<br />

− x)<br />

= 160613.28<br />

49.92 743<br />

36.23<br />

49.66<br />

1045<br />

810<br />

∑ ( x i<br />

x)( yi<br />

− y)<br />

= -8698.33<br />

41.49 927 2<br />

∑ ( y ˆ<br />

i<br />

− yi)<br />

= 55.25<br />

46.16 813<br />

48.18 858<br />

51.81 760<br />

53.28<br />

53.29<br />

747<br />

743<br />

47.18 803<br />

56.91<br />

47.80<br />

683<br />

844<br />

53.69 700<br />

(a) Use the data summary to find an estimate for the equation <strong>of</strong> the least squares regression line <strong>of</strong> Y on X.<br />

(2 marks)<br />

(b)<br />

(c)<br />

(d)<br />

Find an estimate for the st<strong>and</strong>ard error <strong>of</strong> the slope <strong>of</strong> the regression line <strong>and</strong> set up a 95% confidence<br />

interval for the slope <strong>of</strong> the regression line.<br />

(4 marks)<br />

What does the confidence interval in (b) tell you about the effect <strong>of</strong> time (X) on maximum volume <strong>of</strong><br />

oxygen uptake (Y).<br />

(1 mark)<br />

If a man in this age group takes 50 seconds longer to run the 2-mile distance what is the change in his<br />

maximum volume <strong>of</strong> oxygen update Write down the 95% confidence interval for this change using the<br />

result from (c).<br />

(2 marks)<br />

(e) Set up a 95% confidence interval for the maximum volume <strong>of</strong> oxygen uptake for a man who takes 11<br />

minutes (660 seconds) to complete a two mile run.<br />

(3 marks)<br />

371<br />

Section 9


SOLUTIONS<br />

−8698.33<br />

1. (a) b YX = =− 0.054<br />

160613.28<br />

ŷ = 47.67 – 0.054(x – 831.4)<br />

= 92.566 – 0.054x<br />

(b) Estimated st<strong>and</strong>ard error =<br />

∑<br />

se<br />

( x − x)<br />

i<br />

2<br />

where<br />

55.25/16<br />

That is, st<strong>and</strong>ard error = = 0.004637<br />

160613.28<br />

A 95% confidence interval for the true slope is<br />

–0.054 ± t 16 (0.004637) where t 16 = 2.120<br />

That is, –0.054 ± 0.0098<br />

giving – 0.064 < β YX < –0.044<br />

y ˆ<br />

2 i<br />

− yi<br />

e<br />

=<br />

s<br />

∑<br />

( )<br />

n − 2<br />

(c) The maximum volume <strong>of</strong> oxygen uptake is smaller for men who take longer to run 2<br />

miles.<br />

2<br />

(d)<br />

Oxygen uptake reduces by 50(0.054) = 2.7 units.<br />

The 95% confidence interval extends from 50(0.064) to 50(0.044) or a reduction from<br />

2.2 to 3.2 units.<br />

(e) When x = 660 seconds, ŷ = 92.566 – 0.054(660) = 56.93<br />

The 95% confidence interval is<br />

1<br />

56.93 ± t s 1+ +<br />

( x − x)<br />

2<br />

k<br />

16 e<br />

2<br />

n ( xi<br />

− x)<br />

∑<br />

That is, 56.93 ± 2.120<br />

or 56.93 ± 4.38<br />

52.55 < ŷ<br />

660<br />

< 61.31<br />

1 (660 − 831.4)<br />

55.25/16 1+ + 18 160613.28<br />

2<br />

372<br />

Section 9


SECTION 10<br />

Multiple regression models <strong>and</strong> logistic regression models are introduced in this section. In the case<br />

<strong>of</strong> ordinary multiple regression the response or outcome variable is on a continuous scale whereas<br />

in the case <strong>of</strong> a logistic regression the outcome measure is binary taking therefore only two possible<br />

values interpreted as success versus failure.<br />

The models allow us to identify those variables which have an effect on the outcomes <strong>and</strong> those<br />

variables which do not.<br />

Adding additional variables leads to adjusted values for estimated parameters <strong>and</strong> it is this that<br />

allows us to control for confounding.<br />

The Multiple Regression Model<br />

R-cmdr Printout for Multiple Regression<br />

Dummy Variables<br />

Checking Model Fit<br />

Parallel Regression Lines <strong>and</strong> Analysis <strong>of</strong> Covariance<br />

Binary Outcomes <strong>and</strong> Logistic Regression<br />

373<br />

Section 10


Multiple regression<br />

• Simple linear regression (SLR) allowed us to<br />

assess the effect <strong>of</strong> a single independent<br />

variable (X) on a response variable (Y).<br />

• But what do we do if we think that the<br />

response may change according to more<br />

than one independent variable<br />

• SLR regression can be extended.<br />

• Multiple regression allows us to assess the<br />

effects <strong>of</strong> several independent variables on<br />

the outcome variable <strong>and</strong> it allows the<br />

prediction <strong>of</strong> a response from the values <strong>of</strong><br />

several independent variables.<br />

• In multiple regression, there is a single<br />

dependent (outcome) variable <strong>and</strong> two or<br />

more independent (explanatory, predictor)<br />

variables or covariates.<br />

• The predictor variables can be:<br />

Continuous (e.g. blood pressure, height)<br />

Categorical – binary (e.g. sex)<br />

374<br />

Section 10


• The type <strong>of</strong> multiple regression that is<br />

performed depends on the data type <strong>of</strong> the<br />

outcome variable.<br />

• If the outcome variable is continuous, we use<br />

multiple linear regression.<br />

• If the outcome variable is binary, we use<br />

multiple logistic regression.<br />

The possible applications <strong>of</strong> multiple<br />

regression include:<br />

1. Adjusting for the effect <strong>of</strong> confounding<br />

variables.<br />

2. Establishing which variables are important in<br />

explaining the values <strong>of</strong> the outcome<br />

(response) variable.<br />

3. Predicting values <strong>of</strong> the outcome variable.<br />

375<br />

Section 10


4. Describing the strength <strong>of</strong> the association<br />

between the outcome variable <strong>and</strong> explanatory<br />

variables <strong>and</strong> reducing residual variation by<br />

introducing further effects as predictor<br />

variables.<br />

Multiple regression investigates <strong>and</strong> tests the joint<br />

effect <strong>of</strong> all predictors on the outcome variable as<br />

well as the measurement <strong>of</strong> individual effects <strong>of</strong><br />

each predictor.<br />

Example: Predict lung capacity from age, sex<br />

<strong>and</strong> height <strong>of</strong> patient.<br />

Lung capacity itself is difficult to measure. For<br />

heart lung transplants to have best chance <strong>of</strong><br />

success it is desirable to have donor <strong>and</strong> recipient<br />

lungs <strong>of</strong> similar size.<br />

376<br />

Section 10


The multiple linear regression model:<br />

y = β + β x + β x + β x + … +<br />

0 1 1 2 2 33<br />

error<br />

For simple linear regression the model is:<br />

y<br />

= β + β x+<br />

ε<br />

0 1<br />

The fitted straight line then becomes<br />

ŷ<br />

= ˆ β + ˆ β x<br />

0 1<br />

where<br />

0<br />

ˆβ <strong>and</strong> ˆβ 1<br />

are chosen to minimise the sum<br />

<strong>of</strong> the squared errors (residuals).<br />

In the case <strong>of</strong> two explanatory variables, the<br />

multiple linear regression model can be written in<br />

the following form:<br />

y = β + β x + β x + ε<br />

0 1 1 2 2<br />

where ε is the residual (including r<strong>and</strong>om error)<br />

with mean <strong>of</strong> zero (for all data values i) <strong>and</strong><br />

constant variance.<br />

377<br />

Section 10


The fitted regression equation is<br />

ŷ = ˆ β + ˆ β x + ˆ β x<br />

0 1 1 2 2<br />

The estimates ˆ β0, ˆ β ˆ<br />

1<br />

<strong>and</strong> β<br />

2<br />

are found from the<br />

data in such a way that the sum <strong>of</strong> the squared<br />

residuals (errors), that is<br />

y − ( β + β x + β x ) , is minimised.<br />

∑<br />

[ ] 2<br />

i<br />

0 1 1 2 2<br />

The results are complicated <strong>and</strong> statistical<br />

s<strong>of</strong>tware is always used for calculations.<br />

378<br />

Section 10


Example<br />

For lung transplantation it is desirable for the<br />

donor’s lungs to be <strong>of</strong> a similar size as those <strong>of</strong><br />

the recipient. Total lung capacity (TLC) is<br />

difficult to measure, so it is useful to be able to<br />

predict TLC from other information. The<br />

following table shows the pre-transplant TLC <strong>of</strong><br />

32 recipients <strong>of</strong> heart-lung transplants, <strong>and</strong> their<br />

age, sex <strong>and</strong> height<br />

Age Sex Height(cm) TLC(1) Age Sex Height(cm) TLC(1)<br />

1 35 F 149 3.40 17 30 F 172 6.30<br />

2 11 F 138 3.41 18 21 F 163 6.55<br />

3 12 M 148 3.80 19 21 F 164 6.60<br />

4 16 F 156 3.90 20 20 M 189 6.62<br />

5 32 F 152 4.00 21 34 M 182 6.89<br />

6 16 F 157 4.10 22 43 M 184 6.90<br />

7 14 F 165 4.46 23 35 M 174 7.00<br />

8 16 M 152 4.55 24 39 M 177 7.20<br />

9 35 F 177 4.83 25 43 M 183 7.30<br />

10 33 F 158 5.10 26 37 M 175 7.65<br />

11 40 F 166 5.44 27 32 M 173 7.80<br />

12 28 F 165 5.50 28 24 M 173 7.90<br />

13 23 F 160 5.73 29 20 F 162 8.05<br />

14 52 M 178 5.77 30 25 M 180 8.10<br />

15 46 F 169 5.80 31 22 M 173 8.70<br />

16 29 M 173 6.00 32 25 M 171 9.45<br />

379<br />

Section 10


Step 1: First look at some plots in order to gain an<br />

underst<strong>and</strong>ing <strong>of</strong> the data<br />

1. Plot each predictor variable against the<br />

outcome.<br />

Relationship between total lung capacity <strong>and</strong> age<br />

Total lung capacity(l)<br />

2 4 6 8 10<br />

10 20 30 40 50<br />

age(yrs)<br />

It appears that total lung capacity is not affected<br />

by age.<br />

380<br />

Section 10


It appears total lung capacity increases as height<br />

increases.<br />

The effect <strong>of</strong> sex is not clear.<br />

381<br />

Section 10


Step 2: Fit (in R-cmdr) Simple Linear<br />

Regression models for each predictor variable.<br />

1. Age alone:<br />

TLC= 5.07 + 0.036 age<br />

If age increases by one year, TLC increases<br />

by 0.036 litre (which is not significant if<br />

tested).<br />

2. Height alone:<br />

TLC= -9.74 + 0.095 x height<br />

If height increases by 1 cm, TLC increases<br />

by 0.095 (which is significant if tested).<br />

382<br />

Section 10


Step 3: Fit (in R-cmdr) Multiple Linear<br />

Regression Model.<br />

3. Age <strong>and</strong> height<br />

383<br />

Section 10


From regression equation for the model including<br />

age <strong>and</strong> height, the predicted TLC for someone<br />

aged 25 <strong>and</strong> with a height <strong>of</strong> 160 cm is:<br />

TLC = -11.218 – 0.030 × 25 + 0.108 × 160<br />

= 5.322 litres<br />

Regressions which include binary (e.g. sex)<br />

predictor variables<br />

The predictor variable, SEX, has two categories<br />

only, female <strong>and</strong> male. We need a technique for<br />

including such binary variables in the regression<br />

models.<br />

Define a dummy variable (D) as follows:<br />

D<br />

=<br />

⎧0 if<br />

⎨<br />

⎩1if<br />

female<br />

male<br />

If there are two other predictors X 1 <strong>and</strong> X 2 then<br />

we fit the model<br />

y = β<br />

0<br />

+ β<br />

1<br />

x<br />

1<br />

+ β<br />

2<br />

x<br />

2<br />

+ β<br />

3<br />

d + ε<br />

384<br />

Section 10


The fitted equation is therefore<br />

ŷ = ˆ β + ˆ β x + ˆ β x + ˆ β d<br />

0 1 1 2 2 3<br />

We find estimates ˆ β0, ˆ β ˆ ˆ<br />

1, β2 <strong>and</strong> β<br />

3<br />

by minimising<br />

the squared residuals as before (using the<br />

computer).<br />

4. Model with age, height <strong>and</strong> sex.<br />

385<br />

Section 10


Model interpretation:<br />

* TLC decreases with increasing age.<br />

For a person 10 years older, the predicted TLC will<br />

be 0.25 litres lower.<br />

* TLC increases with increasing height.<br />

For a person 10 cm higher, the predicted TLC will<br />

be 0.9 litres higher.<br />

* Males have higher TLC than women:<br />

For males, the predicted TLC is 0.697 litres higher<br />

than for women with same age <strong>and</strong> height.<br />

females, sex = 0<br />

so TLC = –8.54 – 0.025age + 0.0895height + 0.697 × 0<br />

males sex = 1<br />

so TLC = 8.54 – 0.025age + 0.0895height + 0.697 × 1<br />

Therefore, the difference in average TLC between<br />

males <strong>and</strong> females is 0.697.<br />

Note: compare this to the crude difference in mean<br />

TLC between males <strong>and</strong> females<br />

386<br />

Section 10


It is 6.98 – 5.20 = 1.78 litres<br />

where 6.98 <strong>and</strong> 5.20 are male <strong>and</strong> female<br />

averages<br />

Some <strong>of</strong> this difference between males <strong>and</strong><br />

females can be explained by differences in age<br />

<strong>and</strong> height.<br />

Overall, how well does the model fit<br />

The analysis <strong>of</strong> variance is<br />

1. The regression effect has 3 degrees <strong>of</strong><br />

freedom since there are 3 predictor variables<br />

in the model.<br />

2. The ANOVA table shows the ‘usefulness’ <strong>of</strong><br />

the linear regression model – we want the p-<br />

value to be < 0.05.<br />

Here, p-value = 0.000, implying that at least<br />

one <strong>of</strong> the explanatory variables has a<br />

significant linear relationship with the<br />

outcome variable.<br />

387<br />

Section 10


3. The strength <strong>of</strong> the relationship between<br />

TLC <strong>and</strong> the three predictors can be<br />

expressed as the proportion <strong>of</strong> the total SS<br />

explained by the regression equation.<br />

The coefficient <strong>of</strong> determination is:<br />

R 2 = 44.305/81.712 = 54.2%<br />

Thus, 54.2% <strong>of</strong> the total sum <strong>of</strong> squares<br />

(variation) is explained by age, height <strong>and</strong> sex<br />

together.<br />

Notice how the value <strong>of</strong> R 2 has increased from<br />

0.510 or 51.0% to the value <strong>of</strong> 0.542 or 54.2%<br />

when all three predictor variables are included.<br />

388<br />

Section 10


Are all three variables needed in the model<br />

There are 3 ways <strong>of</strong> evaluating the importance <strong>of</strong><br />

a variable in the model:<br />

1. Construct a test <strong>of</strong> the null hypothesis that<br />

the regression coefficient = 0.<br />

2. Calculate a 95% confidence interval for the<br />

regression coefficient.<br />

Note: Regardless <strong>of</strong> whether an additional<br />

variable is significant or not the real point<br />

at issue is that the other regression<br />

parameters are adjusted for the influence<br />

<strong>of</strong> these new confounding variables to<br />

produce adjusted test or confidence<br />

intervals.<br />

Model is<br />

TLC = β<br />

0<br />

+ β<br />

1<br />

age + β<br />

2<br />

height + β<br />

3<br />

sex + ε<br />

giving R-cmdr printout as follows:<br />

389<br />

Section 10


Std Error is the st<strong>and</strong>ard error <strong>of</strong> the<br />

corresponding regression coefficient. (See how<br />

the coefficients <strong>of</strong> age <strong>and</strong> height change when<br />

allowance made for sex).<br />

1. Test <strong>of</strong> the hypothesis H 0 : β 3 = 0<br />

Is the variable sex an important predictor in the<br />

model<br />

T<br />

ˆ β<br />

3<br />

− 0 0.697 − 0<br />

= =<br />

s.e.( ˆ β ) 0.499<br />

3<br />

= 1.396<br />

p – value = 0.174. There is no evidence sex is<br />

important in predicting TLC – the coefficient is<br />

not significantly different from 0.<br />

(Note: the t-test has 28 degrees <strong>of</strong> freedom, the<br />

DF <strong>of</strong> the residual (error) effect).<br />

390<br />

Section 10


Test <strong>of</strong> the hypothesis H 0 : β 1 = 0<br />

Age: t = –0.025/0.024 = –1.063<br />

with 28 degrees <strong>of</strong> freedom (Residual DF)<br />

p-value = 0.297<br />

No evidence age affects TLC<br />

Test <strong>of</strong> the hypothesis H 0 : β 2 = 0<br />

Height: t = 3.647 (p-value = 0.001)<br />

Strong evidence height is important<br />

in predicting TLC.<br />

2. Calculating a confidence interval for a<br />

regression parameter<br />

A true parameter β<br />

i<br />

is estimated by ˆi β .<br />

For sex, the parameter estimates the difference in<br />

average TLC between males <strong>and</strong> females after<br />

taking into account age <strong>and</strong> height.<br />

The C.I. for ˆi β is: ˆ β ˆ<br />

i± t28se<br />

..( β<br />

i)<br />

391<br />

Section 10


For sex, this becomes<br />

0.697 ± t 28 (0.499)<br />

where t 28 = 2.048 for 95% confidence interval.<br />

That is 0.697 ± 1.022<br />

That is (–0.326, 1.720)<br />

This includes zero so there is no evidence <strong>of</strong><br />

difference in average TLC between men <strong>and</strong><br />

women.<br />

Note:<br />

The above interval is called an adjusted confidence<br />

interval. Recall unadjusted difference in means was<br />

-1.78. The unadjusted 95% confidence interval for<br />

the true difference in mean TLC between males <strong>and</strong><br />

females is (-2.77, -0.79).<br />

Adjusting for age <strong>and</strong> height has removed the<br />

statistically significant association between sex <strong>and</strong><br />

TLC.<br />

392<br />

Section 10


95% confidence interval for coefficient <strong>of</strong> age:<br />

–0.0250 ± t 28 (0.024)<br />

or ( –0.073, 0.023)<br />

95% confidence interval for coefficient <strong>of</strong> height<br />

0.0895 ± t 28 (0.025)<br />

or (0.039, 0.14)<br />

Note the correspondence between the 95%<br />

confidence interval <strong>and</strong> the t-test carried out at<br />

the 0.05 (2 sided) significance level.<br />

393<br />

Section 10


Note:<br />

(i) The effect <strong>of</strong> sex was contained in the<br />

residual when TLC was expressed in terms <strong>of</strong><br />

age <strong>and</strong> height only. The effect <strong>of</strong> the<br />

residual was therefore greater.<br />

(ii) The real effect <strong>of</strong> interest can be hidden by<br />

residual variability - reducing this residual<br />

variability by including more predictors in<br />

the model can improve the analysis (<strong>and</strong><br />

therefore the study). The p-values associated<br />

with hypothesis tests for the parameters <strong>of</strong><br />

interest will generally be smaller.<br />

(iii) Confounders can affect the parameter<br />

estimates <strong>of</strong> the predictor variables <strong>of</strong> interest<br />

as well as the residual variability. Therefore<br />

including confounders in the model is<br />

important for obtaining valid estimates <strong>of</strong> the<br />

coefficients <strong>of</strong> interest regardless <strong>of</strong> the<br />

reduction in residual variability.<br />

394<br />

Section 10


Checking the fit <strong>of</strong> the model<br />

We do not expect our model to be correct. We<br />

want it to capture the important aspects <strong>of</strong> the<br />

process under investigation, but also to simplify<br />

things enough to aid underst<strong>and</strong>ing. Choosing an<br />

appropriate model is a complex art which is<br />

covered more fully in higher level courses on<br />

regression. Here we consider some basic<br />

principles.<br />

Rule <strong>of</strong> thumb:<br />

We should not perform a multiple linear<br />

regression analysis if the number <strong>of</strong> variables in<br />

the model is greater than the number <strong>of</strong><br />

individuals divided by 10.<br />

Residual plots<br />

1. The residuals associated with each data value<br />

should be normally distributed with mean = 0<br />

<strong>and</strong> constant variance. (In R-cmdr we can<br />

save the residuals for subsequent plotting e.g.<br />

normal probability plot).<br />

395<br />

Section 10


2. The printouts also identify any unusual data<br />

point which has a very large residual. The<br />

residuals can be st<strong>and</strong>ardised to have mean<br />

zero <strong>and</strong> st<strong>and</strong>ard deviation one. Hence we<br />

can see clearly the unusual cases. (One <strong>of</strong><br />

the options in R-cmdr is to save the<br />

st<strong>and</strong>ardised residuals).<br />

(1) Checking the normality assumptions <strong>of</strong><br />

the residuals.<br />

The matching histogram will present the usual<br />

bell-shaped pattern for the 32 residuals.<br />

396<br />

Section 10


The points in the normal P-P plot lie along a<br />

straight line, confirming the distribution <strong>of</strong> the<br />

residuals is close to normal.<br />

Two extreme points correspond to:<br />

i) female, aged 20, height 162 cm. Predicted<br />

value from model is 5.46 <strong>and</strong> actual TLC is<br />

8.05.<br />

ii) male, aged 25, height 171. Predicted TLC<br />

from model is 6.84, actual TLC is 9.45<br />

(2) Plot <strong>of</strong> residuals vs independent variables<br />

Residuals versus age plot<br />

397<br />

Section 10


This plot identifies the negative residuals for the<br />

people under 20 years <strong>and</strong> also shows the two<br />

large outliers. Otherwise the plot is reasonably<br />

r<strong>and</strong>om about zero.<br />

Residuals versus height plot.<br />

Again the plot has negative residuals for the<br />

shorter people <strong>and</strong> identifies the two large<br />

outliers. These plots indicate special thought<br />

should be given to whether the young people<br />

should be retained in the model.<br />

398<br />

Section 10


Analysis <strong>of</strong> Covariance<br />

This analysis uses a multiple regression to<br />

compare simple regressions coinciding with the<br />

categories <strong>of</strong> a qualitative explanatory variable.<br />

Example: A study investigates the effect <strong>of</strong> a<br />

treatment for hypertension on systolic blood<br />

pressure (BP) compared with a control treatment.<br />

Age for all patients also known <strong>and</strong> it was<br />

thought that age might confound the differences<br />

in BP between the groups.<br />

TREATMENT CONTROL<br />

BP(Y) AGE(X) BP(Y) AGE(X)<br />

120 26 109 33<br />

114 37 145 62<br />

132 31 131 54<br />

130 48 129 44<br />

146 55 101 31<br />

122 35 115 39<br />

136 40 133 60<br />

118 29 105 38<br />

Control mean = 121.00 mm (<strong>of</strong> mercury)<br />

Treatment mean = 127.25 mm (<strong>of</strong> mercury)<br />

399<br />

Section 10


But note:<br />

average age <strong>of</strong> control group = 45.13 years<br />

average age <strong>of</strong> treated group = 37.63 years<br />

[A] First, an ordinary unpaired t-test will be<br />

performed on the BP(Y) values using the<br />

pooled variance <strong>of</strong> the Y values.<br />

Analyze > compare Means > Independent-<br />

Samples t-test<br />

y is the test variable <strong>and</strong> d is the grouping<br />

variable. d is 0 for control <strong>and</strong> 1 for<br />

treatment.<br />

There is no evidence <strong>of</strong> a difference between<br />

the two means as t = -0.932. The 95%<br />

confidence interval for μ − μ is (-8.1,<br />

T C<br />

20.6) which includes 0 confirming no<br />

evidence <strong>of</strong> a difference between the means.<br />

Also p-value = 0.367<br />

400<br />

Section 10


At this stage the ages have been ignored.<br />

Age could be increasing the residual<br />

variation hiding the true treatment difference.<br />

i.e. Age could be a confounder.<br />

[B] Second, a regression analysis <strong>of</strong> Y on d is<br />

performed where d = 0 for the control <strong>and</strong> d<br />

= 1 for the treatment.<br />

Analyze > Regression > Linear<br />

(Again, the 16 Y values are in one column<br />

<strong>and</strong> the values <strong>of</strong> d in a second column)<br />

The estimated regression equation is<br />

ŷ = 121 + 6.25d<br />

401<br />

Section 10


The estimated coefficient for d is 6.25 with a<br />

st<strong>and</strong>ard error <strong>of</strong> 6.708. Note that when d = 0,<br />

ŷ = 121.00 <strong>and</strong> when d = 1, ŷ = 127.25 so<br />

coefficient <strong>of</strong> d is the difference between the<br />

two means. The 95% confidence interval for<br />

treatment difference is<br />

6.25 ± t 14 (6.708) where t 14 = 2.145<br />

giving 6.25 ± 14.39 or (-8.14, 20.64)<br />

as before. This regression is equivalent to the<br />

unpaired t-test. The age variable effect remains<br />

hidden in the residual.<br />

Note: The Confidence Interval can also be obtained<br />

on the printout if requested.<br />

402<br />

Section 10


[C] Third, a regression analysis <strong>of</strong> Y on X <strong>and</strong> D<br />

together is performed where D = 0 for control,<br />

otherwise 1.<br />

Analyze > Regression > Linear<br />

(Values <strong>of</strong> X are now in a third column)<br />

The estimated regression equation is<br />

ŷ = 73.9 + 1.04x + 14.1d<br />

The estimated coefficient <strong>of</strong> d is now 14.082<br />

with a st<strong>and</strong>ard error <strong>of</strong> 3.818. The coefficient<br />

<strong>of</strong> d represents the difference between patients<br />

<strong>of</strong> the same age, one in the control <strong>and</strong> one in<br />

the treated group.<br />

403<br />

Section 10


e.g. Let X = x k be age <strong>of</strong> two such patients.<br />

Then ŷ T – ŷ C = (73.9 + 1.04 x k + 14.1)<br />

– (73.9 + 1.04 x k + 0)<br />

= 14.1<br />

The 95% confidence interval for the difference<br />

is<br />

14.082 ± t 13 (3.818) where t 13 = 2.160<br />

giving 14.082 ± 8.247<br />

or (5.84, 22.33)<br />

Now there is evidence that the treatment raises<br />

blood pressure as 0 excluded from the<br />

confidence interval. The 13 DF are n – 3,<br />

namely those <strong>of</strong> the residual.<br />

Also note that the t test value associated with d is<br />

3.69 with a p-value <strong>of</strong> 0.003.<br />

Also note how the effect <strong>of</strong> age has effectively been<br />

removed from the residual which is substantially<br />

reduced from 2519.5 to 669.4.<br />

The value <strong>of</strong> R 2 has risen from 0.058 or 5.8% to<br />

0.750 or 75% when X is added to the model<br />

involving d only.<br />

404<br />

Section 10


The confidence interval here is the ADJUSTED<br />

CONFIDENCE INTERVAL after allowing for<br />

the effect <strong>of</strong> age.<br />

Unadjusted interval: (– 8.14, 20.64)<br />

Adjusted interval: (5.84, 22.33)<br />

It is helpful to put a geometrical interpretation on<br />

this analysis. The scatter diagram <strong>of</strong> Y(BP)<br />

against X(age) follows for all 16 patients.<br />

150<br />

140<br />

130<br />

120<br />

110<br />

100<br />

Y(BP)<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

25 30 35 40 45 50 55 60 65<br />

X(Age)<br />

Notice the difference between the treated group<br />

(dots) <strong>and</strong> the control group (crosses)<br />

Suppose we fit the equation (by least squares)<br />

ŷ =<br />

0<br />

ˆβ + ˆβ 1x +<br />

2<br />

ˆβ d<br />

where d = 0 for control, d = 1 for treatment.<br />

Section 10<br />

405


If d = 0, ŷ =<br />

0<br />

ˆβ + ˆβ 1x<br />

If d = 1, ŷ =<br />

0<br />

ˆβ + ˆβ 1x +<br />

2<br />

ˆβ<br />

= (<br />

0<br />

ˆβ +<br />

2<br />

ˆβ ) + ˆβ 1x<br />

These two lines are PARALLEL (same slope ˆβ 1) but<br />

the intercepts are<br />

0<br />

ˆβ <strong>and</strong> (<br />

0<br />

ˆβ +<br />

2<br />

ˆβ ). Thus,<br />

2<br />

ˆβ is the<br />

vertical distance between the two parallel straight<br />

lines.<br />

150<br />

140<br />

130<br />

120<br />

110<br />

100<br />

Y(BP)<br />

x<br />

x<br />

x<br />

x<br />

x<br />

X(Age)<br />

25 30 35 40 45 50 55 60 65<br />

2<br />

ˆβ is the effect <strong>of</strong> the treated group relative to the<br />

control. If<br />

2<br />

ˆβ is significant, then there is evidence <strong>of</strong><br />

different blood pressure values in the two groups.<br />

We see how to test<br />

2<br />

ˆβ for significance shortly. The<br />

next printout gives Y regressed on X only, <strong>and</strong> Y<br />

regressed on X <strong>and</strong> d together.<br />

x<br />

x<br />

x<br />

⎫<br />

⎬<br />

⎭<br />

ˆβ 2<br />

406<br />

Section 10


Notes:<br />

1.<br />

2<br />

ˆβ (the coefficient <strong>of</strong> d) = 14.082 is the<br />

increase in blood pressure level due to<br />

administering the treatment (regardless <strong>of</strong> the<br />

age <strong>of</strong> a patient since the two lines, being<br />

parallel, have constant difference).<br />

2. The 95% confidence interval for the<br />

coefficient <strong>of</strong> d (namely β 2 ) is<br />

14.082 ± t 13 (3.818) where t 13 = 2.160<br />

It follows that 5.84 < β 2 < 22.33<br />

3. Without taking age into account, the treatment<br />

raised blood pressure by 6.25 mm <strong>of</strong> mercury<br />

only. Taking age into account, the treatment<br />

raised blood pressure by 14.082 mm.<br />

4. The ordinary unpaired t-test originally<br />

suggested for this problem is equivalent to<br />

regressing Y on d alone. In this case, the<br />

variable x (or Age) remains as part <strong>of</strong> the<br />

residual which is therefore inflated hiding the<br />

true treatment effect. In addition correlation<br />

between age <strong>and</strong> treatment group distorts the<br />

estimate <strong>of</strong> treatment effect on blood<br />

pressure.<br />

407<br />

Section 10


Binary outcomes: Logistic Regression<br />

Recall: For simple <strong>and</strong> multiple linear regression<br />

the outcome variable was continuous.<br />

What do we do if the outcome variable Y is binary<br />

e.g. disease present: yes/no<br />

e.g. tuatara: present / absent<br />

e.g. claim to ACC goes to litigation : Yes / No<br />

e.g. depression: yes/no in 18 yr olds if bullied at<br />

school earlier<br />

We use logistic regression (LR).<br />

In a logistic regression the explanatory or predictor<br />

X variables can be either continuous or<br />

categorical(binary).<br />

Like multiple regression, we can use logistic<br />

regression to:<br />

(1) control for confounding;<br />

(2) investigate the effect <strong>of</strong> several variables on<br />

the outcome variable at one time.<br />

We can use the method <strong>of</strong> LR with data from any<br />

study type as long as we have a binary outcome.<br />

408<br />

Section 10


The logistic regression model is:<br />

⎛ p<br />

⎜<br />

⎞ = β + β + β + + β + ε<br />

− p<br />

⎟<br />

⎝ ⎠<br />

ln<br />

0 1X1 2X2<br />

…<br />

kXk<br />

1<br />

where<br />

Y is the binary outcome variable (values 0 or 1)<br />

p is the probability that a particular event will<br />

occur, i.e. Pr(Y = 1).<br />

X1, X2,..., X<br />

k<br />

are the explanatory variables<br />

β0is the intercept<br />

β1, β2,..., β<br />

k<br />

are the regression coefficients<br />

ε is the r<strong>and</strong>om error<br />

Interpreting the model:<br />

p<br />

is the ‘odds’ <strong>of</strong> the event occurring<br />

1−<br />

p<br />

⎛ p ⎞<br />

ln ⎜ 1 − p<br />

⎟ is the ‘log odds’<br />

⎝ ⎠<br />

The regression coefficient β<br />

i<br />

represents the change<br />

in the log odds for a 1-unit change in X .<br />

i<br />

Fitted logistic model:<br />

The formulae to estimate the values β<br />

0<br />

<strong>and</strong> β<br />

1<br />

etc<br />

are computationally complex. We shall not worry<br />

Section 10<br />

409


about the details here <strong>and</strong> we shall instead focus on<br />

underst<strong>and</strong>ing the results from a logistic regression<br />

R-cmdr printout.<br />

Example:<br />

A study was conducted to investigate the<br />

relationship between physical inactivity <strong>and</strong><br />

myocardial infarction (MI). It was found that<br />

people who were physically inactive had an<br />

increased risk <strong>of</strong> MI. Age was considered to be a<br />

potential confounder.<br />

Compared to younger people, older people:<br />

• are more likely to be physically inactive.<br />

• have a higher risk <strong>of</strong> MI.<br />

Hence, we would expect that age can explain some<br />

<strong>of</strong> the association between physical inactivity <strong>and</strong><br />

MI.<br />

Outcome:<br />

whether a person has a MI (Y) where Y = 0 or 1<br />

Exposure <strong>of</strong> interest:<br />

whether a person was physically inactive<br />

(exposure variable, X 1 )<br />

Possible confounder:<br />

age (X 2 ) <strong>of</strong> the person.<br />

410<br />

Section 10


(1) Investigating the relationship between<br />

physical inactivity <strong>and</strong> MI.<br />

Option 1: Calculate the odds ratio as shown<br />

earlier in the semester.<br />

The 2 × 2 contingency table for outcome <strong>and</strong><br />

exposure is constructed from the 924 people.<br />

Outcome - MI<br />

Exposure (X 1 ) Yes No<br />

Physically inactive 136 98<br />

Physically active 343 347<br />

Odds ratio <strong>of</strong> MI in exposed to unexposed:<br />

136/ 98<br />

OR = = 1.40<br />

343/ 347<br />

with 95% confidence interval 1.04< OR < 1.89.<br />

Interpretation: The odds <strong>of</strong> having a MI is 40%<br />

higher for a person that is physically inactive<br />

compared to a physically active person. The<br />

result is significant.<br />

411<br />

Section 10


Option 2: Alternatively, we can fit a logistic<br />

regression model, using R-cmdr<br />

Y = MI<br />

1 = Yes 0 = No<br />

X 1 = Physically inactive 1 = Yes 0 = No<br />

Fitted Regression Model:<br />

⎛ pˆ ⎞<br />

ln ˆ ˆ<br />

⎜ = + X<br />

1−<br />

pˆ<br />

⎟ β β<br />

⎝ ⎠<br />

0 1 1<br />

where ˆp is the probability that a person has a MI.<br />

R-cmdr comm<strong>and</strong>s:<br />

Analyze > Regression > Binary Logistic<br />

Dependent: enter MI<br />

Covariate: enter Physical Inactivity. OK.<br />

Results from R-cmdr<br />

⎛ pˆ<br />

⎞<br />

Fitted equation is ln⎜<br />

1−<br />

pˆ<br />

⎟<br />

⎝ ⎠ = -0.01+ 0.34 X<br />

1<br />

Odds ratio = 1.40 as before<br />

95% confidence interval for OR is (1.04, 1.89)<br />

412<br />

Section 10


BUT what about the potential confounding<br />

effect <strong>of</strong> age How can we control for that<br />

Note: The odds ratio calculated previously is a<br />

crude odds ratio – it (<strong>and</strong> its corresponding 95%<br />

confidence interval) is not adjusted for the<br />

potential confounder age.<br />

To control for age, we include age as a second<br />

explanatory variable in our logistic regression.<br />

(2) Investigating the relationship between<br />

physical inactivity <strong>and</strong> MI, adjusting<br />

(controlling) for age.<br />

Now add age (X 2 ) to the regression in order to<br />

obtain the adjusted OR <strong>and</strong> its 95% confidence<br />

interval.<br />

Y = MI<br />

1 = Yes 0 = No<br />

X<br />

1<br />

= Physically inactive 1 = Yes 0 = No<br />

X = age<br />

2<br />

Results from R-cmdr:<br />

413<br />

Section 10


The fitted regression is<br />

⎛ pˆ<br />

⎞<br />

ln⎜<br />

1−<br />

pˆ<br />

⎟<br />

⎝ ⎠ = -0.41+0.17 X<br />

1+0.68 X 2<br />

This leads to the age adjusted odds ratio <strong>of</strong> 1.19<br />

which has 95% confidence interval (0.87, 1.62)<br />

These values are read from the printout <strong>and</strong><br />

compare with the crude ratio <strong>of</strong> 1.40 with<br />

confidence interval (1.04, 1.89).<br />

Conclusion: After adjusting for age, the OR<br />

decreased from 1.40 to 1.19. Therefore, age was<br />

making the association between physical inactivity<br />

<strong>and</strong> MI more extreme than it actually was.<br />

414<br />

Section 10


SECTION 11<br />

Study design principles, critical appraisal, sources<br />

<strong>of</strong> bias <strong>and</strong> confounding.<br />

415<br />

Section 11


Study Design <strong>and</strong> Critical Appraisal<br />

Research process:<br />

1. Development <strong>of</strong> research question<br />

2. Design <strong>of</strong> study<br />

3. Collection <strong>of</strong> information<br />

4. Description <strong>of</strong> data<br />

5. Interpretation <strong>of</strong> results<br />

Study design<br />

• Study design refers to the methods used to select the<br />

study participants, control any experimental<br />

conditions, <strong>and</strong> collect the information.<br />

• Interpretation <strong>of</strong> results depends on the study design.<br />

• The study design should be tailored to the research<br />

question.<br />

• Methods <strong>of</strong> statistical analysis <strong>and</strong> information<br />

produced will depend on the study design.<br />

“The data from a good study can be analysed in many<br />

ways, but no amount <strong>of</strong> clever analysis can compensate<br />

for problems with the design <strong>of</strong> the study.” Altman.<br />

416<br />

Section 11


Critical appraisal<br />

Critical appraisal is the process <strong>of</strong> reviewing a study with the<br />

goal <strong>of</strong> identifying its strengths <strong>and</strong> weaknesses, the major<br />

results, <strong>and</strong> its broader implications.<br />

Why teach study design <strong>and</strong> critical appraisal<br />

• it is not possible to sensibly interpret the results <strong>of</strong><br />

statistical analysis without underst<strong>and</strong>ing the context in,<br />

<strong>and</strong> methods with which the data were collected<br />

• health sciences practice <strong>and</strong> policy needs to be based on<br />

sound evidence (as far as possible).<br />

• poorly conducted research should not influence policy or<br />

practice.<br />

• because even well conducted research is not perfect, it is<br />

necessary to underst<strong>and</strong> the nature <strong>of</strong> evidence so that<br />

you can begin to learn to interpret research findings for<br />

yourselves.<br />

• for this you need to gain an underst<strong>and</strong>ing <strong>of</strong> the<br />

scientific method as used in the health sciences.<br />

• this underst<strong>and</strong>ing is enhanced by learning to critique<br />

research.<br />

417<br />

Section 11


Outline <strong>of</strong> next four lectures<br />

1. Introduction to critical appraisal (lecture 1)<br />

• process for critical appraisal<br />

• structure <strong>of</strong> a research paper<br />

2. Design <strong>and</strong> appraisal <strong>of</strong> surveys (lecture 1)<br />

• review <strong>of</strong> surveys<br />

• internal validity<br />

bias<br />

chance<br />

• external validity<br />

• example<br />

3. Design <strong>and</strong> appraisal <strong>of</strong> analytic studies<br />

(lectures 2 – 4)<br />

• review <strong>of</strong> analytic study designs<br />

• internal validity<br />

bias<br />

confounding<br />

chance<br />

• external validity<br />

• causation<br />

• examples: r<strong>and</strong>omised controlled trials<br />

cohort studies<br />

case-control studies<br />

418<br />

Section 11


1. Introduction to critical appraisal<br />

Guideline for critical appraisal<br />

Study summary<br />

What were the study objectives<br />

Why was the study necessary<br />

What type <strong>of</strong> study design was used<br />

How were the participants selected<br />

What information was collected<br />

What were the key results<br />

Internal validity<br />

What do the findings <strong>of</strong> the study tell us about the<br />

population studied<br />

External validity / Generalisability<br />

Can the findings <strong>of</strong> the study be applied to other<br />

populations<br />

Causation (for analytic studies only)<br />

Implications<br />

What are the implications <strong>of</strong> the study<br />

419<br />

Section 11


Structure <strong>of</strong> a scientific paper<br />

Abstract or summary<br />

• usually contains the key results <strong>of</strong> the study.<br />

Introduction<br />

• gives the background, necessity <strong>and</strong> objectives.<br />

Methods<br />

• summarises the study design including source <strong>of</strong><br />

participants <strong>and</strong> methods used to collect data.<br />

Results<br />

• description <strong>of</strong> the study participants including response<br />

rates.<br />

• summary <strong>of</strong> analyses.<br />

Discussion<br />

• provides the authors views <strong>of</strong> the internal <strong>and</strong> external<br />

validity <strong>of</strong> the study, <strong>and</strong> their conclusions about the<br />

implications <strong>of</strong> the study.<br />

420<br />

Section 11


2. Design <strong>and</strong> appraisal <strong>of</strong> Descriptive studies<br />

Aim: To describe characteristics <strong>of</strong> a group or groups <strong>of</strong><br />

people at a given point in time.<br />

Generally, a sample is taken from the population <strong>and</strong> the<br />

distribution <strong>of</strong> variables within that sample is described.<br />

Examples: A descriptive study can be used to<br />

• describe characteristics <strong>of</strong> a group <strong>of</strong> people,<br />

e.g. prevalence <strong>of</strong> asthma, prevalence <strong>of</strong> smoking,<br />

average cholesterol level.<br />

• find out peoples’ opinions <strong>and</strong> attitudes,<br />

e.g. attitudes to alternative health care; satisfaction<br />

with health care delivery.<br />

• find out extent <strong>of</strong> peoples knowledge,<br />

e.g. knowledge <strong>of</strong> risk factors for melanoma, risk factors<br />

for coronary heart disease.<br />

• comparisons <strong>of</strong> subgroups may well be part <strong>of</strong> a survey,<br />

e.g. comparison <strong>of</strong> attitudes <strong>of</strong> men <strong>and</strong> women to<br />

alternative health care; comparison <strong>of</strong> prevalence <strong>of</strong><br />

smoking among different ethnic groups in NZ.<br />

A descriptive study is concerned with <strong>and</strong> designed only to<br />

describe the existing distribution <strong>of</strong> variables, without regard<br />

to causal or other hypotheses.<br />

Descriptive studies can generate hypotheses.<br />

Descriptive studies are <strong>of</strong>ten called surveys or cross-sectional<br />

studies.<br />

421<br />

Section 11


Descriptive studies generally use a sample from a population.<br />

Descriptive studies<br />

underlying population<br />

(parameters eg μ, π)<br />

other populations<br />

(external validity)<br />

sample<br />

inference<br />

(internal validity)<br />

statistics<br />

(eg x, p)<br />

422<br />

Section 11


Recall<br />

Suppose we want to estimate mean cholesterol in the<br />

population:<br />

sample<br />

mean<br />

=<br />

population<br />

mean<br />

+ "error"<br />

systematic error<br />

(bias)<br />

r<strong>and</strong>om<br />

variation<br />

r<strong>and</strong>om error (chance):<br />

• due to natural biological variability.<br />

• increasing the sample size will reduce the r<strong>and</strong>om<br />

fluctuations in the sample mean.<br />

systematic error (=bias)<br />

• due to aspects <strong>of</strong> the design or conduct <strong>of</strong> the study which<br />

systematically distort the results.<br />

• occurs if a sample is not representative <strong>of</strong> the population.<br />

• cannot be reduced by increasing the sample size.<br />

423<br />

Section 11


Internal validity for descriptive studies<br />

Bias<br />

• bias<br />

• chance (r<strong>and</strong>om error)<br />

Selection bias<br />

• systematic error arising from the way people are selected<br />

for the study.<br />

• includes biases from sample selection <strong>and</strong> from nonresponse<br />

to study.<br />

Information bias<br />

• systematic error arising from the way information was<br />

collected from the study participants.<br />

Chance<br />

• Confidence intervals around estimates indicate the degree<br />

<strong>of</strong> precision with which the sample value estimates the<br />

population value.<br />

424<br />

Section 11


Selection bias<br />

• systematic error arising from the way people are selected<br />

for the study.<br />

• includes biases from sample selection <strong>and</strong> from nonresponse<br />

to study.<br />

Questions to ask:<br />

• Is the sample representative <strong>of</strong> the population<br />

• What was the response rate<br />

Example: A study was conducted to estimate the prevalence<br />

<strong>of</strong> smoking among males <strong>and</strong> females in NZ.<br />

Design:<br />

A r<strong>and</strong>om sample <strong>of</strong> households was selected using r<strong>and</strong>om<br />

digit dialling. If the call was not answered, the machine<br />

automatically went on to the next number. All interviews<br />

were conducted from 8am – 5pm (weekdays only).<br />

63% <strong>of</strong> people agreed to participate in the study.<br />

425<br />

Section 11


Information bias<br />

• systematic error arising from the way information was<br />

collected from the study participants.<br />

Question to ask:<br />

Is the information gathered correct<br />

Example: Suppose an investigator wished to estimate the<br />

prevalence <strong>of</strong> depression in NZ. To do this, he carried out<br />

face-to-face interviews around the country with a r<strong>and</strong>om<br />

sample <strong>of</strong> adults. Can you think <strong>of</strong> how information bias<br />

might enter into his study<br />

426<br />

Section 11


Example<br />

Life in New Zeal<strong>and</strong> Survey, Hillary Commission for<br />

Recreation <strong>and</strong> Sport, 1990, David Russell <strong>and</strong> Noela<br />

Wilson.<br />

Objectives<br />

• to provide a snapshot <strong>of</strong> New Zeal<strong>and</strong>ers from a health<br />

perspective.<br />

• included questions on physical activity, leisure patterns,<br />

dietary habits <strong>and</strong> other risk factors for disease.<br />

Necessity for the study<br />

• study provides a benchmark for comparison in future<br />

years.<br />

• the information is useful for generating hypotheses <strong>and</strong><br />

for designing interventions to improve health.<br />

Type <strong>of</strong> study design<br />

• survey <strong>of</strong> New Zeal<strong>and</strong>ers 15 years <strong>and</strong> over.<br />

• carried out April 1989 – May 1990.<br />

Selection <strong>of</strong> participants<br />

• over 18 years:<br />

- selected from electoral rolls.<br />

- each month 10 people were selected at r<strong>and</strong>om from<br />

each <strong>of</strong> the 97 electoral rolls, plus 19 from each <strong>of</strong><br />

the 4 Maori rolls.<br />

• 15 – 18 years:<br />

- snowball sample was used.<br />

- people already selected were asked to identify up to 5<br />

people aged 15 – 18.<br />

• total number selected: 12,463.<br />

427<br />

Section 11


Results<br />

Physical activity<br />

Activity level<br />

low moderate high<br />

Male<br />

15 – 18 17 20 64<br />

19 – 24 23 27 51<br />

25 – 44 34 31 35<br />

45 – 64 50 34 16<br />

64+ 58 39 3<br />

All 37 31 32<br />

Female<br />

15 – 18 24 22 54<br />

19 – 24 30 30 40<br />

25 – 44 20 53 26<br />

45 – 64 25 64 11<br />

64+ 34 63 3<br />

All 25 51 24<br />

Can you summarise these results<br />

428<br />

Section 11


Internal validity<br />

Bias<br />

Selection bias:<br />

• r<strong>and</strong>om sampling was used for those 18 <strong>and</strong> over.<br />

• bias from snowball sample (note multiple starting<br />

points based on r<strong>and</strong>om sampling)<br />

• response rate<br />

Information bias:<br />

• questionnaire<br />

• accuracy <strong>of</strong> recall<br />

• tendency to report what people think the researchers<br />

will want to see<br />

Chance<br />

• the study is large so the confidence intervals for overall<br />

proportions will be fairly narrow, but for smaller<br />

subgroups the proportions may not be so well estimated.<br />

e.g.: women aged 64+, n=814<br />

proportion with low activity level<br />

= 34%, CI= (30.8 to 37.3)<br />

proportion with high activity level<br />

= 3% CI= (2.0 to 4.5)<br />

429<br />

Section 11


External validity<br />

Are the results applicable to other populations<br />

• this calls for a judgement as to whether the other<br />

populations are likely to be similar to New Zeal<strong>and</strong> in<br />

terms <strong>of</strong> their exercise patterns.<br />

Implications<br />

• high activity levels are the levels recommended to<br />

maintain cardio-respiratory fitness.<br />

• programmes to increase activity levels may be useful in<br />

preventing cardiovascular disease.<br />

• efforts to increase activity levels <strong>of</strong> men over the age <strong>of</strong><br />

45 may be particularly useful.<br />

430<br />

Section 11


Study design <strong>and</strong> critical appraisal sessions: 2<br />

1. Introduction to critical appraisal (lecture 1)<br />

• process for critical appraisal<br />

• structure <strong>of</strong> a research paper<br />

2. Design <strong>and</strong> appraisal <strong>of</strong> surveys (lecture 1)<br />

• review <strong>of</strong> surveys<br />

• internal validity<br />

bias<br />

chance<br />

• external validity<br />

• example<br />

3. Design <strong>and</strong> appraisal <strong>of</strong> analytic studies<br />

(lectures 2 – 4)<br />

• review <strong>of</strong> analytic study designs<br />

• internal validity<br />

bias<br />

confounding<br />

• chance<br />

• external validity<br />

• causation<br />

• examples: r<strong>and</strong>omised controlled trials<br />

cohort studies<br />

case-control studies<br />

431<br />

Section 11


3. Design <strong>and</strong> appraisal <strong>of</strong> analytic studies<br />

Review <strong>of</strong> analytic study designs<br />

Purpose<br />

To test hypotheses regarding<br />

• causes <strong>of</strong> disease<br />

• disease prevention strategies<br />

• effectiveness <strong>of</strong> treatments<br />

Example:<br />

• Is a statin drug more effective than a diet high in plant<br />

sterols, soy proteins <strong>and</strong> almonds in reducing serum<br />

cholesterol levels<br />

• Do people who are physically inactive have an increased<br />

risk <strong>of</strong> developing colon cancer<br />

When we are conducting an analytical study we are studying<br />

associations among two or more variables. We will have<br />

an outcome variable (eg<br />

exposure variables (eg<br />

confounding variables – these are variables which distort<br />

the association <strong>of</strong> interest (eg age)<br />

432<br />

Section 11


Types <strong>of</strong> design:<br />

• experimental (intervention)<br />

• e.g. r<strong>and</strong>omised controlled trials.<br />

• observational<br />

• e.g. case-control studies, cohort studies.<br />

Key features <strong>of</strong> common designs<br />

R<strong>and</strong>omised controlled trials<br />

• people are assigned to an intervention or control group<br />

using r<strong>and</strong>om allocation, then followed up over a period<br />

<strong>of</strong> time.<br />

Cohort studies<br />

• participants are selected before they develop disease.<br />

• exposure status is measured, <strong>and</strong> they are followed up<br />

over a period <strong>of</strong> time.<br />

Case-control studies<br />

• two groups <strong>of</strong> people are chosen: a group with disease<br />

(cases) <strong>and</strong> a group without disease (controls).<br />

• information is collected from both groups about<br />

exposures that occurred in the past.<br />

433<br />

Section 11


Key ideas:<br />

• control (or comparison) groups are essential.<br />

• experimental studies provide much stronger tests <strong>of</strong><br />

hypotheses than observational studies.<br />

• experimental studies allow testing <strong>of</strong> causal relationships<br />

• with observational studies it is much harder to isolate the<br />

effects <strong>of</strong> the exposure <strong>of</strong> interest, so much harder to<br />

determine whether an association is causal<br />

434<br />

Section 11


Example<br />

Does smoking cause coronary heart disease<br />

1. Estimate the association between smoking <strong>and</strong> coronary<br />

heart disease (eg relative risk).<br />

2. Does this relative risk represent the true association<br />

between smoking <strong>and</strong> CHD in the population studied<br />

(internal validity)<br />

if yes<br />

3. Can this result be generalised to other populations<br />

(external validity)<br />

4. Is the association causal<br />

435<br />

Section 11


Internal validity<br />

Does the observed association represent the true association<br />

Specifically:<br />

What are the possible explanations for the observed results<br />

• bias<br />

• confounding<br />

• chance<br />

• true relationship<br />

436<br />

Section 11


Assessing internal validity:<br />

Bias<br />

Selection bias – systematic error arising from the way<br />

participants are selected for inclusion in the study.<br />

In an analytic study, selection bias occurs if the selection<br />

processes cause a systematic difference between the groups<br />

<strong>of</strong> people selected for the study.<br />

It includes bias from non-response.<br />

Information bias – systematic error arising from the way<br />

study information is obtained, interpreted <strong>and</strong> recorded.<br />

In an analytic study information bias is a particular problem<br />

if there are systematic differences in the information obtained<br />

from the different groups <strong>of</strong> people in the study.<br />

Information bias may be introduced by the:<br />

• Observer<br />

• Study individual (respondent)<br />

• Instruments used to collect the data (e.g. badly-designed<br />

questionnaire)<br />

437<br />

Section 11


Example<br />

Case-control study to examine relationship between stress<br />

<strong>and</strong> coronary heart disease:<br />

cases:<br />

controls:<br />

people with coronary heart disease<br />

identified through opportunistic<br />

screening by GPS<br />

r<strong>and</strong>om sample from the population<br />

Information on stress collected through a structured interview<br />

Selection bias:<br />

Information bias:<br />

438<br />

Section 11


Evaluation <strong>and</strong> control <strong>of</strong> bias<br />

• Statistical methods cannot control for bias in the selection<br />

<strong>of</strong> subjects or in the measurement <strong>of</strong> the variables <strong>of</strong><br />

interest. Control <strong>of</strong> bias can only be done during the<br />

design <strong>and</strong> data collection phases <strong>of</strong> the study.<br />

• General inaccuracy which is the same in both groups<br />

generally results in an underestimate <strong>of</strong> the true<br />

association.<br />

• If inaccuracy is different in the two groups, the<br />

association can be an over or under estimation.<br />

• It is important to identify sources <strong>of</strong> bias <strong>and</strong> estimate the<br />

magnitude <strong>and</strong> direction <strong>of</strong> their effect on the association.<br />

439<br />

Section 11


Confounding<br />

A distortion <strong>of</strong> the association between exposure <strong>and</strong> disease<br />

caused by the presence <strong>of</strong> a third factor.<br />

• A confounder is a variable which causes this distortion<br />

• To be a confounder a variable must be<br />

• associated with the exposure (independent <strong>of</strong> disease).<br />

• associated with disease (independent <strong>of</strong> exposure).<br />

• it must not just be an intermediate link in the causal<br />

chain.<br />

440<br />

Section 11


Example <strong>of</strong> confounding:<br />

A study was conducted to investigate the relationship<br />

between c<strong>of</strong>fee consumption <strong>and</strong> oral cancer. It was found<br />

that c<strong>of</strong>fee drinkers had an increased risk <strong>of</strong> oral cancer.<br />

Smoking is a potential confounder in this study.<br />

Compared to non-smokers:<br />

• Smokers are more likely to drink c<strong>of</strong>fee;<br />

• Smoking is an independent risk factor for oral cancer.<br />

Hence, the observed association may be due to smoking<br />

habits rather than c<strong>of</strong>fee drinking.<br />

Can you think <strong>of</strong> any other potential confounders<br />

441<br />

Section 11


Example <strong>of</strong> non-confounding:<br />

diet cholesterol level coronary heart disease<br />

In this case, the raised cholesterol levels are likely to be due<br />

in part to diet, so are part <strong>of</strong> the causal pathway. Therefore in<br />

studies <strong>of</strong> diet <strong>and</strong> coronary heart disease raised cholesterol<br />

would not be considered a confounder.<br />

Example <strong>of</strong> a confounder in a cohort study:<br />

Results from a cohort study investigating the relationship<br />

between myocardial infarction <strong>and</strong> exercise.<br />

Myocardial<br />

infarctions<br />

Personyears<br />

Table A: all subjects (n=8000<br />

person-years)<br />

Low exercise 105 4000 26.25<br />

High exercise 25 4000 6.25<br />

Relative risk = 26.25/6.25 = 4.2<br />

Subgroup Analysis<br />

Obese subjects (n=4000)<br />

Low exercise 90 3000 30.0<br />

High exercise 10 1000 10.0<br />

Relative risk = 3.0<br />

Non-obese subjects (n=4000)<br />

Low exercise 15 1000 15.0<br />

High exercise 15 3000 5.0<br />

Relative risk = 3.0<br />

Incidence/1000<br />

442<br />

Section 11


Positive <strong>and</strong> Negative Confounding<br />

Positive confounder – a confounding variable which makes<br />

an association look more extreme or create a spurious<br />

associations.<br />

Example: A study was conducted to investigate the<br />

relationship between physical inactivity <strong>and</strong> MI. It was found<br />

that people who were physically inactive had an increased<br />

risk <strong>of</strong> MI. Age was considered to be a potential confounder.<br />

Physical inactivity<br />

Myocardial infarction<br />

Age<br />

Crude odds ratio =2.5<br />

But compared to younger people, older people:<br />

• are more likely to be physically inactive.<br />

• have a higher risk <strong>of</strong> MI.<br />

Hence, age can explain some <strong>of</strong> the association between<br />

physical inactivity <strong>and</strong> MI.<br />

After “adjusting” for the confounding association <strong>of</strong> age the<br />

OR decreases to 1.4. So confounding by age is making the<br />

association between physical inactivity <strong>and</strong> MI seem more<br />

extreme than it should be, i.e. it is a positive confounder.<br />

443<br />

Section 11


Negative confounder – a counfounding variable which<br />

makes an association look less extreme or even in the<br />

opposite direction. It can mask a real difference.<br />

Example: A study was conducted to investigate the<br />

relationship between physical inactivity <strong>and</strong> MI. It was found<br />

that people who were physically inactive had an increased<br />

risk <strong>of</strong> MI. Sex was considered to be a potential confounder.<br />

Physical inactivity<br />

Myocardial infarction<br />

Sex<br />

Crude OR = 2.5<br />

But compared to females, males:<br />

• are less likely to be physically inactive.<br />

• have a higher risk <strong>of</strong> MI.<br />

Hence, sex masks some <strong>of</strong> the association between physical<br />

inactivity <strong>and</strong> MI.<br />

After “adjusting” for the confounding effect <strong>of</strong> sex, the OR<br />

becomes 3.9.<br />

So confounding by sex makes the association between<br />

physical activity <strong>and</strong> MI seem less extreme than it should be,<br />

i.e. it is a negative confounder.<br />

444<br />

Section 11


Some comments on confounding:<br />

AGE <strong>and</strong> SEX are the most common confounding variables.<br />

This is because these two variables are not only associated<br />

with most exposures we are interested in such as diet,<br />

smoking habits etc., but they are also independent risk factors<br />

for most diseases.<br />

Control <strong>of</strong> confounding<br />

Confounders can be controlled for during the study design,<br />

during the analysis, or both in the design <strong>and</strong> the analysis.<br />

The aim is to make the groups being compared as similar as<br />

possible with respect to the confounders.<br />

(1) Identify potential confounders. A review <strong>of</strong> previous<br />

literature in the area should give you an idea <strong>of</strong> potential<br />

confounders.<br />

Also: What are the known risk factors for the outcome <strong>of</strong><br />

interest; What factors are associated with exposure<br />

Data should be collected on all potential confounders since if<br />

you do not obtain the information you cannot control for it.<br />

445<br />

Section 11


(2) Control <strong>of</strong> confounding during the study design.<br />

Restriction:<br />

• Limits participation in a study to specific groups that<br />

are similar to each other with respect to the<br />

confounder.<br />

e.g. Include only non-smokers in a study <strong>of</strong> exercise<br />

<strong>and</strong> risk <strong>of</strong> CHD.<br />

• Disadvantages<br />

• residual confounding if restriction criteria are too<br />

wide.<br />

• lack <strong>of</strong> generalisability.<br />

• smaller number <strong>of</strong> available participants.<br />

Matching:<br />

Particular subjects are selected in such a way that the<br />

potential confounders are distributed in an identical<br />

manner among each <strong>of</strong> the study groups.<br />

Case-control study: Matching cases <strong>and</strong> controls.<br />

Cohort study: Matching exposed <strong>and</strong> unexposed.<br />

Matching needs to be accounted for in the analysis<br />

R<strong>and</strong>omisation<br />

446<br />

Section 11


(3) Control <strong>of</strong> confounding during the analysis.<br />

Multivariate analysis – multiple regression.<br />

Evaluating confounding<br />

• Check for associations between suspected confounder <strong>and</strong><br />

exposure <strong>and</strong> disease.<br />

• See whether controlling for confounding affects the<br />

association.<br />

Chance<br />

• Study design: ensure study has sufficient power.<br />

• Confidence intervals <strong>and</strong> p-values for the association<br />

indicate the role <strong>of</strong> chance in the study.<br />

• When multiple statistical tests are carried out in a study,<br />

there is an increased chance <strong>of</strong> “false positive” results.<br />

447<br />

Section 11


Study design <strong>and</strong> critical appraisal sessions: 3<br />

R<strong>and</strong>omised controlled trials (RCTs)<br />

Aim: To study evaluate the effects <strong>of</strong> an intervention<br />

• considered the “Gold st<strong>and</strong>ard” for evaluation <strong>of</strong><br />

interventions<br />

Why<br />

• allows isolation <strong>of</strong> the effects <strong>of</strong> the intervention through<br />

controlling the experimental condition<br />

• experiment (“trial”)<br />

• comparison/control group (“controlled”)<br />

• r<strong>and</strong>omisation (“r<strong>and</strong>omised”)<br />

R<strong>and</strong>omisation<br />

• process for deciding who will get the experimental<br />

intervention <strong>and</strong> who will be the control<br />

448<br />

Section 11


Basic structure <strong>of</strong> a RCT<br />

• population to be studied<br />

• choice <strong>of</strong> comparison group<br />

• allocation <strong>of</strong> subjects to intervention or control group<br />

• choice <strong>of</strong> outcome measure<br />

Population to be studied:<br />

Usually not a representative sample from the population<br />

• eg in trials <strong>of</strong> treatments they will be patients coming to<br />

see the doctors who have agreed to take part in the<br />

study<br />

Chosen to maximise internal validity with some cost in terms<br />

<strong>of</strong> generalisability.<br />

• eg we may choose participants who are likely to be able<br />

to complete the requirements <strong>of</strong> the trial<br />

Choice <strong>of</strong> comparison group:<br />

• the control group should provide information on what<br />

would have happened without the experimental<br />

intervention<br />

• in trials <strong>of</strong> disease treatment or prevention the control<br />

group should in general receive the best available<br />

“st<strong>and</strong>ard” treatment.<br />

• sometimes there is no st<strong>and</strong>ard treatment or practice, in<br />

which case a “placebo” control group may be used.<br />

449<br />

Section 11


• “placebos” are substances with no biological effect on<br />

the disease process.<br />

• placebos are used to isolate the particular effect <strong>of</strong><br />

interest from effects that may occur because <strong>of</strong> people’s<br />

belief they are getting a particular intervention<br />

• use <strong>of</strong> a placebo allows “blinding” <strong>of</strong> intervention <strong>and</strong><br />

control groups, so that the results are not biased through<br />

knowledge <strong>of</strong> who got the new intervention<br />

Allocation <strong>of</strong> subjects to treatment groups:<br />

Example: Is the new treatment more effective than the<br />

st<strong>and</strong>ard treatment<br />

How would we test this<br />

(1) We could compare the results <strong>of</strong> the new treatment<br />

on patients with records <strong>of</strong> previous results from<br />

other patients using the old treatment (historical<br />

controls).<br />

Do you think this is a good idea<br />

(2) Ask people to volunteer for the new treatment <strong>and</strong><br />

give the st<strong>and</strong>ard treatment to those who do not<br />

volunteer<br />

Do you think this is a good idea<br />

(3) Allocate patients to the new treatment or the old<br />

treatment using an “objective” method <strong>and</strong> observe<br />

the outcome.<br />

450<br />

Section 11


The way in which patients are allocated to treatments can<br />

influence the results enormously.<br />

We need a method <strong>of</strong> allocation to treatments in which the<br />

characteristics <strong>of</strong> subjects will not affect their chance <strong>of</strong><br />

being put into any particular group – RANDOM<br />

ALLOCATION<br />

Volunteers are assigned to intervention groups using<br />

r<strong>and</strong>omisation, then followed up over a period <strong>of</strong> time.<br />

R<strong>and</strong>omisation:<br />

• best way to control for both known <strong>and</strong> unknown<br />

confounders.<br />

• but does not guarantee control <strong>of</strong> confounding.<br />

• is ethical when there is genuine uncertainty about whether<br />

the new intervention or the comparison strategy is better<br />

(“equipoise”).<br />

451<br />

Section 11


Choice <strong>of</strong> outcome measure:<br />

• needs to be sensitive to the effects <strong>of</strong> intervention<br />

• early in the process <strong>of</strong> evaluation short term outcomes are<br />

used to screen for promising interventions<br />

• ultimately, need to demonstrate that the intervention has a<br />

tangible benefits for the individual <strong>and</strong> society<br />

Example: Zidovudine in treatment <strong>of</strong> people with<br />

asymptomatic HIV infection.<br />

Studies found<br />

• statistically significant improvement in immune function<br />

(measured by CD4 count).<br />

but<br />

• no difference in survival at 3 years.<br />

452<br />

Section 11


R<strong>and</strong>omised controlled trials : Example<br />

Nichol et al. The effectiveness <strong>of</strong> vaccination against<br />

influenza in healthy working adults.<br />

New Engl<strong>and</strong> J. Med (1995)<br />

Objectives<br />

• to clarify the benefits <strong>of</strong> immunisation for influenza in a<br />

population not at high risk for complications.<br />

Background<br />

• most deaths from influenza occur among elderly people,<br />

but all age groups are affected.<br />

• influenza accounts for millions <strong>of</strong> days lost from work<br />

each year.<br />

• current recommendations <strong>of</strong> the US Advisory Committee<br />

on Immunisation Practices target persons at increased risk<br />

for complications <strong>of</strong> influenza, although all people who<br />

wish to avoid illness are encouraged to consider<br />

vaccination.<br />

Type <strong>of</strong> study<br />

R<strong>and</strong>omised controlled trial<br />

453<br />

Section 11


Selection <strong>of</strong> participants<br />

• recruited in Minneapolis-St Paul through newspaper<br />

advertisements, advertisements at work sites <strong>and</strong><br />

recruitment sessions at shopping malls.<br />

• aged 18 – 64 years.<br />

• employed full time.<br />

• no medical conditions which would place them at high<br />

risk for complications <strong>of</strong> influenza.<br />

• not allergic to eggs.<br />

• not pregnant or planned pregnancy within 3 months.<br />

• had not had a previous vaccination for influenza.<br />

Information collected<br />

“Exposure” (=treatment)<br />

• influenza group: active vaccine<br />

• placebo group: vaccine diluent<br />

Outcome measure:<br />

• structured telephone interviews<br />

Week 1:<br />

side effects<br />

Monthly for 4 months:<br />

• occurrence <strong>of</strong> upper respiratory illness<br />

• use <strong>of</strong> sick leave<br />

• visits to the doctor<br />

454<br />

Section 11


Key results<br />

849 r<strong>and</strong>omised<br />

placebo vaccine<br />

n=425 n=424<br />

complete follow-up<br />

n= 416 (98%) n=409 (96%)<br />

455<br />

Section 11


456<br />

Section 11


Internal validity<br />

Chance<br />

• 95% confidence intervals around the differences exclude<br />

zero.<br />

• p-values are small, indicating that differences this large<br />

(or larger) are very unlikely to occur by chance if the<br />

vaccine is not effective.<br />

• several outcome measures were used, increasing the<br />

chance <strong>of</strong> false positive results, but since the p-values are<br />

very small this is not likely to affect the conclusions.<br />

457<br />

Section 11


Confounding<br />

R<strong>and</strong>omisation + intention to treat analysis<br />

458<br />

Section 11


Intention-to-treat analysis<br />

“once r<strong>and</strong>omised, always analysed”<br />

• outcome is compared in<br />

• the group r<strong>and</strong>omised to placebo<br />

• <strong>and</strong> the group r<strong>and</strong>omised to vaccine.<br />

• this preserves the control <strong>of</strong> confounding achieved by<br />

r<strong>and</strong>omisation.<br />

Bias<br />

Selection bias is not a problem in r<strong>and</strong>omised controlled<br />

trials (see generalisability though)<br />

Information bias in r<strong>and</strong>omised trials arises from<br />

• incomplete follow-up <strong>of</strong> participants<br />

• error in measurement <strong>of</strong> outcome<br />

Information bias in vaccine trial:<br />

Completeness <strong>of</strong> follow-up:<br />

• placebo: 98% (416/425)<br />

• vaccine: 96% (409/424)<br />

Measurement <strong>of</strong> illness:<br />

• definition <strong>of</strong> influenza<br />

• recall <strong>of</strong> symptoms<br />

459<br />

Section 11


Blinding<br />

• means participants experience or recall <strong>of</strong> symptoms is not<br />

affected by knowledge <strong>of</strong> whether they had the vaccine<br />

(single blind).<br />

• people collecting the information from the participants<br />

cannot introduce bias through their knowledge <strong>of</strong> whether<br />

or not they had the vaccine (double blind).<br />

Generalisability<br />

• broad group <strong>of</strong> working adults<br />

• risk <strong>of</strong> influenza<br />

• strain <strong>of</strong> influenza<br />

Implications<br />

• the trial demonstrates that vaccination against influenza<br />

can be effective in reducing symptoms, sick leave <strong>and</strong><br />

visits to the doctor.<br />

460<br />

Section 11


Study design <strong>and</strong> critical appraisal sessions: 4<br />

1. Introduction to critical appraisal (lecture 1)<br />

• process for critical appraisal<br />

• structure <strong>of</strong> a research paper<br />

2. Design <strong>and</strong> appraisal <strong>of</strong> surveys (lecture 1)<br />

• review <strong>of</strong> surveys<br />

• internal validity<br />

• bias<br />

• chance<br />

• external validity<br />

• example<br />

3. Design <strong>and</strong> appraisal <strong>of</strong> analytic studies<br />

(lectures 2 – 4)<br />

• review <strong>of</strong> analytic study designs<br />

• internal validity<br />

• bias<br />

• confounding<br />

• chance<br />

• external validity<br />

• causation<br />

• examples: r<strong>and</strong>omised controlled trials<br />

• cohort studies<br />

• case-control studies<br />

461<br />

Section 11


Cohort study<br />

Ref: “Cohort studies: marching towards outcomes”, Lancet<br />

2002; 359:341-45.<br />

462<br />

Section 11


Prospective cohort study (concurrent): Cohort is defined<br />

<strong>and</strong> characterised at the start <strong>of</strong> the study <strong>and</strong> followed up<br />

into the future.<br />

• Assemble the cohort.<br />

• Measure predictor variables <strong>and</strong> potential confounders.<br />

• Follow up the cohort <strong>and</strong> measure outcomes.<br />

Retrospective (historical) cohort: Cohort is defined <strong>and</strong><br />

characterised in the past, based on data already recorded,<br />

<strong>and</strong> followed up toward the present to some cut-<strong>of</strong>f time.<br />

• Identify a suitable cohort.<br />

• Collect data about predictor variables from past records.<br />

• Collect data about subsequent outcomes that occurred at<br />

a later time.<br />

463<br />

Section 11


Cohort studies: example<br />

Hart C, Davey Smith G. “C<strong>of</strong>fee consumption <strong>and</strong> coronary<br />

heart disease mortality in Scottish men: a 21 year follow-up<br />

study.”<br />

J Epidemiol Commun Health (1997); 51: 461-2<br />

Objective<br />

• to examine the effects <strong>of</strong> c<strong>of</strong>fee on coronary heart<br />

disease mortality.<br />

Background / Necessity<br />

• recent studies <strong>of</strong> this hypothesis have produced<br />

conflicting results.<br />

• data on confounding factors has <strong>of</strong>ten been limited in<br />

those studies.<br />

Type <strong>of</strong> study design<br />

• cohort (prospective)<br />

464<br />

Section 11


Selection <strong>of</strong> participants<br />

• 5,766 men aged 35-64 from work places in an area in<br />

the west <strong>of</strong> Scotl<strong>and</strong>.<br />

• enrolled between 1970 <strong>and</strong> 1973.<br />

Information collected<br />

• at enrollment:<br />

• how many cups <strong>of</strong> c<strong>of</strong>fee they usually drank per<br />

day;<br />

• information on confounders such as smoking, social<br />

class.<br />

• followed up for 20 years.<br />

• information about deaths from coronary heart disease<br />

was obtained from the national registry.<br />

465<br />

Section 11


Key results<br />

No. <strong>of</strong> cups CHD<br />

c<strong>of</strong>fee per day Deaths RR 95% CI<br />

0 308 1.0<br />

1 94 0.89 (0.70, 1.12)<br />

2 104 0.98 (0.78, 1.23)<br />

3-4 82 0.90 (0.70, 1.16)<br />

5+ 37 0.96 (0.67, 1.37)<br />

p value from trend test = 0.71<br />

Chance<br />

• all confidence intervals include the null value, 1.<br />

• the upper limits <strong>of</strong> the confidence intervals for < 5 cups<br />

per day are fairly close to 1.<br />

• for 5+ cups per day we cannot exclude a true RR as big as<br />

1.37 (a 37% increase in risk).<br />

• the test for trend gave a p-value >> 0.05.<br />

466<br />

Section 11


Bias<br />

Selection bias<br />

• because there is only one selection process, selection<br />

bias is minimised.<br />

• the study sample may not be representative <strong>of</strong> the<br />

population in west Scotl<strong>and</strong>, but in analytic studies that<br />

issue is addressed under generalisability.<br />

Information bias<br />

• information bias could come from:<br />

• inaccurary in exposure information;<br />

• loss to followup;<br />

• inaccurary in determining death from CHD.<br />

• crude measure <strong>of</strong> c<strong>of</strong>fee consumption used, may bias<br />

RR towards null.<br />

• followup will be nearly complete using national<br />

registry.<br />

• may be some misclassification <strong>of</strong> cause <strong>of</strong> death.<br />

467<br />

Section 11


Confounding<br />

• RR presented were adjusted for a number <strong>of</strong><br />

confounding factors including: age, diastolic blood<br />

pressure, cholesterol, smoking, social class <strong>and</strong> body<br />

mass index.<br />

Generalisability<br />

• type <strong>of</strong> c<strong>of</strong>fee drunk (instant vs ground).<br />

Implications<br />

• found no clear evidence <strong>of</strong> an association between<br />

instant c<strong>of</strong>fee use <strong>and</strong> risk <strong>of</strong> CHD.<br />

• cannot rule out an increase in those drinking 5+ cups<br />

per day (small numbers).<br />

• other types <strong>of</strong> c<strong>of</strong>fee may have detrimental effects on<br />

CHD risk.<br />

468<br />

Section 11


Case-control studies<br />

Ref: “Case-Control studies: research in reverse”, Lancet<br />

2002; 359:431-34.<br />

• Subjects are ascertained based on whether they have<br />

experienced the outcome <strong>of</strong> interest (cases) or not<br />

(controls).<br />

• Information is collected from cases <strong>and</strong> controls about<br />

their past exposures.<br />

469<br />

Section 11


Case-control studies: example<br />

Shinton R <strong>and</strong> Sagar G. “Lifelong exercise <strong>and</strong> stroke.”<br />

BMJ (1993); 307: 231-4.<br />

Objective<br />

• to examine the potential <strong>of</strong> lifelong patterns <strong>of</strong> increased<br />

physical activity to prevent stroke.<br />

Background / Necessity<br />

• there is growing evidence that exercise can protect<br />

against stroke.<br />

• the importance <strong>of</strong> exercise in early adult life in<br />

protection from stroke has received little attention.<br />

• previous studies had not adequately controlled for<br />

confounding.<br />

Type <strong>of</strong> study design<br />

Case-control study<br />

470<br />

Section 11


Selection <strong>of</strong> participants<br />

Study population: people registered with a GP in west<br />

Birmingham, Engl<strong>and</strong>.<br />

Cases:<br />

• men <strong>and</strong> women aged 35-74 who had just had their first<br />

stroke.<br />

• obtained by phoning GPs weekly, <strong>and</strong> by checking<br />

admissions at the local hospital.<br />

Controls:<br />

• r<strong>and</strong>omly selected from the general practice population.<br />

• no history <strong>of</strong> stroke.<br />

471<br />

Section 11


Information collected<br />

• structured questionnaire.<br />

• one interviewer for all cases <strong>and</strong> controls.<br />

• when disability prevented an adequate response the<br />

closest friend or relative was interviewed.<br />

• people were classified by their responses into those who<br />

did or did not engage in vigorous exercise during:<br />

youth (15-25)<br />

early middle age (25-40)<br />

late middle age(40-55)<br />

• information on confounders (e.g. age, sex, smoking)<br />

472<br />

Section 11


Key results<br />

Response rates:<br />

Cases:<br />

• 125 patients were eligible for inclusion.<br />

• no patient or relative declined to participate.<br />

(100% response rate)<br />

Controls:<br />

• 220 controls were selected <strong>and</strong> contacted.<br />

• 13 excluded.<br />

• 198 <strong>of</strong> the remainder (207) agreed to participate.<br />

(95.7% response rate)<br />

Table I. Odds ratios* (95% confidence interval) <strong>of</strong> stroke<br />

according to when exercise undertaken.<br />

Exercise undertaken<br />

no<br />

yes<br />

Age undertaken<br />

15-25 1.0 0.33 (0.2 to 0.6)<br />

25-40 1.0 0.43 (0.2 to 0.8)<br />

40-55 1.0 0.63 (0.3 to 1.5)<br />

* Odds ratios are adjusted for age <strong>and</strong> sex<br />

473<br />

Section 11


Now, let’s consider possible explanations for an<br />

association: Internal validity<br />

Chance<br />

• confidence intervals show the range <strong>of</strong> plausible values<br />

<strong>of</strong> the true odds ratio which are consistent with the<br />

study results.<br />

• if the confidence interval for an odds ratio excludes 1,<br />

then the study provides evidence <strong>of</strong> an association in the<br />

population studied.<br />

• if the confidence interval for the odds ratio includes 1,<br />

then the study results are consistent with the possibility<br />

that there is no true association.<br />

• to conclude definitely that there is no association, the<br />

confidence interval must include 1 <strong>and</strong> be narrow, so that<br />

important differences in the risk <strong>of</strong> disease can be<br />

excluded.<br />

474<br />

Section 11


In this study:<br />

• the odds ratios increase with increasing age at which the<br />

exercise was undertaken.<br />

• the confidence intervals for ages 15-25 <strong>and</strong> 25-40<br />

exclude 1, so there is some evidence <strong>of</strong> an association<br />

between exercise at those ages <strong>and</strong> reduction in risk <strong>of</strong><br />

stroke.<br />

• the odds ratio for exercise undertaken at age 40–55 is<br />

less than 1, but the confidence interval contains 1<br />

indicating that this apparent beneficial effect could just<br />

be due to r<strong>and</strong>om variation or chance.<br />

475<br />

Section 11


Bias<br />

Case-control studies are particularly susceptible to bias<br />

because at the time the study is done both exposure <strong>and</strong><br />

disease have already occurred.<br />

Selection bias<br />

cases: all non-fatal cases which arose from the GP<br />

population were included.<br />

controls: r<strong>and</strong>omly selected from the population the cases<br />

arose from.<br />

Therefore, the controls are representative <strong>of</strong> the population<br />

the cases arose from, <strong>and</strong> selection bias is minimised.<br />

Response rates were high.<br />

(100% for cases <strong>and</strong> 95.7% for controls)<br />

476<br />

Section 11


Information bias<br />

Things done to minimise bias:<br />

• cases <strong>and</strong> controls all interviewed by the same<br />

interviewer.<br />

• structured questionnaire was used.<br />

Possible sources <strong>of</strong> information bias:<br />

recall bias:<br />

• cases <strong>and</strong> controls may both have trouble recalling<br />

accurately exercise patterns when they were young.<br />

• similar patterns <strong>of</strong> poor recall in cases <strong>and</strong> controls will<br />

bias an odds ratio towards 1, so it could not explain the<br />

observed association.<br />

• cases have had a stroke so they may be less likely to<br />

remember than the controls.<br />

• if cases were less likely than controls to report exercise,<br />

an apparent protective association between exercise <strong>and</strong><br />

stroke would be created.<br />

bias from surrogate interviewee:<br />

• information on exercise for cases unable to respond was<br />

obtained from a friend or relative.<br />

477<br />

Section 11


Interviewer bias:<br />

• the interviewer will have known whether or not people<br />

were case or controls.<br />

• If he/she prodded the controls harder for information on<br />

exercise an apparent protective effect would be created.<br />

Confounding<br />

• risk factors for stroke include age, sex, <strong>and</strong> smoking.<br />

• since all 3 <strong>of</strong> these are likely to be associated with<br />

exercise, they may be confounding the relationship<br />

between exercise <strong>and</strong> stroke.<br />

• analyses were adjusted to remove confounding effects<br />

<strong>of</strong> confounding variables including age, sex <strong>and</strong><br />

smoking.<br />

478<br />

Section 11


Generalisability<br />

Could we apply the results <strong>of</strong> this study to the New Zeal<strong>and</strong><br />

population<br />

• need to think about whether or not New Zeal<strong>and</strong>ers<br />

would be likely to experience the same apparent benefit<br />

from exercise.<br />

• depends on the nature <strong>of</strong> the exercise <strong>and</strong> the biological<br />

mechanism by which exercise reduces risk <strong>of</strong> stroke.<br />

Causation<br />

• it is difficult to show causation conclusively with a<br />

single observational study, primarily because <strong>of</strong> the<br />

susceptibility to bias <strong>and</strong> confounding.<br />

• an association is more likely to be causal if :<br />

• the observed association is very strong;<br />

• a dose-response effect can be demonstrated;<br />

• the results from several different studies are<br />

consistent;<br />

• there is a known biological mechanism.<br />

479<br />

Section 11


480


Appendix One: The Basics<br />

This appendix contains some background material to help you prepare for the course.<br />

1. Basic Mathematical Rules<br />

1. BEDMAS – how to work things out in the right order<br />

2. Rounding<br />

3. Dealing with Negatives<br />

4. Fractions<br />

5. Solving Equations<br />

6. Powers <strong>and</strong> Logarithms<br />

7. Sigma means Add Up<br />

2. Basic Statistical Concepts<br />

1. Mean<br />

2. Median<br />

3. Range<br />

4. Variance <strong>and</strong> St<strong>and</strong>ard Deviation<br />

5. Quartiles <strong>and</strong> Interquartile Range<br />

6. Scatterplot<br />

3. Sample Exercises<br />

MATHERCIZE<br />

Practice examples for many <strong>of</strong> the topics covered in this booklet are available on the computer<br />

package MATHERCIZE. This program is available at: http://mathercize.otago.ac.nz, <strong>and</strong> the<br />

login password is line.<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


Section 1: Basic Mathematical Rules<br />

1. BEDMAS – how to work things out in the right order<br />

Brackets<br />

Exponents (also known as Powers)<br />

Division <strong>and</strong> Multiplication<br />

Addition <strong>and</strong> Subtraction<br />

When Division <strong>and</strong> Multiplication occur together, work from the left. Similarly when Addition <strong>and</strong><br />

Subtraction occur together, work from the left. Otherwise follow the guidelines suggested by the<br />

word BEDMAS.<br />

Note that a scientific calculator will maintain this order, provided care is taken, but other calculators<br />

do not.<br />

Example 1<br />

Evaluate ( 3+ 2) × 6 + 9 2 ÷ ( 2+ 7−<br />

6)<br />

• First evaluate both brackets: ( 3+ 2) = 5 <strong>and</strong> ( 2 + 7 − 6)<br />

= 3<br />

• Then the exponent:<br />

2<br />

9 = 81<br />

• Then the division <strong>and</strong> multiplication: 5× 6 = 30 <strong>and</strong> 81÷ 3 = 27<br />

• Finally the addition: 30 + 27 = 57<br />

Setting this out on paper:<br />

2<br />

3+ 2 × 6 + 9 ÷ 2+ 7− 6 = 5 × 6 +<br />

Example 2<br />

5 +<br />

( ) ( )<br />

2<br />

Evaluate ( 9 − 5 ÷ 5×<br />

2 ) − 9<br />

2<br />

9 ÷ 3<br />

= 5 × 6 + 81 ÷ 3<br />

= 30 + 27<br />

= 57<br />

• First evaluate the brackets. The exponent is evaluated first:<br />

9 5 5 2 2<br />

− ÷ × = 9 − 5 ÷ 5 × 4<br />

( ) ( )<br />

( 9 1 4)<br />

( )<br />

= − ×<br />

= 9− 4 = 5<br />

• Finally the addition 5 + 5 – 9 = 1.<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


Example 3<br />

If Z = X − μ calculate Z if X = 15, μ = 8, <strong>and</strong> σ = 2.75<br />

σ<br />

• First carry out a “clean” substitution. This means that each variable whose value is known<br />

is replaced by that value without any calculation being done:<br />

Z = 15 − 8<br />

2.75<br />

• The division sign implies brackets. The expression could be rewritten as<br />

(15 − 8)<br />

Z =<br />

2.75<br />

although the brackets are seldom shown. Nevertheless the expression 15 – 8<br />

is evaluated first.<br />

•<br />

7<br />

Finally the division: = 7 ÷ 2.75 = 2.55 (to two decimal places).<br />

2.75<br />

• Note that using brackets on a st<strong>and</strong>ard calculator should let you evaluate the expression<br />

directly. Try ( 15 – 8 ) ÷ 2.75 = (Missing out the brackets will almost<br />

certainly lead to an incorrect answer.)<br />

Example 4<br />

s<br />

If t = 2.086, s = 3.44, <strong>and</strong> n = 21, evaluate the expression t<br />

n<br />

• Clean substitution: 2.086 × 3.44<br />

21<br />

s<br />

• Note the multiplication sign: t<br />

n means t × sn<br />

• A square root is an exponent, so evaluate 21 = 4.583 (to three d.p.)<br />

• There is no addition or subtraction involved, so work from the left:<br />

2.086 × 3.44 ÷ 4.583 = 1.57 (to two d.p.) (Rounding is discussed below.)<br />

• Again this may be calculated directly on a calculator. Press the buttons:<br />

2.086 × 3.44 ÷ 21 =<br />

Example 5<br />

Evaluate the expression x − μ<br />

if x = 215.8, μ = 246, s = 64.5, <strong>and</strong> n = 10.<br />

s<br />

n<br />

• For this example, only the calculator working is shown. Press the buttons:<br />

( 215.8 – 246 ) ÷ ( 64.5 ÷ 10 ) =<br />

The answer is –1.48 (to two d.p.)<br />

• Try to obtain the same answer using the rules <strong>of</strong> BEDMAS.<br />

Example 6<br />

Evaluate 1.96<br />

2 2<br />

4.5 3.6<br />

+<br />

18 22<br />

• The square root sign implies brackets around the expression<br />

2 2<br />

4.5 3.6<br />

+<br />

18 22<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


⎛ 2 2<br />

4.5 3.6 ⎞<br />

i.e. we have to evaluate 1.96<br />

+<br />

⎜ 18 22 ⎟<br />

⎝<br />

⎠<br />

• All the exponents inside the brackets are calculated first, followed by the divisions:<br />

2<br />

2<br />

4.5 20.25<br />

3.6 12.96<br />

= = 1.125 <strong>and</strong> = = 0.589<br />

18 18<br />

22 22<br />

• Next the addition, followed by the remaining exponent (the square root)<br />

1.125 + 0.589 = 1.714 <strong>and</strong> 1.714 = 1.309<br />

• Finally the multiplication: 1.96 × 1.309 = 2.56 (to two d.p.)<br />

• Again note that this could be calculated directly on a calculator (although a single small<br />

mistake will make everything wrong). Try<br />

1.96 × (4.5 x 2 ÷ 18 + 3.6 x 2 ÷ 22) =<br />

The result should be 2.566 which rounds to 2.57, a little different to the answer above due to<br />

rounding. Note that x 2 refers to the button on a Casio calculator. Other br<strong>and</strong>s may have<br />

different notations for squaring, although they should be similar.<br />

2. Rounding<br />

When you have decided how many digits you want to round to, look at the next digit. If this value<br />

is 0, 1, 2, 3, or 4, the previous digit is rounded down. Otherwise (if the value is 5, 6, 7, 8, or 9), the<br />

previous digit is rounded up.<br />

Example:<br />

By calculator, 8 30 = 1.460593487<br />

• To three d.p. (decimal places) 8 30<br />

= 1.461 because the next digit (5) causes the third<br />

decimal value (0) to be rounded up.<br />

• To four d.p. 8 30 = 1.4606<br />

• To five d.p. 8 30 = 1.46059<br />

• To six d.p. 8 30 = 1.460593<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


There are no hard <strong>and</strong> fast rules concerning how many digits you should round a value to, although<br />

a few general principles should be noted:<br />

• When you are calculating an expression do not round too soon. For example, consider the<br />

expression 150 . To eight decimal places, 10 = 3.16227766.<br />

10<br />

• If you use a calculator to evaluate 150 , <strong>and</strong> round your final answer to three decimal<br />

10<br />

places, the result is 47.434.<br />

• However, if you first round 10 to 3.16 <strong>and</strong> you then calculate 150 , the result is 47.468<br />

3.16<br />

(to three d.p.). This may not appear to be much different to the value 47.434, but it could<br />

make a substantial difference if you have to use the value in further calculations.<br />

• Do not round your working to fewer figures than your final answer. In the previous<br />

example, the value 3.16 has three significant figures, while the (slightly incorrect) given<br />

answer 47.468 has five figures. Having rounded to three figures in the working, three<br />

figures (or fewer) should be used for the final answer.<br />

You should not give an answer “more” accurate than the data or working.<br />

• As a rule <strong>of</strong> thumb, round probabilities to four decimal places.<br />

• Historically, Z-scores have been rounded to two decimal places. The reason for this is that<br />

normal distribution tables use two decimal place Z-scores.<br />

3. Dealing with Negatives<br />

Adding a negative number is the same as subtracting the corresponding positive number:<br />

• Example 5 + (-4) = 5 – 4 = 1<br />

Subtracting a negative number is like adding a positive number:<br />

• Example 5 – (-4) = 5 + 4 = 9<br />

Multiplying two negative numbers give a positive number:<br />

− 5 × − 4 =<br />

• Example ( ) ( ) 20<br />

Multiplying a negative number by a positive number gives a negative number:<br />

− 5 × 4 = −<br />

• Example ( ) 20<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


4. Fractions<br />

Many people have difficulty with fractions. Sometimes the difficulty is in the interpretation rather<br />

than with actual calculations.<br />

Example<br />

Imagine that you have attended a course <strong>and</strong> you are trying to work out your final mark. You have<br />

been told that you scored:<br />

• 8.5 out <strong>of</strong> 10 for the assignments<br />

• 20 out <strong>of</strong> 40 for the test<br />

• 32 out <strong>of</strong> 50 for the exam<br />

If you add these three values up as if they were fractions, you would get<br />

8.5 20 32<br />

+ + = 1.99 (Check this using a calculator.)<br />

10 40 50<br />

This is clearly a silly answer because the values were not actually fractions as such, but marks from<br />

different sections <strong>of</strong> the assessment scheme for the course.<br />

If you just add up the marks you get 60.5. This is a more reasonable answer, because it gives a total<br />

out <strong>of</strong> 100.<br />

But suppose that in the course mentioned in this example, the assessment scheme states that if the<br />

internal mark is higher than the exam mark, your final mark is the average. Otherwise the final<br />

mark is the exam mark. For this example, the internal total is 28.5 out <strong>of</strong> 50, or 57%, while the<br />

exam mark translates to 64%. As the exam mark is higher than the other combined marks, the final<br />

mark in this case would be 64.<br />

Using Calculators for Fractions<br />

When probabilities are involved, dealing with fractions is important. This section aims to show<br />

how to use a calculator to h<strong>and</strong>le problems involving fractions.<br />

As long as you estimate whether the final answer is sensible, practically all fraction work can be<br />

carried out using a calculator. The key button to use is a b c<br />

on a Casio. Other calculators should<br />

have equivalent buttons.<br />

Simplifying Fractions<br />

12<br />

Example 1:<br />

20<br />

Example 2:<br />

On your calculator type 12<br />

a b c<br />

20 =<br />

The answer is given as 3 5 i.e. 12<br />

20 = 3 5<br />

21<br />

105 Type 21 b c<br />

a 105 = i.e. 21<br />

105 = 1 5<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


Converting Fractions to Decimals<br />

The a b c<br />

button will <strong>of</strong>ten do this, although not always!<br />

Example 1: Convert 11 into decimal form<br />

15<br />

On the calculator type 11 a b c<br />

15 =<br />

The screen shows 11 15. Now press the a b c<br />

button <strong>and</strong> the fraction is<br />

converted to the decimal 0.733333 . . . Press a b c<br />

again, <strong>and</strong> the fraction<br />

version reappears.<br />

Example 2:<br />

Example 3:<br />

Convert 0.6875 to a fraction.<br />

Type .6875 = Now press the a b c<br />

button. The screen shows<br />

11 16 i.e. 0.6875 = 11<br />

16<br />

Convert 0.1234567 to a fraction.<br />

Type .1234567 = . Now press the a b c<br />

button.<br />

Nothing happens. The calculator leaves the decimal alone. If you want to<br />

convert this one to a fraction you will have to carry out the working yourself:<br />

1234567<br />

0.1234567 =<br />

10000000<br />

Adding <strong>and</strong> Subtracting Fractions<br />

3 2<br />

Example:<br />

+<br />

5 3<br />

On your calculator type 3 a b c<br />

5 + 2 a b c<br />

3 =<br />

The screen shows 1 4 15 i.e. 3 + 2 = 1<br />

4<br />

5 3 15<br />

(Incidentally, if you now press the a b c<br />

button, the decimal equivalent to this<br />

fraction appears on screen: 1.266666. . .)<br />

Remember that if these two fractions represent probabilities that you are adding together, <strong>and</strong> the<br />

final answer was also meant to represent a probability, then there has to be an error somewhere<br />

because a probability cannot be larger than 1.<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


Multiplying <strong>and</strong> Dividing Fractions<br />

5 5<br />

Example 1: ×<br />

8 3<br />

Type 5 a b c<br />

8 x 5 a b c<br />

3<br />

The result is 1 1 24 i.e. 5 × 5 =<br />

25<br />

8 3 24<br />

Example 2:<br />

5 10<br />

÷<br />

7 11<br />

Type 5<br />

a b c<br />

7 ÷ 10 a b c<br />

11<br />

The result is 11 14 i.e.<br />

5 10 11<br />

÷ =<br />

7 11 14<br />

More Complicated Calculations<br />

As soon as you have a problem involving both addition <strong>and</strong> multiplication, brackets become very<br />

useful.<br />

3⎛1 3⎞<br />

Example:<br />

⎜ + ⎟<br />

4⎝8 7⎠<br />

Note that the fraction in front <strong>of</strong> the brackets implies multiplication.<br />

Type 3 a b c<br />

4 × (1 a b c<br />

8 + 3 a b c<br />

7) =<br />

The answer is 93 or 0.4152 (to four d.p.)<br />

224<br />

Note that as an alternative approach you could use BEDMAS <strong>and</strong> work out the brackets first:<br />

1 a b c<br />

8 + 3 a b c<br />

7 = gives 31<br />

56<br />

Now type × 3 a 4 = to reach 93 224 as before.<br />

b c<br />

5. Solving Equations<br />

Solving equations involves more than evaluating expressions, which was covered earlier. To solve<br />

an equation you should make a clean substitution, then rearrange the expression so that the required<br />

variable is on its own.<br />

Loosely speaking, solving equations involves “undoing BEDMAS”. For example, anything inside<br />

brackets is dealt with last.<br />

In STAT 115 one particular type <strong>of</strong> equation will need to be solved:<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


X − μ<br />

Example 1: If Z = , calculate X if Z = 1.96, μ = 8.5, <strong>and</strong> σ = 1.8<br />

σ<br />

• First make a “clean substitution”, i.e. substitute each <strong>of</strong> the known variables into the<br />

equation without trying to simplify at all:<br />

X − 8.5<br />

1.96 =<br />

1.8<br />

• The division sign implies brackets around X – 8.5. We are “undoing” the equation, so<br />

this part will be left to last.<br />

( X − 8.5)<br />

1.96 =<br />

1.8<br />

• This means we “undo” the value 1.8 first. Because the right h<strong>and</strong> side <strong>of</strong> the equation reads<br />

“(X – 8.5) divided by 1.8”, we will multiply by 1.8, since multiplication is the inverse<br />

operation to division:<br />

1.96 × 1.8 = (X – 8.5)<br />

• Because we added the brackets because <strong>of</strong> the original division sign <strong>and</strong> we have dealt with<br />

the division, the brackets are no longer needed:<br />

3.528 = X – 8.5<br />

• To undo subtraction we perform the opposite operation, addition:<br />

3.528 + 8.5 = X<br />

• We have now rearranged the equation so that X is on its own.<br />

X = 12.0 (one decimal place)<br />

X − μ<br />

Example 2: If Z = calculate X if Z = 2.58, μ = -2.5, σ = 0.85,<br />

σ<br />

n<br />

<strong>and</strong> n = 60.<br />

• Clean substitution: 2.58 =<br />

X −−2.5<br />

0.85<br />

60<br />

• Simplify a little:<br />

2.58 =<br />

X + 2.5<br />

0.1097<br />

• Solve the equation: 2.58 × 0.1097 = X + 2.5<br />

0.2830 = X + 2.5<br />

0.2830 − 2.5 = X<br />

X = − 2.22 (to two d.p.)<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


6. Powers <strong>and</strong> Logarithms<br />

The following power rules may be needed occasionally, <strong>and</strong> examples will be given where<br />

necessary.<br />

a b a b<br />

x x x +<br />

a<br />

= ( x ) b<br />

=<br />

ab<br />

x<br />

x<br />

x<br />

a<br />

b<br />

=<br />

x<br />

a−b<br />

x<br />

− a =<br />

1<br />

x<br />

a<br />

1 2 x<br />

=<br />

x<br />

The following log rules may also be needed. Note that in this paper, log means log e (or natural log<br />

i.e. ln).<br />

log<br />

ln<br />

e x<br />

= ln x<br />

ln x= y ←⎯→ e = x (where e=<br />

2.71828 (five d.p.))<br />

y<br />

( x ) yln<br />

( x)<br />

= ln ( x) + ln ( y) = ln ( xy)<br />

⎛<br />

ln ( x) ln ( y)<br />

ln x ⎞<br />

− = ⎜ ⎟<br />

⎝ y ⎠<br />

Example:<br />

y<br />

⎛ ˆ π ⎞<br />

If log⎜<br />

1 ˆ<br />

⎟ = 3.1305 – 1.1499 – 0.027729 x 45 , find the value <strong>of</strong> the expression<br />

⎝ − π ⎠<br />

• First use BEDMAS to evaluate the RHS (Right H<strong>and</strong> Side) <strong>of</strong> the expression:<br />

ˆ π<br />

.<br />

1 − ˆ π<br />

3.1305 – 1.1499 – 0.027729 × 45 = 3.1305 – 1.1499 – 1.247805<br />

= 0.732795<br />

⎛ ˆ π ⎞<br />

• We now have log⎜<br />

1 ˆ<br />

⎟ = 0.732795.<br />

⎝ − π ⎠<br />

Remembering that log here means ln we are able to rewrite this in exponential form using<br />

the formula<br />

y<br />

ln x = y ←⎯→ e = x<br />

Therefore<br />

⎛ ˆ π ⎞<br />

⎜<br />

1 ˆ<br />

⎟<br />

⎝ − π ⎠<br />

=<br />

0.732795<br />

e = 2.08 (two d.p.)<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


7. Sigma means Add Up<br />

The Greek letter Σ (capital sigma) means “add up what follows”.<br />

Example 1:<br />

Example 2:<br />

3<br />

Evaluate ∑ 3 i<br />

i= 1<br />

Each <strong>of</strong> the values 1, 2, <strong>and</strong> 3 are substituted into the expression one by one<br />

in place <strong>of</strong> the variable i. Then the three values are added:<br />

3<br />

i 1 2 3<br />

∑ 3 = 3 + 3 + 3<br />

i= 1<br />

= 3 + 9 + 27 = 39<br />

n<br />

Exp<strong>and</strong> the expression ∑ xi<br />

, where x 1 is the first observation, x 2 the<br />

i=<br />

1<br />

second observation, etc. in a data set.<br />

There are n observations. Write out the sum <strong>of</strong> the first two or three<br />

observations, use three dots to indicate the other values, <strong>and</strong> add on the final<br />

observation:<br />

n<br />

∑ i<br />

i=<br />

1<br />

x x x x ... x<br />

= 1 + 2 + 3 + +<br />

n<br />

Notation<br />

x i is the i th term from the data set x1, x2, x3 , ..., xn<br />

1,<br />

x<br />

x ij is the (i, j) th term from the data set<br />

x11, x21, ..., xn<br />

1,<br />

1<br />

x12, x22, ..., xn2<br />

2,<br />

...,<br />

x1k, x2k, ..., xn k<br />

k<br />

− n.<br />

Example 3:<br />

If we select 50 female <strong>and</strong> 50 male Stat 115 students <strong>and</strong> measure their<br />

heights, we obtain the data set<br />

xij<br />

i = 1, 2 j = 1, 2, . . . , 50<br />

Here i represents sex (1 for female <strong>and</strong> 2 for male), <strong>and</strong> j the individual.<br />

For example, x29<br />

is the height <strong>of</strong> the 9 th male in the sample.<br />

Example 4: Evaluate the expression x =<br />

set {4, 7.5, 3.5, 8}<br />

1<br />

4<br />

4<br />

∑ xi<br />

where x i is the i th<br />

i=<br />

1<br />

• Substitute each <strong>of</strong> the x i values into the expression <strong>and</strong> follow BEDMAS:<br />

observation in the<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


1<br />

x = 4 + 7.5 + 3.5 + 8<br />

4<br />

1<br />

= ( 23 )<br />

4<br />

= 5.75<br />

( )<br />

Example 5: Evaluate the expression v = ( x − x )<br />

4<br />

1<br />

2<br />

∑ i where x i is the i th observation<br />

3<br />

i=<br />

1<br />

in the set {4, 7.5, 3.5, 8} <strong>and</strong> x = 5.75 (calculated in Example 4).<br />

• Substitute each <strong>of</strong> the x i values into the expression, along with x = 5.75:<br />

v = 1 ( (4 − 5.75) 2 + (7.5 − 5.75) 2 + (3.5 − 5.75) 2 + (8 − 5.75) 2<br />

)<br />

3<br />

• Follow BEDMAS <strong>and</strong> evaluate each one <strong>of</strong> the four inner brackets:<br />

v = 1 ( ( − 1.75) 2 + (1.75) 2 + ( − 2.25) 2 + (2.25) 2<br />

)<br />

3<br />

• The exponents (squares) are calculated <strong>and</strong> then the four terms are added:<br />

v = 1 ( 3.0625 + 3.0625 + 5.0625 + 5.0625 )<br />

3<br />

1<br />

= ( 16.25 )<br />

3<br />

• The multiplication by 1 is outside the brackets so it is calculated last:<br />

3<br />

v = 5.417 (to three d.p.)<br />

Example 6: Evaluate the expression χ 2 (observed - expected)<br />

= ∑<br />

expected<br />

allcells<br />

for the table below where the expected values are given in brackets <strong>and</strong> the<br />

observed values are not in brackets:<br />

15 (26) 50 (39)<br />

33 (22) 22 (33)<br />

2<br />

• Note that for this type <strong>of</strong> sigma expression, the notation means we have to add up the result<br />

from each <strong>of</strong> the four cells.<br />

χ 2<br />

• Substitute each value into the expression:<br />

χ 2 (15 − 26) (50 − 39) (33 − 22) (22 − 33)<br />

=<br />

+ + +<br />

26 39 22 33<br />

• Evaluate each bracket, <strong>and</strong> then square the result:<br />

χ 2 =<br />

=<br />

2 2 2 2<br />

2 2 2 2<br />

( −11) ( −11) (11) ( −11)<br />

+ + +<br />

26 39 22 33<br />

121 121 121 121<br />

+ + +<br />

26 39 22 33<br />

a (or equivalent) button to calculate the sum:<br />

• Use the b c<br />

= 16.923 (to three d.p.)<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


Section 2: Basic Statistical Concepts<br />

1. Mean<br />

The mean x is commonly referred to as the “average”. It is used as a measure <strong>of</strong> the “centre” <strong>of</strong> a<br />

data set. To find the mean, simply add up all your data values (observations) <strong>and</strong> divide by the<br />

number <strong>of</strong> values (sample size):<br />

n<br />

x1+ x 2 + ... + xn<br />

1<br />

x = or ∑ xi<br />

n n<br />

i =<br />

1<br />

Example:<br />

Calculate the mean <strong>of</strong> the data set 2, 4, 6, 8, 10, 12.<br />

There are six values in the data set i.e. n = 6.<br />

2 + 4 + 6 + 8 + 10 + 12 42<br />

x = = = 7<br />

6 6<br />

2. Median<br />

The median is defined as the middle observation in the data set, <strong>and</strong> is another measure <strong>of</strong> the centre<br />

<strong>of</strong> the data. Note that the data must be in order before you calculate the median!<br />

• In general, the median is the ( n + 1)<br />

th observation, where n is the sample size.<br />

2<br />

• If there is an odd number <strong>of</strong> observations, the median will be the middle observation.<br />

• If there is an even number <strong>of</strong> observations, the median will be the mean <strong>of</strong> the two middle<br />

observations.<br />

Example 1: Calculate the median <strong>of</strong> the data set 10, 1, 3, 8, 9.<br />

• First sort the data into order: 1, 3, 8, 9, 10<br />

• There are n = 5 observations so the median in the data set is the ( + )<br />

i.e. 8.<br />

Example 2: Calculate the median <strong>of</strong> the data set 32, 2, 36, 14, 6, 33.<br />

• First sort the data into order: 2, 6, 14, 32, 33, 36<br />

6 + 1<br />

2<br />

• There are n = 6 observations so the median is the ( ) = 3. 5<br />

14 + 32<br />

2<br />

• Take the mean <strong>of</strong> the 3 rd <strong>and</strong> 4 th observations i.e. ( ) = 23<br />

5 1 = 3 rd observation,<br />

2<br />

th observation.<br />

.<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


3. Range<br />

The range is the difference between the largest <strong>and</strong> smallest observations in the data set. It is a<br />

measure <strong>of</strong> the variation in the data.<br />

Example:<br />

The range <strong>of</strong> the data set 2, 5, 6, 9, 16, 2, 13 is 16 – 2 = 14.<br />

4. Variance <strong>and</strong> St<strong>and</strong>ard Deviation<br />

• The variance (s 2 ) is calculated as follows:<br />

n<br />

2 1<br />

s = ∑ xi<br />

−x<br />

n −1<br />

i=<br />

1<br />

( ) 2<br />

The st<strong>and</strong>ard deviation (s) is the most commonly used measure <strong>of</strong> variation in a set <strong>of</strong> data. It is the<br />

square root <strong>of</strong> the variance<br />

n<br />

1<br />

i.e. s = ∑ ( xi<br />

−x) 2<br />

n −1<br />

i=<br />

1<br />

Usually we calculate the variance first, then we take the square root to give the st<strong>and</strong>ard deviation.<br />

(This follows the order <strong>of</strong> operation indicated by BEDMAS)<br />

Example:<br />

The mean for the data set 9, 5, 6, 4, 16, 2 is 7.0. Calculate the st<strong>and</strong>ard deviation:<br />

• First calculate the variance. Substitute in each value, including x = 7 <strong>and</strong> n = 6:<br />

( )<br />

2 1 (9 7)<br />

2 (5 7)<br />

2 (6 7)<br />

2 (4 7)<br />

2 (16 7)<br />

2 (2 7)<br />

2<br />

s = − + − + − + − + − + −<br />

5<br />

• Evaluate the expression, following BEDMAS:<br />

( )<br />

2 1 2 2 2 2 2 2<br />

s = (2) + ( − 2) + ( − 1) + ( − 3) + (9) + ( − 5)<br />

5<br />

1<br />

= ( 4 + 4 + 1 + 9 + 81 + 25 )<br />

5<br />

1<br />

= (124) = 24.8<br />

5<br />

• Take the square root <strong>of</strong> the variance to give the st<strong>and</strong>ard deviation:<br />

s = 24.8 = 4.98 (to two decimal places)<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


5. Quartiles <strong>and</strong> Interquartile Range<br />

There are two quartiles: a lower quartile (Q 1 ) <strong>and</strong> an upper quartile (Q 3 ). The lower quartile has<br />

25% <strong>of</strong> the data below it, <strong>and</strong> the upper quartile has 25% <strong>of</strong> the data above it.<br />

To find a quartile, first find the median <strong>of</strong> the data set. Then treat the data above the median (upper<br />

set) <strong>and</strong> the data below the median (lower set) as separate sets. The lower quartile is the median <strong>of</strong><br />

the lower set, while the upper quartile is the median <strong>of</strong> the upper set.<br />

The interquartile range is the upper quartile minus the lower quartile, <strong>and</strong> contains 50% <strong>of</strong> the data.<br />

It is a measure <strong>of</strong> the variation in the data.<br />

Example 1:<br />

The data set 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 has a median <strong>of</strong> 11.<br />

• Therefore the lower set is 1, 3, 5, 7, 9, which has a median <strong>of</strong> 5. So the lower quartile is 5.<br />

• The upper set is 13, 15, 17, 19, 21, <strong>and</strong> has a median <strong>of</strong> 17. So the upper quartile is 17.<br />

• The interquartile range is 17 – 5 = 12<br />

Example 2:<br />

The data set 1, 5, 6, 8, 12, 16, 19, 22, 29, 31, 36, 40 has a median <strong>of</strong> 17.5.<br />

• The lower set is 1, 5, 6, 8, 12, 16 which has a median <strong>of</strong> 7, so the lower quartile is 7.<br />

• The upper set is 19, 22, 29, 31, 36, 40, which has a median <strong>of</strong> 30, so the upper quartile is 30.<br />

• The interquartile range is 30 – 7 = 23.<br />

6. Scatterplot<br />

A scatterplot shows the relationship between two variables. Each observation consists <strong>of</strong> two<br />

measurements. Often we are interested in the “response” <strong>of</strong> one measurement to the value <strong>of</strong> the<br />

other. We try to distinguish between the “response” variable <strong>and</strong> the “explanatory” variable. The<br />

response variable is plotted on the y-axis (vertical axis) <strong>and</strong> the explanatory variable on the x-axis<br />

(horizontal axis).<br />

Example:<br />

The weight <strong>of</strong> 13 students <strong>and</strong> the amount <strong>of</strong> time it took them to drink a particular beverage are<br />

plotted below: the explanatory variable is the student’s weight (x-axis) <strong>and</strong> the response variable is<br />

the time taken to drink the beverage (y-axis).<br />

Time taken to drink beverage<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

50.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0<br />

Weight <strong>of</strong> Student (kg)<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


Section 3: Sample Exercise<br />

This sample exercise contains questions based on the Basics Booklet, plus a few questions from<br />

material taught during the first week <strong>of</strong> the course.<br />

1. In a recent study looking at rainbow trout, researchers measured the lengths <strong>of</strong> juvenile fish.<br />

The lengths (in cm) for five r<strong>and</strong>omly selected fish were:<br />

18.6, 15.4, 13.4, 17.0, 12.9<br />

Calculate to one decimal place the mean for these data.<br />

2. For a second r<strong>and</strong>om sample <strong>of</strong> six juvenile fish the lengths (in cm) were:<br />

Calculate the median for these data.<br />

15.5, 12.6, 17.5, 17.4, 13.8, 12.2<br />

3. Calculate the range for the data in Question 2.<br />

4. For a third r<strong>and</strong>om sample <strong>of</strong> five juvenile fish the lengths (in cm) were:<br />

14.5, 14.8, 16.5, 18.4, 13.8<br />

The mean for these data is 15.6 (cm). Calculate to one decimal place the st<strong>and</strong>ard deviation<br />

for these data.<br />

5. The mean value <strong>of</strong> 15.6 (cm) in Question 4 is a:<br />

A. Parameter<br />

B. Statistic<br />

C. Distribution<br />

D. Population value<br />

E. Measure <strong>of</strong> Spread<br />

6. The following list contains five values:<br />

3.2%<br />

0.096<br />

0.048<br />

0.32<br />

0.58%<br />

Beside each value select “True” if the value is less than 0.05 or “False” if the value is greater<br />

than 0.05.<br />

7. Calculate the value <strong>of</strong> the expression<br />

5<br />

∑ 3i<br />

.<br />

i = 1<br />

8. If<br />

Z.<br />

Z<br />

X − μ<br />

= , with X = 43.6, μ = 48, σ = 8.6 <strong>and</strong> n = 50, then calculate the value <strong>of</strong><br />

σ<br />

n<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


9. If<br />

1.96<br />

X − 2.8<br />

= , calculate the value <strong>of</strong> X.<br />

5<br />

10. In a previous STAT110 class at Otago <strong>University</strong>, 64% <strong>of</strong> students sitting the paper were<br />

known to be first year students. In a study <strong>of</strong> students sitting the paper, a r<strong>and</strong>om sample <strong>of</strong><br />

40 students was taken, <strong>and</strong> 60% <strong>of</strong> the students in this sample were found to be first year<br />

students.<br />

To earn the mark in this question you will have to answer correctly both questions below. For<br />

each question select your answer from these five options:<br />

A. 64%<br />

B. 40 students<br />

C. 60%<br />

D. all first year students at Otago <strong>University</strong><br />

E. students sitting the paper<br />

Question 1: The statistic in the paragraph above is: . . . . .<br />

Question 2: What is the population . . . . .<br />

Answers<br />

Answers without working are provided. For the working, look through the Basic Booklet above, or<br />

consult your Notes for the first week <strong>of</strong> the course. If you need help, go to one <strong>of</strong> the help sessions.<br />

Details <strong>of</strong> these sessions are provided in the Course Outline at the start <strong>of</strong> this book.<br />

1. 15.5 cm (1 d.p.)<br />

2. 14.65 cm<br />

3. 5.3 cm<br />

4. 1.9 cm<br />

5. B<br />

6. True, false, true, false, true<br />

7. 45<br />

8. –3.62 (2 d.p.)<br />

9. 12.6<br />

10. B, E<br />

Appendix 1 – Basic rules <strong>and</strong> concepts


Appendix Two: Some Summaries<br />

1. Some Useful Rules <strong>of</strong> Probability<br />

2. R<strong>and</strong>om Variables<br />

3. Binomial Distribution<br />

4. Normal Distribution<br />

Basic Probability Rules <strong>and</strong> Distributions<br />

1. Some Useful Rules <strong>of</strong> Probability<br />

• Pr(A or B) = Pr(A) + Pr(B) – Pr(A <strong>and</strong> B)<br />

If we use set notation for this rule, it can be rewritten as<br />

Pr(A ∪ B) = Pr(A) + Pr(B) – Pr(A ∩ B)<br />

A<br />

B<br />

A<br />

B<br />

A ∪ B<br />

A ∩ B<br />

• If A <strong>and</strong> B are mutually exclusive<br />

(disjoint) then:<br />

Pr(A <strong>and</strong> B) = 0<br />

or Pr(A ∩ B) = 0<br />

A<br />

B<br />

• If A represents the complement <strong>of</strong> A<br />

(every event not in A) then<br />

Pr(A) + Pr( A ) = 1<br />

A<br />

A<br />

Appendix 2 – Some summaries


• Probability <strong>of</strong> B given A: Pr( A ∩ B) = Pr( A) x Pr( B| A)<br />

This may be rewritten as<br />

Pr( B| A)<br />

=<br />

Pr( A ∩ B)<br />

Pr( A)<br />

Pr(B | A)<br />

B<br />

Pr(A ∩ B)<br />

Pr(A)<br />

A<br />

Pr( B | A)<br />

B<br />

Pr(A ∩ B )<br />

Pr(B)<br />

A<br />

Pr(B | A )<br />

B<br />

Pr( A ∩ B)<br />

Pr( B | A )<br />

B<br />

Pr( A ∩ B )<br />

• If A <strong>and</strong> B are independent then: (i) P(B | A) = P(B)<br />

(ii) P(A ∩B) = P(A) × P(B)<br />

2. R<strong>and</strong>om Variables<br />

• A r<strong>and</strong>om variable is one whose value is determined by a r<strong>and</strong>om mechanism.<br />

• A continuous r<strong>and</strong>om variable can take any value in an interval.<br />

• A discrete r<strong>and</strong>om variable can take one <strong>of</strong> a countable number <strong>of</strong> values.<br />

3. Binomial Distribution<br />

Suppose<br />

1. We have a fixed number <strong>of</strong> trials (n)<br />

2. Trials are independent<br />

3. Each trial has only two outcomes (“success” or “failure”)<br />

4. The probability <strong>of</strong> success (π) is the same for each trial<br />

The total number <strong>of</strong> successes (X) is a discrete r<strong>and</strong>om variable <strong>and</strong> has a Binomial distribution,<br />

with<br />

⎛n⎞<br />

x n x<br />

Pr( X = x) = ⎜ ⎟ ( 1 )<br />

x π − π −<br />

⎝ ⎠<br />

The mean <strong>and</strong> variance <strong>of</strong> the distribution are<br />

μ nπ<br />

σ<br />

2<br />

= nπ 1−<br />

π<br />

Example:<br />

If n = 30 <strong>and</strong> π = 0.6 then<br />

• μ = n π = 30 x 0.6 = 18<br />

= <strong>and</strong> ( )<br />

Appendix 2 – Some summaries


2 = n 1− = 30 x 0.6 x 0.4 = 7.2<br />

• σ π( π)<br />

• The st<strong>and</strong>ard deviation is σ = 7.2 = 2.68 (to two d.p.)<br />

4. Normal Distribution<br />

A distribution that is commonly used to describe the behaviour <strong>of</strong> continuous r<strong>and</strong>om variables is<br />

the normal distribution.<br />

2<br />

• X ~ N( , σ )<br />

μ means “X has a normal distribution with mean μ <strong>and</strong> variance<br />

• X ~ N ( 0,1)<br />

means X has a st<strong>and</strong>ard normal distribution<br />

2<br />

• If X ~ N( μ , σ )<br />

, then the st<strong>and</strong>ardised r<strong>and</strong>om variable<br />

For any Normal distribution, approximately:<br />

• 68% <strong>of</strong> the observations are between μ − σ <strong>and</strong> μ + σ .<br />

• 95% <strong>of</strong> the observations are between μ − 2σ<br />

<strong>and</strong> μ + 2σ<br />

.<br />

• 99.7% <strong>of</strong> the observations are between μ − 3σ<br />

<strong>and</strong> μ + 3σ<br />

.<br />

− μ<br />

= X<br />

σ<br />

Z ~ N ( 0,1)<br />

2<br />

σ ”<br />

Example:<br />

If X ~ N( 45, 30 ) then<br />

• μ = 45<br />

• the st<strong>and</strong>ard deviation σ = 30 = 5.477 (to three d.p.)<br />

• Approximately 68% <strong>of</strong> the observations are expected to be between<br />

μ − σ = 39.5 <strong>and</strong> μ + σ = 50.5.<br />

• Approximately 95% <strong>of</strong> the observations are expected to be between<br />

34 <strong>and</strong> 56.<br />

• Over 99% (i.e. almost all) <strong>of</strong> the observations are expected to be between<br />

28.5 <strong>and</strong> 61.<br />

• Pr(X < 40) = Pr(Z <<br />

X − μ<br />

)<br />

σ<br />

40 − 45<br />

= Pr(Z < )<br />

5.477<br />

= Pr(Z < –0.913)<br />

= 0.1806<br />

40 45 X<br />

–0.913 0<br />

Z<br />

Appendix 2 – Some summaries


Summary <strong>of</strong> Formulae<br />

1. Normal Distribution<br />

If X is a normal r<strong>and</strong>om variable with parameters µ X (mean) <strong>and</strong> σ 2 X (variance)<br />

• Mean: µ x<br />

• St<strong>and</strong>ard deviation: σ X =<br />

√<br />

σ 2 X<br />

A st<strong>and</strong>ard normal r<strong>and</strong>om variable Z has mean µ Z = 0 <strong>and</strong> σZ 2<br />

variable X into a st<strong>and</strong>ard normal (<strong>and</strong> vice versa):<br />

= 1. To transform a normal r<strong>and</strong>om<br />

Z = X − µ X<br />

σ X<br />

<strong>and</strong> X = Zσ X + µ X .<br />

2. Binomial Distribution<br />

If X is a binomial r<strong>and</strong>om variable with n trials <strong>and</strong> probability π then<br />

• Mean: µ x = nπ<br />

• St<strong>and</strong>ard deviation: σ X = √ nπ(1 − π)<br />

• If nπ <strong>and</strong> n(1 − π) are both greater than 5, then X is approximately normally distributed with mean<br />

µ X <strong>and</strong> variance σ 2 X .<br />

3. Distributions <strong>of</strong> <strong>Statistics</strong><br />

• The mean ¯X <strong>of</strong> a r<strong>and</strong>om sample <strong>of</strong> size n has mean µ ¯X = µ X <strong>and</strong> st<strong>and</strong>ard deviation σ ¯X = σX √ n<br />

.<br />

• The sample proportion P computed from√<br />

a binomial distribution with parameters n <strong>and</strong> π has a mean<br />

π(1−π)<br />

<strong>of</strong> µ P = π <strong>and</strong> st<strong>and</strong>ard deviation σ P =<br />

n<br />

. If nπ <strong>and</strong> n(1 − π) are both greater than 5, then P<br />

will be approximately normally distributed.<br />

• The distribution <strong>of</strong> the difference between two sample means ¯X 1 − ¯X 2 has a mean <strong>of</strong> µ ¯X1 − ¯X 2<br />

= µ 1 − µ 2<br />

<strong>and</strong> a st<strong>and</strong>ard deviation <strong>of</strong> σ ¯X1 − ¯X 2<br />

=<br />

√<br />

σ 2<br />

1<br />

n 1<br />

+ σ2 2<br />

n 2<br />

.<br />

- In large r<strong>and</strong>om samples (n 1 <strong>and</strong> n 2 ≥ 30) σ ¯X1 − ¯X 2<br />

can be estimated by ˆσ ¯X1 − ¯X 2<br />

=<br />

√<br />

s 2<br />

1<br />

- If σ 2 1 = σ2 2 then we can estimate σ ¯X 1 − ¯X 2<br />

by ˆσ ¯X1 − ¯X 2<br />

=<br />

√<br />

(n1 −1)s 2 1 +(n 2−1)s 2 2<br />

n 1 +n 2 −2<br />

4. Contingency tables<br />

√<br />

1<br />

n 1<br />

+ 1 n 2<br />

.<br />

n 1<br />

+ s2 2<br />

n 2<br />

.<br />

Factor 2<br />

Factor 1 Level 1 Level 2 Total<br />

Level 1 w x r 1 = w + x<br />

Level 2 y z r 2 = y + z<br />

c 1 = w + y c 2 = x + z n = w + x + y + z<br />

χ 2 =<br />

2∑<br />

i=1 j=1<br />

2∑ (o ij − e ij ) 2<br />

e ij<br />

where e ij = r ic j<br />

n<br />

<strong>and</strong> o ij is the observed<br />

value in row i column j.<br />

Odds ratio: OR =(w/x)/(y/z) =(w × z)/(x × y)<br />

Relative risk: RR =(w/(w + x)) / (y/(y + z))<br />

Attributable risk: AR = w/(w + x) − y/(y + z)<br />

Appendix 3 - Formulae


5. Confidence Intervals<br />

All <strong>of</strong> the 100(1 − α)% confidence intervals calculated in this course are <strong>of</strong> the form:<br />

Estimate ± multiplier × st<strong>and</strong>ard error.<br />

In the following ¯x, p etc are the values calculated from the samples.<br />

Estimate df (ν) Multiplier St<strong>and</strong>ard Error<br />

Population mean<br />

• R<strong>and</strong>om sample, σ x known ¯x NA z α/2<br />

√<br />

σ X<br />

n<br />

• R<strong>and</strong>om normal sample, σ x unknown<br />

<strong>and</strong> estimated by s<br />

Difference between population means<br />

• Small r<strong>and</strong>om samples, normal population,<br />

σ 1 = σ 2 = σ unknown<br />

¯x n − 1 t α/2,ν<br />

s √ n<br />

¯x 1 − ¯x 2 n 1 + n 2 − 2 t α/2,ν<br />

√<br />

(n1 −1)s 2 1 +(n 2−1)s 2 2<br />

n 1 +n 2 −2<br />

• Large r<strong>and</strong>om samples (both ≥ 30) ¯x 1 − ¯x 2 NA z α/2<br />

√<br />

s 2<br />

1<br />

• Paired difference in small r<strong>and</strong>om ¯d ν = n − 1 t α/2,ν<br />

s d<br />

samples from a normal population<br />

After ANOVA <strong>and</strong> Regression<br />

• Estimate, multiplier <strong>and</strong> st<strong>and</strong>ard errors determined from output<br />

n 1<br />

+ s2 2<br />

n 2<br />

√n<br />

√<br />

1<br />

n 1<br />

+ 1 n 2<br />

Population proportions<br />

√<br />

p(1−p)<br />

• Population proportion p NA z α/2<br />

√ n<br />

• Difference between 2 population proportions<br />

p 1 − p 2 NA z<br />

p1 (1−p 1 )<br />

α/2 n 1<br />

+ p 2(1−p 2 )<br />

n 2<br />

Odds ratio, relative risk, attributable risk (see contingency tables above for<br />

√<br />

w, x, y <strong>and</strong> z)<br />

• Log (natural) odds ratio ln(OR) NA z 1<br />

α/2 w + 1 x + 1 y + 1 z<br />

√<br />

• Log (natural) relative risk ln(RR) NA z 1<br />

α/2 w − 1<br />

w+x + 1 y − 1<br />

y+z<br />

• Attributable risk –as for the difference <strong>of</strong> two population proportions with p 1 = w/(w + x) <strong>and</strong> p 2 = y/(y + z)<br />

6. Regression<br />

ŷ = ˆβ 0 + ˆβ 1 x where ˆβ 1 =<br />

where s e =<br />

7. ANOVA<br />

√ ∑(yi<br />

−ŷ i ) 2<br />

n−2<br />

1. Total SS = Treatment SS + Error SS<br />

2. Total df = Treatment df + Error df<br />

∑ (xi −¯x)(y i −ȳ)<br />

∑ (xi −¯x) 2 <strong>and</strong> ˆβ 0 =ȳ − ˆβ 1¯x. St<strong>and</strong>ard error <strong>of</strong> the slope SE( ˆβ 1 )=<br />

= √ MS Residual. St<strong>and</strong>ard error <strong>of</strong> a forecast at x k = s e<br />

√<br />

1+ 1 n + (x k−¯x) 2 ∑ (xi −¯x) 2 .<br />

3. MS Treatment = Treatment SS/Treatment df <strong>and</strong> MS Error = Error SS/Error df<br />

4. Overall mean SS = nȳ 2 where n = n 1 + ...+ n k <strong>and</strong> ȳ = 1 n (n 1ȳ 1 + ...+ n k ȳ k ).<br />

5. Treatment SS = C2 1<br />

n 1<br />

+ C2 2<br />

n 2<br />

+ ...+ C2 k<br />

n k<br />

− nȳ 2 where C j is the jth column total.<br />

√ s e ∑(xi , −¯x) 2<br />

Appendix 3 - Formulae

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!