CONTENTS - Department of Mathematics and Statistics - University ...
CONTENTS - Department of Mathematics and Statistics - University ...
CONTENTS - Department of Mathematics and Statistics - University ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>CONTENTS</strong><br />
Introduction, General Information <strong>and</strong> Administration, Overview<br />
SECTION 1<br />
This covers an introduction to the package R-cmdr, presents an overview <strong>of</strong> biostatistics <strong>and</strong><br />
research methodology.<br />
Biostatistics <strong>and</strong> Research Methodology; R-cmdr<br />
Types <strong>of</strong> Data<br />
Numerical Data <strong>and</strong> Histograms<br />
Measures <strong>of</strong> Centre: Mean <strong>and</strong> Median<br />
Measures <strong>of</strong> Variability: St<strong>and</strong>ard Deviation, Variance <strong>and</strong> Interquartile range<br />
Box-<strong>and</strong>-Whisker Plots<br />
SECTION 2<br />
This covers the measures <strong>of</strong> disease frequency <strong>and</strong> disease association with several examples<br />
looking at prevalence, incidence, relative risks, attributable risk <strong>and</strong> odds ratios.<br />
Prevalence <strong>and</strong> Incidence<br />
Cumulative Incidence<br />
Incidence Rate<br />
Disease Association<br />
Relative Risk<br />
Attributable Risk<br />
Odds Ratio<br />
SECTION 3<br />
This section covers a brief introduction to probability definitions, notation, rules <strong>and</strong> r<strong>and</strong>om<br />
variables with examples, several involving tree diagram use.<br />
Definitions including mutually exclusive <strong>and</strong> independent events<br />
The Addition Rule for combining probabilities<br />
The Multiplication Rule for probabilities<br />
Tree diagrams with examples<br />
Screening test terminology<br />
Probability Distributions <strong>and</strong> R<strong>and</strong>om Variables<br />
Rules for combining R<strong>and</strong>om Variables<br />
SECTION 4<br />
This section introduces both the Binomial <strong>and</strong> Normal Distributions which model many<br />
phenomena arising in the real world. Consequently the distributions allow us to answer<br />
some important <strong>and</strong> relevant questions.<br />
The Binomial Distribution: Definition, mean <strong>and</strong> variance<br />
The Binomial Table: Examples<br />
The Normal Distribution: Definition<br />
St<strong>and</strong>ard Normal Distribution <strong>and</strong> Table<br />
General Normal Distribution<br />
Normal Approximation to the Binomial<br />
Transforming Data to Normal<br />
Contents
SECTION 5<br />
This section defines sampling distributions, establishes the st<strong>and</strong>ard deviations <strong>of</strong> these<br />
distributions called st<strong>and</strong>ard errors, <strong>and</strong> set up confidence intervals for population means,<br />
differences between the means <strong>of</strong> two populations, proportions <strong>and</strong> difference between<br />
proportions based on r<strong>and</strong>om samples drawn from the populations.<br />
An outline <strong>of</strong> the Research Process<br />
The Distribution <strong>of</strong> Sample Means<br />
The St<strong>and</strong>ard Error <strong>of</strong> the Mean<br />
Confidence Interval for a Mean<br />
The t-distribution <strong>and</strong> Its Use<br />
Comparison <strong>of</strong> Two Independent Groups<br />
The St<strong>and</strong>ard Error <strong>of</strong> the Difference Between Two means<br />
Pooled Estimate for the Common Variance<br />
Comparison <strong>of</strong> Two Dependent Groups (Paired Data)<br />
Confidence Interval for a Proportion<br />
Confidence Interval for Difference Between Two Proportions<br />
Summary <strong>of</strong> Distributions <strong>and</strong> Confidence Intervals<br />
SECTION 6<br />
This section reviews hypothesis testing, type 1 <strong>and</strong> type 2 errors, conclusive <strong>and</strong> inconclusive<br />
results <strong>and</strong> the power <strong>of</strong> a study.<br />
Null <strong>and</strong> Alternative Hypotheses<br />
Study Based <strong>and</strong> Data Driven Hypotheses<br />
One <strong>and</strong> Two Sided Tests<br />
Four Steps in the Hypothesis Testing Procedure<br />
Examples<br />
Pooled proportion estimate<br />
Clinical <strong>and</strong> Ecological Importance<br />
Conclusive <strong>and</strong> Inconclusive Results<br />
Errors in Hypothesis Testing<br />
Power <strong>of</strong> a Study<br />
Examples<br />
SECTION 7<br />
One factor analysis <strong>of</strong> variance<br />
Post analysis <strong>of</strong> variance tests on means<br />
Multiple comparison procedures<br />
SECTION 8<br />
This section covers the analysis <strong>of</strong> count data including the Chi-square test for contingency,<br />
the chi-square test for trend as well as relative risks, attributable risks <strong>and</strong> odds ratios along<br />
with their confidence intervals. The analysis <strong>of</strong> a three way table <strong>and</strong> Simpson’s paradox are<br />
investigated as a way <strong>of</strong> introducing the concept <strong>of</strong> a confounding variable in the lead up to<br />
regression analyses.<br />
Categorical Data Examples<br />
Relative Risk <strong>and</strong> its Confidence Interval<br />
Attributable Risk <strong>and</strong> its Confidence Interval<br />
Odds Ratio <strong>and</strong> its Confidence Interval<br />
Chi-square Test for Contingency<br />
Chi-square Test for Trend<br />
Interpretation <strong>of</strong> Confidence Intervals<br />
Simpson’s Paradox <strong>and</strong> Confounder Control<br />
Contents
SECTION 9<br />
This section introduces the topic <strong>of</strong> Simple Linear Regression which sets out to fit a straight<br />
line through what is called a scatter diagram. One purpose <strong>of</strong> this analysis is to establish<br />
whether one predictor variable is influencing the outcomes <strong>of</strong> a response variable <strong>and</strong> also<br />
measuring the magnitude <strong>of</strong> the effect <strong>of</strong> this predictor variable on the outcome. It is possible<br />
to use the fitted straight line to make predictions.<br />
Simple linear regression is also the first step in controlling for a confounder variable. This<br />
occurs with the extension to multiple regression which will be considered in the next section.<br />
Scatter Diagrams <strong>and</strong> Examples<br />
Equation <strong>of</strong> Fitted Straight Line<br />
Analysis <strong>of</strong> Variance for Regression Model<br />
Confidence Interval for Slope<br />
Confidence Interval for Prediction<br />
Correlation as Measure <strong>of</strong> Linear Association<br />
Review Exercises<br />
SECTION 10<br />
Multiple regression models <strong>and</strong> logistic regression models are introduced in this section. In<br />
the case <strong>of</strong> ordinary multiple regression the response or outcome variable is on a continuous<br />
scale whereas in the case <strong>of</strong> a logistic regression the outcome measure is binary taking<br />
therefore only two possible values interpreted as success versus failure.<br />
The models allow us to identify those variables which have an effect on the outcomes <strong>and</strong><br />
those variables which do not.<br />
Adding additional variables leads to adjusted values for estimated parameters <strong>and</strong> it is this<br />
that allows us to control for confounding.<br />
The Multiple Regression Model<br />
R-cmdr Printout for Multiple Regression<br />
Dummy Variables<br />
Checking Model Fit<br />
Parallel Regression Lines <strong>and</strong> Analysis <strong>of</strong> Covariance<br />
Binary Outcomes <strong>and</strong> Logistic Regression<br />
Study Design principles<br />
Critical appraisal<br />
Confounding analysis<br />
Sources <strong>of</strong> bias<br />
SECTION 11<br />
Appendix 1: The Basics – mathematical rules <strong>and</strong> statistical concepts<br />
Appendix 2: Some summaries<br />
Appendix 3: Formulae<br />
Contents
STAT115 INTRODUCTION TO BIOSTATISTICS 2012<br />
Advances in our underst<strong>and</strong>ing <strong>of</strong> factors which affect health <strong>and</strong> wellbeing come through<br />
research in the health sciences. Examples <strong>of</strong> such research include surveys to describe<br />
patterns <strong>of</strong> disease in a community or risk factors for disease such as diet <strong>and</strong> smoking; studies<br />
trying to find out whether a newly developed treatment works; studies <strong>of</strong> factors which may<br />
prevent disease such as physical activity; studies <strong>of</strong> barriers to improving health such as<br />
reasons for declining vaccination rates in children, prevention <strong>of</strong> smoking. Biostatistics<br />
(statistics applied in the health sciences) is a vital tool in our mission to improve health <strong>and</strong><br />
wellbeing for all people.<br />
STAT115 provides an introduction to the core principles <strong>and</strong> methods <strong>of</strong> biostatistics. In this<br />
course you will gain an underst<strong>and</strong>ing <strong>of</strong> how statistics is used to answer research questions:<br />
how to look for patterns in data, how to test hypotheses about disease causation <strong>and</strong> prevention<br />
<strong>and</strong> improvement in well-being. The underst<strong>and</strong>ing <strong>and</strong> skills gained in STAT115 can be a<br />
starting point for a career in biostatistics or can be used to assist underst<strong>and</strong>ing <strong>of</strong> research in<br />
other disciplines including physiology, anatomy, human nutrition, sports science, <strong>and</strong><br />
psychology.<br />
Lecturers<br />
GENERAL INFORMATION AND ADMINISTRATION<br />
Dr Katrina Sharples, Dept <strong>of</strong> Preventive <strong>and</strong> Social Medicine, Adams Building<br />
Dr Janine Wright, Room 237, Science III building<br />
Mr Daniel Turek, Room 231, Science III building<br />
Dr David Bryant, Room 514, Science III building<br />
Lectures<br />
Lectures are held as follows: Monday, Tuesday, Thursday <strong>and</strong> Friday at 11.00 am,<br />
commencing Monday 9 July. Although these notes are extensive, experience shows that<br />
students who miss lectures have a severe disadvantage.<br />
Help Sessions <strong>and</strong> Tutorials<br />
These will be held in 539 Castle St Laboratory which has 36 computers. Tutorials are<br />
cafeteria style which means that you can attend at any scheduled time when tutors are<br />
available to help with weekly exercises. Times can be found on the STAT115 paper page on<br />
the <strong>Mathematics</strong> <strong>and</strong> <strong>Statistics</strong> <strong>Department</strong> website. In addition, you may access the<br />
computers to complete assignments outside <strong>of</strong> scheduled sessions. Attend early in the week<br />
to avoid the inevitable rush before submission day.<br />
STAT 115 Web Page <strong>and</strong> Resource Area<br />
The STAT 115 web page: www.maths.otago.ac.nz/stat115 will contain course resource<br />
material. Answers to weekly exercises, notices, old exam papers with solutions <strong>and</strong> any other<br />
useful information will be posted here. You can access such information by clicking on the<br />
Resources button. You are strongly advised to read through the solutions to weekly exercises<br />
as students who fail to do this are at a severe disadvantage.<br />
i<br />
Introduction & overview
Support Classes<br />
There is also a Wednesday evening support class for students worried about their mathematics<br />
background for this course. This class will be held in 539 Castle St at 6pm on Wednesday<br />
evening. If you wish to attend the support class you will need to register using the form<br />
which is available on the resource page or from the Maths <strong>and</strong> <strong>Statistics</strong> Reception, Science<br />
III, 2 nd floor. Our experience is that only a small number <strong>of</strong> students will need to use the<br />
support class. Note, there is no mathematics pre-requisite for this course. If you have<br />
difficulty in carrying out the calculations in the Basics Booklet <strong>of</strong> Appendix 1 <strong>of</strong> these notes<br />
you may find it helpful to attend the support class. In addition, you can access Mathercize by<br />
going to the web page mathercize.otago.ac.nz, log-in password line. The options<br />
STAT115 Exercises for Biostatistics<br />
STAT115 Revision mathematics<br />
will take you through background material for this course in an easy to use self-testing<br />
environment.<br />
Study Centre<br />
A Study Centre will operate in a room at the back <strong>of</strong> 539 Castle St. This is an area where you<br />
can go to work with fellow students. There will also be statistics help available at times as<br />
shown on the door.<br />
References<br />
There is no set text for the course as this course booklet contains all material necessary. The<br />
book: Harraway, J. Introductory Statistical Methods for Biological, Health <strong>and</strong> Social<br />
Sciences. (<strong>University</strong> <strong>of</strong> Otago Press) has multiple copies on reserve in the Science Library at<br />
the Loans Desk. The first 17 chapters are relevant for this course. A second book on close<br />
reserve: Clark, M.J. <strong>and</strong> R<strong>and</strong>al, J.A. A First Course in Applied <strong>Statistics</strong> (Pearson).<br />
Computing<br />
The R-comm<strong>and</strong>er (R-cmdr) package will be used in tutorials. No prior knowledge <strong>of</strong> the<br />
package is needed as a h<strong>and</strong>out <strong>and</strong> full instructions will be available in the tutorials. All<br />
students will have their own User Name <strong>and</strong> Password. The User Name is the name on your<br />
student ID card <strong>and</strong> the Password is your student ID number.<br />
Time Commitment<br />
STAT 115 is a one semester course worth 18 points. It is expected that students should spend<br />
an average <strong>of</strong> 12 hours per week on this course. After allowing four hours per week attending<br />
lectures, this leaves eight hours for other course related activities such as assignments, reading<br />
notes <strong>and</strong> revising.<br />
Calculators<br />
There is no restriction on the type <strong>of</strong> calculator that can be used except that no device with<br />
communication capability shall be accepted as a calculator.<br />
ii<br />
Introduction & overview
Course content (in approximate lecture order)<br />
Introduction: research methods <strong>and</strong> study design; designed experiments versus<br />
observational studies; case control, cohort <strong>and</strong> intervention studies.<br />
Data description <strong>and</strong> presentation: the use <strong>of</strong> R-comm<strong>and</strong>er; histograms, box<strong>and</strong>-whisker<br />
plots, measures <strong>of</strong> centre <strong>and</strong> spread <strong>of</strong> data, measures <strong>of</strong> disease<br />
frequency <strong>and</strong> association.<br />
Probability: the nature <strong>of</strong> r<strong>and</strong>om variation; diagnostic tests; probability<br />
distributions including the binomial <strong>and</strong> normal distributions.<br />
Estimation: sampling distributions; confidence intervals for means, differences<br />
proportions.<br />
Hypothesis testing: classical procedures for means, proportions, <strong>and</strong> differences;<br />
the p-value; statistical vs clinical significance; power <strong>and</strong> sample size.<br />
Analysis <strong>of</strong> variance: completely r<strong>and</strong>omised design; bonferroni procedure for<br />
multiple comparisons.<br />
Categorical data: tests for association; rates, relative risk <strong>and</strong> risk differences,<br />
odds ratios; confidence intervals for relative risk <strong>and</strong> odds ratio.<br />
Regression <strong>and</strong> correlation: the simple linear regression model; tests on the slope;<br />
predictions; confidence intervals for predictions; correlation.<br />
Multiple regression: tests on the estimated parameters; dummy variables for<br />
qualitative predictors; parallel regressions <strong>and</strong> control <strong>of</strong> confounding.<br />
Ethics <strong>and</strong> Study design: Ethical issues, bias <strong>and</strong> confounding.<br />
2 lectures<br />
6 lectures<br />
8 lectures<br />
5 lectures<br />
3 lectures<br />
3 lectures<br />
4 lectures<br />
5 lectures<br />
4 lectures<br />
7 lectures<br />
Internal Assessment<br />
There will be eight assignments <strong>and</strong> three mastery tests. Each assessment will have a mark<br />
recorded out <strong>of</strong> 20. These assessments will be admininstered on-line. The assignments can<br />
be completed anywhere you have an internet connection. The mastery tests will be conducted<br />
in the Castle St Computer Laboratory. A booking system for half-hour slots in which to<br />
attempt the tests will operate. Cut<strong>of</strong>f times for each assignment will be announced in lectures.<br />
Exam format<br />
A three-hour exam will produce a mark out <strong>of</strong> 100.<br />
Final mark<br />
In your overall mark we will count your exam mark for 2/3 <strong>of</strong> the total <strong>and</strong> the internal<br />
assessment for 1/3. However, if your final exam mark taken out <strong>of</strong> 100 is greater than this,<br />
we will use just the final exam mark. That is, the final mark F will be calculated as:<br />
F = {E, (2E + A)/3}<br />
where E (exam mark) is out <strong>of</strong> 100 <strong>and</strong> A (internal assessment) is out <strong>of</strong> 100. The internal<br />
assessment marks will be made up 1/3 from the eight assignments <strong>and</strong> 2/3 from the three<br />
mastery tests.<br />
iii<br />
Introduction & overview
Email Contact with Students<br />
From time to time lecturers may wish to email students taking STAT 115. This will be done<br />
by contacting you using your Student email address. You should check your student address<br />
regularly. If you have another address then you might like to arrange that emails sent to your<br />
student address are forwarded automatically.<br />
Disability <strong>and</strong> Impairment Support<br />
The <strong>Department</strong> <strong>of</strong> <strong>Mathematics</strong> <strong>and</strong> <strong>Statistics</strong> encourages students to seek support if they<br />
find they are having difficulty with their studies due to a disability, temporary or permanent<br />
impairment, injury, chronic illness or deafness.<br />
Contact either The Course Convenor,<br />
or Disability Information <strong>and</strong> Support<br />
Telephone 479 8235<br />
Email: disabilities@otago.ac.nz<br />
Website: http://www.otago.ac.nz/disabilities<br />
Plagiarism<br />
Students should make sure that all submitted work is their own. “Plagiarism is a form <strong>of</strong><br />
dishonest practice. Plagiarism is defined as copying or paraphrasing another’s work <strong>and</strong><br />
presenting it as one’s own” (<strong>University</strong> Council, December 2004). In practice this means that<br />
plagiarism includes any attempt in any piece <strong>of</strong> submitted work (e.g. an assignment or test) to<br />
present as one’s own work the work <strong>of</strong> another (whether <strong>of</strong> another student or a published<br />
authority). Any student found to be responsible for plagiarism in any piece <strong>of</strong> work submitted<br />
for assessment shall be subject to the <strong>University</strong>’s dishonest practice regulations which may<br />
result in various penalties, including forfeiture <strong>of</strong> marks for the piece <strong>of</strong> work submitted, a<br />
zero grade for the paper or in extreme cases exclusion from the <strong>University</strong>.<br />
SURV 102 Computational Methods for Surveyors<br />
Students enrolled for SURV102 will attend lectures in STAT115 for four weeks beginning on<br />
Monday 23 July.<br />
A separate notice about assessment in SURV102 will be made in the Surveying <strong>Department</strong>.<br />
iv<br />
Introduction & overview
Biostatistics <strong>and</strong> Health Research - An Overview<br />
1 Health Research<br />
Billions <strong>of</strong> dollars are spent every year in a quest to improve human health <strong>and</strong> well-being.<br />
The broad goal <strong>of</strong> this quest is to acquire new knowledge to help prevent, detect, diagnose <strong>and</strong><br />
treat disease.<br />
What sort <strong>of</strong> knowledge do we look for<br />
What causes a disease<br />
Once you have a disease, what happens<br />
Who has the disease<br />
What is the best strategy for treatment or prevention<br />
How do societal factors affect health<br />
What causes a disease<br />
Underst<strong>and</strong>ing the factors which lead to the development <strong>of</strong> disease gives ideas about how to<br />
prevent disease. For example:<br />
• Drinking water is treated to kill bacteria, virus <strong>and</strong> other contaminants like giardia.<br />
• Our ability to prevent heart disease has improved with our underst<strong>and</strong>ing <strong>of</strong> specific<br />
dietary components which increase risk, <strong>and</strong> with our underst<strong>and</strong>ing <strong>of</strong> how exercise<br />
works to reduce risk.<br />
• The realization that the cause <strong>of</strong> AIDS was a virus (HIV) which could be transmitted<br />
through sexual intercourse <strong>and</strong> blood transfusions led to prevention strategies to<br />
reduce transmission. These included use <strong>of</strong> condoms, screening <strong>of</strong> blood products<br />
<strong>and</strong> drugs to reduce <strong>of</strong> transmission from mother to baby.<br />
• Underst<strong>and</strong>ing how <strong>and</strong> when sports injuries occur helps to develop rules <strong>of</strong> play <strong>and</strong><br />
training schedules which reduce injury burden.<br />
Once you have a disease, what happens<br />
Underst<strong>and</strong>ing how a disease progresses gives ideas about how to cure disease, or to prolong<br />
survival or to improve quality <strong>of</strong> life. For example:<br />
• Underst<strong>and</strong>ing how HIV affects the immune system has led to the development <strong>of</strong><br />
drugs such as zidovudine which prevent the virus from reproducing <strong>and</strong> seem to<br />
slow the destruction <strong>of</strong> the immune system.<br />
• Underst<strong>and</strong>ing how bacteria work allowed the development <strong>of</strong> different types <strong>of</strong><br />
antibiotics with different actions.<br />
• Cancer develops when cells in a part <strong>of</strong> the body begin to grow out <strong>of</strong> control.<br />
Knowledge <strong>of</strong> the cell cycle was important in developing cancer drugs<br />
(chemotherapy) which work only on actively reproducing cells.<br />
Who has the disease<br />
Detecting who has a disease <strong>and</strong> diagnosing disease are the first steps in delivering effective<br />
treatments. For example:<br />
• Development <strong>of</strong> non-invasive technologies for looking inside the body (such as<br />
ultrasound, CT scans, MRI) provided techniques for making the initial diagnosis <strong>of</strong><br />
cancer, or for identifying the form <strong>of</strong> damage to a knee following injury.<br />
• Tests which look at cells from biopsies or blood can give more accurate diagnosis <strong>of</strong><br />
cancer than the non-invasive technologies.<br />
• We identify people with HIV infection though a blood test which detects antibodies<br />
to the virus.<br />
v<br />
Introduction & overview
What is the best strategy for treatment or prevention<br />
Once we have developed a new treatment or approach to prevention we need to evaluate the<br />
risks <strong>and</strong> benefits <strong>of</strong> that treatment before it is made available for use. For example:<br />
• Exercise <strong>and</strong> balance programmes have been demonstrated to reduce the risk <strong>of</strong><br />
falling in the elderly<br />
• The statin family <strong>of</strong> drugs have been demonstrated to reduce risk <strong>of</strong> death from<br />
cardiovascular disease<br />
• Evaluations <strong>of</strong> the use <strong>of</strong> beta-carotene (which the body converts to vitamin A)<br />
found that contrary to expectations, it did not prevent lung cancer; in fact it increased<br />
the risk <strong>of</strong> lung cancer.<br />
How do societal factors affect health<br />
Working with individuals can lead to significant improvements in health, but societal factors<br />
can also have an impact.<br />
• Societal attitudes to alcohol <strong>and</strong> smoking can make it difficult for individuals to<br />
change behaviour<br />
• Underst<strong>and</strong>ing how societal factors operate is important for developing systems <strong>of</strong><br />
health care.<br />
Where does knowledge come from<br />
During the last century we have gained an enormous amount <strong>of</strong> knowledge, but there are still<br />
many gaps.<br />
• Cancer <strong>and</strong> cardiovascular disease still end many people’s lives prematurely.<br />
• Back pain is very common. We still are not very good at treating it or preventing it.<br />
• Diabetes is becoming increasingly common, particularly among Maori <strong>and</strong> Pacific<br />
Isl<strong>and</strong> populations. It has many serious health consequences.<br />
• New diseases provide additional challenges. HIV/AIDS, a disease thought to have<br />
jumped the species barrier into humans, has had an enormous impact. Avian<br />
influenza is common in birds in Asia, <strong>and</strong> can cause severe disease in humans, but<br />
doesn’t currently spread directly from human to human. But it would only take a<br />
small change in the genome <strong>of</strong> the virus to make it highly infectious amongst<br />
humans.<br />
Knowledge can come from ‘experience’ or ‘research’<br />
Experience is a very unreliable way <strong>of</strong> obtaining knowledge. Humans are not objective; our<br />
recall is very selective. The history <strong>of</strong> medicine is littered with treatments which doctors were<br />
convinced, through their own experience, worked, but time has shown to be ineffective or<br />
harmful in many <strong>of</strong> the settings where they were used: bloodletting, ground woodlice,<br />
mercury, arsenic, <strong>and</strong> so on. These treatments were widely used centuries ago, but there are<br />
more modern examples.<br />
• An early treatment for heart attack, where blood flow to part <strong>of</strong> the heart muscle is<br />
blocked, involved sprinkling powdered asbestos on to the heart to increase blood flow<br />
to the affected areas. It was never truly shown to work, but thous<strong>and</strong>s <strong>of</strong> these<br />
operations were done.<br />
• Hormone replacement therapy was widely used initially for treatment <strong>of</strong> the symptoms<br />
<strong>of</strong> menopause, but was also believed to reduce risk <strong>of</strong> heart disease in postmenopausal<br />
women. The results <strong>of</strong> a study published recently found in fact it<br />
increased the risk <strong>of</strong> heart disease.<br />
That leaves research.<br />
vi<br />
Introduction & overview
2 The Research Process <strong>and</strong> Biostatistics<br />
What is research<br />
Research is a systematic process for providing answers to questions<br />
Examples <strong>of</strong> research questions:<br />
• What are the causes <strong>of</strong> meningococcal meningitis<br />
• What is the best treatment strategy for chronic back pain<br />
• What are the genetic events that lead to childhood cancer<br />
• Can this new drug improve survival in people with colon cancer<br />
• What is the role <strong>of</strong> selenium as an antioxidant in the protection against risk factors for<br />
cardiovascular disease<br />
• To what extent do western diet <strong>and</strong> exercise habits need to change in order to reduce<br />
insulin resistance<br />
• Does this conditioning programme reduce serious knee injury in team sports<br />
Biostatistics is the field <strong>of</strong> development <strong>and</strong> application <strong>of</strong> statistical methods to research in<br />
health-related fields, including medicine, public health, <strong>and</strong> biology. Since early in the<br />
twentieth century, biostatistics has become an indispensable tool for health research.<br />
<strong>Statistics</strong> is <strong>of</strong>ten defined as the art <strong>and</strong> science <strong>of</strong> collecting, summarising, presenting <strong>and</strong><br />
interpreting data. <strong>Statistics</strong> is a set <strong>of</strong> techniques which formally implement the fundamental<br />
principles <strong>of</strong> the scientific method. The scientific method underlies the research process:<br />
observation <strong>and</strong> theories lead to the development <strong>of</strong> hypotheses. We work out the best test <strong>of</strong><br />
the hypothesis, then collect data <strong>and</strong> determine to what extent the data are consistent with the<br />
hypothesis.<br />
The research process<br />
When we carry out research we <strong>of</strong>ten collect data on a sample or subgroup from a population.<br />
Our goal is to use the information collected on that sample to draw inferences about a larger<br />
population.<br />
Underlying populations<br />
Inference<br />
Sample<br />
<strong>Statistics</strong><br />
vii<br />
Introduction & overview
Examples<br />
• We use the frequency with which diabetes occurs in a sample to estimate the<br />
frequency with which diabetes occurs in the population the sample came from.<br />
• We study a new treatment in a subgroup <strong>of</strong> patients in order to be able to make claims<br />
about the effects <strong>of</strong> the treatment in all such patients.<br />
Steps in the research process<br />
Development <strong>of</strong> the research questions<br />
Design <strong>of</strong> the study<br />
Collection <strong>of</strong> information<br />
Data description <strong>and</strong> analysis<br />
Interpretation <strong>of</strong> results<br />
Ideas for research come from many places – from reading the literature, observation <strong>and</strong><br />
clinical experience, from talking to colleagues <strong>and</strong> from just sitting <strong>and</strong> thinking.<br />
The first step is to refine the idea into a question, or series <strong>of</strong> questions, which can be<br />
answered in a single study; that is, we need to be able to design a study to answer the<br />
question. The question may be framed as a hypothesis. For example, we might wish to<br />
answer the question “Does a low fat diet reduce risk <strong>of</strong> diabetes” The hypothesis would be<br />
“Low fat diet reduces the risk <strong>of</strong> diabetes”. We then need to work out how best to test the<br />
hypothesis.<br />
The study design specifies the methods for selecting people (or other units) for the study <strong>and</strong><br />
for collecting the information that will be used to answer the questions. It needs to be feasible<br />
<strong>and</strong> ethical. We need to identify which study designs can give us appropriate data, <strong>and</strong> how to<br />
maximize our chance <strong>of</strong> being able to distinguish a true relationship from r<strong>and</strong>om noise.<br />
Once we have collected the data we use statistical methods to describe <strong>and</strong> analyse the data<br />
<strong>and</strong> interpret the results. The analysis <strong>and</strong> the interpretation <strong>of</strong> the results will depend on the<br />
study design.<br />
Biostatisticians work with scientists to identify <strong>and</strong> implement the correct statistical methods<br />
for designing studies <strong>and</strong> analyzing <strong>and</strong> interpreting the results.<br />
3. Introduction to study design<br />
Underst<strong>and</strong>ing where data come from is vital for making sensible choices about statistical<br />
analysis. At this stage in the course we will give an overview <strong>of</strong> some <strong>of</strong> the study designs<br />
that are commonly used in epidemiology <strong>and</strong> clinical research. We will return to this material<br />
in the second half <strong>of</strong> the course.<br />
There are several different ways to classify study designs, <strong>and</strong> several specific ‘named’ study<br />
designs. It can be confusing since different epidemiology books use the terms differently. The<br />
classifications <strong>and</strong> definitions exist to help us think about the strengths <strong>and</strong> weaknesses <strong>of</strong> a<br />
particular study for addressing the research questions. The differences in the ways the<br />
definitions are used arise where textbooks emphasize the relative strengths <strong>and</strong> weaknesses a<br />
little differently.<br />
viii<br />
Introduction & overview
2.1 Classifications <strong>of</strong> Study Designs<br />
1. Descriptive versus analytic<br />
This classification relates to the primary aims or objectives <strong>of</strong> the study. Where the study aims<br />
to test an hypothesis we say the study is analytic. For example, does this vaccine reduce the<br />
risk <strong>of</strong> meningococcal disease Here we hypothesize a relationship between vaccine <strong>and</strong> risk<br />
<strong>of</strong> meningococcal disease (we hypothesize that vaccine reduces risk) <strong>and</strong> aim to test that<br />
hypothesis. Analytic studies are studies which test hypotheses.<br />
Descriptive studies are used where the aims are simply to describe something, with no prespecified<br />
hypothesis. For example, if we wish to describe trends in incidence <strong>of</strong><br />
meningococcal disease over time we carry out a descriptive study. Here there are no prespecified<br />
hypotheses about the reasons for a change over time.<br />
Many descriptive studies in epidemiology describe patterns <strong>of</strong> disease in populations. This can<br />
provide clues about causes <strong>of</strong> disease <strong>and</strong> lead on to further studies. The st<strong>and</strong>ard approach is to<br />
examine the characteristics <strong>of</strong> disease according to time, place, <strong>and</strong> person:<br />
TIME A descriptive study can be repeated in order to examine trends over time<br />
examples: epidemics, seasonality eg: influenza<br />
PLACE Many diseases vary according to country, or even within countries<br />
examples: breast cancer incidence by country, multiple sclerosis <strong>and</strong> latitude<br />
PERSON Characteristics <strong>of</strong> people with the disease can be studied, for instance age, sex,<br />
ethnic group, socioeconomic group, occupation<br />
example: heart disease in New Zeal<strong>and</strong> according to age <strong>and</strong> sex <strong>and</strong> ethnic group<br />
2. Experimental versus observational<br />
In experimental studies the investigators intervene in the natural order (hence the alternative<br />
name intervention study). The investigator decides the exact nature <strong>of</strong> the intervention,<br />
chooses a control strategy, <strong>and</strong> decides who will receive the intervention under study <strong>and</strong> who<br />
will be part <strong>of</strong> the control group. The goal is to control the conditions so that the effect <strong>of</strong><br />
interest can be isolated <strong>and</strong> studied. For example, if investigators want to know whether a<br />
drug (nevirapine) reduces maternal-infant transmission <strong>of</strong> HIV they can construct an<br />
experiment which isolates the effect <strong>of</strong> the drug from any other factors which might affect risk<br />
<strong>of</strong> transmission. The extent to which we can isolate the effect <strong>of</strong> the intervention (eg drug)<br />
determines how good the experiment is. Of course ethics are a fundamental consideration.<br />
In observational studies we simply observe a naturally occurring process without intervening.<br />
It is much harder to test a hypothesis in an observational study, but for many research<br />
questions in the health sciences it is not ethical or feasible to conduct an experiment. We aim<br />
to design our observational studies to get as close as possible to the information we would<br />
have got if the experiment could have been done.<br />
3. R<strong>and</strong>omised versus non-r<strong>and</strong>omised (applies to experiments only)<br />
Experiments always (should) have a control group as well as a group (or groups) which gets<br />
the intervention(s) under study. R<strong>and</strong>omisation is a process we can use to allocate people to<br />
either the intervention group or the control group – the simplest version <strong>of</strong> r<strong>and</strong>omisation is<br />
like flipping a coin: each person has a 50% chance <strong>of</strong> being in the intervention group. Careful<br />
use <strong>of</strong> r<strong>and</strong>omisation gives the best test <strong>of</strong> an hypothesis.<br />
ix<br />
Introduction & overview
In some experiments the investigators use a method other than r<strong>and</strong>omisation to decide who<br />
will be in the intervention group <strong>and</strong> who will be in the control group. For example in a<br />
community intervention study the investigators might choose a set <strong>of</strong> communities to get the<br />
intervention (<strong>of</strong>ten those interested or with structures in place to take part), <strong>and</strong> then choose a<br />
matched set <strong>of</strong> control communities. Experiments like this which are non-r<strong>and</strong>omised are<br />
sometimes referred to as quasi-experiments. Sometimes they are the only practical alternative,<br />
but they never provide the same strength <strong>of</strong> evidence as a r<strong>and</strong>omised trial.<br />
Note that the process <strong>of</strong> r<strong>and</strong>omisation is not the same as r<strong>and</strong>om sampling. The purpose <strong>of</strong><br />
r<strong>and</strong>om sampling is to select a single group which is representative <strong>of</strong> a population (see<br />
below).<br />
4. Cross-sectional versus longitudinal<br />
This classification refers to the data themselves <strong>and</strong> the (calendar) time points or periods<br />
about which the information is collected. For example, we might do a study looking at the<br />
relationship between oral contraceptive use <strong>and</strong> coronary heart disease. Fully cross-sectional<br />
data would refer to one point in (calendar) time. For example, in a survey we might ask, do<br />
you have coronary heart disease today Are you taking oral contraceptives today Note that if<br />
we are collecting data on existing disease we are working with prevalence <strong>of</strong> coronary heart<br />
disease rather than incidence <strong>of</strong> coronary heart, <strong>and</strong> so cross-sectional data is not very good<br />
for testing hypotheses about the causes <strong>of</strong> disease. (The exposures may have changed after<br />
disease was diagnosed.)<br />
Longitudinal data have some time course present. The ideal for testing hypotheses about<br />
disease causation is to get information about things that occurred before the disease<br />
developed. Often the best we can do is collect information about exposures that occurred<br />
before diagnosis <strong>of</strong> disease since the time between developing disease <strong>and</strong> diagnosis is <strong>of</strong>ten<br />
unclear. Longitudinal studies collect information over a period <strong>of</strong> time, eg exposures which<br />
occur before disease is diagnosed.<br />
5. Study unit<br />
The majority <strong>of</strong> studies in epidemiology collect data on individuals. However, there are some<br />
where the ‘unit’ under study is something bigger – such as a family, a community or a<br />
country. In some studies it is the group that is <strong>of</strong> interest, not the individual, <strong>and</strong> we might<br />
want to test a hypothesis relating to the group (an analytic study). For example, the COMMIT<br />
study asked, does a community prevention programme reduce the prevalence <strong>of</strong> smoking in<br />
the community The intervention is carried out at the community level, <strong>and</strong> we can evaluate<br />
by examining whether the prevalence <strong>of</strong> smoking in the community changes. Note the<br />
outcome data are collected on the individual (whether someone smokes or not), to measure the<br />
effect <strong>of</strong> the intervention in a community.<br />
2.1 Common study designs in epidemiology <strong>and</strong> clinical research<br />
1. Case report<br />
Usually describes the occurrence <strong>of</strong> disease in one person. The purpose is to alert others to the<br />
fact that this combination <strong>of</strong> factors can occur, <strong>and</strong> to encourage people to keep a look out for<br />
other similar cases. Such case reports (to a central registry) led to the initial recognition <strong>of</strong><br />
AIDS. Case reports are always descriptive <strong>and</strong> observational. The cross-sectional longitudinal<br />
x<br />
Introduction & overview
classification doesn’t really apply, but they could be considered ‘longitudinal’ in the sense<br />
that they may collect data on the person’s experience over time.<br />
2. Case series<br />
A case series takes a group <strong>of</strong> people with a recognised disease <strong>and</strong> describes patterns among<br />
them. A study <strong>of</strong> the initial case series <strong>of</strong> men diagnosed with AIDS recognised a common<br />
dysfunction <strong>of</strong> the immune system <strong>and</strong> that the disease occurred in gay men, injecting drug<br />
users <strong>and</strong> blood product recipients. This led to the hypothesis that it was caused by a<br />
transmissible agent, <strong>and</strong> gave clues as to the modes <strong>of</strong> transmission. Case series are always<br />
descriptive, observational, <strong>and</strong> are generally cross-sectional, but could be longitudinal if they<br />
describe changes in individuals over time.<br />
3. Descriptive study using population data<br />
Many descriptive epidemiological studies make use <strong>of</strong> data that is collected routinely on a<br />
population. This includes census data, death certificates, data reported to cancer registries,<br />
hospital morbidity <strong>and</strong> mortality data, <strong>and</strong> infectious disease data reported as ‘notifiable’<br />
diseases. Provided the data sources are reliable this can provide valuable descriptions <strong>of</strong> the<br />
disease (or risk factor) experience in a population. These studies are descriptive <strong>and</strong><br />
observational.<br />
4. Sample survey<br />
Where data are collected specifically for a research study, they generally involve collecting<br />
data for only a sample (subset) <strong>of</strong> the population <strong>of</strong> interest. This will give the opportunity to<br />
collect more information about each person, at the cost <strong>of</strong> the r<strong>and</strong>om variation that comes<br />
with sampling from a population. There are many way to go about selecting a sample. In<br />
quantitative research we generally choose r<strong>and</strong>om samples. In a r<strong>and</strong>om sample everyone has<br />
a known chance <strong>of</strong> being selected for the study; this allows us to use statistical methods to<br />
accurately determine the influence <strong>of</strong> r<strong>and</strong>om error (through use <strong>of</strong> confidence intervals).<br />
And hence, to make valid inferences regarding the population the sample came from. R<strong>and</strong>om<br />
sampling gives us the best chance <strong>of</strong> getting a sample which is representative <strong>of</strong> the<br />
population.<br />
The simplest type <strong>of</strong> r<strong>and</strong>om sample is a simple r<strong>and</strong>om sample, where everyone has the same<br />
chance <strong>of</strong> being chosen. We can also draw stratified samples or cluster samples. In stratified<br />
sampling we divide the population into groups (or strata) – for example ethnic groups. We<br />
then choose to sample a fixed number from each stratum to ensure all groups are adequately<br />
represented in the study. For example, we might wish to choose the same number <strong>of</strong> people<br />
from each ethnic group to ensure we have enough data for reliable estimates in each group.<br />
Cluster sampling is used where we can’t easily select a sample <strong>of</strong> individuals. For example, if<br />
we wish to study children, we can’t carry select a simple r<strong>and</strong>om simple because we have no<br />
list <strong>of</strong> children from which to select the sample. One approach commonly used is to select<br />
schools at r<strong>and</strong>om, classrooms within a school at r<strong>and</strong>om, <strong>and</strong> children from a class at<br />
r<strong>and</strong>om.<br />
A true survey generally means getting people to fill in a questionnaire. However people have<br />
extended the idea to include other forms <strong>of</strong> data collection: we may take measurements <strong>of</strong><br />
height <strong>and</strong> weight, fitness tests, blood tests <strong>and</strong> so on.<br />
xi<br />
Introduction & overview
These studies are most <strong>of</strong>ten descriptive, but can be analytic, are observational, <strong>and</strong> can be<br />
cross-sectional or longitudinal.<br />
5. Cross-sectional study<br />
In epidemiology the term cross-sectional study <strong>of</strong>ten refers to a survey. The data are <strong>of</strong>ten not<br />
fully cross-sectional according to the definition above. For example we might carry out a<br />
survey <strong>of</strong> use <strong>of</strong> hormone replacement therapy (HRT) among New Zeal<strong>and</strong> women.<br />
Such a survey would generally ask about past life experiences <strong>and</strong> past use <strong>of</strong> HRT, rather<br />
than just current use, which gives a longitudinal element to the data. When the study collects<br />
information about disease status, it is generally prevalent disease. So while cross-sectional<br />
studies can be used to test hypotheses they are not very good for testing hypotheses about<br />
disease causation.<br />
6. Case-control study<br />
Two groups<br />
Group with disease (cases)<br />
Group free from disease (controls)<br />
In a case-control study, people are selected for the study according to whether they have the<br />
disease <strong>of</strong> interest (cases) or not (controls). Generally case-control studies identify incident cases<br />
<strong>and</strong> collect information about experiences before diagnosis <strong>of</strong> disease <strong>of</strong> the cases, <strong>and</strong> for an<br />
equivalent time period for the controls. Case-control studies are sometimes called retrospective<br />
studies because information is collected about exposures that occurred in the past. For example, a<br />
case-control study <strong>of</strong> cervical cancer selected a group <strong>of</strong> women with cervical cancer <strong>and</strong> a<br />
control group <strong>of</strong> women who did not have cervical cancer. Information was collected about past<br />
experiences which were hypothesised to be related to risk <strong>of</strong> cervical cancer including number <strong>of</strong><br />
sexual partners. Case-control studies are analytic, observational <strong>and</strong> longitudinal.<br />
7. Cohort Study<br />
A group <strong>of</strong> people is observed over a period <strong>of</strong> time in order to measure the frequency <strong>of</strong> the<br />
disease being investigated. A cohort study starts by documenting exposures <strong>and</strong> then measuring<br />
the subsequent risk <strong>of</strong> developing disease, according to exposure. Cohort studies aim to identify<br />
associations between exposure to suspected causal agents <strong>and</strong> the development <strong>of</strong> disease. The<br />
cohort may be selected by taking a r<strong>and</strong>om sample from a population (eg the Scottish Heart<br />
Study); by selecting some geographical areas (eg Framingham study) or taking a particular group<br />
(eg British Doctors study, Nurses Health Study). Researchers may also identify an exposed group<br />
<strong>of</strong> interest (eg people working in a particular industry) <strong>and</strong> find an appropriate control group who<br />
are not exposed to the substance under study. Exposure can be measured at the beginning <strong>of</strong> the<br />
study (baseline) <strong>and</strong> also periodically during the follow-up period. The entire cohort <strong>of</strong> people is<br />
followed up to determine if <strong>and</strong> when disease develops.<br />
8. R<strong>and</strong>omised controlled trial (RCT)<br />
In a r<strong>and</strong>omised controlled trial a group <strong>of</strong> study participants are selected <strong>and</strong> then r<strong>and</strong>omly<br />
allocated to an intervention group (s) (who get the intervention under study) <strong>and</strong> a control<br />
group. Since group allocation is entirely by chance, this is the best approach for getting two<br />
groups who are comparable is all respects. This means that if there is a difference in outcome<br />
xii<br />
Introduction & overview
etween the two groups it can be attributed to the intervention (provided other aspects <strong>of</strong> the<br />
study are well carried out).<br />
9. Clinical trial<br />
This the term used for an experiment which evaluates a treatment. They are <strong>of</strong>ten, but not<br />
always, r<strong>and</strong>omised controlled trials.<br />
10. Prevention trial<br />
This is the term used for an experiment used to evaluate a prevention strategy. They can be<br />
r<strong>and</strong>omised controlled trials.<br />
11. Community intervention study<br />
This is the term used for a study to evaluate a community intervention. They are usually<br />
experiments, but <strong>of</strong>ten not r<strong>and</strong>omised, <strong>and</strong> may not involve a control group.<br />
4. Content <strong>of</strong> STAT115<br />
Learning aims <strong>and</strong> objectives<br />
By the end <strong>of</strong> the course students should<br />
• be aware <strong>of</strong> the appropriate use <strong>of</strong> common study designs <strong>and</strong> their strengths <strong>and</strong><br />
weaknesses<br />
• be able to describe the information contained in a data set<br />
• be able to carry out common statistical data analyses<br />
• be able to interpret the results <strong>of</strong> common statistical analyses in the context <strong>of</strong> the<br />
particular study design used<br />
• be aware <strong>of</strong> ethical issues relating to research involving humans<br />
• be able to critically evaluate selected research articles published in health sciences<br />
journals.<br />
The material in this course will provide skills for interpreting research in your chosen field <strong>of</strong><br />
study, as well as some basic skills for analysing data that you collect through course projects<br />
or labs using a computer <strong>and</strong> a statistical s<strong>of</strong>tware package. If you have mathematical skills,<br />
<strong>and</strong> are stimulated by the idea <strong>of</strong> being involved in health research, you may wish to pursue a<br />
career in biostatistics. There are many jobs available for biostatisticians, in New Zeal<strong>and</strong> <strong>and</strong><br />
overseas. Most are employed in research groups at universities or government or in<br />
pharmaceutical or biotech companies.<br />
Types <strong>of</strong> research questions covered in STAT 115<br />
There are many types <strong>of</strong> research question in the health sciences:<br />
• Laboratory studies: research involves underst<strong>and</strong>ing how cells <strong>and</strong> cell components<br />
work, identifying compounds which can be used to treat disease <strong>and</strong> how they affect<br />
cells.<br />
• Animal studies: used as models for humans<br />
xiii<br />
Introduction & overview
• Human studies:<br />
– anatomy <strong>and</strong> physiology consider the structure <strong>and</strong> function <strong>of</strong> the human body<br />
– clinical research asks questions relating to patient care including evaluation <strong>of</strong><br />
new treatments<br />
– epidemiology is the study <strong>of</strong> the distribution <strong>and</strong> causes <strong>of</strong> disease<br />
• Studies <strong>of</strong> public health: the science <strong>and</strong> art <strong>of</strong> promoting health, preventing disease<br />
<strong>and</strong> prolonging life through organised efforts <strong>of</strong> society<br />
• Studies <strong>of</strong> society:<br />
– medical sociology examines topics such as the social aspects <strong>of</strong> physical <strong>and</strong><br />
mental illness, physician-patient relationships, the organization <strong>and</strong> structure <strong>of</strong><br />
health organizations <strong>and</strong> the socio-economic basis <strong>of</strong> the health care system.<br />
In STAT 115 we will focus on research questions involving humans, mainly clinical research<br />
<strong>and</strong> epidemiology. There are many research questions in these areas which can be understood<br />
without specialised knowledge. In the other areas, particularly laboratory studies, an in-depth<br />
underst<strong>and</strong>ing <strong>of</strong> the field (eg biochemistry, molecular biology, anatomy or physiology) is<br />
needed to underst<strong>and</strong> the research questions.<br />
Studying humans brings particular challenges, <strong>and</strong> it is these challenges which have driven the<br />
specialised development <strong>of</strong> biostatistics from it statistical basis. The challenges arise from the<br />
more complex ethical issues in research involving humans, as well as the complexities <strong>of</strong> the<br />
biological system <strong>and</strong> the consequential research questions we wish to answer.<br />
xiv<br />
Introduction & overview
SECTION 1<br />
This covers an introduction to the package R-cmdr, presents an overview <strong>of</strong> biostatistics <strong>and</strong><br />
research methodology.<br />
Biostatistics <strong>and</strong> Research Methodology; R-cmdr<br />
Types <strong>of</strong> Data<br />
Numerical Data <strong>and</strong> Histograms<br />
Measures <strong>of</strong> Centre: Mean <strong>and</strong> Median<br />
Measures <strong>of</strong> Variability: St<strong>and</strong>ard Deviation, Variance <strong>and</strong> Interquartile range<br />
Box-<strong>and</strong>-Whisker Plots<br />
1<br />
Section 1
Biostatistics <strong>and</strong> research: an overview<br />
Course aim:<br />
An introduction to the core biostatistical methods<br />
essential to the health sciences<br />
• scientific method<br />
• design <strong>of</strong> research studies<br />
• description <strong>and</strong> analysis <strong>of</strong> data<br />
The scientific method underpins the design <strong>of</strong><br />
research studies. Sound research design is vital<br />
for obtaining reliable information. A major part<br />
<strong>of</strong> this course is about techniques for describing<br />
data <strong>and</strong> underst<strong>and</strong>ing the analysis principles.<br />
This enables us to make sense <strong>of</strong> the mass <strong>of</strong><br />
information collected in a research study.<br />
2<br />
Section 1
Learning aims <strong>and</strong> objectives<br />
By the end <strong>of</strong> the course students should<br />
• be aware <strong>of</strong> the appropriate use <strong>of</strong><br />
common study designs <strong>and</strong> their strengths<br />
<strong>and</strong> weaknesses<br />
• be able to describe the information<br />
contained in a data set<br />
• be able to carry out common statistical<br />
data analyses<br />
• be able to interpret the results <strong>of</strong> common<br />
statistical analyses in the context <strong>of</strong> the<br />
particular study design used<br />
• be aware <strong>of</strong> ethical issues relating to<br />
research involving humans<br />
• be able to critically evaluate selected<br />
research articles published in health<br />
sciences journals<br />
3<br />
Section 1
Goal <strong>of</strong> health sciences pr<strong>of</strong>essions<br />
To improve the health <strong>and</strong> well-being <strong>of</strong><br />
individuals <strong>and</strong> communities<br />
This involves<br />
• treatment <strong>of</strong> disease<br />
• prevention <strong>of</strong> disease<br />
• promotion <strong>of</strong> health<br />
In order to do this we need knowledge about<br />
• causes <strong>of</strong> disease<br />
• diagnosis<br />
• disease processes<br />
• effectiveness <strong>of</strong> treatments<br />
• societal factors which affect health<br />
4<br />
Section 1
Examples <strong>of</strong> current gaps in knowledge<br />
• causes <strong>of</strong> meningococcal meningitis<br />
How to prevent Vaccine<br />
• SARS, avian influenza<br />
New diseases<br />
• back pain<br />
Not good at treating<br />
• cancer<br />
Nasty treatments for child cancer<br />
• diabetes<br />
Common in Pacific communities<br />
• cardiovascular disease<br />
Common cause <strong>of</strong> death<br />
• prevention <strong>of</strong> overweight <strong>and</strong> obesity<br />
• effective promotion <strong>of</strong> behaviour change<br />
Prevention <strong>of</strong> smoking<br />
Knowledge may come from<br />
• teaching<br />
• experience<br />
• research<br />
5<br />
Section 1
Research<br />
A process for providing answers to questions for<br />
which the answer is not immediately available<br />
General research areas<br />
What are the causes <strong>of</strong> meningococcal<br />
meningitis<br />
Can we develop a vaccine to prevent SARS<br />
What are the genetic events which lead to<br />
childhood cancer<br />
Can a new drug improve survival in people with<br />
colorectal cancer<br />
How can we prevent childhood overweight <strong>and</strong><br />
obesity<br />
What are the main factors affecting quality <strong>of</strong> life<br />
<strong>of</strong> people with a chronic illness<br />
Research provides a systematic process for<br />
answering these questions<br />
6<br />
Section 1
Iron Deficiency – Should NZ parents be<br />
Concerned<br />
[Dr Elaine Ferguson, Dept <strong>of</strong> Human<br />
Nutrition]<br />
A survey r<strong>and</strong>omly selecting 323 children<br />
aged 6-24 months in Dunedin, Christchurch<br />
<strong>and</strong> Invercargill.<br />
To assess prevalence <strong>of</strong> iron deficiency.<br />
To explore factors associated with low body<br />
iron store. Possible Factors are:<br />
Categorical:<br />
Continuous:<br />
• Sex<br />
• Ethnicity<br />
• Maternal Education<br />
• Household Income<br />
• Breast feeding<br />
• Age<br />
• Meat intake<br />
Regression methods are used as well as<br />
procedures for summarising data.<br />
7<br />
Section 1
Does early childhood circumcision reduce the<br />
risk <strong>of</strong> acquiring genital herpes<br />
[Dr Nigel Dickson, Dept <strong>of</strong> Preventive <strong>and</strong><br />
Social Medicine]<br />
• Cohort <strong>of</strong> over 1000 births in 1972 in<br />
Dunedin.<br />
• Called the Dunedin Multidisciplinary<br />
Health <strong>and</strong> Development study.<br />
• Does early circumcision reduce the risk<br />
<strong>of</strong> genital herpes.<br />
• Initially appears to be the case but it is<br />
an observational study.<br />
• Number <strong>of</strong> sexual partners is a<br />
confounder.<br />
• When confounder allowed for early<br />
circumcision appears not to be<br />
protected.<br />
• Designed experiments (or clinical trials)<br />
set up in Africa to investigate effect <strong>of</strong><br />
circumcision on HIV.<br />
8<br />
Section 1
The research process<br />
The objective for most studies is to use data from<br />
a sample to draw inference about a larger<br />
population:<br />
Underlying population<br />
Inference<br />
Sample<br />
<strong>Statistics</strong><br />
Examples:<br />
• we use the frequency with which a disease<br />
occurs in a sample to estimate the<br />
frequency with which disease occurs in the<br />
population<br />
• we study a new treatment in a group <strong>of</strong><br />
patients in order to be able to make claims<br />
about the effects <strong>of</strong> the treatment in all such<br />
patients<br />
9<br />
Section 1
Steps in the research process:<br />
Development <strong>of</strong> the research question<br />
Design <strong>of</strong> the study<br />
Collection <strong>of</strong> information<br />
Data description <strong>and</strong> analysis<br />
Interpretation <strong>of</strong> results<br />
• the research question<br />
- needs to be framed very carefully<br />
- must be specific enough to be<br />
answerable by a research study<br />
• the study design<br />
- is determined by the research<br />
question<br />
- describes the methods used to collect<br />
the information<br />
• analysis <strong>and</strong> interpretation<br />
- depends on the study design<br />
10<br />
Section 1
Research questions relevant to this course:<br />
Epidemiology:<br />
the study <strong>of</strong> distribution<br />
<strong>and</strong> determinants <strong>of</strong> disease<br />
frequency<br />
Clinical research: the study <strong>of</strong> questions<br />
relating to care <strong>of</strong> patients<br />
Descriptive questions:<br />
What is the distribution <strong>of</strong> a disease<br />
What is the natural history <strong>of</strong> a disease<br />
Analytic questions:<br />
What are the causes <strong>of</strong> a disease<br />
Will this approach prevent disease<br />
Does this treatment improve outcome<br />
11<br />
Section 1
Data Analysis <strong>and</strong> Computer S<strong>of</strong>tware<br />
Easy to use s<strong>of</strong>tware is essential for data<br />
management <strong>and</strong> data analysis. In this course R-<br />
cmdr (Statistical Package for the Social Sciences)<br />
will be used. This package is widely available on<br />
campus, used in most <strong>Department</strong>s which specify<br />
first year statistics as a pre-requisite, <strong>and</strong> widely<br />
available internationally.<br />
At school you may have used EXCEL. Possibly<br />
at <strong>University</strong> you have used EXCEL. EXCEL<br />
is excellent for data management <strong>and</strong> reporting<br />
but is poor for statistical analyses <strong>and</strong> clumsy for<br />
graphical procedures.<br />
R-cmdr is easy to use with good pull down menu<br />
options. There are three windows in R-cmdr<br />
• Data Editor (where data being analysed are<br />
located)<br />
• Output Window (where results appear)<br />
• Syntax Window (not used in this course)<br />
12<br />
Section 1
Introduction to study design<br />
1. Descriptive studies<br />
2. Analytic studies<br />
Experimental studies<br />
Observational studies<br />
Examples <strong>of</strong> analytic study types<br />
3. Summary<br />
Classification <strong>of</strong> research designs<br />
Classification <strong>of</strong> common study types<br />
There are two types <strong>of</strong> research questions.<br />
Descriptive – describing things<br />
Analytic – testing hypotheses<br />
Strengths <strong>and</strong> weaknesses <strong>of</strong> the different designs<br />
will be discussed.<br />
13<br />
Section 1
1. Descriptive studies<br />
Aim: to describe, for example:<br />
• the characteristics <strong>of</strong> people with a disease<br />
(person, place, time)<br />
• lifestyle patterns <strong>of</strong> a population<br />
• attitudes to health care<br />
• etc<br />
Descriptive studies are <strong>of</strong>ten called surveys or<br />
cross-sectional studies<br />
Descriptive studies generally use a sample from a<br />
population<br />
14<br />
Section 1
Example: What are the serum cholesterol levels<br />
<strong>of</strong> New Zeal<strong>and</strong>ers<br />
Method:<br />
Select a subgroup (sample) <strong>of</strong> people<br />
<strong>and</strong> measure their serum cholesterol<br />
levels<br />
R<strong>and</strong>om sampling<br />
• choose the sample in such a way that<br />
every individual in the population has a<br />
known chance <strong>of</strong> being selected<br />
• in a simple r<strong>and</strong>om sample, everyone has<br />
an equal chance <strong>of</strong> being chosen<br />
• this method is the best way <strong>of</strong> obtaining a<br />
sample which is representative <strong>of</strong> the<br />
population<br />
Suppose we want to estimate mean cholesterol in<br />
the population:<br />
15<br />
Section 1
Sample average = true mean + error<br />
unknown<br />
r<strong>and</strong>om error:<br />
systematic<br />
error<br />
r<strong>and</strong>om<br />
error<br />
• due to natural biological variability<br />
• increasing the sample size will reduce<br />
the r<strong>and</strong>om fluctuations in the sample<br />
mean<br />
systematic error (=bias)<br />
• due to aspects <strong>of</strong> the design or<br />
conduct <strong>of</strong> the study which<br />
systematically distort the results<br />
• occurs if a sample is not representative <strong>of</strong><br />
the population<br />
• cannot be reduced by increasing the<br />
sample size<br />
16<br />
Section 1
2. Analytic studies<br />
Purpose: to test hypotheses, about, for<br />
example:<br />
• causes <strong>of</strong> disease<br />
• methods for prevention <strong>of</strong> disease<br />
• the effects <strong>of</strong> treatments<br />
Experimental studies<br />
• the researcher intervenes <strong>and</strong> records the<br />
result <strong>of</strong> their intervention<br />
• the aim is to control all other factors to<br />
isolate the effects <strong>of</strong> the intervention<br />
• best way to study causation<br />
Observational studies<br />
• the investigator does not intervene, simply<br />
observes a naturally occurring process,<br />
<strong>and</strong> collects information<br />
• ideal is to get as close as possible to the<br />
information that would have been<br />
obtained if the experimental study could<br />
have been done<br />
17<br />
Section 1
Example: Options for studying the<br />
relationship between smoking <strong>and</strong> lung cancer<br />
Experimental study<br />
R<strong>and</strong>omly assign people<br />
Follow for 20 years<br />
Check lung cancer rates<br />
Clearly unethical<br />
Smokers (start)<br />
Non-smokers<br />
Observational study<br />
Cohort<br />
Smokers (known) 20yrs % with lung CA<br />
Compare<br />
Non-smokers (known) 20yrs % with lung<br />
CA<br />
Problem: groups may differ in other ways that are<br />
related to CA risk – confounding.<br />
Case control<br />
% smokers 20 yrs lung cancer (now) (cases)<br />
% smokers 20 yrs no lung cancer (now)(controls)<br />
No long term follow up needed. Smaller<br />
samples. Could be recall bias from 20 years ago.<br />
Also confounding.<br />
18<br />
Section 1
Examples <strong>of</strong> analytic study types<br />
R<strong>and</strong>omised controlled trial (RCT)<br />
• a “Gold st<strong>and</strong>ard” analytic study (best)<br />
• experimental<br />
Characteristics <strong>of</strong> a RCT:<br />
• select a group <strong>of</strong> people<br />
• r<strong>and</strong>omly allocate them to either an<br />
intervention or a control group<br />
• follow participants up over time, <strong>and</strong><br />
measure outcome<br />
A control group is used to isolate the effects <strong>of</strong><br />
the intervention<br />
R<strong>and</strong>om allocation, or r<strong>and</strong>omisation means<br />
every person has the same chance <strong>of</strong> being in<br />
each. This gives the best chance <strong>of</strong> getting two<br />
groups which are comparable in all respects<br />
Used to evaluate new treatments<br />
Often not ethical in studies <strong>of</strong> disease causation<br />
19<br />
Section 1
Example RCT: LIPID study (NEJM, 1998)<br />
Does treatment with pravastatin reduce the risk<br />
<strong>of</strong> death in patients with coronary heart disease<br />
Study participants:<br />
9014 patients<br />
age 31-75<br />
coronary heart disease<br />
cholesterol 155 - 271mg/decilitre<br />
participants (selected)<br />
control<br />
(n=4502)<br />
r<strong>and</strong>omisation<br />
r<strong>and</strong>omly<br />
allocated<br />
intervention<br />
pravastatin<br />
(n=4512)<br />
6 yrs<br />
8.3% mortality 6.4%<br />
20<br />
Section 1
Advantages <strong>of</strong> RCT:<br />
• experiment – the best way to test an<br />
hypothesis<br />
• differences in outcome can be attributed to<br />
the exposure<br />
Disadvantages <strong>of</strong> RCT:<br />
• may not be ethical<br />
Cohort study<br />
Observational study, generally carried out to test<br />
hypotheses<br />
Characteristics:<br />
• participants are selected before<br />
disease has developed<br />
• followed over time to determine<br />
development <strong>of</strong> disease<br />
• information is collected about<br />
exposures at baseline <strong>and</strong> during<br />
follow-up<br />
• longitudinal<br />
21<br />
Section 1
Example <strong>of</strong> cohort study:<br />
Study to investigate the relationship between<br />
smoking <strong>and</strong> lung cancer (eg British Doctors<br />
study)<br />
Group <strong>of</strong> people<br />
without lung cancer<br />
smokers<br />
non-smokers<br />
20 years<br />
% develop % develop<br />
lung cancer lung cancer<br />
Compare<br />
22<br />
Section 1
Case-control study<br />
Observational study, generally carried out to test<br />
hypotheses<br />
Characteristics<br />
• participants are chosen on the basis <strong>of</strong><br />
their disease status: a group with disease<br />
(cases) <strong>and</strong> a group without (controls)<br />
• information is collected from people with<br />
<strong>and</strong> without disease about exposures that<br />
occurred in the past<br />
• longitudinal (retrospective)<br />
23<br />
Section 1
Example <strong>of</strong> case-control study<br />
Study to investigate the relationship between<br />
smoking <strong>and</strong> lung cancer<br />
Known at start<br />
group <strong>of</strong> people<br />
with lung<br />
cancer<br />
group <strong>of</strong> people<br />
without<br />
lung cancer<br />
document smoking history<br />
% smokers %smokers<br />
in the past<br />
in the past<br />
24<br />
Section 1
Cohort vs case-control studies<br />
Cohort study<br />
Advantages:<br />
• closest observational study to<br />
r<strong>and</strong>omised controlled trial<br />
• good for examining common outcomes<br />
• can evaluate the effect <strong>of</strong> exposure on<br />
multiple outcomes<br />
Disadvantages:<br />
• long duration needed if the disease takes<br />
a long time to develop after exposure<br />
• if the disease is rare, the number <strong>of</strong><br />
participants needs to be very large<br />
Case-control study<br />
Advantages<br />
• relatively quick<br />
• smaller than cohort studies, particularly<br />
for rare diseases<br />
• can examine the effects <strong>of</strong> multiple<br />
exposures<br />
Disadvantages<br />
• events have already occurred so the<br />
potential for bias is higher<br />
25<br />
Section 1
3. Summary<br />
Classification <strong>of</strong> research designs<br />
Note: these provide a useful framework for<br />
thinking about the strengths <strong>and</strong> weaknesses <strong>of</strong><br />
different study designs, but they will not always<br />
work.<br />
i) Classification by purpose <strong>of</strong> the study<br />
descriptive (describe things)<br />
versus<br />
analytic (testing hypotheses)<br />
ii) Classification by form <strong>of</strong> the design<br />
experimental (researcher intervenes)<br />
versus<br />
observational (researcher observes)<br />
iii) Classification by time<br />
cross-sectional<br />
(information collected about one point in time)<br />
versus<br />
longitudinal<br />
26<br />
Section 1
Classification <strong>of</strong> common study types<br />
R<strong>and</strong>omised controlled trial<br />
• analytic<br />
• experimental<br />
• longitudinal<br />
(prospective)<br />
Cohort study<br />
• analytic<br />
• observational<br />
• longitudinal (usually prospective)<br />
Case-control studies<br />
• analytic<br />
• observational<br />
• longitudinal (retrospective)<br />
27<br />
Section 1
Types <strong>of</strong> data <strong>and</strong> graphical summaries<br />
[A] Data <strong>and</strong> variables<br />
There are two types <strong>of</strong> measurement <strong>of</strong> interest in<br />
many scientific studies.<br />
• First, the outcomes measured on each<br />
experimental unit (plant, animal, person)<br />
provide values <strong>of</strong> what is called a response<br />
variable.<br />
• Second the characteristics or levels <strong>of</strong> exposure<br />
that explain at least some <strong>of</strong> the differences in<br />
the observed values <strong>of</strong> the response variable<br />
are called explanatory variables.<br />
e.g. iron levels in new born children is the<br />
outcome or response – what are the<br />
explanatory variables<br />
e.g. diabetes presence is outcome – what are<br />
the explanatory variables<br />
Data forming the response <strong>and</strong> exposure<br />
variables can be either categorical or numerical<br />
(otherwise known as qualitative <strong>and</strong><br />
quantitative).<br />
28<br />
Section 1
1. Categorical data:<br />
The simplest case involves two categories.<br />
For example a person could be<br />
• male/female<br />
• smoker/non-smoker<br />
• diabetic/non-diabetic<br />
Such data have other names such as binary<br />
data, dichotomous data, yes/no data <strong>and</strong> 0 – 1<br />
data (the last is particularly important, for<br />
example 0 represents non-diabetic <strong>and</strong> 1<br />
represents diabetic).<br />
A problem could be to establish the chance<br />
(or probability) that a woman with a certain<br />
pr<strong>of</strong>ile (defining the explanatory variables)<br />
may drink alcohol during pregnancy (the<br />
response) or equivalently to find the<br />
proportion <strong>of</strong> pregnant women who will drink<br />
alcohol. Ultimately, we are interested in who<br />
will do this.<br />
More than two categories can occur.<br />
• blood group: A/B/AB/O<br />
• Maori/Pacific Isl<strong>and</strong>/Caucasian/Asian.<br />
29<br />
Section 1
In these examples the data are said to be<br />
nominal. But this type <strong>of</strong> data is said to be<br />
ordinal if the categories are in some order.<br />
For example, “degree <strong>of</strong> pain” may be<br />
minimal/moderate/severe/unbearable<br />
If more than two ordinal categories it is not<br />
possible to use 0/1/2/3 to identify the<br />
classes since “unbearable” is not three<br />
times “moderate” even though the data are<br />
ordered. Consequences <strong>of</strong> this will be<br />
important in the second half <strong>of</strong> the<br />
semester.<br />
2. Numerical data:<br />
(a) Discrete Here observations take only<br />
certain numerical values. Usually they are<br />
counts <strong>of</strong> events. For example,<br />
• number <strong>of</strong> possums caught in traps<br />
• number <strong>of</strong> children in a family<br />
(0/1/2/3/4)<br />
30<br />
Section 1
These are not like categorical data as 3<br />
children is three times as many as one.<br />
This type <strong>of</strong> data can be treated as though it<br />
is categorical but this discards information<br />
about the magnitude <strong>of</strong> the relationships<br />
between successive outcomes. Ordinal<br />
categorical data is important.<br />
(b) Continuous quantitative measures. Here<br />
recorded values or observations result from<br />
some form <strong>of</strong> measurement [e.g. height,<br />
age, blood pressure, serum cholesterol,<br />
oxygen levels in a lake].<br />
• Often no restriction on values other than<br />
that caused by accuracy <strong>of</strong> equipment<br />
for recording values.<br />
• Often the values show pattern similar to<br />
what is called the bell-shaped normal<br />
curve with many values clustered<br />
around a central point <strong>and</strong> few values in<br />
the tails.<br />
31<br />
Section 1
3. Rates, Ratios <strong>and</strong> Proportions<br />
These are constructed from categorical data<br />
<strong>and</strong> include for example measures <strong>of</strong> disease<br />
frequency <strong>and</strong> disease association. Examples<br />
<strong>of</strong> disease frequency are<br />
• prevalence or proportion (concerned with<br />
existing cases)<br />
• incidence rate (concerned with new cases)<br />
e.g. the prevalence <strong>of</strong> obesity in the New<br />
Zeal<strong>and</strong> population<br />
(Gives indication <strong>of</strong> burden on the country<br />
by identifying proportion affected)<br />
e.g. the incidence rate <strong>of</strong> HIV in New Zeal<strong>and</strong><br />
in 2008.<br />
(This deals with number <strong>of</strong> new cases <strong>and</strong><br />
is useful if looking at causes)<br />
Examples <strong>of</strong> disease association are<br />
• absolute (or attributable) risk<br />
• relative risk<br />
• odds ratio<br />
32<br />
Section 1
e.g. the relative risk <strong>of</strong> melanoma for a<br />
farmer compared with an <strong>of</strong>fice worker.<br />
Here, the prevalence <strong>of</strong> melanoma<br />
among farmers is divided by the<br />
prevalence among <strong>of</strong>fice workers. This<br />
will show if there is any association<br />
between prevalence <strong>of</strong> melanoma <strong>and</strong><br />
occupation after an appropriate analysis<br />
by essentially comparing the two<br />
groups.<br />
4. Other types <strong>of</strong> response data<br />
• Scores (direct measurement not possible;<br />
instead a patient is assessed on several<br />
subjective scales then the values on each<br />
are added to give a score for a patient)<br />
e.g. 30 questions on a health survey. A<br />
respondent gives values 0 to 3 on each<br />
question then score out <strong>of</strong> 90 given. This<br />
total has convenient properties whereas<br />
individual values may not.<br />
• Patients assess their degree <strong>of</strong> low back<br />
pain after treatment on scale 1 (no pain) to<br />
5 (unbearable pain).<br />
33<br />
Section 1
Two treatments may be assessed from the<br />
two sets <strong>of</strong> values for patients in a new<br />
treatment compared with a st<strong>and</strong>ard. The<br />
data may be viewed as categorical or<br />
continuous but there are problems as the<br />
difference between 1 <strong>and</strong> 2 is not<br />
necessarily the same as the distance<br />
between 4 <strong>and</strong> 5. The data are certainly<br />
ordinal.<br />
• In social sciences, data are <strong>of</strong>ten ordinal.<br />
e.g. In a questionnaire people are asked to<br />
respond by checking the category that best<br />
describes their level <strong>of</strong> agreement with a<br />
statement from<br />
a great deal somewhat not much not at all<br />
usually coded as 4, 3, 2, 1.<br />
Such data can be regarded as continuous or<br />
categorical (ordinal). If ordinal then a<br />
question is how many categories should be<br />
chosen e.g. 4 (as here) or 5 or 7 or 9, <strong>and</strong> is<br />
the distance between 1 <strong>and</strong> 2 the same as that<br />
between 2 <strong>and</strong> 3 etc<br />
34<br />
Section 1
[B] Describing Numerical Data<br />
Graphs can be used to summarise data but many<br />
graphs can be highly misleading especially if too<br />
much information is presented. We shall<br />
summarise numerical data graphically using<br />
• histograms<br />
• box-whisker plots<br />
Particular values which summarise numerical<br />
data are:<br />
• mean; median; mode<br />
• st<strong>and</strong>ard deviation; interquartile range<br />
These approximate the centre <strong>and</strong> the variability<br />
<strong>of</strong> the data collected respectively.<br />
35<br />
Section 1
Example for Continuous Data: In a<br />
hypertension study 56 men who are heavy<br />
smokers (smoked for 25 years) have blood<br />
pressures measured (in mm <strong>of</strong> Hg). Summarise<br />
the outcomes.<br />
Blood pressures are classified into intervals to<br />
form a frequency table <strong>and</strong> interval frequencies<br />
(f j ) are obtained as shown below.<br />
Frequency Table<br />
Pressure(mm <strong>of</strong> Hg) Frequency (f j )<br />
59.5 – (69.5) 2<br />
69.5 – (79.5) 7<br />
79.5 – (84.5) 9<br />
84.5 – (89.5) 10<br />
89.5 – (94.5) 11<br />
94.5 – (99.5) 7<br />
99.5 – (109.5) 8<br />
109.5 – (119.5) 2<br />
Total<br />
56 (sample size)<br />
Although the readings are likely to be recorded to<br />
the nearest mm <strong>and</strong> hence appear to be discrete,<br />
the data are actually continuous <strong>and</strong> for this<br />
reason the intervals are recorded as 59.5 – (69.5)<br />
which is 59.5 up to but not including 69.5.<br />
36<br />
Section 1
Relative frequency: this is f j /n in the j th interval<br />
where n is the sample size.<br />
Pressure<br />
Relative<br />
(mm <strong>of</strong> Hg) Freq(f j ) Freq(f j /n)<br />
59.5 – (69.5) 2 0.036<br />
69.5 – (79.5) 7 0.125<br />
79.5 – (84.5) 9 0.161<br />
84.5 – (89.5) 10 0.179<br />
89.5 – (94.5) 11 0.196<br />
94.5 – (99.5) 7 0.125<br />
99.5 – (109.5) 8 0.143<br />
109.5 – (119.5) 2 0.036<br />
Total 56 1.00<br />
Here, 2/56 = 0.036 (rounded to 3 d.p.)<br />
7/56 = 0.125<br />
Percentage frequency: the relative frequency<br />
multiplied by 100.<br />
e.g. 0.036 = 3.6% (or 3.6 per 100) meaning that<br />
3.6% <strong>of</strong> the values are in 59.5 – (69.5)<br />
Note: Relative (or percentage) frequencies allow<br />
comparison <strong>of</strong> samples when samples are <strong>of</strong><br />
unequal size. Absolute frequencies f j will not<br />
allow this since all f j will be large for a large<br />
sample <strong>of</strong> outcomes but small for a small sample.<br />
37<br />
Section 1
Histograms: These are simple pictures <strong>of</strong> the<br />
data. The base <strong>of</strong> a rectangle is interval length<br />
<strong>and</strong> area <strong>of</strong> a rectangle is proportional to class<br />
frequency (or relative frequency). When class<br />
intervals are all equal, rectangle heights are<br />
proportional to the frequencies as well.<br />
Example: Return to the blood pressure readings.<br />
Pressure (mm) (f j ) (f j /n)<br />
59.5 – (69.5) 2 0.036<br />
69.5 – (79.5) 7 0.125<br />
79.5 – (84.5) 9 0.161<br />
84.5 – (89.5) 10 0.179<br />
89.5 – (94.5) 11 0.196<br />
94.5 – (99.5) 7 0.125<br />
99.5 – (109.5) 8 0.143<br />
109.5 – (119.5) 2 0.036<br />
Total 56 1.00<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
FREQ PER<br />
5mm Interval<br />
59.5<br />
69.5 79.5 89.5 99.5 109.5 119.5<br />
38<br />
Bl pr<br />
Section 1
N.B. (1) The heights <strong>of</strong> the first two <strong>and</strong> last<br />
two rectangles are halved but their bases are<br />
doubled from 5 to 10mm. (Area therefore<br />
remains proportional to frequency in these<br />
intervals if 5mm is regarded as the horizontal<br />
“unit”)<br />
(2) The label on the vertical axis is given as<br />
“Freq. Per unit interval” where “unit” = five.<br />
(3) The relative frequency histogram follows:<br />
0.20<br />
0.15<br />
REL FREQ PER<br />
5mm Interval<br />
0.196<br />
0.125<br />
0.10<br />
0.063<br />
0.072<br />
0.05<br />
0.00<br />
0.018<br />
59.5 69.5 79.5 89.5 99.5 109.5 119.5<br />
Bl pr<br />
(4) The frequency <strong>and</strong> relative frequency<br />
histograms have the same shape. Only the<br />
scales on vertical axis differ. Both give some<br />
idea <strong>of</strong> the data centre, the extent <strong>of</strong> the<br />
variability in the data <strong>and</strong> the distribution <strong>of</strong><br />
the data.<br />
39<br />
Section 1
(5) The relative (or percentage) frequency<br />
histogram is used if comparing two (or more)<br />
samples <strong>of</strong> data, one sample <strong>of</strong> values from a<br />
control group <strong>and</strong> the other from a treated<br />
group <strong>of</strong> experiment units.<br />
(6) Notice how a histogram with rectangle<br />
heights proportional to class frequencies<br />
would give a misleading picture <strong>of</strong> the data.<br />
(7) You will find that most <strong>of</strong> the histograms<br />
produced by statistical packages like R-cmdr<br />
have class intervals <strong>of</strong> equal length <strong>and</strong> you<br />
can decide the number <strong>of</strong> intervals you want<br />
in the graph. Usually between 5 <strong>and</strong> 20<br />
intervals <strong>of</strong> equal length are chosen for a<br />
good summary <strong>of</strong> the data.<br />
40<br />
Section 1
Measures <strong>of</strong> Central Tendency.<br />
The mean is “typical” <strong>of</strong> the majority <strong>of</strong> data in a<br />
sample.<br />
Example: Six patients lived the following years<br />
after diagnosis <strong>of</strong> HIV.<br />
Datum (Outcome) Symbol<br />
1.8 x 1<br />
3.2 x 2<br />
6.8 x 3<br />
4.6 x 4<br />
2.8 x 5<br />
7.9 x 6<br />
Mean = 6<br />
1 (1.8 + 3.2 + 6.8 + 4.6 + 2.8 + 7.9)<br />
= 27.1/6<br />
= 4.52 years<br />
Notation: mean<br />
or x ∑ = 1<br />
x1 + x2<br />
+ x3<br />
+ x4<br />
+ x5<br />
+<br />
n<br />
1<br />
= n i<br />
xi<br />
n<br />
x =<br />
6<br />
x<br />
41<br />
Section 1
Note The mean need not be one <strong>of</strong> the outcome<br />
values <strong>and</strong> i is a suffix taking values i = 1 to i = n<br />
(or 6 here). Any symbol can be used for this<br />
suffix.<br />
Example: The 56 blood pressure readings just<br />
considered have a mean <strong>of</strong> 89.54mm <strong>of</strong> Hg. This<br />
value is “typical” <strong>of</strong> the data in the sense that it is<br />
near the centre <strong>of</strong> the region where most values<br />
are located.<br />
The Median is a second measure “typical” <strong>of</strong> data<br />
in a sample <strong>and</strong> is the “middle value” <strong>of</strong> the data<br />
after arranging the numbers in order from<br />
smallest to largest.<br />
Example: Data: 95 86 78 90 62 73 89<br />
Rearrange: 62 73 78 86 89 90 95<br />
Median = 86 (the middle value)<br />
Note:<br />
1. If 62 replaced by 5, the median is unchanged<br />
(the mean would be much smaller). This<br />
indicates that in general the median is not<br />
affected by a few very extreme values<br />
whereas the mean is.<br />
42<br />
Section 1
2. If even number <strong>of</strong> values, halve the two<br />
centre values.<br />
Example: For the 56 blood pressure readings,<br />
the median turns out to be 89.30 (compare mean<br />
<strong>of</strong> 89.54)<br />
The mode is another measure <strong>of</strong> centre. It is the<br />
commonest value in the data. This only makes<br />
sense for discrete data. For continuous grouped<br />
data it coincides with the peak in the histogram.<br />
The histogram is bimodal if there is more than<br />
one peak.<br />
Further Notes<br />
(1) The mean (89.54) <strong>and</strong> median (89.30) for the<br />
blood pressure readings are close because the<br />
data are almost “symmetrical.”<br />
(2) For “non-symmetrical” data mean <strong>and</strong><br />
median are different since the mean is pulled<br />
in the direction <strong>of</strong> the extreme values. The<br />
data are said to be skew.<br />
43<br />
Section 1
0<br />
Median<br />
Mean<br />
The mean may be unsuitable as a measure <strong>of</strong><br />
centre while the median is more “typical” <strong>of</strong><br />
most values.<br />
(3) For measurements which cannot be negative<br />
it is quite common to have many values close<br />
to zero thus presenting a skew distribution.<br />
This is called positive skewness. (The<br />
histogram above represents positively skewed<br />
data.)<br />
(4)The opposite phenomenon with an extended<br />
left h<strong>and</strong> tail is called negative skewness <strong>and</strong><br />
is rare.<br />
(5) A trimmed mean is the mean with the lower<br />
5% <strong>and</strong> upper 5% <strong>of</strong> values removed.<br />
44<br />
Section 1
Measures <strong>of</strong> Variability<br />
“Looking at the world using data is like looking<br />
through a window with ripples in the glass”.<br />
(Pr<strong>of</strong>essor Chris Wild, Auckl<strong>and</strong> <strong>University</strong>)<br />
<strong>Statistics</strong> is about variability. Variability reflects<br />
differences in the values collected for different<br />
units being measured, for example people, or<br />
animals or plants or companies or readings on<br />
different days etc. Two sets <strong>of</strong> values can have<br />
the same mean <strong>and</strong> median yet show quite<br />
different patterns.<br />
Variability can be r<strong>and</strong>om or caused by different<br />
treatments or “factors” acting on the experiment<br />
units in a study in different ways. The hope is<br />
that the r<strong>and</strong>om variation will be relatively small<br />
or controlled by choice <strong>of</strong> appropriate study<br />
designs. This will result in the identification <strong>of</strong><br />
important treatment effects explaining key<br />
aspects <strong>of</strong> the variation.<br />
45<br />
Section 1
If data are highly variable there are problems<br />
analysing the data <strong>and</strong> it will be necessary to<br />
select larger samples.<br />
The first measure <strong>of</strong> variation is the range (the<br />
distance between the lowest <strong>and</strong> highest values).<br />
It is sensitive to any extreme values <strong>and</strong> hence<br />
not useful. But reduced ranges (encompassing<br />
the central 95% say <strong>of</strong> the data) are useful as<br />
extreme values (outliers) are excluded.<br />
Note: In clinical chemistry (e.g. cholesterol<br />
measures) a reference range encompassing the<br />
central 95% <strong>of</strong> values describes variability in<br />
normal people <strong>and</strong> allows test results for other<br />
individuals to be assessed to see if corrective<br />
action is needed.<br />
A second measure is the (sample) variance defined<br />
by s 2 2<br />
= 1<br />
s ∑ −<br />
−<br />
= 1(<br />
x<br />
n 1<br />
n i i x<br />
Although the divisor is (n – 1) in this equation, we<br />
can see that s 2 is effectively the “average” <strong>of</strong> the<br />
squared deviations <strong>of</strong> the individual data values<br />
46<br />
)<br />
2<br />
Section 1
(x i ) from their mean x . For technical reasons do<br />
not divide by n.<br />
Notes: 1. The variance is an overall measure <strong>of</strong><br />
the extent to which values x i differ from their<br />
mean x .<br />
2. Squaring is essential. If the deviations from x<br />
are added, the value 0 is obtained always.<br />
A third convenient measure is the st<strong>and</strong>ard<br />
deviation (s) given by<br />
1<br />
= var iance = ∑i n = ( xi − x)<br />
n −1<br />
s 1<br />
Note: The st<strong>and</strong>ard deviation s is measured in<br />
the same units as the original data (taking the<br />
square root cancels the squaring).<br />
2<br />
47<br />
Section 1
Example: Find the st<strong>and</strong>ard deviation <strong>of</strong> 11, 18, 14,<br />
15, 12<br />
x i x i – x (x i – x ) 2<br />
11 11 –14 = – 3 9<br />
18 18 – 14 = 4 16<br />
14 14 – 14 = 0 0<br />
15 15 – 14 = 1 1<br />
12 12 – 14 = – 2 4<br />
70 0 30<br />
x = 70/5 = 14 s = 30 / 4 = 2. 74<br />
Note that 2.74 is a “typical” or “average”<br />
deviation from the mean x = 14.<br />
Example: Return to the 56 blood pressure<br />
readings<br />
Pressure Interval f j<br />
59.5 – (69.5) 2<br />
69.5 – (79.5) 7<br />
79.5 – (84.5) 9<br />
84.5 – (89.5) 10<br />
89.5 – (94.5) 11<br />
94.5 – (99.5) 7<br />
99.5 – (109.5) 8<br />
109.5 – (119.5) 2<br />
Total 56<br />
The st<strong>and</strong>ard deviation is s = 11.21. This value is<br />
“typical” <strong>of</strong> deviations from x = 89. 54.<br />
48<br />
Section 1
The Interquartile Range is another measure <strong>of</strong><br />
variability.<br />
25%<br />
data<br />
25%<br />
data<br />
25%<br />
data<br />
Q L median Q U<br />
Interquartile range<br />
Range<br />
25%<br />
data<br />
The lower quartile Q L is the value below which a<br />
quarter <strong>of</strong> data lie. The upper quartile Q U has 4<br />
3<br />
<strong>of</strong> data below it. (These are also known as the<br />
25 th <strong>and</strong> 75 th percentiles.)<br />
Notes: 1. Interquartile range can be a helpful<br />
measure <strong>of</strong> variability. It is not affected by<br />
extreme values.<br />
2. Computer packages also give Q L <strong>and</strong> Q U for<br />
large data sets <strong>and</strong> the approximations for<br />
grouped data are no longer needed.<br />
Example: For the 56 blood pressure readings<br />
Q L = 82.2 <strong>and</strong> Q U = 96.6 with Q U – Q L = 14.4<br />
49<br />
Section 1
Box-<strong>and</strong>-whisker plot<br />
This is a second way <strong>of</strong> summarising data<br />
graphically. Like relative frequencies it is useful<br />
when comparing samples <strong>of</strong> unequal size.<br />
Ex Blood pressures<br />
Q L = 82.2; Q U = 96.6; Median = 89.3<br />
Suppose 63 <strong>and</strong> 116 are lowest <strong>and</strong> highest<br />
values.<br />
X<br />
60 70 80 90 100 110 120<br />
X<br />
The centre <strong>of</strong> the data, its variation, its symmetry<br />
(or lack <strong>of</strong> symmetry) <strong>and</strong> extreme values are<br />
displayed.<br />
Notes: (1). Two samples can be compared<br />
X<br />
X<br />
X<br />
X<br />
Both samples skew, the second is more variable<br />
(larger interquartile range) with a larger median.<br />
50<br />
Section 1
(2) The points at the ends <strong>of</strong> the whiskers depend<br />
on the package <strong>and</strong> are<br />
• the extreme values or<br />
• the 2 2 1 % <strong>and</strong> 97 2 1 % values (centiles) or<br />
• points 1<br />
1<br />
2<br />
times the interquartile range<br />
away from the boxes<br />
Outliers beyond these points are shown in R-<br />
cmdr by an asterix or a small circle (as below)<br />
where there are obvious changes in the ozone<br />
readings recorded over summer in a New<br />
Zeal<strong>and</strong> city. An asterix will represent an<br />
extreme outlier.<br />
11 12 1 2 3<br />
51<br />
Section 1
Example: Thirty-two traps were placed in each<br />
<strong>of</strong> three habitats: pasture, replanted forest <strong>and</strong><br />
tussock on Stephens Isl<strong>and</strong>. The data are the<br />
counts <strong>of</strong> skinks per trap totalled over a ten-day<br />
period in each habitat. The boxplots are below.<br />
Summarize conclusions about skink density.<br />
Pasture 4 3 0 2 2 1 4 1 2 5 0 1 5 6 5 6<br />
11 3 1 1 4 8 5 14 6 8 10 7 4 8 13 6<br />
Replant 15 24 31 8 4 18 14 33 11 16 20 1 17 12 27 26<br />
forest 18 6 12 16 11 8 13 12 11 8 10 17 29 3 12 5<br />
Tussock 14 23 15 14 5 16 10 16 14 10 7 10 8 12 19 17<br />
7 12 29 10 11 11 10 10 6 13 7 10 8 12 6 12<br />
Greater skink density in replanted forest <strong>and</strong><br />
tussock. Greater variation in replanted forests.<br />
Some outliers in the three habitats:<br />
Means: 4.88; 14.63; 12.00<br />
Medians: 4.50; 12.50; 11.00<br />
Std Deviations: 3.64; 8.18; 5.07<br />
52<br />
Section 1
Example: Thirty-four adult hoki caught <strong>of</strong>f the<br />
Kapiti coast. Individual lengths as follows:<br />
Males: 18.7 19.0 18.8 18.4 19.3 19.6 20.3 19.9 19.3 18.9<br />
18.9 19.0 19.7 20.4 18.6 19.5 20.3 19.9 19.2 18.7<br />
Females: 18.6 19.6 18.3 17.5 18.3 19.0 18.5 18.7 19.3 18.5<br />
19.1 18.7 19.1 18.8<br />
Boxplots indicate male hoki longer than female<br />
hoki. Slightly greater variation in the males but no<br />
outliers. Distributions almost symmetric<br />
Mean: 19.32; 18.71<br />
Median: 19.25; 18.70<br />
Std Deviation: 0.61; 0.51<br />
53<br />
Section 1
Interpreting Box whisker plots (Ref: Pr<strong>of</strong>essor<br />
Chris Wild, Auckl<strong>and</strong> <strong>University</strong>)<br />
Observed data:<br />
A<br />
B<br />
A<br />
B<br />
The call is<br />
B values bigger<br />
B values bigger<br />
The above two hold for all sample sizes. Larger<br />
r<strong>and</strong>om samples have more information about the<br />
populations from which they come. With large<br />
r<strong>and</strong>om samples we can make the “B values<br />
bigger” call from smaller shifts. Avoid using the<br />
box whisker plots for samples smaller than about<br />
20.<br />
54<br />
Section 1
Observed data:<br />
A<br />
B<br />
A<br />
B<br />
A<br />
B<br />
A<br />
B<br />
A<br />
B<br />
The call is<br />
B values bigger if<br />
both sample sizes >20<br />
What is my call<br />
What is my call<br />
Cannot tell unless<br />
both samples are huge<br />
Cannot tell for<br />
all sample sizes<br />
55<br />
Section 1
How to make the call.<br />
This is based on a confidence interval idea. (See<br />
later). But the result is easy to calculate. In the<br />
following IQR is the interquartile range <strong>and</strong> n is a<br />
sample size.<br />
Med<br />
Med – 1.5 IQR n<br />
Med + 1.5 IQR n<br />
In the following we can claim the values <strong>of</strong> B tend<br />
to be bigger than the values <strong>of</strong> A back in the<br />
populations from which the samples have been<br />
taken if these horizontal lines (intervals) do not<br />
overlap.<br />
A<br />
B<br />
56<br />
Section 1
SECTION 2<br />
This covers the measures <strong>of</strong> disease frequency <strong>and</strong> disease association with several examples looking<br />
at prevalence, incidence, relative risks, attributable risk <strong>and</strong> odds ratios.<br />
Prevalence <strong>and</strong> Incidence<br />
Cumulative Incidence<br />
Incidence Rate<br />
Disease Association<br />
Relative Risk<br />
Attributable Risk<br />
Odds Ratio<br />
57<br />
Section 2
[C] Measures <strong>of</strong> Disease Frequency<br />
All measures <strong>of</strong> disease frequency are ratios <strong>of</strong> the<br />
form numerator/denominator.<br />
There are two types <strong>of</strong> ratio:<br />
1. Proportion: everyone in numerator must be<br />
included in the denominator.<br />
2. Rate: a measure <strong>of</strong> time is included in the<br />
denominator.<br />
The measures <strong>of</strong> disease frequency are:<br />
1. Prevalence<br />
• gives frequency <strong>of</strong> existing cases <strong>of</strong> disease<br />
• is useful for measuring the disease burden in a<br />
community<br />
• <strong>of</strong>ten measured in a cross-sectional survey<br />
e.g. proportion <strong>of</strong> Otago students at 3pm Tuesday<br />
who have swine flu’.<br />
58<br />
Section 2
2. Incidence:<br />
• measures frequency <strong>of</strong> new cases <strong>of</strong> disease<br />
• is useful for looking at causes <strong>of</strong> disease<br />
e.g. number <strong>of</strong> new cases <strong>of</strong> cold that develop in a<br />
week.<br />
Example: Frequency <strong>of</strong> hepatitis in two regions.<br />
New cases Reporting<br />
Location <strong>of</strong> hepatitis Period Population<br />
Region A 58 1985 25,000<br />
Region B 35 1984-1985 7000<br />
Region A:<br />
58/25000/year<br />
= 232 per 100,000 per year<br />
= 23.2 per 10,000 per year<br />
= 2.32 per 1000 per year<br />
Region B:<br />
35/7000/2years = 17.5/7000/year<br />
= 250 per 100,000 per year<br />
= 2.50 per 1000 per year<br />
Note: The time period must be specified for the<br />
results <strong>and</strong> comparisons to be meaningful.<br />
59<br />
Section 2
Example: In a survey <strong>of</strong> eye disease among 2477<br />
people aged 52-85 in Framingham, Massachusetts,<br />
there were 310 with cataracts <strong>and</strong> 22 blind.<br />
Prevalence <strong>of</strong> cataracts<br />
= 310 = 0.125 = 125 per 1000 (or 12.5%)<br />
2477<br />
Prevalence <strong>of</strong> blindness<br />
22<br />
= = 0.009 = 9 per 1000 (or 0.9%)<br />
2477<br />
60<br />
Section 2
Example: In the following diagram the time a<br />
person has the disease is shaded.<br />
Subject<br />
Number<br />
5<br />
4<br />
3<br />
2<br />
1<br />
Prevalence →<br />
1 / 5<br />
t<br />
2 / 5<br />
3 / 5<br />
2 / 5<br />
Time<br />
Note on Prevalence:<br />
Prevalence is the proportion <strong>of</strong> people in a<br />
population who have the disease at a given point in<br />
time. The time point may refer to calendar time, or<br />
to a fixed point in the course <strong>of</strong> events.<br />
e.g. the proportion <strong>of</strong> people free from back pain 2<br />
months after back injury.<br />
Note on Incidence<br />
Incidence on the other h<strong>and</strong> quantifies the number<br />
<strong>of</strong> new cases <strong>of</strong> disease in a given time period.<br />
There are two measures:<br />
• cumulative incidence<br />
• incidence rate<br />
61<br />
Section 2
2.1 Cumulative incidence is the proportion <strong>of</strong><br />
people who become diseased during a specified<br />
period <strong>of</strong> time<br />
number <strong>of</strong> new cases <strong>of</strong> disease<br />
=<br />
total population at risk<br />
This provides an estimate <strong>of</strong> the probability, or risk,<br />
that an individual will develop the disease during<br />
the specified period <strong>of</strong> time.<br />
Example: In a study in Evans County, Georgia,<br />
there were 609 men aged 40 – 76 who had no<br />
detected heart disease in 1960. These men were<br />
followed for 7 years <strong>and</strong> 71 cases <strong>of</strong> heart disease<br />
were detected during this period.<br />
Cumulative incidence = 71/609<br />
= 0.117 (or 11.7%)<br />
over the 7 year period<br />
Notes (1) The time period over which cumulative<br />
incidence is calculated must be specified for it to be<br />
interpretable.<br />
(2) Cumulative incidence assumes the entire<br />
population at risk at the beginning <strong>of</strong> the study<br />
period has been followed for the whole study<br />
period. But <strong>of</strong>ten -<br />
62<br />
Section 2
• people are lost to follow-up<br />
• people are enrolled in the study at different<br />
times<br />
The length <strong>of</strong> the follow-up period is not therefore<br />
the same for everyone in the study. It is the<br />
incidence rate that takes account <strong>of</strong> varying amounts<br />
<strong>of</strong> follow-up time.<br />
2.2 Incidence rate:<br />
=<br />
the number <strong>of</strong> new cases <strong>of</strong> disease<br />
total person-time at risk<br />
Same amount <strong>of</strong> person time if follow:<br />
16 people for one year<br />
4 people for four years<br />
All have 16 person-years <strong>of</strong> observation.<br />
Example: Calculation <strong>of</strong> person-years for<br />
incidence rate<br />
Total<br />
Jan Jan Jan Jan Jan Jan time<br />
Subject 1997 1998 1999 2000 2001 2002 at risk<br />
A • (lost to follow-up) 2.0<br />
B • ×<br />
3.0<br />
C • 5.0<br />
D • 4.0<br />
E • ×<br />
2.5<br />
Total years at risk 16.5<br />
63<br />
Section 2
• = Initiation <strong>of</strong> follow-up<br />
× = Development <strong>of</strong> disease<br />
Number <strong>of</strong> new cases = 2<br />
Number <strong>of</strong> person-years at risk = 16.5<br />
Incidence rate = 2/16.5 = 0.121<br />
That is, 12.1 cases per 100 person years <strong>of</strong><br />
observation<br />
Example:<br />
A study in the United States measured the incidence<br />
rate <strong>of</strong> stroke in a group <strong>of</strong> 118,539 women aged<br />
30-55 years <strong>of</strong> age. The women were free from<br />
stroke in 1986, <strong>and</strong> were followed for 8 years.<br />
Person-years<br />
<strong>of</strong><br />
observation<br />
(over 8 years)<br />
Stroke<br />
incidence<br />
rate<br />
(per 100,000<br />
person years)<br />
Smoking<br />
category<br />
No. <strong>of</strong> cases<br />
<strong>of</strong> stroke<br />
Never smoked 70 395,594 17.7<br />
Ex-smoker 65 232,712 27.9<br />
Smoker 139 280,141 49.6<br />
Total 274 908,447 30.2<br />
64<br />
Section 2
274<br />
Incidence rate = × 100,000 = 30.2 cases <strong>of</strong><br />
908,447<br />
stroke per 100,000 person-years <strong>of</strong> observation.<br />
Average follow-up per woman<br />
= 908,447<br />
118,539<br />
= 7.7 years<br />
Note: The denominator for measures <strong>of</strong> incidence<br />
should include only those who are at risk <strong>of</strong><br />
developing the disease. It should exclude<br />
• those who already have the disease<br />
• those who cannot develop the disease<br />
Failure to do this will lead to an underestimate <strong>of</strong><br />
the true incidence since fewer will develop the<br />
condition.<br />
For example when studying the incidence <strong>of</strong><br />
endometrial cancer we should exclude women<br />
who have had a hysterectomy.<br />
65<br />
Section 2
Example In (a)-(c) calculate a relevant measure<br />
<strong>of</strong> disease frequency <strong>and</strong> give its name.<br />
(a) You survey 346 travellers returning from overseas<br />
travel <strong>and</strong> find that 95 <strong>of</strong> them experienced a<br />
diarrhoeal illness on their trip.<br />
(1 mark)<br />
(b) A tour <strong>of</strong> 143 people is travelling through Central<br />
America for 2 weeks. During this trip 28 <strong>of</strong> the<br />
people experience a diarrhoeal illness. (1 mark)<br />
(c) A group <strong>of</strong> 18 Peace Corps volunteers in Guatemala<br />
kept daily records <strong>of</strong> their exposure to various risk<br />
factors (such as untreated water) <strong>and</strong> whether or not<br />
they had diarrhoea. The following values are the<br />
numbers <strong>of</strong> new episodes <strong>of</strong> diarrhoea with the<br />
number <strong>of</strong> weeks <strong>of</strong> records (in brackets) for each<br />
<strong>of</strong> the 18 individuals:<br />
12(88) 12(46) 19(77) 7(102) 8(73) 15(110) 7(101) 9(94) 2(62)<br />
8(25) 1(90) 1(17) 15(28) 9(30) 5(101) 7(21) 14(109) 17(93)<br />
NOTE: You should assume that the reported number<br />
<strong>of</strong> weeks does not include weeks in which the<br />
individual had diarrhoea when the week started (i.e.,<br />
each person was disease free at the start <strong>of</strong> each<br />
week).<br />
(1 mark)<br />
66<br />
Section 2
Solution<br />
(a) 95/346 = 0.275. Prevalence = 27.5 per 100<br />
overseas travellers report experiencing<br />
diarrhoea during their trip.<br />
(b) 28/143 = 0.196. Cumulative incidence = 19.6<br />
cases per 100 exposed per 2 weeks.<br />
(c) In this problem you are calculating an<br />
incidence rate. You generally calculate the<br />
incidence rate as the total number <strong>of</strong><br />
episodes divided by the total exposure<br />
time:<br />
12+12+19+7+8+15+7+9+2+8+1+1+15+9+5+7+14+17<br />
88+46+77+102+73+110+101+94+62+25+90+17+28+30+101+21+109+93<br />
= 169/1269 = 0.133<br />
Thus, incidence rate = 13.3 cases per 100<br />
person-weeks <strong>of</strong> observation.<br />
67<br />
Section 2
Relationship between prevalence <strong>and</strong> incidence<br />
Example: Disease A<br />
Subject<br />
1<br />
2<br />
3<br />
4<br />
5<br />
L t<br />
Cumulative Incidence = 5 / 5 in t-years<br />
Prevalence at time L = 2 / 5<br />
Disease B<br />
Subject<br />
1<br />
2<br />
3<br />
4<br />
5<br />
Cumulative Incidence = 5 / 5 in t-years<br />
Prevalence at time L = 5 / 5<br />
L<br />
t<br />
Time<br />
Time<br />
68<br />
Section 2
Note: Prevalence depends on<br />
• incidence rate<br />
• duration <strong>of</strong> disease<br />
Diabetes (adult onset)<br />
• annual incidence rate is low<br />
• duration is long as disease is neither curable<br />
or total<br />
so prevalence is high relative to incidence<br />
Cold<br />
• incidence is high<br />
• duration is short<br />
So prevalence is low relative to incidence<br />
69<br />
Section 2
HIV/AIDS<br />
Many with HIV will live for a long time.<br />
Prevalence <strong>of</strong> HIV in the community will be high.<br />
There is also an issue related to the fact that a<br />
person may not know they are HIV positive.<br />
Hence likely to underestimate the prevalence.<br />
70<br />
Section 2
If diagnosed with AIDS death is quick, ie few<br />
living with AIDS. Hence AIDS prevalence<br />
relatively low.<br />
There are obvious issues related to health care<br />
provision <strong>and</strong> planning.<br />
71<br />
Section 2
[D] Measures <strong>of</strong> disease association<br />
The comparisons <strong>of</strong> disease frequency in different<br />
groups <strong>of</strong> people are made. In the simplest (<strong>and</strong><br />
very common) setting there are two groups, one<br />
exposed <strong>and</strong> the other unexposed.<br />
Example: Data from cohort study <strong>of</strong> oral<br />
contraceptive use (OC) <strong>and</strong> bacteria in the urine<br />
among women aged 16-49 years over 3 years.<br />
Bacteria present<br />
Yes No Total<br />
OC use Yes 27 455 482<br />
No 77 1831 1908<br />
Total 104 2286 2390<br />
Data from D.A. Evans et al. NEJM (1978)<br />
Bacteria is the Disease Category. (Outcome<br />
measure.)<br />
OC use is the Exposure Category.<br />
72<br />
Section 2
Cumulative Incidence<br />
OC users: 27/482 = 0.056<br />
56 cases per 1000 in 3 years<br />
Non users: 77/1908 = 0.040<br />
40 cases per 1000 in 3 years<br />
Measures <strong>of</strong> Association:<br />
Difference (Absolute effect)<br />
56-40 = 16 cases per 1000 in 3 years<br />
Ratio (Relative effect)<br />
56/40 = 1.4<br />
The number <strong>of</strong> OC users with bacteria is 1.4<br />
times the number for non users.<br />
[Note that the ratio does not include the time<br />
interval]<br />
73<br />
Section 2
1. Relative effect = Relative Risk (RR)<br />
• ratio <strong>of</strong> incidence in exposed group (I e ) to<br />
incidence in unexposed group (I 0 )<br />
⎧> 1 (exposure → disease)<br />
Ie<br />
⎪<br />
RR = ⎨=1 if Ie<br />
= I0<br />
I0<br />
⎪<br />
⎩ < 1 (exposure is protective)<br />
• indicates how much more likely disease is to<br />
develop in the exposed group than in the<br />
unexposed group<br />
• no association between exposure <strong>and</strong> disease:<br />
RR = 1 (I e = I 0 )<br />
• good measure <strong>of</strong> strength <strong>of</strong> an association<br />
• the usual measure in studies <strong>of</strong> causation <strong>of</strong><br />
disease<br />
• can also calculate ratios <strong>of</strong> prevalences, but the<br />
interpretation is different<br />
74<br />
Section 2
2. Absolute effect = Attributable Risk (AR)<br />
• difference in incidence between exposed <strong>and</strong><br />
unexposed groups<br />
AR = I e – I 0<br />
• indicates how many more people with disease<br />
there are in the exposed than the unexposed<br />
group<br />
• no association between exposure <strong>and</strong> disease:<br />
AR = 0 (I e = I 0 )<br />
• assuming a cause-effect relationship between<br />
exposure <strong>and</strong> disease, we say:<br />
if AR>0, AR is the number <strong>of</strong> cases <strong>of</strong> the disease<br />
among the exposed that can be attributed to their<br />
exposure;<br />
if AR
Example: A r<strong>and</strong>omised trial <strong>of</strong> the effectiveness<br />
<strong>of</strong> infra-red stimulation compared with placebo on<br />
pain caused by cervical osteoarthritis (degenerative<br />
joint disease in the neck) carried out over two<br />
months.<br />
(Placebo or Control: mock stimulation)<br />
Treatment Control<br />
Improvement in pain 18 8<br />
No improvement in pain 7 17<br />
Total 25 25<br />
Exposure is Treatment/Control<br />
Disease is Improvement/No improvement in pain<br />
[The outcome classification]<br />
Cumulative incidence <strong>of</strong> improvement (in 2<br />
months)<br />
Treatment group:18/25<br />
Control group: 8/25<br />
18/ 25<br />
Rel. Risk =<br />
8/25 = 2.3<br />
The chance <strong>of</strong> improvement in the treatment group<br />
is 2.3 times the chance in the control group.<br />
76<br />
Section 2
Example: Prevalence <strong>of</strong> coronary heart disease<br />
(CHD) at initial examination among 4469 persons<br />
age 30-62 years <strong>of</strong> age in the Framingham Study<br />
Number Number Prevalence<br />
examined with CHD per 1,000<br />
Males 2024 48 23.7<br />
Females 2445 28 11.5<br />
Note that 23.7 = (48/2024) x 1,000 hence<br />
called prevalence per 1,000<br />
Similarly, 11.5 = (28/ 2445) x 1,000<br />
Relative risk = (23.7/11.5) = 2.1<br />
[Heart disease is twice as common in males as in<br />
females]<br />
Attributable risk = 23.7-11.5 = 12.2 per 1000<br />
[There are 12.2 more cases <strong>of</strong> heart disease in 1000<br />
men than in 1000 women]<br />
77<br />
Section 2
Example: Data from a cohort study <strong>of</strong><br />
postmenopausal hormone use <strong>and</strong> coronary heart<br />
disease among female nurses<br />
Coronary heart<br />
disease<br />
Yes No Person-years<br />
Postmenopausal<br />
hormone use<br />
Yes 30 - 54,308.7<br />
No 60 - 51,477.5<br />
Data from Stamfer et al, NEJM (1985)<br />
Incidence rate:<br />
Users: 30/54308.7 = 55 per 100,000 person-years<br />
Non-users: 60/51477.5 = 117 per 100,000 person<br />
years<br />
Attributable Risk:<br />
55-117 =-62 cases <strong>of</strong> CHD per 100,000 person<br />
years<br />
Hormone use prevents 62 cases per 100,000 person<br />
years<br />
Relative Risk: 55/117 = 0.47<br />
The risk <strong>of</strong> CHD among users is 0.47 times the risk<br />
in non-users (ie a 53% reduction in risk)<br />
78<br />
Section 2
Example: Relative <strong>and</strong> attributable risks <strong>of</strong><br />
mortality from lung cancer <strong>and</strong> coronary heart<br />
disease among cigarette smokers in a cohort study<br />
in British male physicians<br />
Annual mortality rate per 100,000<br />
Lung cancer Heart disease<br />
Cigarette smokers 140 669<br />
Non-smokers 10 413<br />
Relative risk 14.0 1.6<br />
Attributable risk 130 256<br />
(per 100,000 per year)<br />
Data from Doll <strong>and</strong> Peto, Br Med J (1976)<br />
RR: 140/10 = 14.0 669/413 = 1.6<br />
AR: 140 – 10 = 130 669 – 413 = 256<br />
Heart disease is more common therefore a smaller<br />
relative increase in risk produces more people with<br />
disease.<br />
79<br />
Section 2
Note<br />
Relative risks<br />
• provide information on the strength <strong>of</strong> an<br />
association<br />
• can be used to assist in assessment <strong>of</strong> the<br />
likelihood <strong>of</strong> a causal association<br />
Attributable risks<br />
• measure the impact <strong>of</strong> an exposure, (assuming<br />
that it is causal)<br />
If a disease is common a small relative risk will<br />
translate to a large attributable risk.<br />
[see previous example]<br />
80<br />
Section 2
3. Odds Ratio: A third measure <strong>of</strong> association<br />
This can be used in case-control studies, where<br />
measures <strong>of</strong> disease frequency in the study<br />
population are not available<br />
Odds <strong>of</strong> disease =<br />
Chance (or Probability) <strong>of</strong> disease<br />
Chance (or Probability) <strong>of</strong> no disease<br />
See later<br />
81<br />
Section 2
SECTION 3<br />
This section covers a brief introduction to probability definitions, notation, rules <strong>and</strong> r<strong>and</strong>om<br />
variables with examples, several involving tree diagram use.<br />
Definitions including mutually exclusive <strong>and</strong> independent events<br />
The Addition Rule for combining probabilities<br />
The Multiplication Rule for probabilities<br />
Tree diagrams with examples<br />
Screening test terminology<br />
Probability Distributions <strong>and</strong> R<strong>and</strong>om Variables<br />
Rules for combining R<strong>and</strong>om Variables<br />
83<br />
Section 3
Introduction To Probability<br />
To define what we mean by probability we need<br />
to talk about experiments <strong>and</strong> events<br />
• An experiment is the process by which<br />
observations or measurements are obtained.<br />
• The outcome <strong>of</strong> an experiment is referred to as<br />
an event <strong>and</strong> may also represent a group <strong>of</strong><br />
possible outcomes.<br />
• The set <strong>of</strong> all possible individual outcomes is<br />
the sample space.<br />
Example: Toss a coin once. Observe event A –<br />
the coin comes up a head (H) or B – the coin<br />
comes up a tail (T). The sample space is {H, T}.<br />
An experiment results in outcomes that cannot be<br />
predicted in advance. This uncertainty about an<br />
outcome is measured by the probability <strong>of</strong> the<br />
event. Different events have different<br />
probabilities. We define the probability <strong>of</strong> an<br />
n<br />
event A as Pr(A) =<br />
A<br />
N where n A is the number <strong>of</strong><br />
experiments resulting in event A in a very large<br />
number (N) <strong>of</strong> repetitions <strong>of</strong> the experiment.<br />
84<br />
Section 3
A probability is therefore like a relative<br />
frequency. It is a measure on a scale from 0<br />
representing absolute impossibility to 1<br />
representing absolute certainty. Subjective<br />
estimates <strong>of</strong> probability are “unlikely”,<br />
“possibly”, “almost never”, etc which all convey<br />
an idea <strong>of</strong> likelihood <strong>of</strong> occurrence <strong>of</strong> an event.<br />
But different people attach different values to<br />
these (<strong>and</strong> this is a problem). For example, what<br />
is the probability that God exists (0 or 1).<br />
Probability calculations began with games <strong>of</strong><br />
chance over 3000 years ago. The games involve<br />
coins, dice, cards, roulette etc. With such objects<br />
we can develop exact probabilities <strong>of</strong> possible<br />
outcomes or events by making sensible<br />
assumptions:<br />
• a die (plural dice) is fair ( 1 6<br />
is the probability <strong>of</strong><br />
any outcome)<br />
• a coin is fair ( 1 2<br />
is probability <strong>of</strong> a head)<br />
• a card is drawn (<br />
52 1 is probability)<br />
• a birth date (<br />
365 1 is prob <strong>of</strong> particular day)<br />
Probabilities associated with these objects can be<br />
calculated using our knowledge <strong>of</strong> the properties<br />
<strong>of</strong> these objects.<br />
85<br />
Section 3
Example: An experiment involves throwing a<br />
fair die. Event is “obtaining” an even number.<br />
The answer is 3 6 or 1 (easy). This probability<br />
2<br />
could also be found by experiment involving<br />
tossing the die many times.<br />
In practice, experiments are much more complex<br />
than this in situations <strong>of</strong> interest to researchers.<br />
Events result from such experiments <strong>and</strong> event<br />
probabilities are needed if we are to draw<br />
conclusions from the sample data collected.<br />
Further Examples<br />
1. An experiment treats 20 patients in a clinical<br />
investigation involving a new drug.<br />
An event is “at least 12 patients are cured”<br />
What is the probability <strong>of</strong> the event<br />
2. An experiment selects 500 voters in a survey.<br />
An event is “at least 300 support windmill<br />
farms in Central Otago”.<br />
3. Experiment treats two “equal” samples <strong>of</strong><br />
cancer patients, one by surgery <strong>and</strong> one by<br />
chemotherapy.<br />
86<br />
Section 3
An event is “more chemotherapy patients are<br />
cured”. The probability will give insight into<br />
the better treatment.<br />
Theoretical probabilities are unknown in such<br />
situations, hence these probabilities must be<br />
estimated from experimental data by observing<br />
outcomes or noting historical information.<br />
Combining Probabilities for Multiple Events<br />
Example: Consider the probability <strong>of</strong> being in<br />
each <strong>of</strong> the four blood groups. The probabilities<br />
from the Dunedin blood donor centre are:<br />
Blood Type Pr(Blood Type)<br />
A 0.38<br />
B 0.11<br />
AB 0.04<br />
O 0.47<br />
(These numbers can also be estimated by<br />
“experiment” <strong>and</strong> will take these values if<br />
many people are sampled)<br />
1. What is the probability that a person is either<br />
A or B<br />
87<br />
Section 3
2. What is the probability that 3 unconnected (or<br />
independent) people are all in blood group O<br />
Solution:<br />
1. For any two independent outcomes the<br />
probability <strong>of</strong> either occurring is (in this case)<br />
the sum <strong>of</strong> the individual probabilities.<br />
The probability <strong>of</strong> being either A or B is<br />
Pr(A) + Pr(B) = 0.38 + 0.11 = 0.49<br />
Note: Pr(A) + Pr(B) + Pr(AB) + Pr(O) = 1<br />
Here we have assumed that the outcomes are<br />
mutually exclusive: that is, a person cannot be<br />
in blood groups A <strong>and</strong> B.<br />
2. For any two independent outcomes, the<br />
probability that both are observed is the<br />
product <strong>of</strong> the individual probabilities. This<br />
can be extended to three people in the obvious<br />
way.<br />
Therefore, probability three people have blood<br />
group O can be shown to be (see later)<br />
88<br />
Section 3
Pr(O) × Pr(O) × Pr(O)<br />
= 0.47 × 0.47 × 0.47<br />
= 0.104<br />
Note: Independent events arise if the outcome <strong>of</strong><br />
one event tells us nothing about the other<br />
event. We obviously must exclude the<br />
possibility that the three people are in the<br />
same family.<br />
Note: This example illustrates the two laws for<br />
combining probabilities:<br />
• the addition rule in part 1.<br />
• the multiplication rule in part 2.<br />
89<br />
Section 3
Properties <strong>of</strong> Probabilities <strong>and</strong> Probability<br />
Laws.<br />
Notation: There is a convenient notation for<br />
representing event probabilities. Suppose S<br />
represents all possible outcomes <strong>of</strong> an experiment,<br />
A is the collection <strong>of</strong> these outcomes representing<br />
an event <strong>and</strong> A is the collection <strong>of</strong> outcomes<br />
which are not in A.<br />
• A is the event called the complement <strong>of</strong> A<br />
• A <strong>and</strong> A are said to be mutually exclusive (no<br />
overlap)<br />
• Also Pr(A) + Pr( A) = 1 since A <strong>and</strong> A must<br />
represent every possible outcome.<br />
Now suppose two events A <strong>and</strong> B may overlap.<br />
• Event A or B denoted by A∪ B occurs if at<br />
least one <strong>of</strong> A or B occurs. Called the union <strong>of</strong><br />
A <strong>and</strong> B.<br />
90<br />
Section 3
• Event A <strong>and</strong> B denoted by A∩ B occurs if both<br />
A <strong>and</strong> B occur. Called the intersection <strong>of</strong> A <strong>and</strong><br />
B.<br />
Example: A fair die is thrown. A is the event “a<br />
number greater than 3 is thrown” <strong>and</strong> B is the event<br />
“an even number is thrown”.<br />
Then S = {1, 2, 3, 4, 5, 6}<br />
A = {4, 5, 6} Pr(A) = 3 6<br />
B = {2, 4, 6} Pr(B) = 3 6<br />
A ∩ B = {4, 6} <strong>and</strong> A ∪ B = {2, 4, 5, 6}<br />
Pr(A ∩ B) = 2 Pr(A ∪ B) =<br />
6<br />
4 6<br />
Set <strong>of</strong> all outcomes<br />
A<br />
A<br />
B<br />
B<br />
A ∩ B<br />
Fig (i) A∩ B not empty Fig (ii) A∩ B empty<br />
(mutual exclusiveness)<br />
The addition rule for combining probabilities<br />
Pr(A or B) = Pr(<br />
Set <strong>of</strong> all outcomes<br />
A∪ B) = Pr(A) + Pr(B) – Pr ( A∩ B)<br />
91<br />
Section 3
since values in the intersection A∩ B are counted<br />
twice. The special case when A <strong>and</strong> B are<br />
mutually exclusive is<br />
Pr( A∪ B) = Pr(A) + Pr(B)<br />
This was illustrated in blood group example, part (1)<br />
Example: The dice again:<br />
Pr(A ∪ B) = 3 6 + 3 6 – 2 6 = 4 6<br />
using addition rule.<br />
The Multiplication Rule<br />
The intersection <strong>of</strong> two events A <strong>and</strong> B is the<br />
event that both occur. The probability <strong>of</strong> this is<br />
Pr(A <strong>and</strong> B) = Pr(A ∩ B) = Pr(A) Pr(B|A)<br />
In words this says that for both <strong>of</strong> the two events<br />
to occur, first one must occur [Pr(A)] <strong>and</strong> then<br />
given that the first has occurred, the second must<br />
occur [Pr(B|A)].<br />
If both Pr(A) <strong>and</strong> Pr(A <strong>and</strong> B) are given, this rule<br />
can be used to define conditional probability as<br />
Pr( B| A)<br />
=<br />
Pr( A∩<br />
B)<br />
Pr( A)<br />
92<br />
Section 3
Independence<br />
The idea behind the term Pr(B|A) is that the<br />
occurrence <strong>of</strong> event A may cause a reassignment<br />
<strong>of</strong> probability to event B that makes it differ from<br />
the original value Pr(B). When the occurrence <strong>of</strong><br />
A gives no additional information about B, A <strong>and</strong><br />
B are independent.<br />
That is Pr(B|A) = Pr(B)<br />
In this situation the multiplication rule is<br />
Pr( A∩<br />
B) = Pr(A)Pr(B)<br />
Otherwise it is the original<br />
Pr( A∩<br />
B) = Pr(A) Pr(B|A)<br />
This first rule was illustrated in the blood group<br />
example where the probability <strong>of</strong> 3 independent<br />
people all having blood group O was<br />
Pr( A∩<br />
B ∩C) = Pr(A) Pr(B)Pr(C)<br />
= 0.47 × 0.47 × 0.47 = 0.104<br />
93<br />
Section 3
Example: A survey <strong>of</strong> hospital patients shows<br />
that the probability a patient has high blood<br />
pressure given he/she is diabetic is 0.85. If 10%<br />
<strong>of</strong> patients are diabetic <strong>and</strong> 25% have high blood<br />
pressure:<br />
(a) Find prob. a patient has both diabetes <strong>and</strong><br />
high blood pressure.<br />
(b) Are the conditions <strong>of</strong> diabetes <strong>and</strong> high<br />
blood pressure independent<br />
Solution (a) A is event “patient has high blood<br />
pressure”<br />
B is event “patient is diabetic”<br />
Pr( A| B ) = 0.85, Pr( B ) = 0.10 <strong>and</strong> Pr( A ) = 0.25<br />
∴Pr( A∩ B)<br />
= Pr( A| B ) Pr( B ) by multiplication rule<br />
= 0.85 x 0.10<br />
= 0.085<br />
(b) Pr( A ) = 0.25 ≠ Pr( A| B)<br />
. Hence not independent<br />
94<br />
Section 3
A tree diagram is useful for helping calculate the<br />
probability <strong>of</strong> a combined event. The stages <strong>of</strong><br />
the combined event can be dependent or<br />
independent.<br />
Example: Independent Stages.<br />
Stephens Isl<strong>and</strong> is an uninhabited isl<strong>and</strong> in Cook<br />
Strait where tuatara are being re-established. For<br />
some years three locations have been visited on<br />
the isl<strong>and</strong> <strong>and</strong> tuatara have been found at a<br />
location with probability 0.4. At any visit X<br />
represents the number <strong>of</strong> locations out <strong>of</strong> three at<br />
which tuatara are observed. X can take values 0,<br />
1, 2 or 3. Find the probabilities that 0, 1, 2, or 3<br />
locations have tuatara on a visit.<br />
T is the event “location has tuatara’’ <strong>and</strong> N is the<br />
complementary event “location has no tuatara”.<br />
LOC 1<br />
LOC 2<br />
LOC 3<br />
95<br />
Section 3
Location<br />
1<br />
Location<br />
2<br />
Location<br />
3<br />
Outcome Pr(Outcome) No<br />
0.40<br />
T<br />
TTT<br />
0.064<br />
3<br />
0.40<br />
T<br />
0.40<br />
0.60<br />
T<br />
N<br />
N<br />
T<br />
N<br />
TTN<br />
TNT<br />
TNN<br />
0.096<br />
0.096<br />
0.144<br />
2<br />
2<br />
1<br />
0.60<br />
N<br />
0.40<br />
T<br />
0.60<br />
N<br />
T<br />
N<br />
T<br />
NTT<br />
NTN<br />
NNT<br />
0.096<br />
0.144<br />
0.144<br />
2<br />
1<br />
1<br />
N<br />
NNN 0.216<br />
0<br />
Then Pr(T) = 0.40 (known historically)<br />
The second location is independent <strong>of</strong> the first<br />
Pr(both T) = Pr(T ∩ T) = Pr(T)Pr(T)<br />
= (0.40)(0.40) = 0.160<br />
using the multiplication rule <strong>and</strong><br />
Pr(TTT) = (0.4) (0.4) (0.4) = 0.064<br />
96<br />
Section 3
The tree diagram shows all possible outcomes.<br />
Branch probabilities are multiplied to give the<br />
probabilities <strong>of</strong> the 8 possible outcomes.<br />
The addition rule tells us that the probability <strong>of</strong><br />
seeing tuatara at two <strong>of</strong> the three sites, Pr(X = 2),<br />
adds the probabilities <strong>of</strong> the three possible<br />
outcomes, TTN, TNT <strong>and</strong> NTT.<br />
That is, Pr(X = 2) = 0.096 + 0.096 + 0.096<br />
= 0.288<br />
Similarly, Pr(X = 0) = 0.216, Pr(X = 1) = 0.432<br />
<strong>and</strong> Pr(X = 3) = 0.064.<br />
In the next examples, the probability at each<br />
branch <strong>of</strong> the tree is conditional on earlier<br />
outcomes. i.e. no longer are the events<br />
independent, but branch probabilities are still<br />
multiplied according to the multiplication law for<br />
probabilities.<br />
97<br />
Section 3
Example: Dependent stages. Andrew, John,<br />
<strong>and</strong> Mark play a game. There are six similar cars,<br />
two <strong>of</strong> which have had the brake cylinders<br />
removed. The player chooses a car at r<strong>and</strong>om,<br />
drives at high speed towards a cliff, <strong>and</strong> brakes in<br />
time to stop. The boys decide to proceed in<br />
alphabetical order. Find Pr(each will lose) <strong>and</strong><br />
Pr(no loser), assuming that the game stops when<br />
the first boy drives over the cliff.<br />
2<br />
6<br />
4<br />
6<br />
Andrew loses<br />
2<br />
5<br />
3<br />
5<br />
John loses<br />
2<br />
4<br />
2<br />
4<br />
Mark loses<br />
No loser<br />
Pr(Andrew loses) =Pr(Andrew picks a faulty car) = 2 6<br />
Pr(John loses)<br />
Pr(Mark loses)<br />
= Pr(Andrew picks a good car <strong>and</strong> John<br />
picks a faulty car)<br />
⎛4⎞⎛2⎞ 4<br />
= ⎜ ⎟⎜ ⎟ =<br />
⎝6⎠⎝5⎠<br />
15<br />
= Pr Andrew <strong>and</strong> John pick good cars,<br />
<strong>and</strong> Mark picks a faulty car)<br />
⎛4⎞⎛3⎞⎛2⎞<br />
3<br />
= ⎜ ⎟⎜ ⎟⎜ ⎟=<br />
⎝6⎠⎝5⎠⎝4⎠<br />
15<br />
98<br />
Section 3
In probability notation we get:<br />
A is event Andrew loses<br />
A is event Andrew does not lose<br />
Pr( A ) = 2/6 Pr( A ) = 4/6<br />
J is event John loses<br />
J is event John does not lose.<br />
It is not Pr( J ) = 2/6<br />
Instead, Pr( J ) is revised using extra information:<br />
Pr( J ) = Pr( JA | )Pr( A )<br />
⎛<br />
=<br />
5<br />
2 ⎞ ⎛ ⎞<br />
⎜ ⎟ ⎜4<br />
⎟<br />
⎜ ⎟ ⎜ ⎟<br />
⎜ ⎟ ⎜6<br />
⎟<br />
⎝<br />
⎠ ⎝<br />
= 4/15<br />
<strong>and</strong> so on.<br />
⎠<br />
99<br />
Section 3
Example: Screening Programmes<br />
A patient with certain symptoms consulted her<br />
doctor to be checked for a cancer. The patient<br />
undergoes a biopsy. With this test there is a<br />
probability <strong>of</strong> 0.90 that a woman with the cancer<br />
shows a positive biopsy, <strong>and</strong> a probability <strong>of</strong> only<br />
0.001 that a healthy woman incorrectly shows a<br />
positive biopsy.<br />
Historical information also suggests that 1 in<br />
10,000 women have the cancer. [This is the<br />
prevalence <strong>of</strong> the cancer in the population.]<br />
Find the probability that a woman has the cancer<br />
given the biopsy says she does.<br />
(Essentially the problem is to decide the ability <strong>of</strong><br />
the biopsy to diagnose true patient status. The<br />
principle applies to breast <strong>and</strong> cervical cancer in<br />
New Zeal<strong>and</strong>.)<br />
Solution: A is event “woman has the cancer”<br />
B is event “biopsy is positive” (indicating cancer)<br />
100<br />
Section 3
Pr(A) = 0.0001 (disease prevalence)<br />
Pr(B|A) = 0.90 (a conditional prob.)<br />
Pr(B|A) = 0.001 (A is complement <strong>of</strong> A)<br />
The problem is to find Pr(A|B)<br />
Pr(A)<br />
= 0.0001<br />
A<br />
Pr(B|A)<br />
= 0.90<br />
Pr(B|A)<br />
= 0.10<br />
B<br />
B<br />
Biopsy + ve<br />
(true positive)<br />
Biopsy – ve<br />
(false negative)<br />
Pr(A)<br />
= 0.9999<br />
(the<br />
complement)<br />
A<br />
Pr(B|A)<br />
= 0.001<br />
Pr(B|A )<br />
= 0.999<br />
By the multiplication rule for dependent events,<br />
Pr(True positive) = Pr(A ∩ B)<br />
= Pr(B|A)Pr(A)<br />
= 0.90 × 0.0001<br />
= 0.00009 (nine out <strong>of</strong> 100<br />
000 show true positive)<br />
Pr(False negative) = Pr(B|A)Pr(A)<br />
= 0.10 × 0.0001<br />
= 0.00001<br />
B<br />
B<br />
Biopsy + ve<br />
(false positive)<br />
Biopsy – ve<br />
(true negative)<br />
101<br />
Section 3
Pr(False positive) = 0.001 × 0.9999<br />
= 0.00100 (100 out <strong>of</strong> 100<br />
000 show false positive)<br />
Pr(True negative) = 0.999 × 0.9999<br />
= 0.99890<br />
Pr(Test positive) = Pr (B)<br />
= 0.00009 + 0.00100<br />
= 0.00109 (109 out <strong>of</strong><br />
100 000 show positive<br />
test)<br />
Therefore,<br />
0.00009<br />
Pr(A ∩ B)<br />
Pr(A|B) =<br />
=<br />
0.00009 + 0.00100 Pr(B)<br />
0.00009<br />
=<br />
0.00109<br />
= 0.083 (nine <strong>of</strong> the 109 with the<br />
positive biopsy have the cancer)<br />
Conclusion: Only 8.3% <strong>of</strong> those women<br />
identified as having the disease actually do.<br />
(This is not at all what we would expect <strong>and</strong> is<br />
rather unsatisfactory.)<br />
102<br />
Section 3
1. Pr(B|A) is called the sensitivity <strong>of</strong> the test (the<br />
probability a person with the disease returns a<br />
positive result or the proportion <strong>of</strong> positives<br />
that are correctly identified).<br />
2. Pr( B | A) is called the specificity <strong>of</strong> the test<br />
(the proportion <strong>of</strong> negatives that are correctly<br />
identified by the test).<br />
3. Sensitivity <strong>and</strong> specificity are from a practical<br />
point <strong>of</strong> view not helpful as the point <strong>of</strong><br />
diagnostic testing is to make a diagnosis i.e.<br />
we need to know the probability <strong>of</strong> the test<br />
giving the correct diagnosis, whether it is<br />
positive or negative. That is pr(A|B), not<br />
pr(B|A).<br />
4. Pr(A|B) is the positive predictive value (the<br />
proportion <strong>of</strong> patients with positive test<br />
results who are correctly diagnosed).<br />
5. The negative predictive value is the<br />
proportion <strong>of</strong> patients with negative test<br />
results who are correctly diagnosed i.e.<br />
Pr( A | B).<br />
103<br />
Section 3
Example: A patient consulted his GP because<br />
he had intermittent chest pain. The description<br />
<strong>of</strong> such pain is known to suggest a patient has<br />
heart disease with a probability <strong>of</strong> 0.48. The<br />
patient took an ECG test which has a sensitivity<br />
<strong>of</strong> 0.90 <strong>and</strong> a specificity <strong>of</strong> 0.84. The patient<br />
returns a positive ECG. Now find the<br />
probability he has heart disease in light <strong>of</strong> this<br />
additional information. Also find the positive<br />
<strong>and</strong> negative predictive values.<br />
Solution:<br />
Pr(H) = 0.48<br />
H<br />
Sensitivity<br />
= 0.90<br />
0.10<br />
T<br />
T<br />
(0.90)(0.48) = 0.4320<br />
(0.10)(0.48) = 0.0480<br />
Pr(H)<br />
= 0.52<br />
H<br />
0.16<br />
Specificity<br />
= 0.84<br />
T<br />
T<br />
(0.16)(0.52) = 0.0832<br />
(0.84)(0.52) = 0.4368<br />
H is event “patient has heart disease”<br />
T is event “ECG test is positive”<br />
Pr(T) = 0.4320 + 0.0832 = 0.5152<br />
Pr(H|T) = 0.4320/0.5152 = 0.839<br />
104<br />
Section 3
Notice how the probability <strong>of</strong> heart disease has<br />
been revised up from 0.48 to 0.839 as a result <strong>of</strong><br />
the test.<br />
Positive predictive value = 0.839<br />
Pr(Test negative) = 0.0480 + 0.4368 = 0.4848<br />
Negative predictive value = 0.4368/0.4848 = 0.901<br />
105<br />
Section 3
Example<br />
Like swine flu’ today, about six years ago SARS was a threat to world health. In the early days<br />
<strong>of</strong> the SARS epidemic emergency measures were put in place by the World Health<br />
Organisation in an attempt to control the spread <strong>of</strong> SARS <strong>and</strong> to identify the condition. But no<br />
adequate screening tests existed to identify the condition when it first appeared in Hong Kong.<br />
A study was carried out in the early days to evaluate a WHO criteria for identifying patients<br />
with SARS in the SARS screening clinic in Hong Kong. Of 556 consecutive clinic attendees,<br />
97 were confirmed with SARS. Of these 97 patients with confirmed SARS, 25 met the WHO<br />
criteria for suspected SARS. Of the 459 patients in whom SARS was not confirmed, 438 were<br />
negative according to the WHO criteria.<br />
(a)<br />
Find the prevalence <strong>of</strong> confirmed SARS at the clinic (i.e. the proportion<br />
with SARS).<br />
(1 mark)<br />
(b) Estimate the sensitivity <strong>and</strong> specificity <strong>of</strong> the WHO test from the numbers above. (2<br />
marks)<br />
(c) Estimate the probability that the WHO test produces a positive result. (1 mark)<br />
(d) Estimate the positive predictive value <strong>of</strong> the test. (1 mark)<br />
(e) Estimate the negative predictive value <strong>of</strong> the test. (1 mark)<br />
(f)<br />
How would the positive predictive value <strong>of</strong> the test be affected if the prevalence <strong>of</strong><br />
SARS among clinic attendees were to decrease<br />
(1 mark)<br />
106<br />
Section 3
WHO SARS Confirmed<br />
Result Yes No Total<br />
Positive 25 [21] 46<br />
Negative [72] 438 510<br />
Total 97 459 556<br />
(a) Prevalence = 97/556 = 0.174<br />
(b) Sensitivity = 25/97 = 0.258; specificity = 438/459 = 0.954<br />
• T +<br />
0.174<br />
S<br />
•<br />
0.258<br />
0.742<br />
• T –<br />
0.826<br />
•<br />
S<br />
0.046<br />
0.954<br />
• T +<br />
• T –<br />
(c) Pr (T + ) = (0.174)(0.258) + (0.826)(0.046)<br />
= 0.0449 + 0.0380<br />
= 0.083<br />
(d) Positive predictive value = 0.045/0.083 = 0.542<br />
(e) Pr(T – ) = (0.174)(0.742) + (0.826)(0.954)<br />
= 0.917<br />
Negative predictive value = 0.788/0.917 = 0.859<br />
(f) The positive predictive value will decrease.<br />
107<br />
Section 3
Example: Sensitive Survey Questions.<br />
This is an important way <strong>of</strong> gaining information<br />
on sensitive or controversial issues.<br />
The question is: do you have or have you ever<br />
had a sexually transmitted disease (STD)<br />
It is unlikely a truthful response or any response<br />
will be given.<br />
In a mail survey <strong>of</strong> 268 young people five said<br />
they had a STD.<br />
Probability = 5/268 = 0.019 (or 19 per 1000)<br />
Instead, proceed as follows:<br />
1. Roll a die, allowing no one to see the<br />
outcome.<br />
2. Toss a fair coin.<br />
3. If the die shows “1” answer truthfully the<br />
question: “Have you thrown a head”<br />
4. If the die shows 2, 3, 4, 5 or 6 answer<br />
truthfully to the question:<br />
108<br />
Section 3
“Have you ever had a sexually transmitted<br />
disease”<br />
A tree diagram summarises this procedure where<br />
θ is the proportion <strong>of</strong> response “YES” to the STD<br />
question.<br />
Roll<br />
Die<br />
1/6<br />
5/6<br />
1<br />
2 to 6<br />
Pr (Yes) = 1 5 θ<br />
+<br />
12 6<br />
1/2<br />
1/2<br />
Head Yes 1/12<br />
Head No 1/12<br />
STD Yes 5θ/6<br />
STD No 5(1 - θ)/6<br />
There were 54 “Yes” <strong>and</strong> 214 “No” for 268<br />
people.<br />
Estimate Pr(Yes) = 54/268 = 0.2015<br />
∴ 0.2015 = 1 5 θ<br />
+<br />
12 6<br />
∴ 12(0.2015) = 1 + 10θ<br />
θ<br />
1 – θ<br />
109<br />
Section 3
∴ 2.418 – 1 = 10θ<br />
∴ 1.418 = 10θ<br />
∴ θ = 0.1418<br />
or 142 per 1000 have STD<br />
(compare 19 per 1000 previously)<br />
110<br />
Section 3
Probability Distribution <strong>and</strong> R<strong>and</strong>om Variables<br />
A r<strong>and</strong>om variable has values which depend on<br />
the outcome <strong>of</strong> a r<strong>and</strong>om experiment. R<strong>and</strong>om<br />
variables are labelled with a capital letter (X<br />
say). They can be discrete or continuous. The<br />
number <strong>of</strong> locations with tuatara on Stephens<br />
Isl<strong>and</strong> is discrete (possible values 0, 1, 2, 3)<br />
while cholesterol levels are continuous.<br />
Example: (Tuatara again) Three locations are<br />
visited on 50 occasions in the tuatara study <strong>and</strong><br />
the number <strong>of</strong> locations with tuatara found are<br />
recorded each time. Results follow along with<br />
values calculated previously in the fourth column.<br />
X = x j f j f j /n f j /n = Pr(X = x j )<br />
(Tuatara at<br />
locations)<br />
(frequency)<br />
(rel.freq) (as n becomes large)<br />
0 8 0.16 0.216<br />
1 22 0.44 0.432<br />
2 15 0.30 0.288<br />
3 5 0.10 0.064<br />
Total n = 50 1.00 1.000<br />
111<br />
Section 3
X is the r<strong>and</strong>om variable. X is discrete here<br />
because all possible outcomes x j can be counted.<br />
The 50 results in the study are summarised by the<br />
relative frequencies.<br />
If many trials (n large) are carried out, the relative<br />
frequencies <strong>of</strong> each x j stabilise to give<br />
probabilities<br />
Pr(X = x j )<br />
for each outcome. Together these probabilities<br />
form the probability distribution rather than a<br />
relative frequency distribution.<br />
NB (1)<br />
4<br />
∑ Pr( X = xj<br />
) = 1 as for relative<br />
j=<br />
1<br />
frequencies<br />
(2) All probabilities are between 0 <strong>and</strong> 1.<br />
112<br />
Section 3
Describing Probability Distributions<br />
Let X be a symbol for a probability distribution<br />
<strong>and</strong> let μ X be the mean <strong>of</strong> X. (Assume X is<br />
discrete for the moment.)<br />
For a sample <strong>of</strong> n values from the distribution<br />
suppose each possible x j occurs f j times <strong>and</strong><br />
there are k possible values <strong>of</strong> j. Then the sample<br />
mean is<br />
x = 1 k k f<br />
n ∑ x j f j = x j<br />
∑<br />
⎛ ⎞<br />
j ⎝ n ⎠<br />
j=1<br />
j=1<br />
As the sample size becomes large, the relative<br />
frequencies become probabilities <strong>and</strong> the mean <strong>of</strong><br />
the probability distribution X is μ X where<br />
k<br />
∑<br />
μ X = x j Pr(X = x j )<br />
j =1<br />
A similar argument shows that the variance σ X 2 <strong>of</strong><br />
the probability distribution X is<br />
k<br />
σ 2 X = ∑ (x j −μ X ) 2 Pr(X = x j )<br />
j=1<br />
113<br />
Section 3
Take the square root to get the st<strong>and</strong>ard deviation<br />
<strong>of</strong> the probability distribution σ X .<br />
Note: The sample mean x <strong>and</strong> variance s 2 are<br />
estimates for population mean μ X <strong>and</strong> variance<br />
σ X 2 .<br />
Ex: Find the mean <strong>and</strong> st<strong>and</strong>ard deviation <strong>of</strong> the<br />
distribution <strong>of</strong> the number <strong>of</strong> locations at which<br />
tuatara are found.<br />
X=x j Pr(X=x j ) x j Pr(X=x j ) (x j - μ X ) 2 (x j - μ X ) 2 Pr(X=x j )<br />
0 0.216 0.000 (0 – 1.2) 2 = 1.44 0.311<br />
1 0.432 0.043 0.04 0.017<br />
2 0.288 0.576 0.64 0.184<br />
3 0.064 0.192 3.24 0.207<br />
Total 1.00 1.200 5.36 0.720<br />
4<br />
μ = ∑ = =<br />
x Pr( X x ) 1.20<br />
X j j<br />
j=<br />
1<br />
On average just over one location per visit will<br />
have tuatara present.<br />
σ<br />
4<br />
2 2<br />
X<br />
xj X<br />
X xj<br />
j=<br />
1<br />
= ∑ ( − μ ) Pr( = ) = 0.72<br />
<strong>and</strong> σ<br />
X<br />
= 0.85<br />
114<br />
Section 3
Example: A person infected with a disease can<br />
pass it on to others. Let the r<strong>and</strong>om variable, X,<br />
be the number <strong>of</strong> others infected by this person.<br />
X is found to have the following probability<br />
distribution.<br />
2<br />
Find μ<br />
X<br />
<strong>and</strong> σ<br />
X<br />
X = x<br />
j<br />
Pr(X = x j<br />
)<br />
0 0.10<br />
1 0.25<br />
2 0.40<br />
3 0.20<br />
4 0.05<br />
μ<br />
X<br />
= 0(0.10)+1(0.25)+2(0.40)+3(0.20)+4(0.05)<br />
= 1.85<br />
2<br />
σ<br />
X<br />
= (0 – 1.85) 2 0.10+(1 - 1.85) 2 0.25+(2 - 1.85) 2 0.40<br />
+(3 - 1.85) 2 0.20+(4 - 1.85) 2 0.05<br />
= 1.0275<br />
Also, σ<br />
X<br />
= 1.0275 = 1.0137<br />
115<br />
Section 3
Rules for combining r<strong>and</strong>om variables<br />
Often we are interested in the mean <strong>and</strong><br />
variance <strong>of</strong> a rescaled r<strong>and</strong>om variable, or in the<br />
mean <strong>and</strong> variance <strong>of</strong> sums (or differences) <strong>of</strong><br />
r<strong>and</strong>om variables. The following properties are<br />
true <strong>of</strong> all numerical r<strong>and</strong>om variables, discrete<br />
or continuous.<br />
If X <strong>and</strong> Y are independent r<strong>and</strong>om variables<br />
<strong>and</strong> a <strong>and</strong> b are constants, then:<br />
1. The mean <strong>of</strong> the new r<strong>and</strong>om variable<br />
a + bX is<br />
μ a+bX = a + bμ X<br />
2. The variance <strong>of</strong> a + bX is:<br />
σ 2 a+bX = b 2 σ 2 X<br />
3. The mean <strong>of</strong> the new r<strong>and</strong>om variable<br />
aX + bY is<br />
μ aX+bY = aμ X + bμ Y<br />
4. The variance <strong>of</strong> aX + bY is<br />
σ 2 aX+bY = a 2 σ 2 X + b 2 σ 2 Y<br />
116<br />
Section 3
Note: Properties 3 <strong>and</strong> 4 tell us that<br />
μ<br />
X+ Y= μX+ μY<br />
<strong>and</strong><br />
σ<br />
+<br />
= σ + σ .<br />
2 2 2<br />
X Y X Y<br />
Also, μ<br />
X− Y= μX− μY<strong>and</strong><br />
σ<br />
−<br />
= σ + σ .<br />
2 2 2<br />
X Y X Y<br />
Example: Temperatures used to be recorded in<br />
degrees Fahrenheit. Suppose a r<strong>and</strong>om variable F<br />
measures January temperature (in Fahrenheit) in<br />
Dunedin <strong>and</strong> daily maximum summer temperatures<br />
have a mean <strong>of</strong> 70°F with a st<strong>and</strong>ard deviation <strong>of</strong><br />
5°F.<br />
Use the conversion formula C = 5 ( F − 32) to find<br />
9<br />
the mean <strong>and</strong> st<strong>and</strong>ard deviation for the temperatures<br />
in degrees Celsius.<br />
Solution:<br />
We will let the r<strong>and</strong>om variable C represent the<br />
temperature in Celsius. The equation<br />
C = 5 ( F − 32) may be rearranged by exp<strong>and</strong>ing the<br />
9<br />
brackets to become<br />
C = 5 F − 5 × 32 or C = 5 F −<br />
160<br />
9 9<br />
9 9<br />
117<br />
Section 3
We have μa+ bX<br />
= a + bμ<br />
X<br />
160<br />
Therefore a = − <strong>and</strong> b = 5 9 9<br />
μ = a + bμ<br />
C<br />
160 5<br />
= − + ×<br />
9 9<br />
= 21.1° C<br />
2 2 2<br />
We also have σ b σ<br />
σ<br />
a+ bX<br />
=<br />
X<br />
2<br />
2 ⎛5<br />
⎞ 2<br />
C<br />
= × 5<br />
⎜ ⎟<br />
⎝9<br />
⎠<br />
25<br />
= × 25<br />
81<br />
= 7.716<br />
F<br />
70<br />
Therefore σ = 7.716 = 2.78° C<br />
C<br />
Example: What is the difference<br />
between T = X + X + X<br />
<strong>and</strong> T = 3X <br />
118<br />
Section 3
Note: These results can be extended to several<br />
r<strong>and</strong>om variables.<br />
Example: (Infected person continued)<br />
Three people living in separate areas have the<br />
disease. R<strong>and</strong>om variables X 1 , X 2 , X 3 are<br />
numbers <strong>of</strong> other people infected by them. Find<br />
mean <strong>and</strong> variance <strong>of</strong> total number infected by<br />
the original three.<br />
Total T = X 1 +X 2 + X 3 (X 1 , X 2 , X 3 assumed<br />
independent as people in different areas)<br />
μ = μ + μ + μ =1.85+1.85+1.85 = 5.55<br />
T X X X<br />
1 2 3<br />
σ = σ + σ + σ =1.0275+1.0275<br />
2 2 2 2<br />
T X X X<br />
1 2 3<br />
+ 1.0275 = 3.0825<br />
Note: Do not say T = 3 X 1 . Although<br />
μ<br />
T<br />
= 3μ<br />
= 5.55,<br />
X<br />
1<br />
σ<br />
= 9σ<br />
≠ 3.0825<br />
2 2<br />
T X<br />
1<br />
This is a very common source <strong>of</strong> error.<br />
119<br />
Section 3
120
SECTION 4<br />
This section introduces both the Binomial <strong>and</strong> Normal Distributions which model many<br />
phenomena arising in the real world. Consequently the distributions allow us to answer some<br />
important <strong>and</strong> relevant questions.<br />
The Binomial Distribution: Definition, mean <strong>and</strong> variance<br />
The Binomial Table: Examples<br />
The Normal Distribution: Definition<br />
St<strong>and</strong>ard Normal Distribution <strong>and</strong> Table<br />
General Normal Distribution<br />
Normal Approximation to the Binomial<br />
Transforming Data to Normal<br />
121<br />
Section 4
The Binomial Distribution<br />
The binomial distribution arises when<br />
investigating proportions. e.g. the proportion <strong>of</strong><br />
adult population with diabetes. Each individual<br />
has or does not have diabetes.<br />
Let Y be the r<strong>and</strong>om variable for an individual<br />
outcome <strong>of</strong> a person in the population. Two<br />
outcomes occur, namely Y = 1 (e.g. diabetes<br />
present or success) <strong>and</strong> Y = 0 (e.g. diabetes not<br />
present or failure). The parameter π represents<br />
the unknown proportion <strong>of</strong> 1’s occurring.<br />
The probability distribution <strong>of</strong> Y is<br />
Y =<br />
y Pr(Y =<br />
j<br />
y j<br />
)<br />
1 π “success”<br />
0 1 – π “failure”<br />
Then μ Y = 1(π) + 0(1 – π) = π<br />
σ = (1 – π) 2 π + (0 – π) 2 (1 – π)<br />
2<br />
Y<br />
= (1 – π) [π(1 – π) + π 2 ]<br />
= π(1 – π)<br />
122<br />
Section 4
Now suppose that we take a sample <strong>of</strong> size n<br />
from the underlying population. What is the<br />
distribution <strong>of</strong> the number <strong>of</strong> successes<br />
The total number <strong>of</strong> successes is X where<br />
X = Y 1 + Y 2 + Y 3 + … + Y n<br />
with all the Y j independent <strong>of</strong> each other.<br />
∴ μ X = π + π + π + … + π = n π<br />
2 2 2<br />
σ = σ<br />
Y 1<br />
+ σY 2<br />
+ … + σY n<br />
2<br />
X<br />
= π (1 – π) + π (1 – π) + … + π (1 – π)<br />
= nπ(1 – π)<br />
X is called a binomial distribution <strong>and</strong><br />
μ<br />
X<br />
= nπ<br />
2<br />
σ = nπ(1 − π)<br />
X<br />
where π is the parameter giving Pr(“success”) or<br />
Pr(diabetes present).<br />
123<br />
Section 4
The mean number <strong>of</strong> successes is nπ <strong>and</strong> the<br />
variance <strong>of</strong> the number <strong>of</strong> successes is nπ(1 – π)<br />
The binomial distribution results from n trials<br />
involving independent binary outcomes.<br />
e.g. melanoma (Yes/No)<br />
Smoking (smokes/does not smoke)<br />
Diabetes (present/absent)<br />
Tuatara (present/absent)<br />
Example: X = number <strong>of</strong> locations in group <strong>of</strong><br />
n that have tuatara present.<br />
It is known that Pr(success) = π = 0.40 <strong>and</strong><br />
Pr(failure) = 1 – π = 0.60.<br />
Each location is assumed independent <strong>of</strong> other<br />
locations.<br />
Also assume the probability <strong>of</strong> tuatara being<br />
present remains constant at each location.<br />
124<br />
Section 4
Notes 1. If these conditions are met, if n (the<br />
number <strong>of</strong> trials) <strong>and</strong> π (the probability <strong>of</strong><br />
success) are known, all probabilities in the<br />
distribution are known exactly.<br />
2. n <strong>and</strong> π are said to be the parameters <strong>of</strong> the<br />
distribution.<br />
3. The binomial distribution requires<br />
independent trials <strong>and</strong> a probability <strong>of</strong><br />
success which remains constant for each<br />
trial.<br />
4. We use binomial tables to approximate<br />
these binomial probabilities for values <strong>of</strong> n<br />
up to 20. (See table section <strong>of</strong> these notes.)<br />
125<br />
Section 4
For example suppose n = 8 <strong>and</strong> π = 0.40 are the<br />
two defining parameters.<br />
π<br />
n x 0.05 0.10 0.15 … 0.40 0.50<br />
8 0 0.6634 -- -- 0.0160 0.0039<br />
1 0.2793 -- -- 0.0896 0.0312<br />
2 0.0515 -- -- 0.2090 0.1094<br />
3 0.0054 -- -- 0.2787 0.2187<br />
4 0.0004 -- -- … 0.2322 0.2734<br />
5 0.0000 -- -- 0.1239 0.2188<br />
6 0.0000 -- -- 0.0413 0.1094<br />
7 0.0000 -- -- 0.0079 0.0313<br />
8 0.0000 -- -- 0.0007 0.0039<br />
9 0 -- --<br />
1 -- … --<br />
2 -- --<br />
3 -- --<br />
etc<br />
Notice that Pr(X = 3) = 0.2787 for π = 0.40 <strong>and</strong> n = 8<br />
Example: Records show that twenty percent <strong>of</strong><br />
violin pupils are known to develop OOS during<br />
the course <strong>of</strong> their training. Define X to be the<br />
number <strong>of</strong> violin pupils out <strong>of</strong> 9 who develop<br />
OOS during their training.<br />
126<br />
Section 4
(a) Find the probability distribution <strong>of</strong> X.<br />
(b) What is the probability that none <strong>of</strong> the 9<br />
pupils develop OOS<br />
(c) What is the probability that more than 4 out<br />
<strong>of</strong> the 9 pupils develop OOS<br />
(d) In 2005 a certain violin teacher had 9 new<br />
pupils <strong>and</strong> 5 developed OOS during training.<br />
What conclusion would you draw about the<br />
training methods <strong>of</strong> this teacher<br />
Solution<br />
(a) Here X is binomial with n = 9; π = 0.20<br />
(<strong>and</strong> assume the pupils are all independent<br />
<strong>of</strong> each other). The binomial table gives<br />
n x π = 0.20<br />
9 0 0.1342 = Pr(X = 0)<br />
1 0.3020 = Pr(X = 1)<br />
2 0.3020 etc<br />
3 0.1762<br />
4 0.0661<br />
5 0.0165<br />
6 0.0028<br />
7 0.0003<br />
8 0.0000<br />
9 0.0000<br />
127<br />
Section 4
(b) Pr(X = 0) = 0.1342<br />
(c) Pr(X > 4) = Pr(X = 5) + Pr(X = 6)<br />
+ Pr(X = 7) + Pr(X = 8) + Pr(X = 9)<br />
= 0.0196<br />
(d) It would be rare or unusual (probability = 0.0196)<br />
for more than four violin pupils to develop OOS<br />
if 20% is the overall percentage known to develop<br />
OOS historically. We conclude the training<br />
methods <strong>of</strong> this teacher are likely to result in a<br />
greater occurrence <strong>of</strong> OOS among pupils.<br />
If the violin teacher has no effect on OOS,<br />
π will remain on 0.20 <strong>and</strong> the probability<br />
that more than four <strong>of</strong> the pupils will<br />
develop OOS is 0.0196.<br />
This is viewed (by convention) to be a small<br />
probability indicating a rare or unusual event<br />
has arisen if the value <strong>of</strong> π = 0.20 still holds<br />
for the pupils <strong>of</strong> this teacher.<br />
Either π = 0.20 is unchanged for this teacher<br />
<strong>and</strong> a rare event has been observed<br />
128<br />
Section 4
or the teacher is at fault <strong>and</strong> more pupils<br />
develop OOS. This second alternative is<br />
usually taken <strong>and</strong> therefore we conclude the<br />
teacher has a higher incidence <strong>of</strong> OOS.<br />
Notes<br />
1. It is the size <strong>of</strong> the probability <strong>of</strong> this<br />
observed “event” or a more extreme <strong>and</strong><br />
convincing event which results in the<br />
conclusion (more than 4).<br />
2. 0.0196 is a chance <strong>of</strong> just under 2 per 100<br />
(2%).<br />
3. A probability less than 0.05 is (by<br />
convention) taken to imply an event is rare or<br />
unlikely to occur.<br />
4. A probability above 0.05 <strong>of</strong>ten means an<br />
event is not unusual. If the violin teacher had<br />
produced such a probability then the teaching<br />
would not be at all unusual in relation to<br />
incidence <strong>of</strong> OOS.<br />
129<br />
Section 4
Binomial Examples <strong>and</strong> Normal Distribution<br />
Example: (artificial data) A sociological<br />
report suggests that 75% <strong>of</strong> Maori children<br />
under 18 live with both parents. A r<strong>and</strong>om<br />
sample <strong>of</strong> 20 Maori children is selected, <strong>and</strong> X<br />
is the binomial r<strong>and</strong>om variable for the number<br />
<strong>of</strong> these 20 who live with both parents.<br />
(a) Define the parameters <strong>of</strong> the distribution <strong>of</strong> X.<br />
(b) Find Pr(X = 15).<br />
(c) Find the probability that 11 or fewer live<br />
with both parents (i.e. Pr(X ≤ 11)).<br />
(d) A r<strong>and</strong>om sample <strong>of</strong> 20 New Zeal<strong>and</strong><br />
Caucasian children had only 11 living with<br />
both parents. Does this result provide any<br />
evidence to support the claim that 75% <strong>of</strong> NZ<br />
Caucasian children live with both parents<br />
130<br />
Section 4
Solution<br />
(a) X is binomial with n = 20, π = 0.75.<br />
(b) The problem is that 0.75 does not occur in<br />
the binomial table directly.<br />
Whenever π > 0.50, we replace the event<br />
“success” by its complement “failure”. This<br />
is because the binomial table does not have<br />
values greater than 0.50. In this case,<br />
“failure” is the event “child does not live<br />
with both parents”. For easy analysis, define<br />
new r<strong>and</strong>om variable<br />
Y = number not living with both parents.<br />
Y is binomial, n = 20 <strong>and</strong> new π′ = 0.25<br />
[here y = n – x <strong>and</strong> π′ = 1 – π]<br />
∴ Pr(X = 15 given π = 0.75)<br />
= Pr(Y = 5 given π′ = 0.25)<br />
= 0.2023 from table<br />
(c) Pr(X ≤ 11) = Pr(Y ≥ 9)<br />
= Pr(Y = 9) + Pr(Y = 10)<br />
+ … + Pr(Y = 20)<br />
= 0.0271 + 0.0099<br />
+ … + 0.0000<br />
= 0.0410<br />
131<br />
Section 4
(d) No. In fact there is evidence it is less than 75%<br />
for NZ Caucasian children.<br />
If π = 0.75 is assumed for Caucasian families,<br />
then the probability <strong>of</strong> observing 11 or fewer<br />
living with both parents is, by our convention,<br />
small (less than 0.05) providing evidence<br />
against 75%. Hence reject claim that π = 0.75<br />
for Caucasian families <strong>and</strong> conclude fewer live<br />
with both parents (because 11 is the direction <strong>of</strong><br />
fewer rather than more).<br />
Note: Suppose instead 12 out <strong>of</strong> 20 <strong>of</strong> the NZ<br />
Caucasian children were living with both<br />
parents.<br />
Pr(X ≤ 12) = Pr(Y ≥ 8) = 0.1019 if<br />
π = 0.75 meaning π′ = 0.25.<br />
This probability is not small, <strong>and</strong> now there is<br />
no evidence from our data to suppose the<br />
situation is any different among Caucasian<br />
families.<br />
132<br />
Section 4
Example (Revision)<br />
The st<strong>and</strong>ard drug for treating a cancer is claimed<br />
to halve the tumor size in 30% <strong>of</strong> all patients<br />
treated. Suppose X is the binomial r<strong>and</strong>om<br />
variable for the number <strong>of</strong> patients in a sample <strong>of</strong><br />
seven who have their tumor size halved.<br />
(a) List the conditions which must be met if X is<br />
binomial.<br />
Patients independent. Two outcomes only.<br />
Constant probability tumor size halved over<br />
all the patients.<br />
(b) Using the appropriate table, write down the<br />
distribution <strong>of</strong> probabilities for the number<br />
(X) who have their tumor size halved.<br />
X = x j Pr(X = x j )<br />
0 0.0824<br />
1 0.2471<br />
2 0.3177<br />
3 0.2269<br />
4 0.0972<br />
5 0.0250<br />
6 0.0036<br />
7 0.0002<br />
133<br />
Section 4
(c) Write down the probability that three <strong>of</strong> the<br />
patients have their tumor size halved.<br />
Probability = 0.2269<br />
(d) Find the probability that three or more <strong>of</strong> the<br />
patients have their tumor size halved.<br />
Probability = 0.3529<br />
(e) In a pilot study in Auckl<strong>and</strong>, three out <strong>of</strong> seven<br />
patients given a new drug had their tumor size<br />
halved. What conclusion if any can be drawn<br />
about the new drug Explain how you reach<br />
your conclusion.<br />
Conclusion: There is no reason to suppose the<br />
new drug is any different to the st<strong>and</strong>ard.<br />
Explanation: Prob. <strong>of</strong> three or more is 0.3529<br />
which is large meaning the result with the new<br />
drug is consistent with the 30% before.<br />
Note: This study involves a very small number <strong>of</strong><br />
patients <strong>and</strong> will be reconsidered later with a larger<br />
sample.<br />
134<br />
Section 4
The Normal Distribution<br />
This distribution will allow us to calculate<br />
probabilities associated with observed sample<br />
results when we are dealing with continuous<br />
outcome measures <strong>and</strong> sample means. First we<br />
develop properties <strong>of</strong> the normal distribution.<br />
A relative frequency histogram tends to a<br />
probability distribution as sample size n becomes<br />
large.<br />
HISTOGRAM<br />
DISTRIBUTION<br />
f(X)<br />
a<br />
b<br />
X<br />
n increases<br />
<strong>and</strong> class<br />
width decreases<br />
Shaded area<br />
Shaded area<br />
= proportion <strong>of</strong> = probability <strong>of</strong><br />
observations<br />
value between<br />
between a <strong>and</strong> b<br />
a <strong>and</strong> b<br />
(This represents a<br />
(This represents a<br />
sample with a<br />
population with a<br />
small number <strong>of</strong><br />
very large number<br />
individuals.)<br />
<strong>of</strong> individuals.)<br />
135<br />
a<br />
b<br />
X<br />
Section 4
The resulting curve is known as a probability<br />
function (or probability density function) <strong>and</strong> is<br />
described by a curve y = f(X).<br />
The area under this curve, say between two points<br />
X = a <strong>and</strong> X = b, is the probability Pr(a < X < b)<br />
X is a r<strong>and</strong>om variable taking values on a<br />
continuous scale.<br />
We have seen several sets <strong>of</strong> sample data which<br />
produce symmetrical histograms, bell shaped<br />
with a concentration <strong>of</strong> values at the centre <strong>and</strong><br />
few values at extremes. (e.g. cholesterol levels in<br />
the pravastatin study) Such data are said to be<br />
collected from a normal distribution or from a<br />
population <strong>of</strong> values which are normally<br />
distributed.<br />
[Gauss, 1777-1855, first developed the equation<br />
<strong>of</strong> such a normal curve while observing pattern in<br />
errors made while making measurements in<br />
astronomy]<br />
136<br />
Section 4
μ<br />
Y<br />
Y = f(X)<br />
X<br />
The equation <strong>of</strong> such a normal curve is<br />
f( X)<br />
1<br />
= e<br />
σ 2π<br />
−<br />
1<br />
2<br />
X −μ<br />
( ) 2<br />
σ<br />
where parameter μ is the mean <strong>and</strong> parameter σ is<br />
the st<strong>and</strong>ard deviation <strong>of</strong> the distribution (in<br />
practice, μ <strong>and</strong> σ will be estimated from sample<br />
data by the values x <strong>and</strong> s).<br />
Notes 1. The graph is symmetrical about centre<br />
point denoted by μ.<br />
2. The two parameters μ <strong>and</strong> σ completely define<br />
a normal distribution (recall that parameters n<br />
<strong>and</strong> π define a binomial distribution).<br />
Notation: X ∼ N(μ,σ 2 )<br />
3. Increasing μ moves the curve but does not<br />
alter its shape<br />
Section 4<br />
137
μ 2 > μ 1<br />
σ unchanged<br />
μ 1 μ 2<br />
X<br />
4. Increasing σ spreads the curve more widely<br />
about X = μ, but does not alter the centre <strong>of</strong> the<br />
distribution.<br />
σ 1<br />
σ 2<br />
μ<br />
σ 2 > σ 1<br />
μ unchanged<br />
X<br />
Both the above could be normal distributions.<br />
5. Areas under these curves can be found from<br />
tables. The table is based on what is known as<br />
the st<strong>and</strong>ard normal distribution which has μ =<br />
0 <strong>and</strong> σ = 1.<br />
138<br />
Section 4
Normal distribution calculations.<br />
The St<strong>and</strong>ard Normal Distribution (Z)<br />
Z ∼ N(0, 1) i.e. Z distributed with μ Z = 0, σ Z<br />
2 = 1<br />
∴ f(Z) = 1 2π e−1 2 Z2 Shaded area<br />
= Pr(0 < Z < z)<br />
(see tables)<br />
O z Z<br />
z .00 .01 .02 .03 .04 .05 …… .09<br />
.0 .0000<br />
.1<br />
.2<br />
.3<br />
<br />
1.5<br />
1.6 0.4484 0.4495<br />
1.7<br />
<br />
3 0.4990<br />
139<br />
Section 4
Some calculations:<br />
1. Find Pr(0 < Z < 1.63)<br />
From table choose z = 1.63<br />
∴ Pr(0 < Z < 1.63) = 0.4484<br />
O 1.63 Z<br />
Also, Pr(0 < Z < 1.64) = 0.4495<br />
3<br />
10<br />
∴ Pr(0 < Z < 1.633) ≈ 0.4484 + (0.0011)<br />
= 0.4487<br />
[final calculation need not be this accurate<br />
+ 0.4484 would be accepted for our purposes<br />
using this table.]<br />
2. Find Pr(Z > 1.64)<br />
Pr(Z > 1.64)<br />
= 0.5 – Pr(0 < Z < 1.64)<br />
= 0.5 – 0.4495<br />
= 0.0505<br />
3. Pr(1 < Z < 1.64)<br />
= Pr(0 < Z < 1.64) - Pr(0 < Z < 1)<br />
= 0.4495 – 0.3413<br />
= 0.1082<br />
140<br />
O 1.64 Z<br />
Section 4
4. Pr(-1 < Z < 1.64) = Pr(0 < Z < 1.64)<br />
+ Pr(-1 < Z < 0)<br />
= Pr(0 < Z < 1.64)<br />
+ Pr(0 < Z < 1) by symmetry<br />
= 0.4495 + 0.3413<br />
= 0.7908<br />
–1<br />
O 1.64 Z<br />
5. Pr(-1 < Z < 1) = 2Pr(0 < Z < 1)<br />
= 2(0.3413)<br />
= 0.6826<br />
Pr(-2 < Z < 2) = 2Pr(0 < Z < 2) = 0.9546<br />
Since σ Z = 1, a value z <strong>of</strong> Z is a count <strong>of</strong> the number<br />
<strong>of</strong> st<strong>and</strong>ard deviations to this point. Notice that<br />
approx 68% <strong>of</strong> the area is within one <strong>and</strong> 95%<br />
within two st<strong>and</strong>ard deviations <strong>of</strong> the centre.<br />
6. Find the value z above which 25% <strong>of</strong> the area lies.<br />
Here, find a value close to 0.25 in the centre <strong>of</strong><br />
normal table, then read back to margins.<br />
0.25<br />
0.50<br />
O<br />
0.25<br />
z Z<br />
Pr(0 < Z < 0.67) = 0.2486<br />
Pr(0 < Z < 0.68) = 0.2517<br />
Hence, z = 0.675 approx.<br />
141<br />
Section 4
The General Normal Distribution (X)<br />
2<br />
X ~ N( μ X , σ X ) say.<br />
Areas under this curve cannot be found directly<br />
from the normal table but X is related to the<br />
st<strong>and</strong>ard normal Z ~ N(0, 1 2 ) by<br />
Z<br />
=<br />
X − μ<br />
σ<br />
X<br />
X<br />
Notes 1. The distribution X is said to be<br />
st<strong>and</strong>ardised when μ X subtracted <strong>and</strong> the<br />
result divided by σ .<br />
X<br />
2. Z is essentially the number <strong>of</strong> st<strong>and</strong>ard<br />
deviations ( σ X ) from μ X to a value x <strong>of</strong> X.<br />
142<br />
Section 4
Some calculations<br />
1. Pr( μ X - σ X < X < μ X + σ X )<br />
= Pr(-σ X < X - μ X < + σ X )<br />
X − μ<br />
= Pr(-1 <<br />
X<br />
< + 1)<br />
σ X<br />
= Pr(-1 < Z < + 1)<br />
= 2 Pr(0 < Z < 1) = 0.6826<br />
[68.26% <strong>of</strong> distribution within one st<strong>and</strong>ard<br />
deviation <strong>of</strong> the centre]<br />
2. In general, Pr(a < X < b)<br />
= Pr(a - μ X < X - μ X < b - μ X )<br />
a − μ<br />
= Pr(<br />
X X − μ<br />
<<br />
X b − μ<br />
<<br />
σ X σ X σ X<br />
a − μ<br />
= Pr(<br />
X b − μ<br />
< Z <<br />
X<br />
)<br />
σ<br />
σ<br />
X<br />
X<br />
X<br />
)<br />
μ X a b X<br />
143<br />
Section 4
Example: Assume that diastolic blood pressures<br />
for men aged 35-44 have a normal distribution with<br />
mean μ X = 80 <strong>and</strong> st<strong>and</strong>ard deviation σ X = 12<br />
(a) Find Pr(90 < X < 100)<br />
(b) The percentage <strong>of</strong> men in this age range who<br />
are hypertensive (a level over 100).<br />
Solution<br />
(a) Pr(90 < X < 100) =<br />
⎛ 90 −80<br />
100 −80<br />
Pr<br />
⎞<br />
⎜ < Z < ⎟<br />
⎝ 12 12 ⎠<br />
= Pr(0.833 < Z < 1.667)<br />
= Pr(0 < Z < 1.667)<br />
– Pr(0 < Z < 0.833)<br />
= 0.4525 – 0.2967<br />
= 0.1558<br />
(b) X ~ N(80, 144). Find Pr(X > 100)<br />
⎛ 100 −80⎞<br />
Pr(X > 100) = Pr⎜<br />
Z > ⎟<br />
⎝ 12 ⎠<br />
= Pr(Z > 1.67)<br />
= 0.5 – Pr(0 < Z < 1.67)<br />
= 0.5 – 0.4525<br />
= 0.0475<br />
We expect 4.8% <strong>of</strong> men in this age group to be<br />
hypertensive.<br />
144<br />
Section 4
(c) Find the diastolic blood pressure which is<br />
exceeded by 10% <strong>of</strong> men aged 35-44.<br />
X ~ N(80, 144)<br />
80<br />
O<br />
0.40<br />
x<br />
z<br />
0.10<br />
X (original scale)<br />
Z (st<strong>and</strong>ard scale)<br />
(It is helpful, initially, to sketch the st<strong>and</strong>ard<br />
scale as well as the original scale).<br />
From the st<strong>and</strong>ard normal table, find the<br />
value, z, which cuts <strong>of</strong>f area 0.40 as shown.<br />
Reading to the margins from the value 0.40 in<br />
centre <strong>of</strong> table gives z = 1.282 (part way<br />
between 1.28 <strong>and</strong> 1.29).<br />
x − μ<br />
Use z =<br />
X<br />
to get 1.282 =<br />
σ X<br />
∴ x = 80 +12(1.282)<br />
= 95.38<br />
x −80<br />
12<br />
145<br />
Section 4
The Normal Approximation to the Binomial<br />
(n, π)<br />
If there is a large sample selected from a<br />
population <strong>of</strong> binary values (e.g. people with or<br />
without diabetes) probabilities <strong>of</strong> observed<br />
outcomes are found from the normal N( μ X , σ )<br />
distribution where μ X = nπ <strong>and</strong><br />
σ = nπ ( 1−π<br />
)<br />
X<br />
2<br />
X<br />
σ<br />
X<br />
μ X = nπ<br />
= nπ ( 1−π<br />
)<br />
x–1 x x+1<br />
1<br />
x – 2<br />
1 x + 2<br />
X<br />
Area <strong>of</strong> shaded block (if x integer) is the binomial<br />
probability <strong>of</strong> obtaining x successes.<br />
This is approximately the area under the normal<br />
curve between x − 1 <strong>and</strong> x + 1 .<br />
2<br />
2<br />
146<br />
Section 4
∴ Pr( X = x)<br />
⎛ 1 1<br />
( x −<br />
2) − nπ<br />
( x+ 2)<br />
−nπ<br />
⎞<br />
≈ Pr<br />
< Z <<br />
⎜<br />
nπ (1 −π) nπ(1 −π)<br />
⎟<br />
⎝<br />
⎠<br />
Notes: 1. This approximation is good provided n<br />
is large <strong>and</strong> π is not too close to 0 or 1. (Under<br />
these conditions the binomial distribution is<br />
reasonably close to symmetrical <strong>and</strong> hence the<br />
normal curve is seen to be a good<br />
approximation.)<br />
2. The normal approximation is good if<br />
nπ± 3 nπ(1 −π )<br />
gives two values between 0 <strong>and</strong> n (the min.<br />
<strong>and</strong> max values <strong>of</strong> the binomial counts) since<br />
95% <strong>of</strong> the possible values should lie within<br />
these limits indicating a near symmetrical<br />
distribution.<br />
147<br />
Section 4
Probability<br />
We know Pr(blood group B) = 0.11<br />
n = 2 nπ = 0.22<br />
π = 0.11 n π ( 1−π<br />
)<br />
= 0.44<br />
Hence 0.22 ± 3(0.44)<br />
Figure 1 Binomial distribution <strong>of</strong> number <strong>of</strong> people out <strong>of</strong> two in blood<br />
group B.<br />
Probability<br />
1.0<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0.0<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.0<br />
0 1 2<br />
Number in blood group B<br />
0 1 2 3 4 5 6 7<br />
Number subjects<br />
n = 10 nπ = 1.10<br />
π = 0.11 n π ( 1−π<br />
)<br />
= 0.99<br />
Hence 1.10 ± 3(0.99)<br />
Figure 2 Binomial distribution showing the number <strong>of</strong> subjects out <strong>of</strong> ten<br />
in blood group B based on the probability <strong>of</strong> being in blood group B.<br />
Probability<br />
0.15<br />
0.10<br />
0.05<br />
0.0<br />
n = 100 nπ = 11<br />
π = 0.11 n π ( 1−π<br />
)<br />
= 3.13<br />
Hence 11 ± 3(3.13)<br />
0 5 10 15 20<br />
Number subjects<br />
Figure 3 Binomial distribution showing the number <strong>of</strong> subjects out <strong>of</strong> 100 in blood group B based<br />
on the probability <strong>of</strong> being in blood group B.<br />
148<br />
Section 4
More on the normal <strong>and</strong> Statistical Inference<br />
Example: One in 40 adults on average develops<br />
a respiratory condition. A r<strong>and</strong>om sample <strong>of</strong> 400<br />
workers in a certain occupation has 16 with the<br />
condition. Find the probability that 16 or more<br />
suffer from this condition in general. What<br />
conclusion would you draw about the possible<br />
effect <strong>of</strong> this occupation on the occurrence <strong>of</strong> the<br />
condition Justify your answer.<br />
Solution: Let X be the distribution <strong>of</strong> the number<br />
in a sample <strong>of</strong> 400 with the condition.<br />
Then X ~ Binomial (n =400; π = 1/40)<br />
μ X = nπ = 10; σ<br />
X<br />
= n π ( 1−π<br />
) = 3.123<br />
Since n π ± 2 nπ<br />
(1 −π<br />
) is 10 ± 6. 2, the normal<br />
approximation can be used.<br />
Pr(X ≥ 16)<br />
≈ Pr( Z<br />
15.5 −10<br />
> )<br />
3.123<br />
15 1<br />
2<br />
16<br />
149<br />
16 1<br />
2<br />
X<br />
= Pr(Z > 1.761)<br />
= 0.0391<br />
Section 4
This is the p-value associated with a study result<br />
<strong>of</strong> 16. There is evidence <strong>of</strong> a higher incidence <strong>of</strong><br />
the respiratory condition than expected in this<br />
occupation. (The probability 0.0391 is small<br />
indicating that the event X = 16 or more is rare if<br />
π = 1/40 were to hold in this occupation.)<br />
Therefore, π is likely to be greater than 1/40 for<br />
workers in this occupation. (If this is the case,<br />
the event observed would not be unusual.)<br />
150<br />
Section 4
Example:<br />
It is claimed cancer tumor size is halved in 30%<br />
<strong>of</strong> all patients using current treatment. A new<br />
drug was used on 70 patients with the cancer.<br />
(Last week we looked at a case where the drug<br />
was tried on 7 patients with 3 successes.)<br />
(a) Suppose Y is the binomial r<strong>and</strong>om variable<br />
for the number <strong>of</strong> patients who have their<br />
tumor size halved. Write down the values for<br />
the mean <strong>and</strong> st<strong>and</strong>ard deviation <strong>of</strong> Y.<br />
μ Y = nπ = 70(0.3) = 21<br />
σ = nπ(1 − π) = 21(0.7) = 3.83<br />
Y<br />
151<br />
Section 4
(b) In a study, thirty out <strong>of</strong> seventy patients<br />
(previously 3 out <strong>of</strong> 7) administered the<br />
st<strong>and</strong>ard drug experience a halving <strong>of</strong> their<br />
tumors. Find the probability that 30 or more<br />
out <strong>of</strong> 70 have their tumors halved.<br />
⎛ 29.5 − 21⎞<br />
Pr(Y ≥ 30) = Pr⎜<br />
Z > ⎟<br />
⎝ 3.83 ⎠<br />
= Pr(Z > 2.22)<br />
= 0.5 – 0.4868<br />
= 0.0132<br />
(c) In a study 30 out <strong>of</strong> 70 patients in Auckl<strong>and</strong><br />
administered this new drug had their tumor<br />
size halved. What conclusion can be drawn<br />
about the new drug<br />
There is evidence that the new drug is more<br />
effective than the st<strong>and</strong>ard because the<br />
probability <strong>of</strong> 30 or more successes is less<br />
than 0.05 indicating the observed 30 (or<br />
more) is not likely to occur unless the new<br />
drug has a beneficial effect.<br />
152<br />
Section 4
Transforming Data<br />
If data being analysed are continuous but not<br />
normally distributed, it may be necessary to<br />
modify the data by transforming each value in<br />
order to create new values which are normal.<br />
Then work with the transformed values. Typical<br />
transformations involve logs, square roots or<br />
reciprocals.<br />
There are three reasons for transforming data.<br />
1. Statistical procedures which we develop may<br />
only be valid if the data are approximately<br />
normal, <strong>and</strong> non-normal data can be converted<br />
to normal by transforming.<br />
2. When comparing for example two samples <strong>of</strong><br />
data (e.g. cholesterol levels after treatment with<br />
pravastatin or a control) the two groups should<br />
have similar st<strong>and</strong>ard deviations for some<br />
testing procedures to be valid. Transforming<br />
such data can produce two sets <strong>of</strong> values with<br />
similar st<strong>and</strong>ard deviations.<br />
3. Transforming can also reduce the influence <strong>of</strong><br />
outlying values on the results <strong>of</strong> an analysis.<br />
153<br />
Section 4
(e.g. suppose most values are around 10 in a<br />
data set with one value <strong>of</strong> 100.<br />
Then ln10 = 2.30 <strong>and</strong> ln100 = 4.61)<br />
EXAMPLE: A sample <strong>of</strong> 216 values <strong>of</strong> a serum<br />
has mean = 60.7 <strong>and</strong> st<strong>and</strong>ard deviation 77.9<br />
Frequency<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
Histogram <strong>of</strong> the serum<br />
values in 216 patients with<br />
fitted normal distribution is<br />
shown. (The normal fit is<br />
terrible!)<br />
The data are transformed by using the ln function.<br />
Mean = 3.547 <strong>and</strong> st<strong>and</strong>ard dev. = 1.03<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
0 100 200 300 400 500<br />
Serum bilirubin (μmol/l)<br />
1 2 3 4 5 6<br />
154<br />
Histogram <strong>of</strong> log serum<br />
values with fitted<br />
Normal distribution<br />
(ln values) looks<br />
reasonably normal.<br />
Section 4
Now suppose we want the range <strong>of</strong> values<br />
containing the central 95% <strong>of</strong> all patients. If data<br />
are normal, 95% <strong>of</strong> the population lie in<br />
mean ± 1.96 (st<strong>and</strong>ard deviations)<br />
0.475 0.475<br />
–1.96 1.96<br />
(from st<strong>and</strong>ard<br />
normal table)<br />
For the raw data, mean = 60.7 <strong>and</strong> s.d. = 77.9.<br />
Hence, interval could be 60.7 ± 1.96(77.9) which<br />
cannot be correct with the negative values.<br />
But the transformed data have approximately a<br />
normal distribution. For transformed data,<br />
mean = 3.547 <strong>and</strong> st<strong>and</strong>ard deviation = 1.030.<br />
Hence, 95% <strong>of</strong> the patients will have ln (serum)<br />
levels in the range<br />
3.547 ± 1.96(1.030)<br />
155<br />
Section 4
That is, 95% <strong>of</strong> distribution (or values) between<br />
3.547 – 2.019 <strong>and</strong> 3.547 + 2.019<br />
or 1.528 <strong>and</strong> 5.566<br />
Transforming back to original scale,<br />
e 1.528 = 4.61 <strong>and</strong> e 5.566 = 261.4<br />
Hence, 95% <strong>of</strong> patients would have serum levels<br />
between 4.61 <strong>and</strong> 261.4 μmol/l<br />
156<br />
Section 4
REVIEW EXERCISES<br />
4. For the st<strong>and</strong>ard normal distribution find the following:<br />
(a) The area below –1.58.<br />
(b) The two points between which the central 85% <strong>of</strong> the area lies. (2 marks)<br />
5. In the Framingham Study, serum cholesterol levels were measured for a large number <strong>of</strong> healthy<br />
males. The population was then followed for 16 years. At the end <strong>of</strong> this time, the men were<br />
divided into two groups: those who had developed coronary heart disease <strong>and</strong> those who had not.<br />
The distributions <strong>of</strong> the initial serum cholesterol levels for each group were found to be<br />
approximately normal. Among individuals who eventually developed coronary heart disease, the<br />
mean serum cholesterol level was μ d = 244 mg/100 ml <strong>and</strong> the st<strong>and</strong>ard deviation was σ d = 51<br />
mg/100ml; for those who did not develop the disease, the mean serum cholesterol level was μ nd =<br />
219 mg/100 ml <strong>and</strong> the st<strong>and</strong>ard deviation was σ nd = 41 mg/100ml.<br />
(a) Suppose that an initial serum cholesterol level <strong>of</strong> 260 mg/100ml or higher is used to predict<br />
coronary heart disease. What is the probability <strong>of</strong> correctly predicting heart disease for a man<br />
who will develop it<br />
(b)<br />
(c)<br />
What is the probability <strong>of</strong> predicting heart disease for a man who will not develop it<br />
What is the probability <strong>of</strong> failing to predict heart disease for a man who will develop it<br />
(3 marks)<br />
6. The length <strong>of</strong> human pregnancies from conception to birth varies according to a distribution that is<br />
approximately normal with mean 266 days <strong>and</strong> st<strong>and</strong>ard deviation 16 days.<br />
(a) What percent <strong>of</strong> pregnancies last less than 240 days (that’s about 8 months)<br />
(b) What percent <strong>of</strong> pregnancies last between 240 <strong>and</strong> 270 days (roughly between 8 months <strong>and</strong> 9<br />
months)<br />
(c) How long do the longest 20% <strong>of</strong> pregnancies last (3 marks)<br />
1. The probability <strong>of</strong> recovery for patients who are administered an established treatment for a<br />
stomach complaint is 0.8. A r<strong>and</strong>om sample <strong>of</strong> 100 patients with the complaint is monitored.<br />
Suppose X is the binomial r<strong>and</strong>om variable for the number <strong>of</strong> patients in this sample who recover<br />
when the established treatment is used.<br />
(a) Specify the parameters <strong>of</strong> X.<br />
(b) Find the mean <strong>and</strong> st<strong>and</strong>ard deviation <strong>of</strong> X.<br />
(c) Find the probability that at least 90 <strong>of</strong> the patients administered the treatment recover. Here you<br />
should first verify that the normal approximation to the binomial distribution can be used.<br />
(d) In a trial involving a new drug for the treatment <strong>of</strong> this stomach complaint, 90 out <strong>of</strong> 100<br />
patients who are administered the new drug recover. What conclusion can you draw about the<br />
new drug State your reason.<br />
(7 marks)<br />
157<br />
Section 4
SOLUTIONS<br />
4. [Note to markers: Since students only have access to a table with z values to two decimal places, be<br />
prepared to accept calculations based on the nearest values in the table. Many students will, <strong>of</strong> course,<br />
interpolate between table values.]<br />
(a)<br />
Shaded area = 0.5 – Pr(0 < Z < 1.58)<br />
= 0.5 – 0.4429<br />
= 0.0571<br />
–1.58 0 Z<br />
(b)<br />
Pr(0 < Z < 1.44) = 0.425<br />
The central 85% lies<br />
between –1.44 <strong>and</strong> + 1.44<br />
5. (a)<br />
244 260<br />
260 − 244<br />
For men who develop chd, Pr(X > 260) = Pr(Z > )<br />
51<br />
= Pr(Z > 0.314)<br />
= 0.5 – 0.1217<br />
X<br />
= 0.3783<br />
260 − 219<br />
(b) For men who do not develop chd, Pr(X > 260) = Pr(Z ><br />
41<br />
= Pr(Z > 1)<br />
= 0.5 – 0.3413<br />
219 260 = X0.1587<br />
(c) The probability <strong>of</strong> failing to predict chd for a man who will develop it is 1 – 0.3783 = 0.6217<br />
6. X ~ N(266, 16 2 ) or X is normal with μ X = 266 <strong>and</strong> σ<br />
2<br />
X<br />
= 256<br />
(a) Pr(X < 240)<br />
240 − 266<br />
= Pr( Z < )<br />
16<br />
= Pr(Z < – 1.625)<br />
= 0.5 – Pr(0 < Z < 1.625)<br />
240 266 X<br />
= 0.5 – 0.4479<br />
= 0.0521<br />
That is, 5.2% less than 8 months.<br />
270 − 266<br />
)<br />
16<br />
= Pr(–1.625 < Z < 0.25)<br />
= 0.4479 + 0.0987<br />
= 0.5466 i.e. 54.7%<br />
(b) Pr(240 < X < 270) = Pr(–1.625 < Z <<br />
(c)<br />
0.425 0.425<br />
–z 0 z Z<br />
240<br />
266 270 X<br />
z = 0.842 approx, from table<br />
(30% <strong>of</strong> the st<strong>and</strong>ard normal<br />
lies between 0 <strong>and</strong> z = 0.842)<br />
0.30<br />
x − 266<br />
0.20 ∴ = 0.842<br />
266 x X 16<br />
∴ x = 266 + 16(0.842)<br />
0 z Z<br />
= 279.47 days<br />
That is, approximately 280 days or more.<br />
1. (a) n = 100; π = 0.8<br />
(b) μ X = nπ = 80; σ<br />
X<br />
= 80(0.2)<br />
= 4.<br />
(c) μ X ± 2σ X gives 80 + 2(4) or 72 to 88. Both values lie in the range <strong>of</strong> possible values 0 to 100 hence<br />
normal approximation can be used. (1.96 instead <strong>of</strong> 2 also acceptable)<br />
89.5<br />
− 80<br />
Pr(X > 89.5) = Pr(Z > )<br />
4<br />
= Pr(Z > 2.375)<br />
= 0.5 – 0.4912<br />
= 0.0088<br />
(d) There is evidence that the new drug produces a greater number who recover from the stomach complaint<br />
than expected from the established treatment; the probability 0.0088 is very small for a recovery rate <strong>of</strong><br />
80%.<br />
Section 4<br />
158
SECTION 5<br />
This section defines sampling distributions, establishes the st<strong>and</strong>ard deviations <strong>of</strong> these distributions<br />
called st<strong>and</strong>ard errors, <strong>and</strong> set up confidence intervals for population means, differences between the<br />
means <strong>of</strong> two populations, proportions <strong>and</strong> difference between proportions based on r<strong>and</strong>om samples<br />
drawn from the populations.<br />
An outline <strong>of</strong> the Research Process<br />
The Distribution <strong>of</strong> Sample Means<br />
The St<strong>and</strong>ard Error <strong>of</strong> the Mean<br />
Confidence Interval for a Mean<br />
The t-distribution <strong>and</strong> Its Use<br />
Comparison <strong>of</strong> Two Independent Groups<br />
The St<strong>and</strong>ard Error <strong>of</strong> the Difference Between Two means<br />
Pooled Estimate for the Common Variance<br />
Comparison <strong>of</strong> Two Dependent Groups (Paired Data)<br />
Confidence Interval for a Proportion<br />
Confidence Interval for Difference Between Two Proportions<br />
Summary <strong>of</strong> Distributions <strong>and</strong> Confidence Intervals<br />
159<br />
Section 5
The Research Process in Two Situations<br />
Binomial<br />
Underlying Population<br />
Bernoulli Outcomes<br />
Success or failure<br />
Inference Y = 1 or 0<br />
Use probability<br />
<strong>of</strong> outcome or<br />
estimate <strong>of</strong> the<br />
success proportion<br />
Sample<br />
(n)<br />
<strong>Statistics</strong><br />
Result <strong>of</strong> study<br />
Number <strong>of</strong> successes, X<br />
Binomial<br />
e.g. Prevalence (π) <strong>of</strong> asthma in women aged 20 to<br />
40.<br />
This can be estimated as the proportion (p) in a<br />
sample chosen from the population.<br />
160<br />
Section 5
Normal<br />
Underlying Population<br />
Continuous Outcomes<br />
X ~ N(μ, σ 2 ) say<br />
Inference<br />
Use probability<br />
<strong>of</strong> outcome or an<br />
estimate based<br />
on the sample mean<br />
Sample<br />
(n)<br />
<strong>Statistics</strong><br />
Result <strong>of</strong> study<br />
How does the sample<br />
mean behave<br />
What is the sample<br />
mean distribution X <br />
e.g. What is the mean resting pulse rate (μ) in beats<br />
per minute for men in age range 20 to 25 years<br />
The mean x from the sample is an estimate for the<br />
mean μ in population <strong>of</strong> all men in this age range.<br />
161<br />
Section 5
Sampling Distributions<br />
Statistical inference is process <strong>of</strong> using<br />
information from a sample to infer something<br />
about the population from which sample was<br />
drawn, thus completing the research loops just<br />
described.<br />
How reliable are these estimates for π <strong>and</strong> μ<br />
To answer these questions focus first on the<br />
sample mean x for a sample <strong>of</strong> size n say.<br />
Proportions will be discussed later. The argument<br />
proceeds as follows:<br />
Successive samples <strong>of</strong> size n can be drawn from<br />
the population. These produce means x 1 , x 2 , x 3 ,<br />
x 4 , … etc <strong>and</strong> these form what is called a<br />
distribution <strong>of</strong> sample means, X , which is quite<br />
different to the original distribution, X, <strong>of</strong> values<br />
in the population.<br />
The problem is now to find μ <strong>and</strong> σ . [Here,<br />
X X<br />
σ is the st<strong>and</strong>ard deviation <strong>of</strong> the distribution <strong>of</strong><br />
X<br />
means <strong>and</strong> hence is the “typical” variation in these<br />
means. i.e. the “typical” error.]<br />
162<br />
Section 5
The Distribution <strong>of</strong> Sample Means<br />
Suppose a population with distribution X has known<br />
mean μ X <strong>and</strong> st<strong>and</strong>ard deviation σ X<br />
Ex: Female adult heights. Suppose μ X = 169 cm<br />
<strong>and</strong> σ X = 3.20 cm<br />
A sample <strong>of</strong> size n = 4 drawn r<strong>and</strong>omly from the<br />
population has values 163, 172, 166, 166 say with<br />
mean x 1 = 667/4 = 166.8 cm<br />
x<br />
1<br />
Distribution <strong>of</strong><br />
individual heights (X)<br />
σ X = 3.20 cm<br />
160 163 166 169 172 175 178 X<br />
=μ<br />
X<br />
The four sample values <strong>and</strong> their mean are plotted.<br />
The average x 1 is not as extreme as the individual<br />
values in the sample. x 1 is an estimate <strong>of</strong> μ X<br />
(usually unknown in the real situation).<br />
A second sample <strong>of</strong> n = 4 gives x 2 = 170.5 cm<br />
A third sample <strong>of</strong> n = 4 gives x 3 = 169.5 cm<br />
163<br />
Section 5
If this process is continued we can obtain a<br />
distribution <strong>of</strong> sample means. What are the<br />
properties <strong>of</strong> this distribution These will allow us<br />
to decide how well a sample mean estimates μ X .<br />
Distributions <strong>of</strong> means are for samples <strong>of</strong> size<br />
n = 10, n = 25 <strong>and</strong> n = 100. The population from<br />
which the samples are taken is Normal.<br />
164<br />
Section 5
Distributions <strong>of</strong> means are for samples <strong>of</strong> size<br />
n = 10, n = 25 <strong>and</strong> n = 100. The population from<br />
which the samples are taken is not Normal. But the<br />
sampling distributions are normal.<br />
165<br />
Section 5
Derivation:<br />
Suppose a r<strong>and</strong>om sample <strong>of</strong> size n is taken from a<br />
population with distribution X. The sample can be<br />
viewed as values from n r<strong>and</strong>om variables X 1 , X 2 ,<br />
…, X n each with mean μ X <strong>and</strong> variance σ . X 1 , X 2 ,<br />
…, X n are independent (if the population is large),<br />
<strong>and</strong> are identical.<br />
A value, x , from one sample is one value <strong>of</strong> X , the<br />
distribution <strong>of</strong> sample means for sample <strong>of</strong> size n.<br />
Then<br />
1<br />
X = 1 2 … +<br />
n<br />
( X + X + )<br />
1<br />
∴ μ X = X X +<br />
n<br />
1 2<br />
∴ μ X = μ X<br />
X n<br />
( μ + μ + … μ )<br />
X n<br />
1<br />
= ( nμ<br />
X ) (X 1 , X 2 etc identical)<br />
n<br />
2<br />
X<br />
166<br />
Section 5
The addition rule for the variance <strong>of</strong> independent<br />
r<strong>and</strong>om variables gives<br />
σ<br />
2<br />
2<br />
2 1 2 1 2 1<br />
X =<br />
⎛<br />
σ<br />
⎛<br />
X σ<br />
⎛<br />
⎜<br />
⎞ ⎟ + ⎜<br />
⎞ ⎟ X + … + ⎜<br />
⎝ n⎠<br />
1<br />
=<br />
⎛ ⎞<br />
⎜ ⎟<br />
⎝ n⎠<br />
2<br />
nσ<br />
⎝ n⎠<br />
⎞<br />
⎟<br />
⎝ n ⎠<br />
2<br />
σ<br />
2<br />
X<br />
1 2<br />
n<br />
2<br />
X<br />
=<br />
σ X 2<br />
n<br />
[i.e. if T = aX + bY, then<br />
2<br />
T<br />
2<br />
2<br />
X<br />
2<br />
2<br />
Y<br />
σ = a σ + b σ ]<br />
Therefore, the st<strong>and</strong>ard deviation <strong>of</strong> the distribution<br />
<strong>of</strong> sample means is<br />
σ<br />
σ<br />
X<br />
X =<br />
n<br />
σ<br />
X<br />
The derivations <strong>of</strong> μ = μ<br />
X X<br />
<strong>and</strong> σ = need<br />
X<br />
n<br />
not be known. These two formula are in fact very<br />
important <strong>and</strong> you must know how to use them.<br />
167<br />
Section 5
Note: 1. σ X is called the st<strong>and</strong>ard error <strong>of</strong> the<br />
mean. (It is the “typical” deviation in the mean<br />
– i.e. a measure <strong>of</strong> precision <strong>of</strong> the error in the<br />
mean).<br />
2. If μ X = 169; σ X = 3.20 for heights <strong>of</strong> women,<br />
for sample <strong>of</strong> size n = 4, μ X = 169 <strong>and</strong><br />
= σ 3.20<br />
σ X<br />
X<br />
4 = = 1.60.<br />
2<br />
3. If sample size (n) is greater than 4, σ X is<br />
smaller meaning the distribution X is more<br />
compact about μ = μ X .<br />
X<br />
4. If X is normal, it can be shown that X is<br />
normal no matter what the sample size.<br />
5. If X is not normal, but n large, then X is<br />
approximately normal. (This result is a<br />
consequence <strong>of</strong> the Central Limit Theorem in<br />
note 6.)<br />
6. For r<strong>and</strong>om sample <strong>of</strong> size n, the sample means<br />
x i fluctuate around the population mean μ X<br />
with st<strong>and</strong>ard error σ X = σ X / n . As n<br />
increases, the distribution fluctuates less <strong>and</strong><br />
less, getting closer to a normal distribution.<br />
168<br />
Section 5
Example: Suppose a population has values which<br />
are normally distributed (distribution is X) <strong>and</strong><br />
μ X = 7.9 with σ X = 0.60.<br />
Find (i) Pr(X > 7.7)<br />
(ii) Find Pr( X > 7.7) where X is the distribution<br />
<strong>of</strong> means for samples <strong>of</strong> size n = 9.<br />
Solution:<br />
⎛ 7.7 − 7.9 ⎞<br />
(i) Pr(X > 7.7) = Pr⎜<br />
Z > ⎟<br />
⎝ 0.60 ⎠<br />
= Pr(Z > - 0.333) = 0.6304<br />
(ii) Since<br />
σ<br />
X<br />
0.60<br />
μ = 7.9 <strong>and</strong> σ = = = 0.2,<br />
X<br />
X<br />
n 9<br />
⎛ 7.7 − 7.9 ⎞<br />
Pr( X > 7.7) = Pr⎜<br />
Z > ⎟<br />
⎝ 0.2 ⎠<br />
= Pr(Z > -1) = 0.8413<br />
169<br />
Section 5
Example: Serum values for a sample n = 216<br />
give x = 34.46 <strong>and</strong> s = 5.84. What is the<br />
st<strong>and</strong>ard error <strong>of</strong> x <br />
σ<br />
St<strong>and</strong>ard error = where σ is the (unknown)<br />
n<br />
population st<strong>and</strong>ard deviation.<br />
In practice, we estimate σ by s.<br />
s<br />
∴ estimated st<strong>and</strong>ard error =<br />
n<br />
=<br />
5.84<br />
216<br />
= 0.397.<br />
Suppose the sample had been twice the size,<br />
n = 432, <strong>and</strong> sample had same mean <strong>and</strong> st<strong>and</strong>ard<br />
5.84<br />
deviation. Estimated s.e. = = 0.281<br />
432<br />
(compare 0.397 for n = 216)<br />
170<br />
Section 5
A Confidence Interval for the Mean<br />
The problem here is to use sample data to find an<br />
estimate for an unknown population mean. This<br />
estimate reflects the r<strong>and</strong>om variation in the data<br />
collected by establishing an interval in which we<br />
are fairly certain that the mean μ lies.<br />
As can be seen, this will complete the research<br />
loop concerning the unknown population.<br />
X<br />
To motivate the procedure we work with the<br />
distribution <strong>of</strong> sample means, X , which is<br />
(<br />
2<br />
)<br />
⎛<br />
2<br />
N μ , σ or alternatively N⎜<br />
X X<br />
μ<br />
X<br />
X , σ<br />
⎝ n<br />
First consider the st<strong>and</strong>ard Normal:<br />
⎞<br />
⎟<br />
⎠<br />
Area = 0.025 Area = 0.025<br />
0.95<br />
<strong>of</strong> area<br />
–1.96 +1.96 Z<br />
171<br />
Section 5
0.95 = Pr(-1.96 < Z < +1.96)<br />
⎛ X − μ ⎞<br />
= Pr⎜−1.96<br />
<<br />
X<br />
< + 1. 96⎟ ⎝ σ X / n ⎠<br />
= Pr<br />
⎛ σ<br />
⎞<br />
⎜−<br />
X<br />
σ<br />
1 .96 < X − μ < +<br />
X<br />
X 1.96 ⎟<br />
⎝ n<br />
n ⎠<br />
⎛ σ<br />
⎞<br />
= Pr⎜<br />
−<br />
X<br />
σ<br />
μ < X < +<br />
X<br />
X 1 .96 μ X 1. 96 ⎟<br />
⎝<br />
n<br />
n ⎠<br />
This result is used to construct a 95% confidence<br />
interval as follows:<br />
For a sample x 1 , x 2 , …, x n <strong>of</strong> n values from a<br />
population we are said to be 95% confident that<br />
the sample mean satisfies<br />
μ<br />
X<br />
σ<br />
X<br />
σ<br />
X<br />
− 1.96 < x < μ<br />
X<br />
+ 1.96<br />
n<br />
n<br />
But<br />
x<br />
while<br />
σ X<br />
σ<br />
< μ X +1. 96 implies x −1<br />
. 96<br />
X < μ X<br />
n<br />
n<br />
σ X<br />
σ<br />
μ X −1 . 96 < x implies μ x<br />
X<br />
X < +1. 96 .<br />
n<br />
n<br />
172<br />
Section 5
Therefore, we are 95% confident that the unknown<br />
population mean μ satisfies<br />
X<br />
x<br />
σ X<br />
σ<br />
− 1 .96 < μ x<br />
X<br />
X < + 1.96<br />
n<br />
n<br />
Alternatively, we are 95% confident that the true<br />
population mean lies in the interval<br />
x<br />
σ<br />
± 1.96<br />
X<br />
n<br />
173<br />
Section 5
Notes: 1. The sample has produced an interval<br />
estimate for the unknown population mean.<br />
2. A 99% confidence interval replaces the value<br />
1.96 by 2.58 since the tail areas beyond +2.58<br />
<strong>and</strong> –2.58 are both 0.005<br />
0.005 0.005<br />
0.99<br />
area<br />
–2.58 +2.58 Z<br />
3. The 99% confidence interval<br />
x<br />
±<br />
σ<br />
2.58<br />
X<br />
n<br />
is wider, hence less precise, but we are now<br />
99% certain μ X is in this interval.<br />
σ<br />
4. As n increases,<br />
X<br />
n<br />
decreases <strong>and</strong> the<br />
confidence interval is narrower meaning a<br />
more precise estimate.<br />
i.e. a large sample leads to greater accuracy.<br />
174<br />
Section 5
Example: A pharmacologist is investigating the<br />
length <strong>of</strong> time that a sedative is effective. Eight<br />
patients are selected at r<strong>and</strong>om for a study <strong>and</strong> the<br />
eight times for which the sedative is effective have<br />
mean x = 8.4 (It is also known that the st<strong>and</strong>ard<br />
deviation for such measures is σ X = 1.5 hours).<br />
Find 95% <strong>and</strong> 99% confidence intervals for the true<br />
mean number <strong>of</strong> hours μ .<br />
X<br />
Solution: Here, n = 8; x = 8.4;<br />
(assuming that X is normal).<br />
1.5<br />
σ = = 0.53<br />
X<br />
8<br />
The 95% confidence interval is<br />
8.4 ± 1.96 (0.53)<br />
or 8.4 ± 1.04<br />
That is, 7.36 < μ X < 9.44 or (7.36, 9.44).<br />
The 99% confidence interval is<br />
8.4 ± 2.58(0.53)<br />
or 8.4 ± 1.37<br />
That is, 7.03 < μ X < 9.77 or (7.03, 9.77)<br />
The second interval is much wider.<br />
175<br />
Section 5
Example: The pharmacologist is required to find<br />
the value <strong>of</strong> μ X to within 15 minutes with 95%<br />
confidence. Assuming that the st<strong>and</strong>ard deviation<br />
is σ X = 1.5 hours, find the size <strong>of</strong> the sample which<br />
must be taken in order to achieve this accuracy.<br />
Solution: Since 15 minutes is ¼ hour, for a sample<br />
size n we need<br />
x ± 1<br />
4<br />
to be an interval which is wider than<br />
σ<br />
x ± 1.96<br />
X<br />
n<br />
1.5<br />
or x ± 1.96<br />
n<br />
1.5<br />
∴1.96<br />
≤ 1<br />
n 4<br />
Rearranging, 1.96 (1.5) 4 ≤ n<br />
or 11.76 ≤ n<br />
Squaring, n ≥ 138.3<br />
Hence, 139 patients must be selected.<br />
176<br />
Section 5
Use <strong>of</strong> t-table when σ X is unknown<br />
In all practical contexts, σ X is not known. In this<br />
case it is estimated in the best possible way by the<br />
sample st<strong>and</strong>ard deviation s X . In this situation, the<br />
t-table provides alternative larger values in place <strong>of</strong><br />
1.96 <strong>and</strong> 2.58.<br />
The confidence intervals are wider <strong>and</strong> hence there<br />
is less precision.<br />
The 95% confidence interval is<br />
s<br />
x<br />
x − tν<br />
< μ<br />
X<br />
< x + tν<br />
n<br />
s<br />
x<br />
n<br />
where ν = n – 1 is the “number <strong>of</strong> degrees <strong>of</strong><br />
freedom” <strong>and</strong> t ν is found in the appropriate column<br />
in the t-table for 95% confidence (see table at end<br />
<strong>of</strong> notes)<br />
Note: (n – 1) = ν is the divisor in the estimate<br />
for the variance.)<br />
2<br />
X s<br />
177<br />
Section 5
Exercise: Now suppose that the pharmacologist<br />
did not know the value <strong>of</strong> σ X <strong>and</strong> was forced to<br />
take the sample st<strong>and</strong>ard deviation from the sample<br />
<strong>of</strong> size n = 8 as the best estimate <strong>of</strong> σ X , namely<br />
s X = 1.5 hours. Find 95% <strong>and</strong> 99% confidence<br />
intervals for μ .<br />
X<br />
Solution: x = 8.4 <strong>and</strong><br />
s<br />
estimated st<strong>and</strong>ard error = X 1.5<br />
= = 0. 53<br />
n 8<br />
The 95% confidence interval for the mean sedative<br />
time μ X for all such patients is<br />
8.4 ± t 7 (0.53) where t 7 = 2.365<br />
That is, 8.4 ± 1.25<br />
or 7.15 < μ X < 9.65<br />
The 99% interval is<br />
8.4 ± t 7 (0.53) where t 7 = 3.500<br />
That is, 8.4 ± 1.86<br />
or 6.54 < μ X < 10.26<br />
Both are wider than before.<br />
178<br />
Section 5
Student’s t distribution<br />
ν<br />
-t ν 0 t ν T<br />
2p 0.100 0.050 0.020 0.010<br />
p 0.050 0.025 0.010 0.005<br />
1 <br />
2 <br />
3 <br />
4 <br />
5 <br />
6 <br />
7 1.895 2.365 2.998 3.500<br />
8<br />
9<br />
10<br />
<br />
<br />
<br />
120<br />
∝ 1.645 1.960 2.326 2.576<br />
Area (p)<br />
or probability<br />
p refers to the area <strong>of</strong> one tail 2p gives the<br />
combined area <strong>of</strong> both tails (View the t-distribution<br />
above as a slight modification to the normal<br />
distribution Z.<br />
179<br />
Section 5
Notes: 1. The interval is wide when samples<br />
small. That is, less precision in estimates.<br />
2. This last example is the most common<br />
situation where: the population is assumed to<br />
be normal; μ X <strong>and</strong> σ X are both unknown;<br />
σ X is estimated by s X from a r<strong>and</strong>om sample<br />
<strong>of</strong> size n.<br />
3. Even for large n the t-table is used. Last row<br />
<strong>of</strong> table has 1.96 <strong>and</strong> 2.58 row <strong>of</strong> t.<br />
4. From the point <strong>of</strong> view <strong>of</strong> exams we shall<br />
accept the normal distribution value for<br />
degrees <strong>of</strong> freedom greater than 30.<br />
180<br />
Section 5
Example: Tablets must be produced which weigh<br />
200milligram. Choose sample <strong>of</strong> n = 20 from<br />
production line. x = 201.7mg <strong>and</strong> s X = 5.13mg.<br />
Does this sample confirm that μ = 200mg<br />
Solution: 19 degrees <strong>of</strong> freedom <strong>and</strong><br />
t 19 = 2.093 for a 95% confidence interval.<br />
Therefore,<br />
X<br />
5.13<br />
201.7 – 2.093 < μ X <<br />
20<br />
or 199.3 < μ X < 204.1<br />
201.7<br />
+<br />
5.13<br />
2.093<br />
20<br />
The weight <strong>of</strong> 200milligram lies in this interval.<br />
Hence, 200milligram is an acceptable value <strong>of</strong> the<br />
mean μ with 95% confidence.<br />
X<br />
181<br />
Section 5
The Meaning <strong>of</strong> a confidence Interval<br />
199.3 μ X = 200 204.1<br />
Sample 100<br />
↑<br />
Sample 6<br />
Sample 5<br />
Sample 4<br />
Sample 3<br />
Sample 2<br />
Sample 1<br />
Sample 5 does not include<br />
μ X = 200mg.<br />
In general, if 100 different samples construct 100<br />
intervals, then five <strong>of</strong> the 100 will miss μ X if we<br />
are working at 95% confidence levels.<br />
(This is the possible error which must be accepted.<br />
With 99% confidence intervals which are wider<br />
only one will miss μ X .)<br />
182<br />
Section 5
100 Confidence Intervals (95%)<br />
Sample 100<br />
↑<br />
Sample 6<br />
Sample 5<br />
Sample 4<br />
Sample 3<br />
Sample 2<br />
Sample 1<br />
199.3 204.1<br />
In the above, the position <strong>of</strong> the true mean μ X is<br />
unknown. Also, in practice we only have one <strong>of</strong> the<br />
above intervals. We say we are 95% confident the<br />
true mean lies in this interval.<br />
183<br />
Section 5
184<br />
Section 5
185<br />
Section 5
186<br />
Section 5
187<br />
Section 5
188<br />
Section 5
Example:<br />
It is claimed that males committed for trial for<br />
minor <strong>of</strong>fences are spending more time in prison on<br />
rem<strong>and</strong> than females committed for trial for similar<br />
<strong>of</strong>fences. A sample <strong>of</strong> 40 females <strong>and</strong> 49 males<br />
awaiting trial gave the following information. The<br />
outcome measure is time on rem<strong>and</strong> (X days).<br />
Female Male<br />
Sample mean ( x i<br />
)<br />
16.3 29.5<br />
Sample st<strong>and</strong>ard deviation (s i ) 14.6 17.2<br />
Sample size (n i ) 40 49<br />
The difference between the sample means is<br />
x M – x F = 29.5 – 16.3 = 13.2 days<br />
Is this an important difference<br />
189<br />
Section 5
If μ M <strong>and</strong> μ F are the population mean times for<br />
males <strong>and</strong> females, a 95% confidence interval for<br />
μ M – μ F is<br />
x M – x F ±<br />
2 2<br />
sM sF<br />
1.96 +<br />
n n<br />
M<br />
(17.2) (14.6)<br />
= 13.2 ± 1.96 +<br />
49 40<br />
= 13.2 ± 6.61<br />
= (6.59, 19.81)<br />
or 6.59 < μ M – μ F < 19.81<br />
F<br />
2 2<br />
The population male rem<strong>and</strong> time is likely to be<br />
within 6.59 <strong>and</strong> 19.81 days longer than that for<br />
females (alternatively, the true mean difference is<br />
between 6.59 <strong>and</strong> 19.81 days).<br />
190<br />
Section 5
Case 2: Comparing means when samples small<br />
In this situation the CLT no longer holds for the<br />
difference between the sample means.<br />
Instead we need to assume that the population<br />
from which the difference is drawn is normally<br />
distributed.<br />
This should be the case if the populations for any<br />
small samples are normal.<br />
In addition to assuming normality, we assume the<br />
two populations have equal variances.<br />
191<br />
Section 5
If<br />
2<br />
σ 1 <strong>and</strong><br />
2<br />
σ 2 are similar <strong>and</strong> equal to<br />
2<br />
σ say.<br />
Then the 95% confidence interval μ 1 – μ 2 is<br />
1 1<br />
( x 1 − x2)<br />
± 1.96σ<br />
+<br />
n n<br />
1 1<br />
1<br />
( x 1 − x2)<br />
−1.96σ<br />
+ < μ1<br />
− μ2<br />
< ( x1<br />
− x2)<br />
± 1.96σ<br />
+<br />
n n<br />
n<br />
1<br />
2<br />
1<br />
2<br />
1<br />
1<br />
n<br />
2<br />
The common variance σ 2 needs to be estimated<br />
from sample data. If both populations have the<br />
same variance, the best estimate for σ 2 is found<br />
when the variation in both samples is averaged to<br />
give the pooled estimate s 2 pwhere<br />
∴<br />
with<br />
s<br />
2 p<br />
( n<br />
=<br />
1<br />
2<br />
1<br />
1<br />
−1)<br />
s<br />
n<br />
+ ( n<br />
+ n<br />
2<br />
2<br />
2 ∑ ( x1<br />
− x1<br />
)<br />
1<br />
=<br />
n1<br />
−1<br />
2<br />
− 2<br />
−1)<br />
s<br />
s i<br />
<strong>and</strong><br />
2<br />
2<br />
2<br />
2 ( x2<br />
− x2)<br />
s<br />
2<br />
= i<br />
n2<br />
−1<br />
When sample estimates for the variances are used,<br />
replace 1.96 by t-value to get<br />
1 1<br />
( x1<br />
− x2)<br />
± tν<br />
s p +<br />
n1<br />
n2<br />
with degrees <strong>of</strong> freedom ν = n 1 + n 2 – 2.<br />
192<br />
Section 5
Example 3: Following data are 24 hour total energy<br />
expenditures (MJ/day) in groups <strong>of</strong> lean <strong>and</strong> obese<br />
patients (1986 study)<br />
Lean (n = 13) Obese (n = 9)<br />
6.13 8.79<br />
7.05 9.19<br />
7.48 9.21 Question: Is<br />
7.48 9.68 there a<br />
7.53 9.69 difference in<br />
7.58 9.97 energy<br />
7.90 11.51 expenditure<br />
8.08 11.85 between lean<br />
8.09 12.79 <strong>and</strong> obese<br />
8.11 patients<br />
8.40<br />
10.15<br />
10.88<br />
Mean: 8.066 10.298<br />
S.D.: 1.238 1.398<br />
Possible explanations for the difference between<br />
samples in above situations:<br />
1. bias (need to r<strong>and</strong>omise)<br />
2. confounding (e.g. gender, age)<br />
3. chance (r<strong>and</strong>om variation)<br />
4. true difference<br />
The methods we discuss in next few lectures assume<br />
that bias <strong>and</strong> confounding are not the explanation.<br />
193<br />
Section 5
n 1 = 13; x 1 = 8.066; s 1 = 1.238 (lean)<br />
n 2 = 9; x 2 = 10.298; s 2 = 1.398 (obese)<br />
Solution: x 2 – x 1 = 2.232 (obese – lean)<br />
2<br />
2<br />
2 (13−1)1.238<br />
+ (9 −1)1.398<br />
s p =<br />
13 + 9 − 2<br />
12 (1.533) + 8(1.954)<br />
=<br />
20<br />
= 1.7014<br />
∴s<br />
p = 1 .7014 = 1.304<br />
ν = 20 giving t 20 = 2.086 for 95% interval<br />
∴ 95% confidence interval is<br />
1 1<br />
2 .232 ± 2.086(1.304) +<br />
13 9<br />
or 2.232 ± 1.180<br />
That is, 1.05 <<br />
μ − μ < 3.41 MJ/day<br />
obese<br />
lean<br />
194<br />
Section 5
Note: This confidence interval tells us that we<br />
can be 95% sure that the true difference in energy<br />
expenditure between obese <strong>and</strong> lean patients is<br />
between 1.05 <strong>and</strong> 3.41 MJ/day.<br />
Since this interval is entirely positive, it means<br />
that we can conclude that lean patients have lower<br />
energy expenditure than obese patients.<br />
Notes: 1. ν = n 1 + n 2 – 2 is the divisor in the<br />
2<br />
formula for s p , the variance estimate (the<br />
degrees <strong>of</strong> freedom are always the divisor in<br />
the variance estimate e.g. n – 1 in the single<br />
sample case)<br />
2. Both populations should have values which<br />
are normally distributed if the samples are<br />
small.<br />
2 2<br />
3. The two population variances, σ<br />
1<br />
<strong>and</strong> σ<br />
2<br />
should be equal approximately. (Otherwise<br />
may need to transform data or use another<br />
test.) R-cmdr has an option which confirms<br />
this.<br />
195<br />
Section 5
4. The two samples from the two populations are<br />
r<strong>and</strong>om <strong>and</strong> independent <strong>of</strong> each other.<br />
5. Testing whether μ 1 = μ2<br />
can be achieved by<br />
seeing if μ1 − μ2<br />
= 0 i.e. confirming if 0 lies in<br />
the confidence interval for the difference.<br />
6. It is possible to obtain the probability value<br />
associated with the study outcome value <strong>of</strong><br />
2.232. (see later)<br />
Example: A nutrition scientist is assessing a<br />
weight-loss programme to evaluate its<br />
effectiveness. Ten people r<strong>and</strong>omly selected.<br />
Initial weight recorded <strong>and</strong> followup weight 20<br />
weeks later.<br />
Subject Initial Weight (x Ii ) Weight at followup (x Fi )<br />
1 180 165<br />
2 142 138<br />
3 126 128<br />
4 138 136<br />
5 175 170<br />
6 205 197<br />
7 116 115<br />
8 142 128<br />
9 157 144<br />
10 136 130<br />
196<br />
Section 5
Find a 95% confidence interval for the reduction<br />
in weight (Assuming the two sets <strong>of</strong> values<br />
independent).<br />
x<br />
I<br />
= 151.7 x<br />
F<br />
= 145.1<br />
s 2 I = 750.76 s 2 F = 620.01<br />
2 9(750.76) + 9(620.01)<br />
s P =<br />
18<br />
= 685.17<br />
Since ν = 18 giving t 18 = 2.101 we get<br />
1 1<br />
(151.7 − 145.1) ± 2.101 685.17 + 10 10<br />
or 6.6 ± 24.6<br />
That is –18.0 < μ I – μ F < 31.2<br />
Note 1: Since the confidence interval includes 0,<br />
conclude there is no evidence to indicate that the<br />
weight loss programme has altered weights.<br />
Note 2: In this study the two sets <strong>of</strong> data are not<br />
independent. One person produces two values<br />
here.<br />
197<br />
Section 5
Case 3: Comparing means if matched data.<br />
It is natural to consider the differences d i in the<br />
weights for each person rather than considering<br />
the two samples separately. The d i are the data<br />
now <strong>and</strong> a confidence interval is constructed for<br />
the average difference μ d based on the single<br />
sample <strong>of</strong> differences. The 95% confidence<br />
interval is<br />
d<br />
± t<br />
ν<br />
sd<br />
n<br />
where d is the average <strong>of</strong> the d i , n is the number<br />
<strong>of</strong> data pairs, ν = n – 1, <strong>and</strong> s d is the st<strong>and</strong>ard<br />
deviation <strong>of</strong> the differences. We have<br />
s<br />
d<br />
=<br />
∑ ( di<br />
− d<br />
n −1<br />
)<br />
2<br />
with n – 1 degrees <strong>of</strong> freedom.<br />
198<br />
Section 5
Example: Weight loss programme again<br />
Subject x Ii x Fi d i = x Ii – x Fi<br />
2<br />
( di<br />
− d)<br />
1 180 165 15 70.56<br />
2 142 138 4 6.76<br />
3 126 128 –2 73.96<br />
4 138 136 2 21.16<br />
5 175 170 5 2.56<br />
6 205 197 8 1.96<br />
7 116 115 1 31.36<br />
8 142 128 14 54.76<br />
9 157 144 13 40.96<br />
10 136 130 6 0.36<br />
Total 66 304.40<br />
d = 66/10 = 6.6<br />
2<br />
∑<br />
2<br />
( di<br />
− d) 304.4<br />
sd<br />
= = = 33.82<br />
n −1 9<br />
ν = n – 1 = 9 giving t ν = 2.262 for a 95% interval.<br />
The 95% confidence interval for the average<br />
difference is<br />
33.82<br />
6.6 ± 2.262<br />
or 6.6 ± 4.2<br />
That is, 2.4 < μ d < 10.8<br />
10<br />
199<br />
Section 5
There is evidence that the weight loss programme<br />
has reduced weights since the difference <strong>of</strong> 0 is<br />
not in this interval (we are 95% sure).<br />
Notes: (1) The “pr<strong>of</strong>ile” <strong>of</strong> each person is<br />
constant in this study because the same<br />
person has produced the two values.<br />
(2) A test involving paired data based on d is<br />
called a paired t-test. The earlier test on<br />
μ − is called an unpaired t-test.<br />
1 μ 2<br />
(3) Negative differences are possible in this<br />
analysis when subtracting. Be consistent with<br />
subtraction process.<br />
200<br />
Section 5
Confidence Intervals for a Proportion<br />
Suppose X is a binomial distribution with<br />
parameters n <strong>and</strong> π (i.e. the number <strong>of</strong> “successes”<br />
lies between 0 <strong>and</strong> n).<br />
Then<br />
μ X = nπ<br />
σ = nπ ( 1−π<br />
)<br />
X<br />
Suppose one sample produces a proportion <strong>of</strong><br />
successes p = in n trials.<br />
n x<br />
Many such samples can be taken to get different<br />
values <strong>of</strong> p. The resulting distribution (P) <strong>of</strong> these<br />
proportions is normal (by the Central Limit<br />
theorem.) It follows that<br />
P =<br />
X<br />
n<br />
where X is binomial. The mean <strong>and</strong> st<strong>and</strong>ard<br />
deviation <strong>of</strong> P are then<br />
201<br />
Section 5
1 1<br />
μ P = μ = nπ<br />
= π<br />
n<br />
X n<br />
2 1 2 1<br />
<strong>and</strong>, since σ =<br />
⎛<br />
⎜<br />
⎞ ⎟ σ X = nπ<br />
(1 −π<br />
)<br />
2<br />
⎝ n⎠<br />
n<br />
2<br />
P ,<br />
σ<br />
P<br />
=<br />
π ( 1−π<br />
)<br />
n<br />
The sample proportion (p) estimates the unknown<br />
true population proportion (π) (e.g. prevalence <strong>of</strong><br />
asthma in women not known.). Thus the<br />
estimated st<strong>and</strong>ard error is<br />
p ( 1−<br />
p)<br />
n<br />
<strong>and</strong> the 95% confidence interval for π is<br />
p<br />
± 1.96<br />
p(1<br />
−<br />
n<br />
p)<br />
Note: 1.96 (or 99% equivalent 2.58) are always<br />
used for confidence intervals for proportions. (If<br />
202<br />
Section 5
small sample, the normal distribution is not a good<br />
approx.)<br />
Example: A r<strong>and</strong>om sample <strong>of</strong> 500 Auckl<strong>and</strong>ers<br />
taken in 1996 had 173 supporting aerial spraying<br />
to eradicate tussock moth. Estimate the<br />
proportion (π) <strong>of</strong> Auckl<strong>and</strong>ers who support this.<br />
Solution:<br />
<strong>and</strong><br />
x 173<br />
p = = = 0.346<br />
n 500<br />
p( 1−<br />
p)<br />
0.346(1 − 0.346)<br />
=<br />
n<br />
500<br />
= 0.021<br />
The 95% confidence interval is<br />
0.346 ± 1.96(0.021)<br />
or 0.346 ± 0.041<br />
Therefore, 0.305 < π < 0.387<br />
We are 95% sure that between 30.5% <strong>and</strong> 38.7%<br />
<strong>of</strong> the Auckl<strong>and</strong> population support the spraying.<br />
203<br />
Section 5
Note: Alternatively, we could say 34.6% <strong>of</strong> the<br />
population support spraying with a margin <strong>of</strong> error<br />
<strong>of</strong> 4.1%. But ‘margin <strong>of</strong> error’ concept must be<br />
used with caution. It is reasonable if the value <strong>of</strong><br />
p lies between 0.3 <strong>and</strong> 0.7 but the margin <strong>of</strong> error<br />
should be adjusted if p outside this range. (We<br />
omit this adjustment.)<br />
Example: Epidemiologist estimates proportion <strong>of</strong><br />
women with asthma. Find the sample size (n)<br />
needed to give an estimate for this proportion with<br />
an error no more than 0.03 with 95% confidence.<br />
Solution: The largest possible value <strong>of</strong> p(1 – p)<br />
occurs when p =<br />
1<br />
2<br />
(verify this by choosing several<br />
p values).<br />
The most conservative (or safest) sample size is<br />
obtained using this value p =<br />
1<br />
2<br />
. The requested<br />
accuracy requires confidence interval p ± 0.03<br />
to be the largest interval. But the actual interval is<br />
0.5(1 − 0.5)<br />
p ± 1.96<br />
for sample size n.<br />
n<br />
Therefore<br />
204<br />
Section 5
0.5(1 − 0.5)<br />
1.96<br />
< 0.03<br />
n<br />
2<br />
(1.96) (0.5)(0.5)<br />
∴ < (0.03) 2<br />
n<br />
2<br />
(1.96) (0.5)(0.5)<br />
∴ n > = 1067.11<br />
2<br />
(0.03)<br />
It follows that 1068 women must be tested.<br />
Now consider the Confidence Interval for<br />
Difference Between Two Proportions<br />
(Derivation not examined)<br />
The difference π1 −π<br />
2is estimated by p 1 – p 2<br />
where p 1 = x1<br />
n1<br />
<strong>and</strong> p 2 = x2<br />
n2<br />
for the two<br />
samples.<br />
The distribution P 1 – P 2 <strong>of</strong> sample proportion<br />
differences is a normal distribution with<br />
μ<br />
P −P<br />
1<br />
2<br />
= π<br />
1<br />
−π<br />
2<br />
<strong>and</strong> st<strong>and</strong>ard deviation (st<strong>and</strong>ard error)<br />
σ<br />
P − P<br />
=<br />
π<br />
−π<br />
1 (1 1)<br />
2(1<br />
2)<br />
1 2<br />
n n<br />
1<br />
205<br />
π<br />
+<br />
−π<br />
2<br />
Section 5
using the addition rule for the mean <strong>and</strong> variance<br />
<strong>of</strong> two independent r<strong>and</strong>om variables, P 1 <strong>and</strong> P 2 .<br />
If π 1 <strong>and</strong> π 2 are estimated from sample data, the<br />
95% confidence interval is<br />
p1(1 − p1) p2(1 − p2)<br />
( p1 − p2) ± 1.96<br />
+<br />
n<br />
n<br />
1 2<br />
Exercise: To study the effectiveness <strong>of</strong> a drug for<br />
arthritis, two samples <strong>of</strong> patients were r<strong>and</strong>omly<br />
selected. One sample <strong>of</strong> 100 was injected with the<br />
drug, the other sample <strong>of</strong> 60 receiving a placebo<br />
injection. After a period <strong>of</strong> time the patients were<br />
asked if their arthritic condition had improved.<br />
Results were<br />
EXPOSURE<br />
DRUG(1) PLACEBO(2)<br />
IMPROVED 59 22<br />
NOT IMPROVED 41 38<br />
TOTAL 100 60<br />
206<br />
Section 5
Solution: The proportions improved are<br />
59<br />
22<br />
p 1 = = 0.59 <strong>and</strong> p 2 = = 0. 37<br />
100<br />
60<br />
p 1 – p 2 = 0.22 <strong>and</strong> the estimated st<strong>and</strong>ard error <strong>of</strong><br />
difference between the proportions is<br />
0.59(1 − 0.59)<br />
100<br />
+<br />
0.37(1 − 0.37)<br />
60<br />
=<br />
0.0794<br />
The 95% confidence interval is<br />
0.22 ± 1.96 (0.0794)<br />
or 0.22 ± 0.156<br />
or 0.064 < π1 − π 2 < 0.376<br />
Since 0 excluded from interval <strong>and</strong> the interval is<br />
positive, there is evidence π1 − π 2 > 0. That is, we<br />
conclude the proportion improved is higher when<br />
the drug is used.<br />
207<br />
Section 5
REVIEW EXERCISES<br />
2. A population is known to be normally distributed with a mean µx = 60 <strong>and</strong> st<strong>and</strong>ard deviation σx =<br />
15. Let X be the distribution <strong>of</strong> means <strong>of</strong> samples <strong>of</strong> size 25 drawn from the population.<br />
(a) Define completely the probability distribution X.<br />
(b) What is the probability that a value in the population will lie between 55 <strong>and</strong> 65<br />
(c) What is the probability that the mean <strong>of</strong> a sample <strong>of</strong> size 25 will lie between 55 <strong>and</strong> 65 (4 marks)<br />
3. Large studies indicate that the mean cholesterol level in children aged 2 – 14 is 175 mg%/mL <strong>and</strong><br />
the st<strong>and</strong>ard deviation is 30 mg%/mL.<br />
The problem here is to see if there is a familial aggregation <strong>of</strong> cholesterol levels. A group <strong>of</strong> fathers<br />
who have had a heart attack <strong>and</strong> have elevated cholesterol levels (≥ 250 mg%/mL) are identified.<br />
The cholesterol levels <strong>of</strong> their <strong>of</strong>fspring within the 2-14 age range are measured. The mean<br />
cholesterol level in a group <strong>of</strong> 100 such children is 207.3 mg%/mL. The problem is to decide if this<br />
value is sufficiently far from 175 mg%/mL for us to believe that the underlying mean cholesterol<br />
level μ in the population <strong>of</strong> all children selected in this way is greater than 175 mg%/mL.<br />
(a) Construct a 95% confidence interval for μ on the basis <strong>of</strong> the sample data. State your conclusion<br />
about familial aggregation <strong>of</strong> cholesterol levels.<br />
(2 marks)<br />
(b) Find the probability <strong>of</strong> obtaining the sample mean <strong>of</strong> 207.3 mg%/mL or a value which is greater<br />
under the assumption that there is no familial aggregation. State your conclusion from this<br />
probability.<br />
(2 marks)<br />
4. Patients with chronic kidney failure may be treated by dialysis, using a machine that removes toxic<br />
wastes from the blood, a function normally performed by the kidneys. Kidney failure <strong>and</strong> dialysis<br />
can cause other changes, such as retention <strong>of</strong> phosphorus, that must be corrected by changes in diet.<br />
A study <strong>of</strong> the nutrition <strong>of</strong> dialysis patients measured the level <strong>of</strong> phosphorus in the blood on six<br />
occasions. Here are the data for one patient (milligrams <strong>of</strong> phosphorous per decilitre <strong>of</strong> blood):<br />
5.5 6.1 4.8 5.8 6.2 4.6<br />
The measurements are separated in time <strong>and</strong> can be considered a r<strong>and</strong>om sample <strong>of</strong> the patient’s<br />
blood phosphorus level.<br />
(a)<br />
(b)<br />
(c)<br />
If the level varies normally with σ = 0.8 mg/dl, find a 95% confidence interval for the mean<br />
blood phosphorus level <strong>of</strong> this patient.<br />
(1 mark)<br />
If the value <strong>of</strong> σ is unknown but estimated by the sample st<strong>and</strong>ard deviation s = 0.669, find a<br />
95% confidence interval for the mean blood phosphorus level <strong>of</strong> this patient. (1 mark)<br />
The normal range <strong>of</strong> phosphorus in the blood is considered to be 2.6 to 4.8 mg/dl. Is there<br />
evidence that the patient has a mean phosphorus level that exceeds 4.8 Explain. (1 mark)<br />
5. A salmon fishing company is monitoring the weight <strong>of</strong> salmon in its ponds prior to harvest. A pilot<br />
sample <strong>of</strong> ten fish, r<strong>and</strong>omly selected, shows a mean weight <strong>of</strong> 2.31 kilograms with a st<strong>and</strong>ard<br />
deviation <strong>of</strong> 0.17 kilogram.<br />
(a)<br />
(b)<br />
Obtain a 95% confidence interval for the mean weight <strong>of</strong> all salmon in the ponds.(2 marks)<br />
Using the st<strong>and</strong>ard deviation from the pilot survey as an estimate <strong>of</strong> the true variation <strong>of</strong><br />
weights <strong>of</strong> salmon in the ponds, establish how many fish should be sampled to obtain an<br />
estimate <strong>of</strong> the mean weight <strong>of</strong> all the salmon in the ponds to within 0.03 kilogram with<br />
95% confidence. (Take 2 as an approximation to the value <strong>of</strong> t.) (3 marks<br />
208<br />
Section 5
SOLUTIONS<br />
2. (a) X is a normal distribution with μ X<br />
= 60 <strong>and</strong><br />
i.e. X ~ N(60, 9)<br />
55 − 60 65 − 60<br />
(b) Pr(55 < X < 65) = Pr( < Z < )<br />
15 15<br />
= Pr(–0.33 < Z < 0.33)<br />
= 2(0.1293)<br />
= 0.2586 approx<br />
55 − 60 65 − 60<br />
(c) Pr(55 < X < 65) = Pr( < Z < )<br />
3<br />
3<br />
= Pr(-1.67 < Z
REVIEW EXERCISES<br />
2. The extent to which X-rays can penetrate tooth enamel has been suggested as a suitable<br />
mechanism for differentiating between males <strong>and</strong> females in forensic medicine. Listed<br />
below in appropriate units are the ‘spectropenetration gradients’ for eight female teeth <strong>and</strong><br />
eight male teeth:<br />
Male (x 1<br />
) 4.9 5.4 5.0 5.5 5.4 6.6 6.3 4.3<br />
Female (x 2<br />
) 4.8 5.3 3.7 4.1 5.6 4.0 3.6 5.0<br />
The data give sample means:<br />
x = 5.4250,<br />
1<br />
x = 4.5125<br />
2<br />
2<br />
2<br />
<strong>and</strong> the sample variances: s = 0.5536, s = 0.5784:<br />
1<br />
2<br />
(a) Calculate the pooled estimate for the variance common to the male <strong>and</strong> female<br />
populations.<br />
(1 mark)<br />
(b) Estimate the st<strong>and</strong>ard error <strong>of</strong> the difference between the population means. (1 mark)<br />
(c) Construct a 95% confidence interval for the difference between the two population<br />
means.<br />
(1 mark)<br />
(d) What conclusion do you now draw about the procedure for differentiating between males<br />
<strong>and</strong> females<br />
(1 mark)<br />
SOLUTIONS<br />
( n<br />
−1)<br />
s + ( n −1)<br />
s<br />
n + n − 2<br />
7(0.5536) + 7(0.5784)<br />
8 + 8 − 2<br />
2 1 1 2 2<br />
2. (a) s p<br />
=<br />
=<br />
= (1.132)<br />
= 0.566<br />
1<br />
2<br />
2<br />
(b) Estimated st<strong>and</strong>ard error <strong>of</strong> difference =<br />
2<br />
1<br />
2<br />
1 1<br />
0 .566 + = 0.376<br />
8 8<br />
(c) The 95% confidence interval is x − x ± t (0.376)<br />
1 2 14<br />
That is, (5.4250 – 4.5125) ± 2.145(0.376)<br />
or 0.9125 ± 0.8065<br />
giving 0.106 < μ 1 – μ 2 < 1.719<br />
(d) We are 95% sure that there is a difference in the mean tooth penetrations for males <strong>and</strong><br />
females since 0 does not lie in the confidence interval in (c). (Because the confidence<br />
interval is positive the male tooth penetration will be greater.)<br />
210<br />
Section 5
[A] DISTRIBUTION SUMMARY<br />
1. Binomial (X): n trials; π is the probability <strong>of</strong><br />
success (discrete)<br />
μ = nπ<br />
X<br />
σ = nπ(1 − π)<br />
X<br />
2. Normal (X): (continuous)<br />
Parameters are μ<br />
X<br />
<strong>and</strong> σ<br />
X<br />
⎛ X − μ ⎞<br />
X<br />
3. St<strong>and</strong>ard Normal ⎜Z<br />
= ⎟<br />
⎝ σ<br />
X ⎠<br />
Parameters are μ<br />
Z<br />
= 0 <strong>and</strong> σ<br />
Z<br />
= 1<br />
4. Normal Approximation to Binomial<br />
Original binomial has parameters n <strong>and</strong> π.<br />
The normal approx has parameters μ<br />
X<br />
= nπ<br />
,<br />
σ = nπ(1 − π)<br />
X<br />
5. Distribution <strong>of</strong> Sample Means ( X )<br />
σ<br />
X<br />
Normal with μ = μ<br />
X X<br />
<strong>and</strong> σ = . The<br />
X<br />
n<br />
st<strong>and</strong>ard deviation σ is also called the<br />
X<br />
st<strong>and</strong>ard error <strong>of</strong> the mean.<br />
211<br />
Section 5
6. Distribution <strong>of</strong> Differences between Sample<br />
Means ( X1−<br />
X2)<br />
μ = μX<br />
−μX<br />
(or μ1−<br />
μ2)<br />
X − X<br />
1 2 1 2<br />
2 2<br />
σ1 σ2 1 1<br />
2 2<br />
σ = + = σ +<br />
X X<br />
1 2<br />
1 2<br />
if σ = σ<br />
−<br />
n n n n<br />
1 2 1 2<br />
7. Distribution <strong>of</strong> Sample Proportions (P)<br />
μP<br />
= π<br />
π (1 −π<br />
)<br />
σ<br />
P<br />
=<br />
n<br />
8. Distribution <strong>of</strong> Differences between Sample<br />
Proportions (P 1 – P 2 )<br />
μP<br />
1− P= π<br />
2 1−<br />
π2<br />
π1(1 −π1) π2(1 −π2)<br />
σ<br />
P1 − P= +<br />
2<br />
n n<br />
1 2<br />
Estimates for π, μ, σ are found from sample data<br />
<strong>and</strong> given by p, x <strong>and</strong> s.<br />
212<br />
Section 5
[B] SUMMARY: CONFIDENCE INTERVALS<br />
s<br />
1. Mean x ± tν<br />
with ν = n –1 D.F.<br />
n<br />
2. Difference Between Means (small samples <strong>and</strong><br />
independent, normal populations with equal<br />
variances.)<br />
1 1<br />
( x1− x2) ± tν<br />
sp<br />
+ with ν = n 1 + n 2 – 2.<br />
n1 n2<br />
2 2<br />
2<br />
( n1− 1) s1 + ( n2−1)<br />
s2<br />
Here, sp<br />
=<br />
n + n −2<br />
1 2<br />
2 2<br />
1 2<br />
Note: If samples ≥ 30, x1− x2<br />
± 1.96 s +<br />
s<br />
n1 n2<br />
3. Difference Between Means (paired populations)<br />
d<br />
s d<br />
± tν<br />
with ν = n – 1<br />
n<br />
4. Proportion:<br />
p<br />
(1 )<br />
1.96 p −<br />
±<br />
p<br />
n<br />
5. Difference Between Two Proportions<br />
1(1 1) 2(1 2)<br />
(<br />
1 2) 1.96 p −<br />
p p<br />
p p −<br />
− ± +<br />
p<br />
n n<br />
1 2<br />
213<br />
Section 5
214
SECTION 6<br />
This section reviews hypothesis testing, type 1 <strong>and</strong> type 2 errors, conclusive <strong>and</strong> inconclusive<br />
results <strong>and</strong> the power <strong>of</strong> a study.<br />
Null <strong>and</strong> Alternative Hypotheses<br />
Study Based <strong>and</strong> Data Driven Hypotheses<br />
One <strong>and</strong> Two Sided Tests<br />
Four Steps in the Hypothesis Testing Procedure<br />
Examples<br />
Pooled proportion estimate<br />
Clinical <strong>and</strong> Ecological Importance<br />
Conclusive <strong>and</strong> Inconclusive Results<br />
Errors in Hypothesis Testing<br />
Power <strong>of</strong> a Study<br />
Examples<br />
215<br />
Section 6
Hypothesis Testing<br />
In most scientific studies we set up hypotheses<br />
before about treatments (or populations) which<br />
are the focus <strong>of</strong> the study. A null hypothesis (H 0 )<br />
which is a claim about a treatment which is<br />
assumed to be true unless the data collected in<br />
our study show substantial evidence against H 0 .<br />
At the same time we propose a research or<br />
alternative hypothesis (H A ) which will be adopted<br />
if there is sufficient evidence against the null<br />
hypothesis.<br />
There are two types <strong>of</strong> alternative hypotheses:<br />
(i)<br />
(ii)<br />
a study based hypothesis which will imply<br />
that we do not know at the outset whether a<br />
new treatment is beneficial or possibly<br />
harmful <strong>and</strong><br />
a data based hypothesis which is suggested<br />
by the very nature <strong>of</strong> the collected data <strong>and</strong><br />
which will usually suggest treatment<br />
benefit.<br />
216<br />
Section 6
If the data suggest harm we are likely to terminate<br />
the study quickly but if the data suggest benefit we<br />
need to know if the benefit is clinically important.<br />
The study based alternative will usually lead to a<br />
two sided test while the data based alternative<br />
will lead to a one sided test. In the literature, the<br />
two sided test is by far the most common.<br />
There are FOUR STEPS in the st<strong>and</strong>ard<br />
hypothesis testing procedure.<br />
Step (1) A null hypothesis (H 0 ) is assumed about<br />
a population parameter.<br />
Step (2) An alternative (research) hypothesis is<br />
proposed. This is accepted if H 0 is rejected.<br />
217<br />
Section 6
Step (3) A test statistic is computed from data.<br />
It is the st<strong>and</strong>ardised value <strong>of</strong> a sample<br />
mean, sample proportion or sample<br />
difference obtained from the data. It is either<br />
a z-score (large sample) or a t-score (for<br />
small samples) given by<br />
test statistic<br />
=<br />
observed sample value - null value<br />
estimated st<strong>and</strong>ard error<br />
That is, the number <strong>of</strong> st<strong>and</strong>ard deviations<br />
from null value to the sample value. It is this<br />
test statistic which allows calculation <strong>of</strong> the<br />
p-value associated with the outcome <strong>of</strong> a<br />
particular study.<br />
Step (4) The probability <strong>of</strong> observing the value <strong>of</strong><br />
the test statistic in step (3), or a value which is<br />
even more extreme, is calculated under the<br />
assumption that the null hypothesis is true.<br />
This probability is the p-value for the test<br />
statistic. The test statistic has <strong>of</strong> course<br />
summarized the data in the study. We draw<br />
appropriate conclusions if the p-value is less<br />
than 0.05.<br />
218<br />
Section 6
Examples Hypothesis Testing<br />
Exercise: Suppose the resting pulse rates for<br />
young women are normally distributed with mean<br />
μ = 66 <strong>and</strong> st<strong>and</strong>ard deviation σ = 9.2 beats per<br />
minute. A drug for the treatment <strong>of</strong> a medical<br />
condition is administered to 100 young women<br />
<strong>and</strong> their average pulse rate is found to be x = 68<br />
beats per minute. Because the drug had for a long<br />
time been observed to increase pulse rates, test<br />
the claim that the drug does in fact increase the<br />
pulse rates. (i.e. H A is data based.)<br />
Solution:<br />
Step (1) H 0 : μ = 66 (the null hypothesis)<br />
Step (2) H A : μ > 66 (the research hypothesis)<br />
Step(3) x = 68 from sample data. Assuming H 0<br />
is true, <strong>and</strong> noting that population st<strong>and</strong>ard<br />
deviation is known, st<strong>and</strong>ardising x leads to<br />
z<br />
=<br />
observed sample mean - null mean<br />
st<strong>and</strong>ard error <strong>of</strong> mean<br />
219<br />
Section 6
x − μ<br />
=<br />
σ / n<br />
68 − 66<br />
=<br />
9.2 / 100<br />
= 2.174<br />
Step(4): Calculate p-value assuming μ = 66<br />
0.50<br />
66 68 X<br />
0 2.174 Z<br />
p-value = Pr( X > 68 given μ = 66)<br />
68 − 66<br />
= Pr( Z > )<br />
9.2 / 100<br />
= Pr( Z > 2.174) = 0. 015<br />
This means that if H 0 is true, there is only a<br />
probability <strong>of</strong> 0.015 <strong>of</strong> observing a sample mean<br />
as large or larger than 68. Hence there is little<br />
support for H 0 . Reject H 0 <strong>and</strong> conclude the mean<br />
pulse rate has been increased by the treatment.<br />
220<br />
Section 6
Notes: 1. R-cmdr <strong>and</strong> other statistical<br />
packages give a p-value directly beside the<br />
study result or the test statistic. If the p-value<br />
is less than 0.05 we have significance at the<br />
5% level <strong>and</strong> if p-value is less than 0.01 we<br />
have significance at the 1% level.<br />
2. If σ is unknown but estimated from the sample,<br />
the st<strong>and</strong>ardised statistic is t <strong>and</strong> the p-value is<br />
found from the t-table with appropriate degrees<br />
<strong>of</strong> freedom. (The exact p-value is not possible<br />
since only a few values are given at top <strong>of</strong><br />
columns in t-table.)<br />
e.g. Suppose s = 9.2 rather than σ = 9.2 <strong>and</strong><br />
sample size is n = 100.<br />
Then,<br />
68 − 66<br />
p-value = Pr( t > )<br />
9.2 / 100<br />
= Pr( t > 2.174)<br />
with 99 DF<br />
t = 2.174 lies between values in columns<br />
headed p = 0.025 <strong>and</strong> p = 0.010.<br />
Hence p-value lies between these two<br />
numbers (R-cmdr gives exact value)<br />
221<br />
Section 6
Exercise: In a large overseas city it was<br />
estimated that 15% <strong>of</strong> girls between the ages <strong>of</strong><br />
14 <strong>and</strong> 18 became pregnant. Concerned parents<br />
<strong>and</strong> health workers introduced an educational<br />
programme in an effort to lower this percentage.<br />
After four years <strong>of</strong> the programme, a r<strong>and</strong>om<br />
sample <strong>of</strong> n = 293 18-year-old girls revealed that<br />
27 had become pregnant.<br />
(a)<br />
(b)<br />
(c)<br />
Define null <strong>and</strong> alternative hypotheses for<br />
investigating whether the proportion<br />
becoming pregnant after the educational<br />
programme has decreased. (Suppose the<br />
alternative hypothesis is one sided.)<br />
Calculate the probability value.<br />
State your conclusion.<br />
Step(1): H 0 : π = 0.15 (15% still become<br />
pregnant)<br />
Step(2): H A : π < 0.15 (less than 15% become<br />
pregnant)<br />
222<br />
Section 6
Step(3): Sample gives p = 27/293 = 0.092<br />
observed proportion - null proportion<br />
z =<br />
st<strong>and</strong>ard error <strong>of</strong> proportion<br />
=<br />
p −π<br />
π ( 1−π<br />
)<br />
n<br />
=<br />
0.092 − 0.15<br />
0.15(1 − 0.15)<br />
under H 0 : π = 0.15<br />
293<br />
= –2.78<br />
use 0.15 <strong>and</strong><br />
not 0.092 here<br />
Step (4): p-value = pr(Z < –2.78)<br />
= 0.5000 – 0.4973<br />
= 0.0027<br />
Z<br />
–2.78 0<br />
There is evidence that after the education<br />
campaign the proportion becoming pregnant has<br />
reduced.<br />
223<br />
Section 6
Exercise: The birthweight <strong>of</strong> a baby is thought to<br />
be associated with the smoking habits <strong>of</strong> the<br />
mother during pregnancy. The means <strong>and</strong><br />
variances <strong>of</strong> the INDIVIDUAL values in the two<br />
samples <strong>of</strong> birthweights, one for non-smoking<br />
<strong>and</strong> the other for smoking mothers, are in the<br />
following table.<br />
Mother<br />
non-smoker<br />
Mother<br />
smoker<br />
Sample Size (n i ) 100 50<br />
Sample Mean ( x i<br />
) 3.45 3.30<br />
Sample Variance ( s 2 i<br />
) 0.36 0.32<br />
Investigate the claim that the mean birthweights<br />
are different in the two groups. In this case we<br />
shall suppose the alternative is study driven rather<br />
than data driven.<br />
Step(1): H 0 : μ NS – μ S = 0 (no difference in the<br />
mean birth weight)<br />
Step(2): H A : μ NS – μ S ≠ 0 (there is a difference<br />
in the mean birth weight)<br />
224<br />
Section 6
Step(3): Sample gives x − x = 3.45 − 3.30 = 0.15<br />
St<strong>and</strong>ardising gives the test statistic<br />
NS<br />
S<br />
observed difference <strong>of</strong> means - null difference<br />
t =<br />
estimated st<strong>and</strong>ard error <strong>of</strong> difference<br />
0.15 − 0<br />
=<br />
1 1<br />
s p<br />
+<br />
100 50<br />
2<br />
2<br />
2 ( n 1) ( 1)<br />
where<br />
1 − s1<br />
+ n2<br />
− s<br />
s =<br />
2<br />
p<br />
n1<br />
+ n2<br />
− 2<br />
99(0.36) + 49(0.32)<br />
=<br />
148<br />
= 0.3468<br />
0.15<br />
=<br />
1 1<br />
0.3468 + 100 50<br />
= 0.15<br />
0.102<br />
= 1.47<br />
(use <strong>of</strong> pooling optional)<br />
Since the sample is large we can use the st<strong>and</strong>ard<br />
normal z in place <strong>of</strong> t with 148 degrees <strong>of</strong><br />
freedom.<br />
225<br />
Section 6
Step(4): In this case (two sided H A )<br />
p-value = Pr(|z| > 1.47)<br />
= Pr(z > 1.47 or z < –1.47)<br />
–1.47 0 1.47<br />
p-value = 2(0.5 – 0.4292) = 2(0.0708) = 0.1416<br />
There is no evidence that the mean birthweights<br />
for the smoking <strong>and</strong> non-smoking groups are<br />
different.<br />
Note: If the test had been one-sided<br />
[H A : μ NS – μ S > 0] p-value = Pr(z > 1.47)<br />
= 0.0708<br />
z<br />
z<br />
0 1.47<br />
There is again no evidence the non-smoking<br />
group has a greater mean birthweight than the<br />
smoking group.<br />
226<br />
Section 6
227<br />
Section 6
228<br />
Section 6
Notes on Hypothesis Testing<br />
1. There is some terminology for reporting the<br />
result <strong>of</strong> a test.<br />
(a) If the p-value < 0.05 the result is<br />
“significant at α = 0.05 level” (5% level)<br />
or “There is some evidence that …”<br />
(b) If the p-value < 0.01 the result is<br />
“significant at α = 0.01 level” (1% level)<br />
or “There is strong evidence that …”<br />
(c) If p-value > 0.05 the result is “not<br />
significant” or “There is no evidence that<br />
…”<br />
In the above α is generally a pre-selected cut<strong>of</strong>f<br />
value.<br />
229<br />
Section 6
2. Choosing a smaller level <strong>of</strong> significance<br />
requires the test statistic to be more<br />
extreme before H 0 rejected.<br />
3. Whether the test is one or two sided<br />
depends on whether the alternative<br />
hypothesis is data based or study based.<br />
4. If H A is one sided, the p-value is the area<br />
in one tail <strong>of</strong> the distribution <strong>of</strong> the<br />
st<strong>and</strong>ardised test statistic.<br />
5. If H A is two-sided, the p-value is the area<br />
in the two tails <strong>of</strong> the distribution <strong>of</strong> the<br />
st<strong>and</strong>ardised test statistic.<br />
6. If using t-table choose column heading 2p<br />
for a two sided alternative hypothesis <strong>and</strong><br />
p for a one sided alternative hypothesis.<br />
7. When a test statistic leads to rejection <strong>of</strong><br />
H 0 , there are two possible explanations<br />
(a) H 0 is true but r<strong>and</strong>om variation has given<br />
an improbable test statistic.<br />
230<br />
Section 6
(b) H 0 is not true, <strong>and</strong> the observed statistic is<br />
consistent with H A .<br />
The second alternative (b) is taken but<br />
there is possible error. This error is the<br />
value α, the level <strong>of</strong> significance, which is<br />
usually 0.05 or 0.01. α is called the type<br />
one error (it is the chance <strong>of</strong> a false<br />
conviction in a court <strong>of</strong> law – i.e. must<br />
operate beyond reasonable doubt hence<br />
choose small α).<br />
8. In published work a p-value is quoted<br />
beside the study result (indicating whether a<br />
new treatment, say, has an effect) <strong>and</strong> a<br />
confidence interval is reported (giving some<br />
idea <strong>of</strong> the magnitude <strong>of</strong> an effect).<br />
But one problem still remains when reporting<br />
conclusions from a scientific study. It is possible<br />
to obtain a result which is statistically<br />
significant (with a small p-value) yet from a<br />
clinical point <strong>of</strong> view the result is unimportant.<br />
That is, it is not clinically important.<br />
(Ecological importance is an equivalent concept.)<br />
231<br />
Section 6
Example: There are two treatments for raising<br />
iron levels in infants, a st<strong>and</strong>ard treatment A <strong>and</strong><br />
a new treatment B.<br />
A mean for treatment B that is 20 units greater<br />
than the mean for treatment A is recognised as a<br />
clinically important improvement which would<br />
lead to widespread introduction <strong>of</strong> treatment B.<br />
An experiment produces the following mean<br />
differences, xB<br />
− xA, with a 95% confidence<br />
interval. Decide in each case whether the p-value<br />
is less than or greater than 0.05. Report whether<br />
the scientific result is conclusive or inconclusive<br />
by considering clinical importance.<br />
(a) Mean Diff = 40. Confidence Interval is<br />
(33, 47)<br />
The confidence interval does not include the<br />
null hypothesis value so the p-value is less<br />
than 0.05 (a statistically significant result).<br />
The point estimate <strong>of</strong> 40 is in the direction<br />
indicating treatment benefit. The result is<br />
conclusive <strong>and</strong> there is evidence the benefit<br />
is enough to be important.<br />
232<br />
Section 6
(b)<br />
Mean Diff = 36. Confidence interval is<br />
(18, 54)<br />
p-value < 0.05. The result is conclusive.<br />
There is treatment benefit but it may not be<br />
as large as hoped.<br />
(c)<br />
Mean Diff = 27. Confidence interval is<br />
(–4, 58)<br />
p-value > 0.05 <strong>and</strong> inconclusive result. The<br />
confidence interval includes H 0 . The new<br />
treatment is probably better than treatment<br />
A but we cannot completely rule out the<br />
possibility that it is worse.<br />
(d)<br />
Mean Diff = –7. Confidence interval is<br />
(–55, 41)<br />
p-value > 0.05<strong>and</strong> result inconclusive. The<br />
new treatment is likely to be harmful but we<br />
cannot rule out the possibility that there is a<br />
clinically important benefit.<br />
(e) Mean Diff = –12. Confidence Interval =<br />
(–34, 10)<br />
233<br />
Section 6
p-value > 0.05 <strong>and</strong> result is conclusive. Any<br />
benefit is not clinically important <strong>and</strong> it is<br />
more likely there will be treatment harm.<br />
Treatment B should not be pursued as a<br />
potential treatment.<br />
(f) Mean Diff = –13. Confidence interval =<br />
(–19, –7)<br />
p-value < 0.05 <strong>and</strong> result very conclusive.<br />
The new treatment is harmful.<br />
(g) Mean Diff = 11. Confidence interval =<br />
(4, 18)<br />
p-value < 0.05. The result is conclusive.<br />
There is treatment benefit but not enough to<br />
lead to the introduction <strong>of</strong> treatment B.<br />
Note: In practice you decide what is clinically<br />
important. This is difficult but as you gain<br />
experience with your own area <strong>of</strong> research it<br />
becomes easier <strong>and</strong> you are able to critique any<br />
published research.<br />
234<br />
Section 6
Summary <strong>of</strong> previous results<br />
0 = null value<br />
20 = clinically important improvement.<br />
p-value < 0.05 implies confidence interval<br />
excludes the null value <strong>of</strong> zero.<br />
p-value > 0.05 implies null value included<br />
The result can be conclusive or inconclusive.<br />
(a) ( )<br />
(b) ( )<br />
(c) ( )<br />
(d) ( )<br />
-55 -7 0 20 41<br />
(e) ( )<br />
-34 -12 0 10 20<br />
(f) ( )<br />
-19 -13<br />
-4<br />
0 20 27<br />
-7 0 20<br />
(g) ( )<br />
0 20 33 40 47<br />
0 1820<br />
36 54<br />
0 4 111820<br />
58<br />
235<br />
Section 6
(a) Conclusive p-value < 0.05<br />
(b) Conclusive p-value < 0.05<br />
(c) Inconclusive p-value > 0.05<br />
(d) Inconclusive p-value > 0.05<br />
(e) Conclusive p-value > 0.05<br />
(f) Conclusive p-value < 0.05<br />
(g) Conclusive p-value < 0.05<br />
Clearly, if the confidence interval is too large,<br />
there is greater chance for an inconclusive result.<br />
236<br />
Section 6
Example<br />
A clinical trial is set up to compare two drugs<br />
(pravastatin, A, <strong>and</strong> a control, B) for lowering<br />
cholesterol. The mean cholesterol reductions in<br />
the two groups are compared. The probability<br />
that such a study will correctly detect a clinically<br />
important difference between the effects <strong>of</strong> the<br />
drugs is called the power <strong>of</strong> the study. Power<br />
depends on the size <strong>of</strong> the difference, the<br />
variability <strong>of</strong> estimates, sample size, <strong>and</strong> the level<br />
<strong>of</strong> significance.<br />
Figure 12.4: 95% confidence intervals for different sample sizes<br />
Mean reduction<br />
greater in A<br />
target treatment<br />
difference<br />
(clinically<br />
important)<br />
0<br />
no difference<br />
Mean reduction<br />
greater in B<br />
n = 10 n = 20 n = 50 n = 200<br />
237<br />
Section 6
If the two samples are <strong>of</strong> size 5 (giving total<br />
n = 10), the three 95% confidence intervals<br />
include zero difference <strong>and</strong> the important<br />
difference. As n increases, the confidence<br />
intervals become smaller <strong>and</strong> it is possible to<br />
detect the difference.<br />
NB 1. It is helpful to aim for a confidence<br />
interval which has diameter (or range) no<br />
greater than the clinically important treatment<br />
difference as in this case the result obtained<br />
must be conclusive (rather than inconclusive).<br />
2. If the clinically important effect size is large,<br />
the confidence interval can be wider <strong>and</strong> hence<br />
a smaller sample taken.<br />
3. A larger sample gives smaller confidence interval.<br />
4. Less r<strong>and</strong>om variation in the data gives<br />
smaller confidence interval. (That is, the value<br />
<strong>of</strong> σ is smaller.)<br />
5. A smaller level <strong>of</strong> significance (α), say 0.01,<br />
gives a wider confidence interval <strong>and</strong> hence<br />
smaller power as there is less chance <strong>of</strong><br />
detecting a clinically important effect in a<br />
conclusive way.<br />
238<br />
Section 6
Errors in Hypothesis Testing<br />
The level <strong>of</strong> significance (α) is chosen by the<br />
researcher, usually 0.05, <strong>and</strong> is the chance that the<br />
null hypothesis (H 0 ) will be rejected when in<br />
actual fact it is true. It would seem sensible for α<br />
to be made as small as possible. Then the<br />
probability <strong>of</strong> correctly not rejecting H 0 when it is<br />
true will be large. But this is not the real issue in<br />
a scientific study involving hypothesis testing.<br />
The real issue is to have high probability <strong>of</strong><br />
rejecting H 0 when in fact H 0 is false or needing to<br />
be rejected. That is, a high probability that a test<br />
will correctly detect a real treatment effect <strong>of</strong> a<br />
given magnitude. This is known as the power <strong>of</strong><br />
the test, <strong>and</strong> involves detecting clinically<br />
worthwhile improvements as defined by<br />
researchers. Power is related to the level <strong>of</strong><br />
significance. A smaller value for the level <strong>of</strong><br />
significance results in a smaller power. A power<br />
between 80% <strong>and</strong> 90% is desirable.<br />
239<br />
Section 6
These ideas have a parallel in the courts <strong>of</strong> law in<br />
this country. To illustrate, suppose we are<br />
interested in testing a new treatment to see if it<br />
has an effect.<br />
1. The treatment is “arrested”.<br />
2. The treatment is charged with having an<br />
effect (H A ).<br />
3. It is assumed treatment is “innocent” (has no<br />
effect, H 0 ) until the evidence (data) shows<br />
otherwise. The evidence is summarized in the<br />
test statistic.<br />
4. The level <strong>of</strong> significance (α) is the probability<br />
that an innocent treatment will be convicted.<br />
This error must be made small. That is, the<br />
probability <strong>of</strong> a false conviction.<br />
5. The power is the probability that a guilty<br />
treatment will be convicted. This is the best<br />
outcome for a court case as it is a correct<br />
conviction. This probability should be large<br />
since then we correctly convict the treatment<br />
concluding there is an important treatment<br />
effect. Power should be at least 0.80 or 0.90.<br />
240<br />
Section 6
Some Computer packages (Minitab is one) have<br />
an excellent routine for analysing the power <strong>of</strong> a<br />
study <strong>and</strong> showing how power, data variability,<br />
sample size, level <strong>of</strong> significance <strong>and</strong> clinically<br />
important effects are related.<br />
EXAMPLE: The problem is to design a milk<br />
feeding trial in 5 year old children to see if a daily<br />
supplement <strong>of</strong> milk for a year leads to an<br />
increased gain in height compared with a control<br />
group (such a study would be both expensive <strong>and</strong><br />
difficult for practical <strong>and</strong> ethical reasons). It is<br />
known that at this age children grow 6cm in a<br />
year with a st<strong>and</strong>ard deviation <strong>of</strong> 2cm (σ). The<br />
effect <strong>of</strong> milk on height gain is important if it<br />
results in a gain <strong>of</strong> at least 0.5cm. We want a<br />
high probability <strong>of</strong> detecting such a difference so<br />
we set the power to be 0.9 (90%) <strong>and</strong> choose a<br />
1% (α = 0.01) significance level.<br />
Known:<br />
Find<br />
σ = 2 (data variability)<br />
α = 0.01 (chosen level <strong>of</strong> sig.)<br />
Clinically important diff = 0.5 cm<br />
Target power is 0.90 (90%)<br />
Sample Size.<br />
241<br />
Section 6
(a)<br />
Find the sample size required to meet these<br />
conditions. (i.e. σ = 2.0cm; clinically<br />
important difference = 0.5cm; power = 0.9;<br />
α = 0.01)<br />
Step 1. STAT > POWER AND SAMPLE<br />
SIZE > 2-SAMPLE t (i.e. choose an<br />
unpaired t-test)<br />
Step 2. Specify power value <strong>of</strong> 0.9, a clinically<br />
important difference <strong>of</strong> 0.5 <strong>and</strong> sigma 2.0<br />
Step 3. Choose Not equal for a study based two<br />
sided alternative hypothesis <strong>and</strong><br />
significance level alpha <strong>of</strong> 0.01<br />
242<br />
Section 6
A printout is as follows:<br />
Power <strong>and</strong> Sample Size<br />
2-Sample t Test<br />
Testing mean 1 = mean 2 (versus not =)<br />
Calculating power for mean 1 = mean 2 + difference<br />
Alpha = 0.01 Sigma = 2<br />
Sample Target Actual<br />
Difference Size Power Power<br />
0.5 478 0.9000 0.9001<br />
There need to be 478 children in each sample<br />
meaning 956 children in total.<br />
(i.e. the size <strong>of</strong> one sample is given)<br />
[Note: the actual power will be different as a<br />
result <strong>of</strong> rounding to the sample size.]<br />
243<br />
Section 6
(b)<br />
Now consider clinically important<br />
differences <strong>of</strong> 0.5, 0.6, 0.7, 0.8, 0.9, 1.0<br />
A printout gives<br />
Power <strong>and</strong> Sample Size<br />
2-Sample t Test<br />
Testing mean 1 = mean 2 (versus not =)<br />
Calculating power for mean 1 = mean 2 + difference<br />
Alpha = 0.01 Sigma = 2<br />
Sample Target Actual<br />
Difference Size Power Power<br />
0.5 478 0.9000 0.9001<br />
0.6 333 0.9000 0.9007<br />
0.7 245 0.9000 0.9006<br />
0.8 188 0.9000 0.9006<br />
0.9 149 0.9000 0.9009<br />
1.0 121 0.9000 0.9008<br />
Notice that smaller samples will detect the larger<br />
clinically important differences. Necessary<br />
sample size reduces from the 956 to 242 [similar<br />
to moving from a high resolution microscope to<br />
pocket magnifying glass which is all that is<br />
needed to detect 1.0]<br />
244<br />
Section 6
(c)<br />
Halve the value <strong>of</strong> sigma to 1.0 <strong>and</strong> repeat<br />
the analysis in (b)<br />
Power <strong>and</strong> Sample Size<br />
2-Sample t Test<br />
Testing mean 1 = mean 2 (versus not =)<br />
Calculating power for mean 1 = mean 2 + difference<br />
Alpha = 0.01 Sigma = 1<br />
Sample Target Actual<br />
Difference Size Power Power<br />
0.5 121 0.9000 0.9008<br />
0.6 85 0.9000 0.9027<br />
0.7 63 0.9000 0.9032<br />
0.8 49 0.9000 0.9058<br />
0.9 39 0.9000 0.9051<br />
1.0 32 0.9000 0.9060<br />
Notice how greater precision (decreased st<strong>and</strong>ard<br />
deviation) in the data results in smaller sample<br />
sizes required to achieve the desired power which<br />
is now only 64 for a difference <strong>of</strong> 1.0.<br />
245<br />
Section 6
(d) A doctor set up a study involving 100<br />
children (50 in each group) <strong>and</strong> monitored<br />
the children for one year. The doctor<br />
wanted to detect a clinically important<br />
difference <strong>of</strong> 0.5, knew from historical<br />
information that sigma = 2.0, <strong>and</strong> set up a<br />
study based (two sided) test at α = 0.05(5%)<br />
level <strong>of</strong> significance. The printout obtained<br />
for the doctor after the study was carried out<br />
follows.<br />
Power <strong>and</strong> Sample Size<br />
2-Sample t Test<br />
Testing mean 1 = mean 2 (versus not =)<br />
Calculating power for mean 1 = mean 2 + difference<br />
Alpha = 0.05 Sigma = 2<br />
Sample<br />
Difference Size Power<br />
0.5 50 0.2358<br />
The power for this study is only 0.2358. The<br />
probability <strong>of</strong> detecting the clinically important<br />
difference <strong>of</strong> 0.5 is too small. The study was a<br />
waste <strong>of</strong> effort in the sense that it is unlikely to<br />
detect a difference as small as 0.5 when this size<br />
difference is important.<br />
If α = 0.01, power = 0.0891<br />
246<br />
Section 6
Revision Examples<br />
1. Exam 2006:<br />
In a study to assess the impact <strong>of</strong> an industrial<br />
development on a nearby river, water temperature<br />
was measured. It has been suggested the mean<br />
water temperature is higher in this river than in a<br />
similar river 30 km away that is not affected by<br />
the development. Daily temperature in degrees<br />
Celsius were taken at mid day for a fortnight in<br />
February from both rivers. Two readings from<br />
the “unaffected” river were spoiled. The data are<br />
summarised below:<br />
Unaffected Affected<br />
river river<br />
Sample Size (n i ) 12 14<br />
Sample Mean ( x i<br />
) 15.41 16.49<br />
Sample Variance ( s 2 i<br />
) 1.963 2.132<br />
(a)<br />
(4 marks) Assuming that temperature has a<br />
common variability in both rivers <strong>and</strong> the<br />
values are approximately normal, calculate<br />
the pooled estimate for the common<br />
variance <strong>and</strong> an estimate for the st<strong>and</strong>ard<br />
error <strong>of</strong> the difference between the two<br />
means.<br />
247<br />
Section 6
2<br />
11(1.963) + 13(2.132)<br />
s p<br />
= =<br />
24<br />
Pooled variance = 2.055<br />
2.055<br />
st<strong>and</strong>ard error =<br />
1 1<br />
2.055 + = 0.564<br />
12 14<br />
Estimated st<strong>and</strong>ard error = 0.564<br />
(b)<br />
(2 marks) Using the appropriate value from<br />
the t-table construct the 95% confidence<br />
interval for the difference in mean<br />
temperature in the affected <strong>and</strong> unaffected<br />
rivers.<br />
1.08 ± t 24 (0.564) where t 24 = 2.064<br />
or 1.08 ± 1.164<br />
Confidence interval:<br />
–0.084 < μ A – μ u < 2.244<br />
248<br />
Section 6
(c)<br />
(d)<br />
(e)<br />
(2 marks) A mean temperature increase <strong>of</strong><br />
0.6 degrees Celsius is ecologically<br />
important. State your conclusion about the<br />
true mean temperature from the confidence<br />
interval in (b).<br />
Conclusion:<br />
Result inconclusive. There is no evidence <strong>of</strong> a<br />
temperature mean difference but an important<br />
increase cannot be ruled out.<br />
(1 mark) State one way in which you might<br />
increase the power <strong>of</strong> this study.<br />
Statement:<br />
Increase sample size.<br />
(5 marks) A more powerful study is to be<br />
set up which has a 95% confidence interval<br />
for the difference between the mean river<br />
temperatures no greater than 0.6 degrees<br />
Celsius. Assuming the same number <strong>of</strong><br />
measurements is taken from each river <strong>and</strong><br />
the pooled estimate for the common<br />
variance from (a) is the best estimate for the<br />
variability, approximately how many<br />
readings should be taken from each river<br />
249<br />
Section 6
Taking 1.96 as multiplier, the 95% C.I. is<br />
1 1<br />
( μ2− μ1) ± 1.96 2.054 + n n<br />
But required precision needs (μ 2 – μ 1 ) ± 0.3<br />
2<br />
Therefore, 1.96 2.054 0.3<br />
n ≤<br />
2<br />
(1.96) (2.054)2<br />
∴<br />
≤ n<br />
2<br />
(0.3)<br />
∴ 175.3 ≤ n<br />
Number <strong>of</strong> readings from each river: 176<br />
(f)<br />
(2 marks) The 95% confidence interval from<br />
the study in (e) is (0.49, 1.12). What<br />
conclusion would you now reach about the true<br />
mean temperature difference<br />
Conclusion:<br />
Result conclusive. There is evidence <strong>of</strong><br />
increased temperatures but the increase may<br />
not be ecologically important.<br />
250<br />
Section 6
2. Exam 2005<br />
An ecologist must determine whether a cleanup<br />
project at a lake has been effective. This is to be<br />
done by recording dissolved oxygen content (in<br />
parts per million, ppm) in the lake, with higher<br />
values indicating less pollution. Prior to the<br />
cleanup project a r<strong>and</strong>om sample <strong>of</strong> 50 dissolved<br />
oxygen readings was recorded around the lake. Six<br />
months after the initiation <strong>of</strong> the cleanup a second<br />
r<strong>and</strong>om sample <strong>of</strong> 70 readings was recorded.<br />
Results are summarised in the following table.<br />
Before Cleanup After Cleanup<br />
Sample Size (n i ) 50 70<br />
Sample Mean ( x i ) 10.30 10.46<br />
2<br />
s 0.32 0.36<br />
Sample Variance ( )<br />
i<br />
(a) (1 mark) State null <strong>and</strong> alternative hypotheses<br />
for testing the data driven hypothesis that the<br />
cleanup has resulted in an increase in the<br />
dissolved oxygen content.<br />
Null hypothesis, H 0 : μ BC = μ AC<br />
Alternative hypothesis, H A : μ BC < μ AC<br />
251<br />
Section 6
(b) (6 marks) Calculate the pooled estimate for<br />
the common variance <strong>of</strong> the two samples, an<br />
estimate for the st<strong>and</strong>ard error <strong>of</strong> the difference<br />
between the two means, <strong>and</strong> a st<strong>and</strong>ardised<br />
normal z statistic for testing the hypotheses.<br />
2<br />
49(0.32) + 69(0.36)<br />
= = 0.3434<br />
118<br />
s p<br />
Pooled variance = 0.3434<br />
estimated st<strong>and</strong>ard error<br />
=<br />
1 1<br />
0.3434 + = 0.1085<br />
50 70<br />
St<strong>and</strong>ard error = 0.1085<br />
z<br />
10.46 −10.30 = = 1.475<br />
0.1085<br />
St<strong>and</strong>ardised z statistic = 1.475<br />
252<br />
Section 6
(c) (2 marks) Find the probability value (p-value)<br />
for the z statistic in (b) <strong>and</strong> state your<br />
conclusion from the p-value (using a 5% level<br />
<strong>of</strong> significance).<br />
p-value = 0.5 – 0.4306 = 0.0694<br />
Conclusion: There is no evidence that the<br />
clean-up has raised the mean dissolved<br />
oxygen reading.<br />
(d) (2 marks) Construct the 95% confidence<br />
interval for the difference in the dissolved oxygen<br />
means for the readings before cleanup <strong>and</strong> the<br />
readings after cleanup.<br />
(10.46 – 10.30) ± t 118 (0.1085) where<br />
t 118 = 1.98 (accept 1.96)<br />
i.e. 0.160 ± 0.215<br />
Confidence interval:<br />
–0.055 < μ AC – μ BC < 0.375<br />
253<br />
Section 6
(e) (1 mark) The power <strong>of</strong> this study is small.<br />
Suggest one way in which you might increase the<br />
power <strong>of</strong> this study.<br />
Answer: Select a larger sample<br />
(f) (3 marks) A more powerful study produced<br />
the 95% confidence interval (0.04, 0.27).<br />
What conclusions would you reach about the<br />
p-value <strong>of</strong> this study result <strong>and</strong> the effect <strong>of</strong><br />
the cleanup project if an increase <strong>of</strong> 0.25 in the<br />
dissolved oxygen mean is ecologically<br />
important<br />
Conclusion: p-value < 0.05<br />
There is evidence the oxygen mean has<br />
increased after the cleanup but it may not be<br />
an important increase (or may not be as great<br />
as hoped)<br />
[Question 4 : 15 marks]<br />
254<br />
Section 6
SECTION 7<br />
One factor analysis <strong>of</strong> variance, post analysis <strong>of</strong><br />
variance tests on means, <strong>and</strong> multiple comparison<br />
procedures.<br />
255<br />
Section 7
ONE FACTOR ANALYSIS OF VARIANCE<br />
This section <strong>of</strong> the course returns to the<br />
continuous outcome theme.<br />
In the studies <strong>of</strong> this type considered so far there<br />
have been two treatments when usually a new<br />
treatment is compared with a control or placebo.<br />
In the first half <strong>of</strong> the semester we answered the<br />
question about the effect <strong>of</strong> the new treatment by<br />
using the two sample t-test to find p-values <strong>and</strong><br />
confidence intervals for the comparison <strong>of</strong> means.<br />
These studies involved an outcome measured on a<br />
continuous scale <strong>and</strong> the scores in the two<br />
treatments were compared.<br />
Regression procedures were developed which<br />
allowed us to introduce potential confounding<br />
variables <strong>and</strong> hence obtain adjusted or modified<br />
confidence intervals <strong>and</strong> different p-values.<br />
We are now going to investigate how to analyse<br />
continuous data when there are more than two<br />
treatments <strong>of</strong> interest.<br />
256<br />
Section 7
Example A general surgeon believes that<br />
providing pain relief immediately following<br />
surgery improves the level <strong>of</strong> comfort postsurgery.<br />
Three pain killing drugs <strong>and</strong> a placebo<br />
are r<strong>and</strong>omly administered to patients<br />
immediately following tonsillectomies. The<br />
times in hours until onset <strong>of</strong> pain are as follows.<br />
The study is double blind.<br />
Placebo Drug A Drug B Drug C<br />
1.6 2.6 1.2 3.6<br />
0.3 12.6 1.7 3.2<br />
1.1 2.8 0.9 3.4<br />
0.4 4.5 2.1 3.9<br />
1.4 5.3 1.3 4.9<br />
2.4 4.4<br />
3.9<br />
Which drugs, if any, may be better than placebo<br />
Notice that there are now three comparisons with<br />
placebo. We can do better than just make the<br />
three comparisons using three unpaired t-tests.<br />
257<br />
Section 7
Example: A comparison was made <strong>of</strong> protein<br />
intake among three groups <strong>of</strong> post-menopausal<br />
women: (1) women eating a st<strong>and</strong>ard American<br />
diet (STD), (2) women eating a lacto-ovovegetarian<br />
diet (LAC), <strong>and</strong> (3) women eating a<br />
strict vegetarian diet (VEG). It was hypothesized<br />
that protein intake was affected by diet. The<br />
protein intakes (mg) for 30 women are:<br />
STD LAC VEG<br />
76 62 47<br />
63 76 75<br />
84 71 32<br />
72 61 40<br />
66 35 52<br />
83 56 37<br />
77 44 56<br />
79 58 35<br />
72 55 27<br />
69 49 66<br />
What are the effects <strong>of</strong> diet on protein intake<br />
Notice that there are three comparisons which<br />
could be <strong>of</strong> interest.<br />
258<br />
Section 7
We now investigate the problem <strong>of</strong> how to deal<br />
with multiple comparisons. The unpaired t test<br />
for comparing two sample means will be<br />
extended to situations involving more than two<br />
samples. As with simple linear regression the<br />
idea is again to partition the total variability <strong>of</strong> a<br />
response or outcome measure into components<br />
due to different sources <strong>of</strong> variation.<br />
Example: The effect <strong>of</strong> five drug treatments (A<br />
to E) on reduction <strong>of</strong> fever is investigated. Four<br />
children are assigned each treatment <strong>and</strong><br />
temperature reductions measured in appropriate<br />
units with high values showing greater reduction.<br />
Responses as follows:<br />
A B C D E<br />
9 7 2 4 4<br />
8 4 3 8 9<br />
6 9 4 1 6<br />
9 6 3 3 3<br />
Total 32 26 12 16 22 108<br />
Mean 8.0 6.5 3.0 4.0 5.5 5.4<br />
259<br />
Section 7
One source <strong>of</strong> variation is due to differences<br />
between the effects <strong>of</strong> the drugs, the other source<br />
<strong>of</strong> variation is the r<strong>and</strong>om variation between the<br />
individual children within each drug treatment.<br />
But which <strong>of</strong> these is most responsible for<br />
explaining the variation in the responses<br />
The Method<br />
Each response can be divided into three<br />
components as follows:<br />
Response = overall effect present in each value<br />
+ a drug treatment (factor) effect<br />
+ r<strong>and</strong>om error (or residual effect)<br />
From the estimates for these components we find<br />
a number measuring treatment variation <strong>and</strong> a<br />
number measuring residual (including error)<br />
variation. These values are compared using an F<br />
statistic as in regression.<br />
260<br />
Section 7
Estimation <strong>of</strong> Components (for reference)<br />
1. Overall mean = 5.4 (this is the estimate for<br />
the overall effect with one degree <strong>of</strong><br />
freedom)<br />
2. The five treatment effects estimated as<br />
follows:<br />
A: 8.0 – 5.4 = 2.6<br />
B: 6.5 – 5.4 = 1.1<br />
C: 3.0 – 5.4 = – 2.4<br />
D: 4.0 – 5.4 = – 1.4<br />
E: 5.5 – 5.4 = 0.1<br />
These add to zero (as they are deviations<br />
from their mean).<br />
There are 5 – 1 = 4 degrees <strong>of</strong> freedom.<br />
Note: The responses for A are, on average,<br />
2.6 units above the overall mean, while<br />
responses for D are, on average 1.4 units<br />
below overall mean.<br />
261<br />
Section 7
3. The residuals (including r<strong>and</strong>om error) are<br />
estimated by subtracting the overall mean<br />
<strong>and</strong> the treatment effect from each response<br />
to get:<br />
A: 9 = 5.4 + 2.6 + 1.0<br />
8 = 5.4 + 2.6 + 0.0<br />
6 = 5.4 + 2.6 – 2.0<br />
9 = 5.4 + 2.6 + 1.0<br />
B: 7 = 5.4 + 1.1 + 0.5<br />
4 = 5.4 + 1.1 – 2.5<br />
9 = 5.4 + 1.1 + 2.5<br />
6 = 5.4 + 1.1 – 0.5<br />
C: 2 = 5.4 + (– 2.4) – 1.0<br />
3 = 5.4 + (– 2.4) + 0.0<br />
4 = 5.4 + (– 2.4) + 1.0<br />
3 = 5.4 + (– 2.4) + 0.0<br />
D: 4 = 5.4 + (– 1.4) + 0.0<br />
8 = 5.4 + (– 1.4) + 4.0<br />
1 = 5.4 + (– 1.4) – 3.0<br />
3 = 5.4 + (– 1.4) – 1.0<br />
E: 4 = 5.4 + 0.1 – 1.5<br />
9 = 5.4 + 0.1 + 3.5<br />
6 = 5.4 + 0.1 + 0.5<br />
3 = 5.4 + 0.1 – 2.5<br />
The residuals are the third values on right.<br />
262<br />
Section 7
There are 20 data values altogether <strong>and</strong> hence 20<br />
degrees <strong>of</strong> freedom but 5 degrees <strong>of</strong> freedom<br />
have been used up leaving 15 for the residual<br />
effect.<br />
Sums <strong>of</strong> Squares Computation<br />
2<br />
∑ (responses)<br />
= 9 2 + 8 2 + 6 2 + 9 2 + … +<br />
6 2 + 3 2<br />
= 714 (with 20 DF)<br />
2<br />
∑ (overall means) = 5.4 2 +… + 5.4 2<br />
= 20 (5.4) 2<br />
= 583.2 (with 1 DF)<br />
2<br />
∑ (treatment effects) = 2.6 2 + … + 2.6 2 + … +<br />
0.1 2 + … + 0.1 2<br />
= 4[(2.6) 2 + (1.1) 2 + (–2.4) 2<br />
+ (–1.4) 2 + (0.1) 2 ]<br />
= 62.8 (with 5 – 1 = 4 DF)<br />
2<br />
∑ (residuals)<br />
= (1.0) 2 + (0.0) 2 + (–2.0) 2 +<br />
… + (–2.5) 2<br />
= 68.0 (with 15 DF)<br />
From these 714 = 583.2 + 62.8 + 68.0<br />
In general Total response Sum <strong>of</strong> Squares<br />
= overall mean SS + treatments SS<br />
+ residuals (error) SS<br />
263<br />
Section 7
Notes: 1. If there are no treatment differences,<br />
treatment effects will all be close to zero,<br />
hence treatments SS will be small. But how<br />
does this compare with the r<strong>and</strong>om SS<br />
measured by the residuals.<br />
2. We find the mean (or average) squares (MS)<br />
for treatment <strong>and</strong> residual effects <strong>and</strong><br />
compare these with an F statistic. Sums <strong>of</strong><br />
squares are divided by degrees <strong>of</strong> freedom.<br />
The Analysis <strong>of</strong> Variance (ANOVA) Table<br />
The calculations are summarised in a table similar<br />
to those arising with a regression analysis.<br />
Source <strong>of</strong> Sum <strong>of</strong> DF Mean F<br />
Variation Squares Squares<br />
Overall mean 583.2 1<br />
Treatment effects 62.8 4 15.70 3.47<br />
Residual (error) 68.0 (15) 4.53<br />
Total 714.0 20<br />
F = 15.70/4.53 = 3.47 (giving effect <strong>of</strong> the<br />
treatments on the responses compared with the<br />
chance (residual) effect on responses.<br />
Is this value large enough to be significant<br />
The critical value is found from F table (5%)<br />
264<br />
Section 7
υ 2 υ 1 1 2 3 4 … 30<br />
1<br />
<br />
15 3.056<br />
<br />
120<br />
If υ 1 = 4 <strong>and</strong> υ 2 = 15 then critical F = 3.056<br />
meaning Pr(F 4,15 > 3.056) = 0.05. Since 3.47 ><br />
3.056 we have significance at the 5% level. This<br />
means that treatment effects outweigh the chance<br />
(residual) effect.<br />
Conclusion: There is evidence <strong>of</strong> a difference<br />
between the mean temperature reductions<br />
resulting from the five treatments.<br />
Note:<br />
Because the overall mean appears in each data<br />
value, it makes no impact on variability between<br />
data values <strong>and</strong> the ANOVA table becomes.<br />
Source SS DF MS F<br />
Treatment effects 62.8 4 15.70 3.47<br />
Residual (error) 68.0 (15) 4.53<br />
Total (mean deleted) 130.8 19<br />
265<br />
Section 7
SYSTEMATIC CALCULATIONS<br />
The calculations for a one factor analysis <strong>of</strong><br />
variance can be carried out easily using statistical<br />
s<strong>of</strong>tware or by the following computation method<br />
which is quicker than the previous partitioning<br />
approach.<br />
A B C D E<br />
9 7 2 4 4<br />
8 4 3 8 9<br />
6 9 4 1 6<br />
9 6 3 3 3<br />
Col Total (C j ) 32 26 12 16 22 108<br />
2<br />
j<br />
C 1024 676 144 256 484 2584<br />
The between treatments (or samples) sum <strong>of</strong><br />
squares is<br />
C<br />
n<br />
2 2 2<br />
1 C C<br />
+<br />
2<br />
+ +<br />
k<br />
−<br />
1<br />
n<br />
2<br />
…<br />
n<br />
k<br />
(overall mean SS)<br />
where n 1 , n 2 etc are sample sizes <strong>and</strong> k = 5 here.<br />
266<br />
Section 7
If n 1 = n 2 = … = n k = n (say) this becomes<br />
[<br />
2 2<br />
] C + C + … + C − (overall mean SS)<br />
1 2<br />
1 2 k<br />
n<br />
Total SS = 9 2 + … + 3 2 = 714.0 as before.<br />
Overall mean SS = 20(108/20) 2 = 583.2 as before<br />
Treatment effects SS<br />
1<br />
= [ 1024 + 676 + 144 + 256 + 484] − 583. 2<br />
4<br />
= 62.8 as n 1 = n 2 = … = 4<br />
SOURCE SS DF MS F<br />
Overall mean 583.2 1<br />
Treatment effects 62.8 4 15.70 3.47*<br />
Residual (error) (68.0) (15) 4.53<br />
Total 714.0 20<br />
Brackets indicate numbers found by subtraction.<br />
If the effect <strong>of</strong> the overall mean is deleted again,<br />
the reduced table is produced.<br />
SOURCE SS DF MS F<br />
Treatment effects 62.8 4 15.70 3.47*<br />
Residual (error) (68.0) (15) 4.53<br />
Total 130.8 19<br />
267<br />
Section 7
A Note on the Residual Mean Square<br />
s or s )<br />
( 2 p<br />
2<br />
e<br />
The four treatment A residuals were 1.0, 0.0, –2.0,<br />
1.0. These are values 9, 8, 6, 9 with A mean <strong>of</strong> 8<br />
subtracted. i.e. they are <strong>of</strong> form x − x . An<br />
Ai A<br />
estimate <strong>of</strong> the variance for treatment A is therefore<br />
s<br />
2<br />
A<br />
2<br />
Ai A A<br />
1 2 2 2 2<br />
= ∑ ( x − x ) /( n −1)<br />
= (1.0 + 0.0 + [ −2.0]<br />
+ 1.0 )<br />
3<br />
For the other four treatments the variance estimates<br />
are<br />
2<br />
2<br />
s = ∑ ( x − x ) /( n −1)<br />
B Bi B B<br />
<br />
2<br />
2<br />
s = ∑ ( x − x ) /( n −1)<br />
E<br />
Ei<br />
E<br />
where in this case n A = n B = n C = n D = n E = 4<br />
If it is assumed that the variance is the same at all<br />
five treatments, then the common or pooled<br />
variance estimate is<br />
E<br />
268<br />
Section 7
[<br />
2 2 2 2 2<br />
]<br />
s + s + s + s s<br />
2 1<br />
s =<br />
+<br />
p 5 A B C D E<br />
1 ⎡1<br />
2 1<br />
= ∑ ( x − x ) + … + ∑ ( x − x<br />
5 ⎢⎣ 3<br />
Ai A<br />
3<br />
Ei<br />
1<br />
2<br />
= ∑ ( x − x ) + … +∑ ( x − x )<br />
15 Ai A<br />
Ei E<br />
1<br />
[<br />
2 2 2 2<br />
]<br />
= 1.0 + 0.0 + ( −2.0)<br />
+ 1.0 +…<br />
15<br />
= Residual SS/Residual DF<br />
= Residual Mean Square ( s 2 e<br />
)<br />
[<br />
2]<br />
The residual mean square is just the pooled<br />
variance estimate for all five samples. (It is a<br />
direct extension <strong>of</strong> the pooled variance estimate<br />
in an unpaired t test.)<br />
Notes (1) For the F test to be valid, the<br />
variances in all samples compared (here 5)<br />
should be approximately equal.<br />
(2) The square root <strong>of</strong> the residual mean square<br />
s is the st<strong>and</strong>ard deviation <strong>of</strong> the residuals.<br />
2<br />
e<br />
E<br />
)<br />
2<br />
⎤<br />
⎥⎦<br />
269<br />
Section 7
(3) In the R-cmdr printout for such an analysis<br />
the overall mean effect is deleted from the<br />
ANOVA table (as in the equivalent<br />
regression printout). The important section<br />
<strong>of</strong> the table remains.<br />
SOURCE SS DF MS F<br />
Treatment effects 62.8 4 15.70 3.47*<br />
Residual (error) effect 68.0 15 4.53<br />
Total (less mean) 130.8 19<br />
Example: 20 children allocated r<strong>and</strong>omly to four<br />
equal groups subjected to different treatments.<br />
After 3 months, progress measured by a test, with<br />
responses below (one child in group 3 died). Test<br />
for treatment mean differences.<br />
TREATMENT<br />
1 2 3 4<br />
4 31 30 19<br />
12 49 41 66<br />
44 22 13 65<br />
9 56 26 46<br />
17 19 89<br />
C 86 177 110 285 658<br />
j<br />
2<br />
C 7396 31329 12100 81225<br />
j<br />
270<br />
Section 7
Total SS = 4 2 + 12 2 + … + 89 2 = 32214<br />
Overall mean SS = 19(658/19) 2 = 22787.58<br />
Total SS (less mean SS) = 9426.42<br />
Treatment effect SS<br />
7396 31329 12100 81225<br />
= + + + − 22787. 58<br />
5 5 4 5<br />
= 4227.43<br />
The ANOVA table becomes<br />
SOURCE SS DF MS F<br />
Treatment effect 4227.43 3 1409.14 4.066<br />
Error (residual) (5198.99) (15) 346.60<br />
Total (less mean) 9426.42 18<br />
Critical value at 5% level <strong>of</strong> significance is 3.287<br />
< 4.066 (Using 3 <strong>and</strong> 15 DF)<br />
Conclusion: There is some evidence that the<br />
mean outcomes in the four treatments differ.<br />
271<br />
Section 7
272<br />
Section 7
POST ANALYSIS OF VARIANCE RESULTS<br />
It is important for further interpretation to set up<br />
confidence intervals for individual sample means<br />
or for differences between pairs <strong>of</strong> sample means.<br />
The useful new development here is that the<br />
residual mean square is an excellent estimate for<br />
the data variance meaning there is no need to<br />
additionally calculate the usual pooled variance<br />
estimate for each pair <strong>of</strong> samples. The advantage<br />
<strong>of</strong> using the residual mean square is that it<br />
involves all the data, not just data in individual<br />
samples.<br />
Example: Set up a 95% confidence interval for<br />
the mean <strong>of</strong> treatment 2.<br />
Solution: Here,<br />
x = 177/5 = 35.4<br />
2<br />
s s 346.60<br />
Estimated st<strong>and</strong>ard error = 2<br />
e<br />
= =<br />
n n 5<br />
= 8.33<br />
which has 15 degrees <strong>of</strong> freedom (same as<br />
residual)<br />
273<br />
Section 7
The 95% C.I. is 35.4 ± t 15 (8.33)<br />
where t 15 = 2.132<br />
That is, 35.4 ± 17.76<br />
or 17.64 < μ 2 < 53.16<br />
N.B. (1) Use 15 DF rather than 5 – 1 = 4 DF for<br />
the single second sample. Hence greater<br />
precision as t 15 < t 4 (note t 4 = 2.776)<br />
(2) R-cmdr gives confidence intervals for these<br />
treatment means automatically.<br />
(3) As we have seen, use <strong>of</strong> the residual mean<br />
square requires the variances to be equal in<br />
each sample.<br />
Example: Compare the mean scores for<br />
treatments 3 <strong>and</strong> 4 by setting up a 95% C.I. for<br />
the difference.<br />
274<br />
Section 7
Solution: x = 110/4 = 27.5 x = 285/5 = 57.0<br />
3<br />
4<br />
estimated st<strong>and</strong>ard error <strong>of</strong> difference<br />
=<br />
1<br />
s p<br />
+<br />
n<br />
2 1 1<br />
= s<br />
e<br />
+<br />
4 5<br />
=<br />
3<br />
1<br />
n<br />
4<br />
1 1<br />
346.60 + 4 5<br />
= 12.49<br />
with 15 DF again rather than n 3 + n 4 – 2 = 7 DF<br />
as for the usual unpaired t-test.<br />
The 95% C.I. for μ 4 – μ 3 is<br />
(57.0 – 27.5) ± t 15 (12.49)<br />
where t 15 = 2.132<br />
That is 29.5 ± 26.63<br />
or 2.87 < μ 4 – μ 3 < 56.13<br />
Since zero excluded, there is evidence treatment 4<br />
has a higher average score than treatment 3.<br />
275<br />
Section 7
A NOTE ON ASSUMPTIONS IN ANOVA<br />
Residuals <strong>and</strong> residual plots can be used to check<br />
the required assumptions. As in a regression<br />
analysis, the residuals should be<br />
(i) normally distributed,<br />
(ii) r<strong>and</strong>omly distributed about 0,<br />
(iii) have similar variation within each <strong>of</strong> the<br />
samples chosen<br />
The following graph shows the variability in each<br />
<strong>of</strong> the drugs in the temperature reduction fever<br />
data. There could be some concern about unequal<br />
variation within each <strong>of</strong> the five treatments (but<br />
the samples are very small in this case so this is<br />
not too surprising).<br />
276<br />
Section 7
The next two residual plots confirm that the<br />
variation is similar for each drug treatment <strong>and</strong><br />
the residuals are close to being normally<br />
distributed.<br />
277<br />
Section 7
278
SECTION 8<br />
This section covers the analysis <strong>of</strong> count data including the Chi-square test for contingency, the chisquare<br />
test for trend as well as relative risks, attributable risks <strong>and</strong> odds ratios along with their<br />
confidence intervals. The analysis <strong>of</strong> a three way table <strong>and</strong> Simpson’s paradox are investigated as a<br />
way <strong>of</strong> introducing the concept <strong>of</strong> a confounding variable in the lead up to regression analyses.<br />
Categorical Data Examples<br />
Relative Risk <strong>and</strong> its Confidence Interval<br />
Attributable Risk <strong>and</strong> its Confidence Interval<br />
Odds Ratio <strong>and</strong> its Confidence Interval<br />
Chi-square Test for Contingency<br />
Chi-square Test for Trend<br />
Interpretation <strong>of</strong> Confidence Intervals<br />
Simpson’s Paradox <strong>and</strong> Confounder Control<br />
279<br />
Section 8
Analysis <strong>of</strong> categorical data<br />
Categorical Data arise when individuals or<br />
experimental units are classified into one <strong>of</strong> two<br />
or more mutually exclusive groups. For example,<br />
• binary e.g. sex (M/F); dead/alive;<br />
diseased/disease free;<br />
treatment/placebo; smoker (yes/no)<br />
Tuatara present/absent<br />
herpes present/absent<br />
melanoma present/absent<br />
• nominal e.g. ethnicity<br />
• ordinal e.g. disease severity; socio economic<br />
status; smoking (never/ex/current)<br />
In a sample <strong>of</strong> units, the number falling into a<br />
particular group is the frequency. The analysis <strong>of</strong><br />
such data is sometimes referred to as the analysis<br />
<strong>of</strong> frequencies or counts.<br />
280<br />
Section 8
Examples <strong>of</strong> research questions that we shall<br />
look at.<br />
Estimation <strong>of</strong> one proportion:<br />
Ex 1. What is the prevalence <strong>of</strong> asthma in a<br />
population<br />
Associations between two factors:<br />
Ex 2. Is a vaccine effective in reducing the risk<br />
<strong>of</strong> catching influenza<br />
Ex 3. Is there an association between exposure<br />
to chlorinated water <strong>and</strong> dental enamel<br />
erosion<br />
Ex 4. Does infra-red stimulation (IRS) provide<br />
effective pain relief in patients with<br />
cervical osteoarthritis<br />
Ex 5. Is there an association between income<br />
level <strong>and</strong> severity <strong>of</strong> cardiovascular<br />
disease in a group <strong>of</strong> people presenting for<br />
treatment<br />
281<br />
Section 8
What tools do we need to answer these types <strong>of</strong><br />
questions Recall the research loop<br />
Underlying Population<br />
Selection<br />
bias<br />
Inference<br />
Sample<br />
Confounding<br />
Statistical<br />
analysis<br />
Information<br />
bias<br />
Possible explanations for an association include<br />
• bias (controlled with study design when<br />
selecting the people for a study or<br />
systematic error arising from the way<br />
information was collected from study<br />
participants)<br />
• confounding (must be allowed for)<br />
• chance (or r<strong>and</strong>om error)<br />
• true association<br />
We shall use proportions, relative <strong>and</strong> attributable<br />
risks, odds ratios, confidence intervals <strong>and</strong><br />
probability values.<br />
282<br />
Section 8
Example 1: What is the prevalence <strong>of</strong> asthma in<br />
a population<br />
Population: adult males on a general practice<br />
register.<br />
Study<br />
• r<strong>and</strong>om sample from population, n = 215<br />
• 39 have history <strong>of</strong> asthma<br />
Sample proportion p = 39/215 = 0.18<br />
0.18(1 − 0.18)<br />
St<strong>and</strong>ard error <strong>of</strong> proportion =<br />
215<br />
= 0.026<br />
95% confidence interval for the true proportion<br />
(0.13, 0.24)<br />
Conclusion<br />
We can be 95% sure that the true prevalence <strong>of</strong><br />
asthma among men attending this general practice<br />
is between 13% <strong>and</strong> 24%.<br />
Confidence intervals for very small proportions<br />
• If the number <strong>of</strong> events is small the<br />
distribution <strong>of</strong> sample proportions is not<br />
normal <strong>and</strong> values would be negative.<br />
• an ‘exact’ method based on the binomial<br />
distribution must be used.<br />
283<br />
Section 8
Evaluating associations in 2 × 2 tables<br />
Example 2: is a vaccine effective in reducing the<br />
risk <strong>of</strong> catching influenza<br />
Study<br />
169 people were r<strong>and</strong>omly allocated to receive a<br />
flu vaccine or a placebo. At end <strong>of</strong> winter they<br />
were asked if they had contracted flu’.<br />
Flu’ No Flu’ Total<br />
Vaccine 9 75 84<br />
Placebo 22 63 85<br />
Total 31 138 169<br />
This is what is called as a prospective cohort<br />
study. In a cohort study the cohort <strong>of</strong> people is<br />
followed into the future. Such studies can be<br />
expensive as they may be <strong>of</strong> long duration. Also<br />
if a disease is rare (say a cancer) many<br />
participants will be needed. The Dunedin<br />
Multidisciplinary Study is one <strong>of</strong> these. Recall<br />
the example on circumcision <strong>and</strong> sexually<br />
transmitted disease.<br />
284<br />
Section 8
Example 3: Is there an association between<br />
exposure to chlorinated water <strong>and</strong> dental enamel<br />
erosion<br />
Study<br />
Of 49 swimmers with enamel erosion (the cases)<br />
32 reported swimming 6 or more hours per week<br />
compared with 118 to 245 swimmers without<br />
enamel erosion (the controls).<br />
Swim time Erosion <strong>of</strong> enamel Total<br />
per week Yes No<br />
(Cases) (Controls)<br />
≥ 6 hrs 32 118 150<br />
< 6 hrs 17 127 144<br />
Total 49 245 294<br />
This is what is called a retrospective case control<br />
study. Advantage is that such a study is relatively<br />
quick <strong>and</strong> smaller than a cohort study particularly<br />
for rare diseases. But greater potential for bias as<br />
there may be inaccurate recall.<br />
The analysis <strong>of</strong> this 2 × 2 table is not the same as<br />
the analysis in the 2 × 2 table in the previous<br />
cohort study. (We shall see that odds ratio rather<br />
than relative risk must be used.)<br />
285<br />
Section 8
Both these data summaries are in the form <strong>of</strong> a<br />
2 × 2 table. Usually there is an exposure (or<br />
predictor) category <strong>and</strong> an outcome (response<br />
category).<br />
Outcome (disease)<br />
Exposed Present Absent Total<br />
Yes a b a + b<br />
No c d c + d<br />
Total a + c b + d n<br />
We know how to summarize data from tables like<br />
these<br />
• the choice <strong>of</strong> measure depends on the study<br />
design<br />
• options include relative risk, attributable risk<br />
(difference in proportions), odds ratio<br />
The tools needed for statistical inference are<br />
• confidence intervals for relative risks<br />
attributable risks <strong>and</strong> odds ratios<br />
• hypothesis tests (p-values) for these<br />
associations<br />
286<br />
Section 8
Prospective Studies<br />
• groups are followed up to see if an outcome<br />
<strong>of</strong> interest occurs<br />
• the proportions in each group who develop<br />
the outcome are found (these are <strong>of</strong>ten called<br />
the incidence which defines numbers <strong>of</strong> new<br />
cases <strong>of</strong> a disease)<br />
• the ratio <strong>of</strong> these proportions is the relative<br />
risk<br />
• the difference in these proportions is the<br />
attributable risk<br />
General form <strong>of</strong> 2 × 2 table:<br />
Outcome (disease)<br />
Exposed Present Absent Total<br />
Yes a b a + b<br />
No c d c + d<br />
Total a + c b + d n<br />
Relative risk, RR =<br />
a/( a+<br />
b)<br />
c/( c+<br />
d)<br />
Attributable risk, AR = a/( a+ b) − c/( c+<br />
d)<br />
Section 8<br />
287
Example 2: Is a vaccine effective in reducing the<br />
risk <strong>of</strong> catching influenza<br />
Flu’ No Flu’ Total<br />
Vaccine 9 75 84<br />
Placebo 22 63 85<br />
Total 31 138 169<br />
Risk in vaccine group = 9/84<br />
Risk in placebo group = 22/85<br />
Relative risk, RR = 9/84<br />
22/85 = 0.4<br />
Those who were vaccinated were 0.4 times as<br />
likely to develop the flu as those who were not<br />
vaccinated. So flu vaccine was associated with a<br />
60% reduction in risk <strong>of</strong> flu.<br />
Notes:<br />
• if a RR = 1.00, then rates are equal <strong>and</strong> there<br />
is no association between flu’ <strong>and</strong> vaccine<br />
• the convention is to calculate the relative risk<br />
this way round so that a ‘protective’ exposure<br />
gives a relative risk less than 1.<br />
288<br />
Section 8
Confidence interval for relative risk<br />
One method for finding confidence intervals for<br />
RR is as follows:<br />
The sampling distribution for ln(RR) is<br />
approximately normal with st<strong>and</strong>ard deviation (or<br />
st<strong>and</strong>ard error) given by<br />
[ ]<br />
s.e. ln(RR)<br />
1 1 1 1<br />
= − + −<br />
a a+ b c c+<br />
d<br />
Then the 95% confidence interval for ln(RR) is<br />
ln(RR) ± 1.96 s.e.[ln(RR)]<br />
For example,<br />
1 1 1 1<br />
s.e. [ ln(RR) ] = − + − = 0.364<br />
9 84 22 85<br />
Now RR = 0.414, giving ln(RR) = –0.882<br />
289<br />
Section 8
The confidence interval (95%) becomes<br />
–0.882 ± 1.96 (0.364)<br />
i.e. –0.882 ± 0.714<br />
Therefore, –1.596 < ln(RR) < –0.168<br />
Taking exponentials, 0.20 < RR < 0.85<br />
So the 95% confidence interval for the true<br />
relative risk is (0.20, 0.85)<br />
Since 1 is not contained in this confidence<br />
interval we conclude that there is evidence <strong>of</strong><br />
association between vaccine use <strong>and</strong> a reduced<br />
risk <strong>of</strong> contracting flu’<br />
Note:<br />
• this method will give a correct CI only if the<br />
numbers in each cell are not too small<br />
• in order to complete our evaluation <strong>of</strong> the<br />
effectiveness <strong>of</strong> the vaccine we need to also<br />
consider possible sources <strong>of</strong> bias <strong>and</strong><br />
confounding<br />
• regression procedures allow us to take<br />
account <strong>of</strong> confounding effects (see later).<br />
290<br />
Section 8
Confidence interval for attributable risk<br />
Once we have determined treatment is effective<br />
we may also wish to consider how many cases <strong>of</strong><br />
flu’ vaccine is likely to prevent:<br />
Attributable risk:<br />
22/85–9/84 = 0.26 – 0.11 = 0.15<br />
Use the normal approximation to get a confidence<br />
interval for this difference in proportions. The<br />
estimated st<strong>and</strong>ard error for the difference<br />
between the proportions is<br />
p1(1 − p1) p2(1 − p2) 0.26(0.74) 0.11(0.89)<br />
+ = +<br />
n1 n2<br />
85 84<br />
= 0.059<br />
<strong>and</strong> the 95% confidence interval for the<br />
attributable risk (risk difference) is<br />
giving<br />
0.15 ± 1.96(0.059)<br />
(0.04, 0.27)<br />
291<br />
Section 8
So, assuming the treatment is effective, in every<br />
100 people vaccinated there will be between 4<br />
<strong>and</strong> 27 fewer cases <strong>of</strong> flu than if they had not<br />
been vaccinated (i.e. vaccination prevents<br />
between 4 <strong>and</strong> 27 cases <strong>of</strong> flu in every 100<br />
people)<br />
292<br />
Section 8
Case control studies<br />
• a group <strong>of</strong> individuals with a disease (called<br />
the cases) is compared to a control group who<br />
do not have the disease. In these cases we<br />
choose the number <strong>of</strong> people with the disease<br />
<strong>and</strong> the number without.<br />
General form <strong>of</strong> 2 × 2 table<br />
Outcome (disease)<br />
Exposed Present Absent Total<br />
Yes a b a + b<br />
No c d c + d<br />
Total a + c b + d n<br />
The measure <strong>of</strong> association used in case-control<br />
studies is the odds ratio, not the relative risk<br />
• In terms <strong>of</strong> probabilities, the odds <strong>of</strong> an event<br />
Pr( A) Pr( A)<br />
A is defined as = . With the<br />
Pr( A) 1 − Pr( A)<br />
notation in the table above, in the exposed<br />
group the odds <strong>of</strong> disease present equals<br />
293<br />
Section 8
⎛ a ⎞ ⎛ b ⎞<br />
⎜ ⎟ ⎜ ⎟ which simplifies to a/b.<br />
⎝a+ b⎠ ⎝a+<br />
b⎠ For unexposed group, odds = c/d.<br />
Example: Is there an association between<br />
exposure to chlorinated water <strong>and</strong> dental enamel<br />
erosion<br />
Study<br />
Of 49 swimmers with enamel erosion (the cases)<br />
32 reported swimming 6 or more hours per week<br />
compared with 118 <strong>of</strong> 245 swimmers without<br />
enamel erosion (the controls).<br />
Swim time Erosion <strong>of</strong> enamel Total<br />
per week Yes No<br />
(Cases) (Controls)<br />
≥ 6 hrs 32 118 150<br />
< 6 hrs 17 127 144<br />
Total 49 245 294<br />
For ≥ 6 hrs, odds = a/b = 32/118<br />
For < 6 hrs, odds = c/d =17/127<br />
294<br />
Section 8
a/<br />
b<br />
The odds ratio, OR =<br />
c/<br />
d<br />
= 32/118<br />
17 /127<br />
= 2.026 (= 2.0)<br />
Note 1: why we use the odds ratio<br />
Compare the numbers in the previous table to a<br />
study which is identical except that we chose to<br />
have only 49 controls:<br />
Swim time Erosion <strong>of</strong> enamel Total<br />
per week Yes No<br />
(Cases) (Controls)<br />
≥ 6 hrs 32 24 56<br />
< 6 hrs 17 25 42<br />
Total 49 49 98<br />
The values 24 <strong>and</strong> 25 give the same proportions<br />
with slight rounding<br />
32/ 24<br />
Odds ratio = = 2.0 with rounding which is<br />
17 / 25<br />
the same as the previous result.<br />
295<br />
Section 8
But now suppose we were to try <strong>and</strong> calculate the<br />
relative risk in both cases:<br />
‘Risk’ ‘RR’<br />
Study 1 ≥ 6 hrs 32/150<br />
< 6 hrs 17/144 1.75<br />
Study 2 ≥ 6 hrs 32/56<br />
< 6 hrs 17/42 1.43<br />
Notice that there is disagreement. The<br />
consequence is that the relative risk can be made<br />
to take any value by choice <strong>of</strong> numbers <strong>of</strong> cases<br />
<strong>and</strong> controls. This is unacceptable.<br />
Note 2: When are the odds ratio <strong>and</strong> relative risk<br />
close<br />
Consider a retrospective case-control study:<br />
If disease (the outcome <strong>of</strong> interest) is rare,<br />
then a <strong>and</strong> c will be small in the table.<br />
Disease No Disease<br />
Exposed (Case) (Control) Total<br />
Yes a b a + b<br />
No c d c + d<br />
296<br />
Section 8
so<br />
a a c c<br />
≈ <strong>and</strong> ≈<br />
b a+ b d c+<br />
d<br />
Then relative risk =<br />
⎛ a ⎞ ⎛ c ⎞ a/<br />
b<br />
⎜ ⎟ ⎜ ⎟ ≈<br />
⎝a+ b⎠ ⎝c+<br />
d ⎠ c/<br />
d<br />
Thus, in a case-control study investigating a rare<br />
disease the odds ratio gives a good estimate <strong>of</strong> the<br />
true unestimable relative risk.<br />
Confidence interval for odds ratio<br />
In repeated sampling, ln(OR) are normal with<br />
st<strong>and</strong>ard deviation (or st<strong>and</strong>ard error) given by<br />
[ ]<br />
s.e. ln(OR)<br />
1 1 1 1<br />
= + + +<br />
a b c d<br />
The 95% confidence interval for ln(OR) is<br />
For the example<br />
ln(OR) ± 1.96 s.e. [ln(OR)]<br />
297<br />
Section 8
1 1 1 1<br />
s.e. [ ln(OR) ] = + + + = 0.326<br />
32 118 17 127<br />
<strong>and</strong> ln(OR) = ln (2.026) = 0.706<br />
The confidence interval becomes<br />
0.706 ± 1.96 (0.326)<br />
i.e. 0.706 ± 0.639<br />
Therefore, 0.067 < ln(OR) < 1.345<br />
∴ e 0.067 < OR < e 1.345<br />
∴ 1.069 < OR < 3.838<br />
We conclude the odds <strong>of</strong> erosion in dental enamel<br />
are raised among those swimming more than 6 hours<br />
per week. We would reject the null hypothesis as p-<br />
value < 0.05.<br />
Note: An odds ratio simply measures if an<br />
association is present between outcome <strong>and</strong><br />
exposure. With a relative risk we are interested if<br />
treatment improves outcome status. A protective<br />
exposure gives a relative risk less than 1.<br />
298<br />
Section 8
Chi Square Test for Contingency Tables<br />
The above examples (2 × 2 tables) are very<br />
common in health research <strong>and</strong> other areas.<br />
However, we may want:<br />
• p-values to formally test for an association<br />
• to answer questions relating to larger<br />
contingency tables.<br />
Note:<br />
• as long as one <strong>of</strong> the variables is binary we<br />
can think <strong>of</strong> comparing proportions <strong>and</strong><br />
calculate RRs or ORs<br />
• if both variables have more than 2 categories<br />
the analysis is more complex<br />
299<br />
Section 8
Example 4<br />
Does infra-red stimulation (IRS) provide effective<br />
pain relief in patients with cervical osteoarthritis<br />
A r<strong>and</strong>omised controlled trial was carried out<br />
with 100 patients: 20 were r<strong>and</strong>omly allocated to<br />
a double dose <strong>and</strong> 40 each to a single dose <strong>and</strong><br />
control (placebo) treatment. The patients were<br />
classified according to improvement levels over a<br />
period <strong>of</strong> one week as follows:<br />
(hypothetical data)<br />
Pain score<br />
IRS Improve No Worse Total<br />
change<br />
Double dose 10 5 5 20 = r 1<br />
Single Dose 15 20 5 40 = r 2<br />
Control 5 20 15 40 = r 3<br />
Total 30 = c 1 45 = c 2 25 = c 3 100 = n<br />
• we can look at the percentage improved, no<br />
better <strong>and</strong> worse for each treatment category<br />
300<br />
Section 8
We wish to know whether the data indicate that<br />
either<br />
or<br />
IRS does provide effective pain relief (<strong>and</strong> in<br />
what dose)<br />
it is no better than the control.<br />
Calculating a p-value for the following<br />
hypotheses will tell us whether there is evidence<br />
that IRS is effective, or whether the differences<br />
we have observed between the treatment groups<br />
are consistent with r<strong>and</strong>om variation.<br />
Hypotheses:<br />
H 0 : The response <strong>and</strong> the type <strong>of</strong> treatment are<br />
independent (i.e. no association)<br />
H A : response <strong>and</strong> type <strong>of</strong> treatment are not<br />
independent (i.e. are associated in some way<br />
or one <strong>of</strong> the responses may occur more <strong>of</strong>ten<br />
with one <strong>of</strong> the treatments)<br />
301<br />
Section 8
If there were no association between treatment <strong>and</strong><br />
outcome (H 0 ), I would expect to have the same<br />
fraction <strong>of</strong> improved responses using the three<br />
treatments <strong>and</strong> this fraction should be<br />
c 1 /n = 30/100 (i.e. 30 <strong>of</strong> the 100 patients show<br />
improvement).<br />
Suppose E 11 , E 21 <strong>and</strong> E 31 are the numbers <strong>of</strong><br />
improvements expected if RESPONSE <strong>and</strong><br />
TREATMENT are independent. Then<br />
30 E11<br />
E<br />
= =<br />
21<br />
E = 31<br />
100 20 40 40<br />
20(30)<br />
∴E 11 = = 6 100<br />
40(30)<br />
E 21 = 100<br />
40(30)<br />
E 31 = 100<br />
= 12<br />
= 12<br />
In general,<br />
E<br />
ij =<br />
r c<br />
i<br />
n<br />
j<br />
for each “cell” or “class” in the contingency table.<br />
302<br />
Section 8
Using this formula, expected numbers can be<br />
calculated for each cell:<br />
RESPONSE<br />
TREATMENT Improve No change Worse Total<br />
Double dose 6 9 [5] 20<br />
Single Dose 12 18 [10] 40<br />
Control [12] [18] [10] 40<br />
Total 30 45 25 100<br />
Each row <strong>and</strong> column total has to be met by the<br />
entries in the table <strong>and</strong> for this reason the numbers<br />
in brackets can be found by subtraction.<br />
The observed frequencies (the data counts) are now<br />
compared with the expected counts calculated<br />
under H 0 .<br />
If H 0 is true, then the expected counts will agree<br />
closely with those observed. [But how closely<br />
must they agree]<br />
This is answered by calculating the chi-square (χ 2 )<br />
statistic<br />
303<br />
Section 8
χ<br />
2<br />
=<br />
∑<br />
over<br />
all cells<br />
(Observed - Expected)<br />
Expected<br />
2<br />
i.e.<br />
χ<br />
2<br />
=<br />
∑<br />
over<br />
all cells( i,<br />
j)<br />
( O<br />
ij<br />
−<br />
E<br />
E<br />
ij<br />
ij<br />
)<br />
2<br />
Observed Counts (O ij )<br />
Treatment Response<br />
1 2 3<br />
1 10 5 5<br />
2 15 20 5<br />
3 5 20 15<br />
Expected Counts (E ij ) [Under H 0 : independent]<br />
Treatment<br />
Response<br />
1 2 3<br />
(Improved) (No change) (Worse)<br />
Double 1 6 9 5<br />
Single 2 12 18 10<br />
Control 3 12 18 10<br />
χ 2 is large if O ij <strong>and</strong> E ij seriously disagree – hence<br />
χ 2 being large will result in H 0 rejection.<br />
304<br />
Section 8
Example: For the drug responses,<br />
χ 2 =<br />
(10 − 6)<br />
6<br />
2<br />
+<br />
(5 − 9)<br />
9<br />
2<br />
+<br />
(5 − 5)<br />
5<br />
2<br />
+<br />
(15 −12)<br />
12<br />
2<br />
+<br />
(20 −18)<br />
18<br />
2<br />
+<br />
(5 −10)<br />
10<br />
2<br />
+<br />
(5 −12)<br />
12<br />
2<br />
+<br />
(20 −18)<br />
18<br />
2<br />
+<br />
(15 −10)<br />
10<br />
2<br />
= 14.72 (χ 2 will always be positive)<br />
In repeated sampling these χ 2 values are distributed<br />
as a chi-square distribution which has<br />
υ = (number <strong>of</strong> rows – 1) × (number <strong>of</strong> columns – 1)<br />
degrees <strong>of</strong> freedom.<br />
Here, υ = (3 – 1) × (3 – 1) = 4<br />
which is just the number <strong>of</strong> values that can be<br />
freely inserted in the table!! (the remaining values<br />
are fixed if the row <strong>and</strong> column totals are to be<br />
met.)<br />
The critical χ 2 value is found from the table at the<br />
end <strong>of</strong> the notes.<br />
305<br />
Section 8
0<br />
Critical value<br />
α (area, or the<br />
level <strong>of</strong><br />
significance)<br />
α<br />
υ 0.1 0.05 0.025 0.01 0.005 0.001<br />
1<br />
2<br />
3<br />
4 9.488 14.86<br />
5<br />
<br />
100<br />
Since 14.72 > 9.488, the null hypothesis <strong>of</strong> no<br />
association is rejected.<br />
Note: when we do this on the computer we get the<br />
exact p-value, p = 0.005<br />
χ<br />
2<br />
υ<br />
306<br />
Section 8
• the p-value gives the probability <strong>of</strong> observing a<br />
difference this large or larger between what we<br />
observed <strong>and</strong> what is expected under H 0 , if H 0<br />
is true.<br />
• since the p-value is small, it is unlikely we<br />
would observe a difference this big just by<br />
chance, it is more likely that the null hypothesis<br />
is false.<br />
• there is evidence that the pain levels depend on<br />
the treatment administered.<br />
• closer inspection <strong>of</strong> the observed frequencies<br />
indicates<br />
• more patients improved on double dose<br />
than expected<br />
• few patients experiencerd improved<br />
response on the control<br />
• fewer than expected being worse on single<br />
dose.<br />
307<br />
Section 8
Notes<br />
1. Check the observed counts in order to interpret<br />
a significant association.<br />
2. Maximum power is achieved if there are equal<br />
numbers in each ‘exposure’ group. This is<br />
<strong>of</strong>ten not possible to achieve in observational<br />
studies.<br />
3. This chi-square procedure is unreliable if<br />
counts are small, in particular less than 5.<br />
• For larger contingency tables it is possible<br />
to combine classes in order to raise<br />
frequencies.<br />
• For 2 × 2 tables if expected frequencies<br />
are between 5 <strong>and</strong> 10, a correction called<br />
Yates correction will modify the χ 2<br />
statistic.<br />
• For 2 × 2 tables, if expected frequencies<br />
are less than 5, there is a test called<br />
Fisher’s Exact Test which can be used.<br />
308<br />
Section 8
Example 5<br />
Is there an association between income level <strong>and</strong><br />
severity <strong>of</strong> cardiovascular disease in a group <strong>of</strong><br />
people presenting for treatment<br />
Study<br />
A group <strong>of</strong> people presenting to a hospital with<br />
acute myocardial infarction or unstable angina are<br />
enrolled in a study. Cross-sectional data are<br />
collected at baseline.<br />
Income level (Exposure)<br />
Disease<br />
level 1 2 3 4 Total<br />
(Outcome)<br />
0 100 107 111 122 440<br />
≥1 (Severe) 115 112 104 97 428<br />
Total 215 219 215 219 868<br />
% ≥1 52.0 51.1 48.4 44.3<br />
RR 1.00 0.96 0.90 0.84<br />
115/ 215<br />
115/ 215<br />
112/ 219<br />
115/ 215<br />
104/ 215<br />
115/ 215<br />
97 / 219<br />
115/ 215<br />
309<br />
Section 8
To test whether or not there is an association<br />
between disease severity <strong>and</strong> income level:<br />
H 0 = there is no association between disease<br />
severity <strong>and</strong> income (i.e. the proportion<br />
with severe disease is the same for all<br />
income levels)<br />
H A =<br />
there is some association (i.e. the<br />
percentage with severe disease varies by<br />
income)<br />
Expected frequencies:<br />
Income level<br />
Disease 1 2 3 4 Total<br />
level<br />
0 108.99 111.01 108.99 111.01 440<br />
≥1 106.01 107.99 106.01 107.99 428<br />
Total 215 219 215 219 868<br />
440<br />
215 108.99<br />
868<br />
E<br />
11<br />
= × =<br />
12<br />
440<br />
E<br />
13<br />
= 215× = 108.99<br />
868<br />
440<br />
E = 219× = 111.01<br />
868<br />
310<br />
Section 8
χ 2<br />
2<br />
(100−108.99)<br />
(107−111.01)<br />
(111−108.99)<br />
=<br />
+<br />
+<br />
108.99 111.01 108.99<br />
2<br />
2<br />
2<br />
(122−111.01)<br />
+<br />
(115−106.01)<br />
+<br />
(112−107.99)<br />
111.01 106.01 107.99<br />
2<br />
2<br />
(104−106.01)<br />
(97−107.99)<br />
+<br />
+<br />
106.01 107.99<br />
= 4.1<br />
2<br />
2<br />
+<br />
The appropriate sampling distribution is a χ 2 with<br />
3 d.f.<br />
From the χ 2 table<br />
Pr(χ 2 (3 d.f.) > 6.251) = 0.1<br />
so p-value > 0.1<br />
From the computer, p-value = 0.25<br />
Hence the observed differences in proportions we<br />
have seen are <strong>of</strong> the order we might expect to see<br />
by chance. There is no evidence supporting<br />
rejection <strong>of</strong> the null hypothesis.<br />
We conclude that there is no evidence <strong>of</strong> an<br />
association between disease severity <strong>and</strong> income.<br />
311<br />
Section 8
Contingency Tables (Continued)<br />
Tests for trend<br />
Example 5 (continued): Do people with lower<br />
incomes tend to present with more severe<br />
disease<br />
The chi-squared test <strong>of</strong> association may not<br />
provide the best answer to this question. It does<br />
not take account <strong>of</strong> the ordering in the income<br />
variable. Specifically, our prior hypothesis is<br />
that the percentage with severe disease decreases<br />
as income increases.<br />
We can test this hypothesis directly using a χ 2<br />
test for trend. The main difference is that this<br />
test has only one degree <strong>of</strong> freedom rather than<br />
the three for the test <strong>of</strong> association.<br />
Note: You will NOT be asked to calculate a test<br />
for trend in this course. You may be asked to<br />
2<br />
interpret the p-value or a χ<br />
trend<br />
value with one<br />
degree <strong>of</strong> freedom.<br />
312<br />
Section 8
This page for reference only<br />
Income level (x i )<br />
Disease 1 2 3 4 Total<br />
level<br />
0 100 107 111 122 440<br />
≥ 1 (r i ) 115 112 104 97 R = 428<br />
Total (n i ) 215 219 215 219 N = 868<br />
r i x i 115 224 312 388<br />
n i x i 215 438 645 876<br />
n i x i<br />
2<br />
215 876 1935 3504<br />
p<br />
= R N = 428 = 0.49 ∑ rx<br />
i i<br />
2174<br />
868 x = = =<br />
N 868<br />
χ<br />
2<br />
trend<br />
[ ∑ rx − Rx] 2<br />
i i<br />
=<br />
2 2<br />
p(1 − p)<br />
⎡⎣∑<br />
nx<br />
i i−Nx<br />
⎤⎦<br />
1039 − 428×<br />
2.50<br />
=<br />
0.49(1 −0.49) ⎡⎣<br />
6530 − 868×<br />
2.50<br />
= 4.06<br />
[ ] 2 2<br />
⎤⎦<br />
2.505<br />
313<br />
Section 8
The trend statistic has only 1 degree <strong>of</strong> freedom.<br />
From χ 2 table, Pr(χ 2 (1 d.f.) > 3.841) = 0.05<br />
Since 4.1 > 3.841 the p-value < 0.05, so we<br />
conclude there is evidence that the proportion<br />
with severe disease decreases as income<br />
increases.<br />
Overview<br />
• interpretation <strong>of</strong> confidence intervals for RR<br />
<strong>and</strong> OR<br />
• relationship between confidence intervals, p-<br />
values <strong>and</strong> sample size<br />
Example: (Hypothetical Data)<br />
The following confidence intervals are from a<br />
study into the erosion <strong>of</strong> tooth enamel as a result<br />
<strong>of</strong> exposure to chlorinated water.<br />
They are the ratio <strong>of</strong> odds for those exposed<br />
(swim ≥ 6 hours per week) to those not exposed<br />
(swim < 6 hours per week).<br />
Suppose an odds ratio greater than 1.5 is<br />
considered clinically important.<br />
314<br />
Section 8
(a) OR = 1.90 with CI (1.23, 2.92)<br />
• p < 0.05 <strong>and</strong> conclusive.<br />
• 1 is not contained in the CI, so there is<br />
evidence <strong>of</strong> an association between<br />
exposure <strong>and</strong> outcome.<br />
• the CI is above 1 indicating harm.<br />
(Swimming bad for teeth.)<br />
• note we have not ruled out a non-clinically<br />
important association<br />
(b) OR = 1.69 with CI (0.83, 3.45)<br />
• p > 0.05 <strong>and</strong> inconclusive.<br />
• point estimate indicates possible clinically<br />
important association but “protection”!! <strong>of</strong><br />
tooth enamel (rather than “harm”) is also<br />
plausible.<br />
(c) OR = 0.81 with CI (0.39, 1.70)<br />
• p > 0.05, inconclusive.<br />
• conclude no evidence <strong>of</strong> an association<br />
even though CI includes clinically<br />
important effects.<br />
• the point estimate is in the “protection”<br />
range (harm is above 1).<br />
315<br />
Section 8
(d) OR = 0.85 with CI (0.53, 1.37)<br />
• p > 0.05, conclusive.<br />
• point estimate in protection range <strong>and</strong> CI<br />
excludes any clinically important harm.<br />
(e) OR = 0.81 with CI (0.67, 0.97)<br />
• p < 0.05 <strong>and</strong> conclusive<br />
• CI excludes 1<br />
• CI entirely less than 1, indicating benefit<br />
from swimming<br />
(f) OR = 1.23 with CI (1.03, 1.48)<br />
• p < 0.05 <strong>and</strong> conclusive<br />
• CI excludes 1<br />
• CI entirely above 1, but excludes the<br />
clinically important difference<br />
• there is evidence <strong>of</strong> an association between<br />
exposure to chlorinated water for more than<br />
6 hours per week but the increased odds are<br />
not clinically important.<br />
(g) OR = 1.15 with CI (0.73, 1.80)<br />
p > 0.05 <strong>and</strong> inconclusive. A clinically<br />
important association is not ruled out.<br />
Advice: Probably continue swimming.<br />
316<br />
Section 8
0 1 1.5 2 3 3.5<br />
a<br />
x<br />
b<br />
x<br />
c<br />
d<br />
e<br />
x<br />
x<br />
x<br />
f<br />
x<br />
g<br />
x<br />
0 1 1.5 2 3 3.5<br />
Notice that these confidence intervals are not<br />
symmetric.<br />
317<br />
Section 8
A Problem when Contingency Tables are<br />
combined<br />
Example: A <strong>University</strong> has a Law School <strong>and</strong> a<br />
Medical Sciences School with men <strong>and</strong> women<br />
being admitted or declined admission as follows:<br />
Admit Decline Total<br />
Male 490 210 700<br />
Female 280 220 500<br />
Total 770 430 1200<br />
Is there gender bias concerning admission (i.e.<br />
is there an association between gender <strong>and</strong><br />
admission decision)<br />
Expected frequencies under H 0 : no association are<br />
Admit Decline Total<br />
Male 700(770)<br />
[250.8] 700<br />
= 449.2<br />
1200<br />
Female [320.8] [179.2] 500<br />
Total 770 430 1200<br />
χ 2 =<br />
(490 − 449.2)<br />
449.2<br />
2<br />
+ … + … + … = 24.82<br />
318<br />
Section 8
with υ = 1 degree <strong>of</strong> freedom. Since critical<br />
value at α = 0.01 level <strong>of</strong> significance is 6.635,<br />
there is strong evidence <strong>of</strong> an association.<br />
Inspection <strong>of</strong> the observed frequencies shows a<br />
tendency to admit a higher number <strong>of</strong> men than<br />
expected i.e. O 11 = 490 but E 11 = 449.2. This<br />
means fewer women are admitted than expected<br />
under equal opportunity. The admission patterns<br />
for the two schools are also known as follows:<br />
LAW SCHOOL MEDICAL SCIENCES<br />
Admit Decline Total Admit Decline Total<br />
M 480 120 600 M 10 90 100<br />
F 180 20 200 F 100 200 300<br />
Total 660 140 800 Total 110 290 400<br />
The expected frequencies under H 0 are:<br />
Admit Decline Admit Decline<br />
M 495 105 M 27.5 72.5<br />
F 165 35 F 82.5 217.5<br />
For Law School χ 2 = 10.38**<br />
For Medical Sciences School, χ 2 = 20.45**<br />
319<br />
Section 8
There is strong evidence <strong>of</strong> an association in both<br />
schools.<br />
HOWEVER, inspection <strong>of</strong> the observed counts<br />
indicates a higher number <strong>of</strong> women than<br />
expected are admitted to both schools.<br />
For LAW, O 21 = 180 with E 21 = 165<br />
For MEDICAL SCIENCES, O 21 = 100 with<br />
E 21 = 82.5<br />
This is the opposite conclusion to that when the<br />
schools are combined. Is there discrimination<br />
against men or women!!<br />
This is known as Simpson’s Paradox.<br />
The reason for this discrepancy is that more<br />
women applied to the Medical Sciences school to<br />
which it was more difficult to be admitted. The<br />
final conclusion is therefore unclear.<br />
Notice that there are essentially three factors <strong>of</strong><br />
classification here, <strong>and</strong> we have summed over<br />
one <strong>of</strong> these factors, namely the “TYPE OF<br />
SCHOOL”<br />
320<br />
Section 8
COMBINED<br />
ADMIT DECLINE<br />
Male 490 (449.2) 210 ( )<br />
Female 280 (320.8) 220 ( )<br />
LAW<br />
MEDICAL<br />
Admit Decline Admit Decline<br />
M 480 (495) 120 ( ) M 10 (27.5) 90 ( )<br />
F 180 (165) 20 ( ) F 100 (82.5) 200 ( )<br />
(Expected numbers are in bold)<br />
“Variable” 1 = GENDER<br />
“Variable” 2 = ADMISSION DECISION<br />
“Variable” 3 = SCHOOL TYPE<br />
Note how careful we must be with such an<br />
observational study which fails to recognise an<br />
important “variable” (here school type).<br />
This phenomenon can occur whenever we sum<br />
over a classification in categorical data.<br />
321<br />
Section 8
REVIEW EXERCISES<br />
1. A r<strong>and</strong>omized double blind study (prospective) was set up to test for an association between<br />
the use <strong>of</strong> aspirin <strong>and</strong> the incidence <strong>of</strong> fatal or nonfatal strokes in a five year period from the<br />
start <strong>of</strong> the study. The results (Journal <strong>of</strong> the American Medical Association, 243: 661-669)<br />
are summarised in the following contingency table:<br />
Stroke No stroke<br />
Placebo 45 2257<br />
Aspirin 29 2238<br />
(b)<br />
(c)<br />
Calculate <strong>and</strong> interpret the risk <strong>of</strong> stroke for people in the placebo group relative to the<br />
aspirin group. Set up a 95% confidence interval for the relative risk. (3 marks)<br />
The use <strong>of</strong> aspirin was felt to increase the occurrence <strong>of</strong> gastrointestinal irritation. In<br />
the study, 229 <strong>of</strong> 2267 patients in the aspirin treatment suffered irritation as opposed to<br />
22 <strong>of</strong> the 2302 in the placebo treatment. Calculate the relative risk <strong>of</strong> gastrointestinal<br />
irritation for people in the aspirin group compared with those in the control. Set up a<br />
95% confidence interval for the relative risk <strong>and</strong> interpret the result. (3 marks)<br />
(d) Calculate the attributable risk for aspirin compared with control Set up a 95%<br />
confidence interval for the attributable risk <strong>and</strong> interpret the result.<br />
(3 marks)<br />
3. Long-term Mobile Phone Use <strong>and</strong> Brain Tumour Risk.<br />
Lonn et al (2005), American Journal <strong>of</strong> Epidemiology, 161: 526-535<br />
Human exposure to radi<strong>of</strong>requency has increased dramatically during recent years from<br />
widespread use <strong>of</strong> mobile phones. If radi<strong>of</strong>requency radiation has a carcinogenic effect, the<br />
exposure poses an important public health problem, <strong>and</strong> intracranial tumours would be <strong>of</strong><br />
primary interest. H<strong>and</strong>held mobile phones were introduced in Sweden during the late<br />
1980’s. This case-control study was carried out to test the hypothesis that long-term mobile<br />
phone use increases the risk <strong>of</strong> brain tumours.<br />
(a)<br />
(b)<br />
This was a case-control study. Describe one advantage <strong>and</strong> one disadvantage <strong>of</strong> using a<br />
case-control study instead <strong>of</strong> a cohort study to investigate the association between longterm<br />
use <strong>of</strong> mobile phones <strong>and</strong> the risk <strong>of</strong> brain tumour.<br />
The information is summarised below.<br />
Brain Tumour (Outcome)<br />
Mobile phone use Yes No Total<br />
Never/rarely 155 275 430<br />
Regularly 118 399 517<br />
Total 273 674 947<br />
(i) Calculate the odds ratio for the association between long-term mobile phone use<br />
<strong>and</strong> the risk <strong>of</strong> brain tumour.<br />
(ii) Interpret the odds ratio.<br />
(iii) Calculate the 95% confidence interval for the odds ratio.<br />
(iv) Interpret the confidence interval.<br />
322<br />
Section 8
SOLUTIONS<br />
29<br />
1. (b) Risk (aspirin group) =<br />
2267<br />
45<br />
<strong>and</strong> risk (placebo group) = 2302<br />
45 / 2302<br />
Relative risk, RR = = 1.53<br />
29 / 2267<br />
The risk is 1.53 times greater for those in the placebo.<br />
Also, s.e. (ln RR) =<br />
1<br />
45<br />
−<br />
1<br />
2302<br />
+<br />
1<br />
29<br />
−<br />
1<br />
2267<br />
= 0.236<br />
<strong>and</strong> since ln(RR) = 0.424 the 95% confidence interval is<br />
0.424 ± 1.96 (0.236)<br />
or 0.424 ± 0.463<br />
or -0.039 < ln(RR) < 0.887<br />
Therefore 0.96 < RR < 2.43, taking exponentials<br />
(notice that the null value for the relative risk is 1 hence no evidence against the null hypothesis)<br />
(c) Irritation No irritation Total<br />
Placebo 22 2280 2302<br />
Aspirin 229 2038 2267<br />
229 / 2267<br />
RR =<br />
22 / 2302<br />
= 10.57<br />
ln(RR) = 2.358<br />
s.e. ln(RR) =<br />
1 1 1 1<br />
− + − = 0.221<br />
229 2267 22 2302<br />
The 95% C.I. for ln(RR) is 2.358 ± 1.96 (0.221)<br />
That is, 2.358 ± 0.433<br />
Giving 1.925 < ln RR < 2.791<br />
Taking exponentials,<br />
6.86 < RR < 16.30<br />
The null value <strong>of</strong> equal risk is rejected.<br />
The true relative risk <strong>of</strong> irritation if aspirin used is between 6.86 <strong>and</strong> 16.30<br />
(d)<br />
229 22<br />
Attributable risk = − = 0.10101 – 0.00956 = 0.09145<br />
2267 2302<br />
Estimated st<strong>and</strong>ard error =<br />
0.10101(0.89899) 0.00956(0.99044)<br />
+ = 0.00665<br />
2267 2302<br />
The 95% C.I. for attributable risk is 0.09145 ± 1.96(0.00665)<br />
or 0.091 ± 0.013<br />
or 0.078 < AR < 0.104<br />
Between 78 <strong>and</strong> 104 in every 1000 people have increased occurrence <strong>of</strong> gastrointestinal irritation<br />
as a result <strong>of</strong> using aspirin.<br />
3. (a) Advantage: A case control study is quick <strong>and</strong> cheaper since information on exposure <strong>and</strong> disease<br />
status are obtained at same time. Brain tumours also are rare so number <strong>of</strong> participants for cohort<br />
study would be large.<br />
Disadvantage: Information collected likely to be affected by recall bias since events have already<br />
occurred.<br />
(b) (i) OR = (118/399)/(155/275) = 0.52<br />
(ii) Those who use mobile phones have 0.52 times the odds <strong>of</strong> a brain tumour compared with those<br />
who do not. [Protective effect from using mobile phones – the odds are 48% less for mobile<br />
phone users compared with those who do not use mobile phones.]<br />
(iii) ln(0.52) = –0.654<br />
The 95% C.I. for ln(OR) is<br />
1 1 1 1<br />
− 0.65 ± 1.96 + + +<br />
155 275 118 399<br />
or –0.654 ± 0.284<br />
or –0.938 < ln(OR) < –0.370<br />
Therefore, 0.39 < OR < 0.69<br />
(iv) 95% confident true OR between 0.39 <strong>and</strong> 0.69. The value (1) is excluded hence<br />
chance is an unlikely explanation.<br />
323<br />
Section 8
324
SECTION 9<br />
This section introduces the topic <strong>of</strong> Simple Linear Regression which sets out to fit a straight line<br />
through what is called a scatter diagram. One purpose <strong>of</strong> this analysis is to establish whether one<br />
predictor variable is influencing the outcomes <strong>of</strong> a response variable <strong>and</strong> also measuring the<br />
magnitude <strong>of</strong> the effect <strong>of</strong> this predictor variable on the outcome. It is possible to use the fitted<br />
straight line to make predictions.<br />
Simple linear regression is also the first step in controlling for a confounder variable. This occurs<br />
with the extension to multiple regression which will be considered in the next section.<br />
Scatter Diagrams <strong>and</strong> Examples<br />
Equation <strong>of</strong> Fitted Straight Line<br />
Analysis <strong>of</strong> Variance for Regression Model<br />
Confidence Interval for Slope<br />
Confidence Interval for Prediction<br />
Correlation as Measure <strong>of</strong> Linear Association<br />
Review Exercises<br />
325<br />
Section 9
Regression Procedures Introduction<br />
During the semester we have analysed data from<br />
1. studies which have measured outcomes on<br />
continuous scales [e.g. blood pressure; lung<br />
capacity; cholesterol] resulting from different<br />
treatments<br />
2. studies which have measured binary<br />
outcomes, establishing odds ratios <strong>and</strong><br />
relative risks as a result <strong>of</strong> exposure to<br />
certain conditions. [e.g. effect <strong>of</strong> chlorine on<br />
tooth enamel; effect <strong>of</strong> sun exposure on<br />
melanoma]<br />
In both cases there are potentially other variables<br />
which have an effect <strong>and</strong>/or possible confounding<br />
factors other than the treatments or exposures<br />
which influence the outcomes.<br />
We must allow for these confounders otherwise<br />
invalid conclusions will be drawn about the real<br />
effects <strong>of</strong> the treatments or exposures.<br />
326<br />
Section 9
Regression methods are used to introduce these<br />
controls. We now develop:<br />
1. Simple linear Regression (now)<br />
• to describe the relationship between two<br />
variables <strong>and</strong> test whether changes in an<br />
outcome measure may be linked to<br />
changes in the other variable.<br />
• to enable the prediction <strong>of</strong> the value <strong>of</strong><br />
the outcome measure from the other<br />
variable.<br />
2. Multiple Regression (later)<br />
• to identify the main factors influencing a<br />
continuous outcome<br />
• to adjust the means <strong>of</strong> outcomes for<br />
confounders or other factors.<br />
3. Logistic Regression (later)<br />
• to identify the main factors influencing<br />
binary outcomes <strong>and</strong> hence odds ratios<br />
<strong>and</strong> relative risks<br />
• to adjust odds ratios for confounding or<br />
other factors.<br />
Show Hans Rosling’s website gapminder.<br />
327<br />
Section 9
Example: Blood Alcohol Concentration in<br />
mg/100mL <strong>and</strong> Body Mass in kg for 8 adults after<br />
drinking 12 glasses <strong>of</strong> regular beer.<br />
0.04<br />
0.02<br />
0.00<br />
MASS (kg) BAC (mg/100mL)<br />
55 0.140<br />
85 0.102<br />
69 0.120<br />
65 0.126<br />
80 0.106<br />
90 0.092<br />
67 0.128<br />
73 0.120<br />
BAC (mg/100mL)<br />
0.14 X<br />
X X<br />
0.12<br />
X<br />
0.10<br />
0.08<br />
0.06<br />
X<br />
X X<br />
MASS (kg)<br />
50 60 70 80 90 100<br />
Does BAC drop as Body Mass increases<br />
Other variables which could be important are:<br />
gender amount eaten alcohol level <strong>of</strong> the beer<br />
Eventually we shall see how to determine which<br />
<strong>of</strong> these may be important.<br />
328<br />
X<br />
Section 9
BAC<br />
X<br />
X X<br />
•<br />
•<br />
• •<br />
X<br />
X<br />
X X<br />
X<br />
• •<br />
•<br />
•<br />
MASS<br />
• Women consistently above men<br />
• Lines could be parallel<br />
BAC<br />
X<br />
X X<br />
X<br />
X<br />
X X<br />
X<br />
• •<br />
• •<br />
• •<br />
• •<br />
MASS<br />
• Lines not parallel. (If low body mass, large<br />
difference, if high body mass there is no<br />
difference.)<br />
329<br />
Section 9
Example: Lung function in children as measured<br />
by a lung capacity variable called FEV.<br />
FEV<br />
+<br />
+ +<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+ +<br />
+ + + + + + +<br />
+ + +<br />
+<br />
+ + + + +<br />
+<br />
+<br />
+<br />
+<br />
+ +<br />
+ + + +<br />
+<br />
+ + + +<br />
+ +<br />
+<br />
+<br />
+<br />
+ +<br />
+ + + + +<br />
3 5 7 9 11 13 15 17 19 Age<br />
FEV values are increasing as the children grow.<br />
But now see the next two graphs.<br />
330<br />
Section 9
FEV<br />
+<br />
+ +<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+ +<br />
+ + + + + + +<br />
+ + +<br />
+<br />
+ + + + +<br />
+<br />
+<br />
+<br />
+<br />
+ +<br />
+ + +<br />
+<br />
+ + + +<br />
+ + +<br />
+<br />
+<br />
+ +<br />
+ + + + +<br />
3 5 7 9 11 13 15 17 19 Age<br />
• Once start smoking FEV is reduced for the smokers.<br />
FEV<br />
+<br />
+ +<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+ +<br />
+ + + + + + +<br />
+ + +<br />
+<br />
+ + + + +<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+ + +<br />
+<br />
+ + + +<br />
+ +<br />
+<br />
+<br />
+<br />
+ +<br />
+ + + + +<br />
Non-smoker<br />
Smoker<br />
Non-smoker<br />
Smoker<br />
3 5 7 9 11 13 15 17 19 Age<br />
• This is more accurate as children may only begin<br />
smoking at age 9 <strong>and</strong> the rate <strong>of</strong> increase is much<br />
smaller with the non-parallel lower line.<br />
• Multiple regression needed for this analysis.<br />
331<br />
Section 9
With a simple linear regression take one variable<br />
as response <strong>and</strong> one variable as a predictor.<br />
The response is plotted on the vertical Y axis.<br />
The predictor is plotted on the horizontal X axis.<br />
Equivalent terms for response <strong>and</strong> predictor:<br />
⎧outcome<br />
⎪<br />
response = ⎨dependent variable<br />
⎪<br />
⎩(<br />
Y - variable)<br />
⎧explanatory variable<br />
⎪<br />
covariate<br />
predictor = ⎨<br />
⎪independent variable<br />
⎪⎩<br />
( X - variable)<br />
Simple regression deals with the case where the<br />
relationship is approximately a straight line.<br />
Example: The values <strong>of</strong> a response variable (Y)<br />
<strong>and</strong> the values <strong>of</strong> a predictor variable (X) are as<br />
follows<br />
X Y<br />
100 39.7<br />
200 51.1<br />
300 49.9 The scatter diagram<br />
400 69.8 below shows the<br />
500 65.2 relationship between<br />
600 65.1 Y <strong>and</strong> X.<br />
700 80.7<br />
332<br />
Section 9
80<br />
Y<br />
X<br />
70<br />
60<br />
X<br />
X<br />
X<br />
50<br />
X<br />
X<br />
40<br />
X<br />
100 200 300 400 500 600<br />
700<br />
Y increases as X increases. The question is<br />
whether this apparent increase in Y is caused by<br />
changing X, or has it been caused by some other<br />
factor, or has it arisen by chance alone.<br />
The values <strong>of</strong> X, the independent variable, are<br />
known exactly (i.e. no error) whereas the values<br />
<strong>of</strong> Y, the dependent variable, have some r<strong>and</strong>om<br />
error associated with them.<br />
The relationship between Y <strong>and</strong> X could be linear<br />
so we attempt to “fit” a straight line through the<br />
data. This line gives the predicted yields ŷ for<br />
i<br />
each value x i <strong>of</strong> X.<br />
X<br />
333<br />
Section 9
80<br />
Y<br />
X<br />
70<br />
60<br />
50<br />
40<br />
X<br />
d X X<br />
4<br />
{<br />
⎫<br />
X ⎪<br />
X<br />
⎬ŷ<br />
4<br />
X<br />
⎪<br />
⎪⎭<br />
100 200 300 400 500 600<br />
⎫<br />
⎪<br />
⎪<br />
⎬<br />
⎪<br />
⎪<br />
⎪⎭<br />
700<br />
y<br />
4<br />
X<br />
An attempt is made to minimise the differences d i<br />
= y i – ŷ between the observed values (y<br />
i<br />
i ) <strong>and</strong> the<br />
predicted values ( ŷ i<br />
). The d i are positive for<br />
points above the fitted line <strong>and</strong> negative for<br />
points below the line. The expression ∑ n<br />
i = 1<br />
d<br />
i<br />
where there are n data points (i.e. the sample is <strong>of</strong><br />
size n) does not measure “fit” due to cancellation<br />
<strong>of</strong> negative <strong>and</strong> positive values.<br />
Therefore, minimise ∑ i=<br />
1d 2 = ∑i=<br />
y − y<br />
i 1( ˆ ) .<br />
i i<br />
Suppose the straight line which does this has<br />
slope “β 1 ” <strong>and</strong> intercept “β 0 ”. That is,<br />
n<br />
n<br />
2<br />
y<br />
= β + β x<br />
0 1<br />
334<br />
Section 9
The method <strong>of</strong> least squares finds the values <strong>of</strong> β 0<br />
<strong>and</strong> β 1 which minimise<br />
n<br />
2<br />
n<br />
2<br />
( y ˆ ) ( [<br />
1 i− yi = y<br />
1 i− β0+<br />
β1xi])<br />
i= i=<br />
∑ ∑<br />
The estimates <strong>of</strong> β 0 <strong>and</strong> β 1 are<br />
0<br />
ˆβ <strong>and</strong> ˆβ 1<br />
which<br />
turn out to be<br />
ˆ<br />
∑<br />
n<br />
i=<br />
1<br />
β<br />
1<br />
=<br />
∑<br />
( x − x)( y − y)<br />
n<br />
i<br />
i=<br />
1<br />
i<br />
( x − x)<br />
ˆ β = y − ˆ β x<br />
0 1<br />
i<br />
2<br />
The line which best “fits” the data is<br />
ŷ = ( y − ˆ β ˆ<br />
1x)<br />
+ β1x<br />
= y + ˆβ 1(x – x )<br />
⎡∑( x −x)( y −y) ⎤<br />
y+ ⎥ ( x−x<br />
)<br />
⎢⎣<br />
∑ ⎥⎦<br />
i i<br />
= ⎢<br />
2<br />
( xi<br />
−x)<br />
335<br />
Section 9
Example:<br />
x i y i (x i – x ) (x i – x ) 2 (y i – y )(x i – x )(y i – y )<br />
100 39.7 –300 90000 –20.51 6153<br />
200 51.1 –200 40000 –9.11 1822<br />
300 49.9 –100 10000 –10.31 1031<br />
400 69.8 0 0 9.59 0<br />
500 65.2 100 10000 4.99 499<br />
600 65.1 200 40000 4.89 978<br />
700 80.7 300 90000 20.49 6147<br />
2800 421.5 280000 16630<br />
x = 400 y = 60.21<br />
Therefore, ˆβ 1<br />
= 16630/280000 = 0.059<br />
0<br />
ˆβ = 60.21 – 0.059 (400) = 36.61<br />
giving ŷ = 36.61 + 0.059 x<br />
To draw this line on the scatter diagram two<br />
points are needed:<br />
e.g. if x = 400, = 36.61 + 0.059 (400) = 60.21<br />
if x = 100, ŷ = 42.51<br />
336<br />
Section 9
N.B. 1. In this situation we have regressed Y on<br />
X.<br />
This implies the X values are known without<br />
error but the Y values are influenced by<br />
r<strong>and</strong>om variation.<br />
2. Numerically, we could regress X on Y. But<br />
the “slope” <strong>of</strong> this regression is not the same<br />
as that for Y on X. The reason is that now the<br />
Y values are known exactly with the X values<br />
influenced by r<strong>and</strong>om variation.<br />
3. ŷ = y + ˆβ 1(x – x )<br />
When x = x, yˆ<br />
= y+ ˆ β1(0)<br />
= y<br />
This means that the point ( x,<br />
y)<br />
always lies<br />
on the least squares straight line. i.e. the<br />
regression line always passes through the<br />
centre <strong>of</strong> the scatter diagram.<br />
337<br />
Section 9
4. We say the least squares line “fits” or<br />
“models” the relationship between Y <strong>and</strong> X.<br />
5. A straight line may give poor fit e.g.<br />
Y<br />
X<br />
X<br />
X X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X X<br />
X<br />
X<br />
Here, it is not appropriate to use the line to<br />
predict values <strong>of</strong> Y for given values <strong>of</strong> X.<br />
The next step in our regression analysis is to<br />
establish how well this fitted line is able to model<br />
or explain the effect X has on Y; <strong>and</strong> also, if the<br />
fitted line is used to make forecasts <strong>of</strong> the values<br />
<strong>of</strong> Y, how accurate these forecasts turn out to be.<br />
(We set up confidence intervals for these<br />
forecasts.)<br />
Definition: The value d i = y i – ŷ is called the<br />
i<br />
residual at the value x i <strong>of</strong> X. These residuals are<br />
338<br />
Section 9
important as they represent the error made when<br />
using the line to make a forecast.<br />
Analysis <strong>of</strong> Variance for a Regression Model<br />
Y<br />
y<br />
The diagram shows that any numerical value y i<br />
can be partitioned into three components as<br />
follows:<br />
That is,<br />
any value,<br />
d<br />
i<br />
⎧<br />
= ( y − yˆ<br />
)<br />
i i ⎨<br />
⎩<br />
x<br />
} 1<br />
ˆ β ( x − x )<br />
⎫<br />
⎪<br />
⎬y<br />
⎪<br />
⎭<br />
y = y+ ˆ β ( x − x) + ( y − yˆ<br />
)<br />
i 1 i i i<br />
x<br />
x i<br />
( x<br />
y i = an overall average<br />
+ an amount explained by a<br />
predictor variable X<br />
+ a residual (or r<strong>and</strong>om error)<br />
i<br />
i<br />
,<br />
y<br />
i<br />
)<br />
Regression<br />
Line<br />
X<br />
339<br />
Section 9
The amount explained by the independent<br />
variable X is called the regression effect. This is<br />
also known as the explained component <strong>of</strong> the<br />
outcomes y i .<br />
The magnitude <strong>of</strong> the regression effect is related<br />
to the slope <strong>of</strong> the line <strong>and</strong> the distance x i is away<br />
from the overall mean x <strong>of</strong> the values x i .<br />
The mean y is the overall average effect.<br />
The term ( y − yˆ<br />
) is the residual effect. This is<br />
i i<br />
also known as the unexplained component <strong>of</strong> the<br />
outcomes.<br />
Therefore,<br />
data value = overall average effect<br />
+ regression effect + residual (error)<br />
effect.<br />
= overall average effect<br />
+ explained amount + unexplained<br />
amount<br />
340<br />
Section 9
To illustrate, the example has x = 400,<br />
y = 60.21 <strong>and</strong> ˆβ 1<br />
= 0.059<br />
x i y i = y + 0.059(x i – 400) + residual<br />
100 39.7 = 60.21 + (–17.82) + (–2.69)<br />
200 51.1 = 60.21 + (–11.88) + 2.77<br />
300 49.9 = 60.21 + (–5.94) + (–4.37)<br />
400 69.8 = 60.21 + 0.00 + 9.59<br />
500 65.2 = 60.21 + 5.95 + (–0.95)<br />
600 65.1 = 60.21 + 11.88 + (–6.99)<br />
700 80.7 = 60.21 + 17.82 + 2.67<br />
overall mean explained unexplained effect<br />
common to effect. chosen to give<br />
each data<br />
equality.<br />
value.<br />
It is important to establish if the explained effect<br />
has a much greater impact on the values y i than<br />
the unexplained residual effect i.e. does the<br />
regression effect explain more <strong>of</strong> the variation in<br />
the y i values. It turns out that the total variation<br />
in the y i values can be partitioned into an overall<br />
mean component, a regression component <strong>and</strong> a<br />
residual component as follows:<br />
341<br />
Section 9
[This page just for reference]<br />
Total sum <strong>of</strong> Squares (SS) <strong>of</strong> y i values<br />
= (39.7) 2 + (51.1) 2 + (49.9) 2 + (69.8) 2<br />
+ (65.2) 2 + (65.1) 2 + (80.7) 2<br />
= 26550.89<br />
The overall mean SS<br />
= (60.21) 2 + … + (60.21) 2 (7 times)<br />
= 7(60.21) 2<br />
= 25380.32<br />
The regression effect SS<br />
= (–17.82) 2 + (–11.88) 2 + … + (17.82) 2<br />
= 987.70<br />
The residual effect SS<br />
= (–2.69) 2 + (2.77) 2 + … + (2.67) 2<br />
= 182.87<br />
Now notice that<br />
26550.89 = 25380.32 + 987.70 + 182.87<br />
i.e. Total SS = overall mean SS + regression SS<br />
+ residual SS<br />
342<br />
Section 9
That is, the total variation is partitioned into these<br />
components which should now be compared. But<br />
the three component values cannot be compared<br />
directly. Note that:<br />
(i) There are seven data values y i hence seven<br />
degrees <strong>of</strong> freedom.<br />
(ii) One overall mean has one DF.<br />
(iii) The seven regression values depend on the<br />
one slope estimate 1<br />
ˆβ , hence one DF.<br />
(iv) The seven residuals have the remaining<br />
7 – 2 = 5 DF.<br />
The average or mean squares (MS) are then found<br />
by dividing the sums <strong>of</strong> squares by the degrees <strong>of</strong><br />
freedom. These mean squares can be compared.<br />
The procedure is summarised in the following<br />
analysis <strong>of</strong> variance table:<br />
SOURCE OF VARIATION SS DF MS<br />
Overall mean 25380.32 1<br />
Regression effect 987.70 1 987.70<br />
Residual effect 182.87 (5) 36.57<br />
Total 26550.89 7<br />
343<br />
Section 9
The average regression effect (or the average<br />
effect <strong>of</strong> X on the Y values) far exceeds the<br />
average residual effect (unexplained) since<br />
987.70 far exceeds 36.57. But is this difference<br />
large enough to be important. The question <strong>of</strong><br />
whether the average regression effect is large<br />
enough is answered by defining F = 987.70/36.57<br />
= 27.01 <strong>and</strong> testing this F-statistic for<br />
significance by reference to the F-table as<br />
follows: (note that the DF here are 1 <strong>and</strong> 5<br />
respectively for numerator <strong>and</strong> denominator).<br />
Since 27.01 > 6.608 there is evidence that the<br />
regression (or explained) effect dominates the<br />
residual (or unexplained) effect. Since the key<br />
part <strong>of</strong> the regression effect is the slope 1<br />
ˆβ , this<br />
effectively means ˆβ 1<br />
≠ 0 (or alternatively that<br />
there is evidence that changes in the values x i <strong>of</strong> X<br />
explain the variation in the values y i <strong>of</strong> Y (<strong>and</strong> this<br />
dominates any left over residual or unexplained<br />
effects).<br />
344<br />
Section 9
The F-distribution (Table in Appendix)<br />
υ 1 = num DF<br />
υ 2 = denom DF<br />
α = 0.05 (say)<br />
0<br />
F<br />
υ 1 ,υ 2<br />
F<br />
υ 1 1 2 3 4 … 60<br />
υ 2<br />
1 … … … … …<br />
2 <br />
3 <br />
4 <br />
5 6.608 5.786 5.409 … … …<br />
6 <br />
<br />
120 3.920 3.072 2.680 … … …<br />
345<br />
Section 9
Note: 1. The residual effect includes any<br />
r<strong>and</strong>om error plus the effects <strong>of</strong> other<br />
variables which may be affecting the<br />
outcome Y values.<br />
2. Computer s<strong>of</strong>tware produces the analysis <strong>of</strong><br />
variance table directly.<br />
3. It is a slightly modified form because the<br />
overall mean effect is never used. Therefore,<br />
this is subtracted (with appropriate changes<br />
to the total SS <strong>and</strong> the degrees <strong>of</strong> freedom)<br />
SOURCE OF VARIATION SS DF MS F<br />
Regression effect 987.70 1 987.70 27.01*<br />
Residual effect 182.87 (5) 36.57<br />
Total (overall mean<br />
removed)<br />
1170.87 6<br />
4. The “fitted” straight line should pass through<br />
the middle <strong>of</strong> the scatter diagram, <strong>and</strong> hence<br />
the residuals should take positive <strong>and</strong><br />
negative values as X increases. (This can be<br />
checked by studying plots <strong>of</strong> the residuals<br />
produced by the program.)<br />
346<br />
Section 9
5. For the validity <strong>of</strong> the F-test, residuals should<br />
be approximately normally distributed. This<br />
can also be checked by obtaining the normal<br />
probability plot using the program.<br />
Analyse > Regression > Linear with Y in<br />
the Dependent Variable box <strong>and</strong> X in the<br />
Independent Variable box produces the<br />
following printout.<br />
347<br />
Section 9
A Confidence Interval for Slope <strong>of</strong> line.<br />
Our sample <strong>of</strong> n = 7 produced an estimate<br />
ˆβ = 0.059<br />
1<br />
Repeated samples <strong>of</strong> size n = 7 give values ˆβ 1<br />
which follow a normal distribution (just the<br />
Central Limit Theorem again).<br />
If β<br />
1<br />
is the true slope <strong>of</strong> the regression line then<br />
the st<strong>and</strong>ard error <strong>of</strong> β<br />
1<br />
is<br />
σ<br />
β<br />
1<br />
=<br />
∑<br />
n<br />
i=<br />
σ<br />
e<br />
( x )<br />
1 i<br />
− x<br />
2<br />
where σ 2 e is estimated from the data by the<br />
formula.<br />
s<br />
2<br />
e<br />
n<br />
∑i<br />
1 ( − ˆ<br />
= = y y<br />
i<br />
n−2<br />
Notes<br />
1. ( y ˆ<br />
i− yi)<br />
is the residual (or error) at the value<br />
x i <strong>of</strong> X.<br />
i<br />
)<br />
2<br />
348<br />
Section 9
2. The divisor is (n – 2) rather than (n – 1) used<br />
in the calculation <strong>of</strong> an ordinary variance<br />
because here two values “ β 0<br />
” <strong>and</strong> “ β 1<br />
” are<br />
estimated from the data <strong>and</strong> used to find the<br />
ŷ from which the deviations are measured.<br />
i<br />
[For an ordinary variance,<br />
only x is estimated.]<br />
2<br />
2 ∑ ( x − x)<br />
s =<br />
i<br />
,<br />
n −1<br />
The estimated st<strong>and</strong>ard error <strong>of</strong> the slope <strong>of</strong> the<br />
regression line is<br />
s<br />
β<br />
1<br />
=<br />
∑<br />
n<br />
i=<br />
s<br />
e<br />
( x )<br />
1 i<br />
− x<br />
2<br />
Therefore, the 95% confidence interval for β<br />
1<br />
is<br />
ˆ β ± t<br />
1 n−2<br />
∑<br />
n<br />
i=<br />
1<br />
s<br />
e<br />
( x − x)<br />
i<br />
2<br />
Notes.<br />
(1) There are υ = n – 2 degrees <strong>of</strong> freedom for<br />
use with the t-table.<br />
349<br />
Section 9
(2) If σ e was known exactly (which it never is)<br />
the 95% confidence interval would be<br />
ˆ β ± 1.96<br />
1<br />
∑<br />
σ<br />
e<br />
( x − x)<br />
i<br />
2<br />
.<br />
(3) In practice, σ e is always estimated by<br />
s<br />
e<br />
=<br />
∑ ( y<br />
i<br />
− yˆ<br />
n − 2<br />
i<br />
)<br />
2<br />
(4)<br />
2<br />
s is just the residual mean square <strong>and</strong> this<br />
e<br />
can be read directly from the analysis <strong>of</strong><br />
variance.<br />
Example<br />
Refer to the earlier data which gave<br />
2<br />
∑ ( x i<br />
− x) = 280000, 1<br />
= 0.059 <strong>and</strong><br />
yˆ = 36.6 + 0. 059x<br />
i<br />
i<br />
350<br />
Section 9
x i y i ( y − yˆ<br />
)<br />
i<br />
i<br />
( y −<br />
100 39.7 -2.69 7.24<br />
200 51.1 2.77 7.67<br />
300 49.9 -4.37 19.10<br />
400 69.8 9.59 91.97<br />
500 65.2 -0.95 0.90<br />
600 65.1 -6.99 48.86<br />
700 80.7 2.67 7.13<br />
182.87<br />
i<br />
yˆ<br />
i<br />
2<br />
)<br />
Therefore,<br />
Residuals (see earlier)<br />
Residual<br />
sum <strong>of</strong> squares<br />
2 182.87<br />
s = = 36.58 (the residual mean square)<br />
e 7 − 2<br />
with<br />
n – 2 = 7 – 2 = 5 D.F. giving<br />
t 5 = 2.571 for 95% confidence.<br />
The st<strong>and</strong>ard error <strong>of</strong> the slope is estimated to be<br />
s<br />
e 36.58<br />
= = 0.0114<br />
2<br />
∑ ( x − x)<br />
280000<br />
i<br />
351<br />
Section 9
The 95% confidence interval is 0.059 ±<br />
2.571(0.0114) or 0.059 ± 0.029<br />
Hence 0.030 < β 1 0<br />
0<br />
X<br />
As X changes, the values<br />
<strong>of</strong> Y tend to show an<br />
increasing trend with<br />
r<strong>and</strong>om variation about<br />
the trend line.<br />
Example<br />
A test has been designed to measure patient stress<br />
level (X). Blood pressure (Y) is recorded for<br />
different stress levels.<br />
352<br />
Section 9
Stress (X) 55 94 64 73 96 86<br />
Blood Pr. (Y) 72 91 76 78 94 81<br />
These data give x = 78; y = 82;<br />
2<br />
∑ ( x i<br />
− x) = 1394 <strong>and</strong> ∑ ( x − i<br />
x)( yi<br />
− y)<br />
= 686.<br />
Find the least squares line, 95% confidence<br />
interval for slope <strong>and</strong> test the research proposal<br />
that higher stress results in higher blood pressure<br />
levels.<br />
353<br />
Section 9
Solution:<br />
ˆ ∑( xi<br />
− x)( yi<br />
− y) 686<br />
β1 = = = 0.492<br />
∑<br />
2<br />
( xi<br />
− x) 1394<br />
yˆ = y+ ˆ β ( x− x) = 82 + 0.492( x−<br />
78)<br />
∴<br />
1<br />
Suppose a computer analysis gives the analysis <strong>of</strong><br />
variance as follows:<br />
SOURCE OF VARIATION SS DF MS F<br />
Regression effect 337.59 1 337.59 33.41<br />
Residual effect 40.41 4 10.10<br />
2<br />
2 ∑ ( y − yˆ<br />
) 40.41<br />
Then s =<br />
i i<br />
= = 10. 10<br />
e n − 2 4<br />
giving s e = 3.178 as residual st<strong>and</strong>ard deviation.<br />
For 95% confidence, t 4 = 2.776 <strong>and</strong> st<strong>and</strong>ard error<br />
3.178<br />
<strong>of</strong> slope = = 0.085.<br />
1394<br />
The 95% confidence interval is<br />
0.492 ± 2.776(0.085)<br />
It follows that 0.256 < ˆβ 1<br />
< 0.728<br />
The test has the p-value less than 0.05.<br />
354<br />
Section 9
Confidence Interval for Prediction using a<br />
Regression Line<br />
The prediction value at value x i <strong>of</strong> X is found by<br />
substituting the value x i in the regression equation<br />
e.g. For our data, ŷ = 36.6 + 0.059x<br />
When x i = 750, ŷ = 36.6 + 0.059(750)<br />
i<br />
= 80.85<br />
But what error is associated with this prediction<br />
At value X = x k say the estimated st<strong>and</strong>ard error<br />
<strong>of</strong> the prediction is<br />
s<br />
1 ( x − x)<br />
2<br />
k<br />
yˆ = se<br />
1+ +<br />
2<br />
n ( xi<br />
− x)<br />
∑<br />
where s e is the residual st<strong>and</strong>ard deviation.<br />
But s e = 36 . 58 = 6.05 (see ANOVA table)<br />
∴<br />
s y ˆ<br />
2<br />
1 (750 − 400)<br />
= 6.05 1+ + = 7.604<br />
7 280000<br />
355<br />
Section 9
The 95% confidence interval is<br />
yˆ ± t s where t 5 = 2.571<br />
That is 80.85 ± 2.571(7.604)<br />
5<br />
yˆ<br />
Therefore, 61.30 < ŷ < 100.40<br />
750<br />
where ŷ is the prediction at x k = 750.<br />
750<br />
Notes<br />
(1) R-cmdr (<strong>and</strong> other packages) give this<br />
interval when requested.<br />
(2) A graph showing the confidence b<strong>and</strong>s<br />
around the regression line can also be<br />
produced as follows<br />
(3) Essentially the confidence interval for the<br />
prediction involves line error <strong>and</strong> natural<br />
variation to predict a data point.<br />
356<br />
Section 9
Y<br />
Prediction Interval<br />
X<br />
EXAMPLE: 2003 EXAM<br />
The data for this question are a sample <strong>of</strong> 100 low<br />
birth weight infants. Measurements <strong>of</strong> systolic<br />
blood pressure (sbp) <strong>and</strong> values <strong>of</strong> gestational age<br />
(gestage) are recorded. The following table<br />
shows the layout <strong>of</strong> the data along with the results<br />
<strong>of</strong> some calculations using the 100 data values.<br />
357<br />
Section 9
sbp<br />
(Y mm Hg)<br />
gestage<br />
(X weeks)<br />
43 29 y= 47.31<br />
51 31 x = 28.89<br />
42 33<br />
2<br />
∑ ( i<br />
− x)<br />
= 635.69<br />
39 31<br />
2<br />
<br />
∑ ( i<br />
− y)<br />
= 15222.24<br />
<br />
<br />
40 33 ∑ ( x i<br />
x)( yi<br />
− y)<br />
= 806.31<br />
50 28<br />
(a) (4 marks) Using systolic blood pressure as<br />
the response <strong>and</strong> gestational age as the<br />
predictor variable, compute the least squares<br />
regression line. Interpret the slope <strong>of</strong> this<br />
regression line.<br />
(b) (5 marks) The st<strong>and</strong>ard deviation <strong>of</strong> the<br />
sample points about the regression line in (a)<br />
is s e = 3.47. Obtain an estimate for the<br />
st<strong>and</strong>ard error <strong>of</strong> the slope <strong>of</strong> the regression<br />
<strong>and</strong> hence set up a 95% confidence interval<br />
for the slope <strong>of</strong> the regression line. State<br />
whether you would reject the null hypothesis<br />
that the true slope is equal to 0.<br />
358<br />
Section 9
(c) (3 marks) What is the predicted systolic<br />
blood pressure for a low birth weight infant<br />
whose gestational age is 31 weeks<br />
Construct a 95% confidence interval for the<br />
prediction.<br />
(d) (1 mark) The value <strong>of</strong> the coefficient <strong>of</strong><br />
determination is R – Sq = 67%. Interpret this<br />
value. (discussed next lecture)<br />
(e) (3 marks) What conclusions would you draw<br />
from the two residual plots below arising<br />
from the fitted regression in (a)<br />
359<br />
Section 9
SOLUTION<br />
(a) ˆβ 1<br />
= 806.31/635.69 = 1.27<br />
0<br />
ˆβ = 47.31 – 1.27(28.89) = 10.62<br />
ŷ = 10.62 + 1.27x<br />
For infants with gestation age one week<br />
higher, the model predicts sbp increases by<br />
1.27 mmHg.<br />
360<br />
Section 9
(b) Estimated st<strong>and</strong>ard error<br />
= 3.47 / 635.69 = 0.138<br />
95% C.I. is 1.27 ± 1.98(0.138)<br />
giving 1.27 ± 0.273<br />
or 1.00 < ˆβ 1<br />
< 1.54<br />
The confidence interval excludes zero (pvalue<br />
< 0.05) hence reject null hypothesis.<br />
(c) Prediction = 10.62 ± 1.27(31) = 49.99<br />
95% C.I. is<br />
49.99 ± 1.98(3.47)<br />
giving 49.99 ± 6.92<br />
or 43.07 < ŷ<br />
31<br />
< 56.91<br />
1 (31−<br />
28.89)<br />
1+ + 100 635.69<br />
(d) 67% <strong>of</strong> the total sum <strong>of</strong> squares <strong>of</strong> the sbp<br />
values is explained by changes in the number<br />
<strong>of</strong> weeks <strong>of</strong> gestation. (Alternatively, 67% <strong>of</strong><br />
the variation in the sbp values is explained.)<br />
(Discussed next lecture.)<br />
(e) Variation about the fitted line is constant for<br />
different gestation times. The residuals<br />
appear close to pa normal distribution except<br />
for a possible outlier at x = 29.<br />
2<br />
361<br />
Section 9
Correlation<br />
The correlation coefficient is a measure <strong>of</strong> linear<br />
association. The Pearson correlation coefficient r<br />
is defined<br />
r<br />
=<br />
∑<br />
∑<br />
n<br />
( x −x)( y − y)<br />
i=<br />
1 i i<br />
n<br />
2<br />
n<br />
2<br />
( x ) ( )<br />
i 1 i−x ∑ y<br />
i 1 i−<br />
y<br />
= =<br />
This measures the ‘strength” <strong>of</strong> linear association<br />
between X <strong>and</strong> Y (as we shall now see). Recall<br />
that the regression line passes through the point,<br />
x,<br />
y .<br />
( )<br />
y<br />
+<br />
2<br />
+ + +<br />
+ + +<br />
+ + +<br />
+ + + +<br />
+ + + +<br />
3<br />
x<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
4<br />
1<br />
X<br />
Section 9<br />
362
The denominator in formula for r is always<br />
positive. In quadrant 1 , x i – x > 0 <strong>and</strong><br />
y i – y > 0 meaning ( xi− x)( yi− y) > 0. In<br />
quadrant 3 , ( xi<br />
− x)<br />
< 0 <strong>and</strong> y i – y < 0 giving<br />
( xi−x)( yi− y) > 0. In quadrant 2 <strong>and</strong> 4 ,<br />
( x −x)( y − y) < 0.<br />
i<br />
i<br />
Therefore, r is large <strong>and</strong> positive if points mainly<br />
in quadrants 1 <strong>and</strong> 3 ; it is large <strong>and</strong> negative if<br />
points in quadrants 2 <strong>and</strong> 4.<br />
Y<br />
Y<br />
+ +<br />
+ + + +<br />
+<br />
+<br />
+ + +<br />
+ +<br />
+<br />
+ +<br />
+<br />
+<br />
+ +<br />
+ +<br />
+<br />
X<br />
+ +<br />
+<br />
+ + +<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+ +<br />
+ +<br />
X<br />
(i)<br />
(ii)<br />
In case (i) the contribution is equal from each<br />
quadrant, the contributions cancel, <strong>and</strong> therefore r<br />
= 0. i.e. there is no relationship between Y <strong>and</strong> X.<br />
In case (ii) there is again cancellation <strong>and</strong> r = 0,<br />
but here there is a strong relationship between Y<br />
<strong>and</strong> X but it is non-linear.<br />
363<br />
Section 9
therefore measures the strength <strong>of</strong> the linear<br />
association between X <strong>and</strong> Y. But we must be<br />
careful as r = 0 in the following case (iii) where<br />
β 1 = 0. In fact r is directly related to β 1 <strong>and</strong> zero<br />
if β 1 is zero.<br />
Y<br />
+ +<br />
+ + + + + + +<br />
+ + + +<br />
+ + + +<br />
(iii)<br />
X<br />
Example: A researcher investigates the<br />
relationship between reading <strong>and</strong> spelling tests<br />
administered to nine students<br />
Student 1 2 3 4 5 6 7 8 9<br />
X (spelling) 52 90 63 81 93 51 48 99 85<br />
Y (reading) 56 81 75 72 50 45 39 87 59<br />
364<br />
Section 9
2<br />
( y i<br />
− y) ( x − x)(<br />
y − y)<br />
x i y i ( x )<br />
2<br />
i<br />
− x<br />
i i<br />
52 56 … … …<br />
90 81 … … …<br />
63 75 … … …<br />
81 72 … … …<br />
93 50 … … …<br />
51 45 … … …<br />
48 39 … … …<br />
99 87 … … …<br />
85 59 … … …<br />
3220.2225 2258.0001 1718.6665<br />
x = 73.55 y = 62. 67<br />
1718.6665<br />
r = =+ 0.6374<br />
3220.2225(2258.0001)<br />
But what does this mean<br />
Very strong correlation:<br />
x<br />
x x<br />
x x<br />
x<br />
x x x<br />
x xx<br />
Near +1<br />
x<br />
x<br />
x<br />
x<br />
Near –1<br />
x<br />
x<br />
x x x x<br />
x x<br />
Section 9<br />
365
Smaller<br />
x<br />
x<br />
x x x<br />
x x x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
+ 0.7<br />
x<br />
x<br />
x<br />
x<br />
x x<br />
x<br />
– 0.7<br />
x<br />
x<br />
x x<br />
x Ho x<br />
lla<br />
nd<br />
x x<br />
x x<br />
Very small<br />
x x<br />
x<br />
x x<br />
x x x<br />
x<br />
x<br />
x<br />
x x<br />
x<br />
x x x x x<br />
x x x<br />
x x<br />
x + 0.2<br />
x<br />
– 0.2<br />
x<br />
x<br />
x x x x<br />
x x x x<br />
x x<br />
x x x x x<br />
x x x x<br />
x<br />
Notes:<br />
1. The largest value <strong>of</strong> r turns out to be +1. In<br />
this case all points lie on a straight line in<br />
quadrants 1 <strong>and</strong> 3 . This implies perfect<br />
positive linear association. i.e. as X increases,<br />
Y increases in the same ratio (if the increase<br />
<strong>of</strong> X is doubled, the increase in Y would also<br />
be doubled).<br />
366<br />
Section 9
2. r = –1 is smallest value which implies perfect<br />
negative linear association when all points lie<br />
in quadrants 2 <strong>and</strong> 4. i.e. as X increases, Y<br />
decreases in same ratio.<br />
3. |r| > 0.7 implies strong linear relationship.<br />
|r| < 0.3 implies negligible linear relationship.<br />
4. The correlation coefficient is an index. It<br />
does not depend on the units <strong>of</strong> either X or Y.<br />
(numerator <strong>and</strong> denominator in same units)<br />
5. r is called the Pearson Correlation<br />
Coefficient.<br />
6. An important correlation does not imply a<br />
causal link between the two variables. (The<br />
correlation is <strong>of</strong>ten caused by the effect <strong>of</strong> a<br />
third variable influencing both X <strong>and</strong> Y).<br />
e.g. smoking <strong>and</strong> lung cancer incidence<br />
correlated – not smoking causing lung<br />
cancer.<br />
7. If r is large, a regression line will fit the data<br />
well.<br />
367<br />
Section 9
8. If r 2 gives the fraction <strong>of</strong> variability in the Y<br />
values associated with the predictor variable<br />
X.<br />
e.g. In the example, r = 0.6374 so 0.406<br />
40.6% <strong>of</strong> the variability in Y is explained<br />
by changes in X.<br />
That is,<br />
2 SS(Regression)<br />
r =<br />
SSTotal (Reg+Resid)<br />
for a simple linear regression.<br />
368<br />
Section 9
Some examples on correlation <strong>and</strong> association discussed in lectures.<br />
Correlation measures association but association is not the same as causation.<br />
Example: For school children, shoe size is strongly correlated with reading skills.<br />
Learning new words does not make the feet get bigger.<br />
Instead, there is a third factor, age. As children get older, they learn to read better <strong>and</strong> they outgrow<br />
their shoes.<br />
Age is a confounder. Here, this confounder is easy to spot. Often this is not so easy. The<br />
arithmetic <strong>of</strong> the correlation coefficient does not give protection against third factors.<br />
Example: Education level <strong>and</strong> unemployment.<br />
In the Great Depression (1929 – 1933), better educated people had shorter spells <strong>of</strong> unemployment.<br />
(Education level <strong>and</strong> days unemployed were very highly correlated: negatively as more education<br />
associated with less days unemployed). Does education protect you against unemployment.<br />
Discussion:<br />
Perhaps, but the data were observational. Age is a confounding variable. Younger people were<br />
better educated as education level had been increasing over time. (It still is!!)<br />
Employers seemed to prefer younger job seekers.<br />
Controlling for age made the effect <strong>of</strong> education on unemployment much weaker.<br />
Example:<br />
In countries where people eat lots <strong>of</strong> fat, rates <strong>of</strong> breast <strong>and</strong> colon cancer are high. This correlation<br />
is <strong>of</strong>ten used to argue that fat in the diet causes cancer. How good is this evidence<br />
Death<br />
Rate<br />
(per 100000)<br />
25<br />
20<br />
15<br />
10<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+ +<br />
+<br />
+ Finl<strong>and</strong><br />
+<br />
+ Spain<br />
+ Holl<strong>and</strong><br />
+ UK<br />
+ Denmark<br />
+ + NZ<br />
+<br />
5<br />
+<br />
+ Japan<br />
+ Sri Lanka<br />
Thail<strong>and</strong><br />
25 50 75 100 125 150 175<br />
Fat intake per capita per day (grams)<br />
369<br />
Section 9
Discussion: There is a very high correlation as shown by the scatter diagram which is very<br />
elongated. If fat in diet causes cancer, then the points should slope up as shown. So the diagram is<br />
some evidence for the theory. But the evidence is weak.<br />
For example, countries with lots <strong>of</strong> fat in diet also have lots <strong>of</strong> sugar, <strong>and</strong> a similar plot for sugar<br />
would be found.<br />
As it turns out, fat <strong>and</strong> sugar are relatively expensive. In rich countries people can afford to eat fat<br />
<strong>and</strong> sugar rather than starchier grain products.<br />
Some aspects <strong>of</strong> diet in these countries or these life-style factors probably do cause certain kinds <strong>of</strong><br />
cancer. Epidemiologists can identify only a few <strong>of</strong> these factors with confidence. Fat is not among<br />
them.<br />
Example: Ultrasound <strong>and</strong> low birthweight.<br />
Babies can be examined in the womb using ultrasound. Several experiments on lab animals have<br />
shown ultrasound exams can cause low birthweight. If true for humans, there are grounds for<br />
concern. Scientists at Johns Hopkins Hospital in Baltimore ran an observational study to find out.<br />
Babies exposed to ultrasound differ from unexposed babies in many ways beside exposure; this<br />
investigation was only an observational study.<br />
The scientists found a number <strong>of</strong> confounding variables <strong>and</strong> adjusted for them. There was still an<br />
association. Babies exposed to ultrasound in the womb had lower birthweight, on average.<br />
Is this evidence that ultrasound causes lower birthweight<br />
Discussion: Obstetricians suggest ultrasound examination when something seems wrong. The<br />
investigators concluded that the ultrasound exams <strong>and</strong> low birthweights had a common cause –<br />
problem pregnancies.<br />
Later, a r<strong>and</strong>omized controlled experiment was carried out to get more definite evidence. If<br />
anything, ultrasound was protective.<br />
Journal <strong>of</strong> Obstetrics <strong>and</strong> Gynaecology. Volume 71 (1988) pp 513-517<br />
Also Lancet (1988) pp 585-588<br />
370<br />
Section 9
REVIEW EXERCISES<br />
1. Physical fitness testing is an important aspect <strong>of</strong> athletic training. A common measure <strong>of</strong> the magnitude <strong>of</strong><br />
cardiovascular fitness is the maximum volume <strong>of</strong> oxygen uptake during a strenuous exercise. A study was<br />
conducted on 18 middle-aged men to study the influence <strong>of</strong> the time that it takes to complete a 2-mile run.<br />
The oxygen uptake measure was accomplished with st<strong>and</strong>ard laboratory methods as the subjects performed<br />
on a motor driven treadmill. The data (Ribisl et al. Journal <strong>of</strong> Sports Medicine, 9: 17-22) are below:<br />
Maximum Volume <strong>of</strong> O 2 (Y) Time in Seconds (X)<br />
42.33<br />
53.10<br />
918<br />
805<br />
Data summary<br />
x = 831.40<br />
42.08 892 y = 47.67<br />
42.45 968<br />
42.46 907 2<br />
∑ ( i<br />
− x)<br />
= 160613.28<br />
49.92 743<br />
36.23<br />
49.66<br />
1045<br />
810<br />
∑ ( x i<br />
x)( yi<br />
− y)<br />
= -8698.33<br />
41.49 927 2<br />
∑ ( y ˆ<br />
i<br />
− yi)<br />
= 55.25<br />
46.16 813<br />
48.18 858<br />
51.81 760<br />
53.28<br />
53.29<br />
747<br />
743<br />
47.18 803<br />
56.91<br />
47.80<br />
683<br />
844<br />
53.69 700<br />
(a) Use the data summary to find an estimate for the equation <strong>of</strong> the least squares regression line <strong>of</strong> Y on X.<br />
(2 marks)<br />
(b)<br />
(c)<br />
(d)<br />
Find an estimate for the st<strong>and</strong>ard error <strong>of</strong> the slope <strong>of</strong> the regression line <strong>and</strong> set up a 95% confidence<br />
interval for the slope <strong>of</strong> the regression line.<br />
(4 marks)<br />
What does the confidence interval in (b) tell you about the effect <strong>of</strong> time (X) on maximum volume <strong>of</strong><br />
oxygen uptake (Y).<br />
(1 mark)<br />
If a man in this age group takes 50 seconds longer to run the 2-mile distance what is the change in his<br />
maximum volume <strong>of</strong> oxygen update Write down the 95% confidence interval for this change using the<br />
result from (c).<br />
(2 marks)<br />
(e) Set up a 95% confidence interval for the maximum volume <strong>of</strong> oxygen uptake for a man who takes 11<br />
minutes (660 seconds) to complete a two mile run.<br />
(3 marks)<br />
371<br />
Section 9
SOLUTIONS<br />
−8698.33<br />
1. (a) b YX = =− 0.054<br />
160613.28<br />
ŷ = 47.67 – 0.054(x – 831.4)<br />
= 92.566 – 0.054x<br />
(b) Estimated st<strong>and</strong>ard error =<br />
∑<br />
se<br />
( x − x)<br />
i<br />
2<br />
where<br />
55.25/16<br />
That is, st<strong>and</strong>ard error = = 0.004637<br />
160613.28<br />
A 95% confidence interval for the true slope is<br />
–0.054 ± t 16 (0.004637) where t 16 = 2.120<br />
That is, –0.054 ± 0.0098<br />
giving – 0.064 < β YX < –0.044<br />
y ˆ<br />
2 i<br />
− yi<br />
e<br />
=<br />
s<br />
∑<br />
( )<br />
n − 2<br />
(c) The maximum volume <strong>of</strong> oxygen uptake is smaller for men who take longer to run 2<br />
miles.<br />
2<br />
(d)<br />
Oxygen uptake reduces by 50(0.054) = 2.7 units.<br />
The 95% confidence interval extends from 50(0.064) to 50(0.044) or a reduction from<br />
2.2 to 3.2 units.<br />
(e) When x = 660 seconds, ŷ = 92.566 – 0.054(660) = 56.93<br />
The 95% confidence interval is<br />
1<br />
56.93 ± t s 1+ +<br />
( x − x)<br />
2<br />
k<br />
16 e<br />
2<br />
n ( xi<br />
− x)<br />
∑<br />
That is, 56.93 ± 2.120<br />
or 56.93 ± 4.38<br />
52.55 < ŷ<br />
660<br />
< 61.31<br />
1 (660 − 831.4)<br />
55.25/16 1+ + 18 160613.28<br />
2<br />
372<br />
Section 9
SECTION 10<br />
Multiple regression models <strong>and</strong> logistic regression models are introduced in this section. In the case<br />
<strong>of</strong> ordinary multiple regression the response or outcome variable is on a continuous scale whereas<br />
in the case <strong>of</strong> a logistic regression the outcome measure is binary taking therefore only two possible<br />
values interpreted as success versus failure.<br />
The models allow us to identify those variables which have an effect on the outcomes <strong>and</strong> those<br />
variables which do not.<br />
Adding additional variables leads to adjusted values for estimated parameters <strong>and</strong> it is this that<br />
allows us to control for confounding.<br />
The Multiple Regression Model<br />
R-cmdr Printout for Multiple Regression<br />
Dummy Variables<br />
Checking Model Fit<br />
Parallel Regression Lines <strong>and</strong> Analysis <strong>of</strong> Covariance<br />
Binary Outcomes <strong>and</strong> Logistic Regression<br />
373<br />
Section 10
Multiple regression<br />
• Simple linear regression (SLR) allowed us to<br />
assess the effect <strong>of</strong> a single independent<br />
variable (X) on a response variable (Y).<br />
• But what do we do if we think that the<br />
response may change according to more<br />
than one independent variable<br />
• SLR regression can be extended.<br />
• Multiple regression allows us to assess the<br />
effects <strong>of</strong> several independent variables on<br />
the outcome variable <strong>and</strong> it allows the<br />
prediction <strong>of</strong> a response from the values <strong>of</strong><br />
several independent variables.<br />
• In multiple regression, there is a single<br />
dependent (outcome) variable <strong>and</strong> two or<br />
more independent (explanatory, predictor)<br />
variables or covariates.<br />
• The predictor variables can be:<br />
Continuous (e.g. blood pressure, height)<br />
Categorical – binary (e.g. sex)<br />
374<br />
Section 10
• The type <strong>of</strong> multiple regression that is<br />
performed depends on the data type <strong>of</strong> the<br />
outcome variable.<br />
• If the outcome variable is continuous, we use<br />
multiple linear regression.<br />
• If the outcome variable is binary, we use<br />
multiple logistic regression.<br />
The possible applications <strong>of</strong> multiple<br />
regression include:<br />
1. Adjusting for the effect <strong>of</strong> confounding<br />
variables.<br />
2. Establishing which variables are important in<br />
explaining the values <strong>of</strong> the outcome<br />
(response) variable.<br />
3. Predicting values <strong>of</strong> the outcome variable.<br />
375<br />
Section 10
4. Describing the strength <strong>of</strong> the association<br />
between the outcome variable <strong>and</strong> explanatory<br />
variables <strong>and</strong> reducing residual variation by<br />
introducing further effects as predictor<br />
variables.<br />
Multiple regression investigates <strong>and</strong> tests the joint<br />
effect <strong>of</strong> all predictors on the outcome variable as<br />
well as the measurement <strong>of</strong> individual effects <strong>of</strong><br />
each predictor.<br />
Example: Predict lung capacity from age, sex<br />
<strong>and</strong> height <strong>of</strong> patient.<br />
Lung capacity itself is difficult to measure. For<br />
heart lung transplants to have best chance <strong>of</strong><br />
success it is desirable to have donor <strong>and</strong> recipient<br />
lungs <strong>of</strong> similar size.<br />
376<br />
Section 10
The multiple linear regression model:<br />
y = β + β x + β x + β x + … +<br />
0 1 1 2 2 33<br />
error<br />
For simple linear regression the model is:<br />
y<br />
= β + β x+<br />
ε<br />
0 1<br />
The fitted straight line then becomes<br />
ŷ<br />
= ˆ β + ˆ β x<br />
0 1<br />
where<br />
0<br />
ˆβ <strong>and</strong> ˆβ 1<br />
are chosen to minimise the sum<br />
<strong>of</strong> the squared errors (residuals).<br />
In the case <strong>of</strong> two explanatory variables, the<br />
multiple linear regression model can be written in<br />
the following form:<br />
y = β + β x + β x + ε<br />
0 1 1 2 2<br />
where ε is the residual (including r<strong>and</strong>om error)<br />
with mean <strong>of</strong> zero (for all data values i) <strong>and</strong><br />
constant variance.<br />
377<br />
Section 10
The fitted regression equation is<br />
ŷ = ˆ β + ˆ β x + ˆ β x<br />
0 1 1 2 2<br />
The estimates ˆ β0, ˆ β ˆ<br />
1<br />
<strong>and</strong> β<br />
2<br />
are found from the<br />
data in such a way that the sum <strong>of</strong> the squared<br />
residuals (errors), that is<br />
y − ( β + β x + β x ) , is minimised.<br />
∑<br />
[ ] 2<br />
i<br />
0 1 1 2 2<br />
The results are complicated <strong>and</strong> statistical<br />
s<strong>of</strong>tware is always used for calculations.<br />
378<br />
Section 10
Example<br />
For lung transplantation it is desirable for the<br />
donor’s lungs to be <strong>of</strong> a similar size as those <strong>of</strong><br />
the recipient. Total lung capacity (TLC) is<br />
difficult to measure, so it is useful to be able to<br />
predict TLC from other information. The<br />
following table shows the pre-transplant TLC <strong>of</strong><br />
32 recipients <strong>of</strong> heart-lung transplants, <strong>and</strong> their<br />
age, sex <strong>and</strong> height<br />
Age Sex Height(cm) TLC(1) Age Sex Height(cm) TLC(1)<br />
1 35 F 149 3.40 17 30 F 172 6.30<br />
2 11 F 138 3.41 18 21 F 163 6.55<br />
3 12 M 148 3.80 19 21 F 164 6.60<br />
4 16 F 156 3.90 20 20 M 189 6.62<br />
5 32 F 152 4.00 21 34 M 182 6.89<br />
6 16 F 157 4.10 22 43 M 184 6.90<br />
7 14 F 165 4.46 23 35 M 174 7.00<br />
8 16 M 152 4.55 24 39 M 177 7.20<br />
9 35 F 177 4.83 25 43 M 183 7.30<br />
10 33 F 158 5.10 26 37 M 175 7.65<br />
11 40 F 166 5.44 27 32 M 173 7.80<br />
12 28 F 165 5.50 28 24 M 173 7.90<br />
13 23 F 160 5.73 29 20 F 162 8.05<br />
14 52 M 178 5.77 30 25 M 180 8.10<br />
15 46 F 169 5.80 31 22 M 173 8.70<br />
16 29 M 173 6.00 32 25 M 171 9.45<br />
379<br />
Section 10
Step 1: First look at some plots in order to gain an<br />
underst<strong>and</strong>ing <strong>of</strong> the data<br />
1. Plot each predictor variable against the<br />
outcome.<br />
Relationship between total lung capacity <strong>and</strong> age<br />
Total lung capacity(l)<br />
2 4 6 8 10<br />
10 20 30 40 50<br />
age(yrs)<br />
It appears that total lung capacity is not affected<br />
by age.<br />
380<br />
Section 10
It appears total lung capacity increases as height<br />
increases.<br />
The effect <strong>of</strong> sex is not clear.<br />
381<br />
Section 10
Step 2: Fit (in R-cmdr) Simple Linear<br />
Regression models for each predictor variable.<br />
1. Age alone:<br />
TLC= 5.07 + 0.036 age<br />
If age increases by one year, TLC increases<br />
by 0.036 litre (which is not significant if<br />
tested).<br />
2. Height alone:<br />
TLC= -9.74 + 0.095 x height<br />
If height increases by 1 cm, TLC increases<br />
by 0.095 (which is significant if tested).<br />
382<br />
Section 10
Step 3: Fit (in R-cmdr) Multiple Linear<br />
Regression Model.<br />
3. Age <strong>and</strong> height<br />
383<br />
Section 10
From regression equation for the model including<br />
age <strong>and</strong> height, the predicted TLC for someone<br />
aged 25 <strong>and</strong> with a height <strong>of</strong> 160 cm is:<br />
TLC = -11.218 – 0.030 × 25 + 0.108 × 160<br />
= 5.322 litres<br />
Regressions which include binary (e.g. sex)<br />
predictor variables<br />
The predictor variable, SEX, has two categories<br />
only, female <strong>and</strong> male. We need a technique for<br />
including such binary variables in the regression<br />
models.<br />
Define a dummy variable (D) as follows:<br />
D<br />
=<br />
⎧0 if<br />
⎨<br />
⎩1if<br />
female<br />
male<br />
If there are two other predictors X 1 <strong>and</strong> X 2 then<br />
we fit the model<br />
y = β<br />
0<br />
+ β<br />
1<br />
x<br />
1<br />
+ β<br />
2<br />
x<br />
2<br />
+ β<br />
3<br />
d + ε<br />
384<br />
Section 10
The fitted equation is therefore<br />
ŷ = ˆ β + ˆ β x + ˆ β x + ˆ β d<br />
0 1 1 2 2 3<br />
We find estimates ˆ β0, ˆ β ˆ ˆ<br />
1, β2 <strong>and</strong> β<br />
3<br />
by minimising<br />
the squared residuals as before (using the<br />
computer).<br />
4. Model with age, height <strong>and</strong> sex.<br />
385<br />
Section 10
Model interpretation:<br />
* TLC decreases with increasing age.<br />
For a person 10 years older, the predicted TLC will<br />
be 0.25 litres lower.<br />
* TLC increases with increasing height.<br />
For a person 10 cm higher, the predicted TLC will<br />
be 0.9 litres higher.<br />
* Males have higher TLC than women:<br />
For males, the predicted TLC is 0.697 litres higher<br />
than for women with same age <strong>and</strong> height.<br />
females, sex = 0<br />
so TLC = –8.54 – 0.025age + 0.0895height + 0.697 × 0<br />
males sex = 1<br />
so TLC = 8.54 – 0.025age + 0.0895height + 0.697 × 1<br />
Therefore, the difference in average TLC between<br />
males <strong>and</strong> females is 0.697.<br />
Note: compare this to the crude difference in mean<br />
TLC between males <strong>and</strong> females<br />
386<br />
Section 10
It is 6.98 – 5.20 = 1.78 litres<br />
where 6.98 <strong>and</strong> 5.20 are male <strong>and</strong> female<br />
averages<br />
Some <strong>of</strong> this difference between males <strong>and</strong><br />
females can be explained by differences in age<br />
<strong>and</strong> height.<br />
Overall, how well does the model fit<br />
The analysis <strong>of</strong> variance is<br />
1. The regression effect has 3 degrees <strong>of</strong><br />
freedom since there are 3 predictor variables<br />
in the model.<br />
2. The ANOVA table shows the ‘usefulness’ <strong>of</strong><br />
the linear regression model – we want the p-<br />
value to be < 0.05.<br />
Here, p-value = 0.000, implying that at least<br />
one <strong>of</strong> the explanatory variables has a<br />
significant linear relationship with the<br />
outcome variable.<br />
387<br />
Section 10
3. The strength <strong>of</strong> the relationship between<br />
TLC <strong>and</strong> the three predictors can be<br />
expressed as the proportion <strong>of</strong> the total SS<br />
explained by the regression equation.<br />
The coefficient <strong>of</strong> determination is:<br />
R 2 = 44.305/81.712 = 54.2%<br />
Thus, 54.2% <strong>of</strong> the total sum <strong>of</strong> squares<br />
(variation) is explained by age, height <strong>and</strong> sex<br />
together.<br />
Notice how the value <strong>of</strong> R 2 has increased from<br />
0.510 or 51.0% to the value <strong>of</strong> 0.542 or 54.2%<br />
when all three predictor variables are included.<br />
388<br />
Section 10
Are all three variables needed in the model<br />
There are 3 ways <strong>of</strong> evaluating the importance <strong>of</strong><br />
a variable in the model:<br />
1. Construct a test <strong>of</strong> the null hypothesis that<br />
the regression coefficient = 0.<br />
2. Calculate a 95% confidence interval for the<br />
regression coefficient.<br />
Note: Regardless <strong>of</strong> whether an additional<br />
variable is significant or not the real point<br />
at issue is that the other regression<br />
parameters are adjusted for the influence<br />
<strong>of</strong> these new confounding variables to<br />
produce adjusted test or confidence<br />
intervals.<br />
Model is<br />
TLC = β<br />
0<br />
+ β<br />
1<br />
age + β<br />
2<br />
height + β<br />
3<br />
sex + ε<br />
giving R-cmdr printout as follows:<br />
389<br />
Section 10
Std Error is the st<strong>and</strong>ard error <strong>of</strong> the<br />
corresponding regression coefficient. (See how<br />
the coefficients <strong>of</strong> age <strong>and</strong> height change when<br />
allowance made for sex).<br />
1. Test <strong>of</strong> the hypothesis H 0 : β 3 = 0<br />
Is the variable sex an important predictor in the<br />
model<br />
T<br />
ˆ β<br />
3<br />
− 0 0.697 − 0<br />
= =<br />
s.e.( ˆ β ) 0.499<br />
3<br />
= 1.396<br />
p – value = 0.174. There is no evidence sex is<br />
important in predicting TLC – the coefficient is<br />
not significantly different from 0.<br />
(Note: the t-test has 28 degrees <strong>of</strong> freedom, the<br />
DF <strong>of</strong> the residual (error) effect).<br />
390<br />
Section 10
Test <strong>of</strong> the hypothesis H 0 : β 1 = 0<br />
Age: t = –0.025/0.024 = –1.063<br />
with 28 degrees <strong>of</strong> freedom (Residual DF)<br />
p-value = 0.297<br />
No evidence age affects TLC<br />
Test <strong>of</strong> the hypothesis H 0 : β 2 = 0<br />
Height: t = 3.647 (p-value = 0.001)<br />
Strong evidence height is important<br />
in predicting TLC.<br />
2. Calculating a confidence interval for a<br />
regression parameter<br />
A true parameter β<br />
i<br />
is estimated by ˆi β .<br />
For sex, the parameter estimates the difference in<br />
average TLC between males <strong>and</strong> females after<br />
taking into account age <strong>and</strong> height.<br />
The C.I. for ˆi β is: ˆ β ˆ<br />
i± t28se<br />
..( β<br />
i)<br />
391<br />
Section 10
For sex, this becomes<br />
0.697 ± t 28 (0.499)<br />
where t 28 = 2.048 for 95% confidence interval.<br />
That is 0.697 ± 1.022<br />
That is (–0.326, 1.720)<br />
This includes zero so there is no evidence <strong>of</strong><br />
difference in average TLC between men <strong>and</strong><br />
women.<br />
Note:<br />
The above interval is called an adjusted confidence<br />
interval. Recall unadjusted difference in means was<br />
-1.78. The unadjusted 95% confidence interval for<br />
the true difference in mean TLC between males <strong>and</strong><br />
females is (-2.77, -0.79).<br />
Adjusting for age <strong>and</strong> height has removed the<br />
statistically significant association between sex <strong>and</strong><br />
TLC.<br />
392<br />
Section 10
95% confidence interval for coefficient <strong>of</strong> age:<br />
–0.0250 ± t 28 (0.024)<br />
or ( –0.073, 0.023)<br />
95% confidence interval for coefficient <strong>of</strong> height<br />
0.0895 ± t 28 (0.025)<br />
or (0.039, 0.14)<br />
Note the correspondence between the 95%<br />
confidence interval <strong>and</strong> the t-test carried out at<br />
the 0.05 (2 sided) significance level.<br />
393<br />
Section 10
Note:<br />
(i) The effect <strong>of</strong> sex was contained in the<br />
residual when TLC was expressed in terms <strong>of</strong><br />
age <strong>and</strong> height only. The effect <strong>of</strong> the<br />
residual was therefore greater.<br />
(ii) The real effect <strong>of</strong> interest can be hidden by<br />
residual variability - reducing this residual<br />
variability by including more predictors in<br />
the model can improve the analysis (<strong>and</strong><br />
therefore the study). The p-values associated<br />
with hypothesis tests for the parameters <strong>of</strong><br />
interest will generally be smaller.<br />
(iii) Confounders can affect the parameter<br />
estimates <strong>of</strong> the predictor variables <strong>of</strong> interest<br />
as well as the residual variability. Therefore<br />
including confounders in the model is<br />
important for obtaining valid estimates <strong>of</strong> the<br />
coefficients <strong>of</strong> interest regardless <strong>of</strong> the<br />
reduction in residual variability.<br />
394<br />
Section 10
Checking the fit <strong>of</strong> the model<br />
We do not expect our model to be correct. We<br />
want it to capture the important aspects <strong>of</strong> the<br />
process under investigation, but also to simplify<br />
things enough to aid underst<strong>and</strong>ing. Choosing an<br />
appropriate model is a complex art which is<br />
covered more fully in higher level courses on<br />
regression. Here we consider some basic<br />
principles.<br />
Rule <strong>of</strong> thumb:<br />
We should not perform a multiple linear<br />
regression analysis if the number <strong>of</strong> variables in<br />
the model is greater than the number <strong>of</strong><br />
individuals divided by 10.<br />
Residual plots<br />
1. The residuals associated with each data value<br />
should be normally distributed with mean = 0<br />
<strong>and</strong> constant variance. (In R-cmdr we can<br />
save the residuals for subsequent plotting e.g.<br />
normal probability plot).<br />
395<br />
Section 10
2. The printouts also identify any unusual data<br />
point which has a very large residual. The<br />
residuals can be st<strong>and</strong>ardised to have mean<br />
zero <strong>and</strong> st<strong>and</strong>ard deviation one. Hence we<br />
can see clearly the unusual cases. (One <strong>of</strong><br />
the options in R-cmdr is to save the<br />
st<strong>and</strong>ardised residuals).<br />
(1) Checking the normality assumptions <strong>of</strong><br />
the residuals.<br />
The matching histogram will present the usual<br />
bell-shaped pattern for the 32 residuals.<br />
396<br />
Section 10
The points in the normal P-P plot lie along a<br />
straight line, confirming the distribution <strong>of</strong> the<br />
residuals is close to normal.<br />
Two extreme points correspond to:<br />
i) female, aged 20, height 162 cm. Predicted<br />
value from model is 5.46 <strong>and</strong> actual TLC is<br />
8.05.<br />
ii) male, aged 25, height 171. Predicted TLC<br />
from model is 6.84, actual TLC is 9.45<br />
(2) Plot <strong>of</strong> residuals vs independent variables<br />
Residuals versus age plot<br />
397<br />
Section 10
This plot identifies the negative residuals for the<br />
people under 20 years <strong>and</strong> also shows the two<br />
large outliers. Otherwise the plot is reasonably<br />
r<strong>and</strong>om about zero.<br />
Residuals versus height plot.<br />
Again the plot has negative residuals for the<br />
shorter people <strong>and</strong> identifies the two large<br />
outliers. These plots indicate special thought<br />
should be given to whether the young people<br />
should be retained in the model.<br />
398<br />
Section 10
Analysis <strong>of</strong> Covariance<br />
This analysis uses a multiple regression to<br />
compare simple regressions coinciding with the<br />
categories <strong>of</strong> a qualitative explanatory variable.<br />
Example: A study investigates the effect <strong>of</strong> a<br />
treatment for hypertension on systolic blood<br />
pressure (BP) compared with a control treatment.<br />
Age for all patients also known <strong>and</strong> it was<br />
thought that age might confound the differences<br />
in BP between the groups.<br />
TREATMENT CONTROL<br />
BP(Y) AGE(X) BP(Y) AGE(X)<br />
120 26 109 33<br />
114 37 145 62<br />
132 31 131 54<br />
130 48 129 44<br />
146 55 101 31<br />
122 35 115 39<br />
136 40 133 60<br />
118 29 105 38<br />
Control mean = 121.00 mm (<strong>of</strong> mercury)<br />
Treatment mean = 127.25 mm (<strong>of</strong> mercury)<br />
399<br />
Section 10
But note:<br />
average age <strong>of</strong> control group = 45.13 years<br />
average age <strong>of</strong> treated group = 37.63 years<br />
[A] First, an ordinary unpaired t-test will be<br />
performed on the BP(Y) values using the<br />
pooled variance <strong>of</strong> the Y values.<br />
Analyze > compare Means > Independent-<br />
Samples t-test<br />
y is the test variable <strong>and</strong> d is the grouping<br />
variable. d is 0 for control <strong>and</strong> 1 for<br />
treatment.<br />
There is no evidence <strong>of</strong> a difference between<br />
the two means as t = -0.932. The 95%<br />
confidence interval for μ − μ is (-8.1,<br />
T C<br />
20.6) which includes 0 confirming no<br />
evidence <strong>of</strong> a difference between the means.<br />
Also p-value = 0.367<br />
400<br />
Section 10
At this stage the ages have been ignored.<br />
Age could be increasing the residual<br />
variation hiding the true treatment difference.<br />
i.e. Age could be a confounder.<br />
[B] Second, a regression analysis <strong>of</strong> Y on d is<br />
performed where d = 0 for the control <strong>and</strong> d<br />
= 1 for the treatment.<br />
Analyze > Regression > Linear<br />
(Again, the 16 Y values are in one column<br />
<strong>and</strong> the values <strong>of</strong> d in a second column)<br />
The estimated regression equation is<br />
ŷ = 121 + 6.25d<br />
401<br />
Section 10
The estimated coefficient for d is 6.25 with a<br />
st<strong>and</strong>ard error <strong>of</strong> 6.708. Note that when d = 0,<br />
ŷ = 121.00 <strong>and</strong> when d = 1, ŷ = 127.25 so<br />
coefficient <strong>of</strong> d is the difference between the<br />
two means. The 95% confidence interval for<br />
treatment difference is<br />
6.25 ± t 14 (6.708) where t 14 = 2.145<br />
giving 6.25 ± 14.39 or (-8.14, 20.64)<br />
as before. This regression is equivalent to the<br />
unpaired t-test. The age variable effect remains<br />
hidden in the residual.<br />
Note: The Confidence Interval can also be obtained<br />
on the printout if requested.<br />
402<br />
Section 10
[C] Third, a regression analysis <strong>of</strong> Y on X <strong>and</strong> D<br />
together is performed where D = 0 for control,<br />
otherwise 1.<br />
Analyze > Regression > Linear<br />
(Values <strong>of</strong> X are now in a third column)<br />
The estimated regression equation is<br />
ŷ = 73.9 + 1.04x + 14.1d<br />
The estimated coefficient <strong>of</strong> d is now 14.082<br />
with a st<strong>and</strong>ard error <strong>of</strong> 3.818. The coefficient<br />
<strong>of</strong> d represents the difference between patients<br />
<strong>of</strong> the same age, one in the control <strong>and</strong> one in<br />
the treated group.<br />
403<br />
Section 10
e.g. Let X = x k be age <strong>of</strong> two such patients.<br />
Then ŷ T – ŷ C = (73.9 + 1.04 x k + 14.1)<br />
– (73.9 + 1.04 x k + 0)<br />
= 14.1<br />
The 95% confidence interval for the difference<br />
is<br />
14.082 ± t 13 (3.818) where t 13 = 2.160<br />
giving 14.082 ± 8.247<br />
or (5.84, 22.33)<br />
Now there is evidence that the treatment raises<br />
blood pressure as 0 excluded from the<br />
confidence interval. The 13 DF are n – 3,<br />
namely those <strong>of</strong> the residual.<br />
Also note that the t test value associated with d is<br />
3.69 with a p-value <strong>of</strong> 0.003.<br />
Also note how the effect <strong>of</strong> age has effectively been<br />
removed from the residual which is substantially<br />
reduced from 2519.5 to 669.4.<br />
The value <strong>of</strong> R 2 has risen from 0.058 or 5.8% to<br />
0.750 or 75% when X is added to the model<br />
involving d only.<br />
404<br />
Section 10
The confidence interval here is the ADJUSTED<br />
CONFIDENCE INTERVAL after allowing for<br />
the effect <strong>of</strong> age.<br />
Unadjusted interval: (– 8.14, 20.64)<br />
Adjusted interval: (5.84, 22.33)<br />
It is helpful to put a geometrical interpretation on<br />
this analysis. The scatter diagram <strong>of</strong> Y(BP)<br />
against X(age) follows for all 16 patients.<br />
150<br />
140<br />
130<br />
120<br />
110<br />
100<br />
Y(BP)<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
25 30 35 40 45 50 55 60 65<br />
X(Age)<br />
Notice the difference between the treated group<br />
(dots) <strong>and</strong> the control group (crosses)<br />
Suppose we fit the equation (by least squares)<br />
ŷ =<br />
0<br />
ˆβ + ˆβ 1x +<br />
2<br />
ˆβ d<br />
where d = 0 for control, d = 1 for treatment.<br />
Section 10<br />
405
If d = 0, ŷ =<br />
0<br />
ˆβ + ˆβ 1x<br />
If d = 1, ŷ =<br />
0<br />
ˆβ + ˆβ 1x +<br />
2<br />
ˆβ<br />
= (<br />
0<br />
ˆβ +<br />
2<br />
ˆβ ) + ˆβ 1x<br />
These two lines are PARALLEL (same slope ˆβ 1) but<br />
the intercepts are<br />
0<br />
ˆβ <strong>and</strong> (<br />
0<br />
ˆβ +<br />
2<br />
ˆβ ). Thus,<br />
2<br />
ˆβ is the<br />
vertical distance between the two parallel straight<br />
lines.<br />
150<br />
140<br />
130<br />
120<br />
110<br />
100<br />
Y(BP)<br />
x<br />
x<br />
x<br />
x<br />
x<br />
X(Age)<br />
25 30 35 40 45 50 55 60 65<br />
2<br />
ˆβ is the effect <strong>of</strong> the treated group relative to the<br />
control. If<br />
2<br />
ˆβ is significant, then there is evidence <strong>of</strong><br />
different blood pressure values in the two groups.<br />
We see how to test<br />
2<br />
ˆβ for significance shortly. The<br />
next printout gives Y regressed on X only, <strong>and</strong> Y<br />
regressed on X <strong>and</strong> d together.<br />
x<br />
x<br />
x<br />
⎫<br />
⎬<br />
⎭<br />
ˆβ 2<br />
406<br />
Section 10
Notes:<br />
1.<br />
2<br />
ˆβ (the coefficient <strong>of</strong> d) = 14.082 is the<br />
increase in blood pressure level due to<br />
administering the treatment (regardless <strong>of</strong> the<br />
age <strong>of</strong> a patient since the two lines, being<br />
parallel, have constant difference).<br />
2. The 95% confidence interval for the<br />
coefficient <strong>of</strong> d (namely β 2 ) is<br />
14.082 ± t 13 (3.818) where t 13 = 2.160<br />
It follows that 5.84 < β 2 < 22.33<br />
3. Without taking age into account, the treatment<br />
raised blood pressure by 6.25 mm <strong>of</strong> mercury<br />
only. Taking age into account, the treatment<br />
raised blood pressure by 14.082 mm.<br />
4. The ordinary unpaired t-test originally<br />
suggested for this problem is equivalent to<br />
regressing Y on d alone. In this case, the<br />
variable x (or Age) remains as part <strong>of</strong> the<br />
residual which is therefore inflated hiding the<br />
true treatment effect. In addition correlation<br />
between age <strong>and</strong> treatment group distorts the<br />
estimate <strong>of</strong> treatment effect on blood<br />
pressure.<br />
407<br />
Section 10
Binary outcomes: Logistic Regression<br />
Recall: For simple <strong>and</strong> multiple linear regression<br />
the outcome variable was continuous.<br />
What do we do if the outcome variable Y is binary<br />
e.g. disease present: yes/no<br />
e.g. tuatara: present / absent<br />
e.g. claim to ACC goes to litigation : Yes / No<br />
e.g. depression: yes/no in 18 yr olds if bullied at<br />
school earlier<br />
We use logistic regression (LR).<br />
In a logistic regression the explanatory or predictor<br />
X variables can be either continuous or<br />
categorical(binary).<br />
Like multiple regression, we can use logistic<br />
regression to:<br />
(1) control for confounding;<br />
(2) investigate the effect <strong>of</strong> several variables on<br />
the outcome variable at one time.<br />
We can use the method <strong>of</strong> LR with data from any<br />
study type as long as we have a binary outcome.<br />
408<br />
Section 10
The logistic regression model is:<br />
⎛ p<br />
⎜<br />
⎞ = β + β + β + + β + ε<br />
− p<br />
⎟<br />
⎝ ⎠<br />
ln<br />
0 1X1 2X2<br />
…<br />
kXk<br />
1<br />
where<br />
Y is the binary outcome variable (values 0 or 1)<br />
p is the probability that a particular event will<br />
occur, i.e. Pr(Y = 1).<br />
X1, X2,..., X<br />
k<br />
are the explanatory variables<br />
β0is the intercept<br />
β1, β2,..., β<br />
k<br />
are the regression coefficients<br />
ε is the r<strong>and</strong>om error<br />
Interpreting the model:<br />
p<br />
is the ‘odds’ <strong>of</strong> the event occurring<br />
1−<br />
p<br />
⎛ p ⎞<br />
ln ⎜ 1 − p<br />
⎟ is the ‘log odds’<br />
⎝ ⎠<br />
The regression coefficient β<br />
i<br />
represents the change<br />
in the log odds for a 1-unit change in X .<br />
i<br />
Fitted logistic model:<br />
The formulae to estimate the values β<br />
0<br />
<strong>and</strong> β<br />
1<br />
etc<br />
are computationally complex. We shall not worry<br />
Section 10<br />
409
about the details here <strong>and</strong> we shall instead focus on<br />
underst<strong>and</strong>ing the results from a logistic regression<br />
R-cmdr printout.<br />
Example:<br />
A study was conducted to investigate the<br />
relationship between physical inactivity <strong>and</strong><br />
myocardial infarction (MI). It was found that<br />
people who were physically inactive had an<br />
increased risk <strong>of</strong> MI. Age was considered to be a<br />
potential confounder.<br />
Compared to younger people, older people:<br />
• are more likely to be physically inactive.<br />
• have a higher risk <strong>of</strong> MI.<br />
Hence, we would expect that age can explain some<br />
<strong>of</strong> the association between physical inactivity <strong>and</strong><br />
MI.<br />
Outcome:<br />
whether a person has a MI (Y) where Y = 0 or 1<br />
Exposure <strong>of</strong> interest:<br />
whether a person was physically inactive<br />
(exposure variable, X 1 )<br />
Possible confounder:<br />
age (X 2 ) <strong>of</strong> the person.<br />
410<br />
Section 10
(1) Investigating the relationship between<br />
physical inactivity <strong>and</strong> MI.<br />
Option 1: Calculate the odds ratio as shown<br />
earlier in the semester.<br />
The 2 × 2 contingency table for outcome <strong>and</strong><br />
exposure is constructed from the 924 people.<br />
Outcome - MI<br />
Exposure (X 1 ) Yes No<br />
Physically inactive 136 98<br />
Physically active 343 347<br />
Odds ratio <strong>of</strong> MI in exposed to unexposed:<br />
136/ 98<br />
OR = = 1.40<br />
343/ 347<br />
with 95% confidence interval 1.04< OR < 1.89.<br />
Interpretation: The odds <strong>of</strong> having a MI is 40%<br />
higher for a person that is physically inactive<br />
compared to a physically active person. The<br />
result is significant.<br />
411<br />
Section 10
Option 2: Alternatively, we can fit a logistic<br />
regression model, using R-cmdr<br />
Y = MI<br />
1 = Yes 0 = No<br />
X 1 = Physically inactive 1 = Yes 0 = No<br />
Fitted Regression Model:<br />
⎛ pˆ ⎞<br />
ln ˆ ˆ<br />
⎜ = + X<br />
1−<br />
pˆ<br />
⎟ β β<br />
⎝ ⎠<br />
0 1 1<br />
where ˆp is the probability that a person has a MI.<br />
R-cmdr comm<strong>and</strong>s:<br />
Analyze > Regression > Binary Logistic<br />
Dependent: enter MI<br />
Covariate: enter Physical Inactivity. OK.<br />
Results from R-cmdr<br />
⎛ pˆ<br />
⎞<br />
Fitted equation is ln⎜<br />
1−<br />
pˆ<br />
⎟<br />
⎝ ⎠ = -0.01+ 0.34 X<br />
1<br />
Odds ratio = 1.40 as before<br />
95% confidence interval for OR is (1.04, 1.89)<br />
412<br />
Section 10
BUT what about the potential confounding<br />
effect <strong>of</strong> age How can we control for that<br />
Note: The odds ratio calculated previously is a<br />
crude odds ratio – it (<strong>and</strong> its corresponding 95%<br />
confidence interval) is not adjusted for the<br />
potential confounder age.<br />
To control for age, we include age as a second<br />
explanatory variable in our logistic regression.<br />
(2) Investigating the relationship between<br />
physical inactivity <strong>and</strong> MI, adjusting<br />
(controlling) for age.<br />
Now add age (X 2 ) to the regression in order to<br />
obtain the adjusted OR <strong>and</strong> its 95% confidence<br />
interval.<br />
Y = MI<br />
1 = Yes 0 = No<br />
X<br />
1<br />
= Physically inactive 1 = Yes 0 = No<br />
X = age<br />
2<br />
Results from R-cmdr:<br />
413<br />
Section 10
The fitted regression is<br />
⎛ pˆ<br />
⎞<br />
ln⎜<br />
1−<br />
pˆ<br />
⎟<br />
⎝ ⎠ = -0.41+0.17 X<br />
1+0.68 X 2<br />
This leads to the age adjusted odds ratio <strong>of</strong> 1.19<br />
which has 95% confidence interval (0.87, 1.62)<br />
These values are read from the printout <strong>and</strong><br />
compare with the crude ratio <strong>of</strong> 1.40 with<br />
confidence interval (1.04, 1.89).<br />
Conclusion: After adjusting for age, the OR<br />
decreased from 1.40 to 1.19. Therefore, age was<br />
making the association between physical inactivity<br />
<strong>and</strong> MI more extreme than it actually was.<br />
414<br />
Section 10
SECTION 11<br />
Study design principles, critical appraisal, sources<br />
<strong>of</strong> bias <strong>and</strong> confounding.<br />
415<br />
Section 11
Study Design <strong>and</strong> Critical Appraisal<br />
Research process:<br />
1. Development <strong>of</strong> research question<br />
2. Design <strong>of</strong> study<br />
3. Collection <strong>of</strong> information<br />
4. Description <strong>of</strong> data<br />
5. Interpretation <strong>of</strong> results<br />
Study design<br />
• Study design refers to the methods used to select the<br />
study participants, control any experimental<br />
conditions, <strong>and</strong> collect the information.<br />
• Interpretation <strong>of</strong> results depends on the study design.<br />
• The study design should be tailored to the research<br />
question.<br />
• Methods <strong>of</strong> statistical analysis <strong>and</strong> information<br />
produced will depend on the study design.<br />
“The data from a good study can be analysed in many<br />
ways, but no amount <strong>of</strong> clever analysis can compensate<br />
for problems with the design <strong>of</strong> the study.” Altman.<br />
416<br />
Section 11
Critical appraisal<br />
Critical appraisal is the process <strong>of</strong> reviewing a study with the<br />
goal <strong>of</strong> identifying its strengths <strong>and</strong> weaknesses, the major<br />
results, <strong>and</strong> its broader implications.<br />
Why teach study design <strong>and</strong> critical appraisal<br />
• it is not possible to sensibly interpret the results <strong>of</strong><br />
statistical analysis without underst<strong>and</strong>ing the context in,<br />
<strong>and</strong> methods with which the data were collected<br />
• health sciences practice <strong>and</strong> policy needs to be based on<br />
sound evidence (as far as possible).<br />
• poorly conducted research should not influence policy or<br />
practice.<br />
• because even well conducted research is not perfect, it is<br />
necessary to underst<strong>and</strong> the nature <strong>of</strong> evidence so that<br />
you can begin to learn to interpret research findings for<br />
yourselves.<br />
• for this you need to gain an underst<strong>and</strong>ing <strong>of</strong> the<br />
scientific method as used in the health sciences.<br />
• this underst<strong>and</strong>ing is enhanced by learning to critique<br />
research.<br />
417<br />
Section 11
Outline <strong>of</strong> next four lectures<br />
1. Introduction to critical appraisal (lecture 1)<br />
• process for critical appraisal<br />
• structure <strong>of</strong> a research paper<br />
2. Design <strong>and</strong> appraisal <strong>of</strong> surveys (lecture 1)<br />
• review <strong>of</strong> surveys<br />
• internal validity<br />
bias<br />
chance<br />
• external validity<br />
• example<br />
3. Design <strong>and</strong> appraisal <strong>of</strong> analytic studies<br />
(lectures 2 – 4)<br />
• review <strong>of</strong> analytic study designs<br />
• internal validity<br />
bias<br />
confounding<br />
chance<br />
• external validity<br />
• causation<br />
• examples: r<strong>and</strong>omised controlled trials<br />
cohort studies<br />
case-control studies<br />
418<br />
Section 11
1. Introduction to critical appraisal<br />
Guideline for critical appraisal<br />
Study summary<br />
What were the study objectives<br />
Why was the study necessary<br />
What type <strong>of</strong> study design was used<br />
How were the participants selected<br />
What information was collected<br />
What were the key results<br />
Internal validity<br />
What do the findings <strong>of</strong> the study tell us about the<br />
population studied<br />
External validity / Generalisability<br />
Can the findings <strong>of</strong> the study be applied to other<br />
populations<br />
Causation (for analytic studies only)<br />
Implications<br />
What are the implications <strong>of</strong> the study<br />
419<br />
Section 11
Structure <strong>of</strong> a scientific paper<br />
Abstract or summary<br />
• usually contains the key results <strong>of</strong> the study.<br />
Introduction<br />
• gives the background, necessity <strong>and</strong> objectives.<br />
Methods<br />
• summarises the study design including source <strong>of</strong><br />
participants <strong>and</strong> methods used to collect data.<br />
Results<br />
• description <strong>of</strong> the study participants including response<br />
rates.<br />
• summary <strong>of</strong> analyses.<br />
Discussion<br />
• provides the authors views <strong>of</strong> the internal <strong>and</strong> external<br />
validity <strong>of</strong> the study, <strong>and</strong> their conclusions about the<br />
implications <strong>of</strong> the study.<br />
420<br />
Section 11
2. Design <strong>and</strong> appraisal <strong>of</strong> Descriptive studies<br />
Aim: To describe characteristics <strong>of</strong> a group or groups <strong>of</strong><br />
people at a given point in time.<br />
Generally, a sample is taken from the population <strong>and</strong> the<br />
distribution <strong>of</strong> variables within that sample is described.<br />
Examples: A descriptive study can be used to<br />
• describe characteristics <strong>of</strong> a group <strong>of</strong> people,<br />
e.g. prevalence <strong>of</strong> asthma, prevalence <strong>of</strong> smoking,<br />
average cholesterol level.<br />
• find out peoples’ opinions <strong>and</strong> attitudes,<br />
e.g. attitudes to alternative health care; satisfaction<br />
with health care delivery.<br />
• find out extent <strong>of</strong> peoples knowledge,<br />
e.g. knowledge <strong>of</strong> risk factors for melanoma, risk factors<br />
for coronary heart disease.<br />
• comparisons <strong>of</strong> subgroups may well be part <strong>of</strong> a survey,<br />
e.g. comparison <strong>of</strong> attitudes <strong>of</strong> men <strong>and</strong> women to<br />
alternative health care; comparison <strong>of</strong> prevalence <strong>of</strong><br />
smoking among different ethnic groups in NZ.<br />
A descriptive study is concerned with <strong>and</strong> designed only to<br />
describe the existing distribution <strong>of</strong> variables, without regard<br />
to causal or other hypotheses.<br />
Descriptive studies can generate hypotheses.<br />
Descriptive studies are <strong>of</strong>ten called surveys or cross-sectional<br />
studies.<br />
421<br />
Section 11
Descriptive studies generally use a sample from a population.<br />
Descriptive studies<br />
underlying population<br />
(parameters eg μ, π)<br />
other populations<br />
(external validity)<br />
sample<br />
inference<br />
(internal validity)<br />
statistics<br />
(eg x, p)<br />
422<br />
Section 11
Recall<br />
Suppose we want to estimate mean cholesterol in the<br />
population:<br />
sample<br />
mean<br />
=<br />
population<br />
mean<br />
+ "error"<br />
systematic error<br />
(bias)<br />
r<strong>and</strong>om<br />
variation<br />
r<strong>and</strong>om error (chance):<br />
• due to natural biological variability.<br />
• increasing the sample size will reduce the r<strong>and</strong>om<br />
fluctuations in the sample mean.<br />
systematic error (=bias)<br />
• due to aspects <strong>of</strong> the design or conduct <strong>of</strong> the study which<br />
systematically distort the results.<br />
• occurs if a sample is not representative <strong>of</strong> the population.<br />
• cannot be reduced by increasing the sample size.<br />
423<br />
Section 11
Internal validity for descriptive studies<br />
Bias<br />
• bias<br />
• chance (r<strong>and</strong>om error)<br />
Selection bias<br />
• systematic error arising from the way people are selected<br />
for the study.<br />
• includes biases from sample selection <strong>and</strong> from nonresponse<br />
to study.<br />
Information bias<br />
• systematic error arising from the way information was<br />
collected from the study participants.<br />
Chance<br />
• Confidence intervals around estimates indicate the degree<br />
<strong>of</strong> precision with which the sample value estimates the<br />
population value.<br />
424<br />
Section 11
Selection bias<br />
• systematic error arising from the way people are selected<br />
for the study.<br />
• includes biases from sample selection <strong>and</strong> from nonresponse<br />
to study.<br />
Questions to ask:<br />
• Is the sample representative <strong>of</strong> the population<br />
• What was the response rate<br />
Example: A study was conducted to estimate the prevalence<br />
<strong>of</strong> smoking among males <strong>and</strong> females in NZ.<br />
Design:<br />
A r<strong>and</strong>om sample <strong>of</strong> households was selected using r<strong>and</strong>om<br />
digit dialling. If the call was not answered, the machine<br />
automatically went on to the next number. All interviews<br />
were conducted from 8am – 5pm (weekdays only).<br />
63% <strong>of</strong> people agreed to participate in the study.<br />
425<br />
Section 11
Information bias<br />
• systematic error arising from the way information was<br />
collected from the study participants.<br />
Question to ask:<br />
Is the information gathered correct<br />
Example: Suppose an investigator wished to estimate the<br />
prevalence <strong>of</strong> depression in NZ. To do this, he carried out<br />
face-to-face interviews around the country with a r<strong>and</strong>om<br />
sample <strong>of</strong> adults. Can you think <strong>of</strong> how information bias<br />
might enter into his study<br />
426<br />
Section 11
Example<br />
Life in New Zeal<strong>and</strong> Survey, Hillary Commission for<br />
Recreation <strong>and</strong> Sport, 1990, David Russell <strong>and</strong> Noela<br />
Wilson.<br />
Objectives<br />
• to provide a snapshot <strong>of</strong> New Zeal<strong>and</strong>ers from a health<br />
perspective.<br />
• included questions on physical activity, leisure patterns,<br />
dietary habits <strong>and</strong> other risk factors for disease.<br />
Necessity for the study<br />
• study provides a benchmark for comparison in future<br />
years.<br />
• the information is useful for generating hypotheses <strong>and</strong><br />
for designing interventions to improve health.<br />
Type <strong>of</strong> study design<br />
• survey <strong>of</strong> New Zeal<strong>and</strong>ers 15 years <strong>and</strong> over.<br />
• carried out April 1989 – May 1990.<br />
Selection <strong>of</strong> participants<br />
• over 18 years:<br />
- selected from electoral rolls.<br />
- each month 10 people were selected at r<strong>and</strong>om from<br />
each <strong>of</strong> the 97 electoral rolls, plus 19 from each <strong>of</strong><br />
the 4 Maori rolls.<br />
• 15 – 18 years:<br />
- snowball sample was used.<br />
- people already selected were asked to identify up to 5<br />
people aged 15 – 18.<br />
• total number selected: 12,463.<br />
427<br />
Section 11
Results<br />
Physical activity<br />
Activity level<br />
low moderate high<br />
Male<br />
15 – 18 17 20 64<br />
19 – 24 23 27 51<br />
25 – 44 34 31 35<br />
45 – 64 50 34 16<br />
64+ 58 39 3<br />
All 37 31 32<br />
Female<br />
15 – 18 24 22 54<br />
19 – 24 30 30 40<br />
25 – 44 20 53 26<br />
45 – 64 25 64 11<br />
64+ 34 63 3<br />
All 25 51 24<br />
Can you summarise these results<br />
428<br />
Section 11
Internal validity<br />
Bias<br />
Selection bias:<br />
• r<strong>and</strong>om sampling was used for those 18 <strong>and</strong> over.<br />
• bias from snowball sample (note multiple starting<br />
points based on r<strong>and</strong>om sampling)<br />
• response rate<br />
Information bias:<br />
• questionnaire<br />
• accuracy <strong>of</strong> recall<br />
• tendency to report what people think the researchers<br />
will want to see<br />
Chance<br />
• the study is large so the confidence intervals for overall<br />
proportions will be fairly narrow, but for smaller<br />
subgroups the proportions may not be so well estimated.<br />
e.g.: women aged 64+, n=814<br />
proportion with low activity level<br />
= 34%, CI= (30.8 to 37.3)<br />
proportion with high activity level<br />
= 3% CI= (2.0 to 4.5)<br />
429<br />
Section 11
External validity<br />
Are the results applicable to other populations<br />
• this calls for a judgement as to whether the other<br />
populations are likely to be similar to New Zeal<strong>and</strong> in<br />
terms <strong>of</strong> their exercise patterns.<br />
Implications<br />
• high activity levels are the levels recommended to<br />
maintain cardio-respiratory fitness.<br />
• programmes to increase activity levels may be useful in<br />
preventing cardiovascular disease.<br />
• efforts to increase activity levels <strong>of</strong> men over the age <strong>of</strong><br />
45 may be particularly useful.<br />
430<br />
Section 11
Study design <strong>and</strong> critical appraisal sessions: 2<br />
1. Introduction to critical appraisal (lecture 1)<br />
• process for critical appraisal<br />
• structure <strong>of</strong> a research paper<br />
2. Design <strong>and</strong> appraisal <strong>of</strong> surveys (lecture 1)<br />
• review <strong>of</strong> surveys<br />
• internal validity<br />
bias<br />
chance<br />
• external validity<br />
• example<br />
3. Design <strong>and</strong> appraisal <strong>of</strong> analytic studies<br />
(lectures 2 – 4)<br />
• review <strong>of</strong> analytic study designs<br />
• internal validity<br />
bias<br />
confounding<br />
• chance<br />
• external validity<br />
• causation<br />
• examples: r<strong>and</strong>omised controlled trials<br />
cohort studies<br />
case-control studies<br />
431<br />
Section 11
3. Design <strong>and</strong> appraisal <strong>of</strong> analytic studies<br />
Review <strong>of</strong> analytic study designs<br />
Purpose<br />
To test hypotheses regarding<br />
• causes <strong>of</strong> disease<br />
• disease prevention strategies<br />
• effectiveness <strong>of</strong> treatments<br />
Example:<br />
• Is a statin drug more effective than a diet high in plant<br />
sterols, soy proteins <strong>and</strong> almonds in reducing serum<br />
cholesterol levels<br />
• Do people who are physically inactive have an increased<br />
risk <strong>of</strong> developing colon cancer<br />
When we are conducting an analytical study we are studying<br />
associations among two or more variables. We will have<br />
an outcome variable (eg<br />
exposure variables (eg<br />
confounding variables – these are variables which distort<br />
the association <strong>of</strong> interest (eg age)<br />
432<br />
Section 11
Types <strong>of</strong> design:<br />
• experimental (intervention)<br />
• e.g. r<strong>and</strong>omised controlled trials.<br />
• observational<br />
• e.g. case-control studies, cohort studies.<br />
Key features <strong>of</strong> common designs<br />
R<strong>and</strong>omised controlled trials<br />
• people are assigned to an intervention or control group<br />
using r<strong>and</strong>om allocation, then followed up over a period<br />
<strong>of</strong> time.<br />
Cohort studies<br />
• participants are selected before they develop disease.<br />
• exposure status is measured, <strong>and</strong> they are followed up<br />
over a period <strong>of</strong> time.<br />
Case-control studies<br />
• two groups <strong>of</strong> people are chosen: a group with disease<br />
(cases) <strong>and</strong> a group without disease (controls).<br />
• information is collected from both groups about<br />
exposures that occurred in the past.<br />
433<br />
Section 11
Key ideas:<br />
• control (or comparison) groups are essential.<br />
• experimental studies provide much stronger tests <strong>of</strong><br />
hypotheses than observational studies.<br />
• experimental studies allow testing <strong>of</strong> causal relationships<br />
• with observational studies it is much harder to isolate the<br />
effects <strong>of</strong> the exposure <strong>of</strong> interest, so much harder to<br />
determine whether an association is causal<br />
434<br />
Section 11
Example<br />
Does smoking cause coronary heart disease<br />
1. Estimate the association between smoking <strong>and</strong> coronary<br />
heart disease (eg relative risk).<br />
2. Does this relative risk represent the true association<br />
between smoking <strong>and</strong> CHD in the population studied<br />
(internal validity)<br />
if yes<br />
3. Can this result be generalised to other populations<br />
(external validity)<br />
4. Is the association causal<br />
435<br />
Section 11
Internal validity<br />
Does the observed association represent the true association<br />
Specifically:<br />
What are the possible explanations for the observed results<br />
• bias<br />
• confounding<br />
• chance<br />
• true relationship<br />
436<br />
Section 11
Assessing internal validity:<br />
Bias<br />
Selection bias – systematic error arising from the way<br />
participants are selected for inclusion in the study.<br />
In an analytic study, selection bias occurs if the selection<br />
processes cause a systematic difference between the groups<br />
<strong>of</strong> people selected for the study.<br />
It includes bias from non-response.<br />
Information bias – systematic error arising from the way<br />
study information is obtained, interpreted <strong>and</strong> recorded.<br />
In an analytic study information bias is a particular problem<br />
if there are systematic differences in the information obtained<br />
from the different groups <strong>of</strong> people in the study.<br />
Information bias may be introduced by the:<br />
• Observer<br />
• Study individual (respondent)<br />
• Instruments used to collect the data (e.g. badly-designed<br />
questionnaire)<br />
437<br />
Section 11
Example<br />
Case-control study to examine relationship between stress<br />
<strong>and</strong> coronary heart disease:<br />
cases:<br />
controls:<br />
people with coronary heart disease<br />
identified through opportunistic<br />
screening by GPS<br />
r<strong>and</strong>om sample from the population<br />
Information on stress collected through a structured interview<br />
Selection bias:<br />
Information bias:<br />
438<br />
Section 11
Evaluation <strong>and</strong> control <strong>of</strong> bias<br />
• Statistical methods cannot control for bias in the selection<br />
<strong>of</strong> subjects or in the measurement <strong>of</strong> the variables <strong>of</strong><br />
interest. Control <strong>of</strong> bias can only be done during the<br />
design <strong>and</strong> data collection phases <strong>of</strong> the study.<br />
• General inaccuracy which is the same in both groups<br />
generally results in an underestimate <strong>of</strong> the true<br />
association.<br />
• If inaccuracy is different in the two groups, the<br />
association can be an over or under estimation.<br />
• It is important to identify sources <strong>of</strong> bias <strong>and</strong> estimate the<br />
magnitude <strong>and</strong> direction <strong>of</strong> their effect on the association.<br />
439<br />
Section 11
Confounding<br />
A distortion <strong>of</strong> the association between exposure <strong>and</strong> disease<br />
caused by the presence <strong>of</strong> a third factor.<br />
• A confounder is a variable which causes this distortion<br />
• To be a confounder a variable must be<br />
• associated with the exposure (independent <strong>of</strong> disease).<br />
• associated with disease (independent <strong>of</strong> exposure).<br />
• it must not just be an intermediate link in the causal<br />
chain.<br />
440<br />
Section 11
Example <strong>of</strong> confounding:<br />
A study was conducted to investigate the relationship<br />
between c<strong>of</strong>fee consumption <strong>and</strong> oral cancer. It was found<br />
that c<strong>of</strong>fee drinkers had an increased risk <strong>of</strong> oral cancer.<br />
Smoking is a potential confounder in this study.<br />
Compared to non-smokers:<br />
• Smokers are more likely to drink c<strong>of</strong>fee;<br />
• Smoking is an independent risk factor for oral cancer.<br />
Hence, the observed association may be due to smoking<br />
habits rather than c<strong>of</strong>fee drinking.<br />
Can you think <strong>of</strong> any other potential confounders<br />
441<br />
Section 11
Example <strong>of</strong> non-confounding:<br />
diet cholesterol level coronary heart disease<br />
In this case, the raised cholesterol levels are likely to be due<br />
in part to diet, so are part <strong>of</strong> the causal pathway. Therefore in<br />
studies <strong>of</strong> diet <strong>and</strong> coronary heart disease raised cholesterol<br />
would not be considered a confounder.<br />
Example <strong>of</strong> a confounder in a cohort study:<br />
Results from a cohort study investigating the relationship<br />
between myocardial infarction <strong>and</strong> exercise.<br />
Myocardial<br />
infarctions<br />
Personyears<br />
Table A: all subjects (n=8000<br />
person-years)<br />
Low exercise 105 4000 26.25<br />
High exercise 25 4000 6.25<br />
Relative risk = 26.25/6.25 = 4.2<br />
Subgroup Analysis<br />
Obese subjects (n=4000)<br />
Low exercise 90 3000 30.0<br />
High exercise 10 1000 10.0<br />
Relative risk = 3.0<br />
Non-obese subjects (n=4000)<br />
Low exercise 15 1000 15.0<br />
High exercise 15 3000 5.0<br />
Relative risk = 3.0<br />
Incidence/1000<br />
442<br />
Section 11
Positive <strong>and</strong> Negative Confounding<br />
Positive confounder – a confounding variable which makes<br />
an association look more extreme or create a spurious<br />
associations.<br />
Example: A study was conducted to investigate the<br />
relationship between physical inactivity <strong>and</strong> MI. It was found<br />
that people who were physically inactive had an increased<br />
risk <strong>of</strong> MI. Age was considered to be a potential confounder.<br />
Physical inactivity<br />
Myocardial infarction<br />
Age<br />
Crude odds ratio =2.5<br />
But compared to younger people, older people:<br />
• are more likely to be physically inactive.<br />
• have a higher risk <strong>of</strong> MI.<br />
Hence, age can explain some <strong>of</strong> the association between<br />
physical inactivity <strong>and</strong> MI.<br />
After “adjusting” for the confounding association <strong>of</strong> age the<br />
OR decreases to 1.4. So confounding by age is making the<br />
association between physical inactivity <strong>and</strong> MI seem more<br />
extreme than it should be, i.e. it is a positive confounder.<br />
443<br />
Section 11
Negative confounder – a counfounding variable which<br />
makes an association look less extreme or even in the<br />
opposite direction. It can mask a real difference.<br />
Example: A study was conducted to investigate the<br />
relationship between physical inactivity <strong>and</strong> MI. It was found<br />
that people who were physically inactive had an increased<br />
risk <strong>of</strong> MI. Sex was considered to be a potential confounder.<br />
Physical inactivity<br />
Myocardial infarction<br />
Sex<br />
Crude OR = 2.5<br />
But compared to females, males:<br />
• are less likely to be physically inactive.<br />
• have a higher risk <strong>of</strong> MI.<br />
Hence, sex masks some <strong>of</strong> the association between physical<br />
inactivity <strong>and</strong> MI.<br />
After “adjusting” for the confounding effect <strong>of</strong> sex, the OR<br />
becomes 3.9.<br />
So confounding by sex makes the association between<br />
physical activity <strong>and</strong> MI seem less extreme than it should be,<br />
i.e. it is a negative confounder.<br />
444<br />
Section 11
Some comments on confounding:<br />
AGE <strong>and</strong> SEX are the most common confounding variables.<br />
This is because these two variables are not only associated<br />
with most exposures we are interested in such as diet,<br />
smoking habits etc., but they are also independent risk factors<br />
for most diseases.<br />
Control <strong>of</strong> confounding<br />
Confounders can be controlled for during the study design,<br />
during the analysis, or both in the design <strong>and</strong> the analysis.<br />
The aim is to make the groups being compared as similar as<br />
possible with respect to the confounders.<br />
(1) Identify potential confounders. A review <strong>of</strong> previous<br />
literature in the area should give you an idea <strong>of</strong> potential<br />
confounders.<br />
Also: What are the known risk factors for the outcome <strong>of</strong><br />
interest; What factors are associated with exposure<br />
Data should be collected on all potential confounders since if<br />
you do not obtain the information you cannot control for it.<br />
445<br />
Section 11
(2) Control <strong>of</strong> confounding during the study design.<br />
Restriction:<br />
• Limits participation in a study to specific groups that<br />
are similar to each other with respect to the<br />
confounder.<br />
e.g. Include only non-smokers in a study <strong>of</strong> exercise<br />
<strong>and</strong> risk <strong>of</strong> CHD.<br />
• Disadvantages<br />
• residual confounding if restriction criteria are too<br />
wide.<br />
• lack <strong>of</strong> generalisability.<br />
• smaller number <strong>of</strong> available participants.<br />
Matching:<br />
Particular subjects are selected in such a way that the<br />
potential confounders are distributed in an identical<br />
manner among each <strong>of</strong> the study groups.<br />
Case-control study: Matching cases <strong>and</strong> controls.<br />
Cohort study: Matching exposed <strong>and</strong> unexposed.<br />
Matching needs to be accounted for in the analysis<br />
R<strong>and</strong>omisation<br />
446<br />
Section 11
(3) Control <strong>of</strong> confounding during the analysis.<br />
Multivariate analysis – multiple regression.<br />
Evaluating confounding<br />
• Check for associations between suspected confounder <strong>and</strong><br />
exposure <strong>and</strong> disease.<br />
• See whether controlling for confounding affects the<br />
association.<br />
Chance<br />
• Study design: ensure study has sufficient power.<br />
• Confidence intervals <strong>and</strong> p-values for the association<br />
indicate the role <strong>of</strong> chance in the study.<br />
• When multiple statistical tests are carried out in a study,<br />
there is an increased chance <strong>of</strong> “false positive” results.<br />
447<br />
Section 11
Study design <strong>and</strong> critical appraisal sessions: 3<br />
R<strong>and</strong>omised controlled trials (RCTs)<br />
Aim: To study evaluate the effects <strong>of</strong> an intervention<br />
• considered the “Gold st<strong>and</strong>ard” for evaluation <strong>of</strong><br />
interventions<br />
Why<br />
• allows isolation <strong>of</strong> the effects <strong>of</strong> the intervention through<br />
controlling the experimental condition<br />
• experiment (“trial”)<br />
• comparison/control group (“controlled”)<br />
• r<strong>and</strong>omisation (“r<strong>and</strong>omised”)<br />
R<strong>and</strong>omisation<br />
• process for deciding who will get the experimental<br />
intervention <strong>and</strong> who will be the control<br />
448<br />
Section 11
Basic structure <strong>of</strong> a RCT<br />
• population to be studied<br />
• choice <strong>of</strong> comparison group<br />
• allocation <strong>of</strong> subjects to intervention or control group<br />
• choice <strong>of</strong> outcome measure<br />
Population to be studied:<br />
Usually not a representative sample from the population<br />
• eg in trials <strong>of</strong> treatments they will be patients coming to<br />
see the doctors who have agreed to take part in the<br />
study<br />
Chosen to maximise internal validity with some cost in terms<br />
<strong>of</strong> generalisability.<br />
• eg we may choose participants who are likely to be able<br />
to complete the requirements <strong>of</strong> the trial<br />
Choice <strong>of</strong> comparison group:<br />
• the control group should provide information on what<br />
would have happened without the experimental<br />
intervention<br />
• in trials <strong>of</strong> disease treatment or prevention the control<br />
group should in general receive the best available<br />
“st<strong>and</strong>ard” treatment.<br />
• sometimes there is no st<strong>and</strong>ard treatment or practice, in<br />
which case a “placebo” control group may be used.<br />
449<br />
Section 11
• “placebos” are substances with no biological effect on<br />
the disease process.<br />
• placebos are used to isolate the particular effect <strong>of</strong><br />
interest from effects that may occur because <strong>of</strong> people’s<br />
belief they are getting a particular intervention<br />
• use <strong>of</strong> a placebo allows “blinding” <strong>of</strong> intervention <strong>and</strong><br />
control groups, so that the results are not biased through<br />
knowledge <strong>of</strong> who got the new intervention<br />
Allocation <strong>of</strong> subjects to treatment groups:<br />
Example: Is the new treatment more effective than the<br />
st<strong>and</strong>ard treatment<br />
How would we test this<br />
(1) We could compare the results <strong>of</strong> the new treatment<br />
on patients with records <strong>of</strong> previous results from<br />
other patients using the old treatment (historical<br />
controls).<br />
Do you think this is a good idea<br />
(2) Ask people to volunteer for the new treatment <strong>and</strong><br />
give the st<strong>and</strong>ard treatment to those who do not<br />
volunteer<br />
Do you think this is a good idea<br />
(3) Allocate patients to the new treatment or the old<br />
treatment using an “objective” method <strong>and</strong> observe<br />
the outcome.<br />
450<br />
Section 11
The way in which patients are allocated to treatments can<br />
influence the results enormously.<br />
We need a method <strong>of</strong> allocation to treatments in which the<br />
characteristics <strong>of</strong> subjects will not affect their chance <strong>of</strong><br />
being put into any particular group – RANDOM<br />
ALLOCATION<br />
Volunteers are assigned to intervention groups using<br />
r<strong>and</strong>omisation, then followed up over a period <strong>of</strong> time.<br />
R<strong>and</strong>omisation:<br />
• best way to control for both known <strong>and</strong> unknown<br />
confounders.<br />
• but does not guarantee control <strong>of</strong> confounding.<br />
• is ethical when there is genuine uncertainty about whether<br />
the new intervention or the comparison strategy is better<br />
(“equipoise”).<br />
451<br />
Section 11
Choice <strong>of</strong> outcome measure:<br />
• needs to be sensitive to the effects <strong>of</strong> intervention<br />
• early in the process <strong>of</strong> evaluation short term outcomes are<br />
used to screen for promising interventions<br />
• ultimately, need to demonstrate that the intervention has a<br />
tangible benefits for the individual <strong>and</strong> society<br />
Example: Zidovudine in treatment <strong>of</strong> people with<br />
asymptomatic HIV infection.<br />
Studies found<br />
• statistically significant improvement in immune function<br />
(measured by CD4 count).<br />
but<br />
• no difference in survival at 3 years.<br />
452<br />
Section 11
R<strong>and</strong>omised controlled trials : Example<br />
Nichol et al. The effectiveness <strong>of</strong> vaccination against<br />
influenza in healthy working adults.<br />
New Engl<strong>and</strong> J. Med (1995)<br />
Objectives<br />
• to clarify the benefits <strong>of</strong> immunisation for influenza in a<br />
population not at high risk for complications.<br />
Background<br />
• most deaths from influenza occur among elderly people,<br />
but all age groups are affected.<br />
• influenza accounts for millions <strong>of</strong> days lost from work<br />
each year.<br />
• current recommendations <strong>of</strong> the US Advisory Committee<br />
on Immunisation Practices target persons at increased risk<br />
for complications <strong>of</strong> influenza, although all people who<br />
wish to avoid illness are encouraged to consider<br />
vaccination.<br />
Type <strong>of</strong> study<br />
R<strong>and</strong>omised controlled trial<br />
453<br />
Section 11
Selection <strong>of</strong> participants<br />
• recruited in Minneapolis-St Paul through newspaper<br />
advertisements, advertisements at work sites <strong>and</strong><br />
recruitment sessions at shopping malls.<br />
• aged 18 – 64 years.<br />
• employed full time.<br />
• no medical conditions which would place them at high<br />
risk for complications <strong>of</strong> influenza.<br />
• not allergic to eggs.<br />
• not pregnant or planned pregnancy within 3 months.<br />
• had not had a previous vaccination for influenza.<br />
Information collected<br />
“Exposure” (=treatment)<br />
• influenza group: active vaccine<br />
• placebo group: vaccine diluent<br />
Outcome measure:<br />
• structured telephone interviews<br />
Week 1:<br />
side effects<br />
Monthly for 4 months:<br />
• occurrence <strong>of</strong> upper respiratory illness<br />
• use <strong>of</strong> sick leave<br />
• visits to the doctor<br />
454<br />
Section 11
Key results<br />
849 r<strong>and</strong>omised<br />
placebo vaccine<br />
n=425 n=424<br />
complete follow-up<br />
n= 416 (98%) n=409 (96%)<br />
455<br />
Section 11
456<br />
Section 11
Internal validity<br />
Chance<br />
• 95% confidence intervals around the differences exclude<br />
zero.<br />
• p-values are small, indicating that differences this large<br />
(or larger) are very unlikely to occur by chance if the<br />
vaccine is not effective.<br />
• several outcome measures were used, increasing the<br />
chance <strong>of</strong> false positive results, but since the p-values are<br />
very small this is not likely to affect the conclusions.<br />
457<br />
Section 11
Confounding<br />
R<strong>and</strong>omisation + intention to treat analysis<br />
458<br />
Section 11
Intention-to-treat analysis<br />
“once r<strong>and</strong>omised, always analysed”<br />
• outcome is compared in<br />
• the group r<strong>and</strong>omised to placebo<br />
• <strong>and</strong> the group r<strong>and</strong>omised to vaccine.<br />
• this preserves the control <strong>of</strong> confounding achieved by<br />
r<strong>and</strong>omisation.<br />
Bias<br />
Selection bias is not a problem in r<strong>and</strong>omised controlled<br />
trials (see generalisability though)<br />
Information bias in r<strong>and</strong>omised trials arises from<br />
• incomplete follow-up <strong>of</strong> participants<br />
• error in measurement <strong>of</strong> outcome<br />
Information bias in vaccine trial:<br />
Completeness <strong>of</strong> follow-up:<br />
• placebo: 98% (416/425)<br />
• vaccine: 96% (409/424)<br />
Measurement <strong>of</strong> illness:<br />
• definition <strong>of</strong> influenza<br />
• recall <strong>of</strong> symptoms<br />
459<br />
Section 11
Blinding<br />
• means participants experience or recall <strong>of</strong> symptoms is not<br />
affected by knowledge <strong>of</strong> whether they had the vaccine<br />
(single blind).<br />
• people collecting the information from the participants<br />
cannot introduce bias through their knowledge <strong>of</strong> whether<br />
or not they had the vaccine (double blind).<br />
Generalisability<br />
• broad group <strong>of</strong> working adults<br />
• risk <strong>of</strong> influenza<br />
• strain <strong>of</strong> influenza<br />
Implications<br />
• the trial demonstrates that vaccination against influenza<br />
can be effective in reducing symptoms, sick leave <strong>and</strong><br />
visits to the doctor.<br />
460<br />
Section 11
Study design <strong>and</strong> critical appraisal sessions: 4<br />
1. Introduction to critical appraisal (lecture 1)<br />
• process for critical appraisal<br />
• structure <strong>of</strong> a research paper<br />
2. Design <strong>and</strong> appraisal <strong>of</strong> surveys (lecture 1)<br />
• review <strong>of</strong> surveys<br />
• internal validity<br />
• bias<br />
• chance<br />
• external validity<br />
• example<br />
3. Design <strong>and</strong> appraisal <strong>of</strong> analytic studies<br />
(lectures 2 – 4)<br />
• review <strong>of</strong> analytic study designs<br />
• internal validity<br />
• bias<br />
• confounding<br />
• chance<br />
• external validity<br />
• causation<br />
• examples: r<strong>and</strong>omised controlled trials<br />
• cohort studies<br />
• case-control studies<br />
461<br />
Section 11
Cohort study<br />
Ref: “Cohort studies: marching towards outcomes”, Lancet<br />
2002; 359:341-45.<br />
462<br />
Section 11
Prospective cohort study (concurrent): Cohort is defined<br />
<strong>and</strong> characterised at the start <strong>of</strong> the study <strong>and</strong> followed up<br />
into the future.<br />
• Assemble the cohort.<br />
• Measure predictor variables <strong>and</strong> potential confounders.<br />
• Follow up the cohort <strong>and</strong> measure outcomes.<br />
Retrospective (historical) cohort: Cohort is defined <strong>and</strong><br />
characterised in the past, based on data already recorded,<br />
<strong>and</strong> followed up toward the present to some cut-<strong>of</strong>f time.<br />
• Identify a suitable cohort.<br />
• Collect data about predictor variables from past records.<br />
• Collect data about subsequent outcomes that occurred at<br />
a later time.<br />
463<br />
Section 11
Cohort studies: example<br />
Hart C, Davey Smith G. “C<strong>of</strong>fee consumption <strong>and</strong> coronary<br />
heart disease mortality in Scottish men: a 21 year follow-up<br />
study.”<br />
J Epidemiol Commun Health (1997); 51: 461-2<br />
Objective<br />
• to examine the effects <strong>of</strong> c<strong>of</strong>fee on coronary heart<br />
disease mortality.<br />
Background / Necessity<br />
• recent studies <strong>of</strong> this hypothesis have produced<br />
conflicting results.<br />
• data on confounding factors has <strong>of</strong>ten been limited in<br />
those studies.<br />
Type <strong>of</strong> study design<br />
• cohort (prospective)<br />
464<br />
Section 11
Selection <strong>of</strong> participants<br />
• 5,766 men aged 35-64 from work places in an area in<br />
the west <strong>of</strong> Scotl<strong>and</strong>.<br />
• enrolled between 1970 <strong>and</strong> 1973.<br />
Information collected<br />
• at enrollment:<br />
• how many cups <strong>of</strong> c<strong>of</strong>fee they usually drank per<br />
day;<br />
• information on confounders such as smoking, social<br />
class.<br />
• followed up for 20 years.<br />
• information about deaths from coronary heart disease<br />
was obtained from the national registry.<br />
465<br />
Section 11
Key results<br />
No. <strong>of</strong> cups CHD<br />
c<strong>of</strong>fee per day Deaths RR 95% CI<br />
0 308 1.0<br />
1 94 0.89 (0.70, 1.12)<br />
2 104 0.98 (0.78, 1.23)<br />
3-4 82 0.90 (0.70, 1.16)<br />
5+ 37 0.96 (0.67, 1.37)<br />
p value from trend test = 0.71<br />
Chance<br />
• all confidence intervals include the null value, 1.<br />
• the upper limits <strong>of</strong> the confidence intervals for < 5 cups<br />
per day are fairly close to 1.<br />
• for 5+ cups per day we cannot exclude a true RR as big as<br />
1.37 (a 37% increase in risk).<br />
• the test for trend gave a p-value >> 0.05.<br />
466<br />
Section 11
Bias<br />
Selection bias<br />
• because there is only one selection process, selection<br />
bias is minimised.<br />
• the study sample may not be representative <strong>of</strong> the<br />
population in west Scotl<strong>and</strong>, but in analytic studies that<br />
issue is addressed under generalisability.<br />
Information bias<br />
• information bias could come from:<br />
• inaccurary in exposure information;<br />
• loss to followup;<br />
• inaccurary in determining death from CHD.<br />
• crude measure <strong>of</strong> c<strong>of</strong>fee consumption used, may bias<br />
RR towards null.<br />
• followup will be nearly complete using national<br />
registry.<br />
• may be some misclassification <strong>of</strong> cause <strong>of</strong> death.<br />
467<br />
Section 11
Confounding<br />
• RR presented were adjusted for a number <strong>of</strong><br />
confounding factors including: age, diastolic blood<br />
pressure, cholesterol, smoking, social class <strong>and</strong> body<br />
mass index.<br />
Generalisability<br />
• type <strong>of</strong> c<strong>of</strong>fee drunk (instant vs ground).<br />
Implications<br />
• found no clear evidence <strong>of</strong> an association between<br />
instant c<strong>of</strong>fee use <strong>and</strong> risk <strong>of</strong> CHD.<br />
• cannot rule out an increase in those drinking 5+ cups<br />
per day (small numbers).<br />
• other types <strong>of</strong> c<strong>of</strong>fee may have detrimental effects on<br />
CHD risk.<br />
468<br />
Section 11
Case-control studies<br />
Ref: “Case-Control studies: research in reverse”, Lancet<br />
2002; 359:431-34.<br />
• Subjects are ascertained based on whether they have<br />
experienced the outcome <strong>of</strong> interest (cases) or not<br />
(controls).<br />
• Information is collected from cases <strong>and</strong> controls about<br />
their past exposures.<br />
469<br />
Section 11
Case-control studies: example<br />
Shinton R <strong>and</strong> Sagar G. “Lifelong exercise <strong>and</strong> stroke.”<br />
BMJ (1993); 307: 231-4.<br />
Objective<br />
• to examine the potential <strong>of</strong> lifelong patterns <strong>of</strong> increased<br />
physical activity to prevent stroke.<br />
Background / Necessity<br />
• there is growing evidence that exercise can protect<br />
against stroke.<br />
• the importance <strong>of</strong> exercise in early adult life in<br />
protection from stroke has received little attention.<br />
• previous studies had not adequately controlled for<br />
confounding.<br />
Type <strong>of</strong> study design<br />
Case-control study<br />
470<br />
Section 11
Selection <strong>of</strong> participants<br />
Study population: people registered with a GP in west<br />
Birmingham, Engl<strong>and</strong>.<br />
Cases:<br />
• men <strong>and</strong> women aged 35-74 who had just had their first<br />
stroke.<br />
• obtained by phoning GPs weekly, <strong>and</strong> by checking<br />
admissions at the local hospital.<br />
Controls:<br />
• r<strong>and</strong>omly selected from the general practice population.<br />
• no history <strong>of</strong> stroke.<br />
471<br />
Section 11
Information collected<br />
• structured questionnaire.<br />
• one interviewer for all cases <strong>and</strong> controls.<br />
• when disability prevented an adequate response the<br />
closest friend or relative was interviewed.<br />
• people were classified by their responses into those who<br />
did or did not engage in vigorous exercise during:<br />
youth (15-25)<br />
early middle age (25-40)<br />
late middle age(40-55)<br />
• information on confounders (e.g. age, sex, smoking)<br />
472<br />
Section 11
Key results<br />
Response rates:<br />
Cases:<br />
• 125 patients were eligible for inclusion.<br />
• no patient or relative declined to participate.<br />
(100% response rate)<br />
Controls:<br />
• 220 controls were selected <strong>and</strong> contacted.<br />
• 13 excluded.<br />
• 198 <strong>of</strong> the remainder (207) agreed to participate.<br />
(95.7% response rate)<br />
Table I. Odds ratios* (95% confidence interval) <strong>of</strong> stroke<br />
according to when exercise undertaken.<br />
Exercise undertaken<br />
no<br />
yes<br />
Age undertaken<br />
15-25 1.0 0.33 (0.2 to 0.6)<br />
25-40 1.0 0.43 (0.2 to 0.8)<br />
40-55 1.0 0.63 (0.3 to 1.5)<br />
* Odds ratios are adjusted for age <strong>and</strong> sex<br />
473<br />
Section 11
Now, let’s consider possible explanations for an<br />
association: Internal validity<br />
Chance<br />
• confidence intervals show the range <strong>of</strong> plausible values<br />
<strong>of</strong> the true odds ratio which are consistent with the<br />
study results.<br />
• if the confidence interval for an odds ratio excludes 1,<br />
then the study provides evidence <strong>of</strong> an association in the<br />
population studied.<br />
• if the confidence interval for the odds ratio includes 1,<br />
then the study results are consistent with the possibility<br />
that there is no true association.<br />
• to conclude definitely that there is no association, the<br />
confidence interval must include 1 <strong>and</strong> be narrow, so that<br />
important differences in the risk <strong>of</strong> disease can be<br />
excluded.<br />
474<br />
Section 11
In this study:<br />
• the odds ratios increase with increasing age at which the<br />
exercise was undertaken.<br />
• the confidence intervals for ages 15-25 <strong>and</strong> 25-40<br />
exclude 1, so there is some evidence <strong>of</strong> an association<br />
between exercise at those ages <strong>and</strong> reduction in risk <strong>of</strong><br />
stroke.<br />
• the odds ratio for exercise undertaken at age 40–55 is<br />
less than 1, but the confidence interval contains 1<br />
indicating that this apparent beneficial effect could just<br />
be due to r<strong>and</strong>om variation or chance.<br />
475<br />
Section 11
Bias<br />
Case-control studies are particularly susceptible to bias<br />
because at the time the study is done both exposure <strong>and</strong><br />
disease have already occurred.<br />
Selection bias<br />
cases: all non-fatal cases which arose from the GP<br />
population were included.<br />
controls: r<strong>and</strong>omly selected from the population the cases<br />
arose from.<br />
Therefore, the controls are representative <strong>of</strong> the population<br />
the cases arose from, <strong>and</strong> selection bias is minimised.<br />
Response rates were high.<br />
(100% for cases <strong>and</strong> 95.7% for controls)<br />
476<br />
Section 11
Information bias<br />
Things done to minimise bias:<br />
• cases <strong>and</strong> controls all interviewed by the same<br />
interviewer.<br />
• structured questionnaire was used.<br />
Possible sources <strong>of</strong> information bias:<br />
recall bias:<br />
• cases <strong>and</strong> controls may both have trouble recalling<br />
accurately exercise patterns when they were young.<br />
• similar patterns <strong>of</strong> poor recall in cases <strong>and</strong> controls will<br />
bias an odds ratio towards 1, so it could not explain the<br />
observed association.<br />
• cases have had a stroke so they may be less likely to<br />
remember than the controls.<br />
• if cases were less likely than controls to report exercise,<br />
an apparent protective association between exercise <strong>and</strong><br />
stroke would be created.<br />
bias from surrogate interviewee:<br />
• information on exercise for cases unable to respond was<br />
obtained from a friend or relative.<br />
477<br />
Section 11
Interviewer bias:<br />
• the interviewer will have known whether or not people<br />
were case or controls.<br />
• If he/she prodded the controls harder for information on<br />
exercise an apparent protective effect would be created.<br />
Confounding<br />
• risk factors for stroke include age, sex, <strong>and</strong> smoking.<br />
• since all 3 <strong>of</strong> these are likely to be associated with<br />
exercise, they may be confounding the relationship<br />
between exercise <strong>and</strong> stroke.<br />
• analyses were adjusted to remove confounding effects<br />
<strong>of</strong> confounding variables including age, sex <strong>and</strong><br />
smoking.<br />
478<br />
Section 11
Generalisability<br />
Could we apply the results <strong>of</strong> this study to the New Zeal<strong>and</strong><br />
population<br />
• need to think about whether or not New Zeal<strong>and</strong>ers<br />
would be likely to experience the same apparent benefit<br />
from exercise.<br />
• depends on the nature <strong>of</strong> the exercise <strong>and</strong> the biological<br />
mechanism by which exercise reduces risk <strong>of</strong> stroke.<br />
Causation<br />
• it is difficult to show causation conclusively with a<br />
single observational study, primarily because <strong>of</strong> the<br />
susceptibility to bias <strong>and</strong> confounding.<br />
• an association is more likely to be causal if :<br />
• the observed association is very strong;<br />
• a dose-response effect can be demonstrated;<br />
• the results from several different studies are<br />
consistent;<br />
• there is a known biological mechanism.<br />
479<br />
Section 11
480
Appendix One: The Basics<br />
This appendix contains some background material to help you prepare for the course.<br />
1. Basic Mathematical Rules<br />
1. BEDMAS – how to work things out in the right order<br />
2. Rounding<br />
3. Dealing with Negatives<br />
4. Fractions<br />
5. Solving Equations<br />
6. Powers <strong>and</strong> Logarithms<br />
7. Sigma means Add Up<br />
2. Basic Statistical Concepts<br />
1. Mean<br />
2. Median<br />
3. Range<br />
4. Variance <strong>and</strong> St<strong>and</strong>ard Deviation<br />
5. Quartiles <strong>and</strong> Interquartile Range<br />
6. Scatterplot<br />
3. Sample Exercises<br />
MATHERCIZE<br />
Practice examples for many <strong>of</strong> the topics covered in this booklet are available on the computer<br />
package MATHERCIZE. This program is available at: http://mathercize.otago.ac.nz, <strong>and</strong> the<br />
login password is line.<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
Section 1: Basic Mathematical Rules<br />
1. BEDMAS – how to work things out in the right order<br />
Brackets<br />
Exponents (also known as Powers)<br />
Division <strong>and</strong> Multiplication<br />
Addition <strong>and</strong> Subtraction<br />
When Division <strong>and</strong> Multiplication occur together, work from the left. Similarly when Addition <strong>and</strong><br />
Subtraction occur together, work from the left. Otherwise follow the guidelines suggested by the<br />
word BEDMAS.<br />
Note that a scientific calculator will maintain this order, provided care is taken, but other calculators<br />
do not.<br />
Example 1<br />
Evaluate ( 3+ 2) × 6 + 9 2 ÷ ( 2+ 7−<br />
6)<br />
• First evaluate both brackets: ( 3+ 2) = 5 <strong>and</strong> ( 2 + 7 − 6)<br />
= 3<br />
• Then the exponent:<br />
2<br />
9 = 81<br />
• Then the division <strong>and</strong> multiplication: 5× 6 = 30 <strong>and</strong> 81÷ 3 = 27<br />
• Finally the addition: 30 + 27 = 57<br />
Setting this out on paper:<br />
2<br />
3+ 2 × 6 + 9 ÷ 2+ 7− 6 = 5 × 6 +<br />
Example 2<br />
5 +<br />
( ) ( )<br />
2<br />
Evaluate ( 9 − 5 ÷ 5×<br />
2 ) − 9<br />
2<br />
9 ÷ 3<br />
= 5 × 6 + 81 ÷ 3<br />
= 30 + 27<br />
= 57<br />
• First evaluate the brackets. The exponent is evaluated first:<br />
9 5 5 2 2<br />
− ÷ × = 9 − 5 ÷ 5 × 4<br />
( ) ( )<br />
( 9 1 4)<br />
( )<br />
= − ×<br />
= 9− 4 = 5<br />
• Finally the addition 5 + 5 – 9 = 1.<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
Example 3<br />
If Z = X − μ calculate Z if X = 15, μ = 8, <strong>and</strong> σ = 2.75<br />
σ<br />
• First carry out a “clean” substitution. This means that each variable whose value is known<br />
is replaced by that value without any calculation being done:<br />
Z = 15 − 8<br />
2.75<br />
• The division sign implies brackets. The expression could be rewritten as<br />
(15 − 8)<br />
Z =<br />
2.75<br />
although the brackets are seldom shown. Nevertheless the expression 15 – 8<br />
is evaluated first.<br />
•<br />
7<br />
Finally the division: = 7 ÷ 2.75 = 2.55 (to two decimal places).<br />
2.75<br />
• Note that using brackets on a st<strong>and</strong>ard calculator should let you evaluate the expression<br />
directly. Try ( 15 – 8 ) ÷ 2.75 = (Missing out the brackets will almost<br />
certainly lead to an incorrect answer.)<br />
Example 4<br />
s<br />
If t = 2.086, s = 3.44, <strong>and</strong> n = 21, evaluate the expression t<br />
n<br />
• Clean substitution: 2.086 × 3.44<br />
21<br />
s<br />
• Note the multiplication sign: t<br />
n means t × sn<br />
• A square root is an exponent, so evaluate 21 = 4.583 (to three d.p.)<br />
• There is no addition or subtraction involved, so work from the left:<br />
2.086 × 3.44 ÷ 4.583 = 1.57 (to two d.p.) (Rounding is discussed below.)<br />
• Again this may be calculated directly on a calculator. Press the buttons:<br />
2.086 × 3.44 ÷ 21 =<br />
Example 5<br />
Evaluate the expression x − μ<br />
if x = 215.8, μ = 246, s = 64.5, <strong>and</strong> n = 10.<br />
s<br />
n<br />
• For this example, only the calculator working is shown. Press the buttons:<br />
( 215.8 – 246 ) ÷ ( 64.5 ÷ 10 ) =<br />
The answer is –1.48 (to two d.p.)<br />
• Try to obtain the same answer using the rules <strong>of</strong> BEDMAS.<br />
Example 6<br />
Evaluate 1.96<br />
2 2<br />
4.5 3.6<br />
+<br />
18 22<br />
• The square root sign implies brackets around the expression<br />
2 2<br />
4.5 3.6<br />
+<br />
18 22<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
⎛ 2 2<br />
4.5 3.6 ⎞<br />
i.e. we have to evaluate 1.96<br />
+<br />
⎜ 18 22 ⎟<br />
⎝<br />
⎠<br />
• All the exponents inside the brackets are calculated first, followed by the divisions:<br />
2<br />
2<br />
4.5 20.25<br />
3.6 12.96<br />
= = 1.125 <strong>and</strong> = = 0.589<br />
18 18<br />
22 22<br />
• Next the addition, followed by the remaining exponent (the square root)<br />
1.125 + 0.589 = 1.714 <strong>and</strong> 1.714 = 1.309<br />
• Finally the multiplication: 1.96 × 1.309 = 2.56 (to two d.p.)<br />
• Again note that this could be calculated directly on a calculator (although a single small<br />
mistake will make everything wrong). Try<br />
1.96 × (4.5 x 2 ÷ 18 + 3.6 x 2 ÷ 22) =<br />
The result should be 2.566 which rounds to 2.57, a little different to the answer above due to<br />
rounding. Note that x 2 refers to the button on a Casio calculator. Other br<strong>and</strong>s may have<br />
different notations for squaring, although they should be similar.<br />
2. Rounding<br />
When you have decided how many digits you want to round to, look at the next digit. If this value<br />
is 0, 1, 2, 3, or 4, the previous digit is rounded down. Otherwise (if the value is 5, 6, 7, 8, or 9), the<br />
previous digit is rounded up.<br />
Example:<br />
By calculator, 8 30 = 1.460593487<br />
• To three d.p. (decimal places) 8 30<br />
= 1.461 because the next digit (5) causes the third<br />
decimal value (0) to be rounded up.<br />
• To four d.p. 8 30 = 1.4606<br />
• To five d.p. 8 30 = 1.46059<br />
• To six d.p. 8 30 = 1.460593<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
There are no hard <strong>and</strong> fast rules concerning how many digits you should round a value to, although<br />
a few general principles should be noted:<br />
• When you are calculating an expression do not round too soon. For example, consider the<br />
expression 150 . To eight decimal places, 10 = 3.16227766.<br />
10<br />
• If you use a calculator to evaluate 150 , <strong>and</strong> round your final answer to three decimal<br />
10<br />
places, the result is 47.434.<br />
• However, if you first round 10 to 3.16 <strong>and</strong> you then calculate 150 , the result is 47.468<br />
3.16<br />
(to three d.p.). This may not appear to be much different to the value 47.434, but it could<br />
make a substantial difference if you have to use the value in further calculations.<br />
• Do not round your working to fewer figures than your final answer. In the previous<br />
example, the value 3.16 has three significant figures, while the (slightly incorrect) given<br />
answer 47.468 has five figures. Having rounded to three figures in the working, three<br />
figures (or fewer) should be used for the final answer.<br />
You should not give an answer “more” accurate than the data or working.<br />
• As a rule <strong>of</strong> thumb, round probabilities to four decimal places.<br />
• Historically, Z-scores have been rounded to two decimal places. The reason for this is that<br />
normal distribution tables use two decimal place Z-scores.<br />
3. Dealing with Negatives<br />
Adding a negative number is the same as subtracting the corresponding positive number:<br />
• Example 5 + (-4) = 5 – 4 = 1<br />
Subtracting a negative number is like adding a positive number:<br />
• Example 5 – (-4) = 5 + 4 = 9<br />
Multiplying two negative numbers give a positive number:<br />
− 5 × − 4 =<br />
• Example ( ) ( ) 20<br />
Multiplying a negative number by a positive number gives a negative number:<br />
− 5 × 4 = −<br />
• Example ( ) 20<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
4. Fractions<br />
Many people have difficulty with fractions. Sometimes the difficulty is in the interpretation rather<br />
than with actual calculations.<br />
Example<br />
Imagine that you have attended a course <strong>and</strong> you are trying to work out your final mark. You have<br />
been told that you scored:<br />
• 8.5 out <strong>of</strong> 10 for the assignments<br />
• 20 out <strong>of</strong> 40 for the test<br />
• 32 out <strong>of</strong> 50 for the exam<br />
If you add these three values up as if they were fractions, you would get<br />
8.5 20 32<br />
+ + = 1.99 (Check this using a calculator.)<br />
10 40 50<br />
This is clearly a silly answer because the values were not actually fractions as such, but marks from<br />
different sections <strong>of</strong> the assessment scheme for the course.<br />
If you just add up the marks you get 60.5. This is a more reasonable answer, because it gives a total<br />
out <strong>of</strong> 100.<br />
But suppose that in the course mentioned in this example, the assessment scheme states that if the<br />
internal mark is higher than the exam mark, your final mark is the average. Otherwise the final<br />
mark is the exam mark. For this example, the internal total is 28.5 out <strong>of</strong> 50, or 57%, while the<br />
exam mark translates to 64%. As the exam mark is higher than the other combined marks, the final<br />
mark in this case would be 64.<br />
Using Calculators for Fractions<br />
When probabilities are involved, dealing with fractions is important. This section aims to show<br />
how to use a calculator to h<strong>and</strong>le problems involving fractions.<br />
As long as you estimate whether the final answer is sensible, practically all fraction work can be<br />
carried out using a calculator. The key button to use is a b c<br />
on a Casio. Other calculators should<br />
have equivalent buttons.<br />
Simplifying Fractions<br />
12<br />
Example 1:<br />
20<br />
Example 2:<br />
On your calculator type 12<br />
a b c<br />
20 =<br />
The answer is given as 3 5 i.e. 12<br />
20 = 3 5<br />
21<br />
105 Type 21 b c<br />
a 105 = i.e. 21<br />
105 = 1 5<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
Converting Fractions to Decimals<br />
The a b c<br />
button will <strong>of</strong>ten do this, although not always!<br />
Example 1: Convert 11 into decimal form<br />
15<br />
On the calculator type 11 a b c<br />
15 =<br />
The screen shows 11 15. Now press the a b c<br />
button <strong>and</strong> the fraction is<br />
converted to the decimal 0.733333 . . . Press a b c<br />
again, <strong>and</strong> the fraction<br />
version reappears.<br />
Example 2:<br />
Example 3:<br />
Convert 0.6875 to a fraction.<br />
Type .6875 = Now press the a b c<br />
button. The screen shows<br />
11 16 i.e. 0.6875 = 11<br />
16<br />
Convert 0.1234567 to a fraction.<br />
Type .1234567 = . Now press the a b c<br />
button.<br />
Nothing happens. The calculator leaves the decimal alone. If you want to<br />
convert this one to a fraction you will have to carry out the working yourself:<br />
1234567<br />
0.1234567 =<br />
10000000<br />
Adding <strong>and</strong> Subtracting Fractions<br />
3 2<br />
Example:<br />
+<br />
5 3<br />
On your calculator type 3 a b c<br />
5 + 2 a b c<br />
3 =<br />
The screen shows 1 4 15 i.e. 3 + 2 = 1<br />
4<br />
5 3 15<br />
(Incidentally, if you now press the a b c<br />
button, the decimal equivalent to this<br />
fraction appears on screen: 1.266666. . .)<br />
Remember that if these two fractions represent probabilities that you are adding together, <strong>and</strong> the<br />
final answer was also meant to represent a probability, then there has to be an error somewhere<br />
because a probability cannot be larger than 1.<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
Multiplying <strong>and</strong> Dividing Fractions<br />
5 5<br />
Example 1: ×<br />
8 3<br />
Type 5 a b c<br />
8 x 5 a b c<br />
3<br />
The result is 1 1 24 i.e. 5 × 5 =<br />
25<br />
8 3 24<br />
Example 2:<br />
5 10<br />
÷<br />
7 11<br />
Type 5<br />
a b c<br />
7 ÷ 10 a b c<br />
11<br />
The result is 11 14 i.e.<br />
5 10 11<br />
÷ =<br />
7 11 14<br />
More Complicated Calculations<br />
As soon as you have a problem involving both addition <strong>and</strong> multiplication, brackets become very<br />
useful.<br />
3⎛1 3⎞<br />
Example:<br />
⎜ + ⎟<br />
4⎝8 7⎠<br />
Note that the fraction in front <strong>of</strong> the brackets implies multiplication.<br />
Type 3 a b c<br />
4 × (1 a b c<br />
8 + 3 a b c<br />
7) =<br />
The answer is 93 or 0.4152 (to four d.p.)<br />
224<br />
Note that as an alternative approach you could use BEDMAS <strong>and</strong> work out the brackets first:<br />
1 a b c<br />
8 + 3 a b c<br />
7 = gives 31<br />
56<br />
Now type × 3 a 4 = to reach 93 224 as before.<br />
b c<br />
5. Solving Equations<br />
Solving equations involves more than evaluating expressions, which was covered earlier. To solve<br />
an equation you should make a clean substitution, then rearrange the expression so that the required<br />
variable is on its own.<br />
Loosely speaking, solving equations involves “undoing BEDMAS”. For example, anything inside<br />
brackets is dealt with last.<br />
In STAT 115 one particular type <strong>of</strong> equation will need to be solved:<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
X − μ<br />
Example 1: If Z = , calculate X if Z = 1.96, μ = 8.5, <strong>and</strong> σ = 1.8<br />
σ<br />
• First make a “clean substitution”, i.e. substitute each <strong>of</strong> the known variables into the<br />
equation without trying to simplify at all:<br />
X − 8.5<br />
1.96 =<br />
1.8<br />
• The division sign implies brackets around X – 8.5. We are “undoing” the equation, so<br />
this part will be left to last.<br />
( X − 8.5)<br />
1.96 =<br />
1.8<br />
• This means we “undo” the value 1.8 first. Because the right h<strong>and</strong> side <strong>of</strong> the equation reads<br />
“(X – 8.5) divided by 1.8”, we will multiply by 1.8, since multiplication is the inverse<br />
operation to division:<br />
1.96 × 1.8 = (X – 8.5)<br />
• Because we added the brackets because <strong>of</strong> the original division sign <strong>and</strong> we have dealt with<br />
the division, the brackets are no longer needed:<br />
3.528 = X – 8.5<br />
• To undo subtraction we perform the opposite operation, addition:<br />
3.528 + 8.5 = X<br />
• We have now rearranged the equation so that X is on its own.<br />
X = 12.0 (one decimal place)<br />
X − μ<br />
Example 2: If Z = calculate X if Z = 2.58, μ = -2.5, σ = 0.85,<br />
σ<br />
n<br />
<strong>and</strong> n = 60.<br />
• Clean substitution: 2.58 =<br />
X −−2.5<br />
0.85<br />
60<br />
• Simplify a little:<br />
2.58 =<br />
X + 2.5<br />
0.1097<br />
• Solve the equation: 2.58 × 0.1097 = X + 2.5<br />
0.2830 = X + 2.5<br />
0.2830 − 2.5 = X<br />
X = − 2.22 (to two d.p.)<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
6. Powers <strong>and</strong> Logarithms<br />
The following power rules may be needed occasionally, <strong>and</strong> examples will be given where<br />
necessary.<br />
a b a b<br />
x x x +<br />
a<br />
= ( x ) b<br />
=<br />
ab<br />
x<br />
x<br />
x<br />
a<br />
b<br />
=<br />
x<br />
a−b<br />
x<br />
− a =<br />
1<br />
x<br />
a<br />
1 2 x<br />
=<br />
x<br />
The following log rules may also be needed. Note that in this paper, log means log e (or natural log<br />
i.e. ln).<br />
log<br />
ln<br />
e x<br />
= ln x<br />
ln x= y ←⎯→ e = x (where e=<br />
2.71828 (five d.p.))<br />
y<br />
( x ) yln<br />
( x)<br />
= ln ( x) + ln ( y) = ln ( xy)<br />
⎛<br />
ln ( x) ln ( y)<br />
ln x ⎞<br />
− = ⎜ ⎟<br />
⎝ y ⎠<br />
Example:<br />
y<br />
⎛ ˆ π ⎞<br />
If log⎜<br />
1 ˆ<br />
⎟ = 3.1305 – 1.1499 – 0.027729 x 45 , find the value <strong>of</strong> the expression<br />
⎝ − π ⎠<br />
• First use BEDMAS to evaluate the RHS (Right H<strong>and</strong> Side) <strong>of</strong> the expression:<br />
ˆ π<br />
.<br />
1 − ˆ π<br />
3.1305 – 1.1499 – 0.027729 × 45 = 3.1305 – 1.1499 – 1.247805<br />
= 0.732795<br />
⎛ ˆ π ⎞<br />
• We now have log⎜<br />
1 ˆ<br />
⎟ = 0.732795.<br />
⎝ − π ⎠<br />
Remembering that log here means ln we are able to rewrite this in exponential form using<br />
the formula<br />
y<br />
ln x = y ←⎯→ e = x<br />
Therefore<br />
⎛ ˆ π ⎞<br />
⎜<br />
1 ˆ<br />
⎟<br />
⎝ − π ⎠<br />
=<br />
0.732795<br />
e = 2.08 (two d.p.)<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
7. Sigma means Add Up<br />
The Greek letter Σ (capital sigma) means “add up what follows”.<br />
Example 1:<br />
Example 2:<br />
3<br />
Evaluate ∑ 3 i<br />
i= 1<br />
Each <strong>of</strong> the values 1, 2, <strong>and</strong> 3 are substituted into the expression one by one<br />
in place <strong>of</strong> the variable i. Then the three values are added:<br />
3<br />
i 1 2 3<br />
∑ 3 = 3 + 3 + 3<br />
i= 1<br />
= 3 + 9 + 27 = 39<br />
n<br />
Exp<strong>and</strong> the expression ∑ xi<br />
, where x 1 is the first observation, x 2 the<br />
i=<br />
1<br />
second observation, etc. in a data set.<br />
There are n observations. Write out the sum <strong>of</strong> the first two or three<br />
observations, use three dots to indicate the other values, <strong>and</strong> add on the final<br />
observation:<br />
n<br />
∑ i<br />
i=<br />
1<br />
x x x x ... x<br />
= 1 + 2 + 3 + +<br />
n<br />
Notation<br />
x i is the i th term from the data set x1, x2, x3 , ..., xn<br />
1,<br />
x<br />
x ij is the (i, j) th term from the data set<br />
x11, x21, ..., xn<br />
1,<br />
1<br />
x12, x22, ..., xn2<br />
2,<br />
...,<br />
x1k, x2k, ..., xn k<br />
k<br />
− n.<br />
Example 3:<br />
If we select 50 female <strong>and</strong> 50 male Stat 115 students <strong>and</strong> measure their<br />
heights, we obtain the data set<br />
xij<br />
i = 1, 2 j = 1, 2, . . . , 50<br />
Here i represents sex (1 for female <strong>and</strong> 2 for male), <strong>and</strong> j the individual.<br />
For example, x29<br />
is the height <strong>of</strong> the 9 th male in the sample.<br />
Example 4: Evaluate the expression x =<br />
set {4, 7.5, 3.5, 8}<br />
1<br />
4<br />
4<br />
∑ xi<br />
where x i is the i th<br />
i=<br />
1<br />
• Substitute each <strong>of</strong> the x i values into the expression <strong>and</strong> follow BEDMAS:<br />
observation in the<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
1<br />
x = 4 + 7.5 + 3.5 + 8<br />
4<br />
1<br />
= ( 23 )<br />
4<br />
= 5.75<br />
( )<br />
Example 5: Evaluate the expression v = ( x − x )<br />
4<br />
1<br />
2<br />
∑ i where x i is the i th observation<br />
3<br />
i=<br />
1<br />
in the set {4, 7.5, 3.5, 8} <strong>and</strong> x = 5.75 (calculated in Example 4).<br />
• Substitute each <strong>of</strong> the x i values into the expression, along with x = 5.75:<br />
v = 1 ( (4 − 5.75) 2 + (7.5 − 5.75) 2 + (3.5 − 5.75) 2 + (8 − 5.75) 2<br />
)<br />
3<br />
• Follow BEDMAS <strong>and</strong> evaluate each one <strong>of</strong> the four inner brackets:<br />
v = 1 ( ( − 1.75) 2 + (1.75) 2 + ( − 2.25) 2 + (2.25) 2<br />
)<br />
3<br />
• The exponents (squares) are calculated <strong>and</strong> then the four terms are added:<br />
v = 1 ( 3.0625 + 3.0625 + 5.0625 + 5.0625 )<br />
3<br />
1<br />
= ( 16.25 )<br />
3<br />
• The multiplication by 1 is outside the brackets so it is calculated last:<br />
3<br />
v = 5.417 (to three d.p.)<br />
Example 6: Evaluate the expression χ 2 (observed - expected)<br />
= ∑<br />
expected<br />
allcells<br />
for the table below where the expected values are given in brackets <strong>and</strong> the<br />
observed values are not in brackets:<br />
15 (26) 50 (39)<br />
33 (22) 22 (33)<br />
2<br />
• Note that for this type <strong>of</strong> sigma expression, the notation means we have to add up the result<br />
from each <strong>of</strong> the four cells.<br />
χ 2<br />
• Substitute each value into the expression:<br />
χ 2 (15 − 26) (50 − 39) (33 − 22) (22 − 33)<br />
=<br />
+ + +<br />
26 39 22 33<br />
• Evaluate each bracket, <strong>and</strong> then square the result:<br />
χ 2 =<br />
=<br />
2 2 2 2<br />
2 2 2 2<br />
( −11) ( −11) (11) ( −11)<br />
+ + +<br />
26 39 22 33<br />
121 121 121 121<br />
+ + +<br />
26 39 22 33<br />
a (or equivalent) button to calculate the sum:<br />
• Use the b c<br />
= 16.923 (to three d.p.)<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
Section 2: Basic Statistical Concepts<br />
1. Mean<br />
The mean x is commonly referred to as the “average”. It is used as a measure <strong>of</strong> the “centre” <strong>of</strong> a<br />
data set. To find the mean, simply add up all your data values (observations) <strong>and</strong> divide by the<br />
number <strong>of</strong> values (sample size):<br />
n<br />
x1+ x 2 + ... + xn<br />
1<br />
x = or ∑ xi<br />
n n<br />
i =<br />
1<br />
Example:<br />
Calculate the mean <strong>of</strong> the data set 2, 4, 6, 8, 10, 12.<br />
There are six values in the data set i.e. n = 6.<br />
2 + 4 + 6 + 8 + 10 + 12 42<br />
x = = = 7<br />
6 6<br />
2. Median<br />
The median is defined as the middle observation in the data set, <strong>and</strong> is another measure <strong>of</strong> the centre<br />
<strong>of</strong> the data. Note that the data must be in order before you calculate the median!<br />
• In general, the median is the ( n + 1)<br />
th observation, where n is the sample size.<br />
2<br />
• If there is an odd number <strong>of</strong> observations, the median will be the middle observation.<br />
• If there is an even number <strong>of</strong> observations, the median will be the mean <strong>of</strong> the two middle<br />
observations.<br />
Example 1: Calculate the median <strong>of</strong> the data set 10, 1, 3, 8, 9.<br />
• First sort the data into order: 1, 3, 8, 9, 10<br />
• There are n = 5 observations so the median in the data set is the ( + )<br />
i.e. 8.<br />
Example 2: Calculate the median <strong>of</strong> the data set 32, 2, 36, 14, 6, 33.<br />
• First sort the data into order: 2, 6, 14, 32, 33, 36<br />
6 + 1<br />
2<br />
• There are n = 6 observations so the median is the ( ) = 3. 5<br />
14 + 32<br />
2<br />
• Take the mean <strong>of</strong> the 3 rd <strong>and</strong> 4 th observations i.e. ( ) = 23<br />
5 1 = 3 rd observation,<br />
2<br />
th observation.<br />
.<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
3. Range<br />
The range is the difference between the largest <strong>and</strong> smallest observations in the data set. It is a<br />
measure <strong>of</strong> the variation in the data.<br />
Example:<br />
The range <strong>of</strong> the data set 2, 5, 6, 9, 16, 2, 13 is 16 – 2 = 14.<br />
4. Variance <strong>and</strong> St<strong>and</strong>ard Deviation<br />
• The variance (s 2 ) is calculated as follows:<br />
n<br />
2 1<br />
s = ∑ xi<br />
−x<br />
n −1<br />
i=<br />
1<br />
( ) 2<br />
The st<strong>and</strong>ard deviation (s) is the most commonly used measure <strong>of</strong> variation in a set <strong>of</strong> data. It is the<br />
square root <strong>of</strong> the variance<br />
n<br />
1<br />
i.e. s = ∑ ( xi<br />
−x) 2<br />
n −1<br />
i=<br />
1<br />
Usually we calculate the variance first, then we take the square root to give the st<strong>and</strong>ard deviation.<br />
(This follows the order <strong>of</strong> operation indicated by BEDMAS)<br />
Example:<br />
The mean for the data set 9, 5, 6, 4, 16, 2 is 7.0. Calculate the st<strong>and</strong>ard deviation:<br />
• First calculate the variance. Substitute in each value, including x = 7 <strong>and</strong> n = 6:<br />
( )<br />
2 1 (9 7)<br />
2 (5 7)<br />
2 (6 7)<br />
2 (4 7)<br />
2 (16 7)<br />
2 (2 7)<br />
2<br />
s = − + − + − + − + − + −<br />
5<br />
• Evaluate the expression, following BEDMAS:<br />
( )<br />
2 1 2 2 2 2 2 2<br />
s = (2) + ( − 2) + ( − 1) + ( − 3) + (9) + ( − 5)<br />
5<br />
1<br />
= ( 4 + 4 + 1 + 9 + 81 + 25 )<br />
5<br />
1<br />
= (124) = 24.8<br />
5<br />
• Take the square root <strong>of</strong> the variance to give the st<strong>and</strong>ard deviation:<br />
s = 24.8 = 4.98 (to two decimal places)<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
5. Quartiles <strong>and</strong> Interquartile Range<br />
There are two quartiles: a lower quartile (Q 1 ) <strong>and</strong> an upper quartile (Q 3 ). The lower quartile has<br />
25% <strong>of</strong> the data below it, <strong>and</strong> the upper quartile has 25% <strong>of</strong> the data above it.<br />
To find a quartile, first find the median <strong>of</strong> the data set. Then treat the data above the median (upper<br />
set) <strong>and</strong> the data below the median (lower set) as separate sets. The lower quartile is the median <strong>of</strong><br />
the lower set, while the upper quartile is the median <strong>of</strong> the upper set.<br />
The interquartile range is the upper quartile minus the lower quartile, <strong>and</strong> contains 50% <strong>of</strong> the data.<br />
It is a measure <strong>of</strong> the variation in the data.<br />
Example 1:<br />
The data set 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 has a median <strong>of</strong> 11.<br />
• Therefore the lower set is 1, 3, 5, 7, 9, which has a median <strong>of</strong> 5. So the lower quartile is 5.<br />
• The upper set is 13, 15, 17, 19, 21, <strong>and</strong> has a median <strong>of</strong> 17. So the upper quartile is 17.<br />
• The interquartile range is 17 – 5 = 12<br />
Example 2:<br />
The data set 1, 5, 6, 8, 12, 16, 19, 22, 29, 31, 36, 40 has a median <strong>of</strong> 17.5.<br />
• The lower set is 1, 5, 6, 8, 12, 16 which has a median <strong>of</strong> 7, so the lower quartile is 7.<br />
• The upper set is 19, 22, 29, 31, 36, 40, which has a median <strong>of</strong> 30, so the upper quartile is 30.<br />
• The interquartile range is 30 – 7 = 23.<br />
6. Scatterplot<br />
A scatterplot shows the relationship between two variables. Each observation consists <strong>of</strong> two<br />
measurements. Often we are interested in the “response” <strong>of</strong> one measurement to the value <strong>of</strong> the<br />
other. We try to distinguish between the “response” variable <strong>and</strong> the “explanatory” variable. The<br />
response variable is plotted on the y-axis (vertical axis) <strong>and</strong> the explanatory variable on the x-axis<br />
(horizontal axis).<br />
Example:<br />
The weight <strong>of</strong> 13 students <strong>and</strong> the amount <strong>of</strong> time it took them to drink a particular beverage are<br />
plotted below: the explanatory variable is the student’s weight (x-axis) <strong>and</strong> the response variable is<br />
the time taken to drink the beverage (y-axis).<br />
Time taken to drink beverage<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
50.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0<br />
Weight <strong>of</strong> Student (kg)<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
Section 3: Sample Exercise<br />
This sample exercise contains questions based on the Basics Booklet, plus a few questions from<br />
material taught during the first week <strong>of</strong> the course.<br />
1. In a recent study looking at rainbow trout, researchers measured the lengths <strong>of</strong> juvenile fish.<br />
The lengths (in cm) for five r<strong>and</strong>omly selected fish were:<br />
18.6, 15.4, 13.4, 17.0, 12.9<br />
Calculate to one decimal place the mean for these data.<br />
2. For a second r<strong>and</strong>om sample <strong>of</strong> six juvenile fish the lengths (in cm) were:<br />
Calculate the median for these data.<br />
15.5, 12.6, 17.5, 17.4, 13.8, 12.2<br />
3. Calculate the range for the data in Question 2.<br />
4. For a third r<strong>and</strong>om sample <strong>of</strong> five juvenile fish the lengths (in cm) were:<br />
14.5, 14.8, 16.5, 18.4, 13.8<br />
The mean for these data is 15.6 (cm). Calculate to one decimal place the st<strong>and</strong>ard deviation<br />
for these data.<br />
5. The mean value <strong>of</strong> 15.6 (cm) in Question 4 is a:<br />
A. Parameter<br />
B. Statistic<br />
C. Distribution<br />
D. Population value<br />
E. Measure <strong>of</strong> Spread<br />
6. The following list contains five values:<br />
3.2%<br />
0.096<br />
0.048<br />
0.32<br />
0.58%<br />
Beside each value select “True” if the value is less than 0.05 or “False” if the value is greater<br />
than 0.05.<br />
7. Calculate the value <strong>of</strong> the expression<br />
5<br />
∑ 3i<br />
.<br />
i = 1<br />
8. If<br />
Z.<br />
Z<br />
X − μ<br />
= , with X = 43.6, μ = 48, σ = 8.6 <strong>and</strong> n = 50, then calculate the value <strong>of</strong><br />
σ<br />
n<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
9. If<br />
1.96<br />
X − 2.8<br />
= , calculate the value <strong>of</strong> X.<br />
5<br />
10. In a previous STAT110 class at Otago <strong>University</strong>, 64% <strong>of</strong> students sitting the paper were<br />
known to be first year students. In a study <strong>of</strong> students sitting the paper, a r<strong>and</strong>om sample <strong>of</strong><br />
40 students was taken, <strong>and</strong> 60% <strong>of</strong> the students in this sample were found to be first year<br />
students.<br />
To earn the mark in this question you will have to answer correctly both questions below. For<br />
each question select your answer from these five options:<br />
A. 64%<br />
B. 40 students<br />
C. 60%<br />
D. all first year students at Otago <strong>University</strong><br />
E. students sitting the paper<br />
Question 1: The statistic in the paragraph above is: . . . . .<br />
Question 2: What is the population . . . . .<br />
Answers<br />
Answers without working are provided. For the working, look through the Basic Booklet above, or<br />
consult your Notes for the first week <strong>of</strong> the course. If you need help, go to one <strong>of</strong> the help sessions.<br />
Details <strong>of</strong> these sessions are provided in the Course Outline at the start <strong>of</strong> this book.<br />
1. 15.5 cm (1 d.p.)<br />
2. 14.65 cm<br />
3. 5.3 cm<br />
4. 1.9 cm<br />
5. B<br />
6. True, false, true, false, true<br />
7. 45<br />
8. –3.62 (2 d.p.)<br />
9. 12.6<br />
10. B, E<br />
Appendix 1 – Basic rules <strong>and</strong> concepts
Appendix Two: Some Summaries<br />
1. Some Useful Rules <strong>of</strong> Probability<br />
2. R<strong>and</strong>om Variables<br />
3. Binomial Distribution<br />
4. Normal Distribution<br />
Basic Probability Rules <strong>and</strong> Distributions<br />
1. Some Useful Rules <strong>of</strong> Probability<br />
• Pr(A or B) = Pr(A) + Pr(B) – Pr(A <strong>and</strong> B)<br />
If we use set notation for this rule, it can be rewritten as<br />
Pr(A ∪ B) = Pr(A) + Pr(B) – Pr(A ∩ B)<br />
A<br />
B<br />
A<br />
B<br />
A ∪ B<br />
A ∩ B<br />
• If A <strong>and</strong> B are mutually exclusive<br />
(disjoint) then:<br />
Pr(A <strong>and</strong> B) = 0<br />
or Pr(A ∩ B) = 0<br />
A<br />
B<br />
• If A represents the complement <strong>of</strong> A<br />
(every event not in A) then<br />
Pr(A) + Pr( A ) = 1<br />
A<br />
A<br />
Appendix 2 – Some summaries
• Probability <strong>of</strong> B given A: Pr( A ∩ B) = Pr( A) x Pr( B| A)<br />
This may be rewritten as<br />
Pr( B| A)<br />
=<br />
Pr( A ∩ B)<br />
Pr( A)<br />
Pr(B | A)<br />
B<br />
Pr(A ∩ B)<br />
Pr(A)<br />
A<br />
Pr( B | A)<br />
B<br />
Pr(A ∩ B )<br />
Pr(B)<br />
A<br />
Pr(B | A )<br />
B<br />
Pr( A ∩ B)<br />
Pr( B | A )<br />
B<br />
Pr( A ∩ B )<br />
• If A <strong>and</strong> B are independent then: (i) P(B | A) = P(B)<br />
(ii) P(A ∩B) = P(A) × P(B)<br />
2. R<strong>and</strong>om Variables<br />
• A r<strong>and</strong>om variable is one whose value is determined by a r<strong>and</strong>om mechanism.<br />
• A continuous r<strong>and</strong>om variable can take any value in an interval.<br />
• A discrete r<strong>and</strong>om variable can take one <strong>of</strong> a countable number <strong>of</strong> values.<br />
3. Binomial Distribution<br />
Suppose<br />
1. We have a fixed number <strong>of</strong> trials (n)<br />
2. Trials are independent<br />
3. Each trial has only two outcomes (“success” or “failure”)<br />
4. The probability <strong>of</strong> success (π) is the same for each trial<br />
The total number <strong>of</strong> successes (X) is a discrete r<strong>and</strong>om variable <strong>and</strong> has a Binomial distribution,<br />
with<br />
⎛n⎞<br />
x n x<br />
Pr( X = x) = ⎜ ⎟ ( 1 )<br />
x π − π −<br />
⎝ ⎠<br />
The mean <strong>and</strong> variance <strong>of</strong> the distribution are<br />
μ nπ<br />
σ<br />
2<br />
= nπ 1−<br />
π<br />
Example:<br />
If n = 30 <strong>and</strong> π = 0.6 then<br />
• μ = n π = 30 x 0.6 = 18<br />
= <strong>and</strong> ( )<br />
Appendix 2 – Some summaries
2 = n 1− = 30 x 0.6 x 0.4 = 7.2<br />
• σ π( π)<br />
• The st<strong>and</strong>ard deviation is σ = 7.2 = 2.68 (to two d.p.)<br />
4. Normal Distribution<br />
A distribution that is commonly used to describe the behaviour <strong>of</strong> continuous r<strong>and</strong>om variables is<br />
the normal distribution.<br />
2<br />
• X ~ N( , σ )<br />
μ means “X has a normal distribution with mean μ <strong>and</strong> variance<br />
• X ~ N ( 0,1)<br />
means X has a st<strong>and</strong>ard normal distribution<br />
2<br />
• If X ~ N( μ , σ )<br />
, then the st<strong>and</strong>ardised r<strong>and</strong>om variable<br />
For any Normal distribution, approximately:<br />
• 68% <strong>of</strong> the observations are between μ − σ <strong>and</strong> μ + σ .<br />
• 95% <strong>of</strong> the observations are between μ − 2σ<br />
<strong>and</strong> μ + 2σ<br />
.<br />
• 99.7% <strong>of</strong> the observations are between μ − 3σ<br />
<strong>and</strong> μ + 3σ<br />
.<br />
− μ<br />
= X<br />
σ<br />
Z ~ N ( 0,1)<br />
2<br />
σ ”<br />
Example:<br />
If X ~ N( 45, 30 ) then<br />
• μ = 45<br />
• the st<strong>and</strong>ard deviation σ = 30 = 5.477 (to three d.p.)<br />
• Approximately 68% <strong>of</strong> the observations are expected to be between<br />
μ − σ = 39.5 <strong>and</strong> μ + σ = 50.5.<br />
• Approximately 95% <strong>of</strong> the observations are expected to be between<br />
34 <strong>and</strong> 56.<br />
• Over 99% (i.e. almost all) <strong>of</strong> the observations are expected to be between<br />
28.5 <strong>and</strong> 61.<br />
• Pr(X < 40) = Pr(Z <<br />
X − μ<br />
)<br />
σ<br />
40 − 45<br />
= Pr(Z < )<br />
5.477<br />
= Pr(Z < –0.913)<br />
= 0.1806<br />
40 45 X<br />
–0.913 0<br />
Z<br />
Appendix 2 – Some summaries
Summary <strong>of</strong> Formulae<br />
1. Normal Distribution<br />
If X is a normal r<strong>and</strong>om variable with parameters µ X (mean) <strong>and</strong> σ 2 X (variance)<br />
• Mean: µ x<br />
• St<strong>and</strong>ard deviation: σ X =<br />
√<br />
σ 2 X<br />
A st<strong>and</strong>ard normal r<strong>and</strong>om variable Z has mean µ Z = 0 <strong>and</strong> σZ 2<br />
variable X into a st<strong>and</strong>ard normal (<strong>and</strong> vice versa):<br />
= 1. To transform a normal r<strong>and</strong>om<br />
Z = X − µ X<br />
σ X<br />
<strong>and</strong> X = Zσ X + µ X .<br />
2. Binomial Distribution<br />
If X is a binomial r<strong>and</strong>om variable with n trials <strong>and</strong> probability π then<br />
• Mean: µ x = nπ<br />
• St<strong>and</strong>ard deviation: σ X = √ nπ(1 − π)<br />
• If nπ <strong>and</strong> n(1 − π) are both greater than 5, then X is approximately normally distributed with mean<br />
µ X <strong>and</strong> variance σ 2 X .<br />
3. Distributions <strong>of</strong> <strong>Statistics</strong><br />
• The mean ¯X <strong>of</strong> a r<strong>and</strong>om sample <strong>of</strong> size n has mean µ ¯X = µ X <strong>and</strong> st<strong>and</strong>ard deviation σ ¯X = σX √ n<br />
.<br />
• The sample proportion P computed from√<br />
a binomial distribution with parameters n <strong>and</strong> π has a mean<br />
π(1−π)<br />
<strong>of</strong> µ P = π <strong>and</strong> st<strong>and</strong>ard deviation σ P =<br />
n<br />
. If nπ <strong>and</strong> n(1 − π) are both greater than 5, then P<br />
will be approximately normally distributed.<br />
• The distribution <strong>of</strong> the difference between two sample means ¯X 1 − ¯X 2 has a mean <strong>of</strong> µ ¯X1 − ¯X 2<br />
= µ 1 − µ 2<br />
<strong>and</strong> a st<strong>and</strong>ard deviation <strong>of</strong> σ ¯X1 − ¯X 2<br />
=<br />
√<br />
σ 2<br />
1<br />
n 1<br />
+ σ2 2<br />
n 2<br />
.<br />
- In large r<strong>and</strong>om samples (n 1 <strong>and</strong> n 2 ≥ 30) σ ¯X1 − ¯X 2<br />
can be estimated by ˆσ ¯X1 − ¯X 2<br />
=<br />
√<br />
s 2<br />
1<br />
- If σ 2 1 = σ2 2 then we can estimate σ ¯X 1 − ¯X 2<br />
by ˆσ ¯X1 − ¯X 2<br />
=<br />
√<br />
(n1 −1)s 2 1 +(n 2−1)s 2 2<br />
n 1 +n 2 −2<br />
4. Contingency tables<br />
√<br />
1<br />
n 1<br />
+ 1 n 2<br />
.<br />
n 1<br />
+ s2 2<br />
n 2<br />
.<br />
Factor 2<br />
Factor 1 Level 1 Level 2 Total<br />
Level 1 w x r 1 = w + x<br />
Level 2 y z r 2 = y + z<br />
c 1 = w + y c 2 = x + z n = w + x + y + z<br />
χ 2 =<br />
2∑<br />
i=1 j=1<br />
2∑ (o ij − e ij ) 2<br />
e ij<br />
where e ij = r ic j<br />
n<br />
<strong>and</strong> o ij is the observed<br />
value in row i column j.<br />
Odds ratio: OR =(w/x)/(y/z) =(w × z)/(x × y)<br />
Relative risk: RR =(w/(w + x)) / (y/(y + z))<br />
Attributable risk: AR = w/(w + x) − y/(y + z)<br />
Appendix 3 - Formulae
5. Confidence Intervals<br />
All <strong>of</strong> the 100(1 − α)% confidence intervals calculated in this course are <strong>of</strong> the form:<br />
Estimate ± multiplier × st<strong>and</strong>ard error.<br />
In the following ¯x, p etc are the values calculated from the samples.<br />
Estimate df (ν) Multiplier St<strong>and</strong>ard Error<br />
Population mean<br />
• R<strong>and</strong>om sample, σ x known ¯x NA z α/2<br />
√<br />
σ X<br />
n<br />
• R<strong>and</strong>om normal sample, σ x unknown<br />
<strong>and</strong> estimated by s<br />
Difference between population means<br />
• Small r<strong>and</strong>om samples, normal population,<br />
σ 1 = σ 2 = σ unknown<br />
¯x n − 1 t α/2,ν<br />
s √ n<br />
¯x 1 − ¯x 2 n 1 + n 2 − 2 t α/2,ν<br />
√<br />
(n1 −1)s 2 1 +(n 2−1)s 2 2<br />
n 1 +n 2 −2<br />
• Large r<strong>and</strong>om samples (both ≥ 30) ¯x 1 − ¯x 2 NA z α/2<br />
√<br />
s 2<br />
1<br />
• Paired difference in small r<strong>and</strong>om ¯d ν = n − 1 t α/2,ν<br />
s d<br />
samples from a normal population<br />
After ANOVA <strong>and</strong> Regression<br />
• Estimate, multiplier <strong>and</strong> st<strong>and</strong>ard errors determined from output<br />
n 1<br />
+ s2 2<br />
n 2<br />
√n<br />
√<br />
1<br />
n 1<br />
+ 1 n 2<br />
Population proportions<br />
√<br />
p(1−p)<br />
• Population proportion p NA z α/2<br />
√ n<br />
• Difference between 2 population proportions<br />
p 1 − p 2 NA z<br />
p1 (1−p 1 )<br />
α/2 n 1<br />
+ p 2(1−p 2 )<br />
n 2<br />
Odds ratio, relative risk, attributable risk (see contingency tables above for<br />
√<br />
w, x, y <strong>and</strong> z)<br />
• Log (natural) odds ratio ln(OR) NA z 1<br />
α/2 w + 1 x + 1 y + 1 z<br />
√<br />
• Log (natural) relative risk ln(RR) NA z 1<br />
α/2 w − 1<br />
w+x + 1 y − 1<br />
y+z<br />
• Attributable risk –as for the difference <strong>of</strong> two population proportions with p 1 = w/(w + x) <strong>and</strong> p 2 = y/(y + z)<br />
6. Regression<br />
ŷ = ˆβ 0 + ˆβ 1 x where ˆβ 1 =<br />
where s e =<br />
7. ANOVA<br />
√ ∑(yi<br />
−ŷ i ) 2<br />
n−2<br />
1. Total SS = Treatment SS + Error SS<br />
2. Total df = Treatment df + Error df<br />
∑ (xi −¯x)(y i −ȳ)<br />
∑ (xi −¯x) 2 <strong>and</strong> ˆβ 0 =ȳ − ˆβ 1¯x. St<strong>and</strong>ard error <strong>of</strong> the slope SE( ˆβ 1 )=<br />
= √ MS Residual. St<strong>and</strong>ard error <strong>of</strong> a forecast at x k = s e<br />
√<br />
1+ 1 n + (x k−¯x) 2 ∑ (xi −¯x) 2 .<br />
3. MS Treatment = Treatment SS/Treatment df <strong>and</strong> MS Error = Error SS/Error df<br />
4. Overall mean SS = nȳ 2 where n = n 1 + ...+ n k <strong>and</strong> ȳ = 1 n (n 1ȳ 1 + ...+ n k ȳ k ).<br />
5. Treatment SS = C2 1<br />
n 1<br />
+ C2 2<br />
n 2<br />
+ ...+ C2 k<br />
n k<br />
− nȳ 2 where C j is the jth column total.<br />
√ s e ∑(xi , −¯x) 2<br />
Appendix 3 - Formulae