Columbia Mountain Institute for Applied Ecology A short course on ...
Columbia Mountain Institute for Applied Ecology A short course on ...
Columbia Mountain Institute for Applied Ecology A short course on ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<str<strong>on</strong>g>Columbia</str<strong>on</strong>g> <str<strong>on</strong>g>Mountain</str<strong>on</strong>g> <str<strong>on</strong>g>Institute</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Applied</str<strong>on</strong>g> <str<strong>on</strong>g>Ecology</str<strong>on</strong>g><br />
A <str<strong>on</strong>g>short</str<strong>on</strong>g> <str<strong>on</strong>g>course</str<strong>on</strong>g> <strong>on</strong> regressi<strong>on</strong> methods.<br />
These notes are a subset of a more complete set of notes<br />
available at<br />
http:<br />
//www.stat.sfu.ca/~cschwarz/CourseNotes<br />
C. J. Schwarz<br />
Department of Statistics and Actuarial Science, Sim<strong>on</strong> Fraser University<br />
8888 Universit Drive<br />
Burnaby, BC V5A 1S6<br />
cschwarz@stat.sfu.ca<br />
November 23, 2012
C<strong>on</strong>tents<br />
1 Correlati<strong>on</strong> and simple linear regressi<strong>on</strong> 7<br />
1.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
1.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
1.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
1.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
1.3 Correlati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />
1.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
1.3.2 Correlati<strong>on</strong> coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
1.3.3 Cauti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
1.3.4 Principles of Causati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
1.4 Single-variable regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
1.4.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
1.4.2 Equati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> a line - getting notati<strong>on</strong> straight (no pun intended) . . . . . . . . . . . . 23<br />
1.4.3 Populati<strong>on</strong>s and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
1.4.4 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
Correct scale of predictor and resp<strong>on</strong>se . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
Correct sampling scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
No outliers or influential points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
Equal variati<strong>on</strong> al<strong>on</strong>g the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
Normality of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
X measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
1.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
1.4.6 Obtaining Predicti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />
1.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
1.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
1.4.9 Example - Mercury polluti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
1.4.10 Example - The Anscombe Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />
1.4.11 Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />
1.4.12 Example: M<strong>on</strong>itoring Dioxins - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . 50<br />
1.4.13 Example: Weight-length relati<strong>on</strong>ships - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> . . . . . . . . . . . . . . . . 62<br />
Using the Fit Special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
Using derived variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
1
CONTENTS<br />
A n<strong>on</strong>-linear fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
1.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />
1.4.15 The perils of R 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
1.5 A no-intercept model: Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor K . . . . . . . . . . . . . . . . . . . . . . 80<br />
1.6 Frequent Asked Questi<strong>on</strong>s - FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
1.6.1 Do I need a random sample; power analysis . . . . . . . . . . . . . . . . . . . . . . 86<br />
2 Detecting trends over time 89<br />
2.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
2.2 Simple Linear Regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />
2.2.1 Populati<strong>on</strong>s and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />
2.2.2 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />
Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />
Scale of Y and X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />
Correct sampling scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
No outliers or influential points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
Equal variati<strong>on</strong> al<strong>on</strong>g the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
Normality of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
X measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
2.2.3 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
2.2.4 Obtaining Predicti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
2.2.5 Inverse predicti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
2.2.6 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />
2.2.7 Example: The Grass is Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger) . . . . . . . . . . . . . . . . . . . . . . 103<br />
2.3 Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
2.3.1 Example: M<strong>on</strong>itoring Dioxins - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . 115<br />
2.3.2 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
2.4 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
2.4.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
2.4.2 Getting the necessary in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
2.4.3 How does power vary as in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> changes? . . . . . . . . . . . . . . . . . . . . 134<br />
2.4.4 Finally - how many year do I need to m<strong>on</strong>itor? . . . . . . . . . . . . . . . . . . . . 145<br />
2.4.5 Summary of plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />
2.5 Testing <str<strong>on</strong>g>for</str<strong>on</strong>g> comm<strong>on</strong> trend - ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162<br />
2.5.1 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />
2.5.2 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166<br />
2.5.3 Example: Degradati<strong>on</strong> of dioxin - pooling locati<strong>on</strong>s . . . . . . . . . . . . . . . . . . 169<br />
2.5.4 Change in yearly average temperature with regime shifts . . . . . . . . . . . . . . . 186<br />
2.6 Dealing with Autocorrelati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195<br />
2.6.1 Example: Mink pelts from Saskatchewan . . . . . . . . . . . . . . . . . . . . . . . 204<br />
2.7 Dealing with seas<strong>on</strong>ality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />
2.7.1 Empirical adjustment <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />
General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />
Example: Total phosphorus from Klamath River . . . . . . . . . . . . . . . . . . . 219<br />
2.7.2 Using the ANCOVA approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227<br />
c○2012 Carl James Schwarz 2 November 23, 2012
CONTENTS<br />
General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227<br />
Example: Total phosphorus levels <strong>on</strong> the Klamath River - revisited . . . . . . . . . . 228<br />
2.7.3 Fitting cyclical patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232<br />
General approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232<br />
Example: Total phosphorus from Klamath River . . . . . . . . . . . . . . . . . . . 233<br />
Example: Comparing air quality measurements using two different methods . . . . . 240<br />
2.7.4 Further comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252<br />
2.8 Seas<strong>on</strong>ality and Autocorrelati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252<br />
2.9 N<strong>on</strong>-parametric detecti<strong>on</strong> of trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254<br />
2.9.1 Cox and Stuart test <str<strong>on</strong>g>for</str<strong>on</strong>g> trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255<br />
2.9.2 N<strong>on</strong>-parametric regressi<strong>on</strong> - Spearman, Kendall, Theil, Sen estimates . . . . . . . . 258<br />
N<strong>on</strong>-parametric does NOT mean no assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . 258<br />
Example: The Grass is Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger) revisited . . . . . . . . . . . . . . . . . 260<br />
Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265<br />
2.9.3 Dealing with seas<strong>on</strong>ality - Seas<strong>on</strong>al Kendall’s τ . . . . . . . . . . . . . . . . . . . . 265<br />
Basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265<br />
Example: Total phosphorus <strong>on</strong> the Klamath River revisited . . . . . . . . . . . . . . 268<br />
Final notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275<br />
2.9.4 Seas<strong>on</strong>ality with Autocorrelati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276<br />
General ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276<br />
2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277<br />
3 Estimating power/sample size using Program M<strong>on</strong>itor 282<br />
3.1 Mechanics of MONITOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283<br />
3.2 How does MONITOR work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292<br />
3.3 Incorporating process and sampling error . . . . . . . . . . . . . . . . . . . . . . . . . . . 297<br />
3.4 Presence/Absence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306<br />
3.5 WARNING about using testing <str<strong>on</strong>g>for</str<strong>on</strong>g> temporal trends . . . . . . . . . . . . . . . . . . . . . . 309<br />
4 Regressi<strong>on</strong> - hockey sticks, broken sticks, piecewise, change points 311<br />
4.1 Hockey-stick, piecewise, or broken-stick regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . 311<br />
4.1.1 Example: Nenana River Ice Breakup Dates . . . . . . . . . . . . . . . . . . . . . . 312<br />
4.2 Searching <str<strong>on</strong>g>for</str<strong>on</strong>g> the change point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317<br />
4.2.1 Change point model <str<strong>on</strong>g>for</str<strong>on</strong>g> the Nenana River Ice Breakup . . . . . . . . . . . . . . . . 318<br />
4.3 How NOT to search <str<strong>on</strong>g>for</str<strong>on</strong>g> a change point! . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325<br />
5 Analysis of Covariance - ANCOVA 327<br />
5.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327<br />
5.2 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331<br />
5.3 Comparing individual regressi<strong>on</strong> lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331<br />
5.4 Comparing Means after covariate adjustments . . . . . . . . . . . . . . . . . . . . . . . . . 335<br />
5.5 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335<br />
5.6 Example - Degradati<strong>on</strong> of dioxin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335<br />
5.7 Change in yearly average temperature with regime shifts . . . . . . . . . . . . . . . . . . . 351<br />
5.8 Example - More refined analysis of stream-slope example . . . . . . . . . . . . . . . . . . . 360<br />
5.9 Comparing Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor K . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374<br />
5.10 Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388<br />
c○2012 Carl James Schwarz 3 November 23, 2012
CONTENTS<br />
6 Multiple linear regressi<strong>on</strong> 389<br />
6.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389<br />
6.1.1 Data <str<strong>on</strong>g>for</str<strong>on</strong>g>mat and missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389<br />
6.1.2 The statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390<br />
6.1.3 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391<br />
Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391<br />
Correct sampling scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392<br />
No outliers or influential points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392<br />
Equal variati<strong>on</strong> al<strong>on</strong>g the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392<br />
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392<br />
Normality of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393<br />
X variables measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . 393<br />
6.1.4 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393<br />
6.1.5 Predicti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394<br />
6.1.6 Example: blood pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395<br />
6.2 Regressi<strong>on</strong> problems and diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408<br />
6.2.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408<br />
6.2.2 Preliminary characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408<br />
6.2.3 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409<br />
6.2.4 Actual vs. Predicted Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />
6.2.5 Detecting influential observati<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />
Cook’s D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />
Hats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412<br />
Cauti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412<br />
6.2.6 Leverage plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412<br />
6.2.7 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420<br />
6.3 Polynomial, product, and interacti<strong>on</strong> terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 422<br />
6.3.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422<br />
6.3.2 Example: Tomato growth as a functi<strong>on</strong> of water . . . . . . . . . . . . . . . . . . . . 423<br />
6.3.3 Polynomial models with several variables . . . . . . . . . . . . . . . . . . . . . . . 440<br />
6.3.4 Cross-product and interacti<strong>on</strong> terms . . . . . . . . . . . . . . . . . . . . . . . . . . 441<br />
6.4 The general linear test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442<br />
6.4.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442<br />
6.4.2 Example: Predicting body fat from measurements . . . . . . . . . . . . . . . . . . . 443<br />
6.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449<br />
6.5 Indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449<br />
6.5.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449<br />
6.5.2 Defining indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450<br />
6.5.3 The ANCOVA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451<br />
6.5.4 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455<br />
6.5.5 Comparing individual regressi<strong>on</strong> lines . . . . . . . . . . . . . . . . . . . . . . . . . 455<br />
6.5.6 Example: Degradati<strong>on</strong> of dioxin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459<br />
6.5.7 Example: More refined analysis of stream-slope example . . . . . . . . . . . . . . . 474<br />
6.6 Example: Predicting PM10 levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482<br />
6.7 Variable selecti<strong>on</strong> methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497<br />
6.7.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497<br />
c○2012 Carl James Schwarz 4 November 23, 2012
CONTENTS<br />
6.7.2 Maximum model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498<br />
6.7.3 Selecting a model criteri<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499<br />
R 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499<br />
F p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499<br />
MSE p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500<br />
C p and AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500<br />
6.7.4 Which subsets should be examined . . . . . . . . . . . . . . . . . . . . . . . . . . 501<br />
All possible subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501<br />
Backward eliminati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501<br />
Forward additi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502<br />
Stepwise selecti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502<br />
Closing words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502<br />
6.7.5 Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502<br />
6.7.6 Example: Calories of candy bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503<br />
6.7.7 Example: Fitness dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514<br />
6.7.8 Example: Predicting zoo plankt<strong>on</strong> biomass . . . . . . . . . . . . . . . . . . . . . . 514<br />
7 Logistic Regressi<strong>on</strong> 528<br />
7.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528<br />
7.1.1 Difference between standard and logistic regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . 528<br />
7.1.2 The Binomial Distributi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529<br />
7.1.3 Odds, risk, odds-ratio, and probability . . . . . . . . . . . . . . . . . . . . . . . . . 530<br />
7.1.4 Modeling the probability of success . . . . . . . . . . . . . . . . . . . . . . . . . . 532<br />
7.1.5 Logistic regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537<br />
7.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544<br />
7.3 Assumpti<strong>on</strong>s made in logistic regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545<br />
7.4 Example: Space Shuttle - Single c<strong>on</strong>tinuous predictor . . . . . . . . . . . . . . . . . . . . . 546<br />
7.5 Example: Predicting Sex from physical measurements - Multiple c<strong>on</strong>tinuous predictors . . . 552<br />
7.6 Examples: Lung Cancer vs. Smoking; Marijuana use of students based <strong>on</strong> parental usage -<br />
Single categorical predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563<br />
7.6.1 Retrospect and Prospective odds-ratio . . . . . . . . . . . . . . . . . . . . . . . . . 563<br />
7.6.2 Example: Parental and student usage of recreati<strong>on</strong>al drugs . . . . . . . . . . . . . . 565<br />
7.6.3 Example: Effect of selenium <strong>on</strong> tadpoles de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities . . . . . . . . . . . . . . . . . 574<br />
7.7 Example: Pet fish survival as functi<strong>on</strong> of covariates - Multiple categorical predictors . . . . 586<br />
7.8 Example: Horseshoe crabs - C<strong>on</strong>tinuous and categorical predictors. . . . . . . . . . . . . . 601<br />
7.9 Assessing goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617<br />
7.10 Variable selecti<strong>on</strong> methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622<br />
7.10.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622<br />
7.10.2 Example: Predicting credit worthiness . . . . . . . . . . . . . . . . . . . . . . . . . 624<br />
7.11 Model comparis<strong>on</strong> using AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631<br />
7.12 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632<br />
7.12.1 Two comm<strong>on</strong> problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632<br />
Zero counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632<br />
Complete separati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632<br />
7.12.2 Extensi<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633<br />
Choice of link functi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633<br />
c○2012 Carl James Schwarz 5 November 23, 2012
CONTENTS<br />
More than two resp<strong>on</strong>se categories . . . . . . . . . . . . . . . . . . . . . . . . . . . 633<br />
Exact logistic regressi<strong>on</strong> with very small datasets . . . . . . . . . . . . . . . . . . . 633<br />
More complex experimental designs . . . . . . . . . . . . . . . . . . . . . . . . . . 634<br />
7.12.3 Yet to do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634<br />
8 Poiss<strong>on</strong> Regressi<strong>on</strong> 635<br />
8.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635<br />
8.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638<br />
8.3 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638<br />
8.4 Single c<strong>on</strong>tinuous X variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638<br />
8.5 Single c<strong>on</strong>tinuous X variable - dealing with overdispersi<strong>on</strong> . . . . . . . . . . . . . . . . . . 644<br />
8.6 Single C<strong>on</strong>tinuous X variable with an OFFSET . . . . . . . . . . . . . . . . . . . . . . . . 661<br />
8.7 ANCOVA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675<br />
8.8 Categorical X variables - a designed experiment . . . . . . . . . . . . . . . . . . . . . . . . 685<br />
8.9 Log-linear models <str<strong>on</strong>g>for</str<strong>on</strong>g> multi-dimensi<strong>on</strong>al c<strong>on</strong>tingency tables . . . . . . . . . . . . . . . . . 695<br />
8.10 Variable selecti<strong>on</strong> methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695<br />
8.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695<br />
c○2012 Carl James Schwarz 6 November 23, 2012
Chapter 1<br />
Correlati<strong>on</strong> and simple linear regressi<strong>on</strong><br />
1.1 Introducti<strong>on</strong><br />
A nice book explaining how to use JMP to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m regressi<strong>on</strong> analysis is: Freund, R., Littell, R., and<br />
Creight<strong>on</strong>, L. (2003) Regressi<strong>on</strong> using JMP. Wiley Interscience.<br />
Much of statistics is c<strong>on</strong>cerned with relati<strong>on</strong>ships am<strong>on</strong>g variables and whether observed relati<strong>on</strong>ships<br />
are real or simply due to chance. In particular, the simplest case deals with the relati<strong>on</strong>ship between two<br />
variables.<br />
Quantifying the relati<strong>on</strong>ship between two variables depends up<strong>on</strong> the scale of measurement of each of<br />
the two variables. The following table summarizes some of the important analyses that are often per<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />
to investigate the relati<strong>on</strong>ship between two variables.<br />
7
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Type of variables<br />
X is Interval or Ratio or<br />
what JMP calls C<strong>on</strong>tinuous<br />
• Scatterplots<br />
me-<br />
• Running<br />
dian/spline fit<br />
• Regressi<strong>on</strong><br />
• Correlati<strong>on</strong><br />
Y is Interval<br />
or Ratio<br />
or what JMP<br />
calls C<strong>on</strong>tinuous<br />
X is Nominal or Ordinal<br />
• Side-by-side dot plot<br />
• Side-by-side box<br />
plot<br />
• ANOVA or t-tests<br />
Y is Nominal<br />
or Ordinal<br />
• Logistic regressi<strong>on</strong><br />
• Mosaic chart<br />
• C<strong>on</strong>tingency tables<br />
• Chi-square tests<br />
In JMP these combinati<strong>on</strong> of two variables are obtained by the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, the<br />
Analyze->Correlati<strong>on</strong>-of-Ys plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, or the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />
When analyzing two variables, <strong>on</strong>e questi<strong>on</strong> becomes important as it determines the type of analysis that<br />
will be d<strong>on</strong>e. Is the purpose to explore the nature of the relati<strong>on</strong>ship, or is the purpose to use <strong>on</strong>e variable<br />
to explain variati<strong>on</strong> in another variable? For example, there is a difference between examining height and<br />
weight to see if there is a str<strong>on</strong>g relati<strong>on</strong>ship, as opposed to using height to predict weight.<br />
C<strong>on</strong>sequently, you need to distinguish between a correlati<strong>on</strong>al analysis in which <strong>on</strong>ly the strength of the<br />
relati<strong>on</strong>ship will be described, or regressi<strong>on</strong> where <strong>on</strong>e variable will be used to predict the values of a sec<strong>on</strong>d<br />
variable.<br />
The two variables are often called either a resp<strong>on</strong>se variable or an explanatory variable. A resp<strong>on</strong>se<br />
variable (also known as a dependent or Y variable) measures the outcome of a study. An explanatory<br />
variable (also known as an independent or X variable) is the variable that attempts to explain the observed<br />
outcomes.<br />
c○2012 Carl James Schwarz 8 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
1.2 Graphical displays<br />
1.2.1 Scatterplots<br />
The scatter-plot is the primary graphical tool used when exploring the relati<strong>on</strong>ship between two interval or<br />
ratio scale variables. This is obtained in JMP using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m – be sure that both<br />
variables have a c<strong>on</strong>tinuous scale.<br />
In graphing the relati<strong>on</strong>ship, the resp<strong>on</strong>se variable is usually plotted al<strong>on</strong>g the vertical axis (the Y axis)<br />
and the explanatory variables is plotted al<strong>on</strong>g the horiz<strong>on</strong>tal axis (the X axis). It is not always perfectly<br />
clear which is the resp<strong>on</strong>se and which is the explanatory variable. If there is no distincti<strong>on</strong> between the two<br />
variables, then it doesn’t matter which variable is plotted <strong>on</strong> which axis – this usually <strong>on</strong>ly happens when<br />
finding correlati<strong>on</strong> between variables is the primary purpose.<br />
For example, look at the relati<strong>on</strong>ship between calories/serving and fat from the cereal dataset using JMP.<br />
[We will create the graph in class at this point.]<br />
What to look <str<strong>on</strong>g>for</str<strong>on</strong>g> in a scatter-plot<br />
Overall pattern. - What is the directi<strong>on</strong> of associati<strong>on</strong>? A positive associati<strong>on</strong> occurs when above-average<br />
values of <strong>on</strong>e variable tend to be associated with above-average variables of another. The plot will<br />
have an upward slope. A negative associati<strong>on</strong> occurs when above-average values of <strong>on</strong>e variable are<br />
associated with below-average values of another variable. The plot will have a downward slope. What<br />
happens when there is “no associati<strong>on</strong>” between the two variables?<br />
Form of the relati<strong>on</strong>ship. Does a straight line seem to fit through the ‘middle’ of the points? Is the line<br />
linear (the points seem to cluster around a straight line?) or is it curvi-linear (the points seem to <str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />
a curve)?<br />
Strength of associati<strong>on</strong>. Are the points clustered tightly around the curve? If the points have a lot of scatter<br />
above and below the trend line, then the associati<strong>on</strong> is not very str<strong>on</strong>g. On the other hand, if the<br />
amount of scatter above and below the trend line is very small, then there is a str<strong>on</strong>g associati<strong>on</strong>.<br />
Outliers Are there any points that seem to be unusual? Outliers are values that are unusually far from the<br />
trend curve - i.e., they are further away from the trend curve than you would expect from the usual<br />
level of scatter. There is no <str<strong>on</strong>g>for</str<strong>on</strong>g>mal rule <str<strong>on</strong>g>for</str<strong>on</strong>g> detecting outliers - use comm<strong>on</strong> sense. [If you set the<br />
role of a variable to be a label, and click <strong>on</strong> points in a linked graph, the label <str<strong>on</strong>g>for</str<strong>on</strong>g> the point will be<br />
displayed making it easy to identify such points.]<br />
One’s usual initial suspici<strong>on</strong> about any outlier is that it is a mistake, e.g., a transcripti<strong>on</strong> error. Every<br />
ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t should be made to trace the data back to its original source and correct the value if possible. If<br />
the data value appears to be correct, then you have a bit of a quandary. Do you keep the data point<br />
in even though it doesn’t follow the trend line, or do you drop the data point because it appears to be<br />
anomalous? Fortunately, with computers it is relatively easy to repeat an analysis with and without an<br />
outlier - if there is very little difference in the final outcome - d<strong>on</strong>’t worry about it.<br />
c○2012 Carl James Schwarz 9 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
In some cases, the outliers are the most interesting part of the data. For example, <str<strong>on</strong>g>for</str<strong>on</strong>g> many years the<br />
oz<strong>on</strong>e hole in the Antarctic was missed because the computers were programmed to ignore readings<br />
that were so low that ‘they must be in error’!<br />
Lurking variables. A lurking variable is a third variable that is related to both variables and may c<strong>on</strong>found<br />
the associati<strong>on</strong>.<br />
For example, the amount of chocolate c<strong>on</strong>sumed in Canada and the number of automobile accidents<br />
are positively related, but most people would agree that this is coincidental and each variable is independently<br />
driven by populati<strong>on</strong> growth.<br />
Sometimes the lurking variable is a ’grouping’ variable of sorts. This is often examined by using<br />
a different plotting symbol to distinguish between the values of the third variables. For example,<br />
c<strong>on</strong>sider the following plot of the relati<strong>on</strong>ship between salary and years of experience <str<strong>on</strong>g>for</str<strong>on</strong>g> nurses.<br />
The individual lines show a positive relati<strong>on</strong>ship, but the overall pattern when the data are pooled,<br />
shows a negative relati<strong>on</strong>ship.<br />
It is easy in JMP to assign different plotting symbols (what JMP calls markers) to different points.<br />
From the Row menu, use Where to select rows. Then assign those rows using the Rows->Markers<br />
menu.<br />
1.2.2 Smoothers<br />
Once the scatter-plot is plotted, it is natural to try and summarize the underlying trend line. For example,<br />
c<strong>on</strong>sider the following data:<br />
c○2012 Carl James Schwarz 10 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
There are several comm<strong>on</strong> methods available to fit a line through this data.<br />
By eye The eye has remarkable power <str<strong>on</strong>g>for</str<strong>on</strong>g> providing a reas<strong>on</strong>able approximati<strong>on</strong> to an underlying trend,<br />
but it needs a little educati<strong>on</strong>. A trend curve is a good summary of a scatter-plot if the differences between<br />
the individual data points and the underlying trend line (technically called residuals) are small. As well, a<br />
good trend curve tries to minimize the total of the residuals. And the trend line should try and go through<br />
the middle of most of the data.<br />
Although the eye often gives a good fit, different people will draw slightly different trend curves. Several<br />
automated ways to derive trend curves are in comm<strong>on</strong> use - bear in mind that the best ways of estimating<br />
trend curves will try and mimic what the eye does so well.<br />
Median or mean trace The idea is very simple. We choose a “window” width of size w, say. For<br />
each point al<strong>on</strong>g the bottom (X) axis, the smoothed value is the median or average of the Y -values <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
all data points with X-values lying within the “window” centered <strong>on</strong> this point. The trend curve is then<br />
the trace of these medians or means over the entire plot. The result is not exactly smooth. Generally, the<br />
wider the window chosen the smoother the result. However, wider windows make the smoother react more<br />
slowly to changes in trend. Smoothing techniques are too computati<strong>on</strong>ally intensive to be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by<br />
hand. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, JMP is unable to compute the trace of data, but splines are a very good alternative (see<br />
below).<br />
The mean or median trace is too unsophisticated to be a generally useful smoother. For example, the<br />
simple averaging causes it to under-estimate the heights of peaks and over-estimate the heights of troughs.<br />
(Can you see why this is so? Draw a picture with a peak.) However, it is a useful way of trying to summarize<br />
a pattern in a weak relati<strong>on</strong>ship <str<strong>on</strong>g>for</str<strong>on</strong>g> a moderately large data set. In a very weak relati<strong>on</strong>ship it can even help<br />
you to see the trend.<br />
c○2012 Carl James Schwarz 11 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Box plots <str<strong>on</strong>g>for</str<strong>on</strong>g> strips The following gives a c<strong>on</strong>ceptually simple method which is useful <str<strong>on</strong>g>for</str<strong>on</strong>g> exploring a<br />
weak relati<strong>on</strong>ship in a large data set. The X-axis is divided into equal-sized intervals. Then separate box<br />
plots of the values of Y are found <str<strong>on</strong>g>for</str<strong>on</strong>g> each strip. The box-plots are plotted side-by-side and the means or<br />
median are joined. Again, we are able to see what is happening to the variability as well as the trend. There<br />
is even more detailed in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> available in the box plots about the shape of the Y -distributi<strong>on</strong> etc. Again,<br />
this is too tedious to do by hand. It is possible to do make this plot in JMP by creating a new variable that<br />
groups the values of the X variable into classes and then using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m using these<br />
groupings. This is illustrated below:<br />
Spline methods A spline is a series of <str<strong>on</strong>g>short</str<strong>on</strong>g> smooth curves that are joined together to create a larger<br />
smooth curve. The computati<strong>on</strong>al details are complex, but can be d<strong>on</strong>e in JMP. The stiffness of the spline<br />
indicates how straight the resulting curve will be. The following shows two spline fits to the same data with<br />
different stiffness measures:<br />
c○2012 Carl James Schwarz 12 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 13 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
1.3 Correlati<strong>on</strong><br />
WARNING!: Correlati<strong>on</strong> is probably the most abused c<strong>on</strong>cept in statistics. Many people use the<br />
word ‘correlati<strong>on</strong>’ to mean any type of associati<strong>on</strong> between two variables, but it has a very strict technical<br />
meaning, i.e. the strength of an apparent linear relati<strong>on</strong>ship between the two interval or ratio scaled<br />
variables.<br />
The correlati<strong>on</strong> measure does not distinguish between explanatory and resp<strong>on</strong>se variables and it treats the<br />
two variables symmetrically. This means that the correlati<strong>on</strong> between Y and X is the same as the correlati<strong>on</strong><br />
between X and Y.<br />
Correlati<strong>on</strong>s are computed in JMP using the Analyze->Correlati<strong>on</strong> of Y’s plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. If there are several<br />
variables, then the data will be organized into a table. Each cell in the table shows the correlati<strong>on</strong> of the<br />
two corresp<strong>on</strong>ding variables. Because of symmetry (the correlati<strong>on</strong> between variable 1 and variable 2 is the<br />
same as between variable 2 and variable 1 ), <strong>on</strong>ly part of the complete matrix will be shown. As well, the<br />
correlati<strong>on</strong> between any variable and itself is always 1.<br />
c○2012 Carl James Schwarz 14 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
1.3.1 Scatter-plot matrix<br />
To illustrate the ideas of correlati<strong>on</strong>, look at the FITNESS dataset in the DATAMORE directory of JMP.<br />
This is a dataset <strong>on</strong> 31 people at a fitness centre and the following variables were measured <strong>on</strong> each subject:<br />
• name<br />
• gender<br />
• age<br />
• weight<br />
• oxygen c<strong>on</strong>sumpti<strong>on</strong> (high values are typically more fit people)<br />
• time to run <strong>on</strong>e mile (1.6 km)<br />
• average pulse rate during the run<br />
• the resting pulse rate<br />
• maximum pulse rate during the run.<br />
We are interested in examining the relati<strong>on</strong>ship am<strong>on</strong>g the variables. At the moment, ignore the fact<br />
that the data c<strong>on</strong>tains both genders. [It would be interesting to assign different plotting symbols to the two<br />
genders to see if gender is a lurking variable.]<br />
One of the first things to do is to create a scatter-plot matrix of all the variables. Use the Analyze-<br />
>Correlati<strong>on</strong> of Ys to get the following scatter-plot:<br />
c○2012 Carl James Schwarz 15 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Interpreting the scatter plot matrix<br />
The entries in the matrix are scatter-plots <str<strong>on</strong>g>for</str<strong>on</strong>g> all the pairs of variables. For example, the entry in row 1<br />
column 3 represents the scatter-plot between age and oxygen c<strong>on</strong>sumpti<strong>on</strong> with age al<strong>on</strong>g the vertical axis<br />
and oxygen c<strong>on</strong>sumpti<strong>on</strong> al<strong>on</strong>g the horiz<strong>on</strong>tal axis, while the entry in row 3 column 1 has age al<strong>on</strong>g the<br />
horiz<strong>on</strong>tal axis and oxygen c<strong>on</strong>sumpti<strong>on</strong> al<strong>on</strong>g the vertical axis.<br />
c○2012 Carl James Schwarz 16 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
There is clearly a difference in the ’strength’ of relati<strong>on</strong>ships. Compare the scatter plot <str<strong>on</strong>g>for</str<strong>on</strong>g> average<br />
running pulse rate and maximum pulse rate (row 5, column 7) to that of running pulse rate and resting pulse<br />
rate (row 5 column 6) to that of running pulse rate and weight (row 5 column 2).<br />
Similarly, there is a difference in the directi<strong>on</strong> of associati<strong>on</strong>. Compare the scatter plot <str<strong>on</strong>g>for</str<strong>on</strong>g> the average<br />
running pulse rate and maximum pulse rate (row 5 column 7) and that <str<strong>on</strong>g>for</str<strong>on</strong>g> oxygen c<strong>on</strong>sumpti<strong>on</strong> and running<br />
time (row 3, column 4).<br />
1.3.2 Correlati<strong>on</strong> coefficient<br />
It is possible to quantify the strength of associati<strong>on</strong> between two variables. As with all statistics, the way the<br />
data are collected influences the meaning of the statistics.<br />
The populati<strong>on</strong> correlati<strong>on</strong> coefficient between two variables is denoted by the Greek letter rho (ρ) and<br />
is computed as:.<br />
ρ = 1 N∑ (X i − µ X ) (Y i − µ Y )<br />
N σ x σ y<br />
The corresp<strong>on</strong>ding sample correlati<strong>on</strong> coefficient is denoted r has a similar <str<strong>on</strong>g>for</str<strong>on</strong>g>m: 1<br />
i=1<br />
r = 1<br />
n − 1<br />
n∑<br />
i=1<br />
(<br />
Xi − X )<br />
s x<br />
(<br />
Yi − Y )<br />
s y<br />
If the sampling scheme is simple random sample from the corresp<strong>on</strong>ding populati<strong>on</strong>, then r is an estimate<br />
of ρ. This is a crucial assumpti<strong>on</strong>. If the sampling is not a simple random sample, the above<br />
definiti<strong>on</strong> of the sample correlati<strong>on</strong> coefficient should not be used! It is possible to find a c<strong>on</strong>fidence interval<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> ρ and to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m statistical tests that ρ is zero. However, <str<strong>on</strong>g>for</str<strong>on</strong>g> the most part, these are rarely d<strong>on</strong>e in<br />
ecological research and so will not be pursued further in this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
The <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula does provide some insight into interpreting its value.<br />
• ρ and r (unlike other populati<strong>on</strong> parameters) are unitless measures.<br />
• the sign of ρ and r is largely determined by the pairing of the relati<strong>on</strong>ship of each of the (X,Y) values<br />
with their respective means, i.e. if both of X and Y are above the mean, or both X and Y are below<br />
their mean, this pair c<strong>on</strong>tributes a positive value towards ρ or r, while if X is above and Y is below, or<br />
X is below and Y is above their respective means c<strong>on</strong>tributes a negative value towards ρ or r.<br />
• ρ and r ranges from -1 to 1. A value of ρ or r equal to -1 implies a perfect negative correlati<strong>on</strong>; a<br />
value of ρ or r equal to 1 implies a perfect positive correlati<strong>on</strong>; a value of ρ or r equal to 0 implied<br />
no correlati<strong>on</strong>. A perfect populati<strong>on</strong> correlati<strong>on</strong> (i.e. ρ or r equal to 1 or -1) implies that all points lie<br />
1 Note that this <str<strong>on</strong>g>for</str<strong>on</strong>g>mula SHOULD NOT be used <str<strong>on</strong>g>for</str<strong>on</strong>g> the actual computati<strong>on</strong> of r, it is numerically unstable and there are better<br />
computing <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae available.<br />
c○2012 Carl James Schwarz 17 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
exactly <strong>on</strong> a straight line, but the slope of the line has NO effect <strong>on</strong> the correlati<strong>on</strong> coefficient. This<br />
latter point is IMPORTANT and often is wr<strong>on</strong>gly interpreted - give some examples.<br />
• ρ and r are unaffected by linear trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s of the individual variables, e.g. unit changes such as<br />
c<strong>on</strong>verting from imperial to metric units.<br />
• ρ and r <strong>on</strong>ly measures the linear associati<strong>on</strong>, and is not affected by the slope of the line, but <strong>on</strong>ly by<br />
the scatter about the line.<br />
Because correlati<strong>on</strong> assumes both variables have an interval or ratio scale, it makes no sense to compute<br />
the correlati<strong>on</strong><br />
• between gender and oxygen (gender is a nominal scale data);<br />
• between n<strong>on</strong>-linear variables (not shown <strong>on</strong> graph);<br />
• <str<strong>on</strong>g>for</str<strong>on</strong>g> data collected without a known probability scheme. If a sampling scheme other than simple random<br />
sampling is used, it is possible to modify the estimati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>mula; if a n<strong>on</strong>-probability sample<br />
scheme was used, the patient is dead <strong>on</strong> arrival, and no amount of statistical wizardry will revive the<br />
corpse.<br />
The data collecti<strong>on</strong> scheme <str<strong>on</strong>g>for</str<strong>on</strong>g> the fitness data set is unknown - we will have to assume that a some sort<br />
of random sample <str<strong>on</strong>g>for</str<strong>on</strong>g>m the relevant populati<strong>on</strong> was taken be<str<strong>on</strong>g>for</str<strong>on</strong>g>e we can make much sense of the number<br />
computed.<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e looking at the details of its computati<strong>on</strong>, look at the sample correlati<strong>on</strong> coefficients <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />
scatter plot above. These can be arranged into a matrix:<br />
Variable Age Weight Oxy Runtime RunPulse RstPulse MaxPulse<br />
Age 1.00 -0.24 -0.31 0.19 -0.31 -0.15 -0.41<br />
Weight -0.24 1.00 -0.16 0.14 0.18 0.04 0.24<br />
Oxy -0.31 -0.16 1.00 -0.86 -0.39 -0.39 -0.23<br />
Runtime 0.19 0.14 -0.86 1.00 0.31 0.45 0.22<br />
RunPulse -0.31 0.18 -0.39 0.31 1.00 0.35 0.92<br />
RstPulse -0.15 0.04 -0.39 0.45 0.35 1.00 0.30<br />
MaxPulse -0.41 0.24 -0.23 0.22 0.92 0.30 1.00<br />
Notice that the sample correlati<strong>on</strong> between any two variables is the same regardless of ordering of the<br />
variables – this explains the symmetry in the matrix between the above and below diag<strong>on</strong>al elements. As<br />
well each variable has a perfect sample correlati<strong>on</strong> with itself – this explains the value of 1 al<strong>on</strong>g the main<br />
diag<strong>on</strong>al.<br />
Compare the sample correlati<strong>on</strong>s between the average running pulse rate and the other variables and<br />
compare them to the corresp<strong>on</strong>ding scatter-plot above.<br />
c○2012 Carl James Schwarz 18 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
1.3.3 Cauti<strong>on</strong>s<br />
• Random Sampling Required Sample correlati<strong>on</strong> coefficients are <strong>on</strong>ly valid under simple random<br />
samples. If the data were collected in a haphazard fashi<strong>on</strong> or if certain data points were oversampled,<br />
then the correlati<strong>on</strong> coefficient may be severely biased.<br />
• There are examples of high correlati<strong>on</strong> but no practical use and low correlati<strong>on</strong> but great practical use.<br />
These will be presented in class. This illustrates why I almost never talk about correlati<strong>on</strong>.<br />
• correlati<strong>on</strong> measures ‘strength’ of a linear relati<strong>on</strong>ship; a curvilinear relati<strong>on</strong>ship may have a correlati<strong>on</strong><br />
of 0, but there will still be a good correlati<strong>on</strong>.<br />
• the effect of outliers and high leverage points will be presented in class<br />
• effects of lurking variables. For example, suppose there is a positive associati<strong>on</strong> between wages of<br />
male nurses and years of experience; between female nurses and years of experience; but males are<br />
generally paid more than females. There is a positive correlati<strong>on</strong> within each group, but an overall<br />
negative correlati<strong>on</strong> when the data are pooled together.<br />
c○2012 Carl James Schwarz 19 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
• ecological fallacy - the problem of correlati<strong>on</strong> applied to averages. Even if there is a high correlati<strong>on</strong><br />
between two variables <strong>on</strong> their averages, it does not imply that there is a correlati<strong>on</strong> between individual<br />
data values.<br />
For example, if you look at the average c<strong>on</strong>sumpti<strong>on</strong> of alcohol and the c<strong>on</strong>sumpti<strong>on</strong> of cigarettes,<br />
there is a high correlati<strong>on</strong> am<strong>on</strong>g the averages when the 12 values from the provinces and territories<br />
are plotted <strong>on</strong> a graph. However, the individual relati<strong>on</strong>ships within provinces can be reversed or<br />
n<strong>on</strong>-existent as shown below:<br />
c○2012 Carl James Schwarz 20 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The relati<strong>on</strong>ship between cigarette c<strong>on</strong>sumpti<strong>on</strong> and alcohol c<strong>on</strong>sumpti<strong>on</strong> shows no relati<strong>on</strong>ship <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
each province, yet there is a str<strong>on</strong>g correlati<strong>on</strong> am<strong>on</strong>g the per-capita averages. This is an example of<br />
the ecological fallacy.<br />
• correlati<strong>on</strong> does not imply causati<strong>on</strong>. This is the most frequent mistake made by people. There are<br />
set of principles of causal inference that need to be satisfied in order to imply cause and effect.<br />
1.3.4 Principles of Causati<strong>on</strong><br />
Types of associati<strong>on</strong><br />
An associati<strong>on</strong> may be found between two variables <str<strong>on</strong>g>for</str<strong>on</strong>g> several reas<strong>on</strong>s (show causal modeling figures):<br />
• There may be direct causati<strong>on</strong>, e.g. smoking causes lung cancer.<br />
• There may be a comm<strong>on</strong> cause, e.g. ice cream sales and number of drownings both increase with<br />
temperature.<br />
• There may be a c<strong>on</strong>founding factor, e.g. highway fatalities decreased when the speed limits were<br />
reduced to 55 mph at the same time that the oil crisis caused supplies to be reduced and people drove<br />
fewer miles.<br />
c○2012 Carl James Schwarz 21 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
• There may be a coincidence, e.g., the populati<strong>on</strong> of Canada has increased at the same time as the<br />
mo<strong>on</strong> has gotten closer by a few miles.<br />
Establishing cause-and effect.<br />
How do we establish a cause and effect relati<strong>on</strong>ship? Brad<str<strong>on</strong>g>for</str<strong>on</strong>g>d Hill (Hill, A. B.. 1971. Principles of<br />
Medical Statistics, 9th ed New York: Ox<str<strong>on</strong>g>for</str<strong>on</strong>g>d University Press) outlined 7 criteria that have been adopted by<br />
many epidemiological researchers. It is generally agreed that most or all of the following must be c<strong>on</strong>sidered<br />
be<str<strong>on</strong>g>for</str<strong>on</strong>g>e causati<strong>on</strong> can be declared.<br />
Strength of the associati<strong>on</strong>. The str<strong>on</strong>ger an observed associati<strong>on</strong> appears over a series of different studies,<br />
the less likely this associati<strong>on</strong> is spurious because of bias.<br />
Dose-resp<strong>on</strong>se effect. The value of the resp<strong>on</strong>se variable changes in a meaningful way with the dose (or<br />
level) of the suspected causal agent.<br />
Lack of temporal ambiguity. The hypothesized cause precedes the occurrence of the effect. The ability to<br />
establish this time pattern will depend up<strong>on</strong> the study design used.<br />
C<strong>on</strong>sistency of the findings. Most, or all, studies c<strong>on</strong>cerned with a given causal hypothesis produce similar<br />
findings. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, studies dealing with a given questi<strong>on</strong> may all have serious bias problems that can<br />
diminish the importance of observed associati<strong>on</strong>s.<br />
Biological or theoretical plausibility. The hypothesized causal relati<strong>on</strong>ship is c<strong>on</strong>sistent with current biological<br />
or theoretical knowledge. Note, that the current state of knowledge may be insufficient to<br />
explain certain findings.<br />
Coherence of the evidence. The findings do not seriously c<strong>on</strong>flict with accepted facts about the outcome<br />
variable being studied.<br />
Specificity of the associati<strong>on</strong>. The observed effect is associated with <strong>on</strong>ly the suspected cause (or few other<br />
causes that can be ruled out).<br />
IMPORTANT: NO CAUSATION WITHOUT MANIPULATION!<br />
Examples:<br />
Discuss the above in relati<strong>on</strong> to:<br />
• amount of studying vs. grades in a <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
• amount of clear cutting and sediments in water.<br />
• fossil fuel burning and the greenhouse effect.<br />
c○2012 Carl James Schwarz 22 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
1.4 Single-variable regressi<strong>on</strong><br />
1.4.1 Introducti<strong>on</strong><br />
Al<strong>on</strong>g with the Analysis of Variance, this is likely the most comm<strong>on</strong>ly used statistical methodology in<br />
ecological research. In virtually every issue of an ecological journal, you will find papers that use a regressi<strong>on</strong><br />
analysis.<br />
There are HUNDREDS of books written <strong>on</strong> regressi<strong>on</strong> analysis. Some of the better <strong>on</strong>es (IMHO) are:<br />
Draper and Smith. <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Regressi<strong>on</strong> Analysis. Wiley.<br />
Neter, Wasserman, and Kutner. <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Linear Statistical Models. Irwin.<br />
Kleinbaum, Kupper, Miller. <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Regressi<strong>on</strong> Analysis. Duxbury.<br />
Zar. Biostatistics. Prentice Hall.<br />
C<strong>on</strong>sequently, this set of notes is VERY brief and makes no pretense to be a thorough review of regressi<strong>on</strong><br />
analysis. Please c<strong>on</strong>sult the above references <str<strong>on</strong>g>for</str<strong>on</strong>g> all the gory details.<br />
It turns out that both Analysis of Variance and Regressi<strong>on</strong> are special cases of a more general statistical<br />
methodology called General Linear Models which in turn are special cases of Generalized Linear Models<br />
(covered in Stat 402/602), which in turn are special cases of Generalized Additive Models, which in turn are<br />
special cases of .....<br />
The key difference between a Regressi<strong>on</strong> analysis and an ANOVA is that the X variable is nominal<br />
scaled in ANOVA, while in regressi<strong>on</strong> analysis the X variable is c<strong>on</strong>tinuous scaled. This implies that in<br />
ANOVA, the shape of the resp<strong>on</strong>se profile was unspecified (the null hypothesis was that all means were<br />
equal while the alternate was that at least <strong>on</strong>e mean differs), while in regressi<strong>on</strong>, the resp<strong>on</strong>se profile must<br />
be a linear line.<br />
Because both ANOVA and regressi<strong>on</strong> are from the same class of statistical models, many of the assumpti<strong>on</strong>s<br />
are similar, the fitting methods are similar, hypotheses testing and inference are similar as well.<br />
1.4.2 Equati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> a line - getting notati<strong>on</strong> straight (no pun intended)<br />
In order to use regressi<strong>on</strong> analysis effectively, it is important that you understand the c<strong>on</strong>cepts of slopes and<br />
intercepts and how to determine these from data values.<br />
This will be QUICKLY reviewed here in class.<br />
In previous <str<strong>on</strong>g>course</str<strong>on</strong>g>s at high school or in linear algebra, the equati<strong>on</strong> of a straight line was often written<br />
y = mx + b where m is the slope and b is the intercept. In some popular spreadsheet programs, the authors<br />
decided to write the equati<strong>on</strong> of a line as y = a + bx. Now a is the intercept, and b is the slope. Statisticians,<br />
c○2012 Carl James Schwarz 23 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> good reas<strong>on</strong>s, have rati<strong>on</strong>alized this notati<strong>on</strong> and usually write the equati<strong>on</strong> of a line as y = β 0 + β 1 x or<br />
as Y = b 0 + b 1 X. (the distincti<strong>on</strong> between β 0 and b 0 will be made clearer in a few minutes). The use of the<br />
subscripts 0 to represent the intercept and the subscript 1 to represent the coefficient <str<strong>on</strong>g>for</str<strong>on</strong>g> the X variable then<br />
readily extends to more complex cases.<br />
Review definiti<strong>on</strong> of intercept as the value of Y when X=0, and slope as the change in Y per unit change<br />
in X.<br />
1.4.3 Populati<strong>on</strong>s and samples<br />
All of statistics is about detecting signals in the face of noise and in estimating populati<strong>on</strong> parameters from<br />
samples. Regressi<strong>on</strong> is no different.<br />
First c<strong>on</strong>sider the the populati<strong>on</strong>. As in previous chapters, the correct definiti<strong>on</strong> of the populati<strong>on</strong> is<br />
important as part of any study. C<strong>on</strong>ceptually, we can think of the large set of all units of interest. On<br />
each unit, there is c<strong>on</strong>ceptually, both an X and Y variable present. We wish to summarize the relati<strong>on</strong>ship<br />
between Y and X, and furthermore wish to make predicti<strong>on</strong>s of the Y value <str<strong>on</strong>g>for</str<strong>on</strong>g> future X values that may<br />
be observed from this populati<strong>on</strong>. [This is analogous to having different treatment groups corresp<strong>on</strong>ding to<br />
different values of X in ANOVA.]<br />
If this were physics, we may c<strong>on</strong>ceive of a physical law between X and Y , e.g. F = ma or P V = nRt.<br />
However, in ecology, the relati<strong>on</strong>ship between Y and X is much more tenuous. If you could draw a scatterplot<br />
of Y against X <str<strong>on</strong>g>for</str<strong>on</strong>g> ALL elements of the populati<strong>on</strong>, the points would NOT fall exactly <strong>on</strong> a straight<br />
line. Rather, the value of Y would fluctuate above or below a straight line at any given X value. [This is<br />
analogous to saying that Y varies randomly around the treatment group mean in ANOVA.]<br />
We denote this relati<strong>on</strong>ship as<br />
Y = β 0 + β 1 X + ɛ<br />
where now β 0 , β 1 are the POPULATION intercept and slope respectively. We say that<br />
E[Y ] = β 0 + β 1 X<br />
is the expected or average value of Y at X. [In ANOVA, we let each treatment group have its own mean;<br />
here in regressi<strong>on</strong> we assume that the means must fit <strong>on</strong> a straight line.]<br />
The term ɛ represent random variati<strong>on</strong> of individual units in the populati<strong>on</strong> above and below the expected<br />
value. It is assumed to have c<strong>on</strong>stant standard deviati<strong>on</strong> over the entire regressi<strong>on</strong> line (i.e. the spread of data<br />
points in the populati<strong>on</strong> is c<strong>on</strong>stant over the entire regressi<strong>on</strong> line). [This is analogous to the assumpti<strong>on</strong> of<br />
equal treatment populati<strong>on</strong> standard deviati<strong>on</strong>s in ANOVA.]<br />
Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, we can never measure all units of the populati<strong>on</strong>. So a sample must be taken in order to<br />
estimate the populati<strong>on</strong> slope, populati<strong>on</strong> intercept, and populati<strong>on</strong> standard deviati<strong>on</strong>. Unlike a correlati<strong>on</strong><br />
analysis, it is NOT necessary to select a simple random sample from the entire populati<strong>on</strong> and more elaborate<br />
schemes can be used. The bare minimum that must be achieved is that <str<strong>on</strong>g>for</str<strong>on</strong>g> any individual X value found in<br />
the sample, the units in the populati<strong>on</strong> that share this X value, must have been selected at random.<br />
c○2012 Carl James Schwarz 24 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
This is quite a relaxed assumpti<strong>on</strong>! For example, it allows us to deliberately choose values of X from the<br />
extremes and then <strong>on</strong>ly at those X value, randomly select from the relevant subset of the populati<strong>on</strong>, rather<br />
than having to select at random from the populati<strong>on</strong> as a whole. [This is analogous to the assumpti<strong>on</strong>s made<br />
in an analytical survey, where we assumed that even though we can’t randomly assign a treatment to a unit<br />
(e.g. we can’t assign sex to an animal) we must ensure that animals are randomly selected from each group].<br />
Once the data points are selected, the estimati<strong>on</strong> process can proceed, but not be<str<strong>on</strong>g>for</str<strong>on</strong>g>e assessing the assumpti<strong>on</strong>s!<br />
1.4.4 Assumpti<strong>on</strong>s<br />
The assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> a regressi<strong>on</strong> analysis are very similar to those found in ANOVA.<br />
Linearity<br />
Regressi<strong>on</strong> analysis assume that the relati<strong>on</strong>ship between Y and X is linear. Make a scatter-plot between<br />
Y and X to assess this assumpti<strong>on</strong>. Perhaps a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is required (e.g. log(Y ) vs. log(X)). Some<br />
cauti<strong>on</strong> is required in trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> in dealing with the error structure as you will see in later examples.<br />
Plot residuals vs. the X values. If the scatter is not random around 0 but shows some pattern (e.g.<br />
quadratic curve), this usually indicates that the relati<strong>on</strong>ship between Y and X is not linear. Or, fit a model<br />
that includes X and X 2 and test if the coefficient associated with X 2 is zero. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this test could<br />
fail to detect a higher order relati<strong>on</strong>ship. Third, if there are multiple readings at some X-values, then a test<br />
of goodness-of-fit can be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med where the variati<strong>on</strong> of the resp<strong>on</strong>ses at the same X value is compared<br />
to the variati<strong>on</strong> around the regressi<strong>on</strong> line.<br />
Correct scale of predictor and resp<strong>on</strong>se<br />
The resp<strong>on</strong>se and predictor variables must both have interval or ratio scale. In particular, using a numerical<br />
value to represent a category and then using this numerical value in a regressi<strong>on</strong> is not valid. For example,<br />
suppose that you code hair color as (1 = red, 2 = brown, and 3 = black). Then using these values in a<br />
regressi<strong>on</strong> either as predictor variable or as a resp<strong>on</strong>se variable is not sensible.<br />
Correct sampling scheme<br />
The Y must be a random sample from the populati<strong>on</strong> of Y values <str<strong>on</strong>g>for</str<strong>on</strong>g> every X value in the sample. Fortunately,<br />
it is not necessary to have a completely random sample from the populati<strong>on</strong> as the regressi<strong>on</strong> line is<br />
valid even if the X values are deliberately chosen. However, <str<strong>on</strong>g>for</str<strong>on</strong>g> a given X, the values from the populati<strong>on</strong><br />
must be a simple random sample.<br />
c○2012 Carl James Schwarz 25 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
No outliers or influential points<br />
All the points must bel<strong>on</strong>g to the relati<strong>on</strong>ship – there should be no unusual points. The scatter-plot of Y vs.<br />
X should be examined. If in doubt, fit the model with the points in and out of the fit and see if this makes a<br />
difference in the fit.<br />
Outliers can have a dramatic effect <strong>on</strong> the fitted line. For example, in the following graph, the single<br />
point is an outlier and and influential point:<br />
Equal variati<strong>on</strong> al<strong>on</strong>g the line<br />
The variability about the regressi<strong>on</strong> line is similar <str<strong>on</strong>g>for</str<strong>on</strong>g> all values of X, i.e. the scatter of the points above and<br />
below the fitted line should be roughly c<strong>on</strong>stant over the entire line. This is assessed by looking at the plots<br />
of the residuals against X to see if the scatter is roughly uni<str<strong>on</strong>g>for</str<strong>on</strong>g>mly scattered around zero with no increase<br />
and no decrease in spread over the entire line.<br />
Independence<br />
Each value of Y is independent of any other value of Y . The most comm<strong>on</strong> cases where this fail are time<br />
series data where X is a time measurement. In these cases, time series analysis should be used.<br />
This assumpti<strong>on</strong> can be assessed by again looking at residual plots against time or other variables.<br />
c○2012 Carl James Schwarz 26 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Normality of errors<br />
The difference between the value of Y and the expected value of Y is assumed to be normally distributed.<br />
This is <strong>on</strong>e of the most misunderstood assumpti<strong>on</strong>s. Many people err<strong>on</strong>eously assume that the distributi<strong>on</strong> of<br />
Y over all X values must be normally distributed, i.e they look simply at the distributi<strong>on</strong> of the Y ’s ignoring<br />
the Xs. The assumpti<strong>on</strong> <strong>on</strong>ly states that the residuals, the difference between the value of Y and the point<br />
<strong>on</strong> the line must be normally distributed.<br />
This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, <str<strong>on</strong>g>for</str<strong>on</strong>g> small<br />
sample sizes, you have little power of detecting n<strong>on</strong>-normality and <str<strong>on</strong>g>for</str<strong>on</strong>g> large sample sizes it is not that<br />
important.<br />
X measured without error<br />
This is a new assumpti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> as compared to ANOVA. In ANOVA, the group membership was<br />
always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity. However,<br />
in regressi<strong>on</strong>, it can turn out that that the X value may not be known exactly.<br />
This general problem is called the “error in variables” problem and has a l<strong>on</strong>g history in statistics.<br />
It turns out that there are two important cases. If the value reported <str<strong>on</strong>g>for</str<strong>on</strong>g> X is a nominal value and the<br />
actual value of X varies randomly around this nominal value, then there is no bias in the estimates. This is<br />
called the Berks<strong>on</strong> case after Berks<strong>on</strong> who first examined this situati<strong>on</strong>. The most comm<strong>on</strong> cases are where<br />
the recorded X is a target value (e.g. temperature as set by a thermostat) while the actual X that occurs<br />
would vary randomly around this target value.<br />
However, if the value used <str<strong>on</strong>g>for</str<strong>on</strong>g> X is an actual measurement of the true underlying X then there is<br />
uncertainty in both the X and Y directi<strong>on</strong>. In this case, estimates of the slope are attenuated towards zero<br />
(i.e. positive slopes are biased downwards, negative slopes biased upwards). More alarmingly, the estimate<br />
are no l<strong>on</strong>ger c<strong>on</strong>sistent, i.e. as the sample size increases, the estimates no l<strong>on</strong>ger tend to the true populati<strong>on</strong><br />
values! For example, suppose that yield of a crop is related to amount of rainfall. A rain gauge may not<br />
be located exactly at the plot where the crop is grown, but may be recorded a nearby weather stati<strong>on</strong> a fair<br />
distance away. The reading at the weather stati<strong>on</strong> is NOT a true reflecti<strong>on</strong> of the rainfall at the test plot.<br />
This latter case of “error in variables” is very difficult to analyze properly and there are not universally<br />
accepted soluti<strong>on</strong>s. Refer to the reference books listed at the start of this chapter <str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />
The problem is set up as follows. Let<br />
Y i =η i + ɛ i<br />
X i =ξ i + δ i<br />
c○2012 Carl James Schwarz 27 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
with the straight-line relati<strong>on</strong>ship between the true (but unobserved) values:<br />
η i =β 0 + β 1 ξ i<br />
Note the (true, but unknown) regressi<strong>on</strong> equati<strong>on</strong> uses ξ i rather than the observed (with error) values X i .<br />
Now if the regressi<strong>on</strong> is d<strong>on</strong>e <strong>on</strong> the observed X (i.e. the error pr<strong>on</strong>e measurement), the regressi<strong>on</strong><br />
equati<strong>on</strong> reduces to:<br />
Y i = β 0 + β 1 X i + (ɛ i − β 1 δ i )<br />
Now this violates the independence assumpti<strong>on</strong> of ordinary least squares because the new “error” term is not<br />
independent of the X i variable.<br />
If an ordinary least squares model is fit, the estimated slope is biased (Draper and Smith, 1998, p. 90)<br />
with<br />
E[̂β 1 ] = β 1 − β 1r(ρ + r)<br />
1 + 2ρr + r 2<br />
where ρ is the correlati<strong>on</strong> between ξ and δ; and r is the ratio of the variance of the error in X to the error in<br />
Y .<br />
The bias is negative, i.e. the estimated slope is too small, in most practical cases (ρ + r > 0). This is<br />
known as attenuati<strong>on</strong> of the estimate, and in general, pulls the estimate towards zero.<br />
The bias will be small in the following cases:<br />
• the error variance of X is small relative to the error variance in Y . This means that r is small (i.e.<br />
close to zero), and so the bias is also small. In the case where X is measured without error, then r = 0<br />
and the bias vanishes as expected.<br />
• if the X are fixed (the Berks<strong>on</strong> case) and actually used 2 , then ρ + r = 0 and the bias also vanishes.<br />
The proper analysis of the error-in-variables case is quite complex – see Draper and Smith (1998, p. 91)<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />
1.4.5 Obtaining Estimates<br />
To distinguish between populati<strong>on</strong> parameters and sample estimates, we denote the sample intercept by b 0<br />
and the sample slope as b 1 . The equati<strong>on</strong> of a particular sample of points is expressed Ŷi = b 0 + b 1 X i where<br />
b 0 is the estimated intercept, and b 1 is the estimated slope. The symbol Ŷ indicates that we are referring to<br />
the estimated line and not to a line in the entire populati<strong>on</strong>.<br />
2 For example, a thermostat measures (with error) the actual temperature of a room. But if the experiment is based <strong>on</strong> the thermostat<br />
readings rather than the (true) unknown temperature, this corresp<strong>on</strong>ds to the Berks<strong>on</strong> case.<br />
c○2012 Carl James Schwarz 28 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
How is the best fitting line found when the points are scattered? We typically use the principle of least<br />
squares. The least-squares line is the line that makes the sum of the squares of the deviati<strong>on</strong>s of the data<br />
points from the line in the vertical directi<strong>on</strong> as small as possible.<br />
Mathematically, the least squares line is the line that minimizes 1 n<br />
∑ ( Y i − Ŷi) 2<br />
where Ŷ i is the point<br />
<strong>on</strong> the line corresp<strong>on</strong>ding to each X value. This is also known as the predicted value of Y <str<strong>on</strong>g>for</str<strong>on</strong>g> a given value<br />
of X. This <str<strong>on</strong>g>for</str<strong>on</strong>g>mal definiti<strong>on</strong> of least squares is not that important - the c<strong>on</strong>cept as expressed in the previous<br />
paragraph is more important – in particular it is the SQUARED deviati<strong>on</strong> in the VERTICAL directi<strong>on</strong> that<br />
is used..<br />
It is possible to write out a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> the estimated intercept and slope, but who cares - let the computer<br />
do the dirty work.<br />
The estimated intercept (b 0 ) is the estimated value of Y when X = 0. In some cases, it is meaningless<br />
to talk about values of Y when X = 0 because X = 0 is n<strong>on</strong>sensical. For example, in a plot of income vs.<br />
year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear interpretati<strong>on</strong> of<br />
the intercept, and it merely serves as a placeholder <str<strong>on</strong>g>for</str<strong>on</strong>g> the line.<br />
The estimated slope (b 1 ) is the estimated change in Y per unit change in X. For every unit change in the<br />
horiz<strong>on</strong>tal directi<strong>on</strong>, the fitted line increased by b 1 units. If b 1 is negative, the fitted line points downwards,<br />
and the increase in the line is negative, i.e., actually a decrease.<br />
As with all estimates, a measure of precisi<strong>on</strong> can be obtained. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, this is the standard error of<br />
each of the estimates. Again, there are computati<strong>on</strong>al <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae, but in this age of computers, these are not<br />
important. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, approximate 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the corresp<strong>on</strong>ding populati<strong>on</strong> parameters<br />
are found as estimate ± 2 × se.<br />
Formal tests of hypotheses can also be d<strong>on</strong>e. Usually, these are <strong>on</strong>ly d<strong>on</strong>e <strong>on</strong> the slope parameter as<br />
this is typically of most interest. The null hypothesis is that populati<strong>on</strong> slope is 0, i.e. there is no relati<strong>on</strong>ship<br />
between Y and X (can you draw a scatter-plot showing such a relati<strong>on</strong>ship?) More <str<strong>on</strong>g>for</str<strong>on</strong>g>mally the null<br />
hypothesis is:<br />
H : β 1 = 0<br />
Again notice that the null hypothesis is ALWAYS in terms of a populati<strong>on</strong> parameter and not in terms of a<br />
sample statistic.<br />
The alternate hypothesis is typically chosen as:<br />
A : β 1 ≠ 0<br />
although <strong>on</strong>e-sided tests looking <str<strong>on</strong>g>for</str<strong>on</strong>g> either a positive or negative slope are possible.<br />
The test-statistics is found as<br />
T = b 1 − 0<br />
se(b 1 )<br />
and is compared to a t-distributi<strong>on</strong> with appropriate degrees of freedom to obtain the p-value. This is usually<br />
automatically d<strong>on</strong>e by most computer packages. The p-value is interpreted in exactly the same way as in<br />
c○2012 Carl James Schwarz 29 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no relati<strong>on</strong>ship were true.<br />
As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the p-value does not tell the whole story, i.e. statistical vs. biological (n<strong>on</strong>)significance must<br />
be determined and assessed.<br />
1.4.6 Obtaining Predicti<strong>on</strong>s<br />
Once the best fitting line is found it can be used to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> new values of X.<br />
There are two types of predicti<strong>on</strong>s that are comm<strong>on</strong>ly made. It is important to distinguish between them<br />
as these two intervals are the source of much c<strong>on</strong>fusi<strong>on</strong> in regressi<strong>on</strong> problems.<br />
First, the experimenter may be interested in predicting a SINGLE future individual value <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />
X. Sec<strong>on</strong>d the experimenter may be interested in predicting the AVERAGE of ALL future resp<strong>on</strong>ses at a<br />
particular X. 3 The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is sometimes called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
an individual resp<strong>on</strong>se but this is an un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate (and incorrect) use of the term c<strong>on</strong>fidence interval. Strictly<br />
speaking c<strong>on</strong>fidence intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> fixed unknown parameter values; predicati<strong>on</strong> intervals are<br />
computed <str<strong>on</strong>g>for</str<strong>on</strong>g> future random variables.<br />
Both of the above intervals should be distinguished from the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />
In both cases, the estimate is found in the same manner – substitute the new value of X into the equati<strong>on</strong><br />
and compute the predicted value Ŷ . In most computer packages this is accomplished by inserting a new<br />
“dummy” observati<strong>on</strong> in the dataset with the value of Y missing, but the value of X present. The missing<br />
Y value prevents this new observati<strong>on</strong> from being used in the fitting process, but the X value allows the<br />
package to compute an estimate <str<strong>on</strong>g>for</str<strong>on</strong>g> this observati<strong>on</strong>.<br />
What differs between the two predicti<strong>on</strong>s are the estimates of uncertainty.<br />
In the first case, there are two sources of uncertainty involved in the predicti<strong>on</strong>. First, there is the uncertainty<br />
caused by the fact that this estimated line is based up<strong>on</strong> a sample. Then there is the additi<strong>on</strong>al<br />
uncertainty that the value could be above or below the predicted line. This interval is often called a predicti<strong>on</strong><br />
interval at a new X.<br />
In the sec<strong>on</strong>d case, <strong>on</strong>ly the uncertainty caused by estimating the line based <strong>on</strong> a sample is relevant. This<br />
interval is often called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean at a new X.<br />
The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is typically MUCH wider than the c<strong>on</strong>fidence interval<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean of all future resp<strong>on</strong>ses because it must account <str<strong>on</strong>g>for</str<strong>on</strong>g> the uncertainty from the fitted line plus<br />
individual variati<strong>on</strong> around the fitted line.<br />
Many textbooks have the <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the se <str<strong>on</strong>g>for</str<strong>on</strong>g> the two types of predicti<strong>on</strong>s, but again, there is little to<br />
be gained by examining them. What is important is that you read the documentati<strong>on</strong> carefully to ensure that<br />
3 There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice.<br />
c○2012 Carl James Schwarz 30 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
you understand exactly what interval is being given to you.<br />
1.4.7 Residual Plots<br />
After the curve is fit, it is important to examine if the fitted curve is reas<strong>on</strong>able. This is d<strong>on</strong>e using residuals.<br />
The residual <str<strong>on</strong>g>for</str<strong>on</strong>g> a point is the difference between the observed value and the predicted value, i.e., the residual<br />
from fitting a straight line is found as: residual i = Y i − (b 0 + b 1 X i ) = (Y i − Ŷi).<br />
There are several standard residual plots:<br />
• plot of residuals vs. predicted (Ŷ );<br />
• plot of residuals vs. X;<br />
• plot of residuals vs. time ordering.<br />
In all cases, the residual plots should show random scatter around zero with no obvious pattern. D<strong>on</strong>’t<br />
plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and d<strong>on</strong>’t mean<br />
anything.<br />
1.4.8 Example - Yield and fertilizer<br />
We wish to investigate the relati<strong>on</strong>ship between yield (Liters) and fertilizer (kg/ha) <str<strong>on</strong>g>for</str<strong>on</strong>g> tomato plants. An<br />
experiment was c<strong>on</strong>ducted in the Schwarz household <strong>on</strong>e summer <strong>on</strong> 11 plots of land where the amount of<br />
fertilizer was varied and the yield measured at the end of the seas<strong>on</strong>.<br />
The amount of fertilizer (randomly) applied to each plot was chosen between 5 and 18 kg/ha. While<br />
the levels were not systematically chosen (e.g. they were not evenly spaced between the highest and lowest<br />
values), they represent comm<strong>on</strong>ly used amounts based <strong>on</strong> a preliminary survey of producers. At the end of<br />
the experiment, the yields were measured and the following data were obtained.<br />
Interest also lies in predicting the yield when 16 kg/ha are assigned.<br />
c○2012 Carl James Schwarz 31 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Fertilizer<br />
(kg/ha)<br />
Yield<br />
(Liters)<br />
12 24<br />
5 18<br />
15 31<br />
17 33<br />
14 30<br />
6 20<br />
11 25<br />
13 27<br />
15 31<br />
8 21<br />
18 29<br />
The raw data is also available in a JMP datasheet called fertilizer.jmp available from the Sample Program<br />
Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
In in this study, it is quite clear that the fertilizer is the predictor (X) variable, while the resp<strong>on</strong>se variable<br />
(Y ) is the yield.<br />
The populati<strong>on</strong> c<strong>on</strong>sists of all possible field plots with all possible tomato plants of this type grown under<br />
all possible fertilizer levels between about 5 and 18 kg/ha.<br />
If all of the populati<strong>on</strong> could be measured (which it can’t) you could find a relati<strong>on</strong>ship between the yield<br />
and the amount of fertilizer applied. This relati<strong>on</strong>ship would have the <str<strong>on</strong>g>for</str<strong>on</strong>g>m: Y = β 0 +β 1 ×(amount of fertilizer)+<br />
ɛ where β 0 and β 1 represent the true populati<strong>on</strong> intercept and slope respectively. The term ɛ represents random<br />
variati<strong>on</strong> that is always present, i.e. even if the same plot was grown twice in a row with the same<br />
amount of fertilizer, the yield would not be identical (why?).<br />
The populati<strong>on</strong> parameters to be estimated are β 0 - the true average yield when the amount of fertilizer<br />
is 0, and β 1 , the true average change in yield per unit change in the amount of fertilizer. These are taken<br />
over all plants in all possible field plots of this type. The values of β 0 and β 1 are impossible to obtain as the<br />
entire populati<strong>on</strong> could never be measured.<br />
Here is the data entered into a JMP data sheet. Note the scale of both variables (c<strong>on</strong>tinuous) and that an<br />
extra row was added to the data table with the value of 16 <str<strong>on</strong>g>for</str<strong>on</strong>g> the fertilizer and the yield left missing.<br />
c○2012 Carl James Schwarz 32 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The ordering of the rows in the data table is NOT important; however, it is often easier to find individual<br />
data points if the data is sorted by the X value and the rows <str<strong>on</strong>g>for</str<strong>on</strong>g> future predicti<strong>on</strong>s are placed at the end of<br />
the dataset. Notice how missing values are represented.<br />
Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to start the analysis. Specify the Y and X variable as needed.<br />
c○2012 Carl James Schwarz 33 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Notice that JMP “reminds” you of the analysis that you will obtain based <strong>on</strong> the scale of the X and Y<br />
variables as shown in the bottom left of the menu. In this case, both X and Y have a c<strong>on</strong>tinuous scale, so<br />
JMP will per<str<strong>on</strong>g>for</str<strong>on</strong>g>m a bi-variate fitting procedure. It starts by showing the scatter-plot between Yield (Y ) and<br />
fertilizer (X).<br />
c○2012 Carl James Schwarz 34 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The relati<strong>on</strong>ship look approximately linear; there d<strong>on</strong>’t appear to be any outlier or influential points; the<br />
scatter appears to be roughly equal across the entire regressi<strong>on</strong> line. Residual plots will be used later to<br />
check these assumpti<strong>on</strong>s in more detail.<br />
The drop-down menu item (from the red triangle beside the Bivariate Fit...) allows you to fit the leastsquares<br />
line. This produces much output, but the three important part of the output are discussed below.<br />
First, the actual fitted line is drawn <strong>on</strong> the scatter plot, and the equati<strong>on</strong> of the fitted line is printed below<br />
the plot.<br />
c○2012 Carl James Schwarz 35 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The estimated regressi<strong>on</strong> line is<br />
Ŷ = b 0 + b 1 (fertilizer) = 12.856 + 1.10137(amount of fertilizer)<br />
In terms of estimates, b 0 =12.856 is the estimated intercept, and b 1 =1.101 is the estimated slope.<br />
The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1 unit.<br />
In this case, the yield is expected to increase (why?) by 1.10137 L when the fertilizer amount is increased by<br />
1 kg/ha. NOTE that the slope is the CHANGE in Y when X increases by 1 unit - not the value of Y when<br />
X = 1.<br />
The estimated intercept is the estimated yield when the amount of fertilizer is 0. In this case, the estimated<br />
yield when no fertilizer is added is 12.856 L. In this particular case the intercept has a meaningful<br />
interpretati<strong>on</strong>, but I’d be worried about extrapolating outside the range of the observed X values. If the<br />
intercept is 12.85, why does the line intersect the left part of the graph at about 15 rather than closer to 13?<br />
Once again, these are the results from a single experiment. If another experiment was repeated, you<br />
would obtain different estimates (b 0 and b 1 would change). The sampling distributi<strong>on</strong> over all possible<br />
experiments would describe the variati<strong>on</strong> in b 0 and b 1 over all possible experiments. The standard deviati<strong>on</strong><br />
of b 0 and b 1 over all possible experiments is again referred to as the standard error of b 0 and b 1 .<br />
c○2012 Carl James Schwarz 36 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the standard errors of b 0 and b 1 are messy, and hopeless to compute by hand. And<br />
just like inference <str<strong>on</strong>g>for</str<strong>on</strong>g> a mean or a proporti<strong>on</strong>, the program automatically computes the se of the regressi<strong>on</strong><br />
estimates.<br />
The estimated standard error <str<strong>on</strong>g>for</str<strong>on</strong>g> b 1 (the estimated slope) is 0.132 L/kg. This is an estimate of the standard<br />
deviati<strong>on</strong> of b 1 over all possible experiments. Normally, the intercept is of limited interest, but a standard<br />
error can also be found <str<strong>on</strong>g>for</str<strong>on</strong>g> it as shown in the above table.<br />
Using exactly the same logic as when we found a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the populati<strong>on</strong> mean, or <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
the populati<strong>on</strong> proporti<strong>on</strong>, a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the populati<strong>on</strong> slope (β 1 ) is found (approximately) as<br />
b 1 ± 2(estimated se) In the above example, an approximate c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> β 1 is found as<br />
of fertilizer applied.<br />
1.101 ± 2 × (.132) = 1.101 ± .264 = (.837 → 1.365)L/kg<br />
An “exact” c<strong>on</strong>fidence interval can be computed by JMP as shown above. 4 The “exact” c<strong>on</strong>fidence<br />
interval is based <strong>on</strong> the t-distributi<strong>on</strong> and is slightly wider than our approximate c<strong>on</strong>fidence interval because<br />
the total sample size (11 pairs of points) is rather small. We interpret this interval as ‘being 95% c<strong>on</strong>fident<br />
that the true increase in yield when the amount of fertilizer is increased by <strong>on</strong>e unit is somewhere between<br />
(.837 to 1.365) L/kg.’<br />
Be sure to carefully distinguish between β 1 and b 1 . Note that the c<strong>on</strong>fidence interval is computed using<br />
b 1 , but is a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> β 1 - the populati<strong>on</strong> parameter that is unknown.<br />
In linear regressi<strong>on</strong> problems, <strong>on</strong>e hypothesis of interest is if the true slope is zero. This would corresp<strong>on</strong>d<br />
to no linear relati<strong>on</strong>ship between the resp<strong>on</strong>se and predictor variable (why?) Again, this is a good<br />
time to read the papers by Cherry and Johns<strong>on</strong> about the dangers of uncritical use of hypothesis testing. In<br />
many cases, a c<strong>on</strong>fidence interval tells the entire story.<br />
JMP produces a test of the hypothesis that each of the parameters (the slope and the intercept in the<br />
populati<strong>on</strong>) is zero. The output is reproduced again below:<br />
4 If your table doesn’t show the c<strong>on</strong>fidence interval, use a C<strong>on</strong>trol-Click or Right-Click in the table and select the columns to be<br />
displayed.<br />
c○2012 Carl James Schwarz 37 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The test of hypothesis about the intercept is not of interest (why?).<br />
Let<br />
• β 1 be the true (unknown) slope.<br />
• b 1 be the estimated slope. In this case b 1 = 1.1014.<br />
The hypothesis testing proceeds as follows. Again note that we are interested in the populati<strong>on</strong> parameters<br />
and not the sample statistics.<br />
1. Specify the null and alternate hypothesis:<br />
H: β 1 = 0<br />
A: β 1 ≠ 0.<br />
Notice that the null hypothesis is in terms of the populati<strong>on</strong> parameter β 1 . This is a two-sided test as<br />
we are interested in detecting differences from zero in either directi<strong>on</strong>.<br />
2. Find the test statistic and the p-value. The test statistic is computed as:<br />
T =<br />
estimate − hypothesized value<br />
estimated se<br />
= 1.1014 − 0<br />
.132<br />
= 8.36<br />
In other words, the estimate is over 8 standard errors away from the hypothesized value!<br />
This will be compared to a t-distributi<strong>on</strong> with n − 2 = 9 degrees of freedom. The p-value is found to<br />
very small (less than 0.0001).<br />
3. C<strong>on</strong>clusi<strong>on</strong>. There is str<strong>on</strong>g evidence that the true slope is not zero. This is not too surprising given<br />
that the 95% c<strong>on</strong>fidence intervals show that plausible values <str<strong>on</strong>g>for</str<strong>on</strong>g> the true slope are from about .8 to<br />
about 1.4.<br />
It is possible to c<strong>on</strong>struct tests of the slope equal to some value other than 0. Most packages can’t do<br />
this. You would compute the T value as shown above, replacing the value 0 with the hypothesized value.<br />
It is also possible to c<strong>on</strong>struct <strong>on</strong>e-sided tests. Most computer packages <strong>on</strong>ly do two-sided tests. Proceed<br />
as above, but the <strong>on</strong>e-sided p-value is the two-sided p-value reported by the packages divided by 2.<br />
If sufficient evidence is found against the hypothesis, a natural questi<strong>on</strong> to ask is ‘well, what values of the<br />
parameter are plausible given this data’. This is exactly what a c<strong>on</strong>fidence interval tells you. C<strong>on</strong>sequently,<br />
I usually prefer to find c<strong>on</strong>fidence intervals, rather than doing <str<strong>on</strong>g>for</str<strong>on</strong>g>mal hypothesis testing.<br />
What about making predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> future yields when certain amounts of fertilizer are applied? For<br />
example, what would be the future yield when 16 kg/ha of fertilizer are applied?<br />
The predicted value is found by substituting the new X into the estimated regressi<strong>on</strong> line.<br />
Ŷ = b 0 + b 1 (fertilizer) = 12.856 + 1.10137(16) = 30.48L<br />
c○2012 Carl James Schwarz 38 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
This can also be found by using the cross hairs tool <strong>on</strong> the actual graph (to be dem<strong>on</strong>strated in class).JMP<br />
can compute the predicted value by selecting the appropriate opti<strong>on</strong> under the drop down menu item in the<br />
Linear Fit item:<br />
and then going back to look at the new column in the data table:<br />
c○2012 Carl James Schwarz 39 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
As noted earlier, there are two types of estimates of precisi<strong>on</strong> associated with predicti<strong>on</strong>s using the<br />
regressi<strong>on</strong> line. It is important to distinguish between them as these two intervals are the source of much<br />
c<strong>on</strong>fusi<strong>on</strong> in regressi<strong>on</strong> problems.<br />
First, the experimenter may be interested in predicting a single FUTURE individual value <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />
X. This would corresp<strong>on</strong>d to the predicted yield <str<strong>on</strong>g>for</str<strong>on</strong>g> a single future plot with 16 kg/ha of fertilizer added.<br />
Sec<strong>on</strong>d the experimenter may be interested in predicting the average of ALL FUTURE resp<strong>on</strong>ses at a<br />
particular X. This would corresp<strong>on</strong>d to the average yield <str<strong>on</strong>g>for</str<strong>on</strong>g> all future plots when 16 kg/ha of fertilizer is<br />
added. The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is sometimes called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an<br />
individual resp<strong>on</strong>se but this is an un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate (and incorrect) use of the term c<strong>on</strong>fidence interval. Strictly<br />
speaking c<strong>on</strong>fidence intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> fixed unknown parameter values; predicati<strong>on</strong> intervals are<br />
computed <str<strong>on</strong>g>for</str<strong>on</strong>g> future random variables.<br />
Both intervals can be computed and plotted byJMP by again using the pop-down menu beside the Linear<br />
Fit box:<br />
c○2012 Carl James Schwarz 40 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
In this menu, the C<strong>on</strong>fid Curves Fit corresp<strong>on</strong>d to c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN resp<strong>on</strong>se, while<br />
the C<strong>on</strong>fid Curves Indiv corresp<strong>on</strong>d to predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the future single resp<strong>on</strong>se. Both can be plotted<br />
<strong>on</strong> the graph. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there does not appear to be a way to save the predicti<strong>on</strong> limits into a data table<br />
from this plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m - the cross hairs tool must be used, or the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m should be used.<br />
c○2012 Carl James Schwarz 41 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The innermost set of lines represents the c<strong>on</strong>fidence bands <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se. The outermost band<br />
of lines represents the predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> a single future resp<strong>on</strong>se. As noted earlier, the latter must be<br />
wider than the <str<strong>on</strong>g>for</str<strong>on</strong>g>mer to account <str<strong>on</strong>g>for</str<strong>on</strong>g> an additi<strong>on</strong>al source of variati<strong>on</strong>.<br />
The numerical values from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m are shown below:<br />
Here the predicted yield <str<strong>on</strong>g>for</str<strong>on</strong>g> a single future trial at 16 kg/ha is 30.5 L, but the 95% predicti<strong>on</strong> interval is<br />
between 26.1 and 34.9 L. The predicted AVERAGE yield <str<strong>on</strong>g>for</str<strong>on</strong>g> ALL future plots when 16 kg/ha of fertilizer<br />
c○2012 Carl James Schwarz 42 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
is applied is also 30.5 L, but the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN yield is between 28.8 and 32.1 L.<br />
Finally, residual plots can be made using the pop-down menu:<br />
The residuals are simply the difference between the actual data point and the corresp<strong>on</strong>ding spot <strong>on</strong> the<br />
line measured in the vertical directi<strong>on</strong>. The residual plot shows no trend in the scatter around the value of<br />
zero.<br />
The same items are available from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. Here you would specify Yield as the<br />
Y variable, Fertilizer as the X variable in much the same way as in the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. Much<br />
of the same output is produced. Additi<strong>on</strong>ally, you can save the actual c<strong>on</strong>fidence bounds <str<strong>on</strong>g>for</str<strong>on</strong>g> predicti<strong>on</strong>s into<br />
the data table (as shown above). This will be dem<strong>on</strong>strated in class.<br />
1.4.9 Example - Mercury polluti<strong>on</strong><br />
Mercury polluti<strong>on</strong> is a serious problem in some waterways. Mercury levels often increase after a lake<br />
is flooded due to leaching of naturally occurring mercury by the higher levels of the water. Excessive<br />
c<strong>on</strong>sumpti<strong>on</strong> of mercury is well known to be deleterious to human health. It is difficult and time c<strong>on</strong>suming<br />
to measure every pers<strong>on</strong>s mercury level. It would be nice to have a quick procedure that could be used to<br />
estimate the mercury level of a pers<strong>on</strong> based up<strong>on</strong> the average mercury level found in fish and estimates<br />
of the pers<strong>on</strong>’s c<strong>on</strong>sumpti<strong>on</strong> of fish in their diet. The following data were collected <strong>on</strong> the methyl mercury<br />
intake of subjects and the actual mercury levels recorded in the blood stream from a random sample of<br />
people around recently flooded lakes.<br />
Here are the raw data:<br />
c○2012 Carl James Schwarz 43 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Methyl Mercury<br />
Intake<br />
(ug Hg/day)<br />
Mercury in<br />
whole blood<br />
(ng/g)<br />
180 90<br />
200 120<br />
230 125<br />
410 290<br />
600 310<br />
550 290<br />
275 170<br />
580 375<br />
600 150<br />
105 70<br />
250 105<br />
60 205<br />
650 480<br />
The data are available in a JMP datasheet called mercury.jmp available from the Sample Program Library<br />
at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The ordering of the rows in the data table is NOT important; however, it is often easier to find individual<br />
data points if the data is sorted by the X value and the rows <str<strong>on</strong>g>for</str<strong>on</strong>g> future predicti<strong>on</strong>s are placed at the end of<br />
the dataset. Notice how missing values are represented.<br />
The populati<strong>on</strong> of interest are the people around recently flooded lakes.<br />
This experiment is an analytical survey as it is quite impossible to randomly assign people different<br />
amounts of mercury in their food intake. C<strong>on</strong>sequently, the key assumpti<strong>on</strong> is that the subjects chosen to be<br />
measured are random samples from those with similar mercury intakes. Note it is NOT necessary <str<strong>on</strong>g>for</str<strong>on</strong>g> this to<br />
be a random sample from the ENTIRE populati<strong>on</strong> (why?).<br />
The explanatory variable is the amount of mercury ingested by a pers<strong>on</strong>. The resp<strong>on</strong>se variable is the<br />
amount of mercury in the blood stream.<br />
We start by producing the scatter-plot.<br />
c○2012 Carl James Schwarz 44 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
There appears to be two outliers (identified by an X). To illustrate the effects of these outliers up<strong>on</strong> the<br />
estimates and the residual plots, the line was fit using all of the data.<br />
c○2012 Carl James Schwarz 45 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The residual plot shows the clear presence of the two outliers, but also identifies a third potential outlier<br />
not evident from the original scatter-plot (can you find it?).<br />
The data were rechecked and it appears that there was an error in the blood work used in determining the<br />
readings. C<strong>on</strong>sequently, these points were removed <str<strong>on</strong>g>for</str<strong>on</strong>g> the subsequent fit.<br />
c○2012 Carl James Schwarz 46 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
.<br />
The estimated regressi<strong>on</strong> line (after removing outliers) is<br />
Blood = −1.951691 + 0.581218Intake<br />
The estimated slope of 0.58 indicates that the mercury level in the blood increases by 0.58 ng/day when<br />
the intake level in the food is increased by 1 ug/day. The intercept has no really meaning in the c<strong>on</strong>text of this<br />
experiment. The negative value is merely a placeholder <str<strong>on</strong>g>for</str<strong>on</strong>g> the line. Also notice that the estimated intercept<br />
is not very precise in any case (how do I know this and what implicati<strong>on</strong>s does this have <str<strong>on</strong>g>for</str<strong>on</strong>g> worrying that it<br />
c○2012 Carl James Schwarz 47 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
is not zero?) 5<br />
What was the impact of the outliers if they had been retained up<strong>on</strong> the estimated slope and intercept?<br />
The estimated slope has been determined relatively well (relative standard error of about 10% – how is<br />
the relative standard error computed?). There is clear evidence that the hypothesis of no relati<strong>on</strong>ship between<br />
blood mercury levels and food mercury levels is not tenable.<br />
The two types of predicti<strong>on</strong>s would also be of interest in this study. First, an individual would like to<br />
know the impact up<strong>on</strong> pers<strong>on</strong>al health. Sec<strong>on</strong>dly, the average level would be of interest to public health<br />
authorities.<br />
JMP was used to plot both intervals <strong>on</strong> the scatter-plot:<br />
1.4.10 Example - The Anscombe Data Set<br />
Anscombe (1973, American Statistician 27, 17-21) created a set of 4 data sets that were quite remarkable.<br />
All four datasets gave exactly the same results when a regressi<strong>on</strong> line was fit, yet are quite different in their<br />
interpretati<strong>on</strong>.<br />
The Anscombe data is available at the http://www.stat.sfu.ca/~cschwarz/Stat-650/<br />
5 It is possible to fit a regressi<strong>on</strong> line that is c<strong>on</strong>strained to go through Y = 0 when X = 0. These must be fit carefully and are not<br />
covered in this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
c○2012 Carl James Schwarz 48 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Notes/MyPrograms. Fitting of regressi<strong>on</strong> lines to this data will be dem<strong>on</strong>strated in class.<br />
1.4.11 Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s<br />
In some cases, the plot of Y vs. X is obviously n<strong>on</strong>-linear and a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of X or Y may be used to<br />
establish linearity. For example, many dose-resp<strong>on</strong>se curves are linear in log(X). Or the equati<strong>on</strong> may be<br />
intrinsically n<strong>on</strong>-linear, e.g. a weight-length relati<strong>on</strong>ship is of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m weight = β 0 length β1 . Or, some<br />
variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measured in L/100<br />
km or km/L? You are already with some variables measured <strong>on</strong> the log-scale - pH is a comm<strong>on</strong> example.<br />
Often a visual inspecti<strong>on</strong> of a plot may identify the appropriate trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>.<br />
There is no theoretical difficulty in fitting a linear regressi<strong>on</strong> using trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med variables other than an<br />
understanding of the implicit assumpti<strong>on</strong> of the error structure. The model <str<strong>on</strong>g>for</str<strong>on</strong>g> a fit <strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data is of<br />
the <str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />
trans(Y ) = β 0 + β 1 × trans(X) + error<br />
Note that the error is assumed to act additively <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale. All of the assumpti<strong>on</strong>s of linear<br />
regressi<strong>on</strong> are assumed to act <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale – in particular that the populati<strong>on</strong> standard deviati<strong>on</strong><br />
around the regressi<strong>on</strong> line is c<strong>on</strong>stant <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />
The most comm<strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the logarithmic trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. It doesn’t matter if the natural logarithm<br />
(often called the ln functi<strong>on</strong>) or the comm<strong>on</strong> logarithm trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> (often called the log 10 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>)<br />
is used. There is a 1-1 relati<strong>on</strong>ship between the two trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, and linearity <strong>on</strong> <strong>on</strong>e trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m is<br />
preserved <strong>on</strong> the other trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. The <strong>on</strong>ly change is that values <strong>on</strong> the ln scale are 2.302 = ln(10) times<br />
that <strong>on</strong> the log 10 scale which implies that the estimated slope and intercept both differ by a factor of 2.302.<br />
There is some c<strong>on</strong>fusi<strong>on</strong> in scientific papers about the meaning of log - some papers use this to refer to the<br />
ln trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>, while others use this to refer to the log 10 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>.<br />
After the regressi<strong>on</strong> model is fit, remember to interpret the estimates of slope and intercept <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />
scale. For example, suppose that a ln(Y ) trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is used. Then we have<br />
and<br />
.<br />
ln(Y t+1 ) = b 0 + b 1 × (t + 1)<br />
ln(Y t ) = b 0 + b 1 × t<br />
ln(Y t+1 ) − ln(Y t ) = ln( Y t+1<br />
Y t<br />
) = b 1 × (t + 1 − t) = b 1<br />
exp(ln( Y t+1<br />
)) = Y t+1<br />
= exp(b 1 ) = e b1<br />
Y t Y t<br />
Hence a <strong>on</strong>e unit increase in X causes Y to be MULTIPLED by e b1 . As an example, suppose that <strong>on</strong><br />
the log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by a<br />
c○2012 Carl James Schwarz 49 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
multiplicative factor or e −.07 = .93, i.e. roughly a 7% decline per year. 6<br />
Similarly, predicti<strong>on</strong>s <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale, must be back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />
In some problems, scientists search <str<strong>on</strong>g>for</str<strong>on</strong>g> the ‘best’ trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. This is not an easy task and using simple<br />
statistics such as R 2 to search <str<strong>on</strong>g>for</str<strong>on</strong>g> the best trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> should be avoided. Seek help if you need to find<br />
the best trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular dataset.<br />
JMP makes it particularly easy to fit regressi<strong>on</strong>s to trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data as shown below. SAS and R have<br />
an extensive array of functi<strong>on</strong>s so that you can create new variables based the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of an existing<br />
variable.<br />
1.4.12 Example: M<strong>on</strong>itoring Dioxins - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />
An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />
material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />
organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />
takes a l<strong>on</strong>g time to degrade.<br />
Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />
measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />
Each year, four crabs are captured from a m<strong>on</strong>itoring stati<strong>on</strong>. The liver is excised and the livers from<br />
all four crabs are composited together into a single sample. 7 The dioxins levels in this composite sample is<br />
measured. As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called<br />
the Total Equivalent Dose (TEQ) is computed from the sample.<br />
Here is the raw data.<br />
6 It can be shown that <strong>on</strong> the log scale, that <str<strong>on</strong>g>for</str<strong>on</strong>g> smallish values of the slope that the change is almost the same <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />
scale, i.e. if the slope is −.07 <strong>on</strong> the log sale, this implies roughly a 7% decline per year; a slope of +.07 implies roughly a 7% increase<br />
per year.<br />
7 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />
loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />
within years and the number of years to m<strong>on</strong>itor.<br />
c○2012 Carl James Schwarz 50 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Site Year TEQ<br />
a 1990 179.05<br />
a 1991 82.39<br />
a 1992 130.18<br />
a 1993 97.06<br />
a 1994 49.34<br />
a 1995 57.05<br />
a 1996 57.41<br />
a 1997 29.94<br />
a 1998 48.48<br />
a 1999 49.67<br />
a 2000 34.25<br />
a 2001 59.28<br />
a 2002 34.92<br />
a 2003 28.16<br />
The data is available in a JMP data file dioxinTEQ.jmp in the http://www.stat.sfu.ca/~cschwarz/<br />
Stat-650/Notes/MyPrograms.<br />
As with all analyses, start with a preliminary plot of the data. Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />
c○2012 Carl James Schwarz 51 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The preliminary plot of the data shows a decline in levels over time, but it is clearly n<strong>on</strong>-linear. Why is<br />
this so? In many cases, a fixed fracti<strong>on</strong> of dioxins degrades per year, e.g. a 10% decline per year. This can<br />
be expressed in a n<strong>on</strong>-linear relati<strong>on</strong>ship:<br />
T EQ = Cr t<br />
where C is the initial c<strong>on</strong>centrati<strong>on</strong>, r is the rate reducti<strong>on</strong> per year, and t is the elapsed time. If this is<br />
plotted over time, this leads to the n<strong>on</strong>-linear pattern seen above.<br />
If logarithms are taken, this leads to the relati<strong>on</strong>ship:<br />
which can be expressed as:<br />
log(T EQ) = log(C) + t × log(r)<br />
log(T EQ) = β 0 + β 1 × t<br />
which is the equati<strong>on</strong> of a straight line with β 0 = log(C) and β 1 = log(r).<br />
JMP can easily be used to compute the log(T EQ) by using the Formula Editor in the usual fashi<strong>on</strong>.<br />
A plot of log(T EQ) vs. year gives the following:<br />
c○2012 Carl James Schwarz 52 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The relati<strong>on</strong>ship look approximately linear; there d<strong>on</strong>’t appear to be any outlier or influential points; the<br />
scatter appears to be roughly equal across the entire regressi<strong>on</strong> line. Residual plots will be used later to<br />
check these assumpti<strong>on</strong>s in more detail.<br />
A line can be fit as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e by selecting the Fit Line opti<strong>on</strong> from the red triangle in the upper left side of<br />
the plot:<br />
c○2012 Carl James Schwarz 53 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
This gives the following output:<br />
c○2012 Carl James Schwarz 54 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
.<br />
The fitted line is:<br />
log(T EQ) = 218.9 − .11(year)<br />
c○2012 Carl James Schwarz 55 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The intercept (218.9) would be the log(T EQ) in the year 0 which is clearly n<strong>on</strong>sensical. The slope<br />
(−.11) is the estimated log(ratio) from <strong>on</strong>e year to the next. For example, exp(−.11) = .898 would mean<br />
that the TEQ in <strong>on</strong>e year is <strong>on</strong>ly 89.8% of the TEQ in the previous year or roughly an 11% decline per year.<br />
The standard error of the estimated slope is .02.<br />
A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope can be obtained by pressing a Right-Click (<str<strong>on</strong>g>for</str<strong>on</strong>g> Windoze machines)<br />
or a Ctrl-Click (<str<strong>on</strong>g>for</str<strong>on</strong>g> Macintosh machines) in the Parameter Estimates summary table and selecting<br />
the c<strong>on</strong>fidence intervals to display in the table.<br />
The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope is (−.154 to −.061). If you take the anti-logs of the endpoints,<br />
this gives a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the fracti<strong>on</strong> of TEQ that remains from year to year, i.e. between<br />
(0.86 to 0.94) of the TEQ in <strong>on</strong>e year, remains to the next year.<br />
Several types of predicti<strong>on</strong>s can be made. For example, what would be the estimated mean TEQ in 2010?<br />
The computati<strong>on</strong>s could be d<strong>on</strong>e by hand, or by using the cross-hairs <strong>on</strong> the plot from the Analyze-<br />
>Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se, or predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual<br />
resp<strong>on</strong>se can be added to the plot from the pop-down menu.<br />
However, a more powerful tool is available from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />
Start first, by adding rows to the original data table corresp<strong>on</strong>ding to the years <str<strong>on</strong>g>for</str<strong>on</strong>g> which a predicti<strong>on</strong> is<br />
required. In this case, the additi<strong>on</strong>al row would have the value of 2010 in the Year column with the remainder<br />
of the row unspecified. Missing values will be automatically inserted <str<strong>on</strong>g>for</str<strong>on</strong>g> the other variables.<br />
c○2012 Carl James Schwarz 56 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Then invoke the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 57 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
This gives much the same output as from the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m with a few new (useful)<br />
features, a few of which we will explore in the remainder of this secti<strong>on</strong>.<br />
Next, save the predicti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>mula, and the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean, and <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual predicti<strong>on</strong><br />
to the data table (this will take three successive saves):<br />
c○2012 Carl James Schwarz 58 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Now the data table has been augmented with additi<strong>on</strong>al columns and more importantly predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
2010 are now available:<br />
c○2012 Carl James Schwarz 59 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The estimated mean log(T EQ) is 2.60 (corresp<strong>on</strong>ding to an estimated MEDIAN TEQ of exp(2.60) =<br />
13.46). A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean log(T EQ) is (1.94 to 3.26) corresp<strong>on</strong>ding to a 95% c<strong>on</strong>fidence<br />
interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the actual MEDIAN TEQ of between (6.96 and 26.05). 8 Note that the c<strong>on</strong>fidence interval<br />
after taking anti-logs is no l<strong>on</strong>ger symmetrical.<br />
Why does a mean of a logarithm trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m back to the median <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale? Basically,<br />
because the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is n<strong>on</strong>-linear, properties such mean and standard errors cannot be simply<br />
anti-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med without introducing some bias. However, measures of locati<strong>on</strong>, (such as a median) are<br />
unaffected. On the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale, it is assumed that the sampling distributi<strong>on</strong> about the estimate is symmetrical<br />
which makes the mean and median take the same value. So what really is happening is that the<br />
median <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale is back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the median <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />
Similarly, a 95% predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the log(T EQ) <str<strong>on</strong>g>for</str<strong>on</strong>g> an INDIVIDUAL composite sample can be<br />
found. Be sure to understand the difference between the two intervals.<br />
Finally, an inverse predicti<strong>on</strong> is sometimes of interest, i.e. in what year, will the TEQ be equal to some<br />
particular value. For example, health regulati<strong>on</strong>s may require that the TEQ of the composite sample be<br />
below 10 units.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m has an inverse predicti<strong>on</strong> functi<strong>on</strong>:<br />
8 A minor correcti<strong>on</strong> can be applied to estimate the mean if required.<br />
c○2012 Carl James Schwarz 60 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Specify the required value <str<strong>on</strong>g>for</str<strong>on</strong>g> Y – in this case log(10) = 2.302,<br />
and then press the RUN butt<strong>on</strong> to get the following output:<br />
c○2012 Carl James Schwarz 61 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The predicted year is found by solving<br />
2.302 = 218.9 − .11(year)<br />
and gives and estimated year of 2012.7. A c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the time when the mean log(T EQ) is<br />
equal to log(10) is somewhere between 2007 and 2026!<br />
The residual plot looks fine with no apparent problems but the dip in the middle years could require<br />
further explorati<strong>on</strong> if this pattern was apparent in other site as well:<br />
The applicati<strong>on</strong> of regressi<strong>on</strong> to n<strong>on</strong>-linear problems is fairly straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward after the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is<br />
made. The most error-pr<strong>on</strong>e step of the process is the interpretati<strong>on</strong> of the estimates <strong>on</strong> the TRANSFORMED<br />
scale and how these relate to the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />
1.4.13 Example: Weight-length relati<strong>on</strong>ships - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />
A comm<strong>on</strong> technique in fisheries management is to investigate the relati<strong>on</strong>ship between weight and lengths<br />
of fish.<br />
This is expected to a n<strong>on</strong>-linear relati<strong>on</strong>ship because as fish get l<strong>on</strong>ger, they also get wider and thicker.<br />
If a fish grew “equally” in all directi<strong>on</strong>s, then the weight of a fish should be proporti<strong>on</strong>al to the length 3<br />
(why?). However, fish do not grow equally in all directi<strong>on</strong>s, i.e. a doubling of length is not necessarily<br />
associated with a doubling of width or thickness. The pattern of associati<strong>on</strong> of weight with length may<br />
reveal in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> how fish grow.<br />
The traditi<strong>on</strong>al model between weight and length is often postulated to be of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
weight = a × length b<br />
where a and b are unknown c<strong>on</strong>stants to be estimated from data.<br />
If the estimated value of b is much less than 3, this indicates that as fish get l<strong>on</strong>ger, they do not get wider<br />
and thicker at the same rates.<br />
or<br />
How are such models fit? If logarithms are taken <strong>on</strong> each side, the above equati<strong>on</strong> is trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to:<br />
log(weight) = log(a) + b × log(length)<br />
log(weight) = β 0 + β 1 × log(length)<br />
where the usual linear relati<strong>on</strong>ship <strong>on</strong> the log-scale is now apparent.<br />
The following example was provided by Randy Zemlak of the British <str<strong>on</strong>g>Columbia</str<strong>on</strong>g> Ministry of Water, Land,<br />
and Air Protecti<strong>on</strong>.<br />
c○2012 Carl James Schwarz 62 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Length (mm)<br />
Weight (g)<br />
34 585<br />
46 1941<br />
33 462<br />
36 511<br />
32 428<br />
33 396<br />
34 527<br />
34 485<br />
33 453<br />
44 1426<br />
35 488<br />
34 511<br />
32 403<br />
31 379<br />
30 319<br />
33 483<br />
36 600<br />
35 532<br />
29 326<br />
34 507<br />
32 414<br />
33 432<br />
33 462<br />
35 566<br />
34 454<br />
35 600<br />
29 336<br />
31 451<br />
33 474<br />
32 480<br />
35 474<br />
30 330<br />
30 376<br />
34 523<br />
31 353<br />
32 412<br />
c○2012 Carl James Schwarz 63 November 23, 2012<br />
32 407
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
A sample of fish was measured at a lake in British <str<strong>on</strong>g>Columbia</str<strong>on</strong>g>. The data is as follows and is available<br />
in a JMP datasheet called wtlen.jmp at the Sample Program Library at http://www.stat.sfu.ca/<br />
~cschwarz/Stat-650/Notes/MyPrograms.<br />
The following is an initial plot with a spline fit (lambda=10) to the data.<br />
The fit appears to be n<strong>on</strong>-linear but this may simply be an artifact of the influence of the two largest fish.<br />
The plot appears to be linear in the area of 30-35 mm in length. If you look at the plot carefully, the variance<br />
appears to be increasing with the length with the spread noticeably wider at 35 mm than at 30 mm.<br />
There are several (equivalent) ways to fit the growth model to such data in JMP:<br />
• Use Analyze->Fit Y-by-X directly with the Fit Special feature.<br />
• Create two new variables log(weight) and log(length) and then use Analyze->Fit Y-by-X <strong>on</strong> these<br />
derived variables.<br />
• Use Analyze->Fit Model <strong>on</strong> these derived variables.<br />
We will fit a model <strong>on</strong> the log-log scale: Note that there is some c<strong>on</strong>fusi<strong>on</strong> in scientific papers about a<br />
“log” trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. In general, a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> refers to taking natural-logarithms (base e), and NOT the<br />
base-10 logarithm. This mathematical c<strong>on</strong>venti<strong>on</strong> is often broken in scientific papers where authors try to<br />
c○2012 Carl James Schwarz 64 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
use ln to represent natural logarithms, etc. It does not affect the analysis in anyway which trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />
is used other than that values <strong>on</strong> the natural log scale are approximately 2.3 times larger than values <strong>on</strong> the<br />
log 10 scale. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, the appropriate back trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is required.<br />
Using the Fit Special<br />
.<br />
The Fit Special is available from the drop-down menu item:<br />
It presents a dialogue box where a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> both the Y and X axes may be specified:<br />
c○2012 Carl James Schwarz 65 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
the following output is obtained:<br />
c○2012 Carl James Schwarz 66 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 67 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The fit is not very satisfactory. The curve doesn’t seem to fit the two “outlier points very well”. At<br />
smaller lengths, the curve seems to under fitting the weight. The residual plot appears to show the two<br />
definite outliers and also shows some evidence of a poor fit with positive residuals at lengths 30 mm and<br />
negative residuals at 35 mm.<br />
The fit was repeated dropping the two largest fish with the following output:<br />
c○2012 Carl James Schwarz 68 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 69 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Now the fit appears to be much better. The relati<strong>on</strong>ship (<strong>on</strong> the log-scale) is linear, the residual plot looks<br />
OK.<br />
The estimated power coefficient is 2.76 (SE .21). We find the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope (the<br />
power coefficient):<br />
The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the power coefficient is from (2.33 to 3.2) which includes the value of<br />
3 – hence the growth could be allometric, i.e. a fish that is twice the length also is twice the width and twice<br />
the thickness. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, with this small sample size, it is too difficult to say too much.<br />
The actual model in the populati<strong>on</strong> is:<br />
log(weight) = β 0 + β 1 log(length) + ɛ<br />
This implies that the “errors” in growth act <strong>on</strong> the LOG-scale. This seems reas<strong>on</strong>able.<br />
For example, a regressi<strong>on</strong> <strong>on</strong> the original scale would make the assumpti<strong>on</strong> that a 20 g error in predicting<br />
weight is equally severe <str<strong>on</strong>g>for</str<strong>on</strong>g> a fish that (<strong>on</strong> average) weighs 200 or 400 grams even though the "error" is<br />
20/200=10% of the predicted value in the first case, while <strong>on</strong>ly 5% of the predicted value in the sec<strong>on</strong>d case.<br />
On the log-scale, it is implicitly assumed that the “errors” operate <strong>on</strong> the log-scale, i.e. a 10% error in a 200<br />
g fish is equally severe as a 10% error in a 400 g fish even though the absolute errors of 20g and 40g are<br />
quite different.<br />
Another assumpti<strong>on</strong> of regressi<strong>on</strong> analysis is that the populati<strong>on</strong> error variance is assumed to be c<strong>on</strong>stant<br />
over the entire regressi<strong>on</strong> line, but the original plot shows that the standard deviati<strong>on</strong> is increasing with<br />
length. On the log-scale, the standard deviati<strong>on</strong> is roughly c<strong>on</strong>stant over the entire regressi<strong>on</strong> line.<br />
Using derived variables<br />
The same analysis was repeated using the derived variables log(weight) and log(length) and again using<br />
the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, but this time without the Fit Special. [The Fit Special is not needed<br />
because the derived variables have already been trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med.]<br />
The following are the outputs using the derived variables, again with and with out the two largest fish.<br />
c○2012 Carl James Schwarz 70 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 71 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Because derived variables are used, the fitting plot uses the derived variables and is <strong>on</strong> the log-scale.<br />
This has the advantage that the the fit at the lower lengths is easier to see, but the lack of fit <str<strong>on</strong>g>for</str<strong>on</strong>g> the two<br />
largest fish is not as clear. However, it is now easier to see <strong>on</strong> the residual plot the apparent lack of fit with<br />
the downward sloping part of the residual plot in the 3.4 to 3.6 log(length).<br />
The two largest fish were removed and the fit repeated using the derived variables:<br />
c○2012 Carl James Schwarz 72 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 73 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The results are identical to the previous secti<strong>on</strong>.<br />
A n<strong>on</strong>-linear fit<br />
It is also possible to do a direct n<strong>on</strong>-linear least-squares fit. Here the objective is to find values of β 0 and β 1<br />
to minimize: ∑<br />
(weight − β0 × length β1 ) 2<br />
directly.<br />
This can also be d<strong>on</strong>e in JMP using the Fit N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and w<strong>on</strong>’t be explored in much detail<br />
here.<br />
First here are the results from using all of the fish:<br />
c○2012 Carl James Schwarz 74 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Note that the fit apparently is better than the fit <strong>on</strong> the log-scale as the fitted curve goes through the<br />
middle of the points from the two largest fish. Note that there still appear to be problems with the fit at the<br />
lower lengths.<br />
The same fit, dropping the two largest fish, gives the following output:<br />
c○2012 Carl James Schwarz 75 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
The estimated power coefficient from the n<strong>on</strong>-linear fit is 2.73 with a standard error of .24. The estimated<br />
intercept is 0.0323 with an estimated standard error of .027. Both estimates are similar to the previous fit.<br />
Which is a better method to fit this data? The n<strong>on</strong>-linear fit assumes that error are additive <strong>on</strong> the original<br />
scale. The c<strong>on</strong>sequences of this were discussed earlier, i.e. a 20 g error is equally serious <str<strong>on</strong>g>for</str<strong>on</strong>g> a 200 g fish as<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> a 400 g fish.<br />
For this problem, both the n<strong>on</strong>-linear fit and the fit <strong>on</strong> the log-scale gave the same results, but this will<br />
not always be true. In particular, look at the large difference in estimates when the models were fit to the<br />
all of the fish. The n<strong>on</strong>-linear fit was more influenced by the two large fish - this is a c<strong>on</strong>sequence of the<br />
minimizing the square of the absolute deviati<strong>on</strong> (as opposed to the relative deviati<strong>on</strong>) between the observed<br />
weight and predicted weight.<br />
c○2012 Carl James Schwarz 76 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
1.4.14 Power/Sample Size<br />
A power analysis and sample size determinati<strong>on</strong> can also be d<strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> problems, but is (un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately)<br />
rarely d<strong>on</strong>e in regressi<strong>on</strong>. This is <str<strong>on</strong>g>for</str<strong>on</strong>g> a number of reas<strong>on</strong>s:<br />
• The power depends not <strong>on</strong>ly <strong>on</strong> the total number of points collected, but also <strong>on</strong> the actual distributi<strong>on</strong><br />
of the X values. For example, a regressi<strong>on</strong> analysis is most powerful to detect a trend if half the<br />
observati<strong>on</strong>s are collected at a small X value and half of the observati<strong>on</strong>s are collected at a large X<br />
value. However, this type of data gives no in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> the linearity (or lack there-of) between the<br />
two X values and is not recommended in practice. A less powerful design would have a range of X<br />
values collected, but this is often more of interest as lack-of-fit and n<strong>on</strong>-linearity can be collected.<br />
• Data collected <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> analysis is often opportunistic with little chance of choosing the X<br />
values. Unless you have some prior in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> the distributi<strong>on</strong> of the X values, it is difficult to<br />
determine the power.<br />
• The <str<strong>on</strong>g>for</str<strong>on</strong>g>mula are clumsy to compute by hand, and most power packages tend not to have modules <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
power analysis of regressi<strong>on</strong>.<br />
For a power analysis, the in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> required is similar to that requested <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA designs:<br />
• α level. As in power analyses <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA, this is traditi<strong>on</strong>ally set to α = 0.05.<br />
• effect size. In ANOVA, power deals with detecti<strong>on</strong> of differences am<strong>on</strong>g means. In regressi<strong>on</strong> analysis,<br />
power deals with detecti<strong>on</strong> of slopes that are different from zero. Hence, the effect size is measured<br />
by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X.<br />
• sample size. Recall in ANOVA with more than two groups, that the power depended not <strong>on</strong>ly <strong>on</strong>ly<br />
the sample size per group, but also how the means are separated. In regressi<strong>on</strong> analysis, the power<br />
will depend up<strong>on</strong> the number of observati<strong>on</strong>s taken at each value of X and the spread of the X values.<br />
For example, the greatest power is obtained when half the sample is taken at the two extremes of the<br />
X space - but at a cost of not being able to detect n<strong>on</strong>-linearity.<br />
• standard deviati<strong>on</strong>. As in ANOVA, the power will depend up<strong>on</strong> the variati<strong>on</strong> of the individual objects<br />
around the regressi<strong>on</strong> line.<br />
This problem of power and sample size <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> is bey<strong>on</strong>d what we can cover in this chapter. JMP<br />
or R currently does not include a power computati<strong>on</strong> module <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> analysis. However SAS (Versi<strong>on</strong><br />
9+) includes a power analysis module (GLMPOWER) <str<strong>on</strong>g>for</str<strong>on</strong>g> the power analysis Please c<strong>on</strong>sult suitable help<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />
However, the problem simplifies c<strong>on</strong>siderably when the the X variable is time, and interest lies in detecting<br />
a trend (increasing or decreasing) over time. A linear regressi<strong>on</strong> of the quantity of interest against<br />
time is comm<strong>on</strong>ly used to evaluate such a trend. For many m<strong>on</strong>itoring designs, observati<strong>on</strong>s are taken <strong>on</strong> a<br />
yearly basis, so the questi<strong>on</strong> reduces to the number of years of m<strong>on</strong>itoring required.<br />
The analysis of trend data and power/sample size computati<strong>on</strong>s is treated in a following chapter.<br />
c○2012 Carl James Schwarz 77 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
1.4.15 The perils of R 2<br />
R 2 is a “popular” measure of the fit of a regressi<strong>on</strong> model and is often quoted in research papers as evidence<br />
of a good fit etc. However, there are several fundamental problems of R 2 which, in my opini<strong>on</strong>, make it<br />
less desirable. A nice summary of these issues is presented in Draper and Smith (1998, <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Regressi<strong>on</strong><br />
Analysis, p. 245-246).<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e exploring this, how is R 2 computed and how is it interpreted?<br />
While I haven’t discussed the decompositi<strong>on</strong> of the Error SS into Lack-of-Fit and Pure error, this can be<br />
d<strong>on</strong>e when there are replicated X values. A prototype ANOVA table would look something like:<br />
Source df SS<br />
Regressi<strong>on</strong> p − 1 A<br />
Lack-of-fit n − p − n e B<br />
Pure error n e C<br />
Corrected Total n-1 D<br />
where there are n observati<strong>on</strong>s and a regressi<strong>on</strong> model is fit with p additi<strong>on</strong>al X values over and above the<br />
intercept.<br />
R 2 is computed as<br />
R 2 = SS(regressi<strong>on</strong>)<br />
SS(total)<br />
= A D = 1 − B + C<br />
D<br />
where SS(·) represents the sum of squares <str<strong>on</strong>g>for</str<strong>on</strong>g> that term in the ANOVA table. At this point, rerun the three<br />
examples presented earlier to find the value of R 2 .<br />
For example, in the fertilizer example, the ANOVA table is:<br />
Analysis of Variance<br />
Source DF Sum of Squares Mean Square F Ratio p-value<br />
Model 1 225.18035 225.180 69.8800
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative. In particular, the estimate of the slope and the se of the slope are much more in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative.<br />
Here are some reas<strong>on</strong>s, why I decline to use R 2 very much:<br />
• Overfitting. If there are no replicate X points, then n e = 0, C = 0, and R 2 = 1 − B D . B has n − p<br />
degrees of freedom. As more and more X variables are added to the model, n − p, and B become<br />
smaller, and R 2 must increase even if the additi<strong>on</strong>al variables are useless.<br />
• Outliers distort. Outliers produce Y values that are extreme relative to the fit. This can inflate the<br />
value of C (if the outlier occurs am<strong>on</strong>g the set of replicate X values), or B if the outlier occurs at a<br />
singlet<strong>on</strong> X value. In any cases, they reduce R 2 , so R 2 is not resistant to outliers.<br />
• People misinterpret high R 2 as implying the regressi<strong>on</strong> line is useful. It is tempting to believe that<br />
a higher value of R 2 implies that a regressi<strong>on</strong> line is more useful. But c<strong>on</strong>sider the pair of plots below:<br />
The graph <strong>on</strong> the left has a very high R 2 , but the change in Y as X varies is negligible. The graph<br />
<strong>on</strong> the right has a lower R 2 , but the average change in Y per unit change in X is c<strong>on</strong>siderable. R 2<br />
measures the “tightness” of the points about the line – the higher value of R 2 <strong>on</strong> the left indicates that<br />
the points fit the line very well. The value of R 2 does NOT measure how much actual change occurs.<br />
• Upper bound is not always 1. People often assume that a low R 2 implies a poor fitting line. If you<br />
have replicate X values, then C > 0. The maximum value of R 2 <str<strong>on</strong>g>for</str<strong>on</strong>g> this problem can be much less<br />
than 100% - it is mathematically impossible <str<strong>on</strong>g>for</str<strong>on</strong>g> R 2 to reach 100% with replicated X values. In the<br />
extreme case where the model “fits perfectly” (i.e. the lack of fit term is zero), R 2 can never exceed<br />
1 − C D .<br />
• No intercept models If there is no intercept then D = ∑ (Y i − Y ) 2 does not exist, and R 2 is not<br />
really defined.<br />
• R 2 gives no additi<strong>on</strong>al in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>. In actual fact, R 2 is a 1-1 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of the slope and its<br />
standard error, as is the p-value. So there is no new in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> in R 2 .<br />
• R 2 is not useful <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-linear fits. R 2 is really <strong>on</strong>ly useful <str<strong>on</strong>g>for</str<strong>on</strong>g> linear fits with the estimated regressi<strong>on</strong><br />
line free to have a n<strong>on</strong>-zero intercept. The reas<strong>on</strong> is that R 2 is really a comparis<strong>on</strong> between two<br />
types of models. For example, refer back to the length-weight relati<strong>on</strong>ship examined earlier.<br />
c○2012 Carl James Schwarz 79 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
In the linear fit case, the two models being compared are<br />
vs.<br />
log(weight) = log(b 0 ) + error<br />
log(weight) = log(b 0 ) + b 1 ∗ log(length) + error<br />
and so R 2 is a measure of the improvement with the regressi<strong>on</strong> line. [In actual fact, it is a 1-1 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />
of the test that β 1 = 0, so why not use that statistics directly?]. In the n<strong>on</strong>-linear fit case, the two<br />
models being compared are:<br />
weight = 0 + error<br />
vs.<br />
The model weight=0 is silly and so R 2 is silly.<br />
weight = b 0 ∗ length ∗ ∗b 1 + error<br />
Hence, the R 2 values reported are really all <str<strong>on</strong>g>for</str<strong>on</strong>g> linear fits - it is just that sometimes the actual linear fit<br />
is hidden.<br />
• Not defined in generalized least squares. There are more complex fits that d<strong>on</strong>’t assume equal<br />
variance around the regressi<strong>on</strong> line. In these cases, R 2 is again not defined.<br />
• Cannot be used with different trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s of Y . R 2 cannot be used to compare models that<br />
are fit to different trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s of the Y variable. For example, many people try fitting a model to<br />
Y and to log(Y ) and choose the model with the highest R 2 . This is not appropriate as the D terms are<br />
no l<strong>on</strong>ger comparable between the two models.<br />
• Cannot be used <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-nested models. R 2 cannot be used to compare models with different sets of<br />
X variables unless <strong>on</strong>e model is nested within another model (i.e. all of the X variables in the smaller<br />
model also appear in the larger model). So using R 2 to compare a model with X 1 , X 3 , and X 5 to a<br />
model with X 1 , X 2 , and X 4 is not appropriate as these two models are not nested. In these cases, AIC<br />
should be used to select am<strong>on</strong>g models.<br />
1.5 A no-intercept model: Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor K<br />
It is possible to fit a regressi<strong>on</strong> line that has an intercept of 0, i.e., goes through the origin. Most computer<br />
packages have an opti<strong>on</strong> to suppress the fitting of the intercept.<br />
The biggest ‘problem’ lies in interpreting some of the output – some of the statistics produced are misleading<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> these models. As this varies from package to package, please seek advice when fitting such<br />
models.<br />
The following is an example of where such a model may be sensible.<br />
Not all fish within a lake are identical. How can a single summary measure be developed to represent<br />
the c<strong>on</strong>diti<strong>on</strong> of fish within a lake?<br />
c○2012 Carl James Schwarz 80 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
In general, the the relati<strong>on</strong>ship between fish weight and length follows a power law:<br />
W = aL b<br />
where W is the observed weight; L is the observed length, and a and b are coefficients relating length to<br />
weight. The usual assumpti<strong>on</strong> is that heavier fish of a given length are in better c<strong>on</strong>diti<strong>on</strong> than than lighter<br />
fish. C<strong>on</strong>diti<strong>on</strong> indices are a popular summary measure of the c<strong>on</strong>diti<strong>on</strong> of the populati<strong>on</strong>.<br />
There are at least eight different measures of c<strong>on</strong>diti<strong>on</strong> which can be found by a simple literature<br />
search. C<strong>on</strong>ne (1989) raises some important questi<strong>on</strong>s about the use of a single index to represent the<br />
two-dimensi<strong>on</strong>al weight-length relati<strong>on</strong>ship.<br />
One comm<strong>on</strong> measure is Fult<strong>on</strong>’s 9 K:<br />
K =<br />
W eight<br />
(Length/100) 3<br />
This index makes an implicit assumpti<strong>on</strong> of isometric growth, i.e. as the fish grows, its body proporti<strong>on</strong>s and<br />
specific gravity do not change.<br />
How can K be computed from a sample of fish, and how can K be compared am<strong>on</strong>g different subset of<br />
fish from the same lake or across lakes?<br />
The B.C. Ministry of Envir<strong>on</strong>ment takes regular samples of rainbow trout using a floating and a sinking<br />
net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish was recorded.<br />
The data are available in the rainbow-c<strong>on</strong>diti<strong>on</strong>.jmp data file in the Sample Program Library at http:<br />
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A porti<strong>on</strong> of the raw data<br />
data appears below:<br />
9 There is some doubt about the first authorship of this c<strong>on</strong>diti<strong>on</strong> factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J.<br />
(2005). The Origin of Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor – Setting the Record Straight. Fisheries, 31, 236-238.<br />
c○2012 Carl James Schwarz 81 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
K was computed <str<strong>on</strong>g>for</str<strong>on</strong>g> each individual fish, and the resulting histogram is displayed below:<br />
There is a range of c<strong>on</strong>diti<strong>on</strong> numbers am<strong>on</strong>g the individual fish with an average (am<strong>on</strong>g the fish caught) K<br />
of about 13.6.<br />
Deriving a single summary measure to represent the entire populati<strong>on</strong> of fish in the lake depends heavily<br />
<strong>on</strong> the sampling design used to capture fish.<br />
Some case must be taken to ensure that the fish collected are a simple random sample from the fish in the<br />
populati<strong>on</strong>. If a net of a single mesh size are used, then this has a selectivity curve and the nets are typically<br />
more selective <str<strong>on</strong>g>for</str<strong>on</strong>g> fish of a certain size. In this experiment, several different mesh sizes were used to try and<br />
ensure that all fish of all sizes have an equal chance of being selected.<br />
As well, if regressi<strong>on</strong> methods have an advantage in that a simple random sample from the populati<strong>on</strong> is<br />
no l<strong>on</strong>ger required to estimate the regressi<strong>on</strong> coefficients. As an analogy, suppose you are interested in the<br />
relati<strong>on</strong>ship between yield of plants and soil fertility. Such a study could be c<strong>on</strong>ducted by finding a random<br />
sample of soil plots, but this may lead to many plots with similar fertility and <strong>on</strong>ly a few plots with fertility<br />
at the tails of the relati<strong>on</strong>ship. An alternate scheme is to deliberately seek out soil plots with a range of<br />
fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit a regressi<strong>on</strong> curve<br />
to these selected data points.<br />
Fult<strong>on</strong>’s index is often re-expressed <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> purposes as:<br />
( ) 3 L<br />
W = K<br />
100<br />
This looks like a simple regressi<strong>on</strong> between W and ( L<br />
100) 3<br />
but with no intercept.<br />
A plot of these two variables:<br />
c○2012 Carl James Schwarz 82 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
shows a tight relati<strong>on</strong>ship am<strong>on</strong>g fish but with possible increasing variance with length.<br />
There is some debate about the proper way to estimate the regressi<strong>on</strong> coefficient K. Classical regressi<strong>on</strong><br />
methods (least squares) implicitly implies that all of the “error” in the regressi<strong>on</strong> is in the vertical directi<strong>on</strong>,<br />
i.e. c<strong>on</strong>diti<strong>on</strong>s <strong>on</strong> the observed lengths. However, the structural relati<strong>on</strong>ship between weight and length<br />
likely is violated in both variables. This would lead to the error-in-variables problem in regressi<strong>on</strong>, which<br />
has a l<strong>on</strong>g history. Fortunately, the relati<strong>on</strong>ship between the two variables is often sufficiently tight that it<br />
really doesn’t matter which method is used to find the estimates.<br />
JMP can be used to fit the regressi<strong>on</strong> line c<strong>on</strong>straining the intercept to be zero by using the Fit Special<br />
opti<strong>on</strong> under the red-triangle:<br />
c○2012 Carl James Schwarz 83 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
This gives rise to the fitted line and statistics about the fit:<br />
c○2012 Carl James Schwarz 84 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 85 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
Note that R 2 really doesn’t make sense in cases where the regressi<strong>on</strong> is <str<strong>on</strong>g>for</str<strong>on</strong>g>ced through the origin because<br />
the null model to which it is being compared is the line Y = 0 which is silly. 10 For this reas<strong>on</strong>, JMP does<br />
not report a value of R 2 .<br />
The estimated value of K is 13.72 (SE 0.099).<br />
The residual plot:<br />
shows clear evidence of increasing variati<strong>on</strong> with the length variable. This usually implies that a weighted<br />
regressi<strong>on</strong> is needed with weights proporti<strong>on</strong>al to the 1/length 2 variable. In this case, such a regressi<strong>on</strong><br />
gives essentially the same estimate of the c<strong>on</strong>diti<strong>on</strong> factor ( ̂K = 13.67, SE = .11).<br />
Comparing c<strong>on</strong>diti<strong>on</strong> factors<br />
This dataset has a number of sub-groups – do all of the subgroups have the same c<strong>on</strong>diti<strong>on</strong> factor? For<br />
example, suppose we wish to compare the K value <str<strong>on</strong>g>for</str<strong>on</strong>g> immature and mature fish. This is covered in more<br />
detail in the Chapter <strong>on</strong> the Analysis of Covariance (ANCOVA).<br />
1.6 Frequent Asked Questi<strong>on</strong>s - FAQ<br />
1.6.1 Do I need a random sample; power analysis<br />
A student wrote:<br />
I am studying the hydraulic geometry of small, steep streams in Southwest BC (abstract at-<br />
10 C<strong>on</strong>sult any of the standard references <strong>on</strong> regressi<strong>on</strong> such as Draper and Smith <str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />
c○2012 Carl James Schwarz 86 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
tached). I would like to define a regi<strong>on</strong>al hydraulic geometry <str<strong>on</strong>g>for</str<strong>on</strong>g> a fairly small hydrologic/geologic<br />
homogeneous area in the coast mountains close to SFU. Hydraulic geometry is the study of how<br />
the primary flow variables (width, depth and velocity) change with discharge in a stream. Typically,<br />
a straight-regressi<strong>on</strong> line is fitted to data plotted <strong>on</strong> a log-log plot. The equati<strong>on</strong> is of the<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g>m w = aQ b where a is the intercept, b is the slope, w is the water surface width, and Q is<br />
the stream discharge.<br />
I am struggling with the last part of my research proposal which is how do I select (randomly)<br />
my field sites and how many sites are required. My supervisor - suggests that I select stream segments<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> study based <strong>on</strong> a-priori knowledge of my field area and select streams from across it.<br />
My argument is that to define a regi<strong>on</strong>ally applicable relati<strong>on</strong>ship (not just <strong>on</strong>e that characterizes<br />
my chosen sites) I must randomly select the sites.<br />
I think that GIS will help me select my sites but have the usual questi<strong>on</strong>s of how many sites are<br />
required to give me a certain level of c<strong>on</strong>fidence and whether or not I’m <strong>on</strong> the right track. As<br />
well, the primary c<strong>on</strong>trolling variables that I am looking at are discharge and stream slope. I<br />
will be plotting the flow variables against discharge directly but will deal with slope by breaking<br />
my stream segments into slope classes - I guess that the null hypothesis would be that there is<br />
no difference in the exp<strong>on</strong>ents and intercepts between slope classes.<br />
You are both correct!<br />
If you were doing a simple survey, then you are correct in that a random sample from the entire populati<strong>on</strong><br />
must be selected - you can’t deliberately choose streams.<br />
However, because you are interested in a regressi<strong>on</strong> approach, the assumpti<strong>on</strong> can be relaxed a bit. You<br />
can deliberately choose values of the X variables, but must randomly select from streams with similar X<br />
values.<br />
As an analogy, suppose you wanted to estimate the average length of male adult arms. You would need<br />
a random sample from the entire populati<strong>on</strong>. However, suppose that you were interested in the relati<strong>on</strong>ship<br />
between body height (X) and arm length (Y ). You could deliberately choose which X values to measure -<br />
indeed it would be good idea to get a good c<strong>on</strong>trast am<strong>on</strong>g the X values, i.e. find people who are 4 ft tall, 5<br />
ft tall, 6 ft tall, 7 ft tall and measure their height and arm length and then fit the regressi<strong>on</strong> curve. However,<br />
at each height level, you must now choose randomly am<strong>on</strong>g those people that meet that criteri<strong>on</strong>. Hence you<br />
could could deliberately choose to have 1/4 of people who are 4 ft tall, 1/4 who are 5 feet tall, 1/4 who are<br />
6 feet tall, 1/4 who are 7 feet tall which is quite different from the proporti<strong>on</strong>s in the populati<strong>on</strong>, but at each<br />
height level must choose people randomly, i.e. d<strong>on</strong>’t always choose skinny 4 ft people, and over-weight 7 ft<br />
people.<br />
Now sample size is a bit more difficult as the required sample size depends both <strong>on</strong> the number of<br />
streams selected and how they are scattered al<strong>on</strong>g the X axis. For example, the highest power occurs when<br />
observati<strong>on</strong>s are evenly divided between the very smallest X and very largest X value. However, without<br />
intermediate points, you can’t assess linearity very well. So you will want points scattered around the range<br />
of X values.<br />
c○2012 Carl James Schwarz 87 November 23, 2012
CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />
If you have some preliminary data, a power/sample size can be d<strong>on</strong>e using JMP, SAS, and other packages.<br />
If you do a google search <str<strong>on</strong>g>for</str<strong>on</strong>g> power analysis regressi<strong>on</strong>, there are several direct links to examples. Refer to<br />
the earlier secti<strong>on</strong> of the notes.<br />
c○2012 Carl James Schwarz 88 November 23, 2012
Chapter 2<br />
Detecting trends over time<br />
2.1 Introducti<strong>on</strong><br />
As the following graphs shows, tests <str<strong>on</strong>g>for</str<strong>on</strong>g> trend are <strong>on</strong>e of the most comm<strong>on</strong> statistical tools used. 1<br />
1 The astute reader may note the discrepancy between the headline and the apparent trend in the graph. Why?<br />
89
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 90 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Trend analysis is often used as the endpoint <str<strong>on</strong>g>for</str<strong>on</strong>g> many m<strong>on</strong>itoring designs, i.e. is the m<strong>on</strong>itored variable<br />
increasing or decreasing. Some nice references <str<strong>on</strong>g>for</str<strong>on</strong>g> planning m<strong>on</strong>itoring studies are:<br />
• USGS Patuxent Wildlife Research Centre’s Manager’s M<strong>on</strong>itoring Manual available at http://<br />
www.pwrc.usgs.gov/m<strong>on</strong>manual/.<br />
• US Nati<strong>on</strong>al Parks Service Guidelines <strong>on</strong> designing a m<strong>on</strong>itoring study available at http://science.<br />
nature.nps.gov/im/m<strong>on</strong>itor/index.htm.<br />
• Elzinga, C.L. et al. (2001). M<strong>on</strong>itoring Plant and Animal Populati<strong>on</strong>s Blackwell Science, Inc.<br />
There are many types of trends that can exist.<br />
For example, a simple step functi<strong>on</strong><br />
is an example of a<br />
trend where the measured quantity Y increases after some interventi<strong>on</strong>. These types of trends are comm<strong>on</strong>ly<br />
analyzed using a t-test or Analysis of Variance (ANOVA) methods covered in other parts of these<br />
notes.<br />
The trend may be a gradual linear increase over time:<br />
c○2012 Carl James Schwarz 91 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
For example, as the amount of trees cleared increases over time, the turbidity of water in a stream may<br />
increase. In many cases, a regressi<strong>on</strong> analysis is used to test <str<strong>on</strong>g>for</str<strong>on</strong>g> trends in time. In these cases, the X variable<br />
is time and the Y variable is some resp<strong>on</strong>se variable of interest. This is the main focus of this chapter.<br />
In some cases the trend is m<strong>on</strong>ot<strong>on</strong>ic but n<strong>on</strong>-linear:<br />
c○2012 Carl James Schwarz 92 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
In the case of<br />
n<strong>on</strong>-linear trends, a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is often used to try and linearize the trend (e.g. a log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m). This<br />
is often successful, in which case methods <str<strong>on</strong>g>for</str<strong>on</strong>g> linear regressi<strong>on</strong> can be used, but in some cases there is noobvious<br />
trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>. The trend can modeled by an arbitrary functi<strong>on</strong> of arbitrary shape. A very general<br />
methodology called Generalized Additive Models can be used to fit very general functi<strong>on</strong>s. These are bey<strong>on</strong>d<br />
the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
Sometimes the linear trend changes at some point (called the break point):<br />
c○2012 Carl James Schwarz 93 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
If the break<br />
point is known in advance, this is easily fit using multiple regressi<strong>on</strong> methods, but is bey<strong>on</strong>d the scope<br />
of these notes. If the breakpoint is unknown, this is a difficult statistical problem, but refer to Toms and<br />
Lesperance (2003) 2 <str<strong>on</strong>g>for</str<strong>on</strong>g> help.<br />
Helsel and Hirsch (2002) 3 have a number summary of methods used to detect trends. The following<br />
table is adopted from their manual:<br />
2 Toms J.D. and Lesperance M.L. (2003) Piecewise regressi<strong>on</strong> A tool <str<strong>on</strong>g>for</str<strong>on</strong>g> identifying ecological thresholds. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g>, 84, 2034-2041.<br />
3 Helsel, D.R. and Hirsch, R.M. (2002). Statistical methods in water resources. Chapter 12. Available at http://pubs.usgs.<br />
gov/twri/twri4a3/<br />
c○2012 Carl James Schwarz 94 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
N<strong>on</strong>parametric<br />
Trends with NO seas<strong>on</strong>ality<br />
Not adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> X<br />
Kendall trend test <strong>on</strong> Y vs.<br />
T .<br />
Adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> X<br />
Kendall trend test <strong>on</strong> residuals<br />
R from smoothing fit<br />
(e.g. LOWESS) of Y <strong>on</strong><br />
X.<br />
Mixed n<strong>on</strong>e Kendall trend test <strong>on</strong> residuals<br />
R from regressi<strong>on</strong> of<br />
Y <strong>on</strong> X. 4<br />
Parametric Regressi<strong>on</strong> of Y <strong>on</strong> T . Regressi<strong>on</strong> of Y <strong>on</strong> X and<br />
T .<br />
N<strong>on</strong>parametric<br />
Mixed<br />
Parametric<br />
Trends with seas<strong>on</strong>ality<br />
Seas<strong>on</strong>al Kendall test of Y<br />
<strong>on</strong> T .<br />
Regressi<strong>on</strong> of deseas<strong>on</strong>alized<br />
Y <strong>on</strong> T , e.g. after<br />
subtracting the seas<strong>on</strong>al<br />
means.<br />
Regressi<strong>on</strong> of Y <strong>on</strong> T and<br />
seas<strong>on</strong>al terms, e.g. AN-<br />
COVA or sin/cos regressi<strong>on</strong>.<br />
Seas<strong>on</strong>al Kendall test <strong>on</strong><br />
residuals R from smoothing<br />
fit (e.g. LOWESS) of<br />
Y <strong>on</strong> X.<br />
Seas<strong>on</strong>al Kendall trend test<br />
<strong>on</strong> residuals from regressi<strong>on</strong><br />
of Y <strong>on</strong> X.<br />
Regressi<strong>on</strong> of Y <strong>on</strong> X, T ,<br />
and seas<strong>on</strong>al terms.<br />
Notati<strong>on</strong>: Y resp<strong>on</strong>se variable; T time variable; X exogenous variable;<br />
R residuals<br />
In these notes we will look at linear trends fit using regressi<strong>on</strong> models and n<strong>on</strong>-parametric methods. We<br />
will also look at how to pool two or more sites to see if they have a comm<strong>on</strong> linear trend. In cases of trends<br />
over time, there are often problems of autocorrelati<strong>on</strong> or seas<strong>on</strong>ality. Methods to deal with the problems will<br />
be discussed.<br />
At this time however, adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> other exogenous variables (X) will not be discussed. Methods to deal<br />
with step-trends are covered in other chapters.<br />
4 Alley (1988) shows that increased power is obtained by doing the Kendall test <strong>on</strong> residuals of Y vs. X and T vs. X. This removes<br />
drift in X over time is removed as well<br />
c○2012 Carl James Schwarz 95 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
2.2 Simple Linear Regressi<strong>on</strong><br />
We will begin by using the methods of linear regressi<strong>on</strong> (covered in an earlier part of these notes) when<br />
applied to trend over time.<br />
This is a special case of linear regressi<strong>on</strong> analysis, but trend analyses also have some peculiar features that<br />
are fairly comm<strong>on</strong> when dealing with trend analyses that d<strong>on</strong>’t have exact counterparts in regular regressi<strong>on</strong>:<br />
• Testing <str<strong>on</strong>g>for</str<strong>on</strong>g> a comm<strong>on</strong> trend (a special case of ANCOVA)<br />
• Dealing with process vs. sampling variati<strong>on</strong><br />
• Dealing with autocorrelati<strong>on</strong> of residuals<br />
For most of this chapter, we will assume that X is measured in years (e.g. calendar year).<br />
The same sampling model, assumpti<strong>on</strong>s, estimati<strong>on</strong>, and hypothesis testing methods are used as in the<br />
regular regressi<strong>on</strong> case with appropriate modificati<strong>on</strong>s to deal with X as time. These will be reviewed again<br />
below.<br />
2.2.1 Populati<strong>on</strong>s and samples<br />
The populati<strong>on</strong> of interest is the set of Y variables as measured over time (X). In most cases in trend<br />
analysis, random sampling from some larger populati<strong>on</strong> of time points really doesn’t make sense. Rather<br />
the time values (the X values) are pre-specified. For example, measurements could be taken every year, or<br />
every two years, etc.<br />
We wish to summarize the relati<strong>on</strong>ship between Y and time (X), and furthermore wish to make predicti<strong>on</strong>s<br />
of the Y value <str<strong>on</strong>g>for</str<strong>on</strong>g> future time (X) values that may be observed from this populati<strong>on</strong>. We may also<br />
wish to do inverse regressi<strong>on</strong>, i.e. predict at what time Y will reach a certain value.<br />
If this were physics, we may c<strong>on</strong>ceive of a physical law between Y and time (e.g. distance = velocity×<br />
time). However, in ecology, the relati<strong>on</strong>ship between Y and time is much more tenuous. If you could draw<br />
a scatter plot of Y against time the points would NOT fall exactly <strong>on</strong> a straight line. Rather, the value of Y<br />
would fluctuate above or below a straight line at any given time value.<br />
We denote this relati<strong>on</strong>ship as<br />
Y = β 0 + β 1 X + ɛ<br />
where we remember that X is now time rather than some other predictor variable. Now β 0 , β 1 are the<br />
POPULATION intercept and slope respectively. We say that<br />
E[Y ] = β 0 + β 1 X<br />
c○2012 Carl James Schwarz 96 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
is the expected or average value of Y at X. 5<br />
The term ɛ represent random variati<strong>on</strong> of individual units in the populati<strong>on</strong> above and below the expected<br />
value. It is assumed to have c<strong>on</strong>stant standard deviati<strong>on</strong> over the entire regressi<strong>on</strong> line (i.e. the spread of<br />
data points is c<strong>on</strong>stant over time).<br />
Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, we can never measure all units of the populati<strong>on</strong>. So a sample must be taken in order to<br />
estimate the populati<strong>on</strong> slope, populati<strong>on</strong> intercept, and standard deviati<strong>on</strong>. In most trend analyses, the<br />
values of X are chosen to be equally spaced in time, e.g. measurements taken every year.<br />
Once the data points are selected, the estimati<strong>on</strong> process can proceed, but not be<str<strong>on</strong>g>for</str<strong>on</strong>g>e assessing the assumpti<strong>on</strong>s!<br />
2.2.2 Assumpti<strong>on</strong>s<br />
The assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> a trend analysis are virtually the same as <str<strong>on</strong>g>for</str<strong>on</strong>g> a standard regressi<strong>on</strong> analysis. This is not<br />
surprising as trend analysis is really a special case of regressi<strong>on</strong> analyses.<br />
Linearity<br />
Regressi<strong>on</strong> analysis assume that the relati<strong>on</strong>ship between Y and X is linear, i.e. a c<strong>on</strong>stant decline over<br />
time. This can be assessed quite simply by plotting Y vs. time. Perhaps a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is required (e.g.<br />
log(Y ) vs. log(X)). Some cauti<strong>on</strong> is required when trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s are d<strong>on</strong>e, as the error structure <strong>on</strong> the<br />
trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale is most important. As well, you need to be a little careful about the back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />
after doing regressi<strong>on</strong> <strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med values.<br />
You should also plot residuals vs. the X (time) values. If the scatter is not random around 0 but shows<br />
some pattern (e.g. quadratic curve), this usually indicates that the relati<strong>on</strong>ship between Y and X (time) is<br />
not linear. Alternatively, you can fit a model that includes X and X 2 and test if the coefficient associated<br />
with X 2 is zero. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this test could fail to detect a higher order relati<strong>on</strong>ship. Third, if there are<br />
multiple readings at some X-values, then a test of goodness-of-fit can be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med where the variati<strong>on</strong> of<br />
the resp<strong>on</strong>ses at the same X value is compared to the variati<strong>on</strong> around the regressi<strong>on</strong> line.<br />
Scale of Y and X<br />
As X is time, it has an interval or ratio scale. It is further assumed that Y has an interval or ratio scale<br />
as well. This can be violated in a number of ways. For example, a numerical value is often to represent a<br />
5 In ANOVA, we let each treatment group have its own mean; here in regressi<strong>on</strong> we assume that the means must fit <strong>on</strong> a straight line.<br />
In some cases, even in the absence of sampling error, the true value of Y does NOT lies <strong>on</strong> the straight time. This is known as process<br />
variati<strong>on</strong>, and will be discussed later.<br />
c○2012 Carl James Schwarz 97 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
category and this numerical value in used in a regressi<strong>on</strong>. This is not valid. Suppose that you code hair color<br />
as (1=red, 2=brown, and 3=black). Then using these values as the resp<strong>on</strong>se variable (Y ) is not sensible.<br />
Correct sampling scheme<br />
The Y must be a random sample from the populati<strong>on</strong> of Y values at every time point.<br />
No outliers or influential points<br />
All the points must bel<strong>on</strong>g to the relati<strong>on</strong>ship – there should be no unusual points. The scatter plot of Y vs.<br />
X should be examined. If in doubt, fit the model with the outlying points in and out of the model and see if<br />
this makes a difference in the fit.<br />
Outliers can have a dramatic effect <strong>on</strong> the fitted line as you saw in a previous chapter.<br />
Equal variati<strong>on</strong> al<strong>on</strong>g the line<br />
The variability about the regressi<strong>on</strong> line is similar <str<strong>on</strong>g>for</str<strong>on</strong>g> all values of X, i.e. the scatter of the points above and<br />
below the fitted line should be roughly c<strong>on</strong>stant over time. This is assessed by looking at the plots of the<br />
residuals against X to see if the scatter is roughly uni<str<strong>on</strong>g>for</str<strong>on</strong>g>mly scattered around zero with no increase and no<br />
decrease in spread over the entire line.<br />
Independence<br />
Each value of Y is independent of any other value of Y . This is a comm<strong>on</strong> failing in trend analysis where<br />
the measurement in a particular year influences the measurement in subsequent years.<br />
This assumpti<strong>on</strong> can be assessed by again looking at residual plots against time or other variables.<br />
Normality of errors<br />
The difference between the value of Y and the expected value of Y is assumed to be normally distributed.<br />
This is <strong>on</strong>e of the most misunderstood assumpti<strong>on</strong>s. Many people err<strong>on</strong>eously assume that the distributi<strong>on</strong> of<br />
Y over all X values must be normally distributed, i.e they look simply at the distributi<strong>on</strong> of the Y ’s ignoring<br />
the Xs. The assumpti<strong>on</strong> <strong>on</strong>ly states that the residuals, the difference between the value of Y and the point<br />
<strong>on</strong> the line must be normally distributed.<br />
c○2012 Carl James Schwarz 98 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, <str<strong>on</strong>g>for</str<strong>on</strong>g> small<br />
sample sizes, you have little power of detecting n<strong>on</strong>-normality and <str<strong>on</strong>g>for</str<strong>on</strong>g> large sample sizes it is not that<br />
important.<br />
X measured without error<br />
This is a new assumpti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> as compared to ANOVA. In ANOVA, the group membership was<br />
always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity. However,<br />
in regressi<strong>on</strong>, it can turn out that that the X value may not be known exactly.<br />
This may seem a bit puzzling in a trend analysis – after all, how can the calendar year not be known<br />
exactly. An example of the problem is when Y is a estimate of the populati<strong>on</strong> size which is measured<br />
over time. This is often obtained from a mark-recapture study when animals are marked in <strong>on</strong>e m<strong>on</strong>th, and<br />
recaptured in the next m<strong>on</strong>th. In this case, does the populati<strong>on</strong> size refer to the populati<strong>on</strong> size at the start<br />
of the study, in the middle of the study, or the end of the study. If the same protocol was per<str<strong>on</strong>g>for</str<strong>on</strong>g>med in all<br />
years, then it really doesn’t matter, but the start and end of sampling likely varies over years (e.g. in some<br />
years starts in March, in other years starts in April) so that the interval between sampling occasi<strong>on</strong>s is not<br />
c<strong>on</strong>stant.<br />
This general problem is called the “error in variables” problem and has a l<strong>on</strong>g history in statistics. More<br />
details are available in the chapter <strong>on</strong> regressi<strong>on</strong> analysis.<br />
2.2.3 Obtaining Estimates<br />
As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, we distinguish between populati<strong>on</strong> parameters and sample estimates. We denote the sample<br />
intercept by b 0 and the sample slope by b 1 . The equati<strong>on</strong> of a particular sample of points is expressed<br />
Ŷ i = b 0 + b 1 X i where b 0 is the estimated intercept, and b 1 is the estimated slope. The symbol Ŷ indicates<br />
that we are referring to the estimated line and not to populati<strong>on</strong> line.<br />
As in regressi<strong>on</strong> analysis, the best fitting line is typically found using least squares. However, in more<br />
complex situati<strong>on</strong> (e.g. when accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong> over time), maximum likelihood methods can<br />
also be used. The least-squares line is the line that makes the sum of the squares of the deviati<strong>on</strong>s of the data<br />
points from the line in the vertical directi<strong>on</strong> as small as possible.<br />
The estimated intercept (b 0 ) is the estimated value of Y when X = 0. In many cases of trend analysis,<br />
it is meaningless to talk about values of Y when X = 0 because X = 0 is n<strong>on</strong>sensical. For example, in a<br />
plot of income vs. year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear<br />
interpretati<strong>on</strong> of the intercept, and it merely serves as a place holder <str<strong>on</strong>g>for</str<strong>on</strong>g> the line.<br />
The estimated slope (b 1 ) is the estimated change in Y per unit change in X. In many cases X is measured<br />
in years, so this would be the change in Y per year.<br />
c○2012 Carl James Schwarz 99 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
As with all estimates, a measure of precisi<strong>on</strong> can be obtained. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, this is the standard error of<br />
each of the estimates. C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the true slope and intercept can also be found.<br />
Formal tests of hypotheses can also be d<strong>on</strong>e. Usually, these are <strong>on</strong>ly d<strong>on</strong>e <strong>on</strong> the slope parameter as this<br />
is typically of most interest. The null hypothesis is that populati<strong>on</strong> slope is 0, i.e. there is no relati<strong>on</strong>ship<br />
between Y and X, i.e. no trend over time. More <str<strong>on</strong>g>for</str<strong>on</strong>g>mally the null hypothesis is:<br />
H : β 1 = 0<br />
Again notice that the null hypothesis is ALWAYS in terms of a populati<strong>on</strong> parameter and not in terms of a<br />
sample statistic.<br />
The alternate hypothesis is typically chosen as:<br />
A : β 1 ≠ 0<br />
although <strong>on</strong>e-sided tests looking <str<strong>on</strong>g>for</str<strong>on</strong>g> either a positive or negative trend are possible.<br />
The p-value is interpreted in exactly the same way as in ANOVA, i.e. is measures the probability of<br />
observing this data if the hypothesis of no relati<strong>on</strong>ship were true.<br />
As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the p-value does not tell the whole story, i.e. statistical vs. biological (n<strong>on</strong>)significance must<br />
be determined and assessed.<br />
2.2.4 Obtaining Predicti<strong>on</strong>s<br />
Once the best fitting line is found it can be used to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> new values of X, e.g. what is the<br />
predicted value of Y <str<strong>on</strong>g>for</str<strong>on</strong>g> new time points.<br />
There are two types of predicti<strong>on</strong>s that are comm<strong>on</strong>ly made. It is important to distinguish between them<br />
as these two intervals are the source of much c<strong>on</strong>fusi<strong>on</strong> in regressi<strong>on</strong> problems.<br />
First, the experimenter may be interested in predicting a SINGLE future individual value <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />
X. Sec<strong>on</strong>d the experimenter may be interested in predicting the AVERAGE of ALL future resp<strong>on</strong>ses<br />
at a particular X. 6 The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is sometimes called a c<strong>on</strong>fidence<br />
interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se but this is an un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate (and incorrect) use of the term c<strong>on</strong>fidence interval.<br />
Strictly speaking c<strong>on</strong>fidence intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> fixed unknown parameter values; predicati<strong>on</strong><br />
intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> future random variables.<br />
Both of the above intervals should be distinguished from the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />
In both cases, the estimate is found in the same manner – substitute the new value of X into the equati<strong>on</strong><br />
and compute the predicted value Ŷ . In most computer packages this is accomplished by inserting a new<br />
“dummy” observati<strong>on</strong> in the dataset with the value of Y missing, but the value of X present. The missing<br />
6 There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice.<br />
c○2012 Carl James Schwarz 100 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Y value prevents this new observati<strong>on</strong> from being used in the fitting process, but the X value allows the<br />
package to compute an estimate <str<strong>on</strong>g>for</str<strong>on</strong>g> this observati<strong>on</strong>.<br />
What differs between the two predicti<strong>on</strong>s are the estimates of uncertainty.<br />
In the first case, where predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> INDIVIDUALs are wanted, there are two sources of uncertainty<br />
involved in the predicti<strong>on</strong>. First, there is the uncertainty caused by the fact that this estimated line is based<br />
up<strong>on</strong> a sample. Then there is the additi<strong>on</strong>al uncertainty that the value could be above or below the predicted<br />
line. This interval is often called a predicti<strong>on</strong> interval at a new X.<br />
In the sec<strong>on</strong>d case, where predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se are wanted, <strong>on</strong>ly the uncertainty caused by<br />
estimating the line based <strong>on</strong> a sample is relevant. This interval is often called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
mean at a new X.<br />
The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is typically MUCH wider than the c<strong>on</strong>fidence interval<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean of all future resp<strong>on</strong>ses because it must account <str<strong>on</strong>g>for</str<strong>on</strong>g> the uncertainty from the fitted line plus<br />
individual variati<strong>on</strong> around the fitted line.<br />
Many textbooks have the <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the se <str<strong>on</strong>g>for</str<strong>on</strong>g> the two types of predicti<strong>on</strong>s, but again, there is little to<br />
be gained by examining them. What is important is that you read the documentati<strong>on</strong> carefully to ensure that<br />
you understand exactly what interval is being given to you.<br />
2.2.5 Inverse predicti<strong>on</strong>s<br />
A related questi<strong>on</strong> is "how l<strong>on</strong>g be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the E[Y ] reaches a certain point". These are obtained by drawing a<br />
line across from the Y axis until it reaches the fitted line, and then following the line down until it reaches<br />
the X (time) axis. C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the inverse predicti<strong>on</strong> are found by following the same procedure<br />
but now following the line horiz<strong>on</strong>tally across until it reaches <strong>on</strong>e of the c<strong>on</strong>fidence intervals (either <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
mean resp<strong>on</strong>se or the individual resp<strong>on</strong>se). 7<br />
7 It is possible that the c<strong>on</strong>fidence intervals are <strong>on</strong>e-sided (i.e. <strong>on</strong>e side is either plus or minus infinity), or even that the c<strong>on</strong>fidence<br />
interval comes in two secti<strong>on</strong>s. Please c<strong>on</strong>sult a reference such as Draper and Smith <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />
c○2012 Carl James Schwarz 101 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
2.2.6 Residual Plots<br />
After the curve is fit, it is important to examine if the fitted curve is reas<strong>on</strong>able. This is d<strong>on</strong>e using residuals.<br />
The residual <str<strong>on</strong>g>for</str<strong>on</strong>g> a point is the difference between the observed value and the predicted value, i.e., the residual<br />
from fitting a straight line is found as: residual i = Y i − (b 0 + b 1 X i ) = (Y i − Ŷi).<br />
There are several standard residual plots:<br />
• plot of residuals vs. predicted (Ŷ );<br />
• plot of residuals vs. X;<br />
In all cases, the residual plots should show random scatter around zero with no obvious pattern. D<strong>on</strong>’t<br />
plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and d<strong>on</strong>’t mean<br />
anything.<br />
c○2012 Carl James Schwarz 102 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
2.2.7 Example: The Grass is Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger)<br />
D.G. Grisenthwaite, a pensi<strong>on</strong>er who has spent 20 years keeping detailed records of how often he cuts his<br />
grass has been included in a climate change study. David Grisenthwhaite, 77, and a self-c<strong>on</strong>fessed “creature<br />
of habit”, has kept a note of cutting grass in his Kirkcaldy garden since 1984. The grandfather’s data was so<br />
valuable it was used by the Royal Meteorological Society in a paper <strong>on</strong> global warming.<br />
The retired paper-maker, who moved to Scotland from Cockermouth in West Cumbria in 1960, said he<br />
began making a note of the time and date of every occasi<strong>on</strong> he cut the grass simply “<str<strong>on</strong>g>for</str<strong>on</strong>g> the fun of it”.<br />
The data are presented in:<br />
Sparks, T.H., Croxt<strong>on</strong>, J.P.J., Collins<strong>on</strong>, N., and Grisenthwaite, D.A. (2005) The Grass is<br />
Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger). Weather 60, 121-123.<br />
from which the data <strong>on</strong> the durati<strong>on</strong> of the cutting seas<strong>on</strong> was extracted:<br />
c○2012 Carl James Schwarz 103 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Year<br />
Durati<strong>on</strong><br />
(days)<br />
1984 200<br />
1985 215<br />
1986 195<br />
1987 212<br />
1988 225<br />
1989 240<br />
1990 203<br />
1991 208<br />
1992 203<br />
1993 202<br />
1994 210<br />
1995 225<br />
1996 204<br />
1997 245<br />
1998 238<br />
1999 226<br />
2000 227<br />
2001 236<br />
2002 215<br />
2003 242<br />
The questi<strong>on</strong> of interest is if there is evidence that the lawn cutting seas<strong>on</strong> has increased over time?<br />
JMP analysis<br />
The data and JMP scripts are available in the grass.jmp file in the Sample Program Library available at<br />
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The data are entered in JMP into the usual way. Notice that two extra lines were added to the end of the<br />
data representing two years <str<strong>on</strong>g>for</str<strong>on</strong>g> which predicti<strong>on</strong>s will be made. Both variables should be c<strong>on</strong>tinuous scale.<br />
c○2012 Carl James Schwarz 104 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to create preliminary plot of the number of days between the first<br />
and last cut (Y ) versus year (X):<br />
c○2012 Carl James Schwarz 105 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The plot shows some evidence that the durati<strong>on</strong> of the cutting seas<strong>on</strong> has increased over time.<br />
We can check some of the assumpti<strong>on</strong>s:<br />
• the Y and X variable are both the proper scale.<br />
• the relati<strong>on</strong>ship appears to be approximately linear.<br />
• there are no obvious outliers.<br />
• the variance (scatter) of points around the line appears to be approximately equal. We will check this<br />
again from the residual plot.<br />
• they may be some evidence of autocorrelati<strong>on</strong> as the line joining the raw data points seems to dip<br />
above and below the line <str<strong>on</strong>g>for</str<strong>on</strong>g> several years in a row. This could corresp<strong>on</strong>d to slowly changing effects<br />
c○2012 Carl James Schwarz 106 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
such as a multi-year dry or wet spell. However, with <strong>on</strong>ly 20 data points, it is difficult to tell. We will<br />
check more <str<strong>on</strong>g>for</str<strong>on</strong>g>mally <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-independence by looking at the residual plot and the Durbin-Wats<strong>on</strong> test<br />
statistic later.<br />
Use the red-triangle drop-down menu <strong>on</strong> the plot to select the Fit Line opti<strong>on</strong>. This gives:<br />
The estimated intercept (-2702) would represent the estimated durati<strong>on</strong> of the growing seas<strong>on</strong> in year 0<br />
– clearly a n<strong>on</strong>sensical result. It really doesn’t matter, as the intercept is just a place holder <str<strong>on</strong>g>for</str<strong>on</strong>g> the equati<strong>on</strong><br />
of the line. What really is of interest, is the estimated slope.<br />
The estimated slope is 1.46 (se 0.52) days/year. This means that the durati<strong>on</strong> of the growing seas<strong>on</strong> is<br />
estimated to have increased by 1.46 days per year over the span of this study. The 95% c<strong>on</strong>fidence interval<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the slope 8 (0.36 to 2.56) does not include the value of 0, so there is evidence against the slope actually<br />
8 If the 95% c<strong>on</strong>fidence interval doesn’t show in your output, do a right-click (Windoze) or ctrl-click (Macintosh) in the table of<br />
c○2012 Carl James Schwarz 107 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
being 0 (i.e. no change over the years).<br />
Finally, the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> testing if the true slope is zero is 0.012 which again provides evidence against<br />
the hypothesis of no change in mean durati<strong>on</strong> over the span of the experiment.<br />
The estimated value of RMSE (not shown here but available in the Summary of Fit secti<strong>on</strong> of the output)<br />
is 13.52 days which is the estimated standard deviati<strong>on</strong> of the data points around the regressi<strong>on</strong> line.<br />
The c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se and the predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual resp<strong>on</strong>se<br />
are available from the red-triangle <strong>on</strong> the linear-fit box:<br />
Selecting both c<strong>on</strong>fidence intervals gives:<br />
estimates and select the 95% lower and upper limits to be displayed.<br />
c○2012 Carl James Schwarz 108 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Notice how much wider the predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> individual resp<strong>on</strong>ses are compared to the c<strong>on</strong>fidence<br />
interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se. You can use the cross hairs tool to select points <strong>on</strong> each of the lines to read<br />
off the values.<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m doesn’t allow you save the c<strong>on</strong>fidence intervals directly<br />
to the data table. In order to do this you need to use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 109 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The Y variable is the durati<strong>on</strong> of the lawn cutting seas<strong>on</strong>, while the <strong>on</strong>ly effects to be entered into the effect<br />
box is that of year. After the model is fit (with identical results to what we had earlier), the predicted values<br />
and c<strong>on</strong>fidence intervals and predicti<strong>on</strong> intervals al<strong>on</strong>g with residuals and other good stuff can be saved by<br />
clicking <strong>on</strong> the drop-down red-triangle near the upper plot:<br />
c○2012 Carl James Schwarz 110 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The data table will also include predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> 2004 and 2005<br />
c○2012 Carl James Schwarz 111 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Notice the difference in width <str<strong>on</strong>g>for</str<strong>on</strong>g> the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se and the predicti<strong>on</strong> interval<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the individual resp<strong>on</strong>se. These two intervals are often c<strong>on</strong>fused and it is important to keep their two uses<br />
in mind.<br />
The residual plot is automatically given by the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, but is also easily obtained<br />
from the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 112 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
It does not show any evidence of problems.<br />
Finally, the Durbin-Wats<strong>on</strong> statistic <str<strong>on</strong>g>for</str<strong>on</strong>g> testing the presence of autocorrelati<strong>on</strong> is found in the Analyze-<br />
>Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
to give 9<br />
9 You may have to use the pop-down menu from the red-triangle to get the p-value<br />
c○2012 Carl James Schwarz 113 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The DW statistic should be close to 2 if there is no autocorrelati<strong>on</strong> present in the data. The p-value does<br />
not indicate any evidence of a problem with autocorrelati<strong>on</strong>. The estimated autocorrelati<strong>on</strong> is very small<br />
(−.004) so that it is essentially zero.<br />
Postscript<br />
A more <str<strong>on</strong>g>for</str<strong>on</strong>g>mal analysis of the data presented in the article looked at the date of first cutting, the date of last<br />
cutting, and the number of cuts as well. The authors c<strong>on</strong>clude:<br />
Despite having a relatively <str<strong>on</strong>g>short</str<strong>on</strong>g> span of 20 years, the data from Kirkcaldy provide biological<br />
evidence of an increase in the length of the growing seas<strong>on</strong> and some suggesti<strong>on</strong>s of what<br />
meteorological factors affect lawn growth. Strictly, we are dealing with the cutting seas<strong>on</strong><br />
which is likely to underestimate the growing seas<strong>on</strong>.<br />
This was quite an interesting analysis of an unusual data set!<br />
2.3 Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s<br />
In some cases, the plot of Y vs. X is obviously n<strong>on</strong>-linear and a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of X or Y may be used to<br />
establish linearity. For example, many dose-resp<strong>on</strong>se curves are linear in log(X). Or the equati<strong>on</strong> may be<br />
intrinsically n<strong>on</strong>-linear, e.g. a weight-length relati<strong>on</strong>ship is of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m weight = β 0 length β1 . Or, some<br />
variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measured in L/100<br />
km or km/L? You are already with some variables measured <strong>on</strong> the log-scale - pH is a comm<strong>on</strong> example.<br />
Often a visual inspecti<strong>on</strong> of a plot may identify the appropriate trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>.<br />
There is no theoretical difficulty in fitting a linear regressi<strong>on</strong> using trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med variables other than an<br />
understanding of the implicit assumpti<strong>on</strong> of the error structure. The model <str<strong>on</strong>g>for</str<strong>on</strong>g> a fit <strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data is of<br />
the <str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />
trans(Y ) = β 0 + β 1 × trans(X) + error<br />
c○2012 Carl James Schwarz 114 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Note that the error is assumed to act additively <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale. All of the assumpti<strong>on</strong>s of linear<br />
regressi<strong>on</strong> are assumed to act <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale – in particular that the standard deviati<strong>on</strong> around the<br />
regressi<strong>on</strong> line is c<strong>on</strong>stant <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />
The most comm<strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the logarithmic trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. It doesn’t matter if the natural logarithm<br />
(often called the ln functi<strong>on</strong>) or the comm<strong>on</strong> logarithm trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> (often called the log 10 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>)<br />
is used. There is a 1-1 relati<strong>on</strong>ship between the two trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, and linearity <strong>on</strong> <strong>on</strong>e trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m is<br />
preserved <strong>on</strong> the other trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. The <strong>on</strong>ly change is that values <strong>on</strong> the ln scale are 2.302 = ln(10) times<br />
that <strong>on</strong> the log 10 scale which implies that the estimated slope and intercept both differ by a factor of 2.302.<br />
There is some c<strong>on</strong>fusi<strong>on</strong> in scientific papers about the meaning of log - some papers use this to refer to the<br />
ln trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>, while others use this to refer to the log 10 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>.<br />
After the regressi<strong>on</strong> model is fit, remember to interpret the estimates of slope and intercept <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />
scale. For example, suppose that a ln(Y ) trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is used. Then we have<br />
and<br />
.<br />
ln(Y t+1 ) = b 0 + b 1 × (t + 1)<br />
ln(Y t ) = b 0 + b 1 × t<br />
ln(Y t+1 ) − ln(Y t ) = ln( Y t+1<br />
Y t<br />
) = b 1 × (t + 1 − t) = b 1<br />
exp(ln( Y t+1<br />
)) = Y t+1<br />
= exp(b 1 ) = e b1<br />
Y t Y t<br />
Hence a <strong>on</strong>e unit increase in X cause Y to be MULTIPLED by e b1 . As an example, suppose that <strong>on</strong><br />
the log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by a<br />
multiplicative factor or e −.07 = .93, i.e. roughly a 7% decline per year. 10<br />
Predicti<strong>on</strong>s <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale, must be back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />
In some problems, scientists search <str<strong>on</strong>g>for</str<strong>on</strong>g> the ‘best’ trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. This is not an easy task and using simple<br />
statistics such as R 2 to search <str<strong>on</strong>g>for</str<strong>on</strong>g> the best trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> should be avoided. Seek help if you need to find<br />
the best trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular dataset.<br />
2.3.1 Example: M<strong>on</strong>itoring Dioxins - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />
An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />
material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />
organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />
takes a l<strong>on</strong>g time to degrade.<br />
10 It can be shown that <strong>on</strong> the log scale, that <str<strong>on</strong>g>for</str<strong>on</strong>g> smallish values of the slope that the change is almost the same <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />
scale, i.e. if the slope is −.07 <strong>on</strong> the log sale, this implies roughly a 7% decline per year; a slope of .07 implies roughly a 7% increase<br />
per year.<br />
c○2012 Carl James Schwarz 115 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />
measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />
Each year, four crabs are captured from a m<strong>on</strong>itoring stati<strong>on</strong>. The liver is excised and the livers from all<br />
four crabs are composited together into a single sample. 11 The dioxins levels in this composite sample is<br />
measured. As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called<br />
the Total Equivalent Dose (TEQ) is computed from the sample.<br />
Here are the raw data:<br />
Site Year TEQ<br />
a 1990 179.05<br />
a 1991 82.39<br />
a 1992 130.18<br />
a 1993 97.06<br />
a 1994 49.34<br />
a 1995 57.05<br />
a 1996 57.41<br />
a 1997 29.94<br />
a 1998 48.48<br />
a 1999 49.67<br />
a 2000 34.25<br />
a 2001 59.28<br />
a 2002 34.92<br />
a 2003 28.16<br />
JMP analysis<br />
The data is available in a JMP data file dioxinTEQ.jmp in the Sample Program Library available at http:<br />
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
As with all analyses, start with a preliminary plot of the data obtained using the Analyze->Fit Y-by-X<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />
11 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />
loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />
within years and the number of years to m<strong>on</strong>itor.<br />
c○2012 Carl James Schwarz 116 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The preliminary plot of the data shows a decline in levels over time, but it is clearly n<strong>on</strong>-linear. Why is<br />
this so? In many cases, a fixed fracti<strong>on</strong> of dioxins degrades per year, e.g. a 10% decline per year. This can<br />
be expressed in a n<strong>on</strong>-linear relati<strong>on</strong>ship:<br />
T EQ = Cr t<br />
where C is the initial c<strong>on</strong>centrati<strong>on</strong>, r is the rate reducti<strong>on</strong> per year, and t is the elapsed time. If this is<br />
plotted over time, this leads to the n<strong>on</strong>-linear pattern seen above.<br />
If logarithms are taken, this leads to the relati<strong>on</strong>ship:<br />
which can be expressed as:<br />
log(T EQ) = log(C) + t × log(r)<br />
log(T EQ) = β 0 + β 1 × t<br />
which is the equati<strong>on</strong> of a straight line with β 0 = log(C) and β 1 = log(r).<br />
JMP can easily be used to compute the log(T EQ) by using the Formula Editor in the usual fashi<strong>on</strong>.<br />
A plot of log(T EQ) vs. year using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m gives the following:<br />
c○2012 Carl James Schwarz 117 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This look linear over time with a steady decline. A line can be fit as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e by selecting the Fit Line<br />
opti<strong>on</strong> from the red triangle in the upper left side of the plot:<br />
c○2012 Carl James Schwarz 118 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This gives the following output:<br />
c○2012 Carl James Schwarz 119 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The residual plot looks fine with no apparent problems but the dip in the middle years could require<br />
c○2012 Carl James Schwarz 120 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
further explorati<strong>on</strong> if this pattern was apparent in other site as well:<br />
.<br />
The fitted line is:<br />
log(T EQ) = 218.9 − .11(year)<br />
The intercept (218.9) would be the log(T EQ) in the year 0 which is clearly n<strong>on</strong>sensical. The slope<br />
(−.11) is the estimated log(ratio) from <strong>on</strong>e year to the next. For example, exp(−.11) = .898 would mean<br />
that the TEQ in <strong>on</strong>e year is <strong>on</strong>ly 89.8% of the TEQ in the previous year, or about an 11% decline per year. 12<br />
The standard error of the estimated slope is .02. A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope can be obtained<br />
by pressing a Right-Click (<str<strong>on</strong>g>for</str<strong>on</strong>g> Windoze machines) or a Ctrl-Click (<str<strong>on</strong>g>for</str<strong>on</strong>g> Macintosh machines) in the Parameter<br />
Estimates summary table and selecting the c<strong>on</strong>fidence intervals to display in the table.<br />
The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope is (−.154 → −.061). If you take the anti-logs of the endpoints,<br />
this gives a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the fracti<strong>on</strong> of TEQ that remains from year to year, i.e. between<br />
(0.86 to 0.94) of the TEQ in <strong>on</strong>e year, remains to the next year.<br />
Several types of predicti<strong>on</strong>s can be made. For example, what would be the estimated mean TEQ in 2010?<br />
12 It can be shown that in regressi<strong>on</strong>s using a log(Y ) vs. time, that the estimated slope <strong>on</strong> the logarithmic scale is the approximate<br />
fracti<strong>on</strong> decline per time interval. For example, in the above, the estimated slope of −.11 corresp<strong>on</strong>ds to an approximate 11% decline<br />
per year. This approximati<strong>on</strong> <strong>on</strong>ly work well when the slopes are small, i.e. close to zero.<br />
c○2012 Carl James Schwarz 121 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This can be accomplished in several ways.<br />
The computati<strong>on</strong>s could be d<strong>on</strong>e by hand, or by using the cross-hairs <strong>on</strong> the plot from the Analyze-<br />
>Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se, or predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual<br />
resp<strong>on</strong>se can be added to the plot from the pop-down menu.<br />
However, a more powerful tool is available from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />
Start first, by adding rows to the original data table corresp<strong>on</strong>ding to the years <str<strong>on</strong>g>for</str<strong>on</strong>g> which a predicti<strong>on</strong> is<br />
required. In this case, the additi<strong>on</strong>al row would have the value of 2010 in the Year column with the remainder<br />
of the row unspecified. Missing values will be automatically inserted <str<strong>on</strong>g>for</str<strong>on</strong>g> the other variables.<br />
Then invoke the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 122 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This gives much the same output as from the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m with a few new (useful)<br />
features, a few of which we will explore in the remainder of this secti<strong>on</strong>.<br />
Next, save the predicti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>mula, and the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean, and <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual predicti<strong>on</strong><br />
to the data table (this will take three successive saves):<br />
c○2012 Carl James Schwarz 123 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Now the data table has been augmented with additi<strong>on</strong>al columns and more importantly predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
2010 are now available:<br />
c○2012 Carl James Schwarz 124 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The estimated mean log(T EQ) is 2.60 (corresp<strong>on</strong>ding to an estimated MEDIAN TEQ of exp(2.60) =<br />
13.46). A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean log(T EQ) is (1.94 to 3.26) corresp<strong>on</strong>ding to a 95% c<strong>on</strong>fidence<br />
interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the actual MEDIAN TEQ of between (6.96 and 26.05). 13 Note that the c<strong>on</strong>fidence<br />
interval after taking anti-logs is no l<strong>on</strong>ger symmetrical.<br />
Why does a mean of a logarithm trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m back to the median <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale? Basically,<br />
because the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is n<strong>on</strong>-linear, properties such mean and standard errors cannot be simply<br />
anti-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med without introducing some bias. However, measures of locati<strong>on</strong>, (such as a median) are<br />
unaffected. On the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale, it is assumed that the sampling distributi<strong>on</strong> about the estimate is symmetrical<br />
which makes the mean and median take the same value. So what really is happening is that the<br />
median <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale is back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the median <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />
Similarly, a 95% predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the log(T EQ) <str<strong>on</strong>g>for</str<strong>on</strong>g> an INDIVIDUAL composite sample can be<br />
found.<br />
Finally, an inverse predicti<strong>on</strong> is sometimes of interest, i.e. in what year, will the TEQ be equal to some<br />
particular value. For example, health regulati<strong>on</strong>s may require that the TEQ of the composite sample be<br />
below 10 units.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m has an inverse predicti<strong>on</strong> functi<strong>on</strong>:<br />
13 A minor correcti<strong>on</strong> can be applied to estimate the mean if required.<br />
c○2012 Carl James Schwarz 125 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Specify the required value <str<strong>on</strong>g>for</str<strong>on</strong>g> Y – in this case log(10) = 2.302,<br />
and then press the RUN butt<strong>on</strong> to get the following output:<br />
c○2012 Carl James Schwarz 126 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The predicted year is found by solving<br />
2.302 = 218.9 − .11(year)<br />
and gives and estimated year of 2012.7. A c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the time when the mean log(T EQ) is<br />
equal to log(10) is somewhere between 2007 and 2026!<br />
2.3.2 Final Words<br />
The applicati<strong>on</strong> of regressi<strong>on</strong> to n<strong>on</strong>-linear problems is fairly straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward after the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is<br />
made. The most error-pr<strong>on</strong>e step of the process is the interpretati<strong>on</strong> of the estimates <strong>on</strong> the TRANSFORMED<br />
scale and how these relate to the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />
2.4 Power/Sample Size<br />
2.4.1 Introducti<strong>on</strong><br />
A comm<strong>on</strong> goal in ecological research is to determine if some quantity (e.g. abundance, water quality)<br />
is tending to increase or decrease. A linear regressi<strong>on</strong> of this quantity against time is comm<strong>on</strong>ly used to<br />
evaluate such a trend. The methods presented earlier can be used in these situati<strong>on</strong>s without much difficulty<br />
except <str<strong>on</strong>g>for</str<strong>on</strong>g> problems of autocorrelati<strong>on</strong> over time (<str<strong>on</strong>g>for</str<strong>on</strong>g> example if the same m<strong>on</strong>itoring plots were measured<br />
repeatedly over time), and making sure that the experimental and observati<strong>on</strong>al unit are not c<strong>on</strong>fused (this is<br />
similar to the problem of sub-sampling discussed earlier). 14<br />
When designing programs to detect trends, several related questi<strong>on</strong>s arise. For how many years does the<br />
study have to run? What is the influence of the precisi<strong>on</strong> of the individual yearly measurements have <strong>on</strong> the<br />
length of the m<strong>on</strong>itoring study? What is the power to detect a certain sized trend given a proposed study<br />
design?<br />
As in ANOVA, these questi<strong>on</strong>s are answered through a power analysis. The in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> needed to<br />
c<strong>on</strong>duct a power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> linear regressi<strong>on</strong> is similar to that required <str<strong>on</strong>g>for</str<strong>on</strong>g> a power analyses in ANOVA -<br />
however, the computati<strong>on</strong>s are more complex.<br />
The in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> needed is:<br />
• α level. As in power analyses <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA, this is traditi<strong>on</strong>ally set to α = 0.05.<br />
14 An example of such c<strong>on</strong>fusi<strong>on</strong> would be an investigati<strong>on</strong> of the fecundity of a bird over time. Several sites covering the range of<br />
the bird are measured and many nests within each site are also measured. This study c<strong>on</strong>tinues <str<strong>on</strong>g>for</str<strong>on</strong>g> a number of years. The average<br />
fecundity (over all sites and nests) is the resp<strong>on</strong>se variable, i.e. <strong>on</strong>e single number per year rather than the individual nest measurements.<br />
The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this is that factors that operate <strong>on</strong> the yearly scale (e.g. envir<strong>on</strong>mental variables) affect all nests simultaneously rather<br />
than operating <strong>on</strong> a single nest at a time independently of other nests. For example, a poor summer will depress fecundity <str<strong>on</strong>g>for</str<strong>on</strong>g> all nests<br />
simultaneously.<br />
c○2012 Carl James Schwarz 127 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
• effect size. In ANOVA, power deals with detecti<strong>on</strong> of differences am<strong>on</strong>g means. In regressi<strong>on</strong> analysis,<br />
power deals with detecti<strong>on</strong> of slopes that are different from zero. Hence, the effect size is measured<br />
by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X.<br />
• sample size. Recall in ANOVA with more than two groups, that the power depended not <strong>on</strong>ly <strong>on</strong>ly<br />
the sample size per group, but also how the means are separated. In regressi<strong>on</strong> analysis, the power<br />
will depend up<strong>on</strong> the number of observati<strong>on</strong>s taken at each value of X and the spread of the X values.<br />
For example, the greatest power is obtained when half the sample is taken at the two extremes of<br />
the X space - but at a cost of not being able to detect n<strong>on</strong>-linearity. For many m<strong>on</strong>itoring designs,<br />
observati<strong>on</strong>s are taken <strong>on</strong> a yearly basis, so the questi<strong>on</strong> reduces to the number of years of m<strong>on</strong>itoring<br />
required.<br />
• standard deviati<strong>on</strong>. As in ANOVA, the power will depend up<strong>on</strong> the variati<strong>on</strong> of the individual objects<br />
around the regressi<strong>on</strong> line.<br />
A very nice series of papers <strong>on</strong> detecting trends in ecological studies is available:<br />
• Gerrodette, T. 1987. A power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> detecting trends. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 68: 1364-1372. http://dx.<br />
doi.org/10.2307/1939220.<br />
• Link, W. A. and Hatfield, J. S. 1990. Power calculati<strong>on</strong>s and model selecti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> trend analysis: a<br />
comment. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 71: 1217-1220. http://dx.doi.org/10.2307/1937393.<br />
• Gerrodette, T. 1991. Models <str<strong>on</strong>g>for</str<strong>on</strong>g> power of detecting trends - a reply to Link and Hatfield. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 72:<br />
1889-1892. http://dx.doi.org/10.2307/1940986.<br />
• Gerrodette, T. 1993. Trends: software <str<strong>on</strong>g>for</str<strong>on</strong>g> a power analysis of linear regressi<strong>on</strong>. Wildlife Society<br />
Bulletin 21: 515-516.<br />
JMP does not include a power computati<strong>on</strong> module <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> analysis. However SAS v.9+ includes<br />
a power analysis module (GLMPOWER) <str<strong>on</strong>g>for</str<strong>on</strong>g> the power analysis of regressi<strong>on</strong> models, but this a bit complex<br />
to use.<br />
Perhaps the most comm<strong>on</strong> aspect of a power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> linear regressi<strong>on</strong> is the planning of a m<strong>on</strong>itoring<br />
study to detect trends over time. This c<strong>on</strong>siderable simplifies the computati<strong>on</strong>s of the power as usually the<br />
time points are equally spaced with the same number of measurements taken at each time point. There are<br />
two readily available software packages to help plan such studies. The first, TRENDS, available at http:<br />
//swfsc.noaa.gov/textblock.aspx?Divisi<strong>on</strong>=PRD&ParentMenuId=228&id=4740 is a<br />
Windoze based program that does the computati<strong>on</strong>s as outlined in the above papers. Because of c<strong>on</strong>cerns<br />
raised by Link and Hatfield, a sec<strong>on</strong>d program, MONITOR, available at http://www.mbr-pwrc.<br />
usgs.gov/software/m<strong>on</strong>itor.html, was developed that does power computati<strong>on</strong>s based <strong>on</strong> simulati<strong>on</strong><br />
rather than simple <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae. This used a web-based interface rather than running <strong>on</strong> individual<br />
machines. 15 This sec<strong>on</strong>d program also has additi<strong>on</strong>al flexibility to handle situati<strong>on</strong> where the m<strong>on</strong>itoring<br />
points are not equally spaced in time, or there are multiple measurements taken at each time point.<br />
15 The original author of MONITOR, James Gibbs, indicates that a Windoze versi<strong>on</strong> will be available in early 2005 at http:<br />
//www.esf.edu/efb/gibbs/<br />
c○2012 Carl James Schwarz 128 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
CAUTION Power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> trend can be very complex. The authors of Program M<strong>on</strong>itor have some<br />
sage advice that is applicable to both TRENDS and MONITOR:<br />
Users should be aware (and wary) of the complexity of power analysis in general, and also acknowledge<br />
some specific limitati<strong>on</strong>s of MONITOR <str<strong>on</strong>g>for</str<strong>on</strong>g> many real-world applicati<strong>on</strong>s. Our chief,<br />
immediate c<strong>on</strong>cern is that many users of MONITOR may be unaware of these limitati<strong>on</strong>s and<br />
may be using the program inappropriately. Below are comments from <strong>on</strong>e of our statisticians<br />
<strong>on</strong> some of the aspects of MONITOR that users should be cognizant of: “There are numerous<br />
issues with how Program M<strong>on</strong>itor calculates statistical power and sample size. One issue c<strong>on</strong>cerns<br />
the default opti<strong>on</strong> whereby the user assumes independence of plots or sites from <strong>on</strong>e time<br />
period to the next. If you are randomly sampling new sites or plots each time period, then it is<br />
correct to assume independence (assuming that finite populati<strong>on</strong> correcti<strong>on</strong> factor is not an issue,<br />
which depends <strong>on</strong> how many plots or sites you are sampling, relative to the total populati<strong>on</strong><br />
size of potential plots or sites). If you are sampling the same plots or sites repeatedly over time,<br />
however, then the default opti<strong>on</strong> in Program M<strong>on</strong>itor is unlikely to give a correct calculati<strong>on</strong> of<br />
statistical power or sample size. If plots or sites are positively autocorrelated over time, as is<br />
usually the case in biological surveys, then Program M<strong>on</strong>itor will underestimate sample size, or<br />
c<strong>on</strong>versely, it will overestimate the statistical power. The correct sample size estimate is likely<br />
to be greater, and depending up<strong>on</strong> the amount of autocorrelati<strong>on</strong>, the correct sample size could<br />
be vastly greater to achieve a stated power objective. A more fundamental issue c<strong>on</strong>cerns the<br />
null model <strong>on</strong>e chooses <str<strong>on</strong>g>for</str<strong>on</strong>g> the trend in populati<strong>on</strong> growth. Program M<strong>on</strong>itor assumes a relatively<br />
simple linear trend in populati<strong>on</strong> growth, but this is a c<strong>on</strong>troversial issue, because there<br />
are potentially an infinite number of models <strong>on</strong>e could use. If pilot data are available, then it<br />
may be possible to estimate autocorrelati<strong>on</strong> and try to make some choices c<strong>on</strong>cerning the type<br />
of model to use as the null model <str<strong>on</strong>g>for</str<strong>on</strong>g> a power calculati<strong>on</strong>, but regardless of how you decide to<br />
proceed, it would be a good idea to c<strong>on</strong>sult a statistician to determine an approach that fits your<br />
needs and data. No matter what additi<strong>on</strong>al flexibility is built into the modeling, however, it will<br />
always be possible to posit the existence of further structure which if overlooked will produce<br />
misleading results. For a pertinent discussi<strong>on</strong> of some of these issues, please see Elzinga et<br />
al. (1998). Although this reference deals specifically with plant populati<strong>on</strong>s, the fundamental<br />
statistical issues are similar whether you are sampling plant or animal populati<strong>on</strong>s. Literature<br />
Citati<strong>on</strong>: Elzinga, C.L., D.W. Salzer, and J.W. Willoughby. 1998. Measuring and m<strong>on</strong>itoring<br />
plant populati<strong>on</strong>s. BLM Technical Reference 1730-1, Denver, CO. 477 pages.”<br />
Some care must also be taken to distinguish between sampling variati<strong>on</strong> and process variati<strong>on</strong>.<br />
c○2012 Carl James Schwarz 129 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Process vs Sampling Variati<strong>on</strong><br />
Populati<strong>on</strong><br />
}<br />
{<br />
}<br />
{<br />
}<br />
Process variati<strong>on</strong><br />
Sampling<br />
Variati<strong>on</strong><br />
Time<br />
Samplingvariati<strong>on</strong> refers to the uncertainty of each<br />
measurement in each year. This can be reduced by<br />
increasing the sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t in each year. Process variati<strong>on</strong><br />
refers to the fact that even if the data values were known exactly,<br />
the points would not lie <strong>on</strong> the straight line. Process variati<strong>on</strong><br />
is unaffected by the sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t in each year.<br />
c○2012 Carl James Schwarz 130 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Sampling variati<strong>on</strong> is the size of the standard error when estimates are made at each sampling occasi<strong>on</strong>.<br />
Sampling variati<strong>on</strong> can be reduced by increasing sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t (e.g. more measurements per occasi<strong>on</strong>).<br />
Process variati<strong>on</strong> refers to the variati<strong>on</strong> around the perfect linear regressi<strong>on</strong> even if there was no uncertainty<br />
in each individual observati<strong>on</strong>. Process variati<strong>on</strong> cannot be reduced by increasing sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t. At the<br />
moment, Program M<strong>on</strong>itor assumes that process variati<strong>on</strong> is 0, i.e. if you knew each data point exactly, they<br />
would all fit exactly <strong>on</strong> the linear trend. There are a number of web pages that discuss this issue in more<br />
detail - do a simple search using a search engine.<br />
2.4.2 Getting the necessary in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />
As noted earlier, the in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> required to do a power analysis is similar to that <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA. We will<br />
c<strong>on</strong>centrate <strong>on</strong> relevant quantities <str<strong>on</strong>g>for</str<strong>on</strong>g> a trend analysis over time rather than a general regressi<strong>on</strong> situati<strong>on</strong>. I<br />
will use populati<strong>on</strong> size as my resp<strong>on</strong>se variable, but any other ecological quantity could be used.<br />
α level. As in power analyses <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA, this is traditi<strong>on</strong>ally set to α = 0.05.<br />
Effect size. In trend analysis, this is traditi<strong>on</strong>ally specified as the rate of change per unit time and denoted<br />
by r. For example, a value of r = .02 = 2% corresp<strong>on</strong>ds to an (increasing) change of 2% per year. Both<br />
TRENDS and MONITOR allow <str<strong>on</strong>g>for</str<strong>on</strong>g> both linear and exp<strong>on</strong>ential trends. In linear trends, the populati<strong>on</strong> size<br />
changes by the same fixed percentage of the initial populati<strong>on</strong> each year. So if the initial populati<strong>on</strong> was<br />
1000 animals, a 2% decline per year would corresp<strong>on</strong>d to a fixed change of .02 × 1000 = 20 animals per<br />
year, i.e. 1000, 980, 960, 940, 920, 900, . . ..<br />
In exp<strong>on</strong>ential trends, the change is multiplicative each year. So if the initial populati<strong>on</strong> was 1000<br />
animals, a 2% (multiplicative) decline corresp<strong>on</strong>ds to 1000×.98 = 980 animals in the next year, 980×.98 =<br />
1000 × .98 2 = 960.4 in the next year, followed by 941.2, 922.4, 904, 885 etc in subsequent years.<br />
If the rate is small, then both an exp<strong>on</strong>ential and linear trend will be very similar <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>short</str<strong>on</strong>g> time trends -<br />
they can be quite different if the rate is large and/or the time series is very l<strong>on</strong>g.<br />
Individuals m<strong>on</strong>itoring populati<strong>on</strong>s often think of l<strong>on</strong>g-term trends in populati<strong>on</strong>s, such as, how many<br />
plots do I need to m<strong>on</strong>itor to detect a 10% reducti<strong>on</strong> in this populati<strong>on</strong> over a 10 year period? This overall<br />
change must be c<strong>on</strong>verted to a rate per unit time. The MONITOR home page has a trend c<strong>on</strong>verter, but the<br />
computati<strong>on</strong>s are relatively simple.<br />
For linear trends, the rate is found as:<br />
r =<br />
R<br />
n − 1<br />
where R is the overall fracti<strong>on</strong>al change in abundance over the n years. For example, a 10% reducti<strong>on</strong> over<br />
10 years has R = −.1 and n = 10 leading to:<br />
or just over a 1% reducti<strong>on</strong> per year.<br />
r = −.1<br />
10 − 1 = −.011<br />
c○2012 Carl James Schwarz 131 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
For exp<strong>on</strong>ential trends, the rate is found as:<br />
f = (R + 1) 1/(n−1) − 1<br />
where R is the overall fracti<strong>on</strong>al change in abundance over the n years. For example, a 10% reducti<strong>on</strong> over<br />
10 years has R = −.1 and n = 10 leading to:<br />
r = (.9) 1/9 − 1 = −.0116<br />
or just over a 1% reducti<strong>on</strong> per year. Again note that <str<strong>on</strong>g>for</str<strong>on</strong>g> small reducti<strong>on</strong>s and a small number of years, both<br />
a linear and exp<strong>on</strong>ential trend have similar rates.<br />
Sample size. For many m<strong>on</strong>itoring designs, observati<strong>on</strong>s are taken <strong>on</strong> a yearly basis, so the questi<strong>on</strong><br />
reduces to the number of years of m<strong>on</strong>itoring required. TRENDS required fixed sampling intervals while<br />
MONITOR has allows <str<strong>on</strong>g>for</str<strong>on</strong>g> some flexibility in the timing of the m<strong>on</strong>itoring.<br />
Standard deviati<strong>on</strong>. As in ANOVA, the power will depend up<strong>on</strong> the variati<strong>on</strong> of the individual objects<br />
around the regressi<strong>on</strong> line.<br />
In many cases, the standard deviati<strong>on</strong> is not directly available given, but rather the variability of the<br />
estimates of the individual observati<strong>on</strong>s is reported as the relative standard error (cv = stddev<br />
mean<br />
). TRENDS<br />
uses the cv while MONITOR uses the actual standard deviati<strong>on</strong>.<br />
Gibbs (2000) 16 summarizes typical cv ′ s <str<strong>on</strong>g>for</str<strong>on</strong>g> measuring a number of types of populati<strong>on</strong>s:<br />
16 Gibbs, J. P. (2000). M<strong>on</strong>itoring Populati<strong>on</strong>s. Pages 213-252 in Research Techniques in Animal <str<strong>on</strong>g>Ecology</str<strong>on</strong>g>, Boitani, L. and Fuller, T.<br />
K.eds, <str<strong>on</strong>g>Columbia</str<strong>on</strong>g> University Press<br />
c○2012 Carl James Schwarz 132 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Group<br />
cv<br />
Large mammals 15%<br />
Grasses and sedges 20%<br />
Herbs, compositae 20%<br />
Herbs, n<strong>on</strong>-compositae 20%<br />
Turtles 35%<br />
Salamanders 35%<br />
Large bodied birds 35%<br />
Lizards 40%<br />
Fishes, salm<strong>on</strong>ids 50%<br />
Caddis flies 50%<br />
Snakes 55%<br />
Drag<strong>on</strong>flies 55%<br />
Small bodied birds 55%<br />
Beetles 60%<br />
Small mammals 60%<br />
Spiders 65%<br />
Medium sized mammals 65%<br />
Fishes, n<strong>on</strong>-salm<strong>on</strong>ids 70%<br />
Salamander (aquatic) 85%<br />
Moths 90%<br />
Frogs and toads 95%<br />
Bats 95%<br />
Butterflies 110%<br />
Flies 130%<br />
If necessary, these can be c<strong>on</strong>verted to standard deviati<strong>on</strong> if the initial density is approximately known<br />
by multiplying the cv by the initial density. For example, if the initial density is 25 mice/hectare, then<br />
the approximate standard deviati<strong>on</strong> (<str<strong>on</strong>g>for</str<strong>on</strong>g> small mammals) would be found as 25 mice/hectare × 60% or 15<br />
mice/hectare.<br />
Finally, even if all else is equal, the variati<strong>on</strong> often changes with the change in abundance over time.<br />
Gerrodette (1987) examines three cases:<br />
• the cv is c<strong>on</strong>stant over time.<br />
• the cv is proporti<strong>on</strong>al to √ abundance<br />
• the cv is proporti<strong>on</strong>al to 1/ √ abundance.<br />
c○2012 Carl James Schwarz 133 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Many sampling methods give cvs that are proporti<strong>on</strong>al to 1/ √ abundance. The TRENDS program allows<br />
you to select an appropriate relati<strong>on</strong>ship. Again, <str<strong>on</strong>g>for</str<strong>on</strong>g> small time scales, there isn’t much of a difference in<br />
results am<strong>on</strong>g the different relati<strong>on</strong>ships of cv and abundance.<br />
The cv may be improved if multiple, independent samples are taken each year. If m multiple samples<br />
are taken each year, then the corresp<strong>on</strong>ding cv value is:<br />
cv average = cv individual<br />
√ m<br />
.<br />
Both programs do this computati<strong>on</strong> automatically if you specify that the ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t is increased at each sampling<br />
occasi<strong>on</strong>. Note that you are implicitly assuming that process variati<strong>on</strong> is 0 in these cases, i.e. if perfect<br />
in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> were known, the abundance would lie exactly <strong>on</strong> the trend line. This may not be a suitable<br />
assumpti<strong>on</strong> and some care is needed if a large amount of sampling is to be d<strong>on</strong>e in each year to try and get<br />
the cv of the estimates down to a reas<strong>on</strong>able level - the payoff may not be as great as expected.<br />
2.4.3 How does power vary as in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> changes?<br />
A nice discussi<strong>on</strong> of some of the issues in sample size <str<strong>on</strong>g>for</str<strong>on</strong>g> trend analysis is found at http://www.pwrc.<br />
usgs.gov/m<strong>on</strong>manual/samplesize.htm which is reproduced here <str<strong>on</strong>g>for</str<strong>on</strong>g> c<strong>on</strong>venience:<br />
c○2012 Carl James Schwarz 134 November 23, 2012
Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />
10/06/2005 06:52 PM<br />
Patuxent Wildlife<br />
Research Center<br />
Managers' M<strong>on</strong>itoring Manual<br />
Home > HOW to m<strong>on</strong>itor? > Setting sample size<br />
Figuring out how many samples you need<br />
The number of samples you need are affected by the following factors:<br />
Project goals<br />
How you plan to analyze your data<br />
How variable your data are or are likely to be<br />
How precisely you want to measure change or trend<br />
The number of years over which you want to detect a trend<br />
How many times a year you will sample each point<br />
How much m<strong>on</strong>ey and manpower you have<br />
Here are some graphs that illustrate some of these trade-offs. These graphs were made using the assumpti<strong>on</strong> that you would be analyzing your<br />
data using simple linear regressi<strong>on</strong>. Each graph isolates <strong>on</strong>e factor and looks how altering that factor affects sample. Those factors are explained<br />
in greater detail below<br />
135<br />
http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />
Page 1 of 8
Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />
10/06/2005 06:52 PM<br />
136<br />
http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />
Page 2 of 8
Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />
10/06/2005 06:52 PM<br />
137<br />
http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />
Page 3 of 8
Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />
10/06/2005 06:52 PM<br />
138<br />
http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />
Page 4 of 8
Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />
10/06/2005 06:52 PM<br />
In general, you can lower your sample size requirements by adopting the following approaches:<br />
Aim to detect <strong>on</strong>ly l<strong>on</strong>g-term changes<br />
Set your analytical tests to P
Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />
10/06/2005 06:52 PM<br />
Must be located randomly or uni<str<strong>on</strong>g>for</str<strong>on</strong>g>mly throughout the study area<br />
Must detect a c<strong>on</strong>stant proporti<strong>on</strong> of the individuals (or estimate the differences)<br />
Be precise enough to detect the types of changes you want to detect<br />
Issues of bias, sample placement, and choosing your counting technique have been discussed elsewhere in this web site. Here we will help you<br />
determine whether your m<strong>on</strong>itoring program has a sufficient number of sampling locati<strong>on</strong>s (sample size) to detect the types of changes you have<br />
set <str<strong>on</strong>g>for</str<strong>on</strong>g>th as your goal.<br />
So, what is a sufficient sample size?<br />
To answer that you need to address three things:<br />
1. What is the inherent variability of your counts?<br />
2. What magnitude of trend do you want to detect and how precisely you would like to measure trend?<br />
3. How you are going to statistically test <str<strong>on</strong>g>for</str<strong>on</strong>g> populati<strong>on</strong> change?<br />
Count Variati<strong>on</strong><br />
Count variati<strong>on</strong> is simply a measure of how your counts fluctuate from year to year. Variati<strong>on</strong> affects your ability to detect trends as, obviously, if<br />
the data fluctuate greatly you will not have resoluti<strong>on</strong> to find an increasing or decreasing trajectory in the populati<strong>on</strong> you are m<strong>on</strong>itoring.<br />
Basic rule of thumb: The more variable your counts, the more samples you will need to detect a change or trend of a given magnitude.<br />
C<strong>on</strong>versely, <str<strong>on</strong>g>for</str<strong>on</strong>g> any given sample size, the more your counts vary the lower your ability to to detect trends.<br />
Sample size calculati<strong>on</strong>s need an estimate of count variati<strong>on</strong> to estimate sample sizes. You can get such an estimate from your own pilot data<br />
(the mean and standard deviati<strong>on</strong>) or from estimates taken from other, similar situati<strong>on</strong>s. We provide some of those estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> amphibians and<br />
birds (point counts and territory mapping). Note that these are calculati<strong>on</strong>s of fluctuati<strong>on</strong> over time not over space, meaning that you calcuate a<br />
mean and standard deviati<strong>on</strong> of the counts across several years at <strong>on</strong>e point and not a mean and standard deviati<strong>on</strong> am<strong>on</strong>g several points.<br />
Be aware, when using counts from other studies, that count variances are specific to the counting technique and how the original study pooled<br />
their samples. Additi<strong>on</strong>ally, as you can see from these collecti<strong>on</strong>s of count variances that even when using the same counting technique, <strong>on</strong> the<br />
same species, in the same regi<strong>on</strong> the degree of variability in the resulting counts is usually differs (often greatly) from study to study or even from<br />
site to site. The good news is that reviews of l<strong>on</strong>g-term studies have shown that, at any individual site, the variability in counts remains about the<br />
same. This is another str<strong>on</strong>g, str<strong>on</strong>g reas<strong>on</strong> that you need to review your m<strong>on</strong>itoring program after 5 years, to see if you have been adequately<br />
sampling your populati<strong>on</strong>s.<br />
Basic rule of thumb: Use the estimates of variability of counts from other studies as a general guide to what you might expect in yours, but it is<br />
wise to check the variability of your own counts after your program has been established <str<strong>on</strong>g>for</str<strong>on</strong>g> 5 years to see if you sampling strategy needs to be<br />
revised.<br />
Helpful hints <strong>on</strong> how to decrease variability<br />
Philosophical C<strong>on</strong>siderati<strong>on</strong>s<br />
Trend. Trend can be defined as change over time. More apropros to a m<strong>on</strong>itoring program, would be to define trend as some specific rate of<br />
change over a specific period of time. Most calculati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> determining sample size require that you specify a minimum rate of change you would<br />
like to detect and a minimum time period over which you would detect those changes. Those minimums now become the targets our m<strong>on</strong>itoring<br />
program will aim to achieve or beat. In other words, by appropriately setting our sample sizes we hope to be able to detect a trend as small or<br />
smaller than those minimums which we have targeted.<br />
Basic rule of thumb: The smaller the populati<strong>on</strong> change you would like to detect, the greater the number of samples you will need to detect it.<br />
Another basic rule of thumb: The fewer the number of years over which you would like to detect a trend, the greater the number of samples<br />
you will need.<br />
Grand rule of thumb: Any m<strong>on</strong>itoring program whose goal it is to detect small populati<strong>on</strong> changes over just a few years will be expensive to<br />
create.<br />
Precisi<strong>on</strong>. Calculators of sample size also need to know how precisely you want to measure these changes. An imprecisely measured trend is a<br />
very unsatisfactory trend in that you are unsure of how well it really reflects the REAL changes in the animal populati<strong>on</strong>s <strong>on</strong> your land. On the<br />
other hand, a very precisely measured trend can be very costly to obtain because you will have to spend a great deal of your budget to achieve<br />
that level of precisi<strong>on</strong>. So, think of your precisi<strong>on</strong> goal as your willingness to risk being wr<strong>on</strong>g about the populati<strong>on</strong> change you are trying to<br />
measure. You need to determine the amount of risk you are willing to take in your m<strong>on</strong>itoring program and understand the c<strong>on</strong>sequences of that<br />
decisi<strong>on</strong>, both as a cost to your budget and in the probability of being wr<strong>on</strong>g.<br />
140<br />
Basic rule of thumb: The lower the precisi<strong>on</strong> - the lower the number of samples you will need. C<strong>on</strong>versely, the higher the precisi<strong>on</strong> the larger<br />
the number of samples needed.<br />
You c<strong>on</strong>trol precisi<strong>on</strong> and risk using two statistical settings: alpha and power. Because most basic statistical books and quite a number of web<br />
sites cover these parameters well and are very accessible, we will not cover them here, but we do want to highlight a few c<strong>on</strong>siderati<strong>on</strong>s relevant<br />
to the estimati<strong>on</strong> of sample sizes <str<strong>on</strong>g>for</str<strong>on</strong>g> m<strong>on</strong>itoring programs.<br />
http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />
Page 6 of 8
Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />
10/06/2005 06:52 PM<br />
Alpha Level. Because animal populati<strong>on</strong>s and their counts will vary <str<strong>on</strong>g>for</str<strong>on</strong>g> a number of reas<strong>on</strong>s, the data from your m<strong>on</strong>itoring program<br />
just cannot be expected to produce nice straight lines when you finally plot them out. Because of this imprecisi<strong>on</strong> we must specify<br />
some level of uncertainty in our measures of change that we are willing to tolerate. This level represents our willingness to risk being<br />
wr<strong>on</strong>g, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, to claim that a trend exists when it does not. Traditi<strong>on</strong>ally, this is known as setting the alpha level.<br />
Setting your alpha level is a balance between not wanting to 'cry wolf' (saying a trend exists when it really doesn't) and missing an<br />
important trend by being too c<strong>on</strong>servative. If you are using your m<strong>on</strong>itoring program as an early warning of negative populati<strong>on</strong><br />
change, then you may want to increase your alpha level above the traditi<strong>on</strong>al level of 0.05. To do so may mean that you 'cry wolf'<br />
more of the time, but because the goal of most m<strong>on</strong>itoring programs is to alert managers to potential problems then a higher alpha is<br />
justified in light of the possibility of missing a problem while waiting <str<strong>on</strong>g>for</str<strong>on</strong>g> it to become "statistically significant" at a lower alpha level.<br />
Basic rule of thumb: The less willing you are to be caught 'crying wolf' (or the smaller you set the alpha level) the more samples<br />
you will need to detect a given level of populati<strong>on</strong> trend.<br />
Power. Power can be defined as your ability to (or the odds of) detecting a trend given that there really is a change going <strong>on</strong> in your<br />
animal populati<strong>on</strong>s. In general, a power of 90-95% is reas<strong>on</strong>able <str<strong>on</strong>g>for</str<strong>on</strong>g> most m<strong>on</strong>itoring programs.<br />
Statistical Testing<br />
You now know that count variability affects the number of samples you need as does your requirements <str<strong>on</strong>g>for</str<strong>on</strong>g> what magnitude of the change you<br />
want your program to detect. The last issue that needs to be resolved is what statistical model will you use to test your data.<br />
Basic rule of thumb. The specific <str<strong>on</strong>g>for</str<strong>on</strong>g>mula (or simulati<strong>on</strong>s) used to calculate sample size is unique to the statistical test or model that you will<br />
use.<br />
Now ... some practical guidance <strong>on</strong> how to calculate the sample sizes <str<strong>on</strong>g>for</str<strong>on</strong>g> your m<strong>on</strong>itoring program.<br />
Note: Throughout this document we often use the terms variance, variati<strong>on</strong>, and variability as a <str<strong>on</strong>g>short</str<strong>on</strong>g> hand expressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the variability of counts.<br />
However, understand that the actual mathematical calculati<strong>on</strong> of variability could be any <strong>on</strong>e of several measures (standard deviati<strong>on</strong>, standard<br />
error, variance, or coefficient of variati<strong>on</strong>), each of which has a specific statistical meaning.<br />
Basic rules of thumb. You must have determined the following to set samples sizes:<br />
A mean and standard deviati<strong>on</strong> (i.e., the coefficient of variati<strong>on</strong> or the variati<strong>on</strong> of your counts)<br />
The smallest number of years over which you would like to detect a change<br />
The smallest percentage change you would like to detect over those years<br />
An alpha level (how often you will cry wolf)<br />
A power level (the proporti<strong>on</strong> of the time you would like to detect a trend if <strong>on</strong>e were occurring)<br />
A statistical test (your analytical model)<br />
Calculating the mean and standard deviati<strong>on</strong> requires some additi<strong>on</strong>al explanati<strong>on</strong>. While the other factors that affect sample sizes are set based<br />
<strong>on</strong> your desired need <str<strong>on</strong>g>for</str<strong>on</strong>g> precisi<strong>on</strong> and the smallest degree of change you want to detect, the mean and variance are factors that are set by the<br />
animals being sampled. If you have several years of pilot data you will want to calculate the mean and standard deviati<strong>on</strong> from your own data. If<br />
you d<strong>on</strong>’t, then you can use some<strong>on</strong>e else’s data from as similar a situati<strong>on</strong> as you can find to estimate means and standard deviati<strong>on</strong>s. In a<br />
pinch, you can estimate some of the variati<strong>on</strong> to be expected in a set of yearly counts by calculating a mean and standard deviati<strong>on</strong> from <strong>on</strong>e<br />
year’s data if you have several replicates OF THE SAME points or plots. However this approach fails to account <str<strong>on</strong>g>for</str<strong>on</strong>g> any between-year variati<strong>on</strong> in<br />
the animal populati<strong>on</strong>s. Finally, you can use data published in the literature or from <strong>on</strong>e of our databases <strong>on</strong> count variati<strong>on</strong> (e.g. amphibians, bird<br />
point counts, bird territory mapping).<br />
Pilot data is far and away the most preferable source of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> determining count variati<strong>on</strong>. We have found that the variati<strong>on</strong> in the<br />
counts of animals is very c<strong>on</strong>sistent within a site (figure). However, there are often wide differences in the variati<strong>on</strong> of counts am<strong>on</strong>g sites, even<br />
those close by using the same technique.<br />
Ways to avoid problems when you calculate variances:<br />
1. Use data collected over time, not over space.<br />
2. Use data that match the counting technique and sampling units you plan to use (e.g, d<strong>on</strong>’t use the variances that come from counts from<br />
a 50-stop point count system when you are planning to use <strong>on</strong>ly a 20-stop system).<br />
3. Use means that come from the same data you used to calculate the variance.<br />
Basic rule of thumb: If you have no access to pilot data or are aware of examples from the literature that you trust, a c<strong>on</strong>servative estimate of<br />
the amount of variati<strong>on</strong> that you could use in sample size calculati<strong>on</strong>s would be to use a CV of 100% with a moderately c<strong>on</strong>servative alternative of<br />
75%.<br />
141<br />
Choosing an analytical technique also includes some further explanati<strong>on</strong>. The specific calculati<strong>on</strong> of sample sizes is different <str<strong>on</strong>g>for</str<strong>on</strong>g> every statistical<br />
test. In complicated analyses <str<strong>on</strong>g>for</str<strong>on</strong>g>mulas often d<strong>on</strong>’t exist and simulati<strong>on</strong> must be used to calculate sample size. Below are listed a some text and<br />
web resources <str<strong>on</strong>g>for</str<strong>on</strong>g> setting sample sizes <str<strong>on</strong>g>for</str<strong>on</strong>g> various simple statistical models. For complicated situati<strong>on</strong>s you can either run the simulati<strong>on</strong>s your<br />
self, have a statistician do them <str<strong>on</strong>g>for</str<strong>on</strong>g> you, or use a c<strong>on</strong>servative model to estimate sample sizes from.<br />
James Gibbs has created a software program that estimates sample sizes <str<strong>on</strong>g>for</str<strong>on</strong>g> those who will use linear or exp<strong>on</strong>ential regressi<strong>on</strong> to anlayze their<br />
http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />
Page 7 of 8
Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />
10/06/2005 06:52 PM<br />
data. As this type of regressi<strong>on</strong> is the most basic it is also likely to be the most c<strong>on</strong>servative. Currently his software <strong>on</strong>ly runs <strong>on</strong> Windows XP or<br />
2000. It is available by c<strong>on</strong>tacting Sam Droege at the address below. A new versi<strong>on</strong> is expected out so<strong>on</strong> that will run <strong>on</strong> more plat<str<strong>on</strong>g>for</str<strong>on</strong>g>ms.<br />
Desperati<strong>on</strong> rule of thumb: If, <str<strong>on</strong>g>for</str<strong>on</strong>g> whatever reas<strong>on</strong>, you cannot calculate a reas<strong>on</strong>able estimate <str<strong>on</strong>g>for</str<strong>on</strong>g> the number of samples to take, then put in<br />
60 plots/points, under many circumstances that may be sufficient. Obviously the more the better, be sure to review your data after 3 years to reevaluate<br />
this weak choice.<br />
Texts <strong>on</strong> estimating sample size<br />
Web sites and <strong>on</strong>line calculators <str<strong>on</strong>g>for</str<strong>on</strong>g> the calculati<strong>on</strong> of sample sizes<br />
home | START HERE | worksheets | counting techniques | CV tools | site guide<br />
U.S. Department of the Interior<br />
U.S. Geological Survey<br />
Patuxent Wildlife Research Center<br />
Laurel, MD USA 20708-4038<br />
http://www.pwrc.usgs.gov/m<strong>on</strong>manual<br />
C<strong>on</strong>tact Sam Droege, email sam_droege@usgs.gov<br />
USGS Privacy Statement<br />
142<br />
http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />
Page 8 of 8
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Gerrodette (1987) also looks at the effect of various factors up<strong>on</strong> the number of years of m<strong>on</strong>itoring<br />
required.<br />
For example, Figure 1 of his report:<br />
shows the reliance of power up<strong>on</strong> the rate and type of cv relati<strong>on</strong>ship when the initial cv was 20% and<br />
α = .05. Note that <str<strong>on</strong>g>for</str<strong>on</strong>g> n = 5, you have very little power to detect anything but huge changes (large<br />
values of r). For example, even with r = .2 corresp<strong>on</strong>ding to a 20% change/year in abundance, power<br />
barely exceeds 50% even after 5 years. Power is highest (and hence trend is easier to detect) when the cv is<br />
proporti<strong>on</strong>al to 1/ √ abundance (but this is reversed <str<strong>on</strong>g>for</str<strong>on</strong>g> declining trends).<br />
Figure 2 of his report:<br />
c○2012 Carl James Schwarz 143 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
shows the relati<strong>on</strong>ship of power to type of trend (linear or exp<strong>on</strong>ential) and if the trend is increasing or<br />
decreasing. Regardless if the type of trend is linear or exp<strong>on</strong>ential, decreasing trends are easier to detect than<br />
increasing trends. Furthermore, it is easier to detect a declining trend that has a c<strong>on</strong>stant absolute decline<br />
than a relative decline and hardest to detect an increasing trend that changes by an absolute amount each<br />
year as well. This is related to the “compounding” effect in exp<strong>on</strong>ential changes.<br />
Finally, Figure 3 of his report:<br />
c○2012 Carl James Schwarz 144 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
shows the effect of different amount of variati<strong>on</strong> up<strong>on</strong> trend detecti<strong>on</strong>. As expected, trend is easier to detect<br />
with lower amount of variati<strong>on</strong> (smaller cvs).<br />
2.4.4 Finally - how many year do I need to m<strong>on</strong>itor?<br />
Gerrodette (1987) gives a quick-and-dirty approximati<strong>on</strong> that will help guide sample size determinati<strong>on</strong>. For<br />
α = .05 and power = 80%, the following is an approximate rule:<br />
r 2 n 3 ≥ 94(cv) 2<br />
For example, to detect a 5% decline/year in a populati<strong>on</strong> whose cv is 20% and c<strong>on</strong>stant over time would<br />
require:<br />
(−.05) 2 n 3 ≥ 94(.2) 2<br />
or n ≥ 11 years of m<strong>on</strong>itoring. 17<br />
Suppose we wish to investigate the power of a m<strong>on</strong>itoring design that will run <str<strong>on</strong>g>for</str<strong>on</strong>g> 5 years. At each survey<br />
occasi<strong>on</strong> (i.e. every year), we have 1 m<strong>on</strong>itoring stati<strong>on</strong>, and we make 2 estimates of the populati<strong>on</strong> at the<br />
17 If you try this actual power computati<strong>on</strong> using TRENDS, you find that actually 9 years may be sufficient. This <str<strong>on</strong>g>for</str<strong>on</strong>g>mula is ONLY<br />
an approximati<strong>on</strong>!<br />
c○2012 Carl James Schwarz 145 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
m<strong>on</strong>itoring stati<strong>on</strong> in each year. The populati<strong>on</strong> is expected to start with 1000 animals, and we expect that<br />
the measurement error in each estimate is about 200, i.e. the coefficient of variati<strong>on</strong> of each measurement is<br />
about 20% and is c<strong>on</strong>stant over time. We are interested in detecting increasing or decreasing trends and to<br />
start, a 5% decline per year will be of interest.<br />
The input/output <str<strong>on</strong>g>for</str<strong>on</strong>g> TRENDS is shown below:<br />
Most of the fields are self-explanatory. The ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t multiplier, i.e. 2 surveys/year is located at the bottom<br />
right of the screen. We find that a five year <strong>on</strong>ly has a 14% chance of detecting a 5% decline per year - hardly<br />
worthwhile doing the study!<br />
The input <str<strong>on</strong>g>for</str<strong>on</strong>g> the MONITOR Program follows:<br />
c○2012 Carl James Schwarz 146 November 23, 2012
147
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
.<br />
Most of the fields are self explanatory, but additi<strong>on</strong>al help can be obtained by clicking <strong>on</strong> the active links<br />
behind each term.<br />
The output from this proposed program follows:<br />
c○2012 Carl James Schwarz 148 November 23, 2012
Program MONITOR<br />
Tue May 4 00:53:34 2004 p=2705<br />
This is an example of a power analysis to detect a trend<br />
SIMULATION OVERVIEW<br />
Number of plots m<strong>on</strong>itored : 1<br />
Plot Counts :<br />
1000.000<br />
Plot Standard Deviati<strong>on</strong>s :<br />
200.000<br />
Plot weights :<br />
1.000<br />
Number counts/plot/survey occasi<strong>on</strong> : 2<br />
CV in trends : 0.000<br />
Total Surveys : 5<br />
Survey occasi<strong>on</strong>s:<br />
0.000<br />
1.000<br />
2.000<br />
3.000<br />
4.000<br />
Trend Type = Linear<br />
Counts analyzed as decimals<br />
Projecti<strong>on</strong> set = Complete<br />
Significance Level : 0.050<br />
Significance Test : 2-tailed t-test<br />
Iterati<strong>on</strong>s : 500<br />
Power to Detect Populati<strong>on</strong> Trends:<br />
10% Increase = 0.68200<br />
9% Increase = 0.58800<br />
8% Increase = 0.45000<br />
7% Increase = 0.36200<br />
6% Increase = 0.26400<br />
5% Increase = 0.17600<br />
4% Increase = 0.13800<br />
3% Increase = 0.11800<br />
2% Increase = 0.05400<br />
1% Increase = 0.05600<br />
0% Increase = 0.04200<br />
149
10% Decrease = 0.35600<br />
9% Decrease = 0.30600<br />
8% Decrease = 0.24800<br />
7% Decrease = 0.21200<br />
6% Decrease = 0.19600<br />
5% Decrease = 0.15600<br />
4% Decrease = 0.09600<br />
3% Decrease = 0.08800<br />
2% Decrease = 0.06600<br />
1% Decrease = 0.05000<br />
0% Decrease = 0.04200<br />
END OF OUTPUT FILE<br />
Back to MONITOR input page<br />
150
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
.<br />
This design is estimated to have an estimated power of 16% to detect a 5% decrease PER YEAR.<br />
The difference in reported powers are artifacts of the different ways the two programs compute power.<br />
TRENDS uses analytical <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae based <strong>on</strong> normal approximati<strong>on</strong>s, while MONITOR c<strong>on</strong>ducts a simulati<strong>on</strong><br />
study and reports the number of trials (in this case out of 500) that detected the trend. In any event,<br />
d<strong>on</strong>’t get hung up over these differences - the key point is that this proposed study has virtually no power to<br />
detect a 5% decline/year.<br />
Program MONITOR also reports power <str<strong>on</strong>g>for</str<strong>on</strong>g> a range of trends; Program TRENDS reports power <str<strong>on</strong>g>for</str<strong>on</strong>g> a<br />
single TREND at a time, but you can quickly vary the sliding window to investigate different design opti<strong>on</strong>s.<br />
2.4.5 Summary of plans<br />
Here is a summary of some power computati<strong>on</strong>s to detect an average decrease over time. In all cases, the cv<br />
was assumed to be proporti<strong>on</strong>al to 1/ √ abundance.<br />
The results are sobering. For many animal species, many years of c<strong>on</strong>centrated ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t will be needed to<br />
detect small effects with decent power.<br />
c○2012 Carl James Schwarz 151 November 23, 2012
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 10<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 7 9 11 15<br />
5 5 6 9 13 19 26<br />
3 1 5 5 6 7 7 9<br />
3 5 7 13 22 35 48<br />
5 5 9 20 38 58 75<br />
4 1 5 6 8 12 16 21<br />
3 5 10 26 48 70 85<br />
5 5 15 43 74 92 98<br />
5 1 5 7 13 22 32 43<br />
3 5 16 47 77 93 99<br />
5 5 26 71 95 100 100<br />
6 1 5 10 22 38 55 69<br />
3 5 25 69 94 99 100<br />
5 5 40 90 100 100 100<br />
7 1 5 13 33 57 76 88<br />
3 5 37 87 99 100 100<br />
5 5 58 98 100 100 100<br />
8 1 5 18 47 75 90 97<br />
3 5 52 96 100 100 100<br />
5 5 75 100 100 100 100<br />
9 1 5 24 62 88 97 99<br />
3 5 66 99 100 100 100<br />
5 5 88 100 100 100 100<br />
10 1 5 31 76 95 99 100<br />
3 5 79 100 100 100 100<br />
5 5 95 100 100 100 100<br />
152<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 20<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 5 6 7 7<br />
5 5 5 6 7 8 10<br />
3 1 5 5 5 5 6 6<br />
3 5 6 7 9 12 16<br />
5 5 6 9 13 19 26<br />
4 1 5 5 6 7 8 9<br />
3 5 6 10 16 24 33<br />
5 5 7 14 25 39 53<br />
5 1 5 6 7 9 12 15<br />
3 5 8 16 27 41 55<br />
5 5 10 24 44 64 79<br />
6 1 5 6 9 13 19 24<br />
3 5 10 23 42 61 76<br />
5 5 14 37 65 84 94<br />
7 1 5 7 12 19 28 36<br />
3 5 13 34 59 78 90<br />
5 5 19 53 82 95 99<br />
8 1 5 8 16 27 38 49<br />
3 5 17 46 74 90 97<br />
5 5 26 69 93 99 100<br />
9 1 5 10 21 36 50 62<br />
3 5 22 60 86 97 99<br />
5 5 35 82 98 100 100<br />
10 1 5 11 27 45 62 74<br />
3 5 28 72 94 99 100<br />
5 5 44 92 100 100 100<br />
153<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 30<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 5 5 6 6<br />
5 5 5 5 6 6 7<br />
3 1 5 5 5 5 5 5<br />
3 5 5 6 7 8 10<br />
5 5 5 7 9 11 14<br />
4 1 5 5 5 6 6 7<br />
3 5 6 7 10 13 17<br />
5 5 6 9 14 20 27<br />
5 1 5 5 6 7 8 10<br />
3 5 6 10 15 21 28<br />
5 5 7 13 23 34 46<br />
6 1 5 6 7 9 11 14<br />
3 5 7 13 22 32 43<br />
5 5 9 19 34 51 65<br />
7 1 5 6 8 11 15 19<br />
3 5 8 18 31 45 58<br />
5 5 11 27 49 68 81<br />
8 1 5 6 10 15 20 25<br />
3 5 10 24 42 59 72<br />
5 5 14 37 63 82 92<br />
9 1 5 7 12 19 26 33<br />
3 5 12 31 53 71 83<br />
5 5 18 49 76 91 97<br />
10 1 5 8 15 23 33 41<br />
3 5 15 40 65 82 91<br />
5 5 23 60 86 96 99<br />
154<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 40<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 5 5 5 6<br />
5 5 5 5 5 6 6<br />
3 1 5 5 5 5 5 5<br />
3 5 5 5 6 7 8<br />
5 5 5 6 7 8 10<br />
4 1 5 5 5 5 6 6<br />
3 5 5 6 8 10 12<br />
5 5 6 7 10 13 17<br />
5 1 5 5 6 6 7 8<br />
3 5 6 8 10 14 18<br />
5 5 6 10 15 21 28<br />
6 1 5 5 6 7 8 10<br />
3 5 6 9 14 20 26<br />
5 5 7 13 21 32 42<br />
7 1 5 5 7 8 11 13<br />
3 5 7 12 19 28 37<br />
5 5 8 18 30 44 57<br />
8 1 5 6 8 10 13 16<br />
3 5 8 15 26 37 48<br />
5 5 10 23 41 57 71<br />
9 1 5 6 9 13 17 21<br />
3 5 9 20 33 47 59<br />
5 5 12 30 52 70 82<br />
10 1 5 7 10 15 21 26<br />
3 5 11 25 42 57 70<br />
5 5 15 39 63 80 90<br />
155<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 50<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 5 5 5 5<br />
5 5 5 5 5 6 6<br />
3 1 5 5 5 5 5 5<br />
3 5 5 5 6 6 7<br />
5 5 5 6 6 7 8<br />
4 1 5 5 5 5 5 6<br />
3 5 5 6 7 8 9<br />
5 5 5 6 8 10 13<br />
5 1 5 5 5 6 6 7<br />
3 5 5 7 8 11 13<br />
5 5 6 8 11 15 20<br />
6 1 5 5 6 6 7 8<br />
3 5 6 8 11 15 19<br />
5 5 6 10 15 22 29<br />
7 1 5 5 6 7 9 10<br />
3 5 6 9 14 20 25<br />
5 5 7 13 21 31 40<br />
8 1 5 5 7 8 10 12<br />
3 5 7 12 18 26 33<br />
5 5 8 17 28 40 52<br />
9 1 5 6 7 10 12 15<br />
3 5 8 14 23 33 42<br />
5 5 10 21 36 51 63<br />
10 1 5 6 8 11 15 18<br />
3 5 9 17 29 40 51<br />
5 5 11 27 45 61 74<br />
156<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 60<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 5 5 5 5<br />
5 5 5 5 5 5 6<br />
3 1 5 5 5 5 5 5<br />
3 5 5 5 5 6 6<br />
5 5 5 5 6 7 7<br />
4 1 5 5 5 5 5 5<br />
3 5 5 6 6 7 8<br />
5 5 5 6 7 9 10<br />
5 1 5 5 5 5 6 6<br />
3 5 5 6 7 9 11<br />
5 5 6 7 9 12 15<br />
6 1 5 5 5 6 6 7<br />
3 5 6 7 9 12 14<br />
5 5 6 8 12 17 22<br />
7 1 5 5 6 7 7 8<br />
3 5 6 8 11 15 19<br />
5 5 7 10 16 23 30<br />
8 1 5 5 6 7 9 10<br />
3 5 6 10 14 19 25<br />
5 5 7 13 21 30 39<br />
9 1 5 5 7 8 10 12<br />
3 5 7 11 18 24 31<br />
5 5 8 16 27 38 48<br />
10 1 5 6 7 9 12 14<br />
3 5 7 14 22 30 38<br />
5 5 9 20 33 47 58<br />
157<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 70<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 5 5 5 5<br />
5 5 5 5 5 5 5<br />
3 1 5 5 5 5 5 5<br />
3 5 5 5 5 6 6<br />
5 5 5 5 6 6 7<br />
4 1 5 5 5 5 5 5<br />
3 5 5 5 6 6 7<br />
5 5 5 6 7 8 9<br />
5 1 5 5 5 5 6 6<br />
3 5 5 6 7 8 9<br />
5 5 5 6 8 10 12<br />
6 1 5 5 5 6 6 7<br />
3 5 5 6 8 10 12<br />
5 5 6 8 10 14 17<br />
7 1 5 5 6 6 7 8<br />
3 5 6 7 10 12 15<br />
5 5 6 9 13 18 23<br />
8 1 5 5 6 7 8 9<br />
3 5 6 8 12 15 19<br />
5 5 7 11 17 23 30<br />
9 1 5 5 6 7 9 10<br />
3 5 6 10 14 19 24<br />
5 5 7 13 21 29 38<br />
10 1 5 6 7 8 10 12<br />
3 5 7 11 17 23 29<br />
5 5 8 16 26 36 46<br />
158<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 80<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 5 5 5 5<br />
5 5 5 5 5 5 5<br />
3 1 5 5 5 5 5 5<br />
3 5 5 5 5 5 6<br />
5 5 5 5 5 6 6<br />
4 1 5 5 5 5 5 5<br />
3 5 5 5 6 6 7<br />
5 5 5 6 6 7 8<br />
5 1 5 5 5 5 5 6<br />
3 5 5 6 6 7 8<br />
5 5 5 6 7 9 11<br />
6 1 5 5 5 6 6 6<br />
3 5 5 6 7 9 10<br />
5 5 6 7 9 11 14<br />
7 1 5 5 5 6 6 7<br />
3 5 5 7 8 11 13<br />
5 5 6 8 11 15 19<br />
8 1 5 5 6 6 7 8<br />
3 5 6 8 10 13 16<br />
5 5 6 9 14 19 24<br />
9 1 5 5 6 7 8 9<br />
3 5 6 9 12 16 20<br />
5 5 7 11 17 24 30<br />
10 1 5 5 6 8 9 10<br />
3 5 6 10 14 19 24<br />
5 5 7 13 21 29 37<br />
159<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 90<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 5 5 5 5<br />
5 5 5 5 5 5 5<br />
3 1 5 5 5 5 5 5<br />
3 5 5 5 5 5 6<br />
5 5 5 5 5 6 6<br />
4 1 5 5 5 5 5 5<br />
3 5 5 5 6 6 6<br />
5 5 5 5 6 7 7<br />
5 1 5 5 5 5 5 6<br />
3 5 5 6 6 7 7<br />
5 5 5 6 7 8 9<br />
6 1 5 5 5 5 6 6<br />
3 5 5 6 7 8 9<br />
5 5 5 7 8 10 12<br />
7 1 5 5 5 6 6 7<br />
3 5 5 6 8 9 11<br />
5 5 6 7 10 13 16<br />
8 1 5 5 6 6 7 7<br />
3 5 6 7 9 11 14<br />
5 5 6 8 12 16 20<br />
9 1 5 5 6 6 7 8<br />
3 5 6 8 10 13 16<br />
5 5 6 10 15 20 25<br />
10 1 5 5 6 7 8 9<br />
3 5 6 9 12 16 20<br />
5 5 7 11 18 24 30<br />
160<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />
CV of initial obs (%) 100<br />
N years<br />
m<strong>on</strong>itored<br />
Obs/year<br />
Average % decrease/year<br />
0 2 4 6 8 10<br />
Power Power Power Power Power Power<br />
2 1 . . . . . .<br />
3 5 5 5 5 5 5<br />
5 5 5 5 5 5 5<br />
3 1 5 5 5 5 5 5<br />
3 5 5 5 5 5 5<br />
5 5 5 5 5 6 6<br />
4 1 5 5 5 5 5 5<br />
3 5 5 5 5 6 6<br />
5 5 5 5 6 6 7<br />
5 1 5 5 5 5 5 5<br />
3 5 5 5 6 6 7<br />
5 5 5 6 7 7 9<br />
6 1 5 5 5 5 6 6<br />
3 5 5 6 6 7 8<br />
5 5 5 6 8 9 11<br />
7 1 5 5 5 6 6 6<br />
3 5 5 6 7 9 10<br />
5 5 6 7 9 11 14<br />
8 1 5 5 5 6 6 7<br />
3 5 5 7 8 10 12<br />
5 5 6 8 11 14 17<br />
9 1 5 5 6 6 7 7<br />
3 5 6 7 9 12 14<br />
5 5 6 9 13 17 21<br />
10 1 5 5 6 7 7 8<br />
3 5 6 8 11 14 17<br />
5 5 7 10 15 20 25<br />
161<br />
Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
.<br />
2.5 Testing <str<strong>on</strong>g>for</str<strong>on</strong>g> comm<strong>on</strong> trend - ANCOVA<br />
In some cases, it is of interest to test if the same trend is occurring in a number of locati<strong>on</strong>s. Or, the data<br />
from a single site is so poor that trends cannot be detected, but by pooling the sites, a comm<strong>on</strong> trend over<br />
sites can be detected because of the increased sample size. This technique can also be used <str<strong>on</strong>g>for</str<strong>on</strong>g> adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
seas<strong>on</strong>ality as will be seen later.<br />
The Analysis of Covariance (ANCOVA) does both. Groups of data (e.g. from the same locati<strong>on</strong>) are<br />
identified by a nominal or ordinal scale variable and time is also measured <str<strong>on</strong>g>for</str<strong>on</strong>g> both groups.<br />
Typically, ANCOVA is used to check if the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> the groups are parallel. If there is evidence<br />
that the individual regressi<strong>on</strong> lines are not parallel, then a separate regressi<strong>on</strong> line (trend line) must be fit <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
each group <str<strong>on</strong>g>for</str<strong>on</strong>g> predicti<strong>on</strong> purposes. If there is no evidence of n<strong>on</strong>-parallelism, then the next task is to see if<br />
the lines are co-incident, i.e. have both the same intercept and the same slope. If there is evidence that the<br />
lines are not coincident, then a series of parallel lines are fit to the data. All of the data are used to estimate<br />
the comm<strong>on</strong> slope. If there is no evidence that the lines are not coincident, then all of the data can be simply<br />
pooled together and a single regressi<strong>on</strong> line fit <str<strong>on</strong>g>for</str<strong>on</strong>g> all of the data.<br />
The three possibilities are shown below <str<strong>on</strong>g>for</str<strong>on</strong>g> the case of two groups - the extensi<strong>on</strong> to many groups is<br />
obvious:<br />
c○2012 Carl James Schwarz 162 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 163 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 164 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
2.5.1 Assumpti<strong>on</strong>s<br />
As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, it is important be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the analysis is started to verify the assumpti<strong>on</strong>s underlying the analysis. As<br />
ANCOVA is a combinati<strong>on</strong> of ANOVA and Regressi<strong>on</strong>, the assumpti<strong>on</strong>s are similar.<br />
• The resp<strong>on</strong>se variable Y is c<strong>on</strong>tinuous (interval or ratio scaled).<br />
• The Y are a random sample from the various time points measured.<br />
• There must be no outliers. Plot Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group separately to see if there are any points that<br />
d<strong>on</strong>’t appear to follow the straight line.<br />
• The relati<strong>on</strong>ship between Y and X must be linear <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. 18 Check this assumpti<strong>on</strong> by looking<br />
18 It is possible to relax this assumpti<strong>on</strong> as well, but is again, bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
c○2012 Carl James Schwarz 165 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
at the individual plots of Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group.<br />
• The variance must be equal <str<strong>on</strong>g>for</str<strong>on</strong>g> both groups around their respective regressi<strong>on</strong> lines. Check that the<br />
spread of the points is equal around the range of X and that the spread is comparable between the two<br />
groups. This can be <str<strong>on</strong>g>for</str<strong>on</strong>g>mally checked by looking at the MSE from a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />
group as MSE estimates the variance of the data around the regressi<strong>on</strong> line.<br />
• The residuals must be normally distributed around the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This assumpti<strong>on</strong><br />
can be check by examining the residual plots from the fitted model <str<strong>on</strong>g>for</str<strong>on</strong>g> evidence of n<strong>on</strong>-normality. For<br />
large samples, this is not too crucial; <str<strong>on</strong>g>for</str<strong>on</strong>g> small sample sizes, you will likely have inadequate power to<br />
detect anything but gross departures.<br />
2.5.2 Statistical model<br />
You saw in earlier chapters, that a statistical model is a powerful <str<strong>on</strong>g>short</str<strong>on</strong>g>hand to describe what analysis is fit<br />
to a set of data. The model must describe the treatment structure, the experimental unit structure, and the<br />
randomizati<strong>on</strong> structure. Let Y be the resp<strong>on</strong>se variable; X be the c<strong>on</strong>tinuous predictor variable, and Group<br />
be the group factor.<br />
As ANCOVA is a combinati<strong>on</strong> of ANOVA and regressi<strong>on</strong>, it will not be surprising that the models will<br />
have terms corresp<strong>on</strong>ding to both Group and X. Again, there are three cases:<br />
If the lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group are not parallel:<br />
c○2012 Carl James Schwarz 166 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
the appropriate model is<br />
Y 1 = Group X Group ∗ X<br />
The terms can be in any order. This is read as variati<strong>on</strong> in Y can be explained a comm<strong>on</strong> intercept (never<br />
specified) followed by group effects (different intercepts), a comm<strong>on</strong> slope (trend) <strong>on</strong> X, and an “interacti<strong>on</strong>”<br />
between Group and X which is interpreted as different slopes (different trends) <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This model<br />
is almost equivalent to fitting a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The <strong>on</strong>ly advantage to using this<br />
joint model <str<strong>on</strong>g>for</str<strong>on</strong>g> all groups is similar to that enjoyed by using ANOVA - all of the groups c<strong>on</strong>tribute to a better<br />
estimate of residual error. If the number of data points per group is small, this can lead to improvements in<br />
precisi<strong>on</strong> compared to fitting each group individually and an improved power to detect trends.<br />
If the lines are parallel across groups, but not coincident:<br />
c○2012 Carl James Schwarz 167 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
the appropriate model is<br />
Y 2 = Group X<br />
The terms can be in any order. The <strong>on</strong>ly difference between this and the previous model is that this simpler<br />
model lacks the Group*X “interacti<strong>on</strong>” term. It would not be surprising then that a statistical test to see if<br />
this simpler model is tenable would corresp<strong>on</strong>d to examining the p-value of the test <strong>on</strong> the Group*X term<br />
from the complex model. This is exactly analogous to testing <str<strong>on</strong>g>for</str<strong>on</strong>g> interacti<strong>on</strong> effects between factors in a<br />
two-factor ANOVA.<br />
Lastly, if the lines are co-incident:<br />
c○2012 Carl James Schwarz 168 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
the appropriate model is<br />
Y 3 = X<br />
Now the difference between this model and the previous model is the Group term that has been dropped.<br />
Again, it would not be surprising that this corresp<strong>on</strong>ds to the test of the Group effect in the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical<br />
test. The test <str<strong>on</strong>g>for</str<strong>on</strong>g> co-incident lines should <strong>on</strong>ly be d<strong>on</strong>e if there is insufficient evidence against parallelism.<br />
While it is possible to test <str<strong>on</strong>g>for</str<strong>on</strong>g> a n<strong>on</strong>-zero slope, this is rarely d<strong>on</strong>e.<br />
2.5.3 Example: Degradati<strong>on</strong> of dioxin - pooling locati<strong>on</strong>s<br />
An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />
material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />
organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />
c○2012 Carl James Schwarz 169 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
takes a l<strong>on</strong>g time to degrade.<br />
Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />
measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />
Each year, four crabs are captured from two m<strong>on</strong>itoring stati<strong>on</strong>s which are situated quite a distance apart<br />
<strong>on</strong> the same inlet where the pulp mill was located. The liver is excised and the livers from all four crabs<br />
are composited together into a single sample. 19 The dioxins levels in this composite sample is measured.<br />
As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called the Total<br />
Equivalent Dose (TEQ) is computed from the sample.<br />
As seen earlier, the appropriate resp<strong>on</strong>se variable is log(T EQ).<br />
Is the rate of decline the same <str<strong>on</strong>g>for</str<strong>on</strong>g> both sites? Did the sites have the same initial c<strong>on</strong>centrati<strong>on</strong>?<br />
Here are the raw data:<br />
19 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />
loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />
within years and the number of years to m<strong>on</strong>itor.<br />
c○2012 Carl James Schwarz 170 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Site Year TEQ log(T EQ)<br />
a 1990 179.05 5.19<br />
a 1991 82.39 4.41<br />
a 1992 130.18 4.87<br />
a 1993 97.06 4.58<br />
a 1994 49.34 3.90<br />
a 1995 57.05 4.04<br />
a 1996 57.41 4.05<br />
a 1997 29.94 3.40<br />
a 1998 48.48 3.88<br />
a 1999 49.67 3.91<br />
a 2000 34.25 3.53<br />
a 2001 59.28 4.08<br />
a 2002 34.92 3.55<br />
a 2003 28.16 3.34<br />
b 1990 93.07 4.53<br />
b 1991 105.23 4.66<br />
b 1992 188.13 5.24<br />
b 1993 133.81 4.90<br />
b 1994 69.17 4.24<br />
b 1995 150.52 5.01<br />
b 1996 95.47 4.56<br />
b 1997 146.80 4.99<br />
b 1998 85.83 4.45<br />
b 1999 67.72 4.22<br />
b 2000 42.44 3.75<br />
b 2001 53.88 3.99<br />
b 2002 81.11 4.40<br />
b 2003 70.88 4.26<br />
JMP analysis<br />
The raw data are available in Dioxin2.JMP available from the Sample Program Library available at http:<br />
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The data can be entered into JMP in the usual fashi<strong>on</strong>. Make sure that Site is a nominal scale variable,<br />
and that Year is a c<strong>on</strong>tinuous variable.<br />
c○2012 Carl James Schwarz 171 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
In cases with multiple groups, it is often helpful to use a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This<br />
is easily accomplished in JMP by selecting the rows (say <str<strong>on</strong>g>for</str<strong>on</strong>g> site a) and using the Rows→Markers to set the<br />
plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> the selected rows:<br />
The final data sheet has two different plotting symbols <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sites:<br />
c○2012 Carl James Schwarz 172 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 173 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting the various models, begin with an exploratory examinati<strong>on</strong> of the data looking <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers<br />
and checking the assumpti<strong>on</strong>s.<br />
Each year’s data is independent of other year’s data as a different set of crabs was selected. Similarly,<br />
the data from <strong>on</strong>e site are independent from the other site. This is an observati<strong>on</strong>al study, so the questi<strong>on</strong><br />
arises of how exactly were the crabs were selected? In this study, crab pots were placed <strong>on</strong> the floor of the<br />
sea to capture the available crabs in the area.<br />
When ever multiple sets of data are collected over time, there is always the worry about comm<strong>on</strong> year<br />
effects (also known as process error). For example, if the resp<strong>on</strong>se variable was body mass of small fish, then<br />
poor growing c<strong>on</strong>diti<strong>on</strong>s in a single year could depress the growth of fish in all locati<strong>on</strong>s. This would then<br />
violate the assumpti<strong>on</strong> of independence as the residual in <strong>on</strong>e site in a year would be related to the residual<br />
in another site in the sam year. You tend to see the residuals “paired” with negative residuals from the fitted<br />
line at <strong>on</strong>e site matched (by year) with negative residuals at the other site. In this case, this is unlikely to have<br />
occured. Degradati<strong>on</strong> of dioxin is relatively independent of external envir<strong>on</strong>mental factors and the variati<strong>on</strong><br />
that we see about the two regressi<strong>on</strong> lines is related solely to samplng error based <strong>on</strong> the particular set of<br />
crabs that that were sampled. It seems unlikely that the residuals are related. 20<br />
Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and specify the log(T EQ) as the Y variable, and Y ear as the X<br />
variable:<br />
20 If you actually try and fit a process error term to this model, you find that the estimated process error is zero.<br />
c○2012 Carl James Schwarz 174 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Then specify a grouping variable by clicking <strong>on</strong> the pop-down menu near the Bivariate Fit window-title line:<br />
and selecting Site as the grouping variable:<br />
c○2012 Carl James Schwarz 175 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Now select the Fit Line from the same pop-down menu:<br />
c○2012 Carl James Schwarz 176 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
to get separate lines fit <str<strong>on</strong>g>for</str<strong>on</strong>g> each group:<br />
c○2012 Carl James Schwarz 177 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This relati<strong>on</strong>ships <str<strong>on</strong>g>for</str<strong>on</strong>g> each site appear to be linear. The actual estimates are also presented:<br />
c○2012 Carl James Schwarz 178 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 179 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The scatter plot doesn’t show any obvious outliers. The estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the a site is −.107 (se .02)<br />
while the estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the b site is −.06 (se .02). The 95% c<strong>on</strong>fidence intervals (not shown <strong>on</strong> the<br />
output but available by right-clicking/ctrl-clicking <strong>on</strong> the parameter estimates table) overlap c<strong>on</strong>siderably so<br />
the slopes could be the same <str<strong>on</strong>g>for</str<strong>on</strong>g> the two groups.<br />
The MSE from site a is .10 and the MSE from site b is .12. This corresp<strong>on</strong>ds to standard deviati<strong>on</strong>s of<br />
√<br />
.10 = .32 and<br />
√<br />
.12 = .35 which are very similar so that assumpti<strong>on</strong> of equal standard deviati<strong>on</strong>s seems<br />
reas<strong>on</strong>able.<br />
The residual plots (not shown) also look reas<strong>on</strong>able.<br />
The assumpti<strong>on</strong>s appear to be satisfied, so let us now fit the various models.<br />
First, fit the model allowing <str<strong>on</strong>g>for</str<strong>on</strong>g> separate lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />
The terms can be in any order and corresp<strong>on</strong>d to the model described earlier. This gives the following<br />
output:<br />
c○2012 Carl James Schwarz 180 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The regressi<strong>on</strong> plot is just the same as the plot of the two individual lines seen earlier. What is of interest is<br />
the Effect test <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site*year interacti<strong>on</strong>. Here the p-value is not very small, so there is no evidence that<br />
the lines are not parallel.<br />
We need to refit the model, dropping the interacti<strong>on</strong> term:<br />
c○2012 Carl James Schwarz 181 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
which gives the following regressi<strong>on</strong> plot:<br />
c○2012 Carl James Schwarz 182 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This shows the fitted parallel lines. The effect tests:<br />
now have a small p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site effect indicating that the lines are not coincident, i.e. they are parallel<br />
with different intercepts. This would mean that the rate of decay of the dioxin appears to be equal in both<br />
sites, but the initial c<strong>on</strong>centrati<strong>on</strong> appears to be different.<br />
The estimated (comm<strong>on</strong>) slope is found in the Parameter Estimates porti<strong>on</strong> of the output:<br />
c○2012 Carl James Schwarz 183 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
and has a value of −.083 (se .016). Because the analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this implies that the<br />
dioxin levels changed by a factor of exp(−.083) = .92 from year to year, i.e. about a 8% decline each year.<br />
The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope <strong>on</strong> the log-scale is from (−.12 →−.05) which corresp<strong>on</strong>ds to a<br />
potential factor between exp(−.12) = .88 to exp(−.05) = .95 per year, i.e. between a 12% and 5% decline<br />
per year. 21<br />
While it is possible to estimate the difference between the parallel lines from the Parameter Estimates<br />
table, it is easier to look at the secti<strong>on</strong> of the output corresp<strong>on</strong>ding to the Site effects. Here the estimated<br />
LSMeans corresp<strong>on</strong>d to the log(T EQ) at the average value of Year - not really of interest. As in previous<br />
chapters, the difference in means is often of more interest than the raw means themselves. This is found by<br />
using the pop-down menu and selecting a LSMeans C<strong>on</strong>trast or Multiple Comparis<strong>on</strong> procedure to give:<br />
21 The c<strong>on</strong>fidence intervals are found by right clicking/ctrl-clicking in the Parameter Estimates table<br />
c○2012 Carl James Schwarz 184 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The estimated difference between the lines (<strong>on</strong> the log-scale) is estimated to be 0.46 (se .13). Because the<br />
analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this corresp<strong>on</strong>ds to a ratio of exp(.46) = 1.58 in dioxin levels between<br />
the two sites, i.e. site b has 1.58 times the dioxin level as site a. Because the slopes are parallel and declining,<br />
the dioxin levels are falling in both sites, but the 1.58 times ratio remains c<strong>on</strong>sistent.<br />
c○2012 Carl James Schwarz 185 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Finally, the actual by Predicted plot (not shown here), the leverage plots (not shown here) and the residual<br />
plot<br />
d<strong>on</strong>’t show any evidence of a problem in the fit.<br />
2.5.4 Change in yearly average temperature with regime shifts<br />
The ANCOVA technique can also be used <str<strong>on</strong>g>for</str<strong>on</strong>g> trends when there are KNOWN regime shifts in the series.<br />
The case when the timing of the shift is unknown is more difficult and not covered in this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
For example, c<strong>on</strong>sider a time series of annual average temperatures measured at Tuscaloosa, Alabama<br />
from 1901 to 2001. It is well known that shifts in temperature can occur whenever the instrument or locati<strong>on</strong><br />
or observer or other characteristics of the stati<strong>on</strong> change.<br />
The data are available in the JMP datafile tuscaloosa-avg-temp.jmp in the Sample Program Library at<br />
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
A porti<strong>on</strong> of the raw data is shown below:<br />
c○2012 Carl James Schwarz 186 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
and a time series plot of the data:<br />
shows a shift in the readings in 1939 (thermometer changed), 1957 (stati<strong>on</strong> moved), and possibly in 1987<br />
(locati<strong>on</strong> and thermometer changed).<br />
It turns out that cases where the number of epochs tends to increase with the number of data points has<br />
some serious technical issues with the properties of the estimators. See<br />
Lu, Q. and Lund, R.B. (2007).<br />
c○2012 Carl James Schwarz 187 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Simple linear regressi<strong>on</strong> with multiple level shifts.<br />
Canadian Journal of Statistics, 35, 447-458<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> details. Basically, if the number of parameters tends to increase with sample size, this violates <strong>on</strong>e of<br />
the assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> maximum likelihood estimati<strong>on</strong>. This would lead to estimates which may not even be<br />
c<strong>on</strong>sistent! For example, suppose that the recording changed every two years. Then the two data points<br />
should still be able to estimate the comm<strong>on</strong> slope, but this corresp<strong>on</strong>ds to the well known problem with<br />
case-c<strong>on</strong>trol studies where the number of pairs increases with total sample size. Fortunately, Lu and Lund<br />
(2007) showed that this violati<strong>on</strong> is not serious.<br />
The analysis proceeds as in the dioxin example with two sites, except that now the series is broken into<br />
different epochs corresp<strong>on</strong>ding to the sets of years when c<strong>on</strong>diti<strong>on</strong>s remained stable at the recording site. In<br />
this case, this corresp<strong>on</strong>ds to the years 1901-1938 (inclusive); 1940-1956 (inclusive); 1958-1986 (inclusive),<br />
and 1989-2000 (inclusive). Note that the years 1939, 1957, and 1987 are NOT used because the average<br />
temperature in these two years is an amalgam of two different recording c<strong>on</strong>diti<strong>on</strong>s 22 .<br />
For example, the data file (around the first regime change) may look like:<br />
Note that year and Avg Temp and both set to have c<strong>on</strong>tinuous scale; but epoch should have a nominal or<br />
ordinal scale.<br />
Model filling proceeds as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e by first the model:<br />
AvgT emp = Y ear Epoch Y ear ∗ Epoch<br />
to see if the change in AvgTemp is c<strong>on</strong>sistent am<strong>on</strong>g Epochs and then fitting the model:<br />
AvgT emp = Y ear Epoch<br />
to estimate the comm<strong>on</strong> trend (after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> shifts am<strong>on</strong>g the Epochs).<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />
22 If the exact day of the change were known, it is possible to weight the two epochs in these years and include the data points.<br />
c○2012 Carl James Schwarz 188 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
There is no str<strong>on</strong>g evidence that the slopes are different am<strong>on</strong>g the epochs (p=.10) despite the plot showing<br />
a potentially differential slope in the 3 rd epoch:<br />
c○2012 Carl James Schwarz 189 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The simpler model with comm<strong>on</strong> slopes is then fit:<br />
c○2012 Carl James Schwarz 190 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
with fitted (comm<strong>on</strong> slope) lines:<br />
c○2012 Carl James Schwarz 191 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
No further model simplificati<strong>on</strong> is possible and there is evident that the comm<strong>on</strong> slope is different from zero:<br />
The estimated change in average temperature is:<br />
c○2012 Carl James Schwarz 192 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
i.e. an estimated increase of .033 (SE .006) per year. The 95% c<strong>on</strong>fidence interval does not cover 0.<br />
The residual plots (against predicted and the order in which the data were collected):<br />
c○2012 Carl James Schwarz 193 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
shows no obvious problems.<br />
Whenever time series data are used, autocorrelati<strong>on</strong> should be investigated. The Durbin-Wats<strong>on</strong> test is<br />
applied to the residuals:<br />
with no obvious problem detected.<br />
The leverage plot (against year)<br />
c○2012 Carl James Schwarz 194 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
also reveals nothing amiss.<br />
A more sophisticated analysis can be fit using SAS, but isn’t needed. The sample program and output are<br />
available in the Sample Program Library.<br />
2.6 Dealing with Autocorrelati<strong>on</strong><br />
Short time series (10-50 observati<strong>on</strong>s) are comm<strong>on</strong> in envir<strong>on</strong>mental and ecological studies. It is well known<br />
that when data are collected over time, that the usual assumpti<strong>on</strong> of errors (deviati<strong>on</strong>s above and below the<br />
regressi<strong>on</strong> line) being independent may not be true.<br />
This is a key assumpti<strong>on</strong> of regressi<strong>on</strong> analysis. What it implies is that if the data point <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />
c○2012 Carl James Schwarz 195 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
year happens to be above the line, it has no influence <strong>on</strong> if the data point <str<strong>on</strong>g>for</str<strong>on</strong>g> the next year is also above the<br />
line. In many cases this is not true, because of l<strong>on</strong>g-term trends that affect data points <str<strong>on</strong>g>for</str<strong>on</strong>g> several years in<br />
a row. For example, precipitati<strong>on</strong> patterns often follow several year patterns where a drought year is more<br />
often followed by another drought year than a return to normal rainfall. If the level of precipitati<strong>on</strong> affects<br />
the resp<strong>on</strong>se, you may see an induced autocorrelati<strong>on</strong> (also known as serial correlati<strong>on</strong>). The uncritical<br />
applicati<strong>on</strong> of regressi<strong>on</strong> to these type of data without accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> the autocorrelati<strong>on</strong> over time is known<br />
as pseudo-replicati<strong>on</strong> over time (Hurlbert, 1984).<br />
This problem and how to deal with it are well known in ec<strong>on</strong>omics and related disciplines, but less well<br />
known in ecology.<br />
Some articles that discuss the problem and soluti<strong>on</strong>s are:<br />
• Bence, J. R. (1995). Analysis of <str<strong>on</strong>g>short</str<strong>on</strong>g> time series: Correcti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 76, 628-<br />
639. A nice n<strong>on</strong>-technical review of the subject and how to deal with in <str<strong>on</strong>g>for</str<strong>on</strong>g> ecology.<br />
• Roy A., Falk B. and Fuller W.A. (2004) Testing <str<strong>on</strong>g>for</str<strong>on</strong>g> Trend in the Presence of Autoregressive Error.<br />
Journal of the American Statistical Associati<strong>on</strong>, 99, 1082-1091. This article is VERY technical, but<br />
the reference list provides a nice summary of relevant papers about this problem.<br />
In some previous examples, we looked at the Durbin-Wats<strong>on</strong> statistic to examine if there was evidence<br />
of autocorrelati<strong>on</strong>. What is the Durbin-Wats<strong>on</strong> test? What is autocorrelati<strong>on</strong>? Why is this a problem? How<br />
do we fit models accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>?<br />
In order to understand autocorrelati<strong>on</strong>, we need to step back and look at the model <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> analysis<br />
in a little more detail. Recall, that we often used a <str<strong>on</strong>g>short</str<strong>on</strong>g>hand notati<strong>on</strong> to represent a linear regressi<strong>on</strong><br />
problem:<br />
Y = T ime<br />
where Y is the resp<strong>on</strong>se variable, and T ime is the effect of time. Mathematically, the model is written as:<br />
Y i = β 0 + β 1 t i + ɛ i<br />
where β 0 is the intercept, β 1 is the slope, and ɛ i is the deviati<strong>on</strong> of the i th data point from the actual<br />
underlying line.<br />
The usual assumpti<strong>on</strong> made in regressi<strong>on</strong> analysis is that the ɛ i are independent of each other. In autocorrelated<br />
models, this is not true. Mathematically, the simplest autocorrelati<strong>on</strong> process (known as an AR(1)<br />
process) has:<br />
ɛ i+1 = ρɛ i + a i<br />
where the a i are now independent and ρ is the autocorrelati<strong>on</strong> coefficient.<br />
In the same way that regular correlati<strong>on</strong> between two variables ranges from -1 to +1, so to does autocorrelati<strong>on</strong>.<br />
An autocorrelati<strong>on</strong> of 0 would indicate no correlati<strong>on</strong> between successive deviati<strong>on</strong>s about the<br />
regressi<strong>on</strong> line as the ɛ i would have no effect <strong>on</strong> ɛ i+1 ; an autocorrelati<strong>on</strong> close to 1 would indicate very<br />
c○2012 Carl James Schwarz 196 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
high correlati<strong>on</strong> between successive deviati<strong>on</strong>s; an autocorrelati<strong>on</strong> close to -1 (very rare in ecological studies)<br />
would indicate a negative influence, i.e. high deviati<strong>on</strong>s in <strong>on</strong>e year are typically followed by high but<br />
negative deviati<strong>on</strong>s in subsequent years. 23<br />
The following plots are some examples of autocorrelated data about the same underlying linear trend<br />
with the associated residual plots:<br />
23 A way that negative autocorrelati<strong>on</strong>s can be induced if there is a cost to breeding so that a successful seas<strong>on</strong> of breeding is followed<br />
by a year of not breeding etc<br />
c○2012 Carl James Schwarz 197 November 23, 2012
a s e l i n e<br />
1 4 0<br />
r h o = - 0 . 9 5<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
r h o = - 0 . 9 5<br />
1 3 0<br />
3 0 . 0 0 0 0<br />
1 2 0<br />
1 1 0<br />
2 0 . 0 0 0 0<br />
1 0 0<br />
9 0<br />
1 0 . 0 0 0 0<br />
8 0<br />
7 0<br />
0<br />
6 0<br />
5 0<br />
- 1 0 . 0 0 0 0<br />
4 0<br />
- 2 0 . 0 0 0 0<br />
3 0<br />
2 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
b a s e l i n e<br />
1 4 0<br />
1 3 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
r h o = - 0 . 9 0<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
3 0 . 0 0 0 0<br />
t i m e<br />
r h o = - 0 . 9 0<br />
1 2 0<br />
1 1 0<br />
1 0 0<br />
9 0<br />
8 0<br />
7 0<br />
6 0<br />
5 0<br />
4 0<br />
3 0<br />
2 0<br />
2 0 . 0 0 0 0<br />
1 0 . 0 0 0 0<br />
0<br />
- 1 0 . 0 0 0 0<br />
- 2 0 . 0 0 0 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
b a s e l i n e<br />
1 4 0<br />
1 3 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
r h o = - 0 . 8 0<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
3 0 . 0 0 0 0<br />
t i m e<br />
r h o = - 0 . 8 0<br />
1 2 0<br />
1 1 0<br />
1 0 0<br />
9 0<br />
8 0<br />
7 0<br />
6 0<br />
5 0<br />
4 0<br />
3 0<br />
2 0<br />
2 0 . 0 0 0 0<br />
1 0 . 0 0 0 0<br />
0<br />
- 1 0 . 0 0 0 0<br />
- 2 0 . 0 0 0 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
b a s e l i n e<br />
1 4 0<br />
1 3 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
r h o = - 0 . 6 0<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
3 0 . 0 0 0 0<br />
t i m e<br />
r h o = - 0 . 6 0<br />
1 2 0<br />
1 1 0<br />
1 0 0<br />
9 0<br />
8 0<br />
7 0<br />
6 0<br />
5 0<br />
4 0<br />
3 0<br />
2 0<br />
2 0 . 0 0 0 0<br />
1 0 . 0 0 0 0<br />
0<br />
- 1 0 . 0 0 0 0<br />
- 2 0 . 0 0 0 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
198
a s e l i n e<br />
1 4 0<br />
r h o = - 0 . 4 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
r h o = - 0 . 4 0<br />
1 3 0<br />
3 0 . 0 0 0 0<br />
1 2 0<br />
1 1 0<br />
2 0 . 0 0 0 0<br />
1 0 0<br />
9 0<br />
1 0 . 0 0 0 0<br />
8 0<br />
7 0<br />
0<br />
6 0<br />
5 0<br />
- 1 0 . 0 0 0 0<br />
4 0<br />
- 2 0 . 0 0 0 0<br />
3 0<br />
2 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
b a s e l i n e<br />
1 4 0<br />
1 3 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
r h o = - 0 . 2 0<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
3 0 . 0 0 0 0<br />
t i m e<br />
r h o = - 0 . 2 0<br />
1 2 0<br />
1 1 0<br />
1 0 0<br />
9 0<br />
8 0<br />
7 0<br />
6 0<br />
5 0<br />
4 0<br />
3 0<br />
2 0<br />
2 0 . 0 0 0 0<br />
1 0 . 0 0 0 0<br />
0<br />
- 1 0 . 0 0 0 0<br />
- 2 0 . 0 0 0 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
b a s e l i n e<br />
1 4 0<br />
1 3 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
r h o = - 0 . 0 0<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
3 0 . 0 0 0 0<br />
t i m e<br />
r h o = - 0 . 0 0<br />
1 2 0<br />
1 1 0<br />
1 0 0<br />
9 0<br />
8 0<br />
7 0<br />
6 0<br />
5 0<br />
4 0<br />
3 0<br />
2 0<br />
2 0 . 0 0 0 0<br />
1 0 . 0 0 0 0<br />
0<br />
- 1 0 . 0 0 0 0<br />
- 2 0 . 0 0 0 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
b a s e l i n e<br />
1 4 0<br />
1 3 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
r h o = 0 . 2 0<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
3 0 . 0 0 0 0<br />
t i m e<br />
r h o = 0 . 2 0<br />
1 2 0<br />
1 1 0<br />
1 0 0<br />
9 0<br />
8 0<br />
7 0<br />
6 0<br />
5 0<br />
4 0<br />
3 0<br />
2 0<br />
2 0 . 0 0 0 0<br />
1 0 . 0 0 0 0<br />
0<br />
- 1 0 . 0 0 0 0<br />
- 2 0 . 0 0 0 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
199
a s e l i n e<br />
1 4 0<br />
r h o = 0 . 4 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
r h o = 0 . 4 0<br />
1 3 0<br />
3 0 . 0 0 0 0<br />
1 2 0<br />
1 1 0<br />
2 0 . 0 0 0 0<br />
1 0 0<br />
9 0<br />
1 0 . 0 0 0 0<br />
8 0<br />
7 0<br />
0<br />
6 0<br />
5 0<br />
- 1 0 . 0 0 0 0<br />
4 0<br />
- 2 0 . 0 0 0 0<br />
3 0<br />
2 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
b a s e l i n e<br />
1 4 0<br />
1 3 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
r h o = 0 . 6 0<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
3 0 . 0 0 0 0<br />
t i m e<br />
r h o = 0 . 6 0<br />
1 2 0<br />
1 1 0<br />
1 0 0<br />
9 0<br />
8 0<br />
7 0<br />
6 0<br />
5 0<br />
4 0<br />
3 0<br />
2 0<br />
2 0 . 0 0 0 0<br />
1 0 . 0 0 0 0<br />
0<br />
- 1 0 . 0 0 0 0<br />
- 2 0 . 0 0 0 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
b a s e l i n e<br />
1 4 0<br />
1 3 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
r h o = 0 . 8 0<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
3 0 . 0 0 0 0<br />
t i m e<br />
r h o = 0 . 8 0<br />
1 2 0<br />
1 1 0<br />
1 0 0<br />
9 0<br />
8 0<br />
7 0<br />
6 0<br />
5 0<br />
4 0<br />
3 0<br />
2 0<br />
2 0 . 0 0 0 0<br />
1 0 . 0 0 0 0<br />
0<br />
- 1 0 . 0 0 0 0<br />
- 2 0 . 0 0 0 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
b a s e l i n e<br />
1 4 0<br />
1 3 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
r h o = 0 . 9 0<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
3 0 . 0 0 0 0<br />
t i m e<br />
r h o = 0 . 9 0<br />
1 2 0<br />
1 1 0<br />
1 0 0<br />
9 0<br />
8 0<br />
7 0<br />
6 0<br />
5 0<br />
4 0<br />
3 0<br />
2 0<br />
2 0 . 0 0 0 0<br />
1 0 . 0 0 0 0<br />
0<br />
- 1 0 . 0 0 0 0<br />
- 2 0 . 0 0 0 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
200
a s e l i n e<br />
1 4 0<br />
r h o = 0 . 9 5<br />
R e s i d u a l<br />
4 0 . 0 0 0 0<br />
r h o = 0 . 9 5<br />
1 3 0<br />
3 0 . 0 0 0 0<br />
1 2 0<br />
1 1 0<br />
2 0 . 0 0 0 0<br />
1 0 0<br />
9 0<br />
1 0 . 0 0 0 0<br />
8 0<br />
7 0<br />
0<br />
6 0<br />
5 0<br />
- 1 0 . 0 0 0 0<br />
4 0<br />
- 2 0 . 0 0 0 0<br />
3 0<br />
2 0<br />
- 3 0 . 0 0 0 0<br />
1 0<br />
0<br />
- 4 0 . 0 0 0 0<br />
- 1 0<br />
- 2 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
- 5 0 . 0 0 0 0<br />
0 1 0 2 0 3 0<br />
t i m e<br />
201
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
If the autocorrelati<strong>on</strong> is close to -1, then points above the underlying trend are usually followed immediately<br />
by points below the underlying. The fitted line will be close to the underlying trend. The residual plot<br />
will show the same pattern.<br />
If the autocorrelati<strong>on</strong> is close to 1, then you will see l<strong>on</strong>g runs of points above the underlying trend line<br />
and l<strong>on</strong>g runs of points below the underlying line. DANGER! In cases of very high autocorrelati<strong>on</strong><br />
with <str<strong>on</strong>g>short</str<strong>on</strong>g> time series, you can be drastically misled by the data! If you examine the plots above, you see that<br />
in the case of high positive autocorrelati<strong>on</strong>, the points tended to stay above or below the underlying trend<br />
line <str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>g periods of time. If the time series is <str<strong>on</strong>g>short</str<strong>on</strong>g>, you may never see the actual line dipping below the<br />
real trend line and the fitted line (shown in the above plots) may be completely misleading and there is no<br />
way to detect this! Ir<strong>on</strong>ically, with <str<strong>on</strong>g>short</str<strong>on</strong>g> time series (e.g. fewer than 30 data points), it will be very difficult<br />
to detect high positive autocorrelati<strong>on</strong> and this is exactly the time when it can cause the most damage when<br />
the data give misleading results!<br />
If the autocorrelati<strong>on</strong> is close to 0, the points will be randomly scattered about the underlying trend<br />
line, the fitted line will be close to the underlying trend line, and the residuals should appear to be random<br />
scattered about 0.<br />
In many cases, if you have fewer than 30 data points, it will be very difficult to observe or detect any<br />
autocorrelati<strong>on</strong> unless extreme!<br />
What are the effects of autocorrelati<strong>on</strong>? In most cases in ecology the autocorrelati<strong>on</strong> tends to be positive.<br />
This has the following effects:<br />
• Estimates of the slope and intercept are still unbiased, but they are less efficient (i.e. the true standard<br />
error is larger) than estimates of the same process in the absence of autocorrelati<strong>on</strong>. This may seem to<br />
be c<strong>on</strong>tradicted by my statement above that in the presence of high positive autocorrelati<strong>on</strong> and <str<strong>on</strong>g>short</str<strong>on</strong>g><br />
time series, that the data may be very misleading, but this is an artifact that you have a very <str<strong>on</strong>g>short</str<strong>on</strong>g> time<br />
series. With a l<strong>on</strong>g time-series you will see that the data run over and under the trend line in l<strong>on</strong>g<br />
waves and the fitted line will <strong>on</strong>ce again be close to the actual underlying trend.<br />
• The reported variance around the regressi<strong>on</strong> line (MSE) may seriously underestimate the true variance.<br />
• Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, while the estimates of the slope and intercept are usually not affected greatly, the<br />
reported standard errors can be misleading. In the case of positive or negative autocorrelati<strong>on</strong>, the<br />
reported standard errors obtained when a line is fit assuming not autocorrelati<strong>on</strong>, are typically too<br />
small, i.e. the estimates look more precise than they really are.<br />
• Reported c<strong>on</strong>fidence intervals ignoring autocorrelati<strong>on</strong> tend to be too narrow.<br />
• The p-values from hypothesis testing tend to be too small, i.e. you tend to detect differences that are<br />
not real too often.<br />
The autocorrelati<strong>on</strong> can be estimated from the data in many ways. In <strong>on</strong>e method, a regressi<strong>on</strong> line is fit<br />
c○2012 Carl James Schwarz 202 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
to the data, the residuals are found, and then the autocorrelati<strong>on</strong> is estimated as:<br />
̂ρ =<br />
T∑<br />
e i e i−1<br />
i=2<br />
T∑<br />
−1<br />
e 2 i<br />
i=2<br />
where e i is the residual <str<strong>on</strong>g>for</str<strong>on</strong>g> the i th observati<strong>on</strong>. Bence (1995) points out that this often underestimates the<br />
autocorrelati<strong>on</strong> and provides some corrected estimates. More modern methods estimate the autocorrelati<strong>on</strong><br />
using a technique called maximum likelihood and these often per<str<strong>on</strong>g>for</str<strong>on</strong>g>m well than these two-step methods.<br />
As a rule of thumb, the reported standard errors obtained from fitting a regressi<strong>on</strong> ignoring autocorrelati<strong>on</strong><br />
should be inflated by a factor of<br />
√<br />
(1+ρ)<br />
(1−ρ)<br />
. For example, if the actual autocorrelati<strong>on</strong> is 0.6, then the<br />
√<br />
(1+.6)<br />
(1−.6) = 2,<br />
standard errors (from an analysis ignoring autocorrelati<strong>on</strong>) should be inflated by a factor of<br />
i.e. multiply the reported standard errors ignoring autocorrelati<strong>on</strong> by a factor of 2. C<strong>on</strong>sequently, unless the<br />
autocorrelati<strong>on</strong> is very close to 1 or -1, the inflati<strong>on</strong> factor is usually pretty small. 24<br />
A slightly simpler <str<strong>on</strong>g>for</str<strong>on</strong>g>mula that also seems to work well in practice is that the effective sample size in the<br />
presence of autocorrelati<strong>on</strong> is found as:<br />
n effective = 1 + (n − 1)(1 − ρ)<br />
This is based <strong>on</strong> the observati<strong>on</strong> that the first observati<strong>on</strong> counts as a data point, but each additi<strong>on</strong>al data<br />
point <strong>on</strong>ly counts as (1 − ρ) of a data point. Then use the fact that <str<strong>on</strong>g>for</str<strong>on</strong>g> most statistical problems, the standard<br />
errors decrease by a factor of √ n to estimate the effect up<strong>on</strong> the precisi<strong>on</strong> of the estimates. For example, if<br />
n<br />
n effective<br />
= 2 then the reported standard errors (computed ignoring autocorrelati<strong>on</strong>) should be inflated by a<br />
factor of about √ 2.<br />
The Durbin-Wats<strong>on</strong> test statistic is a popular measure of autocorrelati<strong>on</strong>. It is computed as:<br />
d =<br />
N∑<br />
(e i − e i−1 ) 2<br />
i=2<br />
N∑<br />
e 2 i<br />
i=1<br />
≈<br />
∑<br />
2 N e 2 i − 2 ∑ N e i e i−1<br />
i=1 i=2<br />
N∑<br />
e 2 i<br />
i=1<br />
≈ 2 (1 − ρ)<br />
C<strong>on</strong>sequently, if the autocorrelati<strong>on</strong> is close to 0, the Durbin-Wats<strong>on</strong> statistic should be close to 2. The<br />
p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the statistics is found from tables, but most modern software can compute it automatically.<br />
Remedial measures If there is str<strong>on</strong>g evidence <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>, there a number of remedial measures<br />
that can be taken:<br />
• Use ordinary regressi<strong>on</strong> and inflate the reported standard errors by the inflati<strong>on</strong> factor menti<strong>on</strong>ed<br />
above. This is a very approximate soluti<strong>on</strong> and is not often used now that modern software is available.<br />
24 As Bence (1995) points out, the correcti<strong>on</strong> factor assumes that you know the value of ρ. Often this is difficult to estimate and<br />
typically estimates are too close to 0 resulting in a correcti<strong>on</strong> factor that is too small as well. He provides a bias adjusted correcti<strong>on</strong><br />
factor.<br />
c○2012 Carl James Schwarz 203 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
• A major cause of autocorrelati<strong>on</strong> is the omissi<strong>on</strong> of an important explanatory variable. The example<br />
of precipitati<strong>on</strong> that tends to occur in cycles was noted earlier. In this case, a more complex regressi<strong>on</strong><br />
model (multiple regressi<strong>on</strong>) that looks at the simultaneous effect of more than two variables would be<br />
appropriate. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately this is bey<strong>on</strong>d the scope of these notes.<br />
• Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m the variables be<str<strong>on</strong>g>for</str<strong>on</strong>g>e using simple regressi<strong>on</strong> methods that ignore autocorrelati<strong>on</strong>. There are<br />
two popular trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, the Cochrane-Orcutt and Hildreth-Lu procedures. Both procedures start<br />
by estimating the autocorrelati<strong>on</strong> ρ by fitting the ordinary regressi<strong>on</strong> line, obtaining the residuals, and<br />
then using the residuals to estimate the autocorrelati<strong>on</strong>. Then the data are trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med by subtracting<br />
the estimated porti<strong>on</strong> due to autocorrelati<strong>on</strong>. Finally, the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data are refit using ordinary<br />
regressi<strong>on</strong> (again ignoring autocorrelati<strong>on</strong>). These approaches are falling out of favor because of the<br />
availability of integrated procedures below.<br />
• Use a more sophisticated fitting procedure that explicitly estimates the autocorrelati<strong>on</strong> and accounts<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> it. This can be d<strong>on</strong>e using maximum likelihood or extensi<strong>on</strong>s of the previous methods, e.g. the<br />
Yule-Walker methods which fit generalized least squares. Many statistical packages offer such procedures,<br />
e.g. SAS’s PROC AUTOREG is specially designed to deal with autocorrelati<strong>on</strong> uses the<br />
Yule-Walker methods; SAS’s Proc MIXED uses maximum likelihood methods.<br />
2.6.1 Example: Mink pelts from Saskatchewan<br />
L.B. Keith (1963) collected in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> the number of mink-pelts from Saskatchewan, Canada over<br />
a 30 year period. This is data series 3707 in the NERC Centre <str<strong>on</strong>g>for</str<strong>on</strong>g> Populati<strong>on</strong> Biology, Imperial College<br />
(1999) The Global Populati<strong>on</strong> Dynamics Database available at http://www.sw.ic.ac.uk/cpb/<br />
cpb/gpdd.html.<br />
We are interested to see if there is a linear trend in the series.<br />
Here is the raw data:<br />
c○2012 Carl James Schwarz 204 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Year Pelts<br />
1914 15585<br />
1915 9696<br />
1916 6757<br />
1917 6443<br />
1918 6744<br />
1919 10637<br />
1920 11206<br />
1921 8937<br />
1922 13977<br />
1923 11430<br />
1924 13955<br />
1925 6635<br />
1926 7855<br />
1927 5485<br />
1928 5605<br />
1929 5016<br />
1930 6028<br />
1931 6287<br />
1932 11978<br />
1933 15730<br />
1934 14850<br />
1935 9766<br />
1936 6577<br />
1937 3871<br />
1938 4659<br />
1939 6749<br />
1940 12469<br />
1941 8579<br />
1942 6839<br />
1943 9990<br />
1944 6561<br />
1945 5831<br />
1946 8088<br />
1947 9579<br />
1948 10672<br />
1949 16195<br />
c○2012 Carl James Schwarz 205 November 23, 2012<br />
1950 12596<br />
1951 12833<br />
1952 18853<br />
1953 11493<br />
1954 14613<br />
1955 18514
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The raw data are available in a JMP file called mink.jmp available in the Sample Program Library available<br />
at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
It is comm<strong>on</strong> when dealing with populati<strong>on</strong> trends, to analyze the data <strong>on</strong> the log-scale. The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
this is that many processes operate multiplicatively <strong>on</strong> the original scale, and this is translated into a linear<br />
line <strong>on</strong> the log-scale. For example, if the number of pelts harvested increased by x% per year, the <str<strong>on</strong>g>for</str<strong>on</strong>g>ecasted<br />
number of pelts harvested would be fit by the equati<strong>on</strong>s:<br />
Years from baseline<br />
P elts = B(1 + x)<br />
where B is the baseline number of pelts. When this is trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the log-scale, the resulting equati<strong>on</strong> is:<br />
or<br />
log(P elts) = log(B) + (Years from baseline) log(1 + x)<br />
Y ′ = β 0 + β 1 (Years from baseline)<br />
This equati<strong>on</strong> can be further modified by using the raw year as the X variable (rather than years-frombaseline).<br />
All that happens is that the value of the baseline refers back to year 0 (which is pretty meaningless),<br />
but the value of the slope is still OK.<br />
It is recommended that you take natural logarithms (base e) rather than comm<strong>on</strong> logarithms (base 10)<br />
because then the estimated slope has a nice interpretati<strong>on</strong>. For small slopes <strong>on</strong> the natural log scale, a value<br />
of ̂β 1 corresp<strong>on</strong>ds closely to the same percentage increase per year. For example if ̂β 1 = .04, then the<br />
populati<strong>on</strong> in increasing at a rate of<br />
or 4% per year.<br />
JMP Analysis<br />
exp(̂β 1 ) − 1 = exp(.04) − 1 = 1.041 − 1 = .041 ≈ β 1 = .04<br />
JMP deals with autocorrelated data through the Analyze →Modelling →TimeSeries plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m which are<br />
bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>. This time series plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m allows you to fit the Box-Jenkins ARIMA(p,q)<br />
series of models, but does not allow <str<strong>on</strong>g>for</str<strong>on</strong>g> missing data.<br />
log_mink was c<strong>on</strong>structed using a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula variable in the usual fashi<strong>on</strong>. Here is a porti<strong>on</strong> of the raw<br />
data:<br />
c○2012 Carl James Schwarz 206 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Begin by using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit a simple linear fit and to fit a line joining all of<br />
the points: 25<br />
25 Use the Fit Each Value under the red-triangle pop-down menu to get the individual points joined up<br />
c○2012 Carl James Schwarz 207 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
There appears to be a generally increasing trend, but the the points seem to show an irregular cyclical<br />
type of pattern where several years of high takes of pelts is followed by several years of low takes of pelts.<br />
This is often a sign of autocorrelated residuals. Indeed the residual plot shows this pattern: 26<br />
26 This residual plot was obtained by Saving the residuals to the data sheet, and then using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to plot<br />
the saved residuals against year. The joined line was obtained by using the Fit Each Value from the red-triangle pop-down menu. The<br />
horiz<strong>on</strong>tal line at zero was obtained by using the Fit Special from the red-triangle and selecting an intercept of 0 and a slope of 0.<br />
c○2012 Carl James Schwarz 208 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
In order to test estimate the autocorrelati<strong>on</strong>, the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m must be used to fit a linear<br />
fit to logmink<br />
c○2012 Carl James Schwarz 209 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
and obtain the fitted line and residual plots in the usual way. The Durbin-Wats<strong>on</strong> statistic is obtained from<br />
the red-triangle pop-down menu:<br />
c○2012 Carl James Schwarz 210 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The Durbin-Wats<strong>on</strong> statistics indicates that there is str<strong>on</strong>g evidence of autocorrelati<strong>on</strong> with an estimated<br />
autocorrelati<strong>on</strong> of approximately 0.56.<br />
The estimated intercept and slope (without adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>) are:<br />
c○2012 Carl James Schwarz 211 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The number of pelts is estimated to increase at about 0.8% per year. As noted be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the estimates are<br />
still unbiased, but the reported standard errors are too small. Using the rule-of-thumb, the inflati<strong>on</strong> factor <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
the standard errors is approximately:<br />
√ √<br />
1 + ̂ρ 1 + .56<br />
InfFactor =<br />
1 − ̂ρ = 1 − .56 = 1.9<br />
Hence a more realistic standard error would be 1.9 × .005 = .009.<br />
A more <str<strong>on</strong>g>for</str<strong>on</strong>g>mal analysis would proceed as follows. First launch the Analyze →Modelling →TimeSeries<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
and specify the Y variable.<br />
c○2012 Carl James Schwarz 212 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The Time variable is <strong>on</strong>ly used <str<strong>on</strong>g>for</str<strong>on</strong>g> graphing. JMP assumes that the data are equally spaced without any<br />
missing values.<br />
This gives the initial output:<br />
c○2012 Carl James Schwarz 213 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The estimated lag-1 autocorrelati<strong>on</strong> is about 0.6 - quite high, but the lag 2 and higher autocorrelati<strong>on</strong>s<br />
d<strong>on</strong>’t appear to be statistically significant as they d<strong>on</strong>’t fall outside the blue lines drawn <strong>on</strong> the graph of the<br />
autocorrelati<strong>on</strong>s.<br />
A simple autoregressi<strong>on</strong> model with NO TREND (called the ARIMA(1,0,0)) model is fit using the<br />
ARIMA drop down menu and completing the various boxes:<br />
c○2012 Carl James Schwarz 214 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 215 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
A key assumpti<strong>on</strong>s of the Box-Jenkins approach is that the series is stati<strong>on</strong>ary, i.e. has a c<strong>on</strong>stant mean. If<br />
there is a linear trend in the log_mink numbers, this MUST be first removed be<str<strong>on</strong>g>for</str<strong>on</strong>g>e a subsequent model is<br />
fit. A simple linear trend is removed by differencing. For example, if a simple linear trend model is correct<br />
Y t = β 0 + β 1 t<br />
then:<br />
Y t+1 = β 0 + β 1 (t + 1)<br />
Y t+1 − Y t = β 1<br />
and the FIRST differences are c<strong>on</strong>stant.<br />
A first difference model is fit by specifying the sec<strong>on</strong>d term in the ARIMA model specificati<strong>on</strong>:<br />
c○2012 Carl James Schwarz 216 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Finally, a model with differencing but not autocorrelati<strong>on</strong> may also be useful:<br />
A comparis<strong>on</strong> of the three model is given by JMP:<br />
c○2012 Carl James Schwarz 217 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The AIC criteri<strong>on</strong> indicates that the model with the lowest value of AIC is preferred, and models with AIC<br />
within 2 or 3 of the best fitting model could also be candidates. According to this output, the AR(1) model is<br />
the best fitting model with an AIC of -94 almost 8 units lower than the next best fitting model, i.e. it wasn’t<br />
necessary to remove the trend from the model be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting it to the data.<br />
Indeed, if you look at the output from the AR(1,1) or AR(0,1) model:<br />
the estimated average difference in the log_mink is <strong>on</strong>ly .00042 with a se of .05, clearly not statistically<br />
different from zero.<br />
It is interesting to note that there is no evidence of further autocorrelati<strong>on</strong> in the residuals after the first<br />
differences were taken as the AR(1,1) model is a worse fit (but <strong>on</strong>ly by about 2 units) when compared <strong>on</strong><br />
the AIC scale. You can also estimate the average first difference by computing a derived variable using the<br />
Formula editor and using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to estimate the overall mean and to see if there<br />
is residual autocorrelati<strong>on</strong>.<br />
Final Notes<br />
It is interesting to note that there is no evidence of further autocorrelati<strong>on</strong> in the residuals after the first<br />
differences were taken. If you hadn’t examined the autocorrelati<strong>on</strong> plots you would not have known this. It<br />
is quite comm<strong>on</strong> that a first difference will remove much autocorrelati<strong>on</strong> in the data and this is often a good<br />
first step.<br />
c○2012 Carl James Schwarz 218 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
2.7 Dealing with seas<strong>on</strong>ality<br />
In many cases, the “cause” of autocorrelati<strong>on</strong> over time is some sort of seas<strong>on</strong>ality. For example, stream<br />
flow may follow a cyclical pattern with high flows in the winter m<strong>on</strong>ths (at least in Vancouver) and low<br />
flows in the summer m<strong>on</strong>ths. A way to deal with this type of autocorrelati<strong>on</strong> is to either first adjust the data<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>al effects, and then use the usual regressi<strong>on</strong> methods <strong>on</strong> this adjusted data, or to fit a cyclic pattern<br />
over and above the simple trend line.<br />
2.7.1 Empirical adjustment <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality<br />
General idea<br />
The intuitive idea behind this method is quite simple. Arrange the data into seas<strong>on</strong>al groups (e.g. m<strong>on</strong>ths)<br />
and subtract the seas<strong>on</strong>al group mean or median 27 from every point in the seas<strong>on</strong>al series. This will subtract<br />
the cyclic pattern and leave adjusted data that is “free” of seas<strong>on</strong>al effects.<br />
The adjustment process can either be d<strong>on</strong>e within the computer package, or in many cases, is easily d<strong>on</strong>e<br />
<strong>on</strong> a spreadsheet.<br />
This adjustment is a bit ad hoc, but seems to work well in practice. The reported standard errors from<br />
the regressi<strong>on</strong> line are a bit too small as they have not accounted <str<strong>on</strong>g>for</str<strong>on</strong>g> the adjustment process.<br />
Example: Total phosphorus from Klamath River<br />
C<strong>on</strong>sider, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, values of total phosphorus taken from the Klamath River near Klamath, Cali<str<strong>on</strong>g>for</str<strong>on</strong>g>nia<br />
as analyzed by Hirsch et al (1982). 28<br />
27 The median would be preferred to avoid c<strong>on</strong>taminati<strong>on</strong> of the mean by outliers<br />
28 This was m<strong>on</strong>itoring stati<strong>on</strong> 11530500 from the NASQAN network in the US. Data are available from http://waterdata.<br />
usgs.gov/nwis/qwdata/?site_no=11530500. The data was analyzed by Hirsch, R.M., Slack, J.R., and Smith, R.A. (1982).<br />
Techniques of trend analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> m<strong>on</strong>thly water quality data. Water Resources Research 18, 107-121.<br />
c○2012 Carl James Schwarz 219 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Total phosphorus (mg/L) in Klamath River near Klamath, CA<br />
M<strong>on</strong>th<br />
Year<br />
1972 1973 1974 1975 1976 1977 1978 1979<br />
1 0.07 0.33 0.70 0.08 0.04 0.05 0.14 0.08<br />
2 0.11 0.24 0.17 . . . 0.11 0.04<br />
3 0.60 0.12 0.16 . 0.14 0.03 0.02 0.02<br />
4 0.10 0.08 1.20 0.11 0.05 0.04 0.06 0.01<br />
5 0.04 0.03 0.12 0.09 0.02 0.04 0.03 0.03<br />
6 0.05
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
shows a obvious seas<strong>on</strong>ality to the data with peak levels occurring in the winter m<strong>on</strong>ths. There are also some<br />
missing values as seen in the raw data table. Finally, notice the presence of several very large values (above<br />
0.20 mg/L) that would normally be classified as outliers.<br />
There are several values greater than 0.20 mg/L which appear to be outliers. C<strong>on</strong>sequently, we will use<br />
the median from each m<strong>on</strong>th <str<strong>on</strong>g>for</str<strong>on</strong>g> the adjustment. The sorted values <str<strong>on</strong>g>for</str<strong>on</strong>g> the January readings are:<br />
.04, .05, .07, .08, .08, .14, .33, .70<br />
The median value <str<strong>on</strong>g>for</str<strong>on</strong>g> January readings is the average of the 4 th and 5 th observati<strong>on</strong>s 29 or<br />
median January =<br />
.08 + .08<br />
2<br />
= .08.<br />
The value of .08 is subtracted from each of the January readings to give<br />
−.01, .25, .62, .00, −.04, −.03, .06, .00<br />
29 If the number of observati<strong>on</strong>s is odd as <str<strong>on</strong>g>for</str<strong>on</strong>g> February, the median is the middle value<br />
c○2012 Carl James Schwarz 221 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This process is repeated <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th. These computati<strong>on</strong>s are illustrated in the Klamath tab in the<br />
ALLofDATA.xls workbook available in the Sample Program Library available at http://www.stat.<br />
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms to give:<br />
M<strong>on</strong>th<br />
Seas<strong>on</strong>ally Adjusted Total phosphorus (mg/L)<br />
in Klamath River near Klamath, CA<br />
Year<br />
M<strong>on</strong>th 1972 1973 1974 1975 1976 1977 1978 1979<br />
1 -0.01 0.25 0.62 0.00 -0.04 -0.03 0.06 0.00<br />
2 0.00 0.13 0.06 . . . 0.00 -0.07<br />
3 0.48 0.00 0.04 . 0.02 -0.09 -0.10 -0.10<br />
4 0.03 0.01 1.13 0.04 -0.02 -0.03 -0.01 -0.06<br />
5 0.01 -0.01 0.09 0.06 -0.02 0.01 -0.01 -0.01<br />
6 0.00 -0.04 0.00 0.00 . . -0.02 .<br />
7 0.00 0.00 -0.01 -0.02 . 0.02 -0.02 0.00<br />
8 -0.01 0.01 -0.03 -0.01 0.02 0.03 0.01 -0.04<br />
9 0.02 0.01 -0.03 0.02 . 0.00 -0.04 .<br />
10 0.00 0.00 -0.01 . 0.00 -0.04 -0.03 0.20<br />
11 0.00 0.28 . -0.01 . 0.33 0.00 .<br />
12 0.01 0.03 -0.01 -0.08 . 0.18 -0.06 .<br />
A plot of the seas<strong>on</strong>ally adjusted values:<br />
c○2012 Carl James Schwarz 222 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
shows that most of the seas<strong>on</strong>al effects have been removed, but there may still evidence of autocorrelati<strong>on</strong>.<br />
There are certainly still some outliers.<br />
JMP analysis<br />
The seas<strong>on</strong>ally adjusted values were imported into JMP and stacked in the usual way. A new variable<br />
year-m<strong>on</strong>th was created using a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula variable (year + m<strong>on</strong>th−1<br />
12<br />
) to represent time:<br />
c○2012 Carl James Schwarz 223 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to draw the scatter plot and fit the preliminary line:<br />
c○2012 Carl James Schwarz 224 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
It is a bit worrisome that the outliers seems to be all in early years. All seas<strong>on</strong>ally adjusted values greater<br />
than 0.2 were excluded from the analysis 30 and the line was refit:<br />
30 Use the Rows→Select to select these rows<br />
c○2012 Carl James Schwarz 225 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 226 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
There appears to be evidence of a trend of −.0032 mg/L/year. The p-value and se of the slope are likely<br />
too small by some small factor because the seas<strong>on</strong>ally adjustment was not taken into account. The residual<br />
plot seems to show some evidence of remaining autocorrelati<strong>on</strong>.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was reused to fit the data and obtain the Durbin-Wats<strong>on</strong> statistics:<br />
This indicates a low amount of residual autocorrelati<strong>on</strong> (estimated value of .04) but is statistically significant<br />
because the large sample size allows you detect very small autocorrelati<strong>on</strong>.<br />
Further comments It is a bit worrisome that all of the outliers appear to happen early in the time series<br />
and <strong>on</strong>ce these are removed, that there is no evidence of a trend. However, <strong>on</strong>e could argue that that disappearance<br />
of the outliers is, in fact, the most interesting point of this dataset and that the fact that the outliers<br />
disappeared indicates evidence of a downward trend.<br />
It also turns out that the results are VERY sensitive to which outliers are removed. For example, in late<br />
1977 there is seas<strong>on</strong>ally adjusted value of .17, and in late 1979 there was a seas<strong>on</strong>ally adjusted value of<br />
0.20 that were not excluded. If these points are also removed, the final regressi<strong>on</strong> line is not statistically<br />
significant with an estimated trend of −.0063 mg/L/year.<br />
As you will see later, a n<strong>on</strong>-parametric analysis that includes these outliers point did detect a downward<br />
trend with an estimated slope of about −.006 mg/L/year! The moral of the story is that statistics must be<br />
used carefully!<br />
2.7.2 Using the ANCOVA approach<br />
General idea<br />
Rather than relying <strong>on</strong> a ad hoc approach to doing a seas<strong>on</strong>al adjustment, the ANCOVA method can also<br />
be used. The advantage of the ANCOVA method over the ad hoc approach is that not <strong>on</strong>ly can you fit an<br />
overall trend line, but you can also test if the trend is the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all seas<strong>on</strong>s as well. Outliers will have to<br />
be removed in the usual fashi<strong>on</strong>.<br />
The general model will start with the n<strong>on</strong>-parallel slope model of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
Y = Seas<strong>on</strong> T ime Seas<strong>on</strong> ∗ T ime<br />
Then examine if the Seas<strong>on</strong>*Time interacti<strong>on</strong> term indicates if the slopes may not be parallel over seas<strong>on</strong>s.<br />
c○2012 Carl James Schwarz 227 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
If there is insufficient evidence against the hypothesis of parallelism, then fit the final model with a comm<strong>on</strong><br />
slope over the seas<strong>on</strong>s, but difference am<strong>on</strong>g the seas<strong>on</strong>s.<br />
Y = Seas<strong>on</strong> T ime<br />
Example: Total phosphorus levels <strong>on</strong> the Klamath River - revisited<br />
JMP Analysis<br />
The raw data is available in the file klamath.jmp in the Sample Program Library available at http://www.<br />
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
An earlier plot shows that there are some outliers. Remove all data points greater than 0.20 mg/L.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to fit the n<strong>on</strong>-parallel slope model. CAUTION: Be sure that<br />
m<strong>on</strong>th is nominally scaled and that year is c<strong>on</strong>tinuously scaled!.<br />
c○2012 Carl James Schwarz 228 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The graph of the lines by seas<strong>on</strong> appear to show that some seas<strong>on</strong>s (m<strong>on</strong>ths) appear to have a different slope<br />
than the other m<strong>on</strong>ths:<br />
and the effect test <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-parallel slopes:<br />
also shows some evidence of n<strong>on</strong>-parallel slopes. However, we will fit the parallel slope model to c<strong>on</strong>tinue<br />
the dem<strong>on</strong>strati<strong>on</strong>.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is again used to fit the parallel slope model. CAUTION: Be sure that<br />
m<strong>on</strong>th is nominally scaled and that year is c<strong>on</strong>tinuously scaled!.<br />
c○2012 Carl James Schwarz 229 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The fitted lines and the model fit graph appear to be acceptable:<br />
c○2012 Carl James Schwarz 230 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 231 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The effect tests show a str<strong>on</strong>g effect of year with estimated coefficients of:<br />
The estimated trend is −.0056 (se .0016) mg/L/year which is comparable to the previous estimates. Note<br />
that the estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the m<strong>on</strong>th effects are not directly interpretable from this output - the LSMEANS table<br />
should be c<strong>on</strong>sulted - seek help <strong>on</strong> this point.<br />
The residual plots (not shown) d<strong>on</strong>’t indicate any major problems. The Durbin-Wats<strong>on</strong> test <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong><br />
detects a small autocorrelati<strong>on</strong>, but with this large sample size is not important.<br />
2.7.3 Fitting cyclical patterns<br />
General approach<br />
In some cases, the seas<strong>on</strong>al pattern is quite regular with regular peaks during part of the year and regular<br />
lows during another part of the year. Another approach is to try and account <str<strong>on</strong>g>for</str<strong>on</strong>g> this cyclical pattern, and<br />
then see if there is still evidence of a decline over time.<br />
The basic building block <str<strong>on</strong>g>for</str<strong>on</strong>g> the seas<strong>on</strong>ality are the use of sine and cosine functi<strong>on</strong>s to represent the<br />
c○2012 Carl James Schwarz 232 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
seas<strong>on</strong>al patterns. The general model will take the <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
Y i = β 0 + β 1 × t i + β 2 cos( 2πt i<br />
ν<br />
) + β 3 sin( 2πt i<br />
ν<br />
) + ɛ i<br />
Here the coefficients β 0 and β 1 represent the intercept and linear change over time. The coefficients β 2 and<br />
β 3 represent the seas<strong>on</strong>al comp<strong>on</strong>ents.<br />
The term ν represents the period of the cycle. It is assumed to be known in advance. For example, if the<br />
cycles are <strong>on</strong>e year in durati<strong>on</strong> and the time axis is measured in years, then ν = 1. If the cycles are <strong>on</strong>e year<br />
in durati<strong>on</strong>, but the time axis is measured in m<strong>on</strong>ths, then ν = 12. This is often coded incorrectly so be<br />
careful!<br />
The reas<strong>on</strong> there are both a sine and cosine functi<strong>on</strong> is that these two functi<strong>on</strong>s have the same period but<br />
different amplitudes at different parts of the cycle. For example, the cosine functi<strong>on</strong> has a maximum at the<br />
start of each cycle and a minimum half-way through each cycle, while the sine functi<strong>on</strong> has a minimum at<br />
the 3/4 point of a cycle and a maximum at the 1/4 point of the cycle.<br />
The analysis starts by creating two new variables in the data table corresp<strong>on</strong>ding to the sine and cosine<br />
functi<strong>on</strong>s. Then multiple regressi<strong>on</strong> is used to fit a model incorporating all three explanatory variables. In<br />
the <str<strong>on</strong>g>short</str<strong>on</strong>g> hand notati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> models, the model fit is:<br />
Y = T ime Cos Sin<br />
After the model is fit, the coefficient of the Time variable represents the overall trend. The usual tests of<br />
hypothesis <str<strong>on</strong>g>for</str<strong>on</strong>g> no trend, and c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope can be found. The slope is interested as the<br />
change in Y per unit change in X = T IME after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality. The coefficients <str<strong>on</strong>g>for</str<strong>on</strong>g> the sine<br />
and cosine functi<strong>on</strong>s are usually not of interest.<br />
The computati<strong>on</strong> should NOT be attempted by hand or in spreadsheet program. Most statistical packages<br />
have facilities <str<strong>on</strong>g>for</str<strong>on</strong>g> creating the relevant variables and fitting these models.<br />
The usual assumpti<strong>on</strong>s still hold, so they should be checked via residual plots, estimati<strong>on</strong> of the autocorrelati<strong>on</strong><br />
that remains, etc.<br />
Example: Total phosphorus from Klamath River<br />
JMP Analysis<br />
The data and model fits are available in a JMP file klamath3.jmp available in the Sample Program Library<br />
available at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The data must be stacked in the usual fashi<strong>on</strong> and a variable created year-m<strong>on</strong>th to represent the time<br />
variable, year-m<strong>on</strong>th = year + m<strong>on</strong>th−1<br />
12<br />
.<br />
c○2012 Carl James Schwarz 233 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
As the time variable is measured in years and the preliminary plot shows a yearly cycle, ν = 1, so the<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the cosine and sine variables are:<br />
c○2012 Carl James Schwarz 234 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
respectively. This gives the final data table looking somewhat like:<br />
There are no problems with the fact that some of the phosphorus data is missing as the packages will simply<br />
ignore any row that is not complete.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to fit the model:<br />
c○2012 Carl James Schwarz 235 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
. The output is voluminous and a full discussi<strong>on</strong> is bey<strong>on</strong>d the scope of these notes. 31 The key things to<br />
look at are the estimated coefficients and the residual plots and model fit plots:<br />
31 See Freund, R., Little, R. and Creight<strong>on</strong>, L. (2003). Regressi<strong>on</strong> using JMP. Wiley. <str<strong>on</strong>g>for</str<strong>on</strong>g> more details <strong>on</strong> the output from this<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />
c○2012 Carl James Schwarz 236 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 237 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
These all indicate the presence of several outliers.<br />
The model was refit omitting these outliers with phosphorus values greater than 0.20. 32 The model was<br />
refit. The residual and model fit plots are much better:<br />
32 Use the Rows→Select to select rows and the Rows→Exclude to remove these from the analysis<br />
c○2012 Carl James Schwarz 238 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 239 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
but the residual plot still shows something strange happening about 1/2 way through the time series. It<br />
appears that the cycles are shifting so you get this l<strong>on</strong>g wave of residuals.<br />
The estimated coefficients are:<br />
The coefficients <str<strong>on</strong>g>for</str<strong>on</strong>g> both the cosine and sine terms are statistically significant but not of much interest. The<br />
estimated trend is −.0056 mg/L/year (se .0017) with a p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the trend line of .0013. The results are<br />
statistically significant.<br />
The Durbin-Wats<strong>on</strong> test <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong> shows some residual serial correlati<strong>on</strong>:<br />
which likely reflects the behavior in the tail end of the series.<br />
Example: Comparing air quality measurements using two different methods<br />
The air that we breath often has many c<strong>on</strong>taminants. One c<strong>on</strong>taminant of interest is Particulate Matter<br />
(PM). Particulate matter is the general term used <str<strong>on</strong>g>for</str<strong>on</strong>g> a mixture of solid particles and liquid droplets in the<br />
air. It includes aerosols, smoke, fumes, dust, ash and pollen. The compositi<strong>on</strong> of particulate matter varies<br />
with place, seas<strong>on</strong> and weather c<strong>on</strong>diti<strong>on</strong>s. Particulate matter is characterized according to size - mainly<br />
because of the different health effects associated with particles of different diameters. Fine particulate matter<br />
is particulate matter that is 2.5 micr<strong>on</strong>s in diameter and less. [A human hair is approximately 30 times larger<br />
c○2012 Carl James Schwarz 240 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
than these particles!] The smaller particles are so small that several thousand of them could fit <strong>on</strong> the period<br />
at the end of this sentence. It is also known as PM2.5 or respirable particles because it penetrates the<br />
respiratory system further than larger particles.<br />
PM2.5 material is primarily <str<strong>on</strong>g>for</str<strong>on</strong>g>med from chemical reacti<strong>on</strong>s in the atmosphere and through fuel combusti<strong>on</strong><br />
(e.g., motor vehicles, power generati<strong>on</strong>, industrial facilities residential fire places, wood stoves and<br />
agricultural burning). Significant amounts of PM2.5 are carried into Ontario from the U.S. During periods<br />
of widespread elevated levels of fine particulate matter, it is estimated that more than 50 per cent of Ontario’s<br />
PM2.5 comes from the U.S.<br />
Adverse health effects from breathing air with a high PM 2.5 c<strong>on</strong>centrati<strong>on</strong> include: premature death,<br />
increased respiratory symptoms and disease, chr<strong>on</strong>ic br<strong>on</strong>chitis, and decreased lung functi<strong>on</strong> particularly <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
individuals with asthma.<br />
Further in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> about fine particulates is available at many websites as http://www.health.<br />
state.ny.us/nysdoh/indoor/pmq_a.htm and http://www.airquality<strong>on</strong>tario.com/<br />
science/pollutants/particulates.cfm, and http://www.epa.gov/pmdesignati<strong>on</strong>s/<br />
faq.htm.<br />
The PM2.5 c<strong>on</strong>centrati<strong>on</strong>s in air can be measured in many ways. A well known method is a is a filter<br />
based method whereby <strong>on</strong>e 24 hour sample is collected every third day. The sampler draws air through a preweighed<br />
filter <str<strong>on</strong>g>for</str<strong>on</strong>g> a specified period (usually 24 hours) at a known flowrate. The filter is then removed and<br />
sent to a laboratory to determine the gain in filter mass due to particle collecti<strong>on</strong>. Ambient PM c<strong>on</strong>centrati<strong>on</strong><br />
is calculated <strong>on</strong> the basis of the gain in filter mass, divided by the product of sampling period and sampling<br />
flowrate. Additi<strong>on</strong>al analysis can also be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med <strong>on</strong> the filter to determine the chemical compositi<strong>on</strong> of<br />
the sample.<br />
In recent years, a program of c<strong>on</strong>tinuous sampling using automatic samplers has been introduced. An instrument<br />
widely adopted <str<strong>on</strong>g>for</str<strong>on</strong>g> this use is the Tapered Element Oscillating Microbalance (TEOM). The TEOM<br />
operates under the following principles. Ambient air is drawn in through a heated inlet. It is then drawn<br />
through a filtered cartridge <strong>on</strong> the end of a hollow, tapered tube. The tube is clamped at <strong>on</strong>e end and oscillates<br />
freely like a tuning <str<strong>on</strong>g>for</str<strong>on</strong>g>k. As particulate matter gathers <strong>on</strong> the filter cartridge, the natural frequency of<br />
oscillati<strong>on</strong> of the tube decreases. The mass accumulati<strong>on</strong> of particulate matter is then determined from the<br />
corresp<strong>on</strong>ding change in frequency.<br />
Because of the different ways in which these instruments work, a calibrati<strong>on</strong> experiment was per<str<strong>on</strong>g>for</str<strong>on</strong>g>med.<br />
The hourly TEOM readings were accumulated to a daily value and compared to those obtained from an air<br />
filter method. Here are the data:<br />
Date TEOM Ref<br />
2003.06.05 8.1 10.6<br />
2003.06.08 6.5 9.0<br />
2003.06.11 3.2 4.6<br />
2003.06.14 2.2 3.7<br />
2003.06.17 5.8 7.9<br />
2003.06.20 1.4 4.4<br />
2003.06.23 1.8 2.8<br />
2003.06.26 4.5 6.5<br />
2003.06.29 4.6 5.8<br />
2003.07.02 3.3 3.6<br />
2003.07.05 1.6 3.7<br />
2003.07.08 7.1 7.2<br />
2003.07.11 7.7 8.6<br />
2003.07.14 4.3 4.4<br />
2003.07.17 4.6 6.4<br />
2003.07.20 7.2 8.5<br />
2003.07.23 8.8 10.5<br />
2003.07.26 8.1 9.0<br />
2003.07.29 11.2 10.4<br />
c○2012 Carl James Schwarz 241 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
2003.08.01 19.4 21.0<br />
2003.08.07 5.9 5.2<br />
2003.08.10 11.9 12.6<br />
2003.08.13 7.2 8.4<br />
2003.08.16 48.2 46.2<br />
2003.08.19 49.3 51.2<br />
2003.08.22 53.3 54.5<br />
2003.08.25 56.8 57.2<br />
2003.08.28 4.5 7.4<br />
2003.08.31 27.8 26.1<br />
2003.09.03 34.3 33.0<br />
2003.09.06 41.5 42.1<br />
2003.09.24 5.8 9.5<br />
2003.09.27 5.7 8.0<br />
2003.09.30 9.1 9.8<br />
2003.10.03 10.5 13.9<br />
2003.10.06 10.9 15.6<br />
2003.10.09 3.5 5.6<br />
2003.10.12 4.1 6.3<br />
2003.10.15 5.7 10.1<br />
2003.10.18 15.5 20.2<br />
2003.10.21 5.4 8.9<br />
2003.10.24 11.7 19.0<br />
2003.10.27 14.9 23.3<br />
2003.10.30 3.9 7.5<br />
2003.11.02 12.9 21.2<br />
2003.11.05 18.9 33.4<br />
2003.11.08 23.6 35.9<br />
2003.11.11 19.0 30.2<br />
2003.11.14 18.5 28.2<br />
2003.11.17 11.1 18.4<br />
2003.11.20 11.6 20.1<br />
2003.11.23 9.4 17.9<br />
2003.11.26 25.6 42.8<br />
2003.11.29 6.9 11.2<br />
2003.12.02 13.2 25.6<br />
2003.12.05 10.2 19.9<br />
2003.12.08 17.6 31.6<br />
2003.12.11 6.7 14.1<br />
2003.12.14 16.2 26.5<br />
2003.12.17 8.3 13.5<br />
2004.01.13 6.8 13.8<br />
2004.01.16 9.2 17.3<br />
2004.01.19 16.5 32.6<br />
2004.01.22 4.3 11.6<br />
2004.01.25 6.1 10.0<br />
2004.01.28 10.1 14.4<br />
2004.01.31 14.0 28.1<br />
2004.02.06 19.4 35.0<br />
2004.02.09 15.1 25.2<br />
2004.02.12 16.8 32.9<br />
2004.02.15 15.9 28.5<br />
2004.02.18 9.8 18.5<br />
2004.02.21 9.1 17.2<br />
2004.02.24 17.1 31.9<br />
2004.02.27 12.1 21.7<br />
2004.03.01 8.8 14.1<br />
2004.03.07 3.2 5.6<br />
2004.03.10 10.9 15.3<br />
2004.03.13 7.1 10.8<br />
2004.03.16 7.4 13.8<br />
2004.03.19 10.4 14.0<br />
2004.03.22 10.6 16.1<br />
2004.03.25 5.0 8.4<br />
2004.03.28 6.4 10.3<br />
2004.03.31 5.3 6.6<br />
2004.04.03 6.5 9.7<br />
2004.04.09 6.4 9.7<br />
2004.04.12 7.0 8.8<br />
2004.04.15 2.3 4.6<br />
2004.04.18 4.2 5.7<br />
2004.04.21 4.7 5.7<br />
2004.04.24 3.7 4.1<br />
2004.04.27 4.1 5.0<br />
2004.04.30 7.3 7.3<br />
2004.05.03 3.5 5.0<br />
2004.05.06 2.5 2.8<br />
2004.05.09 2.3 2.7<br />
2004.07.02 6.0 4.3<br />
2004.07.05 3.3 2.4<br />
2004.07.08 1.6 2.0<br />
2004.07.11 1.2 5.7<br />
2004.07.14 5.4 8.3<br />
2004.07.17 8.8 3.5<br />
2004.07.20 2.2 10.0<br />
2004.07.23 8.3 12.5<br />
2004.07.26 10.5 17.0<br />
2004.08.01 25.3 24.7<br />
2004.08.04 14.7 10.5<br />
2004.08.07 2.7 3.1<br />
2004.08.10 6.5 7.2<br />
2004.08.19 20.1 13.6<br />
2004.08.25 4.1 4.2<br />
2004.08.28 2.5 1.5<br />
2004.08.31 4.7 6.3<br />
2004.09.03 3.2 4.0<br />
2004.09.15 1.8 2.6<br />
2004.09.18 2.6 4.7<br />
2004.09.21 4.7 6.2<br />
2004.09.24 5.6 8.0<br />
2004.09.27 7.1 10.0<br />
2004.09.30 4.8 7.7<br />
2004.10.03 9.5 13.3<br />
2004.10.06 10.1 13.0<br />
2004.10.09 3.8 5.0<br />
2004.10.12 5.0 7.3<br />
2004.10.15 2.3 5.4<br />
2004.10.18 7.5 10.1<br />
2004.10.21 8.1 11.0<br />
2004.10.24 6.6 13.6<br />
2004.10.27 14.0 18.2<br />
2004.10.30 15.9 24.8<br />
2004.11.02 8.4 14.1<br />
2004.11.08 10.8 17.6<br />
2004.11.11 1.4 4.7<br />
2004.11.14 6.5 10.0<br />
2004.11.17 11.0 18.8<br />
2004.11.20 7.7 14.4<br />
2004.11.26 15.4 23.4<br />
2004.11.29 8.9 17.1<br />
2004.12.02 18.3 30.8<br />
2004.12.05 6.2 13.5<br />
2004.12.08 8.3 16.5<br />
2004.12.11 9.6 15.9<br />
2004.12.14 9.8 17.6<br />
2004.12.17 11.5 21.5<br />
2004.12.20 14.0 26.1<br />
2004.12.23 9.8 20.0<br />
2004.12.26 4.9 9.4<br />
2004.12.29 3.7 7.6<br />
2005.01.01 10.2 18.5<br />
2005.01.04 18.6 38.3<br />
2005.01.22 11.1 24.7<br />
2005.01.25 11.8 22.7<br />
2005.01.28 13.1 20.9<br />
2005.01.31 5.1 10.9<br />
2005.02.03 6.2 11.1<br />
2005.02.06 6.5 10.0<br />
c○2012 Carl James Schwarz 242 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
2005.02.09 10.6 20.8<br />
2005.02.12 11.4 23.3<br />
2005.02.15 12.9 18.8<br />
2005.02.18 14.0 23.4<br />
2005.02.21 21.9 31.7<br />
2005.02.24 17.1 26.4<br />
2005.02.26 8.3 16.3<br />
2005.02.27 11.8 20.1<br />
2005.03.02 16.7 28.9<br />
2005.03.05 12.0 18.9<br />
2005.03.08 5.3 9.8<br />
2005.03.11 10.9 18.8<br />
2005.03.14 11.3 18.1<br />
2005.03.17 8.5 11.0<br />
2005.04.04 12.0 10.9<br />
2005.04.07 7.8 7.1<br />
2005.04.16 2.3 4.8<br />
2005.04.19 5.5 3.9<br />
2005.04.22 8.0 6.7<br />
2005.04.25 7.3 10.0<br />
2005.04.28 3.5 9.0<br />
2005.05.01 4.5 4.5<br />
2005.05.04 5.1 1.8<br />
2005.05.07 2.5 5.4<br />
2005.05.28 6.1 6.7<br />
2005.05.31 9.7 12.0<br />
2005.06.03 5.2 5.0<br />
2005.06.06 0.9 2.1<br />
2005.06.09 4.4 6.2<br />
2005.06.12 2.3 2.7<br />
2005.06.15 2.3 2.2<br />
2005.06.18 1.7 2.6<br />
2005.06.21 6.7 6.9<br />
2005.06.24 3.4 3.8<br />
2005.06.27 4.2 4.6<br />
2005.06.30 4.3 5.5<br />
2005.07.03 2.7 5.2<br />
2005.07.06 3.6 4.2<br />
2005.07.09 1.3 1.9<br />
2005.07.12 2.8 6.3<br />
Do both meters give similar readings over time?<br />
It is quite comm<strong>on</strong> when comparing two instruments to do the comparis<strong>on</strong> <strong>on</strong> the log-ratio scale, i.e.<br />
either log(T EOM/reference) or log(reference/T EOM). There are two reas<strong>on</strong>s why this is comm<strong>on</strong>ly<br />
d<strong>on</strong>e. First, the logarithmic scale makes ratios more than 1 and less than 1 symmetric. For example, the<br />
ratio 1/2 and 2 <strong>on</strong> the regular scale are not symmetric about the value of 1, but log(1/2) = −.693 and<br />
log(2) = .693 are symmetric about zero. Sec<strong>on</strong>d, it is often the case that the variati<strong>on</strong> tends to increase with<br />
the base size of the reading. The use of logarithms makes the variances more similar over the spread of the<br />
data values.<br />
JMP Analysis<br />
A JMP data file is available in the teom.jmp file in the Sample Program Library available at http://www.<br />
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
Two variables need to be created in the JMP table. First, the log(T EOM/reference) variable as noted<br />
above. This is created using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor:<br />
c○2012 Carl James Schwarz 243 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Sec<strong>on</strong>d, a variable representing the decimal year is required so that plotting and regressi<strong>on</strong> happen <strong>on</strong><br />
the year scale rather than the internal date and time <str<strong>on</strong>g>for</str<strong>on</strong>g>mat in JMP. JMP uses the number of sec<strong>on</strong>ds since<br />
a reference date as the internal value <str<strong>on</strong>g>for</str<strong>on</strong>g> a date. C<strong>on</strong>sequently, to c<strong>on</strong>vert to dates, you need to divide by<br />
86,400 sec<strong>on</strong>d/day to c<strong>on</strong>vert to days, and then by 365 to c<strong>on</strong>vert to years. [This ignores the effect of leap<br />
years and leap sec<strong>on</strong>ds.] This year variable is created using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula:<br />
c○2012 Carl James Schwarz 244 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Here are the first few lines of data, including the two new derived variables:<br />
c○2012 Carl James Schwarz 245 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
A plot of the log(T EOM/reference) by the year variable is obtained using the Analyze->Fit Y-by-X<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 246 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
shows a clear cyclical pattern. The peaks of the cycles are almost exactly <strong>on</strong>e year apart. C<strong>on</strong>sequently, we<br />
then create two new variables to represent the sine and cosine terms <str<strong>on</strong>g>for</str<strong>on</strong>g> a cyclical fit. Because the time units<br />
are in years, the period is also in years and is equal to ν = 1. The following <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae variables were created:<br />
c○2012 Carl James Schwarz 247 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
with the first few lines of the data table now looking like:<br />
c○2012 Carl James Schwarz 248 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Now use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit a multiple-regressi<strong>on</strong> using the year, it sine, and cosine<br />
variables:<br />
c○2012 Carl James Schwarz 249 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The effects test<br />
indicate the presence of a cyclical pattern (not unexpectedly), but no evidence of a year effect. Save the<br />
predicted values and the residuals to the data table using the Red Triangle→SaveColumns pop-down menus:<br />
The residual plot is then found using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 250 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
and shows no severe lack of fit. There are several outliers that appear and perhaps something unusual is<br />
happening in mid-2003. An overlay plot of the actual and predicted values: 33<br />
33 Use the Graph→Overlay plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m; select the observed and predicted values as the Y variables and year as the X variable; click<br />
<strong>on</strong> the legend <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted values and join with a line and hide the points.<br />
c○2012 Carl James Schwarz 251 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
shows a generally good fit, with some outlier points and again further investigati<strong>on</strong> required in about mid-<br />
2003. 34<br />
The log(T EOM/reference) hardly goes above the value of 0 (which is the reference line indicating<br />
no difference between the two instruments). In order to estimate the average log-ratio, we refit the model<br />
DROPPING the year term (why?) and examine the parameter estimates of this simpler model:<br />
The average log-ratio is −.39 (se .02). This corresp<strong>on</strong>ds to a ratio of .68 <strong>on</strong> the anti-log scale, i.e. the TEOM<br />
meter is reading, <strong>on</strong> average across the entire year, <strong>on</strong>ly 68% of the reference meter.<br />
2.7.4 Further comments<br />
An implicit assumpti<strong>on</strong> of this method is that the amplitude of the seas<strong>on</strong>al trend is c<strong>on</strong>stant in time, i.e.<br />
the β 2 and β 3 terms do not depend <strong>on</strong> time. It could happen that the amplitude is also decreasing in time,<br />
In this case, you may c<strong>on</strong>sider a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m of the Y variable so that the relative ratio between the top<br />
and bottom of the cycle may be fixed. Alternatively, more complex n<strong>on</strong>-linear regressi<strong>on</strong> models where the<br />
amplitude also depends up<strong>on</strong> time may be fit. This bey<strong>on</strong>d the scope of these notes.<br />
The key feature of this method <str<strong>on</strong>g>for</str<strong>on</strong>g> it to work well is the regularity of the seas<strong>on</strong>al effects and that the<br />
shape of the seas<strong>on</strong>al effects must be that of a sine or cosine curve. C<strong>on</strong>sequently, a pattern that is relatively<br />
flat with a single sharp peak in a c<strong>on</strong>sistent m<strong>on</strong>th cannot be well fit by these models. In this case, you could<br />
create indicator variables <str<strong>on</strong>g>for</str<strong>on</strong>g> the peak time and then fit a multiple regressi<strong>on</strong> model as above – this is again<br />
bey<strong>on</strong>d the scope of these notes.<br />
2.8 Seas<strong>on</strong>ality and Autocorrelati<strong>on</strong><br />
Whew! This is a tough issue to deal with! Fortunately, there have been great advances in software and in<br />
some packages (e.g. SAS) this is fairly easy to deal with. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this is bey<strong>on</strong>d simple packages such<br />
as JMP or SYStat.<br />
34 It turns out that these points were collected when a large amount of smoke from a nearby <str<strong>on</strong>g>for</str<strong>on</strong>g>est fire was present.<br />
c○2012 Carl James Schwarz 252 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This secti<strong>on</strong> will be brief with very little explanati<strong>on</strong> of the underlying statistical c<strong>on</strong>cepts and reference<br />
to output from SAS. Please seek further help if you are dealing with this type of data.<br />
Again, refer back to the Klamath River data. It may turn out that even after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality, there<br />
is residual autocorrelati<strong>on</strong> within a year. For example, a particular year may have generally low phosphorus<br />
levels <str<strong>on</strong>g>for</str<strong>on</strong>g> some reas<strong>on</strong> and so observati<strong>on</strong>s in m<strong>on</strong>ths close together are more highly related than observati<strong>on</strong>s<br />
in m<strong>on</strong>ths far apart.<br />
A comm<strong>on</strong> model <str<strong>on</strong>g>for</str<strong>on</strong>g> dealing with this type of autocorrelati<strong>on</strong> is the familiar AR(1) process with a single<br />
autocorrelati<strong>on</strong> parameter. In general, the covariance of two observati<strong>on</strong>s is modeled as:<br />
cov(Y t1 , Y t2 ) = σ 2 ρ ∆t<br />
where ∆t is the difference in time between the two observati<strong>on</strong>s. For example, observati<strong>on</strong>s that are 1 time<br />
unit apart will have covariance σ 2 ρ 1 ; observati<strong>on</strong>s that are two time units apart will have covariance σ 2 ρ 2 ;<br />
etc.<br />
The advantage of using this power notati<strong>on</strong> is that missing values are easily accommodated – it is not<br />
necessary to have every observati<strong>on</strong> in time so interpolati<strong>on</strong> to ‘fill in’ missing values are not necessary.<br />
Let us revisit the Klamath phosphorus data. A model that allows <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>al variati<strong>on</strong> (by m<strong>on</strong>ths) and<br />
autocorrelati<strong>on</strong> can be fit using Proc Mixed using both the ANCOVA and autocorrelati<strong>on</strong> models. The code<br />
fragment looks like:<br />
proc mixed data=klamath maxiter=200 maxfunc=1000;<br />
where phosphorus
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The estimated comm<strong>on</strong> slope from this model (mg/L/year) is:<br />
Standard<br />
Label Estimate Error DF t Value Pr > |t|<br />
avg slope -0.00578 0.002515 10.9 -2.30 0.0430<br />
which is similar to the estimates found earlier.<br />
A model was also fit assuming independence am<strong>on</strong>g the observati<strong>on</strong> (see the ANCOVA approach to<br />
seas<strong>on</strong>al adjustment earlier in this chapter). Is there support <str<strong>on</strong>g>for</str<strong>on</strong>g> the independence model?<br />
The AIC criteria is used to compare these different models. The two AIC (corrected <str<strong>on</strong>g>for</str<strong>on</strong>g> small sample<br />
sizes) are:<br />
AICC (smaller is better) -216.2 <str<strong>on</strong>g>for</str<strong>on</strong>g> the spatial power model<br />
AICC (smaller is better) -208.7 <str<strong>on</strong>g>for</str<strong>on</strong>g> the independence model<br />
A usual rule of thumb is that differences of more than 2 am<strong>on</strong>g the AIC indicate that there is evidence <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
the model with the smaller AIC. In this case, the AIC <str<strong>on</strong>g>for</str<strong>on</strong>g> the spatial power model is almost 8 units smaller<br />
than the independence model. There is str<strong>on</strong>g evidence <str<strong>on</strong>g>for</str<strong>on</strong>g> residual autocorrelati<strong>on</strong>.<br />
The estimated trend (ignoring autocorrelati<strong>on</strong>) is:<br />
Standard<br />
Label Estimate Error DF t Value Pr > |t|<br />
avg slope -0.00562 0.001621 58 -3.47 0.0010<br />
As expected the estimated slopes are √ similar,<br />
√<br />
but the reported se from the model ignoring autocorrelati<strong>on</strong><br />
was too small by a factor of about (1+ρ)<br />
1−ρ<br />
= 1.5<br />
.5 = √ 3 = 1.7.<br />
2.9 N<strong>on</strong>-parametric detecti<strong>on</strong> of trend<br />
The methods so far in this chapter all rely <strong>on</strong> several assumpti<strong>on</strong>s that may not be satisfied in all c<strong>on</strong>text.<br />
For example, all the methods (including the methods <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>) assume that deviati<strong>on</strong>s from the<br />
regressi<strong>on</strong> line are normally distributed with equal variance. In practice, they are fairly robust to n<strong>on</strong>normality<br />
and heterogeneous variances if the sample sizes are fairly large.<br />
But, how is it possible to deal with truncated or censored observati<strong>on</strong>s? For example, it is quite comm<strong>on</strong><br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> measurement tools to have upper and lower limits of detectability and you often get measurements that<br />
c○2012 Carl James Schwarz 254 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
are below or above detecti<strong>on</strong> limits. How can a m<strong>on</strong>ot<strong>on</strong>ic, but not linear relati<strong>on</strong>ship be examined? 35 For<br />
example, cases of asthma seem to increase with the c<strong>on</strong>centrati<strong>on</strong> of particulates in the atmosphere, but the<br />
relati<strong>on</strong>ship is not linear.<br />
A nice review of the basic methods applicable to many situati<strong>on</strong>s is given by:<br />
Berryman, D., B. Bobee, D. Cluis, and J. Haemmerli (1988). N<strong>on</strong>-parametric approaches <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
trend detecti<strong>on</strong> in water quality time series. Water Resources Bulletin 24(3):545-556.<br />
2.9.1 Cox and Stuart test <str<strong>on</strong>g>for</str<strong>on</strong>g> trend<br />
This is a very simple test to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m and can be used in many different situati<strong>on</strong>s as illustrated in C<strong>on</strong>over<br />
(1999, Secti<strong>on</strong> 3.5). 36 The idea behind the test is to divide first the dataset into two parts. Match the first<br />
observati<strong>on</strong> in the first part with the first observati<strong>on</strong> in the sec<strong>on</strong>d part; match the sec<strong>on</strong>d observati<strong>on</strong> in<br />
the first part with the sec<strong>on</strong>d observati<strong>on</strong> in the sec<strong>on</strong>d part; etc. Then <str<strong>on</strong>g>for</str<strong>on</strong>g> each pair of values, determine<br />
if the value from the sec<strong>on</strong>d part is greater that than the matched value from the first part. If there is a<br />
generally upwards trend in the data, then you see should see lots of pairs where the data value <str<strong>on</strong>g>for</str<strong>on</strong>g> the sec<strong>on</strong>d<br />
part is larger than that of the first part. The number of pairs where the data from the sec<strong>on</strong>d part exceeds<br />
its counterpart in the first part has a binomial distributi<strong>on</strong> with p=.5 and this can be used to determine the<br />
p-value of the test etc. This will be illustrated with an example.<br />
In an earlier secti<strong>on</strong>, we examined the records of the grass cutting seas<strong>on</strong> over time. We will apply the<br />
Cox and Stuart procedure to this data as well.<br />
Here is the raw data again:<br />
35 If a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> will linearize the line, then an ordinary regressi<strong>on</strong> can be used <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data.<br />
36 C<strong>on</strong>over, W.J. (1999). <str<strong>on</strong>g>Applied</str<strong>on</strong>g> n<strong>on</strong>-parametric statistics, 2nd editi<strong>on</strong>. Wiley<br />
c○2012 Carl James Schwarz 255 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Year<br />
Durati<strong>on</strong><br />
(days)<br />
1984 200<br />
1985 215<br />
1986 195<br />
1987 212<br />
1988 225<br />
1989 240<br />
1990 203<br />
1991 208<br />
1992 203<br />
1993 202<br />
1994 210<br />
1995 225<br />
1996 204<br />
1997 245<br />
1998 238<br />
1999 226<br />
2000 227<br />
2001 236<br />
2002 215<br />
2003 242<br />
There are exactly 20 observati<strong>on</strong>s, so the data is divided into two parts corresp<strong>on</strong>ding to the first 10 years<br />
and the last 10 years. 37 This gives the pairing:<br />
37 If the number of observati<strong>on</strong>s is odd, then the middle observati<strong>on</strong> is discarded<br />
c○2012 Carl James Schwarz 256 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Part I<br />
Part II<br />
Year Durati<strong>on</strong> Year Durati<strong>on</strong> Part II >Part I<br />
1984 200 1994 210 1<br />
1985 215 1995 225 1<br />
1986 195 1996 204 1<br />
1987 212 1997 245 1<br />
1988 225 1998 238 1<br />
1989 240 1999 226 0<br />
1990 203 2000 227 1<br />
1991 208 2001 236 1<br />
1992 203 2002 215 1<br />
1993 202 2003 242 1<br />
If there are any ties in the pairs, these are also discarded. In this case, there were no ties, and the data<br />
from the sec<strong>on</strong>d part was greater than the corresp<strong>on</strong>ding data from the first part in 9 of the 10 years.<br />
A two-sided p-value (allowing <str<strong>on</strong>g>for</str<strong>on</strong>g> either and increasing or decreasing trend) is found by finding the<br />
probability<br />
P (X ≥ 9) + P (X ≤ 1)<br />
when X comes from a Binomial distributi<strong>on</strong> with n = 10 and p=0.5.<br />
This can be computed or found from tables such as at http://www.stat.sfu.ca/~cschwarz/<br />
Stat-650/Notes/PDF/Tables.pdf. A porti<strong>on</strong> of the Binomial table with n = 10 is presented<br />
below:<br />
Individual binomial probabilities <str<strong>on</strong>g>for</str<strong>on</strong>g> n=10 and selected values of p<br />
n x 0.1 0.2 0.3 0.4 0.5<br />
------------------------------------------<br />
10 0 0.3487 0.1074 0.0282 0.0060 0.0010<br />
10 1 0.3874 0.2684 0.1211 0.0403 0.0098<br />
10 2 0.1937 0.3020 0.2335 0.1209 0.0439<br />
10 3 0.0574 0.2013 0.2668 0.2150 0.1172<br />
10 4 0.0112 0.0881 0.2001 0.2508 0.2051<br />
10 5 0.0015 0.0264 0.1029 0.2007 0.2461<br />
10 6 0.0001 0.0055 0.0368 0.1115 0.2051<br />
10 7 0.0000 0.0008 0.0090 0.0425 0.1172<br />
10 8 0.0000 0.0001 0.0014 0.0106 0.0439<br />
10 9 0.0000 0.0000 0.0001 0.0016 0.0098<br />
10 10 0.0000 0.0000 0.0000 0.0001 0.0010<br />
c○2012 Carl James Schwarz 257 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
From the table above we find that the p-value is<br />
p-value = .0010 + .0098 + .0098 + .0010 = .0216<br />
which is comparable to that found from a direct applicati<strong>on</strong> of linear regressi<strong>on</strong> of .012.<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, it is not possible to estimate the slope or any c<strong>on</strong>fidence interval using this method. The<br />
test is available in some computer packages, but because of its simplicity, is often easiest to do by hand.<br />
Surprisingly, this very simple test does not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m badly when compared to a real regressi<strong>on</strong>. For<br />
example, the asymptotic relative efficiency of this test compared to a normal regressi<strong>on</strong> situati<strong>on</strong> when all<br />
assumpti<strong>on</strong>s are satisfied is almost 80%. This implies that you would get the same power to detect a trend<br />
as a regular regressi<strong>on</strong> with 1<br />
.80<br />
= 1.25 times the sample size and using the Cox and Stuart test.<br />
However, if the data are straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward as in this case, there are better n<strong>on</strong>-parametric methods as will<br />
be illustrated in later secti<strong>on</strong>s.<br />
2.9.2 N<strong>on</strong>-parametric regressi<strong>on</strong> - Spearman, Kendall, Theil, Sen estimates<br />
N<strong>on</strong>-parametric does NOT mean no assumpti<strong>on</strong>s<br />
While the Cox and Stuart test may indicate that there is evidence of a trend, it cannot provide estimates of<br />
the slope etc. C<strong>on</strong>sequently, n<strong>on</strong>-parametric methods have been developed <str<strong>on</strong>g>for</str<strong>on</strong>g> these situati<strong>on</strong>s.<br />
CAUTION: N<strong>on</strong>-parametric does not mean NO assumpti<strong>on</strong>s! Many<br />
people view n<strong>on</strong>-parametric methods as a panacea that solves all ills. On the c<strong>on</strong>trary, n<strong>on</strong>-parametric tests<br />
also make assumpti<strong>on</strong>s about the data that need to be carefully verified in order that the results are sensible.<br />
In the c<strong>on</strong>text of n<strong>on</strong>-parametric regressi<strong>on</strong>, the following assumpti<strong>on</strong>s are usually made and n<strong>on</strong>-parametric<br />
tests may relax some of them:<br />
• Linearity. Parametric regressi<strong>on</strong> analysis assume that the relati<strong>on</strong>ship between Y and X is linear.<br />
N<strong>on</strong>-parametric regressi<strong>on</strong> analysis makes the same assumpti<strong>on</strong>.<br />
• Scale of Y and X. Parametric regressi<strong>on</strong> analysis assumes that X is time, so that it has an interval<br />
or ratio scale. It is further assumed that Y has an interval or ratio scale as well. N<strong>on</strong>-parametric<br />
regressi<strong>on</strong> analysis makes the same assumpti<strong>on</strong> except that some methods allow the Y variable to be<br />
ordinal. This allows n<strong>on</strong>-parametric methods to be used when values are above detecti<strong>on</strong> limits as<br />
they can still often be ordered sensibly.<br />
• Correct sampling scheme. Parametric regressi<strong>on</strong> analysis assumes that the Y must be a random sample<br />
from the populati<strong>on</strong> of Y values at every time point. N<strong>on</strong>-parametric regressi<strong>on</strong> analysis makes the<br />
same assumpti<strong>on</strong>.<br />
• No outliers or influential points. Parametric regressi<strong>on</strong> analysis assumes that all the points must bel<strong>on</strong>g<br />
to the relati<strong>on</strong>ship – there should be no unusual points. N<strong>on</strong>-parametric regressi<strong>on</strong> analysis is more<br />
c○2012 Carl James Schwarz 258 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
robust to failures of this assumpti<strong>on</strong> as the actual distances between the observed point and the fitted<br />
line are not used directly. However, many outliers can mask the true relati<strong>on</strong>ship. A very nice feature<br />
of n<strong>on</strong>-parametric methods is that they are invariant to trans<str<strong>on</strong>g>for</str<strong>on</strong>g>ms that preserve order. For<br />
example, you will get the p-value answers if you use n<strong>on</strong>-parametric analyses <strong>on</strong> Y or log(Y ). But,<br />
the estimated slope may be different as it is measured <strong>on</strong> a different scale.<br />
• Equal variati<strong>on</strong> al<strong>on</strong>g the line <strong>on</strong> some scale. Parametric regressi<strong>on</strong> analysis assumes that the variability<br />
about the regressi<strong>on</strong> line is similar <str<strong>on</strong>g>for</str<strong>on</strong>g> all values of X, i.e. the scatter of the points above and below<br />
the fitted line should be roughly c<strong>on</strong>stant over time. Surprisingly to many people, n<strong>on</strong>-parametric<br />
regressi<strong>on</strong> analysis assumes that the distributi<strong>on</strong> of Y at each X is the same <strong>on</strong> some measuring scale<br />
and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e must also have the same variati<strong>on</strong>. However, because the assumpti<strong>on</strong> is about equal<br />
variance <strong>on</strong> some scale, and because n<strong>on</strong>-parametric methods are invariant to simple trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s,<br />
this is often satisfied. For example, if a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m would stabilize the variance, then it is not necessary<br />
to trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m be<str<strong>on</strong>g>for</str<strong>on</strong>g>e doing the Kendall test. This is <strong>on</strong>e advantage of the n<strong>on</strong>-parametric tests<br />
over parametric tests which require a homogeneous variati<strong>on</strong> about the regressi<strong>on</strong> line.<br />
• Independence. Parametric regressi<strong>on</strong> assumes that each value of Y is independent of any other value<br />
of Y . N<strong>on</strong>-parametric regressi<strong>on</strong> analysis also makes this assumpti<strong>on</strong>s. C<strong>on</strong>sequently, n<strong>on</strong>-parametric<br />
regressi<strong>on</strong> analysis does not deal with autocorrelati<strong>on</strong>.<br />
• Normality of errors. Parametric regressi<strong>on</strong> assumes that the difference between the value of Y and<br />
the expected value of Y is assumed to be normally distributed. N<strong>on</strong>-parametric regressi<strong>on</strong> analysis<br />
assumes that the distributi<strong>on</strong> of Y at each value of X is the same, but does not require that it be<br />
normally distributed. C<strong>on</strong>sequently heavy tailed distributi<strong>on</strong>s such a log-normal distributi<strong>on</strong>s can be<br />
handled with n<strong>on</strong>-parametric regressi<strong>on</strong>.<br />
• X measured without error. Parametric regressi<strong>on</strong> analysis assumes that the error in measurement of<br />
X is small or n<strong>on</strong>-existent relative to the error variati<strong>on</strong> about the regressi<strong>on</strong> line. N<strong>on</strong>-parametric<br />
regressi<strong>on</strong> makes the same assumpti<strong>on</strong>.<br />
As you can see, data to be used in n<strong>on</strong>-parametric analysis cannot be just arbitrarily collected - thought<br />
must be given as to assessing the appropriateness of the regressi<strong>on</strong> model.<br />
Surprising to many, least-square regressi<strong>on</strong> is actually a n<strong>on</strong>-parametric method! The principle of choosing<br />
the regressi<strong>on</strong> line to minimize the sum of squared deviati<strong>on</strong>s from the regressi<strong>on</strong> line makes no distributi<strong>on</strong>al<br />
assumpti<strong>on</strong>s of Y at each X. The assumpti<strong>on</strong> of normality comes into play when you compute F<br />
or t-tests to test if the slope is zero, and c<strong>on</strong>struct c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope or predicti<strong>on</strong> intervals<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> individual means or predicti<strong>on</strong>s.<br />
A simple n<strong>on</strong>-parametric test <str<strong>on</strong>g>for</str<strong>on</strong>g> zero-slope is Spearman’s ρ which is simply a correlati<strong>on</strong> coefficient<br />
computed <strong>on</strong> the RANKS of the data. 38 The standard Pears<strong>on</strong> correlati<strong>on</strong> coefficient (discussed in earlier<br />
secti<strong>on</strong>s) is then applied to the ranked data, and the p-value is found by referring to tables or from large<br />
sample <str<strong>on</strong>g>for</str<strong>on</strong>g>mula. Fortunately, most computer packages compute Spearman’s ρ and provide p-values.<br />
38 For each variable, find the smallest value and replace by the value of 1. Find the sec<strong>on</strong>d smallest value and replace it by the value<br />
of 2, etc. If there are tied values, replace the tied ranks by the average of the ranks. This is easily d<strong>on</strong>e in Excel by repeated sorting the<br />
(X, Y ) pairs first by X and then by Y .<br />
c○2012 Carl James Schwarz 259 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Example: The Grass is Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger) revisited<br />
For example, the grass cutting example data is ranked as follows:<br />
Year Durati<strong>on</strong> Year Durati<strong>on</strong><br />
(days) Rank Rank<br />
1984 200 1 2.0<br />
1985 215 2 10.5<br />
1986 195 3 1.0<br />
1987 212 4 9.0<br />
1988 225 5 12.5<br />
1989 240 6 18.0<br />
1990 203 7 4.5<br />
1991 208 8 7.0<br />
1992 203 9 4.5<br />
1993 202 10 3.0<br />
1994 210 11 8.0<br />
1995 225 12 12.5<br />
1996 204 13 6.0<br />
1997 245 14 20.0<br />
1998 238 15 17.0<br />
1999 226 16 14.0<br />
2000 227 17 15.0<br />
2001 236 18 16.0<br />
2002 215 19 10.5<br />
2003 242 20 19.0<br />
The correlati<strong>on</strong> computed <strong>on</strong> the ranks is found to be .5766.<br />
JMP analysis Parametric and n<strong>on</strong>-parametric Correlati<strong>on</strong>s between variables are found using the Analyze-<br />
>MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 260 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Specify both the X and Y variable in the dialogue box:<br />
c○2012 Carl James Schwarz 261 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Finally request n<strong>on</strong>-parametric correlati<strong>on</strong>s from the drop-down menu box:<br />
which gives the following output:<br />
The Spearman ρ is found to be .5766 with a p-value of .0078. This compares to the p-value from the<br />
parametric regressi<strong>on</strong> of .012.<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, Spearman’s ρ does not provide an easy was to estimate the slope or find c<strong>on</strong>fidence<br />
intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope etc. 39<br />
Because Spearman’s ρ does not provide a c<strong>on</strong>venient way to estimate the slope or to find c<strong>on</strong>fidence<br />
intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope, variants <strong>on</strong> Kendall’s τ are often used instead. This estimator of the slope has many<br />
39 However, refer to C<strong>on</strong>over (1995), Secti<strong>on</strong> 5.5 <str<strong>on</strong>g>for</str<strong>on</strong>g> details <strong>on</strong> using Spearman’s ρ to estimate a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />
c○2012 Carl James Schwarz 262 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
names: Sen’s (1968) estimator 40 ; Theil’s (1950) estimators 41 ; or Kendall’s τ 42 estimator are all comm<strong>on</strong><br />
names. The idea behind these estimators is to look and c<strong>on</strong>cordant and discordant pairs of data points. A<br />
pair of data points (X 1 , Y 1 ) and (X 2 , Y 2 ) is called c<strong>on</strong>cordant if Y2−Y1<br />
X 2−X 1<br />
is greater than zero, discordant if<br />
the ratio is less than zero, and both if the ratio is 0. For the grass cutting durati<strong>on</strong> data, the pair of data point<br />
(1985,215) is c<strong>on</strong>cordant with the data point (1988,225), but discordant with the data point (1986,195). As<br />
you can imagine, it is far easier to let the computer do the computati<strong>on</strong>s!<br />
The test <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-zero slope using Kendall’s tau can be computed by finding ALL possible pairs of data<br />
points (!) and using the rule:<br />
• if Yj−Yi<br />
X j−X i<br />
> 0 then add 1 to N c (c<strong>on</strong>cordant);<br />
• if Yj−Yi<br />
X j−X i<br />
< 0 then add 1 to N d (discordant);<br />
• if Yj−Yi<br />
X j−X i<br />
= 0 then add 1 2 to both N c and N d ;<br />
• if X i = X j , no comparis<strong>on</strong> is made<br />
Kendall’s τ is found as:<br />
τ = N c − N d<br />
N c + N d<br />
The p-value is found from tables or by the computer.<br />
The computati<strong>on</strong> of τ is simplified by sorting the pairs of (X, Y ) by the value of X and creating a<br />
spreadsheet to help with the computati<strong>on</strong>s. Each value of Y needs <strong>on</strong>ly to be compared to those “below”<br />
Estimati<strong>on</strong> of the slope and c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope are found by computing all the pairs of<br />
slopes:<br />
S ij = Y j − Y i<br />
X j − X i<br />
The estimate of the slope is simply the median of these values.<br />
A c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope is found by using tables to find the lower and upper quantities to<br />
use as the bounds of the interval. A close approximati<strong>on</strong> to the values to use is found using this following<br />
procedure:<br />
• Let n be the number of pairs of points, and N be the number of paired slopes from above.<br />
√<br />
n(n−1)(2n+5)<br />
• Compute w = z<br />
18<br />
where z is the appropriate quantile from a standard normal distributi<strong>on</strong>.<br />
For example, <str<strong>on</strong>g>for</str<strong>on</strong>g> a 95% c<strong>on</strong>fidence interval, z = 1.96.<br />
40 Sen, P.K. (1968). Estimates of the regressi<strong>on</strong> coefficient based <strong>on</strong> Kendall’s τ. Journal of the American Statistical Associati<strong>on</strong><br />
63,1379-1389.<br />
41 Theil, H.A. (1950). A rank-invariant method of linear and polynomial regressi<strong>on</strong> analysis 1, 2, and 3. Neder. Acad. Wetersch.<br />
Proc. 53,386-392, 521-525, and 1397-1412.<br />
42 Kendall, M.G. (1970). Rank Correlati<strong>on</strong> Methods. Charles Griffin and Co., L<strong>on</strong>d<strong>on</strong>. Fourth Editi<strong>on</strong>.<br />
c○2012 Carl James Schwarz 263 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
• Compute r = .5(N − w).<br />
• Use the r th and (N − r) th values of the paired slopes as the bounds of the c<strong>on</strong>fidence interval.<br />
For the mowing durati<strong>on</strong> data, n = 20 and there are N = 190 possible slopes! The estimated slope<br />
is the median value. The approximate value of w = 60, so the 65 th and 135 th sorted values of the paired<br />
slopes are the lower and upper bounds of the 95% c<strong>on</strong>fidence interval. This gives an estimated slope of<br />
1.389 with a 95% c<strong>on</strong>fidence interval of (0.20 → 2.8). This can be compared to the estimated slope of 1.46<br />
and c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope from the ordinary regressi<strong>on</strong> analysis of (.4 → 2.6).<br />
This is rarely found in most computer packages, but the computati<strong>on</strong> of the possible slopes can be<br />
programmed (sometimes clumsily) and can actually be d<strong>on</strong>e in a spreadsheet.<br />
JMP analysis Kendall’s τ is also computed using the Analyze->MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />
in the same way as Spearman’s ρ was found:<br />
This gives:<br />
c○2012 Carl James Schwarz 264 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The p-value is .0123 very similar to that from the ordinary regressi<strong>on</strong>.<br />
It is very clumsy in JMP to compute the Sen-Theil-Kendall estimate of the slope and it is not d<strong>on</strong>e. Refer<br />
to the SAS program <str<strong>on</strong>g>for</str<strong>on</strong>g> more help.<br />
Final Remarks<br />
Berryman (1998) recommend that Kendall’s τ or Spearman’s ρ be used <str<strong>on</strong>g>for</str<strong>on</strong>g> a n<strong>on</strong>-parametric testing <str<strong>on</strong>g>for</str<strong>on</strong>g> trend<br />
as these have the greatest efficiency relative to ordinary parametric regressi<strong>on</strong>. They also recommend (their<br />
Table 4) that a minimum of 9-11 observati<strong>on</strong>s be collected be<str<strong>on</strong>g>for</str<strong>on</strong>g>e testing <str<strong>on</strong>g>for</str<strong>on</strong>g> trend using these methods.<br />
It turns out that the asymptotic relative efficiency of both Kendall’s τ and Spearman’s ρ are very high<br />
(90%+) so that the planning tools <str<strong>on</strong>g>for</str<strong>on</strong>g> ordinary regressi<strong>on</strong> can be used to estimate the sample sizes required<br />
under various scenarios with a fair amount of c<strong>on</strong>fidence.<br />
2.9.3 Dealing with seas<strong>on</strong>ality - Seas<strong>on</strong>al Kendall’s τ<br />
Basic principles<br />
In some cases, series of data have an obvious periodicity or seas<strong>on</strong>al effects.<br />
C<strong>on</strong>sider, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, values of total phosphorus taken from the Klamath River near Klamath, Cali<str<strong>on</strong>g>for</str<strong>on</strong>g>nia<br />
as analyzed by Hirsch et al (1982). 43<br />
43 This was m<strong>on</strong>itoring stati<strong>on</strong> 11530500 from the NASQAN network in the US. Data are available from http://waterdata.<br />
usgs.gov/nwis/qwdata/?site_no=11530500. The data was analyzed by Hirsch, R.M., Slack, J.R., and Smith, R.A. (1982).<br />
Techniques of trend analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> m<strong>on</strong>thly water quality data. Water Resources Research 18, 107-121.<br />
c○2012 Carl James Schwarz 265 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Total phosphorus (mg/L) in Klamath River near Klamath, CA<br />
Year<br />
M<strong>on</strong>th 1972 1973 1974 1975 1976 1977 1978 1979<br />
1 0.07 0.33 0.70 0.08 0.04 0.05 0.14 0.08<br />
2 0.11 0.24 0.17 . . . 0.11 0.04<br />
3 0.60 0.12 0.16 . 0.14 0.03 0.02 0.02<br />
4 0.10 0.08 1.20 0.11 0.05 0.04 0.06 0.01<br />
5 0.04 0.03 0.12 0.09 0.02 0.04 0.03 0.03<br />
6 0.05
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
shows a obvious seas<strong>on</strong>ality to the data with peak levels occurring in the winter m<strong>on</strong>ths. There are also some<br />
missing values as seen in the raw data table. Finally, notice the presence of several very large values (above<br />
0.20 mg/L) that would normally be classified as outliers.<br />
How can a test <str<strong>on</strong>g>for</str<strong>on</strong>g> trend be fit in the presence of this seas<strong>on</strong>ality?<br />
Hirsch et al. (1982) modified Kendall’s τ to deal with seas<strong>on</strong>ality. The method is very simple to describe,<br />
but is difficult to implement.<br />
The basic principle is divide the series into (in this case) 12 separate series, <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th. These<br />
m<strong>on</strong>th-based series range from 8 years of data down to 5 years of data. For each m<strong>on</strong>th-based series, compute<br />
Kendall’s τ. Combine the 12 estimates of τ into a single omnibus test to compute the overall p-value. The<br />
estimated slopes is found from all the pairwise comparis<strong>on</strong>s within each m<strong>on</strong>th-based series that are pooled<br />
and then the overall median of these pooled sets is used. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there are no simple procedures<br />
available to compute c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />
c○2012 Carl James Schwarz 267 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Example: Total phosphorus <strong>on</strong> the Klamath River revisited<br />
JMP Analysis<br />
JMP can be used to compute a test statistic, but it is difficult (!) to estimate the slope.<br />
A JMP dataset with scripts is located in klamath.jmp and klamath2.jmp in the Sample Program Library<br />
available at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The JMP dataset has 12 rows and 9 columns corresp<strong>on</strong>ding to the various years:<br />
The years must be stacked using the Tables->Stack command to create three columns, Year, M<strong>on</strong>th, Phosphorus.<br />
A porti<strong>on</strong> of the stacked data is illustrated below:<br />
c○2012 Carl James Schwarz 268 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Missing values are indicated by the period. The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m can be used to create a data<br />
plot to illustrate the seas<strong>on</strong>al nature of the data (not shown).<br />
To compute Kendall’s τ <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th, use the Analyze->MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />
and specify m<strong>on</strong>th in the BY area;<br />
c○2012 Carl James Schwarz 269 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
This will give the estimates of correlati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th. In order to request Kendall’s τ <str<strong>on</strong>g>for</str<strong>on</strong>g> EVERY plot,<br />
hold down the opti<strong>on</strong> key be<str<strong>on</strong>g>for</str<strong>on</strong>g>e clicking <strong>on</strong> the red triangle to request the n<strong>on</strong>-parametric Kendall τ statistic.<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there is no way to have JMP automatically save all the Kendall τ’s to a new data sheet <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
subsequent processing. You will have to manually (groan) type in each estimate of τ and the p-value to give<br />
the following table:<br />
c○2012 Carl James Schwarz 270 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, JMP does not provide the raw value underlying Kendall’s τ (what Hirsch et al call S) so we<br />
can’t use the direct methods outlined in Hirsch et al of simply adding the values of S. A somewhat indirect<br />
method must be used to combine the reported values of τ and their p-values over the 12 m<strong>on</strong>ths.<br />
This indirect method c<strong>on</strong>verts the p-value back to a z-score. As the z-scores are distributed as Normal<br />
distributi<strong>on</strong>s (with mean 0 and variance 1) and are assumed to be independent across the m<strong>on</strong>ths, their sum<br />
is simply a normal distributi<strong>on</strong> with mean 0 and variance equal to the sum of the variances (in this case 12).<br />
This resulting sum can be then be c<strong>on</strong>verted to an actual p-value.<br />
To c<strong>on</strong>vert the p-value back to a z-score, use the relati<strong>on</strong>ship<br />
z = Φ −1 (1 − pvalue/2) × sign(τ b )<br />
where Φ −1 is the inverse normal probability functi<strong>on</strong>; the 1 − pvalue/2 c<strong>on</strong>verts the two-sided p-value to<br />
the upper tail of normal curve, and the sign functi<strong>on</strong> makes sure that the z value also has the correct sign<br />
(i.e. positive or negative). This is d<strong>on</strong>e by creating a new column in JMP and creating a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> this<br />
column:<br />
c○2012 Carl James Schwarz 271 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
The Normal Quantile functi<strong>on</strong> is the inverse normal functi<strong>on</strong>, and the IF clause serves as the sign functi<strong>on</strong>.<br />
The columns Var is simply the variance of the z-score.<br />
This gives the table:<br />
c○2012 Carl James Schwarz 272 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
We add add together the z-scores and the variances, this gives an overall z-score. The Tables->Summary<br />
can be used to get this total:<br />
c○2012 Carl James Schwarz 273 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
to give:<br />
Finally, we use a final <str<strong>on</strong>g>for</str<strong>on</strong>g>mula to compute the probability of exceeding this total z score:<br />
c○2012 Carl James Schwarz 274 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
and the final overall p-value:<br />
The overall p-value of .0049. This can be compared to the paper by Hirsch et al (1982) who obtained an<br />
overall z-value of -2.69 with a p-value of .0072.<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there is no simple way to estimate the slope using JMP.<br />
Final notes<br />
As pointed out earlier, n<strong>on</strong>-parametric analyses are not assumpti<strong>on</strong> free - they merely have different assumpti<strong>on</strong>s<br />
than parametric analyses. In this method, the key assumpti<strong>on</strong>s of independence are still important.<br />
Because the data are broken into m<strong>on</strong>thly-based series, this is likely true - it seems reas<strong>on</strong>able that the value<br />
in January 1971 has no influence <strong>on</strong> the value in January 1972. However, it is likely not true that January<br />
1971 is independent of February 1971 which would likely invalidate a simple use of Kendall’s method <strong>on</strong><br />
the entire series.<br />
As Hirsch et al (1982) point out, it is possible that some sub-series exhibit str<strong>on</strong>g evidence of upward<br />
trend, some sub-series exhibit str<strong>on</strong>g evidence of downward trend, but the overall omnibus test fails to detect<br />
evidence of a trend. This is not unexpected, and if <strong>on</strong>e is interested in the individual sub-series, then these<br />
should be examined individually.<br />
In the original paper by Hirsch et al. (1982), they did not allow <str<strong>on</strong>g>for</str<strong>on</strong>g> multiple observati<strong>on</strong>s in each time<br />
period. This actually poses no problem with computer implementati<strong>on</strong>s which handle ties appropriately.<br />
c○2012 Carl James Schwarz 275 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Lastly, you may have noticed in the original data, some values that were marked as below detecti<strong>on</strong> limit.<br />
These censored observati<strong>on</strong>s pose no problem to most n<strong>on</strong>-parametric tests. Clearly a value that is below<br />
detecti<strong>on</strong> limit (e.g. < .01) is also less than .05. The <strong>on</strong>ly problem arise in making sure that if there are<br />
multiple, different, detecti<strong>on</strong> limits, that comparis<strong>on</strong>s are handled appropriately. Usually, this implies using<br />
the largest detecti<strong>on</strong> limit in place of any lower detecti<strong>on</strong> limits.<br />
Hirsch et al (1982) did several simulati<strong>on</strong> studies of the seas<strong>on</strong>al Kendall, and found that it had high<br />
power to detect changes.<br />
The Seas<strong>on</strong>al Kendall estimator has been implemented in many packages specially designed <str<strong>on</strong>g>for</str<strong>on</strong>g> envir<strong>on</strong>mental<br />
studies. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there are no packages that I am aware of that report c<strong>on</strong>fidence intervals<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />
Berryman (1988) recommends that at least 60 observati<strong>on</strong> spanning at least 5 cycles be obtained be<str<strong>on</strong>g>for</str<strong>on</strong>g>e<br />
using the Seas<strong>on</strong>al Kendall method.<br />
2.9.4 Seas<strong>on</strong>ality with Autocorrelati<strong>on</strong><br />
General ideas<br />
As noted earlier, the Seas<strong>on</strong>al Kendall method still assumes that observati<strong>on</strong>s in different series are independent,<br />
i.e. that the January 1972 reading is not related to the February 1972 reading. In some cases this<br />
is untrue; <str<strong>on</strong>g>for</str<strong>on</strong>g> example, in a wet year, the steam flow may be higher than average <str<strong>on</strong>g>for</str<strong>on</strong>g> all m<strong>on</strong>ths leading to<br />
positive correlati<strong>on</strong> across series.<br />
Hirsch and Slack (1984) 44 c<strong>on</strong>sidered this problem. As in the Seas<strong>on</strong>al Kendall test the data are first<br />
divided into sub-series, e.g. m<strong>on</strong>thly series across several years. The Kendall statistic <str<strong>on</strong>g>for</str<strong>on</strong>g> trend across years<br />
is computed <str<strong>on</strong>g>for</str<strong>on</strong>g> each sub-series, e.g. <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th. These sub-series statistics are added together to give<br />
an omnibus test statistic. The Seas<strong>on</strong>al Kendall method could simply sum the variances of each test statistic<br />
to give the omnibus variance from which a z-score could be computed and a p-value obtained. However,<br />
because the sub-series are autocorrelated, the new test must also add together estimates of the covariances<br />
am<strong>on</strong>g the test statistics from the individual sub-series to get the omnibus variance prior to computing a<br />
z=score and p-value .<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this is procedure is implemented in <strong>on</strong>ly a handful of specialized software packages <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
the analysis of water quality and hydrologic data. These packages can be located with a quick search <strong>on</strong> the<br />
WWW. It is not feasible to do the computati<strong>on</strong>s in JMP; nor in SYStat; the computati<strong>on</strong>s could likely be<br />
d<strong>on</strong>e in SAS, but are complex and way bey<strong>on</strong>d the scope of these notes.<br />
C<strong>on</strong>sequently, this method will not be discussed further in these notes – interested readers are referred<br />
to Hirsch and Slack (1982).<br />
44 Hirsch, R.M. and Slack, J.R. (1984). A n<strong>on</strong>-parametric trend test <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>al data with serial dependence. Water Resources<br />
Research 20, 727-732.<br />
c○2012 Carl James Schwarz 276 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Note that because parametric methods are now readily available – refer to earlier chapters of these notes,<br />
there is less need <str<strong>on</strong>g>for</str<strong>on</strong>g> these no-parametric procedures.<br />
Berryman (1988) and Hirsch and Slack (1982) recommends that at least 120 observati<strong>on</strong>s spanning at<br />
least 10 cycles be obtained be<str<strong>on</strong>g>for</str<strong>on</strong>g>e using the Seas<strong>on</strong>al Kendall method adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>.<br />
2.10 Summary<br />
This chapter is c<strong>on</strong>cerned mainly with detecting m<strong>on</strong>ot<strong>on</strong>ic trends over time, i.e. a gradual increase or<br />
decrease over time. Some methods were introduced to deal with seas<strong>on</strong>al effects, but these effects are<br />
nuisance effects and should be eliminated prior to analysis.<br />
It is possible <str<strong>on</strong>g>for</str<strong>on</strong>g> these trends over time to be masked by exogenous variables, i.e. variables other than Y<br />
and X. For example, many ground water variables are influenced by flow, over and above seas<strong>on</strong>al effects.<br />
It was bey<strong>on</strong>d the scope of these notes, but the effects of these exogenous variables should be first removed<br />
be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the trend analysis is d<strong>on</strong>e. This can be d<strong>on</strong>e using multiple regressi<strong>on</strong> or other curve fitting techniques<br />
such as LOWESS.<br />
Measurements taken in close proximity over time are likely to be related to each other. This is known as<br />
serial or autocorrelati<strong>on</strong>. This is often induced by some envir<strong>on</strong>mental variable that is slowly changing over<br />
time and also affects the m<strong>on</strong>itored variable. Again, these exogenous effects should be removed first. Some<br />
residual autocorrelati<strong>on</strong> may still be present. The most comm<strong>on</strong> test statistic to detect autocorrelati<strong>on</strong> is the<br />
Durbin-Wats<strong>on</strong> statistic where values near 2 indicate the lack of autocorrelati<strong>on</strong>.<br />
Trend analyses can either be d<strong>on</strong>e using parametric or n<strong>on</strong>-parametric methods. BOTH types of analyses<br />
make certain assumpti<strong>on</strong>s about the data – n<strong>on</strong>-parametric methods are NOT assumpti<strong>on</strong>-free! It turns out<br />
that modern n<strong>on</strong>-parametric methods are relatively powerful to detect trends even if all the assumpti<strong>on</strong>s are<br />
used. Hence there is little loss in power in using these methods. In additi<strong>on</strong>, because they use the relative<br />
ranking of observati<strong>on</strong>s, they are relatively insensitive to outliers, moderate levels of n<strong>on</strong>-detected values<br />
and missing values.<br />
If so, why not always use n<strong>on</strong>-parametric methods? The basic impediment to the use of n<strong>on</strong>-parametric<br />
methods are a lack of suitable computer software, the difficulty in computing point estimates and c<strong>on</strong>fidence<br />
intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the trend line, and the difficulty in making predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> future observati<strong>on</strong>s. However, n<strong>on</strong>parametric<br />
tests are often ideally suited <str<strong>on</strong>g>for</str<strong>on</strong>g> mass screening. These procedures can be automated and it is<br />
not necessary to examine the possibly hundreds of individual datasets to see which need to be trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />
be<str<strong>on</strong>g>for</str<strong>on</strong>g>e parametric procedures can be used.<br />
Finally, what to do about outliers? Blindly including outliers using n<strong>on</strong>-parametric methods without<br />
investigating their cause can be very dangerous. Trends may be detected that are not real. An outlier, by<br />
definiti<strong>on</strong>, is a point that doesn’t appear to fit the same pattern as the other data values. An assumpti<strong>on</strong> of<br />
most n<strong>on</strong>-parametric tests is that the distributi<strong>on</strong> of Y values at each X is the same (it may not be normal)<br />
– this would also require you to exclude outliers. Even parametric methods can deal with outliers nicely -<br />
c○2012 Carl James Schwarz 277 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
a whole area of statistics deals with robust regressi<strong>on</strong> methods where outliers are iteratively reweighted and<br />
given a low weight if they appear to be anomalous. For example, SAS provides Proc RobustReg to do robust<br />
regressi<strong>on</strong>.<br />
A summary table of the various methods c<strong>on</strong>sidered in this secti<strong>on</strong> of the notes appears below: 45<br />
45 This table is based <strong>on</strong> Trend Analysis of Food Processor Land Applicati<strong>on</strong> Sites in the LUBGWMA available at: http://www.<br />
deq.state.or.us/wq/groundwa/LUBGroundwater/LUBGTrendAnalysisApp1.pdf<br />
c○2012 Carl James Schwarz 278 November 23, 2012
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
c○2012 Carl James Schwarz 279 November 23, 2012
c○2012 Carl James Schwarz 280 November 23, 2012<br />
Trend Analysis<br />
Method<br />
Simple Linear<br />
Regressi<strong>on</strong><br />
Kendall’s τ<br />
Seas<strong>on</strong>al Regressi<strong>on</strong><br />
Parametric<br />
or N<strong>on</strong>-<br />
Parametric<br />
Account <str<strong>on</strong>g>for</str<strong>on</strong>g> Seas<strong>on</strong>ality<br />
Parametric No · Most powerful if assumpti<strong>on</strong>s<br />
hold, especially normality,<br />
n<strong>on</strong>-seas<strong>on</strong>al, and independent.<br />
· Familiar technique to many<br />
scientists.<br />
· Simple to compute best fit<br />
line.<br />
· Available in most computer<br />
packages.<br />
N<strong>on</strong>parametric<br />
No<br />
Parametric Yes Subtract<br />
m<strong>on</strong>thly mean<br />
or median over<br />
years from original<br />
data. Then use<br />
the residuals to<br />
regress over time<br />
or use ANCOVA<br />
methods.<br />
Advantages Disadvantages Rec.<br />
sample<br />
sizes<br />
· N<strong>on</strong>-detect, outliers are easily<br />
handled.<br />
· Same p-value regardless of<br />
trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m used <strong>on</strong> Y .<br />
· Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />
· Produces a descripti<strong>on</strong> of the<br />
seas<strong>on</strong>ality pattern.<br />
· Envir<strong>on</strong>mental data rarely<br />
c<strong>on</strong><str<strong>on</strong>g>for</str<strong>on</strong>g>ms to test assumpti<strong>on</strong>s.<br />
· Sensitive to outliers.<br />
· Difficult to handle n<strong>on</strong>-detect<br />
values.<br />
· Serial correlati<strong>on</strong> gives unbiased<br />
estimates, but they are<br />
not efficient. C<strong>on</strong>sider methods<br />
to account <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>.<br />
· Does not account <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />
· Does not account <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />
· Not robust against autocorrelati<strong>on</strong>.<br />
· Difficult to make predicti<strong>on</strong>s.<br />
· Assumes normality of adjusted<br />
values about regressi<strong>on</strong><br />
line.<br />
· Not robust against serial correlati<strong>on</strong>.<br />
· Requires near complete<br />
records <str<strong>on</strong>g>for</str<strong>on</strong>g> each set of<br />
m<strong>on</strong>thly data. If the patters<br />
of missing years varies<br />
am<strong>on</strong>g the m<strong>on</strong>ths, the<br />
m<strong>on</strong>thly mean used to adjust<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>al effects may be<br />
miss leading.<br />
· Reported se are too small because<br />
adjustment <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality<br />
not incorporated unless<br />
ANCOVA method used.<br />
10 Good<br />
power<br />
programs<br />
are<br />
available<br />
10<br />
30 with<br />
at least 5<br />
cycles<br />
CHAPTER 2. DETECTING TRENDS OVER TIME<br />
Sine/Cosine<br />
Regressi<strong>on</strong><br />
Parametric Yes Deseas<strong>on</strong>alized<br />
values are<br />
obtained by fitting a<br />
· Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality. · With few excepti<strong>on</strong>s, there is<br />
little reas<strong>on</strong> to believe that<br />
the <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the seas<strong>on</strong>ality<br />
30 with<br />
at least 5<br />
cycles
c○2012 Carl James Schwarz 281 November 23, 2012<br />
Regressi<strong>on</strong><br />
adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
autocorrelati<strong>on</strong><br />
Seas<strong>on</strong>al Kendall<br />
without correcti<strong>on</strong><br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> serial<br />
correlati<strong>on</strong><br />
Parametric No · Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong><br />
in the data.<br />
· Can be also adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />
N<strong>on</strong>parametric<br />
Seas<strong>on</strong>al Kendall<br />
adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong><br />
N<strong>on</strong>parametric<br />
Yes But <strong>on</strong>ly by<br />
comparing the data<br />
from the same seas<strong>on</strong><br />
(e.g. m<strong>on</strong>ths)<br />
Yes (as above)<br />
· Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />
· Robust against n<strong>on</strong>-detects<br />
and outliers.<br />
· Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />
· Robust against n<strong>on</strong>-detects<br />
and outliers.<br />
· Robust against serial correlati<strong>on</strong>.<br />
· Requires sophisticated software.<br />
· Extremely high autocorrelati<strong>on</strong><br />
may be invisible.<br />
· When applied to data that is<br />
not seas<strong>on</strong>al, has a slight loss<br />
of power.<br />
· Not robust against serial correlati<strong>on</strong>.<br />
· Difficult to estimate c<strong>on</strong>fidence<br />
intervals.<br />
· Not all computer packages<br />
have this method. May require<br />
further programming.<br />
· Significant loss of power when<br />
applied to data that is not<br />
seas<strong>on</strong>al or lacks autocorrelati<strong>on</strong>.<br />
· Specialized software required.<br />
20<br />
60 with<br />
at least 5<br />
cycles<br />
120<br />
with at<br />
least 10<br />
cycles<br />
CHAPTER 2. DETECTING TRENDS OVER TIME
Chapter 3<br />
Estimating power/sample size using<br />
Program M<strong>on</strong>itor<br />
J. Gibbs has written a Windoze program to estimate the power and sample size requirements <str<strong>on</strong>g>for</str<strong>on</strong>g> many<br />
comm<strong>on</strong> m<strong>on</strong>itoring programs.<br />
Gibbs, J. P., and Eduard Ene. 2010.<br />
Program MONITOR: Estimating the statistical power of ecological m<strong>on</strong>itoring programs. Versi<strong>on</strong><br />
11.0.0.<br />
http://www.esf.edu/efb/gibbs/m<strong>on</strong>itor/<br />
CAUTION: Versi<strong>on</strong> 11.0 of MONITOR appears to have some “features” that result in<br />
incorrect power computati<strong>on</strong>s in certain cases. Please c<strong>on</strong>tact me in advance of using the results<br />
from MONITOR in a critical planning situati<strong>on</strong> to ensure that you have not stumbled <strong>on</strong> some of the<br />
“features”.<br />
Program MONITOR uses simulati<strong>on</strong> procedures to evaluate how each comp<strong>on</strong>ent of a m<strong>on</strong>itoring program<br />
influences its power to detect a linear (regressi<strong>on</strong>) change. The program has been cited in numerous<br />
peer-reviewed publicati<strong>on</strong>s since it first became available in 1995.<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e using Program MONITOR, you will need to gather some basic in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> about the proposed<br />
study.<br />
• What is the initial value of your populati<strong>on</strong>. This could be the initial populati<strong>on</strong> size, the initial density,<br />
etc.<br />
• How precisely can you measure the populati<strong>on</strong> at a given sampling occasi<strong>on</strong>? This can be given as the<br />
standard error you expect to see at any occasi<strong>on</strong>, or the relative standard error (standard error/estimate),<br />
282
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
etc.<br />
• What is the process variati<strong>on</strong>? Do you really expect that the measurements would fall precisely <strong>on</strong> the<br />
trend line in the absence of measurement error?<br />
• What is the significance level and target power? Traditi<strong>on</strong>al values are α = 0.05 with a power of 80%<br />
or α = 0.10 with a target power of 90%.<br />
3.1 Mechanics of MONITOR<br />
Let us first dem<strong>on</strong>strate the mechanics of MONITOR be<str<strong>on</strong>g>for</str<strong>on</strong>g>e looking at some real examples of how to use it<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> m<strong>on</strong>itoring designs.<br />
Suppose we wish to investigate the power of a m<strong>on</strong>itoring design that will run <str<strong>on</strong>g>for</str<strong>on</strong>g> 5 years. At each survey<br />
occasi<strong>on</strong> (i.e. every year), we have 1 m<strong>on</strong>itoring stati<strong>on</strong>, and we make 2 estimates of the populati<strong>on</strong> size at<br />
the m<strong>on</strong>itoring stati<strong>on</strong> in each year. The populati<strong>on</strong> is expected to start with 1000 animals, and we expect<br />
that the measurement error (standard error) in each estimate is about 200, i.e. the coefficient of variati<strong>on</strong><br />
of each measurement is about 20% and is c<strong>on</strong>stant over time. We are interested in detecting increasing or<br />
decreasing trends and to start, a 5% decline per year will be of interest. We will assume an UNREALISTIC<br />
process error of zero so that the sampling error is equal to the total variati<strong>on</strong> in measurements over time.<br />
Launch Program MONITOR:<br />
c○2012 Carl James Schwarz 283 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
The screen starts with default values. We make some changes:<br />
• Change the sampling occasi<strong>on</strong>s to the values 0, 1, 2, 3, 4.<br />
c○2012 Carl James Schwarz 284 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
• Change the number of survey plots/year to 2.<br />
• Check that the significance level is set of 0.05.<br />
• Check that the desired power is set to 0.80.<br />
• Check that the range of desired trends encompasses -5%. You might want to increase the number<br />
of trend powers computed to 21 to get power computati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> every value rather than every sec<strong>on</strong>d<br />
value.<br />
• Check that the two-sided test is selected.<br />
c○2012 Carl James Schwarz 285 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
Then click <strong>on</strong> the Plots Tab and enter the initial populati<strong>on</strong> size (1000) and a variati<strong>on</strong> (the STANDARD<br />
DEVIATION) in measurements of 200 under Total Variati<strong>on</strong><br />
c○2012 Carl James Schwarz 286 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
Press the Run ic<strong>on</strong> and the following results are shown: [Because the power computati<strong>on</strong>s are based <strong>on</strong><br />
a simulati<strong>on</strong>, your results may vary slightly.]<br />
c○2012 Carl James Schwarz 287 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
Notice that the net change over the five year period with a 5% decline/year is <strong>on</strong>ly a −18.5% total decline<br />
Mean<br />
% Total<br />
over the five year period. This is obtained as:<br />
Year Abundance Decline<br />
0 1000 0.0%<br />
1 950.0 = 1000(.95) −5.0%<br />
2 902.5 = 1000(.95) 2 −9.7%<br />
3 857.4 = 1000(.95) 3 −14.3%<br />
4 814.5 = 1000(.95) 4 −18.5%<br />
By clicking <strong>on</strong> the Trend vs. Power Chart tab, you see a graph of the power by the size of the trend:<br />
c○2012 Carl James Schwarz 288 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
This design has a power around 15% of detecting this trend – hardly worthwhile doing the study!<br />
How many years would be needed to detect this trend with a 80% power? Try modifying the number of<br />
sampling years until you get the approximate power needed:<br />
c○2012 Carl James Schwarz 289 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
c○2012 Carl James Schwarz 290 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
So about 10 years of m<strong>on</strong>itoring will be needed to detect a 5% decline PER YEAR with about an 80% power.<br />
The difference in reported powers between the MONITOR and TRENDS programs are artifacts of the<br />
different ways the two programs compute power (and potential because of some ‘features’ of the MONITOR<br />
program). TRENDS uses analytical <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae based <strong>on</strong> normal approximati<strong>on</strong>s, while MONITOR c<strong>on</strong>ducts<br />
a simulati<strong>on</strong> study and reports the number of trials (in this case out of 500) that detected the trend. In any<br />
event, d<strong>on</strong>Õt get hung up over these differences - the key point is that this proposed study has virtually no<br />
power to detect a 5% decline/year.<br />
Program MONITOR also has a hand calculator to c<strong>on</strong>vert between the trend per year and the total trend<br />
over the <str<strong>on</strong>g>course</str<strong>on</strong>g> of the experiment.<br />
c○2012 Carl James Schwarz 291 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
For example, a 5% decline per year <str<strong>on</strong>g>for</str<strong>on</strong>g> 5 ADDITIONAL years translates into an overall decline of 22.6%<br />
over the six years of the study (the <strong>on</strong>e initial year + 5ADDITIONAl years). It is not a straight arithmetic<br />
c<strong>on</strong>versi<strong>on</strong> because the changes are actually multiplicative rather than additive as shown earlier.<br />
3.2 How does MONITOR work?<br />
Program MONITOR estimates power using a simulati<strong>on</strong> based approach as outlined in the help file. For<br />
example, c<strong>on</strong>sider the situati<strong>on</strong> outlined in the previous secti<strong>on</strong>. Again set up the c<strong>on</strong>trol parameters in the<br />
same way except change the trend lines to look <strong>on</strong>ly at a single value <str<strong>on</strong>g>for</str<strong>on</strong>g> the decline (-5% per year).<br />
c○2012 Carl James Schwarz 292 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
Then press the Step ic<strong>on</strong>. The following display is obtained:<br />
c○2012 Carl James Schwarz 293 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
First the underlying deterministic trend is generated (the black line in the middle of the plot). Then based<br />
<strong>on</strong> the variati<strong>on</strong> expected in the measurements, actual “data” are generated (shown by circles below and note<br />
that at time 1, the values are “off the plot”) and presented in the Survey count details tab:<br />
c○2012 Carl James Schwarz 294 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
c○2012 Carl James Schwarz 295 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
Then it gets a bit odd, and the output is potentially misleading. The red line is regressi<strong>on</strong> line is fit through<br />
the points (the red line in the first graph; estimates at the bottom of the data window). But this curve is not<br />
the <strong>on</strong>e used to estimate the power. Rather, a regressi<strong>on</strong> line is fit through the log(data) and the results from<br />
the regressi<strong>on</strong> <strong>on</strong> the log(data) is used to determine if the trend was detected. The analysis is d<strong>on</strong>e <strong>on</strong> the<br />
log-scale because of the multiplicative way in which the deterministic trend is fit. Refer to the analyses from<br />
JMP below to see which statistics are used:<br />
c○2012 Carl James Schwarz 296 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
In this case, the estimated trend line (<strong>on</strong> the log-scale) was not statistically different from zero and the<br />
trend was NOT detected.<br />
The simulati<strong>on</strong> is repeated many hundreds of times and the number of times that a statistically significant<br />
trend was detected is found and the proporti<strong>on</strong> of times that a statistically significant trend was detected is<br />
then the estimated power <str<strong>on</strong>g>for</str<strong>on</strong>g> this design.<br />
3.3 Incorporating process and sampling error<br />
As noted in the chapter <strong>on</strong> Trend Analysis, there are often two sources of variati<strong>on</strong> in any m<strong>on</strong>itoring study.<br />
First is sampling variati<strong>on</strong>. This occurs because it is impossible to measure the populati<strong>on</strong> parameter<br />
c○2012 Carl James Schwarz 297 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
exactly in any <strong>on</strong>e year. For example, if we are measuring the mean DDT level in birds, we must take a<br />
sample (say of 10 birds), sacrifice them, and find the mean DDT in those 10 birds. If a different sample<br />
of 10 birds were to be selected, then the sample mean DDT would vary in the sec<strong>on</strong>d sample. This is<br />
called sampling error (or the standard error) and can be estimated from the data taken in a single year. Or,<br />
the parameter of interest may be the number of smolt leaving a stream and this is estimated using capturerecapture<br />
methods. Again we would have a measure of uncertainty (the standard error) <str<strong>on</strong>g>for</str<strong>on</strong>g> each measurement<br />
in each year. Sampling error (the standard error) can be reduced by increasing the ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t in each year.<br />
However, c<strong>on</strong>sider what happens when measurements are taken in different years. It is unlikely that the<br />
populati<strong>on</strong> values would fall exactly <strong>on</strong> the trend line even if the sampling error was zero. This is known as<br />
process error and is caused by random “year” effects (e.g. an El Nino). Process error CANNOT be reduced<br />
by increasing the sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t in a year.<br />
The two sources of variati<strong>on</strong> are diagrammed below:<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, process error is often the limiting factor in a m<strong>on</strong>itoring study!<br />
In order to estimate the process and sampling variati<strong>on</strong>, you will need at least two years of data or some<br />
c○2012 Carl James Schwarz 298 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
educated guesses from previous years. The Program MONITOR website has a spreadsheet tool to help you<br />
in the decompositi<strong>on</strong> of process and sampling error.<br />
For example, c<strong>on</strong>sider a study to m<strong>on</strong>itor the density of white-tailed deer obtained by distance sampling<br />
<strong>on</strong> Fire Island Nati<strong>on</strong>al Seabird (Underwood et al, 1998), presented as the example <strong>on</strong> the spreadsheet to<br />
separate process and sampling variati<strong>on</strong>.<br />
The estimated density (and se) are:<br />
Year Density SE<br />
1995 79.6 23.47<br />
1996 90.1 11.67<br />
1997 107.1 12.09<br />
1998 74.1 10.45<br />
1999 64.2 13.90<br />
2000 40.8 12.38<br />
2001 41.2 7.40<br />
C<strong>on</strong>sider the plot of density over time (with approximate 95% c<strong>on</strong>fidence intervals):<br />
c○2012 Carl James Schwarz 299 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
Assuming that the deer density is in steady state over the five years of the study, you can see that there is<br />
c<strong>on</strong>siderable process error as many of the 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the deer density do not cover the<br />
mean density over the five years. So even if the sampling error (the se) was driven to zero by adding more<br />
ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t, the data points would not all lie exactly <strong>on</strong> the mean line over time.<br />
There are many ways to separate process and sampling variati<strong>on</strong> – the chapter <strong>on</strong> the analysis of BACI<br />
designs presents some additi<strong>on</strong>al ways. The following is an approximate analysis that should be sufficient<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> most planning purposes.<br />
c○2012 Carl James Schwarz 300 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
First, examine a plot of the estimated se versus the density estimates:<br />
In many cases, there is often a relati<strong>on</strong>ship between the se and the estimate with larger estimates tending to<br />
have a higher se than smaller estimates. The previous plots shows that except <str<strong>on</strong>g>for</str<strong>on</strong>g> <strong>on</strong>e year, the se is relatively<br />
c<strong>on</strong>stant. If the se had a positive relati<strong>on</strong>ship to the estimate, a weighted procedure could be used (this is the<br />
procedure used in Underwood’s spreadsheet).<br />
We being by finding the mean density and the total variati<strong>on</strong> from the mean. [If the preliminary study<br />
had an obvious trend, you could fit the trend line and then find the total variati<strong>on</strong> from the trend line in a<br />
c○2012 Carl James Schwarz 301 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
similar fashi<strong>on</strong>.]<br />
We start by finding the total variati<strong>on</strong> in the density estimates over time:<br />
̂V ar T otal = var(79.6, 90.1, . . . , 41.2) = 599.6<br />
The total variati<strong>on</strong> is equal to the process + sampling variati<strong>on</strong>. An estimate of the average sampling variati<strong>on</strong><br />
is found by averaging the se 2 :<br />
̂V ar Sampling = 23.42 + 11.67 2 + . . . + 7.40 2<br />
7<br />
= 191.9<br />
Finally, the process variance is found by subtracti<strong>on</strong>:<br />
̂V ar P rocess = ̂V ar T otal − ̂V ar Sampling = 599.6 − 191.9 = 407.7<br />
We now launch Program M<strong>on</strong>itor, and are interested in a 10 year study to look at changes in the populati<strong>on</strong><br />
density following some management acti<strong>on</strong>. Notice we now specify a partiti<strong>on</strong>ing of the variati<strong>on</strong> into<br />
process and sampling error:<br />
c○2012 Carl James Schwarz 302 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
We use the sqrt() of the two variances estimated above when specifying the two sources of variati<strong>on</strong>:<br />
c○2012 Carl James Schwarz 303 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
and then press the Run butt<strong>on</strong> as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e to get:<br />
c○2012 Carl James Schwarz 304 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
The power to detect a 5% decline PER YEAR is not very good.<br />
It is instructive to see what would happen if you believed that there was NO process variati<strong>on</strong> and simply<br />
used the average sampling variati<strong>on</strong> as the sole source of variati<strong>on</strong>:<br />
c○2012 Carl James Schwarz 305 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
Now the (incorrect) estimated power is much higher.<br />
3.4 Presence/Absence Data<br />
Sometimes, <strong>on</strong>ly presence/absence data can be collected <strong>on</strong> each plot, rather than a measure of density. In<br />
cases like this, you may wish to c<strong>on</strong>sider occupancy modelling, but that is a topic <str<strong>on</strong>g>for</str<strong>on</strong>g> another <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
Despite not having an absolute measure of abundance, presence/absence data can be used to m<strong>on</strong>itor<br />
the density of species with relatively low abundances. This makes use of the Poiss<strong>on</strong> distributi<strong>on</strong> to predict<br />
c○2012 Carl James Schwarz 306 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
presence/absence as a functi<strong>on</strong> of density.<br />
For example, according to the Poiss<strong>on</strong> distributi<strong>on</strong>, if the average density per plot is µ, then the probability<br />
that a sampled plot will be labelled as a presence is 1 − exp(−µ) and the probability that a sampled plot<br />
will be labelled as an absence is exp(−µ). So a change in the overall proporti<strong>on</strong> of sites that are occupied<br />
corresp<strong>on</strong>ds to a change in the overall average density.<br />
Note that we are implicitly assuming that all absences are true absences, i.e. not a false negative. If false<br />
negatives are possible, you really should be using an occupancy design rather than a simple presence/absence<br />
design.<br />
We will use the example that ships with Program MONITOR. This example focuses <strong>on</strong> the least bittern<br />
(Ixobrychus exilis) , a secretive marsh bird. Least bittern populati<strong>on</strong>s are hard to m<strong>on</strong>itor given their quirky<br />
habitats, that is, its unpredictable calling behavior. Calls are the <strong>on</strong>ly way to detect the species presence<br />
within the dense vegetati<strong>on</strong> of the marshes where it lives. C<strong>on</strong>sider that baseline surveys of least bitterns<br />
between May 15-June 15 indicate that an average of about 0.20 calling least bitterns were heard <strong>on</strong> any<br />
given visit. A water c<strong>on</strong>trol structure <strong>on</strong> the marsh is being altered to generate a more stable water level that<br />
should improve the situati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> bitterns at the site. How much of a trend can be detected with 10 years of<br />
m<strong>on</strong>itoring and 10 visits to the marsh each year?<br />
Here the average of 0.20 calls/visit implies that a “presence” was detected in about 1/5 visits to the<br />
marsh.<br />
Start by entering the data <strong>on</strong> the main page and then <strong>on</strong> the plots page.<br />
c○2012 Carl James Schwarz 307 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
c○2012 Carl James Schwarz 308 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
With presence/absence data, the plot “mean” should have the approximate base rate of presences and<br />
there is no need <str<strong>on</strong>g>for</str<strong>on</strong>g> a standard deviati<strong>on</strong> estimator. On the main page, tests <str<strong>on</strong>g>for</str<strong>on</strong>g> trend in presence/absence<br />
data are equivalent to “chi-square tests” (covered in another secti<strong>on</strong> of the notes). The Custom/ANOVA<br />
area indicates a doubling of the presences frequency in the sec<strong>on</strong>d through 10 year of m<strong>on</strong>itoring.<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e computing the power, press the step butt<strong>on</strong> to get a feel <str<strong>on</strong>g>for</str<strong>on</strong>g> the data that are generated (not shown).<br />
I think this is where Program MONITOR has a “feature” as the data in the 3rd and subsequent visits doesn’t<br />
ever have any n<strong>on</strong>-detects.<br />
C<strong>on</strong>sequently, I w<strong>on</strong>’t c<strong>on</strong>tinue with this example until I understand what MONITOR is doing! I have<br />
SAS programs that can help in planning of presence/absence studies – please c<strong>on</strong>tact me <str<strong>on</strong>g>for</str<strong>on</strong>g> assistance.<br />
3.5 WARNING about using testing <str<strong>on</strong>g>for</str<strong>on</strong>g> temporal trends<br />
The Patuxent Wildlife Research Center has some sage advice about power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> temporal trends.<br />
Users should be aware (and wary) of the complexity of power analysis in general, and also acknowledge<br />
some specific limitati<strong>on</strong>s of MONITOR <str<strong>on</strong>g>for</str<strong>on</strong>g> many real-world applicati<strong>on</strong>s. Our chief,<br />
immediate c<strong>on</strong>cern is that many users of MONITOR may be unaware of these limitati<strong>on</strong>s and<br />
may be using the program inappropriately. Below are comments from <strong>on</strong>e of our statisticians<br />
<strong>on</strong> some of the aspects of MONITOR that users should be cognizant of: ÒThere are numerous<br />
issues with how Program M<strong>on</strong>itor calculates statistical power and sample size. One issue c<strong>on</strong>cerns<br />
the default opti<strong>on</strong> whereby the user assumes independence of plots or sites from <strong>on</strong>e time<br />
period to the next. If you are randomly sampling new sites or plots each time period, then it is<br />
correct to assume independence (assuming that finite populati<strong>on</strong> correcti<strong>on</strong> factor is not an issue,<br />
which depends <strong>on</strong> how many plots or sites you are sampling, relative to the total populati<strong>on</strong><br />
size of potential plots or sites). If you are sampling the same plots or sites repeatedly over time,<br />
however, then the default opti<strong>on</strong> in Program M<strong>on</strong>itor is unlikely to give a correct calculati<strong>on</strong> of<br />
statistical power or sample size. If plots or sites are positively autocorrelated over time, as is<br />
usually the case in biological surveys, then Program M<strong>on</strong>itor will underestimate sample size, or<br />
c<strong>on</strong>versely, it will overestimate the statistical power. The correct sample size estimate is likely<br />
to be greater, and depending up<strong>on</strong> the amount of autocorrelati<strong>on</strong>, the correct sample size could<br />
c○2012 Carl James Schwarz 309 November 23, 2012
CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />
MONITOR<br />
be vastly greater to achieve a stated power objective.<br />
We deal with some of these issues when we discuss the design and analysis of BACI surveys later in this<br />
<str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
c○2012 Carl James Schwarz 310 November 23, 2012
Chapter 4<br />
Regressi<strong>on</strong> - hockey sticks, broken<br />
sticks, piecewise, change points<br />
A simple regressi<strong>on</strong> analysis assumes that the change in resp<strong>on</strong>se is the same across the range of X values.<br />
In some cases, a model where the slope changes in different parts of the X space may be biologically more<br />
realistic.<br />
This chapter examines two cases of fitting regressi<strong>on</strong> lines with breaks in the slope. In the first case,<br />
the locati<strong>on</strong> of the change in slope is known in advance; the sec<strong>on</strong>d cases also estimates the locati<strong>on</strong> of the<br />
change, also known as the change point problem.<br />
The examples in this chapter look at cases with a single change point – the extensi<strong>on</strong> to multiple change<br />
points (both known and unknown) is straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward. Similarly, the change from linear to quadratic lines is<br />
also straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward.<br />
A related method, a spline fit to the data, where a flexible curve is fit between (evenly) spaced knot points<br />
that is a like a n<strong>on</strong>-parametric curve fit is explored in a different chapter.<br />
4.1 Hockey-stick, piecewise, or broken-stick regressi<strong>on</strong><br />
In this secti<strong>on</strong>, the locati<strong>on</strong> of the change point is known. The statistical model is:<br />
Y = β 0 + β 1 (X) + β 2 (X − C) + + ɛ<br />
where β 0 is the intercept, β 1 is the slope be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the change point C, and β 2 is the DIFFERENCE in slope<br />
311
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
after the change point. The slope after the change point is β 1 + β 2 . The variable (X − C) + is a derived<br />
variable which takes the value of 0 <str<strong>on</strong>g>for</str<strong>on</strong>g> values of X less than C and the values X − C <str<strong>on</strong>g>for</str<strong>on</strong>g> values of X greater<br />
than C. This is usually created using a Formula Editor based <strong>on</strong> the actual data.<br />
The hypothesis of interest is H : β 2 = 0 which indicates no change in slope between X < C and<br />
X > C.<br />
Because the value of C is specified in advance, ordinary least-squares can be used to fit the model. Most<br />
computer packages can easily fit this model.<br />
4.1.1 Example: Nenana River Ice Breakup Dates<br />
The Nenana river in the Interior of Alaska usually freezes over during October and November. The ice<br />
c<strong>on</strong>tinues to grow throughout the winter accumulating an average maximum thickness of about 110 cm,<br />
depending up<strong>on</strong> winter weather c<strong>on</strong>diti<strong>on</strong>s. The Nenana River Ice Classic competiti<strong>on</strong> began in 1917 when<br />
railroad engineers bet a total of 800 dollars, winner takes all, guessing the exact time (m<strong>on</strong>th, day, hour,<br />
minute) ice <strong>on</strong> the Nenana River would break up. Each year since then, Alaska residents have guessed at the<br />
timing of the river breakup. A tripod, c<strong>on</strong>nected to an <strong>on</strong>-shore clock with a string, is planted in two feet<br />
of river ice during river freeze-up in October or November. The following spring, the clock automatically<br />
stops when the tripod moves as the ice breaks up. The time <strong>on</strong> the clock is used as the river ice breakup<br />
time. Many factors influence the river ice breakup, such as air temperature, ice thickness, snow cover, wind,<br />
water temperature, and depth of water below the ice. Generally, the Nenana river ice breaks up in late April<br />
or early May (historically, April 20 to May 20). The time series of the Nenana river ice breakup dates can<br />
be used to investigate the effects of climate change in the regi<strong>on</strong>.<br />
In 2010, the jackpot was almost $300,000 and the ice went out at 9:06 <strong>on</strong> 2010-04-29. In 2012, the<br />
jackpot was over $350,000 and the ice went out at 19:39 <strong>on</strong> 2012-04-23 - as reported at http://www.<br />
cbc.ca/news/offbeat/story/2012/05/02/alaska-ice-c<strong>on</strong>test.html. The latest winner,<br />
Tommy Lee Waters, has also w<strong>on</strong> twice be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, but never has been a solo winner. Waters spent time<br />
drilling holes in the area to measure the thickness of the ice. Altogether he spent $5,000 <strong>on</strong> tickets <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
submitting guesses (he purchased every minute of the afterno<strong>on</strong> of 23 April) and spent an estimated 1,200<br />
hours working out the math by hand. And, it was also his birthday! (What are the odds?) You too can use<br />
statistical methods to gain fame and <str<strong>on</strong>g>for</str<strong>on</strong>g>tune!<br />
More details about the Ice Classic are available at http://www.nenanaakiceclassic.com.<br />
The data is available in the nenana.jmp data file available in the the Sample Program Library at http:<br />
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
A simple regressi<strong>on</strong> line fit to the time of break up with year as the predictor show evidence of a decline<br />
over time (i.e. the time of breakup is tending to occur earlier) and there is no evidence of auto-correlati<strong>on</strong>.<br />
c○2012 Carl James Schwarz 312 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
A closer inspecti<strong>on</strong> of the top graph gives the impressi<strong>on</strong> that until about 1970, the regressi<strong>on</strong> line was<br />
c○2012 Carl James Schwarz 313 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
“flat” and <strong>on</strong>ly after 1970 did the time of breakup seem to decrease.<br />
A broken stick model (separate slopes in the pre-1970 and the post-1970 eras) can be easily fit. We need<br />
to create a new variable that is zero <str<strong>on</strong>g>for</str<strong>on</strong>g> the pre-1970 period and equal to (year − 1970) in the post 1970<br />
period. This is easily created in JMP using the Formula Editor:<br />
This is then fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 314 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
which gives the estimates:<br />
The <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical model is:<br />
Date = β 0 + β 1 (year) + β 2 (year − 1970) + + ɛ<br />
In years prior to 1970, the slope is β 1 . In years after 1970, the slope is β 1 + β 2 . A test <str<strong>on</strong>g>for</str<strong>on</strong>g> differential slopes<br />
in the two eras is then equivalent to a test if β 2 = 0.<br />
c○2012 Carl James Schwarz 315 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
In this case the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the β 2 coefficient (associated with the (year − 1970) + variable) is just under<br />
0.05 providing some evidence of a different slope in the two eras.<br />
A plot of the fitted line is obtained by saving the predicted values to the data table:<br />
and then plotting the actual data and the fitted points <strong>on</strong> the same graph using the Graph->Overlay<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN resp<strong>on</strong>se in a particular year (not likely of interest in this example)<br />
c○2012 Carl James Schwarz 316 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
and <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual resp<strong>on</strong>ses in a particular year are generated in the usual way.<br />
Note that the estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the pre-1970 era is not statistically different from 0. If you wanted to<br />
fit a model where the line was flat (i.e. the slope was 0) in the pre-1970 era, this is d<strong>on</strong>e by using <strong>on</strong>ly the<br />
(year − 1970) + variable. Many of the automatically generated plots look odd (e.g. all of the points appear<br />
to be replotted at 1970), the intercept has a different interpretati<strong>on</strong> in the two models because year = 0 has<br />
a different definiti<strong>on</strong> in the two models, but if the fitted model is plotted against the original year variable<br />
everything works out properly. In this particular case, the two latter models give predicted lines that are<br />
almost identical. It is quite RARE that you would fit a line where the slope is known to be zero in practice.<br />
4.2 Searching <str<strong>on</strong>g>for</str<strong>on</strong>g> the change point<br />
In the previous secti<strong>on</strong> <strong>on</strong> segmented regressi<strong>on</strong> (also known as hockey-stick regressi<strong>on</strong> or broken-stick<br />
regressi<strong>on</strong>), the locati<strong>on</strong>s of the break are assumed to be known. In many cases, the locati<strong>on</strong> of the break is<br />
not known, and it is of interest to estimate the break point as well.<br />
The problems of identifying changes at unknown times and of estimating the locati<strong>on</strong> of changes is<br />
known as “the change-point problem”. Numerous methodological approaches have been implemented in<br />
examining change-point models. Maximum-likelihood estimati<strong>on</strong>, Bayesian estimati<strong>on</strong>, isot<strong>on</strong>ic regressi<strong>on</strong>,<br />
piecewise regressi<strong>on</strong>, quasi-likelihood and n<strong>on</strong>-parametric regressi<strong>on</strong> are am<strong>on</strong>g the methods which have<br />
been applied to resolving challenges in change- point problems. Grid-searching approaches have also been<br />
used to examine the change-point problem. A review of the literature especially as it applies to regressi<strong>on</strong><br />
problem (as of 2008) is available at: http://biostats.bepress.com/cgi/viewc<strong>on</strong>tent.<br />
cgi?article=1075&c<strong>on</strong>text=cobra.<br />
The standard change-point problem in regressi<strong>on</strong> models c<strong>on</strong>sists of<br />
• testing the null hypothesis that no change in regimes has taken place against the alternative that observati<strong>on</strong>s<br />
were generated by two (or possibly more) distinct regressi<strong>on</strong> equati<strong>on</strong>s, and<br />
• estimating the two regimes that gave rise to the data.<br />
There are two comm<strong>on</strong> models. First are models where the regressi<strong>on</strong> line is c<strong>on</strong>tinuous at the break<br />
point, and models where the regressi<strong>on</strong> line can be disc<strong>on</strong>tinuous. In these notes, we <strong>on</strong>ly c<strong>on</strong>sider the<br />
c<strong>on</strong>tinuous case.<br />
This problem has a l<strong>on</strong>g history. A nice summary and treatment of the problem is available in<br />
Toms, J. D. and Lesperance, M L. (2003).<br />
Piecewise regressi<strong>on</strong>: A tool <str<strong>on</strong>g>for</str<strong>on</strong>g> identifying ecological thresholds.<br />
<str<strong>on</strong>g>Ecology</str<strong>on</strong>g>, 84, 2034-2041<br />
http://dx.doi.org/10.1890/02-0472.<br />
c○2012 Carl James Schwarz 317 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
The change point model starts with the broken-stick model seen earlier, i.e.<br />
Y = β 0 + β 1 (X) + β 2 (X − C) + + ɛ<br />
where Y is the resp<strong>on</strong>se variable, X is the covariate, and C is the change point, i.e. where the break occurs.<br />
This model is appropriate where there is an abrupt transiti<strong>on</strong> at the break point, but a smooth transiti<strong>on</strong> may<br />
be more realistic <str<strong>on</strong>g>for</str<strong>on</strong>g> some data. One drawback of this model is that c<strong>on</strong>vergence problems can occur in<br />
locating C when the data are sparse around the neighborhood of C.<br />
Toms and Lesperance (2003) review the use of model with gentler transiti<strong>on</strong>s, e.g. the hyperbolic tangent<br />
model or the bent-cable model. The bent-cable regressi<strong>on</strong> model was recently developed by Chuis, Lockhart<br />
and Routledge (2006, Bent-cable regressi<strong>on</strong> theory and applicati<strong>on</strong>, Journal of the American Statistical<br />
Associati<strong>on</strong>, 101, 542-553). The bent-cable regressi<strong>on</strong> model fits a smooth transiti<strong>on</strong> between the two linear<br />
parts of the model. The latter is also applicable to regressi<strong>on</strong> models where the X variable is time and<br />
auto-correlati<strong>on</strong> may be present 1 .<br />
The simple piece-wise linear model can be fit using the Analyze->Modelling ->N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m of<br />
JMP.<br />
4.2.1 Change point model <str<strong>on</strong>g>for</str<strong>on</strong>g> the Nenana River Ice Breakup<br />
Refer to the previous secti<strong>on</strong> about details <strong>on</strong> the Nenana River Ice Breakup c<strong>on</strong>test. Rather than specifying<br />
a break point at 1970, we will fit the change point model to estimate the change point.<br />
The data are available in the Nenana.jmp data table in the the Sample Program Library at http://<br />
www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The statistical model is:<br />
JulianDate = β 0 + β 1 (Y ear) + β 2 (Y ear − C) + + ɛ<br />
where JulianDate is the date of breakup, Y ear is the calendar year. The parameters to be estimated are<br />
β 0 the intercept, β 1 the change in the breakup prior to the change point, β 2 the change in slope after the<br />
breakup, and C the change point.<br />
We first need to define the parameters of the model (β 0 , β 1 , β 2 , C) and the predicted value in terms of<br />
the parameters of the model. We start by creating a new column in the data table ChangePointPredictor and<br />
start the Formula Editor.<br />
1 Chiu, G. S. and Lockhart, R. L. (2010). Bent-cable regressi<strong>on</strong> with auto-regressive noise. Canadian Journal of Statistics, 38,<br />
386-407. http://dx.doi.org/10.1002/cjs.10070<br />
c○2012 Carl James Schwarz 318 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
New parameters are defined (al<strong>on</strong>g with initial starting guesses), by using the drop-down menu in the<br />
top left of the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor:<br />
Click <strong>on</strong> the New Parameters item and create the four parameters and their initial values (based <strong>on</strong> the results<br />
from the previous example). The choice of initial values is not that crucial. Then create the predicted value<br />
in terms of the parameters and the columns in the data table:<br />
c○2012 Carl James Schwarz 319 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
Notice the use of the If functi<strong>on</strong> to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g> the break point. You can switch back and <str<strong>on</strong>g>for</str<strong>on</strong>g>th between the<br />
parameters, data table columns, etc. using the drop down menu in the top right of the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor. When<br />
you are finished, close the Formula Editor, and the data table will be updated with initial predicti<strong>on</strong>s based<br />
<strong>on</strong> the initial values specified.<br />
Select the Analyze->Modelling ->N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 320 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
Specify the predicted value and Y variables appropriately:<br />
c○2012 Carl James Schwarz 321 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
Notice that the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicti<strong>on</strong>s is displayed.<br />
This brings up the Analyze->Modelling ->N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m c<strong>on</strong>trol panel. The initial fit is displayed.<br />
Press the Go butt<strong>on</strong> to find the n<strong>on</strong>-linear least squares fit.<br />
c○2012 Carl James Schwarz 322 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
The n<strong>on</strong>-linear least squares algorithm appears to have c<strong>on</strong>verged at the estimates listed in the table. The<br />
estimated change-point of 1967 is close to the value of 1970 “guess-timated” earlier. Approximate standard<br />
errors are also presented at the bottom of the output:<br />
These standard errors are based <strong>on</strong> large-sample theory. In order to compute a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
the break point, you could use the standard estimate ± 2(se), but in small samples, the resulting c<strong>on</strong>fidence<br />
intervals may not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m well. Toms and Lesperance (2003) recommend that a likelihood ratio c<strong>on</strong>fidence<br />
interval be computed. JMP attempts to compute profile-likelihood c<strong>on</strong>fidence intervals when you press the<br />
c○2012 Carl James Schwarz 323 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
C<strong>on</strong>fidence Interval butt<strong>on</strong> which gives:<br />
In this case, the profile intervals fail to give upper and lower bounds because the slope after the change<br />
point is just <strong>on</strong> the boundary of statistical significance at (α = 0.05). If you change the c<strong>on</strong>fidence coefficient<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g>m 95% to 90%, the procedure is able to find c<strong>on</strong>fidence bounds <strong>on</strong> the C parameter. C<strong>on</strong>sequently, there<br />
may or may not be a change point. Notice that the lower boundary of the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> C is quite<br />
far below the point estimate!<br />
C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se and predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> a future resp<strong>on</strong>se are obtained by<br />
clicking <strong>on</strong> the red triangle:<br />
These are interpreted in the same way as in ordinary regressi<strong>on</strong>.<br />
The Analyze->Modelling ->N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m also allows you to “play” with the estimates to invesc○2012<br />
Carl James Schwarz 324 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
tigate the sensitivity of the fit to the parameters. The Profiler opti<strong>on</strong> under the red triangle is also useful in<br />
these cases.<br />
4.3 How NOT to search <str<strong>on</strong>g>for</str<strong>on</strong>g> a change point!<br />
A fairly comm<strong>on</strong> “request” in our Statistical C<strong>on</strong>sulting Service is <str<strong>on</strong>g>for</str<strong>on</strong>g> help in finding the time at which<br />
some treatment gives a difference in resp<strong>on</strong>se from a c<strong>on</strong>trol. For example, a group of animals may be fed<br />
a c<strong>on</strong>trol diet and are measured over time, while another group of animals are fed an experimental diet and<br />
are measured over time. At which point do the resp<strong>on</strong>ses between the two groups start to differ?<br />
Let us assume, <str<strong>on</strong>g>for</str<strong>on</strong>g> simplicity, that separate animals are measured at each time point so that the problem<br />
of l<strong>on</strong>gitudinal data are ignored. For example, suppose that animals must be sacrificed at each time point<br />
to measure the resp<strong>on</strong>se. A naive analysis starts by plotting the means of the two groups over time and<br />
searching <str<strong>on</strong>g>for</str<strong>on</strong>g> the first time point at which the two means are statistically different:<br />
This is NOT A VALID ANALYSIS! The problem is that the estimate of the change point <str<strong>on</strong>g>for</str<strong>on</strong>g> this analysis<br />
will depend <strong>on</strong> the sample size. If the sample size is small in each group, then the standard error bars are<br />
larger, and the estimated change point tends to be larger than if the sample size is large and the standard<br />
errors are smaller. The actual change point does NOT depend <strong>on</strong> sample size! All that should happen is that<br />
the estimated precisi<strong>on</strong> of the change point problem should be worse <str<strong>on</strong>g>for</str<strong>on</strong>g> smaller sample sizes than <str<strong>on</strong>g>for</str<strong>on</strong>g> larger<br />
sample sizes.<br />
c○2012 Carl James Schwarz 325 November 23, 2012
CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />
PIECEWISE, CHANGE POINTS<br />
The proper way to search <str<strong>on</strong>g>for</str<strong>on</strong>g> a change point is to find the DIFFERENCE in means at each time point and<br />
then apply the analysis in the previous secti<strong>on</strong>s to the difference. A model where the difference in means is<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g>ced to be zero prior to the unknown change point may be a suitable alternate model.<br />
c○2012 Carl James Schwarz 326 November 23, 2012
Chapter 5<br />
Analysis of Covariance - ANCOVA<br />
5.1 Introducti<strong>on</strong><br />
In previous chapters, we looked at comparing group means from data collected from a single-factor completely<br />
randomized design and analyzed using ANOVA. We also looked at estimating the slope of a straight<br />
line between two variables. In both cases the resp<strong>on</strong>se variable, Y , was c<strong>on</strong>tinuous (interval or ratio scale).<br />
In the case of ANOVA, the X variables was nominal or ordinal in scale and served to identify the treatment<br />
groups. In the regressi<strong>on</strong> setting, the X variable was also c<strong>on</strong>tinuous.<br />
The Analysis of Covariance (ANCOVA) is a combinati<strong>on</strong> of both analyses. Groups are identified by a<br />
nominal or ordinal scale variable and a c<strong>on</strong>tinuous covariate is also measured.<br />
There are two uses of ANCOVA which, <strong>on</strong> the surface, appear to be separate analyses. In fact, both<br />
analyses are identical.<br />
The first use is to check if the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> the groups are parallel. If there is evidence that the<br />
individual regressi<strong>on</strong> lines are not parallel, then a separate regressi<strong>on</strong> line must be fit <str<strong>on</strong>g>for</str<strong>on</strong>g> each group <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
predicti<strong>on</strong> purposes. If there is no evidence of n<strong>on</strong>-parallelism, then the next task is to see if the lines are<br />
co-incident, i.e. have both the same intercept and the same slope. If there is evidence that the lines are not<br />
coincident, then a series of parallel lines are fit to the data. All of the data are used to estimate the comm<strong>on</strong><br />
slope. If there is no evidence that the lines are not coincident, then all of the data can be simply pooled<br />
together and a single regressi<strong>on</strong> line fit <str<strong>on</strong>g>for</str<strong>on</strong>g> all of the data.<br />
The three possibilities are shown below <str<strong>on</strong>g>for</str<strong>on</strong>g> the case of two groups - the extensi<strong>on</strong> to many groups is<br />
obvious:<br />
327
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
c○2012 Carl James Schwarz 328 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
c○2012 Carl James Schwarz 329 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Sec<strong>on</strong>d, ANCOVA has been used to test <str<strong>on</strong>g>for</str<strong>on</strong>g> differences in means am<strong>on</strong>g the groups when some of the<br />
variati<strong>on</strong> in the resp<strong>on</strong>sible variable can be “explained” by a covariate. For example, the effectiveness of two<br />
different diets can be compared by randomizing people to the two diets and measuring the weight change<br />
during the experiment. However, some of the variati<strong>on</strong> in weight change may be related to initial weight.<br />
Perhaps by “standardizing” every<strong>on</strong>e to some comm<strong>on</strong> weight, we can more easily detect differences am<strong>on</strong>g<br />
the groups.<br />
Insert graphs here<br />
A very nice book <strong>on</strong> the Analysis of Covariance is Analysis of Messy Data, Volume III: Analysis of Covariance<br />
by G. A. Milliken and D. E. Johns<strong>on</strong>. Details are available at http://www.statsnetbase.<br />
com/ejournals/books/book_summary/summary.asp?id=869.<br />
c○2012 Carl James Schwarz 330 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
5.2 Assumpti<strong>on</strong>s<br />
As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, it is important be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the analysis is started to verify the assumpti<strong>on</strong>s underlying the analysis. As<br />
ANCOVA is a combinati<strong>on</strong> of ANOVA and Regressi<strong>on</strong>, the assumpti<strong>on</strong>s are similar. Both goals of ANCOVA<br />
have similar assumpti<strong>on</strong>s:<br />
• The resp<strong>on</strong>se variable Y is c<strong>on</strong>tinuous (interval or ratio scaled)<br />
• The data are collected under a completely randomized design. 1 This implies that the treatment must<br />
be randomized completely over the entire set of experimental units if an experimental study, or units<br />
must be selected at random from the relevant populati<strong>on</strong>s if an observati<strong>on</strong>al study.<br />
• There must be no outliers. Plot Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group separately to see if there are any points that<br />
d<strong>on</strong>’t appear to follow the straight line.<br />
• The relati<strong>on</strong>ship between Y and X must be linear <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. 2 Check this assumpti<strong>on</strong> by looking<br />
at the individual plots of Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group.<br />
• The variance must be equal <str<strong>on</strong>g>for</str<strong>on</strong>g> both groups around their respective regressi<strong>on</strong> lines. Check that the<br />
spread of the points is equal around the range of X and that the spread is comparable between the two<br />
groups. This can be <str<strong>on</strong>g>for</str<strong>on</strong>g>mally checked by looking at the MSE from a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />
group as MSE estimates the variance of the data around the regressi<strong>on</strong> line.<br />
• The residuals must be normally distributed around the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This assumpti<strong>on</strong><br />
can be check by examining the residual plots from the fitted model <str<strong>on</strong>g>for</str<strong>on</strong>g> evidence of n<strong>on</strong>-normality. For<br />
large samples, this is not too crucial; <str<strong>on</strong>g>for</str<strong>on</strong>g> small sample sizes, you will likely have inadequate power to<br />
detect anything but gross departures.<br />
5.3 Comparing individual regressi<strong>on</strong> lines<br />
You saw in earlier chapters, that a statistical model is a powerful <str<strong>on</strong>g>short</str<strong>on</strong>g>hand to describe what analysis is fit<br />
to a set of data. The model must describe the treatment structure, the experimental unit structure, and the<br />
randomizati<strong>on</strong> structure.. Let Y be the resp<strong>on</strong>se variable; X be the c<strong>on</strong>tinuous X-variable, and Group be<br />
the group factor.<br />
In all cases that follow, we are assuming that a completely randomized design was used <str<strong>on</strong>g>for</str<strong>on</strong>g> the randomizati<strong>on</strong><br />
structure. This implies that there are no explicit terms <str<strong>on</strong>g>for</str<strong>on</strong>g> the randomizati<strong>on</strong> structure in the<br />
model.<br />
Similarly, there is a single size of experimental unit with no blocking or sub-sampling occurring. This<br />
also implies there will be no terms in the model <str<strong>on</strong>g>for</str<strong>on</strong>g> the experimental unit structure. In more advanced<br />
<str<strong>on</strong>g>course</str<strong>on</strong>g>s, the analyses in this chapter can be extended to more complex designs.<br />
1 It is possible to relax this assumpti<strong>on</strong> - this is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
2 It is possible to relax this assumpti<strong>on</strong> as well, but is again, bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
c○2012 Carl James Schwarz 331 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
In earlier chapters, we saw that the model <str<strong>on</strong>g>for</str<strong>on</strong>g> a single-factor completely randomized design is<br />
Y = Group<br />
This is read as saying that variati<strong>on</strong> in Y can be partially explained by an overall grand mean (never specified)<br />
with differences in the mean caused by Groups plus an implicit random noise (which is never specified).<br />
Again, from an earlier chapter, we say that the model <str<strong>on</strong>g>for</str<strong>on</strong>g> a regressi<strong>on</strong> of Y <strong>on</strong> X is<br />
Y = X<br />
This is read as saying that the variati<strong>on</strong> in Y can be partially explained by an intercept (never specified) plus<br />
changes in the X plus an implicit random noise (which is never specified).<br />
As ANCOVA is a combinati<strong>on</strong> of the above two analyses, it will not be surprising that the models will<br />
have terms corresp<strong>on</strong>ding to both Group and X. Again, there are three cases:<br />
If the lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group are not parallel:<br />
c○2012 Carl James Schwarz 332 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
the appropriate model is<br />
Y 1 = Group X Group ∗ X<br />
The terms can be in any order. This is read as variati<strong>on</strong> in Y can be explained a comm<strong>on</strong> intercept (never<br />
specified) followed by group effects (different intercepts), a comm<strong>on</strong> slope <strong>on</strong> X, and an “interacti<strong>on</strong>” between<br />
Group and X which is interpreted as different slopes <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This model is almost equivalent<br />
to fitting a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The <strong>on</strong>ly advantage to using this joint model <str<strong>on</strong>g>for</str<strong>on</strong>g> all groups<br />
is similar to that enjoyed by using ANOVA - all of the groups c<strong>on</strong>tribute to a better estimate of residual error.<br />
If the number of data points per group is small, this can lead to improvements in precisi<strong>on</strong> compared to<br />
fitting each group individually.<br />
If the lines are parallel across groups, but not coincident:<br />
the appropriate model is<br />
Y 2 = Group X<br />
The terms can be in any order. The <strong>on</strong>ly difference between this and the previous model is that this simpler<br />
model lacks the Group*X “interacti<strong>on</strong>” term. It would not be surprising then that a statistical test to see if<br />
c○2012 Carl James Schwarz 333 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
this simpler model is tenable would corresp<strong>on</strong>d to examining the p-value of the test <strong>on</strong> the Group*X term<br />
from the complex model. This is exactly analogous to testing <str<strong>on</strong>g>for</str<strong>on</strong>g> interacti<strong>on</strong> effects between factors in a<br />
two-factor ANOVA.<br />
Lastly, if the lines are co-incident:<br />
the appropriate model is<br />
Y 3 = X<br />
. Now the difference between this model and the previous model is the Group term that has been dropped.<br />
Again, it would not be surprising that this corresp<strong>on</strong>ds to the test of the Group effect in the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical<br />
test. The test <str<strong>on</strong>g>for</str<strong>on</strong>g> co-incident lines should <strong>on</strong>ly be d<strong>on</strong>e if there is insufficient evidence against the hypothesis<br />
of parallelism.<br />
While it is possible to test <str<strong>on</strong>g>for</str<strong>on</strong>g> a n<strong>on</strong>-zero slope, this is rarely d<strong>on</strong>e.<br />
c○2012 Carl James Schwarz 334 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
5.4 Comparing Means after covariate adjustments<br />
to be added later<br />
5.5 Power and sample size<br />
to be added later<br />
- use the MSE as the estimate of variance <str<strong>on</strong>g>for</str<strong>on</strong>g> testing MEANS and <str<strong>on</strong>g>for</str<strong>on</strong>g> testing the slope.<br />
5.6 Example - Degradati<strong>on</strong> of dioxin<br />
An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />
material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />
organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />
takes a l<strong>on</strong>g time to degrade.<br />
Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />
measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />
Each year, four crabs are captured from two m<strong>on</strong>itoring stati<strong>on</strong>s which are situated quite a distance apart<br />
<strong>on</strong> the same inlet where the pulp mill was located.. The liver is excised and the livers from all four crabs<br />
are composited together into a single sample. 3 The dioxins levels in this composite sample is measured.<br />
As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called the Total<br />
Equivalent Dose (TEQ) is computed from the sample.<br />
As seen in the chapter <strong>on</strong> regressi<strong>on</strong>, the appropriate resp<strong>on</strong>se variable is log(T EQ).<br />
Is the rate of decline the same <str<strong>on</strong>g>for</str<strong>on</strong>g> both sites? Did the sites have the same initial c<strong>on</strong>centrati<strong>on</strong>?<br />
Here are the raw data which are also available in the dataset dioxin2.jmp in the Sample Program Library<br />
at SampleProgramLibrary.<br />
3 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />
loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />
within years and the number of years to m<strong>on</strong>itor.<br />
c○2012 Carl James Schwarz 335 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Site Year TEQ log(TEQ)<br />
a 1990 179.05 5.19<br />
a 1991 82.39 4.41<br />
a 1992 130.18 4.87<br />
a 1993 97.06 4.58<br />
a 1994 49.34 3.90<br />
a 1995 57.05 4.04<br />
a 1996 57.41 4.05<br />
a 1997 29.94 3.40<br />
a 1998 48.48 3.88<br />
a 1999 49.67 3.91<br />
a 2000 34.25 3.53<br />
a 2001 59.28 4.08<br />
a 2002 34.92 3.55<br />
a 2003 28.16 3.34<br />
b 1990 93.07 4.53<br />
b 1991 105.23 4.66<br />
b 1992 188.13 5.24<br />
b 1993 133.81 4.90<br />
b 1994 69.17 4.24<br />
b 1995 150.52 5.01<br />
b 1996 95.47 4.56<br />
b 1997 146.80 4.99<br />
b 1998 85.83 4.45<br />
b 1999 67.72 4.22<br />
b 2000 42.44 3.75<br />
b 2001 53.88 3.99<br />
b 2002 81.11 4.40<br />
b 2003 70.88 4.26<br />
The data can be entered into JMP in the usual fashi<strong>on</strong>. Make sure that Site is a nominal scale variable,<br />
and that Year is a c<strong>on</strong>tinuous variable.<br />
In cases with multiple groups, it is often helpful to use a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This<br />
is easily accomplished in JMP by selecting the rows (say <str<strong>on</strong>g>for</str<strong>on</strong>g> site a) and using the Rows->Markers to set the<br />
plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> the selected rows:<br />
c○2012 Carl James Schwarz 336 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
The final data sheet has two different plotting symbols <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sites:<br />
c○2012 Carl James Schwarz 337 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
c○2012 Carl James Schwarz 338 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting the various models, begin with an exploratory examinati<strong>on</strong> of the data looking <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers<br />
and checking the assumpti<strong>on</strong>s.<br />
Each year’s data is independent of other year’s data as a different set of crabs was selected. Similarly,<br />
the data from <strong>on</strong>e site are independent from the other site. This is an observati<strong>on</strong>al study, so the questi<strong>on</strong><br />
arises of how exactly were the crabs were selected? In this study, crab pots were placed <strong>on</strong> the floor of the<br />
sea to capture the available crabs in the area.<br />
When ever multiple sets of data are collected over time, there is always the worry about comm<strong>on</strong> year<br />
effects (also known as process error). For example, if the resp<strong>on</strong>se variable was body mass of small fish, then<br />
poor growing c<strong>on</strong>diti<strong>on</strong>s in a single year could depress the growth of fish in all locati<strong>on</strong>s. This would then<br />
violate the assumpti<strong>on</strong> of independence as the residual in <strong>on</strong>e site in a year would be related to the residual<br />
in another site in the sam year. You tend to see the residuals “paired” with negative residuals from the fitted<br />
line at <strong>on</strong>e site matched (by year) with negative residuals at the other site. In this case, this is unlikely to have<br />
occured. Degradati<strong>on</strong> of dioxin is relatively independent of external envir<strong>on</strong>mental factors and the variati<strong>on</strong><br />
that we see about the two regressi<strong>on</strong> lines is related solely to samplng error based <strong>on</strong> the particular set of<br />
crabs that that were sampled. It seems unlikely that the residuals are related. 4<br />
Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and specify the log(T EQ) as the Y variable, and Y ear as the X<br />
variable:<br />
4 If you actually try and fit a process error term to this model, you find that the estimated process error is zero.<br />
c○2012 Carl James Schwarz 339 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Then specify a grouping variable by clicking <strong>on</strong> the pop-down menu near the Bivariate Fit title line:<br />
and selecting Site as the grouping variable:<br />
c○2012 Carl James Schwarz 340 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Now select the Fit Line from the same pop-down menu:<br />
c○2012 Carl James Schwarz 341 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
to get separate lines fit <str<strong>on</strong>g>for</str<strong>on</strong>g> each group:<br />
c○2012 Carl James Schwarz 342 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
This relati<strong>on</strong>ships <str<strong>on</strong>g>for</str<strong>on</strong>g> each site appear to be linear. The actual estimates are also presented:<br />
c○2012 Carl James Schwarz 343 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
c○2012 Carl James Schwarz 344 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
The scatterplot doesn’t show any obvious outliers. The estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the a site is -.107 (se .02)<br />
while the estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the b site is -.06 (se .02). The 95% c<strong>on</strong>fidence intervals (not shown <strong>on</strong> the<br />
output but available by right-clicking/ctrl-clicking <strong>on</strong> the parameter estimates table) overlap c<strong>on</strong>siderably so<br />
the slopes could be the same <str<strong>on</strong>g>for</str<strong>on</strong>g> the two groups.<br />
The MSE from site a is .10 and the MSE from site b is .12. This corresp<strong>on</strong>ds to standard deviati<strong>on</strong>s of<br />
√<br />
.10 = .32 and<br />
√<br />
.12 = .35 which are very similar so that assumpti<strong>on</strong> of equal standard deviati<strong>on</strong>s seems<br />
reas<strong>on</strong>able.<br />
The residual plots (not shown) also look reas<strong>on</strong>able.<br />
The assumpti<strong>on</strong>s appear to be satisfied, so let us now fit the various models.<br />
First, fit the model allowing <str<strong>on</strong>g>for</str<strong>on</strong>g> separate lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />
The terms can be in any order and corresp<strong>on</strong>d to the model described earlier. This gives the following<br />
output:<br />
c○2012 Carl James Schwarz 345 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
The regressi<strong>on</strong> plot is just the same as the plot of the two individual lines seen earlier. What is of interest is<br />
the Effect test <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site*year interacti<strong>on</strong>. Here the p-value is not very small, so there is no evidence that<br />
the lines are not parallel.<br />
We need to refit the model, dropping the interacti<strong>on</strong> term:<br />
c○2012 Carl James Schwarz 346 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
which gives the following regressi<strong>on</strong> plot:<br />
c○2012 Carl James Schwarz 347 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
This shows the fitted parallel lines. The effect tests:<br />
now have a small p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site effect indicating that the lines are not coincident, i.e. they are parallel<br />
with different intercepts. This would mean that the rate of decay of the dioxin appears to be equal in both<br />
sites, but the initial c<strong>on</strong>centrati<strong>on</strong> appears to be different.<br />
The estimated (comm<strong>on</strong>) slope is found in the Parameter Estimates porti<strong>on</strong> of the output:<br />
c○2012 Carl James Schwarz 348 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
and has a value of -.083 (se .016). Because the analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this implies that the<br />
dioxin levels changed by a factor of exp(−.083) = .92 from year to year, i.e. about a 8% decline each year.<br />
The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope <strong>on</strong> the log-scale is from (-.12 -> -.05) which corresp<strong>on</strong>ds to a<br />
potential factor between exp(−.12) = .88 to exp(−.05) = .95 per year, i.e. between a 12% and 5% decline<br />
per year. 5<br />
While it is possible to estimate the difference between the parallel lines from the Parameter Estimates<br />
table, it is easier to look at the secti<strong>on</strong> of the output corresp<strong>on</strong>ding to the Site effects. Here the estimated<br />
LSMeans corresp<strong>on</strong>d to the log(TEQ) at the average value of Year - not really of interest. As in previous<br />
chapters, the difference in means is often of more interest than the raw means themselves. This is found by<br />
using the pop-down menu and selecting a LSMeans C<strong>on</strong>trast or Multiple Comparis<strong>on</strong> procedure to give:<br />
5 The c<strong>on</strong>fidence intervals are found by right clicking/ctrl-clicking in the Parameter Estimates table<br />
c○2012 Carl James Schwarz 349 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
The estimated difference between the lines (<strong>on</strong> the log-scale) is estimated to be 0.46 (se .13). Because the<br />
analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this corresp<strong>on</strong>ds to a ratio of exp(.46) = 1.58 in dioxin levels between<br />
the two sites, i.e. site b has 1.58 times the dioxin level as site a. Because the slopes are parallel and declining,<br />
the dioxin levels are falling in both sites, but the 1.58 times ratio remains c<strong>on</strong>sistent.<br />
c○2012 Carl James Schwarz 350 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Finally, the actual by Predicted plot (not shown here), the leverage plots (not shown here) and the residual<br />
plot<br />
d<strong>on</strong>’t show any evidence of a problem in the fit.<br />
5.7 Change in yearly average temperature with regime shifts<br />
The ANCOVA technique can also be used <str<strong>on</strong>g>for</str<strong>on</strong>g> trends when there are KNOWN regime shifts in the series.<br />
The case when the timing of the shift is unknown is more difficult and not covered in this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
For example, c<strong>on</strong>sider a time series of annual average temperatures measured at Tuscaloosa, Alabama<br />
from 1901 to 2001. It is well known that shifts in temperature can occur whenever the instrument or locati<strong>on</strong><br />
or observer or other characteristics of the stati<strong>on</strong> change.<br />
The data are available in the JMP datafile tuscaloosa-avg-temp.jmp in the Sample Program Library at<br />
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
A porti<strong>on</strong> of the raw data is shown below:<br />
c○2012 Carl James Schwarz 351 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
and a time series plot of the data:<br />
shows a shift in the readings in 1939 (thermometer changed), 1957 (stati<strong>on</strong> moved), and possibly in 1987<br />
(locati<strong>on</strong> and thermometer changed).<br />
It turns out that cases where the number of epochs tends to increase with the number of data points has<br />
some serious technical issues with the properties of the estimators. See<br />
Lu, Q. and Lund, R.B. (2007).<br />
c○2012 Carl James Schwarz 352 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Simple linear regressi<strong>on</strong> with multiple level shifts.<br />
Canadian Journal of Statistics, 35, 447-458<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> details. Basically, if the number of parameters tends to increase with sample size, this violates <strong>on</strong>e of<br />
the assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> maximum likelihood estimati<strong>on</strong>. This would lead to estimates which may not even be<br />
c<strong>on</strong>sistent! For example, suppose that the recording changed every two years. Then the two data points<br />
should still be able to estimate the comm<strong>on</strong> slope, but this corresp<strong>on</strong>ds to the well known problem with<br />
case-c<strong>on</strong>trol studies where the number of pairs increases with total sample size. Fortunately, Lu and Lund<br />
(2007) showed that this violati<strong>on</strong> is not serious.<br />
The analysis proceeds as in the dioxin example with two sites, except that now the series is broken into<br />
different epochs corresp<strong>on</strong>ding to the sets of years when c<strong>on</strong>diti<strong>on</strong>s remained stable at the recording site. In<br />
this case, this corresp<strong>on</strong>ds to the years 1901-1938 (inclusive); 1940-1956 (inclusive); 1958-1986 (inclusive),<br />
and 1989-2000 (inclusive). Note that the years 1939, 1957, and 1987 are NOT used because the average<br />
temperature in these two years is an amalgam of two different recording c<strong>on</strong>diti<strong>on</strong>s 6 .<br />
For example, the data file (around the first regime change) may look like:<br />
Note that year and Avg Temp and both set to have c<strong>on</strong>tinuous scale; but epoch should have a nominal or<br />
ordinal scale.<br />
Model filling proceeds as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e by first the model:<br />
AvgT emp = Y ear Epoch Y ear ∗ Epoch<br />
to see if the change in AvgTemp is c<strong>on</strong>sistent am<strong>on</strong>g Epochs and then fitting the model:<br />
AvgT emp = Y ear Epoch<br />
to estimate the comm<strong>on</strong> trend (after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> shifts am<strong>on</strong>g the Epochs).<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />
6 If the exact day of the change were known, it is possible to weight the two epochs in these years and include the data points.<br />
c○2012 Carl James Schwarz 353 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
There is no str<strong>on</strong>g evidence that the slopes are different am<strong>on</strong>g the epochs (p=.10) despite the plot showing<br />
a potentially differential slope in the 3 rd epoch:<br />
c○2012 Carl James Schwarz 354 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
The simpler model with comm<strong>on</strong> slopes is then fit:<br />
c○2012 Carl James Schwarz 355 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
with fitted (comm<strong>on</strong> slope) lines:<br />
c○2012 Carl James Schwarz 356 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
No further model simplificati<strong>on</strong> is possible and there is evident that the comm<strong>on</strong> slope is different from zero:<br />
The estimated change in average temperature is:<br />
c○2012 Carl James Schwarz 357 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
i.e. an estimated increase of .033 (SE .006) per year. The 95% c<strong>on</strong>fidence interval does not cover 0.<br />
The residual plots (against predicted and the order in which the data were collected):<br />
c○2012 Carl James Schwarz 358 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
shows no obvious problems.<br />
Whenever time series data are used, autocorrelati<strong>on</strong> should be investigated. The Durbin-Wats<strong>on</strong> test is<br />
applied to the residuals:<br />
with no obvious problem detected.<br />
The leverage plot (against year)<br />
c○2012 Carl James Schwarz 359 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
also reveals nothing amiss.<br />
A more sophisticated analysis can be fit using SAS, but isn’t needed. The sample program and output are<br />
available in the Sample Program Library.<br />
5.8 Example - More refined analysis of stream-slope example<br />
In the chapter <strong>on</strong> paired comparis<strong>on</strong>s, the example of the effect of stream slope was examined based <strong>on</strong>:<br />
Isaak, D.J. and Hubert, W.A. (2000). Are trout populati<strong>on</strong>s affected by reach-scale stream slope.<br />
Canadian Journal of Fisheries and Aquatic Sciences, 57, 468-477.<br />
c○2012 Carl James Schwarz 360 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
In that paper, stream slope was (roughly) categorized into high or low slope classes and a paired-analysis<br />
was per<str<strong>on</strong>g>for</str<strong>on</strong>g>med. In this secti<strong>on</strong>, we will use the actual stream slopes to examine the relati<strong>on</strong>ship between fish<br />
density and stream slope.<br />
Recall that a stream reach is a porti<strong>on</strong> of a stream from 10 to several hundred metres in length that<br />
exhibits c<strong>on</strong>sistent slope. The slope influences the general speed of the water which exerts a dominant<br />
influence <strong>on</strong> the structure of physical habitat in streams. If fish populati<strong>on</strong>s are influenced by the structure<br />
of physical habitat, then the abundance of fish populati<strong>on</strong>s may be related to the slope of the stream.<br />
Reach-scale stream slope and the structure of associated physical habitats are thought to affect trout<br />
populati<strong>on</strong>s, yet previous studies c<strong>on</strong>found the effect of stream slope with other factors that influence trout<br />
populati<strong>on</strong>s.<br />
Past studies addressing this issue have used sampling designs wherein data were collected either using<br />
repeated samples al<strong>on</strong>g a single stream or measuring many streams distributed across space and time.<br />
Reaches <strong>on</strong> the same stream will likely have correlated measurements making the use of simple statistical<br />
tools problematical. [Indeed, if <strong>on</strong>ly a single stream is measured <strong>on</strong> multiple locati<strong>on</strong>s, then this is an<br />
example of pseudo-replicati<strong>on</strong> and inference is limited to that particular stream.]<br />
Inference from streams spread over time and space is made more difficult by the inter-stream differences<br />
and temporal variati<strong>on</strong> in trout populati<strong>on</strong>s if samples are collected over extended periods of time. This extra<br />
variati<strong>on</strong> reduces the power of any survey to detect effects.<br />
For this reas<strong>on</strong>, a paired approach was taken. A total of twenty-three streams were sampled from a large<br />
watershed. Within each stream, two reaches were identified and the actual slope gradient was measured.<br />
In each reach, fish abundance was determined using electro-fishing methods and the numbers c<strong>on</strong>verted<br />
to a density per 100 m 2 of stream surface.<br />
Table 6.1 presents the (fictitious but based <strong>on</strong> the above paper) raw data<br />
Estimates of fish density from a paired experiment<br />
slope slope density<br />
Stream (%) class (per 100 m 2 )<br />
1 0.7 low 15.0<br />
1 4.0 high 21.0<br />
2 2.4 low 11.0<br />
2 6.0 high 3.1<br />
3 0.7 low 5.9<br />
3 2.6 high 6.4<br />
4 1.3 low 12.2<br />
4 4.0 high 17.6<br />
c○2012 Carl James Schwarz 361 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
5 0.6 low 6.2<br />
5 4.4 high 7.0<br />
6 1.3 low 39.8<br />
6 3.2 high 25.0<br />
7 2.0 low 6.5<br />
7 4.2 high 11.2<br />
8 1.3 low 9.6<br />
8 4.2 high 17.5<br />
9 2.0 low 7.3<br />
9 3.6 high 10.0<br />
10 0.7 low 11.3<br />
10 3.5 high 21.0<br />
11 2.3 low 12.1<br />
11 6.0 high 12.1<br />
12 2.5 low 13.2<br />
12 4.2 high 15.0<br />
13 2.3 low 5.0<br />
13 6.0 high 5.0<br />
14 1.2 low 10.2<br />
14 2.9 high 6.0<br />
15 0.7 low 8.5<br />
15 2.9 high 7.0<br />
16 1.1 low 5.8<br />
16 3.0 high 5.0<br />
17 2.2 low 5.1<br />
17 5.0 high 5.0<br />
18 0.7 low 65.4<br />
18 3.2 high 55.0<br />
19 0.7 low 13.2<br />
19 3.0 high 15.0<br />
20 0.3 low 7.1<br />
20 3.2 high 12.0<br />
21 2.3 low 44.8<br />
21 7.0 high 48.0<br />
22 1.8 low 16.0<br />
c○2012 Carl James Schwarz 362 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
22 6.0 high 20.0<br />
23 2.2 low 7.2<br />
23 6.0 high 10.1<br />
Notice that the density varies c<strong>on</strong>siderably am<strong>on</strong>g stream but appears to be fairly c<strong>on</strong>sistent within each<br />
stream.<br />
The raw data is available in a JMP datafile called paired-stream.jmp in the Sample Programs Library at<br />
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms..<br />
As noted earlier, this is an example of an Analytical Survey. The treatments (low or high slope) cannot<br />
be randomized within stream – the randomizati<strong>on</strong> occurs by selecting streams at random from some larger<br />
populati<strong>on</strong> of potential streams. As noted in the early chapter <strong>on</strong> Observati<strong>on</strong>al Studies, causal inference is<br />
limited whenever a randomizati<strong>on</strong> of experimental units to treatments cannot be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med.<br />
Unlike the example presented in other chapters where the slope is divided (arbitrarily) into two class<br />
(low and high slope), we will now use the actual slope. A simple regressi<strong>on</strong> CANNOT be used because of<br />
the n<strong>on</strong>-independence introduced by measuring two reaches <strong>on</strong> the same stream. However, an ANOCOVA<br />
will prove to be useful here.<br />
First, it seem sensible that the resp<strong>on</strong>se to stream slope will will be multiplicative rather than additive,<br />
i.e. an increase in the stream slope will change the fish density by a comm<strong>on</strong> fracti<strong>on</strong>, rather than simply<br />
changing the density by a fixed amount. For example, it may turn out that a 1 unit change in the slope,<br />
reduces density by 10% - if the density be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the change was 100 fish/m 2 , then after the change, the new<br />
density will be 90 fish/m 2 . Similarly, if the original density was <strong>on</strong>ly 10 fish/m 2 , then the final density will<br />
be 9 fish/m 2 . In both cases, the reducti<strong>on</strong> is a fixed fracti<strong>on</strong>, and NOT the same fixed amount (a change of<br />
10 vs. 1).<br />
Create the log(density) column in the usual fashi<strong>on</strong> (not illustrated here). In cases like this, the natural<br />
logarithm is preferred because the resulting estimates have a very nice simple interpretati<strong>on</strong>. 7<br />
An appropriate model will be <strong>on</strong>e where each stream has a separate intercept (corresp<strong>on</strong>ding to the<br />
different productivities of each stream - acting like a block), with a comm<strong>on</strong> slope <str<strong>on</strong>g>for</str<strong>on</strong>g> all streams. The<br />
simplified model syntax would look like<br />
log(density) = stream slope<br />
where the term stream represents a nominal scaled variable and gives the different intercepts and the term<br />
slope is the effect of the comm<strong>on</strong> slope <strong>on</strong> the log(density).<br />
menu.<br />
This is fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m as:<br />
7 The JMP dataset also created a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each stream using the Rows− > Color or Mark by Column<br />
c○2012 Carl James Schwarz 363 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Note that it stream must have a nominal scale and that slope must have a c<strong>on</strong>tinuous scale. The order of the<br />
terms in the effects box is not important.<br />
The output from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is voluminous, but a careful reading reveals several<br />
interesting features.<br />
First is a plot of the comm<strong>on</strong> slope fit to each stream:<br />
c○2012 Carl James Schwarz 364 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
This shows a gradual increase as slope increases. This plot is hard to interpret, but a plot of observed vs.<br />
predicted values is clearer:<br />
c○2012 Carl James Schwarz 365 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Generally, the observed are close to the predicted values except <str<strong>on</strong>g>for</str<strong>on</strong>g> two potential outliers. By clicking <strong>on</strong><br />
these points, it is shown that both points bel<strong>on</strong>g to stream 2 where it appears that the slope increases causes<br />
a large decrease in density c<strong>on</strong>trary to the general pattern seen in the the other streams.<br />
The effect tests:<br />
fail to detect any influence of slope. Indeed the estimated coefficient associated with a change in slope is<br />
found to be:<br />
c○2012 Carl James Schwarz 366 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
is estimated to be .025 (se .0299) which is not statistically significant. 8<br />
Residual plots also show the odd behavior of stream 2:<br />
If this rogue stream is “eliminated” from the analysis, the the resulting plots do not show any problems<br />
(try it), but now the results are statistically significant (p=.035):<br />
8 Because the natural log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used and the data <strong>on</strong> the log scale was used, “smallish” slope coefficients have an approximate<br />
interpretati<strong>on</strong>. In this example, a slope of .025 <strong>on</strong> the (natural) log scale implies that the estimated fish density INCREASES by<br />
2.5% every time the slope increases by <strong>on</strong>e percentage point.<br />
c○2012 Carl James Schwarz 367 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
The estimated change in log-density per percentage point change in the slope is found to be:<br />
i.e. the slope is .05 (se .02) which is interpreted that a percentage point increase in stream slope increases<br />
fish density by 5%. 9<br />
The remaining residual plot and leverage plots show no problems.<br />
Yet another alternate analysis!<br />
Because the treatment <strong>on</strong>ly has two levels, the same answers can also be obtained by estimating the ratio<br />
of the change in the log(density) to the change in slope. 10 To begin, we need to split the data table so that<br />
both the log(density) and the slope are in separate columns:<br />
9 This easy interpretati<strong>on</strong> occurs because the natural log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used. If the comm<strong>on</strong> (base 10) log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used, there<br />
is no l<strong>on</strong>ger such a simple interpretati<strong>on</strong>.<br />
10 If the slope-class had three or more levels, this analysis could not be d<strong>on</strong>e, and the previous analysis would the preferred route<br />
c○2012 Carl James Schwarz 368 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
This creates a data table with separate columns <str<strong>on</strong>g>for</str<strong>on</strong>g> the log(density) and the stream slope <str<strong>on</strong>g>for</str<strong>on</strong>g> both the<br />
high and low slope categories:<br />
c○2012 Carl James Schwarz 369 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Now create two new variables (create new columns and write a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> each column) representing the<br />
differences in the log(density) and slope between the high and low slope classes:<br />
Finally, we wish to fit a line through the origin through these data points. We use the Analyze->Fit<br />
Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m,<br />
the Fit Special from the red-triangle drop down menu:<br />
c○2012 Carl James Schwarz 370 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
and then check the C<strong>on</strong>strain intercept<br />
c○2012 Carl James Schwarz 371 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
This give the following output:<br />
c○2012 Carl James Schwarz 372 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
We obtain the same estimated effect and se. The outlier from stream 2 is readily evident. When this outlier<br />
is excluded and the analysis is repeated, again a statistically significant result is obtained that matches the<br />
previous analysis.<br />
c○2012 Carl James Schwarz 373 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
5.9 Comparing Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor K<br />
Not all fish within a lake are identical. How can a single summary measure be developed to represent the<br />
c<strong>on</strong>diti<strong>on</strong> of fish within a lake?<br />
In general, the the relati<strong>on</strong>ship between fish weight and length follows a power law:<br />
W = aL b<br />
where W is the observed weight; L is the observed length, and a and b are coefficients relating length to<br />
weight. The usual assumpti<strong>on</strong> is that heavier fish of a given length are in better c<strong>on</strong>diti<strong>on</strong> than than lighter<br />
fish. C<strong>on</strong>diti<strong>on</strong> indices are a popular summary measure of the c<strong>on</strong>diti<strong>on</strong> of the populati<strong>on</strong>.<br />
There are at least eight different measures of c<strong>on</strong>diti<strong>on</strong> which can be found by a simple literature<br />
search. C<strong>on</strong>ne (1989) raises some important questi<strong>on</strong>s about the use of a single index to represent the<br />
two-dimensi<strong>on</strong>al weight-length relati<strong>on</strong>ship.<br />
One comm<strong>on</strong> measure is Fult<strong>on</strong>’s 11 K:<br />
K =<br />
W eigt<br />
(Length/100) 3<br />
This index makes an implicit assumpti<strong>on</strong> of isometric growth, i.e. as the fish grows, its body proporti<strong>on</strong>s and<br />
specific gravity do not change.<br />
How can K be computed from a sample of fish, and how can K be compared am<strong>on</strong>g different subset of<br />
fish from the same lake or across lakes?<br />
The B.C. Ministry of Envir<strong>on</strong>ment takes regular samples of rainbow trout using a floating and a sinking<br />
net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish was recorded.<br />
The data are available in the rainbow-c<strong>on</strong>diti<strong>on</strong>.jmp data file in the Sample Program Library at http:<br />
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
A porti<strong>on</strong> of the raw data data appears below:<br />
11 There is some doubt about the first authorship of this c<strong>on</strong>diti<strong>on</strong> factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J.<br />
(2005). The Origin of Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor – Setting the Record Straight. Fisheries, 31, 236-238.<br />
c○2012 Carl James Schwarz 374 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
K was computed <str<strong>on</strong>g>for</str<strong>on</strong>g> each individual fish, and the resulting histogram is displayed below:<br />
There is a range of c<strong>on</strong>diti<strong>on</strong> numbers am<strong>on</strong>g the individual fish with an average (am<strong>on</strong>g the fish caught) K<br />
of about 13.6.<br />
Deriving a single summary measure to represent the entire populati<strong>on</strong> of fish in the lake depends heavily<br />
<strong>on</strong> the sampling design used to capture fish.<br />
Some case must be taken to ensure that the fish collected are a simple random sample from the fish in the<br />
populati<strong>on</strong>. If a net of a single mesh size are used, then this has a selectivity curve and the nets are typically<br />
more selective <str<strong>on</strong>g>for</str<strong>on</strong>g> fish of a certain size. In this experiment, several different mesh sizes were used to try and<br />
ensure that all fish of all sizes have an equal chance of being selected.<br />
As well, if regressi<strong>on</strong> methods have an advantage in that a simple random sample from the populati<strong>on</strong> is<br />
c○2012 Carl James Schwarz 375 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
no l<strong>on</strong>ger required to estimate the regressi<strong>on</strong> coefficients. As an analogy, suppose you are interested in the<br />
relati<strong>on</strong>ship between yield of plants and soil fertility. Such a study could be c<strong>on</strong>ducted by finding a random<br />
sample of soil plots, but this may lead to many plots with similar fertility and <strong>on</strong>ly a few plots with fertility<br />
at the tails of the relati<strong>on</strong>ship. An alternate scheme is to deliberately seek out soil plots with a range of<br />
fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit a regressi<strong>on</strong> curve<br />
to these selected data points.<br />
Fult<strong>on</strong>’s index is often re-expressed <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> purposes as:<br />
( ) 3 L<br />
W = K<br />
100<br />
This looks like a simple regressi<strong>on</strong> between W and ( L<br />
100) 3<br />
but with no intercept.<br />
A plot of these two variables:<br />
shows a tight relati<strong>on</strong>ship am<strong>on</strong>g fish but with possible increasing variance with length.<br />
There is some debate about the proper way to estimate the regressi<strong>on</strong> coefficient K. Classical regressi<strong>on</strong><br />
methods (least squares) implicitly implies that all of the “error” in the regressi<strong>on</strong> is in the vertical directi<strong>on</strong>,<br />
i.e. c<strong>on</strong>diti<strong>on</strong>s <strong>on</strong> the observed lengths. However, the structural relati<strong>on</strong>ship between weight and length<br />
likely is violated in both variables. This would lead to the error-in-variables problem in regressi<strong>on</strong>, which<br />
c○2012 Carl James Schwarz 376 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
has a l<strong>on</strong>g history. Fortunately, the relati<strong>on</strong>ship between the two variables is often sufficiently tight that it<br />
really doesn’t matter which method is used to find the estimates.<br />
JMP can be used to fit the regressi<strong>on</strong> line c<strong>on</strong>straining the intercept to be zero by using the Fit Special<br />
opti<strong>on</strong> under the red-triangle:<br />
c○2012 Carl James Schwarz 377 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
This gives rise to the fitted line and statistics about the fit:<br />
c○2012 Carl James Schwarz 378 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
c○2012 Carl James Schwarz 379 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Note that R 2 really doesn’t make sense in cases where the regressi<strong>on</strong> is <str<strong>on</strong>g>for</str<strong>on</strong>g>ced through the origin because<br />
the null model to which it is being compared is the line Y = 0 which is silly. 12 For this reas<strong>on</strong>, JMP does<br />
not report a value of R 2 .<br />
The estimated value of K is 13.72 (SE 0.099).<br />
The residual plot:<br />
shows clear evidence of increasing variati<strong>on</strong> with the length variable. This usually implies that a weighted<br />
regressi<strong>on</strong> is needed with weights proporti<strong>on</strong>al to the 1/length 2 variable. In this case, such a regressi<strong>on</strong><br />
gives essentially the same estimate of the c<strong>on</strong>diti<strong>on</strong> factor ( ̂K = 13.67, SE = .11).<br />
Comparing c<strong>on</strong>diti<strong>on</strong> factors<br />
This dataset has a number of sub-groups – do all of the subgroups have the same c<strong>on</strong>diti<strong>on</strong> factor? For<br />
example, suppose we wish to compare the K value <str<strong>on</strong>g>for</str<strong>on</strong>g> immature and mature fish. As noted by Garcia-<br />
Berthou (2001) 13 , this is best d<strong>on</strong>e through a technique called Analysis of Covariance (ANCOVA). Some<br />
details <strong>on</strong> ANCOVA are presented in a separate chapter of these notes.<br />
As outlined in the ANCOVA chapter, we start with a model that has a separate K <str<strong>on</strong>g>for</str<strong>on</strong>g> each maturity class.<br />
The simplified syntax <str<strong>on</strong>g>for</str<strong>on</strong>g> this model is:<br />
W = (Len/100) 3<br />
(Len/100) 3 ∗ Maturity<br />
Note that unlike traditi<strong>on</strong>al ANOCOVA models, the model is lacking the simple effect of maturity. The<br />
reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this is that unlike traditi<strong>on</strong>al ANCOVA models, the intermediate model with parallel slopes really<br />
12 C<strong>on</strong>sult any of the standard references <strong>on</strong> regressi<strong>on</strong> such as Draper and Smith <str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />
13 Garcia-Berthou E. (2001). On the misuse of residuals in ecology: testing regressi<strong>on</strong> residuals vs. the analysis of covariance.<br />
Journal of Animal <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 70, 708-711. http://dx.doi.org/10.1046/j.1365-2656.2001.00524.x<br />
c○2012 Carl James Schwarz 380 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
doesn’t make sense when the regressi<strong>on</strong> lines are <str<strong>on</strong>g>for</str<strong>on</strong>g>ced through the origin. This syntax specifies that<br />
variati<strong>on</strong> in length are attributable to variati<strong>on</strong>s in length and an interacti<strong>on</strong> between the two variables. This<br />
latter term represents the differential K between the maturity classes.<br />
Here is where some care must be taken. By default. JMP “centers” (i.e. subtracts the mean) c<strong>on</strong>tinuous<br />
X variables when they participate in an interacti<strong>on</strong> or similar term:<br />
Hence, if you just try and implement this above model directly in JMP, you will actually fit the model:<br />
W = (Len/100) 3<br />
((Len/100) 3 − (Len/100) 3 ) ∗ Maturity<br />
which, when expanded, actually adds an intercept term to the model. Ordinarily, in regressi<strong>on</strong> models with<br />
intercepts, this would NOT be a problem – it is because the model is being <str<strong>on</strong>g>for</str<strong>on</strong>g>ced through the intercept that<br />
this causes a problem.<br />
In order to prevent JMP from “centering” the length variable when fitting these ANCOVA models, turn<br />
off the centering opti<strong>on</strong> (by unchecking the opti<strong>on</strong>) when the model is fit using the Analyze->Fit Model<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m of JMP:<br />
c○2012 Carl James Schwarz 381 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Note the use of the No Intercept opti<strong>on</strong> to again <str<strong>on</strong>g>for</str<strong>on</strong>g>ce the line through the origin. JMP will ‘complain’<br />
about the odd <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the model because it is missing the simple maturity class effect, but just ignore the<br />
complaints. This gives the summary output <str<strong>on</strong>g>for</str<strong>on</strong>g> the effect test of:<br />
The p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the last term in the table of 0.027 indicates that there is str<strong>on</strong>g evidence of a different K<br />
between the two maturity classes.<br />
The estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the separate maturity classes are obtained from the Custom Test opti<strong>on</strong> (some knowledge<br />
of the design matrix coding <str<strong>on</strong>g>for</str<strong>on</strong>g> categorical variables in JMP is needed to know that JMP uses a (1, -1)<br />
coding <str<strong>on</strong>g>for</str<strong>on</strong>g> indicator variables with 2 classes):<br />
c○2012 Carl James Schwarz 382 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
which gives the estimated K <str<strong>on</strong>g>for</str<strong>on</strong>g> each maturity class.<br />
c○2012 Carl James Schwarz 383 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
If you fit a separate regressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the two maturity classes (use the By opti<strong>on</strong> <strong>on</strong> the fit model box), you<br />
will get the two same estimates. The respective standard errors will be slightly different because the single<br />
model is able to pool over all of the data to estimate the standard errors, but separate estimates cannot do<br />
any pooling.<br />
The separate fitted lines are shown below:<br />
c○2012 Carl James Schwarz 384 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Similarly, a comparis<strong>on</strong> of K can be made am<strong>on</strong>g the three sex classes (M, F, and U) where immature<br />
fish cannot be sexed and are given the code U, but mature fish are further subdivided into M and F classes<br />
(d<strong>on</strong>’t <str<strong>on</strong>g>for</str<strong>on</strong>g>get to uncheck the centering opti<strong>on</strong> in the triangle in the upper left corner of the Analyze->Fit<br />
Model dialogue):<br />
also shows evidence (p=.025) of a differential K am<strong>on</strong>g the three sex classes (this is not unexpected), and a<br />
c<strong>on</strong>trast can be d<strong>on</strong>e to see if there is further evidence of a difference between the male and females:<br />
c○2012 Carl James Schwarz 385 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
As the p-value is .0074, there is also str<strong>on</strong>g evidence of a differential K between the males and females as<br />
well.<br />
A final plot of the three lines is:<br />
c○2012 Carl James Schwarz 386 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Finally, because you have replicate fish at the same body length, it is possible to a <str<strong>on</strong>g>for</str<strong>on</strong>g>mal lack-of-fit test.<br />
The idea behind this test is to compare the variati<strong>on</strong> in data points at the same replicate lengths (pure error)<br />
with the deviati<strong>on</strong>s around the line from the model (model error). If the model fits well, the ratio of these<br />
two estimates of residual variance should be comparable:<br />
The p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the lack-of-fit test is quite large indicating no evidence of a lack of fit.<br />
This same ANCOVA method can be used to compare the K values across lakes or across time within<br />
the same lake. If you have a large number of lakes each measured multiple times, some very interesting<br />
models can be fit that are bey<strong>on</strong>d the scope of these notes – please c<strong>on</strong>tact me. Similarly, interest may lie<br />
in modeling the K as functi<strong>on</strong>s of other lake-specific covariates such as lake size, productivity, etc. Again,<br />
please c<strong>on</strong>tact me as this is bey<strong>on</strong>d the scope of these notes.<br />
c○2012 Carl James Schwarz 387 November 23, 2012
CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />
Statistical significance is not the same as biological significance! While there was evidence of differential<br />
K in this data set, this statistical significance does not imply biological importance. I have no idea of the<br />
observed differences in K am<strong>on</strong>g these three groups has any meaning biologically.<br />
5.10 Final Notes<br />
Some secti<strong>on</strong>s need to be added here <strong>on</strong> the following topics:<br />
• danger of ANCOVA is there is no overlap in the covariate<br />
• choice between paired t-test, multi-variate test, or ANCOVA in the case of two time points<br />
c○2012 Carl James Schwarz 388 November 23, 2012
Chapter 6<br />
Multiple linear regressi<strong>on</strong><br />
6.1 Introducti<strong>on</strong><br />
In previous chapters, the relati<strong>on</strong>ship between a single, c<strong>on</strong>tinuous variable (Y a.k.a. the resp<strong>on</strong>se variable)<br />
and a single c<strong>on</strong>tinuous variable (X, a.k.a the predictor or explanatory variable) was explored using simple<br />
linear regressi<strong>on</strong>. In this chapter, this will be generalized to the case of more than <strong>on</strong>e explanatory (X)<br />
variable. 1<br />
There are many good books covering this topic - refer to the list in previous chapters.<br />
Fortunately, many of the techniques learned in the previous chapter <strong>on</strong> simple linear regressi<strong>on</strong> carry<br />
over directly to the more general multiple regressi<strong>on</strong>. There are a few subtle differences in interpretati<strong>on</strong>,<br />
and additi<strong>on</strong>al problems (such a variable selecti<strong>on</strong>) must be solved.<br />
It turns out that multiple regressi<strong>on</strong> methods are very general methods covering a wide range of statistical<br />
problems under the rubric of general linear models. Surprisingly, multiple regressi<strong>on</strong> is a general soluti<strong>on</strong><br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> two-sample t-tests, <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA models, <str<strong>on</strong>g>for</str<strong>on</strong>g> simple linear regressi<strong>on</strong> models, etc. The exact theory is<br />
bey<strong>on</strong>d the scope of these notes, but intuitive explanati<strong>on</strong>s will be provided as needed.<br />
6.1.1 Data <str<strong>on</strong>g>for</str<strong>on</strong>g>mat and missing values<br />
The data are collected and stored in a tabular <str<strong>on</strong>g>for</str<strong>on</strong>g>mat with rows representing observati<strong>on</strong>s, and columns<br />
representing different variables. One of the variables will be the resp<strong>on</strong>se (Y ) variable; there can be several<br />
predictor (X) variables. Virtually all computer packages require variables to be stored in columns and<br />
1 It is possible to also have more than <strong>on</strong>e Y variable – this is known as multivariate multiple regressi<strong>on</strong> but is not covered in this<br />
chapter.<br />
389
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
observati<strong>on</strong>s stored in rows.<br />
The resp<strong>on</strong>se variable (Y ) must be c<strong>on</strong>tinuous. It is NOT appropriate to do multiple regressi<strong>on</strong> when<br />
the Y variable represents categories – the appropriate methodology in this case is logistic regressi<strong>on</strong>. If the<br />
Y variable represents counts, a technique known as Poiss<strong>on</strong>-regressi<strong>on</strong> may be more appropriate – c<strong>on</strong>sult<br />
the chapter <strong>on</strong> generalized linear models <str<strong>on</strong>g>for</str<strong>on</strong>g> more details. Finally, in some cases, the value of Y may be<br />
censored, i.e. the exact value is not known, but it is known to be below certain threshold values (e.g. above<br />
or below detecti<strong>on</strong> limits). The analysis of such data is bey<strong>on</strong>d the scope of these notes – c<strong>on</strong>sult the chapter<br />
<strong>on</strong> Tobit analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />
Surprising, there is much more flexibility in the type of the X variables. They may be c<strong>on</strong>tinuous as<br />
seen previously in simple linear regressi<strong>on</strong>, or they may be dichotomous variables taking <strong>on</strong>ly the values of<br />
0 or 1 (known as indicator variables). 2 These indicator variables are used to represent different groups (e.g.<br />
male and female) in the data.<br />
The dataset is assumed to be complete, with NO missing values in any of the X variables. If an observati<strong>on</strong><br />
(row) has some missing X values, most computer packages practice what is known as case-wise<br />
deleti<strong>on</strong>, i.e. the entire observati<strong>on</strong> will be dropped from the analysis. C<strong>on</strong>sequently, it is always important<br />
to check the computer output to see exactly how many observati<strong>on</strong>s have been used in the analysis.<br />
Missing Y values also imply that this observati<strong>on</strong> (row) will be deleted from the analysis. However, if<br />
the set of X variables is complete, it is still possible to obtain predicti<strong>on</strong>s of Y <str<strong>on</strong>g>for</str<strong>on</strong>g> the observed set of X<br />
values.<br />
As in previous chapters, missing data should be examined to see if it is missing completely at random<br />
(MCAR) in which case there is usually no problem in the analysis other than reduced sample size, or missing<br />
at random (MAR), which is again handled relatively easily, or in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative missing (IM) which poses serious<br />
problems in the analysis. Seek help <str<strong>on</strong>g>for</str<strong>on</strong>g> the latter case.<br />
6.1.2 The statistical model<br />
The statistical model <str<strong>on</strong>g>for</str<strong>on</strong>g> multiple-regressi<strong>on</strong> is a extensi<strong>on</strong> of that <str<strong>on</strong>g>for</str<strong>on</strong>g> simple linear regressi<strong>on</strong>.<br />
The resp<strong>on</strong>se variable, denoted by Y , is measured al<strong>on</strong>g with a set of predictor variables, denoted by<br />
X 1 , X 2 , . . . , X p where p is the number of predictor variables.<br />
The <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical model is:<br />
Y i = β 0 + β 1 X i1 + β 2 X i2 + . . . + β p X ip + ɛ i<br />
where the unknown parameters are the set of β ′ s. The deviati<strong>on</strong> between the observed value of Y and the<br />
predicted value from the regressi<strong>on</strong> equati<strong>on</strong>, ɛ i is distributed as a Normal distributi<strong>on</strong> with a mean of 0 and<br />
an (unknown) variance of σ 2 .<br />
2 In actual fact, any set of two distinct values may be used, but traditi<strong>on</strong>al usage is to use 0 and 1.<br />
c○2012 Carl James Schwarz 390 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
This is often written using a <str<strong>on</strong>g>short</str<strong>on</strong>g> hand notati<strong>on</strong> in many statistical packages as:<br />
Y = X 1 X 2 . . . X p<br />
where the intercept (β 0 ) and the residual variati<strong>on</strong> (ɛ) are implicit.<br />
This can also be written using matrices as<br />
Y = Xβ + ɛ<br />
where Y is an n × 1 column vector, X is an n × (p + 1) matrix [d<strong>on</strong>’t <str<strong>on</strong>g>for</str<strong>on</strong>g>get the intercept column] of the<br />
predictors, β is a (p + 1) × 1 column vector (the intercept β 0 , plus the p “slopes” β 1 , . . . , β p ), and ɛ is a<br />
n×1 vector of residuals that has a multivariate normal distributi<strong>on</strong> with a mean of 0 and a covariance matrix<br />
of Iσ 2 where I is the identity matrix.<br />
Note that this <str<strong>on</strong>g>for</str<strong>on</strong>g>mat <str<strong>on</strong>g>for</str<strong>on</strong>g> multiple regressi<strong>on</strong> is very flexible. By appropriate definiti<strong>on</strong> of the X variables,<br />
many different problems can be cast into a multiple-regressi<strong>on</strong> framework. In future <str<strong>on</strong>g>course</str<strong>on</strong>g>s you will see<br />
that ANOVA (a technique to compare means am<strong>on</strong>g multiple groups) is actually nothing but regressi<strong>on</strong> in<br />
disguise!<br />
6.1.3 Assumpti<strong>on</strong>s<br />
Not surprising, the assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> a multiple regressi<strong>on</strong> analysis are very similar to those required <str<strong>on</strong>g>for</str<strong>on</strong>g> a<br />
simple linear regressi<strong>on</strong>.<br />
Linearity<br />
Because of the multiple X variables, the assumpti<strong>on</strong> of linearity is not as straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward as <str<strong>on</strong>g>for</str<strong>on</strong>g> simple linear<br />
regressi<strong>on</strong>.<br />
Multiple regressi<strong>on</strong> analysis assume that the MARGINAL relati<strong>on</strong>ship between Y and each X is linear.<br />
This means that if all other X variables are held c<strong>on</strong>stant, then changes in the particular X variable lead to<br />
a linear change in the Y variable. Because this is a MARGINAL relati<strong>on</strong>ship, simple plots of Y vs. each X<br />
variable may not be linear. This is because the simple pairwise plots can’t hold the other variables fixed.<br />
To assess this relati<strong>on</strong>ship, residuals from the fit should be plotted against each X variable in turn. If the<br />
scatter of the residuals is not random around 0 but shows some pattern (e.g. quadratic curve), this usually<br />
indicates that the marginal relati<strong>on</strong>ship between Y and that particular X is not linear. Alternatively, fit a<br />
model that includes both X and X 2 and test if the coefficient associated with X 2 is zero. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this<br />
test could fail to detect a higher order relati<strong>on</strong>ship. Third, if there are multiple readings at some X-values,<br />
then a test of goodness-of-fit (what JMP calls the Lack of Fit Test) can be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med where the variati<strong>on</strong> of<br />
the resp<strong>on</strong>ses at the same X value is compared to the variati<strong>on</strong> around the regressi<strong>on</strong> line.<br />
c○2012 Carl James Schwarz 391 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Correct sampling scheme<br />
The Y values must be a random sample from the populati<strong>on</strong> of Y values <str<strong>on</strong>g>for</str<strong>on</strong>g> every set of X value in the<br />
sample. Fortunately, it is not necessary to have a completely random sample from the populati<strong>on</strong> as the<br />
regressi<strong>on</strong> line is valid even if the X values are deliberately chosen. However, <str<strong>on</strong>g>for</str<strong>on</strong>g> a given set of X, the<br />
values from the populati<strong>on</strong> must be a simple random sample.<br />
This latitude gives c<strong>on</strong>siderable freedom in selecting points to investigate the relati<strong>on</strong>ship between Y<br />
and X. This will be discussed more in class.<br />
No outliers or influential points<br />
All the points must bel<strong>on</strong>g to the relati<strong>on</strong>ship – there should be no unusual points.<br />
The residual plot of the residual against the row number or against the predicted value should be investigated<br />
to see if there are unusual points.<br />
The marginal scatter plot of the residuals from the fit vs. X should be examined. As well leverage plots<br />
(Secti<strong>on</strong> 6.2.6) are useful <str<strong>on</strong>g>for</str<strong>on</strong>g> detecting influential points.<br />
Outliers can have a dramatic effect <strong>on</strong> the fitted line.<br />
Equal variati<strong>on</strong> al<strong>on</strong>g the line<br />
The variability about the regressi<strong>on</strong> plane must be similar <str<strong>on</strong>g>for</str<strong>on</strong>g> all sets of X, i.e. the scatter of the points above<br />
and below the fitted surface should be roughly c<strong>on</strong>stant over the entire surface. This is assessed by looking<br />
at the plots of the residuals against each X variable to see if the scatter is roughly uni<str<strong>on</strong>g>for</str<strong>on</strong>g>mly scattered around<br />
zero with no increase and no decrease in spread over the entire line.<br />
Independence<br />
Each value of Y is independent of any other value of Y . The most comm<strong>on</strong> cases where this fail are time<br />
series data.<br />
This assumpti<strong>on</strong> can be assessed by again looking at residual plots against time or other variables.<br />
c○2012 Carl James Schwarz 392 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Normality of errors<br />
The difference between the value of Y and the expected value of Y is assumed to be normally distributed.<br />
This is <strong>on</strong>e of the most misunderstood assumpti<strong>on</strong>s. Many people err<strong>on</strong>eously assume that the distributi<strong>on</strong> of<br />
Y over all X values must be normally distributed, i.e. they look simply at the distributi<strong>on</strong> of the Y ’s ignoring<br />
the Xs. The assumpti<strong>on</strong> of normality <strong>on</strong>ly states that the residuals, the difference between the value of Y<br />
and the point <strong>on</strong> the line must be normally distributed.<br />
This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, <str<strong>on</strong>g>for</str<strong>on</strong>g> small<br />
sample sizes, you have little power of detecting n<strong>on</strong>-normality and <str<strong>on</strong>g>for</str<strong>on</strong>g> large sample sizes it is not that<br />
important.<br />
X variables measured without error<br />
It sometimes turns out that the X variables are not known precisely. For example, if you wish to investigate<br />
the relati<strong>on</strong>ship of illness to sec<strong>on</strong>d hand cigarette smoke, it is surprisingly difficult to get an estimate of the<br />
“dose” of cigarettes that a worker has been exposed to.<br />
This general problem is called the “error in variables” problem and has a l<strong>on</strong>g history in statistics. A<br />
detailed discussi<strong>on</strong> of this issue is bey<strong>on</strong>d the scope of these notes.<br />
The uncertainty in each X variable should be assessed.<br />
6.1.4 Obtaining Estimates<br />
The same principle of least squares as in simple linear regressi<strong>on</strong> is used to obtain estimates. In general, the<br />
sum of deviati<strong>on</strong>s between the predicted and observed values is computed, and the regressi<strong>on</strong> surface that<br />
minimizes this value is the final relati<strong>on</strong>ship.<br />
The estimated intercept and slopes can be compactly expressed using matrix notati<strong>on</strong><br />
̂β = (X ′ X) −1 X ′ Y<br />
but details are bey<strong>on</strong>d the scope of these notes. Hand <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae are all but impossible except <str<strong>on</strong>g>for</str<strong>on</strong>g> trivially<br />
small examples - let the computer do the work. Of <str<strong>on</strong>g>course</str<strong>on</strong>g> this implies that the scientist has the resp<strong>on</strong>sibility<br />
to ensure that the brain in engaged be<str<strong>on</strong>g>for</str<strong>on</strong>g>e putting the package in gear!<br />
As with all estimates, a measure of precisi<strong>on</strong> can be obtained. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, this is the standard error of<br />
each of the estimates. Again, there are computati<strong>on</strong>al <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae, but in this age of computers, these are not<br />
important. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, approximate 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the corresp<strong>on</strong>ding populati<strong>on</strong> parameters<br />
are found as estimate ± 2 × se. Most packages will compute the 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope as<br />
well.<br />
c○2012 Carl James Schwarz 393 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Once the fit has been obtained, the fit of the model can be assessed in various ways as outlined below.<br />
The overall fit of the model is assess using a Whole Model Test that is traditi<strong>on</strong>ally placed in an ANOVA<br />
table. This test examines if there is at least <strong>on</strong>e X variable that seems to be marginally related to the Y<br />
values. Usually, it is of little interest.<br />
The individual marginal c<strong>on</strong>tributi<strong>on</strong> of each X variable (how each X variable affects the resp<strong>on</strong>se<br />
holding all the other X variables c<strong>on</strong>stant) can be assessed directly either from the reported estimates and<br />
standard errors or from an Effect Test – these are exactly equivalent.<br />
Formal tests of hypotheses about the marginal c<strong>on</strong>tributi<strong>on</strong> of each variable can also be d<strong>on</strong>e. Usually,<br />
these are <strong>on</strong>ly d<strong>on</strong>e <strong>on</strong> the slope parameter as this is typically of most interest. The null hypothesis is that<br />
populati<strong>on</strong> marginal slope of a particular X variable is 0, i.e. there is no marginal relati<strong>on</strong>ship between Y<br />
and and that particular X. More <str<strong>on</strong>g>for</str<strong>on</strong>g>mally the null hypothesis <str<strong>on</strong>g>for</str<strong>on</strong>g> the X i variable is:<br />
H : β i = 0<br />
Again notice that the null hypothesis is ALWAYS in terms of a populati<strong>on</strong> parameter and not in terms of a<br />
sample statistic.<br />
The alternate hypothesis is typically chosen as:<br />
A : β i ≠ 0<br />
although <strong>on</strong>e-sided tests looking <str<strong>on</strong>g>for</str<strong>on</strong>g> either a positive or negative slope are possible.<br />
The test-statistics is found as<br />
T = b i − 0<br />
se(b i )<br />
and is compared to a t-distributi<strong>on</strong> with appropriate degrees of freedom to obtain the p-value. This is usually<br />
automatically d<strong>on</strong>e by most computer packages. The p-value is interpreted in exactly the same way as in<br />
ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no relati<strong>on</strong>ship were true.<br />
It is also possible to obtain test <str<strong>on</strong>g>for</str<strong>on</strong>g> set of predictors (e.g. can several X variables be simultaneously<br />
dropped from the model) as will be seen later in the notes.<br />
Finally, if there are a large number of X variables, is there an objective way to decide which subset of<br />
the X are useful in predicting Y ? Again, this is deferred until later in this chapter.<br />
6.1.5 Predicti<strong>on</strong>s<br />
Once the best fitting line is found it can be used to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> new sets of X.<br />
There are two types of predicti<strong>on</strong>s that are comm<strong>on</strong>ly made. It is important to distinguish between them<br />
as these two intervals are the source of much c<strong>on</strong>fusi<strong>on</strong> in regressi<strong>on</strong> problems.<br />
c○2012 Carl James Schwarz 394 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
First, the experimenter may be interested in predicting a SINGLE future individual value <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />
set of X. Sec<strong>on</strong>d the experimenter may be interested in predicting the AVERAGE of ALL future resp<strong>on</strong>ses<br />
at a particular set of X. 3 The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is sometimes called a c<strong>on</strong>fidence<br />
interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se but this is an un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate (and incorrect) use of the term c<strong>on</strong>fidence interval.<br />
Strictly speaking c<strong>on</strong>fidence intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> fixed unknown parameter values; predicati<strong>on</strong><br />
intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> future random variables.<br />
Both of the above intervals should be distinguished from the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />
In both cases, the estimate is found in the same manner – substitute the new sets of X into the equati<strong>on</strong><br />
and compute the predicted value Ŷ . In most computer packages this is accomplished by inserting a new<br />
“dummy” observati<strong>on</strong> in the dataset with the value of Y missing, but the values of X present. The missing<br />
Y value prevents this new observati<strong>on</strong> from being used in the fitting process, but the X value allows the<br />
package to compute an estimate <str<strong>on</strong>g>for</str<strong>on</strong>g> this observati<strong>on</strong>.<br />
What differs between the two predicti<strong>on</strong>s are the estimates of uncertainty.<br />
In the first case (predicting single values), there are two sources of uncertainty involved in the predicti<strong>on</strong>.<br />
First, there is the uncertainty caused by the fact that this estimated line is based up<strong>on</strong> a sample. Then there<br />
is the additi<strong>on</strong>al uncertainty that the value could be above or below the predicted line. This interval is often<br />
called a predicti<strong>on</strong> interval at a new X.<br />
In the sec<strong>on</strong>d case (predicting the mean of future resp<strong>on</strong>ses), <strong>on</strong>ly the uncertainty caused by estimating<br />
the line based <strong>on</strong> a sample is relevant. This interval is often called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean at a<br />
new X.<br />
The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is typically MUCH wider than the c<strong>on</strong>fidence interval<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean of all future resp<strong>on</strong>ses because it must account <str<strong>on</strong>g>for</str<strong>on</strong>g> the uncertainty from the fitted line plus<br />
individual variati<strong>on</strong> around the fitted line.<br />
Many textbooks have the <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the se <str<strong>on</strong>g>for</str<strong>on</strong>g> the two types of predicti<strong>on</strong>s, but again, there is little to<br />
be gained by examining them. What is important is that you read the documentati<strong>on</strong> carefully to ensure that<br />
you understand exactly what interval is being given to you.<br />
6.1.6 Example: blood pressure<br />
Blood pressure tends to increase with age, body mass, and stress. To investigate the relati<strong>on</strong>ship of blood<br />
pressure to these variables, a sample of men in a large corporati<strong>on</strong> was selected. For each subject, their<br />
age (years), body mass (kg), and a stress index (ranges from 0 to 100) was recorded al<strong>on</strong>g with their blood<br />
pressure.<br />
The raw data is presented in the following table:<br />
3 There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice.<br />
c○2012 Carl James Schwarz 395 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Age Blood Pressure Body Mass Stress Index<br />
(years) (mm) (kg) (no units)<br />
50 120 55 69<br />
20 141 47 83<br />
20 124 33 77<br />
30 126 65 75<br />
30 117 47 71<br />
50 129 58 73<br />
60 123 46 67<br />
50 125 68 71<br />
40 132 70 77<br />
55 123 42 69<br />
40 132 33 74<br />
40 155 55 86<br />
20 147 48 84<br />
31 . 53 86<br />
32 146 59 .<br />
JMP Analysis<br />
The raw data is also available in a JMP data sheet called bloodpress.jmp available from the Sample<br />
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The data has been entered with rows corresp<strong>on</strong>ding to the different subjects and columns corresp<strong>on</strong>ding<br />
to the different variables:<br />
c○2012 Carl James Schwarz 396 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Notice that the resp<strong>on</strong>se variable is c<strong>on</strong>tinuous as are the other variables. 4 Also notice that the blood<br />
pressure value is missing <str<strong>on</strong>g>for</str<strong>on</strong>g> <strong>on</strong>e subject – it cannot be used in the analysis, but predicti<strong>on</strong>s can be made<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> this subject as all the X values are present. One subject is missing <strong>on</strong>e of the X variables – this subject<br />
cannot be used in the fitting process nor <str<strong>on</strong>g>for</str<strong>on</strong>g> making predicti<strong>on</strong>s. The remaining sample size is <strong>on</strong>ly 13<br />
subjects.<br />
As usual, the researcher needs to think why certain values are missing.<br />
It is also interesting to note that measurement error in the X variables could be a c<strong>on</strong>cern. For example,<br />
it is highly unlikely that the first subject is exactly 20.000000 years old! People usually truncate their age<br />
when asked, e.g. even <strong>on</strong> the day be<str<strong>on</strong>g>for</str<strong>on</strong>g>e their 21st birthday, a pers<strong>on</strong> will still resp<strong>on</strong>d as their age being<br />
20 years old. Here the error in aging ranges from about 5% of values (when age is around 20 years old) to<br />
about 2% (when the age is around 50 years old). How was weight collected? If the subjects were actually<br />
weighed, the actual number may not be in dispute (i.e. it is unlikely that the scale is wr<strong>on</strong>g), but then the<br />
weight includes shoes, clothes, and ???? If the weight is a recall measurement, many people under report<br />
their actual weight, often by quite a margin. An how is stress measured? It is likely an index based <strong>on</strong> a<br />
survey, but it is not even clear how to numerically measure stress – after all, stress can’t simply be measured<br />
like temperature.<br />
Begin by plotting the variables against each other - a simple way is a scatter plot matrix available under<br />
the Analyze->MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
4 In actual fact, these variables have been discretized, but as the discretizati<strong>on</strong> interval is small relative to typical values, they can be<br />
treated as being c<strong>on</strong>tinuous.<br />
c○2012 Carl James Schwarz 397 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 398 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The scatter plot matrix shows no str<strong>on</strong>g simple relati<strong>on</strong>ships between pairs of variables. Rather surprisingly,<br />
weight seems to decrease with age, and there appears to be general increase of blood pressure with<br />
weight.<br />
These pairwise scatter plots are primarily useful <str<strong>on</strong>g>for</str<strong>on</strong>g> checking <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers and other problems in the data<br />
– often a multivariate relati<strong>on</strong>ship is too complex to be seen in simple pairwise plots.<br />
We will fit the model where the resp<strong>on</strong>se variable (blood pressure) is modeled as a functi<strong>on</strong> of the three<br />
predictor variables (age, weight, and stress index). Using the <str<strong>on</strong>g>short</str<strong>on</strong>g> hand notati<strong>on</strong> discussed earlier, the model<br />
c○2012 Carl James Schwarz 399 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
is<br />
BloodPressure = Age Weight Stress<br />
This model is fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 400 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The X variables can be listed in any order.<br />
The output from the Analyze->Fit Model is voluminous and cannot be displayed in <strong>on</strong>e panel, so it is<br />
necessary to look at several parts in more details.<br />
Because of the missing values, <strong>on</strong>ly 13 subjects could be used in the model fit:<br />
c○2012 Carl James Schwarz 401 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The number of cases actually used in the fit should alway be ascertained because in large datasets, the<br />
missing value pattern may not be easily discerned.<br />
First, assess the overall fit of the model by examining the plot of the actual blood pressure vs. the<br />
predicted blood pressure. If the model made exact predicti<strong>on</strong>s, then the points <strong>on</strong> the plot would all lie<br />
perfectly <strong>on</strong> the 45 ◦ line. The plot from this fit:<br />
shows the most points lie fairly close to the 45 degree line. As well there are no points that appear to have<br />
undue leverage <strong>on</strong> the plot as there is a general scatter around the 45 degree line.<br />
The residual plot:<br />
c○2012 Carl James Schwarz 402 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
also shows a random scatter of residuals around the value of 0 with no apparent pattern.<br />
The whole model test, i.e. if any of the X variables provide in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> predicting Y is found in the<br />
Analysis of Variance Table:<br />
The p-value is very small, and so there is good evidence that at least <strong>on</strong>e X variable appears to predict the<br />
blood pressure. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, at this point, it is unclear which X variable are good predictors and which X<br />
variables may be poor predictors..<br />
The fitted regressi<strong>on</strong> equati<strong>on</strong> is found by looking at the Parameter Estimates area:<br />
c○2012 Carl James Schwarz 403 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
and is:<br />
̂ BloodP ressure = −61.3 + .45(Age) − 0.087(Stress) + 2.37(W eight)<br />
These coefficients are interpreted as the MARGINAL increase in the blood pressure when each variable<br />
changes by 1 unit AND ALL OTHER VARIABLES REMAIN FIXED. For example, the coefficient of 0.45<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> age indicates that the estimated blood pressure increases by .45 units <str<strong>on</strong>g>for</str<strong>on</strong>g> year increase in age assuming<br />
that the stress index and weight remain c<strong>on</strong>stant. The c<strong>on</strong>cept of marginality, i.e. the marginal increase in Y<br />
when a single X variable is changed but all other X variables are held fixed is the crucial c<strong>on</strong>cept in multiple<br />
regressi<strong>on</strong>. In some cases, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, polynomial regressi<strong>on</strong>, it is impossible to hold all other X variables<br />
fixed as you will see later in this chapter.<br />
The sign of the coefficient <str<strong>on</strong>g>for</str<strong>on</strong>g> stress is somewhat surprising, but as you will see in a few minutes is<br />
nothing to worry about.<br />
Are there any X variables that d<strong>on</strong>’t appear to be useful in predicting blood pressure? The Effect Tests<br />
or the Parameter Estimates table provide some clues:<br />
The p-values from the the Effect Test table or the the Parameter Estimates table are identical; the F -statistic<br />
is simply the t-ratio squared. These are MARGINAL tests, i.e. is a particular X variable useful in predicting<br />
the blood pressure given all other variables remain in the model. For example, the test <str<strong>on</strong>g>for</str<strong>on</strong>g> age examines if<br />
blood pressure changes with age after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> stress and weight. The test <str<strong>on</strong>g>for</str<strong>on</strong>g> Stress examines if blood<br />
pressure changes with stress after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> age and weight.<br />
In this example, the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> stress appears to be statistically not significant. This would imply that<br />
blood pressure does not seem to increase with stress after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> age and weight. This would indicate<br />
c○2012 Carl James Schwarz 404 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
that perhaps stress could be dropped from the model and a final model <strong>on</strong>ly using age and weight may be<br />
suitable. C<strong>on</strong>sequently, the negative sign <strong>on</strong> the coefficient is not really worry some.<br />
Again, this c<strong>on</strong>cept of marginality is crucial <str<strong>on</strong>g>for</str<strong>on</strong>g> the proper interpretati<strong>on</strong> of the statistical tests. If two<br />
X variables are related, it is possible that both of the statistical tests could be n<strong>on</strong>-significant, but this does<br />
not imply that both variables can be dropped from the model. Later in this chapter (Secti<strong>on</strong> 6.4), it will be<br />
shown how to test if multiple variables can be simultaneously dropped from the model.<br />
The leverage plots should also be examined to see that any relati<strong>on</strong>ship between the predictor and resp<strong>on</strong>se<br />
variable is not highly dependent up<strong>on</strong> a single (high leverage) point:<br />
Leverage plots in general, examine the new in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> each X variable in predicting Y after adjusting<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> all the other variables in the model. The general theory is presented in Secti<strong>on</strong> 6.2.6. Two features of the<br />
plot should be examined. The general statistical significance of the X variable is found by c<strong>on</strong>sidering the<br />
slope of the line and if the c<strong>on</strong>fidence curves c<strong>on</strong>tain the horiz<strong>on</strong>tal line :<br />
c○2012 Carl James Schwarz 405 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
We see that the c<strong>on</strong>fidence curves in the leverage plots <str<strong>on</strong>g>for</str<strong>on</strong>g> age and weight both do not c<strong>on</strong>tain the horiz<strong>on</strong>tal<br />
line. However, the c<strong>on</strong>fidence curve <strong>on</strong> the leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> stress includes the horiz<strong>on</strong>tal line indicating<br />
that this variable’s c<strong>on</strong>tributi<strong>on</strong> to predicting blood pressure is not statistically useful.<br />
The sec<strong>on</strong>d feature of leverage plots that should be examined is the distributi<strong>on</strong> of points al<strong>on</strong>g the X<br />
axis of the leverage plot. There should be a fairly even distributi<strong>on</strong> al<strong>on</strong>g the bottom axis and the fitted line<br />
in the leverage plot should not be heavily influenced by a few points with high leverage.<br />
By clicking <strong>on</strong> the red-triangle associated with the fit:<br />
it is possible to save various predicti<strong>on</strong>s to the data table. For example, save the predicted values and the two<br />
types of c<strong>on</strong>fidence intervals (<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean and <str<strong>on</strong>g>for</str<strong>on</strong>g> individuals):<br />
c○2012 Carl James Schwarz 406 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Notice that <str<strong>on</strong>g>for</str<strong>on</strong>g> observati<strong>on</strong> 14, <strong>on</strong>ly the blood pressure was missing and so predicti<strong>on</strong>s of the blood pressure<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> that individual can be made. However, <str<strong>on</strong>g>for</str<strong>on</strong>g> individual 15, at least <strong>on</strong>e of the X variables had a missing<br />
value and so no predicti<strong>on</strong>s can be made.<br />
The predicti<strong>on</strong> are simply found by substituting in the X values into the predicti<strong>on</strong> equati<strong>on</strong>. As in<br />
simple linear regressi<strong>on</strong>, there are two different c<strong>on</strong>fidence intervals. The c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN<br />
resp<strong>on</strong>se would be useful to predicting the average blood pressure over many people with the same values of<br />
X as recorded. The c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the INDIVIDUAL resp<strong>on</strong>se would be useful to predict the blood<br />
pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a single future individual with those particular X values. A comm<strong>on</strong> error is to c<strong>on</strong>fuse these<br />
two types of intervals.<br />
As in simple linear regressi<strong>on</strong>, a comm<strong>on</strong> way to make predicti<strong>on</strong>s it to add rows to the end of the data<br />
table with the Y variable deliberately set to missing with the X values set to those of interest. These rows<br />
are NOT used in the model fitting, but because the X set is complete, predicti<strong>on</strong>s can be made.<br />
If the residuals are saved to the data table, a normal probability plot of the residuals can be made using<br />
the Analyze->Distributi<strong>on</strong> plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m <strong>on</strong> the saved residuals.<br />
Similarly, the residuals can be plotted against each X variable in turn to assess if there is a linear marginal<br />
relati<strong>on</strong>ship between Y and each X variable. Each of these residual plots should show a random scatter<br />
around zero.<br />
It is also possible to do inverse predicti<strong>on</strong>s, but this is bey<strong>on</strong>d the scope of these notes.<br />
There are lots of other interesting features to the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m that are bey<strong>on</strong>d the scope<br />
c○2012 Carl James Schwarz 407 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
of these notes.<br />
6.2 Regressi<strong>on</strong> problems and diagnostics<br />
6.2.1 Introducti<strong>on</strong><br />
“All models are wr<strong>on</strong>g, but some are useful.” G.E.P. Box <strong>on</strong> page 424 of Empirical Model-<br />
Building and Resp<strong>on</strong>se Surfaces (1987), co-authored with Norman R. Draper.<br />
This famous quote implies that no study ever satisfies the assumpti<strong>on</strong>s made when modeling the data.<br />
However, unless the violati<strong>on</strong>s are extreme, perhaps the model can be useful to make predicti<strong>on</strong>s.<br />
In this secti<strong>on</strong>, we will take a detailed look at a number of diagnostic measures to assess the fit of our<br />
model to the data.<br />
6.2.2 Preliminary characteristics<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e building complex models, the analyst should become familiar with the basic properties of their data.<br />
This is accomplished by:<br />
• Examine the RRR’s of experimental and survey design as it relates to this study.<br />
• What are the scale (nominal, ordinal, interval, ratio) of each variable?<br />
• Which are the predictors and resp<strong>on</strong>se variables?<br />
• What are the types (discrete, c<strong>on</strong>tinuous, discretized c<strong>on</strong>tinuous) of each variable?<br />
Then do some basic plots and tabulati<strong>on</strong>s to spot potential problems in the data:<br />
• Missing values. Examine the pattern of missing values. Most regressi<strong>on</strong> packages practice case-wise<br />
deleti<strong>on</strong>, i.e. any observati<strong>on</strong> (row) that is missing any of the X or the Y variable is not used in the<br />
analysis. If you have a large dataset with many X variables, even a small percentage of missing<br />
values can lead to many rows being deleted from the analysis. Think about how the missing values<br />
came about - are the MCAR, MAR, or IM? JMP has a nice feature to tabulate the pattern of missing<br />
values under the Tables menu.<br />
• Single variable descriptive statistics For each variable in the dataset, do some basic descriptive<br />
statistics and plots (e.g. histograms, dot-plots, box-plots) to identify potentially extreme observati<strong>on</strong>s.<br />
Check to see that all values are plausible, e.g. if <strong>on</strong>e variable records the sex of the subject, <strong>on</strong>ly two<br />
possible values should be recorded; It is unlikely that a women has 20 natural children; it is unlikely<br />
that a human male is more than 3 m tall, etc.<br />
c○2012 Carl James Schwarz 408 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
• Pairwise plots Create bivariate plots of all the variables. Check to unusual looking observati<strong>on</strong>s.<br />
These may be perfectly valid observati<strong>on</strong>s, but they should be examined in more detail to make sure.<br />
A casement plot (a matrix of pairwise scatter plot) can be created easily in JMP using the Analyze-<br />
>MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />
6.2.3 Residual plots<br />
After the model is fit, compute the residuals which are simply the VERTICAL difference between the observed<br />
and predicted values, ̂ɛ i = Y i − Ŷi. Most computer packages will compute and plot residuals easily.<br />
The basic assumpti<strong>on</strong> about the VERTICAL discrepancies was that they have a mean of zero and a<br />
CONSTANT variance σ 2 . We estimated the variance by MSE in the ANOVA table.<br />
There are several different types of residuals that can be computed and plotted:<br />
• Standardized residual. This is simply computed as z i = ̂ɛ i / √ MSE and is an attempt to create<br />
residuals with a mean of 0 and a variance of 1, i.e. like a standard normal distributi<strong>on</strong>. Because all the<br />
residuals are divided by the same value, the pattern seen in the standardized residuals will the same as<br />
seen in the ordinary residuals.<br />
• Studentized residual. The precisi<strong>on</strong> of the predicti<strong>on</strong>s changes at different parts of the regressi<strong>on</strong><br />
line. You saw earlier that the c<strong>on</strong>fidence band <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se got wider as the predicti<strong>on</strong> point<br />
moved further away from the center of the data. The studentized residuals (see book <str<strong>on</strong>g>for</str<strong>on</strong>g> computati<strong>on</strong>al<br />
details) attempts to standardize each residual by its approximate precisi<strong>on</strong>. Because each residual is<br />
adjusted individually, the plots of the studentized residuals will look slightly different from that of the<br />
regular or standardized residuals, but they will be similar.<br />
• Jackknifed residuals. Less comm<strong>on</strong>ly computed, jack-knifed residuals are computed by fitting a<br />
regressi<strong>on</strong> line after dropping each point in turn, and then finding the residual. For example, if there<br />
were 4 data points, the jack-knifed residual <str<strong>on</strong>g>for</str<strong>on</strong>g> the first point would be the difference between the<br />
observed value and the predicted value based <strong>on</strong> a regressi<strong>on</strong> line fit to points 2, 3 and 4 <strong>on</strong>ly. The<br />
jack-knifed residual <str<strong>on</strong>g>for</str<strong>on</strong>g> the sec<strong>on</strong>d observati<strong>on</strong> would be the difference between the observed value<br />
and the predicted value based <strong>on</strong> the 1, 3 and 4th observati<strong>on</strong>s. Plots based <strong>on</strong> these residuals will<br />
appear similar, but not exactly the same as plots based <strong>on</strong> the the other residuals.<br />
Several plots can be c<strong>on</strong>structed. First, look at the univariate distributi<strong>on</strong> of the residuals. Which observati<strong>on</strong>s<br />
corresp<strong>on</strong>d to the largest negative and positive residuals?<br />
Sec<strong>on</strong>d, plot the residuals against each predictor variable, against the PREDICTED Y values, and against<br />
the order with which the data were collected (this may be, but is not necessarily the order of the observati<strong>on</strong>s<br />
in the dataset). D<strong>on</strong>’t plot the residuals against the observed Y values because you will see strange patterns<br />
that are artifacts of the plot. 5 A good residual plot will show random scatter around zero; bad residual plots<br />
5 Basically negative residuals will be associated with smaller Y values and these will increase as Y increases, and then crash and<br />
rise and then crash and rise again.<br />
c○2012 Carl James Schwarz 409 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
will show a definite pattern. Typical residual plots are illustrated below – with small datasets, the patterns<br />
will not be as clear cut.<br />
With small datasets, d<strong>on</strong>’t over analyze the plots - <strong>on</strong>ly gross deviati<strong>on</strong>s from ideal plots are of interest.<br />
Modern alternatives to residual plots are to plot the absolute value of the residuals and fit LOWESS<br />
curves through them. C<strong>on</strong>sult our Stat400 <str<strong>on</strong>g>course</str<strong>on</strong>g> (Data Analysis) <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />
Many books present <str<strong>on</strong>g>for</str<strong>on</strong>g>mal tests <str<strong>on</strong>g>for</str<strong>on</strong>g> residuals – I find these not particularly useful, and prefer the simple<br />
residual plots. However, <strong>on</strong>e useful diagnostic is the Durbin-Wats<strong>on</strong> test <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong> – c<strong>on</strong>sult the<br />
c○2012 Carl James Schwarz 410 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
chapter <strong>on</strong> trend analysis in this collecti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />
Finally, many books also present what are known a normal probability plots to assess the normality of<br />
the residuals. Again, I have found these to be less than useful.<br />
6.2.4 Actual vs. Predicted Plot<br />
In multiple regressi<strong>on</strong>, it is very difficult to look at plots of Y vs. each X variable and come to anything very<br />
useful. In general, you are trying to view a multi-dimensi<strong>on</strong> space in two dimensi<strong>on</strong>s.<br />
A plot of the actual Y vs. the predicted Y ′ s is useful to assess how well the model does in predicting<br />
each observati<strong>on</strong>s. This plot is produced automatically by JMP and many other packages. In some packages,<br />
you will have to save the predicted values and do the plot yourself.<br />
6.2.5 Detecting influential observati<strong>on</strong>s<br />
An influential observati<strong>on</strong> is defined as an observati<strong>on</strong> whose deleti<strong>on</strong> greatly changes the results of the<br />
regressi<strong>on</strong>. There are many techniques available <str<strong>on</strong>g>for</str<strong>on</strong>g> spotting individual influential points – however, many<br />
of these methods will fail to detect pairs of influential points in close proximity to each other.<br />
Cook’s D<br />
One popular measure of an observati<strong>on</strong>s influence is the Cook’s Distance. This statistic measures the extent<br />
to which the regressi<strong>on</strong> coefficients change when each individual observati<strong>on</strong> is deleted. It is a summary<br />
measure of the impact of the observati<strong>on</strong>s deleti<strong>on</strong> and is a weighted sum 6 of: (̂β 0 − ̂β 0(−i) ) 2 , (̂β 1 −<br />
̂β 1(−i) ) 2 , . . . , (̂β k − ̂β k(−i) ) 2 where ̂β k(−i) is the regressi<strong>on</strong> coefficient <str<strong>on</strong>g>for</str<strong>on</strong>g> the k th variable after dropping<br />
the i th observati<strong>on</strong>.<br />
If a point has no effect <strong>on</strong> the fit, then D i will be zero. Large values of D i indicate points that have a large<br />
influence <strong>on</strong> the fit. There is no easy rule <str<strong>on</strong>g>for</str<strong>on</strong>g> determining which values of D i are extreme. 7 A general rule<br />
of thumb is to look at the distributi<strong>on</strong> of the D ′ s and look at those observati<strong>on</strong>s corresp<strong>on</strong>ding to extreme<br />
values.<br />
6 Refer to the original paper <str<strong>on</strong>g>for</str<strong>on</strong>g> the exact <str<strong>on</strong>g>for</str<strong>on</strong>g>mula.<br />
7 An often quoted rule is to look at values of D i that are greater than 1, but recent work has shown that this rule does not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />
effectively<br />
c○2012 Carl James Schwarz 411 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Hats<br />
Oddly named statistics are the Hats, or leverage values. These are computed under the idea that if a point<br />
has extreme influence, the regressi<strong>on</strong> should predict it exactly. C<strong>on</strong>sequently, the hats are computed from<br />
what is known (<str<strong>on</strong>g>for</str<strong>on</strong>g> historical reas<strong>on</strong>s) the hat matrix which is defined as X ′ (X ′ X) −1 X and should not be<br />
attempted to be computed by hand! If the hat-value is larger than about twice the average hat-value, then<br />
is usually taken to indicates an influential point. There are more <str<strong>on</strong>g>for</str<strong>on</strong>g>mal rules to checking the hat values but<br />
these are seldom worthwhile.<br />
Cauti<strong>on</strong><br />
It is clear that some observati<strong>on</strong>s must be the most extreme in every sample, and so it would be silly to<br />
automatically delete these extreme observati<strong>on</strong>s without a careful c<strong>on</strong>siderati<strong>on</strong> of the the underlying data!<br />
The purpose of Cook’s D and other similar statistics is to warn the analyst that certain observati<strong>on</strong>s require<br />
additi<strong>on</strong>al scrutiny. D<strong>on</strong>’t data snoop simply to polish fit!<br />
6.2.6 Leverage plots<br />
These are likely the most useful of the diagnostic tools <str<strong>on</strong>g>for</str<strong>on</strong>g> spotting influential observati<strong>on</strong>s and are produced<br />
by many computer packages.<br />
The Leverage Plots produced by JMP are example of what are also called partial regressi<strong>on</strong> plot or<br />
adjusted variable plots. They are c<strong>on</strong>structed <str<strong>on</strong>g>for</str<strong>on</strong>g> each individual variable. Suppose that we are regressing<br />
Y <strong>on</strong> four predictors X 1 , . . . , X 4 . The leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> X 1 is c<strong>on</strong>structed as follows:.<br />
1. Find the residuals when Y is regressed against all the other variables except X 1 , i.e. fit the model<br />
Y = X 2 X 3 X 4 . Denote this residual as ̂ɛ Y |X(−1) where the -1 indicates that the first variable was<br />
dropped from the set of X ′ s.<br />
2. Find the residuals when X 1 is regressed against the other X variables, i.e. fit the model X 1 =<br />
X 2 X 3 X 4 . Denote this residual as ̂ɛ X1|X (−1)<br />
where the -1 indicates that the first variable was<br />
dropped from the set of X ′ s.<br />
3. Plot the first residual against the sec<strong>on</strong>d residual <str<strong>on</strong>g>for</str<strong>on</strong>g> each observati<strong>on</strong>. 8<br />
.<br />
Now if X 1 has no further in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> about Y (after accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> the other X ′ s), then the X 1 variable<br />
really isn’t needed, and so all the first residuals should be centered around zero with random scatter.<br />
8 JMP actually adds the mean of Y and X 1 to the residuals be<str<strong>on</strong>g>for</str<strong>on</strong>g>e plotting, but this does not change the shape of the plot.<br />
c○2012 Carl James Schwarz 412 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
But suppose that X 1 is important in predicting Y . Then the residuals from the regressi<strong>on</strong> of Y <strong>on</strong> the<br />
other X variables should be missing the c<strong>on</strong>tributi<strong>on</strong> of X 1 and the residual plot should show an upward (or<br />
downward) trend relative to the other residuals. In fact, if you fit a regressi<strong>on</strong> line to the leverage plot, the<br />
slope will equal the slope in the full regressi<strong>on</strong> model. If the c<strong>on</strong>tributi<strong>on</strong> of X 1 is not linear, then the plot<br />
will show a n<strong>on</strong>-linear relati<strong>on</strong>ship.<br />
Why is X 1 regressed against the other X variables? Recall that the interpretati<strong>on</strong> of the slope in multiple<br />
regressi<strong>on</strong> is the MARGINAL c<strong>on</strong>tributi<strong>on</strong> after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> all other variables in the model. In other<br />
words, the slope reflect the NEW in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> in X 1 after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> the other X ′ s. How is the new<br />
in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> in X 1 found – yes, by regressing X 1 against the other variables. For example, suppose that X 1<br />
was an exact copy of another variable in the dataset. Then the sec<strong>on</strong>d residuals would all be zero, indicating<br />
no new in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> (why?). So, if the leverage plot shows a very thin vertical band of points, this may be<br />
an indicati<strong>on</strong> that a certain variable does NOT have useful marginal in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>, i.e. is redundant given the<br />
other variables. This c<strong>on</strong>diti<strong>on</strong> is known as multi-collinearity and is discussed later in this chapter.<br />
If a single observati<strong>on</strong> has high leverage, the leverage plot will show the observati<strong>on</strong>s as an outlier. The<br />
diagram below dem<strong>on</strong>strates some of the important cases <str<strong>on</strong>g>for</str<strong>on</strong>g> leverage plots:<br />
c○2012 Carl James Schwarz 413 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
In JMP, and many other packages, the points <strong>on</strong> these plots are hot-linked to the data sheet. By clicking<br />
<strong>on</strong> these points, you can identify the observati<strong>on</strong> in the data sheet.<br />
The c<strong>on</strong>cept of leverage plots is sufficiently important and n<strong>on</strong>-obvious that a numerical example will be<br />
examined. In JMP, open the Fitness.jmp dataset from the JMP sample dataset library. This dataset c<strong>on</strong>sists<br />
of measurements taken <strong>on</strong> subjects <strong>on</strong> the age, weight, oxygen c<strong>on</strong>sumpti<strong>on</strong>, times to run a mile, and three<br />
measurements <strong>on</strong> their pulse rate.<br />
The first few lines of the data file are:<br />
Fit a model to predict oxygen c<strong>on</strong>sumpti<strong>on</strong> as the Y variable with age, weight, runtime, and the three<br />
pulse measurements as the X variables. The estimated slopes are:<br />
and the leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> Runtime is:<br />
c○2012 Carl James Schwarz 414 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
To reproduce this leverage plot, first fit the model <str<strong>on</strong>g>for</str<strong>on</strong>g> oxygen c<strong>on</strong>sumpti<strong>on</strong>, dropping the run-time<br />
variable and save the residuals to the data sheet.<br />
c○2012 Carl James Schwarz 415 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 416 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Next, regress run-time against the other X variables and save the residuals to the data sheet:<br />
c○2012 Carl James Schwarz 417 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
This will give the data sheet with two new columns added:<br />
Finally, plot the Residual of Oxygen <strong>on</strong> all but runtime vs. the Residual of runtime <strong>on</strong> other X variables<br />
and fit a line through that plot using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 418 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
You will see that this plot looks the same as the leverage plot (but the Y and X axes are scaled slightly<br />
differently) and that the slope <strong>on</strong> this plot (-2.639) matches the estimated slope seen earlier.<br />
Leverage plots should be used with some cauti<strong>on</strong>. They will show the nature of the functi<strong>on</strong>al relati<strong>on</strong>ship<br />
with the variable, but not its exact <str<strong>on</strong>g>for</str<strong>on</strong>g>m. As well, as these plots are after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> other variables, a<br />
variety of curvature models should be investigated. As well, if the functi<strong>on</strong>al <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the other variables is<br />
incorrect (e.g. age 2 is needed but has not been added to the model), then the true nature of the relati<strong>on</strong>ship<br />
may be missed.<br />
You can get JMP to save all the leverage pairs under the Save Columns pop-down menu.<br />
c○2012 Carl James Schwarz 419 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
6.2.7 Collinearity<br />
It is often the case that many of the X variables are related to each other. For example, if you wanted<br />
to predict blood pressure as a functi<strong>on</strong> of several variables including height and weight, there is a str<strong>on</strong>g<br />
relati<strong>on</strong>ship between these two latter variables. When the relati<strong>on</strong>ship am<strong>on</strong>g the predictor variables is<br />
str<strong>on</strong>g, they are said to be collinear. This can lead to problems in fitting the model and in interpreting the<br />
results of a model fit. In this example, it is c<strong>on</strong>ceivable that you could increase the weight of a subject while<br />
holding height c<strong>on</strong>stant, but suppose the two variables were total hours of sunshine and total hours of clouds<br />
in a year. If <strong>on</strong>e increases, the other must decrease.<br />
Because the regressi<strong>on</strong> coefficients are interpreted as the MARGINAL c<strong>on</strong>tributi<strong>on</strong> of each predictor,<br />
collinearly am<strong>on</strong>g the predictors can mask the c<strong>on</strong>tributi<strong>on</strong> of a variable. For example, if both height and<br />
weight are fit in a model, then the marginal c<strong>on</strong>tributi<strong>on</strong> of height (given weight is already in the model) is<br />
small; similarly, the marginal c<strong>on</strong>tributi<strong>on</strong> of weight (given height is in the model) is also small. However,<br />
it would not be valid to say that the marginal c<strong>on</strong>tributi<strong>on</strong> of both height and weight (together) are small.<br />
In Secti<strong>on</strong> 6.4 methods <str<strong>on</strong>g>for</str<strong>on</strong>g> testing if several variables can be deleted simultaneously from the model are<br />
presented.<br />
If the predictor variables were perfectly collinear, the whole model fitting procedure breaks down. It<br />
turns out that a certain matrix used in the model fitting cannot be numerically inverted (similar to trying to<br />
divide by zero) and no estimates are possible. If the variables are not perfectly collinear, many different sets<br />
of estimates can be found that give very nearly the same predicti<strong>on</strong>s!<br />
Not all the story is bad – multicollinearity does not imply that the whole regressi<strong>on</strong> model is useless.<br />
Even if predictor variables are highly related, good predicti<strong>on</strong>s are still possible provided that you make<br />
predicti<strong>on</strong>s at values of X that are similar to those used in model fitting.<br />
The basic tool <str<strong>on</strong>g>for</str<strong>on</strong>g> diagnosing potential collinearity is the variance inflati<strong>on</strong> factor (VIF) <str<strong>on</strong>g>for</str<strong>on</strong>g> each regressi<strong>on</strong><br />
coefficient. In JMP this is obtained by right-clicking <strong>on</strong> the table of parameter estimates after the<br />
Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is run. For example, the VIF <str<strong>on</strong>g>for</str<strong>on</strong>g> the fitness dataset are:<br />
c○2012 Carl James Schwarz 420 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The VIF is interpreted as the increase in the variance (se 2 ) of the estimate compared to what would be<br />
expected if the variable was completely independent of all other predictor variables. The VIF equals 1 when<br />
a predictor is not collinear with other predictors. VIFs that are vary large, typically around 10 or higher, are<br />
usually taken as an indicati<strong>on</strong> of potential collinearity.<br />
In the fitness dataset, there is evidence of collinearity in the average pulse rate during the run (Run Pulse)<br />
and the maximum pulse rate during the run (Max Pulse) variables. This is not unexpected.<br />
If collinearity is detected, remedial measures include dropping some of the redundant predictor variables,<br />
9 or more sophisticated fitting methods such as ridge or robust regressi<strong>on</strong> (which are bey<strong>on</strong>d the scope<br />
of this <str<strong>on</strong>g>course</str<strong>on</strong>g>).<br />
9 An obvious questi<strong>on</strong> is how do you tell which variables are redundant? Comm<strong>on</strong> methods are principal comp<strong>on</strong>ent analysis of the<br />
X variables, or examining the correlati<strong>on</strong> am<strong>on</strong>g the predictors. Seek help if you run into a problem of extreme multicollinearity.<br />
c○2012 Carl James Schwarz 421 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
6.3 Polynomial, product, and interacti<strong>on</strong> terms<br />
6.3.1 Introducti<strong>on</strong><br />
The assumpti<strong>on</strong> of a marginal linear relati<strong>on</strong>ship between the resp<strong>on</strong>se variable and the X variable is sometimes<br />
not true, and a quadratic and (rarely) cubic or higher polynomials are often fit in terms of X in order<br />
to approximate this n<strong>on</strong>-linear relati<strong>on</strong>ship.<br />
The basic way to deal with polynomial regressi<strong>on</strong> (i.e. quadratic and higher terms) is to create new<br />
predictor variables involving X 2 , X 3 , . . .. Although not necessary with modern software, it is often a good<br />
idea to center variables that will be used in quadratic and higher relati<strong>on</strong>ship to avoid a high degree of<br />
collinearity am<strong>on</strong>g the terms. For example, replace X and X 2 by (X −X) and (X −X) 2 respectively While<br />
the actual coefficients may change, the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> testing the linear and quadratic slope are unaffected, and<br />
predicti<strong>on</strong>s are also unaffected – this is exactly analogous what happens in regressi<strong>on</strong> when there is a unit<br />
change between imperial and metric units <str<strong>on</strong>g>for</str<strong>on</strong>g> some variable.<br />
The model fit is<br />
If the square term is called X 2 , the model is:<br />
Y i = β 0 + β 1 X i1 + β 2 X 2 i1 + ɛ i<br />
Y i = β 0 + β 1 X i1 + β 2 X i2 + ɛ i<br />
which now looks exactly like a ordinary multiple regressi<strong>on</strong> model.<br />
The rest of the model fitting, testing, etc proceeds exactly as outlined in previous secti<strong>on</strong>s. However,<br />
there are two potential problems with polynomial models.<br />
• Models should be hierarchical. This means that if you include a term involving X 2 in the model, you<br />
must include a term involving X. If you include a quadratic, but not the linear term, you are restricting<br />
the quadratic curve to be a very special shape which is not usually wanted in practice. This will be<br />
outlined in class.<br />
• The interpretati<strong>on</strong> of the estimates must be d<strong>on</strong>e with care. Normally, the estimated slopes are the<br />
MARGINAL c<strong>on</strong>tributi<strong>on</strong> of this variable to the resp<strong>on</strong>se, i.e. after holding all other variables c<strong>on</strong>stant.<br />
However, if the regressi<strong>on</strong> equati<strong>on</strong> includes both X and X 2 terms, it is impossible to hold X fixed<br />
while changing X 2 al<strong>on</strong>e.<br />
What degree of polynomial is suitable? This is usually d<strong>on</strong>e by fitting successively higher polynomial<br />
terms until the added term is no l<strong>on</strong>ger statistically significant and then using the previous model. While<br />
polynomial models allow <str<strong>on</strong>g>for</str<strong>on</strong>g> some degree of curvature in the resp<strong>on</strong>se, it is very rare to fit terms involving<br />
cubic and higher powers. The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this is that such curves seldom have biological plausibility, and<br />
they have wide oscillati<strong>on</strong>s in their predicted values.<br />
The researcher should also investigate if a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m of the Y or X variable may linearize the relati<strong>on</strong>ship.<br />
For example, a plot of log(Y ) vs. X may show a linear fit. Similarly, 1/X may be a more suitable<br />
c○2012 Carl James Schwarz 422 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
predictor. 10 It is possible to use least squares to actually fit n<strong>on</strong>-linear models where no trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> or<br />
polynomial terms provide a good fit. This is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
6.3.2 Example: Tomato growth as a functi<strong>on</strong> of water<br />
An experiment was run to investigate the yield of tomato plants as a functi<strong>on</strong> of the amount of water provided<br />
over the seas<strong>on</strong>. A series of plots were randomized to different watering levels and at the end of the seas<strong>on</strong>,<br />
the yield of the plants was determined.<br />
The raw data follows:<br />
Water<br />
Yield<br />
6 49.2<br />
6 48.1<br />
6 48.0<br />
6 49.6<br />
6 47.0<br />
8 51.5<br />
8 51.7<br />
8 50.4<br />
8 51.2<br />
8 48.4<br />
10 51.1<br />
10 51.5<br />
10 50.3<br />
10 48.9<br />
10 48.7<br />
12 48.6<br />
12 48.0<br />
12 46.4<br />
12 46.2<br />
14 43.2<br />
14 42.6<br />
14 42.1<br />
14 43.9<br />
14 40.5<br />
10 For example, should fuel ec<strong>on</strong>omy of car be measured a miles/gall<strong>on</strong> (distance/c<strong>on</strong>sumpti<strong>on</strong>) or L/100 km (c<strong>on</strong>sumpti<strong>on</strong>/distance).<br />
c○2012 Carl James Schwarz 423 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
JMP Analysis:<br />
The raw data is also available in a JMP data sheet called tomatowater.jmp available from the Sample<br />
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The data is entered into JMP in the usual fashi<strong>on</strong> – columns represent variables and rows represent<br />
observati<strong>on</strong>s. The scale of both variables should be c<strong>on</strong>tinuous.<br />
As usual, begin with a plot of the data:<br />
c○2012 Carl James Schwarz 424 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The relati<strong>on</strong>ship is clearly n<strong>on</strong>-linear and looks as if a quadratic may be suitable.<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting the model, think about the assumpti<strong>on</strong>s required <str<strong>on</strong>g>for</str<strong>on</strong>g> the fit and assess if these are suitable<br />
to the data at hand.<br />
There are two ways to fit simple (i.e. <strong>on</strong>ly involving polynomial terms in X) polynomial models in JMP.<br />
If your regressi<strong>on</strong> model is a mixture of polynomial and other X variables, then the sec<strong>on</strong>d method must be<br />
used.<br />
In the first method, the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m can be used directly. For example, select the<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 425 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
and choose Polynomial Fit:<br />
which gives a plot of the fitted line:<br />
c○2012 Carl James Schwarz 426 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
and statistics about the fit:<br />
c○2012 Carl James Schwarz 427 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The fitted curve is:<br />
Yield = 57.726857 − 0.762Water − 0.2928571(Water-10) 2<br />
Notice that JMP has automatically centered the quadratic term by subtracting the mean X of 10 from each<br />
value prior to squaring. As you will see in a few minutes, this has no affect up<strong>on</strong> the test of significance of<br />
the quadratic term, nor <strong>on</strong> the actual predicted values.<br />
The ANOVA table can be used to examine if either/or the linear or quadratic terms provide any predictive<br />
power. The table of estimates shows that the quadratic term is clearly statistically significant. C<strong>on</strong>fidence<br />
intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the regressi<strong>on</strong> coefficients can be found in the usual fashi<strong>on</strong> by right clicking in the table and<br />
requesting the appropriate columns (not shown).<br />
A residual plot is obtained in the usual fashi<strong>on</strong>:<br />
c○2012 Carl James Schwarz 428 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
which shows no evidence of a problem.<br />
If a cubic polynomial is fit (in the same fashi<strong>on</strong> as the quadratic polynomial) you will see that the cubic<br />
c○2012 Carl James Schwarz 429 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
term is not statistically significant indicating that a quadratic model is sufficient.<br />
C<strong>on</strong>fidence band <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se at each X and the individual resp<strong>on</strong>se at each X can also be<br />
obtained in the usual way:<br />
c○2012 Carl James Schwarz 430 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Again, the scientist must understand the difference between the c<strong>on</strong>fidence bounds <str<strong>on</strong>g>for</str<strong>on</strong>g> each type of predicti<strong>on</strong><br />
as outlined in earlier chapters.<br />
The sec<strong>on</strong>d way to fit polynomial models (and the <strong>on</strong>ly way when polynomial terms are intermixed with<br />
other variables) is to use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. First variables corresp<strong>on</strong>ding to X 2 and X 3 (if<br />
needed) must be created using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor of JMP: 11<br />
11 It is preferable to use JMP’s <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor rather than creating these variables outside of the data sheet because these columns<br />
will be hot-linked to the original column. If, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, a value of X is updated, then the values of the squared and cubic terms will<br />
also be updated automatically.<br />
c○2012 Carl James Schwarz 431 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
and a porti<strong>on</strong> of the resulting data table is shown below:<br />
c○2012 Carl James Schwarz 432 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Note that the X variable was centered be<str<strong>on</strong>g>for</str<strong>on</strong>g>e squaring and cubing.<br />
Now use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit using the water and water-squared terms:<br />
c○2012 Carl James Schwarz 433 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The plot of actual vs. predicted shows a good fit:<br />
The ANOVA table (not shown) can be used to assess the overall fit of the model as seen in earlier secti<strong>on</strong>s.<br />
The estimates match those seen earlier, as do the p-values:<br />
C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the regressi<strong>on</strong> coefficients can be found in the usual fashi<strong>on</strong> by right clicking in the<br />
table and requesting the appropriate columns (not shown).<br />
The leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> the X 2 term shows that this polynomial is required and is not influenced by any unusual<br />
c○2012 Carl James Schwarz 434 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
values: 12<br />
C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se or individual resp<strong>on</strong>ses are saved to the data table in the<br />
usual fashi<strong>on</strong> (but are not shown in these notes):<br />
12 Because of the hierarchical restricti<strong>on</strong>, the leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> the linear term is not of interest.<br />
c○2012 Carl James Schwarz 435 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Finally, getting a plot of the actual fitted line a bit of work if using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />
First, save the predicted values to the data table:<br />
c○2012 Carl James Schwarz 436 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Then use the Overlay Plot under the Graph menu to plot the individual points and the predicted values:<br />
c○2012 Carl James Schwarz 437 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
and then join up the predicted values (and remove the fitted points)<br />
c○2012 Carl James Schwarz 438 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
to finally give the plot that we saw earlier (whew!) but un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there does not appear to be anyway to<br />
drawing a smooth curve <str<strong>on</strong>g>short</str<strong>on</strong>g> of getting predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> many points between the observed values of X and<br />
drawing the curve through the smaller increments.<br />
c○2012 Carl James Schwarz 439 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
6.3.3 Polynomial models with several variables<br />
The methods of the previous secti<strong>on</strong> can be extended to cases where several variables have quadratic or<br />
higher powers. It is also possible to include crossproducts of these variables as well.<br />
There are no c<strong>on</strong>ceptual difficulties in having multiple polynomial variables. However, the analyst must<br />
ensure that models are hierarchical (i.e. if higher powers or cross products are includes, then lower order<br />
terms must also be included). C<strong>on</strong>sequently, leverage plots of the lower order terms are likely not be very<br />
useful when higher order terms are included in the model.<br />
In practice, polynomial models are comm<strong>on</strong>ly restricted to quadratic terms or lower. The goal is not<br />
so much as to elucidate the underlying mechanism of the resp<strong>on</strong>se, but rather to get a good approximati<strong>on</strong><br />
to the resp<strong>on</strong>se surface. indeed, there are a whole suite of techniques (comm<strong>on</strong>ly called resp<strong>on</strong>se surface<br />
methodology) used to fit and explore polynomial models in this c<strong>on</strong>text. Often predicti<strong>on</strong>s of where the<br />
maximum or minimum resp<strong>on</strong>se is found are important.<br />
c○2012 Carl James Schwarz 440 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
There are many excellent books available. JMP also has specialized tools in the Analyze->Fit Model<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to assist in the fitting of resp<strong>on</strong>se surfaces. These are bey<strong>on</strong>d the scope of these notes<br />
6.3.4 Cross-product and interacti<strong>on</strong> terms<br />
Recall that the interpretati<strong>on</strong> of the regressi<strong>on</strong> coefficient associated with the i th predictor variable is the<br />
marginal (i.e. after keeping all other variables in the model and fixed) increase in Y per unit change in X i .<br />
This marginal increase is the same regardless of the values of the other X variables.<br />
But sometimes, the c<strong>on</strong>tributi<strong>on</strong> of the i th variable depends up<strong>on</strong> the value of another, the j th , predictor.<br />
For example, suppose the blood pressure tends to increase by .5 units <str<strong>on</strong>g>for</str<strong>on</strong>g> every kg increase in body mass <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
people under 1.5 m in height, but tends to increase by .6 units <str<strong>on</strong>g>for</str<strong>on</strong>g> very kg increase in body mass <str<strong>on</strong>g>for</str<strong>on</strong>g> people<br />
over 1.5 m in height. We would say that the body mass interacts with the height variable. This c<strong>on</strong>cept is<br />
very similar to analogous interacti<strong>on</strong> of factors in ANOVA models. 13<br />
C<strong>on</strong>sider a model where blood pressure depends up<strong>on</strong> age and height via the model:<br />
This corresp<strong>on</strong>ds to the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical model of:<br />
BP = AGE HEIGHT<br />
Y i = β 0 + β 1 AGE i + β 2 HEIGHT i + ɛ i<br />
You can see that if age increases by 1 unit, then the value of Y increases by β 1 units regardless of the value<br />
of height. Similarly, every time height increases by 1 unit, Y increases by β 2 regardless of the value of age.<br />
Now c<strong>on</strong>sider the model written as:<br />
which corresp<strong>on</strong>ds to the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical model of:<br />
BP = AGE HEIGHT AGE*HEIGHT<br />
Y i = β 0 + β 1 AGE i + β 2 HEIGHT i + β 3 AGE i × HEIGHT i + ɛ i<br />
The crossproduct of age and height enter into the model as new predictor variable. 14 Now look what happens<br />
when age is increased by 1 unit. The value of Y increases not simply by β 1 but by β 1 + β 3 HEIGHT i .<br />
Now when height is small, the increase in Y per unit change in age is smaller than when height is large.<br />
Similarly, an increase by 1 unit in the value of height will lead to an increase by β 2 + β 3 AGE i . The effect<br />
of height will be less <str<strong>on</strong>g>for</str<strong>on</strong>g> younger subjects than <str<strong>on</strong>g>for</str<strong>on</strong>g> older subjects.<br />
The use of product terms in multiple-regressi<strong>on</strong> can be easily extended to products involving more than<br />
two variables, and, more importantly as discussed in Secti<strong>on</strong> 6.5.3, to products with indicator variables.<br />
13 Indeed, this is not surprising as ANOVA is actually a special case of regressi<strong>on</strong>.<br />
14 The actual X matrix would then have four columns. Column 1 would c<strong>on</strong>sists of all 1’s; column 2 would c<strong>on</strong>sist of the values<br />
of age; column 3 would c<strong>on</strong>sist of the values of height; and column 4 would c<strong>on</strong>tain the actual products of age and height <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />
individual.<br />
c○2012 Carl James Schwarz 441 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
There is no real problem in fitting these models other than the model must c<strong>on</strong><str<strong>on</strong>g>for</str<strong>on</strong>g>m to the hierarchical<br />
principle. This principle states that if terms like X i X j are in the model, so must be all lower order terms –<br />
in this case, both X i and X j as separate terms must remain in the model. This is the same principle as you<br />
saw <str<strong>on</strong>g>for</str<strong>on</strong>g> polynomial models.<br />
6.4 The general linear test<br />
6.4.1 Introducti<strong>on</strong><br />
In previous secti<strong>on</strong>s, you say how to test if a specific regressi<strong>on</strong> coefficient in the populati<strong>on</strong> was zero using<br />
the t-test provided by most computer packages. It is tempting then, to try and test if multiple X variables<br />
can be dropped simultaneously if their individual p-values are all not statistically significant.<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately this strategy often fails. The basic reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> its failure is that very often regressi<strong>on</strong><br />
coefficients are highly interrelated because their corresp<strong>on</strong>ding X variables are not orthog<strong>on</strong>al to each other.<br />
For example, suppose that both height and weight were X variables in a model that was trying to predict<br />
blood pressure. The test of the hypothesis <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope <str<strong>on</strong>g>for</str<strong>on</strong>g> weight and height are MARGINAL tests, i.e. is<br />
the slope associated with weight in the populati<strong>on</strong> zero assuming that all other variables (including height)<br />
are retained in the model. Because of the high interdependency between height and weight, the p-value<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the test of marginal zero slope <str<strong>on</strong>g>for</str<strong>on</strong>g> weight may not be statistically significant. Similarly, the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
the test of marginal zero slope <str<strong>on</strong>g>for</str<strong>on</strong>g> height (assuming that weight is in the model) may also be statistically<br />
n<strong>on</strong>-significant. However, both height and weight cannot be simultaneously removed from the model.<br />
In order to test if a set of predictor variables can be simultaneously removed from the model, a General<br />
Linear Test is per<str<strong>on</strong>g>for</str<strong>on</strong>g>med. The mechanics of the test are:<br />
1. Fit the full model, i.e. with all variables present. Find SSE full from the full model.<br />
2. Fit the reduced model, i.e. dropping the variables of interest. Find SSE reduced from the reduced<br />
model.<br />
3. If the reduced model is still an adequate fit, then SSE reduced should be very close to SSE full – after<br />
all, if the dropped variables were not important, then the reducti<strong>on</strong> in predicti<strong>on</strong> error should be small.<br />
C<strong>on</strong>struct a test statistic as:<br />
F general =<br />
SSE reduced −SSE full<br />
df SSEreduced −df SSEfull<br />
SSE full<br />
df SSEfull<br />
This is compared an F -distributi<strong>on</strong> with the appropriate degrees of freedom. Large values of the<br />
F -statistic indicate evidence that not all variables can be simultaneously dropped.<br />
Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, this procedure has been automated in most statistical packages as will be illustrated by an<br />
example.<br />
c○2012 Carl James Schwarz 442 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
6.4.2 Example: Predicting body fat from measurements<br />
The percentage of body fat in humans is a good indicator of future problems with cardiovascular and other<br />
diseases.<br />
The following was taken from Wikipedia: 15<br />
Body fat percentage is the fracti<strong>on</strong> of the total body mass that is adipose tissue. This index<br />
is often used as a means to m<strong>on</strong>itor progress during a diet or as a measure of physical fitness<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> certain sports, such as body building. It is more accurate as a measure of health than body<br />
mass index (BMI) since it directly measures body compositi<strong>on</strong> and there are separate body fat<br />
guidelines <str<strong>on</strong>g>for</str<strong>on</strong>g> men and women. However, its popularity is less than BMI because most of the<br />
techniques used to measure body fat percentage require equipment and skills that are not readily<br />
available.<br />
The most accurate method has been to weigh a pers<strong>on</strong> underwater in order to obtain the average<br />
density (mass per unit volume). Since fat tissue has a lower density than muscles and b<strong>on</strong>es,<br />
it is possible to estimate the fat c<strong>on</strong>tent. This estimate is distorted by the fact that muscles and<br />
b<strong>on</strong>es have different densities: <str<strong>on</strong>g>for</str<strong>on</strong>g> a pers<strong>on</strong> with a more-than-average amount of b<strong>on</strong>e tissue, the<br />
estimate will be too low. However, this method gives highly reproducible results <str<strong>on</strong>g>for</str<strong>on</strong>g> individual<br />
pers<strong>on</strong>s ±1%. The body fat percentage is comm<strong>on</strong>ly calculated from <strong>on</strong>e of two <str<strong>on</strong>g>for</str<strong>on</strong>g>mulas:<br />
Brozek <str<strong>on</strong>g>for</str<strong>on</strong>g>mula : BF = (4.57/p − 4.142) ∗ 100<br />
Siri <str<strong>on</strong>g>for</str<strong>on</strong>g>mula : BF = (4.95/p − 4.50) ∗ 100<br />
In these <str<strong>on</strong>g>for</str<strong>on</strong>g>mulas, p is the body density in kg/L obtained by weighing the pers<strong>on</strong> out of water<br />
and then dividing by the volume obtained by dunking the pers<strong>on</strong> underwater.<br />
BTW, the American Council <strong>on</strong> Exercise has associated categories with ranges of body fat.<br />
Women generally have less muscle mass than men and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e they have a higher body fat<br />
percentage range <str<strong>on</strong>g>for</str<strong>on</strong>g> each category.<br />
Descripti<strong>on</strong> Women Men<br />
Essential fat 10-13% 2-5%<br />
Athletes 14-20% 6-13%<br />
Fitness 21-24% 14-17%<br />
Acceptable 25-31% 18-24%<br />
Obesity 32%+ 25%+<br />
Many studies have been d<strong>on</strong>e to see if predicti<strong>on</strong>s of body fat can be made based <strong>on</strong> simple measurements<br />
such as circumferences of various body parts.<br />
A study of middle age men measured the percentage of body fat using the difficult methods explained<br />
above and also taking measurements of the circumference of their thigh, triceps, and mid-arm.<br />
15 2006-05-15, at http://en.wikipedia.org/wiki/Body_fat_percentage<br />
c○2012 Carl James Schwarz 443 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Here are the raw data:<br />
Triceps Thigh Mid-arm PerBodyFat<br />
19 43 29 11.9<br />
24 49 28 22.8<br />
30 51 37 18.7<br />
29 54 31 20.1<br />
19 42 30 12.9<br />
25 53 23 21.7<br />
31 58 27 27.1<br />
27 52 30 25.4<br />
22 49 23 21.3<br />
25 53 24 19.3<br />
31 56 30 25.4<br />
30 56 28 27.2<br />
18 46 23 11.7<br />
19 44 28 17.8<br />
14 45 21 12.8<br />
29 54 30 23.9<br />
27 55 25 22.6<br />
30 58 24 25.4<br />
22 48 27 14.8<br />
25 51 27 21.1<br />
JMP Analysis<br />
The raw data is also available in a JMP data sheet called bodyfat.jmp available from the Sample Program<br />
Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
Fit the multiple-regressi<strong>on</strong> model using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 444 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The resulting estimates all have tests <str<strong>on</strong>g>for</str<strong>on</strong>g> the marginal populati<strong>on</strong> slope statistically n<strong>on</strong>-significant:<br />
But at the same time, the whole model test:<br />
c○2012 Carl James Schwarz 445 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
shows that there is predictive ability in these X variables because the overall p-value is statistically significant.<br />
The problem is that the X variables are all highly related. Indeed a scatter plot matrix of their X variables<br />
shows a high degree of relati<strong>on</strong>ship am<strong>on</strong>g them:<br />
A general-linear test <str<strong>on</strong>g>for</str<strong>on</strong>g> dropping, say both the triceps and thigh X variables is c<strong>on</strong>structed using the<br />
Custom Tests pop-down menu item:<br />
c○2012 Carl James Schwarz 446 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
and then specifying which X variables are to be tested together. You need a separate column in the Custom<br />
Test <str<strong>on</strong>g>for</str<strong>on</strong>g> each variable to be tested – if you specify multiple variables in a single column, you will get a test<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> a crazy hypothesis:<br />
The final result:<br />
c○2012 Carl James Schwarz 447 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
has a p-value of .000003 which is very str<strong>on</strong>g evidence that both variables cannot be dropped simultaneously.<br />
If you look at the ANOVA table from the full model:<br />
the SSE full = 100.1 with 16 df .<br />
The reduced model is fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and with just the Mid-arm variable, and<br />
the reduced model ANOVA table is:<br />
c○2012 Carl James Schwarz 448 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
with the SSE reduced = 487.4 with 18 df .<br />
The general linear test is found as:<br />
F general =<br />
SSE reduced−SSE full<br />
df SSEreduced −df SSEfull<br />
SSE full<br />
=<br />
df SSEfull<br />
487.4−100.1<br />
18−16<br />
100.1<br />
16<br />
= 193.65<br />
6.3<br />
= 30.94<br />
which is the value reported above.<br />
6.4.3 Summary<br />
The general linear test is often used to test if a “chunk” of X variables can be removed from the model.<br />
Often this chunk will be set of variables that has something in comm<strong>on</strong>.<br />
For example, often all quadratic terms are tested simultaneously, or a variable and all it higher order<br />
interacti<strong>on</strong>s terms (e.g. X, X 2 , X 3 , etc.).<br />
6.5 Indicator variables<br />
6.5.1 Introducti<strong>on</strong><br />
Indicator variables (also known as dummy variables) are a device to incorporate nominal-scaled variables<br />
into regressi<strong>on</strong> c<strong>on</strong>texts. For example, suppose you looked at the relati<strong>on</strong>ship between blood pressure and<br />
weight. In general blood pressure of individual increases with weight. But in general, males are larger than<br />
females, so a body weight of 90 kg may have a different effect <str<strong>on</strong>g>for</str<strong>on</strong>g> males than <str<strong>on</strong>g>for</str<strong>on</strong>g> females. So how can sex<br />
(a nominally scaled variable) be incorporated into the regressi<strong>on</strong> equati<strong>on</strong>?<br />
It turns out that using indicator variables makes ordinary regressi<strong>on</strong> a general tool <str<strong>on</strong>g>for</str<strong>on</strong>g> many more applicati<strong>on</strong>s<br />
than simply regressi<strong>on</strong>. Indeed, it is possible to show that two-sample t-tests, single factor completely<br />
c○2012 Carl James Schwarz 449 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
randomized design ANOVAs, and even more complex experimental designs can be analyzed using regressi<strong>on</strong><br />
methods. This is why many computer packages call their analysis tool <str<strong>on</strong>g>for</str<strong>on</strong>g> comparing means and fitting<br />
regressi<strong>on</strong>s as variants of general linear models.<br />
6.5.2 Defining indicator variables<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there is no standard way to define an indicator variable in a regressi<strong>on</strong> setting, but <str<strong>on</strong>g>for</str<strong>on</strong>g>tunately,<br />
it turns out that it doesn’t matter which <str<strong>on</strong>g>for</str<strong>on</strong>g>mulati<strong>on</strong> is used – it is always possible to get an appropriate<br />
answer.<br />
In general, if a nominally scaled variable has k categories, you will require k − 1 indicator variables. In<br />
many cases, computer packages will generate these automatically if the package knows that variable is to be<br />
treated as a nominally scaled variables. 16<br />
For example, as sex <strong>on</strong>ly has two levels, <strong>on</strong>ly <strong>on</strong>e indicator variable is required. It could be coded as:<br />
{ }<br />
1 if male<br />
X 1 =<br />
0 if female<br />
or<br />
Many other codings are possible.<br />
{<br />
1 if male<br />
X 1 =<br />
−1 if female<br />
}<br />
For a nominally scaled variable with three levels, two indicator variables will be needed. For example,<br />
suppose that the size of a pers<strong>on</strong> is classified as small, medium, or large. Then the indicator variables could<br />
be defined as:<br />
{ } { }<br />
1 if small<br />
1 if medium<br />
X 1 =<br />
X 2 =<br />
0 if medium or large<br />
0 if small or large<br />
Now the pair of variables define the three classes as: (X 1 , X 2 ) = (1, 0) = small, (X 1 , X 2 ) = (0, 1) =<br />
medium, and (X 1 , X 2 ) = (0, 0) = large.<br />
Many packages use what is known as reference coding rules <str<strong>on</strong>g>for</str<strong>on</strong>g> indicator variables, where the i th indicator<br />
variable take the values of 1 to indicate the i th value of the variable <str<strong>on</strong>g>for</str<strong>on</strong>g> the first k − 1 values of the<br />
variable, and all the indicator variables take the value 0 to refer to the last value of the variable. 17<br />
So, how do indicator variables help incorporate the effects of a nominally scaled variable? C<strong>on</strong>sider<br />
the variable sex (taking two levels labeled f and m in that order). A single indicator variable, say Sex,<br />
16 That is why it is good practice to code nominally scaled variables using alphanumeric codes (e.g. m and f <str<strong>on</strong>g>for</str<strong>on</strong>g> sex), rather than<br />
numeric codes such as 3 or 7.<br />
17 Always check the package documentati<strong>on</strong> carefully to see if the package is using this rule. If it uses a different coding scheme,<br />
you will have to interpret the estimates carefully.<br />
c○2012 Carl James Schwarz 450 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
is defined that takes the value of 1 <str<strong>on</strong>g>for</str<strong>on</strong>g> females and 0 <str<strong>on</strong>g>for</str<strong>on</strong>g> males. Now c<strong>on</strong>sider the following estimated<br />
regressi<strong>on</strong> equati<strong>on</strong>:<br />
BloodPressure = 110 − 10 ∗ Sex + .10 ∗ Weight<br />
The estimated blood pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a female who weighs 100 kg would be:<br />
110 = 110 − 10(1) + .10(100)<br />
while the estimated blood pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a male who weighs 100 kg would be:<br />
100 = 110 − 10(0) + .10(100)<br />
Hence, the coefficient associated with sex (with a value of -10) would be interpreted as the difference in<br />
blood pressure between females and males <str<strong>on</strong>g>for</str<strong>on</strong>g> all weight classes, i.e. the relati<strong>on</strong>ship c<strong>on</strong>sists of two parallel<br />
lines (with a slope against weight of 0.10) with a separati<strong>on</strong> of 10 units.<br />
On the other hand, c<strong>on</strong>sider the regressi<strong>on</strong> equati<strong>on</strong>:<br />
BloodPressure = 110 − 10 ∗ Sex + .10 ∗ Weight − 0.05 ∗ Sex*Weight<br />
Notice that two variables (the Sex indicator variable and the weight variable) are multiplied together. Now,<br />
the estimated blood pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a female who weighs 100 kg would be:<br />
105 = 110 − 10(1) + .10(100) − 0.05 ∗ (1) ∗ (100)<br />
while the estimated blood pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a male who weighs 100 kg would be:<br />
120 = 110 − 10(0) + .10(100) − 0.05 ∗ (0) ∗ (100)<br />
Hence, the coefficient associated with the product of sex and weight would be interpreted as differential<br />
resp<strong>on</strong>se to weigh between males and females, i.e. the relati<strong>on</strong>ship c<strong>on</strong>sists of two n<strong>on</strong>-parallel lines. The<br />
slope <str<strong>on</strong>g>for</str<strong>on</strong>g> males against weight is .10 while the slope <str<strong>on</strong>g>for</str<strong>on</strong>g> females against weight is .10 − .05 = .05.<br />
This idea can be extended to nominally scaled variables with more than two levels in a straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward<br />
way. Fortunately, most packages will do the coding automatically <str<strong>on</strong>g>for</str<strong>on</strong>g> you and all that is necessary is to<br />
specify the model appropriately and understand what the various model <str<strong>on</strong>g>for</str<strong>on</strong>g>mulati<strong>on</strong>s imply.<br />
6.5.3 The ANCOVA model<br />
The of indicator variables, has <str<strong>on</strong>g>for</str<strong>on</strong>g> historical reas<strong>on</strong>s, been referred to as the Analysis of Covariance (AN-<br />
COVA) approach. It actual has two separate, but functi<strong>on</strong>ally identical uses.<br />
The first use is to incorporate nominally scaled variables into regressi<strong>on</strong> situati<strong>on</strong>s. The modeling starts<br />
off with individual regressi<strong>on</strong> lines, <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> each value in the nominal variable (e.g. a separate line <str<strong>on</strong>g>for</str<strong>on</strong>g> males<br />
and females). A statistical test is used to see if the lines are parallel. If there is evidence that the individual<br />
regressi<strong>on</strong> lines are not parallel, then a separate regressi<strong>on</strong> line must be used <str<strong>on</strong>g>for</str<strong>on</strong>g> each group <str<strong>on</strong>g>for</str<strong>on</strong>g> predicti<strong>on</strong><br />
purposes. If there is no evidence of n<strong>on</strong>-parallelism, then the next task is to see if the lines are co-incident,<br />
c○2012 Carl James Schwarz 451 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
i.e. have both the same intercept and the same slope. If there is evidence that the lines are not coincident,<br />
then a series of parallel lines are used to make predicti<strong>on</strong>s. All of the data are used to estimate the comm<strong>on</strong><br />
slope. If there is no evidence that the lines are not coincident, then all of the data can be simply pooled<br />
together and a single regressi<strong>on</strong> line fit <str<strong>on</strong>g>for</str<strong>on</strong>g> all of the data.<br />
The three possibilities are shown below <str<strong>on</strong>g>for</str<strong>on</strong>g> the case of two groups - the extensi<strong>on</strong> to many groups is<br />
obvious:<br />
c○2012 Carl James Schwarz 452 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 453 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Sec<strong>on</strong>d, ANCOVA has been used to test <str<strong>on</strong>g>for</str<strong>on</strong>g> differences in means am<strong>on</strong>g the groups when some of the<br />
variati<strong>on</strong> in the resp<strong>on</strong>sible variable can be “explained” by a covariate. For example, the effectiveness of two<br />
different diets can be compared by randomizing people to the two diets and measuring the weight change<br />
during the experiment. However, some of the variati<strong>on</strong> in weight change may be related to initial weight.<br />
Perhaps by “standardizing” every<strong>on</strong>e to some comm<strong>on</strong> weight, we can more easily detect differences am<strong>on</strong>g<br />
the groups. This will be discussed in a later chapter.<br />
A very nice book <strong>on</strong> the Analysis of Covariance is Analysis of Messy Data, Volume III: Analysis of Covariance<br />
by G. A. Milliken and D. E. Johns<strong>on</strong>. Details are available at http://www.statsnetbase.<br />
com/ejournals/books/book_summary/summary.asp?id=869.<br />
c○2012 Carl James Schwarz 454 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
6.5.4 Assumpti<strong>on</strong>s<br />
As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, it is important be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the analysis is started to verify the assumpti<strong>on</strong>s underlying the analysis. As<br />
ANCOVA is a combinati<strong>on</strong> of ANOVA and Regressi<strong>on</strong>, the assumpti<strong>on</strong>s are similar. Both goals of ANCOVA<br />
have similar assumpti<strong>on</strong>s:<br />
• The resp<strong>on</strong>se variable Y is c<strong>on</strong>tinuous (interval or ratio scaled).<br />
• The data are collected under a completely randomized design. 18 This implies that the treatment must<br />
be randomized completely over the entire set of experimental units if an experimental study, or units<br />
must be selected at random from the relevant populati<strong>on</strong>s if an observati<strong>on</strong>al study.<br />
• There must be no outliers. Plot Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group separately to see if there are any points that<br />
d<strong>on</strong>’t appear to follow the straight line.<br />
• The relati<strong>on</strong>ship between Y and X must be linear <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. 19 Check this assumpti<strong>on</strong> by looking<br />
at the individual plots of Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group.<br />
• The variance must be equal <str<strong>on</strong>g>for</str<strong>on</strong>g> both groups around their respective regressi<strong>on</strong> lines. Check that the<br />
spread of the points is equal around the range of X and that the spread is comparable between the two<br />
groups. This can be <str<strong>on</strong>g>for</str<strong>on</strong>g>mally checked by looking at the MSE from a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />
group as MSE estimates the variance of the data around the regressi<strong>on</strong> line.<br />
• The residuals must be normally distributed around the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This assumpti<strong>on</strong><br />
can be check by examining the residual plots from the fitted model <str<strong>on</strong>g>for</str<strong>on</strong>g> evidence of n<strong>on</strong>-normality. For<br />
large samples, this is not too crucial; <str<strong>on</strong>g>for</str<strong>on</strong>g> small sample sizes, you will likely have inadequate power to<br />
detect anything but gross departures.<br />
6.5.5 Comparing individual regressi<strong>on</strong> lines<br />
You saw in earlier chapters, that a statistical model is a powerful <str<strong>on</strong>g>short</str<strong>on</strong>g>hand to describe what analysis is fit<br />
to a set of data. The model must describe the treatment structure, the experimental unit structure, and the<br />
randomizati<strong>on</strong> structure.. Let Y be the resp<strong>on</strong>se variable; X be the c<strong>on</strong>tinuous X-variable, and Group be the<br />
nominally scaled group variable with TWO levels, i.e. <strong>on</strong>ly <strong>on</strong>e indicator variable will be generated, called<br />
I.<br />
In this and previous chapter, we uses a <str<strong>on</strong>g>short</str<strong>on</strong>g>hand model notati<strong>on</strong>. For example, the model notati<strong>on</strong><br />
Y = X<br />
would refer to a regressi<strong>on</strong> of Y <strong>on</strong> X with the underlying statistical model:<br />
Y = β 0 + β 1 X + ɛ<br />
18 It is possible to relax this assumpti<strong>on</strong> - this is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
19 It is possible to relax this assumpti<strong>on</strong> as well, but is again, bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
c○2012 Carl James Schwarz 455 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
where the subscript corresp<strong>on</strong>ding to individual subjects has been dropped <str<strong>on</strong>g>for</str<strong>on</strong>g> clarity.<br />
We now use an extensi<strong>on</strong> of model notati<strong>on</strong>. The model notati<strong>on</strong>:<br />
Y = X Group Group*X<br />
refers to the model:<br />
Y = β 0 + β 1 X + β 2 I + β 3 I ∗ X + ɛ<br />
Lastly, the model notati<strong>on</strong>:<br />
refers to the model<br />
Y = X Group<br />
Y = β 0 + β 1 X + β 2 I + ɛ<br />
These can be diagrammed in a graphs. If the lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group are not parallel:<br />
c○2012 Carl James Schwarz 456 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
the appropriate model is<br />
Y1 = X Group Group*X<br />
The terms can be in any order. This is read as variati<strong>on</strong> in Y can be explained a comm<strong>on</strong> intercept (never<br />
specified), with group effects (different intercepts), a comm<strong>on</strong> slope <strong>on</strong> X, and an “interacti<strong>on</strong>” between<br />
Group and X which is interpreted as different slopes <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This model is almost equivalent to<br />
fitting a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The <strong>on</strong>ly advantage to using this joint model compared to<br />
fitting separate slopes is that all of the groups c<strong>on</strong>tribute to a better estimate of residual error. If the number<br />
of data points per group is small, this can lead to improvements in precisi<strong>on</strong> compared to fitting each group<br />
individually.<br />
If the lines are parallel across groups, but not coincident:<br />
the appropriate model is<br />
Y2 = Group X<br />
The terms can be in any order. The <strong>on</strong>ly difference between this and the previous model is that this simpler<br />
model lacks the Group*X “interacti<strong>on</strong>” term. It would not be surprising then that a statistical test to see if<br />
c○2012 Carl James Schwarz 457 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
this simpler model is tenable would corresp<strong>on</strong>d to examining the p-value of the test <strong>on</strong> the Group*X term<br />
from the complex model. This is exactly analogous to testing <str<strong>on</strong>g>for</str<strong>on</strong>g> interacti<strong>on</strong> effects between factors in a<br />
two-factor ANOVA.<br />
Lastly, if the lines are co-incident:<br />
the appropriate model is<br />
Y3 = X<br />
. Now the difference between this model and the previous model is the Group term that has been dropped.<br />
Again, it would not be surprising that this corresp<strong>on</strong>ds to the test of the Group effect in the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical<br />
test. The test <str<strong>on</strong>g>for</str<strong>on</strong>g> co-incident lines should <strong>on</strong>ly be d<strong>on</strong>e if there is insufficient evidence against parallelism.<br />
While it is possible to test <str<strong>on</strong>g>for</str<strong>on</strong>g> a n<strong>on</strong>-zero slope, this is rarely d<strong>on</strong>e.<br />
c○2012 Carl James Schwarz 458 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
6.5.6 Example: Degradati<strong>on</strong> of dioxin<br />
An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />
material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />
organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />
takes a l<strong>on</strong>g time to degrade.<br />
Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />
measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />
Each year, four crabs are captured from two m<strong>on</strong>itoring stati<strong>on</strong>s which are situated quite a distance apart<br />
<strong>on</strong> the same inlet where the pulp mill was located.. The liver is excised and the livers from all four crabs<br />
are composited together into a single sample. 20 The dioxins levels in this composite sample is measured.<br />
As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called the Total<br />
Equivalent Dose (TEQ) is computed from the sample.<br />
As seen in the chapter <strong>on</strong> regressi<strong>on</strong>, the appropriate resp<strong>on</strong>se variable is log(T EQ).<br />
Is the rate of decline the same <str<strong>on</strong>g>for</str<strong>on</strong>g> both sites? Did the sites have the same initial c<strong>on</strong>centrati<strong>on</strong>?<br />
Here are the raw data which are also available <strong>on</strong> the web in the SampleProgramLibrary available at<br />
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
20 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />
loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />
within years and the number of years to m<strong>on</strong>itor.<br />
c○2012 Carl James Schwarz 459 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Site Year TEQ log(TEQ)<br />
a 1990 179.05 5.19<br />
a 1991 82.39 4.41<br />
a 1992 130.18 4.87<br />
a 1993 97.06 4.58<br />
a 1994 49.34 3.90<br />
a 1995 57.05 4.04<br />
a 1996 57.41 4.05<br />
a 1997 29.94 3.40<br />
a 1998 48.48 3.88<br />
a 1999 49.67 3.91<br />
a 2000 34.25 3.53<br />
a 2001 59.28 4.08<br />
a 2002 34.92 3.55<br />
a 2003 28.16 3.34<br />
b 1990 93.07 4.53<br />
b 1991 105.23 4.66<br />
b 1992 188.13 5.24<br />
b 1993 133.81 4.90<br />
b 1994 69.17 4.24<br />
b 1995 150.52 5.01<br />
b 1996 95.47 4.56<br />
b 1997 146.80 4.99<br />
b 1998 85.83 4.45<br />
b 1999 67.72 4.22<br />
b 2000 42.44 3.75<br />
b 2001 53.88 3.99<br />
b 2002 81.11 4.40<br />
b 2003 70.88 4.26<br />
The data is entered into JMP in the usual fashi<strong>on</strong>. Make sure that Site is a nominal scale variable, and<br />
that Year is a c<strong>on</strong>tinuous variable.<br />
In cases with multiple groups, it is often helpful to use a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This<br />
is easily accomplished in JMP by selecting the rows (say <str<strong>on</strong>g>for</str<strong>on</strong>g> site a) and using the Rows->Markers to set the<br />
plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> the selected rows:<br />
c○2012 Carl James Schwarz 460 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The final data sheet has two different plotting symbols <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sites:<br />
c○2012 Carl James Schwarz 461 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting the various models, begin with an exploratory examinati<strong>on</strong> of the data looking <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers<br />
and checking the assumpti<strong>on</strong>s.<br />
Each year’s data is independent of other year’s data as a different set of crabs was selected. Similarly,<br />
the data from <strong>on</strong>e site are independent from the other site. This is an observati<strong>on</strong>al study, so the questi<strong>on</strong><br />
arises of how exactly were the crabs were selected? In this study, crab pots were placed <strong>on</strong> the floor of the<br />
sea to capture the available crabs in the area.<br />
When ever multiple sets of data are collected over time, there is always the worry about comm<strong>on</strong> year<br />
effects (also known as process error). For example, if the resp<strong>on</strong>se variable was body mass of small fish, then<br />
poor growing c<strong>on</strong>diti<strong>on</strong>s in a single year could depress the growth of fish in all locati<strong>on</strong>s. This would then<br />
violate the assumpti<strong>on</strong> of independence as the residual in <strong>on</strong>e site in a year would be related to the residual<br />
in another site in the sam year. You tend to see the residuals “paired” with negative residuals from the fitted<br />
c○2012 Carl James Schwarz 462 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
line at <strong>on</strong>e site matched (by year) with negative residuals at the other site. In this case, this is unlikely to have<br />
occured. Degradati<strong>on</strong> of dioxin is relatively independent of external envir<strong>on</strong>mental factors and the variati<strong>on</strong><br />
that we see about the two regressi<strong>on</strong> lines is related solely to samplng error based <strong>on</strong> the particular set of<br />
crabs that that were sampled. It seems unlikely that the residuals are related. 21<br />
Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and specify the log(T EQ) as the Y variable, and Y ear as the X<br />
variable:<br />
Then specify a grouping variable by clicking <strong>on</strong> the pop-down menu near the Bivariate Fit title line:<br />
21 If you actually try and fit a process error term to this model, you find that the estimated process error is zero.<br />
c○2012 Carl James Schwarz 463 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
and selecting Site as the grouping variable:<br />
c○2012 Carl James Schwarz 464 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Now select the Fit Line from the same pop-down menu:<br />
c○2012 Carl James Schwarz 465 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
to get separate lines fit <str<strong>on</strong>g>for</str<strong>on</strong>g> each group:<br />
c○2012 Carl James Schwarz 466 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
This relati<strong>on</strong>ships <str<strong>on</strong>g>for</str<strong>on</strong>g> each site appear to be linear. The actual estimates are also presented:<br />
The scatter plot doesn’t show any obvious outliers. The estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the a site is −0.107 (se .02)<br />
while the estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the b site is −0.06 (se .02). The 95% c<strong>on</strong>fidence intervals (not shown <strong>on</strong> the<br />
output but available by right-clicking/ctrl-clicking <strong>on</strong> the parameter estimates table) overlap c<strong>on</strong>siderably so<br />
c○2012 Carl James Schwarz 467 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
the slopes could be the same <str<strong>on</strong>g>for</str<strong>on</strong>g> the two groups.<br />
The MSE from site a is 0.10 and the MSE from site b is 0.12. This corresp<strong>on</strong>ds to standard deviati<strong>on</strong>s<br />
of √ 0.10 = 0.32 and √ 0.12 = 0.35 which are very similar so that assumpti<strong>on</strong> of equal standard deviati<strong>on</strong>s<br />
seems reas<strong>on</strong>able.<br />
The residual plots (not shown) also look reas<strong>on</strong>able.<br />
The assumpti<strong>on</strong>s appear to be satisfied, so let us now fit the various models.<br />
First, fit the model allowing <str<strong>on</strong>g>for</str<strong>on</strong>g> separate lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />
The terms can be in any order and corresp<strong>on</strong>d to the model described earlier. This gives the following<br />
output:<br />
c○2012 Carl James Schwarz 468 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The regressi<strong>on</strong> plot is just the same as the plot of the two individual lines seen earlier. What is of interest is<br />
the Effect test <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site*year interacti<strong>on</strong>. Here the p-value is not very small, so there is no evidence that<br />
the lines are not parallel.<br />
We need to refit the model, dropping the interacti<strong>on</strong> term:<br />
c○2012 Carl James Schwarz 469 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
which gives the following regressi<strong>on</strong> plot:<br />
c○2012 Carl James Schwarz 470 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
This shows the fitted parallel lines. The effect tests:<br />
now have a small p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site effect indicating that the lines are not coincident, i.e. they are parallel<br />
with different intercepts. This would mean that the rate of decay of the dioxin appears to be equal in both<br />
sites, but the initial c<strong>on</strong>centrati<strong>on</strong> appears to be different.<br />
The estimated (comm<strong>on</strong>) slope is found in the Parameter Estimates porti<strong>on</strong> of the output:<br />
c○2012 Carl James Schwarz 471 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
and has a value of −0.083 (se 0.016). Because the analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this implies that the<br />
dioxin levels changed by a factor of exp(−.083) = .92 from year to year, i.e. about a 8% decline each year.<br />
The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope <strong>on</strong> the log-scale is from (−.12 → −0.05) which corresp<strong>on</strong>ds to<br />
a potential factor between exp(−0.12) = .88 to exp(−0.05) = .95 per year, i.e. between a 12% and 5%<br />
decline per year. 22<br />
While it is possible to estimate the difference between the parallel lines from the Parameter Estimates<br />
table, it is easier to look at the secti<strong>on</strong> of the output corresp<strong>on</strong>ding to the Site effects. Here the estimated<br />
LSMeans corresp<strong>on</strong>d to the log(TEQ) at the average value of Year - not really of interest. As in previous<br />
chapters, the difference in means is often of more interest than the raw means themselves. This is found by<br />
using the pop-down menu and selecting a LSMEans C<strong>on</strong>trast or Multiple Comparis<strong>on</strong> procedure to give:<br />
22 The c<strong>on</strong>fidence intervals are found by right clicking/ctrl-clicking in the Parameter Estimates table<br />
c○2012 Carl James Schwarz 472 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The estimated difference between the lines (<strong>on</strong> the log-scale) is estimated to be 0.46 (se .13). Because the<br />
analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this corresp<strong>on</strong>ds to a ratio of exp(.46) = 1.58 in dioxin levels between<br />
the two sites, i.e. site b has 1.58 times the dioxin level as site a. Because the slopes are parallel and declining,<br />
the dioxin levels are falling in both sites, but the 1.58 times ratio remains c<strong>on</strong>sistent.<br />
c○2012 Carl James Schwarz 473 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Finally, the actual by Predicted plot (not shown here), the leverage plots (not shown here) and the residual<br />
plot<br />
d<strong>on</strong>’t show any evidence of a problem in the fit.<br />
6.5.7 Example: More refined analysis of stream-slope example<br />
In the chapter <strong>on</strong> paired comparis<strong>on</strong>s, the example of the effect of stream slope was examined based <strong>on</strong>:<br />
Isaak, D.J. and Hubert, W.A. (2000). Are trout populati<strong>on</strong>s affected by reach-scale stream slope.<br />
Canadian Journal of Fisheries and Aquatic Sciences, 57, 468-477.<br />
In that paper, stream slope was (roughly) categorized into high or low slope classes and a paired-analysis<br />
was per<str<strong>on</strong>g>for</str<strong>on</strong>g>med. In this secti<strong>on</strong>, we will use the actual stream slopes to examine the relati<strong>on</strong>ship between fish<br />
density and stream slope.<br />
Recall that a stream reach is a porti<strong>on</strong> of a stream from 10 to several hundred meters in length that<br />
exhibits c<strong>on</strong>sistent slope. The slope influences the general speed of the water which exerts a dominant<br />
influence <strong>on</strong> the structure of physical habitat in streams. If fish populati<strong>on</strong>s are influenced by the structure<br />
of physical habitat, then the abundance of fish populati<strong>on</strong>s may be related to the slope of the stream.<br />
c○2012 Carl James Schwarz 474 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Reach-scale stream slope and the structure of associated physical habitats are thought to affect trout<br />
populati<strong>on</strong>s, yet previous studies c<strong>on</strong>found the effect of stream slope with other factors that influence trout<br />
populati<strong>on</strong>s.<br />
Past studies addressing this issue have used sampling designs wherein data were collected either using<br />
repeated samples al<strong>on</strong>g a single stream or measuring many streams distributed across space and time.<br />
Reaches <strong>on</strong> the same stream will likely have correlated measurements making the use of simple statistical<br />
tools problematical. [Indeed, if <strong>on</strong>ly a single stream is measured <strong>on</strong> multiple locati<strong>on</strong>s, then this is an<br />
example of pseudo-replicati<strong>on</strong> and inference is limited to that particular stream.]<br />
Inference from streams spread over time and space is made more difficult by the inter-stream differences<br />
and temporal variati<strong>on</strong> in trout populati<strong>on</strong>s if samples are collected over extended periods of time. This extra<br />
variati<strong>on</strong> reduces the power of any survey to detect effects.<br />
For this reas<strong>on</strong>, a paired approach was taken. A total of twenty-three streams were sampled from a large<br />
watershed. Within each stream, two reaches were identified and the actual slope gradient was measured.<br />
In each reach, fish abundance was determined using electro-fishing methods and the numbers c<strong>on</strong>verted<br />
to a density per 100 m 2 of stream surface.<br />
The following table presents the (fictitious but based <strong>on</strong> the above paper) raw data<br />
Estimates of fish density from a paired experiment<br />
slope slope density<br />
Stream (%) class (per 100 m 2 )<br />
1 0.7 low 15.0<br />
1 4.0 high 21.0<br />
2 2.4 low 11.0<br />
2 6.0 high 3.1<br />
3 0.7 low 5.9<br />
3 2.6 high 6.4<br />
4 1.3 low 12.2<br />
4 4.0 high 17.6<br />
5 0.6 low 6.2<br />
5 4.4 high 7.0<br />
6 1.3 low 39.8<br />
6 3.2 high 25.0<br />
7 2.0 low 6.5<br />
7 4.2 high 11.2<br />
8 1.3 low 9.6<br />
c○2012 Carl James Schwarz 475 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
8 4.2 high 17.5<br />
9 2.0 low 7.3<br />
9 3.6 high 10.0<br />
10 0.7 low 11.3<br />
10 3.5 high 21.0<br />
11 2.3 low 12.1<br />
11 6.0 high 12.1<br />
12 2.5 low 13.2<br />
12 4.2 high 15.0<br />
13 2.3 low 5.0<br />
13 6.0 high 5.0<br />
14 1.2 low 10.2<br />
14 2.9 high 6.0<br />
15 0.7 low 8.5<br />
15 2.9 high 7.0<br />
16 1.1 low 5.8<br />
16 3.0 high 5.0<br />
17 2.2 low 5.1<br />
17 5.0 high 5.0<br />
18 0.7 low 65.4<br />
18 3.2 high 55.0<br />
19 0.7 low 13.2<br />
19 3.0 high 15.0<br />
20 0.3 low 7.1<br />
20 3.2 high 12.0<br />
21 2.3 low 44.8<br />
21 7.0 high 48.0<br />
22 1.8 low 16.0<br />
22 6.0 high 20.0<br />
23 2.2 low 7.2<br />
23 6.0 high 10.1<br />
Notice that the density varies c<strong>on</strong>siderably am<strong>on</strong>g stream but appears to be fairly c<strong>on</strong>sistent within each<br />
stream.<br />
c○2012 Carl James Schwarz 476 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The raw data is available in a JMP datafile called paired-stream.jmp in the Sample Programs Library at<br />
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
As noted earlier, this is an example of an Analytical Survey. The treatments (low or high slope) cannot<br />
be randomized within stream – the randomizati<strong>on</strong> occurs by selecting streams at random from some larger<br />
populati<strong>on</strong> of potential streams. As noted in the early chapter <strong>on</strong> Observati<strong>on</strong>al Studies, causal inference is<br />
limited whenever a randomizati<strong>on</strong> of experimental units to treatments cannot be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med.<br />
Unlike the example presented in other chapters where the slope is divided (arbitrarily) into two class<br />
(low and high slope), we will now use the actual slope. A simple regressi<strong>on</strong> CANNOT be used because of<br />
the n<strong>on</strong>-independence introduced by measuring two reaches <strong>on</strong> the same stream. However, an ANOCOVA<br />
will prove to be useful here.<br />
First, it seem sensible that the resp<strong>on</strong>se to stream slope will will be multiplicative rather than additive,<br />
i.e. an increase in the stream slope will change the fish density by a comm<strong>on</strong> fracti<strong>on</strong>, rather than simply<br />
changing the density by a fixed amount. For example, it may turn out that a 1 unit change in the slope,<br />
reduces density by 10% - if the density be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the change was 100 fish/m 2 , then after the change, the new<br />
density will be 90 fish/m 2 . Similarly, if the original density was <strong>on</strong>ly 10 fish/m 2 , then the final density will<br />
be 9 fish/m 2 . In both cases, the reducti<strong>on</strong> is a fixed fracti<strong>on</strong>, and NOT the same fixed amount (a change of<br />
10 vs. 1).<br />
Create the log(density) column in the usual fashi<strong>on</strong> (not illustrated here). In cases like this, the natural<br />
logarithm is preferred because the resulting estimates have a very nice simple interpretati<strong>on</strong>. 23<br />
An appropriate model will be <strong>on</strong>e where each stream has a separate intercept (corresp<strong>on</strong>ding to the<br />
different productivities of each stream - acting like a block), with a comm<strong>on</strong> slope <str<strong>on</strong>g>for</str<strong>on</strong>g> all streams. The<br />
simplified model syntax would look like<br />
log(density) = Stream Slope<br />
where the term Stream represents a nominal scaled variable and gives the different intercepts and the Slope<br />
is the effect of the comm<strong>on</strong> slope <strong>on</strong> the log(density).<br />
This is fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m as:<br />
23 The JMP dataset also created a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each stream using the Rows− > Color or Mark by Column<br />
menu.<br />
c○2012 Carl James Schwarz 477 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Note that it stream must have a nominal scale and that slope must have a c<strong>on</strong>tinuous scale. The order of the<br />
terms in the effects box is not important.<br />
The output from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is voluminous, but a careful reading reveals several<br />
interesting features.<br />
First is a plot of the comm<strong>on</strong> slope fit to each stream:<br />
c○2012 Carl James Schwarz 478 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
This shows a gradual increase as slope increases. This plot is hard to interpret, but a plot of observed vs.<br />
predicted values is clearer:<br />
c○2012 Carl James Schwarz 479 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Generally, the observed are close to the predicted values except <str<strong>on</strong>g>for</str<strong>on</strong>g> two potential outliers. By clicking <strong>on</strong><br />
these points, it is shown that both points bel<strong>on</strong>g to stream 2 where it appears that increases in the slope<br />
causes a large decrease in density c<strong>on</strong>trary to the general pattern seen in the the other streams.<br />
The effect tests:<br />
fail to detect any influence of slope. Indeed the estimated coefficient associated with a change in slope is<br />
found to be:<br />
c○2012 Carl James Schwarz 480 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
is estimated to be .025 (se .0299) which is not statistically significant. 24<br />
Residual plots also show the odd behavior of stream 2:<br />
If this rogue stream is “eliminated” from the analysis, the the resulting plots do not show any problems<br />
(try it), but now the results are statistically significant (p = 0.035):<br />
24 Because the natural log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used and the data <strong>on</strong> the log scale was used, “smallish” slope coefficients have an approximate<br />
interpretati<strong>on</strong>. In this example, a slope of .025 <strong>on</strong> the (natural) log scale implies that the estimated fish density INCREASES by<br />
2.5% every time the slope increases by <strong>on</strong>e percentage point.<br />
c○2012 Carl James Schwarz 481 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The estimated change in log-density per percentage point change in the slope is found to be:<br />
i.e. the slope is .05 (se .02) which is interpreted that a percentage point increase in stream slope increases<br />
fish density by 5%. 25<br />
The remaining residual plot and leverage plots show no problems.<br />
6.6 Example: Predicting PM10 levels<br />
Small particulates are known to have adverse health effects. Here is some background in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> from<br />
Wikipedia: 26<br />
The effects of inhaling particulate matter has been widely studied in humans and animals and<br />
include asthma, lung cancer, cardiovascular issues, and premature death. The size of the particle<br />
determines where in the body the particle will come to rest if inhaled. Larger particles are<br />
generally filtered by small hairs in the nose and throat and do not cause problems, but particulate<br />
matter smaller than about 10 micrometers, referred to as PM10, can settle in the br<strong>on</strong>chial<br />
tubes and lungs and cause health problems. Particles smaller than 2.5 micrometers, PM2.5, can<br />
penetrate directly into the lung, whereas particles smaller than 1 micrometer PM1 can penetrate<br />
into the alveolar regi<strong>on</strong> of the lung and tend to be the most hazardous when inhaled.<br />
The large number of deaths and other health problems associated with particulate polluti<strong>on</strong> was<br />
first dem<strong>on</strong>strated in the early 1970s (Lave et. al, 1973) and has been reproduced many times<br />
25 This easy interpretati<strong>on</strong> occurs because the natural log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used. If the comm<strong>on</strong> (base 10) log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used, there<br />
is no l<strong>on</strong>ger such a simple interpretati<strong>on</strong>.<br />
26 Downloaded from http://en.wikipedia.org/wiki/Particulate <strong>on</strong> 2006-05-22<br />
c○2012 Carl James Schwarz 482 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
since. PM polluti<strong>on</strong> is estimated to cause 20,000-50,000 deaths per year in the United States<br />
(Mokdad et. al, 2004) and 200,000 deaths per year in Europe. For this reas<strong>on</strong>, the US Envir<strong>on</strong>mental<br />
Protecti<strong>on</strong> Agency (EPA) sets standards <str<strong>on</strong>g>for</str<strong>on</strong>g> PM10 and PM2.5 c<strong>on</strong>centrati<strong>on</strong>s in urban<br />
air. EPA regulates primary particulate emissi<strong>on</strong>s and precursors to sec<strong>on</strong>dary emissi<strong>on</strong>s (NOx,<br />
sulfur, and amm<strong>on</strong>ia). Many urban areas in the US and Europe still frequently violate the particulate<br />
standards, though urban air has gotten cleaner, <strong>on</strong> average, with respect to particulates<br />
over the last quarter of the 20th century.<br />
The data are a subsample of 500 observati<strong>on</strong>s from a data set that originate in a study where air polluti<strong>on</strong><br />
at a road is related to traffic volume and meteorological variables, collected by the Norwegian Public Roads<br />
Administrati<strong>on</strong>.<br />
The resp<strong>on</strong>se variable c<strong>on</strong>sist of hourly values of the logarithm of the c<strong>on</strong>centrati<strong>on</strong> (why?) of PM10<br />
(particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003. The predictor<br />
variables are the logarithm of the number of cars per hour, temperature 2 meters above ground (degree C),<br />
wind speed (meters/sec<strong>on</strong>d), the temperature difference between 25 and 2 meters above ground (degree C),<br />
wind directi<strong>on</strong> (degrees between 0 and 360), hour of day and day number from October 1. 2001.<br />
The data were extracted from http://lib.stat.cmu.edu/datasets/ and are available in the<br />
file pm10.jmp in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/<br />
Notes/MyPrograms.<br />
Wind directi<strong>on</strong> is an interesting variable as it ranges from 0 to 360 around a circle and cannot be used<br />
directly in a regressi<strong>on</strong> setting – after all a directi<strong>on</strong> of 1 degree and 359 degrees is so similar, but have vastly<br />
“different” measured values.<br />
Examine the histogram of the wind directi<strong>on</strong>s (obtained from the Analyze->Distributi<strong>on</strong> plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m):<br />
c○2012 Carl James Schwarz 483 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
This seems to indicate that there are two major wind directi<strong>on</strong>s. The “E” winds corresp<strong>on</strong>d to wind directi<strong>on</strong>s<br />
between from about 320 → 360 degrees and from 0 → 150 degrees, while the “W” winds corresp<strong>on</strong>d to<br />
directi<strong>on</strong>s between 150 → 320 degrees.<br />
C<strong>on</strong>vert these measurements into a nominal scaled variable using JMP’s <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor:<br />
c○2012 Carl James Schwarz 484 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
This classifies the wind directi<strong>on</strong> into the two categories. A character coding is used to prevent computer<br />
packages from interpreting a numeric code as an interval or ratio scaled variable. An indicator variable could<br />
be created <str<strong>on</strong>g>for</str<strong>on</strong>g> this variable as seen in earlier chapters.<br />
An initial scatter plot matrix of the data is obtained by using the Analyze->MultiVariateMethods->Multivariate<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 485 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 486 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
There is no obvious relati<strong>on</strong>ship am<strong>on</strong>g the variables. The plot of the day variable show a large gap. Inspecti<strong>on</strong><br />
of the data shows that recording was stopped from about 100 days in the middle of the data set –<br />
the reas<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> this are unknown. The number of cars/hour varies over the hour of the day in a predictable<br />
fashi<strong>on</strong>. The wind directi<strong>on</strong> variable shows that most of the data points have wind blowing in two major<br />
directi<strong>on</strong>s corresp<strong>on</strong>ding to E and W as broken into categories earlier.<br />
A plot of the log(PM10) c<strong>on</strong>centrati<strong>on</strong> by the c<strong>on</strong>densed wind directi<strong>on</strong>:<br />
c○2012 Carl James Schwarz 487 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
shows no obvious relati<strong>on</strong>ship between the PM10 and the wind directi<strong>on</strong>.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to fit a model to the c<strong>on</strong>tinuous and indicator variable.<br />
c○2012 Carl James Schwarz 488 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The leverage plots (not shown) d<strong>on</strong>’t reveal any problems in the fit. The actual vs. predicted plot:<br />
c○2012 Carl James Schwarz 489 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
appears to show some evidence that the fitted line tends to under predict at high log(PM10) c<strong>on</strong>centrati<strong>on</strong>s<br />
and over predict at lower log(PM10) c<strong>on</strong>centrati<strong>on</strong>s, but the visual impressi<strong>on</strong> may be an artifact of the<br />
density of points. The residual plot:<br />
c○2012 Carl James Schwarz 490 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
d<strong>on</strong>’t show any problems with the fit. In any case, the R 2 is not large indicating plenty of residual variati<strong>on</strong><br />
not explained by the regressor variables.<br />
The estimates table:<br />
doesn’t show any problems with variance inflati<strong>on</strong>, but perhaps some variables can be deleted. Use the<br />
Custom Test opti<strong>on</strong>:<br />
c○2012 Carl James Schwarz 491 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
to see if the day, wind-directi<strong>on</strong>, and hour can be removed. [I suspect that any hour effect has been taken up<br />
by the log(cars) effect and so is redundant (why?). Similarly, any trend over time (the day effect) may also<br />
be included in the log(cars) effect (why?)]:<br />
[Why are three columns needed to test the three variables?] The results of the “chunk” test are:<br />
c○2012 Carl James Schwarz 492 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
showing that these variables can be safely deleted. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is again used, but now<br />
dropping these apparently redundant variables.<br />
The revised estimates from this reduced model again show no problems in the leverage plots, no problems<br />
in the residual plots, and no problems in the VIF. The estimates are:<br />
c○2012 Carl James Schwarz 493 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
This time, it appears that both temperature variables are also redundant. This is somewhat surprising, but <strong>on</strong><br />
sober sec<strong>on</strong>d thought, perhaps not so. The temperature wouldn’t affect the creati<strong>on</strong> of particles – after all if<br />
the cars are the driving <str<strong>on</strong>g>for</str<strong>on</strong>g>ce behind the levels, the cars will produce the same particular levels regardless of<br />
temperature. Perhaps temperature <strong>on</strong>ly affects how the PM10 levels affect human health, i.e. <strong>on</strong> hot days,<br />
perhaps people feel affect more by polluti<strong>on</strong>.<br />
A “chunk” test using the Custom Test procedure shows that the temperature variable can also be dropped<br />
(not shown).<br />
The final model includes <strong>on</strong>ly two variables, the log(cars/hour) and the wind speed. The final estimates<br />
are:<br />
As the number of cars/hour increases, the polluti<strong>on</strong> level increases. As both the polluti<strong>on</strong> level and<br />
the number of cars have been measured <strong>on</strong> the log scale, the coefficient must be interpreted carefully. A<br />
doubling of the number of cars corresp<strong>on</strong>ds to an increase of .7 <strong>on</strong> the natural logarithm scale (log(2)=.7).<br />
Hence, the log(PM10) increases by .7(.32)=.22 which corresp<strong>on</strong>ds to exp(.22) = 1.25 times increase <strong>on</strong> the<br />
anti-log scale. In other words, a doubling of cars/hour corresp<strong>on</strong>ds to a 25% increase in the PM10 levels.<br />
As wind speed increases, the c<strong>on</strong>centrati<strong>on</strong> of PM10 decreases. A similar exercise shows that an increase<br />
in wind speed by 1 m/sec<strong>on</strong>d causes the PM10 c<strong>on</strong>centrati<strong>on</strong> decrease by about 10%.<br />
The leverage plots and residual plots show no problems in the data.<br />
How well does the model per<str<strong>on</strong>g>for</str<strong>on</strong>g>m in practice? On way to assess this is to save the Std Err of predicti<strong>on</strong>s<br />
of the mean and of individual predicti<strong>on</strong>s to the data table:<br />
c○2012 Carl James Schwarz 494 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
(similar acti<strong>on</strong>s are d<strong>on</strong>e to save the std error <str<strong>on</strong>g>for</str<strong>on</strong>g> individual predicti<strong>on</strong>s and the actual predicted values).<br />
Then compute the ratio of each of the standard errors to the predicted values:<br />
c○2012 Carl James Schwarz 495 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
(again, <strong>on</strong>ly <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g>mula is shown) and use the Analyze->Distributi<strong>on</strong> plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to see the histograms of the<br />
relative predicti<strong>on</strong> errors:<br />
c○2012 Carl James Schwarz 496 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Predicti<strong>on</strong>s of the MEAN resp<strong>on</strong>se are fairly good – the relative standard errors are under 5% so the 95%<br />
c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted resp<strong>on</strong>se will be fairly tight. However, as expected, the predicti<strong>on</strong><br />
intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> individual resp<strong>on</strong>se are fairly poor – the relative predicti<strong>on</strong> standard errors are around 25%<br />
which means that the 95% predicti<strong>on</strong> intervals will be ± 50%! It is unclear how useful this is <str<strong>on</strong>g>for</str<strong>on</strong>g> advising<br />
individuals to take preventive acti<strong>on</strong>s under certain c<strong>on</strong>diti<strong>on</strong>s of traffic volume and wind speed.<br />
6.7 Variable selecti<strong>on</strong> methods<br />
6.7.1 Introducti<strong>on</strong><br />
Up to now, it has been assumed that the variables to be used in the regressi<strong>on</strong> equati<strong>on</strong> are basically known<br />
and all that matters is perhaps deleting some variables as being unimportant, or deciding up<strong>on</strong> the degree of<br />
the polynomial needed <str<strong>on</strong>g>for</str<strong>on</strong>g> a variable.<br />
In some cases, researchers are faces with several tens (sometimes hundreds or thousands) of predictors<br />
and help is needed in even selecting a reas<strong>on</strong>able subset of variables to describe the relati<strong>on</strong>ship. The techc○2012<br />
Carl James Schwarz 497 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
niques in this secti<strong>on</strong> are called variable selecti<strong>on</strong> methods. CAUTION: Variable selecti<strong>on</strong> methods,<br />
despite their apparent objectivity, are no substitute <str<strong>on</strong>g>for</str<strong>on</strong>g> intelligent thought. As you will see in the remainder<br />
of this secti<strong>on</strong>, there are numerous caveats that must be kept in mind when using these methods.<br />
There are two philosophies underlying variable selecti<strong>on</strong> methods. The first philosophy is that the there<br />
is a unique correct model that explains the data. This MAY be true in physical systems where the goal of<br />
the project is to understand mechanisms of acti<strong>on</strong>s. The role of variable selecti<strong>on</strong> is to try and come and up<br />
with the variables that describe the mechanism of acti<strong>on</strong>. The sec<strong>on</strong>d philosophy (and <strong>on</strong>e that I pers<strong>on</strong>ally<br />
find more appealing) is that reality is hopelessly complex and that all our models are wr<strong>on</strong>g. We hope via<br />
regressi<strong>on</strong> methods to come up with a predicti<strong>on</strong> functi<strong>on</strong> that works satisfactorily. There is NO unique set<br />
of predictors which is “correct” – there may be several sets of predictors that all give reas<strong>on</strong>able answers<br />
and the choice am<strong>on</strong>g these sets is not obvious.<br />
In both cases, model selecti<strong>on</strong> follows five general steps:<br />
1. Specify the maximum model (i.e. the largest set of predictors).<br />
2. Specify a criteri<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> selecting a model.<br />
3. Specify a strategy <str<strong>on</strong>g>for</str<strong>on</strong>g> selecting variables.<br />
4. Specify a mechanism <str<strong>on</strong>g>for</str<strong>on</strong>g> fitting the models – usually least squares.<br />
5. Assess the goodness-of-fit of the the models and the predicti<strong>on</strong>s.<br />
6.7.2 Maximum model<br />
The maximum model is the set of predictors that c<strong>on</strong>tains all potential predictors of interest. Often researchers<br />
will add polynomial (e.g. X 2 1 ), crossproduct terms (e.g. X 1 X 2 ), or trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s of variables<br />
(e.g. ln(X 1 )).<br />
If the first philosophy is correct, this maximal model must c<strong>on</strong>tain the correct model as a subset of the<br />
potential predictor variables. As the maximum model, this model has the highest predictive power, but some<br />
predictors may be redundant. Under the sec<strong>on</strong>d philosophy, we know that this (and all models) are wr<strong>on</strong>g,<br />
but we hope that this maximal model is a reas<strong>on</strong>able predictor functi<strong>on</strong>. Again, some predictors may be<br />
redundant.<br />
Some cauti<strong>on</strong> must be used in specifying a maximum model. First, try to avoid including many variables<br />
that are collinear. For example, height and weight are highly collinear and are both variables really<br />
needed? If including polynomial or cross product terms, center (i.e. subtract the mean) be<str<strong>on</strong>g>for</str<strong>on</strong>g>e squaring the<br />
variables or taking cross-products. Use scientific knowledge to select the potential predictors and the shape<br />
of the predicti<strong>on</strong> functi<strong>on</strong>. Classificati<strong>on</strong> variables (i.e. nominal or interval scaled variables) will generate a<br />
separate indicator variable <str<strong>on</strong>g>for</str<strong>on</strong>g> each level of the variable. Some computer programs (e.g. JMP) may generate<br />
c<strong>on</strong>trasts am<strong>on</strong>g these indicator variables as well.<br />
c○2012 Carl James Schwarz 498 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Sec<strong>on</strong>d, there are various rule of thumb <str<strong>on</strong>g>for</str<strong>on</strong>g> the maximum number of predictors that should be entertained<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> a dataset. Generally, you want about 10 observati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> each potential predictor variable. Hence, if your<br />
maximum model has 30 potential predictor variables, this rule of thumb would require you have at least 300<br />
observati<strong>on</strong>s! Remember that a nominal scaled variable with k values will required k−1 indicator variables!<br />
Third, examine the c<strong>on</strong>trast within variables. If a variable is essentially c<strong>on</strong>stant (e.g. every subject<br />
had essentially the same weight), then this a useless predictor variable as no “effect” of weight will be<br />
apparent. If an indicator variable <strong>on</strong>ly points to a single case (e.g. <strong>on</strong>ly a single female in the dataset) then<br />
the results may be highly specific to the dataset analyzed. Low c<strong>on</strong>trast variables should not be included in<br />
the maximum model.<br />
6.7.3 Selecting a model criteri<strong>on</strong><br />
The model criteri<strong>on</strong> is an “index” that is computed <str<strong>on</strong>g>for</str<strong>on</strong>g> each candidate model and use to compare the various<br />
models. Given a particular criteri<strong>on</strong>, <strong>on</strong>e can order the models from “best” to “worst”.<br />
The criteri<strong>on</strong> used should be related to the goal of the analysis. If the goal is predicti<strong>on</strong>, the selecti<strong>on</strong><br />
criteri<strong>on</strong> should be related to errors in predicti<strong>on</strong>s. If the goal is variable subset selecti<strong>on</strong>, then the criteri<strong>on</strong><br />
should be related to the quality of the subset.<br />
There is NO single best criteri<strong>on</strong>. A literature search will reveal at least 10 criteria that have been<br />
proposed. In this chapter, five of the criteria will be discussed – this is not to say that these five are the<br />
optimal criteria, but rather the most frequently chosen. These criteria are R 2 , F p , MSE p , C p , and AIC.<br />
R 2<br />
The R 2 criteri<strong>on</strong> is the simplest criteri<strong>on</strong> in use. The value of R 2 measures, in some sense, the proporti<strong>on</strong> of<br />
total variati<strong>on</strong> in the data that is explained by the predictors. C<strong>on</strong>sequently, higher values of R 2 are “better”.<br />
However, this criteri<strong>on</strong> has a number of defects. First R 2 will never decrease as you add variables<br />
(regardless of usefulness) to models. But in many cases, a plot of R 2 by the number of variables shows a<br />
rapid increase a variables are added, then a leveling off where new variables essentially add very little new<br />
in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>. Model near the bend of the curve seem to offer an reas<strong>on</strong>able descripti<strong>on</strong> of the data. Some<br />
packages attempt to adjust the value of R 2 <str<strong>on</strong>g>for</str<strong>on</strong>g> the number of variables (called the adjusted R 2 ), and so the<br />
value of the adjusted R 2 again near the bend of the curve would be the target.<br />
F p<br />
The F p criteri<strong>on</strong> is essentially a number of hypothesis tests to see which set of p variables is not statistically<br />
different from the full model. If the test statistic <str<strong>on</strong>g>for</str<strong>on</strong>g> a set of p predictors is not statistically significant, then<br />
the other variables can be dropped.<br />
c○2012 Carl James Schwarz 499 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The danger with this criteri<strong>on</strong> is that every test has a α probability of a Type I (false positive) error. So<br />
if you do 50 tests, each at α = .05, there is a very good chance that at least <strong>on</strong>e of the tests will show a<br />
statistically significant results when in fact it is not. If you decide to use this criteri<strong>on</strong>, you likely want to do<br />
the tests at a more stringent criteri<strong>on</strong>, i.e. use α = .01 or α = .001.<br />
MSE p<br />
This criteri<strong>on</strong> uses the estimated residual variance about the regressi<strong>on</strong> line. This residual variance is a<br />
combinati<strong>on</strong> of unexplainable variati<strong>on</strong> and excess variati<strong>on</strong> caused by unknown predictors. In many cases,<br />
there is a subset that has the minimal residual variati<strong>on</strong>.<br />
C p and AIC<br />
These are two related (and in linear regressi<strong>on</strong> can be shown to be equivalent) criteri<strong>on</strong>.<br />
Mallow’s C P , is computed as:<br />
C p = SSE(p) − [n − 2(p + 1)]<br />
MSE(k)<br />
where MSE(p) is the mean square error from the subset with p predictors EXCLUDING the intercept 27 ;<br />
MSE(k) is the MSE from the maximum model; and n is the number of observati<strong>on</strong>s.<br />
If the maximum model does c<strong>on</strong>tain the “truth”, then Mallow showed that C p should be close to p + 1 28<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> a subset model that is closest to the “truth”.<br />
Akaike In<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> Criteri<strong>on</strong> (AIC) is a 1-1 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of the C p and can be thought of as<br />
AIC = fit + penalty <str<strong>on</strong>g>for</str<strong>on</strong>g> predictors<br />
. In the case of multiple regressi<strong>on</strong>, AIC has a simple <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
AIC = nlog( SSE<br />
n<br />
) + 2p<br />
where again p is the number of predictors INCLUDING the intercept. The model with the smallest AIC<br />
is usually preferred as this model has the best fit after accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> a penalty <str<strong>on</strong>g>for</str<strong>on</strong>g> adding too many predictors.<br />
However, AIC goes further. Under the philosophy that all models are wr<strong>on</strong>g, but some are useful,<br />
it is possible to obtain model weights <str<strong>on</strong>g>for</str<strong>on</strong>g> several potential models, and to “average” the results of several<br />
competing models. This avoids the entire discussi<strong>on</strong> of which is the best wr<strong>on</strong>g model, but rather works <strong>on</strong><br />
the philosophy that if several models that all seem to fit the data similarly give wildly different answers, then<br />
27 Some textbooks define p to INCLUDE the intercept and so the last term may look like n − 2p rather than n − 2(p + 1). Both are<br />
equivalent.<br />
28 Again, if p is defined to include the intercept, then C p should be close to p rather than p + 1<br />
c○2012 Carl James Schwarz 500 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
this uncertainty in the resp<strong>on</strong>se must be incorporated. Burhnam and Anders<strong>on</strong> (2002) has a very nice book<br />
<strong>on</strong> the use of AIC and its philosophy. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, the use of model weights is bey<strong>on</strong>d the scope of this<br />
<str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
6.7.4 Which subsets should be examined<br />
When we start with k potential predictors, there are many, many potential models that involve subsets of the<br />
k predictors. How are these subsets chosen?<br />
All possible subsets<br />
If there are k predictors variables in the maximum models, there are around 2 k possible subsets. This number<br />
can be enormous – <str<strong>on</strong>g>for</str<strong>on</strong>g> example with 10 potential predictors, there are around 2 10 = 1024 subsets; with 20<br />
predictors, there are around 2 20 = 1, 048, 576 possible models etc.<br />
With modern computers and good algorithms, it is actually possible to search all subsets <str<strong>on</strong>g>for</str<strong>on</strong>g> up to about<br />
15 predictors (and this number gets higher each year). 29 D<strong>on</strong>’t use Excel!<br />
The all possible subsets strategy is preferred <str<strong>on</strong>g>for</str<strong>on</strong>g> reas<strong>on</strong>ably sized problems. Because it looks at all<br />
possible models, it is unlikely that you would miss the “correct” am<strong>on</strong>g the subsets. However, there may be<br />
several different models that all are essentially the same and being <str<strong>on</strong>g>for</str<strong>on</strong>g>ced to select <strong>on</strong>e of these models is a<br />
bit arbitrary – hence <strong>on</strong>e of the driving <str<strong>on</strong>g>for</str<strong>on</strong>g>ces behind the AIC.<br />
Backward eliminati<strong>on</strong><br />
If you have many predictors, then all possible subsets may not be feasible. The backward eliminati<strong>on</strong><br />
procedure starts with the maximum model and successively “deletes” variables until no further variables can<br />
be deleted.<br />
The algorithm proceeds as follows:<br />
1. Fit the maximum model<br />
2. Decide which variable to delete. Look at each of the individual p-values <str<strong>on</strong>g>for</str<strong>on</strong>g> variables still in the<br />
model. If all of the p-values are less than some α (say .05 but this varies am<strong>on</strong>g packages), then stop.<br />
Else, find the variable with the largest (why?) p-value and drop this variable.<br />
3. Refit the model. Refit the model after dropping this variable, and repeat step 2 until no further variables<br />
can be deleted.<br />
29 It turns out that by cleverly computing various statistics, you can actually predict the results from many subsets without actually<br />
having to fit all the subsets.<br />
c○2012 Carl James Schwarz 501 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
One must be careful to ensure that models are hierarchical, i.e. if a X 2 term remains in the model, then<br />
the corresp<strong>on</strong>ding X terms must also remain. Many computer packages will violate this restricti<strong>on</strong> if left to<br />
their own devices.<br />
Forward additi<strong>on</strong><br />
This is the reverse of the backward eliminati<strong>on</strong> procedure. Start with a null model, and keep adding variables<br />
until no more can be added. The variable at each step with the smallest increment p-value is the variable that<br />
is added.<br />
Again, you must ensure that if X 2 terms are entered, that the corresp<strong>on</strong>ding X term is also entered.<br />
Stepwise selecti<strong>on</strong><br />
It may turn out that adding a variable during a <str<strong>on</strong>g>for</str<strong>on</strong>g>ward process makes an existing variable redundant. The<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g>ward additi<strong>on</strong> process has no mechanism <str<strong>on</strong>g>for</str<strong>on</strong>g> deleting variables <strong>on</strong>ce they’ve entered the model.<br />
In a stepwise selecti<strong>on</strong> procedure, after a variable is entered, a backward eliminati<strong>on</strong> procedure is attempted<br />
to see if any variable can be removed.<br />
Closing words<br />
In all of these automated selecti<strong>on</strong> procedures, there is no guarantee that the chosen model will be “optimal”<br />
in any sense. As well, because of the many, many statistical tests per<str<strong>on</strong>g>for</str<strong>on</strong>g>med, n<strong>on</strong>e of the p-values at final<br />
step should be interpreted literally. It is also well known, that if data that is generated completely at random<br />
is used with stepwise methods, it will select a model <str<strong>on</strong>g>for</str<strong>on</strong>g> predicti<strong>on</strong> that is often just noise.<br />
C<strong>on</strong>sequently, the results that you obtain may be highly specific to the dataset collected and may not be<br />
reproducible with other datasets. Refer to Secti<strong>on</strong> 6.7.5 <str<strong>on</strong>g>for</str<strong>on</strong>g> ideas <strong>on</strong> evaluating the reliability of the analysis.<br />
6.7.5 Goodness-of-fit<br />
Even with automated variable selecti<strong>on</strong> methods, there is no guarantee that the fitted models actually fit the<br />
data well. C<strong>on</strong>sequently, the usual residual diagnostics must be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med as outlined in earlier secti<strong>on</strong>s.<br />
At the same time, the analyst should avoid becoming fixated with the results from a single dataset. There<br />
is no guarantee that the results from this particular dataset translate into other datasets. There are several<br />
ways to try and assess how well the chosen relati<strong>on</strong>ship will work in the future:<br />
c○2012 Carl James Schwarz 502 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
• Try <strong>on</strong> a new dataset. In some cases, the study can be be repeated and a comparis<strong>on</strong> of the model<br />
selected from the existing and new study is instructive.<br />
• Split-sample. If there are many observati<strong>on</strong>s, the sample can be split into two. Model selecti<strong>on</strong> is d<strong>on</strong>e<br />
<strong>on</strong> each half independently, and the two analyses compared. If a variable is selected in <strong>on</strong>e half, but<br />
not the other, this is an indicati<strong>on</strong> of instability in the analysis.<br />
How well does the model do in predicti<strong>on</strong>s? Recall that R 2 measures the percentage of variati<strong>on</strong><br />
explained by the model. Use the first half of the data, fit a model, and find the R 2 <str<strong>on</strong>g>for</str<strong>on</strong>g> the first half.<br />
Use the model from the first sample to predict the data points <str<strong>on</strong>g>for</str<strong>on</strong>g> the sec<strong>on</strong>d sample and compute the<br />
correlati<strong>on</strong> 2 between the observed and predicted values. This sec<strong>on</strong>d R 2 will typically be smaller<br />
than the R 2 based <strong>on</strong> the first sample. If the shrinkage in R 2 is large, this is is bad news – it implies<br />
that the results from the first sample did not do well in predicting the values in the sec<strong>on</strong>d sample.<br />
• Cross validati<strong>on</strong>. In some cases, you do not have sufficient data to split into two halves. In these cases,<br />
single case cross validati<strong>on</strong> is often attempted. In this method, you fit a model excluding each case in<br />
turn, and then use the fitted model to fit the hold out case. A comparis<strong>on</strong> of the fitted vs. actual values<br />
is a measure of predictive ability.<br />
6.7.6 Example: Calories of candy bars<br />
The JMP installati<strong>on</strong> includes a dataset <strong>on</strong> the compositi<strong>on</strong> of popular candy bars. This is available under the<br />
Help → Sample Data Library → Food and Nutriti<strong>on</strong> secti<strong>on</strong> or in the candybar.jmp file in the Sample Program<br />
Library in the http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms<br />
directory.<br />
For each of about 50 brands of candy bars, the total calories and the compositi<strong>on</strong> (grams of fat, grams of<br />
fiber, etc.) were measured. Can the total calories be predicted from the various c<strong>on</strong>stituents?<br />
A preliminary scatter plot of the data:<br />
c○2012 Carl James Schwarz 503 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 504 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
shows a str<strong>on</strong>g relati<strong>on</strong>ship between calories and total grams of fat and/or grams of saturated fat, but a<br />
weaker relati<strong>on</strong>ship between calories and grams of protein and grams of carbohydrates.<br />
There are no obvious outliers except <str<strong>on</strong>g>for</str<strong>on</strong>g> a few candy bars which appear to have unusual levels of vitamins<br />
(?).<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to request a stepwise regressi<strong>on</strong> analysis to try and predict the<br />
number of calories in the candy bars:<br />
c○2012 Carl James Schwarz 505 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
In this case, the philosophy that the correct model must be a subset of these variables is likely correct. The<br />
mechanism by which calories “appear” in food is well understood - likely a combinati<strong>on</strong> of fat, protein, and<br />
carbohydrates. It is unlikely that fiber, or vitamins c<strong>on</strong>tribute anything substantial to the total calories.<br />
The stepwise dialogue box has a number (!) of opti<strong>on</strong>s and statistics available:<br />
c○2012 Carl James Schwarz 506 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Detailed explanati<strong>on</strong> of these features is available in the JMP help, but a summary is below:<br />
• The directi<strong>on</strong> of the stepwise procedure can be changed from <str<strong>on</strong>g>for</str<strong>on</strong>g>ward, to backwards, or to mixed. If<br />
you wish to do backwards eliminati<strong>on</strong>, you will have to Enter All variables first be<str<strong>on</strong>g>for</str<strong>on</strong>g>e selecting this<br />
opti<strong>on</strong>. All possible regressi<strong>on</strong>s is available from the red-triangle pop-down menu.<br />
• The probability to enter and to leave are set fairly liberally. A probability to enter of 0.25 indicates<br />
that variables that have any chance of being useful are added; the probability to leave indicates that as<br />
l<strong>on</strong>g as some predictive marginal ability is available, the variable should be retained.<br />
• If the Go butt<strong>on</strong> is pressed, the procedure is completely automatic. If the Step butt<strong>on</strong> is pressed, the<br />
procedure goes step-by-step through the algorithms. The Make Model butt<strong>on</strong> is used at the end to fit<br />
c○2012 Carl James Schwarz 507 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
the final selected model and obtain the usual diagnostic features.<br />
• The package reports the MSE, R 2 , the adjusted R 2 , C p , and AIC <str<strong>on</strong>g>for</str<strong>on</strong>g> each model. These can be used<br />
to assess the progress of the procedure.<br />
• The actual model under c<strong>on</strong>siderati<strong>on</strong> are those variables with check marks inside the Entered boxes.<br />
If you wish to <str<strong>on</strong>g>for</str<strong>on</strong>g>ce a variable to be always present, this is possible by entering a variable and locking<br />
it in.<br />
Change the directi<strong>on</strong> to Mixed and then repeatedly press the Step butt<strong>on</strong>.<br />
For the first step, the program computes the p-values <str<strong>on</strong>g>for</str<strong>on</strong>g> each new variable to enter the model. The<br />
variable with the smallest p-value that is below the Prob to Enter will be selected to enter, which is the Total<br />
Fat Variable<br />
c○2012 Carl James Schwarz 508 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The model now c<strong>on</strong>sists of the intercept and the total fat variable <str<strong>on</strong>g>for</str<strong>on</strong>g> a total of p = 2 predictors. The C p is<br />
extremely large; the R 2 has increased from the previous model; the MSE has decreased.<br />
N<strong>on</strong>e of the variables has p-values greater than the Prob to Leave so nothing happens <str<strong>on</strong>g>for</str<strong>on</strong>g> the “leaving<br />
step” and the step butt<strong>on</strong> must be pressed again.<br />
Based <strong>on</strong> the previous output, the carbohydrate variable will be entered (why?):<br />
c○2012 Carl James Schwarz 509 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
and then the protein variable (why?):<br />
c○2012 Carl James Schwarz 510 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
At this point we are now getting models with enormous R 2 values (close to 100%) which is practically<br />
unheard of in ecological c<strong>on</strong>texts. Note that C p is becoming close to p.<br />
At this point which variable would be entered next? Surprisingly, sodium is entered next, followed by<br />
saturated fat and finally the procedure halts:<br />
c○2012 Carl James Schwarz 511 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Both backward eliminati<strong>on</strong> and <str<strong>on</strong>g>for</str<strong>on</strong>g>ward selecti<strong>on</strong> also pick this final model (try it).<br />
The Make Model butt<strong>on</strong> will take these selected variables and create the Analyze->Fit Model dialogue<br />
box to fit this final model:<br />
c○2012 Carl James Schwarz 512 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
N<strong>on</strong>e of the leverage plots show anything a miss; the residual plots look good. The final estimates are:<br />
The VIF <str<strong>on</strong>g>for</str<strong>on</strong>g> total fat is a bit worrisome - notice that both total fat and saturated fat variables are in the model.<br />
Presumably, saturated fat is included in the total fat and is redundant. Try refitting this model dropping the<br />
saturated fat variable and reexamine the estimates:<br />
c○2012 Carl James Schwarz 513 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
Again all the leverage plots look fine; and the VIF are all small. In our final model, each additi<strong>on</strong>al gram<br />
of total fat increases calories by 8.9 g 30 ; each additi<strong>on</strong>al gram of protein increases calories by 4.7 g; 31<br />
each additi<strong>on</strong>al gram of carbohydrates increases calories by 4.1 grams; 32 ; and each mg of sodium decreases<br />
calories by a miniscule amount. The biological relevance of the sodium c<strong>on</strong>tributi<strong>on</strong> is unknown. Perhaps<br />
this is an artifact of this particular data set?<br />
This particular example was “easy” as the true model is known and the resp<strong>on</strong>se is almost exactly predicted<br />
by the predictors. As noted earlier, most ecological c<strong>on</strong>texts are not so nearly perfect.<br />
6.7.7 Example: Fitness dataset<br />
- this will be dem<strong>on</strong>strated in class<br />
6.7.8 Example: Predicting zoo plankt<strong>on</strong> biomass<br />
What drives the biomass of zoo plankt<strong>on</strong> <strong>on</strong> reefs? The zoo plankt<strong>on</strong> was broken into two size classes (190-<br />
600 µm and >600 µm) and envir<strong>on</strong>mental variables were sampled at 51 irregularly spaced sites (sampling<br />
interval: 156-37 m), arranged al<strong>on</strong>g a straight-line cross-shelf transect 8.4 km in length.<br />
The raw data are available at http://www.esapubs.org/archive/ecol/E085/050/suppl-1.<br />
htm#anchorFilelist in the Guadeloupe.txt file and is in the guadeloupe.jmp file in the Sample Program<br />
Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The resp<strong>on</strong>se variable is the log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med zoo plankt<strong>on</strong> biomasses of two size classes (original units:<br />
mg/m 3 ash-free dry mass) 33 . The predictor variables include<br />
• coordinate (km) of the sampling site al<strong>on</strong>g the transect.<br />
30 The accepted value <str<strong>on</strong>g>for</str<strong>on</strong>g> fat is 9 calories/gram.<br />
31 The accepted value <str<strong>on</strong>g>for</str<strong>on</strong>g> protein is 4 calories/gram.<br />
32 The accepted value <str<strong>on</strong>g>for</str<strong>on</strong>g> carbohydrates is 4 calories/gram<br />
33 Why was a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m used?<br />
c○2012 Carl James Schwarz 514 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
• envir<strong>on</strong>mental variables such as dissolved oxygen (mg/L), salinity (psu), wind speed (m/s), phytoplankt<strong>on</strong><br />
biomass (log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med, original units: µg/L), turbidity (NTU), swell height (m)<br />
• habitat variables codes as 14 indicator variables indicating various habitat classes.<br />
We will try and develop a predicti<strong>on</strong> equati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the larger zoo plankt<strong>on</strong> category.<br />
It is always good practice to do some preliminary plots of the data to search <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers and general<br />
trends in the data be<str<strong>on</strong>g>for</str<strong>on</strong>g>e beginning a more sophisticated analysis.<br />
Start with a scatter plot-matrix of the c<strong>on</strong>tinuous variables obtained from the Analyze->MultiVariateMethods-<br />
>Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 515 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 516 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
There appears to be a str<strong>on</strong>g bivariate relati<strong>on</strong>ship of biomass with distance al<strong>on</strong>g the transect line and<br />
phytoplankt<strong>on</strong> biomass. At the same time, several of the predictors appear to be highly related. For example,<br />
the distance al<strong>on</strong>g the transect line and phytoplankt<strong>on</strong> biomass are very str<strong>on</strong>gly related as is wind speed and<br />
swell height. A quadratic relati<strong>on</strong>ship between some of the predictor variables is also apparent (e.g. wind<br />
speed vs. distance). A few unusual points appear, e.g. look at the plot of salinity vs. the log(zooplankt<strong>on</strong>)<br />
where two points seem at odds with the rest of the data. By clicking <strong>on</strong> these points, we see that these<br />
corresp<strong>on</strong>d to site 5 (whose marker I subsequently changed to an X to see where it fit in the rest of the plot),<br />
and site 1 (whose marker I subsequent changed to a triangle <str<strong>on</strong>g>for</str<strong>on</strong>g> the remainder of the analysis).<br />
c○2012 Carl James Schwarz 517 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
A comm<strong>on</strong> problem with indicator variables is insufficient c<strong>on</strong>trast, i.e. there are <strong>on</strong>ly a few sampling<br />
sites with a particular habitat variable. You can see how many of each habitat variable are present by simply<br />
counting the number of 1’s in each indicator variable column or finding the “sum” of each column.<br />
c○2012 Carl James Schwarz 518 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 519 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
c○2012 Carl James Schwarz 520 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
These indicate that there is <strong>on</strong>ly 1 site with under 25% coverage of sea-grass <strong>on</strong> muddy sand, and most<br />
of the indicator variables occur <strong>on</strong> less than 10% of the sites. I will be hesitant to read too much into any<br />
regressi<strong>on</strong> equati<strong>on</strong> that includes most of these indicator variables as I suspect they will specific to this<br />
particular dataset and not generalizable to other datasets.<br />
So, based <strong>on</strong> this preliminary analysis, I would expect that the distance and/or phytoplankt<strong>on</strong> and/or<br />
turbidity would be the primary predictor <str<strong>on</strong>g>for</str<strong>on</strong>g> zooplankt<strong>on</strong> biomass in this category. With <strong>on</strong>ly 51 data points, I<br />
would be reluctant to include more than about 5 predictor variables using the rule of thumb of 10 observati<strong>on</strong>s<br />
per predictor.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to request a stepwise regressi<strong>on</strong> analysis:<br />
c○2012 Carl James Schwarz 521 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
A stepwise analysis is requested.<br />
c○2012 Carl James Schwarz 522 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
The step history :<br />
c○2012 Carl James Schwarz 523 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
shows that R 2 increases fairly rapidly until it hits around 80% and then tends to level off; the C p approaches<br />
p 34 also around step 9 or 10.<br />
The summary of the steps shows that the transect locati<strong>on</strong> is the first variable in, followed, surprisingly,<br />
by several indicator variables, followed by phytoplankt<strong>on</strong> biomass. It is somewhat surprising that both the<br />
transect locati<strong>on</strong> and the phytoplankt<strong>on</strong> biomass are both entered into the model as they are highly related.<br />
Rerun the stepwise procedure, a step at a time <str<strong>on</strong>g>for</str<strong>on</strong>g> the first 9 steps and then press the MakeModel butt<strong>on</strong>:<br />
34 Note that JMP uses the c<strong>on</strong>venti<strong>on</strong> that the count p INCLUDES the intercept.<br />
c○2012 Carl James Schwarz 524 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
to actually fit this model. The plot of actual vs. predicted :<br />
c○2012 Carl James Schwarz 525 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
shows a reas<strong>on</strong>able fit. Some of the leverage plots <str<strong>on</strong>g>for</str<strong>on</strong>g> the indicator variables show that the fit is determined<br />
by a single or pair of sites:<br />
The VIF <str<strong>on</strong>g>for</str<strong>on</strong>g> the transect locati<strong>on</strong> and phytoplankt<strong>on</strong> biomass variables:<br />
c○2012 Carl James Schwarz 526 November 23, 2012
CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />
are large – a c<strong>on</strong>sequence of the str<strong>on</strong>g relati<strong>on</strong>ship between these two variables.<br />
I would subsequently remove <strong>on</strong>e of the the transect locati<strong>on</strong> or the phytoplankt<strong>on</strong> biomass variables and<br />
would likely remove any indicator variable that is entered that depends <strong>on</strong> a single site as this is surely an<br />
artifact of this particular dataset.<br />
All possible subsets is barely feasible with this size of regressi<strong>on</strong> problem. It took less than three minutes<br />
to fit <strong>on</strong> my Macintosh G4 at home, but the output file was enormous! I suspect that unless some way is<br />
found to c<strong>on</strong>dense the output to something more user friendly, that this would not be a feasible way to<br />
proceed.<br />
c○2012 Carl James Schwarz 527 November 23, 2012
Chapter 7<br />
Logistic Regressi<strong>on</strong><br />
7.1 Introducti<strong>on</strong><br />
7.1.1 Difference between standard and logistic regressi<strong>on</strong><br />
In regular multiple-regressi<strong>on</strong> problems, the Y variable is assumed to have a c<strong>on</strong>tinuous distributi<strong>on</strong> with<br />
the vertical deviati<strong>on</strong>s around the regressi<strong>on</strong> line being independently normally distributed with a mean of 0<br />
and a c<strong>on</strong>stant variance σ 2 . The X variables are either c<strong>on</strong>tinuous or indicator variables.<br />
In some cases, the Y variable is a categorical variable, often with two distinct classes. The X variables<br />
can be either c<strong>on</strong>tinuous or indicator variables. The object is now to predict the CATEGORY in which a<br />
particular observati<strong>on</strong> will lie.<br />
For example:<br />
• The Y variable is over-winter survival of a deer (yes or no) as a functi<strong>on</strong> of the body mass, c<strong>on</strong>diti<strong>on</strong><br />
factor, and winter severity index.<br />
• The Y variable is fledging (yes or no) of birds as a functi<strong>on</strong> of distance from the edge of a field, food<br />
availability, and predati<strong>on</strong> index.<br />
• The Y variable is breeding (yes or no) of birds as a functi<strong>on</strong> of nest density, predators, and temperature.<br />
C<strong>on</strong>sequently, the linear regressi<strong>on</strong> model with normally distributed vertical deviati<strong>on</strong>s really doesn’t<br />
make much sense – the resp<strong>on</strong>se variable is a category and does NOT follow a normal distributi<strong>on</strong>. In these<br />
cases, a popular methodology that is used is logistic regressi<strong>on</strong>.<br />
There are a number of good books <strong>on</strong> the use of logistic regressi<strong>on</strong>:<br />
528
CHAPTER 7. LOGISTIC REGRESSION<br />
• Agresti, A. (2002). Categorical Data Analysis. Wiley: New York.<br />
• Hosmer, D.W. and Lemeshow, S. (2000). <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Logistic Regressi<strong>on</strong>. Wiley: New York.<br />
These should be c<strong>on</strong>sulted <str<strong>on</strong>g>for</str<strong>on</strong>g> all the gory details <strong>on</strong> the use of logistic regressi<strong>on</strong>.<br />
7.1.2 The Binomial Distributi<strong>on</strong><br />
A comm<strong>on</strong> probability model <str<strong>on</strong>g>for</str<strong>on</strong>g> outcomes that come in <strong>on</strong>ly two states (e.g. alive or dead, success or failure,<br />
breeding or not breeding) is the Binomial distributi<strong>on</strong>. The Binomial distributi<strong>on</strong> counts the number of times<br />
that a particular event will occur in a sequence of observati<strong>on</strong>s. 1 The binomial distributi<strong>on</strong> is used when a<br />
researcher is interested in the occurrence of an event, not in its magnitude. For instance, in a clinical trial,<br />
a patient may survive or die. The researcher studies the number of survivors, and not how l<strong>on</strong>g the patient<br />
survives after treatment. In a study of bird nests, the number in the clutch that hatch is measured, not the<br />
length of time to hatch.<br />
In general the binomial distributi<strong>on</strong> counts the number of events in a set of trials, e.g. the number of<br />
deaths in a cohort of patients, the number of broken eggs in a box of eggs, or the number of eggs that hatch<br />
from a clutch. Other situati<strong>on</strong>s in which binomial distributi<strong>on</strong>s arise are quality c<strong>on</strong>trol, public opini<strong>on</strong><br />
surveys, medical research, and insurance problems.<br />
It is important to examine the assumpti<strong>on</strong>s being made be<str<strong>on</strong>g>for</str<strong>on</strong>g>e a Binomial distributi<strong>on</strong> is used. The<br />
c<strong>on</strong>diti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> a Binomial Distributi<strong>on</strong> are:<br />
• n identical trials (n could be 1);<br />
• all trials are independent of each other;<br />
• each trial has <strong>on</strong>ly <strong>on</strong>e outcome, success or failure;<br />
• the probability of success is c<strong>on</strong>stant <str<strong>on</strong>g>for</str<strong>on</strong>g> the set of n trials. Some books use p to represent the probability<br />
of success; other books use π to represent the probability of success; 2<br />
• the resp<strong>on</strong>se variable Y is the the number of successes 3 in the set of n trials.<br />
However, not all experiments, that <strong>on</strong> the surface look like binomial experiments, satisfy all the assumpti<strong>on</strong>s<br />
required. Typically failure of assumpti<strong>on</strong>s include n<strong>on</strong>-independence (e.g. the first bird that hatches<br />
destroys remaining eggs in the nest), or changing p within a set of trials (e.g. measuring genetic abnormalities<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> a particular mother as a functi<strong>on</strong> of her age; <str<strong>on</strong>g>for</str<strong>on</strong>g> many species, older mothers have a higher probability<br />
of genetic defects in their offspring as they age).<br />
1 The Poiss<strong>on</strong> distributi<strong>on</strong> is a close cousin of the Binomial distributi<strong>on</strong> and is discussed in other chapters.<br />
2 Following the c<strong>on</strong>venti<strong>on</strong> that Greek letters refer to the populati<strong>on</strong> parameters just like µ refers to the populati<strong>on</strong> mean.<br />
3 There is great flexibility in defining what is a success. For example, you could count either the number of successful eggs that hatch<br />
or the number of eggs that failed to hatch in a clutch. You will get the same answers from the analysis after making the appropriate<br />
substituti<strong>on</strong>s.<br />
c○2012 Carl James Schwarz 529 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The probability of observing Y successes in n trials if each success has a probability p of occurring can<br />
be computed using:<br />
( )<br />
n<br />
p(Y = y|n, p) = p y (1 − p) n−y<br />
y<br />
where the binomial coefficient is computed as<br />
(<br />
n<br />
y<br />
and where n! = n(n − 1)(n − 2) . . . (2)(1).<br />
)<br />
=<br />
n!<br />
y!(n − y)!<br />
For example, the probability of observing Y = 3 eggs hatch from a nest with n = 5 eggs in the clutch if<br />
the probability of success p = .2 is<br />
( )<br />
5<br />
p(Y = 3|n = 5, p = .2) = (.2) 3 (1 − .2) 5−3 = .0512<br />
3<br />
Fortunately, we will have little need <str<strong>on</strong>g>for</str<strong>on</strong>g> these probability computati<strong>on</strong>s. There are many tables that tabulate<br />
the probabilities <str<strong>on</strong>g>for</str<strong>on</strong>g> various combinati<strong>on</strong>s of n and p – check the web.<br />
There are two important properties of a binomial distributi<strong>on</strong> that will serve us in the future. If Y is<br />
Binomial(n, p), then:<br />
• E[Y ] = np<br />
• V [Y ] = np(1 − p) and standard deviati<strong>on</strong> of Y is √ np(1 − p)<br />
For example, if n = 20 and p = .4, then the average number of successes in these 20 trials is E[Y ] = np =<br />
20(.4) = 8.<br />
If an experiment is observed, and a certain number of successes is observed, then the estimator <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
success probability is found as:<br />
̂p = Y n<br />
For example, if a clutch of 5 eggs is observed (the set of trials) and 3 successfully hatch, then the estimated<br />
proporti<strong>on</strong> of eggs that hatch is ̂p = 3 5<br />
= .60. This is exactly analogous to the case where a sample is drawn<br />
from a populati<strong>on</strong> and the sample average Y is used to estimate the populati<strong>on</strong> mean µ.<br />
7.1.3 Odds, risk, odds-ratio, and probability<br />
The odds of an event and the odds ratio of events are very comm<strong>on</strong> terms in logistic c<strong>on</strong>texts. C<strong>on</strong>sequently,<br />
it is important to understand exactly what these say and d<strong>on</strong>’t say.<br />
c○2012 Carl James Schwarz 530 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The odds of an event are defined as:<br />
Odds(event) =<br />
P (event)<br />
P (not event) = P (event)<br />
1 − P (event)<br />
The notati<strong>on</strong> used is often a col<strong>on</strong> separating the odds values. Some sample values are tabulated below:<br />
Probability<br />
Odds<br />
.01 1:99<br />
.1 1:9<br />
.5 1:1<br />
.6 6:4 or 3:2 or 1.5<br />
.9 9:1<br />
.99 99:1<br />
For very small or very large odds, the probability of the event is approximately equal to the odds. For<br />
example if the odds are 1:99, then the probability of the event is 1/100 which is roughly equal to 1/99.<br />
The odds ratio (OR) is by definiti<strong>on</strong>, the ratio of two odds:<br />
OR A vs. B = odds(A)<br />
odds(B) =<br />
P (A)<br />
1−P (A)<br />
P (B)<br />
1−P (B)<br />
For example, of the probability of an egg hatching under c<strong>on</strong>diti<strong>on</strong> A is 1/10 and the probability of an egg<br />
hatching under c<strong>on</strong>diti<strong>on</strong> B is 1/20, then the odds ratio is OR = (1 : 9)/(1 : 19) = 2.1 : 1. Again <str<strong>on</strong>g>for</str<strong>on</strong>g> very<br />
small or very larger odds, the odds ratio is approximately equal to the ratio of the probabilities.<br />
An odds ratio of 1, would indicate that the probability of the two events is equal.<br />
In many studies, you will hear reports that the odds of an event have doubled. This give NO in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />
about the base rate. For example, did the odds increase from 1:milli<strong>on</strong> to 2:milli<strong>on</strong> or from 1:10 to 2:10.<br />
It turns out that it is c<strong>on</strong>venient to model probabilities <strong>on</strong> the log-odds scale. The log-odds (LO), also<br />
known as the logit, is defined as:<br />
( ) P (A)<br />
logit(A) = log e (odds(A)) = log e<br />
1 − P (A)<br />
We can extend the previous table, to compute the log-odds:<br />
c○2012 Carl James Schwarz 531 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Probability Odds Logit<br />
.01 1:99 −4.59<br />
.1 1:9 −2.20<br />
.5 1:1 0<br />
.6 6:4 or 3:2 or 1.5 .41<br />
.9 9:1 2.20<br />
.99 99:1 4.59<br />
Notice that the log-odds is zero when the probability is .5 and that the log-odds of .01 is symmetric with<br />
the log-odds of .99.<br />
It is also easy to go back from the log-odds scale to the regular probability scale in two equivalent ways:<br />
p =<br />
elog-odds<br />
1 + e log-odds = 1<br />
1 + e −log-odds<br />
Notice the minus sign in the sec<strong>on</strong>d back-translati<strong>on</strong>. For example, a LO = 10, translates to p = .9999; a<br />
LO = 4 translates to p = .98; a LO = 1 translates to p = .73; etc.<br />
7.1.4 Modeling the probability of success<br />
Now if the probability of success was the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all sets of trials, the analysis would be trivial: simply<br />
tabulate the total number of successes and divide by the total number of trials to estimate the probability of<br />
success. However, what we are really interested in is the relati<strong>on</strong>ship of the probability of success to some<br />
covariate X such as temperature, or c<strong>on</strong>diti<strong>on</strong> factor.<br />
For example, c<strong>on</strong>sider the following (hypothetical) example of an experiment where various clutches of<br />
bird eggs were found, and the number of eggs that hatched and fledged were measured al<strong>on</strong>g with the height<br />
the nest was above the ground:<br />
c○2012 Carl James Schwarz 532 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Height Clutch Size Fledged ̂p<br />
2.0 4 0 0.00<br />
3.0 3 0 0.00<br />
2.5 5 0 0.00<br />
3.3 3 2 0.67<br />
4.7 4 1 0.25<br />
3.9 2 0 0.00<br />
5.2 4 2 0.50<br />
10.5 5 5 1.00<br />
4.7 4 2 0.50<br />
6.8 5 3 0.60<br />
7.3 3 3 1.00<br />
8.4 4 3 0.75<br />
9.2 3 2 0.67<br />
8.5 4 4 1.00<br />
10.0 3 3 1.00<br />
12.0 6 6 1.00<br />
15.0 4 4 1.00<br />
12.2 3 3 1.00<br />
13.0 5 5 1.00<br />
12.9 4 4 1.00<br />
Notice that the probability of a fledging seems to increase with height above the grounds (potentially<br />
reflecting distance from predators?).<br />
We would like to model the probability of success as a functi<strong>on</strong> of height. As a first attempt, suppose<br />
that we plot the estimated probability of success (̂p) as a functi<strong>on</strong> of height and try and fit a straight line to<br />
the plotted points.<br />
The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used, and ̂p was treated as the Y variable and Height as the X<br />
variable:<br />
c○2012 Carl James Schwarz 533 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This procedure is not entirely satisfactory <str<strong>on</strong>g>for</str<strong>on</strong>g> a number of reas<strong>on</strong>s:<br />
• The data points seem to follow an S-shaped relati<strong>on</strong>ship with probabilities of success near 0 at lower<br />
heights and near 1 at higher heights.<br />
• The fitted line gives predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> the probability of success that are more than 1 and less than 0<br />
which is impossible.<br />
• The fitted line cannot deal properly with the fact that the probability of success is likely close to 0%<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> a wide range of small heights and essentially close to 100% <str<strong>on</strong>g>for</str<strong>on</strong>g> a wide range of taller heights.<br />
• The assumpti<strong>on</strong> of a normal distributi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the deviati<strong>on</strong>s from the fitted line is not tenable as the ̂p<br />
are essentially discrete <str<strong>on</strong>g>for</str<strong>on</strong>g> the small clutch sizes found in this experiment.<br />
• While not apparent from this graph, the variability of the resp<strong>on</strong>se changes over the different parts<br />
of the regressi<strong>on</strong> line. For example, when the true probability of success is very low (say 0.1), the<br />
c○2012 Carl James Schwarz 534 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
standard deviati<strong>on</strong> in the number fledged <str<strong>on</strong>g>for</str<strong>on</strong>g> a clutch with 5 eggs is found as √ 5(.1)(.9) = .67 while<br />
the standard deviati<strong>on</strong> of the number of fledges in a clutch with 5 eggs and the probability of success<br />
of 0.5 is √ 5(.5)(.5) = 1.1 which is almost twice as large as the previous standard deviati<strong>on</strong>.<br />
For these (and other reas<strong>on</strong>s), the analysis of this type of data are comm<strong>on</strong>ly d<strong>on</strong>e <strong>on</strong> the log-odds (also<br />
called the logit) scale. The odds of an event is computed as:<br />
ODDS =<br />
p<br />
1 − p<br />
and the log-odds is found as the (natural) logarithm of the odds:<br />
( ) p<br />
LO = log<br />
1 − p<br />
This trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> c<strong>on</strong>verts the 0-1 scale of probability to a −∞ → ∞ scale as illustrated below:<br />
p<br />
LO<br />
0.001 -6.91<br />
0.01 -4.60<br />
0.05 -2.94<br />
0.1 -2.20<br />
0.2 -1.39<br />
0.3 -0.85<br />
0.4 -0.41<br />
0.5 0.00<br />
0.6 0.41<br />
0.7 0.85<br />
0.8 1.39<br />
0.9 2.20<br />
0.95 2.94<br />
0.99 4.60<br />
0.999 6.91<br />
Notice that the log-odds scale is symmetrical about 0, and that <str<strong>on</strong>g>for</str<strong>on</strong>g> moderate values of p, changes <strong>on</strong> the<br />
p-scale have nearly c<strong>on</strong>stant changes <strong>on</strong> the log-odds scale. For example, going from .5 → .6 → .7 <strong>on</strong> the<br />
p-scale corresp<strong>on</strong>ds to moving from 0 → .41 → .85 <strong>on</strong> the log-odds scale.<br />
It is also easy to go back from the log-odds scale to the regular probability scale:<br />
p =<br />
eLO<br />
1 + e LO = 1<br />
1 + e −LO<br />
c○2012 Carl James Schwarz 535 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
For example, a LO = 10, translates to p = .9999; a LO = 4 translates to p = .98; a LO = 1 translates to<br />
p = .73; etc.<br />
We can now return back to the previous data. At first glance, it would seem that the estimated log-odds<br />
is simply estimated as:<br />
( ) ̂p<br />
̂LO = log<br />
1 − ̂p<br />
but this doesn’t work well with small sample sizes (it can be shown that the simple logit functi<strong>on</strong> is biased)<br />
or when values of ̂p close to 0 or 1 (the simple logit functi<strong>on</strong> hits ±∞). C<strong>on</strong>sequently, in small samples or<br />
when the observed probability of success is close to 0 or 1, the empirical log-odds is often computed as:<br />
̂LO empirical = log<br />
( n̂p + .5<br />
n(1 − ̂p) + .5<br />
)<br />
= log<br />
( ̂p + .5/n<br />
1 − ̂p + .5/n<br />
)<br />
We compute the empirical log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g> the hatching data:<br />
Height Clutch Fledged ̂p ̂LOemp<br />
2.0 4 0 0.00 -2.20<br />
3.0 3 0 0.00 -1.95<br />
2.5 5 0 0.00 -2.40<br />
3.3 3 2 0.67 0.51<br />
4.7 4 1 0.25 -0.85<br />
3.9 2 0 0.00 -1.61<br />
5.2 4 2 0.50 0.00<br />
10.5 5 5 1.00 2.40<br />
4.7 4 2 0.50 0.00<br />
6.8 5 3 0.60 0.34<br />
7.3 3 3 1.00 1.95<br />
8.4 4 3 0.75 0.85<br />
9.2 3 2 0.67 0.51<br />
8.5 4 4 1.00 2.20<br />
10.0 3 3 1.00 1.95<br />
12.0 6 6 1.00 2.56<br />
15.0 4 4 1.00 2.20<br />
12.2 3 3 1.00 1.95<br />
13.0 5 5 1.00 2.40<br />
12.9 4 4 1.00 2.20<br />
and now plot the empirical log-odds against height:<br />
c○2012 Carl James Schwarz 536 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The fit is much nicer, the relati<strong>on</strong>ship has been linearized, and now, no matter what the predicti<strong>on</strong>, it can<br />
always be translated back to a probability between 0 and 1 using the inverse trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m seen earlier.<br />
7.1.5 Logistic regressi<strong>on</strong><br />
But this is still not enough. Even <strong>on</strong> the log-odds scale the data points are not normally distributed around<br />
the regressi<strong>on</strong> line. C<strong>on</strong>sequently, rather than using ordinary least-squares to fit the line, a technique called<br />
generalized linear modeling is used to fit the line.<br />
In generalized linear models a method called maximum likelihood is used to find the parameters of the<br />
model (in this case, the intercept and the regressi<strong>on</strong> coefficient of height) that gives the best fit to the data.<br />
While details of maximum likelihood estimati<strong>on</strong> are bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>, they are closely related<br />
to weighted least squares in this class of problems. Maximum Likelihood Estimators (often abbreviated as<br />
MLEs) are, under fairly general c<strong>on</strong>diti<strong>on</strong>s, guaranteed to be the “best” (in the sense of having smallest<br />
standard errors) in large samples. In small samples, there is no guarantee that MLEs are optimal, but in<br />
practice, MLEs seem to work well. In most cases, the calculati<strong>on</strong>s must be d<strong>on</strong>e numerically – there are no<br />
c○2012 Carl James Schwarz 537 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
simple <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae as in simple linear regressi<strong>on</strong>. 4<br />
In order to fit a logistic regressi<strong>on</strong> using maximum likelihood estimati<strong>on</strong>, the data must be in a standard<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g>mat. In particular, both success and failures must be recorded al<strong>on</strong>g with a classificati<strong>on</strong> variable that<br />
is nominally scaled. For example, the first clutch (at 2.0 m) will generate two lines of data – <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
successful fledges and <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> the unsuccessful fledges. If the count <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular outcome is zero, it can<br />
be omitted from the data table, but I prefer to record a value of 0 so that there is no doubt that all eggs were<br />
examined and n<strong>on</strong>e of this outcome were observed.<br />
A new column was created in JMP <str<strong>on</strong>g>for</str<strong>on</strong>g> the number of eggs that failed to fledge, and after stacking the<br />
revised dataset, the dataset in JMP that can be used <str<strong>on</strong>g>for</str<strong>on</strong>g> logistic regressi<strong>on</strong> looks like: 5<br />
4 Other methods that are qute popular are n<strong>on</strong>-iterative weighted least squares and discriminant functi<strong>on</strong> analysis. These are bey<strong>on</strong>d<br />
the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
5 This stacked data is available in the eggsfledge2.jmp dataset available from the Sample Program Library at http://www.stat.<br />
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
c○2012 Carl James Schwarz 538 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
c○2012 Carl James Schwarz 539 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to launch simple logistic regressi<strong>on</strong>:<br />
Note that the Outcome is the actual Y variable (and is nominally scaled) while the Count column simply<br />
indicates how many of this outcome were observed. The X variable is Height as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e. JMP knows this is<br />
a logistic regressi<strong>on</strong> by the combinati<strong>on</strong> of a nominally or ordinally scaled variable <str<strong>on</strong>g>for</str<strong>on</strong>g> the Y variable, and a<br />
c<strong>on</strong>tinuously scaled variable <str<strong>on</strong>g>for</str<strong>on</strong>g> the X variable as seen by the reminder at the left of the plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m dialogue<br />
box.<br />
This gives the output:<br />
c○2012 Carl James Schwarz 540 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The first point to note is that most computer packages make arbitrary decisi<strong>on</strong>s <strong>on</strong> what is a “success” and<br />
what is a “failure” when fitting the logistic regressi<strong>on</strong>. It is important to always look at the output carefully<br />
to see what has been defined as a success. In this case, at the bottom of the output, JMP has indicated that<br />
fledged is c<strong>on</strong>sidered as a “success” and not fledged as a “failure”. If it had reversed the roles of these two<br />
c○2012 Carl James Schwarz 541 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
categories, everything would be “identical” except reversed appropriately.<br />
Sec<strong>on</strong>d, rather bizarrely, the actual data points plotted by JMP really d<strong>on</strong>’t any meaning! According<br />
the JMP help screens:<br />
Markers <str<strong>on</strong>g>for</str<strong>on</strong>g> the data are drawn at their x-coordinate, with the y positi<strong>on</strong> jittered randomly within<br />
the range corresp<strong>on</strong>ding to the resp<strong>on</strong>se category <str<strong>on</strong>g>for</str<strong>on</strong>g> that row.<br />
So if you do the analysis <strong>on</strong> the exact same data, the data points are jittered and will look different even<br />
though the fit is the same. The explanati<strong>on</strong> in the JMP support pages <strong>on</strong> the web state: 6<br />
The exact vertical placement of points in the logistic regressi<strong>on</strong> plots (<str<strong>on</strong>g>for</str<strong>on</strong>g> instance, <strong>on</strong> pages<br />
308 and 309 of the JMP User’s Guide, Versi<strong>on</strong> 2, and pages 114 and 115 of the JMP Statistics<br />
and Graphics Guide, Versi<strong>on</strong> 3) has no particular interpretati<strong>on</strong>. The points are placed midway<br />
between curves so as to assure their visibility. However, the locati<strong>on</strong> of a point between a<br />
particular set of curves is important. All points between a particular set of curves have the same<br />
observed value <str<strong>on</strong>g>for</str<strong>on</strong>g> the dependent variable. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, the horiz<strong>on</strong>tal placement of each point is<br />
meaningful with respect to the horiz<strong>on</strong>tal axis.<br />
This is rather un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate, to say the least! This means that the user must create nice plot by hand. This plot<br />
should plot the estimated proporti<strong>on</strong>s as a functi<strong>on</strong> of height with the fitted curve then overdrawn.<br />
Fortunately, the fitted curves are correct (whew). The curves presented doesn’t look linear <strong>on</strong>ly because<br />
JMP has trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med back from the log-odds scale to the regular probability scale. A linear curve <strong>on</strong> the<br />
log-odds scale has a characteristic “S” shape <strong>on</strong> the regular probability scale with the ends of the curve<br />
flattening out a 0 and 1. Using the Cross Hairs tool, you can see that a height of 5 m gives a predicted<br />
probability of success (fledged) of .57; by 7 m the estimated probability of success is almost 100%.<br />
The table of parameter estimates gives the estimated fit <strong>on</strong> the log-odds scale:<br />
̂LO = −4.03 + .72(Height)<br />
Substituting in the value <str<strong>on</strong>g>for</str<strong>on</strong>g> Height = 5, gives an estimated log-odds of −.43 which <strong>on</strong> the regular probability<br />
scale corresp<strong>on</strong>ds to .394 as seen be<str<strong>on</strong>g>for</str<strong>on</strong>g>e from using the cross hairs.<br />
The coefficient associated with height is interpreted as the increase in log-odds of fledging when height<br />
is increased by 1 m.<br />
As in simple regressi<strong>on</strong>, the precisi<strong>on</strong> of the estimates is given by the standard error. An approximate<br />
95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the coefficient associated with height is found in the usual fashi<strong>on</strong>, i.e.<br />
estimate ± 2se. 7 This c<strong>on</strong>fidence interval does NOT include 0; there<str<strong>on</strong>g>for</str<strong>on</strong>g>e there is good evidence that the<br />
probability of fledging is not c<strong>on</strong>stant over the various heights.<br />
6 http://www.jmp.com/support/techsup/notes/001897.html<br />
7 It is not possible to display the 95% c<strong>on</strong>fidence intervals in the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m output by right clicking in the table<br />
(d<strong>on</strong>’t ask me why not). However, if the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to fit the model, then right-clicking in the Estimates table<br />
does make the 95% c<strong>on</strong>fidence intervals available.<br />
c○2012 Carl James Schwarz 542 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Similarly, the p-value is interpreted in the same way – how c<strong>on</strong>sistent is the data with the hypothesis<br />
of NO effect of height up<strong>on</strong> the survival rate. Rather than the t-test seen in linear regressi<strong>on</strong>, maximum<br />
likelihood methods often c<strong>on</strong>structs the test statistics in a different fashi<strong>on</strong> (called χ 2 likelihood ratio tests).<br />
The test statistic is not particularly of interest – <strong>on</strong>ly the final p-value matters. In this case, it is well below<br />
α = .05, so there is good evidence that the probability of success is not c<strong>on</strong>stant across heights. As in all<br />
cases, statistical significance is no guarantee of biological relevance.<br />
In theory, it is possible to obtain predicti<strong>on</strong> intervals and c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN probability<br />
of success at new values of X – JMP does not provide these in the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m with<br />
logistic regressi<strong>on</strong>. It does do Inverse Predicti<strong>on</strong>s and can give c<strong>on</strong>fidence bounds <strong>on</strong> the inverse predicti<strong>on</strong><br />
which require the c<strong>on</strong>fidence bounds to be computed, so it is a mystery to me why the c<strong>on</strong>fidence intervals<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean probability of success at future X values are not provided.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m can also be used to fit a logistic regressi<strong>on</strong> in the same way:<br />
Be sure to specify the Y variable as a nominally or ordinally scaled variable; the count as the frequency<br />
variable; and the X variables in the usual fashi<strong>on</strong>. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m automatically switches<br />
to indicate logistic regressi<strong>on</strong> will be run.<br />
The same in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> as previously seen is shown again. But, you can now obtain 95% c<strong>on</strong>fidence<br />
c○2012 Carl James Schwarz 543 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the parameter estimates and there are additi<strong>on</strong>al opti<strong>on</strong>s under the red-triangle pop-down menu.<br />
These features will be explored in more detail in further examples.<br />
Lastly, the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m using the Generalized Linear Model opti<strong>on</strong> in the pers<strong>on</strong>ality<br />
box in the upper right corner, also can be used to fit this model. Specify a binomial distributi<strong>on</strong> with the<br />
logit link. You get similar results with more goodies under the red-triangles such as c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
the MEAN probability of success that can be saved to the data table, residual plots, and more. Again, these<br />
will be explored in more details in the examples.<br />
7.2 Data Structures<br />
There are two comm<strong>on</strong> ways in which data can be entered <str<strong>on</strong>g>for</str<strong>on</strong>g> logistic regressi<strong>on</strong>, either as individual observati<strong>on</strong>s<br />
or as grouped counts.<br />
If individual data points are entered, each line of the data file corresp<strong>on</strong>ds to a single individual. The<br />
columns will corresp<strong>on</strong>ds to the predictors (X) that can be c<strong>on</strong>tinuous (interval or ratio scales) or classificati<strong>on</strong><br />
variables (nominal or ordinal). The resp<strong>on</strong>se (Y ) must be a classificati<strong>on</strong> variable with any two possible<br />
outcomes 8 . Most packages will arbitrarily choose <strong>on</strong>e of these classes to be the success – often this is the<br />
first category when sorted alphabetically. I would recommend that you do NOT code the resp<strong>on</strong>se variable<br />
as 0/1 – it is far to easy to <str<strong>on</strong>g>for</str<strong>on</strong>g>get that the 0/1 corresp<strong>on</strong>d to nominally or ordinally scaled variables and not<br />
to c<strong>on</strong>tinuous variables.<br />
As an example, suppose you wish to predict if an egg will hatch given the height in a tree. The data<br />
structure <str<strong>on</strong>g>for</str<strong>on</strong>g> individuals would look something like:<br />
Egg Height Outcome<br />
. . .<br />
1 10 hatch<br />
2 15 not hatch<br />
3 5 hatch<br />
4 10 hatch<br />
5 10 not hatch<br />
Notice that even though three eggs were all at 10 m height, separate data lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each of the three eggs<br />
appear in the data file.<br />
In grouped counts, each line in the data file corresp<strong>on</strong>ds to a group of events with the same predictor<br />
(X) variables. Often researchers record the number of events and the number of successes in two separate<br />
columns, or the number of success and the number of failures in two separate columns. This data must be<br />
c<strong>on</strong>verted to two rows per group - <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> the success and <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> the failures with <strong>on</strong>e variable representing<br />
8 In more advanced classes this restricti<strong>on</strong> can be relaxed.<br />
c○2012 Carl James Schwarz 544 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
the outcome and a sec<strong>on</strong>d variable representing the frequency of this event. The outcome will be the Y<br />
variable, while the count will be the frequency variable. 9<br />
For example, the above data could be originally entered as:<br />
Height Hatch Not Hatch<br />
10 2 1<br />
15 0 1<br />
5 1 0<br />
. . .<br />
but must be translated (e.g. using the Tables → Stack command) to:<br />
Height Outcome Count<br />
10 Hatch 2<br />
10 Not Hatch 1<br />
15 Hatch 0<br />
15 Not Hatch 1<br />
. . .<br />
5 Hatch 1<br />
5 Not Hatch 0<br />
While it is not required that counts of zero have data lines present, it is good statistical practice to remind<br />
yourself that you did look <str<strong>on</strong>g>for</str<strong>on</strong>g> failures, but failed to find any of them.<br />
7.3 Assumpti<strong>on</strong>s made in logistic regressi<strong>on</strong><br />
Many of the assumpti<strong>on</strong>s made <str<strong>on</strong>g>for</str<strong>on</strong>g> logistic regressi<strong>on</strong> parallel those made <str<strong>on</strong>g>for</str<strong>on</strong>g> ordinary regressi<strong>on</strong> with obvious<br />
modificati<strong>on</strong>s.<br />
1. Check sampling design. In these <str<strong>on</strong>g>course</str<strong>on</strong>g> notes it is implicitly assumed that the data are collected either<br />
as simple random sample or under a completely randomized design experiment. This implies that the<br />
units selected must be a random sample (with equal probability) from the relevant populati<strong>on</strong>s or<br />
complete randomizati<strong>on</strong> during the assignment of treatments to experimental units. The experimental<br />
unit must equal the observati<strong>on</strong>al unit (no pseudo-replicati<strong>on</strong>), and there must be no pairing, blocking,<br />
or stratificati<strong>on</strong>.<br />
It is possible to generalize logistic regressi<strong>on</strong> to cases where pairing, blocking, or stratificati<strong>on</strong> took<br />
place (<str<strong>on</strong>g>for</str<strong>on</strong>g> example, in case-c<strong>on</strong>trol studies), but these are not covered during this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
9 Refer to the secti<strong>on</strong> <strong>on</strong> Poiss<strong>on</strong> regressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> an alternate way to analyze this type of data where the count is the resp<strong>on</strong>se variable.<br />
c○2012 Carl James Schwarz 545 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Comm<strong>on</strong> ways in which assumpti<strong>on</strong> are violated include:<br />
• Collecting data under a cluster design. For example, class rooms are selected at random from<br />
a school district and individuals within a class room are then measured. Or herds or schools of<br />
animals are selected and all individuals within the herd or school are measured.<br />
• Quota samples are used to select individuals with certain classificati<strong>on</strong>s. For example, exactly<br />
100 males and 100 females are sampled and you are trying to predict sex as the outcome measure.<br />
2. No outliers. This is usually pretty easy to check. A logistic regressi<strong>on</strong> <strong>on</strong>ly allows two categories<br />
within the resp<strong>on</strong>se variables. If there are more than two categories of resp<strong>on</strong>ses, this may represent a<br />
typographical error and should be corrected. Or, categories should be combined into larger categories.<br />
It is possible to generalize logistic regressi<strong>on</strong> to the case of more than two possible outcomes. Please<br />
c<strong>on</strong>tact a statistician <str<strong>on</strong>g>for</str<strong>on</strong>g> assistance.<br />
3. Missing values are MCAR. The usual assumpti<strong>on</strong> as listed in earlier chapters.<br />
4. Binomial distributi<strong>on</strong>. This is a crucial assumpti<strong>on</strong>. A binomial distributi<strong>on</strong> is appropriate when<br />
there is a fixed number of trials at a given set of covariates (could be 1 trial); there is c<strong>on</strong>stant probability<br />
of “success” within that set of trials; each trial is independent; and the number of trials in the n<br />
successes is measured.<br />
Comm<strong>on</strong> ways in which this assumpti<strong>on</strong> is violated are:<br />
• Items within a set of trials do not operate independently of each other. For example, subjects<br />
could be litter mates, twins, or share envir<strong>on</strong>mental variables. This can lead to over- or underdispersi<strong>on</strong>.<br />
• The probability of success within the set of trials is not c<strong>on</strong>stant. For example, suppose a set of<br />
trials is defined by weight class. Not every<strong>on</strong>e in the weight class is exactly the same weight and<br />
so their probability of “success” could vary. Animals all d<strong>on</strong>’t have exactly the same survival<br />
rates.<br />
• The number of trials is not fixed. For example, sampling could occur until a certain number of<br />
success occur. In this case, a negative binomial distributi<strong>on</strong> would be more appropriate.<br />
5. Independence am<strong>on</strong>g subjects. See above.<br />
7.4 Example: Space Shuttle - Single c<strong>on</strong>tinuous predictor<br />
In January 1986, the space shuttle Challenger was destroyed <strong>on</strong> launch. Subsequent investigati<strong>on</strong>s showed<br />
that an O-ring, a piece of rubber used to seal two segments of the booster rocket, failed, allowing highly<br />
flammable fuel to leak, light, and destroy the ship. 10<br />
As part of the investigati<strong>on</strong>, the following chart of previous launches and the temperature at which the<br />
shuttle was launched was presented:<br />
10 Refer to http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster.<br />
c○2012 Carl James Schwarz 546 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The raw data is available in the JMP file spaceshuttleoring.jmp available from the Sample Program Library<br />
at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
Notice that the raw data has a single line <str<strong>on</strong>g>for</str<strong>on</strong>g> each previous launch even though there are multiple launches<br />
at some temperatures. The X variable is temperature and the Y variable is the outcome – either f <str<strong>on</strong>g>for</str<strong>on</strong>g> failure<br />
of the O-ring, or OK <str<strong>on</strong>g>for</str<strong>on</strong>g> a launch where the O-ring did not fail.<br />
c○2012 Carl James Schwarz 547 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
With the data in a single observati<strong>on</strong> mode, it is impossible to make a simple plot of the empirical logistic<br />
functi<strong>on</strong>. If some of the temperatures were pooled, you might be able to do a simple plot.<br />
The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used and gave the following results:<br />
First notice that JMP treats a failure f as a “success”, and will model the probability of failure as a functi<strong>on</strong><br />
c○2012 Carl James Schwarz 548 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
of temperature. This is why it is important that you examine computer output carefully to see exactly what<br />
a package is doing.<br />
The graph showing the fitted logistic curve must be interpreted carefully. While the plotted curve is<br />
correct, the actual data points are randomly placed - groan – see the notes in the previous secti<strong>on</strong>.<br />
The estimated model is:<br />
̂ logit(failure) = 10.875 − .17(temperature)<br />
So, the log-odds of failure decrease by .17 (se .083) units <str<strong>on</strong>g>for</str<strong>on</strong>g> every degree ( ◦ F) increase in launch temperature.<br />
C<strong>on</strong>versely, the log-odds of failure increase by .17 by every degree ( ◦ F) decrease in temperature.<br />
The p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> no effect of temperature is just below α = .05.<br />
Using the same reas<strong>on</strong>ing as was d<strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> ordinary regressi<strong>on</strong>, the odds of failure increase by a factor of<br />
e .17 = 1.18, i.e. almost a 18% increase per degree drop.<br />
To predict the failure rate at a given temperature, a two stage-process is required. First, estimate the<br />
log-odds by substituting in the X values of interest. Sec<strong>on</strong>d, c<strong>on</strong>vert the estimated log-odds to a probability<br />
using p(x) =<br />
eLO(x)<br />
1<br />
= .<br />
1+e LO(x) 1+e −LO(x)<br />
The actual launch was at 32 ◦ F. While it is extremely dangerous to try and predict outside the range<br />
of observed data, the estimated log-odds of failure of the O-ring are 10.875 − .17(32) = 5.43 and then<br />
p(failure) = e5.43<br />
1+e 5.43 = .99+, i.e. well over 99%!<br />
It is possible to find c<strong>on</strong>fidence bounds <str<strong>on</strong>g>for</str<strong>on</strong>g> these predicti<strong>on</strong>s – the easiest way is to create some “dummy”<br />
rows in the data table corresp<strong>on</strong>ding to the future predicti<strong>on</strong>s with the resp<strong>on</strong>se variable left blank. Use<br />
JMP’s Exclude Rows feature to exclude these rows from the model fit. Then use the red-triangle to same<br />
predicti<strong>on</strong>s and c<strong>on</strong>fidence bounds back to the data table.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m gives the same results with additi<strong>on</strong>al analysis opti<strong>on</strong>s that we will<br />
examine in future examples.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m using the Generalized Linear Model opti<strong>on</strong> also gives the same results<br />
with additi<strong>on</strong>al analysis opti<strong>on</strong>s. For example, it is possible to compute c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted<br />
probability of success at the new X. Use the pop-down menu beside the red-triangle:<br />
c○2012 Carl James Schwarz 549 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The predicted values and 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted probability are stored in the data table:<br />
c○2012 Carl James Schwarz 550 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
These are found by finding the predicted log-odds and a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted logodds<br />
and then inverting the c<strong>on</strong>fidence interval endpoints in the same way as the predicted probabilities are<br />
obtained from the predicted log-odds.<br />
While the predicted value and the 95% c<strong>on</strong>fidence interval are available, <str<strong>on</strong>g>for</str<strong>on</strong>g> some odd reas<strong>on</strong> the se of<br />
the predicted probability is not presented – this is odd as it is easily computed. The c<strong>on</strong>fidence intervals are<br />
quite wide given that there were <strong>on</strong>ly 24 data values and <strong>on</strong>ly a few failures.<br />
It should be noted that <strong>on</strong>ly predicti<strong>on</strong>s of the probability of success and c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
c○2012 Carl James Schwarz 551 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
probability of success are computed. These intervals would apply to all future subjects that have the particular<br />
value of the covariates. Unlike the case of linear regressi<strong>on</strong>, it really doesn’t make sense to predict<br />
individual outcomes as these are categories. It is sensible to look at which category is most probable and<br />
then use this as a “guess” <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual resp<strong>on</strong>se, but that is about it. This area of predicting categories<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> individuals is called discriminant analysis and has a l<strong>on</strong>g history in statistics. There are many excellent<br />
books <strong>on</strong> this topic.<br />
7.5 Example: Predicting Sex from physical measurements - Multiple<br />
c<strong>on</strong>tinuous predictors<br />
The extensi<strong>on</strong> to multiple c<strong>on</strong>tinuous X variables is immediate. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e there are now several predictors.<br />
It is usually highly unlikely to have multiple observati<strong>on</strong>s with exactly the same set of X values, so the data<br />
sets usually c<strong>on</strong>sist of individual observati<strong>on</strong>s.<br />
Let us proceed by example using the Fitness data set available in the JMP sample data library. This<br />
dataset has variables <strong>on</strong> age, weight, and measurements of per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance while per<str<strong>on</strong>g>for</str<strong>on</strong>g>ming a fitness assessment.<br />
In this case we will try and predict the sex of the subject given the various attributes.<br />
As usual, be<str<strong>on</strong>g>for</str<strong>on</strong>g>e doing any computati<strong>on</strong>s, examine the data <str<strong>on</strong>g>for</str<strong>on</strong>g> unusual points. Look at pairwise plots,<br />
the pattern of missing values, etc.<br />
It is important that the data be collected under a completely randomized design or simple random sample.<br />
If your data are collected under a different design, e.g. a cluster design, please see suitable assistance.<br />
Use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit a logistic regressi<strong>on</strong> trying to predict sex from the age,<br />
weight, oxygen c<strong>on</strong>sumpti<strong>on</strong> and run time:<br />
c○2012 Carl James Schwarz 552 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This gives the summary output:<br />
c○2012 Carl James Schwarz 553 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
First determine which category is being predicted. In this case, the sex = f category will be predicted.<br />
The Whole Model Test examines if there is evidence of any predictive ability in the 4 predictor variable.<br />
The p-value is very small indicating that there is predictive ability.<br />
Because we have NO categorical predictors, the Effect Tests can be ignored <str<strong>on</strong>g>for</str<strong>on</strong>g> now. The Parameter<br />
Estimates look <str<strong>on</strong>g>for</str<strong>on</strong>g> the marginal c<strong>on</strong>tributi<strong>on</strong> of each predictor to predicting the probability of being a Female.<br />
Just like in regular regressi<strong>on</strong>, these are MARGINAL c<strong>on</strong>tributi<strong>on</strong>s, i.e. how much would the log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
the probability of being female change if this variable changed by <strong>on</strong>e unit and all other variables remained<br />
in the model and did not change. In this case, there is good evidence that weight is a good predictor (not<br />
surprisingly), but also some evidence that oxygen c<strong>on</strong>sumpti<strong>on</strong> may be useful. 11 If you look at the dot plots<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the weight <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sexes and <str<strong>on</strong>g>for</str<strong>on</strong>g> the oxygen c<strong>on</strong>sumpti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sexes, the two groups seem to<br />
be separated <strong>on</strong> these variables:<br />
11 The output above actually appears to be a bit c<strong>on</strong>tradictory. The chi-square value <str<strong>on</strong>g>for</str<strong>on</strong>g> the effect of weight is 17 with a p-value<br />
< .0001. Yet the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the coefficient associated with weight ranges from (−1.57 → .105) which INCLUDES<br />
zero, and so whould not be statistically significant! It turns out that JMP has mixed two (asymptotically) equivalent methods in this<br />
<strong>on</strong>e output. The chi-square value and p-value are computed using a likelihood ratio test (a model with and without this variable is fit<br />
and the difference in fit is measured), while the c<strong>on</strong>fidence intervals are computed using a Wald approximati<strong>on</strong> (estimate ± 2(se)).<br />
In small samples, the sampling distributi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> an estimate may not be very symmetric or close to normally shaped and so the Wald<br />
intervals may not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m well.<br />
c○2012 Carl James Schwarz 554 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The estimated coefficient <str<strong>on</strong>g>for</str<strong>on</strong>g> weight is −.73. This indicates that the log-odds of being female decrease<br />
by .73 <str<strong>on</strong>g>for</str<strong>on</strong>g> every additi<strong>on</strong>al unit of weight, all other variables held fixed. This often appears in scientific<br />
reports as the adjusted effect of weight – the adjusted term implies that it is the marginal c<strong>on</strong>tributi<strong>on</strong>.<br />
C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual coefficient (<str<strong>on</strong>g>for</str<strong>on</strong>g> predicting the log-odds of being female) are interpreted<br />
in the same way.<br />
Just like in regular regressi<strong>on</strong>, collinearity can be a problem in the X values. There is no easy test <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
collinearity in logistic regressi<strong>on</strong> in JMP, but similar diagnostics as in in ordinary regressi<strong>on</strong> are becoming<br />
available.<br />
Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e dropping more than <strong>on</strong>e variable, it is possible to test if two or more variables can be dropped.<br />
Use the Custom Test opti<strong>on</strong>s from the drop-down menu:<br />
c○2012 Carl James Schwarz 555 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Complete the boxes in a similar way as in ordinary linear regressi<strong>on</strong>. For example, to test if both age and<br />
runtime can be dropped:<br />
which gives:<br />
c○2012 Carl James Schwarz 556 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
It appears safe to drop both variables.<br />
Just as in regular regressi<strong>on</strong>, you can fit quadratic and product terms to try and capture some n<strong>on</strong>-linearity<br />
in the log-odds. This affects the interpretati<strong>on</strong> of the estimated coefficients in the same way as in ordinary<br />
regressi<strong>on</strong>. The simpler model involving weight and oxygen c<strong>on</strong>sumpti<strong>on</strong>, their quadratic terms and cross<br />
product term was fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
c○2012 Carl James Schwarz 557 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Surprisingly, the model has problems:<br />
c○2012 Carl James Schwarz 558 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Ir<strong>on</strong>ically, it is because the model is too good of a fit. It appears that you can discriminate perfectly between<br />
men and women by fitting this model. Why does a perfect fit cause problems. The reas<strong>on</strong> is that if the<br />
p(sex = f) = 1, the log-odds is then +∞ and it is hard to get a predicted value of ∞ from an equati<strong>on</strong><br />
without some terms also being infinite.<br />
If you plot the weight against oxygen c<strong>on</strong>sumpti<strong>on</strong> using different symbols <str<strong>on</strong>g>for</str<strong>on</strong>g> males and females, you<br />
can see the near complete separati<strong>on</strong> based <strong>on</strong> simply looking at oxygen c<strong>on</strong>sumpti<strong>on</strong> and weight without<br />
the need <str<strong>on</strong>g>for</str<strong>on</strong>g> quadratic and cross products:<br />
c○2012 Carl James Schwarz 559 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
I’ll c<strong>on</strong>tinue by fitting just a model with linear effects of weight and oxygen c<strong>on</strong>sumpti<strong>on</strong> as an illustrati<strong>on</strong>.<br />
Use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit this model with just the two covariates:<br />
c○2012 Carl James Schwarz 560 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Both covariates are now statistically significant and cannot be dropped.<br />
The Goodness-of-fit statistic is computed in two ways (which are asymptotically equivalent), but both<br />
are tedious to compute by hand. The Deviance of a model is a measure of how well a model per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms. As<br />
there are 31 data points, you could get a perfect fit by fitting a model with 31 parameters – this is exactly<br />
what happens if you try and fit a line through 2 points where 2 parameters (the slope and intercept) will fit<br />
exactly two data points. A measure of goodness of fit is then found <str<strong>on</strong>g>for</str<strong>on</strong>g> the model in questi<strong>on</strong> based <strong>on</strong> the<br />
fitted parameters of this model. In both cases, the measure of fit is called the deviance which is simply twice<br />
the negative of the log-likelihood which in turn is related to the probability of observing this data given the<br />
parameter values. The difference in deviances is the deviance goodness-of-fit statistic. If the current model<br />
is a good model, the difference in deviance should be small (this is the column labeled chi-square). There is<br />
no simple calibrati<strong>on</strong> of deviances 12 , so a p-value must be found which say how large is this difference. The<br />
p-value of .96 which indicates that the difference is actually quite small, almost 96% of the time you would<br />
get a larger difference in deviances.<br />
Similarly, the row labeled the Pears<strong>on</strong> goodness-of-fit is based <strong>on</strong> the same idea. A perfect fit is obtained<br />
with a model of 31 parameters. A comparis<strong>on</strong> of the observed and predicted values is found <str<strong>on</strong>g>for</str<strong>on</strong>g> the model<br />
with 3 parameters. How big is the difference in fit? How unusual is it?<br />
NOTE that <str<strong>on</strong>g>for</str<strong>on</strong>g> goodness-of-fit tests, you DO NOT WANT TO REJECT the null hypothesis. Hence<br />
p-values <str<strong>on</strong>g>for</str<strong>on</strong>g> a goodness-of-fit test that are small (e.g. less than α = .05) are NOT good!<br />
12 The df=31-3<br />
c○2012 Carl James Schwarz 561 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
So <str<strong>on</strong>g>for</str<strong>on</strong>g> this model, there is no reas<strong>on</strong> to be upset with the fit.<br />
The residual plots look strange, but this is an artifact of the data:<br />
Al<strong>on</strong>g the bottom axis is the predicted probability of being female. Now c<strong>on</strong>sider a male subject. If<br />
the predicted probability of being female is small (e.g. close to 0 because the subject is quite heavy),<br />
then there is an almost perfect agreement of the observed resp<strong>on</strong>se with the predicted probability. If you<br />
compute a residual by defining a male=0 and female=1, then the residual here would be computed as<br />
(obs − predicted)/se(predicted) = (0 − 0)/blah = 0. This corresp<strong>on</strong>ds to points near the (0,0) area<br />
of the plots.<br />
What about males whose predicted probability of being female is almost .7 (which corresp<strong>on</strong>ds to observati<strong>on</strong><br />
15). This is a poor predicti<strong>on</strong>. and the residual is computed as (0 − .7)/se(predicted) which is<br />
approximately equal to (0 − .7)/ √ .7(.3) ≈ −1.52 with some further adjustment to compute the se of the<br />
predicted value. This corresp<strong>on</strong>ds to the point near (.7, -1.5).<br />
On the other hand, a female with a predicted probability of being female will have a residual equal to<br />
approximately (1 − .7)/ √ .7(.3) = .65.<br />
Hence the two lines <strong>on</strong> the graph corresp<strong>on</strong>d to males and female respectively. What you want to see is<br />
c○2012 Carl James Schwarz 562 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
this two parallel line system, particularly with few males near the probability of being female close to 1, and<br />
few females with probability of being female close to 0.<br />
There are four possible residual plots available in JMP – they are all based <strong>on</strong> a similar procedure with<br />
minor adjustments in the way they compute a standard error. Usually, all four plots are virtually the same –<br />
anomalies am<strong>on</strong>g the plots should be investigated carefully.<br />
7.6 Examples: Lung Cancer vs. Smoking; Marijuana use of students<br />
based <strong>on</strong> parental usage - Single categorical predictor<br />
7.6.1 Retrospect and Prospective odds-ratio<br />
In this secti<strong>on</strong>, the case where the predictor (X) variable is also a categorical variable will be examined. As<br />
seen in multiple linear regressi<strong>on</strong>, categorical X variables are handled by the creati<strong>on</strong> of indicator variables.<br />
A categorical variable with k classes will generate k − 1 indicator variables. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, there are many ways<br />
to define these indicator variables and the user must examine the computer software carefully be<str<strong>on</strong>g>for</str<strong>on</strong>g>e using<br />
any of the raw estimated coefficients associated with a particular indicator variable.<br />
It turns out that there are multiple ways to analyze such data – all of which are asymptotically equivalent.<br />
Also, this particular topic is usually divided into two sub-categories - problems where there are <strong>on</strong>ly two<br />
levels of the predictor variables and cases where there are three or more levels of the predictor variables.<br />
This divisi<strong>on</strong> actually has a good reas<strong>on</strong> – it turns out that in the case of 2 levels <str<strong>on</strong>g>for</str<strong>on</strong>g> the predictor and 2<br />
levels <str<strong>on</strong>g>for</str<strong>on</strong>g> the resp<strong>on</strong>se variable (the classic 2 × 2 c<strong>on</strong>tingency table), it is possible to use a retrospective<br />
study and actually get valid estimates of the prospective odds ratio.<br />
For example, suppose you were interested in the looking at the relati<strong>on</strong>ship between smoking and lung<br />
cancer. In a prospective study, you could randomly select 1000 smokers and 1000 n<strong>on</strong>-smokers <str<strong>on</strong>g>for</str<strong>on</strong>g> their<br />
relevant populati<strong>on</strong>s and follow them over time to see how many developed lung cancer. Suppose you<br />
obtained the following results:<br />
Cohort Lung Cancer No lung cancer<br />
Smokers 100 900<br />
N<strong>on</strong>-smoker 10 990<br />
Because this is a prospective study, it is quite valid to say that the probability of developing lung cancer<br />
if you are a smoker is 100/1000 and the probability of developing lung cancer if you are not a smoker is<br />
10/1000. The odds of developing cancer if you are smoker are 100:900 and the odds of developing cancer<br />
if you are n<strong>on</strong>-smoker are 10:990. The odds ratio of developing cancer of a smoker vs. a n<strong>on</strong>-smoker is then<br />
OR(LC) S vs. NS =<br />
100 : 900<br />
10 : 990 = 11 : 1<br />
c○2012 Carl James Schwarz 563 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
But a prospective study takes too l<strong>on</strong>g, so an alternate way of studying the problem is to do a retrospective<br />
study. Here samples of 1000 people with lung cancer, and 1000 people without lung cancer are selected at<br />
random from their respective populati<strong>on</strong>s. For each subject, you determine if they smoked in the past.<br />
Suppose you get the following results:<br />
Lung Cancer Smoker N<strong>on</strong>-smoker<br />
yes 810 190<br />
no 280 720<br />
Now you can’t directly find the probability of lung cancer if you are smoker. It is NOT simply 810/(810+<br />
280) because you selected equal number of smokers and n<strong>on</strong>-smokers while less than 30% of the populati<strong>on</strong><br />
generally smokes. Unless that proporti<strong>on</strong> is known, it is impossible to compute the probability of getting<br />
lung cancer if you are a smoker or n<strong>on</strong>-smoker directly, and so it would seem that finding the odds of lung<br />
cancer would be impossible.<br />
However, not all is lost. Let P (smoker) represent the probability that a randomly chosen pers<strong>on</strong> is a<br />
smoker; then P (n<strong>on</strong>-smoker) = 1 − P (smoker). Bayes’ Rule 13<br />
P (lung cancer | smoker) =<br />
P (no lung cancer | smoker) =<br />
P (lung cancer | n<strong>on</strong>-smoker) =<br />
P (no lung cancer | n<strong>on</strong>-smoker) =<br />
P (smoker | lung cancer)P (lung cancer)<br />
P (smoker)<br />
P (smoker | no lung cancer)P (no lung cancer)<br />
P (smoker)<br />
P (n<strong>on</strong>-smoker | lung cancer)P (lung cancer)<br />
P (n<strong>on</strong>-smoker)<br />
P (n<strong>on</strong>-smoker | no lung cancer)P (no lung cancer)<br />
P (n<strong>on</strong>-smoker)<br />
This doesn’t appear to helpful, as P(smoker) or P(n<strong>on</strong>-smoker) is unknown. But, look at the odds-ratio<br />
of getting lung cancer of a smoker vs. a n<strong>on</strong>-smoker:<br />
OR(LC) S vs. NS =<br />
=<br />
ODDS(lung cancer if smoker)<br />
ODDS(lung cancer if n<strong>on</strong>-smoker)<br />
P (lung cancer | smoker)<br />
P (no lung cancer | smoker)<br />
P (lung cancer | n<strong>on</strong>-smoker)<br />
P (no lung cancer | n<strong>on</strong>-smoker)<br />
If you substitute in the above expressi<strong>on</strong>s, you find that:<br />
OR(LC) S vs NS =<br />
P (smoker | lung cancer)<br />
P (smoker | no lung cancer)<br />
P (n<strong>on</strong>-smoker | lung cancer)<br />
P (n<strong>on</strong>-smoker | no lung cancer)<br />
which can be computed from the retro-spective study. Based <strong>on</strong> the above table, we obtain<br />
OR(LC) S vs NS =<br />
.810<br />
.280<br />
.190<br />
.720<br />
= 11 : 1<br />
This symmetric in odds-ratios between prospective and retrospective studies <strong>on</strong>ly works in the 2x2 case<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> simple random sampling.<br />
13 See http://en.wikipedia.org/wiki/Bayes_rule<br />
c○2012 Carl James Schwarz 564 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
7.6.2 Example: Parental and student usage of recreati<strong>on</strong>al drugs<br />
A study was c<strong>on</strong>ducted where students at a college were asked about their pers<strong>on</strong>al use of marijuana and<br />
if their parents used alcohol and/or marijuana. 14 The following data is a collapsed versi<strong>on</strong> of the table that<br />
appears in the report:<br />
Parental<br />
Student Usage<br />
Usage Yes No<br />
Yes 125 85<br />
No 94 141<br />
This is a retrospective analysis as the students are interviewed and past behavior of parents is recorded.<br />
The data are entered in JMP in the usual <str<strong>on</strong>g>for</str<strong>on</strong>g>mat. There will be four lines, and three variables corresp<strong>on</strong>ding<br />
to parental usage, student usage, and the count.<br />
Start using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
14 “Marijuana Use in College, Youth and Society, 1979, 323-334.<br />
c○2012 Carl James Schwarz 565 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
but d<strong>on</strong>’t <str<strong>on</strong>g>for</str<strong>on</strong>g>get to specify the Count as the frequency variable. It doesn’t matter which variable is entered<br />
as the X or Y variable. Note that JMP actually will switch from the logistic plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to the c<strong>on</strong>tingency<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m 15 as noted by the diagram at the lower left of the dialogue box.<br />
The mosaic plot shows the relative percentages in each of the student usage groups:<br />
15 Refer to the chapter <strong>on</strong> Chi-square tests.<br />
c○2012 Carl James Schwarz 566 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The c<strong>on</strong>tingency table (after selecting the appropriate percentages <str<strong>on</strong>g>for</str<strong>on</strong>g> display from the red-triangle popdown<br />
menu) 16<br />
16 In my opini<strong>on</strong>, I would never display percentages to more than integer values. Displays such as 42.92% are just silly as they imply<br />
a precisi<strong>on</strong> of 1 part in 10,000 but you <strong>on</strong>ly have 219 subjects in the first row.<br />
c○2012 Carl James Schwarz 567 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The c<strong>on</strong>tingency table approach tests the hypothesis of independence between the X and Y variable, i.e.<br />
is the proporti<strong>on</strong> of parents who use marijuana the same <str<strong>on</strong>g>for</str<strong>on</strong>g> the two groups of students:<br />
As explained in the chapter <strong>on</strong> chi-square tests, there are two (asymptotically) equivalent ways to test this<br />
hypothesis – the Pears<strong>on</strong> chi-square statistic and the likelihood ratio statistic. In this case, you would come<br />
to the same c<strong>on</strong>clusi<strong>on</strong>.<br />
The odds-ratio is obtained from the red-triangle at the top of the display:<br />
c○2012 Carl James Schwarz 568 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
and gives:<br />
It is estimated that the odds of children using marijuana if their parents use marijuana or alcohol are about<br />
2.2 times that of the odds of a child using marijuana <str<strong>on</strong>g>for</str<strong>on</strong>g> parents who d<strong>on</strong>’t use marijuana or alcohol. The<br />
95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the odds-ratio is between 1.51 and 3.22. In this case, you would examine if<br />
the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the odds-ratio includes the value of 1 (why?) to see if anything interesting is<br />
happening.<br />
If the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used and a logistic regressi<strong>on</strong> is fit:<br />
c○2012 Carl James Schwarz 569 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This gives the output:<br />
c○2012 Carl James Schwarz 570 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The coefficient of interest is the effect of student usage <strong>on</strong> the no/yes log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g> parental usage. The<br />
test <str<strong>on</strong>g>for</str<strong>on</strong>g> the effect of student usage has chi-square test value of 17.02 with a small p-value which matches<br />
the likelihood ratio test from the c<strong>on</strong>tingency table approach. Many packages use different codings <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
categorical X variables (as seen in the secti<strong>on</strong> <strong>on</strong> multiple regressi<strong>on</strong>) so you need to check the computer<br />
manual carefully to understand exactly what the coefficient measures.<br />
However, the odds-ratio can be found from the red-triangle pop-down menu:<br />
c○2012 Carl James Schwarz 571 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
and matches what was seen earlier.<br />
Finally, the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m can be used with the Generalized Linear Model opti<strong>on</strong>:<br />
c○2012 Carl James Schwarz 572 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This gives:<br />
c○2012 Carl James Schwarz 573 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The test <str<strong>on</strong>g>for</str<strong>on</strong>g> a student effect has the same results as seen previously. But, ir<strong>on</strong>ically, gives no easy easy to<br />
compute the odds ratio. It turns out that given the parameterizati<strong>on</strong> used by JMP, the log-odds ratio is twice<br />
the coefficient of the student-usage, i.e. twice of -.3955. The odds-ratio would be found as the anti-log of<br />
this value, i.e. e 2×−.3955 = .4522 and the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the odds-ratio can be found by anti-logging<br />
twice the c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> this coefficient, i.e. ranging from (e 2×−.5866 = .31 → e 2×−.2068 =<br />
.66). 17 These values are the inverse of the value seen earlier but this is an artefact of which category is<br />
modelled. For example, the odds ratio of P arents Y vs.N (student Y vs.N ) =<br />
1<br />
P arents N vs.Y (student Y vs.N ) =<br />
1<br />
P arents Y vs.N (student N vs.Y ) = P arents N vs.Y (student N vs.Y )<br />
7.6.3 Example: Effect of selenium <strong>on</strong> tadpoles de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities<br />
The generalizati<strong>on</strong> of the above to more than two levels of the X variable is straight <str<strong>on</strong>g>for</str<strong>on</strong>g>ward and parallels the<br />
analysis of a single factor CRD ANOVA. Again, we will assume that the experimental design is a completely<br />
randomized design or simple random sample.<br />
Selenium (Se) is an essential element required <str<strong>on</strong>g>for</str<strong>on</strong>g> the health of humans, animals and plants, but be-<br />
17 This simple relati<strong>on</strong>ship may not be true with other computer packages. YMMV.<br />
c○2012 Carl James Schwarz 574 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
comes a toxicant at elevated c<strong>on</strong>centrati<strong>on</strong>s. The most sensitive species to selenium toxicity are oviparous<br />
(egg-laying) animals. Ecological impacts in aquatic systems are usually associated with teratogenic effects<br />
(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities) in early life stages of oviparous biota as a result of maternal sequestering of selenium in eggs.<br />
In aquatic envir<strong>on</strong>ments, inorganic selenium, found in water or in sediments is c<strong>on</strong>verted to organic selenium<br />
at the base of the food chain (e.g., bacteria and algae) and then transferred through dietary pathways to other<br />
aquatic organisms (invertebrates, fish). Selenium also tends to biomagnify up the food chain, meaning that<br />
it accumulates to higher tissue c<strong>on</strong>centrati<strong>on</strong>s am<strong>on</strong>g organisms higher in the food web.<br />
Selenium often occurs naturally in ores and can leach from mine tailings. This leached selenium can<br />
make its way to waterways and potentially c<strong>on</strong>taminate organisms.<br />
As a preliminary survey, samples of tadpoles were selected from a c<strong>on</strong>trol site and three sites identified<br />
as low, medium, and high c<strong>on</strong>centrati<strong>on</strong>s of selenium based <strong>on</strong> hydrologic maps and expert opini<strong>on</strong>. These<br />
tadpoles were examined, and the number that had de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities were counted.<br />
Here is the raw data:<br />
Site Tadpoles De<str<strong>on</strong>g>for</str<strong>on</strong>g>med % de<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />
C<strong>on</strong>trol 208 56 27%<br />
low 687 243 35%<br />
medium 832 329 40%<br />
high 597 283 47%<br />
The data are entered in JMP in the usual fashi<strong>on</strong>:<br />
c○2012 Carl James Schwarz 575 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Notice that the status of the tadpoles as de<str<strong>on</strong>g>for</str<strong>on</strong>g>med or not de<str<strong>on</strong>g>for</str<strong>on</strong>g>med is entered al<strong>on</strong>g with the count of each<br />
status.<br />
As the selenium level has an ordering, it should be declared as an ordinal scale and the ordering of the<br />
values <str<strong>on</strong>g>for</str<strong>on</strong>g> the selenium levels should be specified using the Column In<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> → Column Properties →<br />
Value Ordering dialogue box<br />
The hypothesis to be tested can be written in a number of equivalent ways:<br />
• H: p(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all levels of selenium.<br />
• H: odds(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all levels of selenium.<br />
• H: log-odds(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all levels of selenium.<br />
• H: p(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is independent of the level of selenium. 18<br />
• H: odds(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is independent of the level of selenium.<br />
• H: log-odds(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is independent of the level of selenium.<br />
18 The use of independent in the hypothesis is a bit old-fashi<strong>on</strong>ed and not the same as statistical independence.<br />
c○2012 Carl James Schwarz 576 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
• H: p c (D) = p L (D) = p M (D) = p H (D) where p L (D) is the probability of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities at low doses,<br />
etc.<br />
There are again several ways in which this data can be analyzed.<br />
Start with the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
This will give a standard c<strong>on</strong>tingency table analysis (see chapter <strong>on</strong> chi-square tests).<br />
The mosaic plot:<br />
c○2012 Carl James Schwarz 577 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
seems to show an increasing trend in de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities with increasing selenium levels. It is a pity that JMP<br />
doesn’t display any measure of precisi<strong>on</strong> (such se bars or c<strong>on</strong>fidence intervals) <strong>on</strong> this plot.<br />
The c<strong>on</strong>tingency table (with suitable percentages shown 19 )<br />
19 I would display percentages to the nearest integer. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there doesn’t appear to be an easy way to c<strong>on</strong>trol this in JMP.<br />
c○2012 Carl James Schwarz 578 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
also gives the same impressi<strong>on</strong>.<br />
A <str<strong>on</strong>g>for</str<strong>on</strong>g>mal test <str<strong>on</strong>g>for</str<strong>on</strong>g> equality of proporti<strong>on</strong> of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s across all levels of the factor gives the following<br />
test statistics and p-values:<br />
There are two comm<strong>on</strong> test-statistics. The Pears<strong>on</strong> chi-square test-statistic which examines the difference<br />
between observed and expected counts (see chapter <strong>on</strong> chi-square tests), and the likelihood-ratio test which<br />
compares the model when the hypothesis is true vs. the model when the hypothesis is false. Both are asymptotically<br />
equivalent. There is str<strong>on</strong>g evidence against the hypothesis of equal proporti<strong>on</strong>s of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities.<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, most c<strong>on</strong>tingency table analyses stop here. A naked p-value which indicates that there is<br />
evidence of a difference but does not tell you where the differences might lie, is not very in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative! In the<br />
same way that ANOVA must be followed by a comparis<strong>on</strong> of the mean am<strong>on</strong>g the treatment levels, this test<br />
should be followed by a comparis<strong>on</strong> of the proporti<strong>on</strong> of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities am<strong>on</strong>g the factor levels.<br />
Logistic regressi<strong>on</strong> methods will enable us to estimate the relative odds of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities am<strong>on</strong>g the various<br />
c○2012 Carl James Schwarz 579 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
classes.<br />
Start with the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
This gives the output:<br />
c○2012 Carl James Schwarz 580 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
First, the Effect Tests tests the hypothesis of equality of the proporti<strong>on</strong> of defectives am<strong>on</strong>g the four levels<br />
of selenium. The test-statistic and p-value match that seen earlier, so there is good evidence of a difference<br />
am<strong>on</strong>g the de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity proporti<strong>on</strong>s am<strong>on</strong>g the various levels.<br />
At this point in a ANOVA, a multiple comparis<strong>on</strong> procedure (such a Tukey’s HSD) would be used to<br />
examine which levels may have different means from the other levels. There is no simple equivalent <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
logistic regressi<strong>on</strong> implemented in JMP. 20 It would be possible to use a simple B<strong>on</strong>fer<strong>on</strong>ni correcti<strong>on</strong> if the<br />
number of groups is small.<br />
JMP provides some in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> comparis<strong>on</strong> am<strong>on</strong>g the levels. In the Parameter Estimates secti<strong>on</strong>,<br />
it presents comparis<strong>on</strong>s of the proporti<strong>on</strong> of defectives am<strong>on</strong>g the successive levels of selenium. 21 The<br />
estimated difference in the log-odds of de<str<strong>on</strong>g>for</str<strong>on</strong>g>med <str<strong>on</strong>g>for</str<strong>on</strong>g> the low vs. c<strong>on</strong>trol group is .39 (se .18). The associated<br />
p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> no difference in the proporti<strong>on</strong> of de<str<strong>on</strong>g>for</str<strong>on</strong>g>med is .02 which is less than the α = .05 levels so there<br />
is evidence of a difference in the proporti<strong>on</strong> of de<str<strong>on</strong>g>for</str<strong>on</strong>g>med between these two levels.<br />
By requesting the c<strong>on</strong>fidence interval and the odds-ratio these can be trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the odds-scale<br />
(rather than the log-odds) scale.<br />
20 This is somewhat puzzling as the theory should be straight <str<strong>on</strong>g>for</str<strong>on</strong>g>ward.<br />
21 This is purely a functi<strong>on</strong> of the internal coding used by JMP. Other packages may use different coding. YMMV.<br />
c○2012 Carl James Schwarz 581 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there is no simple mechanism to do a more general c<strong>on</strong>trasts in this variant of the Analyze-<br />
>Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />
The Generalized Linear Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m in the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m gives more opti<strong>on</strong>s:<br />
c○2012 Carl James Schwarz 582 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The output you get is very similar to what was seen previously. Suppose that a comparis<strong>on</strong> between the<br />
proporti<strong>on</strong>s of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities between the high and c<strong>on</strong>trol levels of selenium are wanted.<br />
Use the red-triangle pop-down menu to select the C<strong>on</strong>trast opti<strong>on</strong>s:<br />
c○2012 Carl James Schwarz 583 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Then select the radio butt<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> comparis<strong>on</strong>s am<strong>on</strong>g selenium levels:<br />
Click <strong>on</strong> the + and − to <str<strong>on</strong>g>for</str<strong>on</strong>g>m the c<strong>on</strong>trast. Here you are interested in LO high − LO c<strong>on</strong>trol where the LO<br />
are the log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g> a de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity.<br />
c○2012 Carl James Schwarz 584 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This gives:<br />
c○2012 Carl James Schwarz 585 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The estimated log-odds ratio is .89 (se .18). This implies that the odds-ratio of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity is e .89 = 2.43,<br />
i.e. the odds of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity are 2.43 greater in the high selenium site than the c<strong>on</strong>trol site. The p-value is well<br />
below α = .05 so there is str<strong>on</strong>g evidence that this effect is real. It is possible to compute the se of the oddsratio<br />
using the Delta method – pity that JMP doesn’t do this directly. 22 An approximate 95% c<strong>on</strong>fidence<br />
interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the log-odds ratio could be found using the usual rule of estimate ± 2se. The 95% c<strong>on</strong>fidence<br />
interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the odd-ratio would be found by taking anti-logs of the end points.<br />
This procedure could then be repeated <str<strong>on</strong>g>for</str<strong>on</strong>g> any c<strong>on</strong>trast of interest.<br />
7.7 Example: Pet fish survival as functi<strong>on</strong> of covariates - Multiple<br />
categorical predictors<br />
There is no c<strong>on</strong>ceptual problem in having multiple categorical X variables. Unlike the case of a single<br />
categorical X variable, there is no simple c<strong>on</strong>tingency table approach. However, in more advanced classes,<br />
you will learn about a technique called log-linear modeling that can often be used <str<strong>on</strong>g>for</str<strong>on</strong>g> these types of tables.<br />
Again, be<str<strong>on</strong>g>for</str<strong>on</strong>g>e analyzing any dataset, ensure that you understand the experimental design. In these notes,<br />
it is assumed that the design is completely randomized design or a simple random sample. If your design is<br />
more complex, please seek suitable help.<br />
A fish is a popular pet <str<strong>on</strong>g>for</str<strong>on</strong>g> young children – yet the survival rate of many of these fish is likely poor. What<br />
factors seem to influence the survival probabilities of pet fish?<br />
A large pet store c<strong>on</strong>ducted a customer follow-up survey of purchasers of pet fish. A number of customers<br />
were called and asked about the hardness of the water used <str<strong>on</strong>g>for</str<strong>on</strong>g> the fish (soft, medium, or hard), where the<br />
fish was kept which was then classified into cool or hot locati<strong>on</strong>s within the living dwelling, if they had<br />
previous experience with pet fish (yes or no), and if the pet fish was alive six m<strong>on</strong>ths after purchase (yes or<br />
no).<br />
Here is the raw data 23 :<br />
22 For those so inclined, if ̂θ is the estimator with associated se, then the se of êθ is found as se(êθ) = se(̂θ) × êθ. In this case, the<br />
se of the odd-ratio would be .18 × e .89 = .44.<br />
23 Taken from Cox and Snell, Analysis of Binary Data<br />
c○2012 Carl James Schwarz 586 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Softness Temp PrevPet N Alive<br />
h c n 89 37<br />
h h n 67 24<br />
m c n 102 47<br />
m h n 70 23<br />
s c n 106 57<br />
s h n 48 19<br />
h c y 110 68<br />
h h y 72 42<br />
m c y 116 66<br />
m h y 56 33<br />
s c y 116 63<br />
s h y 56 29<br />
There are three factors in this study:<br />
• Softness with three levels (h, m or s);<br />
• Temperature with two levels (c or h);<br />
• Previous ownership with two levels (y or n).<br />
This a factorial experiment because all 12 treatment combinati<strong>on</strong>s appear in the experiment.<br />
The experimental unit is the household. The observati<strong>on</strong>al unit is also the household. There is no<br />
pseudo-replicati<strong>on</strong>.<br />
The randomizati<strong>on</strong> structure is likely complete. It seems unlikely that people would pick particular<br />
individual fish depending <strong>on</strong> their water hardness, temperature, or previous history of pet ownership.<br />
The resp<strong>on</strong>se variable is the Alive/Dead status at the end of six m<strong>on</strong>ths. This is a discrete binary outcome.<br />
For example, in the first row of the data table, there were 37 households where the fish was still alive after 6<br />
m<strong>on</strong>ths and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e 89 − 37 = 52 households where the fish had died somewhere in the 6 m<strong>on</strong>th interval.<br />
One way to analyze this data would be to compute the proporti<strong>on</strong> of households that had fish alive after<br />
six m<strong>on</strong>ths, and then use a three-factor CRD ANOVA <strong>on</strong> the estimated proporti<strong>on</strong>s. Because each treatment<br />
combinati<strong>on</strong> is based <strong>on</strong> a different number of trial (ranging from 48 to 116) which implies that the variance<br />
of the estimated proporti<strong>on</strong> is not c<strong>on</strong>stant. This violates (but not likely too badly) <strong>on</strong>e of the assumpti<strong>on</strong>s<br />
of ANOVA – that of c<strong>on</strong>stant variance in each treatment combinati<strong>on</strong>. Also, this seems to throw away data,<br />
as these 1000 observati<strong>on</strong>s are basically collapsed into 12 cells.<br />
Because the outcome is a discrete binary resp<strong>on</strong>se and each trial within each treatment is independent, a<br />
logistic regressi<strong>on</strong> (or generalized linear model) approach can be used.<br />
c○2012 Carl James Schwarz 587 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The data is available in the JMP data file fishsurvive.jmp available in the Sample Program Library<br />
at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Here is the data<br />
file:<br />
To begin with, c<strong>on</strong>struct some profile plots to get a feel <str<strong>on</strong>g>for</str<strong>on</strong>g> what is happening. Create new variables<br />
corresp<strong>on</strong>ding to the proporti<strong>on</strong> of fish alive and its logit 24 . These are created using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor of<br />
JMP in the usual fashi<strong>on</strong>. Also, <str<strong>on</strong>g>for</str<strong>on</strong>g> reas<strong>on</strong>s which will become apparent in a few minutes, create a variable<br />
which is the c<strong>on</strong>catenati<strong>on</strong> of the Temperature and Previous Ownership factor levels. This gives:<br />
( )<br />
24 Recall that logit(p) = log p<br />
. 1−p<br />
c○2012 Carl James Schwarz 588 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Now use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and specify that the p(alive) or logit(alive) is the resp<strong>on</strong>se<br />
variable, with the WaterSoftness as the factor.<br />
c○2012 Carl James Schwarz 589 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Then specify a matching column <str<strong>on</strong>g>for</str<strong>on</strong>g> the plot (do this <strong>on</strong> both plots) using the c<strong>on</strong>catenated variable defined<br />
above.<br />
c○2012 Carl James Schwarz 590 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This creates the two profile plots 25 :<br />
The profile plots seem to indicate that the p(alive) tends to increase with water softness if this is a first time<br />
pet owner, and (ir<strong>on</strong>ically) tends to decrease if a previous pet owner. Of <str<strong>on</strong>g>course</str<strong>on</strong>g> without standard error bars,<br />
it is difficult to tell if these trends are real or not. The sample sizes in each group are around 100 households.<br />
√<br />
.5(.5)<br />
100<br />
= .05 or the approximate<br />
If p(alive) = .5, then the approximate size of a standard error is se =<br />
95% c<strong>on</strong>fidence intervals are ±.1. It looks as if any trends will be hard to detect with the sample sizes used<br />
in this experiment.<br />
25 To get the labels <strong>on</strong> the graph, set the c<strong>on</strong>catenated variable to be a label variable and the rows corresp<strong>on</strong>ding to the h softness<br />
level to be labeled rows.<br />
c○2012 Carl James Schwarz 591 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
In order to fit a logistic-regressi<strong>on</strong> model, you must first create new variable representing the number<br />
Dead in each trial 26 , and then stack 27 the Alive and Dead variables, label the columns as Status and the<br />
Count of each Status to give the final table:<br />
Whew! Now we can finally fit a model to the data and test <str<strong>on</strong>g>for</str<strong>on</strong>g> various effects. In JMP 6.0 and later,<br />
there are two ways to proceed (both give the same answers, but the generalized linear model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m gives<br />
a richer set of outputs). Use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
26 Use a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula to subtract the number alive from the number of trials.<br />
27 Use the Tables->Stack command.<br />
c○2012 Carl James Schwarz 592 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Notice that the resp<strong>on</strong>se variable is Status and that the frequency variable is the Count of the number of<br />
times each status occurs. The model effects box is filled with each factors effect, and the sec<strong>on</strong>d and third<br />
order interacti<strong>on</strong>s.<br />
This gives the following output:<br />
c○2012 Carl James Schwarz 593 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Check to see exactly what is being modeled. In this case, it is the probability of the first level of the<br />
resp<strong>on</strong>ses, logit(alive).<br />
Then examine the effect tests. Just as in ordinary ANOVA modeling, start with the most complex term,<br />
and work backwards successively eliminating terms until nothing more can be eliminated. The third-order<br />
interacti<strong>on</strong> is not statistically significant. Eliminate this term from the Analyze->Fit Model dialog box, and<br />
refit using <strong>on</strong>ly main effects and two factor interacti<strong>on</strong>s. 28<br />
Successive terms were dropped to give the final model:<br />
28 Just like regular ANOVA, you can’t examine the p-values of lower order interacti<strong>on</strong> terms if a higher order interacti<strong>on</strong> is present.<br />
In this case, you can’t look at the p-values <str<strong>on</strong>g>for</str<strong>on</strong>g> the sec<strong>on</strong>d order interacti<strong>on</strong> when the third order interacti<strong>on</strong> is present in the model. You<br />
must first refit the model after the third order interacti<strong>on</strong> is dropped.<br />
c○2012 Carl James Schwarz 594 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
It appears that there is good evidence of Previous Ownership, marginal evidence of an effect of Temperature<br />
and an interacti<strong>on</strong> between water softness and previous ownership. [Because the two factor interacti<strong>on</strong> was<br />
retained, the main effects of softness and previous ownership must be retained in the model even though it<br />
looks as if there is no main effect of softness. Refer to the previous notes <strong>on</strong> two-factor ANOVA <str<strong>on</strong>g>for</str<strong>on</strong>g> details.]<br />
Save the predicted p(alive) to the data table 29<br />
29 CAUTION: the predicted p(alive) is saved to the data line even if the actual status is dead.<br />
c○2012 Carl James Schwarz 595 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
and plot the observed proporti<strong>on</strong>s against the predicted values as seen in regressi<strong>on</strong> examples earlier. 30<br />
30 Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, and then the Fit Special opti<strong>on</strong> to draw a line with slope=1 <strong>on</strong> the plot.<br />
c○2012 Carl James Schwarz 596 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The plot isn’t bad and seems to have captured most of what is happening. Use the Analyze->Fit Y-by-X<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, with the Matching Column as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e to create the profile plot of the predicted values:<br />
c○2012 Carl James Schwarz 597 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
It is a pity that JMP gives you no easy way to annotate the standard error or c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
predicted mean p(alive), but the c<strong>on</strong>fidence bounds can be saved to the data table.<br />
Unlike regular regressi<strong>on</strong>, it makes no sense to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> individual fish.<br />
By using the C<strong>on</strong>trast pop-down menu, you can estimate the difference in survival rates (but, un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately,<br />
<strong>on</strong> the logit scale) as needed. For example, suppose that you wished to estimate the difference in<br />
survival rates between fish raised in hard water and no previous experience and hard water with previous<br />
experience. Use the C<strong>on</strong>trast pop-down menu:<br />
c○2012 Carl James Schwarz 598 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The c<strong>on</strong>trast is specified by pressing the - and + boxes as needed:<br />
c○2012 Carl James Schwarz 599 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This gives:<br />
c○2012 Carl James Schwarz 600 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Again this is <strong>on</strong> the logit scale and implies that the logit(p(alive)) h,n − logit(p(alive)) h,y = −.86 (se .22).<br />
This is highly statistically significant. But, what does this mean? Working backwards, we get:<br />
)<br />
logit (p (alive) hn<br />
) − logit<br />
(p (alive) hy<br />
= −.86<br />
[ ] [<br />
p(alive)hn<br />
p(alive)hy<br />
]<br />
log<br />
1−p(alive)<br />
− log<br />
hn 1−p(alive)<br />
= −.86<br />
hy<br />
log<br />
[<br />
odds(alive)hn<br />
odds(alive) hy<br />
]<br />
= −.86<br />
odds(alive) hn<br />
odds(alive) hy<br />
= e −.86 = .423<br />
Or, the odds of a fish being alive from a n<strong>on</strong>-owner in hard water are about 1/2 of the odds of a fish being<br />
alive from a previous owner in hard water. If you look at the previous graphs, this indeed does match. It is<br />
possible to compute a se <str<strong>on</strong>g>for</str<strong>on</strong>g> this odds ratio, but is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />
7.8 Example: Horseshoe crabs - C<strong>on</strong>tinuous and categorical predictors.<br />
As to be expected, combinati<strong>on</strong>s of c<strong>on</strong>tinuous and categorical X variables can also be fit using similar<br />
reas<strong>on</strong>ing as ANCOVA models discussed in the chapter <strong>on</strong> multiple regressi<strong>on</strong>.<br />
If the categorical X variable has k categories, k −1 indicator variables will be created using an appropriate<br />
coding. Different computer packages use different codings, so you must read the package documentati<strong>on</strong><br />
carefully in order to interpret the estimated coefficients. However, the different codings, must, in the end,<br />
arrive at the same final estimates of effects.<br />
Unlike the ANCOVA model with c<strong>on</strong>tinuous resp<strong>on</strong>ses, there are no simple plots in logistic regressi<strong>on</strong><br />
to examine visually the parallelism of the resp<strong>on</strong>se or the equality of intercepts. 31 Preliminary plots where<br />
data are pooled into various classes so that empirical logistic plots can be made seem to be the best that can<br />
be d<strong>on</strong>e.<br />
As in the ANCOVA model, there are three models that are usually fit. Let X represent the c<strong>on</strong>tinuous<br />
predictor, let Cat represent the categorical predictor, and p the probability of success. The three models are:<br />
• logit(p) = X Cat X ∗ Cat - different intercepts and slopes <str<strong>on</strong>g>for</str<strong>on</strong>g> each group;<br />
• logit(p) = X Cat - different intercepts but comm<strong>on</strong> slope (<strong>on</strong> the logit scale);<br />
• logit(p) = X - same slope and intercept <str<strong>on</strong>g>for</str<strong>on</strong>g> all groups - coincident lines.<br />
The choice am<strong>on</strong>g these models is made by examining the Effect Tests <str<strong>on</strong>g>for</str<strong>on</strong>g> the various terms. For example,<br />
to select between the first and sec<strong>on</strong>d model, look at the p-value of the X ∗ Cat term; to select between the<br />
sec<strong>on</strong>d and third model, examine the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the Cat term.<br />
31 This is a general problem in logistic regressi<strong>on</strong> because the resp<strong>on</strong>ses are <strong>on</strong>e of two discrete categories.<br />
c○2012 Carl James Schwarz 601 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
These c<strong>on</strong>cepts will be illustrated using a dataset <strong>on</strong> nesting horseshoe crabs 32 that is analyzed in<br />
Agresti’s book. 33<br />
The design of the study is given in Brockmann H.J. (1996). Satellite male groups in horseshoe crabs,<br />
Limulus polyphemus. Ethology, 102, 1-21. Again it is important to check that the design is a completely<br />
randomized design or a simple random sampling. As in regressi<strong>on</strong> models, you do have some flexibility in<br />
the choice of the X settings, but <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular weight and color, the data must be selected at random from<br />
that relevant populati<strong>on</strong>.<br />
Each female horseshoe crab had a male resident in her nest. The study investigated other factors affecting<br />
whether the female had any other males, called satellites residing nearby. These other factors includes:<br />
• crab color where 2=light medium, 3=medium, 4=dark medium, 5=dark.<br />
• spine c<strong>on</strong>diti<strong>on</strong> where 1=both good, 2=<strong>on</strong>e worn or broken, or 3=both worn or broken.<br />
• weight<br />
• carapace width<br />
The number of satellites was measured; <str<strong>on</strong>g>for</str<strong>on</strong>g> this example we will c<strong>on</strong>vert the number of satellite males into<br />
a presence (number at least 1) or absence (no satellites).<br />
A JMP dataset crabsatellites.jmp is available from the Sample Program Library at http://www.<br />
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A porti<strong>on</strong> of the datafile is shown<br />
below:<br />
32 See http://en.wikipedia.org/wiki/Horseshoe_crab.<br />
33 These are available from Agresti’s web site at http://www.stat.ufl.edu/~aa/cda/sas/sas.html.<br />
c○2012 Carl James Schwarz 602 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Note that the color and spine c<strong>on</strong>diti<strong>on</strong> variables should be declared with an ordinal scale despite having<br />
numerical codes. The number of satellite males was c<strong>on</strong>verted to a presence/absence value using the JMP<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor.<br />
A preliminary scatter plot of the variables shows some interesting features.<br />
There is a very high positive relati<strong>on</strong>ship between carapace width and weight, but there are few anomalous<br />
crabs that should be investigated further as shown in this magnified plot:<br />
c○2012 Carl James Schwarz 603 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
There are three points with weights in the 1200-1300 g range whose carapace widths suggest that the weights<br />
should be in the 2200-2300 g range, i.e. a typographical error in the first digit. There is a single crab whose<br />
weight suggests a width of 24 cm rather than 21 cm – perhaps a typo in the last digit. Finally, there is <strong>on</strong>e<br />
crab which is extremely large compared to the rest of the group. In the analysis that follows, I’ve excluded<br />
these five crabs.<br />
The final point also appears to have an unusual number of satellite males compared to the other crabs in<br />
the dataset.<br />
The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was then used to examine the differences in mean or proporti<strong>on</strong>s in<br />
the other variables when grouped by the presence/absence score. These are not shown in these notes, but<br />
generally dem<strong>on</strong>strate some separati<strong>on</strong> in the means or proporti<strong>on</strong>s between the two groups, but there is<br />
c<strong>on</strong>siderable overlap in the individual values between the two groups. The group with no satellite males<br />
tends to have darker colors than the presence group; while the distincti<strong>on</strong> between the spine c<strong>on</strong>diti<strong>on</strong> is not<br />
clear cut.<br />
Because of the high correlati<strong>on</strong> between carapace size and weight, the weight variable was used as the<br />
c<strong>on</strong>tinuous covariate and the color variable was used as the discrete covariate.<br />
A preliminary analysis divided weight into four classes (up to 2000g; 2000-2500 g; 2500-3000 g; and<br />
over 3000 g). 34 Similarly, a new variable (PA) was created to be 0 (<str<strong>on</strong>g>for</str<strong>on</strong>g> absence) or 1 (<str<strong>on</strong>g>for</str<strong>on</strong>g> presence) <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
presence/absence of satellite males. The Tables->Summary was used to compute the mean PA (which then<br />
34 The <str<strong>on</strong>g>for</str<strong>on</strong>g>mula commands of JMP were used.<br />
c○2012 Carl James Schwarz 604 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
corresp<strong>on</strong>ds to the estimated probability of presence) <str<strong>on</strong>g>for</str<strong>on</strong>g> each combinati<strong>on</strong> of weight class and color:<br />
c○2012 Carl James Schwarz 605 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Finally, the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to plot the probability of presence by weight class, using<br />
the Matching Column to joint lines of the same color:<br />
c○2012 Carl James Schwarz 606 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
c○2012 Carl James Schwarz 607 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
c○2012 Carl James Schwarz 608 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Note despite the appearance of n<strong>on</strong>-parallelism <str<strong>on</strong>g>for</str<strong>on</strong>g> the bottom line, the point in the 2500-3000 gram category<br />
is <strong>on</strong>ly based <strong>on</strong> 4 crabs and so has very poor precisi<strong>on</strong>. Similarly, the point near 100% in the 0000-2000 g<br />
c○2012 Carl James Schwarz 609 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
category is based <strong>on</strong> 1 data point! The parallelism hypothesis may be appropriate.<br />
A generalized linear model using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to fit the most general<br />
model using the raw data:<br />
This gives the results:<br />
c○2012 Carl James Schwarz 610 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-parallelism (refer to the line corresp<strong>on</strong>ding to the Color*Weight term) is just over<br />
α = .05 so there is some evidence that perhaps the lines are not parallel. The parameter estimates are not<br />
interpretable without understanding the coding scheme used <str<strong>on</strong>g>for</str<strong>on</strong>g> the indicator variables. The goodness-of-fit<br />
test does not indicate any problems.<br />
Let us c<strong>on</strong>tinue with a the parallel slopes model by dropping the interacti<strong>on</strong> term. This gives the following<br />
results:<br />
c○2012 Carl James Schwarz 611 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
There is good evidence that the log-odds of NO males present decrease as weight increases (i.e. the log-odds<br />
of a male being present increases as weight increases), with an estimated increase of .0016 in the log-odds<br />
per gram increase in weight. There is very weak evidence that the intercepts are different as the p-value is<br />
just under 10%.<br />
The goodness-of-fit test seems to indicate no problem. The residual plot must be interpreted carefully,<br />
but its appearance was explained in a previous secti<strong>on</strong>.<br />
The different intercepts will be retained to illustrate how to graph the final model. Use the red-triangle<br />
to save the predicted probabilities to the data table. Note that you may wish to rename the predicted column<br />
to remind yourself that the probability of NO male is being predicted.<br />
c○2012 Carl James Schwarz 612 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to plot the predicted probability of absence against weight, use the<br />
group by opti<strong>on</strong> to separate by color, and then fit a spline (a smooth flexible curve) to draw the four curves:<br />
c○2012 Carl James Schwarz 613 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
c○2012 Carl James Schwarz 614 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
to give the final plot:<br />
c○2012 Carl James Schwarz 615 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
Notice that while the models are linear <strong>on</strong> the log-odds scale, they plots will show a n<strong>on</strong>-linear shape <strong>on</strong> the<br />
regular scale.<br />
It appears that the color=5 group appears to be different from the rest. If you do a c<strong>on</strong>trast am<strong>on</strong>g the<br />
intercepts (not really a good idea as this could be c<strong>on</strong>sidered data dredging), you indeed find evidence that<br />
the intercept (<strong>on</strong> the log-odds scale) <str<strong>on</strong>g>for</str<strong>on</strong>g> color 5 may be different than the average of the intercepts <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
other three colors:<br />
c○2012 Carl James Schwarz 616 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
7.9 Assessing goodness of fit<br />
As is the case in all model fitting in Statistics, it is important that the model provides an adequate fit to the<br />
data at hand. Without such an analysis, the inferences drawn from the model may be misleading or even<br />
totally wr<strong>on</strong>g!<br />
One of the “flaws” of many published papers is a lack to detail <strong>on</strong> how the fit of the model to the data<br />
was assessed. The logistic regressi<strong>on</strong> model is a powerful statistical tool, but it must be used with cauti<strong>on</strong>.<br />
Goodness-of-fit <str<strong>on</strong>g>for</str<strong>on</strong>g> logistic regressi<strong>on</strong> models are more difficult than similar methods <str<strong>on</strong>g>for</str<strong>on</strong>g> multiple regressi<strong>on</strong><br />
because of the binary (success/failure) nature of the resp<strong>on</strong>se variable. Nevertheless, many of the<br />
c○2012 Carl James Schwarz 617 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
methods used in multiple regressi<strong>on</strong> have been extended to the logistic regressi<strong>on</strong> case.<br />
A nice review paper of the methods of assessing fit is given by<br />
Hosmer, D. W., Tabler, S., and Lameshow, S. (1991).<br />
The importance of assessing the fit of logistic regressi<strong>on</strong> models: a case study.<br />
American Journal of Public Health, 81, 1630âĂŞ1635.<br />
http://dx.doi.org/10.2105/AJPH.81.12.1630<br />
In any statistical model, there are two comp<strong>on</strong>ents – the structural porti<strong>on</strong> (e.g. the fitted curve) and the<br />
residual (or noise) (e.g. the deviati<strong>on</strong> of the actual values from the fitted curve). The process of building a<br />
model focuses <strong>on</strong> the structural porti<strong>on</strong>. Which variables are important in predicting value? Is the correct<br />
scale (e.g. should x or x 2 be used?) After the structural model is fit, the analyst should assess the degree fit.<br />
Assessing goodness-of-fit (GOF) usually entails two stages. First, computing a statistic that summarizes<br />
the general fit of the model to the data. Sec<strong>on</strong>d, computing statistics <str<strong>on</strong>g>for</str<strong>on</strong>g> individual observati<strong>on</strong>s that assess<br />
the (lack of) fit of the model to individual observati<strong>on</strong>s and their leverage in the fit. This may indentify<br />
particular observati<strong>on</strong>s that are outliers or have undue influence or leverage <strong>on</strong> the fit. These points need to<br />
be inspected carefully, but it is important to remember that data should not be arbitrarily deleted based solely<br />
<strong>on</strong> a statistical measure.<br />
Let ̂π i represent the predicted probability <str<strong>on</strong>g>for</str<strong>on</strong>g> case i whose resp<strong>on</strong>se is either 0 (<str<strong>on</strong>g>for</str<strong>on</strong>g> failure) or 1 (<str<strong>on</strong>g>for</str<strong>on</strong>g><br />
success). The deviance of a point is defined as<br />
√<br />
d i = 2| ln(̂π yi<br />
i (1 − ̂π i) 1−yi )|<br />
and is basically a functi<strong>on</strong> of the log-likelihood <str<strong>on</strong>g>for</str<strong>on</strong>g> that observati<strong>on</strong>.<br />
The total deviance is defined as:<br />
D = ∑ d 2 i<br />
Another statistics, the Pears<strong>on</strong> residual, is defined as:<br />
f i =<br />
y i − ̂π i<br />
√̂πi (1 − ̂π i )<br />
and the Pears<strong>on</strong> chi-square statistic is defined as<br />
χ 2 = ∑ r 2 i<br />
The summary statistics D and χ 2 each have degrees of freedom approximately equal to n − (p + 1)<br />
where p is the number of predictor variables, but d<strong>on</strong>’t have any nice distributi<strong>on</strong>al <str<strong>on</strong>g>for</str<strong>on</strong>g>ms (i.e. you can’t<br />
assume that they follow a chi-square distributi<strong>on</strong>). This is because the individual comp<strong>on</strong>ents are essentially<br />
fromed from n × 2 c<strong>on</strong>tingency table with all counts 1 or 0 so the problem of small expected counts found in<br />
c○2012 Carl James Schwarz 618 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
chi-square tests is quite serious. So any p-value reported <str<strong>on</strong>g>for</str<strong>on</strong>g> these overall goodness-of-fit measures are not<br />
very reliable, and about the <strong>on</strong>ly thing that is useful is to compare these statistics to their degrees of freedom<br />
to compute an approximate variance inflati<strong>on</strong> factor as seen earlier in the Fitness example.<br />
One strategy <str<strong>on</strong>g>for</str<strong>on</strong>g> sparse tables is to pool. The Lemeshow test divides the data into 10 groups of equal<br />
sizes based <strong>on</strong> the deciles of the fitted values. The observed and expected counts are computed by summing<br />
the estimated probabilities and the observed values in the usual fashi<strong>on</strong>, and then computing a standard<br />
chi-square goodness-of-fit statistic. It is compared to a chi-square distributi<strong>on</strong> with 8 df .<br />
Any assessment of goodness of fit should then start with the examinati<strong>on</strong> of the D, χ 2 and Lemeshow<br />
statistics. Then do a careful evauati<strong>on</strong> of the individual terms d i and r i .<br />
To start with, examine the residual plots. Suppose we wish to predict membership in a category as a<br />
functi<strong>on</strong> of a c<strong>on</strong>tinuous covariate. For example, can we predict the sex of an individual based <strong>on</strong> their<br />
weight? This is known as logistic regressi<strong>on</strong> and is discussed in another chapter in this series of notes.<br />
Again refer to the Fitness dataset. The (Generalized Linear) model is:<br />
Y i distributed as Binomial(p i )<br />
φ i = logit(p i )<br />
φ i = W eight<br />
The residual plot is produced automatically from the Generalized Linear Model opti<strong>on</strong> of the Analyze->Fit<br />
Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and looks like 35 :<br />
35 I added reference lines at zero, 2, and −2 by clicking <strong>on</strong> the Y axis of the plot<br />
c○2012 Carl James Schwarz 619 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This plot looks a bit strange!<br />
Al<strong>on</strong>g the bottom of the plot, is the predicted probability of being female 36 This is found by substituting<br />
in the weight of each pers<strong>on</strong> into the estimated linear part, and then back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>ming from the logit scale<br />
to the ordinary probability scale. The first point <strong>on</strong> the plot, identified by a square box, is from a male who<br />
weighs over 90 kg. The predicted probability of being female is very small, about 5%.<br />
The first questi<strong>on</strong> is exactly how is a residual defined when the Y variable is a category? For example,<br />
how would the residual <str<strong>on</strong>g>for</str<strong>on</strong>g> this point be computed - it makes no sense to simply take the observed (male)<br />
minus the predicted probability (.05)?<br />
Many computer packages redefine the categories using 0 and 1 labels. Because JMP was modeling the<br />
probability of being female, all males are assigned the value of 0, and all females assigned the value of 1.<br />
Hence the residual <str<strong>on</strong>g>for</str<strong>on</strong>g> this point is 0-.05-0.05 which after studentizati<strong>on</strong>, is plots as shown.<br />
The bottom line in the residual plot corresp<strong>on</strong>ds to the male subjects, The top line corresp<strong>on</strong>ds to the<br />
female subjects. Where are areas of c<strong>on</strong>cern? You would be c<strong>on</strong>cerned about females who have a very small<br />
probability of predicti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> being female, and males who have a large probability of predicti<strong>on</strong> of being<br />
36 The first part of the output from the plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m states that the probability of being female is being modeled.<br />
c○2012 Carl James Schwarz 620 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
female. These are located in the plot in the circled areas.<br />
The residual plot’s strange appearance is an artifact of the modeling process.<br />
What happens if the predictors in a logistic regressi<strong>on</strong> are also categorical. Based <strong>on</strong> what what seen <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
the ordinary regressi<strong>on</strong> case, you can expect to see a set of vertical lines. But, there are <strong>on</strong>ly two possible<br />
resp<strong>on</strong>ses, so the plot reduces to a (n<strong>on</strong>-in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative) set of lattice points.<br />
For example, c<strong>on</strong>sider predicting survival rates of Titanic passengers as a functi<strong>on</strong> of their sex. This<br />
model is:<br />
Y i distributed as Binomial(p i )<br />
φ i = logit(p i )<br />
φ i = Sex<br />
The residual plot is produced automatically from the Generalized Linear Model opti<strong>on</strong> of the Analyze->Fit<br />
Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and looks like 37 :<br />
37 I added reference lines at zero, 2, and −2 by clicking <strong>on</strong> the Y axis of the plot<br />
c○2012 Carl James Schwarz 621 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The same logic applies as in the previous secti<strong>on</strong>s. Because Sex is a discrete predictor with two possible<br />
values, there are <strong>on</strong>ly two possible predicted probability of survival corresp<strong>on</strong>ding to the two vertical lines<br />
in the plot. Because the resp<strong>on</strong>se variable is categorical, it is c<strong>on</strong>verted to a 0 or 1 values, and the residuals<br />
computed which then corresp<strong>on</strong>d to the two dots in each vertical line. Note that each dot represents several<br />
hundred data values!<br />
This residual plot is rarely in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative – after all, if there are <strong>on</strong>ly two outcomes and <strong>on</strong>ly two categories<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> the predictors, some people have to lie in the two outcomes <str<strong>on</strong>g>for</str<strong>on</strong>g> each of the two categories of predictors.<br />
The leverage of a point measures how extreme the set of predictors is relative to the rest of the predictors<br />
in the study. Leverage in logistic regressi<strong>on</strong> depends no <strong>on</strong>ly this distance, but also the weight in predicti<strong>on</strong>s<br />
which is a functi<strong>on</strong> of π(1 − π). C<strong>on</strong>sequently, points with very small predicted (i.e. ̂π i < 0.15) or very<br />
larger predicted (i.e. ̂π i > 0.85) actually have little weight <strong>on</strong> the fit and the maximum leverage occurs with<br />
points where the predicted probability is close to 0.15 or 0.85.<br />
Hosmer et al. (1991) suggest plotting the leverage of each point vs. ̂π i to determine the regi<strong>on</strong>s where<br />
the leverage is highest. These values may not be available in your package of choice.<br />
Hosmer et al. (1991) also suggest computing the Cook’s distance – how much does the regressi<strong>on</strong> coefficient<br />
change if a case is dropped from the model. These values may not be available in your package of<br />
choice.<br />
7.10 Variable selecti<strong>on</strong> methods<br />
7.10.1 Introducti<strong>on</strong><br />
In the previous examples, there were <strong>on</strong>ly a few predictor variables and generally, there was <strong>on</strong>ly model<br />
really of interest. In many cases, the <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the model is unknown, and some sort of variable selecti<strong>on</strong><br />
methods are required to build realistic model.<br />
As in ordinary regressi<strong>on</strong>, these variable selecti<strong>on</strong> methods are NO substitute <str<strong>on</strong>g>for</str<strong>on</strong>g> intelligent thought,<br />
experience, and comm<strong>on</strong> sense.<br />
As always, be<str<strong>on</strong>g>for</str<strong>on</strong>g>e starting any analysis, check the sample or experimental design. This chapter <strong>on</strong>ly<br />
deals with data collected under a simple random sample or completely randomized design. If the sample or<br />
experimental design is more complex, please c<strong>on</strong>sult with a friendly statistician.<br />
Epidemiologists often advise that all clinically relevant variables should be included regardless if statistically<br />
significant or not. The rati<strong>on</strong>ale <str<strong>on</strong>g>for</str<strong>on</strong>g> this approach is to provide as complete c<strong>on</strong>trol of c<strong>on</strong>founding as<br />
possible – we saw in regular regressi<strong>on</strong> that collinearity am<strong>on</strong>g variables can mask statistical significance.<br />
The major problem with this approach is over-fitting. Over-fitted models have too many variables relative to<br />
the number of observati<strong>on</strong>s, leading to numerically unstable estimates with large standard errors.<br />
c○2012 Carl James Schwarz 622 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
I prefer a more subdued approach rather than this shotgun approach and would follow these steps to find<br />
a reas<strong>on</strong>able model:<br />
• Start with a multi-variate scatter-plot matrix to investigate pairwise relati<strong>on</strong>ships am<strong>on</strong>g variables. Are<br />
there pairs of variables that appear to be highly correlated? Are there any points that d<strong>on</strong>’t seem to<br />
follow the pattern seen with the other points?<br />
• Examine each variable separately using the Analyze->Distributi<strong>on</strong> plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to check <str<strong>on</strong>g>for</str<strong>on</strong>g> anomalous<br />
values, etc.<br />
• Start with simple univariate logistic regressi<strong>on</strong> with each variable in turn.<br />
For c<strong>on</strong>tinuous variables, there are two suggested analyses. First, use the binary variable as the X<br />
variable and do a simple two-sample t-test to look <str<strong>on</strong>g>for</str<strong>on</strong>g> differences am<strong>on</strong>g the means of the potential<br />
predictors. The dot plots should show some separati<strong>on</strong> of the two groups. Sec<strong>on</strong>d, try a simple<br />
univariate logistic-regressi<strong>on</strong> using the binary variable as the Y variable with each individual predictor.<br />
Third, although it seems odd to do so, c<strong>on</strong>vert the binary resp<strong>on</strong>se variable to a 0/1 c<strong>on</strong>tinuous resp<strong>on</strong>se<br />
and try some of the standard smoothing methods, such a spline fit to investigate the general <str<strong>on</strong>g>for</str<strong>on</strong>g>m of<br />
the resp<strong>on</strong>se. Does it look logistic? Are quadratic terms needed?<br />
For nominal or ordinal variables, the two above analyses often start with a c<strong>on</strong>tingency table. Particular<br />
attenti<strong>on</strong> should be paid to problem cases – cells in a c<strong>on</strong>tingency table which have a zero count.<br />
For example, if an experiment was testing different doses of a drug <str<strong>on</strong>g>for</str<strong>on</strong>g> the LD50 38 and no deaths<br />
occurred at a particular dose. In these situati<strong>on</strong>s, the log-odds of success are either ±∞ which is<br />
impossible to model properly using virtually any standard statistical package. 39 If there are cells with<br />
0 counts, some pooling is often required.<br />
Looking at all the variables, which variables appear to be statistically significant? Approximately how<br />
large are these simple effects – can the predictor variables be ranked in approximate order of univariate<br />
importance?<br />
• Based up<strong>on</strong> the above results, start with a model that includes what appear to be the most important<br />
variables. As a rule of thumb 40 include variables that have a p-value under .25 rather relying <strong>on</strong> a<br />
stricter criteria. At this stage of the game, building a good starting model is of primary importance.<br />
• Use standard variable selecti<strong>on</strong> methods, such as stepwise selecti<strong>on</strong> (<str<strong>on</strong>g>for</str<strong>on</strong>g>ward, backward, combined)<br />
or all subset regressi<strong>on</strong> to investigate potential models. These mechanical methods are not to be used<br />
as a substitute <str<strong>on</strong>g>for</str<strong>on</strong>g> thinking! Remember that highly collinear variables can mask the importance of<br />
each other.<br />
If categorical variables are to be included then some care must be used <strong>on</strong> how the various indicator<br />
variables are included. The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this is that the coding of the indicator variables is arbitrary and<br />
the selecti<strong>on</strong> of a particular indicator variable may be artifact of the coding used. One strategy is that<br />
all the indicator variables should be included or excluded as a set, rather than individually selecting<br />
separate indicator variables. As you will see in the example, JMP has four different rules that could<br />
be used.<br />
38 LD50=Lethal Dose 50th percentile – that dose which kills 50% of the subjects<br />
39 However, refer to Hosmer and Lemeshow (2000) <str<strong>on</strong>g>for</str<strong>on</strong>g> details <strong>on</strong> alternate approaches.<br />
40 Hosmer and Lemeshow (2000), p. 95<br />
c○2012 Carl James Schwarz 623 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
• Once main effects have be identified, look at quadratic, interacti<strong>on</strong>, and crossproduct terms.<br />
• Verify the final model. Look <str<strong>on</strong>g>for</str<strong>on</strong>g> collinearity, high leverage, etc. Check if the resp<strong>on</strong>se to the selected<br />
variables are linear <strong>on</strong> the logistic scale. For example, break a c<strong>on</strong>tinuous variable into 4 classes, and<br />
refit the same model with these discretized classes. The estimates of the effects <str<strong>on</strong>g>for</str<strong>on</strong>g> each class should<br />
then follow an approximate linear pattern.<br />
• Cross validate the model so that artifacts of that particular dataset are not highlighted.<br />
7.10.2 Example: Predicting credit worthiness<br />
In credit business, banks are interested in in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> whether prospective c<strong>on</strong>sumers will pay back their<br />
credit or not. The aim of credit-scoring is to model or predict the probability that a c<strong>on</strong>sumer with certain<br />
covariates is to be c<strong>on</strong>sidered as a potential risk.<br />
If you visit http://www.stat.uni-muenchen.de/service/datenarchiv/welcome_e.<br />
html you will find a dataset c<strong>on</strong>sisting of 1000 c<strong>on</strong>sumer credits from a German bank. For each c<strong>on</strong>sumer<br />
the binary resp<strong>on</strong>se variable “creditability” is available. In additi<strong>on</strong>, 20 covariates that are assumed to influence<br />
creditability were recorded. The dataset is available in the creditcheck.jmp datafile from the Sample<br />
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The variable descripti<strong>on</strong>s are available at http://www.stat.uni-muenchen.de/service/<br />
datenarchiv/kredit/kreditvar_e.html and in the Sample Program Library.<br />
I will assume that the initial steps in variable selecti<strong>on</strong> have been d<strong>on</strong>e such as scatter-plots, looking <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
outliers etc.<br />
This dataset has a mixture of c<strong>on</strong>tinuous variables (such as length of time an account has been paid in<br />
full), nominal scaled variables (such as sex, or the purpose of the credit request), and ordinal scaled variables<br />
(such as length of employment). Some of the ordinal variables may even be close enough to interval or ratio<br />
scaled to be usable as a c<strong>on</strong>tinuous variables (such as length of employment). Both approaches should be<br />
tried, particularly if the estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual categories appear to be increasing in a linear fashi<strong>on</strong>.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to specify the resp<strong>on</strong>se variable, the potential covariates,<br />
and that a variable selecti<strong>on</strong> method will be used:<br />
c○2012 Carl James Schwarz 624 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This brings up the standard dialogue box <str<strong>on</strong>g>for</str<strong>on</strong>g> step-wise and other variable selecti<strong>on</strong> methods.<br />
c○2012 Carl James Schwarz 625 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
In the stepwise paradigm, the usual <str<strong>on</strong>g>for</str<strong>on</strong>g>ward, backwards, and mixed (i.e. <str<strong>on</strong>g>for</str<strong>on</strong>g>ward followed by a backwards<br />
step at each iterati<strong>on</strong>):<br />
In cases where variables are nominally or ordinally scales (and discrete), JMP provides a number of way<br />
to include/exclude the individual indicator variables:<br />
For example, c<strong>on</strong>sider the variable Repayment that had levels 0 to 4 corresp<strong>on</strong>ding from 0=repayment problems<br />
in the past, to 4=completely satisfactory repayment of past credit. JMP will create 4 indicator variables<br />
to represent these 5 categories. These indicator variables are derived in a hierarchical fashi<strong>on</strong>:<br />
c○2012 Carl James Schwarz 626 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The first indicator variable, splits the classes in such a way to maximize the difference between the proporti<strong>on</strong><br />
of credit worthiness between the two parts of the split. This corresp<strong>on</strong>ds to grouping levels 0 and 1 vs. levels<br />
2, 3, and 4. The next indicator variables then split the splits, again, if possible, to maximize the difference<br />
in the credit worthiness between the two parts of the split. [If the split is of a pair of variables, there is<br />
no choice in the split.] This corresp<strong>on</strong>ds to splitting the 0&1 categories into another indicator variable that<br />
distinguishes category 0 from 1. The 2&3&4 class is split into two sub-splits corresp<strong>on</strong>ding to categories<br />
2&3 vs. category 4. Finally, the 2&3 class is split into an indicator variable differentiating categories 2 and<br />
3.<br />
Now the rules <str<strong>on</strong>g>for</str<strong>on</strong>g> entering effects corresp<strong>on</strong>d to :<br />
• Combined When terms enter the model, they are combined with all higher terms in the hierarchy and<br />
tested as a group to enter or leave.<br />
• Restrict Terms cannot be entered into the model unless terms higher in the hierarchy are already<br />
entered. Hence the indicator variable that distinguishes categories 0 and 1 in the repayment variable<br />
cannot enter be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the indicator variable that c<strong>on</strong>trasts 0&1 and 2&3&4.<br />
• No Rules Each indicator variable is free to enter or leave the model regardless of the presence or<br />
absence of other variables in the set.<br />
• Whole Effects All indicator variable in a set must enter or leave together as a set.<br />
The Combined or Whole Effects are the two most comm<strong>on</strong> choices.<br />
This plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m also supports all possible subset regressi<strong>on</strong>s:<br />
c○2012 Carl James Schwarz 627 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
This should be used cautiously with a large number of variables.<br />
Because it is computati<strong>on</strong>ally difficult to fit thousands of models using maximum likelihood methods<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> each of the potential new variables that enter the model, a computati<strong>on</strong>ally simpler (but asymptotically<br />
equivalent) test procedure (called the Wald or score test) is used in the table of variables to enter or leave. In<br />
a <str<strong>on</strong>g>for</str<strong>on</strong>g>ward selecti<strong>on</strong>, the variable with the smallest p-value or the largest Wald test-statistic is chosen:<br />
Once this variable is chosen, the current model is refit using maximum likelihood, so the report in the Step<br />
History may show a slightly different test statistics (the L-R ChiSquare) than the score statistic and the<br />
p-value may be different.<br />
The stepwise selecti<strong>on</strong> c<strong>on</strong>tinues.<br />
In a few steps, the next variable to enter is the indicator variable that distinguishes categories 2&3 and<br />
4. Because of the restricti<strong>on</strong> <strong>on</strong> entering terms, if this indicator variable is entered, the first cut must also be<br />
entered. Hence, this step actually enters 2 variables and the number of predictors jumps from 3 to 5:<br />
c○2012 Carl James Schwarz 628 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
In a few more steps, some of the credit purpose variables are entered, again as a pair.<br />
The stepwise c<strong>on</strong>tinues <str<strong>on</strong>g>for</str<strong>on</strong>g> a total of 18 steps.<br />
As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, <strong>on</strong>ce you have identified a candidate model, it must be fit and examined in more detail. Use<br />
the Make Model butt<strong>on</strong> to fit the final model. Note that JMP must add new columns to the data tables<br />
corresp<strong>on</strong>ding to the indicator variables created during the stepwise report. These can be c<strong>on</strong>fusing to the<br />
novice, but just keep in mind that any set of indicator variables is somewhat arbitrary.<br />
c○2012 Carl James Schwarz 629 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The model fit then has separate variables used <str<strong>on</strong>g>for</str<strong>on</strong>g> each indicator variable created:<br />
c○2012 Carl James Schwarz 630 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
The log-odds of NOT repaying the loan is computed (see the bottom of the estimates table). Do the coefficient<br />
make sense?<br />
Can some variables be dropped?<br />
Pay attenti<strong>on</strong> to how the indicator variables have been split. For example, do you understand what terms<br />
are used if the borrower intends to use the credit to do repairs (CreditPurpose value =6)?<br />
Models that are similar to this <strong>on</strong>e should also be explored.<br />
Again, just like in the case of ordinary regressi<strong>on</strong>, model validati<strong>on</strong> using other data sets or hold-out<br />
samples should be explored.<br />
7.11 Model comparis<strong>on</strong> using AIC<br />
Sorry, to be added later.<br />
c○2012 Carl James Schwarz 631 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
7.12 Final Words<br />
7.12.1 Two comm<strong>on</strong> problems<br />
Two comm<strong>on</strong> problems can be encountered with logistic regressi<strong>on</strong>.<br />
Zero counts<br />
As noted earlier, zero counts <str<strong>on</strong>g>for</str<strong>on</strong>g> <strong>on</strong>e category of a nominal or ordinal predictor (X) variable are problematic<br />
as the log-odds of that category then approach ±∞ which is somewhat difficult to model.<br />
One simplistic approach is that similar to the computati<strong>on</strong> of the empirical logistic estimate – add 1/2n<br />
to each cell so that the counts are no l<strong>on</strong>ger-integers, but most packages will deal with n<strong>on</strong>-integer counts<br />
without problems.<br />
If the zero counts arise from spreading the data over too many cells, perhaps some pooling of adjacent<br />
cells is warranted. If the data are sufficiently dense that pooling is not needed, perhaps this level of the<br />
variable can be dropped.<br />
Complete separati<strong>on</strong><br />
Ir<strong>on</strong>ically, this is a problem because the logistic models are per<str<strong>on</strong>g>for</str<strong>on</strong>g>ming too well! We saw an example of this<br />
earlier, when the fitness data could predict perfectly the sex of the subject.<br />
This is a problem, because not the predicted log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g> the groups must again be ±∞. This can <strong>on</strong>ly<br />
happen if some of the estimated coefficients are also infinite which is difficult to deal with numerically.<br />
Theoretical c<strong>on</strong>siderati<strong>on</strong>s show that in the case of complete separati<strong>on</strong>, maximum likelihood estimates do<br />
not exist!<br />
Sometimes this complete separati<strong>on</strong> is an artifact of too many variables and not enough observati<strong>on</strong>s.<br />
Furthermore, it is not so much a problem of total observati<strong>on</strong>s, but also the divisi<strong>on</strong> of observati<strong>on</strong>s between<br />
the two binary outcomes. If you have 1000 observati<strong>on</strong>s, but <strong>on</strong>ly 1 “success”, then any model with more<br />
than a few variables will be 100% efficient in capturing the single success – however, it is almost certain to<br />
be an artifact of the particular dataset.<br />
c○2012 Carl James Schwarz 632 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
7.12.2 Extensi<strong>on</strong>s<br />
Choice of link functi<strong>on</strong><br />
The logit link functi<strong>on</strong> is the most comm<strong>on</strong> choice <str<strong>on</strong>g>for</str<strong>on</strong>g> the link functi<strong>on</strong> between the probability of an<br />
outcome and the scale <strong>on</strong> which the predictors operate in a linear fashi<strong>on</strong>.<br />
However, other link functi<strong>on</strong>s have been used in different situati<strong>on</strong>s. For example, a log-link (log(p)),<br />
the log-log link (log(−log(p))), the complementary log-link (log(− log(1 − p))), the probit functi<strong>on</strong> (the<br />
inverse normal distributi<strong>on</strong>), the identity link (p) have all been proposed <str<strong>on</strong>g>for</str<strong>on</strong>g> various special cases. Please<br />
c<strong>on</strong>sult a statistician <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />
More than two resp<strong>on</strong>se categories<br />
Logistic regressi<strong>on</strong> traditi<strong>on</strong>ally has two resp<strong>on</strong>se categories that are classified as “success” or “failure”.<br />
It is possible to extend this modelling framework to cases where the resp<strong>on</strong>se variable has more than two<br />
categories.<br />
This is known as multinomial logistic regressi<strong>on</strong>, discrete choice, polychotomous logistic or polytomous<br />
logistic model, depending up<strong>on</strong> your field of expertise.<br />
There is a difference in the analysis if the resp<strong>on</strong>ses can be ordered (i.e. the resp<strong>on</strong>se variable takes an<br />
ordinal scale), or remain unordered (i.e. the resp<strong>on</strong>se variable takes an nominal scale).<br />
The basic idea is to compute a logistic regressi<strong>on</strong> of each category against a reference category. So a<br />
resp<strong>on</strong>se variable with three categories is translated into two logistic regressi<strong>on</strong>s where, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, the<br />
first regressi<strong>on</strong> is category 1 vs. category 0 and the sec<strong>on</strong>d regressi<strong>on</strong> is category 2 vs. category 0. These<br />
can be used to derive the results of category 2 vs. category 1. What is of particular interest is the role of the<br />
predictor variables in each of the possible comparis<strong>on</strong>, e.g. does weight have the same effect up<strong>on</strong> mortality<br />
<str<strong>on</strong>g>for</str<strong>on</strong>g> three different disease outcomes.<br />
C<strong>on</strong>sult <strong>on</strong>e of the many book <strong>on</strong> logistic regressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />
Exact logistic regressi<strong>on</strong> with very small datasets<br />
The methods presented in this chapter rely up<strong>on</strong> maximum likelihood methods and asympototic arguments.<br />
In very small datasets, these large sample approximati<strong>on</strong>s may not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m well.<br />
There are several statistical packages which per<str<strong>on</strong>g>for</str<strong>on</strong>g>m exact logistic regressi<strong>on</strong> and do not rely up<strong>on</strong><br />
asymptotic arguments.<br />
A simple search of the web brings up several such packages.<br />
c○2012 Carl James Schwarz 633 November 23, 2012
CHAPTER 7. LOGISTIC REGRESSION<br />
More complex experimental designs<br />
The results of this chapter have all assumed that the sampling design was a simple random sample or that<br />
the experiment design was a completely randomized design.<br />
Logistic regressi<strong>on</strong> can be extended to many more complex designs.<br />
In matched pair designs, each “success” in the outcome is matched with a randomly chosen “failure”<br />
al<strong>on</strong>g as many covariates as possible. For example, lung cancer patients could be matched with healthy<br />
patients with comm<strong>on</strong> age, weight, occupati<strong>on</strong> and other covariates. These designs are very comm<strong>on</strong> in<br />
health studies. There are many good books <strong>on</strong> the analysis of such design.<br />
Clustered designs are also very comm<strong>on</strong> where groups of subjects all receive a comm<strong>on</strong> treatment. For<br />
example, classrooms may be randomly assigned to different reading programs, and the success or failure<br />
of individual students within the classrooms in obtaining reading goals is assessed. Here the experimental<br />
unit is the classroom, not the individual student and the methods of this chapter are not directly applicable.<br />
Several extensi<strong>on</strong>s have been proposed <str<strong>on</strong>g>for</str<strong>on</strong>g> this type of “correlated” binary data (students within the same<br />
classroom are all exposed to exactly the same set of experimenal and n<strong>on</strong>-experimental factors). The most<br />
comm<strong>on</strong> is known as Generalized Estimating Equati<strong>on</strong>s and is described in many books.<br />
More complex experimental designs (e.g. split-plot designs) can also be run with binary outcomes. These<br />
complex designs require high power computati<strong>on</strong>al machinery to analyze.<br />
7.12.3 Yet to do<br />
- examples - dov’s example used in a comprehensive exam in previous years<br />
c○2012 Carl James Schwarz 634 November 23, 2012
Chapter 8<br />
Poiss<strong>on</strong> Regressi<strong>on</strong><br />
8.1 Introducti<strong>on</strong><br />
In past chapters, multiple-regressi<strong>on</strong> methods were used to predict a c<strong>on</strong>tinuous Y variable given a set of<br />
predictors, and logistic regressi<strong>on</strong> methods were used to predict a dichotomous categorical variable given a<br />
set of predictors.<br />
In this chapter, we will explore the use of Poiss<strong>on</strong>-regressi<strong>on</strong> methods that are typically used to predict<br />
counts of (rare) events given a set of predictors.<br />
Just as multiple-regressi<strong>on</strong> implicitly assumed that the Y variable had a normal distributi<strong>on</strong> and logisticregressi<strong>on</strong><br />
assumed that the choice of categories in Y was based <strong>on</strong> binomial distributi<strong>on</strong>, Poiss<strong>on</strong> regressi<strong>on</strong><br />
assumes that the observed counts are generated from a Poiss<strong>on</strong> distributi<strong>on</strong>.<br />
The Poiss<strong>on</strong> distributi<strong>on</strong> is often used to model count data when the events being counted are somewhat<br />
rare, e.g. cancer cases, the number of accidents, the number of satellite males around a female bird, etc. It is<br />
characterized by the expected number of events to occur µ with probability mass functi<strong>on</strong>s:<br />
P (Y = y|µ) = e−µ µ y<br />
where y! = y(y − 1)(y − 2) . . . 2(1), and y ≥ 0. The probability mass functi<strong>on</strong> is available in tabular <str<strong>on</strong>g>for</str<strong>on</strong>g>m,<br />
or can be computed by many statistical packages. While the values of Y are restricted to being n<strong>on</strong>-negative<br />
integers, it is not necessary <str<strong>on</strong>g>for</str<strong>on</strong>g> µ to be an integer.<br />
In the following graph, 1000 observati<strong>on</strong>s were each generated from a Poiss<strong>on</strong> distributi<strong>on</strong> with differing<br />
means.<br />
y!<br />
635
CHAPTER 8. POISSON REGRESSION<br />
c○2012 Carl James Schwarz 636 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
For very small values of µ, virtually all the counts are zero, with <strong>on</strong>ly a few counts that are positive. As µ<br />
increases, the shape of the distributi<strong>on</strong> look more and more like a normal distributi<strong>on</strong> – indeed <str<strong>on</strong>g>for</str<strong>on</strong>g> large µ, a<br />
normal distributi<strong>on</strong> can be used as an approximati<strong>on</strong> to the distributi<strong>on</strong> of Y .<br />
Sometimes µ is further parameterized by a rate parameter and a group size, i.e. µ = Nλ where λ is the<br />
rate per unit, and N is the group size. For example, the number of cancers in a group of 100,000 people<br />
could be modeled using λ as the rate per 1000 people, and the N = 100.<br />
Two important properties of the Poiss<strong>on</strong> distributi<strong>on</strong> are:<br />
E[Y ] = µ<br />
V [y] = µ<br />
Unlike the normal distributi<strong>on</strong> which has a separate parameter <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean and variance, the Poiss<strong>on</strong> distributi<strong>on</strong><br />
variance is equal to the mean. This means that <strong>on</strong>ce you estimate the mean, you have also estimated<br />
the variance and so it is not necessary to have replicate counts to estimate the sample variance from data. As<br />
will be seen later, this can be quite limiting when <str<strong>on</strong>g>for</str<strong>on</strong>g> many populati<strong>on</strong>, the data are over-dispersed, i.e. the<br />
variance is greater than you would expect from a simple Poiss<strong>on</strong> distributi<strong>on</strong>.<br />
Another important property is that the Poiss<strong>on</strong> distributi<strong>on</strong> is additive. If Y 1 is a Poiss<strong>on</strong>(µ 1 ), and Y 2 is<br />
a Poiss<strong>on</strong>(µ 2 ), then Y = Y 1 + Y 2 is also Poiss<strong>on</strong>(µ = µ 1 + µ 2 ).<br />
Lastly, the Poiss<strong>on</strong> distributi<strong>on</strong> is a limiting distributi<strong>on</strong> of a Binomial distributi<strong>on</strong> as n becomes large<br />
and p becomes very small.<br />
Poiss<strong>on</strong> regressi<strong>on</strong> is another example of a Generalized Linear Model (GLIM) 1 . As in all GLIM’s, the<br />
modeling process is a three step affair:<br />
Y i is assumed P oiss<strong>on</strong>(µ i )<br />
φ i = log(µ i )<br />
φ i = β 0 + β 1 X i1 + β 2 X i2 + . . .<br />
Here the link functi<strong>on</strong> is the natural logarithms log. In many cases, the mean changes in a multiplicative<br />
fashi<strong>on</strong>. For example, if populati<strong>on</strong> size doubled, then the expected number of cancer cases should also<br />
double. As populati<strong>on</strong> age, the rate of cancer increases linearly <strong>on</strong> a log-scale. Additi<strong>on</strong>ally, by modeling<br />
the log(µ i ), it is impossible to get negative estimates of the mean.<br />
The linear part of the GLIM can c<strong>on</strong>sist of c<strong>on</strong>tinuous X or categorical X or mixtures of both types<br />
of predictors. Categorical variables will be c<strong>on</strong>verted to indicator variables in exactly the same way as in<br />
multiple- and logistic-regressi<strong>on</strong>.<br />
Unlike multiple-regressi<strong>on</strong>, there are no closed <str<strong>on</strong>g>for</str<strong>on</strong>g>m soluti<strong>on</strong>s to give estimates of parameters. Standard<br />
maximum likelihood estimati<strong>on</strong> (MLE) methods are used. 2 MLEs are guaranteed to be the “best” estimators<br />
1 Logistic regressi<strong>on</strong> is another GLIM.<br />
2 A discussi<strong>on</strong> of the theory of MLE is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>, but is covered in Stat-330 and Stat-402.<br />
c○2012 Carl James Schwarz 637 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
(smallest standard errors) as the sample size increases, and seem to work well even if the sample sizes are<br />
not large. Standard methods are used to estimate the standard errors of the estimates. Model comparis<strong>on</strong>s<br />
are d<strong>on</strong>e using likelihood-ratio tests whose test statistics follow a chi-square distributi<strong>on</strong> which is used to<br />
give a p-value which is interpreted in the standard fashi<strong>on</strong>. Predicti<strong>on</strong>s are d<strong>on</strong>e in the usual fashi<strong>on</strong> – these<br />
initially appear <strong>on</strong> the log-scale and must be anti-logged to provide estimates <strong>on</strong> the ordinary scale.<br />
8.2 Experimental design<br />
In this chapter, we will again assume that the data are collected under a completely randomized design. In<br />
some of the examples that follow, blocked designs will be analyzed, but we will not explore how to analyze<br />
split-plot or repeated measure designs or design with pseudo-replicati<strong>on</strong>.<br />
The analysis of such designs in a generalized linear models framework is possible – please c<strong>on</strong>sult with<br />
a statistician if you have a complex experimental design.<br />
8.3 Data structure<br />
The data structure is straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward. Columns represent variables and rows represent observati<strong>on</strong>s. The<br />
resp<strong>on</strong>se variable, Y , will be a count of the number of events and will be set to c<strong>on</strong>tinuous scale. The<br />
predictor variables, X, can be either c<strong>on</strong>tinuous or categorical – in the later case, indicator variables will be<br />
created.<br />
As usual, the coding that a package uses <str<strong>on</strong>g>for</str<strong>on</strong>g> indicator variables is important if you want to interpret<br />
directly the estimates of the effect of the indicator variable. C<strong>on</strong>sult the documentati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the package <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
details.<br />
8.4 Single c<strong>on</strong>tinuous X variable<br />
The JMP file salamanders-burn.jmp available in the Sample Program Library at http://www.stat.<br />
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms c<strong>on</strong>tains data <strong>on</strong> the number of salamanders<br />
in a fixed size quadrat at various locati<strong>on</strong>s in a large <str<strong>on</strong>g>for</str<strong>on</strong>g>est. The locati<strong>on</strong> of quadrats were chosen to represent<br />
a range of years since a <str<strong>on</strong>g>for</str<strong>on</strong>g>est fire burned the understory.<br />
A simple plot of the data:<br />
c○2012 Carl James Schwarz 638 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
shows an increasing relati<strong>on</strong>ship between the number of salamanders and the time since the <str<strong>on</strong>g>for</str<strong>on</strong>g>est understory<br />
burned.<br />
Why can’t a simple regressi<strong>on</strong> analysis using standard normal theory be used to fit the curve?<br />
First, the assumpti<strong>on</strong> of normality is suspect. The counts of the number of salamanders are discrete with<br />
most under 10. It is impossible to get a negative number of salamanders so the bottom left part of the graph<br />
would require the normal distributi<strong>on</strong> to be truncated at Y = 0.<br />
Sec<strong>on</strong>d, it appears that the variance of the counts at any particular age increases with age since burned.<br />
This violates the assumpti<strong>on</strong> of equal variance <str<strong>on</strong>g>for</str<strong>on</strong>g> all X values made <str<strong>on</strong>g>for</str<strong>on</strong>g> standard regressi<strong>on</strong> models.<br />
Third, the fitted line from ordinary regressi<strong>on</strong> could go negative. It is impossible to have a negative<br />
number of salamanders.<br />
It seems reas<strong>on</strong>able that a Poiss<strong>on</strong> distributi<strong>on</strong> could be used to model the number of salamanders. They<br />
are relatively rare and seem to <str<strong>on</strong>g>for</str<strong>on</strong>g>age independently of each other. This c<strong>on</strong>diti<strong>on</strong>s are the underpinnings of<br />
a Poiss<strong>on</strong> distributi<strong>on</strong>.<br />
The process of fitting the model and interpreting the output is analogous to those used in logistic regressi<strong>on</strong>.<br />
c○2012 Carl James Schwarz 639 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
The basic model is then:<br />
Y i ∼P oiss<strong>on</strong>(µ i )<br />
θ i =log(µ i )<br />
θ i =β 0 + β 1 Years i<br />
As in the logistic model, the distributi<strong>on</strong> of the data about the mean (line 1) has a link functi<strong>on</strong> (line<br />
2) between the mean <str<strong>on</strong>g>for</str<strong>on</strong>g> each Y and the linear structural part of the model (line 3). In logistic regressi<strong>on</strong>,<br />
the logit link was used to ensure that all values of p were between 0 and 1. In Poiss<strong>on</strong> regressi<strong>on</strong>, the log<br />
(natural logarithm) is traditi<strong>on</strong>ally used to ensure that the mean is always positive.<br />
The model must be fit using maximum likelihood methods, just like in logistic regressi<strong>on</strong>.<br />
This model is fit in JMP using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
Be sure to specify the proper distributi<strong>on</strong> and link functi<strong>on</strong>.<br />
c○2012 Carl James Schwarz 640 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
This gives the output:<br />
Most of the output parallels that seen in logistic regressi<strong>on</strong>. At the top of the output is a summary of variable<br />
being analyzed, the distributi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the raw data, the link used, and the total number of observati<strong>on</strong> (rows in<br />
the dataset).<br />
The Whole Model Test is analogous to that in multiple-regressi<strong>on</strong> - is there evidence that the set of<br />
predictors (in this case there is <strong>on</strong>ly <strong>on</strong>e predictor) have any predictive ability over that seen by random<br />
chance. The test statistic is computed using a likelihood-ratio test comparing this model to a model with<br />
<strong>on</strong>ly the intercept. The p-value is very small, indicating that the model has some predictive ability. [Because<br />
there is <strong>on</strong>ly 1 predictor, this test is equivalent to the Effect Test discussed below.]<br />
The goodness-of-fit statistic compares the model with the intercept and the single predictor to a model<br />
where every observati<strong>on</strong> is predicted individually. If the model fits well, the chi-square test statistic should<br />
be approximately equal to the degrees of freedom, and the p-value should be LARGE, i.e. much larger than<br />
.05. 3 There is no evidence of a problem in the fit. Later in this secti<strong>on</strong>, we will examine how to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
slight lack of fit.<br />
The Effect tests examine if each predictor (or in the case of a categorical variable, the entire set of<br />
indicator variables) makes a statistically significant marginal c<strong>on</strong>tributi<strong>on</strong> to the fit. As in multiple-regressi<strong>on</strong><br />
model, this are MARGINAL c<strong>on</strong>tributi<strong>on</strong>s, i.e. assuming that all other variables remain in the model and<br />
fixed at their current value. There is <strong>on</strong>ly <strong>on</strong>e predictor, and there is str<strong>on</strong>g evidence against the hypothesis<br />
of no marginal c<strong>on</strong>tributi<strong>on</strong>.<br />
3 Remember, that in goodness-of-fit tests, you DON’T want to find evidence against the null hypothesis.<br />
c○2012 Carl James Schwarz 641 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Finally, the Parameter Estimates secti<strong>on</strong> reports the estimated β’s. So our fitted model is:<br />
Y i ∼P oiss<strong>on</strong>(µ i )<br />
θ i =log(µ i )<br />
θ i =0.59 + .045Years i<br />
Each line also tests if the corresp<strong>on</strong>ding populati<strong>on</strong> coefficient is zero. Because each of the X variables<br />
in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the<br />
effect tests.<br />
We can obtain predicti<strong>on</strong>s by following the drop down menu:<br />
c○2012 Carl James Schwarz 642 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
For example, c<strong>on</strong>sider the first row of the data. At 12 years since the last burn, we estimate the mean resp<strong>on</strong>se<br />
by starting at the bottom of the model and working upwards:<br />
which is the predicted value in the table.<br />
θ 1 =0.59 + .045(12) = 1.12<br />
µ 1 =exp(1.12) = 3.08<br />
As in ordinary normal-theory regressi<strong>on</strong>, c<strong>on</strong>fidence limits <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se and <str<strong>on</strong>g>for</str<strong>on</strong>g> individual<br />
resp<strong>on</strong>se may be found. The above table shows the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se.<br />
Finally, a residual plot may also be c<strong>on</strong>structed:<br />
There is no evidence of a lack-of-fit.<br />
c○2012 Carl James Schwarz 643 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
8.5 Single c<strong>on</strong>tinuous X variable - dealing with overdispersi<strong>on</strong><br />
One of the weaknesses of Poiss<strong>on</strong> regressi<strong>on</strong> is the very restrictive assumpti<strong>on</strong> that the variance of a Poiss<strong>on</strong><br />
distributi<strong>on</strong> is equal to its mean. In some cases, data are over-dispersed, i.e. the variance is greater than<br />
predicted by a simple Poiss<strong>on</strong> distributi<strong>on</strong>. In this secti<strong>on</strong>, we will illustrate how to detect overdispersi<strong>on</strong><br />
and how to adjust the analysis to account <str<strong>on</strong>g>for</str<strong>on</strong>g> overdispersi<strong>on</strong>.<br />
In the secti<strong>on</strong> <strong>on</strong> Logistic Regressi<strong>on</strong>, a dataset was examined <strong>on</strong> nesting horseshoe crabs 4 that is analyzed<br />
in Agresti’s book. 5<br />
The design of the study is given in Brockmann H.J. (1996). Satellite male groups in horseshoe crabs,<br />
Limulus polyphemus. Ethology, 102, 1-21. Again it is important to check that the design is a completely<br />
randomized design or a simple random sampling. As in regressi<strong>on</strong> models, you do have some flexibility in<br />
the choice of the X settings, but <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular weight and color, the data must be selected at random from<br />
that relevant populati<strong>on</strong>.<br />
Each female horseshoe crab had a male resident in her nest. The study investigated other factors affecting<br />
whether the female had any other males, called satellites residing nearby. These other factors includes:<br />
• crab color where 2=light medium, 3=medium, 4=dark medium, 5=dark.<br />
• spine c<strong>on</strong>diti<strong>on</strong> where 1=both good, 2=<strong>on</strong>e worn or broken, or 3=both worn or broken.<br />
• weight<br />
• carapace width<br />
In the secti<strong>on</strong> <strong>on</strong> Logistic Regressi<strong>on</strong>, a derived variable <strong>on</strong> the presence or absence of satellite males<br />
was examined. In this secti<strong>on</strong>, we will examine the actual number of satellite males.<br />
A JMP dataset crabsatellites.jmp is available from the Sample Program Library at http://www.<br />
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A porti<strong>on</strong> of the datafile is shown<br />
below:<br />
4 See http://en.wikipedia.org/wiki/Horseshoe_crab.<br />
5 These are available from Agresti’s web site at http://www.stat.ufl.edu/~aa/cda/sas/sas.html.<br />
c○2012 Carl James Schwarz 644 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Note that the color and spine c<strong>on</strong>diti<strong>on</strong> variables should be declared with an ordinal scale despite having<br />
numerical codes. In this analysis we will use the actual number of satellite males.<br />
As noted <strong>on</strong> the secti<strong>on</strong> <strong>on</strong> Logistic Regressi<strong>on</strong>, a preliminary scatter plot of the variables shows some<br />
interesting features.<br />
c○2012 Carl James Schwarz 645 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
There is a very high positive relati<strong>on</strong>ship between carapace width and weight, but there are few anomalous<br />
crabs that should be investigated further as shown in this magnified plot:<br />
c○2012 Carl James Schwarz 646 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
There are three points with weights in the 1200-1300 g range whose carapace widths suggest that the weights<br />
should be in the 2200-2300 g range, i.e. a typographical error in the first digit. There is a single crab whose<br />
weight suggests a width of 24 cm rather than 21 cm – perhaps a typo in the last digit. Finally, there is <strong>on</strong>e<br />
crab which is extremely large compared to the rest of the group. In the analysis that follows, I’ve excluded<br />
these five data values.<br />
To begin with, fit a model that attempts to predict the mean number of satellite crabs as a functi<strong>on</strong> of the<br />
weight of the female crab, i.e.<br />
Y i distributed P oiss<strong>on</strong>(µ i )<br />
λ i = log(µ i )<br />
µ i = β 0 + β 1 W eight i<br />
The Generalized Linear Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m of JMP is used:<br />
c○2012 Carl James Schwarz 647 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
c○2012 Carl James Schwarz 648 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
This gives selected output:<br />
c○2012 Carl James Schwarz 649 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
c○2012 Carl James Schwarz 650 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
There are two parts of the output which show that the fit is not very satisfactory. First while the studentized<br />
residual plot does not show any structural defects (the residuals are scattered around zero) 6 , it does<br />
show substantial numbers of points outside of the (−2, 2) range. This suggests that the data are too variable<br />
relative to the Passi<strong>on</strong> assumpti<strong>on</strong>. Sec<strong>on</strong>d, the Goodness-of-fit statistic has a vary small p-value indicating<br />
that the data are not well fit by the model.<br />
This is an example of overdispersi<strong>on</strong>. To see this overdispersi<strong>on</strong>, divide the weight classes into categories,<br />
e.g. 0000 − 2500 g, 2500 − 3000 g., etc. [This has already been d<strong>on</strong>e in the dataset.] 7 Now find<br />
the mean and variance of the number of satellite males <str<strong>on</strong>g>for</str<strong>on</strong>g> each weight class using the Tables->Summary<br />
plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
6 The “lines” in the plot are artifacts of the discrete nature of the resp<strong>on</strong>se. See the chapter <strong>on</strong> residual plots <str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />
7 The choice of 4 weight classes is somewhat arbitrary. I would usually try and subdivide the data into between 4 and 10 classes<br />
ensuring that at least 20-30 observati<strong>on</strong>s are in each class.<br />
c○2012 Carl James Schwarz 651 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
If the Poiss<strong>on</strong> assumpti<strong>on</strong> were true, then the variance of the number of satellite males should be roughly<br />
equal to the mean in each class. In fact, the variance in the number of satellite males appears to be roughly<br />
3× that of the mean.<br />
With generalized linear models, there are two ways to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong>.<br />
c○2012 Carl James Schwarz 652 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
A different distributi<strong>on</strong> can be used that is more flexible in the mean-to-variance ratio. A comm<strong>on</strong><br />
distributi<strong>on</strong> that is used in these cases is the negative binomial distributi<strong>on</strong>. In more advanced classes, you<br />
will learn that the negative binomial distributi<strong>on</strong> can arise from a Poiss<strong>on</strong> distributi<strong>on</strong> with extra variati<strong>on</strong> in<br />
the mean rates. JMP does not allow the fitting of a negative binomial distributi<strong>on</strong>, but this opti<strong>on</strong> is available<br />
in SAS.<br />
An “ad hoc” method, that nevertheless has theoretical justificati<strong>on</strong>, is to allow some flexibility in the<br />
variance. For example, rather than restricting V [Y ] = E[Y ] = µ, perhaps, V [y] = cµ where c is called the<br />
over-dispersi<strong>on</strong> factor. Note that if this <str<strong>on</strong>g>for</str<strong>on</strong>g>mulati<strong>on</strong> is used, the data are no l<strong>on</strong>ger distributed as a Poiss<strong>on</strong><br />
distributi<strong>on</strong>; in fact, there is NO actual probability functi<strong>on</strong> that has this property. Nevertheless, this quasidistributi<strong>on</strong><br />
still has nice properties and the over-dispersi<strong>on</strong> factor can be estimated using quasi-likelihood<br />
methods that are analogous to regular likelihood methods.<br />
The end result is that the over-dispersi<strong>on</strong> factor is used to adjust the se and the test-statistics. The adjusted<br />
se are obtained by multiplying the se from the Poiss<strong>on</strong> model by √ ĉ. The adjusted chi-square test statistics<br />
are found by dividing the test statistic from the poiss<strong>on</strong> model by ĉ, and p-value is adjusted by looking up<br />
the adjusted test-statistic in the appropriate table.<br />
How is the over-dispersi<strong>on</strong> factor c estimated? There are two methods, both of which are asymptotically<br />
equivalent. These involve taking the goodness-of-fit statistic and dividing by their degrees of freedom:<br />
ĉ = goodness-of-fit-statistic<br />
df<br />
Usually, ĉ’s of less than 10 (corresp<strong>on</strong>ding to a potential inflati<strong>on</strong> in the se by a factor of about 3) are<br />
acceptable – if the inflati<strong>on</strong> factor is more than about 10, the lack-of-fit is so large that alternate methods<br />
should be used.<br />
In JMP, the adjustment of over-dispersi<strong>on</strong> occurs in the Analyze->Fit Model dialogue box:<br />
c○2012 Carl James Schwarz 653 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
The revised output is now:<br />
c○2012 Carl James Schwarz 654 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Notice that the overdispersi<strong>on</strong> factor has been estimated as<br />
ĉ =<br />
chi − square<br />
df<br />
= 519.7857<br />
166<br />
= 3.13<br />
This is very close to the “guess” that we made based <strong>on</strong> looking at the variance-to-mean ratio am<strong>on</strong>g weight<br />
classes.<br />
The estimated intercept and slope are unchanged and their interpretati<strong>on</strong> is as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e. For example, the<br />
c○2012 Carl James Schwarz 655 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
estimated slope of .000668 is the estimated increase in the log number of male satellite crabs when the female<br />
crab’s weight increases by 1 g. A 1000g increase in body-weight corresp<strong>on</strong>ds to a 1000 × .000668 = .668<br />
increase in the log(number of satellite males) which corresp<strong>on</strong>ds to an increase by a factor of e .668 = 1.95,<br />
i.e. the mean number of male satellite crabs almost doubles. The estimated se has been “inflated” by √ ĉ =<br />
√<br />
3.13 = 1.77. The c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope and intercept are now wider.<br />
The chi-square test statistics have been “deflated” by ĉ and the p-values have been adjusted accordingly.<br />
Finally, the residual plot has been rescaled by the factor of √ ĉ and now most residuals lie between −2<br />
and 2. Note that the pattern of the residual plot doesn’t change; all that the over-dispersi<strong>on</strong> adjustment does<br />
is to change the residual variance so that the standardizati<strong>on</strong> brings them closer to 0.<br />
Predicti<strong>on</strong>s of the mean resp<strong>on</strong>se at levels of X are obtained in the usual fashi<strong>on</strong>:<br />
giving (partial output):<br />
c○2012 Carl James Schwarz 656 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
The se of the predicted mean will also have been be adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> overdispersi<strong>on</strong> as will have the c<strong>on</strong>fidence<br />
intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean number of male satellite crabs at each weight value.<br />
However, notice that the menu item <str<strong>on</strong>g>for</str<strong>on</strong>g> a predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the INDIVIDUAL resp<strong>on</strong>se is “grayed<br />
out” and it is now impossible to obtain predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the ACTUAl number of events. By using the<br />
overdispersi<strong>on</strong> factor, you are no l<strong>on</strong>ger assuming that the counts are distributed as a Poiss<strong>on</strong> distributi<strong>on</strong> –<br />
in fact, there is NO REAL DISTRIBUTION that has the mean to variance ratio that implicitly assumed using<br />
the overdispersi<strong>on</strong> factor. Without an actual distributi<strong>on</strong>, it is impossible to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> individual<br />
events.<br />
We save the predicted values to the dataset and do a plot of the final results <strong>on</strong> both the ordinary scale:<br />
c○2012 Carl James Schwarz 657 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
c○2012 Carl James Schwarz 658 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
and <strong>on</strong> the log-scale (the scale where the model is “linear”):<br />
c○2012 Carl James Schwarz 659 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
c○2012 Carl James Schwarz 660 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
8.6 Single C<strong>on</strong>tinuous X variable with an OFFSET<br />
In the previous examples, the sampling unit (where the counts were obtained) were all the same size (e.g.<br />
the number of satellite males around a single female). In some cases, the sampling unit are of different sizes.<br />
For example, if the number of weeds are counted in a quadrat plot, then hopefully the size of the plot<br />
is c<strong>on</strong>stant. However, it is c<strong>on</strong>ceivable that the size of the plot varies because different people collected<br />
different parts of the data. Of if the number of events are counted in a time interval (e.g. the number of fish<br />
captured in a fishing trip), the time intervals could be of different size.<br />
Often these type of data are pre-standardized, i.e. c<strong>on</strong>verted to a per m 2 or per hour basis and then an<br />
c○2012 Carl James Schwarz 661 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
analysis is attempted <strong>on</strong> this standardized variable. However, standardizati<strong>on</strong> destroys the poiss<strong>on</strong> shape of<br />
the data and turns out to be unnecessary if the size of the sampling unit is also collected.<br />
The incidence of n<strong>on</strong> melanoma skin cancer am<strong>on</strong>g women in the early 1970’s in Minneapolis-St Paul,<br />
Minnesota, and Dallas-Fort Worth, Texas is summarized below:<br />
City Age Class Age Mid Count Pop Size<br />
msp 15-24 20 1 172,675<br />
msp 25-34 30 16 123,065<br />
msp 35-44 40 30 96,216<br />
msp 45-54 50 71 92,051<br />
msp 55-64 60 102 72,159<br />
msp 65-74 70 130 54,722<br />
msp 75-84 80 133 32,185<br />
msp 85+ 90 40 8,328<br />
dfw 15-24 20 4 181,343<br />
dfw 25-34 30 38 146,207<br />
dfw 35-44 40 119 121,374<br />
dfw 45-54 50 221 111,353<br />
dfw 55-64 60 259 83,004<br />
dfw 65-74 70 310 55,932<br />
dfw 75-84 80 226 29,007<br />
dfw 85+ 90 65 7,538<br />
We will first examine the relati<strong>on</strong>ship of cancer incidence to age by using the age midpoint as our<br />
c<strong>on</strong>tinuous X variable and <strong>on</strong>ly using the Minneapolis data (<str<strong>on</strong>g>for</str<strong>on</strong>g> now).<br />
The data set is available in the JMP data file skincancer.jmp from the Sample Program Library at http:<br />
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
Is there a relati<strong>on</strong>ship between the age of a cohort and the cancer incidence rate? Notice that a comparis<strong>on</strong><br />
of the raw counts is not very sensible because of the different size of the age cohorts. Most people<br />
would first STANDARDIZE the incidence rate, e.g. find the incidence per pers<strong>on</strong> by dividing the number of<br />
cancers by the number of people in each cohort:<br />
c○2012 Carl James Schwarz 662 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
A plot of the standardized incidence rate by the mid-age of each cohort:<br />
shows a curved relati<strong>on</strong>ship between the incidence rate and the mid-point of the age-cohort. This suggests a<br />
c○2012 Carl James Schwarz 663 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
theoretical model of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
Incidence = Ce age<br />
i.e. an exp<strong>on</strong>ential increase in the cancer rates with age.<br />
This suggests that a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m is applied to BOTH sides, but a plot: of the logarithm of the incidence<br />
rate against log(age midpoint):<br />
is still not linear with a dip <str<strong>on</strong>g>for</str<strong>on</strong>g> the youhgest cohorts. There appears to be a str<strong>on</strong>g relati<strong>on</strong>ship between the<br />
log(cancer rate) and log(age) that may not be linear, but a quadratic looks as if it could fit quite nicely, i.e.<br />
a model of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />
.<br />
log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />
c○2012 Carl James Schwarz 664 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Is it possible to include the populati<strong>on</strong> size direct? Expand the above model:<br />
log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />
log( count<br />
pop size ) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />
log(count) − log(pop size) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />
log(count) = log(pop size) + β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />
Notice that the log(pop size) has a known coefficient of 1 associated with it, i.e. there is NO β coefficient<br />
associated with log(pop size).<br />
Also notice that log(P OP SiZE) is known in advance and is NOT a parameter to be estimated. Variables<br />
such as populati<strong>on</strong> size are often called offset variables and notice that most packages expect to see the offset<br />
variable pre-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med depending up<strong>on</strong> the link functi<strong>on</strong> used. In this case, the log link was used, so the<br />
offset is log(P OP SIZE age ) as you will see in a minute.<br />
Our GLIM model will then be:<br />
Y age distributed P oiss<strong>on</strong>(µ age )<br />
φ age = log(µ age ) = log(P OP SIZE age ) + log(λ age )<br />
φ age = β 0 + β 1 log(AGE) + β 2 log(AGE) 2<br />
This can be rewritten slightly as:<br />
Y age distributed P oiss<strong>on</strong>(µ age )<br />
φ age = log(µ age ) = log(P OP SIZE age ) + log(λ age )<br />
φ age = log(P OP SIZE age ) + log(λ age ) = β 0 + β 1 log(AGE) + β 2 log(AGE) 2<br />
or<br />
log(λ age ) = β 0 + β 1 log(AGE) + β 2 log(AGE) 2 − log(P OP SIZE age )<br />
So the modeling can be d<strong>on</strong>e in terms of estimating the effect of log(age) up<strong>on</strong> the incidence rate, rather<br />
than the raw counts, as l<strong>on</strong>g as the offset variable (log(P OP SIZE age )) is known.<br />
To per<str<strong>on</strong>g>for</str<strong>on</strong>g>m a Poiss<strong>on</strong> regressi<strong>on</strong>, first create the offset variable (log(P OP SIZE age )) using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula<br />
editor of JMP.<br />
The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m launches the analysis:<br />
c○2012 Carl James Schwarz 665 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Note that the raw count is the Y variable, and that the offset variable is specified separately from the X<br />
variables.<br />
The output is:<br />
c○2012 Carl James Schwarz 666 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
The goodness-of-fit statistic indictes no evidence of lack-of-fit, i.e. no need to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong>.<br />
Based <strong>on</strong> the results of the Effect Test <str<strong>on</strong>g>for</str<strong>on</strong>g> the quadratic term, it appears that a linear fit may actually be<br />
sufficient as the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the quadratic term is almost 10%.. The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this apparent n<strong>on</strong>-need <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
quadratic term is that the smaller age-cohorts have very few counts and so the actual incidence rate is very<br />
imprecisely estimated.<br />
Finally, the Parameter Estimates secti<strong>on</strong> reports the estimated β’s (remember these are <strong>on</strong> the log-scale).<br />
Each line also tests if the corresp<strong>on</strong>ding populati<strong>on</strong> coefficient is zero. Because each of the X variables<br />
in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the<br />
effect tests.<br />
Based <strong>on</strong> the output so far, it appears that we can drop the quadratic term.’ This term was dropped, and<br />
the model refit:<br />
c○2012 Carl James Schwarz 667 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
The final model is<br />
The predicted log(λ) <str<strong>on</strong>g>for</str<strong>on</strong>g> age 40 is found as:<br />
̂ log(λ) age<br />
= −21.32 + 3.60(log(age))<br />
̂ log(λ) 40<br />
= −21.32 + 3.60(log(40)) = −8.04<br />
This incidence rate is <strong>on</strong> the log-scale, so the predicted incidence rate is found by taking the anti-logs, or<br />
e −8.04 = .000322 or .322/thousand people or 322/milli<strong>on</strong> people.<br />
In order to make predicti<strong>on</strong>s about the expected number of cancers in each age cohort, that would be<br />
seen under this model, you would need to add back the log(P OP SIZE) <str<strong>on</strong>g>for</str<strong>on</strong>g> the appropriate age class:<br />
log(µ ̂ 40 ) = log(λ) ̂<br />
40<br />
+ log(P OP SIZE 40 ) = −8.04 + 11.47 = 3.42<br />
Finally, the predicted number of cases is simply the anti-log of this value:<br />
Ŷ 40 = e ̂ logµ 40<br />
= e 3.42 = 30.96<br />
Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, this can be d<strong>on</strong>e automatically the the plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m by requesting:<br />
c○2012 Carl James Schwarz 668 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
This also allows you save the c<strong>on</strong>fidence limits <str<strong>on</strong>g>for</str<strong>on</strong>g> the average number (the mean c<strong>on</strong>fidence bounds) of skin<br />
cancers expected <str<strong>on</strong>g>for</str<strong>on</strong>g> this age class (assuming the same populati<strong>on</strong> size) and c<strong>on</strong>fidence limits (the individual<br />
c<strong>on</strong>fidence bounds)<br />
In this case, the expected number of skin cancer cases <str<strong>on</strong>g>for</str<strong>on</strong>g> the 35-44 age group is 30.69 with a 95% c<strong>on</strong>fidence<br />
interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean number of cases ranging from (26.0 → 36.8). The c<strong>on</strong>fidence bound <str<strong>on</strong>g>for</str<strong>on</strong>g> the actual<br />
number of cases (assuming the model is correct) is somewhere between 19 and 43 cases.<br />
By adding new data lines to the data table (be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the model fit) with the Y variable missing, but the age<br />
and offset variable present, you can make <str<strong>on</strong>g>for</str<strong>on</strong>g>ecasts <str<strong>on</strong>g>for</str<strong>on</strong>g> any set of new X values.<br />
The residual plot:<br />
c○2012 Carl James Schwarz 669 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
isn’t too bad – the large negative residual <str<strong>on</strong>g>for</str<strong>on</strong>g> the first age class (near when 0 skin cancers are predicted) is a<br />
bit worrisome, I suspect this is where the quadratic curve may provide a better fit.<br />
A plot of actual vs. predicted values can be obtained directly:<br />
c○2012 Carl James Schwarz 670 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
or by saving the predicted value to the data sheet, and using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m with Fit<br />
Special to add the reference line:<br />
c○2012 Carl James Schwarz 671 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
These plot show excellent agreement with data.<br />
Finally, it is nice to c<strong>on</strong>struct an overlay plot the empirical log(rates) (the first plot c<strong>on</strong>structed) with the<br />
estimated log(rate) and c<strong>on</strong>fidence bounds as a functi<strong>on</strong> of log(age). Create the predicted log(rate) using<br />
the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor and the predicted skin cancer numbers by subtracting the log(P OP SIZE) (why?):<br />
c○2012 Carl James Schwarz 672 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Repeat the same <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> the lower and upper bounds of the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean number<br />
of cases:<br />
Finally, use the Graph → OverlayPlot to plot the empirical estimates, the predicted values of λ and the 95%<br />
c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> λ <strong>on</strong> the same plot:<br />
c○2012 Carl James Schwarz 673 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
and fiddle 8 with the plot to join up predicti<strong>on</strong>s and c<strong>on</strong>fidence bounds but leave the actual empirical points<br />
as is to give the final plot:<br />
8 I had to use the turn <strong>on</strong> the c<strong>on</strong>nect through missing opti<strong>on</strong> the red-triangle.<br />
c○2012 Carl James Schwarz 674 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Remember that the point with the smallest log(rate) is based <strong>on</strong> a single skin cancer case and not very<br />
reliable. That is why the quadratic fit was likely not selected.<br />
8.7 ANCOVA models<br />
Just like in regular multiple-regressi<strong>on</strong>, it is possible to mix c<strong>on</strong>tinuous and categorical variables and test <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
parallelism of the effects. Of <str<strong>on</strong>g>course</str<strong>on</strong>g> this parallelism is assessed <strong>on</strong> the link scale (in most cases <str<strong>on</strong>g>for</str<strong>on</strong>g> Poiss<strong>on</strong><br />
data, <strong>on</strong> the log scale).<br />
There is nothing new compared to what was seen with ordinary regressi<strong>on</strong> and logistic regressi<strong>on</strong>. The<br />
three appropriate models are:<br />
log(λ) = X<br />
log(λ) = X Cat<br />
log(λ) = X Cat X ∗ Cat<br />
where X is the c<strong>on</strong>tinuous predictors, and Cat is the categorical predictor. The first model assumes a<br />
comm<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> all categories of the Cat variable. The sec<strong>on</strong>d model assumes parallel slopes, but differing<br />
intercepts. The third model assumes separate lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each category.<br />
Fitting would start with the most complex model (the third model) and test if there is evidence of n<strong>on</strong>parallelism.<br />
If n<strong>on</strong>e were found, the sec<strong>on</strong>d model would be examined, and a test would be made <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
comm<strong>on</strong> intercepts. Finally, the simplest model may be an adequate fit.<br />
Let us return to the skin cancer data examined earlier in this chapter. It is of interest to see if there is a<br />
c<strong>on</strong>sistent difference in skin cancer rates between the two cities. Presumably, Dallas, which receives more<br />
intense sun, would have a higher skin cancer rate.<br />
The data is available in the skincancer.jmp data set in the Sample Program Library at http://www.<br />
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Use all of the data. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the<br />
log(populati<strong>on</strong> size) will be the offset variable.<br />
A preliminary data plot of the empirical cancer rate <str<strong>on</strong>g>for</str<strong>on</strong>g> the two cities:<br />
c○2012 Carl James Schwarz 675 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
shows roughly parallel resp<strong>on</strong>ses, but now the curvature is much more pr<strong>on</strong>ounced in Dallas.<br />
Perhaps a quadratic model should be first fit, with separate resp<strong>on</strong>se curve <str<strong>on</strong>g>for</str<strong>on</strong>g> both cities. In <str<strong>on</strong>g>short</str<strong>on</strong>g> hand<br />
model notati<strong>on</strong>, this is:<br />
log(lambda) = City log(Age) log(Age) 2 City ∗ log(Age) City ∗ log(Age) 2<br />
where City is the effect of the two cities, log(age) is the c<strong>on</strong>tinuous X variable, and the interacti<strong>on</strong> terms<br />
represent the n<strong>on</strong>-parallelism of the resp<strong>on</strong>ses. This is specified as:<br />
c○2012 Carl James Schwarz 676 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, use the Generalized Linear Model opti<strong>on</strong> of the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and d<strong>on</strong>’t <str<strong>on</strong>g>for</str<strong>on</strong>g>get to<br />
specify the log(popsize) as the offset variable. This gives the output:<br />
c○2012 Carl James Schwarz 677 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
The Whole Model Test shows evidence that the model has predictive ability. The Goodness-of-fit Test<br />
shows that this model is a reas<strong>on</strong>able fit (p-values around .30). The Effect Test shows that perhaps both of<br />
the interacti<strong>on</strong> terms can be dropped, but some care must be taken as these are marginal tests and cannot be<br />
simply combined.<br />
A “Chunk Test” similar to that seen in logistic regressi<strong>on</strong> can be d<strong>on</strong>e to see if both interacti<strong>on</strong> terms can<br />
be dropped simultaneously:<br />
c○2012 Carl James Schwarz 678 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
c○2012 Carl James Schwarz 679 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
The p-value is just above α = .05 so I would be a little hesitant to drop both interacti<strong>on</strong> terms. On the other<br />
hand, some of the larger age classes have such large sample sizes and large count values that very minor<br />
differences in fit can likely be detected.<br />
The simpler model with two parallel quadratic curves was then fit:<br />
c○2012 Carl James Schwarz 680 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
c○2012 Carl James Schwarz 681 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
This simpler model also has no str<strong>on</strong>g evidence of lack-of-fit. Now, however, the quadratic term cannot be<br />
dropped.<br />
The parameter estimates must be interpreted carefully <str<strong>on</strong>g>for</str<strong>on</strong>g> categorical data. Every package codes indicator<br />
variables in different ways, and so the interpretati<strong>on</strong> of the estimates associated with the indicator<br />
variables differs am<strong>on</strong>g packages. JMP codes indicator variables so that estimates are the difference in<br />
resp<strong>on</strong>se between that specified level and the AVERAGE of all other levels. So in this case, the estimate<br />
associated with City[dfw] ̂ = .401 represent 1/2 the distance between the two parallel curves. C<strong>on</strong>sequently,<br />
the difference in logλ between Minneapolis and Dallas is 2 × .401 = .801 (SE 2 × .026 = .05). This is a<br />
c<strong>on</strong>sistent difference <str<strong>on</strong>g>for</str<strong>on</strong>g> all age groups.<br />
This can also be estimated without having to worry too much about the coding details by doing a c<strong>on</strong>trast<br />
between the estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the city effects::<br />
c○2012 Carl James Schwarz 682 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
c○2012 Carl James Schwarz 683 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
This gives the same results as above.<br />
c○2012 Carl James Schwarz 684 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
This is a difference <strong>on</strong> the log-scale. As seen in earlier chapter, this can be c<strong>on</strong>verted to an estimate of<br />
the ratio of incidence by taking anti-logs. In this case, Dallas is estimated to have e .802 = 2.22 TIMES the<br />
skin cancer rate of Minneapolis. This is c<strong>on</strong>sistent with what is seen in the raw data. The SE of this ratio is<br />
found using an applicati<strong>on</strong> of the Delta method 9 The delta-method indicates that the SE of an exp<strong>on</strong>entiated<br />
estimate is found as<br />
SE(êθ)<br />
= SE(̂θ)êθ<br />
In this case<br />
SE(ratio) = .052 × 2.22 = .11<br />
C<strong>on</strong>fidence bounds are found by finding the usual c<strong>on</strong>fidence bounds <strong>on</strong> the log-scale and then taking antilogs<br />
of the end points. In this case, the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the difference in log(λ) is (.802 −<br />
2(.052) → .802 + 2(.052)) or (.698 → .906). Taking antilogs, gives a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the ratio<br />
of skin cancer rates as (2.01 → 2.47).<br />
The residual plot (not shown) look reas<strong>on</strong>able.<br />
8.8 Categorical X variables - a designed experiment<br />
Just like ANOVA is used to analyze data from designed experiments, Generalized linear models can also be<br />
used to analyze count data from designed experiments. However, JMP is limited to designs without random<br />
effects, e.g. no GLIMs that involve split-plot designs.<br />
C<strong>on</strong>sider an experiment to investigate 10 treatments (a c<strong>on</strong>trol vs. a 3x3 factorial structure <str<strong>on</strong>g>for</str<strong>on</strong>g> two factors<br />
A and B) <strong>on</strong> c<strong>on</strong>trolling insect numbers. The experiment was run in a randomized block design (see earlier<br />
chapters). In each block, the 10 treatments were randomized to 10 different trees. On each tree, a trap was<br />
mounted, and the number of insects caught in each trap was recorded.<br />
Here is the raw data. 10<br />
9 A <str<strong>on</strong>g>for</str<strong>on</strong>g>m of a Taylor Series Expansi<strong>on</strong>. C<strong>on</strong>sult many books <strong>on</strong> statistics <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />
10 This is example 10.4.1. from SAS <str<strong>on</strong>g>for</str<strong>on</strong>g> Linear Models, 4th Editi<strong>on</strong>. Data extracted from http://ftp.sas.com/samples/<br />
A56655 <strong>on</strong> 2006-07-19.<br />
c○2012 Carl James Schwarz 685 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Block Treatment A B Count<br />
1 1 1 1 6<br />
1 2 1 2 2<br />
1 5 2 2 3<br />
1 8 3 2 3<br />
1 7 3 1 1<br />
1 0 0 0 16<br />
1 3 1 3 4<br />
1 6 2 3 1<br />
1 9 3 3 1<br />
1 4 2 1 5<br />
2 1 1 1 9<br />
2 2 1 2 6<br />
2 5 2 2 4<br />
2 8 3 2 2<br />
2 7 3 1 2<br />
2 0 0 0 25<br />
2 3 1 3 3<br />
2 6 2 3 5<br />
2 9 3 3 0<br />
2 4 2 1 3<br />
3 1 1 1 2<br />
3 2 1 2 14<br />
3 5 2 2 6<br />
3 8 3 2 3<br />
3 7 3 1 2<br />
3 0 0 0 5<br />
3 3 1 3 5<br />
3 6 2 3 17<br />
3 9 3 3 2<br />
3 4 2 1 3<br />
4 1 1 1 22<br />
4 2 1 2 4<br />
4 5 2 2 3<br />
4 8 3 2 4<br />
4 7 3 1 3<br />
4 0 0 0 9<br />
c○2012 Carl James Schwarz 686 November 23, 2012<br />
4 3 1 3 5<br />
4 6 2 3 1<br />
4 9 3 3 9<br />
4 4 2 1 2
CHAPTER 8. POISSON REGRESSION<br />
The data are available in JMP data file insectcount.jmp in the Sample Program Library at http://<br />
www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />
The RCB model was fit using a generalized linear model with a log link:<br />
Count i distributed P oiss<strong>on</strong>(µ i )<br />
φ i = log(µ i )<br />
φ i = Block T reatment<br />
where the simplified syntax Block and Treatment refer to block and treatment effects. Both Blocks and<br />
Treatment are categorical, and will be translated to sets of indicator variables in the usual way.<br />
This model is fit in JMP using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />
Note that the block and treatment variables must be nominally scaled. There is NO offset variable as the<br />
insect cages were all equal size.<br />
c○2012 Carl James Schwarz 687 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
This produces the output:<br />
The Goodness-of-fit test shows str<strong>on</strong>g evidence that the model doesn’t fit as the p-values are very small.<br />
Lack-of-fit can be caused by inadequacies of the actual model (perhaps a more complex model with block<br />
and treatment interacti<strong>on</strong>s is needed?), failure of the Poiss<strong>on</strong> assumpti<strong>on</strong>, or using the wr<strong>on</strong>g link-functi<strong>on</strong>,<br />
The residual plot:<br />
c○2012 Carl James Schwarz 688 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
shows that the data is more variable than expected by a Poiss<strong>on</strong> distributi<strong>on</strong> (about 95% of the residual<br />
should be within ± 2). The base model and link functi<strong>on</strong> seem reas<strong>on</strong>able as there is no pattern to the<br />
residuals, merely an over-dispersi<strong>on</strong> relative to a Poiss<strong>on</strong> distributi<strong>on</strong>.<br />
The adjustment of over-dispersi<strong>on</strong> is made as seen earlier in the Analyze->Fit Model dialogue box:<br />
c○2012 Carl James Schwarz 689 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
which gives the revised output:<br />
c○2012 Carl James Schwarz 690 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Note that the over-dispersi<strong>on</strong> factor ĉ = 3.5. The test-statistic <str<strong>on</strong>g>for</str<strong>on</strong>g> the Effect Test are adjusted by this factor<br />
(compare the chi-square of 76.37 <str<strong>on</strong>g>for</str<strong>on</strong>g> the treatment effects in the absence of adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong><br />
with the chi-square of 21.79 after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong>), and the p-values have been adjusted as well.<br />
The residuals have been adjusted by √ ĉ and now look more acceptable:<br />
c○2012 Carl James Schwarz 691 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
Note that the pattern of the residual plot doesn’t change; all that the over-dispersi<strong>on</strong> adjustment does is to<br />
change the residual variance so that the standardizati<strong>on</strong> brings them closer to 0.<br />
If you compare the parameter estimates between the two models, you will find that the estimates are<br />
unchanged, but the reported se are increased by √ ĉ to account <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong>. As the case with all<br />
categorical X variables, the interpretati<strong>on</strong> of the estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the indicator variables depends up<strong>on</strong> the<br />
coding used by the package. JMP uses a coding where each indicator variable is compared to the mean<br />
resp<strong>on</strong>se over all indicator variables.<br />
Predicti<strong>on</strong>s of the mean resp<strong>on</strong>se at levels of X are obtained in the usual fashi<strong>on</strong>. The se will also be<br />
adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> overdisperi<strong>on</strong>. However, it is now impossible to obtain predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the ACTUAl<br />
number of events. By using the overdispersi<strong>on</strong> factor, you are no l<strong>on</strong>ger assuming that the counts are<br />
distributed as a Poiss<strong>on</strong> distributi<strong>on</strong> – in fact, there is NO REAL DISTRIBUTION that has the mean to<br />
variance ratio that implicitly assumed using the overdispersi<strong>on</strong> factor. Without an actual distributi<strong>on</strong>, it is<br />
impossible to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> individual events.<br />
If comparis<strong>on</strong>s are of interest am<strong>on</strong>g the treatment levels, it is better to use the built-in C<strong>on</strong>trast facilities<br />
of the package to compute the estimates and standard errors rather than trying to do this by hand. For<br />
example, suppose we are interested in comparing treatment 0 (the c<strong>on</strong>trol), to the treatment with factor A at<br />
level 1 and factor B at level 1 (corresp<strong>on</strong>ding to treatment 1). The c<strong>on</strong>trast is estimated as:<br />
c○2012 Carl James Schwarz 692 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
c○2012 Carl James Schwarz 693 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
The estimated difference in the log(mean) is -.34 (se .39) which corresp<strong>on</strong>ds to a ratio of e −.34 = .71 of<br />
treatment 1 to c<strong>on</strong>trol, i.e. <strong>on</strong>, average, the number of insects in the treatment 1 traps are 71% of the number<br />
of insects in the c<strong>on</strong>trol trap. An applicati<strong>on</strong> of the delta-method shows that the se of the ratio is computed<br />
as se(êθ) = se(̂θ)êθ = .39(.71) = .28. However, there was no evidence of a difference in trap counts as the<br />
standard error was sufficiently large. A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the difference in log(mean) is found as<br />
−.34 ± 2(.39) which gives (−1.12 → .44). Because the p-value was larger than α = .05, this c<strong>on</strong>fidence<br />
c○2012 Carl James Schwarz 694 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
interval includes zero. When this interval is anti-logged, the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the ratio of mean<br />
counts is (.32 → 1.55), i.e. the true ratio of treatment counts to c<strong>on</strong>trol counts is between .32 and 1.55.<br />
Because the p-value was greater than α = .05, this interval c<strong>on</strong>tains the value of 1 (indicating that the ratio<br />
of counts was 1:1). It is also correct to compute the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the ratio using the estimated<br />
ratio ± its se@. This gives (.71 ± 2(.28)) or (.15 → 1.27). In large samples, these c<strong>on</strong>fidence intervals are<br />
equivalent. In smaller samples, there is no real objective way to choose between them.<br />
8.9 Log-linear models <str<strong>on</strong>g>for</str<strong>on</strong>g> multi-dimensi<strong>on</strong>al c<strong>on</strong>tingency tables<br />
In the chapter <strong>on</strong> logistic regressi<strong>on</strong>, k × 2 c<strong>on</strong>tingency tables were analyzed to see if the proporti<strong>on</strong> of<br />
resp<strong>on</strong>ses in the populati<strong>on</strong> that fell in the two categories (e.g. survived or died) were the same across the k<br />
levels of the factor (e.g. sex, or passenger class, or dose of a drug).<br />
The use of logisitic regressi<strong>on</strong> is a special case of the general r × c c<strong>on</strong>tingency table where observati<strong>on</strong>s<br />
are classified by r levels of a factor and c levels of a resp<strong>on</strong>se. In a separate chapter, the use of χ 2 tests to<br />
test the hypothesis of equal populati<strong>on</strong> proporti<strong>on</strong>s in the c levels of the resp<strong>on</strong>se across all levels of the the<br />
factor. This is also known as the test of independence of the resp<strong>on</strong>se to levels of the factor.<br />
This can be generalized to the analysis of multi-dimensi<strong>on</strong>al tables using Poiss<strong>on</strong>-regressi<strong>on</strong>. In more<br />
advanced <str<strong>on</strong>g>course</str<strong>on</strong>g>s, you can learn how the two previous cases are simple cases of this more general modelling<br />
approach. C<strong>on</strong>sult Agresti’s book <str<strong>on</strong>g>for</str<strong>on</strong>g> a fuller account of this topic.<br />
8.10 Variable selecti<strong>on</strong> methods<br />
To be added later<br />
8.11 Summary<br />
Poiss<strong>on</strong>-regressi<strong>on</strong> is the standard tool <str<strong>on</strong>g>for</str<strong>on</strong>g> the analysis of “smallish” count data. If the counts are large (say<br />
in the orders of hundreds), you could likely use ordinary or weighted regressi<strong>on</strong> methods without difficulty.<br />
This chapter <strong>on</strong>ly c<strong>on</strong>cerns itself with data collected under a simple random sample or a completely<br />
randomized design. If the data are collected under other designs, please c<strong>on</strong>sult with a statistician <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />
proper analysis.<br />
A comm<strong>on</strong> problem that have encountered are data that have been prestandardized. For example, data<br />
may recorded <strong>on</strong> the number of tree stems in a 100 m 2 test plots. This data could likely be modeled using<br />
poiss<strong>on</strong> regressi<strong>on</strong>. But, then the data are standardized to a “per hectare” basis. These standardized data<br />
c○2012 Carl James Schwarz 695 November 23, 2012
CHAPTER 8. POISSON REGRESSION<br />
are NO LONGER distributed as a Poiss<strong>on</strong> distributi<strong>on</strong>. It would be preferable to analyze the data using the<br />
sampling units that were used to collect the data with an offset variable being used to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g> differing<br />
sizes of survey units.<br />
A comm<strong>on</strong> cause of overdispersi<strong>on</strong> is n<strong>on</strong>-independence in the data. For example, data may be collected<br />
using a cluster design rather than by a simple random sample. Overdispersi<strong>on</strong> can be accounted <str<strong>on</strong>g>for</str<strong>on</strong>g> using<br />
quasi-likelihood methods. As a rule of thumb, overdispersi<strong>on</strong> factor ĉ of 10 or less are acceptable. Very<br />
larger overdispersi<strong>on</strong> factors indicate other serious problems in the model. Alternatives to the use of the<br />
correcti<strong>on</strong> factor are using a different distributi<strong>on</strong> such as the negative binomial distributi<strong>on</strong>.<br />
Related models <str<strong>on</strong>g>for</str<strong>on</strong>g> this chapter are the Zero-inflated Poiss<strong>on</strong> (ZIP) models. In these models there are<br />
an excess number of zeroes relative to what would be expected under a Poiss<strong>on</strong> model. The ZIP model has<br />
two parts – the probability that an observati<strong>on</strong> will be zero, and then the distributi<strong>on</strong> of the n<strong>on</strong>-zero counts.<br />
There is a substantial base in the literature <strong>on</strong> this model.<br />
c○2012 Carl James Schwarz 696 November 23, 2012