19.11.2014 Views

Columbia Mountain Institute for Applied Ecology A short course on ...

Columbia Mountain Institute for Applied Ecology A short course on ...

Columbia Mountain Institute for Applied Ecology A short course on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>Columbia</str<strong>on</strong>g> <str<strong>on</strong>g>Mountain</str<strong>on</strong>g> <str<strong>on</strong>g>Institute</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Applied</str<strong>on</strong>g> <str<strong>on</strong>g>Ecology</str<strong>on</strong>g><br />

A <str<strong>on</strong>g>short</str<strong>on</strong>g> <str<strong>on</strong>g>course</str<strong>on</strong>g> <strong>on</strong> regressi<strong>on</strong> methods.<br />

These notes are a subset of a more complete set of notes<br />

available at<br />

http:<br />

//www.stat.sfu.ca/~cschwarz/CourseNotes<br />

C. J. Schwarz<br />

Department of Statistics and Actuarial Science, Sim<strong>on</strong> Fraser University<br />

8888 Universit Drive<br />

Burnaby, BC V5A 1S6<br />

cschwarz@stat.sfu.ca<br />

November 23, 2012


C<strong>on</strong>tents<br />

1 Correlati<strong>on</strong> and simple linear regressi<strong>on</strong> 7<br />

1.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

1.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

1.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

1.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.3 Correlati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

1.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.3.2 Correlati<strong>on</strong> coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

1.3.3 Cauti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

1.3.4 Principles of Causati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

1.4 Single-variable regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.4.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.4.2 Equati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> a line - getting notati<strong>on</strong> straight (no pun intended) . . . . . . . . . . . . 23<br />

1.4.3 Populati<strong>on</strong>s and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

1.4.4 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

Correct scale of predictor and resp<strong>on</strong>se . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

Correct sampling scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

No outliers or influential points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

Equal variati<strong>on</strong> al<strong>on</strong>g the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

Normality of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

X measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

1.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

1.4.6 Obtaining Predicti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />

1.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

1.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

1.4.9 Example - Mercury polluti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

1.4.10 Example - The Anscombe Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

1.4.11 Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

1.4.12 Example: M<strong>on</strong>itoring Dioxins - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . 50<br />

1.4.13 Example: Weight-length relati<strong>on</strong>ships - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> . . . . . . . . . . . . . . . . 62<br />

Using the Fit Special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

Using derived variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

1


CONTENTS<br />

A n<strong>on</strong>-linear fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

1.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

1.4.15 The perils of R 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

1.5 A no-intercept model: Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor K . . . . . . . . . . . . . . . . . . . . . . 80<br />

1.6 Frequent Asked Questi<strong>on</strong>s - FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

1.6.1 Do I need a random sample; power analysis . . . . . . . . . . . . . . . . . . . . . . 86<br />

2 Detecting trends over time 89<br />

2.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

2.2 Simple Linear Regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

2.2.1 Populati<strong>on</strong>s and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

2.2.2 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

Scale of Y and X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

Correct sampling scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

No outliers or influential points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

Equal variati<strong>on</strong> al<strong>on</strong>g the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

Normality of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

X measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

2.2.3 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

2.2.4 Obtaining Predicti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

2.2.5 Inverse predicti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

2.2.6 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

2.2.7 Example: The Grass is Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger) . . . . . . . . . . . . . . . . . . . . . . 103<br />

2.3 Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

2.3.1 Example: M<strong>on</strong>itoring Dioxins - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . 115<br />

2.3.2 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

2.4 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

2.4.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

2.4.2 Getting the necessary in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

2.4.3 How does power vary as in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> changes? . . . . . . . . . . . . . . . . . . . . 134<br />

2.4.4 Finally - how many year do I need to m<strong>on</strong>itor? . . . . . . . . . . . . . . . . . . . . 145<br />

2.4.5 Summary of plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />

2.5 Testing <str<strong>on</strong>g>for</str<strong>on</strong>g> comm<strong>on</strong> trend - ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162<br />

2.5.1 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

2.5.2 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166<br />

2.5.3 Example: Degradati<strong>on</strong> of dioxin - pooling locati<strong>on</strong>s . . . . . . . . . . . . . . . . . . 169<br />

2.5.4 Change in yearly average temperature with regime shifts . . . . . . . . . . . . . . . 186<br />

2.6 Dealing with Autocorrelati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195<br />

2.6.1 Example: Mink pelts from Saskatchewan . . . . . . . . . . . . . . . . . . . . . . . 204<br />

2.7 Dealing with seas<strong>on</strong>ality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />

2.7.1 Empirical adjustment <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />

General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />

Example: Total phosphorus from Klamath River . . . . . . . . . . . . . . . . . . . 219<br />

2.7.2 Using the ANCOVA approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227<br />

c○2012 Carl James Schwarz 2 November 23, 2012


CONTENTS<br />

General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227<br />

Example: Total phosphorus levels <strong>on</strong> the Klamath River - revisited . . . . . . . . . . 228<br />

2.7.3 Fitting cyclical patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232<br />

General approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232<br />

Example: Total phosphorus from Klamath River . . . . . . . . . . . . . . . . . . . 233<br />

Example: Comparing air quality measurements using two different methods . . . . . 240<br />

2.7.4 Further comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252<br />

2.8 Seas<strong>on</strong>ality and Autocorrelati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252<br />

2.9 N<strong>on</strong>-parametric detecti<strong>on</strong> of trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254<br />

2.9.1 Cox and Stuart test <str<strong>on</strong>g>for</str<strong>on</strong>g> trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255<br />

2.9.2 N<strong>on</strong>-parametric regressi<strong>on</strong> - Spearman, Kendall, Theil, Sen estimates . . . . . . . . 258<br />

N<strong>on</strong>-parametric does NOT mean no assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . 258<br />

Example: The Grass is Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger) revisited . . . . . . . . . . . . . . . . . 260<br />

Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265<br />

2.9.3 Dealing with seas<strong>on</strong>ality - Seas<strong>on</strong>al Kendall’s τ . . . . . . . . . . . . . . . . . . . . 265<br />

Basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265<br />

Example: Total phosphorus <strong>on</strong> the Klamath River revisited . . . . . . . . . . . . . . 268<br />

Final notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275<br />

2.9.4 Seas<strong>on</strong>ality with Autocorrelati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276<br />

General ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276<br />

2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277<br />

3 Estimating power/sample size using Program M<strong>on</strong>itor 282<br />

3.1 Mechanics of MONITOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283<br />

3.2 How does MONITOR work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292<br />

3.3 Incorporating process and sampling error . . . . . . . . . . . . . . . . . . . . . . . . . . . 297<br />

3.4 Presence/Absence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306<br />

3.5 WARNING about using testing <str<strong>on</strong>g>for</str<strong>on</strong>g> temporal trends . . . . . . . . . . . . . . . . . . . . . . 309<br />

4 Regressi<strong>on</strong> - hockey sticks, broken sticks, piecewise, change points 311<br />

4.1 Hockey-stick, piecewise, or broken-stick regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . 311<br />

4.1.1 Example: Nenana River Ice Breakup Dates . . . . . . . . . . . . . . . . . . . . . . 312<br />

4.2 Searching <str<strong>on</strong>g>for</str<strong>on</strong>g> the change point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317<br />

4.2.1 Change point model <str<strong>on</strong>g>for</str<strong>on</strong>g> the Nenana River Ice Breakup . . . . . . . . . . . . . . . . 318<br />

4.3 How NOT to search <str<strong>on</strong>g>for</str<strong>on</strong>g> a change point! . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325<br />

5 Analysis of Covariance - ANCOVA 327<br />

5.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327<br />

5.2 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331<br />

5.3 Comparing individual regressi<strong>on</strong> lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331<br />

5.4 Comparing Means after covariate adjustments . . . . . . . . . . . . . . . . . . . . . . . . . 335<br />

5.5 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335<br />

5.6 Example - Degradati<strong>on</strong> of dioxin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335<br />

5.7 Change in yearly average temperature with regime shifts . . . . . . . . . . . . . . . . . . . 351<br />

5.8 Example - More refined analysis of stream-slope example . . . . . . . . . . . . . . . . . . . 360<br />

5.9 Comparing Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor K . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374<br />

5.10 Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388<br />

c○2012 Carl James Schwarz 3 November 23, 2012


CONTENTS<br />

6 Multiple linear regressi<strong>on</strong> 389<br />

6.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389<br />

6.1.1 Data <str<strong>on</strong>g>for</str<strong>on</strong>g>mat and missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389<br />

6.1.2 The statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390<br />

6.1.3 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391<br />

Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391<br />

Correct sampling scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392<br />

No outliers or influential points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392<br />

Equal variati<strong>on</strong> al<strong>on</strong>g the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392<br />

Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392<br />

Normality of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393<br />

X variables measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . 393<br />

6.1.4 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393<br />

6.1.5 Predicti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394<br />

6.1.6 Example: blood pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395<br />

6.2 Regressi<strong>on</strong> problems and diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408<br />

6.2.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408<br />

6.2.2 Preliminary characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408<br />

6.2.3 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409<br />

6.2.4 Actual vs. Predicted Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />

6.2.5 Detecting influential observati<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />

Cook’s D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />

Hats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412<br />

Cauti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412<br />

6.2.6 Leverage plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412<br />

6.2.7 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420<br />

6.3 Polynomial, product, and interacti<strong>on</strong> terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 422<br />

6.3.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422<br />

6.3.2 Example: Tomato growth as a functi<strong>on</strong> of water . . . . . . . . . . . . . . . . . . . . 423<br />

6.3.3 Polynomial models with several variables . . . . . . . . . . . . . . . . . . . . . . . 440<br />

6.3.4 Cross-product and interacti<strong>on</strong> terms . . . . . . . . . . . . . . . . . . . . . . . . . . 441<br />

6.4 The general linear test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442<br />

6.4.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442<br />

6.4.2 Example: Predicting body fat from measurements . . . . . . . . . . . . . . . . . . . 443<br />

6.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449<br />

6.5 Indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449<br />

6.5.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449<br />

6.5.2 Defining indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450<br />

6.5.3 The ANCOVA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451<br />

6.5.4 Assumpti<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455<br />

6.5.5 Comparing individual regressi<strong>on</strong> lines . . . . . . . . . . . . . . . . . . . . . . . . . 455<br />

6.5.6 Example: Degradati<strong>on</strong> of dioxin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459<br />

6.5.7 Example: More refined analysis of stream-slope example . . . . . . . . . . . . . . . 474<br />

6.6 Example: Predicting PM10 levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482<br />

6.7 Variable selecti<strong>on</strong> methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497<br />

6.7.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497<br />

c○2012 Carl James Schwarz 4 November 23, 2012


CONTENTS<br />

6.7.2 Maximum model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498<br />

6.7.3 Selecting a model criteri<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499<br />

R 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499<br />

F p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499<br />

MSE p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500<br />

C p and AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500<br />

6.7.4 Which subsets should be examined . . . . . . . . . . . . . . . . . . . . . . . . . . 501<br />

All possible subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501<br />

Backward eliminati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501<br />

Forward additi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502<br />

Stepwise selecti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502<br />

Closing words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502<br />

6.7.5 Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502<br />

6.7.6 Example: Calories of candy bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503<br />

6.7.7 Example: Fitness dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514<br />

6.7.8 Example: Predicting zoo plankt<strong>on</strong> biomass . . . . . . . . . . . . . . . . . . . . . . 514<br />

7 Logistic Regressi<strong>on</strong> 528<br />

7.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528<br />

7.1.1 Difference between standard and logistic regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . 528<br />

7.1.2 The Binomial Distributi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529<br />

7.1.3 Odds, risk, odds-ratio, and probability . . . . . . . . . . . . . . . . . . . . . . . . . 530<br />

7.1.4 Modeling the probability of success . . . . . . . . . . . . . . . . . . . . . . . . . . 532<br />

7.1.5 Logistic regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537<br />

7.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544<br />

7.3 Assumpti<strong>on</strong>s made in logistic regressi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545<br />

7.4 Example: Space Shuttle - Single c<strong>on</strong>tinuous predictor . . . . . . . . . . . . . . . . . . . . . 546<br />

7.5 Example: Predicting Sex from physical measurements - Multiple c<strong>on</strong>tinuous predictors . . . 552<br />

7.6 Examples: Lung Cancer vs. Smoking; Marijuana use of students based <strong>on</strong> parental usage -<br />

Single categorical predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563<br />

7.6.1 Retrospect and Prospective odds-ratio . . . . . . . . . . . . . . . . . . . . . . . . . 563<br />

7.6.2 Example: Parental and student usage of recreati<strong>on</strong>al drugs . . . . . . . . . . . . . . 565<br />

7.6.3 Example: Effect of selenium <strong>on</strong> tadpoles de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities . . . . . . . . . . . . . . . . . 574<br />

7.7 Example: Pet fish survival as functi<strong>on</strong> of covariates - Multiple categorical predictors . . . . 586<br />

7.8 Example: Horseshoe crabs - C<strong>on</strong>tinuous and categorical predictors. . . . . . . . . . . . . . 601<br />

7.9 Assessing goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617<br />

7.10 Variable selecti<strong>on</strong> methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622<br />

7.10.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622<br />

7.10.2 Example: Predicting credit worthiness . . . . . . . . . . . . . . . . . . . . . . . . . 624<br />

7.11 Model comparis<strong>on</strong> using AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631<br />

7.12 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632<br />

7.12.1 Two comm<strong>on</strong> problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632<br />

Zero counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632<br />

Complete separati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632<br />

7.12.2 Extensi<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633<br />

Choice of link functi<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633<br />

c○2012 Carl James Schwarz 5 November 23, 2012


CONTENTS<br />

More than two resp<strong>on</strong>se categories . . . . . . . . . . . . . . . . . . . . . . . . . . . 633<br />

Exact logistic regressi<strong>on</strong> with very small datasets . . . . . . . . . . . . . . . . . . . 633<br />

More complex experimental designs . . . . . . . . . . . . . . . . . . . . . . . . . . 634<br />

7.12.3 Yet to do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634<br />

8 Poiss<strong>on</strong> Regressi<strong>on</strong> 635<br />

8.1 Introducti<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635<br />

8.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638<br />

8.3 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638<br />

8.4 Single c<strong>on</strong>tinuous X variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638<br />

8.5 Single c<strong>on</strong>tinuous X variable - dealing with overdispersi<strong>on</strong> . . . . . . . . . . . . . . . . . . 644<br />

8.6 Single C<strong>on</strong>tinuous X variable with an OFFSET . . . . . . . . . . . . . . . . . . . . . . . . 661<br />

8.7 ANCOVA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675<br />

8.8 Categorical X variables - a designed experiment . . . . . . . . . . . . . . . . . . . . . . . . 685<br />

8.9 Log-linear models <str<strong>on</strong>g>for</str<strong>on</strong>g> multi-dimensi<strong>on</strong>al c<strong>on</strong>tingency tables . . . . . . . . . . . . . . . . . 695<br />

8.10 Variable selecti<strong>on</strong> methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695<br />

8.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695<br />

c○2012 Carl James Schwarz 6 November 23, 2012


Chapter 1<br />

Correlati<strong>on</strong> and simple linear regressi<strong>on</strong><br />

1.1 Introducti<strong>on</strong><br />

A nice book explaining how to use JMP to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m regressi<strong>on</strong> analysis is: Freund, R., Littell, R., and<br />

Creight<strong>on</strong>, L. (2003) Regressi<strong>on</strong> using JMP. Wiley Interscience.<br />

Much of statistics is c<strong>on</strong>cerned with relati<strong>on</strong>ships am<strong>on</strong>g variables and whether observed relati<strong>on</strong>ships<br />

are real or simply due to chance. In particular, the simplest case deals with the relati<strong>on</strong>ship between two<br />

variables.<br />

Quantifying the relati<strong>on</strong>ship between two variables depends up<strong>on</strong> the scale of measurement of each of<br />

the two variables. The following table summarizes some of the important analyses that are often per<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />

to investigate the relati<strong>on</strong>ship between two variables.<br />

7


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Type of variables<br />

X is Interval or Ratio or<br />

what JMP calls C<strong>on</strong>tinuous<br />

• Scatterplots<br />

me-<br />

• Running<br />

dian/spline fit<br />

• Regressi<strong>on</strong><br />

• Correlati<strong>on</strong><br />

Y is Interval<br />

or Ratio<br />

or what JMP<br />

calls C<strong>on</strong>tinuous<br />

X is Nominal or Ordinal<br />

• Side-by-side dot plot<br />

• Side-by-side box<br />

plot<br />

• ANOVA or t-tests<br />

Y is Nominal<br />

or Ordinal<br />

• Logistic regressi<strong>on</strong><br />

• Mosaic chart<br />

• C<strong>on</strong>tingency tables<br />

• Chi-square tests<br />

In JMP these combinati<strong>on</strong> of two variables are obtained by the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, the<br />

Analyze->Correlati<strong>on</strong>-of-Ys plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, or the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />

When analyzing two variables, <strong>on</strong>e questi<strong>on</strong> becomes important as it determines the type of analysis that<br />

will be d<strong>on</strong>e. Is the purpose to explore the nature of the relati<strong>on</strong>ship, or is the purpose to use <strong>on</strong>e variable<br />

to explain variati<strong>on</strong> in another variable? For example, there is a difference between examining height and<br />

weight to see if there is a str<strong>on</strong>g relati<strong>on</strong>ship, as opposed to using height to predict weight.<br />

C<strong>on</strong>sequently, you need to distinguish between a correlati<strong>on</strong>al analysis in which <strong>on</strong>ly the strength of the<br />

relati<strong>on</strong>ship will be described, or regressi<strong>on</strong> where <strong>on</strong>e variable will be used to predict the values of a sec<strong>on</strong>d<br />

variable.<br />

The two variables are often called either a resp<strong>on</strong>se variable or an explanatory variable. A resp<strong>on</strong>se<br />

variable (also known as a dependent or Y variable) measures the outcome of a study. An explanatory<br />

variable (also known as an independent or X variable) is the variable that attempts to explain the observed<br />

outcomes.<br />

c○2012 Carl James Schwarz 8 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

1.2 Graphical displays<br />

1.2.1 Scatterplots<br />

The scatter-plot is the primary graphical tool used when exploring the relati<strong>on</strong>ship between two interval or<br />

ratio scale variables. This is obtained in JMP using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m – be sure that both<br />

variables have a c<strong>on</strong>tinuous scale.<br />

In graphing the relati<strong>on</strong>ship, the resp<strong>on</strong>se variable is usually plotted al<strong>on</strong>g the vertical axis (the Y axis)<br />

and the explanatory variables is plotted al<strong>on</strong>g the horiz<strong>on</strong>tal axis (the X axis). It is not always perfectly<br />

clear which is the resp<strong>on</strong>se and which is the explanatory variable. If there is no distincti<strong>on</strong> between the two<br />

variables, then it doesn’t matter which variable is plotted <strong>on</strong> which axis – this usually <strong>on</strong>ly happens when<br />

finding correlati<strong>on</strong> between variables is the primary purpose.<br />

For example, look at the relati<strong>on</strong>ship between calories/serving and fat from the cereal dataset using JMP.<br />

[We will create the graph in class at this point.]<br />

What to look <str<strong>on</strong>g>for</str<strong>on</strong>g> in a scatter-plot<br />

Overall pattern. - What is the directi<strong>on</strong> of associati<strong>on</strong>? A positive associati<strong>on</strong> occurs when above-average<br />

values of <strong>on</strong>e variable tend to be associated with above-average variables of another. The plot will<br />

have an upward slope. A negative associati<strong>on</strong> occurs when above-average values of <strong>on</strong>e variable are<br />

associated with below-average values of another variable. The plot will have a downward slope. What<br />

happens when there is “no associati<strong>on</strong>” between the two variables?<br />

Form of the relati<strong>on</strong>ship. Does a straight line seem to fit through the ‘middle’ of the points? Is the line<br />

linear (the points seem to cluster around a straight line?) or is it curvi-linear (the points seem to <str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />

a curve)?<br />

Strength of associati<strong>on</strong>. Are the points clustered tightly around the curve? If the points have a lot of scatter<br />

above and below the trend line, then the associati<strong>on</strong> is not very str<strong>on</strong>g. On the other hand, if the<br />

amount of scatter above and below the trend line is very small, then there is a str<strong>on</strong>g associati<strong>on</strong>.<br />

Outliers Are there any points that seem to be unusual? Outliers are values that are unusually far from the<br />

trend curve - i.e., they are further away from the trend curve than you would expect from the usual<br />

level of scatter. There is no <str<strong>on</strong>g>for</str<strong>on</strong>g>mal rule <str<strong>on</strong>g>for</str<strong>on</strong>g> detecting outliers - use comm<strong>on</strong> sense. [If you set the<br />

role of a variable to be a label, and click <strong>on</strong> points in a linked graph, the label <str<strong>on</strong>g>for</str<strong>on</strong>g> the point will be<br />

displayed making it easy to identify such points.]<br />

One’s usual initial suspici<strong>on</strong> about any outlier is that it is a mistake, e.g., a transcripti<strong>on</strong> error. Every<br />

ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t should be made to trace the data back to its original source and correct the value if possible. If<br />

the data value appears to be correct, then you have a bit of a quandary. Do you keep the data point<br />

in even though it doesn’t follow the trend line, or do you drop the data point because it appears to be<br />

anomalous? Fortunately, with computers it is relatively easy to repeat an analysis with and without an<br />

outlier - if there is very little difference in the final outcome - d<strong>on</strong>’t worry about it.<br />

c○2012 Carl James Schwarz 9 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

In some cases, the outliers are the most interesting part of the data. For example, <str<strong>on</strong>g>for</str<strong>on</strong>g> many years the<br />

oz<strong>on</strong>e hole in the Antarctic was missed because the computers were programmed to ignore readings<br />

that were so low that ‘they must be in error’!<br />

Lurking variables. A lurking variable is a third variable that is related to both variables and may c<strong>on</strong>found<br />

the associati<strong>on</strong>.<br />

For example, the amount of chocolate c<strong>on</strong>sumed in Canada and the number of automobile accidents<br />

are positively related, but most people would agree that this is coincidental and each variable is independently<br />

driven by populati<strong>on</strong> growth.<br />

Sometimes the lurking variable is a ’grouping’ variable of sorts. This is often examined by using<br />

a different plotting symbol to distinguish between the values of the third variables. For example,<br />

c<strong>on</strong>sider the following plot of the relati<strong>on</strong>ship between salary and years of experience <str<strong>on</strong>g>for</str<strong>on</strong>g> nurses.<br />

The individual lines show a positive relati<strong>on</strong>ship, but the overall pattern when the data are pooled,<br />

shows a negative relati<strong>on</strong>ship.<br />

It is easy in JMP to assign different plotting symbols (what JMP calls markers) to different points.<br />

From the Row menu, use Where to select rows. Then assign those rows using the Rows->Markers<br />

menu.<br />

1.2.2 Smoothers<br />

Once the scatter-plot is plotted, it is natural to try and summarize the underlying trend line. For example,<br />

c<strong>on</strong>sider the following data:<br />

c○2012 Carl James Schwarz 10 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

There are several comm<strong>on</strong> methods available to fit a line through this data.<br />

By eye The eye has remarkable power <str<strong>on</strong>g>for</str<strong>on</strong>g> providing a reas<strong>on</strong>able approximati<strong>on</strong> to an underlying trend,<br />

but it needs a little educati<strong>on</strong>. A trend curve is a good summary of a scatter-plot if the differences between<br />

the individual data points and the underlying trend line (technically called residuals) are small. As well, a<br />

good trend curve tries to minimize the total of the residuals. And the trend line should try and go through<br />

the middle of most of the data.<br />

Although the eye often gives a good fit, different people will draw slightly different trend curves. Several<br />

automated ways to derive trend curves are in comm<strong>on</strong> use - bear in mind that the best ways of estimating<br />

trend curves will try and mimic what the eye does so well.<br />

Median or mean trace The idea is very simple. We choose a “window” width of size w, say. For<br />

each point al<strong>on</strong>g the bottom (X) axis, the smoothed value is the median or average of the Y -values <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

all data points with X-values lying within the “window” centered <strong>on</strong> this point. The trend curve is then<br />

the trace of these medians or means over the entire plot. The result is not exactly smooth. Generally, the<br />

wider the window chosen the smoother the result. However, wider windows make the smoother react more<br />

slowly to changes in trend. Smoothing techniques are too computati<strong>on</strong>ally intensive to be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by<br />

hand. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, JMP is unable to compute the trace of data, but splines are a very good alternative (see<br />

below).<br />

The mean or median trace is too unsophisticated to be a generally useful smoother. For example, the<br />

simple averaging causes it to under-estimate the heights of peaks and over-estimate the heights of troughs.<br />

(Can you see why this is so? Draw a picture with a peak.) However, it is a useful way of trying to summarize<br />

a pattern in a weak relati<strong>on</strong>ship <str<strong>on</strong>g>for</str<strong>on</strong>g> a moderately large data set. In a very weak relati<strong>on</strong>ship it can even help<br />

you to see the trend.<br />

c○2012 Carl James Schwarz 11 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Box plots <str<strong>on</strong>g>for</str<strong>on</strong>g> strips The following gives a c<strong>on</strong>ceptually simple method which is useful <str<strong>on</strong>g>for</str<strong>on</strong>g> exploring a<br />

weak relati<strong>on</strong>ship in a large data set. The X-axis is divided into equal-sized intervals. Then separate box<br />

plots of the values of Y are found <str<strong>on</strong>g>for</str<strong>on</strong>g> each strip. The box-plots are plotted side-by-side and the means or<br />

median are joined. Again, we are able to see what is happening to the variability as well as the trend. There<br />

is even more detailed in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> available in the box plots about the shape of the Y -distributi<strong>on</strong> etc. Again,<br />

this is too tedious to do by hand. It is possible to do make this plot in JMP by creating a new variable that<br />

groups the values of the X variable into classes and then using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m using these<br />

groupings. This is illustrated below:<br />

Spline methods A spline is a series of <str<strong>on</strong>g>short</str<strong>on</strong>g> smooth curves that are joined together to create a larger<br />

smooth curve. The computati<strong>on</strong>al details are complex, but can be d<strong>on</strong>e in JMP. The stiffness of the spline<br />

indicates how straight the resulting curve will be. The following shows two spline fits to the same data with<br />

different stiffness measures:<br />

c○2012 Carl James Schwarz 12 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 13 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

1.3 Correlati<strong>on</strong><br />

WARNING!: Correlati<strong>on</strong> is probably the most abused c<strong>on</strong>cept in statistics. Many people use the<br />

word ‘correlati<strong>on</strong>’ to mean any type of associati<strong>on</strong> between two variables, but it has a very strict technical<br />

meaning, i.e. the strength of an apparent linear relati<strong>on</strong>ship between the two interval or ratio scaled<br />

variables.<br />

The correlati<strong>on</strong> measure does not distinguish between explanatory and resp<strong>on</strong>se variables and it treats the<br />

two variables symmetrically. This means that the correlati<strong>on</strong> between Y and X is the same as the correlati<strong>on</strong><br />

between X and Y.<br />

Correlati<strong>on</strong>s are computed in JMP using the Analyze->Correlati<strong>on</strong> of Y’s plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. If there are several<br />

variables, then the data will be organized into a table. Each cell in the table shows the correlati<strong>on</strong> of the<br />

two corresp<strong>on</strong>ding variables. Because of symmetry (the correlati<strong>on</strong> between variable 1 and variable 2 is the<br />

same as between variable 2 and variable 1 ), <strong>on</strong>ly part of the complete matrix will be shown. As well, the<br />

correlati<strong>on</strong> between any variable and itself is always 1.<br />

c○2012 Carl James Schwarz 14 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

1.3.1 Scatter-plot matrix<br />

To illustrate the ideas of correlati<strong>on</strong>, look at the FITNESS dataset in the DATAMORE directory of JMP.<br />

This is a dataset <strong>on</strong> 31 people at a fitness centre and the following variables were measured <strong>on</strong> each subject:<br />

• name<br />

• gender<br />

• age<br />

• weight<br />

• oxygen c<strong>on</strong>sumpti<strong>on</strong> (high values are typically more fit people)<br />

• time to run <strong>on</strong>e mile (1.6 km)<br />

• average pulse rate during the run<br />

• the resting pulse rate<br />

• maximum pulse rate during the run.<br />

We are interested in examining the relati<strong>on</strong>ship am<strong>on</strong>g the variables. At the moment, ignore the fact<br />

that the data c<strong>on</strong>tains both genders. [It would be interesting to assign different plotting symbols to the two<br />

genders to see if gender is a lurking variable.]<br />

One of the first things to do is to create a scatter-plot matrix of all the variables. Use the Analyze-<br />

>Correlati<strong>on</strong> of Ys to get the following scatter-plot:<br />

c○2012 Carl James Schwarz 15 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Interpreting the scatter plot matrix<br />

The entries in the matrix are scatter-plots <str<strong>on</strong>g>for</str<strong>on</strong>g> all the pairs of variables. For example, the entry in row 1<br />

column 3 represents the scatter-plot between age and oxygen c<strong>on</strong>sumpti<strong>on</strong> with age al<strong>on</strong>g the vertical axis<br />

and oxygen c<strong>on</strong>sumpti<strong>on</strong> al<strong>on</strong>g the horiz<strong>on</strong>tal axis, while the entry in row 3 column 1 has age al<strong>on</strong>g the<br />

horiz<strong>on</strong>tal axis and oxygen c<strong>on</strong>sumpti<strong>on</strong> al<strong>on</strong>g the vertical axis.<br />

c○2012 Carl James Schwarz 16 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

There is clearly a difference in the ’strength’ of relati<strong>on</strong>ships. Compare the scatter plot <str<strong>on</strong>g>for</str<strong>on</strong>g> average<br />

running pulse rate and maximum pulse rate (row 5, column 7) to that of running pulse rate and resting pulse<br />

rate (row 5 column 6) to that of running pulse rate and weight (row 5 column 2).<br />

Similarly, there is a difference in the directi<strong>on</strong> of associati<strong>on</strong>. Compare the scatter plot <str<strong>on</strong>g>for</str<strong>on</strong>g> the average<br />

running pulse rate and maximum pulse rate (row 5 column 7) and that <str<strong>on</strong>g>for</str<strong>on</strong>g> oxygen c<strong>on</strong>sumpti<strong>on</strong> and running<br />

time (row 3, column 4).<br />

1.3.2 Correlati<strong>on</strong> coefficient<br />

It is possible to quantify the strength of associati<strong>on</strong> between two variables. As with all statistics, the way the<br />

data are collected influences the meaning of the statistics.<br />

The populati<strong>on</strong> correlati<strong>on</strong> coefficient between two variables is denoted by the Greek letter rho (ρ) and<br />

is computed as:.<br />

ρ = 1 N∑ (X i − µ X ) (Y i − µ Y )<br />

N σ x σ y<br />

The corresp<strong>on</strong>ding sample correlati<strong>on</strong> coefficient is denoted r has a similar <str<strong>on</strong>g>for</str<strong>on</strong>g>m: 1<br />

i=1<br />

r = 1<br />

n − 1<br />

n∑<br />

i=1<br />

(<br />

Xi − X )<br />

s x<br />

(<br />

Yi − Y )<br />

s y<br />

If the sampling scheme is simple random sample from the corresp<strong>on</strong>ding populati<strong>on</strong>, then r is an estimate<br />

of ρ. This is a crucial assumpti<strong>on</strong>. If the sampling is not a simple random sample, the above<br />

definiti<strong>on</strong> of the sample correlati<strong>on</strong> coefficient should not be used! It is possible to find a c<strong>on</strong>fidence interval<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> ρ and to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m statistical tests that ρ is zero. However, <str<strong>on</strong>g>for</str<strong>on</strong>g> the most part, these are rarely d<strong>on</strong>e in<br />

ecological research and so will not be pursued further in this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

The <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula does provide some insight into interpreting its value.<br />

• ρ and r (unlike other populati<strong>on</strong> parameters) are unitless measures.<br />

• the sign of ρ and r is largely determined by the pairing of the relati<strong>on</strong>ship of each of the (X,Y) values<br />

with their respective means, i.e. if both of X and Y are above the mean, or both X and Y are below<br />

their mean, this pair c<strong>on</strong>tributes a positive value towards ρ or r, while if X is above and Y is below, or<br />

X is below and Y is above their respective means c<strong>on</strong>tributes a negative value towards ρ or r.<br />

• ρ and r ranges from -1 to 1. A value of ρ or r equal to -1 implies a perfect negative correlati<strong>on</strong>; a<br />

value of ρ or r equal to 1 implies a perfect positive correlati<strong>on</strong>; a value of ρ or r equal to 0 implied<br />

no correlati<strong>on</strong>. A perfect populati<strong>on</strong> correlati<strong>on</strong> (i.e. ρ or r equal to 1 or -1) implies that all points lie<br />

1 Note that this <str<strong>on</strong>g>for</str<strong>on</strong>g>mula SHOULD NOT be used <str<strong>on</strong>g>for</str<strong>on</strong>g> the actual computati<strong>on</strong> of r, it is numerically unstable and there are better<br />

computing <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae available.<br />

c○2012 Carl James Schwarz 17 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

exactly <strong>on</strong> a straight line, but the slope of the line has NO effect <strong>on</strong> the correlati<strong>on</strong> coefficient. This<br />

latter point is IMPORTANT and often is wr<strong>on</strong>gly interpreted - give some examples.<br />

• ρ and r are unaffected by linear trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s of the individual variables, e.g. unit changes such as<br />

c<strong>on</strong>verting from imperial to metric units.<br />

• ρ and r <strong>on</strong>ly measures the linear associati<strong>on</strong>, and is not affected by the slope of the line, but <strong>on</strong>ly by<br />

the scatter about the line.<br />

Because correlati<strong>on</strong> assumes both variables have an interval or ratio scale, it makes no sense to compute<br />

the correlati<strong>on</strong><br />

• between gender and oxygen (gender is a nominal scale data);<br />

• between n<strong>on</strong>-linear variables (not shown <strong>on</strong> graph);<br />

• <str<strong>on</strong>g>for</str<strong>on</strong>g> data collected without a known probability scheme. If a sampling scheme other than simple random<br />

sampling is used, it is possible to modify the estimati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>mula; if a n<strong>on</strong>-probability sample<br />

scheme was used, the patient is dead <strong>on</strong> arrival, and no amount of statistical wizardry will revive the<br />

corpse.<br />

The data collecti<strong>on</strong> scheme <str<strong>on</strong>g>for</str<strong>on</strong>g> the fitness data set is unknown - we will have to assume that a some sort<br />

of random sample <str<strong>on</strong>g>for</str<strong>on</strong>g>m the relevant populati<strong>on</strong> was taken be<str<strong>on</strong>g>for</str<strong>on</strong>g>e we can make much sense of the number<br />

computed.<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e looking at the details of its computati<strong>on</strong>, look at the sample correlati<strong>on</strong> coefficients <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />

scatter plot above. These can be arranged into a matrix:<br />

Variable Age Weight Oxy Runtime RunPulse RstPulse MaxPulse<br />

Age 1.00 -0.24 -0.31 0.19 -0.31 -0.15 -0.41<br />

Weight -0.24 1.00 -0.16 0.14 0.18 0.04 0.24<br />

Oxy -0.31 -0.16 1.00 -0.86 -0.39 -0.39 -0.23<br />

Runtime 0.19 0.14 -0.86 1.00 0.31 0.45 0.22<br />

RunPulse -0.31 0.18 -0.39 0.31 1.00 0.35 0.92<br />

RstPulse -0.15 0.04 -0.39 0.45 0.35 1.00 0.30<br />

MaxPulse -0.41 0.24 -0.23 0.22 0.92 0.30 1.00<br />

Notice that the sample correlati<strong>on</strong> between any two variables is the same regardless of ordering of the<br />

variables – this explains the symmetry in the matrix between the above and below diag<strong>on</strong>al elements. As<br />

well each variable has a perfect sample correlati<strong>on</strong> with itself – this explains the value of 1 al<strong>on</strong>g the main<br />

diag<strong>on</strong>al.<br />

Compare the sample correlati<strong>on</strong>s between the average running pulse rate and the other variables and<br />

compare them to the corresp<strong>on</strong>ding scatter-plot above.<br />

c○2012 Carl James Schwarz 18 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

1.3.3 Cauti<strong>on</strong>s<br />

• Random Sampling Required Sample correlati<strong>on</strong> coefficients are <strong>on</strong>ly valid under simple random<br />

samples. If the data were collected in a haphazard fashi<strong>on</strong> or if certain data points were oversampled,<br />

then the correlati<strong>on</strong> coefficient may be severely biased.<br />

• There are examples of high correlati<strong>on</strong> but no practical use and low correlati<strong>on</strong> but great practical use.<br />

These will be presented in class. This illustrates why I almost never talk about correlati<strong>on</strong>.<br />

• correlati<strong>on</strong> measures ‘strength’ of a linear relati<strong>on</strong>ship; a curvilinear relati<strong>on</strong>ship may have a correlati<strong>on</strong><br />

of 0, but there will still be a good correlati<strong>on</strong>.<br />

• the effect of outliers and high leverage points will be presented in class<br />

• effects of lurking variables. For example, suppose there is a positive associati<strong>on</strong> between wages of<br />

male nurses and years of experience; between female nurses and years of experience; but males are<br />

generally paid more than females. There is a positive correlati<strong>on</strong> within each group, but an overall<br />

negative correlati<strong>on</strong> when the data are pooled together.<br />

c○2012 Carl James Schwarz 19 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

• ecological fallacy - the problem of correlati<strong>on</strong> applied to averages. Even if there is a high correlati<strong>on</strong><br />

between two variables <strong>on</strong> their averages, it does not imply that there is a correlati<strong>on</strong> between individual<br />

data values.<br />

For example, if you look at the average c<strong>on</strong>sumpti<strong>on</strong> of alcohol and the c<strong>on</strong>sumpti<strong>on</strong> of cigarettes,<br />

there is a high correlati<strong>on</strong> am<strong>on</strong>g the averages when the 12 values from the provinces and territories<br />

are plotted <strong>on</strong> a graph. However, the individual relati<strong>on</strong>ships within provinces can be reversed or<br />

n<strong>on</strong>-existent as shown below:<br />

c○2012 Carl James Schwarz 20 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The relati<strong>on</strong>ship between cigarette c<strong>on</strong>sumpti<strong>on</strong> and alcohol c<strong>on</strong>sumpti<strong>on</strong> shows no relati<strong>on</strong>ship <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

each province, yet there is a str<strong>on</strong>g correlati<strong>on</strong> am<strong>on</strong>g the per-capita averages. This is an example of<br />

the ecological fallacy.<br />

• correlati<strong>on</strong> does not imply causati<strong>on</strong>. This is the most frequent mistake made by people. There are<br />

set of principles of causal inference that need to be satisfied in order to imply cause and effect.<br />

1.3.4 Principles of Causati<strong>on</strong><br />

Types of associati<strong>on</strong><br />

An associati<strong>on</strong> may be found between two variables <str<strong>on</strong>g>for</str<strong>on</strong>g> several reas<strong>on</strong>s (show causal modeling figures):<br />

• There may be direct causati<strong>on</strong>, e.g. smoking causes lung cancer.<br />

• There may be a comm<strong>on</strong> cause, e.g. ice cream sales and number of drownings both increase with<br />

temperature.<br />

• There may be a c<strong>on</strong>founding factor, e.g. highway fatalities decreased when the speed limits were<br />

reduced to 55 mph at the same time that the oil crisis caused supplies to be reduced and people drove<br />

fewer miles.<br />

c○2012 Carl James Schwarz 21 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

• There may be a coincidence, e.g., the populati<strong>on</strong> of Canada has increased at the same time as the<br />

mo<strong>on</strong> has gotten closer by a few miles.<br />

Establishing cause-and effect.<br />

How do we establish a cause and effect relati<strong>on</strong>ship? Brad<str<strong>on</strong>g>for</str<strong>on</strong>g>d Hill (Hill, A. B.. 1971. Principles of<br />

Medical Statistics, 9th ed New York: Ox<str<strong>on</strong>g>for</str<strong>on</strong>g>d University Press) outlined 7 criteria that have been adopted by<br />

many epidemiological researchers. It is generally agreed that most or all of the following must be c<strong>on</strong>sidered<br />

be<str<strong>on</strong>g>for</str<strong>on</strong>g>e causati<strong>on</strong> can be declared.<br />

Strength of the associati<strong>on</strong>. The str<strong>on</strong>ger an observed associati<strong>on</strong> appears over a series of different studies,<br />

the less likely this associati<strong>on</strong> is spurious because of bias.<br />

Dose-resp<strong>on</strong>se effect. The value of the resp<strong>on</strong>se variable changes in a meaningful way with the dose (or<br />

level) of the suspected causal agent.<br />

Lack of temporal ambiguity. The hypothesized cause precedes the occurrence of the effect. The ability to<br />

establish this time pattern will depend up<strong>on</strong> the study design used.<br />

C<strong>on</strong>sistency of the findings. Most, or all, studies c<strong>on</strong>cerned with a given causal hypothesis produce similar<br />

findings. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, studies dealing with a given questi<strong>on</strong> may all have serious bias problems that can<br />

diminish the importance of observed associati<strong>on</strong>s.<br />

Biological or theoretical plausibility. The hypothesized causal relati<strong>on</strong>ship is c<strong>on</strong>sistent with current biological<br />

or theoretical knowledge. Note, that the current state of knowledge may be insufficient to<br />

explain certain findings.<br />

Coherence of the evidence. The findings do not seriously c<strong>on</strong>flict with accepted facts about the outcome<br />

variable being studied.<br />

Specificity of the associati<strong>on</strong>. The observed effect is associated with <strong>on</strong>ly the suspected cause (or few other<br />

causes that can be ruled out).<br />

IMPORTANT: NO CAUSATION WITHOUT MANIPULATION!<br />

Examples:<br />

Discuss the above in relati<strong>on</strong> to:<br />

• amount of studying vs. grades in a <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

• amount of clear cutting and sediments in water.<br />

• fossil fuel burning and the greenhouse effect.<br />

c○2012 Carl James Schwarz 22 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

1.4 Single-variable regressi<strong>on</strong><br />

1.4.1 Introducti<strong>on</strong><br />

Al<strong>on</strong>g with the Analysis of Variance, this is likely the most comm<strong>on</strong>ly used statistical methodology in<br />

ecological research. In virtually every issue of an ecological journal, you will find papers that use a regressi<strong>on</strong><br />

analysis.<br />

There are HUNDREDS of books written <strong>on</strong> regressi<strong>on</strong> analysis. Some of the better <strong>on</strong>es (IMHO) are:<br />

Draper and Smith. <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Regressi<strong>on</strong> Analysis. Wiley.<br />

Neter, Wasserman, and Kutner. <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Linear Statistical Models. Irwin.<br />

Kleinbaum, Kupper, Miller. <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Regressi<strong>on</strong> Analysis. Duxbury.<br />

Zar. Biostatistics. Prentice Hall.<br />

C<strong>on</strong>sequently, this set of notes is VERY brief and makes no pretense to be a thorough review of regressi<strong>on</strong><br />

analysis. Please c<strong>on</strong>sult the above references <str<strong>on</strong>g>for</str<strong>on</strong>g> all the gory details.<br />

It turns out that both Analysis of Variance and Regressi<strong>on</strong> are special cases of a more general statistical<br />

methodology called General Linear Models which in turn are special cases of Generalized Linear Models<br />

(covered in Stat 402/602), which in turn are special cases of Generalized Additive Models, which in turn are<br />

special cases of .....<br />

The key difference between a Regressi<strong>on</strong> analysis and an ANOVA is that the X variable is nominal<br />

scaled in ANOVA, while in regressi<strong>on</strong> analysis the X variable is c<strong>on</strong>tinuous scaled. This implies that in<br />

ANOVA, the shape of the resp<strong>on</strong>se profile was unspecified (the null hypothesis was that all means were<br />

equal while the alternate was that at least <strong>on</strong>e mean differs), while in regressi<strong>on</strong>, the resp<strong>on</strong>se profile must<br />

be a linear line.<br />

Because both ANOVA and regressi<strong>on</strong> are from the same class of statistical models, many of the assumpti<strong>on</strong>s<br />

are similar, the fitting methods are similar, hypotheses testing and inference are similar as well.<br />

1.4.2 Equati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> a line - getting notati<strong>on</strong> straight (no pun intended)<br />

In order to use regressi<strong>on</strong> analysis effectively, it is important that you understand the c<strong>on</strong>cepts of slopes and<br />

intercepts and how to determine these from data values.<br />

This will be QUICKLY reviewed here in class.<br />

In previous <str<strong>on</strong>g>course</str<strong>on</strong>g>s at high school or in linear algebra, the equati<strong>on</strong> of a straight line was often written<br />

y = mx + b where m is the slope and b is the intercept. In some popular spreadsheet programs, the authors<br />

decided to write the equati<strong>on</strong> of a line as y = a + bx. Now a is the intercept, and b is the slope. Statisticians,<br />

c○2012 Carl James Schwarz 23 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> good reas<strong>on</strong>s, have rati<strong>on</strong>alized this notati<strong>on</strong> and usually write the equati<strong>on</strong> of a line as y = β 0 + β 1 x or<br />

as Y = b 0 + b 1 X. (the distincti<strong>on</strong> between β 0 and b 0 will be made clearer in a few minutes). The use of the<br />

subscripts 0 to represent the intercept and the subscript 1 to represent the coefficient <str<strong>on</strong>g>for</str<strong>on</strong>g> the X variable then<br />

readily extends to more complex cases.<br />

Review definiti<strong>on</strong> of intercept as the value of Y when X=0, and slope as the change in Y per unit change<br />

in X.<br />

1.4.3 Populati<strong>on</strong>s and samples<br />

All of statistics is about detecting signals in the face of noise and in estimating populati<strong>on</strong> parameters from<br />

samples. Regressi<strong>on</strong> is no different.<br />

First c<strong>on</strong>sider the the populati<strong>on</strong>. As in previous chapters, the correct definiti<strong>on</strong> of the populati<strong>on</strong> is<br />

important as part of any study. C<strong>on</strong>ceptually, we can think of the large set of all units of interest. On<br />

each unit, there is c<strong>on</strong>ceptually, both an X and Y variable present. We wish to summarize the relati<strong>on</strong>ship<br />

between Y and X, and furthermore wish to make predicti<strong>on</strong>s of the Y value <str<strong>on</strong>g>for</str<strong>on</strong>g> future X values that may<br />

be observed from this populati<strong>on</strong>. [This is analogous to having different treatment groups corresp<strong>on</strong>ding to<br />

different values of X in ANOVA.]<br />

If this were physics, we may c<strong>on</strong>ceive of a physical law between X and Y , e.g. F = ma or P V = nRt.<br />

However, in ecology, the relati<strong>on</strong>ship between Y and X is much more tenuous. If you could draw a scatterplot<br />

of Y against X <str<strong>on</strong>g>for</str<strong>on</strong>g> ALL elements of the populati<strong>on</strong>, the points would NOT fall exactly <strong>on</strong> a straight<br />

line. Rather, the value of Y would fluctuate above or below a straight line at any given X value. [This is<br />

analogous to saying that Y varies randomly around the treatment group mean in ANOVA.]<br />

We denote this relati<strong>on</strong>ship as<br />

Y = β 0 + β 1 X + ɛ<br />

where now β 0 , β 1 are the POPULATION intercept and slope respectively. We say that<br />

E[Y ] = β 0 + β 1 X<br />

is the expected or average value of Y at X. [In ANOVA, we let each treatment group have its own mean;<br />

here in regressi<strong>on</strong> we assume that the means must fit <strong>on</strong> a straight line.]<br />

The term ɛ represent random variati<strong>on</strong> of individual units in the populati<strong>on</strong> above and below the expected<br />

value. It is assumed to have c<strong>on</strong>stant standard deviati<strong>on</strong> over the entire regressi<strong>on</strong> line (i.e. the spread of data<br />

points in the populati<strong>on</strong> is c<strong>on</strong>stant over the entire regressi<strong>on</strong> line). [This is analogous to the assumpti<strong>on</strong> of<br />

equal treatment populati<strong>on</strong> standard deviati<strong>on</strong>s in ANOVA.]<br />

Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, we can never measure all units of the populati<strong>on</strong>. So a sample must be taken in order to<br />

estimate the populati<strong>on</strong> slope, populati<strong>on</strong> intercept, and populati<strong>on</strong> standard deviati<strong>on</strong>. Unlike a correlati<strong>on</strong><br />

analysis, it is NOT necessary to select a simple random sample from the entire populati<strong>on</strong> and more elaborate<br />

schemes can be used. The bare minimum that must be achieved is that <str<strong>on</strong>g>for</str<strong>on</strong>g> any individual X value found in<br />

the sample, the units in the populati<strong>on</strong> that share this X value, must have been selected at random.<br />

c○2012 Carl James Schwarz 24 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

This is quite a relaxed assumpti<strong>on</strong>! For example, it allows us to deliberately choose values of X from the<br />

extremes and then <strong>on</strong>ly at those X value, randomly select from the relevant subset of the populati<strong>on</strong>, rather<br />

than having to select at random from the populati<strong>on</strong> as a whole. [This is analogous to the assumpti<strong>on</strong>s made<br />

in an analytical survey, where we assumed that even though we can’t randomly assign a treatment to a unit<br />

(e.g. we can’t assign sex to an animal) we must ensure that animals are randomly selected from each group].<br />

Once the data points are selected, the estimati<strong>on</strong> process can proceed, but not be<str<strong>on</strong>g>for</str<strong>on</strong>g>e assessing the assumpti<strong>on</strong>s!<br />

1.4.4 Assumpti<strong>on</strong>s<br />

The assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> a regressi<strong>on</strong> analysis are very similar to those found in ANOVA.<br />

Linearity<br />

Regressi<strong>on</strong> analysis assume that the relati<strong>on</strong>ship between Y and X is linear. Make a scatter-plot between<br />

Y and X to assess this assumpti<strong>on</strong>. Perhaps a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is required (e.g. log(Y ) vs. log(X)). Some<br />

cauti<strong>on</strong> is required in trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> in dealing with the error structure as you will see in later examples.<br />

Plot residuals vs. the X values. If the scatter is not random around 0 but shows some pattern (e.g.<br />

quadratic curve), this usually indicates that the relati<strong>on</strong>ship between Y and X is not linear. Or, fit a model<br />

that includes X and X 2 and test if the coefficient associated with X 2 is zero. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this test could<br />

fail to detect a higher order relati<strong>on</strong>ship. Third, if there are multiple readings at some X-values, then a test<br />

of goodness-of-fit can be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med where the variati<strong>on</strong> of the resp<strong>on</strong>ses at the same X value is compared<br />

to the variati<strong>on</strong> around the regressi<strong>on</strong> line.<br />

Correct scale of predictor and resp<strong>on</strong>se<br />

The resp<strong>on</strong>se and predictor variables must both have interval or ratio scale. In particular, using a numerical<br />

value to represent a category and then using this numerical value in a regressi<strong>on</strong> is not valid. For example,<br />

suppose that you code hair color as (1 = red, 2 = brown, and 3 = black). Then using these values in a<br />

regressi<strong>on</strong> either as predictor variable or as a resp<strong>on</strong>se variable is not sensible.<br />

Correct sampling scheme<br />

The Y must be a random sample from the populati<strong>on</strong> of Y values <str<strong>on</strong>g>for</str<strong>on</strong>g> every X value in the sample. Fortunately,<br />

it is not necessary to have a completely random sample from the populati<strong>on</strong> as the regressi<strong>on</strong> line is<br />

valid even if the X values are deliberately chosen. However, <str<strong>on</strong>g>for</str<strong>on</strong>g> a given X, the values from the populati<strong>on</strong><br />

must be a simple random sample.<br />

c○2012 Carl James Schwarz 25 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

No outliers or influential points<br />

All the points must bel<strong>on</strong>g to the relati<strong>on</strong>ship – there should be no unusual points. The scatter-plot of Y vs.<br />

X should be examined. If in doubt, fit the model with the points in and out of the fit and see if this makes a<br />

difference in the fit.<br />

Outliers can have a dramatic effect <strong>on</strong> the fitted line. For example, in the following graph, the single<br />

point is an outlier and and influential point:<br />

Equal variati<strong>on</strong> al<strong>on</strong>g the line<br />

The variability about the regressi<strong>on</strong> line is similar <str<strong>on</strong>g>for</str<strong>on</strong>g> all values of X, i.e. the scatter of the points above and<br />

below the fitted line should be roughly c<strong>on</strong>stant over the entire line. This is assessed by looking at the plots<br />

of the residuals against X to see if the scatter is roughly uni<str<strong>on</strong>g>for</str<strong>on</strong>g>mly scattered around zero with no increase<br />

and no decrease in spread over the entire line.<br />

Independence<br />

Each value of Y is independent of any other value of Y . The most comm<strong>on</strong> cases where this fail are time<br />

series data where X is a time measurement. In these cases, time series analysis should be used.<br />

This assumpti<strong>on</strong> can be assessed by again looking at residual plots against time or other variables.<br />

c○2012 Carl James Schwarz 26 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Normality of errors<br />

The difference between the value of Y and the expected value of Y is assumed to be normally distributed.<br />

This is <strong>on</strong>e of the most misunderstood assumpti<strong>on</strong>s. Many people err<strong>on</strong>eously assume that the distributi<strong>on</strong> of<br />

Y over all X values must be normally distributed, i.e they look simply at the distributi<strong>on</strong> of the Y ’s ignoring<br />

the Xs. The assumpti<strong>on</strong> <strong>on</strong>ly states that the residuals, the difference between the value of Y and the point<br />

<strong>on</strong> the line must be normally distributed.<br />

This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, <str<strong>on</strong>g>for</str<strong>on</strong>g> small<br />

sample sizes, you have little power of detecting n<strong>on</strong>-normality and <str<strong>on</strong>g>for</str<strong>on</strong>g> large sample sizes it is not that<br />

important.<br />

X measured without error<br />

This is a new assumpti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> as compared to ANOVA. In ANOVA, the group membership was<br />

always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity. However,<br />

in regressi<strong>on</strong>, it can turn out that that the X value may not be known exactly.<br />

This general problem is called the “error in variables” problem and has a l<strong>on</strong>g history in statistics.<br />

It turns out that there are two important cases. If the value reported <str<strong>on</strong>g>for</str<strong>on</strong>g> X is a nominal value and the<br />

actual value of X varies randomly around this nominal value, then there is no bias in the estimates. This is<br />

called the Berks<strong>on</strong> case after Berks<strong>on</strong> who first examined this situati<strong>on</strong>. The most comm<strong>on</strong> cases are where<br />

the recorded X is a target value (e.g. temperature as set by a thermostat) while the actual X that occurs<br />

would vary randomly around this target value.<br />

However, if the value used <str<strong>on</strong>g>for</str<strong>on</strong>g> X is an actual measurement of the true underlying X then there is<br />

uncertainty in both the X and Y directi<strong>on</strong>. In this case, estimates of the slope are attenuated towards zero<br />

(i.e. positive slopes are biased downwards, negative slopes biased upwards). More alarmingly, the estimate<br />

are no l<strong>on</strong>ger c<strong>on</strong>sistent, i.e. as the sample size increases, the estimates no l<strong>on</strong>ger tend to the true populati<strong>on</strong><br />

values! For example, suppose that yield of a crop is related to amount of rainfall. A rain gauge may not<br />

be located exactly at the plot where the crop is grown, but may be recorded a nearby weather stati<strong>on</strong> a fair<br />

distance away. The reading at the weather stati<strong>on</strong> is NOT a true reflecti<strong>on</strong> of the rainfall at the test plot.<br />

This latter case of “error in variables” is very difficult to analyze properly and there are not universally<br />

accepted soluti<strong>on</strong>s. Refer to the reference books listed at the start of this chapter <str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />

The problem is set up as follows. Let<br />

Y i =η i + ɛ i<br />

X i =ξ i + δ i<br />

c○2012 Carl James Schwarz 27 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

with the straight-line relati<strong>on</strong>ship between the true (but unobserved) values:<br />

η i =β 0 + β 1 ξ i<br />

Note the (true, but unknown) regressi<strong>on</strong> equati<strong>on</strong> uses ξ i rather than the observed (with error) values X i .<br />

Now if the regressi<strong>on</strong> is d<strong>on</strong>e <strong>on</strong> the observed X (i.e. the error pr<strong>on</strong>e measurement), the regressi<strong>on</strong><br />

equati<strong>on</strong> reduces to:<br />

Y i = β 0 + β 1 X i + (ɛ i − β 1 δ i )<br />

Now this violates the independence assumpti<strong>on</strong> of ordinary least squares because the new “error” term is not<br />

independent of the X i variable.<br />

If an ordinary least squares model is fit, the estimated slope is biased (Draper and Smith, 1998, p. 90)<br />

with<br />

E[̂β 1 ] = β 1 − β 1r(ρ + r)<br />

1 + 2ρr + r 2<br />

where ρ is the correlati<strong>on</strong> between ξ and δ; and r is the ratio of the variance of the error in X to the error in<br />

Y .<br />

The bias is negative, i.e. the estimated slope is too small, in most practical cases (ρ + r > 0). This is<br />

known as attenuati<strong>on</strong> of the estimate, and in general, pulls the estimate towards zero.<br />

The bias will be small in the following cases:<br />

• the error variance of X is small relative to the error variance in Y . This means that r is small (i.e.<br />

close to zero), and so the bias is also small. In the case where X is measured without error, then r = 0<br />

and the bias vanishes as expected.<br />

• if the X are fixed (the Berks<strong>on</strong> case) and actually used 2 , then ρ + r = 0 and the bias also vanishes.<br />

The proper analysis of the error-in-variables case is quite complex – see Draper and Smith (1998, p. 91)<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />

1.4.5 Obtaining Estimates<br />

To distinguish between populati<strong>on</strong> parameters and sample estimates, we denote the sample intercept by b 0<br />

and the sample slope as b 1 . The equati<strong>on</strong> of a particular sample of points is expressed Ŷi = b 0 + b 1 X i where<br />

b 0 is the estimated intercept, and b 1 is the estimated slope. The symbol Ŷ indicates that we are referring to<br />

the estimated line and not to a line in the entire populati<strong>on</strong>.<br />

2 For example, a thermostat measures (with error) the actual temperature of a room. But if the experiment is based <strong>on</strong> the thermostat<br />

readings rather than the (true) unknown temperature, this corresp<strong>on</strong>ds to the Berks<strong>on</strong> case.<br />

c○2012 Carl James Schwarz 28 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

How is the best fitting line found when the points are scattered? We typically use the principle of least<br />

squares. The least-squares line is the line that makes the sum of the squares of the deviati<strong>on</strong>s of the data<br />

points from the line in the vertical directi<strong>on</strong> as small as possible.<br />

Mathematically, the least squares line is the line that minimizes 1 n<br />

∑ ( Y i − Ŷi) 2<br />

where Ŷ i is the point<br />

<strong>on</strong> the line corresp<strong>on</strong>ding to each X value. This is also known as the predicted value of Y <str<strong>on</strong>g>for</str<strong>on</strong>g> a given value<br />

of X. This <str<strong>on</strong>g>for</str<strong>on</strong>g>mal definiti<strong>on</strong> of least squares is not that important - the c<strong>on</strong>cept as expressed in the previous<br />

paragraph is more important – in particular it is the SQUARED deviati<strong>on</strong> in the VERTICAL directi<strong>on</strong> that<br />

is used..<br />

It is possible to write out a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> the estimated intercept and slope, but who cares - let the computer<br />

do the dirty work.<br />

The estimated intercept (b 0 ) is the estimated value of Y when X = 0. In some cases, it is meaningless<br />

to talk about values of Y when X = 0 because X = 0 is n<strong>on</strong>sensical. For example, in a plot of income vs.<br />

year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear interpretati<strong>on</strong> of<br />

the intercept, and it merely serves as a placeholder <str<strong>on</strong>g>for</str<strong>on</strong>g> the line.<br />

The estimated slope (b 1 ) is the estimated change in Y per unit change in X. For every unit change in the<br />

horiz<strong>on</strong>tal directi<strong>on</strong>, the fitted line increased by b 1 units. If b 1 is negative, the fitted line points downwards,<br />

and the increase in the line is negative, i.e., actually a decrease.<br />

As with all estimates, a measure of precisi<strong>on</strong> can be obtained. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, this is the standard error of<br />

each of the estimates. Again, there are computati<strong>on</strong>al <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae, but in this age of computers, these are not<br />

important. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, approximate 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the corresp<strong>on</strong>ding populati<strong>on</strong> parameters<br />

are found as estimate ± 2 × se.<br />

Formal tests of hypotheses can also be d<strong>on</strong>e. Usually, these are <strong>on</strong>ly d<strong>on</strong>e <strong>on</strong> the slope parameter as<br />

this is typically of most interest. The null hypothesis is that populati<strong>on</strong> slope is 0, i.e. there is no relati<strong>on</strong>ship<br />

between Y and X (can you draw a scatter-plot showing such a relati<strong>on</strong>ship?) More <str<strong>on</strong>g>for</str<strong>on</strong>g>mally the null<br />

hypothesis is:<br />

H : β 1 = 0<br />

Again notice that the null hypothesis is ALWAYS in terms of a populati<strong>on</strong> parameter and not in terms of a<br />

sample statistic.<br />

The alternate hypothesis is typically chosen as:<br />

A : β 1 ≠ 0<br />

although <strong>on</strong>e-sided tests looking <str<strong>on</strong>g>for</str<strong>on</strong>g> either a positive or negative slope are possible.<br />

The test-statistics is found as<br />

T = b 1 − 0<br />

se(b 1 )<br />

and is compared to a t-distributi<strong>on</strong> with appropriate degrees of freedom to obtain the p-value. This is usually<br />

automatically d<strong>on</strong>e by most computer packages. The p-value is interpreted in exactly the same way as in<br />

c○2012 Carl James Schwarz 29 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no relati<strong>on</strong>ship were true.<br />

As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the p-value does not tell the whole story, i.e. statistical vs. biological (n<strong>on</strong>)significance must<br />

be determined and assessed.<br />

1.4.6 Obtaining Predicti<strong>on</strong>s<br />

Once the best fitting line is found it can be used to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> new values of X.<br />

There are two types of predicti<strong>on</strong>s that are comm<strong>on</strong>ly made. It is important to distinguish between them<br />

as these two intervals are the source of much c<strong>on</strong>fusi<strong>on</strong> in regressi<strong>on</strong> problems.<br />

First, the experimenter may be interested in predicting a SINGLE future individual value <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />

X. Sec<strong>on</strong>d the experimenter may be interested in predicting the AVERAGE of ALL future resp<strong>on</strong>ses at a<br />

particular X. 3 The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is sometimes called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

an individual resp<strong>on</strong>se but this is an un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate (and incorrect) use of the term c<strong>on</strong>fidence interval. Strictly<br />

speaking c<strong>on</strong>fidence intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> fixed unknown parameter values; predicati<strong>on</strong> intervals are<br />

computed <str<strong>on</strong>g>for</str<strong>on</strong>g> future random variables.<br />

Both of the above intervals should be distinguished from the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />

In both cases, the estimate is found in the same manner – substitute the new value of X into the equati<strong>on</strong><br />

and compute the predicted value Ŷ . In most computer packages this is accomplished by inserting a new<br />

“dummy” observati<strong>on</strong> in the dataset with the value of Y missing, but the value of X present. The missing<br />

Y value prevents this new observati<strong>on</strong> from being used in the fitting process, but the X value allows the<br />

package to compute an estimate <str<strong>on</strong>g>for</str<strong>on</strong>g> this observati<strong>on</strong>.<br />

What differs between the two predicti<strong>on</strong>s are the estimates of uncertainty.<br />

In the first case, there are two sources of uncertainty involved in the predicti<strong>on</strong>. First, there is the uncertainty<br />

caused by the fact that this estimated line is based up<strong>on</strong> a sample. Then there is the additi<strong>on</strong>al<br />

uncertainty that the value could be above or below the predicted line. This interval is often called a predicti<strong>on</strong><br />

interval at a new X.<br />

In the sec<strong>on</strong>d case, <strong>on</strong>ly the uncertainty caused by estimating the line based <strong>on</strong> a sample is relevant. This<br />

interval is often called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean at a new X.<br />

The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is typically MUCH wider than the c<strong>on</strong>fidence interval<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean of all future resp<strong>on</strong>ses because it must account <str<strong>on</strong>g>for</str<strong>on</strong>g> the uncertainty from the fitted line plus<br />

individual variati<strong>on</strong> around the fitted line.<br />

Many textbooks have the <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the se <str<strong>on</strong>g>for</str<strong>on</strong>g> the two types of predicti<strong>on</strong>s, but again, there is little to<br />

be gained by examining them. What is important is that you read the documentati<strong>on</strong> carefully to ensure that<br />

3 There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice.<br />

c○2012 Carl James Schwarz 30 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

you understand exactly what interval is being given to you.<br />

1.4.7 Residual Plots<br />

After the curve is fit, it is important to examine if the fitted curve is reas<strong>on</strong>able. This is d<strong>on</strong>e using residuals.<br />

The residual <str<strong>on</strong>g>for</str<strong>on</strong>g> a point is the difference between the observed value and the predicted value, i.e., the residual<br />

from fitting a straight line is found as: residual i = Y i − (b 0 + b 1 X i ) = (Y i − Ŷi).<br />

There are several standard residual plots:<br />

• plot of residuals vs. predicted (Ŷ );<br />

• plot of residuals vs. X;<br />

• plot of residuals vs. time ordering.<br />

In all cases, the residual plots should show random scatter around zero with no obvious pattern. D<strong>on</strong>’t<br />

plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and d<strong>on</strong>’t mean<br />

anything.<br />

1.4.8 Example - Yield and fertilizer<br />

We wish to investigate the relati<strong>on</strong>ship between yield (Liters) and fertilizer (kg/ha) <str<strong>on</strong>g>for</str<strong>on</strong>g> tomato plants. An<br />

experiment was c<strong>on</strong>ducted in the Schwarz household <strong>on</strong>e summer <strong>on</strong> 11 plots of land where the amount of<br />

fertilizer was varied and the yield measured at the end of the seas<strong>on</strong>.<br />

The amount of fertilizer (randomly) applied to each plot was chosen between 5 and 18 kg/ha. While<br />

the levels were not systematically chosen (e.g. they were not evenly spaced between the highest and lowest<br />

values), they represent comm<strong>on</strong>ly used amounts based <strong>on</strong> a preliminary survey of producers. At the end of<br />

the experiment, the yields were measured and the following data were obtained.<br />

Interest also lies in predicting the yield when 16 kg/ha are assigned.<br />

c○2012 Carl James Schwarz 31 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Fertilizer<br />

(kg/ha)<br />

Yield<br />

(Liters)<br />

12 24<br />

5 18<br />

15 31<br />

17 33<br />

14 30<br />

6 20<br />

11 25<br />

13 27<br />

15 31<br />

8 21<br />

18 29<br />

The raw data is also available in a JMP datasheet called fertilizer.jmp available from the Sample Program<br />

Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

In in this study, it is quite clear that the fertilizer is the predictor (X) variable, while the resp<strong>on</strong>se variable<br />

(Y ) is the yield.<br />

The populati<strong>on</strong> c<strong>on</strong>sists of all possible field plots with all possible tomato plants of this type grown under<br />

all possible fertilizer levels between about 5 and 18 kg/ha.<br />

If all of the populati<strong>on</strong> could be measured (which it can’t) you could find a relati<strong>on</strong>ship between the yield<br />

and the amount of fertilizer applied. This relati<strong>on</strong>ship would have the <str<strong>on</strong>g>for</str<strong>on</strong>g>m: Y = β 0 +β 1 ×(amount of fertilizer)+<br />

ɛ where β 0 and β 1 represent the true populati<strong>on</strong> intercept and slope respectively. The term ɛ represents random<br />

variati<strong>on</strong> that is always present, i.e. even if the same plot was grown twice in a row with the same<br />

amount of fertilizer, the yield would not be identical (why?).<br />

The populati<strong>on</strong> parameters to be estimated are β 0 - the true average yield when the amount of fertilizer<br />

is 0, and β 1 , the true average change in yield per unit change in the amount of fertilizer. These are taken<br />

over all plants in all possible field plots of this type. The values of β 0 and β 1 are impossible to obtain as the<br />

entire populati<strong>on</strong> could never be measured.<br />

Here is the data entered into a JMP data sheet. Note the scale of both variables (c<strong>on</strong>tinuous) and that an<br />

extra row was added to the data table with the value of 16 <str<strong>on</strong>g>for</str<strong>on</strong>g> the fertilizer and the yield left missing.<br />

c○2012 Carl James Schwarz 32 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The ordering of the rows in the data table is NOT important; however, it is often easier to find individual<br />

data points if the data is sorted by the X value and the rows <str<strong>on</strong>g>for</str<strong>on</strong>g> future predicti<strong>on</strong>s are placed at the end of<br />

the dataset. Notice how missing values are represented.<br />

Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to start the analysis. Specify the Y and X variable as needed.<br />

c○2012 Carl James Schwarz 33 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Notice that JMP “reminds” you of the analysis that you will obtain based <strong>on</strong> the scale of the X and Y<br />

variables as shown in the bottom left of the menu. In this case, both X and Y have a c<strong>on</strong>tinuous scale, so<br />

JMP will per<str<strong>on</strong>g>for</str<strong>on</strong>g>m a bi-variate fitting procedure. It starts by showing the scatter-plot between Yield (Y ) and<br />

fertilizer (X).<br />

c○2012 Carl James Schwarz 34 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The relati<strong>on</strong>ship look approximately linear; there d<strong>on</strong>’t appear to be any outlier or influential points; the<br />

scatter appears to be roughly equal across the entire regressi<strong>on</strong> line. Residual plots will be used later to<br />

check these assumpti<strong>on</strong>s in more detail.<br />

The drop-down menu item (from the red triangle beside the Bivariate Fit...) allows you to fit the leastsquares<br />

line. This produces much output, but the three important part of the output are discussed below.<br />

First, the actual fitted line is drawn <strong>on</strong> the scatter plot, and the equati<strong>on</strong> of the fitted line is printed below<br />

the plot.<br />

c○2012 Carl James Schwarz 35 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The estimated regressi<strong>on</strong> line is<br />

Ŷ = b 0 + b 1 (fertilizer) = 12.856 + 1.10137(amount of fertilizer)<br />

In terms of estimates, b 0 =12.856 is the estimated intercept, and b 1 =1.101 is the estimated slope.<br />

The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1 unit.<br />

In this case, the yield is expected to increase (why?) by 1.10137 L when the fertilizer amount is increased by<br />

1 kg/ha. NOTE that the slope is the CHANGE in Y when X increases by 1 unit - not the value of Y when<br />

X = 1.<br />

The estimated intercept is the estimated yield when the amount of fertilizer is 0. In this case, the estimated<br />

yield when no fertilizer is added is 12.856 L. In this particular case the intercept has a meaningful<br />

interpretati<strong>on</strong>, but I’d be worried about extrapolating outside the range of the observed X values. If the<br />

intercept is 12.85, why does the line intersect the left part of the graph at about 15 rather than closer to 13?<br />

Once again, these are the results from a single experiment. If another experiment was repeated, you<br />

would obtain different estimates (b 0 and b 1 would change). The sampling distributi<strong>on</strong> over all possible<br />

experiments would describe the variati<strong>on</strong> in b 0 and b 1 over all possible experiments. The standard deviati<strong>on</strong><br />

of b 0 and b 1 over all possible experiments is again referred to as the standard error of b 0 and b 1 .<br />

c○2012 Carl James Schwarz 36 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the standard errors of b 0 and b 1 are messy, and hopeless to compute by hand. And<br />

just like inference <str<strong>on</strong>g>for</str<strong>on</strong>g> a mean or a proporti<strong>on</strong>, the program automatically computes the se of the regressi<strong>on</strong><br />

estimates.<br />

The estimated standard error <str<strong>on</strong>g>for</str<strong>on</strong>g> b 1 (the estimated slope) is 0.132 L/kg. This is an estimate of the standard<br />

deviati<strong>on</strong> of b 1 over all possible experiments. Normally, the intercept is of limited interest, but a standard<br />

error can also be found <str<strong>on</strong>g>for</str<strong>on</strong>g> it as shown in the above table.<br />

Using exactly the same logic as when we found a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the populati<strong>on</strong> mean, or <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

the populati<strong>on</strong> proporti<strong>on</strong>, a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the populati<strong>on</strong> slope (β 1 ) is found (approximately) as<br />

b 1 ± 2(estimated se) In the above example, an approximate c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> β 1 is found as<br />

of fertilizer applied.<br />

1.101 ± 2 × (.132) = 1.101 ± .264 = (.837 → 1.365)L/kg<br />

An “exact” c<strong>on</strong>fidence interval can be computed by JMP as shown above. 4 The “exact” c<strong>on</strong>fidence<br />

interval is based <strong>on</strong> the t-distributi<strong>on</strong> and is slightly wider than our approximate c<strong>on</strong>fidence interval because<br />

the total sample size (11 pairs of points) is rather small. We interpret this interval as ‘being 95% c<strong>on</strong>fident<br />

that the true increase in yield when the amount of fertilizer is increased by <strong>on</strong>e unit is somewhere between<br />

(.837 to 1.365) L/kg.’<br />

Be sure to carefully distinguish between β 1 and b 1 . Note that the c<strong>on</strong>fidence interval is computed using<br />

b 1 , but is a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> β 1 - the populati<strong>on</strong> parameter that is unknown.<br />

In linear regressi<strong>on</strong> problems, <strong>on</strong>e hypothesis of interest is if the true slope is zero. This would corresp<strong>on</strong>d<br />

to no linear relati<strong>on</strong>ship between the resp<strong>on</strong>se and predictor variable (why?) Again, this is a good<br />

time to read the papers by Cherry and Johns<strong>on</strong> about the dangers of uncritical use of hypothesis testing. In<br />

many cases, a c<strong>on</strong>fidence interval tells the entire story.<br />

JMP produces a test of the hypothesis that each of the parameters (the slope and the intercept in the<br />

populati<strong>on</strong>) is zero. The output is reproduced again below:<br />

4 If your table doesn’t show the c<strong>on</strong>fidence interval, use a C<strong>on</strong>trol-Click or Right-Click in the table and select the columns to be<br />

displayed.<br />

c○2012 Carl James Schwarz 37 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The test of hypothesis about the intercept is not of interest (why?).<br />

Let<br />

• β 1 be the true (unknown) slope.<br />

• b 1 be the estimated slope. In this case b 1 = 1.1014.<br />

The hypothesis testing proceeds as follows. Again note that we are interested in the populati<strong>on</strong> parameters<br />

and not the sample statistics.<br />

1. Specify the null and alternate hypothesis:<br />

H: β 1 = 0<br />

A: β 1 ≠ 0.<br />

Notice that the null hypothesis is in terms of the populati<strong>on</strong> parameter β 1 . This is a two-sided test as<br />

we are interested in detecting differences from zero in either directi<strong>on</strong>.<br />

2. Find the test statistic and the p-value. The test statistic is computed as:<br />

T =<br />

estimate − hypothesized value<br />

estimated se<br />

= 1.1014 − 0<br />

.132<br />

= 8.36<br />

In other words, the estimate is over 8 standard errors away from the hypothesized value!<br />

This will be compared to a t-distributi<strong>on</strong> with n − 2 = 9 degrees of freedom. The p-value is found to<br />

very small (less than 0.0001).<br />

3. C<strong>on</strong>clusi<strong>on</strong>. There is str<strong>on</strong>g evidence that the true slope is not zero. This is not too surprising given<br />

that the 95% c<strong>on</strong>fidence intervals show that plausible values <str<strong>on</strong>g>for</str<strong>on</strong>g> the true slope are from about .8 to<br />

about 1.4.<br />

It is possible to c<strong>on</strong>struct tests of the slope equal to some value other than 0. Most packages can’t do<br />

this. You would compute the T value as shown above, replacing the value 0 with the hypothesized value.<br />

It is also possible to c<strong>on</strong>struct <strong>on</strong>e-sided tests. Most computer packages <strong>on</strong>ly do two-sided tests. Proceed<br />

as above, but the <strong>on</strong>e-sided p-value is the two-sided p-value reported by the packages divided by 2.<br />

If sufficient evidence is found against the hypothesis, a natural questi<strong>on</strong> to ask is ‘well, what values of the<br />

parameter are plausible given this data’. This is exactly what a c<strong>on</strong>fidence interval tells you. C<strong>on</strong>sequently,<br />

I usually prefer to find c<strong>on</strong>fidence intervals, rather than doing <str<strong>on</strong>g>for</str<strong>on</strong>g>mal hypothesis testing.<br />

What about making predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> future yields when certain amounts of fertilizer are applied? For<br />

example, what would be the future yield when 16 kg/ha of fertilizer are applied?<br />

The predicted value is found by substituting the new X into the estimated regressi<strong>on</strong> line.<br />

Ŷ = b 0 + b 1 (fertilizer) = 12.856 + 1.10137(16) = 30.48L<br />

c○2012 Carl James Schwarz 38 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

This can also be found by using the cross hairs tool <strong>on</strong> the actual graph (to be dem<strong>on</strong>strated in class).JMP<br />

can compute the predicted value by selecting the appropriate opti<strong>on</strong> under the drop down menu item in the<br />

Linear Fit item:<br />

and then going back to look at the new column in the data table:<br />

c○2012 Carl James Schwarz 39 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

As noted earlier, there are two types of estimates of precisi<strong>on</strong> associated with predicti<strong>on</strong>s using the<br />

regressi<strong>on</strong> line. It is important to distinguish between them as these two intervals are the source of much<br />

c<strong>on</strong>fusi<strong>on</strong> in regressi<strong>on</strong> problems.<br />

First, the experimenter may be interested in predicting a single FUTURE individual value <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />

X. This would corresp<strong>on</strong>d to the predicted yield <str<strong>on</strong>g>for</str<strong>on</strong>g> a single future plot with 16 kg/ha of fertilizer added.<br />

Sec<strong>on</strong>d the experimenter may be interested in predicting the average of ALL FUTURE resp<strong>on</strong>ses at a<br />

particular X. This would corresp<strong>on</strong>d to the average yield <str<strong>on</strong>g>for</str<strong>on</strong>g> all future plots when 16 kg/ha of fertilizer is<br />

added. The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is sometimes called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an<br />

individual resp<strong>on</strong>se but this is an un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate (and incorrect) use of the term c<strong>on</strong>fidence interval. Strictly<br />

speaking c<strong>on</strong>fidence intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> fixed unknown parameter values; predicati<strong>on</strong> intervals are<br />

computed <str<strong>on</strong>g>for</str<strong>on</strong>g> future random variables.<br />

Both intervals can be computed and plotted byJMP by again using the pop-down menu beside the Linear<br />

Fit box:<br />

c○2012 Carl James Schwarz 40 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

In this menu, the C<strong>on</strong>fid Curves Fit corresp<strong>on</strong>d to c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN resp<strong>on</strong>se, while<br />

the C<strong>on</strong>fid Curves Indiv corresp<strong>on</strong>d to predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the future single resp<strong>on</strong>se. Both can be plotted<br />

<strong>on</strong> the graph. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there does not appear to be a way to save the predicti<strong>on</strong> limits into a data table<br />

from this plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m - the cross hairs tool must be used, or the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m should be used.<br />

c○2012 Carl James Schwarz 41 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The innermost set of lines represents the c<strong>on</strong>fidence bands <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se. The outermost band<br />

of lines represents the predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> a single future resp<strong>on</strong>se. As noted earlier, the latter must be<br />

wider than the <str<strong>on</strong>g>for</str<strong>on</strong>g>mer to account <str<strong>on</strong>g>for</str<strong>on</strong>g> an additi<strong>on</strong>al source of variati<strong>on</strong>.<br />

The numerical values from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m are shown below:<br />

Here the predicted yield <str<strong>on</strong>g>for</str<strong>on</strong>g> a single future trial at 16 kg/ha is 30.5 L, but the 95% predicti<strong>on</strong> interval is<br />

between 26.1 and 34.9 L. The predicted AVERAGE yield <str<strong>on</strong>g>for</str<strong>on</strong>g> ALL future plots when 16 kg/ha of fertilizer<br />

c○2012 Carl James Schwarz 42 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

is applied is also 30.5 L, but the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN yield is between 28.8 and 32.1 L.<br />

Finally, residual plots can be made using the pop-down menu:<br />

The residuals are simply the difference between the actual data point and the corresp<strong>on</strong>ding spot <strong>on</strong> the<br />

line measured in the vertical directi<strong>on</strong>. The residual plot shows no trend in the scatter around the value of<br />

zero.<br />

The same items are available from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. Here you would specify Yield as the<br />

Y variable, Fertilizer as the X variable in much the same way as in the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. Much<br />

of the same output is produced. Additi<strong>on</strong>ally, you can save the actual c<strong>on</strong>fidence bounds <str<strong>on</strong>g>for</str<strong>on</strong>g> predicti<strong>on</strong>s into<br />

the data table (as shown above). This will be dem<strong>on</strong>strated in class.<br />

1.4.9 Example - Mercury polluti<strong>on</strong><br />

Mercury polluti<strong>on</strong> is a serious problem in some waterways. Mercury levels often increase after a lake<br />

is flooded due to leaching of naturally occurring mercury by the higher levels of the water. Excessive<br />

c<strong>on</strong>sumpti<strong>on</strong> of mercury is well known to be deleterious to human health. It is difficult and time c<strong>on</strong>suming<br />

to measure every pers<strong>on</strong>s mercury level. It would be nice to have a quick procedure that could be used to<br />

estimate the mercury level of a pers<strong>on</strong> based up<strong>on</strong> the average mercury level found in fish and estimates<br />

of the pers<strong>on</strong>’s c<strong>on</strong>sumpti<strong>on</strong> of fish in their diet. The following data were collected <strong>on</strong> the methyl mercury<br />

intake of subjects and the actual mercury levels recorded in the blood stream from a random sample of<br />

people around recently flooded lakes.<br />

Here are the raw data:<br />

c○2012 Carl James Schwarz 43 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Methyl Mercury<br />

Intake<br />

(ug Hg/day)<br />

Mercury in<br />

whole blood<br />

(ng/g)<br />

180 90<br />

200 120<br />

230 125<br />

410 290<br />

600 310<br />

550 290<br />

275 170<br />

580 375<br />

600 150<br />

105 70<br />

250 105<br />

60 205<br />

650 480<br />

The data are available in a JMP datasheet called mercury.jmp available from the Sample Program Library<br />

at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The ordering of the rows in the data table is NOT important; however, it is often easier to find individual<br />

data points if the data is sorted by the X value and the rows <str<strong>on</strong>g>for</str<strong>on</strong>g> future predicti<strong>on</strong>s are placed at the end of<br />

the dataset. Notice how missing values are represented.<br />

The populati<strong>on</strong> of interest are the people around recently flooded lakes.<br />

This experiment is an analytical survey as it is quite impossible to randomly assign people different<br />

amounts of mercury in their food intake. C<strong>on</strong>sequently, the key assumpti<strong>on</strong> is that the subjects chosen to be<br />

measured are random samples from those with similar mercury intakes. Note it is NOT necessary <str<strong>on</strong>g>for</str<strong>on</strong>g> this to<br />

be a random sample from the ENTIRE populati<strong>on</strong> (why?).<br />

The explanatory variable is the amount of mercury ingested by a pers<strong>on</strong>. The resp<strong>on</strong>se variable is the<br />

amount of mercury in the blood stream.<br />

We start by producing the scatter-plot.<br />

c○2012 Carl James Schwarz 44 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

There appears to be two outliers (identified by an X). To illustrate the effects of these outliers up<strong>on</strong> the<br />

estimates and the residual plots, the line was fit using all of the data.<br />

c○2012 Carl James Schwarz 45 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The residual plot shows the clear presence of the two outliers, but also identifies a third potential outlier<br />

not evident from the original scatter-plot (can you find it?).<br />

The data were rechecked and it appears that there was an error in the blood work used in determining the<br />

readings. C<strong>on</strong>sequently, these points were removed <str<strong>on</strong>g>for</str<strong>on</strong>g> the subsequent fit.<br />

c○2012 Carl James Schwarz 46 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

.<br />

The estimated regressi<strong>on</strong> line (after removing outliers) is<br />

Blood = −1.951691 + 0.581218Intake<br />

The estimated slope of 0.58 indicates that the mercury level in the blood increases by 0.58 ng/day when<br />

the intake level in the food is increased by 1 ug/day. The intercept has no really meaning in the c<strong>on</strong>text of this<br />

experiment. The negative value is merely a placeholder <str<strong>on</strong>g>for</str<strong>on</strong>g> the line. Also notice that the estimated intercept<br />

is not very precise in any case (how do I know this and what implicati<strong>on</strong>s does this have <str<strong>on</strong>g>for</str<strong>on</strong>g> worrying that it<br />

c○2012 Carl James Schwarz 47 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

is not zero?) 5<br />

What was the impact of the outliers if they had been retained up<strong>on</strong> the estimated slope and intercept?<br />

The estimated slope has been determined relatively well (relative standard error of about 10% – how is<br />

the relative standard error computed?). There is clear evidence that the hypothesis of no relati<strong>on</strong>ship between<br />

blood mercury levels and food mercury levels is not tenable.<br />

The two types of predicti<strong>on</strong>s would also be of interest in this study. First, an individual would like to<br />

know the impact up<strong>on</strong> pers<strong>on</strong>al health. Sec<strong>on</strong>dly, the average level would be of interest to public health<br />

authorities.<br />

JMP was used to plot both intervals <strong>on</strong> the scatter-plot:<br />

1.4.10 Example - The Anscombe Data Set<br />

Anscombe (1973, American Statistician 27, 17-21) created a set of 4 data sets that were quite remarkable.<br />

All four datasets gave exactly the same results when a regressi<strong>on</strong> line was fit, yet are quite different in their<br />

interpretati<strong>on</strong>.<br />

The Anscombe data is available at the http://www.stat.sfu.ca/~cschwarz/Stat-650/<br />

5 It is possible to fit a regressi<strong>on</strong> line that is c<strong>on</strong>strained to go through Y = 0 when X = 0. These must be fit carefully and are not<br />

covered in this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

c○2012 Carl James Schwarz 48 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Notes/MyPrograms. Fitting of regressi<strong>on</strong> lines to this data will be dem<strong>on</strong>strated in class.<br />

1.4.11 Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s<br />

In some cases, the plot of Y vs. X is obviously n<strong>on</strong>-linear and a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of X or Y may be used to<br />

establish linearity. For example, many dose-resp<strong>on</strong>se curves are linear in log(X). Or the equati<strong>on</strong> may be<br />

intrinsically n<strong>on</strong>-linear, e.g. a weight-length relati<strong>on</strong>ship is of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m weight = β 0 length β1 . Or, some<br />

variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measured in L/100<br />

km or km/L? You are already with some variables measured <strong>on</strong> the log-scale - pH is a comm<strong>on</strong> example.<br />

Often a visual inspecti<strong>on</strong> of a plot may identify the appropriate trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>.<br />

There is no theoretical difficulty in fitting a linear regressi<strong>on</strong> using trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med variables other than an<br />

understanding of the implicit assumpti<strong>on</strong> of the error structure. The model <str<strong>on</strong>g>for</str<strong>on</strong>g> a fit <strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data is of<br />

the <str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />

trans(Y ) = β 0 + β 1 × trans(X) + error<br />

Note that the error is assumed to act additively <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale. All of the assumpti<strong>on</strong>s of linear<br />

regressi<strong>on</strong> are assumed to act <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale – in particular that the populati<strong>on</strong> standard deviati<strong>on</strong><br />

around the regressi<strong>on</strong> line is c<strong>on</strong>stant <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />

The most comm<strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the logarithmic trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. It doesn’t matter if the natural logarithm<br />

(often called the ln functi<strong>on</strong>) or the comm<strong>on</strong> logarithm trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> (often called the log 10 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>)<br />

is used. There is a 1-1 relati<strong>on</strong>ship between the two trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, and linearity <strong>on</strong> <strong>on</strong>e trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m is<br />

preserved <strong>on</strong> the other trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. The <strong>on</strong>ly change is that values <strong>on</strong> the ln scale are 2.302 = ln(10) times<br />

that <strong>on</strong> the log 10 scale which implies that the estimated slope and intercept both differ by a factor of 2.302.<br />

There is some c<strong>on</strong>fusi<strong>on</strong> in scientific papers about the meaning of log - some papers use this to refer to the<br />

ln trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>, while others use this to refer to the log 10 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>.<br />

After the regressi<strong>on</strong> model is fit, remember to interpret the estimates of slope and intercept <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />

scale. For example, suppose that a ln(Y ) trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is used. Then we have<br />

and<br />

.<br />

ln(Y t+1 ) = b 0 + b 1 × (t + 1)<br />

ln(Y t ) = b 0 + b 1 × t<br />

ln(Y t+1 ) − ln(Y t ) = ln( Y t+1<br />

Y t<br />

) = b 1 × (t + 1 − t) = b 1<br />

exp(ln( Y t+1<br />

)) = Y t+1<br />

= exp(b 1 ) = e b1<br />

Y t Y t<br />

Hence a <strong>on</strong>e unit increase in X causes Y to be MULTIPLED by e b1 . As an example, suppose that <strong>on</strong><br />

the log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by a<br />

c○2012 Carl James Schwarz 49 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

multiplicative factor or e −.07 = .93, i.e. roughly a 7% decline per year. 6<br />

Similarly, predicti<strong>on</strong>s <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale, must be back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />

In some problems, scientists search <str<strong>on</strong>g>for</str<strong>on</strong>g> the ‘best’ trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. This is not an easy task and using simple<br />

statistics such as R 2 to search <str<strong>on</strong>g>for</str<strong>on</strong>g> the best trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> should be avoided. Seek help if you need to find<br />

the best trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular dataset.<br />

JMP makes it particularly easy to fit regressi<strong>on</strong>s to trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data as shown below. SAS and R have<br />

an extensive array of functi<strong>on</strong>s so that you can create new variables based the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of an existing<br />

variable.<br />

1.4.12 Example: M<strong>on</strong>itoring Dioxins - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />

An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />

material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />

organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />

takes a l<strong>on</strong>g time to degrade.<br />

Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />

measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />

Each year, four crabs are captured from a m<strong>on</strong>itoring stati<strong>on</strong>. The liver is excised and the livers from<br />

all four crabs are composited together into a single sample. 7 The dioxins levels in this composite sample is<br />

measured. As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called<br />

the Total Equivalent Dose (TEQ) is computed from the sample.<br />

Here is the raw data.<br />

6 It can be shown that <strong>on</strong> the log scale, that <str<strong>on</strong>g>for</str<strong>on</strong>g> smallish values of the slope that the change is almost the same <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />

scale, i.e. if the slope is −.07 <strong>on</strong> the log sale, this implies roughly a 7% decline per year; a slope of +.07 implies roughly a 7% increase<br />

per year.<br />

7 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />

loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />

within years and the number of years to m<strong>on</strong>itor.<br />

c○2012 Carl James Schwarz 50 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Site Year TEQ<br />

a 1990 179.05<br />

a 1991 82.39<br />

a 1992 130.18<br />

a 1993 97.06<br />

a 1994 49.34<br />

a 1995 57.05<br />

a 1996 57.41<br />

a 1997 29.94<br />

a 1998 48.48<br />

a 1999 49.67<br />

a 2000 34.25<br />

a 2001 59.28<br />

a 2002 34.92<br />

a 2003 28.16<br />

The data is available in a JMP data file dioxinTEQ.jmp in the http://www.stat.sfu.ca/~cschwarz/<br />

Stat-650/Notes/MyPrograms.<br />

As with all analyses, start with a preliminary plot of the data. Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />

c○2012 Carl James Schwarz 51 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The preliminary plot of the data shows a decline in levels over time, but it is clearly n<strong>on</strong>-linear. Why is<br />

this so? In many cases, a fixed fracti<strong>on</strong> of dioxins degrades per year, e.g. a 10% decline per year. This can<br />

be expressed in a n<strong>on</strong>-linear relati<strong>on</strong>ship:<br />

T EQ = Cr t<br />

where C is the initial c<strong>on</strong>centrati<strong>on</strong>, r is the rate reducti<strong>on</strong> per year, and t is the elapsed time. If this is<br />

plotted over time, this leads to the n<strong>on</strong>-linear pattern seen above.<br />

If logarithms are taken, this leads to the relati<strong>on</strong>ship:<br />

which can be expressed as:<br />

log(T EQ) = log(C) + t × log(r)<br />

log(T EQ) = β 0 + β 1 × t<br />

which is the equati<strong>on</strong> of a straight line with β 0 = log(C) and β 1 = log(r).<br />

JMP can easily be used to compute the log(T EQ) by using the Formula Editor in the usual fashi<strong>on</strong>.<br />

A plot of log(T EQ) vs. year gives the following:<br />

c○2012 Carl James Schwarz 52 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The relati<strong>on</strong>ship look approximately linear; there d<strong>on</strong>’t appear to be any outlier or influential points; the<br />

scatter appears to be roughly equal across the entire regressi<strong>on</strong> line. Residual plots will be used later to<br />

check these assumpti<strong>on</strong>s in more detail.<br />

A line can be fit as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e by selecting the Fit Line opti<strong>on</strong> from the red triangle in the upper left side of<br />

the plot:<br />

c○2012 Carl James Schwarz 53 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

This gives the following output:<br />

c○2012 Carl James Schwarz 54 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

.<br />

The fitted line is:<br />

log(T EQ) = 218.9 − .11(year)<br />

c○2012 Carl James Schwarz 55 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The intercept (218.9) would be the log(T EQ) in the year 0 which is clearly n<strong>on</strong>sensical. The slope<br />

(−.11) is the estimated log(ratio) from <strong>on</strong>e year to the next. For example, exp(−.11) = .898 would mean<br />

that the TEQ in <strong>on</strong>e year is <strong>on</strong>ly 89.8% of the TEQ in the previous year or roughly an 11% decline per year.<br />

The standard error of the estimated slope is .02.<br />

A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope can be obtained by pressing a Right-Click (<str<strong>on</strong>g>for</str<strong>on</strong>g> Windoze machines)<br />

or a Ctrl-Click (<str<strong>on</strong>g>for</str<strong>on</strong>g> Macintosh machines) in the Parameter Estimates summary table and selecting<br />

the c<strong>on</strong>fidence intervals to display in the table.<br />

The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope is (−.154 to −.061). If you take the anti-logs of the endpoints,<br />

this gives a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the fracti<strong>on</strong> of TEQ that remains from year to year, i.e. between<br />

(0.86 to 0.94) of the TEQ in <strong>on</strong>e year, remains to the next year.<br />

Several types of predicti<strong>on</strong>s can be made. For example, what would be the estimated mean TEQ in 2010?<br />

The computati<strong>on</strong>s could be d<strong>on</strong>e by hand, or by using the cross-hairs <strong>on</strong> the plot from the Analyze-<br />

>Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se, or predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual<br />

resp<strong>on</strong>se can be added to the plot from the pop-down menu.<br />

However, a more powerful tool is available from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />

Start first, by adding rows to the original data table corresp<strong>on</strong>ding to the years <str<strong>on</strong>g>for</str<strong>on</strong>g> which a predicti<strong>on</strong> is<br />

required. In this case, the additi<strong>on</strong>al row would have the value of 2010 in the Year column with the remainder<br />

of the row unspecified. Missing values will be automatically inserted <str<strong>on</strong>g>for</str<strong>on</strong>g> the other variables.<br />

c○2012 Carl James Schwarz 56 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Then invoke the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 57 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

This gives much the same output as from the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m with a few new (useful)<br />

features, a few of which we will explore in the remainder of this secti<strong>on</strong>.<br />

Next, save the predicti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>mula, and the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean, and <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual predicti<strong>on</strong><br />

to the data table (this will take three successive saves):<br />

c○2012 Carl James Schwarz 58 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Now the data table has been augmented with additi<strong>on</strong>al columns and more importantly predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

2010 are now available:<br />

c○2012 Carl James Schwarz 59 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The estimated mean log(T EQ) is 2.60 (corresp<strong>on</strong>ding to an estimated MEDIAN TEQ of exp(2.60) =<br />

13.46). A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean log(T EQ) is (1.94 to 3.26) corresp<strong>on</strong>ding to a 95% c<strong>on</strong>fidence<br />

interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the actual MEDIAN TEQ of between (6.96 and 26.05). 8 Note that the c<strong>on</strong>fidence interval<br />

after taking anti-logs is no l<strong>on</strong>ger symmetrical.<br />

Why does a mean of a logarithm trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m back to the median <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale? Basically,<br />

because the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is n<strong>on</strong>-linear, properties such mean and standard errors cannot be simply<br />

anti-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med without introducing some bias. However, measures of locati<strong>on</strong>, (such as a median) are<br />

unaffected. On the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale, it is assumed that the sampling distributi<strong>on</strong> about the estimate is symmetrical<br />

which makes the mean and median take the same value. So what really is happening is that the<br />

median <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale is back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the median <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />

Similarly, a 95% predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the log(T EQ) <str<strong>on</strong>g>for</str<strong>on</strong>g> an INDIVIDUAL composite sample can be<br />

found. Be sure to understand the difference between the two intervals.<br />

Finally, an inverse predicti<strong>on</strong> is sometimes of interest, i.e. in what year, will the TEQ be equal to some<br />

particular value. For example, health regulati<strong>on</strong>s may require that the TEQ of the composite sample be<br />

below 10 units.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m has an inverse predicti<strong>on</strong> functi<strong>on</strong>:<br />

8 A minor correcti<strong>on</strong> can be applied to estimate the mean if required.<br />

c○2012 Carl James Schwarz 60 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Specify the required value <str<strong>on</strong>g>for</str<strong>on</strong>g> Y – in this case log(10) = 2.302,<br />

and then press the RUN butt<strong>on</strong> to get the following output:<br />

c○2012 Carl James Schwarz 61 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The predicted year is found by solving<br />

2.302 = 218.9 − .11(year)<br />

and gives and estimated year of 2012.7. A c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the time when the mean log(T EQ) is<br />

equal to log(10) is somewhere between 2007 and 2026!<br />

The residual plot looks fine with no apparent problems but the dip in the middle years could require<br />

further explorati<strong>on</strong> if this pattern was apparent in other site as well:<br />

The applicati<strong>on</strong> of regressi<strong>on</strong> to n<strong>on</strong>-linear problems is fairly straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward after the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is<br />

made. The most error-pr<strong>on</strong>e step of the process is the interpretati<strong>on</strong> of the estimates <strong>on</strong> the TRANSFORMED<br />

scale and how these relate to the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />

1.4.13 Example: Weight-length relati<strong>on</strong>ships - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />

A comm<strong>on</strong> technique in fisheries management is to investigate the relati<strong>on</strong>ship between weight and lengths<br />

of fish.<br />

This is expected to a n<strong>on</strong>-linear relati<strong>on</strong>ship because as fish get l<strong>on</strong>ger, they also get wider and thicker.<br />

If a fish grew “equally” in all directi<strong>on</strong>s, then the weight of a fish should be proporti<strong>on</strong>al to the length 3<br />

(why?). However, fish do not grow equally in all directi<strong>on</strong>s, i.e. a doubling of length is not necessarily<br />

associated with a doubling of width or thickness. The pattern of associati<strong>on</strong> of weight with length may<br />

reveal in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> how fish grow.<br />

The traditi<strong>on</strong>al model between weight and length is often postulated to be of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

weight = a × length b<br />

where a and b are unknown c<strong>on</strong>stants to be estimated from data.<br />

If the estimated value of b is much less than 3, this indicates that as fish get l<strong>on</strong>ger, they do not get wider<br />

and thicker at the same rates.<br />

or<br />

How are such models fit? If logarithms are taken <strong>on</strong> each side, the above equati<strong>on</strong> is trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to:<br />

log(weight) = log(a) + b × log(length)<br />

log(weight) = β 0 + β 1 × log(length)<br />

where the usual linear relati<strong>on</strong>ship <strong>on</strong> the log-scale is now apparent.<br />

The following example was provided by Randy Zemlak of the British <str<strong>on</strong>g>Columbia</str<strong>on</strong>g> Ministry of Water, Land,<br />

and Air Protecti<strong>on</strong>.<br />

c○2012 Carl James Schwarz 62 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Length (mm)<br />

Weight (g)<br />

34 585<br />

46 1941<br />

33 462<br />

36 511<br />

32 428<br />

33 396<br />

34 527<br />

34 485<br />

33 453<br />

44 1426<br />

35 488<br />

34 511<br />

32 403<br />

31 379<br />

30 319<br />

33 483<br />

36 600<br />

35 532<br />

29 326<br />

34 507<br />

32 414<br />

33 432<br />

33 462<br />

35 566<br />

34 454<br />

35 600<br />

29 336<br />

31 451<br />

33 474<br />

32 480<br />

35 474<br />

30 330<br />

30 376<br />

34 523<br />

31 353<br />

32 412<br />

c○2012 Carl James Schwarz 63 November 23, 2012<br />

32 407


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

A sample of fish was measured at a lake in British <str<strong>on</strong>g>Columbia</str<strong>on</strong>g>. The data is as follows and is available<br />

in a JMP datasheet called wtlen.jmp at the Sample Program Library at http://www.stat.sfu.ca/<br />

~cschwarz/Stat-650/Notes/MyPrograms.<br />

The following is an initial plot with a spline fit (lambda=10) to the data.<br />

The fit appears to be n<strong>on</strong>-linear but this may simply be an artifact of the influence of the two largest fish.<br />

The plot appears to be linear in the area of 30-35 mm in length. If you look at the plot carefully, the variance<br />

appears to be increasing with the length with the spread noticeably wider at 35 mm than at 30 mm.<br />

There are several (equivalent) ways to fit the growth model to such data in JMP:<br />

• Use Analyze->Fit Y-by-X directly with the Fit Special feature.<br />

• Create two new variables log(weight) and log(length) and then use Analyze->Fit Y-by-X <strong>on</strong> these<br />

derived variables.<br />

• Use Analyze->Fit Model <strong>on</strong> these derived variables.<br />

We will fit a model <strong>on</strong> the log-log scale: Note that there is some c<strong>on</strong>fusi<strong>on</strong> in scientific papers about a<br />

“log” trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. In general, a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> refers to taking natural-logarithms (base e), and NOT the<br />

base-10 logarithm. This mathematical c<strong>on</strong>venti<strong>on</strong> is often broken in scientific papers where authors try to<br />

c○2012 Carl James Schwarz 64 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

use ln to represent natural logarithms, etc. It does not affect the analysis in anyway which trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />

is used other than that values <strong>on</strong> the natural log scale are approximately 2.3 times larger than values <strong>on</strong> the<br />

log 10 scale. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, the appropriate back trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is required.<br />

Using the Fit Special<br />

.<br />

The Fit Special is available from the drop-down menu item:<br />

It presents a dialogue box where a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> both the Y and X axes may be specified:<br />

c○2012 Carl James Schwarz 65 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

the following output is obtained:<br />

c○2012 Carl James Schwarz 66 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 67 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The fit is not very satisfactory. The curve doesn’t seem to fit the two “outlier points very well”. At<br />

smaller lengths, the curve seems to under fitting the weight. The residual plot appears to show the two<br />

definite outliers and also shows some evidence of a poor fit with positive residuals at lengths 30 mm and<br />

negative residuals at 35 mm.<br />

The fit was repeated dropping the two largest fish with the following output:<br />

c○2012 Carl James Schwarz 68 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 69 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Now the fit appears to be much better. The relati<strong>on</strong>ship (<strong>on</strong> the log-scale) is linear, the residual plot looks<br />

OK.<br />

The estimated power coefficient is 2.76 (SE .21). We find the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope (the<br />

power coefficient):<br />

The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the power coefficient is from (2.33 to 3.2) which includes the value of<br />

3 – hence the growth could be allometric, i.e. a fish that is twice the length also is twice the width and twice<br />

the thickness. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, with this small sample size, it is too difficult to say too much.<br />

The actual model in the populati<strong>on</strong> is:<br />

log(weight) = β 0 + β 1 log(length) + ɛ<br />

This implies that the “errors” in growth act <strong>on</strong> the LOG-scale. This seems reas<strong>on</strong>able.<br />

For example, a regressi<strong>on</strong> <strong>on</strong> the original scale would make the assumpti<strong>on</strong> that a 20 g error in predicting<br />

weight is equally severe <str<strong>on</strong>g>for</str<strong>on</strong>g> a fish that (<strong>on</strong> average) weighs 200 or 400 grams even though the "error" is<br />

20/200=10% of the predicted value in the first case, while <strong>on</strong>ly 5% of the predicted value in the sec<strong>on</strong>d case.<br />

On the log-scale, it is implicitly assumed that the “errors” operate <strong>on</strong> the log-scale, i.e. a 10% error in a 200<br />

g fish is equally severe as a 10% error in a 400 g fish even though the absolute errors of 20g and 40g are<br />

quite different.<br />

Another assumpti<strong>on</strong> of regressi<strong>on</strong> analysis is that the populati<strong>on</strong> error variance is assumed to be c<strong>on</strong>stant<br />

over the entire regressi<strong>on</strong> line, but the original plot shows that the standard deviati<strong>on</strong> is increasing with<br />

length. On the log-scale, the standard deviati<strong>on</strong> is roughly c<strong>on</strong>stant over the entire regressi<strong>on</strong> line.<br />

Using derived variables<br />

The same analysis was repeated using the derived variables log(weight) and log(length) and again using<br />

the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, but this time without the Fit Special. [The Fit Special is not needed<br />

because the derived variables have already been trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med.]<br />

The following are the outputs using the derived variables, again with and with out the two largest fish.<br />

c○2012 Carl James Schwarz 70 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 71 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Because derived variables are used, the fitting plot uses the derived variables and is <strong>on</strong> the log-scale.<br />

This has the advantage that the the fit at the lower lengths is easier to see, but the lack of fit <str<strong>on</strong>g>for</str<strong>on</strong>g> the two<br />

largest fish is not as clear. However, it is now easier to see <strong>on</strong> the residual plot the apparent lack of fit with<br />

the downward sloping part of the residual plot in the 3.4 to 3.6 log(length).<br />

The two largest fish were removed and the fit repeated using the derived variables:<br />

c○2012 Carl James Schwarz 72 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 73 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The results are identical to the previous secti<strong>on</strong>.<br />

A n<strong>on</strong>-linear fit<br />

It is also possible to do a direct n<strong>on</strong>-linear least-squares fit. Here the objective is to find values of β 0 and β 1<br />

to minimize: ∑<br />

(weight − β0 × length β1 ) 2<br />

directly.<br />

This can also be d<strong>on</strong>e in JMP using the Fit N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and w<strong>on</strong>’t be explored in much detail<br />

here.<br />

First here are the results from using all of the fish:<br />

c○2012 Carl James Schwarz 74 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Note that the fit apparently is better than the fit <strong>on</strong> the log-scale as the fitted curve goes through the<br />

middle of the points from the two largest fish. Note that there still appear to be problems with the fit at the<br />

lower lengths.<br />

The same fit, dropping the two largest fish, gives the following output:<br />

c○2012 Carl James Schwarz 75 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

The estimated power coefficient from the n<strong>on</strong>-linear fit is 2.73 with a standard error of .24. The estimated<br />

intercept is 0.0323 with an estimated standard error of .027. Both estimates are similar to the previous fit.<br />

Which is a better method to fit this data? The n<strong>on</strong>-linear fit assumes that error are additive <strong>on</strong> the original<br />

scale. The c<strong>on</strong>sequences of this were discussed earlier, i.e. a 20 g error is equally serious <str<strong>on</strong>g>for</str<strong>on</strong>g> a 200 g fish as<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> a 400 g fish.<br />

For this problem, both the n<strong>on</strong>-linear fit and the fit <strong>on</strong> the log-scale gave the same results, but this will<br />

not always be true. In particular, look at the large difference in estimates when the models were fit to the<br />

all of the fish. The n<strong>on</strong>-linear fit was more influenced by the two large fish - this is a c<strong>on</strong>sequence of the<br />

minimizing the square of the absolute deviati<strong>on</strong> (as opposed to the relative deviati<strong>on</strong>) between the observed<br />

weight and predicted weight.<br />

c○2012 Carl James Schwarz 76 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

1.4.14 Power/Sample Size<br />

A power analysis and sample size determinati<strong>on</strong> can also be d<strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> problems, but is (un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately)<br />

rarely d<strong>on</strong>e in regressi<strong>on</strong>. This is <str<strong>on</strong>g>for</str<strong>on</strong>g> a number of reas<strong>on</strong>s:<br />

• The power depends not <strong>on</strong>ly <strong>on</strong> the total number of points collected, but also <strong>on</strong> the actual distributi<strong>on</strong><br />

of the X values. For example, a regressi<strong>on</strong> analysis is most powerful to detect a trend if half the<br />

observati<strong>on</strong>s are collected at a small X value and half of the observati<strong>on</strong>s are collected at a large X<br />

value. However, this type of data gives no in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> the linearity (or lack there-of) between the<br />

two X values and is not recommended in practice. A less powerful design would have a range of X<br />

values collected, but this is often more of interest as lack-of-fit and n<strong>on</strong>-linearity can be collected.<br />

• Data collected <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> analysis is often opportunistic with little chance of choosing the X<br />

values. Unless you have some prior in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> the distributi<strong>on</strong> of the X values, it is difficult to<br />

determine the power.<br />

• The <str<strong>on</strong>g>for</str<strong>on</strong>g>mula are clumsy to compute by hand, and most power packages tend not to have modules <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

power analysis of regressi<strong>on</strong>.<br />

For a power analysis, the in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> required is similar to that requested <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA designs:<br />

• α level. As in power analyses <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA, this is traditi<strong>on</strong>ally set to α = 0.05.<br />

• effect size. In ANOVA, power deals with detecti<strong>on</strong> of differences am<strong>on</strong>g means. In regressi<strong>on</strong> analysis,<br />

power deals with detecti<strong>on</strong> of slopes that are different from zero. Hence, the effect size is measured<br />

by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X.<br />

• sample size. Recall in ANOVA with more than two groups, that the power depended not <strong>on</strong>ly <strong>on</strong>ly<br />

the sample size per group, but also how the means are separated. In regressi<strong>on</strong> analysis, the power<br />

will depend up<strong>on</strong> the number of observati<strong>on</strong>s taken at each value of X and the spread of the X values.<br />

For example, the greatest power is obtained when half the sample is taken at the two extremes of the<br />

X space - but at a cost of not being able to detect n<strong>on</strong>-linearity.<br />

• standard deviati<strong>on</strong>. As in ANOVA, the power will depend up<strong>on</strong> the variati<strong>on</strong> of the individual objects<br />

around the regressi<strong>on</strong> line.<br />

This problem of power and sample size <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> is bey<strong>on</strong>d what we can cover in this chapter. JMP<br />

or R currently does not include a power computati<strong>on</strong> module <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> analysis. However SAS (Versi<strong>on</strong><br />

9+) includes a power analysis module (GLMPOWER) <str<strong>on</strong>g>for</str<strong>on</strong>g> the power analysis Please c<strong>on</strong>sult suitable help<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />

However, the problem simplifies c<strong>on</strong>siderably when the the X variable is time, and interest lies in detecting<br />

a trend (increasing or decreasing) over time. A linear regressi<strong>on</strong> of the quantity of interest against<br />

time is comm<strong>on</strong>ly used to evaluate such a trend. For many m<strong>on</strong>itoring designs, observati<strong>on</strong>s are taken <strong>on</strong> a<br />

yearly basis, so the questi<strong>on</strong> reduces to the number of years of m<strong>on</strong>itoring required.<br />

The analysis of trend data and power/sample size computati<strong>on</strong>s is treated in a following chapter.<br />

c○2012 Carl James Schwarz 77 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

1.4.15 The perils of R 2<br />

R 2 is a “popular” measure of the fit of a regressi<strong>on</strong> model and is often quoted in research papers as evidence<br />

of a good fit etc. However, there are several fundamental problems of R 2 which, in my opini<strong>on</strong>, make it<br />

less desirable. A nice summary of these issues is presented in Draper and Smith (1998, <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Regressi<strong>on</strong><br />

Analysis, p. 245-246).<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e exploring this, how is R 2 computed and how is it interpreted?<br />

While I haven’t discussed the decompositi<strong>on</strong> of the Error SS into Lack-of-Fit and Pure error, this can be<br />

d<strong>on</strong>e when there are replicated X values. A prototype ANOVA table would look something like:<br />

Source df SS<br />

Regressi<strong>on</strong> p − 1 A<br />

Lack-of-fit n − p − n e B<br />

Pure error n e C<br />

Corrected Total n-1 D<br />

where there are n observati<strong>on</strong>s and a regressi<strong>on</strong> model is fit with p additi<strong>on</strong>al X values over and above the<br />

intercept.<br />

R 2 is computed as<br />

R 2 = SS(regressi<strong>on</strong>)<br />

SS(total)<br />

= A D = 1 − B + C<br />

D<br />

where SS(·) represents the sum of squares <str<strong>on</strong>g>for</str<strong>on</strong>g> that term in the ANOVA table. At this point, rerun the three<br />

examples presented earlier to find the value of R 2 .<br />

For example, in the fertilizer example, the ANOVA table is:<br />

Analysis of Variance<br />

Source DF Sum of Squares Mean Square F Ratio p-value<br />

Model 1 225.18035 225.180 69.8800


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative. In particular, the estimate of the slope and the se of the slope are much more in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative.<br />

Here are some reas<strong>on</strong>s, why I decline to use R 2 very much:<br />

• Overfitting. If there are no replicate X points, then n e = 0, C = 0, and R 2 = 1 − B D . B has n − p<br />

degrees of freedom. As more and more X variables are added to the model, n − p, and B become<br />

smaller, and R 2 must increase even if the additi<strong>on</strong>al variables are useless.<br />

• Outliers distort. Outliers produce Y values that are extreme relative to the fit. This can inflate the<br />

value of C (if the outlier occurs am<strong>on</strong>g the set of replicate X values), or B if the outlier occurs at a<br />

singlet<strong>on</strong> X value. In any cases, they reduce R 2 , so R 2 is not resistant to outliers.<br />

• People misinterpret high R 2 as implying the regressi<strong>on</strong> line is useful. It is tempting to believe that<br />

a higher value of R 2 implies that a regressi<strong>on</strong> line is more useful. But c<strong>on</strong>sider the pair of plots below:<br />

The graph <strong>on</strong> the left has a very high R 2 , but the change in Y as X varies is negligible. The graph<br />

<strong>on</strong> the right has a lower R 2 , but the average change in Y per unit change in X is c<strong>on</strong>siderable. R 2<br />

measures the “tightness” of the points about the line – the higher value of R 2 <strong>on</strong> the left indicates that<br />

the points fit the line very well. The value of R 2 does NOT measure how much actual change occurs.<br />

• Upper bound is not always 1. People often assume that a low R 2 implies a poor fitting line. If you<br />

have replicate X values, then C > 0. The maximum value of R 2 <str<strong>on</strong>g>for</str<strong>on</strong>g> this problem can be much less<br />

than 100% - it is mathematically impossible <str<strong>on</strong>g>for</str<strong>on</strong>g> R 2 to reach 100% with replicated X values. In the<br />

extreme case where the model “fits perfectly” (i.e. the lack of fit term is zero), R 2 can never exceed<br />

1 − C D .<br />

• No intercept models If there is no intercept then D = ∑ (Y i − Y ) 2 does not exist, and R 2 is not<br />

really defined.<br />

• R 2 gives no additi<strong>on</strong>al in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>. In actual fact, R 2 is a 1-1 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of the slope and its<br />

standard error, as is the p-value. So there is no new in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> in R 2 .<br />

• R 2 is not useful <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-linear fits. R 2 is really <strong>on</strong>ly useful <str<strong>on</strong>g>for</str<strong>on</strong>g> linear fits with the estimated regressi<strong>on</strong><br />

line free to have a n<strong>on</strong>-zero intercept. The reas<strong>on</strong> is that R 2 is really a comparis<strong>on</strong> between two<br />

types of models. For example, refer back to the length-weight relati<strong>on</strong>ship examined earlier.<br />

c○2012 Carl James Schwarz 79 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

In the linear fit case, the two models being compared are<br />

vs.<br />

log(weight) = log(b 0 ) + error<br />

log(weight) = log(b 0 ) + b 1 ∗ log(length) + error<br />

and so R 2 is a measure of the improvement with the regressi<strong>on</strong> line. [In actual fact, it is a 1-1 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />

of the test that β 1 = 0, so why not use that statistics directly?]. In the n<strong>on</strong>-linear fit case, the two<br />

models being compared are:<br />

weight = 0 + error<br />

vs.<br />

The model weight=0 is silly and so R 2 is silly.<br />

weight = b 0 ∗ length ∗ ∗b 1 + error<br />

Hence, the R 2 values reported are really all <str<strong>on</strong>g>for</str<strong>on</strong>g> linear fits - it is just that sometimes the actual linear fit<br />

is hidden.<br />

• Not defined in generalized least squares. There are more complex fits that d<strong>on</strong>’t assume equal<br />

variance around the regressi<strong>on</strong> line. In these cases, R 2 is again not defined.<br />

• Cannot be used with different trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s of Y . R 2 cannot be used to compare models that<br />

are fit to different trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s of the Y variable. For example, many people try fitting a model to<br />

Y and to log(Y ) and choose the model with the highest R 2 . This is not appropriate as the D terms are<br />

no l<strong>on</strong>ger comparable between the two models.<br />

• Cannot be used <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-nested models. R 2 cannot be used to compare models with different sets of<br />

X variables unless <strong>on</strong>e model is nested within another model (i.e. all of the X variables in the smaller<br />

model also appear in the larger model). So using R 2 to compare a model with X 1 , X 3 , and X 5 to a<br />

model with X 1 , X 2 , and X 4 is not appropriate as these two models are not nested. In these cases, AIC<br />

should be used to select am<strong>on</strong>g models.<br />

1.5 A no-intercept model: Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor K<br />

It is possible to fit a regressi<strong>on</strong> line that has an intercept of 0, i.e., goes through the origin. Most computer<br />

packages have an opti<strong>on</strong> to suppress the fitting of the intercept.<br />

The biggest ‘problem’ lies in interpreting some of the output – some of the statistics produced are misleading<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> these models. As this varies from package to package, please seek advice when fitting such<br />

models.<br />

The following is an example of where such a model may be sensible.<br />

Not all fish within a lake are identical. How can a single summary measure be developed to represent<br />

the c<strong>on</strong>diti<strong>on</strong> of fish within a lake?<br />

c○2012 Carl James Schwarz 80 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

In general, the the relati<strong>on</strong>ship between fish weight and length follows a power law:<br />

W = aL b<br />

where W is the observed weight; L is the observed length, and a and b are coefficients relating length to<br />

weight. The usual assumpti<strong>on</strong> is that heavier fish of a given length are in better c<strong>on</strong>diti<strong>on</strong> than than lighter<br />

fish. C<strong>on</strong>diti<strong>on</strong> indices are a popular summary measure of the c<strong>on</strong>diti<strong>on</strong> of the populati<strong>on</strong>.<br />

There are at least eight different measures of c<strong>on</strong>diti<strong>on</strong> which can be found by a simple literature<br />

search. C<strong>on</strong>ne (1989) raises some important questi<strong>on</strong>s about the use of a single index to represent the<br />

two-dimensi<strong>on</strong>al weight-length relati<strong>on</strong>ship.<br />

One comm<strong>on</strong> measure is Fult<strong>on</strong>’s 9 K:<br />

K =<br />

W eight<br />

(Length/100) 3<br />

This index makes an implicit assumpti<strong>on</strong> of isometric growth, i.e. as the fish grows, its body proporti<strong>on</strong>s and<br />

specific gravity do not change.<br />

How can K be computed from a sample of fish, and how can K be compared am<strong>on</strong>g different subset of<br />

fish from the same lake or across lakes?<br />

The B.C. Ministry of Envir<strong>on</strong>ment takes regular samples of rainbow trout using a floating and a sinking<br />

net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish was recorded.<br />

The data are available in the rainbow-c<strong>on</strong>diti<strong>on</strong>.jmp data file in the Sample Program Library at http:<br />

//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A porti<strong>on</strong> of the raw data<br />

data appears below:<br />

9 There is some doubt about the first authorship of this c<strong>on</strong>diti<strong>on</strong> factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J.<br />

(2005). The Origin of Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor – Setting the Record Straight. Fisheries, 31, 236-238.<br />

c○2012 Carl James Schwarz 81 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

K was computed <str<strong>on</strong>g>for</str<strong>on</strong>g> each individual fish, and the resulting histogram is displayed below:<br />

There is a range of c<strong>on</strong>diti<strong>on</strong> numbers am<strong>on</strong>g the individual fish with an average (am<strong>on</strong>g the fish caught) K<br />

of about 13.6.<br />

Deriving a single summary measure to represent the entire populati<strong>on</strong> of fish in the lake depends heavily<br />

<strong>on</strong> the sampling design used to capture fish.<br />

Some case must be taken to ensure that the fish collected are a simple random sample from the fish in the<br />

populati<strong>on</strong>. If a net of a single mesh size are used, then this has a selectivity curve and the nets are typically<br />

more selective <str<strong>on</strong>g>for</str<strong>on</strong>g> fish of a certain size. In this experiment, several different mesh sizes were used to try and<br />

ensure that all fish of all sizes have an equal chance of being selected.<br />

As well, if regressi<strong>on</strong> methods have an advantage in that a simple random sample from the populati<strong>on</strong> is<br />

no l<strong>on</strong>ger required to estimate the regressi<strong>on</strong> coefficients. As an analogy, suppose you are interested in the<br />

relati<strong>on</strong>ship between yield of plants and soil fertility. Such a study could be c<strong>on</strong>ducted by finding a random<br />

sample of soil plots, but this may lead to many plots with similar fertility and <strong>on</strong>ly a few plots with fertility<br />

at the tails of the relati<strong>on</strong>ship. An alternate scheme is to deliberately seek out soil plots with a range of<br />

fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit a regressi<strong>on</strong> curve<br />

to these selected data points.<br />

Fult<strong>on</strong>’s index is often re-expressed <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> purposes as:<br />

( ) 3 L<br />

W = K<br />

100<br />

This looks like a simple regressi<strong>on</strong> between W and ( L<br />

100) 3<br />

but with no intercept.<br />

A plot of these two variables:<br />

c○2012 Carl James Schwarz 82 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

shows a tight relati<strong>on</strong>ship am<strong>on</strong>g fish but with possible increasing variance with length.<br />

There is some debate about the proper way to estimate the regressi<strong>on</strong> coefficient K. Classical regressi<strong>on</strong><br />

methods (least squares) implicitly implies that all of the “error” in the regressi<strong>on</strong> is in the vertical directi<strong>on</strong>,<br />

i.e. c<strong>on</strong>diti<strong>on</strong>s <strong>on</strong> the observed lengths. However, the structural relati<strong>on</strong>ship between weight and length<br />

likely is violated in both variables. This would lead to the error-in-variables problem in regressi<strong>on</strong>, which<br />

has a l<strong>on</strong>g history. Fortunately, the relati<strong>on</strong>ship between the two variables is often sufficiently tight that it<br />

really doesn’t matter which method is used to find the estimates.<br />

JMP can be used to fit the regressi<strong>on</strong> line c<strong>on</strong>straining the intercept to be zero by using the Fit Special<br />

opti<strong>on</strong> under the red-triangle:<br />

c○2012 Carl James Schwarz 83 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

This gives rise to the fitted line and statistics about the fit:<br />

c○2012 Carl James Schwarz 84 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 85 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

Note that R 2 really doesn’t make sense in cases where the regressi<strong>on</strong> is <str<strong>on</strong>g>for</str<strong>on</strong>g>ced through the origin because<br />

the null model to which it is being compared is the line Y = 0 which is silly. 10 For this reas<strong>on</strong>, JMP does<br />

not report a value of R 2 .<br />

The estimated value of K is 13.72 (SE 0.099).<br />

The residual plot:<br />

shows clear evidence of increasing variati<strong>on</strong> with the length variable. This usually implies that a weighted<br />

regressi<strong>on</strong> is needed with weights proporti<strong>on</strong>al to the 1/length 2 variable. In this case, such a regressi<strong>on</strong><br />

gives essentially the same estimate of the c<strong>on</strong>diti<strong>on</strong> factor ( ̂K = 13.67, SE = .11).<br />

Comparing c<strong>on</strong>diti<strong>on</strong> factors<br />

This dataset has a number of sub-groups – do all of the subgroups have the same c<strong>on</strong>diti<strong>on</strong> factor? For<br />

example, suppose we wish to compare the K value <str<strong>on</strong>g>for</str<strong>on</strong>g> immature and mature fish. This is covered in more<br />

detail in the Chapter <strong>on</strong> the Analysis of Covariance (ANCOVA).<br />

1.6 Frequent Asked Questi<strong>on</strong>s - FAQ<br />

1.6.1 Do I need a random sample; power analysis<br />

A student wrote:<br />

I am studying the hydraulic geometry of small, steep streams in Southwest BC (abstract at-<br />

10 C<strong>on</strong>sult any of the standard references <strong>on</strong> regressi<strong>on</strong> such as Draper and Smith <str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />

c○2012 Carl James Schwarz 86 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

tached). I would like to define a regi<strong>on</strong>al hydraulic geometry <str<strong>on</strong>g>for</str<strong>on</strong>g> a fairly small hydrologic/geologic<br />

homogeneous area in the coast mountains close to SFU. Hydraulic geometry is the study of how<br />

the primary flow variables (width, depth and velocity) change with discharge in a stream. Typically,<br />

a straight-regressi<strong>on</strong> line is fitted to data plotted <strong>on</strong> a log-log plot. The equati<strong>on</strong> is of the<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g>m w = aQ b where a is the intercept, b is the slope, w is the water surface width, and Q is<br />

the stream discharge.<br />

I am struggling with the last part of my research proposal which is how do I select (randomly)<br />

my field sites and how many sites are required. My supervisor - suggests that I select stream segments<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> study based <strong>on</strong> a-priori knowledge of my field area and select streams from across it.<br />

My argument is that to define a regi<strong>on</strong>ally applicable relati<strong>on</strong>ship (not just <strong>on</strong>e that characterizes<br />

my chosen sites) I must randomly select the sites.<br />

I think that GIS will help me select my sites but have the usual questi<strong>on</strong>s of how many sites are<br />

required to give me a certain level of c<strong>on</strong>fidence and whether or not I’m <strong>on</strong> the right track. As<br />

well, the primary c<strong>on</strong>trolling variables that I am looking at are discharge and stream slope. I<br />

will be plotting the flow variables against discharge directly but will deal with slope by breaking<br />

my stream segments into slope classes - I guess that the null hypothesis would be that there is<br />

no difference in the exp<strong>on</strong>ents and intercepts between slope classes.<br />

You are both correct!<br />

If you were doing a simple survey, then you are correct in that a random sample from the entire populati<strong>on</strong><br />

must be selected - you can’t deliberately choose streams.<br />

However, because you are interested in a regressi<strong>on</strong> approach, the assumpti<strong>on</strong> can be relaxed a bit. You<br />

can deliberately choose values of the X variables, but must randomly select from streams with similar X<br />

values.<br />

As an analogy, suppose you wanted to estimate the average length of male adult arms. You would need<br />

a random sample from the entire populati<strong>on</strong>. However, suppose that you were interested in the relati<strong>on</strong>ship<br />

between body height (X) and arm length (Y ). You could deliberately choose which X values to measure -<br />

indeed it would be good idea to get a good c<strong>on</strong>trast am<strong>on</strong>g the X values, i.e. find people who are 4 ft tall, 5<br />

ft tall, 6 ft tall, 7 ft tall and measure their height and arm length and then fit the regressi<strong>on</strong> curve. However,<br />

at each height level, you must now choose randomly am<strong>on</strong>g those people that meet that criteri<strong>on</strong>. Hence you<br />

could could deliberately choose to have 1/4 of people who are 4 ft tall, 1/4 who are 5 feet tall, 1/4 who are<br />

6 feet tall, 1/4 who are 7 feet tall which is quite different from the proporti<strong>on</strong>s in the populati<strong>on</strong>, but at each<br />

height level must choose people randomly, i.e. d<strong>on</strong>’t always choose skinny 4 ft people, and over-weight 7 ft<br />

people.<br />

Now sample size is a bit more difficult as the required sample size depends both <strong>on</strong> the number of<br />

streams selected and how they are scattered al<strong>on</strong>g the X axis. For example, the highest power occurs when<br />

observati<strong>on</strong>s are evenly divided between the very smallest X and very largest X value. However, without<br />

intermediate points, you can’t assess linearity very well. So you will want points scattered around the range<br />

of X values.<br />

c○2012 Carl James Schwarz 87 November 23, 2012


CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION<br />

If you have some preliminary data, a power/sample size can be d<strong>on</strong>e using JMP, SAS, and other packages.<br />

If you do a google search <str<strong>on</strong>g>for</str<strong>on</strong>g> power analysis regressi<strong>on</strong>, there are several direct links to examples. Refer to<br />

the earlier secti<strong>on</strong> of the notes.<br />

c○2012 Carl James Schwarz 88 November 23, 2012


Chapter 2<br />

Detecting trends over time<br />

2.1 Introducti<strong>on</strong><br />

As the following graphs shows, tests <str<strong>on</strong>g>for</str<strong>on</strong>g> trend are <strong>on</strong>e of the most comm<strong>on</strong> statistical tools used. 1<br />

1 The astute reader may note the discrepancy between the headline and the apparent trend in the graph. Why?<br />

89


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 90 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Trend analysis is often used as the endpoint <str<strong>on</strong>g>for</str<strong>on</strong>g> many m<strong>on</strong>itoring designs, i.e. is the m<strong>on</strong>itored variable<br />

increasing or decreasing. Some nice references <str<strong>on</strong>g>for</str<strong>on</strong>g> planning m<strong>on</strong>itoring studies are:<br />

• USGS Patuxent Wildlife Research Centre’s Manager’s M<strong>on</strong>itoring Manual available at http://<br />

www.pwrc.usgs.gov/m<strong>on</strong>manual/.<br />

• US Nati<strong>on</strong>al Parks Service Guidelines <strong>on</strong> designing a m<strong>on</strong>itoring study available at http://science.<br />

nature.nps.gov/im/m<strong>on</strong>itor/index.htm.<br />

• Elzinga, C.L. et al. (2001). M<strong>on</strong>itoring Plant and Animal Populati<strong>on</strong>s Blackwell Science, Inc.<br />

There are many types of trends that can exist.<br />

For example, a simple step functi<strong>on</strong><br />

is an example of a<br />

trend where the measured quantity Y increases after some interventi<strong>on</strong>. These types of trends are comm<strong>on</strong>ly<br />

analyzed using a t-test or Analysis of Variance (ANOVA) methods covered in other parts of these<br />

notes.<br />

The trend may be a gradual linear increase over time:<br />

c○2012 Carl James Schwarz 91 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

For example, as the amount of trees cleared increases over time, the turbidity of water in a stream may<br />

increase. In many cases, a regressi<strong>on</strong> analysis is used to test <str<strong>on</strong>g>for</str<strong>on</strong>g> trends in time. In these cases, the X variable<br />

is time and the Y variable is some resp<strong>on</strong>se variable of interest. This is the main focus of this chapter.<br />

In some cases the trend is m<strong>on</strong>ot<strong>on</strong>ic but n<strong>on</strong>-linear:<br />

c○2012 Carl James Schwarz 92 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

In the case of<br />

n<strong>on</strong>-linear trends, a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is often used to try and linearize the trend (e.g. a log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m). This<br />

is often successful, in which case methods <str<strong>on</strong>g>for</str<strong>on</strong>g> linear regressi<strong>on</strong> can be used, but in some cases there is noobvious<br />

trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>. The trend can modeled by an arbitrary functi<strong>on</strong> of arbitrary shape. A very general<br />

methodology called Generalized Additive Models can be used to fit very general functi<strong>on</strong>s. These are bey<strong>on</strong>d<br />

the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

Sometimes the linear trend changes at some point (called the break point):<br />

c○2012 Carl James Schwarz 93 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

If the break<br />

point is known in advance, this is easily fit using multiple regressi<strong>on</strong> methods, but is bey<strong>on</strong>d the scope<br />

of these notes. If the breakpoint is unknown, this is a difficult statistical problem, but refer to Toms and<br />

Lesperance (2003) 2 <str<strong>on</strong>g>for</str<strong>on</strong>g> help.<br />

Helsel and Hirsch (2002) 3 have a number summary of methods used to detect trends. The following<br />

table is adopted from their manual:<br />

2 Toms J.D. and Lesperance M.L. (2003) Piecewise regressi<strong>on</strong> A tool <str<strong>on</strong>g>for</str<strong>on</strong>g> identifying ecological thresholds. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g>, 84, 2034-2041.<br />

3 Helsel, D.R. and Hirsch, R.M. (2002). Statistical methods in water resources. Chapter 12. Available at http://pubs.usgs.<br />

gov/twri/twri4a3/<br />

c○2012 Carl James Schwarz 94 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

N<strong>on</strong>parametric<br />

Trends with NO seas<strong>on</strong>ality<br />

Not adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> X<br />

Kendall trend test <strong>on</strong> Y vs.<br />

T .<br />

Adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> X<br />

Kendall trend test <strong>on</strong> residuals<br />

R from smoothing fit<br />

(e.g. LOWESS) of Y <strong>on</strong><br />

X.<br />

Mixed n<strong>on</strong>e Kendall trend test <strong>on</strong> residuals<br />

R from regressi<strong>on</strong> of<br />

Y <strong>on</strong> X. 4<br />

Parametric Regressi<strong>on</strong> of Y <strong>on</strong> T . Regressi<strong>on</strong> of Y <strong>on</strong> X and<br />

T .<br />

N<strong>on</strong>parametric<br />

Mixed<br />

Parametric<br />

Trends with seas<strong>on</strong>ality<br />

Seas<strong>on</strong>al Kendall test of Y<br />

<strong>on</strong> T .<br />

Regressi<strong>on</strong> of deseas<strong>on</strong>alized<br />

Y <strong>on</strong> T , e.g. after<br />

subtracting the seas<strong>on</strong>al<br />

means.<br />

Regressi<strong>on</strong> of Y <strong>on</strong> T and<br />

seas<strong>on</strong>al terms, e.g. AN-<br />

COVA or sin/cos regressi<strong>on</strong>.<br />

Seas<strong>on</strong>al Kendall test <strong>on</strong><br />

residuals R from smoothing<br />

fit (e.g. LOWESS) of<br />

Y <strong>on</strong> X.<br />

Seas<strong>on</strong>al Kendall trend test<br />

<strong>on</strong> residuals from regressi<strong>on</strong><br />

of Y <strong>on</strong> X.<br />

Regressi<strong>on</strong> of Y <strong>on</strong> X, T ,<br />

and seas<strong>on</strong>al terms.<br />

Notati<strong>on</strong>: Y resp<strong>on</strong>se variable; T time variable; X exogenous variable;<br />

R residuals<br />

In these notes we will look at linear trends fit using regressi<strong>on</strong> models and n<strong>on</strong>-parametric methods. We<br />

will also look at how to pool two or more sites to see if they have a comm<strong>on</strong> linear trend. In cases of trends<br />

over time, there are often problems of autocorrelati<strong>on</strong> or seas<strong>on</strong>ality. Methods to deal with the problems will<br />

be discussed.<br />

At this time however, adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> other exogenous variables (X) will not be discussed. Methods to deal<br />

with step-trends are covered in other chapters.<br />

4 Alley (1988) shows that increased power is obtained by doing the Kendall test <strong>on</strong> residuals of Y vs. X and T vs. X. This removes<br />

drift in X over time is removed as well<br />

c○2012 Carl James Schwarz 95 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

2.2 Simple Linear Regressi<strong>on</strong><br />

We will begin by using the methods of linear regressi<strong>on</strong> (covered in an earlier part of these notes) when<br />

applied to trend over time.<br />

This is a special case of linear regressi<strong>on</strong> analysis, but trend analyses also have some peculiar features that<br />

are fairly comm<strong>on</strong> when dealing with trend analyses that d<strong>on</strong>’t have exact counterparts in regular regressi<strong>on</strong>:<br />

• Testing <str<strong>on</strong>g>for</str<strong>on</strong>g> a comm<strong>on</strong> trend (a special case of ANCOVA)<br />

• Dealing with process vs. sampling variati<strong>on</strong><br />

• Dealing with autocorrelati<strong>on</strong> of residuals<br />

For most of this chapter, we will assume that X is measured in years (e.g. calendar year).<br />

The same sampling model, assumpti<strong>on</strong>s, estimati<strong>on</strong>, and hypothesis testing methods are used as in the<br />

regular regressi<strong>on</strong> case with appropriate modificati<strong>on</strong>s to deal with X as time. These will be reviewed again<br />

below.<br />

2.2.1 Populati<strong>on</strong>s and samples<br />

The populati<strong>on</strong> of interest is the set of Y variables as measured over time (X). In most cases in trend<br />

analysis, random sampling from some larger populati<strong>on</strong> of time points really doesn’t make sense. Rather<br />

the time values (the X values) are pre-specified. For example, measurements could be taken every year, or<br />

every two years, etc.<br />

We wish to summarize the relati<strong>on</strong>ship between Y and time (X), and furthermore wish to make predicti<strong>on</strong>s<br />

of the Y value <str<strong>on</strong>g>for</str<strong>on</strong>g> future time (X) values that may be observed from this populati<strong>on</strong>. We may also<br />

wish to do inverse regressi<strong>on</strong>, i.e. predict at what time Y will reach a certain value.<br />

If this were physics, we may c<strong>on</strong>ceive of a physical law between Y and time (e.g. distance = velocity×<br />

time). However, in ecology, the relati<strong>on</strong>ship between Y and time is much more tenuous. If you could draw<br />

a scatter plot of Y against time the points would NOT fall exactly <strong>on</strong> a straight line. Rather, the value of Y<br />

would fluctuate above or below a straight line at any given time value.<br />

We denote this relati<strong>on</strong>ship as<br />

Y = β 0 + β 1 X + ɛ<br />

where we remember that X is now time rather than some other predictor variable. Now β 0 , β 1 are the<br />

POPULATION intercept and slope respectively. We say that<br />

E[Y ] = β 0 + β 1 X<br />

c○2012 Carl James Schwarz 96 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

is the expected or average value of Y at X. 5<br />

The term ɛ represent random variati<strong>on</strong> of individual units in the populati<strong>on</strong> above and below the expected<br />

value. It is assumed to have c<strong>on</strong>stant standard deviati<strong>on</strong> over the entire regressi<strong>on</strong> line (i.e. the spread of<br />

data points is c<strong>on</strong>stant over time).<br />

Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, we can never measure all units of the populati<strong>on</strong>. So a sample must be taken in order to<br />

estimate the populati<strong>on</strong> slope, populati<strong>on</strong> intercept, and standard deviati<strong>on</strong>. In most trend analyses, the<br />

values of X are chosen to be equally spaced in time, e.g. measurements taken every year.<br />

Once the data points are selected, the estimati<strong>on</strong> process can proceed, but not be<str<strong>on</strong>g>for</str<strong>on</strong>g>e assessing the assumpti<strong>on</strong>s!<br />

2.2.2 Assumpti<strong>on</strong>s<br />

The assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> a trend analysis are virtually the same as <str<strong>on</strong>g>for</str<strong>on</strong>g> a standard regressi<strong>on</strong> analysis. This is not<br />

surprising as trend analysis is really a special case of regressi<strong>on</strong> analyses.<br />

Linearity<br />

Regressi<strong>on</strong> analysis assume that the relati<strong>on</strong>ship between Y and X is linear, i.e. a c<strong>on</strong>stant decline over<br />

time. This can be assessed quite simply by plotting Y vs. time. Perhaps a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is required (e.g.<br />

log(Y ) vs. log(X)). Some cauti<strong>on</strong> is required when trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s are d<strong>on</strong>e, as the error structure <strong>on</strong> the<br />

trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale is most important. As well, you need to be a little careful about the back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />

after doing regressi<strong>on</strong> <strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med values.<br />

You should also plot residuals vs. the X (time) values. If the scatter is not random around 0 but shows<br />

some pattern (e.g. quadratic curve), this usually indicates that the relati<strong>on</strong>ship between Y and X (time) is<br />

not linear. Alternatively, you can fit a model that includes X and X 2 and test if the coefficient associated<br />

with X 2 is zero. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this test could fail to detect a higher order relati<strong>on</strong>ship. Third, if there are<br />

multiple readings at some X-values, then a test of goodness-of-fit can be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med where the variati<strong>on</strong> of<br />

the resp<strong>on</strong>ses at the same X value is compared to the variati<strong>on</strong> around the regressi<strong>on</strong> line.<br />

Scale of Y and X<br />

As X is time, it has an interval or ratio scale. It is further assumed that Y has an interval or ratio scale<br />

as well. This can be violated in a number of ways. For example, a numerical value is often to represent a<br />

5 In ANOVA, we let each treatment group have its own mean; here in regressi<strong>on</strong> we assume that the means must fit <strong>on</strong> a straight line.<br />

In some cases, even in the absence of sampling error, the true value of Y does NOT lies <strong>on</strong> the straight time. This is known as process<br />

variati<strong>on</strong>, and will be discussed later.<br />

c○2012 Carl James Schwarz 97 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

category and this numerical value in used in a regressi<strong>on</strong>. This is not valid. Suppose that you code hair color<br />

as (1=red, 2=brown, and 3=black). Then using these values as the resp<strong>on</strong>se variable (Y ) is not sensible.<br />

Correct sampling scheme<br />

The Y must be a random sample from the populati<strong>on</strong> of Y values at every time point.<br />

No outliers or influential points<br />

All the points must bel<strong>on</strong>g to the relati<strong>on</strong>ship – there should be no unusual points. The scatter plot of Y vs.<br />

X should be examined. If in doubt, fit the model with the outlying points in and out of the model and see if<br />

this makes a difference in the fit.<br />

Outliers can have a dramatic effect <strong>on</strong> the fitted line as you saw in a previous chapter.<br />

Equal variati<strong>on</strong> al<strong>on</strong>g the line<br />

The variability about the regressi<strong>on</strong> line is similar <str<strong>on</strong>g>for</str<strong>on</strong>g> all values of X, i.e. the scatter of the points above and<br />

below the fitted line should be roughly c<strong>on</strong>stant over time. This is assessed by looking at the plots of the<br />

residuals against X to see if the scatter is roughly uni<str<strong>on</strong>g>for</str<strong>on</strong>g>mly scattered around zero with no increase and no<br />

decrease in spread over the entire line.<br />

Independence<br />

Each value of Y is independent of any other value of Y . This is a comm<strong>on</strong> failing in trend analysis where<br />

the measurement in a particular year influences the measurement in subsequent years.<br />

This assumpti<strong>on</strong> can be assessed by again looking at residual plots against time or other variables.<br />

Normality of errors<br />

The difference between the value of Y and the expected value of Y is assumed to be normally distributed.<br />

This is <strong>on</strong>e of the most misunderstood assumpti<strong>on</strong>s. Many people err<strong>on</strong>eously assume that the distributi<strong>on</strong> of<br />

Y over all X values must be normally distributed, i.e they look simply at the distributi<strong>on</strong> of the Y ’s ignoring<br />

the Xs. The assumpti<strong>on</strong> <strong>on</strong>ly states that the residuals, the difference between the value of Y and the point<br />

<strong>on</strong> the line must be normally distributed.<br />

c○2012 Carl James Schwarz 98 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, <str<strong>on</strong>g>for</str<strong>on</strong>g> small<br />

sample sizes, you have little power of detecting n<strong>on</strong>-normality and <str<strong>on</strong>g>for</str<strong>on</strong>g> large sample sizes it is not that<br />

important.<br />

X measured without error<br />

This is a new assumpti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> as compared to ANOVA. In ANOVA, the group membership was<br />

always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity. However,<br />

in regressi<strong>on</strong>, it can turn out that that the X value may not be known exactly.<br />

This may seem a bit puzzling in a trend analysis – after all, how can the calendar year not be known<br />

exactly. An example of the problem is when Y is a estimate of the populati<strong>on</strong> size which is measured<br />

over time. This is often obtained from a mark-recapture study when animals are marked in <strong>on</strong>e m<strong>on</strong>th, and<br />

recaptured in the next m<strong>on</strong>th. In this case, does the populati<strong>on</strong> size refer to the populati<strong>on</strong> size at the start<br />

of the study, in the middle of the study, or the end of the study. If the same protocol was per<str<strong>on</strong>g>for</str<strong>on</strong>g>med in all<br />

years, then it really doesn’t matter, but the start and end of sampling likely varies over years (e.g. in some<br />

years starts in March, in other years starts in April) so that the interval between sampling occasi<strong>on</strong>s is not<br />

c<strong>on</strong>stant.<br />

This general problem is called the “error in variables” problem and has a l<strong>on</strong>g history in statistics. More<br />

details are available in the chapter <strong>on</strong> regressi<strong>on</strong> analysis.<br />

2.2.3 Obtaining Estimates<br />

As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, we distinguish between populati<strong>on</strong> parameters and sample estimates. We denote the sample<br />

intercept by b 0 and the sample slope by b 1 . The equati<strong>on</strong> of a particular sample of points is expressed<br />

Ŷ i = b 0 + b 1 X i where b 0 is the estimated intercept, and b 1 is the estimated slope. The symbol Ŷ indicates<br />

that we are referring to the estimated line and not to populati<strong>on</strong> line.<br />

As in regressi<strong>on</strong> analysis, the best fitting line is typically found using least squares. However, in more<br />

complex situati<strong>on</strong> (e.g. when accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong> over time), maximum likelihood methods can<br />

also be used. The least-squares line is the line that makes the sum of the squares of the deviati<strong>on</strong>s of the data<br />

points from the line in the vertical directi<strong>on</strong> as small as possible.<br />

The estimated intercept (b 0 ) is the estimated value of Y when X = 0. In many cases of trend analysis,<br />

it is meaningless to talk about values of Y when X = 0 because X = 0 is n<strong>on</strong>sensical. For example, in a<br />

plot of income vs. year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear<br />

interpretati<strong>on</strong> of the intercept, and it merely serves as a place holder <str<strong>on</strong>g>for</str<strong>on</strong>g> the line.<br />

The estimated slope (b 1 ) is the estimated change in Y per unit change in X. In many cases X is measured<br />

in years, so this would be the change in Y per year.<br />

c○2012 Carl James Schwarz 99 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

As with all estimates, a measure of precisi<strong>on</strong> can be obtained. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, this is the standard error of<br />

each of the estimates. C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the true slope and intercept can also be found.<br />

Formal tests of hypotheses can also be d<strong>on</strong>e. Usually, these are <strong>on</strong>ly d<strong>on</strong>e <strong>on</strong> the slope parameter as this<br />

is typically of most interest. The null hypothesis is that populati<strong>on</strong> slope is 0, i.e. there is no relati<strong>on</strong>ship<br />

between Y and X, i.e. no trend over time. More <str<strong>on</strong>g>for</str<strong>on</strong>g>mally the null hypothesis is:<br />

H : β 1 = 0<br />

Again notice that the null hypothesis is ALWAYS in terms of a populati<strong>on</strong> parameter and not in terms of a<br />

sample statistic.<br />

The alternate hypothesis is typically chosen as:<br />

A : β 1 ≠ 0<br />

although <strong>on</strong>e-sided tests looking <str<strong>on</strong>g>for</str<strong>on</strong>g> either a positive or negative trend are possible.<br />

The p-value is interpreted in exactly the same way as in ANOVA, i.e. is measures the probability of<br />

observing this data if the hypothesis of no relati<strong>on</strong>ship were true.<br />

As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the p-value does not tell the whole story, i.e. statistical vs. biological (n<strong>on</strong>)significance must<br />

be determined and assessed.<br />

2.2.4 Obtaining Predicti<strong>on</strong>s<br />

Once the best fitting line is found it can be used to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> new values of X, e.g. what is the<br />

predicted value of Y <str<strong>on</strong>g>for</str<strong>on</strong>g> new time points.<br />

There are two types of predicti<strong>on</strong>s that are comm<strong>on</strong>ly made. It is important to distinguish between them<br />

as these two intervals are the source of much c<strong>on</strong>fusi<strong>on</strong> in regressi<strong>on</strong> problems.<br />

First, the experimenter may be interested in predicting a SINGLE future individual value <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />

X. Sec<strong>on</strong>d the experimenter may be interested in predicting the AVERAGE of ALL future resp<strong>on</strong>ses<br />

at a particular X. 6 The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is sometimes called a c<strong>on</strong>fidence<br />

interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se but this is an un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate (and incorrect) use of the term c<strong>on</strong>fidence interval.<br />

Strictly speaking c<strong>on</strong>fidence intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> fixed unknown parameter values; predicati<strong>on</strong><br />

intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> future random variables.<br />

Both of the above intervals should be distinguished from the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />

In both cases, the estimate is found in the same manner – substitute the new value of X into the equati<strong>on</strong><br />

and compute the predicted value Ŷ . In most computer packages this is accomplished by inserting a new<br />

“dummy” observati<strong>on</strong> in the dataset with the value of Y missing, but the value of X present. The missing<br />

6 There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice.<br />

c○2012 Carl James Schwarz 100 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Y value prevents this new observati<strong>on</strong> from being used in the fitting process, but the X value allows the<br />

package to compute an estimate <str<strong>on</strong>g>for</str<strong>on</strong>g> this observati<strong>on</strong>.<br />

What differs between the two predicti<strong>on</strong>s are the estimates of uncertainty.<br />

In the first case, where predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> INDIVIDUALs are wanted, there are two sources of uncertainty<br />

involved in the predicti<strong>on</strong>. First, there is the uncertainty caused by the fact that this estimated line is based<br />

up<strong>on</strong> a sample. Then there is the additi<strong>on</strong>al uncertainty that the value could be above or below the predicted<br />

line. This interval is often called a predicti<strong>on</strong> interval at a new X.<br />

In the sec<strong>on</strong>d case, where predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se are wanted, <strong>on</strong>ly the uncertainty caused by<br />

estimating the line based <strong>on</strong> a sample is relevant. This interval is often called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

mean at a new X.<br />

The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is typically MUCH wider than the c<strong>on</strong>fidence interval<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean of all future resp<strong>on</strong>ses because it must account <str<strong>on</strong>g>for</str<strong>on</strong>g> the uncertainty from the fitted line plus<br />

individual variati<strong>on</strong> around the fitted line.<br />

Many textbooks have the <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the se <str<strong>on</strong>g>for</str<strong>on</strong>g> the two types of predicti<strong>on</strong>s, but again, there is little to<br />

be gained by examining them. What is important is that you read the documentati<strong>on</strong> carefully to ensure that<br />

you understand exactly what interval is being given to you.<br />

2.2.5 Inverse predicti<strong>on</strong>s<br />

A related questi<strong>on</strong> is "how l<strong>on</strong>g be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the E[Y ] reaches a certain point". These are obtained by drawing a<br />

line across from the Y axis until it reaches the fitted line, and then following the line down until it reaches<br />

the X (time) axis. C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the inverse predicti<strong>on</strong> are found by following the same procedure<br />

but now following the line horiz<strong>on</strong>tally across until it reaches <strong>on</strong>e of the c<strong>on</strong>fidence intervals (either <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

mean resp<strong>on</strong>se or the individual resp<strong>on</strong>se). 7<br />

7 It is possible that the c<strong>on</strong>fidence intervals are <strong>on</strong>e-sided (i.e. <strong>on</strong>e side is either plus or minus infinity), or even that the c<strong>on</strong>fidence<br />

interval comes in two secti<strong>on</strong>s. Please c<strong>on</strong>sult a reference such as Draper and Smith <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />

c○2012 Carl James Schwarz 101 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

2.2.6 Residual Plots<br />

After the curve is fit, it is important to examine if the fitted curve is reas<strong>on</strong>able. This is d<strong>on</strong>e using residuals.<br />

The residual <str<strong>on</strong>g>for</str<strong>on</strong>g> a point is the difference between the observed value and the predicted value, i.e., the residual<br />

from fitting a straight line is found as: residual i = Y i − (b 0 + b 1 X i ) = (Y i − Ŷi).<br />

There are several standard residual plots:<br />

• plot of residuals vs. predicted (Ŷ );<br />

• plot of residuals vs. X;<br />

In all cases, the residual plots should show random scatter around zero with no obvious pattern. D<strong>on</strong>’t<br />

plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and d<strong>on</strong>’t mean<br />

anything.<br />

c○2012 Carl James Schwarz 102 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

2.2.7 Example: The Grass is Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger)<br />

D.G. Grisenthwaite, a pensi<strong>on</strong>er who has spent 20 years keeping detailed records of how often he cuts his<br />

grass has been included in a climate change study. David Grisenthwhaite, 77, and a self-c<strong>on</strong>fessed “creature<br />

of habit”, has kept a note of cutting grass in his Kirkcaldy garden since 1984. The grandfather’s data was so<br />

valuable it was used by the Royal Meteorological Society in a paper <strong>on</strong> global warming.<br />

The retired paper-maker, who moved to Scotland from Cockermouth in West Cumbria in 1960, said he<br />

began making a note of the time and date of every occasi<strong>on</strong> he cut the grass simply “<str<strong>on</strong>g>for</str<strong>on</strong>g> the fun of it”.<br />

The data are presented in:<br />

Sparks, T.H., Croxt<strong>on</strong>, J.P.J., Collins<strong>on</strong>, N., and Grisenthwaite, D.A. (2005) The Grass is<br />

Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger). Weather 60, 121-123.<br />

from which the data <strong>on</strong> the durati<strong>on</strong> of the cutting seas<strong>on</strong> was extracted:<br />

c○2012 Carl James Schwarz 103 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Year<br />

Durati<strong>on</strong><br />

(days)<br />

1984 200<br />

1985 215<br />

1986 195<br />

1987 212<br />

1988 225<br />

1989 240<br />

1990 203<br />

1991 208<br />

1992 203<br />

1993 202<br />

1994 210<br />

1995 225<br />

1996 204<br />

1997 245<br />

1998 238<br />

1999 226<br />

2000 227<br />

2001 236<br />

2002 215<br />

2003 242<br />

The questi<strong>on</strong> of interest is if there is evidence that the lawn cutting seas<strong>on</strong> has increased over time?<br />

JMP analysis<br />

The data and JMP scripts are available in the grass.jmp file in the Sample Program Library available at<br />

http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The data are entered in JMP into the usual way. Notice that two extra lines were added to the end of the<br />

data representing two years <str<strong>on</strong>g>for</str<strong>on</strong>g> which predicti<strong>on</strong>s will be made. Both variables should be c<strong>on</strong>tinuous scale.<br />

c○2012 Carl James Schwarz 104 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to create preliminary plot of the number of days between the first<br />

and last cut (Y ) versus year (X):<br />

c○2012 Carl James Schwarz 105 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The plot shows some evidence that the durati<strong>on</strong> of the cutting seas<strong>on</strong> has increased over time.<br />

We can check some of the assumpti<strong>on</strong>s:<br />

• the Y and X variable are both the proper scale.<br />

• the relati<strong>on</strong>ship appears to be approximately linear.<br />

• there are no obvious outliers.<br />

• the variance (scatter) of points around the line appears to be approximately equal. We will check this<br />

again from the residual plot.<br />

• they may be some evidence of autocorrelati<strong>on</strong> as the line joining the raw data points seems to dip<br />

above and below the line <str<strong>on</strong>g>for</str<strong>on</strong>g> several years in a row. This could corresp<strong>on</strong>d to slowly changing effects<br />

c○2012 Carl James Schwarz 106 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

such as a multi-year dry or wet spell. However, with <strong>on</strong>ly 20 data points, it is difficult to tell. We will<br />

check more <str<strong>on</strong>g>for</str<strong>on</strong>g>mally <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-independence by looking at the residual plot and the Durbin-Wats<strong>on</strong> test<br />

statistic later.<br />

Use the red-triangle drop-down menu <strong>on</strong> the plot to select the Fit Line opti<strong>on</strong>. This gives:<br />

The estimated intercept (-2702) would represent the estimated durati<strong>on</strong> of the growing seas<strong>on</strong> in year 0<br />

– clearly a n<strong>on</strong>sensical result. It really doesn’t matter, as the intercept is just a place holder <str<strong>on</strong>g>for</str<strong>on</strong>g> the equati<strong>on</strong><br />

of the line. What really is of interest, is the estimated slope.<br />

The estimated slope is 1.46 (se 0.52) days/year. This means that the durati<strong>on</strong> of the growing seas<strong>on</strong> is<br />

estimated to have increased by 1.46 days per year over the span of this study. The 95% c<strong>on</strong>fidence interval<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the slope 8 (0.36 to 2.56) does not include the value of 0, so there is evidence against the slope actually<br />

8 If the 95% c<strong>on</strong>fidence interval doesn’t show in your output, do a right-click (Windoze) or ctrl-click (Macintosh) in the table of<br />

c○2012 Carl James Schwarz 107 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

being 0 (i.e. no change over the years).<br />

Finally, the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> testing if the true slope is zero is 0.012 which again provides evidence against<br />

the hypothesis of no change in mean durati<strong>on</strong> over the span of the experiment.<br />

The estimated value of RMSE (not shown here but available in the Summary of Fit secti<strong>on</strong> of the output)<br />

is 13.52 days which is the estimated standard deviati<strong>on</strong> of the data points around the regressi<strong>on</strong> line.<br />

The c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se and the predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual resp<strong>on</strong>se<br />

are available from the red-triangle <strong>on</strong> the linear-fit box:<br />

Selecting both c<strong>on</strong>fidence intervals gives:<br />

estimates and select the 95% lower and upper limits to be displayed.<br />

c○2012 Carl James Schwarz 108 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Notice how much wider the predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> individual resp<strong>on</strong>ses are compared to the c<strong>on</strong>fidence<br />

interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se. You can use the cross hairs tool to select points <strong>on</strong> each of the lines to read<br />

off the values.<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m doesn’t allow you save the c<strong>on</strong>fidence intervals directly<br />

to the data table. In order to do this you need to use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 109 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The Y variable is the durati<strong>on</strong> of the lawn cutting seas<strong>on</strong>, while the <strong>on</strong>ly effects to be entered into the effect<br />

box is that of year. After the model is fit (with identical results to what we had earlier), the predicted values<br />

and c<strong>on</strong>fidence intervals and predicti<strong>on</strong> intervals al<strong>on</strong>g with residuals and other good stuff can be saved by<br />

clicking <strong>on</strong> the drop-down red-triangle near the upper plot:<br />

c○2012 Carl James Schwarz 110 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The data table will also include predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> 2004 and 2005<br />

c○2012 Carl James Schwarz 111 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Notice the difference in width <str<strong>on</strong>g>for</str<strong>on</strong>g> the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se and the predicti<strong>on</strong> interval<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the individual resp<strong>on</strong>se. These two intervals are often c<strong>on</strong>fused and it is important to keep their two uses<br />

in mind.<br />

The residual plot is automatically given by the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, but is also easily obtained<br />

from the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 112 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

It does not show any evidence of problems.<br />

Finally, the Durbin-Wats<strong>on</strong> statistic <str<strong>on</strong>g>for</str<strong>on</strong>g> testing the presence of autocorrelati<strong>on</strong> is found in the Analyze-<br />

>Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

to give 9<br />

9 You may have to use the pop-down menu from the red-triangle to get the p-value<br />

c○2012 Carl James Schwarz 113 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The DW statistic should be close to 2 if there is no autocorrelati<strong>on</strong> present in the data. The p-value does<br />

not indicate any evidence of a problem with autocorrelati<strong>on</strong>. The estimated autocorrelati<strong>on</strong> is very small<br />

(−.004) so that it is essentially zero.<br />

Postscript<br />

A more <str<strong>on</strong>g>for</str<strong>on</strong>g>mal analysis of the data presented in the article looked at the date of first cutting, the date of last<br />

cutting, and the number of cuts as well. The authors c<strong>on</strong>clude:<br />

Despite having a relatively <str<strong>on</strong>g>short</str<strong>on</strong>g> span of 20 years, the data from Kirkcaldy provide biological<br />

evidence of an increase in the length of the growing seas<strong>on</strong> and some suggesti<strong>on</strong>s of what<br />

meteorological factors affect lawn growth. Strictly, we are dealing with the cutting seas<strong>on</strong><br />

which is likely to underestimate the growing seas<strong>on</strong>.<br />

This was quite an interesting analysis of an unusual data set!<br />

2.3 Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s<br />

In some cases, the plot of Y vs. X is obviously n<strong>on</strong>-linear and a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of X or Y may be used to<br />

establish linearity. For example, many dose-resp<strong>on</strong>se curves are linear in log(X). Or the equati<strong>on</strong> may be<br />

intrinsically n<strong>on</strong>-linear, e.g. a weight-length relati<strong>on</strong>ship is of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m weight = β 0 length β1 . Or, some<br />

variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measured in L/100<br />

km or km/L? You are already with some variables measured <strong>on</strong> the log-scale - pH is a comm<strong>on</strong> example.<br />

Often a visual inspecti<strong>on</strong> of a plot may identify the appropriate trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>.<br />

There is no theoretical difficulty in fitting a linear regressi<strong>on</strong> using trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med variables other than an<br />

understanding of the implicit assumpti<strong>on</strong> of the error structure. The model <str<strong>on</strong>g>for</str<strong>on</strong>g> a fit <strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data is of<br />

the <str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />

trans(Y ) = β 0 + β 1 × trans(X) + error<br />

c○2012 Carl James Schwarz 114 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Note that the error is assumed to act additively <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale. All of the assumpti<strong>on</strong>s of linear<br />

regressi<strong>on</strong> are assumed to act <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale – in particular that the standard deviati<strong>on</strong> around the<br />

regressi<strong>on</strong> line is c<strong>on</strong>stant <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />

The most comm<strong>on</strong> trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the logarithmic trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. It doesn’t matter if the natural logarithm<br />

(often called the ln functi<strong>on</strong>) or the comm<strong>on</strong> logarithm trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> (often called the log 10 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>)<br />

is used. There is a 1-1 relati<strong>on</strong>ship between the two trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, and linearity <strong>on</strong> <strong>on</strong>e trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m is<br />

preserved <strong>on</strong> the other trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. The <strong>on</strong>ly change is that values <strong>on</strong> the ln scale are 2.302 = ln(10) times<br />

that <strong>on</strong> the log 10 scale which implies that the estimated slope and intercept both differ by a factor of 2.302.<br />

There is some c<strong>on</strong>fusi<strong>on</strong> in scientific papers about the meaning of log - some papers use this to refer to the<br />

ln trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>, while others use this to refer to the log 10 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>.<br />

After the regressi<strong>on</strong> model is fit, remember to interpret the estimates of slope and intercept <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />

scale. For example, suppose that a ln(Y ) trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is used. Then we have<br />

and<br />

.<br />

ln(Y t+1 ) = b 0 + b 1 × (t + 1)<br />

ln(Y t ) = b 0 + b 1 × t<br />

ln(Y t+1 ) − ln(Y t ) = ln( Y t+1<br />

Y t<br />

) = b 1 × (t + 1 − t) = b 1<br />

exp(ln( Y t+1<br />

)) = Y t+1<br />

= exp(b 1 ) = e b1<br />

Y t Y t<br />

Hence a <strong>on</strong>e unit increase in X cause Y to be MULTIPLED by e b1 . As an example, suppose that <strong>on</strong><br />

the log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by a<br />

multiplicative factor or e −.07 = .93, i.e. roughly a 7% decline per year. 10<br />

Predicti<strong>on</strong>s <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale, must be back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />

In some problems, scientists search <str<strong>on</strong>g>for</str<strong>on</strong>g> the ‘best’ trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m. This is not an easy task and using simple<br />

statistics such as R 2 to search <str<strong>on</strong>g>for</str<strong>on</strong>g> the best trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> should be avoided. Seek help if you need to find<br />

the best trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular dataset.<br />

2.3.1 Example: M<strong>on</strong>itoring Dioxins - trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />

An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />

material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />

organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />

takes a l<strong>on</strong>g time to degrade.<br />

10 It can be shown that <strong>on</strong> the log scale, that <str<strong>on</strong>g>for</str<strong>on</strong>g> smallish values of the slope that the change is almost the same <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />

scale, i.e. if the slope is −.07 <strong>on</strong> the log sale, this implies roughly a 7% decline per year; a slope of .07 implies roughly a 7% increase<br />

per year.<br />

c○2012 Carl James Schwarz 115 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />

measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />

Each year, four crabs are captured from a m<strong>on</strong>itoring stati<strong>on</strong>. The liver is excised and the livers from all<br />

four crabs are composited together into a single sample. 11 The dioxins levels in this composite sample is<br />

measured. As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called<br />

the Total Equivalent Dose (TEQ) is computed from the sample.<br />

Here are the raw data:<br />

Site Year TEQ<br />

a 1990 179.05<br />

a 1991 82.39<br />

a 1992 130.18<br />

a 1993 97.06<br />

a 1994 49.34<br />

a 1995 57.05<br />

a 1996 57.41<br />

a 1997 29.94<br />

a 1998 48.48<br />

a 1999 49.67<br />

a 2000 34.25<br />

a 2001 59.28<br />

a 2002 34.92<br />

a 2003 28.16<br />

JMP analysis<br />

The data is available in a JMP data file dioxinTEQ.jmp in the Sample Program Library available at http:<br />

//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

As with all analyses, start with a preliminary plot of the data obtained using the Analyze->Fit Y-by-X<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />

11 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />

loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />

within years and the number of years to m<strong>on</strong>itor.<br />

c○2012 Carl James Schwarz 116 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The preliminary plot of the data shows a decline in levels over time, but it is clearly n<strong>on</strong>-linear. Why is<br />

this so? In many cases, a fixed fracti<strong>on</strong> of dioxins degrades per year, e.g. a 10% decline per year. This can<br />

be expressed in a n<strong>on</strong>-linear relati<strong>on</strong>ship:<br />

T EQ = Cr t<br />

where C is the initial c<strong>on</strong>centrati<strong>on</strong>, r is the rate reducti<strong>on</strong> per year, and t is the elapsed time. If this is<br />

plotted over time, this leads to the n<strong>on</strong>-linear pattern seen above.<br />

If logarithms are taken, this leads to the relati<strong>on</strong>ship:<br />

which can be expressed as:<br />

log(T EQ) = log(C) + t × log(r)<br />

log(T EQ) = β 0 + β 1 × t<br />

which is the equati<strong>on</strong> of a straight line with β 0 = log(C) and β 1 = log(r).<br />

JMP can easily be used to compute the log(T EQ) by using the Formula Editor in the usual fashi<strong>on</strong>.<br />

A plot of log(T EQ) vs. year using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m gives the following:<br />

c○2012 Carl James Schwarz 117 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This look linear over time with a steady decline. A line can be fit as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e by selecting the Fit Line<br />

opti<strong>on</strong> from the red triangle in the upper left side of the plot:<br />

c○2012 Carl James Schwarz 118 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This gives the following output:<br />

c○2012 Carl James Schwarz 119 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The residual plot looks fine with no apparent problems but the dip in the middle years could require<br />

c○2012 Carl James Schwarz 120 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

further explorati<strong>on</strong> if this pattern was apparent in other site as well:<br />

.<br />

The fitted line is:<br />

log(T EQ) = 218.9 − .11(year)<br />

The intercept (218.9) would be the log(T EQ) in the year 0 which is clearly n<strong>on</strong>sensical. The slope<br />

(−.11) is the estimated log(ratio) from <strong>on</strong>e year to the next. For example, exp(−.11) = .898 would mean<br />

that the TEQ in <strong>on</strong>e year is <strong>on</strong>ly 89.8% of the TEQ in the previous year, or about an 11% decline per year. 12<br />

The standard error of the estimated slope is .02. A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope can be obtained<br />

by pressing a Right-Click (<str<strong>on</strong>g>for</str<strong>on</strong>g> Windoze machines) or a Ctrl-Click (<str<strong>on</strong>g>for</str<strong>on</strong>g> Macintosh machines) in the Parameter<br />

Estimates summary table and selecting the c<strong>on</strong>fidence intervals to display in the table.<br />

The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope is (−.154 → −.061). If you take the anti-logs of the endpoints,<br />

this gives a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the fracti<strong>on</strong> of TEQ that remains from year to year, i.e. between<br />

(0.86 to 0.94) of the TEQ in <strong>on</strong>e year, remains to the next year.<br />

Several types of predicti<strong>on</strong>s can be made. For example, what would be the estimated mean TEQ in 2010?<br />

12 It can be shown that in regressi<strong>on</strong>s using a log(Y ) vs. time, that the estimated slope <strong>on</strong> the logarithmic scale is the approximate<br />

fracti<strong>on</strong> decline per time interval. For example, in the above, the estimated slope of −.11 corresp<strong>on</strong>ds to an approximate 11% decline<br />

per year. This approximati<strong>on</strong> <strong>on</strong>ly work well when the slopes are small, i.e. close to zero.<br />

c○2012 Carl James Schwarz 121 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This can be accomplished in several ways.<br />

The computati<strong>on</strong>s could be d<strong>on</strong>e by hand, or by using the cross-hairs <strong>on</strong> the plot from the Analyze-<br />

>Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se, or predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual<br />

resp<strong>on</strong>se can be added to the plot from the pop-down menu.<br />

However, a more powerful tool is available from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />

Start first, by adding rows to the original data table corresp<strong>on</strong>ding to the years <str<strong>on</strong>g>for</str<strong>on</strong>g> which a predicti<strong>on</strong> is<br />

required. In this case, the additi<strong>on</strong>al row would have the value of 2010 in the Year column with the remainder<br />

of the row unspecified. Missing values will be automatically inserted <str<strong>on</strong>g>for</str<strong>on</strong>g> the other variables.<br />

Then invoke the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 122 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This gives much the same output as from the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m with a few new (useful)<br />

features, a few of which we will explore in the remainder of this secti<strong>on</strong>.<br />

Next, save the predicti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>mula, and the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean, and <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual predicti<strong>on</strong><br />

to the data table (this will take three successive saves):<br />

c○2012 Carl James Schwarz 123 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Now the data table has been augmented with additi<strong>on</strong>al columns and more importantly predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

2010 are now available:<br />

c○2012 Carl James Schwarz 124 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The estimated mean log(T EQ) is 2.60 (corresp<strong>on</strong>ding to an estimated MEDIAN TEQ of exp(2.60) =<br />

13.46). A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean log(T EQ) is (1.94 to 3.26) corresp<strong>on</strong>ding to a 95% c<strong>on</strong>fidence<br />

interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the actual MEDIAN TEQ of between (6.96 and 26.05). 13 Note that the c<strong>on</strong>fidence<br />

interval after taking anti-logs is no l<strong>on</strong>ger symmetrical.<br />

Why does a mean of a logarithm trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m back to the median <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale? Basically,<br />

because the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is n<strong>on</strong>-linear, properties such mean and standard errors cannot be simply<br />

anti-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med without introducing some bias. However, measures of locati<strong>on</strong>, (such as a median) are<br />

unaffected. On the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale, it is assumed that the sampling distributi<strong>on</strong> about the estimate is symmetrical<br />

which makes the mean and median take the same value. So what really is happening is that the<br />

median <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale is back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the median <strong>on</strong> the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />

Similarly, a 95% predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the log(T EQ) <str<strong>on</strong>g>for</str<strong>on</strong>g> an INDIVIDUAL composite sample can be<br />

found.<br />

Finally, an inverse predicti<strong>on</strong> is sometimes of interest, i.e. in what year, will the TEQ be equal to some<br />

particular value. For example, health regulati<strong>on</strong>s may require that the TEQ of the composite sample be<br />

below 10 units.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m has an inverse predicti<strong>on</strong> functi<strong>on</strong>:<br />

13 A minor correcti<strong>on</strong> can be applied to estimate the mean if required.<br />

c○2012 Carl James Schwarz 125 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Specify the required value <str<strong>on</strong>g>for</str<strong>on</strong>g> Y – in this case log(10) = 2.302,<br />

and then press the RUN butt<strong>on</strong> to get the following output:<br />

c○2012 Carl James Schwarz 126 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The predicted year is found by solving<br />

2.302 = 218.9 − .11(year)<br />

and gives and estimated year of 2012.7. A c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the time when the mean log(T EQ) is<br />

equal to log(10) is somewhere between 2007 and 2026!<br />

2.3.2 Final Words<br />

The applicati<strong>on</strong> of regressi<strong>on</strong> to n<strong>on</strong>-linear problems is fairly straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward after the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is<br />

made. The most error-pr<strong>on</strong>e step of the process is the interpretati<strong>on</strong> of the estimates <strong>on</strong> the TRANSFORMED<br />

scale and how these relate to the untrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med scale.<br />

2.4 Power/Sample Size<br />

2.4.1 Introducti<strong>on</strong><br />

A comm<strong>on</strong> goal in ecological research is to determine if some quantity (e.g. abundance, water quality)<br />

is tending to increase or decrease. A linear regressi<strong>on</strong> of this quantity against time is comm<strong>on</strong>ly used to<br />

evaluate such a trend. The methods presented earlier can be used in these situati<strong>on</strong>s without much difficulty<br />

except <str<strong>on</strong>g>for</str<strong>on</strong>g> problems of autocorrelati<strong>on</strong> over time (<str<strong>on</strong>g>for</str<strong>on</strong>g> example if the same m<strong>on</strong>itoring plots were measured<br />

repeatedly over time), and making sure that the experimental and observati<strong>on</strong>al unit are not c<strong>on</strong>fused (this is<br />

similar to the problem of sub-sampling discussed earlier). 14<br />

When designing programs to detect trends, several related questi<strong>on</strong>s arise. For how many years does the<br />

study have to run? What is the influence of the precisi<strong>on</strong> of the individual yearly measurements have <strong>on</strong> the<br />

length of the m<strong>on</strong>itoring study? What is the power to detect a certain sized trend given a proposed study<br />

design?<br />

As in ANOVA, these questi<strong>on</strong>s are answered through a power analysis. The in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> needed to<br />

c<strong>on</strong>duct a power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> linear regressi<strong>on</strong> is similar to that required <str<strong>on</strong>g>for</str<strong>on</strong>g> a power analyses in ANOVA -<br />

however, the computati<strong>on</strong>s are more complex.<br />

The in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> needed is:<br />

• α level. As in power analyses <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA, this is traditi<strong>on</strong>ally set to α = 0.05.<br />

14 An example of such c<strong>on</strong>fusi<strong>on</strong> would be an investigati<strong>on</strong> of the fecundity of a bird over time. Several sites covering the range of<br />

the bird are measured and many nests within each site are also measured. This study c<strong>on</strong>tinues <str<strong>on</strong>g>for</str<strong>on</strong>g> a number of years. The average<br />

fecundity (over all sites and nests) is the resp<strong>on</strong>se variable, i.e. <strong>on</strong>e single number per year rather than the individual nest measurements.<br />

The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this is that factors that operate <strong>on</strong> the yearly scale (e.g. envir<strong>on</strong>mental variables) affect all nests simultaneously rather<br />

than operating <strong>on</strong> a single nest at a time independently of other nests. For example, a poor summer will depress fecundity <str<strong>on</strong>g>for</str<strong>on</strong>g> all nests<br />

simultaneously.<br />

c○2012 Carl James Schwarz 127 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

• effect size. In ANOVA, power deals with detecti<strong>on</strong> of differences am<strong>on</strong>g means. In regressi<strong>on</strong> analysis,<br />

power deals with detecti<strong>on</strong> of slopes that are different from zero. Hence, the effect size is measured<br />

by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X.<br />

• sample size. Recall in ANOVA with more than two groups, that the power depended not <strong>on</strong>ly <strong>on</strong>ly<br />

the sample size per group, but also how the means are separated. In regressi<strong>on</strong> analysis, the power<br />

will depend up<strong>on</strong> the number of observati<strong>on</strong>s taken at each value of X and the spread of the X values.<br />

For example, the greatest power is obtained when half the sample is taken at the two extremes of<br />

the X space - but at a cost of not being able to detect n<strong>on</strong>-linearity. For many m<strong>on</strong>itoring designs,<br />

observati<strong>on</strong>s are taken <strong>on</strong> a yearly basis, so the questi<strong>on</strong> reduces to the number of years of m<strong>on</strong>itoring<br />

required.<br />

• standard deviati<strong>on</strong>. As in ANOVA, the power will depend up<strong>on</strong> the variati<strong>on</strong> of the individual objects<br />

around the regressi<strong>on</strong> line.<br />

A very nice series of papers <strong>on</strong> detecting trends in ecological studies is available:<br />

• Gerrodette, T. 1987. A power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> detecting trends. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 68: 1364-1372. http://dx.<br />

doi.org/10.2307/1939220.<br />

• Link, W. A. and Hatfield, J. S. 1990. Power calculati<strong>on</strong>s and model selecti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> trend analysis: a<br />

comment. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 71: 1217-1220. http://dx.doi.org/10.2307/1937393.<br />

• Gerrodette, T. 1991. Models <str<strong>on</strong>g>for</str<strong>on</strong>g> power of detecting trends - a reply to Link and Hatfield. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 72:<br />

1889-1892. http://dx.doi.org/10.2307/1940986.<br />

• Gerrodette, T. 1993. Trends: software <str<strong>on</strong>g>for</str<strong>on</strong>g> a power analysis of linear regressi<strong>on</strong>. Wildlife Society<br />

Bulletin 21: 515-516.<br />

JMP does not include a power computati<strong>on</strong> module <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> analysis. However SAS v.9+ includes<br />

a power analysis module (GLMPOWER) <str<strong>on</strong>g>for</str<strong>on</strong>g> the power analysis of regressi<strong>on</strong> models, but this a bit complex<br />

to use.<br />

Perhaps the most comm<strong>on</strong> aspect of a power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> linear regressi<strong>on</strong> is the planning of a m<strong>on</strong>itoring<br />

study to detect trends over time. This c<strong>on</strong>siderable simplifies the computati<strong>on</strong>s of the power as usually the<br />

time points are equally spaced with the same number of measurements taken at each time point. There are<br />

two readily available software packages to help plan such studies. The first, TRENDS, available at http:<br />

//swfsc.noaa.gov/textblock.aspx?Divisi<strong>on</strong>=PRD&ParentMenuId=228&id=4740 is a<br />

Windoze based program that does the computati<strong>on</strong>s as outlined in the above papers. Because of c<strong>on</strong>cerns<br />

raised by Link and Hatfield, a sec<strong>on</strong>d program, MONITOR, available at http://www.mbr-pwrc.<br />

usgs.gov/software/m<strong>on</strong>itor.html, was developed that does power computati<strong>on</strong>s based <strong>on</strong> simulati<strong>on</strong><br />

rather than simple <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae. This used a web-based interface rather than running <strong>on</strong> individual<br />

machines. 15 This sec<strong>on</strong>d program also has additi<strong>on</strong>al flexibility to handle situati<strong>on</strong> where the m<strong>on</strong>itoring<br />

points are not equally spaced in time, or there are multiple measurements taken at each time point.<br />

15 The original author of MONITOR, James Gibbs, indicates that a Windoze versi<strong>on</strong> will be available in early 2005 at http:<br />

//www.esf.edu/efb/gibbs/<br />

c○2012 Carl James Schwarz 128 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

CAUTION Power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> trend can be very complex. The authors of Program M<strong>on</strong>itor have some<br />

sage advice that is applicable to both TRENDS and MONITOR:<br />

Users should be aware (and wary) of the complexity of power analysis in general, and also acknowledge<br />

some specific limitati<strong>on</strong>s of MONITOR <str<strong>on</strong>g>for</str<strong>on</strong>g> many real-world applicati<strong>on</strong>s. Our chief,<br />

immediate c<strong>on</strong>cern is that many users of MONITOR may be unaware of these limitati<strong>on</strong>s and<br />

may be using the program inappropriately. Below are comments from <strong>on</strong>e of our statisticians<br />

<strong>on</strong> some of the aspects of MONITOR that users should be cognizant of: “There are numerous<br />

issues with how Program M<strong>on</strong>itor calculates statistical power and sample size. One issue c<strong>on</strong>cerns<br />

the default opti<strong>on</strong> whereby the user assumes independence of plots or sites from <strong>on</strong>e time<br />

period to the next. If you are randomly sampling new sites or plots each time period, then it is<br />

correct to assume independence (assuming that finite populati<strong>on</strong> correcti<strong>on</strong> factor is not an issue,<br />

which depends <strong>on</strong> how many plots or sites you are sampling, relative to the total populati<strong>on</strong><br />

size of potential plots or sites). If you are sampling the same plots or sites repeatedly over time,<br />

however, then the default opti<strong>on</strong> in Program M<strong>on</strong>itor is unlikely to give a correct calculati<strong>on</strong> of<br />

statistical power or sample size. If plots or sites are positively autocorrelated over time, as is<br />

usually the case in biological surveys, then Program M<strong>on</strong>itor will underestimate sample size, or<br />

c<strong>on</strong>versely, it will overestimate the statistical power. The correct sample size estimate is likely<br />

to be greater, and depending up<strong>on</strong> the amount of autocorrelati<strong>on</strong>, the correct sample size could<br />

be vastly greater to achieve a stated power objective. A more fundamental issue c<strong>on</strong>cerns the<br />

null model <strong>on</strong>e chooses <str<strong>on</strong>g>for</str<strong>on</strong>g> the trend in populati<strong>on</strong> growth. Program M<strong>on</strong>itor assumes a relatively<br />

simple linear trend in populati<strong>on</strong> growth, but this is a c<strong>on</strong>troversial issue, because there<br />

are potentially an infinite number of models <strong>on</strong>e could use. If pilot data are available, then it<br />

may be possible to estimate autocorrelati<strong>on</strong> and try to make some choices c<strong>on</strong>cerning the type<br />

of model to use as the null model <str<strong>on</strong>g>for</str<strong>on</strong>g> a power calculati<strong>on</strong>, but regardless of how you decide to<br />

proceed, it would be a good idea to c<strong>on</strong>sult a statistician to determine an approach that fits your<br />

needs and data. No matter what additi<strong>on</strong>al flexibility is built into the modeling, however, it will<br />

always be possible to posit the existence of further structure which if overlooked will produce<br />

misleading results. For a pertinent discussi<strong>on</strong> of some of these issues, please see Elzinga et<br />

al. (1998). Although this reference deals specifically with plant populati<strong>on</strong>s, the fundamental<br />

statistical issues are similar whether you are sampling plant or animal populati<strong>on</strong>s. Literature<br />

Citati<strong>on</strong>: Elzinga, C.L., D.W. Salzer, and J.W. Willoughby. 1998. Measuring and m<strong>on</strong>itoring<br />

plant populati<strong>on</strong>s. BLM Technical Reference 1730-1, Denver, CO. 477 pages.”<br />

Some care must also be taken to distinguish between sampling variati<strong>on</strong> and process variati<strong>on</strong>.<br />

c○2012 Carl James Schwarz 129 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Process vs Sampling Variati<strong>on</strong><br />

Populati<strong>on</strong><br />

}<br />

{<br />

}<br />

{<br />

}<br />

Process variati<strong>on</strong><br />

Sampling<br />

Variati<strong>on</strong><br />

Time<br />

Samplingvariati<strong>on</strong> refers to the uncertainty of each<br />

measurement in each year. This can be reduced by<br />

increasing the sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t in each year. Process variati<strong>on</strong><br />

refers to the fact that even if the data values were known exactly,<br />

the points would not lie <strong>on</strong> the straight line. Process variati<strong>on</strong><br />

is unaffected by the sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t in each year.<br />

c○2012 Carl James Schwarz 130 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Sampling variati<strong>on</strong> is the size of the standard error when estimates are made at each sampling occasi<strong>on</strong>.<br />

Sampling variati<strong>on</strong> can be reduced by increasing sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t (e.g. more measurements per occasi<strong>on</strong>).<br />

Process variati<strong>on</strong> refers to the variati<strong>on</strong> around the perfect linear regressi<strong>on</strong> even if there was no uncertainty<br />

in each individual observati<strong>on</strong>. Process variati<strong>on</strong> cannot be reduced by increasing sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t. At the<br />

moment, Program M<strong>on</strong>itor assumes that process variati<strong>on</strong> is 0, i.e. if you knew each data point exactly, they<br />

would all fit exactly <strong>on</strong> the linear trend. There are a number of web pages that discuss this issue in more<br />

detail - do a simple search using a search engine.<br />

2.4.2 Getting the necessary in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />

As noted earlier, the in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> required to do a power analysis is similar to that <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA. We will<br />

c<strong>on</strong>centrate <strong>on</strong> relevant quantities <str<strong>on</strong>g>for</str<strong>on</strong>g> a trend analysis over time rather than a general regressi<strong>on</strong> situati<strong>on</strong>. I<br />

will use populati<strong>on</strong> size as my resp<strong>on</strong>se variable, but any other ecological quantity could be used.<br />

α level. As in power analyses <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA, this is traditi<strong>on</strong>ally set to α = 0.05.<br />

Effect size. In trend analysis, this is traditi<strong>on</strong>ally specified as the rate of change per unit time and denoted<br />

by r. For example, a value of r = .02 = 2% corresp<strong>on</strong>ds to an (increasing) change of 2% per year. Both<br />

TRENDS and MONITOR allow <str<strong>on</strong>g>for</str<strong>on</strong>g> both linear and exp<strong>on</strong>ential trends. In linear trends, the populati<strong>on</strong> size<br />

changes by the same fixed percentage of the initial populati<strong>on</strong> each year. So if the initial populati<strong>on</strong> was<br />

1000 animals, a 2% decline per year would corresp<strong>on</strong>d to a fixed change of .02 × 1000 = 20 animals per<br />

year, i.e. 1000, 980, 960, 940, 920, 900, . . ..<br />

In exp<strong>on</strong>ential trends, the change is multiplicative each year. So if the initial populati<strong>on</strong> was 1000<br />

animals, a 2% (multiplicative) decline corresp<strong>on</strong>ds to 1000×.98 = 980 animals in the next year, 980×.98 =<br />

1000 × .98 2 = 960.4 in the next year, followed by 941.2, 922.4, 904, 885 etc in subsequent years.<br />

If the rate is small, then both an exp<strong>on</strong>ential and linear trend will be very similar <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>short</str<strong>on</strong>g> time trends -<br />

they can be quite different if the rate is large and/or the time series is very l<strong>on</strong>g.<br />

Individuals m<strong>on</strong>itoring populati<strong>on</strong>s often think of l<strong>on</strong>g-term trends in populati<strong>on</strong>s, such as, how many<br />

plots do I need to m<strong>on</strong>itor to detect a 10% reducti<strong>on</strong> in this populati<strong>on</strong> over a 10 year period? This overall<br />

change must be c<strong>on</strong>verted to a rate per unit time. The MONITOR home page has a trend c<strong>on</strong>verter, but the<br />

computati<strong>on</strong>s are relatively simple.<br />

For linear trends, the rate is found as:<br />

r =<br />

R<br />

n − 1<br />

where R is the overall fracti<strong>on</strong>al change in abundance over the n years. For example, a 10% reducti<strong>on</strong> over<br />

10 years has R = −.1 and n = 10 leading to:<br />

or just over a 1% reducti<strong>on</strong> per year.<br />

r = −.1<br />

10 − 1 = −.011<br />

c○2012 Carl James Schwarz 131 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

For exp<strong>on</strong>ential trends, the rate is found as:<br />

f = (R + 1) 1/(n−1) − 1<br />

where R is the overall fracti<strong>on</strong>al change in abundance over the n years. For example, a 10% reducti<strong>on</strong> over<br />

10 years has R = −.1 and n = 10 leading to:<br />

r = (.9) 1/9 − 1 = −.0116<br />

or just over a 1% reducti<strong>on</strong> per year. Again note that <str<strong>on</strong>g>for</str<strong>on</strong>g> small reducti<strong>on</strong>s and a small number of years, both<br />

a linear and exp<strong>on</strong>ential trend have similar rates.<br />

Sample size. For many m<strong>on</strong>itoring designs, observati<strong>on</strong>s are taken <strong>on</strong> a yearly basis, so the questi<strong>on</strong><br />

reduces to the number of years of m<strong>on</strong>itoring required. TRENDS required fixed sampling intervals while<br />

MONITOR has allows <str<strong>on</strong>g>for</str<strong>on</strong>g> some flexibility in the timing of the m<strong>on</strong>itoring.<br />

Standard deviati<strong>on</strong>. As in ANOVA, the power will depend up<strong>on</strong> the variati<strong>on</strong> of the individual objects<br />

around the regressi<strong>on</strong> line.<br />

In many cases, the standard deviati<strong>on</strong> is not directly available given, but rather the variability of the<br />

estimates of the individual observati<strong>on</strong>s is reported as the relative standard error (cv = stddev<br />

mean<br />

). TRENDS<br />

uses the cv while MONITOR uses the actual standard deviati<strong>on</strong>.<br />

Gibbs (2000) 16 summarizes typical cv ′ s <str<strong>on</strong>g>for</str<strong>on</strong>g> measuring a number of types of populati<strong>on</strong>s:<br />

16 Gibbs, J. P. (2000). M<strong>on</strong>itoring Populati<strong>on</strong>s. Pages 213-252 in Research Techniques in Animal <str<strong>on</strong>g>Ecology</str<strong>on</strong>g>, Boitani, L. and Fuller, T.<br />

K.eds, <str<strong>on</strong>g>Columbia</str<strong>on</strong>g> University Press<br />

c○2012 Carl James Schwarz 132 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Group<br />

cv<br />

Large mammals 15%<br />

Grasses and sedges 20%<br />

Herbs, compositae 20%<br />

Herbs, n<strong>on</strong>-compositae 20%<br />

Turtles 35%<br />

Salamanders 35%<br />

Large bodied birds 35%<br />

Lizards 40%<br />

Fishes, salm<strong>on</strong>ids 50%<br />

Caddis flies 50%<br />

Snakes 55%<br />

Drag<strong>on</strong>flies 55%<br />

Small bodied birds 55%<br />

Beetles 60%<br />

Small mammals 60%<br />

Spiders 65%<br />

Medium sized mammals 65%<br />

Fishes, n<strong>on</strong>-salm<strong>on</strong>ids 70%<br />

Salamander (aquatic) 85%<br />

Moths 90%<br />

Frogs and toads 95%<br />

Bats 95%<br />

Butterflies 110%<br />

Flies 130%<br />

If necessary, these can be c<strong>on</strong>verted to standard deviati<strong>on</strong> if the initial density is approximately known<br />

by multiplying the cv by the initial density. For example, if the initial density is 25 mice/hectare, then<br />

the approximate standard deviati<strong>on</strong> (<str<strong>on</strong>g>for</str<strong>on</strong>g> small mammals) would be found as 25 mice/hectare × 60% or 15<br />

mice/hectare.<br />

Finally, even if all else is equal, the variati<strong>on</strong> often changes with the change in abundance over time.<br />

Gerrodette (1987) examines three cases:<br />

• the cv is c<strong>on</strong>stant over time.<br />

• the cv is proporti<strong>on</strong>al to √ abundance<br />

• the cv is proporti<strong>on</strong>al to 1/ √ abundance.<br />

c○2012 Carl James Schwarz 133 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Many sampling methods give cvs that are proporti<strong>on</strong>al to 1/ √ abundance. The TRENDS program allows<br />

you to select an appropriate relati<strong>on</strong>ship. Again, <str<strong>on</strong>g>for</str<strong>on</strong>g> small time scales, there isn’t much of a difference in<br />

results am<strong>on</strong>g the different relati<strong>on</strong>ships of cv and abundance.<br />

The cv may be improved if multiple, independent samples are taken each year. If m multiple samples<br />

are taken each year, then the corresp<strong>on</strong>ding cv value is:<br />

cv average = cv individual<br />

√ m<br />

.<br />

Both programs do this computati<strong>on</strong> automatically if you specify that the ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t is increased at each sampling<br />

occasi<strong>on</strong>. Note that you are implicitly assuming that process variati<strong>on</strong> is 0 in these cases, i.e. if perfect<br />

in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> were known, the abundance would lie exactly <strong>on</strong> the trend line. This may not be a suitable<br />

assumpti<strong>on</strong> and some care is needed if a large amount of sampling is to be d<strong>on</strong>e in each year to try and get<br />

the cv of the estimates down to a reas<strong>on</strong>able level - the payoff may not be as great as expected.<br />

2.4.3 How does power vary as in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> changes?<br />

A nice discussi<strong>on</strong> of some of the issues in sample size <str<strong>on</strong>g>for</str<strong>on</strong>g> trend analysis is found at http://www.pwrc.<br />

usgs.gov/m<strong>on</strong>manual/samplesize.htm which is reproduced here <str<strong>on</strong>g>for</str<strong>on</strong>g> c<strong>on</strong>venience:<br />

c○2012 Carl James Schwarz 134 November 23, 2012


Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />

10/06/2005 06:52 PM<br />

Patuxent Wildlife<br />

Research Center<br />

Managers' M<strong>on</strong>itoring Manual<br />

Home > HOW to m<strong>on</strong>itor? > Setting sample size<br />

Figuring out how many samples you need<br />

The number of samples you need are affected by the following factors:<br />

Project goals<br />

How you plan to analyze your data<br />

How variable your data are or are likely to be<br />

How precisely you want to measure change or trend<br />

The number of years over which you want to detect a trend<br />

How many times a year you will sample each point<br />

How much m<strong>on</strong>ey and manpower you have<br />

Here are some graphs that illustrate some of these trade-offs. These graphs were made using the assumpti<strong>on</strong> that you would be analyzing your<br />

data using simple linear regressi<strong>on</strong>. Each graph isolates <strong>on</strong>e factor and looks how altering that factor affects sample. Those factors are explained<br />

in greater detail below<br />

135<br />

http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />

Page 1 of 8


Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />

10/06/2005 06:52 PM<br />

136<br />

http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />

Page 2 of 8


Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />

10/06/2005 06:52 PM<br />

137<br />

http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />

Page 3 of 8


Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />

10/06/2005 06:52 PM<br />

138<br />

http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />

Page 4 of 8


Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />

10/06/2005 06:52 PM<br />

In general, you can lower your sample size requirements by adopting the following approaches:<br />

Aim to detect <strong>on</strong>ly l<strong>on</strong>g-term changes<br />

Set your analytical tests to P


Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />

10/06/2005 06:52 PM<br />

Must be located randomly or uni<str<strong>on</strong>g>for</str<strong>on</strong>g>mly throughout the study area<br />

Must detect a c<strong>on</strong>stant proporti<strong>on</strong> of the individuals (or estimate the differences)<br />

Be precise enough to detect the types of changes you want to detect<br />

Issues of bias, sample placement, and choosing your counting technique have been discussed elsewhere in this web site. Here we will help you<br />

determine whether your m<strong>on</strong>itoring program has a sufficient number of sampling locati<strong>on</strong>s (sample size) to detect the types of changes you have<br />

set <str<strong>on</strong>g>for</str<strong>on</strong>g>th as your goal.<br />

So, what is a sufficient sample size?<br />

To answer that you need to address three things:<br />

1. What is the inherent variability of your counts?<br />

2. What magnitude of trend do you want to detect and how precisely you would like to measure trend?<br />

3. How you are going to statistically test <str<strong>on</strong>g>for</str<strong>on</strong>g> populati<strong>on</strong> change?<br />

Count Variati<strong>on</strong><br />

Count variati<strong>on</strong> is simply a measure of how your counts fluctuate from year to year. Variati<strong>on</strong> affects your ability to detect trends as, obviously, if<br />

the data fluctuate greatly you will not have resoluti<strong>on</strong> to find an increasing or decreasing trajectory in the populati<strong>on</strong> you are m<strong>on</strong>itoring.<br />

Basic rule of thumb: The more variable your counts, the more samples you will need to detect a change or trend of a given magnitude.<br />

C<strong>on</strong>versely, <str<strong>on</strong>g>for</str<strong>on</strong>g> any given sample size, the more your counts vary the lower your ability to to detect trends.<br />

Sample size calculati<strong>on</strong>s need an estimate of count variati<strong>on</strong> to estimate sample sizes. You can get such an estimate from your own pilot data<br />

(the mean and standard deviati<strong>on</strong>) or from estimates taken from other, similar situati<strong>on</strong>s. We provide some of those estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> amphibians and<br />

birds (point counts and territory mapping). Note that these are calculati<strong>on</strong>s of fluctuati<strong>on</strong> over time not over space, meaning that you calcuate a<br />

mean and standard deviati<strong>on</strong> of the counts across several years at <strong>on</strong>e point and not a mean and standard deviati<strong>on</strong> am<strong>on</strong>g several points.<br />

Be aware, when using counts from other studies, that count variances are specific to the counting technique and how the original study pooled<br />

their samples. Additi<strong>on</strong>ally, as you can see from these collecti<strong>on</strong>s of count variances that even when using the same counting technique, <strong>on</strong> the<br />

same species, in the same regi<strong>on</strong> the degree of variability in the resulting counts is usually differs (often greatly) from study to study or even from<br />

site to site. The good news is that reviews of l<strong>on</strong>g-term studies have shown that, at any individual site, the variability in counts remains about the<br />

same. This is another str<strong>on</strong>g, str<strong>on</strong>g reas<strong>on</strong> that you need to review your m<strong>on</strong>itoring program after 5 years, to see if you have been adequately<br />

sampling your populati<strong>on</strong>s.<br />

Basic rule of thumb: Use the estimates of variability of counts from other studies as a general guide to what you might expect in yours, but it is<br />

wise to check the variability of your own counts after your program has been established <str<strong>on</strong>g>for</str<strong>on</strong>g> 5 years to see if you sampling strategy needs to be<br />

revised.<br />

Helpful hints <strong>on</strong> how to decrease variability<br />

Philosophical C<strong>on</strong>siderati<strong>on</strong>s<br />

Trend. Trend can be defined as change over time. More apropros to a m<strong>on</strong>itoring program, would be to define trend as some specific rate of<br />

change over a specific period of time. Most calculati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> determining sample size require that you specify a minimum rate of change you would<br />

like to detect and a minimum time period over which you would detect those changes. Those minimums now become the targets our m<strong>on</strong>itoring<br />

program will aim to achieve or beat. In other words, by appropriately setting our sample sizes we hope to be able to detect a trend as small or<br />

smaller than those minimums which we have targeted.<br />

Basic rule of thumb: The smaller the populati<strong>on</strong> change you would like to detect, the greater the number of samples you will need to detect it.<br />

Another basic rule of thumb: The fewer the number of years over which you would like to detect a trend, the greater the number of samples<br />

you will need.<br />

Grand rule of thumb: Any m<strong>on</strong>itoring program whose goal it is to detect small populati<strong>on</strong> changes over just a few years will be expensive to<br />

create.<br />

Precisi<strong>on</strong>. Calculators of sample size also need to know how precisely you want to measure these changes. An imprecisely measured trend is a<br />

very unsatisfactory trend in that you are unsure of how well it really reflects the REAL changes in the animal populati<strong>on</strong>s <strong>on</strong> your land. On the<br />

other hand, a very precisely measured trend can be very costly to obtain because you will have to spend a great deal of your budget to achieve<br />

that level of precisi<strong>on</strong>. So, think of your precisi<strong>on</strong> goal as your willingness to risk being wr<strong>on</strong>g about the populati<strong>on</strong> change you are trying to<br />

measure. You need to determine the amount of risk you are willing to take in your m<strong>on</strong>itoring program and understand the c<strong>on</strong>sequences of that<br />

decisi<strong>on</strong>, both as a cost to your budget and in the probability of being wr<strong>on</strong>g.<br />

140<br />

Basic rule of thumb: The lower the precisi<strong>on</strong> - the lower the number of samples you will need. C<strong>on</strong>versely, the higher the precisi<strong>on</strong> the larger<br />

the number of samples needed.<br />

You c<strong>on</strong>trol precisi<strong>on</strong> and risk using two statistical settings: alpha and power. Because most basic statistical books and quite a number of web<br />

sites cover these parameters well and are very accessible, we will not cover them here, but we do want to highlight a few c<strong>on</strong>siderati<strong>on</strong>s relevant<br />

to the estimati<strong>on</strong> of sample sizes <str<strong>on</strong>g>for</str<strong>on</strong>g> m<strong>on</strong>itoring programs.<br />

http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />

Page 6 of 8


Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />

10/06/2005 06:52 PM<br />

Alpha Level. Because animal populati<strong>on</strong>s and their counts will vary <str<strong>on</strong>g>for</str<strong>on</strong>g> a number of reas<strong>on</strong>s, the data from your m<strong>on</strong>itoring program<br />

just cannot be expected to produce nice straight lines when you finally plot them out. Because of this imprecisi<strong>on</strong> we must specify<br />

some level of uncertainty in our measures of change that we are willing to tolerate. This level represents our willingness to risk being<br />

wr<strong>on</strong>g, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, to claim that a trend exists when it does not. Traditi<strong>on</strong>ally, this is known as setting the alpha level.<br />

Setting your alpha level is a balance between not wanting to 'cry wolf' (saying a trend exists when it really doesn't) and missing an<br />

important trend by being too c<strong>on</strong>servative. If you are using your m<strong>on</strong>itoring program as an early warning of negative populati<strong>on</strong><br />

change, then you may want to increase your alpha level above the traditi<strong>on</strong>al level of 0.05. To do so may mean that you 'cry wolf'<br />

more of the time, but because the goal of most m<strong>on</strong>itoring programs is to alert managers to potential problems then a higher alpha is<br />

justified in light of the possibility of missing a problem while waiting <str<strong>on</strong>g>for</str<strong>on</strong>g> it to become "statistically significant" at a lower alpha level.<br />

Basic rule of thumb: The less willing you are to be caught 'crying wolf' (or the smaller you set the alpha level) the more samples<br />

you will need to detect a given level of populati<strong>on</strong> trend.<br />

Power. Power can be defined as your ability to (or the odds of) detecting a trend given that there really is a change going <strong>on</strong> in your<br />

animal populati<strong>on</strong>s. In general, a power of 90-95% is reas<strong>on</strong>able <str<strong>on</strong>g>for</str<strong>on</strong>g> most m<strong>on</strong>itoring programs.<br />

Statistical Testing<br />

You now know that count variability affects the number of samples you need as does your requirements <str<strong>on</strong>g>for</str<strong>on</strong>g> what magnitude of the change you<br />

want your program to detect. The last issue that needs to be resolved is what statistical model will you use to test your data.<br />

Basic rule of thumb. The specific <str<strong>on</strong>g>for</str<strong>on</strong>g>mula (or simulati<strong>on</strong>s) used to calculate sample size is unique to the statistical test or model that you will<br />

use.<br />

Now ... some practical guidance <strong>on</strong> how to calculate the sample sizes <str<strong>on</strong>g>for</str<strong>on</strong>g> your m<strong>on</strong>itoring program.<br />

Note: Throughout this document we often use the terms variance, variati<strong>on</strong>, and variability as a <str<strong>on</strong>g>short</str<strong>on</strong>g> hand expressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the variability of counts.<br />

However, understand that the actual mathematical calculati<strong>on</strong> of variability could be any <strong>on</strong>e of several measures (standard deviati<strong>on</strong>, standard<br />

error, variance, or coefficient of variati<strong>on</strong>), each of which has a specific statistical meaning.<br />

Basic rules of thumb. You must have determined the following to set samples sizes:<br />

A mean and standard deviati<strong>on</strong> (i.e., the coefficient of variati<strong>on</strong> or the variati<strong>on</strong> of your counts)<br />

The smallest number of years over which you would like to detect a change<br />

The smallest percentage change you would like to detect over those years<br />

An alpha level (how often you will cry wolf)<br />

A power level (the proporti<strong>on</strong> of the time you would like to detect a trend if <strong>on</strong>e were occurring)<br />

A statistical test (your analytical model)<br />

Calculating the mean and standard deviati<strong>on</strong> requires some additi<strong>on</strong>al explanati<strong>on</strong>. While the other factors that affect sample sizes are set based<br />

<strong>on</strong> your desired need <str<strong>on</strong>g>for</str<strong>on</strong>g> precisi<strong>on</strong> and the smallest degree of change you want to detect, the mean and variance are factors that are set by the<br />

animals being sampled. If you have several years of pilot data you will want to calculate the mean and standard deviati<strong>on</strong> from your own data. If<br />

you d<strong>on</strong>’t, then you can use some<strong>on</strong>e else’s data from as similar a situati<strong>on</strong> as you can find to estimate means and standard deviati<strong>on</strong>s. In a<br />

pinch, you can estimate some of the variati<strong>on</strong> to be expected in a set of yearly counts by calculating a mean and standard deviati<strong>on</strong> from <strong>on</strong>e<br />

year’s data if you have several replicates OF THE SAME points or plots. However this approach fails to account <str<strong>on</strong>g>for</str<strong>on</strong>g> any between-year variati<strong>on</strong> in<br />

the animal populati<strong>on</strong>s. Finally, you can use data published in the literature or from <strong>on</strong>e of our databases <strong>on</strong> count variati<strong>on</strong> (e.g. amphibians, bird<br />

point counts, bird territory mapping).<br />

Pilot data is far and away the most preferable source of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> determining count variati<strong>on</strong>. We have found that the variati<strong>on</strong> in the<br />

counts of animals is very c<strong>on</strong>sistent within a site (figure). However, there are often wide differences in the variati<strong>on</strong> of counts am<strong>on</strong>g sites, even<br />

those close by using the same technique.<br />

Ways to avoid problems when you calculate variances:<br />

1. Use data collected over time, not over space.<br />

2. Use data that match the counting technique and sampling units you plan to use (e.g, d<strong>on</strong>’t use the variances that come from counts from<br />

a 50-stop point count system when you are planning to use <strong>on</strong>ly a 20-stop system).<br />

3. Use means that come from the same data you used to calculate the variance.<br />

Basic rule of thumb: If you have no access to pilot data or are aware of examples from the literature that you trust, a c<strong>on</strong>servative estimate of<br />

the amount of variati<strong>on</strong> that you could use in sample size calculati<strong>on</strong>s would be to use a CV of 100% with a moderately c<strong>on</strong>servative alternative of<br />

75%.<br />

141<br />

Choosing an analytical technique also includes some further explanati<strong>on</strong>. The specific calculati<strong>on</strong> of sample sizes is different <str<strong>on</strong>g>for</str<strong>on</strong>g> every statistical<br />

test. In complicated analyses <str<strong>on</strong>g>for</str<strong>on</strong>g>mulas often d<strong>on</strong>’t exist and simulati<strong>on</strong> must be used to calculate sample size. Below are listed a some text and<br />

web resources <str<strong>on</strong>g>for</str<strong>on</strong>g> setting sample sizes <str<strong>on</strong>g>for</str<strong>on</strong>g> various simple statistical models. For complicated situati<strong>on</strong>s you can either run the simulati<strong>on</strong>s your<br />

self, have a statistician do them <str<strong>on</strong>g>for</str<strong>on</strong>g> you, or use a c<strong>on</strong>servative model to estimate sample sizes from.<br />

James Gibbs has created a software program that estimates sample sizes <str<strong>on</strong>g>for</str<strong>on</strong>g> those who will use linear or exp<strong>on</strong>ential regressi<strong>on</strong> to anlayze their<br />

http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />

Page 7 of 8


Managers' M<strong>on</strong>itoring Manual | Setting Sample Size<br />

10/06/2005 06:52 PM<br />

data. As this type of regressi<strong>on</strong> is the most basic it is also likely to be the most c<strong>on</strong>servative. Currently his software <strong>on</strong>ly runs <strong>on</strong> Windows XP or<br />

2000. It is available by c<strong>on</strong>tacting Sam Droege at the address below. A new versi<strong>on</strong> is expected out so<strong>on</strong> that will run <strong>on</strong> more plat<str<strong>on</strong>g>for</str<strong>on</strong>g>ms.<br />

Desperati<strong>on</strong> rule of thumb: If, <str<strong>on</strong>g>for</str<strong>on</strong>g> whatever reas<strong>on</strong>, you cannot calculate a reas<strong>on</strong>able estimate <str<strong>on</strong>g>for</str<strong>on</strong>g> the number of samples to take, then put in<br />

60 plots/points, under many circumstances that may be sufficient. Obviously the more the better, be sure to review your data after 3 years to reevaluate<br />

this weak choice.<br />

Texts <strong>on</strong> estimating sample size<br />

Web sites and <strong>on</strong>line calculators <str<strong>on</strong>g>for</str<strong>on</strong>g> the calculati<strong>on</strong> of sample sizes<br />

home | START HERE | worksheets | counting techniques | CV tools | site guide<br />

U.S. Department of the Interior<br />

U.S. Geological Survey<br />

Patuxent Wildlife Research Center<br />

Laurel, MD USA 20708-4038<br />

http://www.pwrc.usgs.gov/m<strong>on</strong>manual<br />

C<strong>on</strong>tact Sam Droege, email sam_droege@usgs.gov<br />

USGS Privacy Statement<br />

142<br />

http://www.pwrc.usgs.gov/m<strong>on</strong>manual/samplesize.htm<br />

Page 8 of 8


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Gerrodette (1987) also looks at the effect of various factors up<strong>on</strong> the number of years of m<strong>on</strong>itoring<br />

required.<br />

For example, Figure 1 of his report:<br />

shows the reliance of power up<strong>on</strong> the rate and type of cv relati<strong>on</strong>ship when the initial cv was 20% and<br />

α = .05. Note that <str<strong>on</strong>g>for</str<strong>on</strong>g> n = 5, you have very little power to detect anything but huge changes (large<br />

values of r). For example, even with r = .2 corresp<strong>on</strong>ding to a 20% change/year in abundance, power<br />

barely exceeds 50% even after 5 years. Power is highest (and hence trend is easier to detect) when the cv is<br />

proporti<strong>on</strong>al to 1/ √ abundance (but this is reversed <str<strong>on</strong>g>for</str<strong>on</strong>g> declining trends).<br />

Figure 2 of his report:<br />

c○2012 Carl James Schwarz 143 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

shows the relati<strong>on</strong>ship of power to type of trend (linear or exp<strong>on</strong>ential) and if the trend is increasing or<br />

decreasing. Regardless if the type of trend is linear or exp<strong>on</strong>ential, decreasing trends are easier to detect than<br />

increasing trends. Furthermore, it is easier to detect a declining trend that has a c<strong>on</strong>stant absolute decline<br />

than a relative decline and hardest to detect an increasing trend that changes by an absolute amount each<br />

year as well. This is related to the “compounding” effect in exp<strong>on</strong>ential changes.<br />

Finally, Figure 3 of his report:<br />

c○2012 Carl James Schwarz 144 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

shows the effect of different amount of variati<strong>on</strong> up<strong>on</strong> trend detecti<strong>on</strong>. As expected, trend is easier to detect<br />

with lower amount of variati<strong>on</strong> (smaller cvs).<br />

2.4.4 Finally - how many year do I need to m<strong>on</strong>itor?<br />

Gerrodette (1987) gives a quick-and-dirty approximati<strong>on</strong> that will help guide sample size determinati<strong>on</strong>. For<br />

α = .05 and power = 80%, the following is an approximate rule:<br />

r 2 n 3 ≥ 94(cv) 2<br />

For example, to detect a 5% decline/year in a populati<strong>on</strong> whose cv is 20% and c<strong>on</strong>stant over time would<br />

require:<br />

(−.05) 2 n 3 ≥ 94(.2) 2<br />

or n ≥ 11 years of m<strong>on</strong>itoring. 17<br />

Suppose we wish to investigate the power of a m<strong>on</strong>itoring design that will run <str<strong>on</strong>g>for</str<strong>on</strong>g> 5 years. At each survey<br />

occasi<strong>on</strong> (i.e. every year), we have 1 m<strong>on</strong>itoring stati<strong>on</strong>, and we make 2 estimates of the populati<strong>on</strong> at the<br />

17 If you try this actual power computati<strong>on</strong> using TRENDS, you find that actually 9 years may be sufficient. This <str<strong>on</strong>g>for</str<strong>on</strong>g>mula is ONLY<br />

an approximati<strong>on</strong>!<br />

c○2012 Carl James Schwarz 145 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

m<strong>on</strong>itoring stati<strong>on</strong> in each year. The populati<strong>on</strong> is expected to start with 1000 animals, and we expect that<br />

the measurement error in each estimate is about 200, i.e. the coefficient of variati<strong>on</strong> of each measurement is<br />

about 20% and is c<strong>on</strong>stant over time. We are interested in detecting increasing or decreasing trends and to<br />

start, a 5% decline per year will be of interest.<br />

The input/output <str<strong>on</strong>g>for</str<strong>on</strong>g> TRENDS is shown below:<br />

Most of the fields are self-explanatory. The ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t multiplier, i.e. 2 surveys/year is located at the bottom<br />

right of the screen. We find that a five year <strong>on</strong>ly has a 14% chance of detecting a 5% decline per year - hardly<br />

worthwhile doing the study!<br />

The input <str<strong>on</strong>g>for</str<strong>on</strong>g> the MONITOR Program follows:<br />

c○2012 Carl James Schwarz 146 November 23, 2012


147


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

.<br />

Most of the fields are self explanatory, but additi<strong>on</strong>al help can be obtained by clicking <strong>on</strong> the active links<br />

behind each term.<br />

The output from this proposed program follows:<br />

c○2012 Carl James Schwarz 148 November 23, 2012


Program MONITOR<br />

Tue May 4 00:53:34 2004 p=2705<br />

This is an example of a power analysis to detect a trend<br />

SIMULATION OVERVIEW<br />

Number of plots m<strong>on</strong>itored : 1<br />

Plot Counts :<br />

1000.000<br />

Plot Standard Deviati<strong>on</strong>s :<br />

200.000<br />

Plot weights :<br />

1.000<br />

Number counts/plot/survey occasi<strong>on</strong> : 2<br />

CV in trends : 0.000<br />

Total Surveys : 5<br />

Survey occasi<strong>on</strong>s:<br />

0.000<br />

1.000<br />

2.000<br />

3.000<br />

4.000<br />

Trend Type = Linear<br />

Counts analyzed as decimals<br />

Projecti<strong>on</strong> set = Complete<br />

Significance Level : 0.050<br />

Significance Test : 2-tailed t-test<br />

Iterati<strong>on</strong>s : 500<br />

Power to Detect Populati<strong>on</strong> Trends:<br />

10% Increase = 0.68200<br />

9% Increase = 0.58800<br />

8% Increase = 0.45000<br />

7% Increase = 0.36200<br />

6% Increase = 0.26400<br />

5% Increase = 0.17600<br />

4% Increase = 0.13800<br />

3% Increase = 0.11800<br />

2% Increase = 0.05400<br />

1% Increase = 0.05600<br />

0% Increase = 0.04200<br />

149


10% Decrease = 0.35600<br />

9% Decrease = 0.30600<br />

8% Decrease = 0.24800<br />

7% Decrease = 0.21200<br />

6% Decrease = 0.19600<br />

5% Decrease = 0.15600<br />

4% Decrease = 0.09600<br />

3% Decrease = 0.08800<br />

2% Decrease = 0.06600<br />

1% Decrease = 0.05000<br />

0% Decrease = 0.04200<br />

END OF OUTPUT FILE<br />

Back to MONITOR input page<br />

150


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

.<br />

This design is estimated to have an estimated power of 16% to detect a 5% decrease PER YEAR.<br />

The difference in reported powers are artifacts of the different ways the two programs compute power.<br />

TRENDS uses analytical <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae based <strong>on</strong> normal approximati<strong>on</strong>s, while MONITOR c<strong>on</strong>ducts a simulati<strong>on</strong><br />

study and reports the number of trials (in this case out of 500) that detected the trend. In any event,<br />

d<strong>on</strong>’t get hung up over these differences - the key point is that this proposed study has virtually no power to<br />

detect a 5% decline/year.<br />

Program MONITOR also reports power <str<strong>on</strong>g>for</str<strong>on</strong>g> a range of trends; Program TRENDS reports power <str<strong>on</strong>g>for</str<strong>on</strong>g> a<br />

single TREND at a time, but you can quickly vary the sliding window to investigate different design opti<strong>on</strong>s.<br />

2.4.5 Summary of plans<br />

Here is a summary of some power computati<strong>on</strong>s to detect an average decrease over time. In all cases, the cv<br />

was assumed to be proporti<strong>on</strong>al to 1/ √ abundance.<br />

The results are sobering. For many animal species, many years of c<strong>on</strong>centrated ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t will be needed to<br />

detect small effects with decent power.<br />

c○2012 Carl James Schwarz 151 November 23, 2012


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 10<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 7 9 11 15<br />

5 5 6 9 13 19 26<br />

3 1 5 5 6 7 7 9<br />

3 5 7 13 22 35 48<br />

5 5 9 20 38 58 75<br />

4 1 5 6 8 12 16 21<br />

3 5 10 26 48 70 85<br />

5 5 15 43 74 92 98<br />

5 1 5 7 13 22 32 43<br />

3 5 16 47 77 93 99<br />

5 5 26 71 95 100 100<br />

6 1 5 10 22 38 55 69<br />

3 5 25 69 94 99 100<br />

5 5 40 90 100 100 100<br />

7 1 5 13 33 57 76 88<br />

3 5 37 87 99 100 100<br />

5 5 58 98 100 100 100<br />

8 1 5 18 47 75 90 97<br />

3 5 52 96 100 100 100<br />

5 5 75 100 100 100 100<br />

9 1 5 24 62 88 97 99<br />

3 5 66 99 100 100 100<br />

5 5 88 100 100 100 100<br />

10 1 5 31 76 95 99 100<br />

3 5 79 100 100 100 100<br />

5 5 95 100 100 100 100<br />

152<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 20<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 5 6 7 7<br />

5 5 5 6 7 8 10<br />

3 1 5 5 5 5 6 6<br />

3 5 6 7 9 12 16<br />

5 5 6 9 13 19 26<br />

4 1 5 5 6 7 8 9<br />

3 5 6 10 16 24 33<br />

5 5 7 14 25 39 53<br />

5 1 5 6 7 9 12 15<br />

3 5 8 16 27 41 55<br />

5 5 10 24 44 64 79<br />

6 1 5 6 9 13 19 24<br />

3 5 10 23 42 61 76<br />

5 5 14 37 65 84 94<br />

7 1 5 7 12 19 28 36<br />

3 5 13 34 59 78 90<br />

5 5 19 53 82 95 99<br />

8 1 5 8 16 27 38 49<br />

3 5 17 46 74 90 97<br />

5 5 26 69 93 99 100<br />

9 1 5 10 21 36 50 62<br />

3 5 22 60 86 97 99<br />

5 5 35 82 98 100 100<br />

10 1 5 11 27 45 62 74<br />

3 5 28 72 94 99 100<br />

5 5 44 92 100 100 100<br />

153<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 30<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 5 5 6 6<br />

5 5 5 5 6 6 7<br />

3 1 5 5 5 5 5 5<br />

3 5 5 6 7 8 10<br />

5 5 5 7 9 11 14<br />

4 1 5 5 5 6 6 7<br />

3 5 6 7 10 13 17<br />

5 5 6 9 14 20 27<br />

5 1 5 5 6 7 8 10<br />

3 5 6 10 15 21 28<br />

5 5 7 13 23 34 46<br />

6 1 5 6 7 9 11 14<br />

3 5 7 13 22 32 43<br />

5 5 9 19 34 51 65<br />

7 1 5 6 8 11 15 19<br />

3 5 8 18 31 45 58<br />

5 5 11 27 49 68 81<br />

8 1 5 6 10 15 20 25<br />

3 5 10 24 42 59 72<br />

5 5 14 37 63 82 92<br />

9 1 5 7 12 19 26 33<br />

3 5 12 31 53 71 83<br />

5 5 18 49 76 91 97<br />

10 1 5 8 15 23 33 41<br />

3 5 15 40 65 82 91<br />

5 5 23 60 86 96 99<br />

154<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 40<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 5 5 5 6<br />

5 5 5 5 5 6 6<br />

3 1 5 5 5 5 5 5<br />

3 5 5 5 6 7 8<br />

5 5 5 6 7 8 10<br />

4 1 5 5 5 5 6 6<br />

3 5 5 6 8 10 12<br />

5 5 6 7 10 13 17<br />

5 1 5 5 6 6 7 8<br />

3 5 6 8 10 14 18<br />

5 5 6 10 15 21 28<br />

6 1 5 5 6 7 8 10<br />

3 5 6 9 14 20 26<br />

5 5 7 13 21 32 42<br />

7 1 5 5 7 8 11 13<br />

3 5 7 12 19 28 37<br />

5 5 8 18 30 44 57<br />

8 1 5 6 8 10 13 16<br />

3 5 8 15 26 37 48<br />

5 5 10 23 41 57 71<br />

9 1 5 6 9 13 17 21<br />

3 5 9 20 33 47 59<br />

5 5 12 30 52 70 82<br />

10 1 5 7 10 15 21 26<br />

3 5 11 25 42 57 70<br />

5 5 15 39 63 80 90<br />

155<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 50<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 5 5 5 5<br />

5 5 5 5 5 6 6<br />

3 1 5 5 5 5 5 5<br />

3 5 5 5 6 6 7<br />

5 5 5 6 6 7 8<br />

4 1 5 5 5 5 5 6<br />

3 5 5 6 7 8 9<br />

5 5 5 6 8 10 13<br />

5 1 5 5 5 6 6 7<br />

3 5 5 7 8 11 13<br />

5 5 6 8 11 15 20<br />

6 1 5 5 6 6 7 8<br />

3 5 6 8 11 15 19<br />

5 5 6 10 15 22 29<br />

7 1 5 5 6 7 9 10<br />

3 5 6 9 14 20 25<br />

5 5 7 13 21 31 40<br />

8 1 5 5 7 8 10 12<br />

3 5 7 12 18 26 33<br />

5 5 8 17 28 40 52<br />

9 1 5 6 7 10 12 15<br />

3 5 8 14 23 33 42<br />

5 5 10 21 36 51 63<br />

10 1 5 6 8 11 15 18<br />

3 5 9 17 29 40 51<br />

5 5 11 27 45 61 74<br />

156<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 60<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 5 5 5 5<br />

5 5 5 5 5 5 6<br />

3 1 5 5 5 5 5 5<br />

3 5 5 5 5 6 6<br />

5 5 5 5 6 7 7<br />

4 1 5 5 5 5 5 5<br />

3 5 5 6 6 7 8<br />

5 5 5 6 7 9 10<br />

5 1 5 5 5 5 6 6<br />

3 5 5 6 7 9 11<br />

5 5 6 7 9 12 15<br />

6 1 5 5 5 6 6 7<br />

3 5 6 7 9 12 14<br />

5 5 6 8 12 17 22<br />

7 1 5 5 6 7 7 8<br />

3 5 6 8 11 15 19<br />

5 5 7 10 16 23 30<br />

8 1 5 5 6 7 9 10<br />

3 5 6 10 14 19 25<br />

5 5 7 13 21 30 39<br />

9 1 5 5 7 8 10 12<br />

3 5 7 11 18 24 31<br />

5 5 8 16 27 38 48<br />

10 1 5 6 7 9 12 14<br />

3 5 7 14 22 30 38<br />

5 5 9 20 33 47 58<br />

157<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 70<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 5 5 5 5<br />

5 5 5 5 5 5 5<br />

3 1 5 5 5 5 5 5<br />

3 5 5 5 5 6 6<br />

5 5 5 5 6 6 7<br />

4 1 5 5 5 5 5 5<br />

3 5 5 5 6 6 7<br />

5 5 5 6 7 8 9<br />

5 1 5 5 5 5 6 6<br />

3 5 5 6 7 8 9<br />

5 5 5 6 8 10 12<br />

6 1 5 5 5 6 6 7<br />

3 5 5 6 8 10 12<br />

5 5 6 8 10 14 17<br />

7 1 5 5 6 6 7 8<br />

3 5 6 7 10 12 15<br />

5 5 6 9 13 18 23<br />

8 1 5 5 6 7 8 9<br />

3 5 6 8 12 15 19<br />

5 5 7 11 17 23 30<br />

9 1 5 5 6 7 9 10<br />

3 5 6 10 14 19 24<br />

5 5 7 13 21 29 38<br />

10 1 5 6 7 8 10 12<br />

3 5 7 11 17 23 29<br />

5 5 8 16 26 36 46<br />

158<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 80<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 5 5 5 5<br />

5 5 5 5 5 5 5<br />

3 1 5 5 5 5 5 5<br />

3 5 5 5 5 5 6<br />

5 5 5 5 5 6 6<br />

4 1 5 5 5 5 5 5<br />

3 5 5 5 6 6 7<br />

5 5 5 6 6 7 8<br />

5 1 5 5 5 5 5 6<br />

3 5 5 6 6 7 8<br />

5 5 5 6 7 9 11<br />

6 1 5 5 5 6 6 6<br />

3 5 5 6 7 9 10<br />

5 5 6 7 9 11 14<br />

7 1 5 5 5 6 6 7<br />

3 5 5 7 8 11 13<br />

5 5 6 8 11 15 19<br />

8 1 5 5 6 6 7 8<br />

3 5 6 8 10 13 16<br />

5 5 6 9 14 19 24<br />

9 1 5 5 6 7 8 9<br />

3 5 6 9 12 16 20<br />

5 5 7 11 17 24 30<br />

10 1 5 5 6 8 9 10<br />

3 5 6 10 14 19 24<br />

5 5 7 13 21 29 37<br />

159<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 90<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 5 5 5 5<br />

5 5 5 5 5 5 5<br />

3 1 5 5 5 5 5 5<br />

3 5 5 5 5 5 6<br />

5 5 5 5 5 6 6<br />

4 1 5 5 5 5 5 5<br />

3 5 5 5 6 6 6<br />

5 5 5 5 6 7 7<br />

5 1 5 5 5 5 5 6<br />

3 5 5 6 6 7 7<br />

5 5 5 6 7 8 9<br />

6 1 5 5 5 5 6 6<br />

3 5 5 6 7 8 9<br />

5 5 5 7 8 10 12<br />

7 1 5 5 5 6 6 7<br />

3 5 5 6 8 9 11<br />

5 5 6 7 10 13 16<br />

8 1 5 5 6 6 7 7<br />

3 5 6 7 9 11 14<br />

5 5 6 8 12 16 20<br />

9 1 5 5 6 6 7 8<br />

3 5 6 8 10 13 16<br />

5 5 6 10 15 20 25<br />

10 1 5 5 6 7 8 9<br />

3 5 6 9 12 16 20<br />

5 5 7 11 18 24 30<br />

160<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


Approximate power to detect (decreasing) linear trend when m<strong>on</strong>itoring x years and n obs/year<br />

CV of initial obs (%) 100<br />

N years<br />

m<strong>on</strong>itored<br />

Obs/year<br />

Average % decrease/year<br />

0 2 4 6 8 10<br />

Power Power Power Power Power Power<br />

2 1 . . . . . .<br />

3 5 5 5 5 5 5<br />

5 5 5 5 5 5 5<br />

3 1 5 5 5 5 5 5<br />

3 5 5 5 5 5 5<br />

5 5 5 5 5 6 6<br />

4 1 5 5 5 5 5 5<br />

3 5 5 5 5 6 6<br />

5 5 5 5 6 6 7<br />

5 1 5 5 5 5 5 5<br />

3 5 5 5 6 6 7<br />

5 5 5 6 7 7 9<br />

6 1 5 5 5 5 6 6<br />

3 5 5 6 6 7 8<br />

5 5 5 6 8 9 11<br />

7 1 5 5 5 6 6 6<br />

3 5 5 6 7 9 10<br />

5 5 6 7 9 11 14<br />

8 1 5 5 5 6 6 7<br />

3 5 5 7 8 10 12<br />

5 5 6 8 11 14 17<br />

9 1 5 5 6 6 7 7<br />

3 5 6 7 9 12 14<br />

5 5 6 9 13 17 21<br />

10 1 5 5 6 7 7 8<br />

3 5 6 8 11 14 17<br />

5 5 7 10 15 20 25<br />

161<br />

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/m<strong>on</strong>itor.pl <str<strong>on</strong>g>for</str<strong>on</strong>g> a web-based interface


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

.<br />

2.5 Testing <str<strong>on</strong>g>for</str<strong>on</strong>g> comm<strong>on</strong> trend - ANCOVA<br />

In some cases, it is of interest to test if the same trend is occurring in a number of locati<strong>on</strong>s. Or, the data<br />

from a single site is so poor that trends cannot be detected, but by pooling the sites, a comm<strong>on</strong> trend over<br />

sites can be detected because of the increased sample size. This technique can also be used <str<strong>on</strong>g>for</str<strong>on</strong>g> adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

seas<strong>on</strong>ality as will be seen later.<br />

The Analysis of Covariance (ANCOVA) does both. Groups of data (e.g. from the same locati<strong>on</strong>) are<br />

identified by a nominal or ordinal scale variable and time is also measured <str<strong>on</strong>g>for</str<strong>on</strong>g> both groups.<br />

Typically, ANCOVA is used to check if the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> the groups are parallel. If there is evidence<br />

that the individual regressi<strong>on</strong> lines are not parallel, then a separate regressi<strong>on</strong> line (trend line) must be fit <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

each group <str<strong>on</strong>g>for</str<strong>on</strong>g> predicti<strong>on</strong> purposes. If there is no evidence of n<strong>on</strong>-parallelism, then the next task is to see if<br />

the lines are co-incident, i.e. have both the same intercept and the same slope. If there is evidence that the<br />

lines are not coincident, then a series of parallel lines are fit to the data. All of the data are used to estimate<br />

the comm<strong>on</strong> slope. If there is no evidence that the lines are not coincident, then all of the data can be simply<br />

pooled together and a single regressi<strong>on</strong> line fit <str<strong>on</strong>g>for</str<strong>on</strong>g> all of the data.<br />

The three possibilities are shown below <str<strong>on</strong>g>for</str<strong>on</strong>g> the case of two groups - the extensi<strong>on</strong> to many groups is<br />

obvious:<br />

c○2012 Carl James Schwarz 162 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 163 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 164 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

2.5.1 Assumpti<strong>on</strong>s<br />

As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, it is important be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the analysis is started to verify the assumpti<strong>on</strong>s underlying the analysis. As<br />

ANCOVA is a combinati<strong>on</strong> of ANOVA and Regressi<strong>on</strong>, the assumpti<strong>on</strong>s are similar.<br />

• The resp<strong>on</strong>se variable Y is c<strong>on</strong>tinuous (interval or ratio scaled).<br />

• The Y are a random sample from the various time points measured.<br />

• There must be no outliers. Plot Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group separately to see if there are any points that<br />

d<strong>on</strong>’t appear to follow the straight line.<br />

• The relati<strong>on</strong>ship between Y and X must be linear <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. 18 Check this assumpti<strong>on</strong> by looking<br />

18 It is possible to relax this assumpti<strong>on</strong> as well, but is again, bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

c○2012 Carl James Schwarz 165 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

at the individual plots of Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group.<br />

• The variance must be equal <str<strong>on</strong>g>for</str<strong>on</strong>g> both groups around their respective regressi<strong>on</strong> lines. Check that the<br />

spread of the points is equal around the range of X and that the spread is comparable between the two<br />

groups. This can be <str<strong>on</strong>g>for</str<strong>on</strong>g>mally checked by looking at the MSE from a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />

group as MSE estimates the variance of the data around the regressi<strong>on</strong> line.<br />

• The residuals must be normally distributed around the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This assumpti<strong>on</strong><br />

can be check by examining the residual plots from the fitted model <str<strong>on</strong>g>for</str<strong>on</strong>g> evidence of n<strong>on</strong>-normality. For<br />

large samples, this is not too crucial; <str<strong>on</strong>g>for</str<strong>on</strong>g> small sample sizes, you will likely have inadequate power to<br />

detect anything but gross departures.<br />

2.5.2 Statistical model<br />

You saw in earlier chapters, that a statistical model is a powerful <str<strong>on</strong>g>short</str<strong>on</strong>g>hand to describe what analysis is fit<br />

to a set of data. The model must describe the treatment structure, the experimental unit structure, and the<br />

randomizati<strong>on</strong> structure. Let Y be the resp<strong>on</strong>se variable; X be the c<strong>on</strong>tinuous predictor variable, and Group<br />

be the group factor.<br />

As ANCOVA is a combinati<strong>on</strong> of ANOVA and regressi<strong>on</strong>, it will not be surprising that the models will<br />

have terms corresp<strong>on</strong>ding to both Group and X. Again, there are three cases:<br />

If the lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group are not parallel:<br />

c○2012 Carl James Schwarz 166 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

the appropriate model is<br />

Y 1 = Group X Group ∗ X<br />

The terms can be in any order. This is read as variati<strong>on</strong> in Y can be explained a comm<strong>on</strong> intercept (never<br />

specified) followed by group effects (different intercepts), a comm<strong>on</strong> slope (trend) <strong>on</strong> X, and an “interacti<strong>on</strong>”<br />

between Group and X which is interpreted as different slopes (different trends) <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This model<br />

is almost equivalent to fitting a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The <strong>on</strong>ly advantage to using this<br />

joint model <str<strong>on</strong>g>for</str<strong>on</strong>g> all groups is similar to that enjoyed by using ANOVA - all of the groups c<strong>on</strong>tribute to a better<br />

estimate of residual error. If the number of data points per group is small, this can lead to improvements in<br />

precisi<strong>on</strong> compared to fitting each group individually and an improved power to detect trends.<br />

If the lines are parallel across groups, but not coincident:<br />

c○2012 Carl James Schwarz 167 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

the appropriate model is<br />

Y 2 = Group X<br />

The terms can be in any order. The <strong>on</strong>ly difference between this and the previous model is that this simpler<br />

model lacks the Group*X “interacti<strong>on</strong>” term. It would not be surprising then that a statistical test to see if<br />

this simpler model is tenable would corresp<strong>on</strong>d to examining the p-value of the test <strong>on</strong> the Group*X term<br />

from the complex model. This is exactly analogous to testing <str<strong>on</strong>g>for</str<strong>on</strong>g> interacti<strong>on</strong> effects between factors in a<br />

two-factor ANOVA.<br />

Lastly, if the lines are co-incident:<br />

c○2012 Carl James Schwarz 168 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

the appropriate model is<br />

Y 3 = X<br />

Now the difference between this model and the previous model is the Group term that has been dropped.<br />

Again, it would not be surprising that this corresp<strong>on</strong>ds to the test of the Group effect in the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical<br />

test. The test <str<strong>on</strong>g>for</str<strong>on</strong>g> co-incident lines should <strong>on</strong>ly be d<strong>on</strong>e if there is insufficient evidence against parallelism.<br />

While it is possible to test <str<strong>on</strong>g>for</str<strong>on</strong>g> a n<strong>on</strong>-zero slope, this is rarely d<strong>on</strong>e.<br />

2.5.3 Example: Degradati<strong>on</strong> of dioxin - pooling locati<strong>on</strong>s<br />

An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />

material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />

organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />

c○2012 Carl James Schwarz 169 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

takes a l<strong>on</strong>g time to degrade.<br />

Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />

measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />

Each year, four crabs are captured from two m<strong>on</strong>itoring stati<strong>on</strong>s which are situated quite a distance apart<br />

<strong>on</strong> the same inlet where the pulp mill was located. The liver is excised and the livers from all four crabs<br />

are composited together into a single sample. 19 The dioxins levels in this composite sample is measured.<br />

As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called the Total<br />

Equivalent Dose (TEQ) is computed from the sample.<br />

As seen earlier, the appropriate resp<strong>on</strong>se variable is log(T EQ).<br />

Is the rate of decline the same <str<strong>on</strong>g>for</str<strong>on</strong>g> both sites? Did the sites have the same initial c<strong>on</strong>centrati<strong>on</strong>?<br />

Here are the raw data:<br />

19 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />

loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />

within years and the number of years to m<strong>on</strong>itor.<br />

c○2012 Carl James Schwarz 170 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Site Year TEQ log(T EQ)<br />

a 1990 179.05 5.19<br />

a 1991 82.39 4.41<br />

a 1992 130.18 4.87<br />

a 1993 97.06 4.58<br />

a 1994 49.34 3.90<br />

a 1995 57.05 4.04<br />

a 1996 57.41 4.05<br />

a 1997 29.94 3.40<br />

a 1998 48.48 3.88<br />

a 1999 49.67 3.91<br />

a 2000 34.25 3.53<br />

a 2001 59.28 4.08<br />

a 2002 34.92 3.55<br />

a 2003 28.16 3.34<br />

b 1990 93.07 4.53<br />

b 1991 105.23 4.66<br />

b 1992 188.13 5.24<br />

b 1993 133.81 4.90<br />

b 1994 69.17 4.24<br />

b 1995 150.52 5.01<br />

b 1996 95.47 4.56<br />

b 1997 146.80 4.99<br />

b 1998 85.83 4.45<br />

b 1999 67.72 4.22<br />

b 2000 42.44 3.75<br />

b 2001 53.88 3.99<br />

b 2002 81.11 4.40<br />

b 2003 70.88 4.26<br />

JMP analysis<br />

The raw data are available in Dioxin2.JMP available from the Sample Program Library available at http:<br />

//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The data can be entered into JMP in the usual fashi<strong>on</strong>. Make sure that Site is a nominal scale variable,<br />

and that Year is a c<strong>on</strong>tinuous variable.<br />

c○2012 Carl James Schwarz 171 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

In cases with multiple groups, it is often helpful to use a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This<br />

is easily accomplished in JMP by selecting the rows (say <str<strong>on</strong>g>for</str<strong>on</strong>g> site a) and using the Rows→Markers to set the<br />

plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> the selected rows:<br />

The final data sheet has two different plotting symbols <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sites:<br />

c○2012 Carl James Schwarz 172 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 173 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting the various models, begin with an exploratory examinati<strong>on</strong> of the data looking <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers<br />

and checking the assumpti<strong>on</strong>s.<br />

Each year’s data is independent of other year’s data as a different set of crabs was selected. Similarly,<br />

the data from <strong>on</strong>e site are independent from the other site. This is an observati<strong>on</strong>al study, so the questi<strong>on</strong><br />

arises of how exactly were the crabs were selected? In this study, crab pots were placed <strong>on</strong> the floor of the<br />

sea to capture the available crabs in the area.<br />

When ever multiple sets of data are collected over time, there is always the worry about comm<strong>on</strong> year<br />

effects (also known as process error). For example, if the resp<strong>on</strong>se variable was body mass of small fish, then<br />

poor growing c<strong>on</strong>diti<strong>on</strong>s in a single year could depress the growth of fish in all locati<strong>on</strong>s. This would then<br />

violate the assumpti<strong>on</strong> of independence as the residual in <strong>on</strong>e site in a year would be related to the residual<br />

in another site in the sam year. You tend to see the residuals “paired” with negative residuals from the fitted<br />

line at <strong>on</strong>e site matched (by year) with negative residuals at the other site. In this case, this is unlikely to have<br />

occured. Degradati<strong>on</strong> of dioxin is relatively independent of external envir<strong>on</strong>mental factors and the variati<strong>on</strong><br />

that we see about the two regressi<strong>on</strong> lines is related solely to samplng error based <strong>on</strong> the particular set of<br />

crabs that that were sampled. It seems unlikely that the residuals are related. 20<br />

Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and specify the log(T EQ) as the Y variable, and Y ear as the X<br />

variable:<br />

20 If you actually try and fit a process error term to this model, you find that the estimated process error is zero.<br />

c○2012 Carl James Schwarz 174 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Then specify a grouping variable by clicking <strong>on</strong> the pop-down menu near the Bivariate Fit window-title line:<br />

and selecting Site as the grouping variable:<br />

c○2012 Carl James Schwarz 175 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Now select the Fit Line from the same pop-down menu:<br />

c○2012 Carl James Schwarz 176 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

to get separate lines fit <str<strong>on</strong>g>for</str<strong>on</strong>g> each group:<br />

c○2012 Carl James Schwarz 177 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This relati<strong>on</strong>ships <str<strong>on</strong>g>for</str<strong>on</strong>g> each site appear to be linear. The actual estimates are also presented:<br />

c○2012 Carl James Schwarz 178 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 179 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The scatter plot doesn’t show any obvious outliers. The estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the a site is −.107 (se .02)<br />

while the estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the b site is −.06 (se .02). The 95% c<strong>on</strong>fidence intervals (not shown <strong>on</strong> the<br />

output but available by right-clicking/ctrl-clicking <strong>on</strong> the parameter estimates table) overlap c<strong>on</strong>siderably so<br />

the slopes could be the same <str<strong>on</strong>g>for</str<strong>on</strong>g> the two groups.<br />

The MSE from site a is .10 and the MSE from site b is .12. This corresp<strong>on</strong>ds to standard deviati<strong>on</strong>s of<br />

√<br />

.10 = .32 and<br />

√<br />

.12 = .35 which are very similar so that assumpti<strong>on</strong> of equal standard deviati<strong>on</strong>s seems<br />

reas<strong>on</strong>able.<br />

The residual plots (not shown) also look reas<strong>on</strong>able.<br />

The assumpti<strong>on</strong>s appear to be satisfied, so let us now fit the various models.<br />

First, fit the model allowing <str<strong>on</strong>g>for</str<strong>on</strong>g> separate lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />

The terms can be in any order and corresp<strong>on</strong>d to the model described earlier. This gives the following<br />

output:<br />

c○2012 Carl James Schwarz 180 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The regressi<strong>on</strong> plot is just the same as the plot of the two individual lines seen earlier. What is of interest is<br />

the Effect test <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site*year interacti<strong>on</strong>. Here the p-value is not very small, so there is no evidence that<br />

the lines are not parallel.<br />

We need to refit the model, dropping the interacti<strong>on</strong> term:<br />

c○2012 Carl James Schwarz 181 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

which gives the following regressi<strong>on</strong> plot:<br />

c○2012 Carl James Schwarz 182 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This shows the fitted parallel lines. The effect tests:<br />

now have a small p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site effect indicating that the lines are not coincident, i.e. they are parallel<br />

with different intercepts. This would mean that the rate of decay of the dioxin appears to be equal in both<br />

sites, but the initial c<strong>on</strong>centrati<strong>on</strong> appears to be different.<br />

The estimated (comm<strong>on</strong>) slope is found in the Parameter Estimates porti<strong>on</strong> of the output:<br />

c○2012 Carl James Schwarz 183 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

and has a value of −.083 (se .016). Because the analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this implies that the<br />

dioxin levels changed by a factor of exp(−.083) = .92 from year to year, i.e. about a 8% decline each year.<br />

The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope <strong>on</strong> the log-scale is from (−.12 →−.05) which corresp<strong>on</strong>ds to a<br />

potential factor between exp(−.12) = .88 to exp(−.05) = .95 per year, i.e. between a 12% and 5% decline<br />

per year. 21<br />

While it is possible to estimate the difference between the parallel lines from the Parameter Estimates<br />

table, it is easier to look at the secti<strong>on</strong> of the output corresp<strong>on</strong>ding to the Site effects. Here the estimated<br />

LSMeans corresp<strong>on</strong>d to the log(T EQ) at the average value of Year - not really of interest. As in previous<br />

chapters, the difference in means is often of more interest than the raw means themselves. This is found by<br />

using the pop-down menu and selecting a LSMeans C<strong>on</strong>trast or Multiple Comparis<strong>on</strong> procedure to give:<br />

21 The c<strong>on</strong>fidence intervals are found by right clicking/ctrl-clicking in the Parameter Estimates table<br />

c○2012 Carl James Schwarz 184 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The estimated difference between the lines (<strong>on</strong> the log-scale) is estimated to be 0.46 (se .13). Because the<br />

analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this corresp<strong>on</strong>ds to a ratio of exp(.46) = 1.58 in dioxin levels between<br />

the two sites, i.e. site b has 1.58 times the dioxin level as site a. Because the slopes are parallel and declining,<br />

the dioxin levels are falling in both sites, but the 1.58 times ratio remains c<strong>on</strong>sistent.<br />

c○2012 Carl James Schwarz 185 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Finally, the actual by Predicted plot (not shown here), the leverage plots (not shown here) and the residual<br />

plot<br />

d<strong>on</strong>’t show any evidence of a problem in the fit.<br />

2.5.4 Change in yearly average temperature with regime shifts<br />

The ANCOVA technique can also be used <str<strong>on</strong>g>for</str<strong>on</strong>g> trends when there are KNOWN regime shifts in the series.<br />

The case when the timing of the shift is unknown is more difficult and not covered in this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

For example, c<strong>on</strong>sider a time series of annual average temperatures measured at Tuscaloosa, Alabama<br />

from 1901 to 2001. It is well known that shifts in temperature can occur whenever the instrument or locati<strong>on</strong><br />

or observer or other characteristics of the stati<strong>on</strong> change.<br />

The data are available in the JMP datafile tuscaloosa-avg-temp.jmp in the Sample Program Library at<br />

http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

A porti<strong>on</strong> of the raw data is shown below:<br />

c○2012 Carl James Schwarz 186 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

and a time series plot of the data:<br />

shows a shift in the readings in 1939 (thermometer changed), 1957 (stati<strong>on</strong> moved), and possibly in 1987<br />

(locati<strong>on</strong> and thermometer changed).<br />

It turns out that cases where the number of epochs tends to increase with the number of data points has<br />

some serious technical issues with the properties of the estimators. See<br />

Lu, Q. and Lund, R.B. (2007).<br />

c○2012 Carl James Schwarz 187 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Simple linear regressi<strong>on</strong> with multiple level shifts.<br />

Canadian Journal of Statistics, 35, 447-458<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> details. Basically, if the number of parameters tends to increase with sample size, this violates <strong>on</strong>e of<br />

the assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> maximum likelihood estimati<strong>on</strong>. This would lead to estimates which may not even be<br />

c<strong>on</strong>sistent! For example, suppose that the recording changed every two years. Then the two data points<br />

should still be able to estimate the comm<strong>on</strong> slope, but this corresp<strong>on</strong>ds to the well known problem with<br />

case-c<strong>on</strong>trol studies where the number of pairs increases with total sample size. Fortunately, Lu and Lund<br />

(2007) showed that this violati<strong>on</strong> is not serious.<br />

The analysis proceeds as in the dioxin example with two sites, except that now the series is broken into<br />

different epochs corresp<strong>on</strong>ding to the sets of years when c<strong>on</strong>diti<strong>on</strong>s remained stable at the recording site. In<br />

this case, this corresp<strong>on</strong>ds to the years 1901-1938 (inclusive); 1940-1956 (inclusive); 1958-1986 (inclusive),<br />

and 1989-2000 (inclusive). Note that the years 1939, 1957, and 1987 are NOT used because the average<br />

temperature in these two years is an amalgam of two different recording c<strong>on</strong>diti<strong>on</strong>s 22 .<br />

For example, the data file (around the first regime change) may look like:<br />

Note that year and Avg Temp and both set to have c<strong>on</strong>tinuous scale; but epoch should have a nominal or<br />

ordinal scale.<br />

Model filling proceeds as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e by first the model:<br />

AvgT emp = Y ear Epoch Y ear ∗ Epoch<br />

to see if the change in AvgTemp is c<strong>on</strong>sistent am<strong>on</strong>g Epochs and then fitting the model:<br />

AvgT emp = Y ear Epoch<br />

to estimate the comm<strong>on</strong> trend (after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> shifts am<strong>on</strong>g the Epochs).<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />

22 If the exact day of the change were known, it is possible to weight the two epochs in these years and include the data points.<br />

c○2012 Carl James Schwarz 188 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

There is no str<strong>on</strong>g evidence that the slopes are different am<strong>on</strong>g the epochs (p=.10) despite the plot showing<br />

a potentially differential slope in the 3 rd epoch:<br />

c○2012 Carl James Schwarz 189 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The simpler model with comm<strong>on</strong> slopes is then fit:<br />

c○2012 Carl James Schwarz 190 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

with fitted (comm<strong>on</strong> slope) lines:<br />

c○2012 Carl James Schwarz 191 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

No further model simplificati<strong>on</strong> is possible and there is evident that the comm<strong>on</strong> slope is different from zero:<br />

The estimated change in average temperature is:<br />

c○2012 Carl James Schwarz 192 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

i.e. an estimated increase of .033 (SE .006) per year. The 95% c<strong>on</strong>fidence interval does not cover 0.<br />

The residual plots (against predicted and the order in which the data were collected):<br />

c○2012 Carl James Schwarz 193 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

shows no obvious problems.<br />

Whenever time series data are used, autocorrelati<strong>on</strong> should be investigated. The Durbin-Wats<strong>on</strong> test is<br />

applied to the residuals:<br />

with no obvious problem detected.<br />

The leverage plot (against year)<br />

c○2012 Carl James Schwarz 194 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

also reveals nothing amiss.<br />

A more sophisticated analysis can be fit using SAS, but isn’t needed. The sample program and output are<br />

available in the Sample Program Library.<br />

2.6 Dealing with Autocorrelati<strong>on</strong><br />

Short time series (10-50 observati<strong>on</strong>s) are comm<strong>on</strong> in envir<strong>on</strong>mental and ecological studies. It is well known<br />

that when data are collected over time, that the usual assumpti<strong>on</strong> of errors (deviati<strong>on</strong>s above and below the<br />

regressi<strong>on</strong> line) being independent may not be true.<br />

This is a key assumpti<strong>on</strong> of regressi<strong>on</strong> analysis. What it implies is that if the data point <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />

c○2012 Carl James Schwarz 195 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

year happens to be above the line, it has no influence <strong>on</strong> if the data point <str<strong>on</strong>g>for</str<strong>on</strong>g> the next year is also above the<br />

line. In many cases this is not true, because of l<strong>on</strong>g-term trends that affect data points <str<strong>on</strong>g>for</str<strong>on</strong>g> several years in<br />

a row. For example, precipitati<strong>on</strong> patterns often follow several year patterns where a drought year is more<br />

often followed by another drought year than a return to normal rainfall. If the level of precipitati<strong>on</strong> affects<br />

the resp<strong>on</strong>se, you may see an induced autocorrelati<strong>on</strong> (also known as serial correlati<strong>on</strong>). The uncritical<br />

applicati<strong>on</strong> of regressi<strong>on</strong> to these type of data without accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> the autocorrelati<strong>on</strong> over time is known<br />

as pseudo-replicati<strong>on</strong> over time (Hurlbert, 1984).<br />

This problem and how to deal with it are well known in ec<strong>on</strong>omics and related disciplines, but less well<br />

known in ecology.<br />

Some articles that discuss the problem and soluti<strong>on</strong>s are:<br />

• Bence, J. R. (1995). Analysis of <str<strong>on</strong>g>short</str<strong>on</strong>g> time series: Correcti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>. <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 76, 628-<br />

639. A nice n<strong>on</strong>-technical review of the subject and how to deal with in <str<strong>on</strong>g>for</str<strong>on</strong>g> ecology.<br />

• Roy A., Falk B. and Fuller W.A. (2004) Testing <str<strong>on</strong>g>for</str<strong>on</strong>g> Trend in the Presence of Autoregressive Error.<br />

Journal of the American Statistical Associati<strong>on</strong>, 99, 1082-1091. This article is VERY technical, but<br />

the reference list provides a nice summary of relevant papers about this problem.<br />

In some previous examples, we looked at the Durbin-Wats<strong>on</strong> statistic to examine if there was evidence<br />

of autocorrelati<strong>on</strong>. What is the Durbin-Wats<strong>on</strong> test? What is autocorrelati<strong>on</strong>? Why is this a problem? How<br />

do we fit models accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>?<br />

In order to understand autocorrelati<strong>on</strong>, we need to step back and look at the model <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> analysis<br />

in a little more detail. Recall, that we often used a <str<strong>on</strong>g>short</str<strong>on</strong>g>hand notati<strong>on</strong> to represent a linear regressi<strong>on</strong><br />

problem:<br />

Y = T ime<br />

where Y is the resp<strong>on</strong>se variable, and T ime is the effect of time. Mathematically, the model is written as:<br />

Y i = β 0 + β 1 t i + ɛ i<br />

where β 0 is the intercept, β 1 is the slope, and ɛ i is the deviati<strong>on</strong> of the i th data point from the actual<br />

underlying line.<br />

The usual assumpti<strong>on</strong> made in regressi<strong>on</strong> analysis is that the ɛ i are independent of each other. In autocorrelated<br />

models, this is not true. Mathematically, the simplest autocorrelati<strong>on</strong> process (known as an AR(1)<br />

process) has:<br />

ɛ i+1 = ρɛ i + a i<br />

where the a i are now independent and ρ is the autocorrelati<strong>on</strong> coefficient.<br />

In the same way that regular correlati<strong>on</strong> between two variables ranges from -1 to +1, so to does autocorrelati<strong>on</strong>.<br />

An autocorrelati<strong>on</strong> of 0 would indicate no correlati<strong>on</strong> between successive deviati<strong>on</strong>s about the<br />

regressi<strong>on</strong> line as the ɛ i would have no effect <strong>on</strong> ɛ i+1 ; an autocorrelati<strong>on</strong> close to 1 would indicate very<br />

c○2012 Carl James Schwarz 196 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

high correlati<strong>on</strong> between successive deviati<strong>on</strong>s; an autocorrelati<strong>on</strong> close to -1 (very rare in ecological studies)<br />

would indicate a negative influence, i.e. high deviati<strong>on</strong>s in <strong>on</strong>e year are typically followed by high but<br />

negative deviati<strong>on</strong>s in subsequent years. 23<br />

The following plots are some examples of autocorrelated data about the same underlying linear trend<br />

with the associated residual plots:<br />

23 A way that negative autocorrelati<strong>on</strong>s can be induced if there is a cost to breeding so that a successful seas<strong>on</strong> of breeding is followed<br />

by a year of not breeding etc<br />

c○2012 Carl James Schwarz 197 November 23, 2012


a s e l i n e<br />

1 4 0<br />

r h o = - 0 . 9 5<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

r h o = - 0 . 9 5<br />

1 3 0<br />

3 0 . 0 0 0 0<br />

1 2 0<br />

1 1 0<br />

2 0 . 0 0 0 0<br />

1 0 0<br />

9 0<br />

1 0 . 0 0 0 0<br />

8 0<br />

7 0<br />

0<br />

6 0<br />

5 0<br />

- 1 0 . 0 0 0 0<br />

4 0<br />

- 2 0 . 0 0 0 0<br />

3 0<br />

2 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

b a s e l i n e<br />

1 4 0<br />

1 3 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

r h o = - 0 . 9 0<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

3 0 . 0 0 0 0<br />

t i m e<br />

r h o = - 0 . 9 0<br />

1 2 0<br />

1 1 0<br />

1 0 0<br />

9 0<br />

8 0<br />

7 0<br />

6 0<br />

5 0<br />

4 0<br />

3 0<br />

2 0<br />

2 0 . 0 0 0 0<br />

1 0 . 0 0 0 0<br />

0<br />

- 1 0 . 0 0 0 0<br />

- 2 0 . 0 0 0 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

b a s e l i n e<br />

1 4 0<br />

1 3 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

r h o = - 0 . 8 0<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

3 0 . 0 0 0 0<br />

t i m e<br />

r h o = - 0 . 8 0<br />

1 2 0<br />

1 1 0<br />

1 0 0<br />

9 0<br />

8 0<br />

7 0<br />

6 0<br />

5 0<br />

4 0<br />

3 0<br />

2 0<br />

2 0 . 0 0 0 0<br />

1 0 . 0 0 0 0<br />

0<br />

- 1 0 . 0 0 0 0<br />

- 2 0 . 0 0 0 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

b a s e l i n e<br />

1 4 0<br />

1 3 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

r h o = - 0 . 6 0<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

3 0 . 0 0 0 0<br />

t i m e<br />

r h o = - 0 . 6 0<br />

1 2 0<br />

1 1 0<br />

1 0 0<br />

9 0<br />

8 0<br />

7 0<br />

6 0<br />

5 0<br />

4 0<br />

3 0<br />

2 0<br />

2 0 . 0 0 0 0<br />

1 0 . 0 0 0 0<br />

0<br />

- 1 0 . 0 0 0 0<br />

- 2 0 . 0 0 0 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

198


a s e l i n e<br />

1 4 0<br />

r h o = - 0 . 4 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

r h o = - 0 . 4 0<br />

1 3 0<br />

3 0 . 0 0 0 0<br />

1 2 0<br />

1 1 0<br />

2 0 . 0 0 0 0<br />

1 0 0<br />

9 0<br />

1 0 . 0 0 0 0<br />

8 0<br />

7 0<br />

0<br />

6 0<br />

5 0<br />

- 1 0 . 0 0 0 0<br />

4 0<br />

- 2 0 . 0 0 0 0<br />

3 0<br />

2 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

b a s e l i n e<br />

1 4 0<br />

1 3 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

r h o = - 0 . 2 0<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

3 0 . 0 0 0 0<br />

t i m e<br />

r h o = - 0 . 2 0<br />

1 2 0<br />

1 1 0<br />

1 0 0<br />

9 0<br />

8 0<br />

7 0<br />

6 0<br />

5 0<br />

4 0<br />

3 0<br />

2 0<br />

2 0 . 0 0 0 0<br />

1 0 . 0 0 0 0<br />

0<br />

- 1 0 . 0 0 0 0<br />

- 2 0 . 0 0 0 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

b a s e l i n e<br />

1 4 0<br />

1 3 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

r h o = - 0 . 0 0<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

3 0 . 0 0 0 0<br />

t i m e<br />

r h o = - 0 . 0 0<br />

1 2 0<br />

1 1 0<br />

1 0 0<br />

9 0<br />

8 0<br />

7 0<br />

6 0<br />

5 0<br />

4 0<br />

3 0<br />

2 0<br />

2 0 . 0 0 0 0<br />

1 0 . 0 0 0 0<br />

0<br />

- 1 0 . 0 0 0 0<br />

- 2 0 . 0 0 0 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

b a s e l i n e<br />

1 4 0<br />

1 3 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

r h o = 0 . 2 0<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

3 0 . 0 0 0 0<br />

t i m e<br />

r h o = 0 . 2 0<br />

1 2 0<br />

1 1 0<br />

1 0 0<br />

9 0<br />

8 0<br />

7 0<br />

6 0<br />

5 0<br />

4 0<br />

3 0<br />

2 0<br />

2 0 . 0 0 0 0<br />

1 0 . 0 0 0 0<br />

0<br />

- 1 0 . 0 0 0 0<br />

- 2 0 . 0 0 0 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

199


a s e l i n e<br />

1 4 0<br />

r h o = 0 . 4 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

r h o = 0 . 4 0<br />

1 3 0<br />

3 0 . 0 0 0 0<br />

1 2 0<br />

1 1 0<br />

2 0 . 0 0 0 0<br />

1 0 0<br />

9 0<br />

1 0 . 0 0 0 0<br />

8 0<br />

7 0<br />

0<br />

6 0<br />

5 0<br />

- 1 0 . 0 0 0 0<br />

4 0<br />

- 2 0 . 0 0 0 0<br />

3 0<br />

2 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

b a s e l i n e<br />

1 4 0<br />

1 3 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

r h o = 0 . 6 0<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

3 0 . 0 0 0 0<br />

t i m e<br />

r h o = 0 . 6 0<br />

1 2 0<br />

1 1 0<br />

1 0 0<br />

9 0<br />

8 0<br />

7 0<br />

6 0<br />

5 0<br />

4 0<br />

3 0<br />

2 0<br />

2 0 . 0 0 0 0<br />

1 0 . 0 0 0 0<br />

0<br />

- 1 0 . 0 0 0 0<br />

- 2 0 . 0 0 0 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

b a s e l i n e<br />

1 4 0<br />

1 3 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

r h o = 0 . 8 0<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

3 0 . 0 0 0 0<br />

t i m e<br />

r h o = 0 . 8 0<br />

1 2 0<br />

1 1 0<br />

1 0 0<br />

9 0<br />

8 0<br />

7 0<br />

6 0<br />

5 0<br />

4 0<br />

3 0<br />

2 0<br />

2 0 . 0 0 0 0<br />

1 0 . 0 0 0 0<br />

0<br />

- 1 0 . 0 0 0 0<br />

- 2 0 . 0 0 0 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

b a s e l i n e<br />

1 4 0<br />

1 3 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

r h o = 0 . 9 0<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

3 0 . 0 0 0 0<br />

t i m e<br />

r h o = 0 . 9 0<br />

1 2 0<br />

1 1 0<br />

1 0 0<br />

9 0<br />

8 0<br />

7 0<br />

6 0<br />

5 0<br />

4 0<br />

3 0<br />

2 0<br />

2 0 . 0 0 0 0<br />

1 0 . 0 0 0 0<br />

0<br />

- 1 0 . 0 0 0 0<br />

- 2 0 . 0 0 0 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

200


a s e l i n e<br />

1 4 0<br />

r h o = 0 . 9 5<br />

R e s i d u a l<br />

4 0 . 0 0 0 0<br />

r h o = 0 . 9 5<br />

1 3 0<br />

3 0 . 0 0 0 0<br />

1 2 0<br />

1 1 0<br />

2 0 . 0 0 0 0<br />

1 0 0<br />

9 0<br />

1 0 . 0 0 0 0<br />

8 0<br />

7 0<br />

0<br />

6 0<br />

5 0<br />

- 1 0 . 0 0 0 0<br />

4 0<br />

- 2 0 . 0 0 0 0<br />

3 0<br />

2 0<br />

- 3 0 . 0 0 0 0<br />

1 0<br />

0<br />

- 4 0 . 0 0 0 0<br />

- 1 0<br />

- 2 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

- 5 0 . 0 0 0 0<br />

0 1 0 2 0 3 0<br />

t i m e<br />

201


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

If the autocorrelati<strong>on</strong> is close to -1, then points above the underlying trend are usually followed immediately<br />

by points below the underlying. The fitted line will be close to the underlying trend. The residual plot<br />

will show the same pattern.<br />

If the autocorrelati<strong>on</strong> is close to 1, then you will see l<strong>on</strong>g runs of points above the underlying trend line<br />

and l<strong>on</strong>g runs of points below the underlying line. DANGER! In cases of very high autocorrelati<strong>on</strong><br />

with <str<strong>on</strong>g>short</str<strong>on</strong>g> time series, you can be drastically misled by the data! If you examine the plots above, you see that<br />

in the case of high positive autocorrelati<strong>on</strong>, the points tended to stay above or below the underlying trend<br />

line <str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>g periods of time. If the time series is <str<strong>on</strong>g>short</str<strong>on</strong>g>, you may never see the actual line dipping below the<br />

real trend line and the fitted line (shown in the above plots) may be completely misleading and there is no<br />

way to detect this! Ir<strong>on</strong>ically, with <str<strong>on</strong>g>short</str<strong>on</strong>g> time series (e.g. fewer than 30 data points), it will be very difficult<br />

to detect high positive autocorrelati<strong>on</strong> and this is exactly the time when it can cause the most damage when<br />

the data give misleading results!<br />

If the autocorrelati<strong>on</strong> is close to 0, the points will be randomly scattered about the underlying trend<br />

line, the fitted line will be close to the underlying trend line, and the residuals should appear to be random<br />

scattered about 0.<br />

In many cases, if you have fewer than 30 data points, it will be very difficult to observe or detect any<br />

autocorrelati<strong>on</strong> unless extreme!<br />

What are the effects of autocorrelati<strong>on</strong>? In most cases in ecology the autocorrelati<strong>on</strong> tends to be positive.<br />

This has the following effects:<br />

• Estimates of the slope and intercept are still unbiased, but they are less efficient (i.e. the true standard<br />

error is larger) than estimates of the same process in the absence of autocorrelati<strong>on</strong>. This may seem to<br />

be c<strong>on</strong>tradicted by my statement above that in the presence of high positive autocorrelati<strong>on</strong> and <str<strong>on</strong>g>short</str<strong>on</strong>g><br />

time series, that the data may be very misleading, but this is an artifact that you have a very <str<strong>on</strong>g>short</str<strong>on</strong>g> time<br />

series. With a l<strong>on</strong>g time-series you will see that the data run over and under the trend line in l<strong>on</strong>g<br />

waves and the fitted line will <strong>on</strong>ce again be close to the actual underlying trend.<br />

• The reported variance around the regressi<strong>on</strong> line (MSE) may seriously underestimate the true variance.<br />

• Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, while the estimates of the slope and intercept are usually not affected greatly, the<br />

reported standard errors can be misleading. In the case of positive or negative autocorrelati<strong>on</strong>, the<br />

reported standard errors obtained when a line is fit assuming not autocorrelati<strong>on</strong>, are typically too<br />

small, i.e. the estimates look more precise than they really are.<br />

• Reported c<strong>on</strong>fidence intervals ignoring autocorrelati<strong>on</strong> tend to be too narrow.<br />

• The p-values from hypothesis testing tend to be too small, i.e. you tend to detect differences that are<br />

not real too often.<br />

The autocorrelati<strong>on</strong> can be estimated from the data in many ways. In <strong>on</strong>e method, a regressi<strong>on</strong> line is fit<br />

c○2012 Carl James Schwarz 202 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

to the data, the residuals are found, and then the autocorrelati<strong>on</strong> is estimated as:<br />

̂ρ =<br />

T∑<br />

e i e i−1<br />

i=2<br />

T∑<br />

−1<br />

e 2 i<br />

i=2<br />

where e i is the residual <str<strong>on</strong>g>for</str<strong>on</strong>g> the i th observati<strong>on</strong>. Bence (1995) points out that this often underestimates the<br />

autocorrelati<strong>on</strong> and provides some corrected estimates. More modern methods estimate the autocorrelati<strong>on</strong><br />

using a technique called maximum likelihood and these often per<str<strong>on</strong>g>for</str<strong>on</strong>g>m well than these two-step methods.<br />

As a rule of thumb, the reported standard errors obtained from fitting a regressi<strong>on</strong> ignoring autocorrelati<strong>on</strong><br />

should be inflated by a factor of<br />

√<br />

(1+ρ)<br />

(1−ρ)<br />

. For example, if the actual autocorrelati<strong>on</strong> is 0.6, then the<br />

√<br />

(1+.6)<br />

(1−.6) = 2,<br />

standard errors (from an analysis ignoring autocorrelati<strong>on</strong>) should be inflated by a factor of<br />

i.e. multiply the reported standard errors ignoring autocorrelati<strong>on</strong> by a factor of 2. C<strong>on</strong>sequently, unless the<br />

autocorrelati<strong>on</strong> is very close to 1 or -1, the inflati<strong>on</strong> factor is usually pretty small. 24<br />

A slightly simpler <str<strong>on</strong>g>for</str<strong>on</strong>g>mula that also seems to work well in practice is that the effective sample size in the<br />

presence of autocorrelati<strong>on</strong> is found as:<br />

n effective = 1 + (n − 1)(1 − ρ)<br />

This is based <strong>on</strong> the observati<strong>on</strong> that the first observati<strong>on</strong> counts as a data point, but each additi<strong>on</strong>al data<br />

point <strong>on</strong>ly counts as (1 − ρ) of a data point. Then use the fact that <str<strong>on</strong>g>for</str<strong>on</strong>g> most statistical problems, the standard<br />

errors decrease by a factor of √ n to estimate the effect up<strong>on</strong> the precisi<strong>on</strong> of the estimates. For example, if<br />

n<br />

n effective<br />

= 2 then the reported standard errors (computed ignoring autocorrelati<strong>on</strong>) should be inflated by a<br />

factor of about √ 2.<br />

The Durbin-Wats<strong>on</strong> test statistic is a popular measure of autocorrelati<strong>on</strong>. It is computed as:<br />

d =<br />

N∑<br />

(e i − e i−1 ) 2<br />

i=2<br />

N∑<br />

e 2 i<br />

i=1<br />

≈<br />

∑<br />

2 N e 2 i − 2 ∑ N e i e i−1<br />

i=1 i=2<br />

N∑<br />

e 2 i<br />

i=1<br />

≈ 2 (1 − ρ)<br />

C<strong>on</strong>sequently, if the autocorrelati<strong>on</strong> is close to 0, the Durbin-Wats<strong>on</strong> statistic should be close to 2. The<br />

p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the statistics is found from tables, but most modern software can compute it automatically.<br />

Remedial measures If there is str<strong>on</strong>g evidence <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>, there a number of remedial measures<br />

that can be taken:<br />

• Use ordinary regressi<strong>on</strong> and inflate the reported standard errors by the inflati<strong>on</strong> factor menti<strong>on</strong>ed<br />

above. This is a very approximate soluti<strong>on</strong> and is not often used now that modern software is available.<br />

24 As Bence (1995) points out, the correcti<strong>on</strong> factor assumes that you know the value of ρ. Often this is difficult to estimate and<br />

typically estimates are too close to 0 resulting in a correcti<strong>on</strong> factor that is too small as well. He provides a bias adjusted correcti<strong>on</strong><br />

factor.<br />

c○2012 Carl James Schwarz 203 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

• A major cause of autocorrelati<strong>on</strong> is the omissi<strong>on</strong> of an important explanatory variable. The example<br />

of precipitati<strong>on</strong> that tends to occur in cycles was noted earlier. In this case, a more complex regressi<strong>on</strong><br />

model (multiple regressi<strong>on</strong>) that looks at the simultaneous effect of more than two variables would be<br />

appropriate. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately this is bey<strong>on</strong>d the scope of these notes.<br />

• Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m the variables be<str<strong>on</strong>g>for</str<strong>on</strong>g>e using simple regressi<strong>on</strong> methods that ignore autocorrelati<strong>on</strong>. There are<br />

two popular trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, the Cochrane-Orcutt and Hildreth-Lu procedures. Both procedures start<br />

by estimating the autocorrelati<strong>on</strong> ρ by fitting the ordinary regressi<strong>on</strong> line, obtaining the residuals, and<br />

then using the residuals to estimate the autocorrelati<strong>on</strong>. Then the data are trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med by subtracting<br />

the estimated porti<strong>on</strong> due to autocorrelati<strong>on</strong>. Finally, the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data are refit using ordinary<br />

regressi<strong>on</strong> (again ignoring autocorrelati<strong>on</strong>). These approaches are falling out of favor because of the<br />

availability of integrated procedures below.<br />

• Use a more sophisticated fitting procedure that explicitly estimates the autocorrelati<strong>on</strong> and accounts<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> it. This can be d<strong>on</strong>e using maximum likelihood or extensi<strong>on</strong>s of the previous methods, e.g. the<br />

Yule-Walker methods which fit generalized least squares. Many statistical packages offer such procedures,<br />

e.g. SAS’s PROC AUTOREG is specially designed to deal with autocorrelati<strong>on</strong> uses the<br />

Yule-Walker methods; SAS’s Proc MIXED uses maximum likelihood methods.<br />

2.6.1 Example: Mink pelts from Saskatchewan<br />

L.B. Keith (1963) collected in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> the number of mink-pelts from Saskatchewan, Canada over<br />

a 30 year period. This is data series 3707 in the NERC Centre <str<strong>on</strong>g>for</str<strong>on</strong>g> Populati<strong>on</strong> Biology, Imperial College<br />

(1999) The Global Populati<strong>on</strong> Dynamics Database available at http://www.sw.ic.ac.uk/cpb/<br />

cpb/gpdd.html.<br />

We are interested to see if there is a linear trend in the series.<br />

Here is the raw data:<br />

c○2012 Carl James Schwarz 204 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Year Pelts<br />

1914 15585<br />

1915 9696<br />

1916 6757<br />

1917 6443<br />

1918 6744<br />

1919 10637<br />

1920 11206<br />

1921 8937<br />

1922 13977<br />

1923 11430<br />

1924 13955<br />

1925 6635<br />

1926 7855<br />

1927 5485<br />

1928 5605<br />

1929 5016<br />

1930 6028<br />

1931 6287<br />

1932 11978<br />

1933 15730<br />

1934 14850<br />

1935 9766<br />

1936 6577<br />

1937 3871<br />

1938 4659<br />

1939 6749<br />

1940 12469<br />

1941 8579<br />

1942 6839<br />

1943 9990<br />

1944 6561<br />

1945 5831<br />

1946 8088<br />

1947 9579<br />

1948 10672<br />

1949 16195<br />

c○2012 Carl James Schwarz 205 November 23, 2012<br />

1950 12596<br />

1951 12833<br />

1952 18853<br />

1953 11493<br />

1954 14613<br />

1955 18514


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The raw data are available in a JMP file called mink.jmp available in the Sample Program Library available<br />

at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

It is comm<strong>on</strong> when dealing with populati<strong>on</strong> trends, to analyze the data <strong>on</strong> the log-scale. The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

this is that many processes operate multiplicatively <strong>on</strong> the original scale, and this is translated into a linear<br />

line <strong>on</strong> the log-scale. For example, if the number of pelts harvested increased by x% per year, the <str<strong>on</strong>g>for</str<strong>on</strong>g>ecasted<br />

number of pelts harvested would be fit by the equati<strong>on</strong>s:<br />

Years from baseline<br />

P elts = B(1 + x)<br />

where B is the baseline number of pelts. When this is trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the log-scale, the resulting equati<strong>on</strong> is:<br />

or<br />

log(P elts) = log(B) + (Years from baseline) log(1 + x)<br />

Y ′ = β 0 + β 1 (Years from baseline)<br />

This equati<strong>on</strong> can be further modified by using the raw year as the X variable (rather than years-frombaseline).<br />

All that happens is that the value of the baseline refers back to year 0 (which is pretty meaningless),<br />

but the value of the slope is still OK.<br />

It is recommended that you take natural logarithms (base e) rather than comm<strong>on</strong> logarithms (base 10)<br />

because then the estimated slope has a nice interpretati<strong>on</strong>. For small slopes <strong>on</strong> the natural log scale, a value<br />

of ̂β 1 corresp<strong>on</strong>ds closely to the same percentage increase per year. For example if ̂β 1 = .04, then the<br />

populati<strong>on</strong> in increasing at a rate of<br />

or 4% per year.<br />

JMP Analysis<br />

exp(̂β 1 ) − 1 = exp(.04) − 1 = 1.041 − 1 = .041 ≈ β 1 = .04<br />

JMP deals with autocorrelated data through the Analyze →Modelling →TimeSeries plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m which are<br />

bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>. This time series plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m allows you to fit the Box-Jenkins ARIMA(p,q)<br />

series of models, but does not allow <str<strong>on</strong>g>for</str<strong>on</strong>g> missing data.<br />

log_mink was c<strong>on</strong>structed using a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula variable in the usual fashi<strong>on</strong>. Here is a porti<strong>on</strong> of the raw<br />

data:<br />

c○2012 Carl James Schwarz 206 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Begin by using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit a simple linear fit and to fit a line joining all of<br />

the points: 25<br />

25 Use the Fit Each Value under the red-triangle pop-down menu to get the individual points joined up<br />

c○2012 Carl James Schwarz 207 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

There appears to be a generally increasing trend, but the the points seem to show an irregular cyclical<br />

type of pattern where several years of high takes of pelts is followed by several years of low takes of pelts.<br />

This is often a sign of autocorrelated residuals. Indeed the residual plot shows this pattern: 26<br />

26 This residual plot was obtained by Saving the residuals to the data sheet, and then using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to plot<br />

the saved residuals against year. The joined line was obtained by using the Fit Each Value from the red-triangle pop-down menu. The<br />

horiz<strong>on</strong>tal line at zero was obtained by using the Fit Special from the red-triangle and selecting an intercept of 0 and a slope of 0.<br />

c○2012 Carl James Schwarz 208 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

In order to test estimate the autocorrelati<strong>on</strong>, the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m must be used to fit a linear<br />

fit to logmink<br />

c○2012 Carl James Schwarz 209 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

and obtain the fitted line and residual plots in the usual way. The Durbin-Wats<strong>on</strong> statistic is obtained from<br />

the red-triangle pop-down menu:<br />

c○2012 Carl James Schwarz 210 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The Durbin-Wats<strong>on</strong> statistics indicates that there is str<strong>on</strong>g evidence of autocorrelati<strong>on</strong> with an estimated<br />

autocorrelati<strong>on</strong> of approximately 0.56.<br />

The estimated intercept and slope (without adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>) are:<br />

c○2012 Carl James Schwarz 211 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The number of pelts is estimated to increase at about 0.8% per year. As noted be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the estimates are<br />

still unbiased, but the reported standard errors are too small. Using the rule-of-thumb, the inflati<strong>on</strong> factor <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

the standard errors is approximately:<br />

√ √<br />

1 + ̂ρ 1 + .56<br />

InfFactor =<br />

1 − ̂ρ = 1 − .56 = 1.9<br />

Hence a more realistic standard error would be 1.9 × .005 = .009.<br />

A more <str<strong>on</strong>g>for</str<strong>on</strong>g>mal analysis would proceed as follows. First launch the Analyze →Modelling →TimeSeries<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

and specify the Y variable.<br />

c○2012 Carl James Schwarz 212 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The Time variable is <strong>on</strong>ly used <str<strong>on</strong>g>for</str<strong>on</strong>g> graphing. JMP assumes that the data are equally spaced without any<br />

missing values.<br />

This gives the initial output:<br />

c○2012 Carl James Schwarz 213 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The estimated lag-1 autocorrelati<strong>on</strong> is about 0.6 - quite high, but the lag 2 and higher autocorrelati<strong>on</strong>s<br />

d<strong>on</strong>’t appear to be statistically significant as they d<strong>on</strong>’t fall outside the blue lines drawn <strong>on</strong> the graph of the<br />

autocorrelati<strong>on</strong>s.<br />

A simple autoregressi<strong>on</strong> model with NO TREND (called the ARIMA(1,0,0)) model is fit using the<br />

ARIMA drop down menu and completing the various boxes:<br />

c○2012 Carl James Schwarz 214 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 215 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

A key assumpti<strong>on</strong>s of the Box-Jenkins approach is that the series is stati<strong>on</strong>ary, i.e. has a c<strong>on</strong>stant mean. If<br />

there is a linear trend in the log_mink numbers, this MUST be first removed be<str<strong>on</strong>g>for</str<strong>on</strong>g>e a subsequent model is<br />

fit. A simple linear trend is removed by differencing. For example, if a simple linear trend model is correct<br />

Y t = β 0 + β 1 t<br />

then:<br />

Y t+1 = β 0 + β 1 (t + 1)<br />

Y t+1 − Y t = β 1<br />

and the FIRST differences are c<strong>on</strong>stant.<br />

A first difference model is fit by specifying the sec<strong>on</strong>d term in the ARIMA model specificati<strong>on</strong>:<br />

c○2012 Carl James Schwarz 216 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Finally, a model with differencing but not autocorrelati<strong>on</strong> may also be useful:<br />

A comparis<strong>on</strong> of the three model is given by JMP:<br />

c○2012 Carl James Schwarz 217 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The AIC criteri<strong>on</strong> indicates that the model with the lowest value of AIC is preferred, and models with AIC<br />

within 2 or 3 of the best fitting model could also be candidates. According to this output, the AR(1) model is<br />

the best fitting model with an AIC of -94 almost 8 units lower than the next best fitting model, i.e. it wasn’t<br />

necessary to remove the trend from the model be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting it to the data.<br />

Indeed, if you look at the output from the AR(1,1) or AR(0,1) model:<br />

the estimated average difference in the log_mink is <strong>on</strong>ly .00042 with a se of .05, clearly not statistically<br />

different from zero.<br />

It is interesting to note that there is no evidence of further autocorrelati<strong>on</strong> in the residuals after the first<br />

differences were taken as the AR(1,1) model is a worse fit (but <strong>on</strong>ly by about 2 units) when compared <strong>on</strong><br />

the AIC scale. You can also estimate the average first difference by computing a derived variable using the<br />

Formula editor and using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to estimate the overall mean and to see if there<br />

is residual autocorrelati<strong>on</strong>.<br />

Final Notes<br />

It is interesting to note that there is no evidence of further autocorrelati<strong>on</strong> in the residuals after the first<br />

differences were taken. If you hadn’t examined the autocorrelati<strong>on</strong> plots you would not have known this. It<br />

is quite comm<strong>on</strong> that a first difference will remove much autocorrelati<strong>on</strong> in the data and this is often a good<br />

first step.<br />

c○2012 Carl James Schwarz 218 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

2.7 Dealing with seas<strong>on</strong>ality<br />

In many cases, the “cause” of autocorrelati<strong>on</strong> over time is some sort of seas<strong>on</strong>ality. For example, stream<br />

flow may follow a cyclical pattern with high flows in the winter m<strong>on</strong>ths (at least in Vancouver) and low<br />

flows in the summer m<strong>on</strong>ths. A way to deal with this type of autocorrelati<strong>on</strong> is to either first adjust the data<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>al effects, and then use the usual regressi<strong>on</strong> methods <strong>on</strong> this adjusted data, or to fit a cyclic pattern<br />

over and above the simple trend line.<br />

2.7.1 Empirical adjustment <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality<br />

General idea<br />

The intuitive idea behind this method is quite simple. Arrange the data into seas<strong>on</strong>al groups (e.g. m<strong>on</strong>ths)<br />

and subtract the seas<strong>on</strong>al group mean or median 27 from every point in the seas<strong>on</strong>al series. This will subtract<br />

the cyclic pattern and leave adjusted data that is “free” of seas<strong>on</strong>al effects.<br />

The adjustment process can either be d<strong>on</strong>e within the computer package, or in many cases, is easily d<strong>on</strong>e<br />

<strong>on</strong> a spreadsheet.<br />

This adjustment is a bit ad hoc, but seems to work well in practice. The reported standard errors from<br />

the regressi<strong>on</strong> line are a bit too small as they have not accounted <str<strong>on</strong>g>for</str<strong>on</strong>g> the adjustment process.<br />

Example: Total phosphorus from Klamath River<br />

C<strong>on</strong>sider, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, values of total phosphorus taken from the Klamath River near Klamath, Cali<str<strong>on</strong>g>for</str<strong>on</strong>g>nia<br />

as analyzed by Hirsch et al (1982). 28<br />

27 The median would be preferred to avoid c<strong>on</strong>taminati<strong>on</strong> of the mean by outliers<br />

28 This was m<strong>on</strong>itoring stati<strong>on</strong> 11530500 from the NASQAN network in the US. Data are available from http://waterdata.<br />

usgs.gov/nwis/qwdata/?site_no=11530500. The data was analyzed by Hirsch, R.M., Slack, J.R., and Smith, R.A. (1982).<br />

Techniques of trend analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> m<strong>on</strong>thly water quality data. Water Resources Research 18, 107-121.<br />

c○2012 Carl James Schwarz 219 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Total phosphorus (mg/L) in Klamath River near Klamath, CA<br />

M<strong>on</strong>th<br />

Year<br />

1972 1973 1974 1975 1976 1977 1978 1979<br />

1 0.07 0.33 0.70 0.08 0.04 0.05 0.14 0.08<br />

2 0.11 0.24 0.17 . . . 0.11 0.04<br />

3 0.60 0.12 0.16 . 0.14 0.03 0.02 0.02<br />

4 0.10 0.08 1.20 0.11 0.05 0.04 0.06 0.01<br />

5 0.04 0.03 0.12 0.09 0.02 0.04 0.03 0.03<br />

6 0.05


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

shows a obvious seas<strong>on</strong>ality to the data with peak levels occurring in the winter m<strong>on</strong>ths. There are also some<br />

missing values as seen in the raw data table. Finally, notice the presence of several very large values (above<br />

0.20 mg/L) that would normally be classified as outliers.<br />

There are several values greater than 0.20 mg/L which appear to be outliers. C<strong>on</strong>sequently, we will use<br />

the median from each m<strong>on</strong>th <str<strong>on</strong>g>for</str<strong>on</strong>g> the adjustment. The sorted values <str<strong>on</strong>g>for</str<strong>on</strong>g> the January readings are:<br />

.04, .05, .07, .08, .08, .14, .33, .70<br />

The median value <str<strong>on</strong>g>for</str<strong>on</strong>g> January readings is the average of the 4 th and 5 th observati<strong>on</strong>s 29 or<br />

median January =<br />

.08 + .08<br />

2<br />

= .08.<br />

The value of .08 is subtracted from each of the January readings to give<br />

−.01, .25, .62, .00, −.04, −.03, .06, .00<br />

29 If the number of observati<strong>on</strong>s is odd as <str<strong>on</strong>g>for</str<strong>on</strong>g> February, the median is the middle value<br />

c○2012 Carl James Schwarz 221 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This process is repeated <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th. These computati<strong>on</strong>s are illustrated in the Klamath tab in the<br />

ALLofDATA.xls workbook available in the Sample Program Library available at http://www.stat.<br />

sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms to give:<br />

M<strong>on</strong>th<br />

Seas<strong>on</strong>ally Adjusted Total phosphorus (mg/L)<br />

in Klamath River near Klamath, CA<br />

Year<br />

M<strong>on</strong>th 1972 1973 1974 1975 1976 1977 1978 1979<br />

1 -0.01 0.25 0.62 0.00 -0.04 -0.03 0.06 0.00<br />

2 0.00 0.13 0.06 . . . 0.00 -0.07<br />

3 0.48 0.00 0.04 . 0.02 -0.09 -0.10 -0.10<br />

4 0.03 0.01 1.13 0.04 -0.02 -0.03 -0.01 -0.06<br />

5 0.01 -0.01 0.09 0.06 -0.02 0.01 -0.01 -0.01<br />

6 0.00 -0.04 0.00 0.00 . . -0.02 .<br />

7 0.00 0.00 -0.01 -0.02 . 0.02 -0.02 0.00<br />

8 -0.01 0.01 -0.03 -0.01 0.02 0.03 0.01 -0.04<br />

9 0.02 0.01 -0.03 0.02 . 0.00 -0.04 .<br />

10 0.00 0.00 -0.01 . 0.00 -0.04 -0.03 0.20<br />

11 0.00 0.28 . -0.01 . 0.33 0.00 .<br />

12 0.01 0.03 -0.01 -0.08 . 0.18 -0.06 .<br />

A plot of the seas<strong>on</strong>ally adjusted values:<br />

c○2012 Carl James Schwarz 222 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

shows that most of the seas<strong>on</strong>al effects have been removed, but there may still evidence of autocorrelati<strong>on</strong>.<br />

There are certainly still some outliers.<br />

JMP analysis<br />

The seas<strong>on</strong>ally adjusted values were imported into JMP and stacked in the usual way. A new variable<br />

year-m<strong>on</strong>th was created using a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula variable (year + m<strong>on</strong>th−1<br />

12<br />

) to represent time:<br />

c○2012 Carl James Schwarz 223 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to draw the scatter plot and fit the preliminary line:<br />

c○2012 Carl James Schwarz 224 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

It is a bit worrisome that the outliers seems to be all in early years. All seas<strong>on</strong>ally adjusted values greater<br />

than 0.2 were excluded from the analysis 30 and the line was refit:<br />

30 Use the Rows→Select to select these rows<br />

c○2012 Carl James Schwarz 225 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 226 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

There appears to be evidence of a trend of −.0032 mg/L/year. The p-value and se of the slope are likely<br />

too small by some small factor because the seas<strong>on</strong>ally adjustment was not taken into account. The residual<br />

plot seems to show some evidence of remaining autocorrelati<strong>on</strong>.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was reused to fit the data and obtain the Durbin-Wats<strong>on</strong> statistics:<br />

This indicates a low amount of residual autocorrelati<strong>on</strong> (estimated value of .04) but is statistically significant<br />

because the large sample size allows you detect very small autocorrelati<strong>on</strong>.<br />

Further comments It is a bit worrisome that all of the outliers appear to happen early in the time series<br />

and <strong>on</strong>ce these are removed, that there is no evidence of a trend. However, <strong>on</strong>e could argue that that disappearance<br />

of the outliers is, in fact, the most interesting point of this dataset and that the fact that the outliers<br />

disappeared indicates evidence of a downward trend.<br />

It also turns out that the results are VERY sensitive to which outliers are removed. For example, in late<br />

1977 there is seas<strong>on</strong>ally adjusted value of .17, and in late 1979 there was a seas<strong>on</strong>ally adjusted value of<br />

0.20 that were not excluded. If these points are also removed, the final regressi<strong>on</strong> line is not statistically<br />

significant with an estimated trend of −.0063 mg/L/year.<br />

As you will see later, a n<strong>on</strong>-parametric analysis that includes these outliers point did detect a downward<br />

trend with an estimated slope of about −.006 mg/L/year! The moral of the story is that statistics must be<br />

used carefully!<br />

2.7.2 Using the ANCOVA approach<br />

General idea<br />

Rather than relying <strong>on</strong> a ad hoc approach to doing a seas<strong>on</strong>al adjustment, the ANCOVA method can also<br />

be used. The advantage of the ANCOVA method over the ad hoc approach is that not <strong>on</strong>ly can you fit an<br />

overall trend line, but you can also test if the trend is the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all seas<strong>on</strong>s as well. Outliers will have to<br />

be removed in the usual fashi<strong>on</strong>.<br />

The general model will start with the n<strong>on</strong>-parallel slope model of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

Y = Seas<strong>on</strong> T ime Seas<strong>on</strong> ∗ T ime<br />

Then examine if the Seas<strong>on</strong>*Time interacti<strong>on</strong> term indicates if the slopes may not be parallel over seas<strong>on</strong>s.<br />

c○2012 Carl James Schwarz 227 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

If there is insufficient evidence against the hypothesis of parallelism, then fit the final model with a comm<strong>on</strong><br />

slope over the seas<strong>on</strong>s, but difference am<strong>on</strong>g the seas<strong>on</strong>s.<br />

Y = Seas<strong>on</strong> T ime<br />

Example: Total phosphorus levels <strong>on</strong> the Klamath River - revisited<br />

JMP Analysis<br />

The raw data is available in the file klamath.jmp in the Sample Program Library available at http://www.<br />

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

An earlier plot shows that there are some outliers. Remove all data points greater than 0.20 mg/L.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to fit the n<strong>on</strong>-parallel slope model. CAUTION: Be sure that<br />

m<strong>on</strong>th is nominally scaled and that year is c<strong>on</strong>tinuously scaled!.<br />

c○2012 Carl James Schwarz 228 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The graph of the lines by seas<strong>on</strong> appear to show that some seas<strong>on</strong>s (m<strong>on</strong>ths) appear to have a different slope<br />

than the other m<strong>on</strong>ths:<br />

and the effect test <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-parallel slopes:<br />

also shows some evidence of n<strong>on</strong>-parallel slopes. However, we will fit the parallel slope model to c<strong>on</strong>tinue<br />

the dem<strong>on</strong>strati<strong>on</strong>.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is again used to fit the parallel slope model. CAUTION: Be sure that<br />

m<strong>on</strong>th is nominally scaled and that year is c<strong>on</strong>tinuously scaled!.<br />

c○2012 Carl James Schwarz 229 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The fitted lines and the model fit graph appear to be acceptable:<br />

c○2012 Carl James Schwarz 230 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 231 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The effect tests show a str<strong>on</strong>g effect of year with estimated coefficients of:<br />

The estimated trend is −.0056 (se .0016) mg/L/year which is comparable to the previous estimates. Note<br />

that the estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the m<strong>on</strong>th effects are not directly interpretable from this output - the LSMEANS table<br />

should be c<strong>on</strong>sulted - seek help <strong>on</strong> this point.<br />

The residual plots (not shown) d<strong>on</strong>’t indicate any major problems. The Durbin-Wats<strong>on</strong> test <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong><br />

detects a small autocorrelati<strong>on</strong>, but with this large sample size is not important.<br />

2.7.3 Fitting cyclical patterns<br />

General approach<br />

In some cases, the seas<strong>on</strong>al pattern is quite regular with regular peaks during part of the year and regular<br />

lows during another part of the year. Another approach is to try and account <str<strong>on</strong>g>for</str<strong>on</strong>g> this cyclical pattern, and<br />

then see if there is still evidence of a decline over time.<br />

The basic building block <str<strong>on</strong>g>for</str<strong>on</strong>g> the seas<strong>on</strong>ality are the use of sine and cosine functi<strong>on</strong>s to represent the<br />

c○2012 Carl James Schwarz 232 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

seas<strong>on</strong>al patterns. The general model will take the <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

Y i = β 0 + β 1 × t i + β 2 cos( 2πt i<br />

ν<br />

) + β 3 sin( 2πt i<br />

ν<br />

) + ɛ i<br />

Here the coefficients β 0 and β 1 represent the intercept and linear change over time. The coefficients β 2 and<br />

β 3 represent the seas<strong>on</strong>al comp<strong>on</strong>ents.<br />

The term ν represents the period of the cycle. It is assumed to be known in advance. For example, if the<br />

cycles are <strong>on</strong>e year in durati<strong>on</strong> and the time axis is measured in years, then ν = 1. If the cycles are <strong>on</strong>e year<br />

in durati<strong>on</strong>, but the time axis is measured in m<strong>on</strong>ths, then ν = 12. This is often coded incorrectly so be<br />

careful!<br />

The reas<strong>on</strong> there are both a sine and cosine functi<strong>on</strong> is that these two functi<strong>on</strong>s have the same period but<br />

different amplitudes at different parts of the cycle. For example, the cosine functi<strong>on</strong> has a maximum at the<br />

start of each cycle and a minimum half-way through each cycle, while the sine functi<strong>on</strong> has a minimum at<br />

the 3/4 point of a cycle and a maximum at the 1/4 point of the cycle.<br />

The analysis starts by creating two new variables in the data table corresp<strong>on</strong>ding to the sine and cosine<br />

functi<strong>on</strong>s. Then multiple regressi<strong>on</strong> is used to fit a model incorporating all three explanatory variables. In<br />

the <str<strong>on</strong>g>short</str<strong>on</strong>g> hand notati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> models, the model fit is:<br />

Y = T ime Cos Sin<br />

After the model is fit, the coefficient of the Time variable represents the overall trend. The usual tests of<br />

hypothesis <str<strong>on</strong>g>for</str<strong>on</strong>g> no trend, and c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope can be found. The slope is interested as the<br />

change in Y per unit change in X = T IME after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality. The coefficients <str<strong>on</strong>g>for</str<strong>on</strong>g> the sine<br />

and cosine functi<strong>on</strong>s are usually not of interest.<br />

The computati<strong>on</strong> should NOT be attempted by hand or in spreadsheet program. Most statistical packages<br />

have facilities <str<strong>on</strong>g>for</str<strong>on</strong>g> creating the relevant variables and fitting these models.<br />

The usual assumpti<strong>on</strong>s still hold, so they should be checked via residual plots, estimati<strong>on</strong> of the autocorrelati<strong>on</strong><br />

that remains, etc.<br />

Example: Total phosphorus from Klamath River<br />

JMP Analysis<br />

The data and model fits are available in a JMP file klamath3.jmp available in the Sample Program Library<br />

available at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The data must be stacked in the usual fashi<strong>on</strong> and a variable created year-m<strong>on</strong>th to represent the time<br />

variable, year-m<strong>on</strong>th = year + m<strong>on</strong>th−1<br />

12<br />

.<br />

c○2012 Carl James Schwarz 233 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

As the time variable is measured in years and the preliminary plot shows a yearly cycle, ν = 1, so the<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the cosine and sine variables are:<br />

c○2012 Carl James Schwarz 234 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

respectively. This gives the final data table looking somewhat like:<br />

There are no problems with the fact that some of the phosphorus data is missing as the packages will simply<br />

ignore any row that is not complete.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to fit the model:<br />

c○2012 Carl James Schwarz 235 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

. The output is voluminous and a full discussi<strong>on</strong> is bey<strong>on</strong>d the scope of these notes. 31 The key things to<br />

look at are the estimated coefficients and the residual plots and model fit plots:<br />

31 See Freund, R., Little, R. and Creight<strong>on</strong>, L. (2003). Regressi<strong>on</strong> using JMP. Wiley. <str<strong>on</strong>g>for</str<strong>on</strong>g> more details <strong>on</strong> the output from this<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />

c○2012 Carl James Schwarz 236 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 237 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

These all indicate the presence of several outliers.<br />

The model was refit omitting these outliers with phosphorus values greater than 0.20. 32 The model was<br />

refit. The residual and model fit plots are much better:<br />

32 Use the Rows→Select to select rows and the Rows→Exclude to remove these from the analysis<br />

c○2012 Carl James Schwarz 238 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 239 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

but the residual plot still shows something strange happening about 1/2 way through the time series. It<br />

appears that the cycles are shifting so you get this l<strong>on</strong>g wave of residuals.<br />

The estimated coefficients are:<br />

The coefficients <str<strong>on</strong>g>for</str<strong>on</strong>g> both the cosine and sine terms are statistically significant but not of much interest. The<br />

estimated trend is −.0056 mg/L/year (se .0017) with a p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the trend line of .0013. The results are<br />

statistically significant.<br />

The Durbin-Wats<strong>on</strong> test <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong> shows some residual serial correlati<strong>on</strong>:<br />

which likely reflects the behavior in the tail end of the series.<br />

Example: Comparing air quality measurements using two different methods<br />

The air that we breath often has many c<strong>on</strong>taminants. One c<strong>on</strong>taminant of interest is Particulate Matter<br />

(PM). Particulate matter is the general term used <str<strong>on</strong>g>for</str<strong>on</strong>g> a mixture of solid particles and liquid droplets in the<br />

air. It includes aerosols, smoke, fumes, dust, ash and pollen. The compositi<strong>on</strong> of particulate matter varies<br />

with place, seas<strong>on</strong> and weather c<strong>on</strong>diti<strong>on</strong>s. Particulate matter is characterized according to size - mainly<br />

because of the different health effects associated with particles of different diameters. Fine particulate matter<br />

is particulate matter that is 2.5 micr<strong>on</strong>s in diameter and less. [A human hair is approximately 30 times larger<br />

c○2012 Carl James Schwarz 240 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

than these particles!] The smaller particles are so small that several thousand of them could fit <strong>on</strong> the period<br />

at the end of this sentence. It is also known as PM2.5 or respirable particles because it penetrates the<br />

respiratory system further than larger particles.<br />

PM2.5 material is primarily <str<strong>on</strong>g>for</str<strong>on</strong>g>med from chemical reacti<strong>on</strong>s in the atmosphere and through fuel combusti<strong>on</strong><br />

(e.g., motor vehicles, power generati<strong>on</strong>, industrial facilities residential fire places, wood stoves and<br />

agricultural burning). Significant amounts of PM2.5 are carried into Ontario from the U.S. During periods<br />

of widespread elevated levels of fine particulate matter, it is estimated that more than 50 per cent of Ontario’s<br />

PM2.5 comes from the U.S.<br />

Adverse health effects from breathing air with a high PM 2.5 c<strong>on</strong>centrati<strong>on</strong> include: premature death,<br />

increased respiratory symptoms and disease, chr<strong>on</strong>ic br<strong>on</strong>chitis, and decreased lung functi<strong>on</strong> particularly <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

individuals with asthma.<br />

Further in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> about fine particulates is available at many websites as http://www.health.<br />

state.ny.us/nysdoh/indoor/pmq_a.htm and http://www.airquality<strong>on</strong>tario.com/<br />

science/pollutants/particulates.cfm, and http://www.epa.gov/pmdesignati<strong>on</strong>s/<br />

faq.htm.<br />

The PM2.5 c<strong>on</strong>centrati<strong>on</strong>s in air can be measured in many ways. A well known method is a is a filter<br />

based method whereby <strong>on</strong>e 24 hour sample is collected every third day. The sampler draws air through a preweighed<br />

filter <str<strong>on</strong>g>for</str<strong>on</strong>g> a specified period (usually 24 hours) at a known flowrate. The filter is then removed and<br />

sent to a laboratory to determine the gain in filter mass due to particle collecti<strong>on</strong>. Ambient PM c<strong>on</strong>centrati<strong>on</strong><br />

is calculated <strong>on</strong> the basis of the gain in filter mass, divided by the product of sampling period and sampling<br />

flowrate. Additi<strong>on</strong>al analysis can also be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med <strong>on</strong> the filter to determine the chemical compositi<strong>on</strong> of<br />

the sample.<br />

In recent years, a program of c<strong>on</strong>tinuous sampling using automatic samplers has been introduced. An instrument<br />

widely adopted <str<strong>on</strong>g>for</str<strong>on</strong>g> this use is the Tapered Element Oscillating Microbalance (TEOM). The TEOM<br />

operates under the following principles. Ambient air is drawn in through a heated inlet. It is then drawn<br />

through a filtered cartridge <strong>on</strong> the end of a hollow, tapered tube. The tube is clamped at <strong>on</strong>e end and oscillates<br />

freely like a tuning <str<strong>on</strong>g>for</str<strong>on</strong>g>k. As particulate matter gathers <strong>on</strong> the filter cartridge, the natural frequency of<br />

oscillati<strong>on</strong> of the tube decreases. The mass accumulati<strong>on</strong> of particulate matter is then determined from the<br />

corresp<strong>on</strong>ding change in frequency.<br />

Because of the different ways in which these instruments work, a calibrati<strong>on</strong> experiment was per<str<strong>on</strong>g>for</str<strong>on</strong>g>med.<br />

The hourly TEOM readings were accumulated to a daily value and compared to those obtained from an air<br />

filter method. Here are the data:<br />

Date TEOM Ref<br />

2003.06.05 8.1 10.6<br />

2003.06.08 6.5 9.0<br />

2003.06.11 3.2 4.6<br />

2003.06.14 2.2 3.7<br />

2003.06.17 5.8 7.9<br />

2003.06.20 1.4 4.4<br />

2003.06.23 1.8 2.8<br />

2003.06.26 4.5 6.5<br />

2003.06.29 4.6 5.8<br />

2003.07.02 3.3 3.6<br />

2003.07.05 1.6 3.7<br />

2003.07.08 7.1 7.2<br />

2003.07.11 7.7 8.6<br />

2003.07.14 4.3 4.4<br />

2003.07.17 4.6 6.4<br />

2003.07.20 7.2 8.5<br />

2003.07.23 8.8 10.5<br />

2003.07.26 8.1 9.0<br />

2003.07.29 11.2 10.4<br />

c○2012 Carl James Schwarz 241 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

2003.08.01 19.4 21.0<br />

2003.08.07 5.9 5.2<br />

2003.08.10 11.9 12.6<br />

2003.08.13 7.2 8.4<br />

2003.08.16 48.2 46.2<br />

2003.08.19 49.3 51.2<br />

2003.08.22 53.3 54.5<br />

2003.08.25 56.8 57.2<br />

2003.08.28 4.5 7.4<br />

2003.08.31 27.8 26.1<br />

2003.09.03 34.3 33.0<br />

2003.09.06 41.5 42.1<br />

2003.09.24 5.8 9.5<br />

2003.09.27 5.7 8.0<br />

2003.09.30 9.1 9.8<br />

2003.10.03 10.5 13.9<br />

2003.10.06 10.9 15.6<br />

2003.10.09 3.5 5.6<br />

2003.10.12 4.1 6.3<br />

2003.10.15 5.7 10.1<br />

2003.10.18 15.5 20.2<br />

2003.10.21 5.4 8.9<br />

2003.10.24 11.7 19.0<br />

2003.10.27 14.9 23.3<br />

2003.10.30 3.9 7.5<br />

2003.11.02 12.9 21.2<br />

2003.11.05 18.9 33.4<br />

2003.11.08 23.6 35.9<br />

2003.11.11 19.0 30.2<br />

2003.11.14 18.5 28.2<br />

2003.11.17 11.1 18.4<br />

2003.11.20 11.6 20.1<br />

2003.11.23 9.4 17.9<br />

2003.11.26 25.6 42.8<br />

2003.11.29 6.9 11.2<br />

2003.12.02 13.2 25.6<br />

2003.12.05 10.2 19.9<br />

2003.12.08 17.6 31.6<br />

2003.12.11 6.7 14.1<br />

2003.12.14 16.2 26.5<br />

2003.12.17 8.3 13.5<br />

2004.01.13 6.8 13.8<br />

2004.01.16 9.2 17.3<br />

2004.01.19 16.5 32.6<br />

2004.01.22 4.3 11.6<br />

2004.01.25 6.1 10.0<br />

2004.01.28 10.1 14.4<br />

2004.01.31 14.0 28.1<br />

2004.02.06 19.4 35.0<br />

2004.02.09 15.1 25.2<br />

2004.02.12 16.8 32.9<br />

2004.02.15 15.9 28.5<br />

2004.02.18 9.8 18.5<br />

2004.02.21 9.1 17.2<br />

2004.02.24 17.1 31.9<br />

2004.02.27 12.1 21.7<br />

2004.03.01 8.8 14.1<br />

2004.03.07 3.2 5.6<br />

2004.03.10 10.9 15.3<br />

2004.03.13 7.1 10.8<br />

2004.03.16 7.4 13.8<br />

2004.03.19 10.4 14.0<br />

2004.03.22 10.6 16.1<br />

2004.03.25 5.0 8.4<br />

2004.03.28 6.4 10.3<br />

2004.03.31 5.3 6.6<br />

2004.04.03 6.5 9.7<br />

2004.04.09 6.4 9.7<br />

2004.04.12 7.0 8.8<br />

2004.04.15 2.3 4.6<br />

2004.04.18 4.2 5.7<br />

2004.04.21 4.7 5.7<br />

2004.04.24 3.7 4.1<br />

2004.04.27 4.1 5.0<br />

2004.04.30 7.3 7.3<br />

2004.05.03 3.5 5.0<br />

2004.05.06 2.5 2.8<br />

2004.05.09 2.3 2.7<br />

2004.07.02 6.0 4.3<br />

2004.07.05 3.3 2.4<br />

2004.07.08 1.6 2.0<br />

2004.07.11 1.2 5.7<br />

2004.07.14 5.4 8.3<br />

2004.07.17 8.8 3.5<br />

2004.07.20 2.2 10.0<br />

2004.07.23 8.3 12.5<br />

2004.07.26 10.5 17.0<br />

2004.08.01 25.3 24.7<br />

2004.08.04 14.7 10.5<br />

2004.08.07 2.7 3.1<br />

2004.08.10 6.5 7.2<br />

2004.08.19 20.1 13.6<br />

2004.08.25 4.1 4.2<br />

2004.08.28 2.5 1.5<br />

2004.08.31 4.7 6.3<br />

2004.09.03 3.2 4.0<br />

2004.09.15 1.8 2.6<br />

2004.09.18 2.6 4.7<br />

2004.09.21 4.7 6.2<br />

2004.09.24 5.6 8.0<br />

2004.09.27 7.1 10.0<br />

2004.09.30 4.8 7.7<br />

2004.10.03 9.5 13.3<br />

2004.10.06 10.1 13.0<br />

2004.10.09 3.8 5.0<br />

2004.10.12 5.0 7.3<br />

2004.10.15 2.3 5.4<br />

2004.10.18 7.5 10.1<br />

2004.10.21 8.1 11.0<br />

2004.10.24 6.6 13.6<br />

2004.10.27 14.0 18.2<br />

2004.10.30 15.9 24.8<br />

2004.11.02 8.4 14.1<br />

2004.11.08 10.8 17.6<br />

2004.11.11 1.4 4.7<br />

2004.11.14 6.5 10.0<br />

2004.11.17 11.0 18.8<br />

2004.11.20 7.7 14.4<br />

2004.11.26 15.4 23.4<br />

2004.11.29 8.9 17.1<br />

2004.12.02 18.3 30.8<br />

2004.12.05 6.2 13.5<br />

2004.12.08 8.3 16.5<br />

2004.12.11 9.6 15.9<br />

2004.12.14 9.8 17.6<br />

2004.12.17 11.5 21.5<br />

2004.12.20 14.0 26.1<br />

2004.12.23 9.8 20.0<br />

2004.12.26 4.9 9.4<br />

2004.12.29 3.7 7.6<br />

2005.01.01 10.2 18.5<br />

2005.01.04 18.6 38.3<br />

2005.01.22 11.1 24.7<br />

2005.01.25 11.8 22.7<br />

2005.01.28 13.1 20.9<br />

2005.01.31 5.1 10.9<br />

2005.02.03 6.2 11.1<br />

2005.02.06 6.5 10.0<br />

c○2012 Carl James Schwarz 242 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

2005.02.09 10.6 20.8<br />

2005.02.12 11.4 23.3<br />

2005.02.15 12.9 18.8<br />

2005.02.18 14.0 23.4<br />

2005.02.21 21.9 31.7<br />

2005.02.24 17.1 26.4<br />

2005.02.26 8.3 16.3<br />

2005.02.27 11.8 20.1<br />

2005.03.02 16.7 28.9<br />

2005.03.05 12.0 18.9<br />

2005.03.08 5.3 9.8<br />

2005.03.11 10.9 18.8<br />

2005.03.14 11.3 18.1<br />

2005.03.17 8.5 11.0<br />

2005.04.04 12.0 10.9<br />

2005.04.07 7.8 7.1<br />

2005.04.16 2.3 4.8<br />

2005.04.19 5.5 3.9<br />

2005.04.22 8.0 6.7<br />

2005.04.25 7.3 10.0<br />

2005.04.28 3.5 9.0<br />

2005.05.01 4.5 4.5<br />

2005.05.04 5.1 1.8<br />

2005.05.07 2.5 5.4<br />

2005.05.28 6.1 6.7<br />

2005.05.31 9.7 12.0<br />

2005.06.03 5.2 5.0<br />

2005.06.06 0.9 2.1<br />

2005.06.09 4.4 6.2<br />

2005.06.12 2.3 2.7<br />

2005.06.15 2.3 2.2<br />

2005.06.18 1.7 2.6<br />

2005.06.21 6.7 6.9<br />

2005.06.24 3.4 3.8<br />

2005.06.27 4.2 4.6<br />

2005.06.30 4.3 5.5<br />

2005.07.03 2.7 5.2<br />

2005.07.06 3.6 4.2<br />

2005.07.09 1.3 1.9<br />

2005.07.12 2.8 6.3<br />

Do both meters give similar readings over time?<br />

It is quite comm<strong>on</strong> when comparing two instruments to do the comparis<strong>on</strong> <strong>on</strong> the log-ratio scale, i.e.<br />

either log(T EOM/reference) or log(reference/T EOM). There are two reas<strong>on</strong>s why this is comm<strong>on</strong>ly<br />

d<strong>on</strong>e. First, the logarithmic scale makes ratios more than 1 and less than 1 symmetric. For example, the<br />

ratio 1/2 and 2 <strong>on</strong> the regular scale are not symmetric about the value of 1, but log(1/2) = −.693 and<br />

log(2) = .693 are symmetric about zero. Sec<strong>on</strong>d, it is often the case that the variati<strong>on</strong> tends to increase with<br />

the base size of the reading. The use of logarithms makes the variances more similar over the spread of the<br />

data values.<br />

JMP Analysis<br />

A JMP data file is available in the teom.jmp file in the Sample Program Library available at http://www.<br />

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

Two variables need to be created in the JMP table. First, the log(T EOM/reference) variable as noted<br />

above. This is created using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor:<br />

c○2012 Carl James Schwarz 243 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Sec<strong>on</strong>d, a variable representing the decimal year is required so that plotting and regressi<strong>on</strong> happen <strong>on</strong><br />

the year scale rather than the internal date and time <str<strong>on</strong>g>for</str<strong>on</strong>g>mat in JMP. JMP uses the number of sec<strong>on</strong>ds since<br />

a reference date as the internal value <str<strong>on</strong>g>for</str<strong>on</strong>g> a date. C<strong>on</strong>sequently, to c<strong>on</strong>vert to dates, you need to divide by<br />

86,400 sec<strong>on</strong>d/day to c<strong>on</strong>vert to days, and then by 365 to c<strong>on</strong>vert to years. [This ignores the effect of leap<br />

years and leap sec<strong>on</strong>ds.] This year variable is created using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula:<br />

c○2012 Carl James Schwarz 244 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Here are the first few lines of data, including the two new derived variables:<br />

c○2012 Carl James Schwarz 245 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

A plot of the log(T EOM/reference) by the year variable is obtained using the Analyze->Fit Y-by-X<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 246 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

shows a clear cyclical pattern. The peaks of the cycles are almost exactly <strong>on</strong>e year apart. C<strong>on</strong>sequently, we<br />

then create two new variables to represent the sine and cosine terms <str<strong>on</strong>g>for</str<strong>on</strong>g> a cyclical fit. Because the time units<br />

are in years, the period is also in years and is equal to ν = 1. The following <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae variables were created:<br />

c○2012 Carl James Schwarz 247 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

with the first few lines of the data table now looking like:<br />

c○2012 Carl James Schwarz 248 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Now use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit a multiple-regressi<strong>on</strong> using the year, it sine, and cosine<br />

variables:<br />

c○2012 Carl James Schwarz 249 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The effects test<br />

indicate the presence of a cyclical pattern (not unexpectedly), but no evidence of a year effect. Save the<br />

predicted values and the residuals to the data table using the Red Triangle→SaveColumns pop-down menus:<br />

The residual plot is then found using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 250 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

and shows no severe lack of fit. There are several outliers that appear and perhaps something unusual is<br />

happening in mid-2003. An overlay plot of the actual and predicted values: 33<br />

33 Use the Graph→Overlay plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m; select the observed and predicted values as the Y variables and year as the X variable; click<br />

<strong>on</strong> the legend <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted values and join with a line and hide the points.<br />

c○2012 Carl James Schwarz 251 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

shows a generally good fit, with some outlier points and again further investigati<strong>on</strong> required in about mid-<br />

2003. 34<br />

The log(T EOM/reference) hardly goes above the value of 0 (which is the reference line indicating<br />

no difference between the two instruments). In order to estimate the average log-ratio, we refit the model<br />

DROPPING the year term (why?) and examine the parameter estimates of this simpler model:<br />

The average log-ratio is −.39 (se .02). This corresp<strong>on</strong>ds to a ratio of .68 <strong>on</strong> the anti-log scale, i.e. the TEOM<br />

meter is reading, <strong>on</strong> average across the entire year, <strong>on</strong>ly 68% of the reference meter.<br />

2.7.4 Further comments<br />

An implicit assumpti<strong>on</strong> of this method is that the amplitude of the seas<strong>on</strong>al trend is c<strong>on</strong>stant in time, i.e.<br />

the β 2 and β 3 terms do not depend <strong>on</strong> time. It could happen that the amplitude is also decreasing in time,<br />

In this case, you may c<strong>on</strong>sider a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m of the Y variable so that the relative ratio between the top<br />

and bottom of the cycle may be fixed. Alternatively, more complex n<strong>on</strong>-linear regressi<strong>on</strong> models where the<br />

amplitude also depends up<strong>on</strong> time may be fit. This bey<strong>on</strong>d the scope of these notes.<br />

The key feature of this method <str<strong>on</strong>g>for</str<strong>on</strong>g> it to work well is the regularity of the seas<strong>on</strong>al effects and that the<br />

shape of the seas<strong>on</strong>al effects must be that of a sine or cosine curve. C<strong>on</strong>sequently, a pattern that is relatively<br />

flat with a single sharp peak in a c<strong>on</strong>sistent m<strong>on</strong>th cannot be well fit by these models. In this case, you could<br />

create indicator variables <str<strong>on</strong>g>for</str<strong>on</strong>g> the peak time and then fit a multiple regressi<strong>on</strong> model as above – this is again<br />

bey<strong>on</strong>d the scope of these notes.<br />

2.8 Seas<strong>on</strong>ality and Autocorrelati<strong>on</strong><br />

Whew! This is a tough issue to deal with! Fortunately, there have been great advances in software and in<br />

some packages (e.g. SAS) this is fairly easy to deal with. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this is bey<strong>on</strong>d simple packages such<br />

as JMP or SYStat.<br />

34 It turns out that these points were collected when a large amount of smoke from a nearby <str<strong>on</strong>g>for</str<strong>on</strong>g>est fire was present.<br />

c○2012 Carl James Schwarz 252 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This secti<strong>on</strong> will be brief with very little explanati<strong>on</strong> of the underlying statistical c<strong>on</strong>cepts and reference<br />

to output from SAS. Please seek further help if you are dealing with this type of data.<br />

Again, refer back to the Klamath River data. It may turn out that even after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality, there<br />

is residual autocorrelati<strong>on</strong> within a year. For example, a particular year may have generally low phosphorus<br />

levels <str<strong>on</strong>g>for</str<strong>on</strong>g> some reas<strong>on</strong> and so observati<strong>on</strong>s in m<strong>on</strong>ths close together are more highly related than observati<strong>on</strong>s<br />

in m<strong>on</strong>ths far apart.<br />

A comm<strong>on</strong> model <str<strong>on</strong>g>for</str<strong>on</strong>g> dealing with this type of autocorrelati<strong>on</strong> is the familiar AR(1) process with a single<br />

autocorrelati<strong>on</strong> parameter. In general, the covariance of two observati<strong>on</strong>s is modeled as:<br />

cov(Y t1 , Y t2 ) = σ 2 ρ ∆t<br />

where ∆t is the difference in time between the two observati<strong>on</strong>s. For example, observati<strong>on</strong>s that are 1 time<br />

unit apart will have covariance σ 2 ρ 1 ; observati<strong>on</strong>s that are two time units apart will have covariance σ 2 ρ 2 ;<br />

etc.<br />

The advantage of using this power notati<strong>on</strong> is that missing values are easily accommodated – it is not<br />

necessary to have every observati<strong>on</strong> in time so interpolati<strong>on</strong> to ‘fill in’ missing values are not necessary.<br />

Let us revisit the Klamath phosphorus data. A model that allows <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>al variati<strong>on</strong> (by m<strong>on</strong>ths) and<br />

autocorrelati<strong>on</strong> can be fit using Proc Mixed using both the ANCOVA and autocorrelati<strong>on</strong> models. The code<br />

fragment looks like:<br />

proc mixed data=klamath maxiter=200 maxfunc=1000;<br />

where phosphorus


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The estimated comm<strong>on</strong> slope from this model (mg/L/year) is:<br />

Standard<br />

Label Estimate Error DF t Value Pr > |t|<br />

avg slope -0.00578 0.002515 10.9 -2.30 0.0430<br />

which is similar to the estimates found earlier.<br />

A model was also fit assuming independence am<strong>on</strong>g the observati<strong>on</strong> (see the ANCOVA approach to<br />

seas<strong>on</strong>al adjustment earlier in this chapter). Is there support <str<strong>on</strong>g>for</str<strong>on</strong>g> the independence model?<br />

The AIC criteria is used to compare these different models. The two AIC (corrected <str<strong>on</strong>g>for</str<strong>on</strong>g> small sample<br />

sizes) are:<br />

AICC (smaller is better) -216.2 <str<strong>on</strong>g>for</str<strong>on</strong>g> the spatial power model<br />

AICC (smaller is better) -208.7 <str<strong>on</strong>g>for</str<strong>on</strong>g> the independence model<br />

A usual rule of thumb is that differences of more than 2 am<strong>on</strong>g the AIC indicate that there is evidence <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

the model with the smaller AIC. In this case, the AIC <str<strong>on</strong>g>for</str<strong>on</strong>g> the spatial power model is almost 8 units smaller<br />

than the independence model. There is str<strong>on</strong>g evidence <str<strong>on</strong>g>for</str<strong>on</strong>g> residual autocorrelati<strong>on</strong>.<br />

The estimated trend (ignoring autocorrelati<strong>on</strong>) is:<br />

Standard<br />

Label Estimate Error DF t Value Pr > |t|<br />

avg slope -0.00562 0.001621 58 -3.47 0.0010<br />

As expected the estimated slopes are √ similar,<br />

√<br />

but the reported se from the model ignoring autocorrelati<strong>on</strong><br />

was too small by a factor of about (1+ρ)<br />

1−ρ<br />

= 1.5<br />

.5 = √ 3 = 1.7.<br />

2.9 N<strong>on</strong>-parametric detecti<strong>on</strong> of trend<br />

The methods so far in this chapter all rely <strong>on</strong> several assumpti<strong>on</strong>s that may not be satisfied in all c<strong>on</strong>text.<br />

For example, all the methods (including the methods <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>) assume that deviati<strong>on</strong>s from the<br />

regressi<strong>on</strong> line are normally distributed with equal variance. In practice, they are fairly robust to n<strong>on</strong>normality<br />

and heterogeneous variances if the sample sizes are fairly large.<br />

But, how is it possible to deal with truncated or censored observati<strong>on</strong>s? For example, it is quite comm<strong>on</strong><br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> measurement tools to have upper and lower limits of detectability and you often get measurements that<br />

c○2012 Carl James Schwarz 254 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

are below or above detecti<strong>on</strong> limits. How can a m<strong>on</strong>ot<strong>on</strong>ic, but not linear relati<strong>on</strong>ship be examined? 35 For<br />

example, cases of asthma seem to increase with the c<strong>on</strong>centrati<strong>on</strong> of particulates in the atmosphere, but the<br />

relati<strong>on</strong>ship is not linear.<br />

A nice review of the basic methods applicable to many situati<strong>on</strong>s is given by:<br />

Berryman, D., B. Bobee, D. Cluis, and J. Haemmerli (1988). N<strong>on</strong>-parametric approaches <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

trend detecti<strong>on</strong> in water quality time series. Water Resources Bulletin 24(3):545-556.<br />

2.9.1 Cox and Stuart test <str<strong>on</strong>g>for</str<strong>on</strong>g> trend<br />

This is a very simple test to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m and can be used in many different situati<strong>on</strong>s as illustrated in C<strong>on</strong>over<br />

(1999, Secti<strong>on</strong> 3.5). 36 The idea behind the test is to divide first the dataset into two parts. Match the first<br />

observati<strong>on</strong> in the first part with the first observati<strong>on</strong> in the sec<strong>on</strong>d part; match the sec<strong>on</strong>d observati<strong>on</strong> in<br />

the first part with the sec<strong>on</strong>d observati<strong>on</strong> in the sec<strong>on</strong>d part; etc. Then <str<strong>on</strong>g>for</str<strong>on</strong>g> each pair of values, determine<br />

if the value from the sec<strong>on</strong>d part is greater that than the matched value from the first part. If there is a<br />

generally upwards trend in the data, then you see should see lots of pairs where the data value <str<strong>on</strong>g>for</str<strong>on</strong>g> the sec<strong>on</strong>d<br />

part is larger than that of the first part. The number of pairs where the data from the sec<strong>on</strong>d part exceeds<br />

its counterpart in the first part has a binomial distributi<strong>on</strong> with p=.5 and this can be used to determine the<br />

p-value of the test etc. This will be illustrated with an example.<br />

In an earlier secti<strong>on</strong>, we examined the records of the grass cutting seas<strong>on</strong> over time. We will apply the<br />

Cox and Stuart procedure to this data as well.<br />

Here is the raw data again:<br />

35 If a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> will linearize the line, then an ordinary regressi<strong>on</strong> can be used <strong>on</strong> the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med data.<br />

36 C<strong>on</strong>over, W.J. (1999). <str<strong>on</strong>g>Applied</str<strong>on</strong>g> n<strong>on</strong>-parametric statistics, 2nd editi<strong>on</strong>. Wiley<br />

c○2012 Carl James Schwarz 255 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Year<br />

Durati<strong>on</strong><br />

(days)<br />

1984 200<br />

1985 215<br />

1986 195<br />

1987 212<br />

1988 225<br />

1989 240<br />

1990 203<br />

1991 208<br />

1992 203<br />

1993 202<br />

1994 210<br />

1995 225<br />

1996 204<br />

1997 245<br />

1998 238<br />

1999 226<br />

2000 227<br />

2001 236<br />

2002 215<br />

2003 242<br />

There are exactly 20 observati<strong>on</strong>s, so the data is divided into two parts corresp<strong>on</strong>ding to the first 10 years<br />

and the last 10 years. 37 This gives the pairing:<br />

37 If the number of observati<strong>on</strong>s is odd, then the middle observati<strong>on</strong> is discarded<br />

c○2012 Carl James Schwarz 256 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Part I<br />

Part II<br />

Year Durati<strong>on</strong> Year Durati<strong>on</strong> Part II >Part I<br />

1984 200 1994 210 1<br />

1985 215 1995 225 1<br />

1986 195 1996 204 1<br />

1987 212 1997 245 1<br />

1988 225 1998 238 1<br />

1989 240 1999 226 0<br />

1990 203 2000 227 1<br />

1991 208 2001 236 1<br />

1992 203 2002 215 1<br />

1993 202 2003 242 1<br />

If there are any ties in the pairs, these are also discarded. In this case, there were no ties, and the data<br />

from the sec<strong>on</strong>d part was greater than the corresp<strong>on</strong>ding data from the first part in 9 of the 10 years.<br />

A two-sided p-value (allowing <str<strong>on</strong>g>for</str<strong>on</strong>g> either and increasing or decreasing trend) is found by finding the<br />

probability<br />

P (X ≥ 9) + P (X ≤ 1)<br />

when X comes from a Binomial distributi<strong>on</strong> with n = 10 and p=0.5.<br />

This can be computed or found from tables such as at http://www.stat.sfu.ca/~cschwarz/<br />

Stat-650/Notes/PDF/Tables.pdf. A porti<strong>on</strong> of the Binomial table with n = 10 is presented<br />

below:<br />

Individual binomial probabilities <str<strong>on</strong>g>for</str<strong>on</strong>g> n=10 and selected values of p<br />

n x 0.1 0.2 0.3 0.4 0.5<br />

------------------------------------------<br />

10 0 0.3487 0.1074 0.0282 0.0060 0.0010<br />

10 1 0.3874 0.2684 0.1211 0.0403 0.0098<br />

10 2 0.1937 0.3020 0.2335 0.1209 0.0439<br />

10 3 0.0574 0.2013 0.2668 0.2150 0.1172<br />

10 4 0.0112 0.0881 0.2001 0.2508 0.2051<br />

10 5 0.0015 0.0264 0.1029 0.2007 0.2461<br />

10 6 0.0001 0.0055 0.0368 0.1115 0.2051<br />

10 7 0.0000 0.0008 0.0090 0.0425 0.1172<br />

10 8 0.0000 0.0001 0.0014 0.0106 0.0439<br />

10 9 0.0000 0.0000 0.0001 0.0016 0.0098<br />

10 10 0.0000 0.0000 0.0000 0.0001 0.0010<br />

c○2012 Carl James Schwarz 257 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

From the table above we find that the p-value is<br />

p-value = .0010 + .0098 + .0098 + .0010 = .0216<br />

which is comparable to that found from a direct applicati<strong>on</strong> of linear regressi<strong>on</strong> of .012.<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, it is not possible to estimate the slope or any c<strong>on</strong>fidence interval using this method. The<br />

test is available in some computer packages, but because of its simplicity, is often easiest to do by hand.<br />

Surprisingly, this very simple test does not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m badly when compared to a real regressi<strong>on</strong>. For<br />

example, the asymptotic relative efficiency of this test compared to a normal regressi<strong>on</strong> situati<strong>on</strong> when all<br />

assumpti<strong>on</strong>s are satisfied is almost 80%. This implies that you would get the same power to detect a trend<br />

as a regular regressi<strong>on</strong> with 1<br />

.80<br />

= 1.25 times the sample size and using the Cox and Stuart test.<br />

However, if the data are straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward as in this case, there are better n<strong>on</strong>-parametric methods as will<br />

be illustrated in later secti<strong>on</strong>s.<br />

2.9.2 N<strong>on</strong>-parametric regressi<strong>on</strong> - Spearman, Kendall, Theil, Sen estimates<br />

N<strong>on</strong>-parametric does NOT mean no assumpti<strong>on</strong>s<br />

While the Cox and Stuart test may indicate that there is evidence of a trend, it cannot provide estimates of<br />

the slope etc. C<strong>on</strong>sequently, n<strong>on</strong>-parametric methods have been developed <str<strong>on</strong>g>for</str<strong>on</strong>g> these situati<strong>on</strong>s.<br />

CAUTION: N<strong>on</strong>-parametric does not mean NO assumpti<strong>on</strong>s! Many<br />

people view n<strong>on</strong>-parametric methods as a panacea that solves all ills. On the c<strong>on</strong>trary, n<strong>on</strong>-parametric tests<br />

also make assumpti<strong>on</strong>s about the data that need to be carefully verified in order that the results are sensible.<br />

In the c<strong>on</strong>text of n<strong>on</strong>-parametric regressi<strong>on</strong>, the following assumpti<strong>on</strong>s are usually made and n<strong>on</strong>-parametric<br />

tests may relax some of them:<br />

• Linearity. Parametric regressi<strong>on</strong> analysis assume that the relati<strong>on</strong>ship between Y and X is linear.<br />

N<strong>on</strong>-parametric regressi<strong>on</strong> analysis makes the same assumpti<strong>on</strong>.<br />

• Scale of Y and X. Parametric regressi<strong>on</strong> analysis assumes that X is time, so that it has an interval<br />

or ratio scale. It is further assumed that Y has an interval or ratio scale as well. N<strong>on</strong>-parametric<br />

regressi<strong>on</strong> analysis makes the same assumpti<strong>on</strong> except that some methods allow the Y variable to be<br />

ordinal. This allows n<strong>on</strong>-parametric methods to be used when values are above detecti<strong>on</strong> limits as<br />

they can still often be ordered sensibly.<br />

• Correct sampling scheme. Parametric regressi<strong>on</strong> analysis assumes that the Y must be a random sample<br />

from the populati<strong>on</strong> of Y values at every time point. N<strong>on</strong>-parametric regressi<strong>on</strong> analysis makes the<br />

same assumpti<strong>on</strong>.<br />

• No outliers or influential points. Parametric regressi<strong>on</strong> analysis assumes that all the points must bel<strong>on</strong>g<br />

to the relati<strong>on</strong>ship – there should be no unusual points. N<strong>on</strong>-parametric regressi<strong>on</strong> analysis is more<br />

c○2012 Carl James Schwarz 258 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

robust to failures of this assumpti<strong>on</strong> as the actual distances between the observed point and the fitted<br />

line are not used directly. However, many outliers can mask the true relati<strong>on</strong>ship. A very nice feature<br />

of n<strong>on</strong>-parametric methods is that they are invariant to trans<str<strong>on</strong>g>for</str<strong>on</strong>g>ms that preserve order. For<br />

example, you will get the p-value answers if you use n<strong>on</strong>-parametric analyses <strong>on</strong> Y or log(Y ). But,<br />

the estimated slope may be different as it is measured <strong>on</strong> a different scale.<br />

• Equal variati<strong>on</strong> al<strong>on</strong>g the line <strong>on</strong> some scale. Parametric regressi<strong>on</strong> analysis assumes that the variability<br />

about the regressi<strong>on</strong> line is similar <str<strong>on</strong>g>for</str<strong>on</strong>g> all values of X, i.e. the scatter of the points above and below<br />

the fitted line should be roughly c<strong>on</strong>stant over time. Surprisingly to many people, n<strong>on</strong>-parametric<br />

regressi<strong>on</strong> analysis assumes that the distributi<strong>on</strong> of Y at each X is the same <strong>on</strong> some measuring scale<br />

and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e must also have the same variati<strong>on</strong>. However, because the assumpti<strong>on</strong> is about equal<br />

variance <strong>on</strong> some scale, and because n<strong>on</strong>-parametric methods are invariant to simple trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s,<br />

this is often satisfied. For example, if a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m would stabilize the variance, then it is not necessary<br />

to trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m be<str<strong>on</strong>g>for</str<strong>on</strong>g>e doing the Kendall test. This is <strong>on</strong>e advantage of the n<strong>on</strong>-parametric tests<br />

over parametric tests which require a homogeneous variati<strong>on</strong> about the regressi<strong>on</strong> line.<br />

• Independence. Parametric regressi<strong>on</strong> assumes that each value of Y is independent of any other value<br />

of Y . N<strong>on</strong>-parametric regressi<strong>on</strong> analysis also makes this assumpti<strong>on</strong>s. C<strong>on</strong>sequently, n<strong>on</strong>-parametric<br />

regressi<strong>on</strong> analysis does not deal with autocorrelati<strong>on</strong>.<br />

• Normality of errors. Parametric regressi<strong>on</strong> assumes that the difference between the value of Y and<br />

the expected value of Y is assumed to be normally distributed. N<strong>on</strong>-parametric regressi<strong>on</strong> analysis<br />

assumes that the distributi<strong>on</strong> of Y at each value of X is the same, but does not require that it be<br />

normally distributed. C<strong>on</strong>sequently heavy tailed distributi<strong>on</strong>s such a log-normal distributi<strong>on</strong>s can be<br />

handled with n<strong>on</strong>-parametric regressi<strong>on</strong>.<br />

• X measured without error. Parametric regressi<strong>on</strong> analysis assumes that the error in measurement of<br />

X is small or n<strong>on</strong>-existent relative to the error variati<strong>on</strong> about the regressi<strong>on</strong> line. N<strong>on</strong>-parametric<br />

regressi<strong>on</strong> makes the same assumpti<strong>on</strong>.<br />

As you can see, data to be used in n<strong>on</strong>-parametric analysis cannot be just arbitrarily collected - thought<br />

must be given as to assessing the appropriateness of the regressi<strong>on</strong> model.<br />

Surprising to many, least-square regressi<strong>on</strong> is actually a n<strong>on</strong>-parametric method! The principle of choosing<br />

the regressi<strong>on</strong> line to minimize the sum of squared deviati<strong>on</strong>s from the regressi<strong>on</strong> line makes no distributi<strong>on</strong>al<br />

assumpti<strong>on</strong>s of Y at each X. The assumpti<strong>on</strong> of normality comes into play when you compute F<br />

or t-tests to test if the slope is zero, and c<strong>on</strong>struct c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope or predicti<strong>on</strong> intervals<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> individual means or predicti<strong>on</strong>s.<br />

A simple n<strong>on</strong>-parametric test <str<strong>on</strong>g>for</str<strong>on</strong>g> zero-slope is Spearman’s ρ which is simply a correlati<strong>on</strong> coefficient<br />

computed <strong>on</strong> the RANKS of the data. 38 The standard Pears<strong>on</strong> correlati<strong>on</strong> coefficient (discussed in earlier<br />

secti<strong>on</strong>s) is then applied to the ranked data, and the p-value is found by referring to tables or from large<br />

sample <str<strong>on</strong>g>for</str<strong>on</strong>g>mula. Fortunately, most computer packages compute Spearman’s ρ and provide p-values.<br />

38 For each variable, find the smallest value and replace by the value of 1. Find the sec<strong>on</strong>d smallest value and replace it by the value<br />

of 2, etc. If there are tied values, replace the tied ranks by the average of the ranks. This is easily d<strong>on</strong>e in Excel by repeated sorting the<br />

(X, Y ) pairs first by X and then by Y .<br />

c○2012 Carl James Schwarz 259 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Example: The Grass is Greener (<str<strong>on</strong>g>for</str<strong>on</strong>g> l<strong>on</strong>ger) revisited<br />

For example, the grass cutting example data is ranked as follows:<br />

Year Durati<strong>on</strong> Year Durati<strong>on</strong><br />

(days) Rank Rank<br />

1984 200 1 2.0<br />

1985 215 2 10.5<br />

1986 195 3 1.0<br />

1987 212 4 9.0<br />

1988 225 5 12.5<br />

1989 240 6 18.0<br />

1990 203 7 4.5<br />

1991 208 8 7.0<br />

1992 203 9 4.5<br />

1993 202 10 3.0<br />

1994 210 11 8.0<br />

1995 225 12 12.5<br />

1996 204 13 6.0<br />

1997 245 14 20.0<br />

1998 238 15 17.0<br />

1999 226 16 14.0<br />

2000 227 17 15.0<br />

2001 236 18 16.0<br />

2002 215 19 10.5<br />

2003 242 20 19.0<br />

The correlati<strong>on</strong> computed <strong>on</strong> the ranks is found to be .5766.<br />

JMP analysis Parametric and n<strong>on</strong>-parametric Correlati<strong>on</strong>s between variables are found using the Analyze-<br />

>MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 260 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Specify both the X and Y variable in the dialogue box:<br />

c○2012 Carl James Schwarz 261 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Finally request n<strong>on</strong>-parametric correlati<strong>on</strong>s from the drop-down menu box:<br />

which gives the following output:<br />

The Spearman ρ is found to be .5766 with a p-value of .0078. This compares to the p-value from the<br />

parametric regressi<strong>on</strong> of .012.<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, Spearman’s ρ does not provide an easy was to estimate the slope or find c<strong>on</strong>fidence<br />

intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope etc. 39<br />

Because Spearman’s ρ does not provide a c<strong>on</strong>venient way to estimate the slope or to find c<strong>on</strong>fidence<br />

intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope, variants <strong>on</strong> Kendall’s τ are often used instead. This estimator of the slope has many<br />

39 However, refer to C<strong>on</strong>over (1995), Secti<strong>on</strong> 5.5 <str<strong>on</strong>g>for</str<strong>on</strong>g> details <strong>on</strong> using Spearman’s ρ to estimate a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />

c○2012 Carl James Schwarz 262 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

names: Sen’s (1968) estimator 40 ; Theil’s (1950) estimators 41 ; or Kendall’s τ 42 estimator are all comm<strong>on</strong><br />

names. The idea behind these estimators is to look and c<strong>on</strong>cordant and discordant pairs of data points. A<br />

pair of data points (X 1 , Y 1 ) and (X 2 , Y 2 ) is called c<strong>on</strong>cordant if Y2−Y1<br />

X 2−X 1<br />

is greater than zero, discordant if<br />

the ratio is less than zero, and both if the ratio is 0. For the grass cutting durati<strong>on</strong> data, the pair of data point<br />

(1985,215) is c<strong>on</strong>cordant with the data point (1988,225), but discordant with the data point (1986,195). As<br />

you can imagine, it is far easier to let the computer do the computati<strong>on</strong>s!<br />

The test <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-zero slope using Kendall’s tau can be computed by finding ALL possible pairs of data<br />

points (!) and using the rule:<br />

• if Yj−Yi<br />

X j−X i<br />

> 0 then add 1 to N c (c<strong>on</strong>cordant);<br />

• if Yj−Yi<br />

X j−X i<br />

< 0 then add 1 to N d (discordant);<br />

• if Yj−Yi<br />

X j−X i<br />

= 0 then add 1 2 to both N c and N d ;<br />

• if X i = X j , no comparis<strong>on</strong> is made<br />

Kendall’s τ is found as:<br />

τ = N c − N d<br />

N c + N d<br />

The p-value is found from tables or by the computer.<br />

The computati<strong>on</strong> of τ is simplified by sorting the pairs of (X, Y ) by the value of X and creating a<br />

spreadsheet to help with the computati<strong>on</strong>s. Each value of Y needs <strong>on</strong>ly to be compared to those “below”<br />

Estimati<strong>on</strong> of the slope and c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope are found by computing all the pairs of<br />

slopes:<br />

S ij = Y j − Y i<br />

X j − X i<br />

The estimate of the slope is simply the median of these values.<br />

A c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope is found by using tables to find the lower and upper quantities to<br />

use as the bounds of the interval. A close approximati<strong>on</strong> to the values to use is found using this following<br />

procedure:<br />

• Let n be the number of pairs of points, and N be the number of paired slopes from above.<br />

√<br />

n(n−1)(2n+5)<br />

• Compute w = z<br />

18<br />

where z is the appropriate quantile from a standard normal distributi<strong>on</strong>.<br />

For example, <str<strong>on</strong>g>for</str<strong>on</strong>g> a 95% c<strong>on</strong>fidence interval, z = 1.96.<br />

40 Sen, P.K. (1968). Estimates of the regressi<strong>on</strong> coefficient based <strong>on</strong> Kendall’s τ. Journal of the American Statistical Associati<strong>on</strong><br />

63,1379-1389.<br />

41 Theil, H.A. (1950). A rank-invariant method of linear and polynomial regressi<strong>on</strong> analysis 1, 2, and 3. Neder. Acad. Wetersch.<br />

Proc. 53,386-392, 521-525, and 1397-1412.<br />

42 Kendall, M.G. (1970). Rank Correlati<strong>on</strong> Methods. Charles Griffin and Co., L<strong>on</strong>d<strong>on</strong>. Fourth Editi<strong>on</strong>.<br />

c○2012 Carl James Schwarz 263 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

• Compute r = .5(N − w).<br />

• Use the r th and (N − r) th values of the paired slopes as the bounds of the c<strong>on</strong>fidence interval.<br />

For the mowing durati<strong>on</strong> data, n = 20 and there are N = 190 possible slopes! The estimated slope<br />

is the median value. The approximate value of w = 60, so the 65 th and 135 th sorted values of the paired<br />

slopes are the lower and upper bounds of the 95% c<strong>on</strong>fidence interval. This gives an estimated slope of<br />

1.389 with a 95% c<strong>on</strong>fidence interval of (0.20 → 2.8). This can be compared to the estimated slope of 1.46<br />

and c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope from the ordinary regressi<strong>on</strong> analysis of (.4 → 2.6).<br />

This is rarely found in most computer packages, but the computati<strong>on</strong> of the possible slopes can be<br />

programmed (sometimes clumsily) and can actually be d<strong>on</strong>e in a spreadsheet.<br />

JMP analysis Kendall’s τ is also computed using the Analyze->MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />

in the same way as Spearman’s ρ was found:<br />

This gives:<br />

c○2012 Carl James Schwarz 264 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The p-value is .0123 very similar to that from the ordinary regressi<strong>on</strong>.<br />

It is very clumsy in JMP to compute the Sen-Theil-Kendall estimate of the slope and it is not d<strong>on</strong>e. Refer<br />

to the SAS program <str<strong>on</strong>g>for</str<strong>on</strong>g> more help.<br />

Final Remarks<br />

Berryman (1998) recommend that Kendall’s τ or Spearman’s ρ be used <str<strong>on</strong>g>for</str<strong>on</strong>g> a n<strong>on</strong>-parametric testing <str<strong>on</strong>g>for</str<strong>on</strong>g> trend<br />

as these have the greatest efficiency relative to ordinary parametric regressi<strong>on</strong>. They also recommend (their<br />

Table 4) that a minimum of 9-11 observati<strong>on</strong>s be collected be<str<strong>on</strong>g>for</str<strong>on</strong>g>e testing <str<strong>on</strong>g>for</str<strong>on</strong>g> trend using these methods.<br />

It turns out that the asymptotic relative efficiency of both Kendall’s τ and Spearman’s ρ are very high<br />

(90%+) so that the planning tools <str<strong>on</strong>g>for</str<strong>on</strong>g> ordinary regressi<strong>on</strong> can be used to estimate the sample sizes required<br />

under various scenarios with a fair amount of c<strong>on</strong>fidence.<br />

2.9.3 Dealing with seas<strong>on</strong>ality - Seas<strong>on</strong>al Kendall’s τ<br />

Basic principles<br />

In some cases, series of data have an obvious periodicity or seas<strong>on</strong>al effects.<br />

C<strong>on</strong>sider, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, values of total phosphorus taken from the Klamath River near Klamath, Cali<str<strong>on</strong>g>for</str<strong>on</strong>g>nia<br />

as analyzed by Hirsch et al (1982). 43<br />

43 This was m<strong>on</strong>itoring stati<strong>on</strong> 11530500 from the NASQAN network in the US. Data are available from http://waterdata.<br />

usgs.gov/nwis/qwdata/?site_no=11530500. The data was analyzed by Hirsch, R.M., Slack, J.R., and Smith, R.A. (1982).<br />

Techniques of trend analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> m<strong>on</strong>thly water quality data. Water Resources Research 18, 107-121.<br />

c○2012 Carl James Schwarz 265 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Total phosphorus (mg/L) in Klamath River near Klamath, CA<br />

Year<br />

M<strong>on</strong>th 1972 1973 1974 1975 1976 1977 1978 1979<br />

1 0.07 0.33 0.70 0.08 0.04 0.05 0.14 0.08<br />

2 0.11 0.24 0.17 . . . 0.11 0.04<br />

3 0.60 0.12 0.16 . 0.14 0.03 0.02 0.02<br />

4 0.10 0.08 1.20 0.11 0.05 0.04 0.06 0.01<br />

5 0.04 0.03 0.12 0.09 0.02 0.04 0.03 0.03<br />

6 0.05


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

shows a obvious seas<strong>on</strong>ality to the data with peak levels occurring in the winter m<strong>on</strong>ths. There are also some<br />

missing values as seen in the raw data table. Finally, notice the presence of several very large values (above<br />

0.20 mg/L) that would normally be classified as outliers.<br />

How can a test <str<strong>on</strong>g>for</str<strong>on</strong>g> trend be fit in the presence of this seas<strong>on</strong>ality?<br />

Hirsch et al. (1982) modified Kendall’s τ to deal with seas<strong>on</strong>ality. The method is very simple to describe,<br />

but is difficult to implement.<br />

The basic principle is divide the series into (in this case) 12 separate series, <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th. These<br />

m<strong>on</strong>th-based series range from 8 years of data down to 5 years of data. For each m<strong>on</strong>th-based series, compute<br />

Kendall’s τ. Combine the 12 estimates of τ into a single omnibus test to compute the overall p-value. The<br />

estimated slopes is found from all the pairwise comparis<strong>on</strong>s within each m<strong>on</strong>th-based series that are pooled<br />

and then the overall median of these pooled sets is used. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there are no simple procedures<br />

available to compute c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />

c○2012 Carl James Schwarz 267 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Example: Total phosphorus <strong>on</strong> the Klamath River revisited<br />

JMP Analysis<br />

JMP can be used to compute a test statistic, but it is difficult (!) to estimate the slope.<br />

A JMP dataset with scripts is located in klamath.jmp and klamath2.jmp in the Sample Program Library<br />

available at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The JMP dataset has 12 rows and 9 columns corresp<strong>on</strong>ding to the various years:<br />

The years must be stacked using the Tables->Stack command to create three columns, Year, M<strong>on</strong>th, Phosphorus.<br />

A porti<strong>on</strong> of the stacked data is illustrated below:<br />

c○2012 Carl James Schwarz 268 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Missing values are indicated by the period. The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m can be used to create a data<br />

plot to illustrate the seas<strong>on</strong>al nature of the data (not shown).<br />

To compute Kendall’s τ <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th, use the Analyze->MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />

and specify m<strong>on</strong>th in the BY area;<br />

c○2012 Carl James Schwarz 269 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

This will give the estimates of correlati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th. In order to request Kendall’s τ <str<strong>on</strong>g>for</str<strong>on</strong>g> EVERY plot,<br />

hold down the opti<strong>on</strong> key be<str<strong>on</strong>g>for</str<strong>on</strong>g>e clicking <strong>on</strong> the red triangle to request the n<strong>on</strong>-parametric Kendall τ statistic.<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there is no way to have JMP automatically save all the Kendall τ’s to a new data sheet <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

subsequent processing. You will have to manually (groan) type in each estimate of τ and the p-value to give<br />

the following table:<br />

c○2012 Carl James Schwarz 270 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, JMP does not provide the raw value underlying Kendall’s τ (what Hirsch et al call S) so we<br />

can’t use the direct methods outlined in Hirsch et al of simply adding the values of S. A somewhat indirect<br />

method must be used to combine the reported values of τ and their p-values over the 12 m<strong>on</strong>ths.<br />

This indirect method c<strong>on</strong>verts the p-value back to a z-score. As the z-scores are distributed as Normal<br />

distributi<strong>on</strong>s (with mean 0 and variance 1) and are assumed to be independent across the m<strong>on</strong>ths, their sum<br />

is simply a normal distributi<strong>on</strong> with mean 0 and variance equal to the sum of the variances (in this case 12).<br />

This resulting sum can be then be c<strong>on</strong>verted to an actual p-value.<br />

To c<strong>on</strong>vert the p-value back to a z-score, use the relati<strong>on</strong>ship<br />

z = Φ −1 (1 − pvalue/2) × sign(τ b )<br />

where Φ −1 is the inverse normal probability functi<strong>on</strong>; the 1 − pvalue/2 c<strong>on</strong>verts the two-sided p-value to<br />

the upper tail of normal curve, and the sign functi<strong>on</strong> makes sure that the z value also has the correct sign<br />

(i.e. positive or negative). This is d<strong>on</strong>e by creating a new column in JMP and creating a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> this<br />

column:<br />

c○2012 Carl James Schwarz 271 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

The Normal Quantile functi<strong>on</strong> is the inverse normal functi<strong>on</strong>, and the IF clause serves as the sign functi<strong>on</strong>.<br />

The columns Var is simply the variance of the z-score.<br />

This gives the table:<br />

c○2012 Carl James Schwarz 272 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

We add add together the z-scores and the variances, this gives an overall z-score. The Tables->Summary<br />

can be used to get this total:<br />

c○2012 Carl James Schwarz 273 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

to give:<br />

Finally, we use a final <str<strong>on</strong>g>for</str<strong>on</strong>g>mula to compute the probability of exceeding this total z score:<br />

c○2012 Carl James Schwarz 274 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

and the final overall p-value:<br />

The overall p-value of .0049. This can be compared to the paper by Hirsch et al (1982) who obtained an<br />

overall z-value of -2.69 with a p-value of .0072.<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there is no simple way to estimate the slope using JMP.<br />

Final notes<br />

As pointed out earlier, n<strong>on</strong>-parametric analyses are not assumpti<strong>on</strong> free - they merely have different assumpti<strong>on</strong>s<br />

than parametric analyses. In this method, the key assumpti<strong>on</strong>s of independence are still important.<br />

Because the data are broken into m<strong>on</strong>thly-based series, this is likely true - it seems reas<strong>on</strong>able that the value<br />

in January 1971 has no influence <strong>on</strong> the value in January 1972. However, it is likely not true that January<br />

1971 is independent of February 1971 which would likely invalidate a simple use of Kendall’s method <strong>on</strong><br />

the entire series.<br />

As Hirsch et al (1982) point out, it is possible that some sub-series exhibit str<strong>on</strong>g evidence of upward<br />

trend, some sub-series exhibit str<strong>on</strong>g evidence of downward trend, but the overall omnibus test fails to detect<br />

evidence of a trend. This is not unexpected, and if <strong>on</strong>e is interested in the individual sub-series, then these<br />

should be examined individually.<br />

In the original paper by Hirsch et al. (1982), they did not allow <str<strong>on</strong>g>for</str<strong>on</strong>g> multiple observati<strong>on</strong>s in each time<br />

period. This actually poses no problem with computer implementati<strong>on</strong>s which handle ties appropriately.<br />

c○2012 Carl James Schwarz 275 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Lastly, you may have noticed in the original data, some values that were marked as below detecti<strong>on</strong> limit.<br />

These censored observati<strong>on</strong>s pose no problem to most n<strong>on</strong>-parametric tests. Clearly a value that is below<br />

detecti<strong>on</strong> limit (e.g. < .01) is also less than .05. The <strong>on</strong>ly problem arise in making sure that if there are<br />

multiple, different, detecti<strong>on</strong> limits, that comparis<strong>on</strong>s are handled appropriately. Usually, this implies using<br />

the largest detecti<strong>on</strong> limit in place of any lower detecti<strong>on</strong> limits.<br />

Hirsch et al (1982) did several simulati<strong>on</strong> studies of the seas<strong>on</strong>al Kendall, and found that it had high<br />

power to detect changes.<br />

The Seas<strong>on</strong>al Kendall estimator has been implemented in many packages specially designed <str<strong>on</strong>g>for</str<strong>on</strong>g> envir<strong>on</strong>mental<br />

studies. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there are no packages that I am aware of that report c<strong>on</strong>fidence intervals<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />

Berryman (1988) recommends that at least 60 observati<strong>on</strong> spanning at least 5 cycles be obtained be<str<strong>on</strong>g>for</str<strong>on</strong>g>e<br />

using the Seas<strong>on</strong>al Kendall method.<br />

2.9.4 Seas<strong>on</strong>ality with Autocorrelati<strong>on</strong><br />

General ideas<br />

As noted earlier, the Seas<strong>on</strong>al Kendall method still assumes that observati<strong>on</strong>s in different series are independent,<br />

i.e. that the January 1972 reading is not related to the February 1972 reading. In some cases this<br />

is untrue; <str<strong>on</strong>g>for</str<strong>on</strong>g> example, in a wet year, the steam flow may be higher than average <str<strong>on</strong>g>for</str<strong>on</strong>g> all m<strong>on</strong>ths leading to<br />

positive correlati<strong>on</strong> across series.<br />

Hirsch and Slack (1984) 44 c<strong>on</strong>sidered this problem. As in the Seas<strong>on</strong>al Kendall test the data are first<br />

divided into sub-series, e.g. m<strong>on</strong>thly series across several years. The Kendall statistic <str<strong>on</strong>g>for</str<strong>on</strong>g> trend across years<br />

is computed <str<strong>on</strong>g>for</str<strong>on</strong>g> each sub-series, e.g. <str<strong>on</strong>g>for</str<strong>on</strong>g> each m<strong>on</strong>th. These sub-series statistics are added together to give<br />

an omnibus test statistic. The Seas<strong>on</strong>al Kendall method could simply sum the variances of each test statistic<br />

to give the omnibus variance from which a z-score could be computed and a p-value obtained. However,<br />

because the sub-series are autocorrelated, the new test must also add together estimates of the covariances<br />

am<strong>on</strong>g the test statistics from the individual sub-series to get the omnibus variance prior to computing a<br />

z=score and p-value .<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this is procedure is implemented in <strong>on</strong>ly a handful of specialized software packages <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

the analysis of water quality and hydrologic data. These packages can be located with a quick search <strong>on</strong> the<br />

WWW. It is not feasible to do the computati<strong>on</strong>s in JMP; nor in SYStat; the computati<strong>on</strong>s could likely be<br />

d<strong>on</strong>e in SAS, but are complex and way bey<strong>on</strong>d the scope of these notes.<br />

C<strong>on</strong>sequently, this method will not be discussed further in these notes – interested readers are referred<br />

to Hirsch and Slack (1982).<br />

44 Hirsch, R.M. and Slack, J.R. (1984). A n<strong>on</strong>-parametric trend test <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>al data with serial dependence. Water Resources<br />

Research 20, 727-732.<br />

c○2012 Carl James Schwarz 276 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Note that because parametric methods are now readily available – refer to earlier chapters of these notes,<br />

there is less need <str<strong>on</strong>g>for</str<strong>on</strong>g> these no-parametric procedures.<br />

Berryman (1988) and Hirsch and Slack (1982) recommends that at least 120 observati<strong>on</strong>s spanning at<br />

least 10 cycles be obtained be<str<strong>on</strong>g>for</str<strong>on</strong>g>e using the Seas<strong>on</strong>al Kendall method adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>.<br />

2.10 Summary<br />

This chapter is c<strong>on</strong>cerned mainly with detecting m<strong>on</strong>ot<strong>on</strong>ic trends over time, i.e. a gradual increase or<br />

decrease over time. Some methods were introduced to deal with seas<strong>on</strong>al effects, but these effects are<br />

nuisance effects and should be eliminated prior to analysis.<br />

It is possible <str<strong>on</strong>g>for</str<strong>on</strong>g> these trends over time to be masked by exogenous variables, i.e. variables other than Y<br />

and X. For example, many ground water variables are influenced by flow, over and above seas<strong>on</strong>al effects.<br />

It was bey<strong>on</strong>d the scope of these notes, but the effects of these exogenous variables should be first removed<br />

be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the trend analysis is d<strong>on</strong>e. This can be d<strong>on</strong>e using multiple regressi<strong>on</strong> or other curve fitting techniques<br />

such as LOWESS.<br />

Measurements taken in close proximity over time are likely to be related to each other. This is known as<br />

serial or autocorrelati<strong>on</strong>. This is often induced by some envir<strong>on</strong>mental variable that is slowly changing over<br />

time and also affects the m<strong>on</strong>itored variable. Again, these exogenous effects should be removed first. Some<br />

residual autocorrelati<strong>on</strong> may still be present. The most comm<strong>on</strong> test statistic to detect autocorrelati<strong>on</strong> is the<br />

Durbin-Wats<strong>on</strong> statistic where values near 2 indicate the lack of autocorrelati<strong>on</strong>.<br />

Trend analyses can either be d<strong>on</strong>e using parametric or n<strong>on</strong>-parametric methods. BOTH types of analyses<br />

make certain assumpti<strong>on</strong>s about the data – n<strong>on</strong>-parametric methods are NOT assumpti<strong>on</strong>-free! It turns out<br />

that modern n<strong>on</strong>-parametric methods are relatively powerful to detect trends even if all the assumpti<strong>on</strong>s are<br />

used. Hence there is little loss in power in using these methods. In additi<strong>on</strong>, because they use the relative<br />

ranking of observati<strong>on</strong>s, they are relatively insensitive to outliers, moderate levels of n<strong>on</strong>-detected values<br />

and missing values.<br />

If so, why not always use n<strong>on</strong>-parametric methods? The basic impediment to the use of n<strong>on</strong>-parametric<br />

methods are a lack of suitable computer software, the difficulty in computing point estimates and c<strong>on</strong>fidence<br />

intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the trend line, and the difficulty in making predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> future observati<strong>on</strong>s. However, n<strong>on</strong>parametric<br />

tests are often ideally suited <str<strong>on</strong>g>for</str<strong>on</strong>g> mass screening. These procedures can be automated and it is<br />

not necessary to examine the possibly hundreds of individual datasets to see which need to be trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />

be<str<strong>on</strong>g>for</str<strong>on</strong>g>e parametric procedures can be used.<br />

Finally, what to do about outliers? Blindly including outliers using n<strong>on</strong>-parametric methods without<br />

investigating their cause can be very dangerous. Trends may be detected that are not real. An outlier, by<br />

definiti<strong>on</strong>, is a point that doesn’t appear to fit the same pattern as the other data values. An assumpti<strong>on</strong> of<br />

most n<strong>on</strong>-parametric tests is that the distributi<strong>on</strong> of Y values at each X is the same (it may not be normal)<br />

– this would also require you to exclude outliers. Even parametric methods can deal with outliers nicely -<br />

c○2012 Carl James Schwarz 277 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

a whole area of statistics deals with robust regressi<strong>on</strong> methods where outliers are iteratively reweighted and<br />

given a low weight if they appear to be anomalous. For example, SAS provides Proc RobustReg to do robust<br />

regressi<strong>on</strong>.<br />

A summary table of the various methods c<strong>on</strong>sidered in this secti<strong>on</strong> of the notes appears below: 45<br />

45 This table is based <strong>on</strong> Trend Analysis of Food Processor Land Applicati<strong>on</strong> Sites in the LUBGWMA available at: http://www.<br />

deq.state.or.us/wq/groundwa/LUBGroundwater/LUBGTrendAnalysisApp1.pdf<br />

c○2012 Carl James Schwarz 278 November 23, 2012


CHAPTER 2. DETECTING TRENDS OVER TIME<br />

c○2012 Carl James Schwarz 279 November 23, 2012


c○2012 Carl James Schwarz 280 November 23, 2012<br />

Trend Analysis<br />

Method<br />

Simple Linear<br />

Regressi<strong>on</strong><br />

Kendall’s τ<br />

Seas<strong>on</strong>al Regressi<strong>on</strong><br />

Parametric<br />

or N<strong>on</strong>-<br />

Parametric<br />

Account <str<strong>on</strong>g>for</str<strong>on</strong>g> Seas<strong>on</strong>ality<br />

Parametric No · Most powerful if assumpti<strong>on</strong>s<br />

hold, especially normality,<br />

n<strong>on</strong>-seas<strong>on</strong>al, and independent.<br />

· Familiar technique to many<br />

scientists.<br />

· Simple to compute best fit<br />

line.<br />

· Available in most computer<br />

packages.<br />

N<strong>on</strong>parametric<br />

No<br />

Parametric Yes Subtract<br />

m<strong>on</strong>thly mean<br />

or median over<br />

years from original<br />

data. Then use<br />

the residuals to<br />

regress over time<br />

or use ANCOVA<br />

methods.<br />

Advantages Disadvantages Rec.<br />

sample<br />

sizes<br />

· N<strong>on</strong>-detect, outliers are easily<br />

handled.<br />

· Same p-value regardless of<br />

trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m used <strong>on</strong> Y .<br />

· Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />

· Produces a descripti<strong>on</strong> of the<br />

seas<strong>on</strong>ality pattern.<br />

· Envir<strong>on</strong>mental data rarely<br />

c<strong>on</strong><str<strong>on</strong>g>for</str<strong>on</strong>g>ms to test assumpti<strong>on</strong>s.<br />

· Sensitive to outliers.<br />

· Difficult to handle n<strong>on</strong>-detect<br />

values.<br />

· Serial correlati<strong>on</strong> gives unbiased<br />

estimates, but they are<br />

not efficient. C<strong>on</strong>sider methods<br />

to account <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong>.<br />

· Does not account <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />

· Does not account <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />

· Not robust against autocorrelati<strong>on</strong>.<br />

· Difficult to make predicti<strong>on</strong>s.<br />

· Assumes normality of adjusted<br />

values about regressi<strong>on</strong><br />

line.<br />

· Not robust against serial correlati<strong>on</strong>.<br />

· Requires near complete<br />

records <str<strong>on</strong>g>for</str<strong>on</strong>g> each set of<br />

m<strong>on</strong>thly data. If the patters<br />

of missing years varies<br />

am<strong>on</strong>g the m<strong>on</strong>ths, the<br />

m<strong>on</strong>thly mean used to adjust<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>al effects may be<br />

miss leading.<br />

· Reported se are too small because<br />

adjustment <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality<br />

not incorporated unless<br />

ANCOVA method used.<br />

10 Good<br />

power<br />

programs<br />

are<br />

available<br />

10<br />

30 with<br />

at least 5<br />

cycles<br />

CHAPTER 2. DETECTING TRENDS OVER TIME<br />

Sine/Cosine<br />

Regressi<strong>on</strong><br />

Parametric Yes Deseas<strong>on</strong>alized<br />

values are<br />

obtained by fitting a<br />

· Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality. · With few excepti<strong>on</strong>s, there is<br />

little reas<strong>on</strong> to believe that<br />

the <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the seas<strong>on</strong>ality<br />

30 with<br />

at least 5<br />

cycles


c○2012 Carl James Schwarz 281 November 23, 2012<br />

Regressi<strong>on</strong><br />

adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

autocorrelati<strong>on</strong><br />

Seas<strong>on</strong>al Kendall<br />

without correcti<strong>on</strong><br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> serial<br />

correlati<strong>on</strong><br />

Parametric No · Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong><br />

in the data.<br />

· Can be also adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />

N<strong>on</strong>parametric<br />

Seas<strong>on</strong>al Kendall<br />

adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong><br />

N<strong>on</strong>parametric<br />

Yes But <strong>on</strong>ly by<br />

comparing the data<br />

from the same seas<strong>on</strong><br />

(e.g. m<strong>on</strong>ths)<br />

Yes (as above)<br />

· Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />

· Robust against n<strong>on</strong>-detects<br />

and outliers.<br />

· Accounts <str<strong>on</strong>g>for</str<strong>on</strong>g> seas<strong>on</strong>ality.<br />

· Robust against n<strong>on</strong>-detects<br />

and outliers.<br />

· Robust against serial correlati<strong>on</strong>.<br />

· Requires sophisticated software.<br />

· Extremely high autocorrelati<strong>on</strong><br />

may be invisible.<br />

· When applied to data that is<br />

not seas<strong>on</strong>al, has a slight loss<br />

of power.<br />

· Not robust against serial correlati<strong>on</strong>.<br />

· Difficult to estimate c<strong>on</strong>fidence<br />

intervals.<br />

· Not all computer packages<br />

have this method. May require<br />

further programming.<br />

· Significant loss of power when<br />

applied to data that is not<br />

seas<strong>on</strong>al or lacks autocorrelati<strong>on</strong>.<br />

· Specialized software required.<br />

20<br />

60 with<br />

at least 5<br />

cycles<br />

120<br />

with at<br />

least 10<br />

cycles<br />

CHAPTER 2. DETECTING TRENDS OVER TIME


Chapter 3<br />

Estimating power/sample size using<br />

Program M<strong>on</strong>itor<br />

J. Gibbs has written a Windoze program to estimate the power and sample size requirements <str<strong>on</strong>g>for</str<strong>on</strong>g> many<br />

comm<strong>on</strong> m<strong>on</strong>itoring programs.<br />

Gibbs, J. P., and Eduard Ene. 2010.<br />

Program MONITOR: Estimating the statistical power of ecological m<strong>on</strong>itoring programs. Versi<strong>on</strong><br />

11.0.0.<br />

http://www.esf.edu/efb/gibbs/m<strong>on</strong>itor/<br />

CAUTION: Versi<strong>on</strong> 11.0 of MONITOR appears to have some “features” that result in<br />

incorrect power computati<strong>on</strong>s in certain cases. Please c<strong>on</strong>tact me in advance of using the results<br />

from MONITOR in a critical planning situati<strong>on</strong> to ensure that you have not stumbled <strong>on</strong> some of the<br />

“features”.<br />

Program MONITOR uses simulati<strong>on</strong> procedures to evaluate how each comp<strong>on</strong>ent of a m<strong>on</strong>itoring program<br />

influences its power to detect a linear (regressi<strong>on</strong>) change. The program has been cited in numerous<br />

peer-reviewed publicati<strong>on</strong>s since it first became available in 1995.<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e using Program MONITOR, you will need to gather some basic in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> about the proposed<br />

study.<br />

• What is the initial value of your populati<strong>on</strong>. This could be the initial populati<strong>on</strong> size, the initial density,<br />

etc.<br />

• How precisely can you measure the populati<strong>on</strong> at a given sampling occasi<strong>on</strong>? This can be given as the<br />

standard error you expect to see at any occasi<strong>on</strong>, or the relative standard error (standard error/estimate),<br />

282


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

etc.<br />

• What is the process variati<strong>on</strong>? Do you really expect that the measurements would fall precisely <strong>on</strong> the<br />

trend line in the absence of measurement error?<br />

• What is the significance level and target power? Traditi<strong>on</strong>al values are α = 0.05 with a power of 80%<br />

or α = 0.10 with a target power of 90%.<br />

3.1 Mechanics of MONITOR<br />

Let us first dem<strong>on</strong>strate the mechanics of MONITOR be<str<strong>on</strong>g>for</str<strong>on</strong>g>e looking at some real examples of how to use it<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> m<strong>on</strong>itoring designs.<br />

Suppose we wish to investigate the power of a m<strong>on</strong>itoring design that will run <str<strong>on</strong>g>for</str<strong>on</strong>g> 5 years. At each survey<br />

occasi<strong>on</strong> (i.e. every year), we have 1 m<strong>on</strong>itoring stati<strong>on</strong>, and we make 2 estimates of the populati<strong>on</strong> size at<br />

the m<strong>on</strong>itoring stati<strong>on</strong> in each year. The populati<strong>on</strong> is expected to start with 1000 animals, and we expect<br />

that the measurement error (standard error) in each estimate is about 200, i.e. the coefficient of variati<strong>on</strong><br />

of each measurement is about 20% and is c<strong>on</strong>stant over time. We are interested in detecting increasing or<br />

decreasing trends and to start, a 5% decline per year will be of interest. We will assume an UNREALISTIC<br />

process error of zero so that the sampling error is equal to the total variati<strong>on</strong> in measurements over time.<br />

Launch Program MONITOR:<br />

c○2012 Carl James Schwarz 283 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

The screen starts with default values. We make some changes:<br />

• Change the sampling occasi<strong>on</strong>s to the values 0, 1, 2, 3, 4.<br />

c○2012 Carl James Schwarz 284 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

• Change the number of survey plots/year to 2.<br />

• Check that the significance level is set of 0.05.<br />

• Check that the desired power is set to 0.80.<br />

• Check that the range of desired trends encompasses -5%. You might want to increase the number<br />

of trend powers computed to 21 to get power computati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> every value rather than every sec<strong>on</strong>d<br />

value.<br />

• Check that the two-sided test is selected.<br />

c○2012 Carl James Schwarz 285 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

Then click <strong>on</strong> the Plots Tab and enter the initial populati<strong>on</strong> size (1000) and a variati<strong>on</strong> (the STANDARD<br />

DEVIATION) in measurements of 200 under Total Variati<strong>on</strong><br />

c○2012 Carl James Schwarz 286 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

Press the Run ic<strong>on</strong> and the following results are shown: [Because the power computati<strong>on</strong>s are based <strong>on</strong><br />

a simulati<strong>on</strong>, your results may vary slightly.]<br />

c○2012 Carl James Schwarz 287 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

Notice that the net change over the five year period with a 5% decline/year is <strong>on</strong>ly a −18.5% total decline<br />

Mean<br />

% Total<br />

over the five year period. This is obtained as:<br />

Year Abundance Decline<br />

0 1000 0.0%<br />

1 950.0 = 1000(.95) −5.0%<br />

2 902.5 = 1000(.95) 2 −9.7%<br />

3 857.4 = 1000(.95) 3 −14.3%<br />

4 814.5 = 1000(.95) 4 −18.5%<br />

By clicking <strong>on</strong> the Trend vs. Power Chart tab, you see a graph of the power by the size of the trend:<br />

c○2012 Carl James Schwarz 288 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

This design has a power around 15% of detecting this trend – hardly worthwhile doing the study!<br />

How many years would be needed to detect this trend with a 80% power? Try modifying the number of<br />

sampling years until you get the approximate power needed:<br />

c○2012 Carl James Schwarz 289 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

c○2012 Carl James Schwarz 290 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

So about 10 years of m<strong>on</strong>itoring will be needed to detect a 5% decline PER YEAR with about an 80% power.<br />

The difference in reported powers between the MONITOR and TRENDS programs are artifacts of the<br />

different ways the two programs compute power (and potential because of some ‘features’ of the MONITOR<br />

program). TRENDS uses analytical <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae based <strong>on</strong> normal approximati<strong>on</strong>s, while MONITOR c<strong>on</strong>ducts<br />

a simulati<strong>on</strong> study and reports the number of trials (in this case out of 500) that detected the trend. In any<br />

event, d<strong>on</strong>Õt get hung up over these differences - the key point is that this proposed study has virtually no<br />

power to detect a 5% decline/year.<br />

Program MONITOR also has a hand calculator to c<strong>on</strong>vert between the trend per year and the total trend<br />

over the <str<strong>on</strong>g>course</str<strong>on</strong>g> of the experiment.<br />

c○2012 Carl James Schwarz 291 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

For example, a 5% decline per year <str<strong>on</strong>g>for</str<strong>on</strong>g> 5 ADDITIONAL years translates into an overall decline of 22.6%<br />

over the six years of the study (the <strong>on</strong>e initial year + 5ADDITIONAl years). It is not a straight arithmetic<br />

c<strong>on</strong>versi<strong>on</strong> because the changes are actually multiplicative rather than additive as shown earlier.<br />

3.2 How does MONITOR work?<br />

Program MONITOR estimates power using a simulati<strong>on</strong> based approach as outlined in the help file. For<br />

example, c<strong>on</strong>sider the situati<strong>on</strong> outlined in the previous secti<strong>on</strong>. Again set up the c<strong>on</strong>trol parameters in the<br />

same way except change the trend lines to look <strong>on</strong>ly at a single value <str<strong>on</strong>g>for</str<strong>on</strong>g> the decline (-5% per year).<br />

c○2012 Carl James Schwarz 292 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

Then press the Step ic<strong>on</strong>. The following display is obtained:<br />

c○2012 Carl James Schwarz 293 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

First the underlying deterministic trend is generated (the black line in the middle of the plot). Then based<br />

<strong>on</strong> the variati<strong>on</strong> expected in the measurements, actual “data” are generated (shown by circles below and note<br />

that at time 1, the values are “off the plot”) and presented in the Survey count details tab:<br />

c○2012 Carl James Schwarz 294 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

c○2012 Carl James Schwarz 295 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

Then it gets a bit odd, and the output is potentially misleading. The red line is regressi<strong>on</strong> line is fit through<br />

the points (the red line in the first graph; estimates at the bottom of the data window). But this curve is not<br />

the <strong>on</strong>e used to estimate the power. Rather, a regressi<strong>on</strong> line is fit through the log(data) and the results from<br />

the regressi<strong>on</strong> <strong>on</strong> the log(data) is used to determine if the trend was detected. The analysis is d<strong>on</strong>e <strong>on</strong> the<br />

log-scale because of the multiplicative way in which the deterministic trend is fit. Refer to the analyses from<br />

JMP below to see which statistics are used:<br />

c○2012 Carl James Schwarz 296 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

In this case, the estimated trend line (<strong>on</strong> the log-scale) was not statistically different from zero and the<br />

trend was NOT detected.<br />

The simulati<strong>on</strong> is repeated many hundreds of times and the number of times that a statistically significant<br />

trend was detected is found and the proporti<strong>on</strong> of times that a statistically significant trend was detected is<br />

then the estimated power <str<strong>on</strong>g>for</str<strong>on</strong>g> this design.<br />

3.3 Incorporating process and sampling error<br />

As noted in the chapter <strong>on</strong> Trend Analysis, there are often two sources of variati<strong>on</strong> in any m<strong>on</strong>itoring study.<br />

First is sampling variati<strong>on</strong>. This occurs because it is impossible to measure the populati<strong>on</strong> parameter<br />

c○2012 Carl James Schwarz 297 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

exactly in any <strong>on</strong>e year. For example, if we are measuring the mean DDT level in birds, we must take a<br />

sample (say of 10 birds), sacrifice them, and find the mean DDT in those 10 birds. If a different sample<br />

of 10 birds were to be selected, then the sample mean DDT would vary in the sec<strong>on</strong>d sample. This is<br />

called sampling error (or the standard error) and can be estimated from the data taken in a single year. Or,<br />

the parameter of interest may be the number of smolt leaving a stream and this is estimated using capturerecapture<br />

methods. Again we would have a measure of uncertainty (the standard error) <str<strong>on</strong>g>for</str<strong>on</strong>g> each measurement<br />

in each year. Sampling error (the standard error) can be reduced by increasing the ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t in each year.<br />

However, c<strong>on</strong>sider what happens when measurements are taken in different years. It is unlikely that the<br />

populati<strong>on</strong> values would fall exactly <strong>on</strong> the trend line even if the sampling error was zero. This is known as<br />

process error and is caused by random “year” effects (e.g. an El Nino). Process error CANNOT be reduced<br />

by increasing the sampling ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t in a year.<br />

The two sources of variati<strong>on</strong> are diagrammed below:<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, process error is often the limiting factor in a m<strong>on</strong>itoring study!<br />

In order to estimate the process and sampling variati<strong>on</strong>, you will need at least two years of data or some<br />

c○2012 Carl James Schwarz 298 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

educated guesses from previous years. The Program MONITOR website has a spreadsheet tool to help you<br />

in the decompositi<strong>on</strong> of process and sampling error.<br />

For example, c<strong>on</strong>sider a study to m<strong>on</strong>itor the density of white-tailed deer obtained by distance sampling<br />

<strong>on</strong> Fire Island Nati<strong>on</strong>al Seabird (Underwood et al, 1998), presented as the example <strong>on</strong> the spreadsheet to<br />

separate process and sampling variati<strong>on</strong>.<br />

The estimated density (and se) are:<br />

Year Density SE<br />

1995 79.6 23.47<br />

1996 90.1 11.67<br />

1997 107.1 12.09<br />

1998 74.1 10.45<br />

1999 64.2 13.90<br />

2000 40.8 12.38<br />

2001 41.2 7.40<br />

C<strong>on</strong>sider the plot of density over time (with approximate 95% c<strong>on</strong>fidence intervals):<br />

c○2012 Carl James Schwarz 299 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

Assuming that the deer density is in steady state over the five years of the study, you can see that there is<br />

c<strong>on</strong>siderable process error as many of the 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the deer density do not cover the<br />

mean density over the five years. So even if the sampling error (the se) was driven to zero by adding more<br />

ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t, the data points would not all lie exactly <strong>on</strong> the mean line over time.<br />

There are many ways to separate process and sampling variati<strong>on</strong> – the chapter <strong>on</strong> the analysis of BACI<br />

designs presents some additi<strong>on</strong>al ways. The following is an approximate analysis that should be sufficient<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> most planning purposes.<br />

c○2012 Carl James Schwarz 300 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

First, examine a plot of the estimated se versus the density estimates:<br />

In many cases, there is often a relati<strong>on</strong>ship between the se and the estimate with larger estimates tending to<br />

have a higher se than smaller estimates. The previous plots shows that except <str<strong>on</strong>g>for</str<strong>on</strong>g> <strong>on</strong>e year, the se is relatively<br />

c<strong>on</strong>stant. If the se had a positive relati<strong>on</strong>ship to the estimate, a weighted procedure could be used (this is the<br />

procedure used in Underwood’s spreadsheet).<br />

We being by finding the mean density and the total variati<strong>on</strong> from the mean. [If the preliminary study<br />

had an obvious trend, you could fit the trend line and then find the total variati<strong>on</strong> from the trend line in a<br />

c○2012 Carl James Schwarz 301 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

similar fashi<strong>on</strong>.]<br />

We start by finding the total variati<strong>on</strong> in the density estimates over time:<br />

̂V ar T otal = var(79.6, 90.1, . . . , 41.2) = 599.6<br />

The total variati<strong>on</strong> is equal to the process + sampling variati<strong>on</strong>. An estimate of the average sampling variati<strong>on</strong><br />

is found by averaging the se 2 :<br />

̂V ar Sampling = 23.42 + 11.67 2 + . . . + 7.40 2<br />

7<br />

= 191.9<br />

Finally, the process variance is found by subtracti<strong>on</strong>:<br />

̂V ar P rocess = ̂V ar T otal − ̂V ar Sampling = 599.6 − 191.9 = 407.7<br />

We now launch Program M<strong>on</strong>itor, and are interested in a 10 year study to look at changes in the populati<strong>on</strong><br />

density following some management acti<strong>on</strong>. Notice we now specify a partiti<strong>on</strong>ing of the variati<strong>on</strong> into<br />

process and sampling error:<br />

c○2012 Carl James Schwarz 302 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

We use the sqrt() of the two variances estimated above when specifying the two sources of variati<strong>on</strong>:<br />

c○2012 Carl James Schwarz 303 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

and then press the Run butt<strong>on</strong> as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e to get:<br />

c○2012 Carl James Schwarz 304 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

The power to detect a 5% decline PER YEAR is not very good.<br />

It is instructive to see what would happen if you believed that there was NO process variati<strong>on</strong> and simply<br />

used the average sampling variati<strong>on</strong> as the sole source of variati<strong>on</strong>:<br />

c○2012 Carl James Schwarz 305 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

Now the (incorrect) estimated power is much higher.<br />

3.4 Presence/Absence Data<br />

Sometimes, <strong>on</strong>ly presence/absence data can be collected <strong>on</strong> each plot, rather than a measure of density. In<br />

cases like this, you may wish to c<strong>on</strong>sider occupancy modelling, but that is a topic <str<strong>on</strong>g>for</str<strong>on</strong>g> another <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

Despite not having an absolute measure of abundance, presence/absence data can be used to m<strong>on</strong>itor<br />

the density of species with relatively low abundances. This makes use of the Poiss<strong>on</strong> distributi<strong>on</strong> to predict<br />

c○2012 Carl James Schwarz 306 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

presence/absence as a functi<strong>on</strong> of density.<br />

For example, according to the Poiss<strong>on</strong> distributi<strong>on</strong>, if the average density per plot is µ, then the probability<br />

that a sampled plot will be labelled as a presence is 1 − exp(−µ) and the probability that a sampled plot<br />

will be labelled as an absence is exp(−µ). So a change in the overall proporti<strong>on</strong> of sites that are occupied<br />

corresp<strong>on</strong>ds to a change in the overall average density.<br />

Note that we are implicitly assuming that all absences are true absences, i.e. not a false negative. If false<br />

negatives are possible, you really should be using an occupancy design rather than a simple presence/absence<br />

design.<br />

We will use the example that ships with Program MONITOR. This example focuses <strong>on</strong> the least bittern<br />

(Ixobrychus exilis) , a secretive marsh bird. Least bittern populati<strong>on</strong>s are hard to m<strong>on</strong>itor given their quirky<br />

habitats, that is, its unpredictable calling behavior. Calls are the <strong>on</strong>ly way to detect the species presence<br />

within the dense vegetati<strong>on</strong> of the marshes where it lives. C<strong>on</strong>sider that baseline surveys of least bitterns<br />

between May 15-June 15 indicate that an average of about 0.20 calling least bitterns were heard <strong>on</strong> any<br />

given visit. A water c<strong>on</strong>trol structure <strong>on</strong> the marsh is being altered to generate a more stable water level that<br />

should improve the situati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> bitterns at the site. How much of a trend can be detected with 10 years of<br />

m<strong>on</strong>itoring and 10 visits to the marsh each year?<br />

Here the average of 0.20 calls/visit implies that a “presence” was detected in about 1/5 visits to the<br />

marsh.<br />

Start by entering the data <strong>on</strong> the main page and then <strong>on</strong> the plots page.<br />

c○2012 Carl James Schwarz 307 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

c○2012 Carl James Schwarz 308 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

With presence/absence data, the plot “mean” should have the approximate base rate of presences and<br />

there is no need <str<strong>on</strong>g>for</str<strong>on</strong>g> a standard deviati<strong>on</strong> estimator. On the main page, tests <str<strong>on</strong>g>for</str<strong>on</strong>g> trend in presence/absence<br />

data are equivalent to “chi-square tests” (covered in another secti<strong>on</strong> of the notes). The Custom/ANOVA<br />

area indicates a doubling of the presences frequency in the sec<strong>on</strong>d through 10 year of m<strong>on</strong>itoring.<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e computing the power, press the step butt<strong>on</strong> to get a feel <str<strong>on</strong>g>for</str<strong>on</strong>g> the data that are generated (not shown).<br />

I think this is where Program MONITOR has a “feature” as the data in the 3rd and subsequent visits doesn’t<br />

ever have any n<strong>on</strong>-detects.<br />

C<strong>on</strong>sequently, I w<strong>on</strong>’t c<strong>on</strong>tinue with this example until I understand what MONITOR is doing! I have<br />

SAS programs that can help in planning of presence/absence studies – please c<strong>on</strong>tact me <str<strong>on</strong>g>for</str<strong>on</strong>g> assistance.<br />

3.5 WARNING about using testing <str<strong>on</strong>g>for</str<strong>on</strong>g> temporal trends<br />

The Patuxent Wildlife Research Center has some sage advice about power analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> temporal trends.<br />

Users should be aware (and wary) of the complexity of power analysis in general, and also acknowledge<br />

some specific limitati<strong>on</strong>s of MONITOR <str<strong>on</strong>g>for</str<strong>on</strong>g> many real-world applicati<strong>on</strong>s. Our chief,<br />

immediate c<strong>on</strong>cern is that many users of MONITOR may be unaware of these limitati<strong>on</strong>s and<br />

may be using the program inappropriately. Below are comments from <strong>on</strong>e of our statisticians<br />

<strong>on</strong> some of the aspects of MONITOR that users should be cognizant of: ÒThere are numerous<br />

issues with how Program M<strong>on</strong>itor calculates statistical power and sample size. One issue c<strong>on</strong>cerns<br />

the default opti<strong>on</strong> whereby the user assumes independence of plots or sites from <strong>on</strong>e time<br />

period to the next. If you are randomly sampling new sites or plots each time period, then it is<br />

correct to assume independence (assuming that finite populati<strong>on</strong> correcti<strong>on</strong> factor is not an issue,<br />

which depends <strong>on</strong> how many plots or sites you are sampling, relative to the total populati<strong>on</strong><br />

size of potential plots or sites). If you are sampling the same plots or sites repeatedly over time,<br />

however, then the default opti<strong>on</strong> in Program M<strong>on</strong>itor is unlikely to give a correct calculati<strong>on</strong> of<br />

statistical power or sample size. If plots or sites are positively autocorrelated over time, as is<br />

usually the case in biological surveys, then Program M<strong>on</strong>itor will underestimate sample size, or<br />

c<strong>on</strong>versely, it will overestimate the statistical power. The correct sample size estimate is likely<br />

to be greater, and depending up<strong>on</strong> the amount of autocorrelati<strong>on</strong>, the correct sample size could<br />

c○2012 Carl James Schwarz 309 November 23, 2012


CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM<br />

MONITOR<br />

be vastly greater to achieve a stated power objective.<br />

We deal with some of these issues when we discuss the design and analysis of BACI surveys later in this<br />

<str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

c○2012 Carl James Schwarz 310 November 23, 2012


Chapter 4<br />

Regressi<strong>on</strong> - hockey sticks, broken<br />

sticks, piecewise, change points<br />

A simple regressi<strong>on</strong> analysis assumes that the change in resp<strong>on</strong>se is the same across the range of X values.<br />

In some cases, a model where the slope changes in different parts of the X space may be biologically more<br />

realistic.<br />

This chapter examines two cases of fitting regressi<strong>on</strong> lines with breaks in the slope. In the first case,<br />

the locati<strong>on</strong> of the change in slope is known in advance; the sec<strong>on</strong>d cases also estimates the locati<strong>on</strong> of the<br />

change, also known as the change point problem.<br />

The examples in this chapter look at cases with a single change point – the extensi<strong>on</strong> to multiple change<br />

points (both known and unknown) is straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward. Similarly, the change from linear to quadratic lines is<br />

also straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward.<br />

A related method, a spline fit to the data, where a flexible curve is fit between (evenly) spaced knot points<br />

that is a like a n<strong>on</strong>-parametric curve fit is explored in a different chapter.<br />

4.1 Hockey-stick, piecewise, or broken-stick regressi<strong>on</strong><br />

In this secti<strong>on</strong>, the locati<strong>on</strong> of the change point is known. The statistical model is:<br />

Y = β 0 + β 1 (X) + β 2 (X − C) + + ɛ<br />

where β 0 is the intercept, β 1 is the slope be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the change point C, and β 2 is the DIFFERENCE in slope<br />

311


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

after the change point. The slope after the change point is β 1 + β 2 . The variable (X − C) + is a derived<br />

variable which takes the value of 0 <str<strong>on</strong>g>for</str<strong>on</strong>g> values of X less than C and the values X − C <str<strong>on</strong>g>for</str<strong>on</strong>g> values of X greater<br />

than C. This is usually created using a Formula Editor based <strong>on</strong> the actual data.<br />

The hypothesis of interest is H : β 2 = 0 which indicates no change in slope between X < C and<br />

X > C.<br />

Because the value of C is specified in advance, ordinary least-squares can be used to fit the model. Most<br />

computer packages can easily fit this model.<br />

4.1.1 Example: Nenana River Ice Breakup Dates<br />

The Nenana river in the Interior of Alaska usually freezes over during October and November. The ice<br />

c<strong>on</strong>tinues to grow throughout the winter accumulating an average maximum thickness of about 110 cm,<br />

depending up<strong>on</strong> winter weather c<strong>on</strong>diti<strong>on</strong>s. The Nenana River Ice Classic competiti<strong>on</strong> began in 1917 when<br />

railroad engineers bet a total of 800 dollars, winner takes all, guessing the exact time (m<strong>on</strong>th, day, hour,<br />

minute) ice <strong>on</strong> the Nenana River would break up. Each year since then, Alaska residents have guessed at the<br />

timing of the river breakup. A tripod, c<strong>on</strong>nected to an <strong>on</strong>-shore clock with a string, is planted in two feet<br />

of river ice during river freeze-up in October or November. The following spring, the clock automatically<br />

stops when the tripod moves as the ice breaks up. The time <strong>on</strong> the clock is used as the river ice breakup<br />

time. Many factors influence the river ice breakup, such as air temperature, ice thickness, snow cover, wind,<br />

water temperature, and depth of water below the ice. Generally, the Nenana river ice breaks up in late April<br />

or early May (historically, April 20 to May 20). The time series of the Nenana river ice breakup dates can<br />

be used to investigate the effects of climate change in the regi<strong>on</strong>.<br />

In 2010, the jackpot was almost $300,000 and the ice went out at 9:06 <strong>on</strong> 2010-04-29. In 2012, the<br />

jackpot was over $350,000 and the ice went out at 19:39 <strong>on</strong> 2012-04-23 - as reported at http://www.<br />

cbc.ca/news/offbeat/story/2012/05/02/alaska-ice-c<strong>on</strong>test.html. The latest winner,<br />

Tommy Lee Waters, has also w<strong>on</strong> twice be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, but never has been a solo winner. Waters spent time<br />

drilling holes in the area to measure the thickness of the ice. Altogether he spent $5,000 <strong>on</strong> tickets <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

submitting guesses (he purchased every minute of the afterno<strong>on</strong> of 23 April) and spent an estimated 1,200<br />

hours working out the math by hand. And, it was also his birthday! (What are the odds?) You too can use<br />

statistical methods to gain fame and <str<strong>on</strong>g>for</str<strong>on</strong>g>tune!<br />

More details about the Ice Classic are available at http://www.nenanaakiceclassic.com.<br />

The data is available in the nenana.jmp data file available in the the Sample Program Library at http:<br />

//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

A simple regressi<strong>on</strong> line fit to the time of break up with year as the predictor show evidence of a decline<br />

over time (i.e. the time of breakup is tending to occur earlier) and there is no evidence of auto-correlati<strong>on</strong>.<br />

c○2012 Carl James Schwarz 312 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

A closer inspecti<strong>on</strong> of the top graph gives the impressi<strong>on</strong> that until about 1970, the regressi<strong>on</strong> line was<br />

c○2012 Carl James Schwarz 313 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

“flat” and <strong>on</strong>ly after 1970 did the time of breakup seem to decrease.<br />

A broken stick model (separate slopes in the pre-1970 and the post-1970 eras) can be easily fit. We need<br />

to create a new variable that is zero <str<strong>on</strong>g>for</str<strong>on</strong>g> the pre-1970 period and equal to (year − 1970) in the post 1970<br />

period. This is easily created in JMP using the Formula Editor:<br />

This is then fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 314 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

which gives the estimates:<br />

The <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical model is:<br />

Date = β 0 + β 1 (year) + β 2 (year − 1970) + + ɛ<br />

In years prior to 1970, the slope is β 1 . In years after 1970, the slope is β 1 + β 2 . A test <str<strong>on</strong>g>for</str<strong>on</strong>g> differential slopes<br />

in the two eras is then equivalent to a test if β 2 = 0.<br />

c○2012 Carl James Schwarz 315 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

In this case the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the β 2 coefficient (associated with the (year − 1970) + variable) is just under<br />

0.05 providing some evidence of a different slope in the two eras.<br />

A plot of the fitted line is obtained by saving the predicted values to the data table:<br />

and then plotting the actual data and the fitted points <strong>on</strong> the same graph using the Graph->Overlay<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN resp<strong>on</strong>se in a particular year (not likely of interest in this example)<br />

c○2012 Carl James Schwarz 316 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

and <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual resp<strong>on</strong>ses in a particular year are generated in the usual way.<br />

Note that the estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the pre-1970 era is not statistically different from 0. If you wanted to<br />

fit a model where the line was flat (i.e. the slope was 0) in the pre-1970 era, this is d<strong>on</strong>e by using <strong>on</strong>ly the<br />

(year − 1970) + variable. Many of the automatically generated plots look odd (e.g. all of the points appear<br />

to be replotted at 1970), the intercept has a different interpretati<strong>on</strong> in the two models because year = 0 has<br />

a different definiti<strong>on</strong> in the two models, but if the fitted model is plotted against the original year variable<br />

everything works out properly. In this particular case, the two latter models give predicted lines that are<br />

almost identical. It is quite RARE that you would fit a line where the slope is known to be zero in practice.<br />

4.2 Searching <str<strong>on</strong>g>for</str<strong>on</strong>g> the change point<br />

In the previous secti<strong>on</strong> <strong>on</strong> segmented regressi<strong>on</strong> (also known as hockey-stick regressi<strong>on</strong> or broken-stick<br />

regressi<strong>on</strong>), the locati<strong>on</strong>s of the break are assumed to be known. In many cases, the locati<strong>on</strong> of the break is<br />

not known, and it is of interest to estimate the break point as well.<br />

The problems of identifying changes at unknown times and of estimating the locati<strong>on</strong> of changes is<br />

known as “the change-point problem”. Numerous methodological approaches have been implemented in<br />

examining change-point models. Maximum-likelihood estimati<strong>on</strong>, Bayesian estimati<strong>on</strong>, isot<strong>on</strong>ic regressi<strong>on</strong>,<br />

piecewise regressi<strong>on</strong>, quasi-likelihood and n<strong>on</strong>-parametric regressi<strong>on</strong> are am<strong>on</strong>g the methods which have<br />

been applied to resolving challenges in change- point problems. Grid-searching approaches have also been<br />

used to examine the change-point problem. A review of the literature especially as it applies to regressi<strong>on</strong><br />

problem (as of 2008) is available at: http://biostats.bepress.com/cgi/viewc<strong>on</strong>tent.<br />

cgi?article=1075&c<strong>on</strong>text=cobra.<br />

The standard change-point problem in regressi<strong>on</strong> models c<strong>on</strong>sists of<br />

• testing the null hypothesis that no change in regimes has taken place against the alternative that observati<strong>on</strong>s<br />

were generated by two (or possibly more) distinct regressi<strong>on</strong> equati<strong>on</strong>s, and<br />

• estimating the two regimes that gave rise to the data.<br />

There are two comm<strong>on</strong> models. First are models where the regressi<strong>on</strong> line is c<strong>on</strong>tinuous at the break<br />

point, and models where the regressi<strong>on</strong> line can be disc<strong>on</strong>tinuous. In these notes, we <strong>on</strong>ly c<strong>on</strong>sider the<br />

c<strong>on</strong>tinuous case.<br />

This problem has a l<strong>on</strong>g history. A nice summary and treatment of the problem is available in<br />

Toms, J. D. and Lesperance, M L. (2003).<br />

Piecewise regressi<strong>on</strong>: A tool <str<strong>on</strong>g>for</str<strong>on</strong>g> identifying ecological thresholds.<br />

<str<strong>on</strong>g>Ecology</str<strong>on</strong>g>, 84, 2034-2041<br />

http://dx.doi.org/10.1890/02-0472.<br />

c○2012 Carl James Schwarz 317 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

The change point model starts with the broken-stick model seen earlier, i.e.<br />

Y = β 0 + β 1 (X) + β 2 (X − C) + + ɛ<br />

where Y is the resp<strong>on</strong>se variable, X is the covariate, and C is the change point, i.e. where the break occurs.<br />

This model is appropriate where there is an abrupt transiti<strong>on</strong> at the break point, but a smooth transiti<strong>on</strong> may<br />

be more realistic <str<strong>on</strong>g>for</str<strong>on</strong>g> some data. One drawback of this model is that c<strong>on</strong>vergence problems can occur in<br />

locating C when the data are sparse around the neighborhood of C.<br />

Toms and Lesperance (2003) review the use of model with gentler transiti<strong>on</strong>s, e.g. the hyperbolic tangent<br />

model or the bent-cable model. The bent-cable regressi<strong>on</strong> model was recently developed by Chuis, Lockhart<br />

and Routledge (2006, Bent-cable regressi<strong>on</strong> theory and applicati<strong>on</strong>, Journal of the American Statistical<br />

Associati<strong>on</strong>, 101, 542-553). The bent-cable regressi<strong>on</strong> model fits a smooth transiti<strong>on</strong> between the two linear<br />

parts of the model. The latter is also applicable to regressi<strong>on</strong> models where the X variable is time and<br />

auto-correlati<strong>on</strong> may be present 1 .<br />

The simple piece-wise linear model can be fit using the Analyze->Modelling ->N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m of<br />

JMP.<br />

4.2.1 Change point model <str<strong>on</strong>g>for</str<strong>on</strong>g> the Nenana River Ice Breakup<br />

Refer to the previous secti<strong>on</strong> about details <strong>on</strong> the Nenana River Ice Breakup c<strong>on</strong>test. Rather than specifying<br />

a break point at 1970, we will fit the change point model to estimate the change point.<br />

The data are available in the Nenana.jmp data table in the the Sample Program Library at http://<br />

www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The statistical model is:<br />

JulianDate = β 0 + β 1 (Y ear) + β 2 (Y ear − C) + + ɛ<br />

where JulianDate is the date of breakup, Y ear is the calendar year. The parameters to be estimated are<br />

β 0 the intercept, β 1 the change in the breakup prior to the change point, β 2 the change in slope after the<br />

breakup, and C the change point.<br />

We first need to define the parameters of the model (β 0 , β 1 , β 2 , C) and the predicted value in terms of<br />

the parameters of the model. We start by creating a new column in the data table ChangePointPredictor and<br />

start the Formula Editor.<br />

1 Chiu, G. S. and Lockhart, R. L. (2010). Bent-cable regressi<strong>on</strong> with auto-regressive noise. Canadian Journal of Statistics, 38,<br />

386-407. http://dx.doi.org/10.1002/cjs.10070<br />

c○2012 Carl James Schwarz 318 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

New parameters are defined (al<strong>on</strong>g with initial starting guesses), by using the drop-down menu in the<br />

top left of the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor:<br />

Click <strong>on</strong> the New Parameters item and create the four parameters and their initial values (based <strong>on</strong> the results<br />

from the previous example). The choice of initial values is not that crucial. Then create the predicted value<br />

in terms of the parameters and the columns in the data table:<br />

c○2012 Carl James Schwarz 319 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

Notice the use of the If functi<strong>on</strong> to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g> the break point. You can switch back and <str<strong>on</strong>g>for</str<strong>on</strong>g>th between the<br />

parameters, data table columns, etc. using the drop down menu in the top right of the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor. When<br />

you are finished, close the Formula Editor, and the data table will be updated with initial predicti<strong>on</strong>s based<br />

<strong>on</strong> the initial values specified.<br />

Select the Analyze->Modelling ->N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 320 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

Specify the predicted value and Y variables appropriately:<br />

c○2012 Carl James Schwarz 321 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

Notice that the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicti<strong>on</strong>s is displayed.<br />

This brings up the Analyze->Modelling ->N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m c<strong>on</strong>trol panel. The initial fit is displayed.<br />

Press the Go butt<strong>on</strong> to find the n<strong>on</strong>-linear least squares fit.<br />

c○2012 Carl James Schwarz 322 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

The n<strong>on</strong>-linear least squares algorithm appears to have c<strong>on</strong>verged at the estimates listed in the table. The<br />

estimated change-point of 1967 is close to the value of 1970 “guess-timated” earlier. Approximate standard<br />

errors are also presented at the bottom of the output:<br />

These standard errors are based <strong>on</strong> large-sample theory. In order to compute a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

the break point, you could use the standard estimate ± 2(se), but in small samples, the resulting c<strong>on</strong>fidence<br />

intervals may not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m well. Toms and Lesperance (2003) recommend that a likelihood ratio c<strong>on</strong>fidence<br />

interval be computed. JMP attempts to compute profile-likelihood c<strong>on</strong>fidence intervals when you press the<br />

c○2012 Carl James Schwarz 323 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

C<strong>on</strong>fidence Interval butt<strong>on</strong> which gives:<br />

In this case, the profile intervals fail to give upper and lower bounds because the slope after the change<br />

point is just <strong>on</strong> the boundary of statistical significance at (α = 0.05). If you change the c<strong>on</strong>fidence coefficient<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g>m 95% to 90%, the procedure is able to find c<strong>on</strong>fidence bounds <strong>on</strong> the C parameter. C<strong>on</strong>sequently, there<br />

may or may not be a change point. Notice that the lower boundary of the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> C is quite<br />

far below the point estimate!<br />

C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se and predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> a future resp<strong>on</strong>se are obtained by<br />

clicking <strong>on</strong> the red triangle:<br />

These are interpreted in the same way as in ordinary regressi<strong>on</strong>.<br />

The Analyze->Modelling ->N<strong>on</strong>Linear plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m also allows you to “play” with the estimates to invesc○2012<br />

Carl James Schwarz 324 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

tigate the sensitivity of the fit to the parameters. The Profiler opti<strong>on</strong> under the red triangle is also useful in<br />

these cases.<br />

4.3 How NOT to search <str<strong>on</strong>g>for</str<strong>on</strong>g> a change point!<br />

A fairly comm<strong>on</strong> “request” in our Statistical C<strong>on</strong>sulting Service is <str<strong>on</strong>g>for</str<strong>on</strong>g> help in finding the time at which<br />

some treatment gives a difference in resp<strong>on</strong>se from a c<strong>on</strong>trol. For example, a group of animals may be fed<br />

a c<strong>on</strong>trol diet and are measured over time, while another group of animals are fed an experimental diet and<br />

are measured over time. At which point do the resp<strong>on</strong>ses between the two groups start to differ?<br />

Let us assume, <str<strong>on</strong>g>for</str<strong>on</strong>g> simplicity, that separate animals are measured at each time point so that the problem<br />

of l<strong>on</strong>gitudinal data are ignored. For example, suppose that animals must be sacrificed at each time point<br />

to measure the resp<strong>on</strong>se. A naive analysis starts by plotting the means of the two groups over time and<br />

searching <str<strong>on</strong>g>for</str<strong>on</strong>g> the first time point at which the two means are statistically different:<br />

This is NOT A VALID ANALYSIS! The problem is that the estimate of the change point <str<strong>on</strong>g>for</str<strong>on</strong>g> this analysis<br />

will depend <strong>on</strong> the sample size. If the sample size is small in each group, then the standard error bars are<br />

larger, and the estimated change point tends to be larger than if the sample size is large and the standard<br />

errors are smaller. The actual change point does NOT depend <strong>on</strong> sample size! All that should happen is that<br />

the estimated precisi<strong>on</strong> of the change point problem should be worse <str<strong>on</strong>g>for</str<strong>on</strong>g> smaller sample sizes than <str<strong>on</strong>g>for</str<strong>on</strong>g> larger<br />

sample sizes.<br />

c○2012 Carl James Schwarz 325 November 23, 2012


CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS,<br />

PIECEWISE, CHANGE POINTS<br />

The proper way to search <str<strong>on</strong>g>for</str<strong>on</strong>g> a change point is to find the DIFFERENCE in means at each time point and<br />

then apply the analysis in the previous secti<strong>on</strong>s to the difference. A model where the difference in means is<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g>ced to be zero prior to the unknown change point may be a suitable alternate model.<br />

c○2012 Carl James Schwarz 326 November 23, 2012


Chapter 5<br />

Analysis of Covariance - ANCOVA<br />

5.1 Introducti<strong>on</strong><br />

In previous chapters, we looked at comparing group means from data collected from a single-factor completely<br />

randomized design and analyzed using ANOVA. We also looked at estimating the slope of a straight<br />

line between two variables. In both cases the resp<strong>on</strong>se variable, Y , was c<strong>on</strong>tinuous (interval or ratio scale).<br />

In the case of ANOVA, the X variables was nominal or ordinal in scale and served to identify the treatment<br />

groups. In the regressi<strong>on</strong> setting, the X variable was also c<strong>on</strong>tinuous.<br />

The Analysis of Covariance (ANCOVA) is a combinati<strong>on</strong> of both analyses. Groups are identified by a<br />

nominal or ordinal scale variable and a c<strong>on</strong>tinuous covariate is also measured.<br />

There are two uses of ANCOVA which, <strong>on</strong> the surface, appear to be separate analyses. In fact, both<br />

analyses are identical.<br />

The first use is to check if the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> the groups are parallel. If there is evidence that the<br />

individual regressi<strong>on</strong> lines are not parallel, then a separate regressi<strong>on</strong> line must be fit <str<strong>on</strong>g>for</str<strong>on</strong>g> each group <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

predicti<strong>on</strong> purposes. If there is no evidence of n<strong>on</strong>-parallelism, then the next task is to see if the lines are<br />

co-incident, i.e. have both the same intercept and the same slope. If there is evidence that the lines are not<br />

coincident, then a series of parallel lines are fit to the data. All of the data are used to estimate the comm<strong>on</strong><br />

slope. If there is no evidence that the lines are not coincident, then all of the data can be simply pooled<br />

together and a single regressi<strong>on</strong> line fit <str<strong>on</strong>g>for</str<strong>on</strong>g> all of the data.<br />

The three possibilities are shown below <str<strong>on</strong>g>for</str<strong>on</strong>g> the case of two groups - the extensi<strong>on</strong> to many groups is<br />

obvious:<br />

327


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

c○2012 Carl James Schwarz 328 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

c○2012 Carl James Schwarz 329 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Sec<strong>on</strong>d, ANCOVA has been used to test <str<strong>on</strong>g>for</str<strong>on</strong>g> differences in means am<strong>on</strong>g the groups when some of the<br />

variati<strong>on</strong> in the resp<strong>on</strong>sible variable can be “explained” by a covariate. For example, the effectiveness of two<br />

different diets can be compared by randomizing people to the two diets and measuring the weight change<br />

during the experiment. However, some of the variati<strong>on</strong> in weight change may be related to initial weight.<br />

Perhaps by “standardizing” every<strong>on</strong>e to some comm<strong>on</strong> weight, we can more easily detect differences am<strong>on</strong>g<br />

the groups.<br />

Insert graphs here<br />

A very nice book <strong>on</strong> the Analysis of Covariance is Analysis of Messy Data, Volume III: Analysis of Covariance<br />

by G. A. Milliken and D. E. Johns<strong>on</strong>. Details are available at http://www.statsnetbase.<br />

com/ejournals/books/book_summary/summary.asp?id=869.<br />

c○2012 Carl James Schwarz 330 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

5.2 Assumpti<strong>on</strong>s<br />

As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, it is important be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the analysis is started to verify the assumpti<strong>on</strong>s underlying the analysis. As<br />

ANCOVA is a combinati<strong>on</strong> of ANOVA and Regressi<strong>on</strong>, the assumpti<strong>on</strong>s are similar. Both goals of ANCOVA<br />

have similar assumpti<strong>on</strong>s:<br />

• The resp<strong>on</strong>se variable Y is c<strong>on</strong>tinuous (interval or ratio scaled)<br />

• The data are collected under a completely randomized design. 1 This implies that the treatment must<br />

be randomized completely over the entire set of experimental units if an experimental study, or units<br />

must be selected at random from the relevant populati<strong>on</strong>s if an observati<strong>on</strong>al study.<br />

• There must be no outliers. Plot Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group separately to see if there are any points that<br />

d<strong>on</strong>’t appear to follow the straight line.<br />

• The relati<strong>on</strong>ship between Y and X must be linear <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. 2 Check this assumpti<strong>on</strong> by looking<br />

at the individual plots of Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group.<br />

• The variance must be equal <str<strong>on</strong>g>for</str<strong>on</strong>g> both groups around their respective regressi<strong>on</strong> lines. Check that the<br />

spread of the points is equal around the range of X and that the spread is comparable between the two<br />

groups. This can be <str<strong>on</strong>g>for</str<strong>on</strong>g>mally checked by looking at the MSE from a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />

group as MSE estimates the variance of the data around the regressi<strong>on</strong> line.<br />

• The residuals must be normally distributed around the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This assumpti<strong>on</strong><br />

can be check by examining the residual plots from the fitted model <str<strong>on</strong>g>for</str<strong>on</strong>g> evidence of n<strong>on</strong>-normality. For<br />

large samples, this is not too crucial; <str<strong>on</strong>g>for</str<strong>on</strong>g> small sample sizes, you will likely have inadequate power to<br />

detect anything but gross departures.<br />

5.3 Comparing individual regressi<strong>on</strong> lines<br />

You saw in earlier chapters, that a statistical model is a powerful <str<strong>on</strong>g>short</str<strong>on</strong>g>hand to describe what analysis is fit<br />

to a set of data. The model must describe the treatment structure, the experimental unit structure, and the<br />

randomizati<strong>on</strong> structure.. Let Y be the resp<strong>on</strong>se variable; X be the c<strong>on</strong>tinuous X-variable, and Group be<br />

the group factor.<br />

In all cases that follow, we are assuming that a completely randomized design was used <str<strong>on</strong>g>for</str<strong>on</strong>g> the randomizati<strong>on</strong><br />

structure. This implies that there are no explicit terms <str<strong>on</strong>g>for</str<strong>on</strong>g> the randomizati<strong>on</strong> structure in the<br />

model.<br />

Similarly, there is a single size of experimental unit with no blocking or sub-sampling occurring. This<br />

also implies there will be no terms in the model <str<strong>on</strong>g>for</str<strong>on</strong>g> the experimental unit structure. In more advanced<br />

<str<strong>on</strong>g>course</str<strong>on</strong>g>s, the analyses in this chapter can be extended to more complex designs.<br />

1 It is possible to relax this assumpti<strong>on</strong> - this is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

2 It is possible to relax this assumpti<strong>on</strong> as well, but is again, bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

c○2012 Carl James Schwarz 331 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

In earlier chapters, we saw that the model <str<strong>on</strong>g>for</str<strong>on</strong>g> a single-factor completely randomized design is<br />

Y = Group<br />

This is read as saying that variati<strong>on</strong> in Y can be partially explained by an overall grand mean (never specified)<br />

with differences in the mean caused by Groups plus an implicit random noise (which is never specified).<br />

Again, from an earlier chapter, we say that the model <str<strong>on</strong>g>for</str<strong>on</strong>g> a regressi<strong>on</strong> of Y <strong>on</strong> X is<br />

Y = X<br />

This is read as saying that the variati<strong>on</strong> in Y can be partially explained by an intercept (never specified) plus<br />

changes in the X plus an implicit random noise (which is never specified).<br />

As ANCOVA is a combinati<strong>on</strong> of the above two analyses, it will not be surprising that the models will<br />

have terms corresp<strong>on</strong>ding to both Group and X. Again, there are three cases:<br />

If the lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group are not parallel:<br />

c○2012 Carl James Schwarz 332 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

the appropriate model is<br />

Y 1 = Group X Group ∗ X<br />

The terms can be in any order. This is read as variati<strong>on</strong> in Y can be explained a comm<strong>on</strong> intercept (never<br />

specified) followed by group effects (different intercepts), a comm<strong>on</strong> slope <strong>on</strong> X, and an “interacti<strong>on</strong>” between<br />

Group and X which is interpreted as different slopes <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This model is almost equivalent<br />

to fitting a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The <strong>on</strong>ly advantage to using this joint model <str<strong>on</strong>g>for</str<strong>on</strong>g> all groups<br />

is similar to that enjoyed by using ANOVA - all of the groups c<strong>on</strong>tribute to a better estimate of residual error.<br />

If the number of data points per group is small, this can lead to improvements in precisi<strong>on</strong> compared to<br />

fitting each group individually.<br />

If the lines are parallel across groups, but not coincident:<br />

the appropriate model is<br />

Y 2 = Group X<br />

The terms can be in any order. The <strong>on</strong>ly difference between this and the previous model is that this simpler<br />

model lacks the Group*X “interacti<strong>on</strong>” term. It would not be surprising then that a statistical test to see if<br />

c○2012 Carl James Schwarz 333 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

this simpler model is tenable would corresp<strong>on</strong>d to examining the p-value of the test <strong>on</strong> the Group*X term<br />

from the complex model. This is exactly analogous to testing <str<strong>on</strong>g>for</str<strong>on</strong>g> interacti<strong>on</strong> effects between factors in a<br />

two-factor ANOVA.<br />

Lastly, if the lines are co-incident:<br />

the appropriate model is<br />

Y 3 = X<br />

. Now the difference between this model and the previous model is the Group term that has been dropped.<br />

Again, it would not be surprising that this corresp<strong>on</strong>ds to the test of the Group effect in the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical<br />

test. The test <str<strong>on</strong>g>for</str<strong>on</strong>g> co-incident lines should <strong>on</strong>ly be d<strong>on</strong>e if there is insufficient evidence against the hypothesis<br />

of parallelism.<br />

While it is possible to test <str<strong>on</strong>g>for</str<strong>on</strong>g> a n<strong>on</strong>-zero slope, this is rarely d<strong>on</strong>e.<br />

c○2012 Carl James Schwarz 334 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

5.4 Comparing Means after covariate adjustments<br />

to be added later<br />

5.5 Power and sample size<br />

to be added later<br />

- use the MSE as the estimate of variance <str<strong>on</strong>g>for</str<strong>on</strong>g> testing MEANS and <str<strong>on</strong>g>for</str<strong>on</strong>g> testing the slope.<br />

5.6 Example - Degradati<strong>on</strong> of dioxin<br />

An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />

material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />

organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />

takes a l<strong>on</strong>g time to degrade.<br />

Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />

measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />

Each year, four crabs are captured from two m<strong>on</strong>itoring stati<strong>on</strong>s which are situated quite a distance apart<br />

<strong>on</strong> the same inlet where the pulp mill was located.. The liver is excised and the livers from all four crabs<br />

are composited together into a single sample. 3 The dioxins levels in this composite sample is measured.<br />

As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called the Total<br />

Equivalent Dose (TEQ) is computed from the sample.<br />

As seen in the chapter <strong>on</strong> regressi<strong>on</strong>, the appropriate resp<strong>on</strong>se variable is log(T EQ).<br />

Is the rate of decline the same <str<strong>on</strong>g>for</str<strong>on</strong>g> both sites? Did the sites have the same initial c<strong>on</strong>centrati<strong>on</strong>?<br />

Here are the raw data which are also available in the dataset dioxin2.jmp in the Sample Program Library<br />

at SampleProgramLibrary.<br />

3 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />

loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />

within years and the number of years to m<strong>on</strong>itor.<br />

c○2012 Carl James Schwarz 335 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Site Year TEQ log(TEQ)<br />

a 1990 179.05 5.19<br />

a 1991 82.39 4.41<br />

a 1992 130.18 4.87<br />

a 1993 97.06 4.58<br />

a 1994 49.34 3.90<br />

a 1995 57.05 4.04<br />

a 1996 57.41 4.05<br />

a 1997 29.94 3.40<br />

a 1998 48.48 3.88<br />

a 1999 49.67 3.91<br />

a 2000 34.25 3.53<br />

a 2001 59.28 4.08<br />

a 2002 34.92 3.55<br />

a 2003 28.16 3.34<br />

b 1990 93.07 4.53<br />

b 1991 105.23 4.66<br />

b 1992 188.13 5.24<br />

b 1993 133.81 4.90<br />

b 1994 69.17 4.24<br />

b 1995 150.52 5.01<br />

b 1996 95.47 4.56<br />

b 1997 146.80 4.99<br />

b 1998 85.83 4.45<br />

b 1999 67.72 4.22<br />

b 2000 42.44 3.75<br />

b 2001 53.88 3.99<br />

b 2002 81.11 4.40<br />

b 2003 70.88 4.26<br />

The data can be entered into JMP in the usual fashi<strong>on</strong>. Make sure that Site is a nominal scale variable,<br />

and that Year is a c<strong>on</strong>tinuous variable.<br />

In cases with multiple groups, it is often helpful to use a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This<br />

is easily accomplished in JMP by selecting the rows (say <str<strong>on</strong>g>for</str<strong>on</strong>g> site a) and using the Rows->Markers to set the<br />

plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> the selected rows:<br />

c○2012 Carl James Schwarz 336 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

The final data sheet has two different plotting symbols <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sites:<br />

c○2012 Carl James Schwarz 337 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

c○2012 Carl James Schwarz 338 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting the various models, begin with an exploratory examinati<strong>on</strong> of the data looking <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers<br />

and checking the assumpti<strong>on</strong>s.<br />

Each year’s data is independent of other year’s data as a different set of crabs was selected. Similarly,<br />

the data from <strong>on</strong>e site are independent from the other site. This is an observati<strong>on</strong>al study, so the questi<strong>on</strong><br />

arises of how exactly were the crabs were selected? In this study, crab pots were placed <strong>on</strong> the floor of the<br />

sea to capture the available crabs in the area.<br />

When ever multiple sets of data are collected over time, there is always the worry about comm<strong>on</strong> year<br />

effects (also known as process error). For example, if the resp<strong>on</strong>se variable was body mass of small fish, then<br />

poor growing c<strong>on</strong>diti<strong>on</strong>s in a single year could depress the growth of fish in all locati<strong>on</strong>s. This would then<br />

violate the assumpti<strong>on</strong> of independence as the residual in <strong>on</strong>e site in a year would be related to the residual<br />

in another site in the sam year. You tend to see the residuals “paired” with negative residuals from the fitted<br />

line at <strong>on</strong>e site matched (by year) with negative residuals at the other site. In this case, this is unlikely to have<br />

occured. Degradati<strong>on</strong> of dioxin is relatively independent of external envir<strong>on</strong>mental factors and the variati<strong>on</strong><br />

that we see about the two regressi<strong>on</strong> lines is related solely to samplng error based <strong>on</strong> the particular set of<br />

crabs that that were sampled. It seems unlikely that the residuals are related. 4<br />

Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and specify the log(T EQ) as the Y variable, and Y ear as the X<br />

variable:<br />

4 If you actually try and fit a process error term to this model, you find that the estimated process error is zero.<br />

c○2012 Carl James Schwarz 339 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Then specify a grouping variable by clicking <strong>on</strong> the pop-down menu near the Bivariate Fit title line:<br />

and selecting Site as the grouping variable:<br />

c○2012 Carl James Schwarz 340 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Now select the Fit Line from the same pop-down menu:<br />

c○2012 Carl James Schwarz 341 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

to get separate lines fit <str<strong>on</strong>g>for</str<strong>on</strong>g> each group:<br />

c○2012 Carl James Schwarz 342 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

This relati<strong>on</strong>ships <str<strong>on</strong>g>for</str<strong>on</strong>g> each site appear to be linear. The actual estimates are also presented:<br />

c○2012 Carl James Schwarz 343 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

c○2012 Carl James Schwarz 344 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

The scatterplot doesn’t show any obvious outliers. The estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the a site is -.107 (se .02)<br />

while the estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the b site is -.06 (se .02). The 95% c<strong>on</strong>fidence intervals (not shown <strong>on</strong> the<br />

output but available by right-clicking/ctrl-clicking <strong>on</strong> the parameter estimates table) overlap c<strong>on</strong>siderably so<br />

the slopes could be the same <str<strong>on</strong>g>for</str<strong>on</strong>g> the two groups.<br />

The MSE from site a is .10 and the MSE from site b is .12. This corresp<strong>on</strong>ds to standard deviati<strong>on</strong>s of<br />

√<br />

.10 = .32 and<br />

√<br />

.12 = .35 which are very similar so that assumpti<strong>on</strong> of equal standard deviati<strong>on</strong>s seems<br />

reas<strong>on</strong>able.<br />

The residual plots (not shown) also look reas<strong>on</strong>able.<br />

The assumpti<strong>on</strong>s appear to be satisfied, so let us now fit the various models.<br />

First, fit the model allowing <str<strong>on</strong>g>for</str<strong>on</strong>g> separate lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />

The terms can be in any order and corresp<strong>on</strong>d to the model described earlier. This gives the following<br />

output:<br />

c○2012 Carl James Schwarz 345 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

The regressi<strong>on</strong> plot is just the same as the plot of the two individual lines seen earlier. What is of interest is<br />

the Effect test <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site*year interacti<strong>on</strong>. Here the p-value is not very small, so there is no evidence that<br />

the lines are not parallel.<br />

We need to refit the model, dropping the interacti<strong>on</strong> term:<br />

c○2012 Carl James Schwarz 346 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

which gives the following regressi<strong>on</strong> plot:<br />

c○2012 Carl James Schwarz 347 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

This shows the fitted parallel lines. The effect tests:<br />

now have a small p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site effect indicating that the lines are not coincident, i.e. they are parallel<br />

with different intercepts. This would mean that the rate of decay of the dioxin appears to be equal in both<br />

sites, but the initial c<strong>on</strong>centrati<strong>on</strong> appears to be different.<br />

The estimated (comm<strong>on</strong>) slope is found in the Parameter Estimates porti<strong>on</strong> of the output:<br />

c○2012 Carl James Schwarz 348 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

and has a value of -.083 (se .016). Because the analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this implies that the<br />

dioxin levels changed by a factor of exp(−.083) = .92 from year to year, i.e. about a 8% decline each year.<br />

The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope <strong>on</strong> the log-scale is from (-.12 -> -.05) which corresp<strong>on</strong>ds to a<br />

potential factor between exp(−.12) = .88 to exp(−.05) = .95 per year, i.e. between a 12% and 5% decline<br />

per year. 5<br />

While it is possible to estimate the difference between the parallel lines from the Parameter Estimates<br />

table, it is easier to look at the secti<strong>on</strong> of the output corresp<strong>on</strong>ding to the Site effects. Here the estimated<br />

LSMeans corresp<strong>on</strong>d to the log(TEQ) at the average value of Year - not really of interest. As in previous<br />

chapters, the difference in means is often of more interest than the raw means themselves. This is found by<br />

using the pop-down menu and selecting a LSMeans C<strong>on</strong>trast or Multiple Comparis<strong>on</strong> procedure to give:<br />

5 The c<strong>on</strong>fidence intervals are found by right clicking/ctrl-clicking in the Parameter Estimates table<br />

c○2012 Carl James Schwarz 349 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

The estimated difference between the lines (<strong>on</strong> the log-scale) is estimated to be 0.46 (se .13). Because the<br />

analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this corresp<strong>on</strong>ds to a ratio of exp(.46) = 1.58 in dioxin levels between<br />

the two sites, i.e. site b has 1.58 times the dioxin level as site a. Because the slopes are parallel and declining,<br />

the dioxin levels are falling in both sites, but the 1.58 times ratio remains c<strong>on</strong>sistent.<br />

c○2012 Carl James Schwarz 350 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Finally, the actual by Predicted plot (not shown here), the leverage plots (not shown here) and the residual<br />

plot<br />

d<strong>on</strong>’t show any evidence of a problem in the fit.<br />

5.7 Change in yearly average temperature with regime shifts<br />

The ANCOVA technique can also be used <str<strong>on</strong>g>for</str<strong>on</strong>g> trends when there are KNOWN regime shifts in the series.<br />

The case when the timing of the shift is unknown is more difficult and not covered in this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

For example, c<strong>on</strong>sider a time series of annual average temperatures measured at Tuscaloosa, Alabama<br />

from 1901 to 2001. It is well known that shifts in temperature can occur whenever the instrument or locati<strong>on</strong><br />

or observer or other characteristics of the stati<strong>on</strong> change.<br />

The data are available in the JMP datafile tuscaloosa-avg-temp.jmp in the Sample Program Library at<br />

http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

A porti<strong>on</strong> of the raw data is shown below:<br />

c○2012 Carl James Schwarz 351 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

and a time series plot of the data:<br />

shows a shift in the readings in 1939 (thermometer changed), 1957 (stati<strong>on</strong> moved), and possibly in 1987<br />

(locati<strong>on</strong> and thermometer changed).<br />

It turns out that cases where the number of epochs tends to increase with the number of data points has<br />

some serious technical issues with the properties of the estimators. See<br />

Lu, Q. and Lund, R.B. (2007).<br />

c○2012 Carl James Schwarz 352 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Simple linear regressi<strong>on</strong> with multiple level shifts.<br />

Canadian Journal of Statistics, 35, 447-458<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> details. Basically, if the number of parameters tends to increase with sample size, this violates <strong>on</strong>e of<br />

the assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> maximum likelihood estimati<strong>on</strong>. This would lead to estimates which may not even be<br />

c<strong>on</strong>sistent! For example, suppose that the recording changed every two years. Then the two data points<br />

should still be able to estimate the comm<strong>on</strong> slope, but this corresp<strong>on</strong>ds to the well known problem with<br />

case-c<strong>on</strong>trol studies where the number of pairs increases with total sample size. Fortunately, Lu and Lund<br />

(2007) showed that this violati<strong>on</strong> is not serious.<br />

The analysis proceeds as in the dioxin example with two sites, except that now the series is broken into<br />

different epochs corresp<strong>on</strong>ding to the sets of years when c<strong>on</strong>diti<strong>on</strong>s remained stable at the recording site. In<br />

this case, this corresp<strong>on</strong>ds to the years 1901-1938 (inclusive); 1940-1956 (inclusive); 1958-1986 (inclusive),<br />

and 1989-2000 (inclusive). Note that the years 1939, 1957, and 1987 are NOT used because the average<br />

temperature in these two years is an amalgam of two different recording c<strong>on</strong>diti<strong>on</strong>s 6 .<br />

For example, the data file (around the first regime change) may look like:<br />

Note that year and Avg Temp and both set to have c<strong>on</strong>tinuous scale; but epoch should have a nominal or<br />

ordinal scale.<br />

Model filling proceeds as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e by first the model:<br />

AvgT emp = Y ear Epoch Y ear ∗ Epoch<br />

to see if the change in AvgTemp is c<strong>on</strong>sistent am<strong>on</strong>g Epochs and then fitting the model:<br />

AvgT emp = Y ear Epoch<br />

to estimate the comm<strong>on</strong> trend (after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> shifts am<strong>on</strong>g the Epochs).<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />

6 If the exact day of the change were known, it is possible to weight the two epochs in these years and include the data points.<br />

c○2012 Carl James Schwarz 353 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

There is no str<strong>on</strong>g evidence that the slopes are different am<strong>on</strong>g the epochs (p=.10) despite the plot showing<br />

a potentially differential slope in the 3 rd epoch:<br />

c○2012 Carl James Schwarz 354 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

The simpler model with comm<strong>on</strong> slopes is then fit:<br />

c○2012 Carl James Schwarz 355 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

with fitted (comm<strong>on</strong> slope) lines:<br />

c○2012 Carl James Schwarz 356 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

No further model simplificati<strong>on</strong> is possible and there is evident that the comm<strong>on</strong> slope is different from zero:<br />

The estimated change in average temperature is:<br />

c○2012 Carl James Schwarz 357 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

i.e. an estimated increase of .033 (SE .006) per year. The 95% c<strong>on</strong>fidence interval does not cover 0.<br />

The residual plots (against predicted and the order in which the data were collected):<br />

c○2012 Carl James Schwarz 358 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

shows no obvious problems.<br />

Whenever time series data are used, autocorrelati<strong>on</strong> should be investigated. The Durbin-Wats<strong>on</strong> test is<br />

applied to the residuals:<br />

with no obvious problem detected.<br />

The leverage plot (against year)<br />

c○2012 Carl James Schwarz 359 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

also reveals nothing amiss.<br />

A more sophisticated analysis can be fit using SAS, but isn’t needed. The sample program and output are<br />

available in the Sample Program Library.<br />

5.8 Example - More refined analysis of stream-slope example<br />

In the chapter <strong>on</strong> paired comparis<strong>on</strong>s, the example of the effect of stream slope was examined based <strong>on</strong>:<br />

Isaak, D.J. and Hubert, W.A. (2000). Are trout populati<strong>on</strong>s affected by reach-scale stream slope.<br />

Canadian Journal of Fisheries and Aquatic Sciences, 57, 468-477.<br />

c○2012 Carl James Schwarz 360 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

In that paper, stream slope was (roughly) categorized into high or low slope classes and a paired-analysis<br />

was per<str<strong>on</strong>g>for</str<strong>on</strong>g>med. In this secti<strong>on</strong>, we will use the actual stream slopes to examine the relati<strong>on</strong>ship between fish<br />

density and stream slope.<br />

Recall that a stream reach is a porti<strong>on</strong> of a stream from 10 to several hundred metres in length that<br />

exhibits c<strong>on</strong>sistent slope. The slope influences the general speed of the water which exerts a dominant<br />

influence <strong>on</strong> the structure of physical habitat in streams. If fish populati<strong>on</strong>s are influenced by the structure<br />

of physical habitat, then the abundance of fish populati<strong>on</strong>s may be related to the slope of the stream.<br />

Reach-scale stream slope and the structure of associated physical habitats are thought to affect trout<br />

populati<strong>on</strong>s, yet previous studies c<strong>on</strong>found the effect of stream slope with other factors that influence trout<br />

populati<strong>on</strong>s.<br />

Past studies addressing this issue have used sampling designs wherein data were collected either using<br />

repeated samples al<strong>on</strong>g a single stream or measuring many streams distributed across space and time.<br />

Reaches <strong>on</strong> the same stream will likely have correlated measurements making the use of simple statistical<br />

tools problematical. [Indeed, if <strong>on</strong>ly a single stream is measured <strong>on</strong> multiple locati<strong>on</strong>s, then this is an<br />

example of pseudo-replicati<strong>on</strong> and inference is limited to that particular stream.]<br />

Inference from streams spread over time and space is made more difficult by the inter-stream differences<br />

and temporal variati<strong>on</strong> in trout populati<strong>on</strong>s if samples are collected over extended periods of time. This extra<br />

variati<strong>on</strong> reduces the power of any survey to detect effects.<br />

For this reas<strong>on</strong>, a paired approach was taken. A total of twenty-three streams were sampled from a large<br />

watershed. Within each stream, two reaches were identified and the actual slope gradient was measured.<br />

In each reach, fish abundance was determined using electro-fishing methods and the numbers c<strong>on</strong>verted<br />

to a density per 100 m 2 of stream surface.<br />

Table 6.1 presents the (fictitious but based <strong>on</strong> the above paper) raw data<br />

Estimates of fish density from a paired experiment<br />

slope slope density<br />

Stream (%) class (per 100 m 2 )<br />

1 0.7 low 15.0<br />

1 4.0 high 21.0<br />

2 2.4 low 11.0<br />

2 6.0 high 3.1<br />

3 0.7 low 5.9<br />

3 2.6 high 6.4<br />

4 1.3 low 12.2<br />

4 4.0 high 17.6<br />

c○2012 Carl James Schwarz 361 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

5 0.6 low 6.2<br />

5 4.4 high 7.0<br />

6 1.3 low 39.8<br />

6 3.2 high 25.0<br />

7 2.0 low 6.5<br />

7 4.2 high 11.2<br />

8 1.3 low 9.6<br />

8 4.2 high 17.5<br />

9 2.0 low 7.3<br />

9 3.6 high 10.0<br />

10 0.7 low 11.3<br />

10 3.5 high 21.0<br />

11 2.3 low 12.1<br />

11 6.0 high 12.1<br />

12 2.5 low 13.2<br />

12 4.2 high 15.0<br />

13 2.3 low 5.0<br />

13 6.0 high 5.0<br />

14 1.2 low 10.2<br />

14 2.9 high 6.0<br />

15 0.7 low 8.5<br />

15 2.9 high 7.0<br />

16 1.1 low 5.8<br />

16 3.0 high 5.0<br />

17 2.2 low 5.1<br />

17 5.0 high 5.0<br />

18 0.7 low 65.4<br />

18 3.2 high 55.0<br />

19 0.7 low 13.2<br />

19 3.0 high 15.0<br />

20 0.3 low 7.1<br />

20 3.2 high 12.0<br />

21 2.3 low 44.8<br />

21 7.0 high 48.0<br />

22 1.8 low 16.0<br />

c○2012 Carl James Schwarz 362 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

22 6.0 high 20.0<br />

23 2.2 low 7.2<br />

23 6.0 high 10.1<br />

Notice that the density varies c<strong>on</strong>siderably am<strong>on</strong>g stream but appears to be fairly c<strong>on</strong>sistent within each<br />

stream.<br />

The raw data is available in a JMP datafile called paired-stream.jmp in the Sample Programs Library at<br />

http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms..<br />

As noted earlier, this is an example of an Analytical Survey. The treatments (low or high slope) cannot<br />

be randomized within stream – the randomizati<strong>on</strong> occurs by selecting streams at random from some larger<br />

populati<strong>on</strong> of potential streams. As noted in the early chapter <strong>on</strong> Observati<strong>on</strong>al Studies, causal inference is<br />

limited whenever a randomizati<strong>on</strong> of experimental units to treatments cannot be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med.<br />

Unlike the example presented in other chapters where the slope is divided (arbitrarily) into two class<br />

(low and high slope), we will now use the actual slope. A simple regressi<strong>on</strong> CANNOT be used because of<br />

the n<strong>on</strong>-independence introduced by measuring two reaches <strong>on</strong> the same stream. However, an ANOCOVA<br />

will prove to be useful here.<br />

First, it seem sensible that the resp<strong>on</strong>se to stream slope will will be multiplicative rather than additive,<br />

i.e. an increase in the stream slope will change the fish density by a comm<strong>on</strong> fracti<strong>on</strong>, rather than simply<br />

changing the density by a fixed amount. For example, it may turn out that a 1 unit change in the slope,<br />

reduces density by 10% - if the density be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the change was 100 fish/m 2 , then after the change, the new<br />

density will be 90 fish/m 2 . Similarly, if the original density was <strong>on</strong>ly 10 fish/m 2 , then the final density will<br />

be 9 fish/m 2 . In both cases, the reducti<strong>on</strong> is a fixed fracti<strong>on</strong>, and NOT the same fixed amount (a change of<br />

10 vs. 1).<br />

Create the log(density) column in the usual fashi<strong>on</strong> (not illustrated here). In cases like this, the natural<br />

logarithm is preferred because the resulting estimates have a very nice simple interpretati<strong>on</strong>. 7<br />

An appropriate model will be <strong>on</strong>e where each stream has a separate intercept (corresp<strong>on</strong>ding to the<br />

different productivities of each stream - acting like a block), with a comm<strong>on</strong> slope <str<strong>on</strong>g>for</str<strong>on</strong>g> all streams. The<br />

simplified model syntax would look like<br />

log(density) = stream slope<br />

where the term stream represents a nominal scaled variable and gives the different intercepts and the term<br />

slope is the effect of the comm<strong>on</strong> slope <strong>on</strong> the log(density).<br />

menu.<br />

This is fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m as:<br />

7 The JMP dataset also created a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each stream using the Rows− > Color or Mark by Column<br />

c○2012 Carl James Schwarz 363 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Note that it stream must have a nominal scale and that slope must have a c<strong>on</strong>tinuous scale. The order of the<br />

terms in the effects box is not important.<br />

The output from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is voluminous, but a careful reading reveals several<br />

interesting features.<br />

First is a plot of the comm<strong>on</strong> slope fit to each stream:<br />

c○2012 Carl James Schwarz 364 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

This shows a gradual increase as slope increases. This plot is hard to interpret, but a plot of observed vs.<br />

predicted values is clearer:<br />

c○2012 Carl James Schwarz 365 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Generally, the observed are close to the predicted values except <str<strong>on</strong>g>for</str<strong>on</strong>g> two potential outliers. By clicking <strong>on</strong><br />

these points, it is shown that both points bel<strong>on</strong>g to stream 2 where it appears that the slope increases causes<br />

a large decrease in density c<strong>on</strong>trary to the general pattern seen in the the other streams.<br />

The effect tests:<br />

fail to detect any influence of slope. Indeed the estimated coefficient associated with a change in slope is<br />

found to be:<br />

c○2012 Carl James Schwarz 366 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

is estimated to be .025 (se .0299) which is not statistically significant. 8<br />

Residual plots also show the odd behavior of stream 2:<br />

If this rogue stream is “eliminated” from the analysis, the the resulting plots do not show any problems<br />

(try it), but now the results are statistically significant (p=.035):<br />

8 Because the natural log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used and the data <strong>on</strong> the log scale was used, “smallish” slope coefficients have an approximate<br />

interpretati<strong>on</strong>. In this example, a slope of .025 <strong>on</strong> the (natural) log scale implies that the estimated fish density INCREASES by<br />

2.5% every time the slope increases by <strong>on</strong>e percentage point.<br />

c○2012 Carl James Schwarz 367 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

The estimated change in log-density per percentage point change in the slope is found to be:<br />

i.e. the slope is .05 (se .02) which is interpreted that a percentage point increase in stream slope increases<br />

fish density by 5%. 9<br />

The remaining residual plot and leverage plots show no problems.<br />

Yet another alternate analysis!<br />

Because the treatment <strong>on</strong>ly has two levels, the same answers can also be obtained by estimating the ratio<br />

of the change in the log(density) to the change in slope. 10 To begin, we need to split the data table so that<br />

both the log(density) and the slope are in separate columns:<br />

9 This easy interpretati<strong>on</strong> occurs because the natural log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used. If the comm<strong>on</strong> (base 10) log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used, there<br />

is no l<strong>on</strong>ger such a simple interpretati<strong>on</strong>.<br />

10 If the slope-class had three or more levels, this analysis could not be d<strong>on</strong>e, and the previous analysis would the preferred route<br />

c○2012 Carl James Schwarz 368 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

This creates a data table with separate columns <str<strong>on</strong>g>for</str<strong>on</strong>g> the log(density) and the stream slope <str<strong>on</strong>g>for</str<strong>on</strong>g> both the<br />

high and low slope categories:<br />

c○2012 Carl James Schwarz 369 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Now create two new variables (create new columns and write a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> each column) representing the<br />

differences in the log(density) and slope between the high and low slope classes:<br />

Finally, we wish to fit a line through the origin through these data points. We use the Analyze->Fit<br />

Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m,<br />

the Fit Special from the red-triangle drop down menu:<br />

c○2012 Carl James Schwarz 370 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

and then check the C<strong>on</strong>strain intercept<br />

c○2012 Carl James Schwarz 371 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

This give the following output:<br />

c○2012 Carl James Schwarz 372 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

We obtain the same estimated effect and se. The outlier from stream 2 is readily evident. When this outlier<br />

is excluded and the analysis is repeated, again a statistically significant result is obtained that matches the<br />

previous analysis.<br />

c○2012 Carl James Schwarz 373 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

5.9 Comparing Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor K<br />

Not all fish within a lake are identical. How can a single summary measure be developed to represent the<br />

c<strong>on</strong>diti<strong>on</strong> of fish within a lake?<br />

In general, the the relati<strong>on</strong>ship between fish weight and length follows a power law:<br />

W = aL b<br />

where W is the observed weight; L is the observed length, and a and b are coefficients relating length to<br />

weight. The usual assumpti<strong>on</strong> is that heavier fish of a given length are in better c<strong>on</strong>diti<strong>on</strong> than than lighter<br />

fish. C<strong>on</strong>diti<strong>on</strong> indices are a popular summary measure of the c<strong>on</strong>diti<strong>on</strong> of the populati<strong>on</strong>.<br />

There are at least eight different measures of c<strong>on</strong>diti<strong>on</strong> which can be found by a simple literature<br />

search. C<strong>on</strong>ne (1989) raises some important questi<strong>on</strong>s about the use of a single index to represent the<br />

two-dimensi<strong>on</strong>al weight-length relati<strong>on</strong>ship.<br />

One comm<strong>on</strong> measure is Fult<strong>on</strong>’s 11 K:<br />

K =<br />

W eigt<br />

(Length/100) 3<br />

This index makes an implicit assumpti<strong>on</strong> of isometric growth, i.e. as the fish grows, its body proporti<strong>on</strong>s and<br />

specific gravity do not change.<br />

How can K be computed from a sample of fish, and how can K be compared am<strong>on</strong>g different subset of<br />

fish from the same lake or across lakes?<br />

The B.C. Ministry of Envir<strong>on</strong>ment takes regular samples of rainbow trout using a floating and a sinking<br />

net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish was recorded.<br />

The data are available in the rainbow-c<strong>on</strong>diti<strong>on</strong>.jmp data file in the Sample Program Library at http:<br />

//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

A porti<strong>on</strong> of the raw data data appears below:<br />

11 There is some doubt about the first authorship of this c<strong>on</strong>diti<strong>on</strong> factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J.<br />

(2005). The Origin of Fult<strong>on</strong>’s C<strong>on</strong>diti<strong>on</strong> Factor – Setting the Record Straight. Fisheries, 31, 236-238.<br />

c○2012 Carl James Schwarz 374 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

K was computed <str<strong>on</strong>g>for</str<strong>on</strong>g> each individual fish, and the resulting histogram is displayed below:<br />

There is a range of c<strong>on</strong>diti<strong>on</strong> numbers am<strong>on</strong>g the individual fish with an average (am<strong>on</strong>g the fish caught) K<br />

of about 13.6.<br />

Deriving a single summary measure to represent the entire populati<strong>on</strong> of fish in the lake depends heavily<br />

<strong>on</strong> the sampling design used to capture fish.<br />

Some case must be taken to ensure that the fish collected are a simple random sample from the fish in the<br />

populati<strong>on</strong>. If a net of a single mesh size are used, then this has a selectivity curve and the nets are typically<br />

more selective <str<strong>on</strong>g>for</str<strong>on</strong>g> fish of a certain size. In this experiment, several different mesh sizes were used to try and<br />

ensure that all fish of all sizes have an equal chance of being selected.<br />

As well, if regressi<strong>on</strong> methods have an advantage in that a simple random sample from the populati<strong>on</strong> is<br />

c○2012 Carl James Schwarz 375 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

no l<strong>on</strong>ger required to estimate the regressi<strong>on</strong> coefficients. As an analogy, suppose you are interested in the<br />

relati<strong>on</strong>ship between yield of plants and soil fertility. Such a study could be c<strong>on</strong>ducted by finding a random<br />

sample of soil plots, but this may lead to many plots with similar fertility and <strong>on</strong>ly a few plots with fertility<br />

at the tails of the relati<strong>on</strong>ship. An alternate scheme is to deliberately seek out soil plots with a range of<br />

fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit a regressi<strong>on</strong> curve<br />

to these selected data points.<br />

Fult<strong>on</strong>’s index is often re-expressed <str<strong>on</strong>g>for</str<strong>on</strong>g> regressi<strong>on</strong> purposes as:<br />

( ) 3 L<br />

W = K<br />

100<br />

This looks like a simple regressi<strong>on</strong> between W and ( L<br />

100) 3<br />

but with no intercept.<br />

A plot of these two variables:<br />

shows a tight relati<strong>on</strong>ship am<strong>on</strong>g fish but with possible increasing variance with length.<br />

There is some debate about the proper way to estimate the regressi<strong>on</strong> coefficient K. Classical regressi<strong>on</strong><br />

methods (least squares) implicitly implies that all of the “error” in the regressi<strong>on</strong> is in the vertical directi<strong>on</strong>,<br />

i.e. c<strong>on</strong>diti<strong>on</strong>s <strong>on</strong> the observed lengths. However, the structural relati<strong>on</strong>ship between weight and length<br />

likely is violated in both variables. This would lead to the error-in-variables problem in regressi<strong>on</strong>, which<br />

c○2012 Carl James Schwarz 376 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

has a l<strong>on</strong>g history. Fortunately, the relati<strong>on</strong>ship between the two variables is often sufficiently tight that it<br />

really doesn’t matter which method is used to find the estimates.<br />

JMP can be used to fit the regressi<strong>on</strong> line c<strong>on</strong>straining the intercept to be zero by using the Fit Special<br />

opti<strong>on</strong> under the red-triangle:<br />

c○2012 Carl James Schwarz 377 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

This gives rise to the fitted line and statistics about the fit:<br />

c○2012 Carl James Schwarz 378 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

c○2012 Carl James Schwarz 379 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Note that R 2 really doesn’t make sense in cases where the regressi<strong>on</strong> is <str<strong>on</strong>g>for</str<strong>on</strong>g>ced through the origin because<br />

the null model to which it is being compared is the line Y = 0 which is silly. 12 For this reas<strong>on</strong>, JMP does<br />

not report a value of R 2 .<br />

The estimated value of K is 13.72 (SE 0.099).<br />

The residual plot:<br />

shows clear evidence of increasing variati<strong>on</strong> with the length variable. This usually implies that a weighted<br />

regressi<strong>on</strong> is needed with weights proporti<strong>on</strong>al to the 1/length 2 variable. In this case, such a regressi<strong>on</strong><br />

gives essentially the same estimate of the c<strong>on</strong>diti<strong>on</strong> factor ( ̂K = 13.67, SE = .11).<br />

Comparing c<strong>on</strong>diti<strong>on</strong> factors<br />

This dataset has a number of sub-groups – do all of the subgroups have the same c<strong>on</strong>diti<strong>on</strong> factor? For<br />

example, suppose we wish to compare the K value <str<strong>on</strong>g>for</str<strong>on</strong>g> immature and mature fish. As noted by Garcia-<br />

Berthou (2001) 13 , this is best d<strong>on</strong>e through a technique called Analysis of Covariance (ANCOVA). Some<br />

details <strong>on</strong> ANCOVA are presented in a separate chapter of these notes.<br />

As outlined in the ANCOVA chapter, we start with a model that has a separate K <str<strong>on</strong>g>for</str<strong>on</strong>g> each maturity class.<br />

The simplified syntax <str<strong>on</strong>g>for</str<strong>on</strong>g> this model is:<br />

W = (Len/100) 3<br />

(Len/100) 3 ∗ Maturity<br />

Note that unlike traditi<strong>on</strong>al ANOCOVA models, the model is lacking the simple effect of maturity. The<br />

reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this is that unlike traditi<strong>on</strong>al ANCOVA models, the intermediate model with parallel slopes really<br />

12 C<strong>on</strong>sult any of the standard references <strong>on</strong> regressi<strong>on</strong> such as Draper and Smith <str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />

13 Garcia-Berthou E. (2001). On the misuse of residuals in ecology: testing regressi<strong>on</strong> residuals vs. the analysis of covariance.<br />

Journal of Animal <str<strong>on</strong>g>Ecology</str<strong>on</strong>g> 70, 708-711. http://dx.doi.org/10.1046/j.1365-2656.2001.00524.x<br />

c○2012 Carl James Schwarz 380 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

doesn’t make sense when the regressi<strong>on</strong> lines are <str<strong>on</strong>g>for</str<strong>on</strong>g>ced through the origin. This syntax specifies that<br />

variati<strong>on</strong> in length are attributable to variati<strong>on</strong>s in length and an interacti<strong>on</strong> between the two variables. This<br />

latter term represents the differential K between the maturity classes.<br />

Here is where some care must be taken. By default. JMP “centers” (i.e. subtracts the mean) c<strong>on</strong>tinuous<br />

X variables when they participate in an interacti<strong>on</strong> or similar term:<br />

Hence, if you just try and implement this above model directly in JMP, you will actually fit the model:<br />

W = (Len/100) 3<br />

((Len/100) 3 − (Len/100) 3 ) ∗ Maturity<br />

which, when expanded, actually adds an intercept term to the model. Ordinarily, in regressi<strong>on</strong> models with<br />

intercepts, this would NOT be a problem – it is because the model is being <str<strong>on</strong>g>for</str<strong>on</strong>g>ced through the intercept that<br />

this causes a problem.<br />

In order to prevent JMP from “centering” the length variable when fitting these ANCOVA models, turn<br />

off the centering opti<strong>on</strong> (by unchecking the opti<strong>on</strong>) when the model is fit using the Analyze->Fit Model<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m of JMP:<br />

c○2012 Carl James Schwarz 381 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Note the use of the No Intercept opti<strong>on</strong> to again <str<strong>on</strong>g>for</str<strong>on</strong>g>ce the line through the origin. JMP will ‘complain’<br />

about the odd <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the model because it is missing the simple maturity class effect, but just ignore the<br />

complaints. This gives the summary output <str<strong>on</strong>g>for</str<strong>on</strong>g> the effect test of:<br />

The p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the last term in the table of 0.027 indicates that there is str<strong>on</strong>g evidence of a different K<br />

between the two maturity classes.<br />

The estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the separate maturity classes are obtained from the Custom Test opti<strong>on</strong> (some knowledge<br />

of the design matrix coding <str<strong>on</strong>g>for</str<strong>on</strong>g> categorical variables in JMP is needed to know that JMP uses a (1, -1)<br />

coding <str<strong>on</strong>g>for</str<strong>on</strong>g> indicator variables with 2 classes):<br />

c○2012 Carl James Schwarz 382 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

which gives the estimated K <str<strong>on</strong>g>for</str<strong>on</strong>g> each maturity class.<br />

c○2012 Carl James Schwarz 383 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

If you fit a separate regressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the two maturity classes (use the By opti<strong>on</strong> <strong>on</strong> the fit model box), you<br />

will get the two same estimates. The respective standard errors will be slightly different because the single<br />

model is able to pool over all of the data to estimate the standard errors, but separate estimates cannot do<br />

any pooling.<br />

The separate fitted lines are shown below:<br />

c○2012 Carl James Schwarz 384 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Similarly, a comparis<strong>on</strong> of K can be made am<strong>on</strong>g the three sex classes (M, F, and U) where immature<br />

fish cannot be sexed and are given the code U, but mature fish are further subdivided into M and F classes<br />

(d<strong>on</strong>’t <str<strong>on</strong>g>for</str<strong>on</strong>g>get to uncheck the centering opti<strong>on</strong> in the triangle in the upper left corner of the Analyze->Fit<br />

Model dialogue):<br />

also shows evidence (p=.025) of a differential K am<strong>on</strong>g the three sex classes (this is not unexpected), and a<br />

c<strong>on</strong>trast can be d<strong>on</strong>e to see if there is further evidence of a difference between the male and females:<br />

c○2012 Carl James Schwarz 385 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

As the p-value is .0074, there is also str<strong>on</strong>g evidence of a differential K between the males and females as<br />

well.<br />

A final plot of the three lines is:<br />

c○2012 Carl James Schwarz 386 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Finally, because you have replicate fish at the same body length, it is possible to a <str<strong>on</strong>g>for</str<strong>on</strong>g>mal lack-of-fit test.<br />

The idea behind this test is to compare the variati<strong>on</strong> in data points at the same replicate lengths (pure error)<br />

with the deviati<strong>on</strong>s around the line from the model (model error). If the model fits well, the ratio of these<br />

two estimates of residual variance should be comparable:<br />

The p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the lack-of-fit test is quite large indicating no evidence of a lack of fit.<br />

This same ANCOVA method can be used to compare the K values across lakes or across time within<br />

the same lake. If you have a large number of lakes each measured multiple times, some very interesting<br />

models can be fit that are bey<strong>on</strong>d the scope of these notes – please c<strong>on</strong>tact me. Similarly, interest may lie<br />

in modeling the K as functi<strong>on</strong>s of other lake-specific covariates such as lake size, productivity, etc. Again,<br />

please c<strong>on</strong>tact me as this is bey<strong>on</strong>d the scope of these notes.<br />

c○2012 Carl James Schwarz 387 November 23, 2012


CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA<br />

Statistical significance is not the same as biological significance! While there was evidence of differential<br />

K in this data set, this statistical significance does not imply biological importance. I have no idea of the<br />

observed differences in K am<strong>on</strong>g these three groups has any meaning biologically.<br />

5.10 Final Notes<br />

Some secti<strong>on</strong>s need to be added here <strong>on</strong> the following topics:<br />

• danger of ANCOVA is there is no overlap in the covariate<br />

• choice between paired t-test, multi-variate test, or ANCOVA in the case of two time points<br />

c○2012 Carl James Schwarz 388 November 23, 2012


Chapter 6<br />

Multiple linear regressi<strong>on</strong><br />

6.1 Introducti<strong>on</strong><br />

In previous chapters, the relati<strong>on</strong>ship between a single, c<strong>on</strong>tinuous variable (Y a.k.a. the resp<strong>on</strong>se variable)<br />

and a single c<strong>on</strong>tinuous variable (X, a.k.a the predictor or explanatory variable) was explored using simple<br />

linear regressi<strong>on</strong>. In this chapter, this will be generalized to the case of more than <strong>on</strong>e explanatory (X)<br />

variable. 1<br />

There are many good books covering this topic - refer to the list in previous chapters.<br />

Fortunately, many of the techniques learned in the previous chapter <strong>on</strong> simple linear regressi<strong>on</strong> carry<br />

over directly to the more general multiple regressi<strong>on</strong>. There are a few subtle differences in interpretati<strong>on</strong>,<br />

and additi<strong>on</strong>al problems (such a variable selecti<strong>on</strong>) must be solved.<br />

It turns out that multiple regressi<strong>on</strong> methods are very general methods covering a wide range of statistical<br />

problems under the rubric of general linear models. Surprisingly, multiple regressi<strong>on</strong> is a general soluti<strong>on</strong><br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> two-sample t-tests, <str<strong>on</strong>g>for</str<strong>on</strong>g> ANOVA models, <str<strong>on</strong>g>for</str<strong>on</strong>g> simple linear regressi<strong>on</strong> models, etc. The exact theory is<br />

bey<strong>on</strong>d the scope of these notes, but intuitive explanati<strong>on</strong>s will be provided as needed.<br />

6.1.1 Data <str<strong>on</strong>g>for</str<strong>on</strong>g>mat and missing values<br />

The data are collected and stored in a tabular <str<strong>on</strong>g>for</str<strong>on</strong>g>mat with rows representing observati<strong>on</strong>s, and columns<br />

representing different variables. One of the variables will be the resp<strong>on</strong>se (Y ) variable; there can be several<br />

predictor (X) variables. Virtually all computer packages require variables to be stored in columns and<br />

1 It is possible to also have more than <strong>on</strong>e Y variable – this is known as multivariate multiple regressi<strong>on</strong> but is not covered in this<br />

chapter.<br />

389


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

observati<strong>on</strong>s stored in rows.<br />

The resp<strong>on</strong>se variable (Y ) must be c<strong>on</strong>tinuous. It is NOT appropriate to do multiple regressi<strong>on</strong> when<br />

the Y variable represents categories – the appropriate methodology in this case is logistic regressi<strong>on</strong>. If the<br />

Y variable represents counts, a technique known as Poiss<strong>on</strong>-regressi<strong>on</strong> may be more appropriate – c<strong>on</strong>sult<br />

the chapter <strong>on</strong> generalized linear models <str<strong>on</strong>g>for</str<strong>on</strong>g> more details. Finally, in some cases, the value of Y may be<br />

censored, i.e. the exact value is not known, but it is known to be below certain threshold values (e.g. above<br />

or below detecti<strong>on</strong> limits). The analysis of such data is bey<strong>on</strong>d the scope of these notes – c<strong>on</strong>sult the chapter<br />

<strong>on</strong> Tobit analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />

Surprising, there is much more flexibility in the type of the X variables. They may be c<strong>on</strong>tinuous as<br />

seen previously in simple linear regressi<strong>on</strong>, or they may be dichotomous variables taking <strong>on</strong>ly the values of<br />

0 or 1 (known as indicator variables). 2 These indicator variables are used to represent different groups (e.g.<br />

male and female) in the data.<br />

The dataset is assumed to be complete, with NO missing values in any of the X variables. If an observati<strong>on</strong><br />

(row) has some missing X values, most computer packages practice what is known as case-wise<br />

deleti<strong>on</strong>, i.e. the entire observati<strong>on</strong> will be dropped from the analysis. C<strong>on</strong>sequently, it is always important<br />

to check the computer output to see exactly how many observati<strong>on</strong>s have been used in the analysis.<br />

Missing Y values also imply that this observati<strong>on</strong> (row) will be deleted from the analysis. However, if<br />

the set of X variables is complete, it is still possible to obtain predicti<strong>on</strong>s of Y <str<strong>on</strong>g>for</str<strong>on</strong>g> the observed set of X<br />

values.<br />

As in previous chapters, missing data should be examined to see if it is missing completely at random<br />

(MCAR) in which case there is usually no problem in the analysis other than reduced sample size, or missing<br />

at random (MAR), which is again handled relatively easily, or in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative missing (IM) which poses serious<br />

problems in the analysis. Seek help <str<strong>on</strong>g>for</str<strong>on</strong>g> the latter case.<br />

6.1.2 The statistical model<br />

The statistical model <str<strong>on</strong>g>for</str<strong>on</strong>g> multiple-regressi<strong>on</strong> is a extensi<strong>on</strong> of that <str<strong>on</strong>g>for</str<strong>on</strong>g> simple linear regressi<strong>on</strong>.<br />

The resp<strong>on</strong>se variable, denoted by Y , is measured al<strong>on</strong>g with a set of predictor variables, denoted by<br />

X 1 , X 2 , . . . , X p where p is the number of predictor variables.<br />

The <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical model is:<br />

Y i = β 0 + β 1 X i1 + β 2 X i2 + . . . + β p X ip + ɛ i<br />

where the unknown parameters are the set of β ′ s. The deviati<strong>on</strong> between the observed value of Y and the<br />

predicted value from the regressi<strong>on</strong> equati<strong>on</strong>, ɛ i is distributed as a Normal distributi<strong>on</strong> with a mean of 0 and<br />

an (unknown) variance of σ 2 .<br />

2 In actual fact, any set of two distinct values may be used, but traditi<strong>on</strong>al usage is to use 0 and 1.<br />

c○2012 Carl James Schwarz 390 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

This is often written using a <str<strong>on</strong>g>short</str<strong>on</strong>g> hand notati<strong>on</strong> in many statistical packages as:<br />

Y = X 1 X 2 . . . X p<br />

where the intercept (β 0 ) and the residual variati<strong>on</strong> (ɛ) are implicit.<br />

This can also be written using matrices as<br />

Y = Xβ + ɛ<br />

where Y is an n × 1 column vector, X is an n × (p + 1) matrix [d<strong>on</strong>’t <str<strong>on</strong>g>for</str<strong>on</strong>g>get the intercept column] of the<br />

predictors, β is a (p + 1) × 1 column vector (the intercept β 0 , plus the p “slopes” β 1 , . . . , β p ), and ɛ is a<br />

n×1 vector of residuals that has a multivariate normal distributi<strong>on</strong> with a mean of 0 and a covariance matrix<br />

of Iσ 2 where I is the identity matrix.<br />

Note that this <str<strong>on</strong>g>for</str<strong>on</strong>g>mat <str<strong>on</strong>g>for</str<strong>on</strong>g> multiple regressi<strong>on</strong> is very flexible. By appropriate definiti<strong>on</strong> of the X variables,<br />

many different problems can be cast into a multiple-regressi<strong>on</strong> framework. In future <str<strong>on</strong>g>course</str<strong>on</strong>g>s you will see<br />

that ANOVA (a technique to compare means am<strong>on</strong>g multiple groups) is actually nothing but regressi<strong>on</strong> in<br />

disguise!<br />

6.1.3 Assumpti<strong>on</strong>s<br />

Not surprising, the assumpti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> a multiple regressi<strong>on</strong> analysis are very similar to those required <str<strong>on</strong>g>for</str<strong>on</strong>g> a<br />

simple linear regressi<strong>on</strong>.<br />

Linearity<br />

Because of the multiple X variables, the assumpti<strong>on</strong> of linearity is not as straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward as <str<strong>on</strong>g>for</str<strong>on</strong>g> simple linear<br />

regressi<strong>on</strong>.<br />

Multiple regressi<strong>on</strong> analysis assume that the MARGINAL relati<strong>on</strong>ship between Y and each X is linear.<br />

This means that if all other X variables are held c<strong>on</strong>stant, then changes in the particular X variable lead to<br />

a linear change in the Y variable. Because this is a MARGINAL relati<strong>on</strong>ship, simple plots of Y vs. each X<br />

variable may not be linear. This is because the simple pairwise plots can’t hold the other variables fixed.<br />

To assess this relati<strong>on</strong>ship, residuals from the fit should be plotted against each X variable in turn. If the<br />

scatter of the residuals is not random around 0 but shows some pattern (e.g. quadratic curve), this usually<br />

indicates that the marginal relati<strong>on</strong>ship between Y and that particular X is not linear. Alternatively, fit a<br />

model that includes both X and X 2 and test if the coefficient associated with X 2 is zero. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, this<br />

test could fail to detect a higher order relati<strong>on</strong>ship. Third, if there are multiple readings at some X-values,<br />

then a test of goodness-of-fit (what JMP calls the Lack of Fit Test) can be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med where the variati<strong>on</strong> of<br />

the resp<strong>on</strong>ses at the same X value is compared to the variati<strong>on</strong> around the regressi<strong>on</strong> line.<br />

c○2012 Carl James Schwarz 391 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Correct sampling scheme<br />

The Y values must be a random sample from the populati<strong>on</strong> of Y values <str<strong>on</strong>g>for</str<strong>on</strong>g> every set of X value in the<br />

sample. Fortunately, it is not necessary to have a completely random sample from the populati<strong>on</strong> as the<br />

regressi<strong>on</strong> line is valid even if the X values are deliberately chosen. However, <str<strong>on</strong>g>for</str<strong>on</strong>g> a given set of X, the<br />

values from the populati<strong>on</strong> must be a simple random sample.<br />

This latitude gives c<strong>on</strong>siderable freedom in selecting points to investigate the relati<strong>on</strong>ship between Y<br />

and X. This will be discussed more in class.<br />

No outliers or influential points<br />

All the points must bel<strong>on</strong>g to the relati<strong>on</strong>ship – there should be no unusual points.<br />

The residual plot of the residual against the row number or against the predicted value should be investigated<br />

to see if there are unusual points.<br />

The marginal scatter plot of the residuals from the fit vs. X should be examined. As well leverage plots<br />

(Secti<strong>on</strong> 6.2.6) are useful <str<strong>on</strong>g>for</str<strong>on</strong>g> detecting influential points.<br />

Outliers can have a dramatic effect <strong>on</strong> the fitted line.<br />

Equal variati<strong>on</strong> al<strong>on</strong>g the line<br />

The variability about the regressi<strong>on</strong> plane must be similar <str<strong>on</strong>g>for</str<strong>on</strong>g> all sets of X, i.e. the scatter of the points above<br />

and below the fitted surface should be roughly c<strong>on</strong>stant over the entire surface. This is assessed by looking<br />

at the plots of the residuals against each X variable to see if the scatter is roughly uni<str<strong>on</strong>g>for</str<strong>on</strong>g>mly scattered around<br />

zero with no increase and no decrease in spread over the entire line.<br />

Independence<br />

Each value of Y is independent of any other value of Y . The most comm<strong>on</strong> cases where this fail are time<br />

series data.<br />

This assumpti<strong>on</strong> can be assessed by again looking at residual plots against time or other variables.<br />

c○2012 Carl James Schwarz 392 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Normality of errors<br />

The difference between the value of Y and the expected value of Y is assumed to be normally distributed.<br />

This is <strong>on</strong>e of the most misunderstood assumpti<strong>on</strong>s. Many people err<strong>on</strong>eously assume that the distributi<strong>on</strong> of<br />

Y over all X values must be normally distributed, i.e. they look simply at the distributi<strong>on</strong> of the Y ’s ignoring<br />

the Xs. The assumpti<strong>on</strong> of normality <strong>on</strong>ly states that the residuals, the difference between the value of Y<br />

and the point <strong>on</strong> the line must be normally distributed.<br />

This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, <str<strong>on</strong>g>for</str<strong>on</strong>g> small<br />

sample sizes, you have little power of detecting n<strong>on</strong>-normality and <str<strong>on</strong>g>for</str<strong>on</strong>g> large sample sizes it is not that<br />

important.<br />

X variables measured without error<br />

It sometimes turns out that the X variables are not known precisely. For example, if you wish to investigate<br />

the relati<strong>on</strong>ship of illness to sec<strong>on</strong>d hand cigarette smoke, it is surprisingly difficult to get an estimate of the<br />

“dose” of cigarettes that a worker has been exposed to.<br />

This general problem is called the “error in variables” problem and has a l<strong>on</strong>g history in statistics. A<br />

detailed discussi<strong>on</strong> of this issue is bey<strong>on</strong>d the scope of these notes.<br />

The uncertainty in each X variable should be assessed.<br />

6.1.4 Obtaining Estimates<br />

The same principle of least squares as in simple linear regressi<strong>on</strong> is used to obtain estimates. In general, the<br />

sum of deviati<strong>on</strong>s between the predicted and observed values is computed, and the regressi<strong>on</strong> surface that<br />

minimizes this value is the final relati<strong>on</strong>ship.<br />

The estimated intercept and slopes can be compactly expressed using matrix notati<strong>on</strong><br />

̂β = (X ′ X) −1 X ′ Y<br />

but details are bey<strong>on</strong>d the scope of these notes. Hand <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae are all but impossible except <str<strong>on</strong>g>for</str<strong>on</strong>g> trivially<br />

small examples - let the computer do the work. Of <str<strong>on</strong>g>course</str<strong>on</strong>g> this implies that the scientist has the resp<strong>on</strong>sibility<br />

to ensure that the brain in engaged be<str<strong>on</strong>g>for</str<strong>on</strong>g>e putting the package in gear!<br />

As with all estimates, a measure of precisi<strong>on</strong> can be obtained. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, this is the standard error of<br />

each of the estimates. Again, there are computati<strong>on</strong>al <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae, but in this age of computers, these are not<br />

important. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, approximate 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the corresp<strong>on</strong>ding populati<strong>on</strong> parameters<br />

are found as estimate ± 2 × se. Most packages will compute the 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope as<br />

well.<br />

c○2012 Carl James Schwarz 393 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Once the fit has been obtained, the fit of the model can be assessed in various ways as outlined below.<br />

The overall fit of the model is assess using a Whole Model Test that is traditi<strong>on</strong>ally placed in an ANOVA<br />

table. This test examines if there is at least <strong>on</strong>e X variable that seems to be marginally related to the Y<br />

values. Usually, it is of little interest.<br />

The individual marginal c<strong>on</strong>tributi<strong>on</strong> of each X variable (how each X variable affects the resp<strong>on</strong>se<br />

holding all the other X variables c<strong>on</strong>stant) can be assessed directly either from the reported estimates and<br />

standard errors or from an Effect Test – these are exactly equivalent.<br />

Formal tests of hypotheses about the marginal c<strong>on</strong>tributi<strong>on</strong> of each variable can also be d<strong>on</strong>e. Usually,<br />

these are <strong>on</strong>ly d<strong>on</strong>e <strong>on</strong> the slope parameter as this is typically of most interest. The null hypothesis is that<br />

populati<strong>on</strong> marginal slope of a particular X variable is 0, i.e. there is no marginal relati<strong>on</strong>ship between Y<br />

and and that particular X. More <str<strong>on</strong>g>for</str<strong>on</strong>g>mally the null hypothesis <str<strong>on</strong>g>for</str<strong>on</strong>g> the X i variable is:<br />

H : β i = 0<br />

Again notice that the null hypothesis is ALWAYS in terms of a populati<strong>on</strong> parameter and not in terms of a<br />

sample statistic.<br />

The alternate hypothesis is typically chosen as:<br />

A : β i ≠ 0<br />

although <strong>on</strong>e-sided tests looking <str<strong>on</strong>g>for</str<strong>on</strong>g> either a positive or negative slope are possible.<br />

The test-statistics is found as<br />

T = b i − 0<br />

se(b i )<br />

and is compared to a t-distributi<strong>on</strong> with appropriate degrees of freedom to obtain the p-value. This is usually<br />

automatically d<strong>on</strong>e by most computer packages. The p-value is interpreted in exactly the same way as in<br />

ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no relati<strong>on</strong>ship were true.<br />

It is also possible to obtain test <str<strong>on</strong>g>for</str<strong>on</strong>g> set of predictors (e.g. can several X variables be simultaneously<br />

dropped from the model) as will be seen later in the notes.<br />

Finally, if there are a large number of X variables, is there an objective way to decide which subset of<br />

the X are useful in predicting Y ? Again, this is deferred until later in this chapter.<br />

6.1.5 Predicti<strong>on</strong>s<br />

Once the best fitting line is found it can be used to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> new sets of X.<br />

There are two types of predicti<strong>on</strong>s that are comm<strong>on</strong>ly made. It is important to distinguish between them<br />

as these two intervals are the source of much c<strong>on</strong>fusi<strong>on</strong> in regressi<strong>on</strong> problems.<br />

c○2012 Carl James Schwarz 394 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

First, the experimenter may be interested in predicting a SINGLE future individual value <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular<br />

set of X. Sec<strong>on</strong>d the experimenter may be interested in predicting the AVERAGE of ALL future resp<strong>on</strong>ses<br />

at a particular set of X. 3 The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is sometimes called a c<strong>on</strong>fidence<br />

interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se but this is an un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate (and incorrect) use of the term c<strong>on</strong>fidence interval.<br />

Strictly speaking c<strong>on</strong>fidence intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> fixed unknown parameter values; predicati<strong>on</strong><br />

intervals are computed <str<strong>on</strong>g>for</str<strong>on</strong>g> future random variables.<br />

Both of the above intervals should be distinguished from the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope.<br />

In both cases, the estimate is found in the same manner – substitute the new sets of X into the equati<strong>on</strong><br />

and compute the predicted value Ŷ . In most computer packages this is accomplished by inserting a new<br />

“dummy” observati<strong>on</strong> in the dataset with the value of Y missing, but the values of X present. The missing<br />

Y value prevents this new observati<strong>on</strong> from being used in the fitting process, but the X value allows the<br />

package to compute an estimate <str<strong>on</strong>g>for</str<strong>on</strong>g> this observati<strong>on</strong>.<br />

What differs between the two predicti<strong>on</strong>s are the estimates of uncertainty.<br />

In the first case (predicting single values), there are two sources of uncertainty involved in the predicti<strong>on</strong>.<br />

First, there is the uncertainty caused by the fact that this estimated line is based up<strong>on</strong> a sample. Then there<br />

is the additi<strong>on</strong>al uncertainty that the value could be above or below the predicted line. This interval is often<br />

called a predicti<strong>on</strong> interval at a new X.<br />

In the sec<strong>on</strong>d case (predicting the mean of future resp<strong>on</strong>ses), <strong>on</strong>ly the uncertainty caused by estimating<br />

the line based <strong>on</strong> a sample is relevant. This interval is often called a c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean at a<br />

new X.<br />

The predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> an individual resp<strong>on</strong>se is typically MUCH wider than the c<strong>on</strong>fidence interval<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean of all future resp<strong>on</strong>ses because it must account <str<strong>on</strong>g>for</str<strong>on</strong>g> the uncertainty from the fitted line plus<br />

individual variati<strong>on</strong> around the fitted line.<br />

Many textbooks have the <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae <str<strong>on</strong>g>for</str<strong>on</strong>g> the se <str<strong>on</strong>g>for</str<strong>on</strong>g> the two types of predicti<strong>on</strong>s, but again, there is little to<br />

be gained by examining them. What is important is that you read the documentati<strong>on</strong> carefully to ensure that<br />

you understand exactly what interval is being given to you.<br />

6.1.6 Example: blood pressure<br />

Blood pressure tends to increase with age, body mass, and stress. To investigate the relati<strong>on</strong>ship of blood<br />

pressure to these variables, a sample of men in a large corporati<strong>on</strong> was selected. For each subject, their<br />

age (years), body mass (kg), and a stress index (ranges from 0 to 100) was recorded al<strong>on</strong>g with their blood<br />

pressure.<br />

The raw data is presented in the following table:<br />

3 There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice.<br />

c○2012 Carl James Schwarz 395 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Age Blood Pressure Body Mass Stress Index<br />

(years) (mm) (kg) (no units)<br />

50 120 55 69<br />

20 141 47 83<br />

20 124 33 77<br />

30 126 65 75<br />

30 117 47 71<br />

50 129 58 73<br />

60 123 46 67<br />

50 125 68 71<br />

40 132 70 77<br />

55 123 42 69<br />

40 132 33 74<br />

40 155 55 86<br />

20 147 48 84<br />

31 . 53 86<br />

32 146 59 .<br />

JMP Analysis<br />

The raw data is also available in a JMP data sheet called bloodpress.jmp available from the Sample<br />

Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The data has been entered with rows corresp<strong>on</strong>ding to the different subjects and columns corresp<strong>on</strong>ding<br />

to the different variables:<br />

c○2012 Carl James Schwarz 396 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Notice that the resp<strong>on</strong>se variable is c<strong>on</strong>tinuous as are the other variables. 4 Also notice that the blood<br />

pressure value is missing <str<strong>on</strong>g>for</str<strong>on</strong>g> <strong>on</strong>e subject – it cannot be used in the analysis, but predicti<strong>on</strong>s can be made<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> this subject as all the X values are present. One subject is missing <strong>on</strong>e of the X variables – this subject<br />

cannot be used in the fitting process nor <str<strong>on</strong>g>for</str<strong>on</strong>g> making predicti<strong>on</strong>s. The remaining sample size is <strong>on</strong>ly 13<br />

subjects.<br />

As usual, the researcher needs to think why certain values are missing.<br />

It is also interesting to note that measurement error in the X variables could be a c<strong>on</strong>cern. For example,<br />

it is highly unlikely that the first subject is exactly 20.000000 years old! People usually truncate their age<br />

when asked, e.g. even <strong>on</strong> the day be<str<strong>on</strong>g>for</str<strong>on</strong>g>e their 21st birthday, a pers<strong>on</strong> will still resp<strong>on</strong>d as their age being<br />

20 years old. Here the error in aging ranges from about 5% of values (when age is around 20 years old) to<br />

about 2% (when the age is around 50 years old). How was weight collected? If the subjects were actually<br />

weighed, the actual number may not be in dispute (i.e. it is unlikely that the scale is wr<strong>on</strong>g), but then the<br />

weight includes shoes, clothes, and ???? If the weight is a recall measurement, many people under report<br />

their actual weight, often by quite a margin. An how is stress measured? It is likely an index based <strong>on</strong> a<br />

survey, but it is not even clear how to numerically measure stress – after all, stress can’t simply be measured<br />

like temperature.<br />

Begin by plotting the variables against each other - a simple way is a scatter plot matrix available under<br />

the Analyze->MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

4 In actual fact, these variables have been discretized, but as the discretizati<strong>on</strong> interval is small relative to typical values, they can be<br />

treated as being c<strong>on</strong>tinuous.<br />

c○2012 Carl James Schwarz 397 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 398 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The scatter plot matrix shows no str<strong>on</strong>g simple relati<strong>on</strong>ships between pairs of variables. Rather surprisingly,<br />

weight seems to decrease with age, and there appears to be general increase of blood pressure with<br />

weight.<br />

These pairwise scatter plots are primarily useful <str<strong>on</strong>g>for</str<strong>on</strong>g> checking <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers and other problems in the data<br />

– often a multivariate relati<strong>on</strong>ship is too complex to be seen in simple pairwise plots.<br />

We will fit the model where the resp<strong>on</strong>se variable (blood pressure) is modeled as a functi<strong>on</strong> of the three<br />

predictor variables (age, weight, and stress index). Using the <str<strong>on</strong>g>short</str<strong>on</strong>g> hand notati<strong>on</strong> discussed earlier, the model<br />

c○2012 Carl James Schwarz 399 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

is<br />

BloodPressure = Age Weight Stress<br />

This model is fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 400 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The X variables can be listed in any order.<br />

The output from the Analyze->Fit Model is voluminous and cannot be displayed in <strong>on</strong>e panel, so it is<br />

necessary to look at several parts in more details.<br />

Because of the missing values, <strong>on</strong>ly 13 subjects could be used in the model fit:<br />

c○2012 Carl James Schwarz 401 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The number of cases actually used in the fit should alway be ascertained because in large datasets, the<br />

missing value pattern may not be easily discerned.<br />

First, assess the overall fit of the model by examining the plot of the actual blood pressure vs. the<br />

predicted blood pressure. If the model made exact predicti<strong>on</strong>s, then the points <strong>on</strong> the plot would all lie<br />

perfectly <strong>on</strong> the 45 ◦ line. The plot from this fit:<br />

shows the most points lie fairly close to the 45 degree line. As well there are no points that appear to have<br />

undue leverage <strong>on</strong> the plot as there is a general scatter around the 45 degree line.<br />

The residual plot:<br />

c○2012 Carl James Schwarz 402 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

also shows a random scatter of residuals around the value of 0 with no apparent pattern.<br />

The whole model test, i.e. if any of the X variables provide in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> predicting Y is found in the<br />

Analysis of Variance Table:<br />

The p-value is very small, and so there is good evidence that at least <strong>on</strong>e X variable appears to predict the<br />

blood pressure. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, at this point, it is unclear which X variable are good predictors and which X<br />

variables may be poor predictors..<br />

The fitted regressi<strong>on</strong> equati<strong>on</strong> is found by looking at the Parameter Estimates area:<br />

c○2012 Carl James Schwarz 403 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

and is:<br />

̂ BloodP ressure = −61.3 + .45(Age) − 0.087(Stress) + 2.37(W eight)<br />

These coefficients are interpreted as the MARGINAL increase in the blood pressure when each variable<br />

changes by 1 unit AND ALL OTHER VARIABLES REMAIN FIXED. For example, the coefficient of 0.45<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> age indicates that the estimated blood pressure increases by .45 units <str<strong>on</strong>g>for</str<strong>on</strong>g> year increase in age assuming<br />

that the stress index and weight remain c<strong>on</strong>stant. The c<strong>on</strong>cept of marginality, i.e. the marginal increase in Y<br />

when a single X variable is changed but all other X variables are held fixed is the crucial c<strong>on</strong>cept in multiple<br />

regressi<strong>on</strong>. In some cases, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, polynomial regressi<strong>on</strong>, it is impossible to hold all other X variables<br />

fixed as you will see later in this chapter.<br />

The sign of the coefficient <str<strong>on</strong>g>for</str<strong>on</strong>g> stress is somewhat surprising, but as you will see in a few minutes is<br />

nothing to worry about.<br />

Are there any X variables that d<strong>on</strong>’t appear to be useful in predicting blood pressure? The Effect Tests<br />

or the Parameter Estimates table provide some clues:<br />

The p-values from the the Effect Test table or the the Parameter Estimates table are identical; the F -statistic<br />

is simply the t-ratio squared. These are MARGINAL tests, i.e. is a particular X variable useful in predicting<br />

the blood pressure given all other variables remain in the model. For example, the test <str<strong>on</strong>g>for</str<strong>on</strong>g> age examines if<br />

blood pressure changes with age after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> stress and weight. The test <str<strong>on</strong>g>for</str<strong>on</strong>g> Stress examines if blood<br />

pressure changes with stress after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> age and weight.<br />

In this example, the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> stress appears to be statistically not significant. This would imply that<br />

blood pressure does not seem to increase with stress after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> age and weight. This would indicate<br />

c○2012 Carl James Schwarz 404 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

that perhaps stress could be dropped from the model and a final model <strong>on</strong>ly using age and weight may be<br />

suitable. C<strong>on</strong>sequently, the negative sign <strong>on</strong> the coefficient is not really worry some.<br />

Again, this c<strong>on</strong>cept of marginality is crucial <str<strong>on</strong>g>for</str<strong>on</strong>g> the proper interpretati<strong>on</strong> of the statistical tests. If two<br />

X variables are related, it is possible that both of the statistical tests could be n<strong>on</strong>-significant, but this does<br />

not imply that both variables can be dropped from the model. Later in this chapter (Secti<strong>on</strong> 6.4), it will be<br />

shown how to test if multiple variables can be simultaneously dropped from the model.<br />

The leverage plots should also be examined to see that any relati<strong>on</strong>ship between the predictor and resp<strong>on</strong>se<br />

variable is not highly dependent up<strong>on</strong> a single (high leverage) point:<br />

Leverage plots in general, examine the new in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> each X variable in predicting Y after adjusting<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> all the other variables in the model. The general theory is presented in Secti<strong>on</strong> 6.2.6. Two features of the<br />

plot should be examined. The general statistical significance of the X variable is found by c<strong>on</strong>sidering the<br />

slope of the line and if the c<strong>on</strong>fidence curves c<strong>on</strong>tain the horiz<strong>on</strong>tal line :<br />

c○2012 Carl James Schwarz 405 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

We see that the c<strong>on</strong>fidence curves in the leverage plots <str<strong>on</strong>g>for</str<strong>on</strong>g> age and weight both do not c<strong>on</strong>tain the horiz<strong>on</strong>tal<br />

line. However, the c<strong>on</strong>fidence curve <strong>on</strong> the leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> stress includes the horiz<strong>on</strong>tal line indicating<br />

that this variable’s c<strong>on</strong>tributi<strong>on</strong> to predicting blood pressure is not statistically useful.<br />

The sec<strong>on</strong>d feature of leverage plots that should be examined is the distributi<strong>on</strong> of points al<strong>on</strong>g the X<br />

axis of the leverage plot. There should be a fairly even distributi<strong>on</strong> al<strong>on</strong>g the bottom axis and the fitted line<br />

in the leverage plot should not be heavily influenced by a few points with high leverage.<br />

By clicking <strong>on</strong> the red-triangle associated with the fit:<br />

it is possible to save various predicti<strong>on</strong>s to the data table. For example, save the predicted values and the two<br />

types of c<strong>on</strong>fidence intervals (<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean and <str<strong>on</strong>g>for</str<strong>on</strong>g> individuals):<br />

c○2012 Carl James Schwarz 406 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Notice that <str<strong>on</strong>g>for</str<strong>on</strong>g> observati<strong>on</strong> 14, <strong>on</strong>ly the blood pressure was missing and so predicti<strong>on</strong>s of the blood pressure<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> that individual can be made. However, <str<strong>on</strong>g>for</str<strong>on</strong>g> individual 15, at least <strong>on</strong>e of the X variables had a missing<br />

value and so no predicti<strong>on</strong>s can be made.<br />

The predicti<strong>on</strong> are simply found by substituting in the X values into the predicti<strong>on</strong> equati<strong>on</strong>. As in<br />

simple linear regressi<strong>on</strong>, there are two different c<strong>on</strong>fidence intervals. The c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN<br />

resp<strong>on</strong>se would be useful to predicting the average blood pressure over many people with the same values of<br />

X as recorded. The c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the INDIVIDUAL resp<strong>on</strong>se would be useful to predict the blood<br />

pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a single future individual with those particular X values. A comm<strong>on</strong> error is to c<strong>on</strong>fuse these<br />

two types of intervals.<br />

As in simple linear regressi<strong>on</strong>, a comm<strong>on</strong> way to make predicti<strong>on</strong>s it to add rows to the end of the data<br />

table with the Y variable deliberately set to missing with the X values set to those of interest. These rows<br />

are NOT used in the model fitting, but because the X set is complete, predicti<strong>on</strong>s can be made.<br />

If the residuals are saved to the data table, a normal probability plot of the residuals can be made using<br />

the Analyze->Distributi<strong>on</strong> plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m <strong>on</strong> the saved residuals.<br />

Similarly, the residuals can be plotted against each X variable in turn to assess if there is a linear marginal<br />

relati<strong>on</strong>ship between Y and each X variable. Each of these residual plots should show a random scatter<br />

around zero.<br />

It is also possible to do inverse predicti<strong>on</strong>s, but this is bey<strong>on</strong>d the scope of these notes.<br />

There are lots of other interesting features to the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m that are bey<strong>on</strong>d the scope<br />

c○2012 Carl James Schwarz 407 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

of these notes.<br />

6.2 Regressi<strong>on</strong> problems and diagnostics<br />

6.2.1 Introducti<strong>on</strong><br />

“All models are wr<strong>on</strong>g, but some are useful.” G.E.P. Box <strong>on</strong> page 424 of Empirical Model-<br />

Building and Resp<strong>on</strong>se Surfaces (1987), co-authored with Norman R. Draper.<br />

This famous quote implies that no study ever satisfies the assumpti<strong>on</strong>s made when modeling the data.<br />

However, unless the violati<strong>on</strong>s are extreme, perhaps the model can be useful to make predicti<strong>on</strong>s.<br />

In this secti<strong>on</strong>, we will take a detailed look at a number of diagnostic measures to assess the fit of our<br />

model to the data.<br />

6.2.2 Preliminary characteristics<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e building complex models, the analyst should become familiar with the basic properties of their data.<br />

This is accomplished by:<br />

• Examine the RRR’s of experimental and survey design as it relates to this study.<br />

• What are the scale (nominal, ordinal, interval, ratio) of each variable?<br />

• Which are the predictors and resp<strong>on</strong>se variables?<br />

• What are the types (discrete, c<strong>on</strong>tinuous, discretized c<strong>on</strong>tinuous) of each variable?<br />

Then do some basic plots and tabulati<strong>on</strong>s to spot potential problems in the data:<br />

• Missing values. Examine the pattern of missing values. Most regressi<strong>on</strong> packages practice case-wise<br />

deleti<strong>on</strong>, i.e. any observati<strong>on</strong> (row) that is missing any of the X or the Y variable is not used in the<br />

analysis. If you have a large dataset with many X variables, even a small percentage of missing<br />

values can lead to many rows being deleted from the analysis. Think about how the missing values<br />

came about - are the MCAR, MAR, or IM? JMP has a nice feature to tabulate the pattern of missing<br />

values under the Tables menu.<br />

• Single variable descriptive statistics For each variable in the dataset, do some basic descriptive<br />

statistics and plots (e.g. histograms, dot-plots, box-plots) to identify potentially extreme observati<strong>on</strong>s.<br />

Check to see that all values are plausible, e.g. if <strong>on</strong>e variable records the sex of the subject, <strong>on</strong>ly two<br />

possible values should be recorded; It is unlikely that a women has 20 natural children; it is unlikely<br />

that a human male is more than 3 m tall, etc.<br />

c○2012 Carl James Schwarz 408 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

• Pairwise plots Create bivariate plots of all the variables. Check to unusual looking observati<strong>on</strong>s.<br />

These may be perfectly valid observati<strong>on</strong>s, but they should be examined in more detail to make sure.<br />

A casement plot (a matrix of pairwise scatter plot) can be created easily in JMP using the Analyze-<br />

>MultiVariateMethods->Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />

6.2.3 Residual plots<br />

After the model is fit, compute the residuals which are simply the VERTICAL difference between the observed<br />

and predicted values, ̂ɛ i = Y i − Ŷi. Most computer packages will compute and plot residuals easily.<br />

The basic assumpti<strong>on</strong> about the VERTICAL discrepancies was that they have a mean of zero and a<br />

CONSTANT variance σ 2 . We estimated the variance by MSE in the ANOVA table.<br />

There are several different types of residuals that can be computed and plotted:<br />

• Standardized residual. This is simply computed as z i = ̂ɛ i / √ MSE and is an attempt to create<br />

residuals with a mean of 0 and a variance of 1, i.e. like a standard normal distributi<strong>on</strong>. Because all the<br />

residuals are divided by the same value, the pattern seen in the standardized residuals will the same as<br />

seen in the ordinary residuals.<br />

• Studentized residual. The precisi<strong>on</strong> of the predicti<strong>on</strong>s changes at different parts of the regressi<strong>on</strong><br />

line. You saw earlier that the c<strong>on</strong>fidence band <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se got wider as the predicti<strong>on</strong> point<br />

moved further away from the center of the data. The studentized residuals (see book <str<strong>on</strong>g>for</str<strong>on</strong>g> computati<strong>on</strong>al<br />

details) attempts to standardize each residual by its approximate precisi<strong>on</strong>. Because each residual is<br />

adjusted individually, the plots of the studentized residuals will look slightly different from that of the<br />

regular or standardized residuals, but they will be similar.<br />

• Jackknifed residuals. Less comm<strong>on</strong>ly computed, jack-knifed residuals are computed by fitting a<br />

regressi<strong>on</strong> line after dropping each point in turn, and then finding the residual. For example, if there<br />

were 4 data points, the jack-knifed residual <str<strong>on</strong>g>for</str<strong>on</strong>g> the first point would be the difference between the<br />

observed value and the predicted value based <strong>on</strong> a regressi<strong>on</strong> line fit to points 2, 3 and 4 <strong>on</strong>ly. The<br />

jack-knifed residual <str<strong>on</strong>g>for</str<strong>on</strong>g> the sec<strong>on</strong>d observati<strong>on</strong> would be the difference between the observed value<br />

and the predicted value based <strong>on</strong> the 1, 3 and 4th observati<strong>on</strong>s. Plots based <strong>on</strong> these residuals will<br />

appear similar, but not exactly the same as plots based <strong>on</strong> the the other residuals.<br />

Several plots can be c<strong>on</strong>structed. First, look at the univariate distributi<strong>on</strong> of the residuals. Which observati<strong>on</strong>s<br />

corresp<strong>on</strong>d to the largest negative and positive residuals?<br />

Sec<strong>on</strong>d, plot the residuals against each predictor variable, against the PREDICTED Y values, and against<br />

the order with which the data were collected (this may be, but is not necessarily the order of the observati<strong>on</strong>s<br />

in the dataset). D<strong>on</strong>’t plot the residuals against the observed Y values because you will see strange patterns<br />

that are artifacts of the plot. 5 A good residual plot will show random scatter around zero; bad residual plots<br />

5 Basically negative residuals will be associated with smaller Y values and these will increase as Y increases, and then crash and<br />

rise and then crash and rise again.<br />

c○2012 Carl James Schwarz 409 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

will show a definite pattern. Typical residual plots are illustrated below – with small datasets, the patterns<br />

will not be as clear cut.<br />

With small datasets, d<strong>on</strong>’t over analyze the plots - <strong>on</strong>ly gross deviati<strong>on</strong>s from ideal plots are of interest.<br />

Modern alternatives to residual plots are to plot the absolute value of the residuals and fit LOWESS<br />

curves through them. C<strong>on</strong>sult our Stat400 <str<strong>on</strong>g>course</str<strong>on</strong>g> (Data Analysis) <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />

Many books present <str<strong>on</strong>g>for</str<strong>on</strong>g>mal tests <str<strong>on</strong>g>for</str<strong>on</strong>g> residuals – I find these not particularly useful, and prefer the simple<br />

residual plots. However, <strong>on</strong>e useful diagnostic is the Durbin-Wats<strong>on</strong> test <str<strong>on</strong>g>for</str<strong>on</strong>g> autocorrelati<strong>on</strong> – c<strong>on</strong>sult the<br />

c○2012 Carl James Schwarz 410 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

chapter <strong>on</strong> trend analysis in this collecti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />

Finally, many books also present what are known a normal probability plots to assess the normality of<br />

the residuals. Again, I have found these to be less than useful.<br />

6.2.4 Actual vs. Predicted Plot<br />

In multiple regressi<strong>on</strong>, it is very difficult to look at plots of Y vs. each X variable and come to anything very<br />

useful. In general, you are trying to view a multi-dimensi<strong>on</strong> space in two dimensi<strong>on</strong>s.<br />

A plot of the actual Y vs. the predicted Y ′ s is useful to assess how well the model does in predicting<br />

each observati<strong>on</strong>s. This plot is produced automatically by JMP and many other packages. In some packages,<br />

you will have to save the predicted values and do the plot yourself.<br />

6.2.5 Detecting influential observati<strong>on</strong>s<br />

An influential observati<strong>on</strong> is defined as an observati<strong>on</strong> whose deleti<strong>on</strong> greatly changes the results of the<br />

regressi<strong>on</strong>. There are many techniques available <str<strong>on</strong>g>for</str<strong>on</strong>g> spotting individual influential points – however, many<br />

of these methods will fail to detect pairs of influential points in close proximity to each other.<br />

Cook’s D<br />

One popular measure of an observati<strong>on</strong>s influence is the Cook’s Distance. This statistic measures the extent<br />

to which the regressi<strong>on</strong> coefficients change when each individual observati<strong>on</strong> is deleted. It is a summary<br />

measure of the impact of the observati<strong>on</strong>s deleti<strong>on</strong> and is a weighted sum 6 of: (̂β 0 − ̂β 0(−i) ) 2 , (̂β 1 −<br />

̂β 1(−i) ) 2 , . . . , (̂β k − ̂β k(−i) ) 2 where ̂β k(−i) is the regressi<strong>on</strong> coefficient <str<strong>on</strong>g>for</str<strong>on</strong>g> the k th variable after dropping<br />

the i th observati<strong>on</strong>.<br />

If a point has no effect <strong>on</strong> the fit, then D i will be zero. Large values of D i indicate points that have a large<br />

influence <strong>on</strong> the fit. There is no easy rule <str<strong>on</strong>g>for</str<strong>on</strong>g> determining which values of D i are extreme. 7 A general rule<br />

of thumb is to look at the distributi<strong>on</strong> of the D ′ s and look at those observati<strong>on</strong>s corresp<strong>on</strong>ding to extreme<br />

values.<br />

6 Refer to the original paper <str<strong>on</strong>g>for</str<strong>on</strong>g> the exact <str<strong>on</strong>g>for</str<strong>on</strong>g>mula.<br />

7 An often quoted rule is to look at values of D i that are greater than 1, but recent work has shown that this rule does not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />

effectively<br />

c○2012 Carl James Schwarz 411 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Hats<br />

Oddly named statistics are the Hats, or leverage values. These are computed under the idea that if a point<br />

has extreme influence, the regressi<strong>on</strong> should predict it exactly. C<strong>on</strong>sequently, the hats are computed from<br />

what is known (<str<strong>on</strong>g>for</str<strong>on</strong>g> historical reas<strong>on</strong>s) the hat matrix which is defined as X ′ (X ′ X) −1 X and should not be<br />

attempted to be computed by hand! If the hat-value is larger than about twice the average hat-value, then<br />

is usually taken to indicates an influential point. There are more <str<strong>on</strong>g>for</str<strong>on</strong>g>mal rules to checking the hat values but<br />

these are seldom worthwhile.<br />

Cauti<strong>on</strong><br />

It is clear that some observati<strong>on</strong>s must be the most extreme in every sample, and so it would be silly to<br />

automatically delete these extreme observati<strong>on</strong>s without a careful c<strong>on</strong>siderati<strong>on</strong> of the the underlying data!<br />

The purpose of Cook’s D and other similar statistics is to warn the analyst that certain observati<strong>on</strong>s require<br />

additi<strong>on</strong>al scrutiny. D<strong>on</strong>’t data snoop simply to polish fit!<br />

6.2.6 Leverage plots<br />

These are likely the most useful of the diagnostic tools <str<strong>on</strong>g>for</str<strong>on</strong>g> spotting influential observati<strong>on</strong>s and are produced<br />

by many computer packages.<br />

The Leverage Plots produced by JMP are example of what are also called partial regressi<strong>on</strong> plot or<br />

adjusted variable plots. They are c<strong>on</strong>structed <str<strong>on</strong>g>for</str<strong>on</strong>g> each individual variable. Suppose that we are regressing<br />

Y <strong>on</strong> four predictors X 1 , . . . , X 4 . The leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> X 1 is c<strong>on</strong>structed as follows:.<br />

1. Find the residuals when Y is regressed against all the other variables except X 1 , i.e. fit the model<br />

Y = X 2 X 3 X 4 . Denote this residual as ̂ɛ Y |X(−1) where the -1 indicates that the first variable was<br />

dropped from the set of X ′ s.<br />

2. Find the residuals when X 1 is regressed against the other X variables, i.e. fit the model X 1 =<br />

X 2 X 3 X 4 . Denote this residual as ̂ɛ X1|X (−1)<br />

where the -1 indicates that the first variable was<br />

dropped from the set of X ′ s.<br />

3. Plot the first residual against the sec<strong>on</strong>d residual <str<strong>on</strong>g>for</str<strong>on</strong>g> each observati<strong>on</strong>. 8<br />

.<br />

Now if X 1 has no further in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> about Y (after accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> the other X ′ s), then the X 1 variable<br />

really isn’t needed, and so all the first residuals should be centered around zero with random scatter.<br />

8 JMP actually adds the mean of Y and X 1 to the residuals be<str<strong>on</strong>g>for</str<strong>on</strong>g>e plotting, but this does not change the shape of the plot.<br />

c○2012 Carl James Schwarz 412 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

But suppose that X 1 is important in predicting Y . Then the residuals from the regressi<strong>on</strong> of Y <strong>on</strong> the<br />

other X variables should be missing the c<strong>on</strong>tributi<strong>on</strong> of X 1 and the residual plot should show an upward (or<br />

downward) trend relative to the other residuals. In fact, if you fit a regressi<strong>on</strong> line to the leverage plot, the<br />

slope will equal the slope in the full regressi<strong>on</strong> model. If the c<strong>on</strong>tributi<strong>on</strong> of X 1 is not linear, then the plot<br />

will show a n<strong>on</strong>-linear relati<strong>on</strong>ship.<br />

Why is X 1 regressed against the other X variables? Recall that the interpretati<strong>on</strong> of the slope in multiple<br />

regressi<strong>on</strong> is the MARGINAL c<strong>on</strong>tributi<strong>on</strong> after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> all other variables in the model. In other<br />

words, the slope reflect the NEW in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> in X 1 after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> the other X ′ s. How is the new<br />

in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> in X 1 found – yes, by regressing X 1 against the other variables. For example, suppose that X 1<br />

was an exact copy of another variable in the dataset. Then the sec<strong>on</strong>d residuals would all be zero, indicating<br />

no new in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> (why?). So, if the leverage plot shows a very thin vertical band of points, this may be<br />

an indicati<strong>on</strong> that a certain variable does NOT have useful marginal in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>, i.e. is redundant given the<br />

other variables. This c<strong>on</strong>diti<strong>on</strong> is known as multi-collinearity and is discussed later in this chapter.<br />

If a single observati<strong>on</strong> has high leverage, the leverage plot will show the observati<strong>on</strong>s as an outlier. The<br />

diagram below dem<strong>on</strong>strates some of the important cases <str<strong>on</strong>g>for</str<strong>on</strong>g> leverage plots:<br />

c○2012 Carl James Schwarz 413 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

In JMP, and many other packages, the points <strong>on</strong> these plots are hot-linked to the data sheet. By clicking<br />

<strong>on</strong> these points, you can identify the observati<strong>on</strong> in the data sheet.<br />

The c<strong>on</strong>cept of leverage plots is sufficiently important and n<strong>on</strong>-obvious that a numerical example will be<br />

examined. In JMP, open the Fitness.jmp dataset from the JMP sample dataset library. This dataset c<strong>on</strong>sists<br />

of measurements taken <strong>on</strong> subjects <strong>on</strong> the age, weight, oxygen c<strong>on</strong>sumpti<strong>on</strong>, times to run a mile, and three<br />

measurements <strong>on</strong> their pulse rate.<br />

The first few lines of the data file are:<br />

Fit a model to predict oxygen c<strong>on</strong>sumpti<strong>on</strong> as the Y variable with age, weight, runtime, and the three<br />

pulse measurements as the X variables. The estimated slopes are:<br />

and the leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> Runtime is:<br />

c○2012 Carl James Schwarz 414 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

To reproduce this leverage plot, first fit the model <str<strong>on</strong>g>for</str<strong>on</strong>g> oxygen c<strong>on</strong>sumpti<strong>on</strong>, dropping the run-time<br />

variable and save the residuals to the data sheet.<br />

c○2012 Carl James Schwarz 415 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 416 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Next, regress run-time against the other X variables and save the residuals to the data sheet:<br />

c○2012 Carl James Schwarz 417 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

This will give the data sheet with two new columns added:<br />

Finally, plot the Residual of Oxygen <strong>on</strong> all but runtime vs. the Residual of runtime <strong>on</strong> other X variables<br />

and fit a line through that plot using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 418 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

You will see that this plot looks the same as the leverage plot (but the Y and X axes are scaled slightly<br />

differently) and that the slope <strong>on</strong> this plot (-2.639) matches the estimated slope seen earlier.<br />

Leverage plots should be used with some cauti<strong>on</strong>. They will show the nature of the functi<strong>on</strong>al relati<strong>on</strong>ship<br />

with the variable, but not its exact <str<strong>on</strong>g>for</str<strong>on</strong>g>m. As well, as these plots are after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> other variables, a<br />

variety of curvature models should be investigated. As well, if the functi<strong>on</strong>al <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the other variables is<br />

incorrect (e.g. age 2 is needed but has not been added to the model), then the true nature of the relati<strong>on</strong>ship<br />

may be missed.<br />

You can get JMP to save all the leverage pairs under the Save Columns pop-down menu.<br />

c○2012 Carl James Schwarz 419 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

6.2.7 Collinearity<br />

It is often the case that many of the X variables are related to each other. For example, if you wanted<br />

to predict blood pressure as a functi<strong>on</strong> of several variables including height and weight, there is a str<strong>on</strong>g<br />

relati<strong>on</strong>ship between these two latter variables. When the relati<strong>on</strong>ship am<strong>on</strong>g the predictor variables is<br />

str<strong>on</strong>g, they are said to be collinear. This can lead to problems in fitting the model and in interpreting the<br />

results of a model fit. In this example, it is c<strong>on</strong>ceivable that you could increase the weight of a subject while<br />

holding height c<strong>on</strong>stant, but suppose the two variables were total hours of sunshine and total hours of clouds<br />

in a year. If <strong>on</strong>e increases, the other must decrease.<br />

Because the regressi<strong>on</strong> coefficients are interpreted as the MARGINAL c<strong>on</strong>tributi<strong>on</strong> of each predictor,<br />

collinearly am<strong>on</strong>g the predictors can mask the c<strong>on</strong>tributi<strong>on</strong> of a variable. For example, if both height and<br />

weight are fit in a model, then the marginal c<strong>on</strong>tributi<strong>on</strong> of height (given weight is already in the model) is<br />

small; similarly, the marginal c<strong>on</strong>tributi<strong>on</strong> of weight (given height is in the model) is also small. However,<br />

it would not be valid to say that the marginal c<strong>on</strong>tributi<strong>on</strong> of both height and weight (together) are small.<br />

In Secti<strong>on</strong> 6.4 methods <str<strong>on</strong>g>for</str<strong>on</strong>g> testing if several variables can be deleted simultaneously from the model are<br />

presented.<br />

If the predictor variables were perfectly collinear, the whole model fitting procedure breaks down. It<br />

turns out that a certain matrix used in the model fitting cannot be numerically inverted (similar to trying to<br />

divide by zero) and no estimates are possible. If the variables are not perfectly collinear, many different sets<br />

of estimates can be found that give very nearly the same predicti<strong>on</strong>s!<br />

Not all the story is bad – multicollinearity does not imply that the whole regressi<strong>on</strong> model is useless.<br />

Even if predictor variables are highly related, good predicti<strong>on</strong>s are still possible provided that you make<br />

predicti<strong>on</strong>s at values of X that are similar to those used in model fitting.<br />

The basic tool <str<strong>on</strong>g>for</str<strong>on</strong>g> diagnosing potential collinearity is the variance inflati<strong>on</strong> factor (VIF) <str<strong>on</strong>g>for</str<strong>on</strong>g> each regressi<strong>on</strong><br />

coefficient. In JMP this is obtained by right-clicking <strong>on</strong> the table of parameter estimates after the<br />

Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is run. For example, the VIF <str<strong>on</strong>g>for</str<strong>on</strong>g> the fitness dataset are:<br />

c○2012 Carl James Schwarz 420 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The VIF is interpreted as the increase in the variance (se 2 ) of the estimate compared to what would be<br />

expected if the variable was completely independent of all other predictor variables. The VIF equals 1 when<br />

a predictor is not collinear with other predictors. VIFs that are vary large, typically around 10 or higher, are<br />

usually taken as an indicati<strong>on</strong> of potential collinearity.<br />

In the fitness dataset, there is evidence of collinearity in the average pulse rate during the run (Run Pulse)<br />

and the maximum pulse rate during the run (Max Pulse) variables. This is not unexpected.<br />

If collinearity is detected, remedial measures include dropping some of the redundant predictor variables,<br />

9 or more sophisticated fitting methods such as ridge or robust regressi<strong>on</strong> (which are bey<strong>on</strong>d the scope<br />

of this <str<strong>on</strong>g>course</str<strong>on</strong>g>).<br />

9 An obvious questi<strong>on</strong> is how do you tell which variables are redundant? Comm<strong>on</strong> methods are principal comp<strong>on</strong>ent analysis of the<br />

X variables, or examining the correlati<strong>on</strong> am<strong>on</strong>g the predictors. Seek help if you run into a problem of extreme multicollinearity.<br />

c○2012 Carl James Schwarz 421 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

6.3 Polynomial, product, and interacti<strong>on</strong> terms<br />

6.3.1 Introducti<strong>on</strong><br />

The assumpti<strong>on</strong> of a marginal linear relati<strong>on</strong>ship between the resp<strong>on</strong>se variable and the X variable is sometimes<br />

not true, and a quadratic and (rarely) cubic or higher polynomials are often fit in terms of X in order<br />

to approximate this n<strong>on</strong>-linear relati<strong>on</strong>ship.<br />

The basic way to deal with polynomial regressi<strong>on</strong> (i.e. quadratic and higher terms) is to create new<br />

predictor variables involving X 2 , X 3 , . . .. Although not necessary with modern software, it is often a good<br />

idea to center variables that will be used in quadratic and higher relati<strong>on</strong>ship to avoid a high degree of<br />

collinearity am<strong>on</strong>g the terms. For example, replace X and X 2 by (X −X) and (X −X) 2 respectively While<br />

the actual coefficients may change, the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> testing the linear and quadratic slope are unaffected, and<br />

predicti<strong>on</strong>s are also unaffected – this is exactly analogous what happens in regressi<strong>on</strong> when there is a unit<br />

change between imperial and metric units <str<strong>on</strong>g>for</str<strong>on</strong>g> some variable.<br />

The model fit is<br />

If the square term is called X 2 , the model is:<br />

Y i = β 0 + β 1 X i1 + β 2 X 2 i1 + ɛ i<br />

Y i = β 0 + β 1 X i1 + β 2 X i2 + ɛ i<br />

which now looks exactly like a ordinary multiple regressi<strong>on</strong> model.<br />

The rest of the model fitting, testing, etc proceeds exactly as outlined in previous secti<strong>on</strong>s. However,<br />

there are two potential problems with polynomial models.<br />

• Models should be hierarchical. This means that if you include a term involving X 2 in the model, you<br />

must include a term involving X. If you include a quadratic, but not the linear term, you are restricting<br />

the quadratic curve to be a very special shape which is not usually wanted in practice. This will be<br />

outlined in class.<br />

• The interpretati<strong>on</strong> of the estimates must be d<strong>on</strong>e with care. Normally, the estimated slopes are the<br />

MARGINAL c<strong>on</strong>tributi<strong>on</strong> of this variable to the resp<strong>on</strong>se, i.e. after holding all other variables c<strong>on</strong>stant.<br />

However, if the regressi<strong>on</strong> equati<strong>on</strong> includes both X and X 2 terms, it is impossible to hold X fixed<br />

while changing X 2 al<strong>on</strong>e.<br />

What degree of polynomial is suitable? This is usually d<strong>on</strong>e by fitting successively higher polynomial<br />

terms until the added term is no l<strong>on</strong>ger statistically significant and then using the previous model. While<br />

polynomial models allow <str<strong>on</strong>g>for</str<strong>on</strong>g> some degree of curvature in the resp<strong>on</strong>se, it is very rare to fit terms involving<br />

cubic and higher powers. The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this is that such curves seldom have biological plausibility, and<br />

they have wide oscillati<strong>on</strong>s in their predicted values.<br />

The researcher should also investigate if a trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m of the Y or X variable may linearize the relati<strong>on</strong>ship.<br />

For example, a plot of log(Y ) vs. X may show a linear fit. Similarly, 1/X may be a more suitable<br />

c○2012 Carl James Schwarz 422 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

predictor. 10 It is possible to use least squares to actually fit n<strong>on</strong>-linear models where no trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> or<br />

polynomial terms provide a good fit. This is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

6.3.2 Example: Tomato growth as a functi<strong>on</strong> of water<br />

An experiment was run to investigate the yield of tomato plants as a functi<strong>on</strong> of the amount of water provided<br />

over the seas<strong>on</strong>. A series of plots were randomized to different watering levels and at the end of the seas<strong>on</strong>,<br />

the yield of the plants was determined.<br />

The raw data follows:<br />

Water<br />

Yield<br />

6 49.2<br />

6 48.1<br />

6 48.0<br />

6 49.6<br />

6 47.0<br />

8 51.5<br />

8 51.7<br />

8 50.4<br />

8 51.2<br />

8 48.4<br />

10 51.1<br />

10 51.5<br />

10 50.3<br />

10 48.9<br />

10 48.7<br />

12 48.6<br />

12 48.0<br />

12 46.4<br />

12 46.2<br />

14 43.2<br />

14 42.6<br />

14 42.1<br />

14 43.9<br />

14 40.5<br />

10 For example, should fuel ec<strong>on</strong>omy of car be measured a miles/gall<strong>on</strong> (distance/c<strong>on</strong>sumpti<strong>on</strong>) or L/100 km (c<strong>on</strong>sumpti<strong>on</strong>/distance).<br />

c○2012 Carl James Schwarz 423 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

JMP Analysis:<br />

The raw data is also available in a JMP data sheet called tomatowater.jmp available from the Sample<br />

Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The data is entered into JMP in the usual fashi<strong>on</strong> – columns represent variables and rows represent<br />

observati<strong>on</strong>s. The scale of both variables should be c<strong>on</strong>tinuous.<br />

As usual, begin with a plot of the data:<br />

c○2012 Carl James Schwarz 424 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The relati<strong>on</strong>ship is clearly n<strong>on</strong>-linear and looks as if a quadratic may be suitable.<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting the model, think about the assumpti<strong>on</strong>s required <str<strong>on</strong>g>for</str<strong>on</strong>g> the fit and assess if these are suitable<br />

to the data at hand.<br />

There are two ways to fit simple (i.e. <strong>on</strong>ly involving polynomial terms in X) polynomial models in JMP.<br />

If your regressi<strong>on</strong> model is a mixture of polynomial and other X variables, then the sec<strong>on</strong>d method must be<br />

used.<br />

In the first method, the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m can be used directly. For example, select the<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 425 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

and choose Polynomial Fit:<br />

which gives a plot of the fitted line:<br />

c○2012 Carl James Schwarz 426 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

and statistics about the fit:<br />

c○2012 Carl James Schwarz 427 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The fitted curve is:<br />

Yield = 57.726857 − 0.762Water − 0.2928571(Water-10) 2<br />

Notice that JMP has automatically centered the quadratic term by subtracting the mean X of 10 from each<br />

value prior to squaring. As you will see in a few minutes, this has no affect up<strong>on</strong> the test of significance of<br />

the quadratic term, nor <strong>on</strong> the actual predicted values.<br />

The ANOVA table can be used to examine if either/or the linear or quadratic terms provide any predictive<br />

power. The table of estimates shows that the quadratic term is clearly statistically significant. C<strong>on</strong>fidence<br />

intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the regressi<strong>on</strong> coefficients can be found in the usual fashi<strong>on</strong> by right clicking in the table and<br />

requesting the appropriate columns (not shown).<br />

A residual plot is obtained in the usual fashi<strong>on</strong>:<br />

c○2012 Carl James Schwarz 428 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

which shows no evidence of a problem.<br />

If a cubic polynomial is fit (in the same fashi<strong>on</strong> as the quadratic polynomial) you will see that the cubic<br />

c○2012 Carl James Schwarz 429 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

term is not statistically significant indicating that a quadratic model is sufficient.<br />

C<strong>on</strong>fidence band <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se at each X and the individual resp<strong>on</strong>se at each X can also be<br />

obtained in the usual way:<br />

c○2012 Carl James Schwarz 430 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Again, the scientist must understand the difference between the c<strong>on</strong>fidence bounds <str<strong>on</strong>g>for</str<strong>on</strong>g> each type of predicti<strong>on</strong><br />

as outlined in earlier chapters.<br />

The sec<strong>on</strong>d way to fit polynomial models (and the <strong>on</strong>ly way when polynomial terms are intermixed with<br />

other variables) is to use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m. First variables corresp<strong>on</strong>ding to X 2 and X 3 (if<br />

needed) must be created using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor of JMP: 11<br />

11 It is preferable to use JMP’s <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor rather than creating these variables outside of the data sheet because these columns<br />

will be hot-linked to the original column. If, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, a value of X is updated, then the values of the squared and cubic terms will<br />

also be updated automatically.<br />

c○2012 Carl James Schwarz 431 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

and a porti<strong>on</strong> of the resulting data table is shown below:<br />

c○2012 Carl James Schwarz 432 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Note that the X variable was centered be<str<strong>on</strong>g>for</str<strong>on</strong>g>e squaring and cubing.<br />

Now use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit using the water and water-squared terms:<br />

c○2012 Carl James Schwarz 433 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The plot of actual vs. predicted shows a good fit:<br />

The ANOVA table (not shown) can be used to assess the overall fit of the model as seen in earlier secti<strong>on</strong>s.<br />

The estimates match those seen earlier, as do the p-values:<br />

C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the regressi<strong>on</strong> coefficients can be found in the usual fashi<strong>on</strong> by right clicking in the<br />

table and requesting the appropriate columns (not shown).<br />

The leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> the X 2 term shows that this polynomial is required and is not influenced by any unusual<br />

c○2012 Carl James Schwarz 434 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

values: 12<br />

C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se or individual resp<strong>on</strong>ses are saved to the data table in the<br />

usual fashi<strong>on</strong> (but are not shown in these notes):<br />

12 Because of the hierarchical restricti<strong>on</strong>, the leverage plot <str<strong>on</strong>g>for</str<strong>on</strong>g> the linear term is not of interest.<br />

c○2012 Carl James Schwarz 435 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Finally, getting a plot of the actual fitted line a bit of work if using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />

First, save the predicted values to the data table:<br />

c○2012 Carl James Schwarz 436 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Then use the Overlay Plot under the Graph menu to plot the individual points and the predicted values:<br />

c○2012 Carl James Schwarz 437 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

and then join up the predicted values (and remove the fitted points)<br />

c○2012 Carl James Schwarz 438 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

to finally give the plot that we saw earlier (whew!) but un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there does not appear to be anyway to<br />

drawing a smooth curve <str<strong>on</strong>g>short</str<strong>on</strong>g> of getting predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> many points between the observed values of X and<br />

drawing the curve through the smaller increments.<br />

c○2012 Carl James Schwarz 439 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

6.3.3 Polynomial models with several variables<br />

The methods of the previous secti<strong>on</strong> can be extended to cases where several variables have quadratic or<br />

higher powers. It is also possible to include crossproducts of these variables as well.<br />

There are no c<strong>on</strong>ceptual difficulties in having multiple polynomial variables. However, the analyst must<br />

ensure that models are hierarchical (i.e. if higher powers or cross products are includes, then lower order<br />

terms must also be included). C<strong>on</strong>sequently, leverage plots of the lower order terms are likely not be very<br />

useful when higher order terms are included in the model.<br />

In practice, polynomial models are comm<strong>on</strong>ly restricted to quadratic terms or lower. The goal is not<br />

so much as to elucidate the underlying mechanism of the resp<strong>on</strong>se, but rather to get a good approximati<strong>on</strong><br />

to the resp<strong>on</strong>se surface. indeed, there are a whole suite of techniques (comm<strong>on</strong>ly called resp<strong>on</strong>se surface<br />

methodology) used to fit and explore polynomial models in this c<strong>on</strong>text. Often predicti<strong>on</strong>s of where the<br />

maximum or minimum resp<strong>on</strong>se is found are important.<br />

c○2012 Carl James Schwarz 440 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

There are many excellent books available. JMP also has specialized tools in the Analyze->Fit Model<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to assist in the fitting of resp<strong>on</strong>se surfaces. These are bey<strong>on</strong>d the scope of these notes<br />

6.3.4 Cross-product and interacti<strong>on</strong> terms<br />

Recall that the interpretati<strong>on</strong> of the regressi<strong>on</strong> coefficient associated with the i th predictor variable is the<br />

marginal (i.e. after keeping all other variables in the model and fixed) increase in Y per unit change in X i .<br />

This marginal increase is the same regardless of the values of the other X variables.<br />

But sometimes, the c<strong>on</strong>tributi<strong>on</strong> of the i th variable depends up<strong>on</strong> the value of another, the j th , predictor.<br />

For example, suppose the blood pressure tends to increase by .5 units <str<strong>on</strong>g>for</str<strong>on</strong>g> every kg increase in body mass <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

people under 1.5 m in height, but tends to increase by .6 units <str<strong>on</strong>g>for</str<strong>on</strong>g> very kg increase in body mass <str<strong>on</strong>g>for</str<strong>on</strong>g> people<br />

over 1.5 m in height. We would say that the body mass interacts with the height variable. This c<strong>on</strong>cept is<br />

very similar to analogous interacti<strong>on</strong> of factors in ANOVA models. 13<br />

C<strong>on</strong>sider a model where blood pressure depends up<strong>on</strong> age and height via the model:<br />

This corresp<strong>on</strong>ds to the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical model of:<br />

BP = AGE HEIGHT<br />

Y i = β 0 + β 1 AGE i + β 2 HEIGHT i + ɛ i<br />

You can see that if age increases by 1 unit, then the value of Y increases by β 1 units regardless of the value<br />

of height. Similarly, every time height increases by 1 unit, Y increases by β 2 regardless of the value of age.<br />

Now c<strong>on</strong>sider the model written as:<br />

which corresp<strong>on</strong>ds to the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical model of:<br />

BP = AGE HEIGHT AGE*HEIGHT<br />

Y i = β 0 + β 1 AGE i + β 2 HEIGHT i + β 3 AGE i × HEIGHT i + ɛ i<br />

The crossproduct of age and height enter into the model as new predictor variable. 14 Now look what happens<br />

when age is increased by 1 unit. The value of Y increases not simply by β 1 but by β 1 + β 3 HEIGHT i .<br />

Now when height is small, the increase in Y per unit change in age is smaller than when height is large.<br />

Similarly, an increase by 1 unit in the value of height will lead to an increase by β 2 + β 3 AGE i . The effect<br />

of height will be less <str<strong>on</strong>g>for</str<strong>on</strong>g> younger subjects than <str<strong>on</strong>g>for</str<strong>on</strong>g> older subjects.<br />

The use of product terms in multiple-regressi<strong>on</strong> can be easily extended to products involving more than<br />

two variables, and, more importantly as discussed in Secti<strong>on</strong> 6.5.3, to products with indicator variables.<br />

13 Indeed, this is not surprising as ANOVA is actually a special case of regressi<strong>on</strong>.<br />

14 The actual X matrix would then have four columns. Column 1 would c<strong>on</strong>sists of all 1’s; column 2 would c<strong>on</strong>sist of the values<br />

of age; column 3 would c<strong>on</strong>sist of the values of height; and column 4 would c<strong>on</strong>tain the actual products of age and height <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />

individual.<br />

c○2012 Carl James Schwarz 441 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

There is no real problem in fitting these models other than the model must c<strong>on</strong><str<strong>on</strong>g>for</str<strong>on</strong>g>m to the hierarchical<br />

principle. This principle states that if terms like X i X j are in the model, so must be all lower order terms –<br />

in this case, both X i and X j as separate terms must remain in the model. This is the same principle as you<br />

saw <str<strong>on</strong>g>for</str<strong>on</strong>g> polynomial models.<br />

6.4 The general linear test<br />

6.4.1 Introducti<strong>on</strong><br />

In previous secti<strong>on</strong>s, you say how to test if a specific regressi<strong>on</strong> coefficient in the populati<strong>on</strong> was zero using<br />

the t-test provided by most computer packages. It is tempting then, to try and test if multiple X variables<br />

can be dropped simultaneously if their individual p-values are all not statistically significant.<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately this strategy often fails. The basic reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> its failure is that very often regressi<strong>on</strong><br />

coefficients are highly interrelated because their corresp<strong>on</strong>ding X variables are not orthog<strong>on</strong>al to each other.<br />

For example, suppose that both height and weight were X variables in a model that was trying to predict<br />

blood pressure. The test of the hypothesis <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope <str<strong>on</strong>g>for</str<strong>on</strong>g> weight and height are MARGINAL tests, i.e. is<br />

the slope associated with weight in the populati<strong>on</strong> zero assuming that all other variables (including height)<br />

are retained in the model. Because of the high interdependency between height and weight, the p-value<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the test of marginal zero slope <str<strong>on</strong>g>for</str<strong>on</strong>g> weight may not be statistically significant. Similarly, the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

the test of marginal zero slope <str<strong>on</strong>g>for</str<strong>on</strong>g> height (assuming that weight is in the model) may also be statistically<br />

n<strong>on</strong>-significant. However, both height and weight cannot be simultaneously removed from the model.<br />

In order to test if a set of predictor variables can be simultaneously removed from the model, a General<br />

Linear Test is per<str<strong>on</strong>g>for</str<strong>on</strong>g>med. The mechanics of the test are:<br />

1. Fit the full model, i.e. with all variables present. Find SSE full from the full model.<br />

2. Fit the reduced model, i.e. dropping the variables of interest. Find SSE reduced from the reduced<br />

model.<br />

3. If the reduced model is still an adequate fit, then SSE reduced should be very close to SSE full – after<br />

all, if the dropped variables were not important, then the reducti<strong>on</strong> in predicti<strong>on</strong> error should be small.<br />

C<strong>on</strong>struct a test statistic as:<br />

F general =<br />

SSE reduced −SSE full<br />

df SSEreduced −df SSEfull<br />

SSE full<br />

df SSEfull<br />

This is compared an F -distributi<strong>on</strong> with the appropriate degrees of freedom. Large values of the<br />

F -statistic indicate evidence that not all variables can be simultaneously dropped.<br />

Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, this procedure has been automated in most statistical packages as will be illustrated by an<br />

example.<br />

c○2012 Carl James Schwarz 442 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

6.4.2 Example: Predicting body fat from measurements<br />

The percentage of body fat in humans is a good indicator of future problems with cardiovascular and other<br />

diseases.<br />

The following was taken from Wikipedia: 15<br />

Body fat percentage is the fracti<strong>on</strong> of the total body mass that is adipose tissue. This index<br />

is often used as a means to m<strong>on</strong>itor progress during a diet or as a measure of physical fitness<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> certain sports, such as body building. It is more accurate as a measure of health than body<br />

mass index (BMI) since it directly measures body compositi<strong>on</strong> and there are separate body fat<br />

guidelines <str<strong>on</strong>g>for</str<strong>on</strong>g> men and women. However, its popularity is less than BMI because most of the<br />

techniques used to measure body fat percentage require equipment and skills that are not readily<br />

available.<br />

The most accurate method has been to weigh a pers<strong>on</strong> underwater in order to obtain the average<br />

density (mass per unit volume). Since fat tissue has a lower density than muscles and b<strong>on</strong>es,<br />

it is possible to estimate the fat c<strong>on</strong>tent. This estimate is distorted by the fact that muscles and<br />

b<strong>on</strong>es have different densities: <str<strong>on</strong>g>for</str<strong>on</strong>g> a pers<strong>on</strong> with a more-than-average amount of b<strong>on</strong>e tissue, the<br />

estimate will be too low. However, this method gives highly reproducible results <str<strong>on</strong>g>for</str<strong>on</strong>g> individual<br />

pers<strong>on</strong>s ±1%. The body fat percentage is comm<strong>on</strong>ly calculated from <strong>on</strong>e of two <str<strong>on</strong>g>for</str<strong>on</strong>g>mulas:<br />

Brozek <str<strong>on</strong>g>for</str<strong>on</strong>g>mula : BF = (4.57/p − 4.142) ∗ 100<br />

Siri <str<strong>on</strong>g>for</str<strong>on</strong>g>mula : BF = (4.95/p − 4.50) ∗ 100<br />

In these <str<strong>on</strong>g>for</str<strong>on</strong>g>mulas, p is the body density in kg/L obtained by weighing the pers<strong>on</strong> out of water<br />

and then dividing by the volume obtained by dunking the pers<strong>on</strong> underwater.<br />

BTW, the American Council <strong>on</strong> Exercise has associated categories with ranges of body fat.<br />

Women generally have less muscle mass than men and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e they have a higher body fat<br />

percentage range <str<strong>on</strong>g>for</str<strong>on</strong>g> each category.<br />

Descripti<strong>on</strong> Women Men<br />

Essential fat 10-13% 2-5%<br />

Athletes 14-20% 6-13%<br />

Fitness 21-24% 14-17%<br />

Acceptable 25-31% 18-24%<br />

Obesity 32%+ 25%+<br />

Many studies have been d<strong>on</strong>e to see if predicti<strong>on</strong>s of body fat can be made based <strong>on</strong> simple measurements<br />

such as circumferences of various body parts.<br />

A study of middle age men measured the percentage of body fat using the difficult methods explained<br />

above and also taking measurements of the circumference of their thigh, triceps, and mid-arm.<br />

15 2006-05-15, at http://en.wikipedia.org/wiki/Body_fat_percentage<br />

c○2012 Carl James Schwarz 443 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Here are the raw data:<br />

Triceps Thigh Mid-arm PerBodyFat<br />

19 43 29 11.9<br />

24 49 28 22.8<br />

30 51 37 18.7<br />

29 54 31 20.1<br />

19 42 30 12.9<br />

25 53 23 21.7<br />

31 58 27 27.1<br />

27 52 30 25.4<br />

22 49 23 21.3<br />

25 53 24 19.3<br />

31 56 30 25.4<br />

30 56 28 27.2<br />

18 46 23 11.7<br />

19 44 28 17.8<br />

14 45 21 12.8<br />

29 54 30 23.9<br />

27 55 25 22.6<br />

30 58 24 25.4<br />

22 48 27 14.8<br />

25 51 27 21.1<br />

JMP Analysis<br />

The raw data is also available in a JMP data sheet called bodyfat.jmp available from the Sample Program<br />

Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

Fit the multiple-regressi<strong>on</strong> model using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 444 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The resulting estimates all have tests <str<strong>on</strong>g>for</str<strong>on</strong>g> the marginal populati<strong>on</strong> slope statistically n<strong>on</strong>-significant:<br />

But at the same time, the whole model test:<br />

c○2012 Carl James Schwarz 445 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

shows that there is predictive ability in these X variables because the overall p-value is statistically significant.<br />

The problem is that the X variables are all highly related. Indeed a scatter plot matrix of their X variables<br />

shows a high degree of relati<strong>on</strong>ship am<strong>on</strong>g them:<br />

A general-linear test <str<strong>on</strong>g>for</str<strong>on</strong>g> dropping, say both the triceps and thigh X variables is c<strong>on</strong>structed using the<br />

Custom Tests pop-down menu item:<br />

c○2012 Carl James Schwarz 446 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

and then specifying which X variables are to be tested together. You need a separate column in the Custom<br />

Test <str<strong>on</strong>g>for</str<strong>on</strong>g> each variable to be tested – if you specify multiple variables in a single column, you will get a test<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> a crazy hypothesis:<br />

The final result:<br />

c○2012 Carl James Schwarz 447 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

has a p-value of .000003 which is very str<strong>on</strong>g evidence that both variables cannot be dropped simultaneously.<br />

If you look at the ANOVA table from the full model:<br />

the SSE full = 100.1 with 16 df .<br />

The reduced model is fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and with just the Mid-arm variable, and<br />

the reduced model ANOVA table is:<br />

c○2012 Carl James Schwarz 448 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

with the SSE reduced = 487.4 with 18 df .<br />

The general linear test is found as:<br />

F general =<br />

SSE reduced−SSE full<br />

df SSEreduced −df SSEfull<br />

SSE full<br />

=<br />

df SSEfull<br />

487.4−100.1<br />

18−16<br />

100.1<br />

16<br />

= 193.65<br />

6.3<br />

= 30.94<br />

which is the value reported above.<br />

6.4.3 Summary<br />

The general linear test is often used to test if a “chunk” of X variables can be removed from the model.<br />

Often this chunk will be set of variables that has something in comm<strong>on</strong>.<br />

For example, often all quadratic terms are tested simultaneously, or a variable and all it higher order<br />

interacti<strong>on</strong>s terms (e.g. X, X 2 , X 3 , etc.).<br />

6.5 Indicator variables<br />

6.5.1 Introducti<strong>on</strong><br />

Indicator variables (also known as dummy variables) are a device to incorporate nominal-scaled variables<br />

into regressi<strong>on</strong> c<strong>on</strong>texts. For example, suppose you looked at the relati<strong>on</strong>ship between blood pressure and<br />

weight. In general blood pressure of individual increases with weight. But in general, males are larger than<br />

females, so a body weight of 90 kg may have a different effect <str<strong>on</strong>g>for</str<strong>on</strong>g> males than <str<strong>on</strong>g>for</str<strong>on</strong>g> females. So how can sex<br />

(a nominally scaled variable) be incorporated into the regressi<strong>on</strong> equati<strong>on</strong>?<br />

It turns out that using indicator variables makes ordinary regressi<strong>on</strong> a general tool <str<strong>on</strong>g>for</str<strong>on</strong>g> many more applicati<strong>on</strong>s<br />

than simply regressi<strong>on</strong>. Indeed, it is possible to show that two-sample t-tests, single factor completely<br />

c○2012 Carl James Schwarz 449 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

randomized design ANOVAs, and even more complex experimental designs can be analyzed using regressi<strong>on</strong><br />

methods. This is why many computer packages call their analysis tool <str<strong>on</strong>g>for</str<strong>on</strong>g> comparing means and fitting<br />

regressi<strong>on</strong>s as variants of general linear models.<br />

6.5.2 Defining indicator variables<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there is no standard way to define an indicator variable in a regressi<strong>on</strong> setting, but <str<strong>on</strong>g>for</str<strong>on</strong>g>tunately,<br />

it turns out that it doesn’t matter which <str<strong>on</strong>g>for</str<strong>on</strong>g>mulati<strong>on</strong> is used – it is always possible to get an appropriate<br />

answer.<br />

In general, if a nominally scaled variable has k categories, you will require k − 1 indicator variables. In<br />

many cases, computer packages will generate these automatically if the package knows that variable is to be<br />

treated as a nominally scaled variables. 16<br />

For example, as sex <strong>on</strong>ly has two levels, <strong>on</strong>ly <strong>on</strong>e indicator variable is required. It could be coded as:<br />

{ }<br />

1 if male<br />

X 1 =<br />

0 if female<br />

or<br />

Many other codings are possible.<br />

{<br />

1 if male<br />

X 1 =<br />

−1 if female<br />

}<br />

For a nominally scaled variable with three levels, two indicator variables will be needed. For example,<br />

suppose that the size of a pers<strong>on</strong> is classified as small, medium, or large. Then the indicator variables could<br />

be defined as:<br />

{ } { }<br />

1 if small<br />

1 if medium<br />

X 1 =<br />

X 2 =<br />

0 if medium or large<br />

0 if small or large<br />

Now the pair of variables define the three classes as: (X 1 , X 2 ) = (1, 0) = small, (X 1 , X 2 ) = (0, 1) =<br />

medium, and (X 1 , X 2 ) = (0, 0) = large.<br />

Many packages use what is known as reference coding rules <str<strong>on</strong>g>for</str<strong>on</strong>g> indicator variables, where the i th indicator<br />

variable take the values of 1 to indicate the i th value of the variable <str<strong>on</strong>g>for</str<strong>on</strong>g> the first k − 1 values of the<br />

variable, and all the indicator variables take the value 0 to refer to the last value of the variable. 17<br />

So, how do indicator variables help incorporate the effects of a nominally scaled variable? C<strong>on</strong>sider<br />

the variable sex (taking two levels labeled f and m in that order). A single indicator variable, say Sex,<br />

16 That is why it is good practice to code nominally scaled variables using alphanumeric codes (e.g. m and f <str<strong>on</strong>g>for</str<strong>on</strong>g> sex), rather than<br />

numeric codes such as 3 or 7.<br />

17 Always check the package documentati<strong>on</strong> carefully to see if the package is using this rule. If it uses a different coding scheme,<br />

you will have to interpret the estimates carefully.<br />

c○2012 Carl James Schwarz 450 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

is defined that takes the value of 1 <str<strong>on</strong>g>for</str<strong>on</strong>g> females and 0 <str<strong>on</strong>g>for</str<strong>on</strong>g> males. Now c<strong>on</strong>sider the following estimated<br />

regressi<strong>on</strong> equati<strong>on</strong>:<br />

BloodPressure = 110 − 10 ∗ Sex + .10 ∗ Weight<br />

The estimated blood pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a female who weighs 100 kg would be:<br />

110 = 110 − 10(1) + .10(100)<br />

while the estimated blood pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a male who weighs 100 kg would be:<br />

100 = 110 − 10(0) + .10(100)<br />

Hence, the coefficient associated with sex (with a value of -10) would be interpreted as the difference in<br />

blood pressure between females and males <str<strong>on</strong>g>for</str<strong>on</strong>g> all weight classes, i.e. the relati<strong>on</strong>ship c<strong>on</strong>sists of two parallel<br />

lines (with a slope against weight of 0.10) with a separati<strong>on</strong> of 10 units.<br />

On the other hand, c<strong>on</strong>sider the regressi<strong>on</strong> equati<strong>on</strong>:<br />

BloodPressure = 110 − 10 ∗ Sex + .10 ∗ Weight − 0.05 ∗ Sex*Weight<br />

Notice that two variables (the Sex indicator variable and the weight variable) are multiplied together. Now,<br />

the estimated blood pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a female who weighs 100 kg would be:<br />

105 = 110 − 10(1) + .10(100) − 0.05 ∗ (1) ∗ (100)<br />

while the estimated blood pressure <str<strong>on</strong>g>for</str<strong>on</strong>g> a male who weighs 100 kg would be:<br />

120 = 110 − 10(0) + .10(100) − 0.05 ∗ (0) ∗ (100)<br />

Hence, the coefficient associated with the product of sex and weight would be interpreted as differential<br />

resp<strong>on</strong>se to weigh between males and females, i.e. the relati<strong>on</strong>ship c<strong>on</strong>sists of two n<strong>on</strong>-parallel lines. The<br />

slope <str<strong>on</strong>g>for</str<strong>on</strong>g> males against weight is .10 while the slope <str<strong>on</strong>g>for</str<strong>on</strong>g> females against weight is .10 − .05 = .05.<br />

This idea can be extended to nominally scaled variables with more than two levels in a straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward<br />

way. Fortunately, most packages will do the coding automatically <str<strong>on</strong>g>for</str<strong>on</strong>g> you and all that is necessary is to<br />

specify the model appropriately and understand what the various model <str<strong>on</strong>g>for</str<strong>on</strong>g>mulati<strong>on</strong>s imply.<br />

6.5.3 The ANCOVA model<br />

The of indicator variables, has <str<strong>on</strong>g>for</str<strong>on</strong>g> historical reas<strong>on</strong>s, been referred to as the Analysis of Covariance (AN-<br />

COVA) approach. It actual has two separate, but functi<strong>on</strong>ally identical uses.<br />

The first use is to incorporate nominally scaled variables into regressi<strong>on</strong> situati<strong>on</strong>s. The modeling starts<br />

off with individual regressi<strong>on</strong> lines, <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> each value in the nominal variable (e.g. a separate line <str<strong>on</strong>g>for</str<strong>on</strong>g> males<br />

and females). A statistical test is used to see if the lines are parallel. If there is evidence that the individual<br />

regressi<strong>on</strong> lines are not parallel, then a separate regressi<strong>on</strong> line must be used <str<strong>on</strong>g>for</str<strong>on</strong>g> each group <str<strong>on</strong>g>for</str<strong>on</strong>g> predicti<strong>on</strong><br />

purposes. If there is no evidence of n<strong>on</strong>-parallelism, then the next task is to see if the lines are co-incident,<br />

c○2012 Carl James Schwarz 451 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

i.e. have both the same intercept and the same slope. If there is evidence that the lines are not coincident,<br />

then a series of parallel lines are used to make predicti<strong>on</strong>s. All of the data are used to estimate the comm<strong>on</strong><br />

slope. If there is no evidence that the lines are not coincident, then all of the data can be simply pooled<br />

together and a single regressi<strong>on</strong> line fit <str<strong>on</strong>g>for</str<strong>on</strong>g> all of the data.<br />

The three possibilities are shown below <str<strong>on</strong>g>for</str<strong>on</strong>g> the case of two groups - the extensi<strong>on</strong> to many groups is<br />

obvious:<br />

c○2012 Carl James Schwarz 452 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 453 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Sec<strong>on</strong>d, ANCOVA has been used to test <str<strong>on</strong>g>for</str<strong>on</strong>g> differences in means am<strong>on</strong>g the groups when some of the<br />

variati<strong>on</strong> in the resp<strong>on</strong>sible variable can be “explained” by a covariate. For example, the effectiveness of two<br />

different diets can be compared by randomizing people to the two diets and measuring the weight change<br />

during the experiment. However, some of the variati<strong>on</strong> in weight change may be related to initial weight.<br />

Perhaps by “standardizing” every<strong>on</strong>e to some comm<strong>on</strong> weight, we can more easily detect differences am<strong>on</strong>g<br />

the groups. This will be discussed in a later chapter.<br />

A very nice book <strong>on</strong> the Analysis of Covariance is Analysis of Messy Data, Volume III: Analysis of Covariance<br />

by G. A. Milliken and D. E. Johns<strong>on</strong>. Details are available at http://www.statsnetbase.<br />

com/ejournals/books/book_summary/summary.asp?id=869.<br />

c○2012 Carl James Schwarz 454 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

6.5.4 Assumpti<strong>on</strong>s<br />

As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, it is important be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the analysis is started to verify the assumpti<strong>on</strong>s underlying the analysis. As<br />

ANCOVA is a combinati<strong>on</strong> of ANOVA and Regressi<strong>on</strong>, the assumpti<strong>on</strong>s are similar. Both goals of ANCOVA<br />

have similar assumpti<strong>on</strong>s:<br />

• The resp<strong>on</strong>se variable Y is c<strong>on</strong>tinuous (interval or ratio scaled).<br />

• The data are collected under a completely randomized design. 18 This implies that the treatment must<br />

be randomized completely over the entire set of experimental units if an experimental study, or units<br />

must be selected at random from the relevant populati<strong>on</strong>s if an observati<strong>on</strong>al study.<br />

• There must be no outliers. Plot Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group separately to see if there are any points that<br />

d<strong>on</strong>’t appear to follow the straight line.<br />

• The relati<strong>on</strong>ship between Y and X must be linear <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. 19 Check this assumpti<strong>on</strong> by looking<br />

at the individual plots of Y vs. X <str<strong>on</strong>g>for</str<strong>on</strong>g> each group.<br />

• The variance must be equal <str<strong>on</strong>g>for</str<strong>on</strong>g> both groups around their respective regressi<strong>on</strong> lines. Check that the<br />

spread of the points is equal around the range of X and that the spread is comparable between the two<br />

groups. This can be <str<strong>on</strong>g>for</str<strong>on</strong>g>mally checked by looking at the MSE from a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each<br />

group as MSE estimates the variance of the data around the regressi<strong>on</strong> line.<br />

• The residuals must be normally distributed around the regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This assumpti<strong>on</strong><br />

can be check by examining the residual plots from the fitted model <str<strong>on</strong>g>for</str<strong>on</strong>g> evidence of n<strong>on</strong>-normality. For<br />

large samples, this is not too crucial; <str<strong>on</strong>g>for</str<strong>on</strong>g> small sample sizes, you will likely have inadequate power to<br />

detect anything but gross departures.<br />

6.5.5 Comparing individual regressi<strong>on</strong> lines<br />

You saw in earlier chapters, that a statistical model is a powerful <str<strong>on</strong>g>short</str<strong>on</strong>g>hand to describe what analysis is fit<br />

to a set of data. The model must describe the treatment structure, the experimental unit structure, and the<br />

randomizati<strong>on</strong> structure.. Let Y be the resp<strong>on</strong>se variable; X be the c<strong>on</strong>tinuous X-variable, and Group be the<br />

nominally scaled group variable with TWO levels, i.e. <strong>on</strong>ly <strong>on</strong>e indicator variable will be generated, called<br />

I.<br />

In this and previous chapter, we uses a <str<strong>on</strong>g>short</str<strong>on</strong>g>hand model notati<strong>on</strong>. For example, the model notati<strong>on</strong><br />

Y = X<br />

would refer to a regressi<strong>on</strong> of Y <strong>on</strong> X with the underlying statistical model:<br />

Y = β 0 + β 1 X + ɛ<br />

18 It is possible to relax this assumpti<strong>on</strong> - this is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

19 It is possible to relax this assumpti<strong>on</strong> as well, but is again, bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

c○2012 Carl James Schwarz 455 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

where the subscript corresp<strong>on</strong>ding to individual subjects has been dropped <str<strong>on</strong>g>for</str<strong>on</strong>g> clarity.<br />

We now use an extensi<strong>on</strong> of model notati<strong>on</strong>. The model notati<strong>on</strong>:<br />

Y = X Group Group*X<br />

refers to the model:<br />

Y = β 0 + β 1 X + β 2 I + β 3 I ∗ X + ɛ<br />

Lastly, the model notati<strong>on</strong>:<br />

refers to the model<br />

Y = X Group<br />

Y = β 0 + β 1 X + β 2 I + ɛ<br />

These can be diagrammed in a graphs. If the lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group are not parallel:<br />

c○2012 Carl James Schwarz 456 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

the appropriate model is<br />

Y1 = X Group Group*X<br />

The terms can be in any order. This is read as variati<strong>on</strong> in Y can be explained a comm<strong>on</strong> intercept (never<br />

specified), with group effects (different intercepts), a comm<strong>on</strong> slope <strong>on</strong> X, and an “interacti<strong>on</strong>” between<br />

Group and X which is interpreted as different slopes <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This model is almost equivalent to<br />

fitting a separate regressi<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The <strong>on</strong>ly advantage to using this joint model compared to<br />

fitting separate slopes is that all of the groups c<strong>on</strong>tribute to a better estimate of residual error. If the number<br />

of data points per group is small, this can lead to improvements in precisi<strong>on</strong> compared to fitting each group<br />

individually.<br />

If the lines are parallel across groups, but not coincident:<br />

the appropriate model is<br />

Y2 = Group X<br />

The terms can be in any order. The <strong>on</strong>ly difference between this and the previous model is that this simpler<br />

model lacks the Group*X “interacti<strong>on</strong>” term. It would not be surprising then that a statistical test to see if<br />

c○2012 Carl James Schwarz 457 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

this simpler model is tenable would corresp<strong>on</strong>d to examining the p-value of the test <strong>on</strong> the Group*X term<br />

from the complex model. This is exactly analogous to testing <str<strong>on</strong>g>for</str<strong>on</strong>g> interacti<strong>on</strong> effects between factors in a<br />

two-factor ANOVA.<br />

Lastly, if the lines are co-incident:<br />

the appropriate model is<br />

Y3 = X<br />

. Now the difference between this model and the previous model is the Group term that has been dropped.<br />

Again, it would not be surprising that this corresp<strong>on</strong>ds to the test of the Group effect in the <str<strong>on</strong>g>for</str<strong>on</strong>g>mal statistical<br />

test. The test <str<strong>on</strong>g>for</str<strong>on</strong>g> co-incident lines should <strong>on</strong>ly be d<strong>on</strong>e if there is insufficient evidence against parallelism.<br />

While it is possible to test <str<strong>on</strong>g>for</str<strong>on</strong>g> a n<strong>on</strong>-zero slope, this is rarely d<strong>on</strong>e.<br />

c○2012 Carl James Schwarz 458 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

6.5.6 Example: Degradati<strong>on</strong> of dioxin<br />

An un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate byproduct of pulp-and-paper producti<strong>on</strong> used to be dioxins - a very hazardous material. This<br />

material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living<br />

organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms<br />

takes a l<strong>on</strong>g time to degrade.<br />

Government envir<strong>on</strong>mental protecti<strong>on</strong> agencies take samples of crabs from affected areas each year and<br />

measure the amount of dioxins in the tissue. The following example is based <strong>on</strong> a real study.<br />

Each year, four crabs are captured from two m<strong>on</strong>itoring stati<strong>on</strong>s which are situated quite a distance apart<br />

<strong>on</strong> the same inlet where the pulp mill was located.. The liver is excised and the livers from all four crabs<br />

are composited together into a single sample. 20 The dioxins levels in this composite sample is measured.<br />

As there are many different <str<strong>on</strong>g>for</str<strong>on</strong>g>ms of dioxins with different toxicities, a summary measure, called the Total<br />

Equivalent Dose (TEQ) is computed from the sample.<br />

As seen in the chapter <strong>on</strong> regressi<strong>on</strong>, the appropriate resp<strong>on</strong>se variable is log(T EQ).<br />

Is the rate of decline the same <str<strong>on</strong>g>for</str<strong>on</strong>g> both sites? Did the sites have the same initial c<strong>on</strong>centrati<strong>on</strong>?<br />

Here are the raw data which are also available <strong>on</strong> the web in the SampleProgramLibrary available at<br />

http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

20 Compositing is a comm<strong>on</strong> analytical tool. There is little loss of useful in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> induced by the compositing process - the <strong>on</strong>ly<br />

loss of in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is the am<strong>on</strong>g individual-sample variability which can be used to determine the optimal allocati<strong>on</strong> between samples<br />

within years and the number of years to m<strong>on</strong>itor.<br />

c○2012 Carl James Schwarz 459 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Site Year TEQ log(TEQ)<br />

a 1990 179.05 5.19<br />

a 1991 82.39 4.41<br />

a 1992 130.18 4.87<br />

a 1993 97.06 4.58<br />

a 1994 49.34 3.90<br />

a 1995 57.05 4.04<br />

a 1996 57.41 4.05<br />

a 1997 29.94 3.40<br />

a 1998 48.48 3.88<br />

a 1999 49.67 3.91<br />

a 2000 34.25 3.53<br />

a 2001 59.28 4.08<br />

a 2002 34.92 3.55<br />

a 2003 28.16 3.34<br />

b 1990 93.07 4.53<br />

b 1991 105.23 4.66<br />

b 1992 188.13 5.24<br />

b 1993 133.81 4.90<br />

b 1994 69.17 4.24<br />

b 1995 150.52 5.01<br />

b 1996 95.47 4.56<br />

b 1997 146.80 4.99<br />

b 1998 85.83 4.45<br />

b 1999 67.72 4.22<br />

b 2000 42.44 3.75<br />

b 2001 53.88 3.99<br />

b 2002 81.11 4.40<br />

b 2003 70.88 4.26<br />

The data is entered into JMP in the usual fashi<strong>on</strong>. Make sure that Site is a nominal scale variable, and<br />

that Year is a c<strong>on</strong>tinuous variable.<br />

In cases with multiple groups, it is often helpful to use a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. This<br />

is easily accomplished in JMP by selecting the rows (say <str<strong>on</strong>g>for</str<strong>on</strong>g> site a) and using the Rows->Markers to set the<br />

plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> the selected rows:<br />

c○2012 Carl James Schwarz 460 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The final data sheet has two different plotting symbols <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sites:<br />

c○2012 Carl James Schwarz 461 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e fitting the various models, begin with an exploratory examinati<strong>on</strong> of the data looking <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers<br />

and checking the assumpti<strong>on</strong>s.<br />

Each year’s data is independent of other year’s data as a different set of crabs was selected. Similarly,<br />

the data from <strong>on</strong>e site are independent from the other site. This is an observati<strong>on</strong>al study, so the questi<strong>on</strong><br />

arises of how exactly were the crabs were selected? In this study, crab pots were placed <strong>on</strong> the floor of the<br />

sea to capture the available crabs in the area.<br />

When ever multiple sets of data are collected over time, there is always the worry about comm<strong>on</strong> year<br />

effects (also known as process error). For example, if the resp<strong>on</strong>se variable was body mass of small fish, then<br />

poor growing c<strong>on</strong>diti<strong>on</strong>s in a single year could depress the growth of fish in all locati<strong>on</strong>s. This would then<br />

violate the assumpti<strong>on</strong> of independence as the residual in <strong>on</strong>e site in a year would be related to the residual<br />

in another site in the sam year. You tend to see the residuals “paired” with negative residuals from the fitted<br />

c○2012 Carl James Schwarz 462 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

line at <strong>on</strong>e site matched (by year) with negative residuals at the other site. In this case, this is unlikely to have<br />

occured. Degradati<strong>on</strong> of dioxin is relatively independent of external envir<strong>on</strong>mental factors and the variati<strong>on</strong><br />

that we see about the two regressi<strong>on</strong> lines is related solely to samplng error based <strong>on</strong> the particular set of<br />

crabs that that were sampled. It seems unlikely that the residuals are related. 21<br />

Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and specify the log(T EQ) as the Y variable, and Y ear as the X<br />

variable:<br />

Then specify a grouping variable by clicking <strong>on</strong> the pop-down menu near the Bivariate Fit title line:<br />

21 If you actually try and fit a process error term to this model, you find that the estimated process error is zero.<br />

c○2012 Carl James Schwarz 463 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

and selecting Site as the grouping variable:<br />

c○2012 Carl James Schwarz 464 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Now select the Fit Line from the same pop-down menu:<br />

c○2012 Carl James Schwarz 465 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

to get separate lines fit <str<strong>on</strong>g>for</str<strong>on</strong>g> each group:<br />

c○2012 Carl James Schwarz 466 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

This relati<strong>on</strong>ships <str<strong>on</strong>g>for</str<strong>on</strong>g> each site appear to be linear. The actual estimates are also presented:<br />

The scatter plot doesn’t show any obvious outliers. The estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the a site is −0.107 (se .02)<br />

while the estimated slope <str<strong>on</strong>g>for</str<strong>on</strong>g> the b site is −0.06 (se .02). The 95% c<strong>on</strong>fidence intervals (not shown <strong>on</strong> the<br />

output but available by right-clicking/ctrl-clicking <strong>on</strong> the parameter estimates table) overlap c<strong>on</strong>siderably so<br />

c○2012 Carl James Schwarz 467 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

the slopes could be the same <str<strong>on</strong>g>for</str<strong>on</strong>g> the two groups.<br />

The MSE from site a is 0.10 and the MSE from site b is 0.12. This corresp<strong>on</strong>ds to standard deviati<strong>on</strong>s<br />

of √ 0.10 = 0.32 and √ 0.12 = 0.35 which are very similar so that assumpti<strong>on</strong> of equal standard deviati<strong>on</strong>s<br />

seems reas<strong>on</strong>able.<br />

The residual plots (not shown) also look reas<strong>on</strong>able.<br />

The assumpti<strong>on</strong>s appear to be satisfied, so let us now fit the various models.<br />

First, fit the model allowing <str<strong>on</strong>g>for</str<strong>on</strong>g> separate lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each group. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used:<br />

The terms can be in any order and corresp<strong>on</strong>d to the model described earlier. This gives the following<br />

output:<br />

c○2012 Carl James Schwarz 468 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The regressi<strong>on</strong> plot is just the same as the plot of the two individual lines seen earlier. What is of interest is<br />

the Effect test <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site*year interacti<strong>on</strong>. Here the p-value is not very small, so there is no evidence that<br />

the lines are not parallel.<br />

We need to refit the model, dropping the interacti<strong>on</strong> term:<br />

c○2012 Carl James Schwarz 469 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

which gives the following regressi<strong>on</strong> plot:<br />

c○2012 Carl James Schwarz 470 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

This shows the fitted parallel lines. The effect tests:<br />

now have a small p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the Site effect indicating that the lines are not coincident, i.e. they are parallel<br />

with different intercepts. This would mean that the rate of decay of the dioxin appears to be equal in both<br />

sites, but the initial c<strong>on</strong>centrati<strong>on</strong> appears to be different.<br />

The estimated (comm<strong>on</strong>) slope is found in the Parameter Estimates porti<strong>on</strong> of the output:<br />

c○2012 Carl James Schwarz 471 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

and has a value of −0.083 (se 0.016). Because the analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this implies that the<br />

dioxin levels changed by a factor of exp(−.083) = .92 from year to year, i.e. about a 8% decline each year.<br />

The 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope <strong>on</strong> the log-scale is from (−.12 → −0.05) which corresp<strong>on</strong>ds to<br />

a potential factor between exp(−0.12) = .88 to exp(−0.05) = .95 per year, i.e. between a 12% and 5%<br />

decline per year. 22<br />

While it is possible to estimate the difference between the parallel lines from the Parameter Estimates<br />

table, it is easier to look at the secti<strong>on</strong> of the output corresp<strong>on</strong>ding to the Site effects. Here the estimated<br />

LSMeans corresp<strong>on</strong>d to the log(TEQ) at the average value of Year - not really of interest. As in previous<br />

chapters, the difference in means is often of more interest than the raw means themselves. This is found by<br />

using the pop-down menu and selecting a LSMEans C<strong>on</strong>trast or Multiple Comparis<strong>on</strong> procedure to give:<br />

22 The c<strong>on</strong>fidence intervals are found by right clicking/ctrl-clicking in the Parameter Estimates table<br />

c○2012 Carl James Schwarz 472 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The estimated difference between the lines (<strong>on</strong> the log-scale) is estimated to be 0.46 (se .13). Because the<br />

analysis was d<strong>on</strong>e <strong>on</strong> the log-scale, this corresp<strong>on</strong>ds to a ratio of exp(.46) = 1.58 in dioxin levels between<br />

the two sites, i.e. site b has 1.58 times the dioxin level as site a. Because the slopes are parallel and declining,<br />

the dioxin levels are falling in both sites, but the 1.58 times ratio remains c<strong>on</strong>sistent.<br />

c○2012 Carl James Schwarz 473 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Finally, the actual by Predicted plot (not shown here), the leverage plots (not shown here) and the residual<br />

plot<br />

d<strong>on</strong>’t show any evidence of a problem in the fit.<br />

6.5.7 Example: More refined analysis of stream-slope example<br />

In the chapter <strong>on</strong> paired comparis<strong>on</strong>s, the example of the effect of stream slope was examined based <strong>on</strong>:<br />

Isaak, D.J. and Hubert, W.A. (2000). Are trout populati<strong>on</strong>s affected by reach-scale stream slope.<br />

Canadian Journal of Fisheries and Aquatic Sciences, 57, 468-477.<br />

In that paper, stream slope was (roughly) categorized into high or low slope classes and a paired-analysis<br />

was per<str<strong>on</strong>g>for</str<strong>on</strong>g>med. In this secti<strong>on</strong>, we will use the actual stream slopes to examine the relati<strong>on</strong>ship between fish<br />

density and stream slope.<br />

Recall that a stream reach is a porti<strong>on</strong> of a stream from 10 to several hundred meters in length that<br />

exhibits c<strong>on</strong>sistent slope. The slope influences the general speed of the water which exerts a dominant<br />

influence <strong>on</strong> the structure of physical habitat in streams. If fish populati<strong>on</strong>s are influenced by the structure<br />

of physical habitat, then the abundance of fish populati<strong>on</strong>s may be related to the slope of the stream.<br />

c○2012 Carl James Schwarz 474 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Reach-scale stream slope and the structure of associated physical habitats are thought to affect trout<br />

populati<strong>on</strong>s, yet previous studies c<strong>on</strong>found the effect of stream slope with other factors that influence trout<br />

populati<strong>on</strong>s.<br />

Past studies addressing this issue have used sampling designs wherein data were collected either using<br />

repeated samples al<strong>on</strong>g a single stream or measuring many streams distributed across space and time.<br />

Reaches <strong>on</strong> the same stream will likely have correlated measurements making the use of simple statistical<br />

tools problematical. [Indeed, if <strong>on</strong>ly a single stream is measured <strong>on</strong> multiple locati<strong>on</strong>s, then this is an<br />

example of pseudo-replicati<strong>on</strong> and inference is limited to that particular stream.]<br />

Inference from streams spread over time and space is made more difficult by the inter-stream differences<br />

and temporal variati<strong>on</strong> in trout populati<strong>on</strong>s if samples are collected over extended periods of time. This extra<br />

variati<strong>on</strong> reduces the power of any survey to detect effects.<br />

For this reas<strong>on</strong>, a paired approach was taken. A total of twenty-three streams were sampled from a large<br />

watershed. Within each stream, two reaches were identified and the actual slope gradient was measured.<br />

In each reach, fish abundance was determined using electro-fishing methods and the numbers c<strong>on</strong>verted<br />

to a density per 100 m 2 of stream surface.<br />

The following table presents the (fictitious but based <strong>on</strong> the above paper) raw data<br />

Estimates of fish density from a paired experiment<br />

slope slope density<br />

Stream (%) class (per 100 m 2 )<br />

1 0.7 low 15.0<br />

1 4.0 high 21.0<br />

2 2.4 low 11.0<br />

2 6.0 high 3.1<br />

3 0.7 low 5.9<br />

3 2.6 high 6.4<br />

4 1.3 low 12.2<br />

4 4.0 high 17.6<br />

5 0.6 low 6.2<br />

5 4.4 high 7.0<br />

6 1.3 low 39.8<br />

6 3.2 high 25.0<br />

7 2.0 low 6.5<br />

7 4.2 high 11.2<br />

8 1.3 low 9.6<br />

c○2012 Carl James Schwarz 475 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

8 4.2 high 17.5<br />

9 2.0 low 7.3<br />

9 3.6 high 10.0<br />

10 0.7 low 11.3<br />

10 3.5 high 21.0<br />

11 2.3 low 12.1<br />

11 6.0 high 12.1<br />

12 2.5 low 13.2<br />

12 4.2 high 15.0<br />

13 2.3 low 5.0<br />

13 6.0 high 5.0<br />

14 1.2 low 10.2<br />

14 2.9 high 6.0<br />

15 0.7 low 8.5<br />

15 2.9 high 7.0<br />

16 1.1 low 5.8<br />

16 3.0 high 5.0<br />

17 2.2 low 5.1<br />

17 5.0 high 5.0<br />

18 0.7 low 65.4<br />

18 3.2 high 55.0<br />

19 0.7 low 13.2<br />

19 3.0 high 15.0<br />

20 0.3 low 7.1<br />

20 3.2 high 12.0<br />

21 2.3 low 44.8<br />

21 7.0 high 48.0<br />

22 1.8 low 16.0<br />

22 6.0 high 20.0<br />

23 2.2 low 7.2<br />

23 6.0 high 10.1<br />

Notice that the density varies c<strong>on</strong>siderably am<strong>on</strong>g stream but appears to be fairly c<strong>on</strong>sistent within each<br />

stream.<br />

c○2012 Carl James Schwarz 476 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The raw data is available in a JMP datafile called paired-stream.jmp in the Sample Programs Library at<br />

http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

As noted earlier, this is an example of an Analytical Survey. The treatments (low or high slope) cannot<br />

be randomized within stream – the randomizati<strong>on</strong> occurs by selecting streams at random from some larger<br />

populati<strong>on</strong> of potential streams. As noted in the early chapter <strong>on</strong> Observati<strong>on</strong>al Studies, causal inference is<br />

limited whenever a randomizati<strong>on</strong> of experimental units to treatments cannot be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med.<br />

Unlike the example presented in other chapters where the slope is divided (arbitrarily) into two class<br />

(low and high slope), we will now use the actual slope. A simple regressi<strong>on</strong> CANNOT be used because of<br />

the n<strong>on</strong>-independence introduced by measuring two reaches <strong>on</strong> the same stream. However, an ANOCOVA<br />

will prove to be useful here.<br />

First, it seem sensible that the resp<strong>on</strong>se to stream slope will will be multiplicative rather than additive,<br />

i.e. an increase in the stream slope will change the fish density by a comm<strong>on</strong> fracti<strong>on</strong>, rather than simply<br />

changing the density by a fixed amount. For example, it may turn out that a 1 unit change in the slope,<br />

reduces density by 10% - if the density be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the change was 100 fish/m 2 , then after the change, the new<br />

density will be 90 fish/m 2 . Similarly, if the original density was <strong>on</strong>ly 10 fish/m 2 , then the final density will<br />

be 9 fish/m 2 . In both cases, the reducti<strong>on</strong> is a fixed fracti<strong>on</strong>, and NOT the same fixed amount (a change of<br />

10 vs. 1).<br />

Create the log(density) column in the usual fashi<strong>on</strong> (not illustrated here). In cases like this, the natural<br />

logarithm is preferred because the resulting estimates have a very nice simple interpretati<strong>on</strong>. 23<br />

An appropriate model will be <strong>on</strong>e where each stream has a separate intercept (corresp<strong>on</strong>ding to the<br />

different productivities of each stream - acting like a block), with a comm<strong>on</strong> slope <str<strong>on</strong>g>for</str<strong>on</strong>g> all streams. The<br />

simplified model syntax would look like<br />

log(density) = Stream Slope<br />

where the term Stream represents a nominal scaled variable and gives the different intercepts and the Slope<br />

is the effect of the comm<strong>on</strong> slope <strong>on</strong> the log(density).<br />

This is fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m as:<br />

23 The JMP dataset also created a different plotting symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> each stream using the Rows− > Color or Mark by Column<br />

menu.<br />

c○2012 Carl James Schwarz 477 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Note that it stream must have a nominal scale and that slope must have a c<strong>on</strong>tinuous scale. The order of the<br />

terms in the effects box is not important.<br />

The output from the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is voluminous, but a careful reading reveals several<br />

interesting features.<br />

First is a plot of the comm<strong>on</strong> slope fit to each stream:<br />

c○2012 Carl James Schwarz 478 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

This shows a gradual increase as slope increases. This plot is hard to interpret, but a plot of observed vs.<br />

predicted values is clearer:<br />

c○2012 Carl James Schwarz 479 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Generally, the observed are close to the predicted values except <str<strong>on</strong>g>for</str<strong>on</strong>g> two potential outliers. By clicking <strong>on</strong><br />

these points, it is shown that both points bel<strong>on</strong>g to stream 2 where it appears that increases in the slope<br />

causes a large decrease in density c<strong>on</strong>trary to the general pattern seen in the the other streams.<br />

The effect tests:<br />

fail to detect any influence of slope. Indeed the estimated coefficient associated with a change in slope is<br />

found to be:<br />

c○2012 Carl James Schwarz 480 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

is estimated to be .025 (se .0299) which is not statistically significant. 24<br />

Residual plots also show the odd behavior of stream 2:<br />

If this rogue stream is “eliminated” from the analysis, the the resulting plots do not show any problems<br />

(try it), but now the results are statistically significant (p = 0.035):<br />

24 Because the natural log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used and the data <strong>on</strong> the log scale was used, “smallish” slope coefficients have an approximate<br />

interpretati<strong>on</strong>. In this example, a slope of .025 <strong>on</strong> the (natural) log scale implies that the estimated fish density INCREASES by<br />

2.5% every time the slope increases by <strong>on</strong>e percentage point.<br />

c○2012 Carl James Schwarz 481 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The estimated change in log-density per percentage point change in the slope is found to be:<br />

i.e. the slope is .05 (se .02) which is interpreted that a percentage point increase in stream slope increases<br />

fish density by 5%. 25<br />

The remaining residual plot and leverage plots show no problems.<br />

6.6 Example: Predicting PM10 levels<br />

Small particulates are known to have adverse health effects. Here is some background in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> from<br />

Wikipedia: 26<br />

The effects of inhaling particulate matter has been widely studied in humans and animals and<br />

include asthma, lung cancer, cardiovascular issues, and premature death. The size of the particle<br />

determines where in the body the particle will come to rest if inhaled. Larger particles are<br />

generally filtered by small hairs in the nose and throat and do not cause problems, but particulate<br />

matter smaller than about 10 micrometers, referred to as PM10, can settle in the br<strong>on</strong>chial<br />

tubes and lungs and cause health problems. Particles smaller than 2.5 micrometers, PM2.5, can<br />

penetrate directly into the lung, whereas particles smaller than 1 micrometer PM1 can penetrate<br />

into the alveolar regi<strong>on</strong> of the lung and tend to be the most hazardous when inhaled.<br />

The large number of deaths and other health problems associated with particulate polluti<strong>on</strong> was<br />

first dem<strong>on</strong>strated in the early 1970s (Lave et. al, 1973) and has been reproduced many times<br />

25 This easy interpretati<strong>on</strong> occurs because the natural log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used. If the comm<strong>on</strong> (base 10) log trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used, there<br />

is no l<strong>on</strong>ger such a simple interpretati<strong>on</strong>.<br />

26 Downloaded from http://en.wikipedia.org/wiki/Particulate <strong>on</strong> 2006-05-22<br />

c○2012 Carl James Schwarz 482 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

since. PM polluti<strong>on</strong> is estimated to cause 20,000-50,000 deaths per year in the United States<br />

(Mokdad et. al, 2004) and 200,000 deaths per year in Europe. For this reas<strong>on</strong>, the US Envir<strong>on</strong>mental<br />

Protecti<strong>on</strong> Agency (EPA) sets standards <str<strong>on</strong>g>for</str<strong>on</strong>g> PM10 and PM2.5 c<strong>on</strong>centrati<strong>on</strong>s in urban<br />

air. EPA regulates primary particulate emissi<strong>on</strong>s and precursors to sec<strong>on</strong>dary emissi<strong>on</strong>s (NOx,<br />

sulfur, and amm<strong>on</strong>ia). Many urban areas in the US and Europe still frequently violate the particulate<br />

standards, though urban air has gotten cleaner, <strong>on</strong> average, with respect to particulates<br />

over the last quarter of the 20th century.<br />

The data are a subsample of 500 observati<strong>on</strong>s from a data set that originate in a study where air polluti<strong>on</strong><br />

at a road is related to traffic volume and meteorological variables, collected by the Norwegian Public Roads<br />

Administrati<strong>on</strong>.<br />

The resp<strong>on</strong>se variable c<strong>on</strong>sist of hourly values of the logarithm of the c<strong>on</strong>centrati<strong>on</strong> (why?) of PM10<br />

(particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003. The predictor<br />

variables are the logarithm of the number of cars per hour, temperature 2 meters above ground (degree C),<br />

wind speed (meters/sec<strong>on</strong>d), the temperature difference between 25 and 2 meters above ground (degree C),<br />

wind directi<strong>on</strong> (degrees between 0 and 360), hour of day and day number from October 1. 2001.<br />

The data were extracted from http://lib.stat.cmu.edu/datasets/ and are available in the<br />

file pm10.jmp in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/<br />

Notes/MyPrograms.<br />

Wind directi<strong>on</strong> is an interesting variable as it ranges from 0 to 360 around a circle and cannot be used<br />

directly in a regressi<strong>on</strong> setting – after all a directi<strong>on</strong> of 1 degree and 359 degrees is so similar, but have vastly<br />

“different” measured values.<br />

Examine the histogram of the wind directi<strong>on</strong>s (obtained from the Analyze->Distributi<strong>on</strong> plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m):<br />

c○2012 Carl James Schwarz 483 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

This seems to indicate that there are two major wind directi<strong>on</strong>s. The “E” winds corresp<strong>on</strong>d to wind directi<strong>on</strong>s<br />

between from about 320 → 360 degrees and from 0 → 150 degrees, while the “W” winds corresp<strong>on</strong>d to<br />

directi<strong>on</strong>s between 150 → 320 degrees.<br />

C<strong>on</strong>vert these measurements into a nominal scaled variable using JMP’s <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor:<br />

c○2012 Carl James Schwarz 484 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

This classifies the wind directi<strong>on</strong> into the two categories. A character coding is used to prevent computer<br />

packages from interpreting a numeric code as an interval or ratio scaled variable. An indicator variable could<br />

be created <str<strong>on</strong>g>for</str<strong>on</strong>g> this variable as seen in earlier chapters.<br />

An initial scatter plot matrix of the data is obtained by using the Analyze->MultiVariateMethods->Multivariate<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 485 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 486 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

There is no obvious relati<strong>on</strong>ship am<strong>on</strong>g the variables. The plot of the day variable show a large gap. Inspecti<strong>on</strong><br />

of the data shows that recording was stopped from about 100 days in the middle of the data set –<br />

the reas<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> this are unknown. The number of cars/hour varies over the hour of the day in a predictable<br />

fashi<strong>on</strong>. The wind directi<strong>on</strong> variable shows that most of the data points have wind blowing in two major<br />

directi<strong>on</strong>s corresp<strong>on</strong>ding to E and W as broken into categories earlier.<br />

A plot of the log(PM10) c<strong>on</strong>centrati<strong>on</strong> by the c<strong>on</strong>densed wind directi<strong>on</strong>:<br />

c○2012 Carl James Schwarz 487 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

shows no obvious relati<strong>on</strong>ship between the PM10 and the wind directi<strong>on</strong>.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to fit a model to the c<strong>on</strong>tinuous and indicator variable.<br />

c○2012 Carl James Schwarz 488 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The leverage plots (not shown) d<strong>on</strong>’t reveal any problems in the fit. The actual vs. predicted plot:<br />

c○2012 Carl James Schwarz 489 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

appears to show some evidence that the fitted line tends to under predict at high log(PM10) c<strong>on</strong>centrati<strong>on</strong>s<br />

and over predict at lower log(PM10) c<strong>on</strong>centrati<strong>on</strong>s, but the visual impressi<strong>on</strong> may be an artifact of the<br />

density of points. The residual plot:<br />

c○2012 Carl James Schwarz 490 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

d<strong>on</strong>’t show any problems with the fit. In any case, the R 2 is not large indicating plenty of residual variati<strong>on</strong><br />

not explained by the regressor variables.<br />

The estimates table:<br />

doesn’t show any problems with variance inflati<strong>on</strong>, but perhaps some variables can be deleted. Use the<br />

Custom Test opti<strong>on</strong>:<br />

c○2012 Carl James Schwarz 491 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

to see if the day, wind-directi<strong>on</strong>, and hour can be removed. [I suspect that any hour effect has been taken up<br />

by the log(cars) effect and so is redundant (why?). Similarly, any trend over time (the day effect) may also<br />

be included in the log(cars) effect (why?)]:<br />

[Why are three columns needed to test the three variables?] The results of the “chunk” test are:<br />

c○2012 Carl James Schwarz 492 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

showing that these variables can be safely deleted. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is again used, but now<br />

dropping these apparently redundant variables.<br />

The revised estimates from this reduced model again show no problems in the leverage plots, no problems<br />

in the residual plots, and no problems in the VIF. The estimates are:<br />

c○2012 Carl James Schwarz 493 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

This time, it appears that both temperature variables are also redundant. This is somewhat surprising, but <strong>on</strong><br />

sober sec<strong>on</strong>d thought, perhaps not so. The temperature wouldn’t affect the creati<strong>on</strong> of particles – after all if<br />

the cars are the driving <str<strong>on</strong>g>for</str<strong>on</strong>g>ce behind the levels, the cars will produce the same particular levels regardless of<br />

temperature. Perhaps temperature <strong>on</strong>ly affects how the PM10 levels affect human health, i.e. <strong>on</strong> hot days,<br />

perhaps people feel affect more by polluti<strong>on</strong>.<br />

A “chunk” test using the Custom Test procedure shows that the temperature variable can also be dropped<br />

(not shown).<br />

The final model includes <strong>on</strong>ly two variables, the log(cars/hour) and the wind speed. The final estimates<br />

are:<br />

As the number of cars/hour increases, the polluti<strong>on</strong> level increases. As both the polluti<strong>on</strong> level and<br />

the number of cars have been measured <strong>on</strong> the log scale, the coefficient must be interpreted carefully. A<br />

doubling of the number of cars corresp<strong>on</strong>ds to an increase of .7 <strong>on</strong> the natural logarithm scale (log(2)=.7).<br />

Hence, the log(PM10) increases by .7(.32)=.22 which corresp<strong>on</strong>ds to exp(.22) = 1.25 times increase <strong>on</strong> the<br />

anti-log scale. In other words, a doubling of cars/hour corresp<strong>on</strong>ds to a 25% increase in the PM10 levels.<br />

As wind speed increases, the c<strong>on</strong>centrati<strong>on</strong> of PM10 decreases. A similar exercise shows that an increase<br />

in wind speed by 1 m/sec<strong>on</strong>d causes the PM10 c<strong>on</strong>centrati<strong>on</strong> decrease by about 10%.<br />

The leverage plots and residual plots show no problems in the data.<br />

How well does the model per<str<strong>on</strong>g>for</str<strong>on</strong>g>m in practice? On way to assess this is to save the Std Err of predicti<strong>on</strong>s<br />

of the mean and of individual predicti<strong>on</strong>s to the data table:<br />

c○2012 Carl James Schwarz 494 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

(similar acti<strong>on</strong>s are d<strong>on</strong>e to save the std error <str<strong>on</strong>g>for</str<strong>on</strong>g> individual predicti<strong>on</strong>s and the actual predicted values).<br />

Then compute the ratio of each of the standard errors to the predicted values:<br />

c○2012 Carl James Schwarz 495 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

(again, <strong>on</strong>ly <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g>mula is shown) and use the Analyze->Distributi<strong>on</strong> plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to see the histograms of the<br />

relative predicti<strong>on</strong> errors:<br />

c○2012 Carl James Schwarz 496 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Predicti<strong>on</strong>s of the MEAN resp<strong>on</strong>se are fairly good – the relative standard errors are under 5% so the 95%<br />

c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted resp<strong>on</strong>se will be fairly tight. However, as expected, the predicti<strong>on</strong><br />

intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> individual resp<strong>on</strong>se are fairly poor – the relative predicti<strong>on</strong> standard errors are around 25%<br />

which means that the 95% predicti<strong>on</strong> intervals will be ± 50%! It is unclear how useful this is <str<strong>on</strong>g>for</str<strong>on</strong>g> advising<br />

individuals to take preventive acti<strong>on</strong>s under certain c<strong>on</strong>diti<strong>on</strong>s of traffic volume and wind speed.<br />

6.7 Variable selecti<strong>on</strong> methods<br />

6.7.1 Introducti<strong>on</strong><br />

Up to now, it has been assumed that the variables to be used in the regressi<strong>on</strong> equati<strong>on</strong> are basically known<br />

and all that matters is perhaps deleting some variables as being unimportant, or deciding up<strong>on</strong> the degree of<br />

the polynomial needed <str<strong>on</strong>g>for</str<strong>on</strong>g> a variable.<br />

In some cases, researchers are faces with several tens (sometimes hundreds or thousands) of predictors<br />

and help is needed in even selecting a reas<strong>on</strong>able subset of variables to describe the relati<strong>on</strong>ship. The techc○2012<br />

Carl James Schwarz 497 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

niques in this secti<strong>on</strong> are called variable selecti<strong>on</strong> methods. CAUTION: Variable selecti<strong>on</strong> methods,<br />

despite their apparent objectivity, are no substitute <str<strong>on</strong>g>for</str<strong>on</strong>g> intelligent thought. As you will see in the remainder<br />

of this secti<strong>on</strong>, there are numerous caveats that must be kept in mind when using these methods.<br />

There are two philosophies underlying variable selecti<strong>on</strong> methods. The first philosophy is that the there<br />

is a unique correct model that explains the data. This MAY be true in physical systems where the goal of<br />

the project is to understand mechanisms of acti<strong>on</strong>s. The role of variable selecti<strong>on</strong> is to try and come and up<br />

with the variables that describe the mechanism of acti<strong>on</strong>. The sec<strong>on</strong>d philosophy (and <strong>on</strong>e that I pers<strong>on</strong>ally<br />

find more appealing) is that reality is hopelessly complex and that all our models are wr<strong>on</strong>g. We hope via<br />

regressi<strong>on</strong> methods to come up with a predicti<strong>on</strong> functi<strong>on</strong> that works satisfactorily. There is NO unique set<br />

of predictors which is “correct” – there may be several sets of predictors that all give reas<strong>on</strong>able answers<br />

and the choice am<strong>on</strong>g these sets is not obvious.<br />

In both cases, model selecti<strong>on</strong> follows five general steps:<br />

1. Specify the maximum model (i.e. the largest set of predictors).<br />

2. Specify a criteri<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> selecting a model.<br />

3. Specify a strategy <str<strong>on</strong>g>for</str<strong>on</strong>g> selecting variables.<br />

4. Specify a mechanism <str<strong>on</strong>g>for</str<strong>on</strong>g> fitting the models – usually least squares.<br />

5. Assess the goodness-of-fit of the the models and the predicti<strong>on</strong>s.<br />

6.7.2 Maximum model<br />

The maximum model is the set of predictors that c<strong>on</strong>tains all potential predictors of interest. Often researchers<br />

will add polynomial (e.g. X 2 1 ), crossproduct terms (e.g. X 1 X 2 ), or trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s of variables<br />

(e.g. ln(X 1 )).<br />

If the first philosophy is correct, this maximal model must c<strong>on</strong>tain the correct model as a subset of the<br />

potential predictor variables. As the maximum model, this model has the highest predictive power, but some<br />

predictors may be redundant. Under the sec<strong>on</strong>d philosophy, we know that this (and all models) are wr<strong>on</strong>g,<br />

but we hope that this maximal model is a reas<strong>on</strong>able predictor functi<strong>on</strong>. Again, some predictors may be<br />

redundant.<br />

Some cauti<strong>on</strong> must be used in specifying a maximum model. First, try to avoid including many variables<br />

that are collinear. For example, height and weight are highly collinear and are both variables really<br />

needed? If including polynomial or cross product terms, center (i.e. subtract the mean) be<str<strong>on</strong>g>for</str<strong>on</strong>g>e squaring the<br />

variables or taking cross-products. Use scientific knowledge to select the potential predictors and the shape<br />

of the predicti<strong>on</strong> functi<strong>on</strong>. Classificati<strong>on</strong> variables (i.e. nominal or interval scaled variables) will generate a<br />

separate indicator variable <str<strong>on</strong>g>for</str<strong>on</strong>g> each level of the variable. Some computer programs (e.g. JMP) may generate<br />

c<strong>on</strong>trasts am<strong>on</strong>g these indicator variables as well.<br />

c○2012 Carl James Schwarz 498 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Sec<strong>on</strong>d, there are various rule of thumb <str<strong>on</strong>g>for</str<strong>on</strong>g> the maximum number of predictors that should be entertained<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> a dataset. Generally, you want about 10 observati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> each potential predictor variable. Hence, if your<br />

maximum model has 30 potential predictor variables, this rule of thumb would require you have at least 300<br />

observati<strong>on</strong>s! Remember that a nominal scaled variable with k values will required k−1 indicator variables!<br />

Third, examine the c<strong>on</strong>trast within variables. If a variable is essentially c<strong>on</strong>stant (e.g. every subject<br />

had essentially the same weight), then this a useless predictor variable as no “effect” of weight will be<br />

apparent. If an indicator variable <strong>on</strong>ly points to a single case (e.g. <strong>on</strong>ly a single female in the dataset) then<br />

the results may be highly specific to the dataset analyzed. Low c<strong>on</strong>trast variables should not be included in<br />

the maximum model.<br />

6.7.3 Selecting a model criteri<strong>on</strong><br />

The model criteri<strong>on</strong> is an “index” that is computed <str<strong>on</strong>g>for</str<strong>on</strong>g> each candidate model and use to compare the various<br />

models. Given a particular criteri<strong>on</strong>, <strong>on</strong>e can order the models from “best” to “worst”.<br />

The criteri<strong>on</strong> used should be related to the goal of the analysis. If the goal is predicti<strong>on</strong>, the selecti<strong>on</strong><br />

criteri<strong>on</strong> should be related to errors in predicti<strong>on</strong>s. If the goal is variable subset selecti<strong>on</strong>, then the criteri<strong>on</strong><br />

should be related to the quality of the subset.<br />

There is NO single best criteri<strong>on</strong>. A literature search will reveal at least 10 criteria that have been<br />

proposed. In this chapter, five of the criteria will be discussed – this is not to say that these five are the<br />

optimal criteria, but rather the most frequently chosen. These criteria are R 2 , F p , MSE p , C p , and AIC.<br />

R 2<br />

The R 2 criteri<strong>on</strong> is the simplest criteri<strong>on</strong> in use. The value of R 2 measures, in some sense, the proporti<strong>on</strong> of<br />

total variati<strong>on</strong> in the data that is explained by the predictors. C<strong>on</strong>sequently, higher values of R 2 are “better”.<br />

However, this criteri<strong>on</strong> has a number of defects. First R 2 will never decrease as you add variables<br />

(regardless of usefulness) to models. But in many cases, a plot of R 2 by the number of variables shows a<br />

rapid increase a variables are added, then a leveling off where new variables essentially add very little new<br />

in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>. Model near the bend of the curve seem to offer an reas<strong>on</strong>able descripti<strong>on</strong> of the data. Some<br />

packages attempt to adjust the value of R 2 <str<strong>on</strong>g>for</str<strong>on</strong>g> the number of variables (called the adjusted R 2 ), and so the<br />

value of the adjusted R 2 again near the bend of the curve would be the target.<br />

F p<br />

The F p criteri<strong>on</strong> is essentially a number of hypothesis tests to see which set of p variables is not statistically<br />

different from the full model. If the test statistic <str<strong>on</strong>g>for</str<strong>on</strong>g> a set of p predictors is not statistically significant, then<br />

the other variables can be dropped.<br />

c○2012 Carl James Schwarz 499 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The danger with this criteri<strong>on</strong> is that every test has a α probability of a Type I (false positive) error. So<br />

if you do 50 tests, each at α = .05, there is a very good chance that at least <strong>on</strong>e of the tests will show a<br />

statistically significant results when in fact it is not. If you decide to use this criteri<strong>on</strong>, you likely want to do<br />

the tests at a more stringent criteri<strong>on</strong>, i.e. use α = .01 or α = .001.<br />

MSE p<br />

This criteri<strong>on</strong> uses the estimated residual variance about the regressi<strong>on</strong> line. This residual variance is a<br />

combinati<strong>on</strong> of unexplainable variati<strong>on</strong> and excess variati<strong>on</strong> caused by unknown predictors. In many cases,<br />

there is a subset that has the minimal residual variati<strong>on</strong>.<br />

C p and AIC<br />

These are two related (and in linear regressi<strong>on</strong> can be shown to be equivalent) criteri<strong>on</strong>.<br />

Mallow’s C P , is computed as:<br />

C p = SSE(p) − [n − 2(p + 1)]<br />

MSE(k)<br />

where MSE(p) is the mean square error from the subset with p predictors EXCLUDING the intercept 27 ;<br />

MSE(k) is the MSE from the maximum model; and n is the number of observati<strong>on</strong>s.<br />

If the maximum model does c<strong>on</strong>tain the “truth”, then Mallow showed that C p should be close to p + 1 28<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> a subset model that is closest to the “truth”.<br />

Akaike In<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> Criteri<strong>on</strong> (AIC) is a 1-1 trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> of the C p and can be thought of as<br />

AIC = fit + penalty <str<strong>on</strong>g>for</str<strong>on</strong>g> predictors<br />

. In the case of multiple regressi<strong>on</strong>, AIC has a simple <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

AIC = nlog( SSE<br />

n<br />

) + 2p<br />

where again p is the number of predictors INCLUDING the intercept. The model with the smallest AIC<br />

is usually preferred as this model has the best fit after accounting <str<strong>on</strong>g>for</str<strong>on</strong>g> a penalty <str<strong>on</strong>g>for</str<strong>on</strong>g> adding too many predictors.<br />

However, AIC goes further. Under the philosophy that all models are wr<strong>on</strong>g, but some are useful,<br />

it is possible to obtain model weights <str<strong>on</strong>g>for</str<strong>on</strong>g> several potential models, and to “average” the results of several<br />

competing models. This avoids the entire discussi<strong>on</strong> of which is the best wr<strong>on</strong>g model, but rather works <strong>on</strong><br />

the philosophy that if several models that all seem to fit the data similarly give wildly different answers, then<br />

27 Some textbooks define p to INCLUDE the intercept and so the last term may look like n − 2p rather than n − 2(p + 1). Both are<br />

equivalent.<br />

28 Again, if p is defined to include the intercept, then C p should be close to p rather than p + 1<br />

c○2012 Carl James Schwarz 500 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

this uncertainty in the resp<strong>on</strong>se must be incorporated. Burhnam and Anders<strong>on</strong> (2002) has a very nice book<br />

<strong>on</strong> the use of AIC and its philosophy. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, the use of model weights is bey<strong>on</strong>d the scope of this<br />

<str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

6.7.4 Which subsets should be examined<br />

When we start with k potential predictors, there are many, many potential models that involve subsets of the<br />

k predictors. How are these subsets chosen?<br />

All possible subsets<br />

If there are k predictors variables in the maximum models, there are around 2 k possible subsets. This number<br />

can be enormous – <str<strong>on</strong>g>for</str<strong>on</strong>g> example with 10 potential predictors, there are around 2 10 = 1024 subsets; with 20<br />

predictors, there are around 2 20 = 1, 048, 576 possible models etc.<br />

With modern computers and good algorithms, it is actually possible to search all subsets <str<strong>on</strong>g>for</str<strong>on</strong>g> up to about<br />

15 predictors (and this number gets higher each year). 29 D<strong>on</strong>’t use Excel!<br />

The all possible subsets strategy is preferred <str<strong>on</strong>g>for</str<strong>on</strong>g> reas<strong>on</strong>ably sized problems. Because it looks at all<br />

possible models, it is unlikely that you would miss the “correct” am<strong>on</strong>g the subsets. However, there may be<br />

several different models that all are essentially the same and being <str<strong>on</strong>g>for</str<strong>on</strong>g>ced to select <strong>on</strong>e of these models is a<br />

bit arbitrary – hence <strong>on</strong>e of the driving <str<strong>on</strong>g>for</str<strong>on</strong>g>ces behind the AIC.<br />

Backward eliminati<strong>on</strong><br />

If you have many predictors, then all possible subsets may not be feasible. The backward eliminati<strong>on</strong><br />

procedure starts with the maximum model and successively “deletes” variables until no further variables can<br />

be deleted.<br />

The algorithm proceeds as follows:<br />

1. Fit the maximum model<br />

2. Decide which variable to delete. Look at each of the individual p-values <str<strong>on</strong>g>for</str<strong>on</strong>g> variables still in the<br />

model. If all of the p-values are less than some α (say .05 but this varies am<strong>on</strong>g packages), then stop.<br />

Else, find the variable with the largest (why?) p-value and drop this variable.<br />

3. Refit the model. Refit the model after dropping this variable, and repeat step 2 until no further variables<br />

can be deleted.<br />

29 It turns out that by cleverly computing various statistics, you can actually predict the results from many subsets without actually<br />

having to fit all the subsets.<br />

c○2012 Carl James Schwarz 501 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

One must be careful to ensure that models are hierarchical, i.e. if a X 2 term remains in the model, then<br />

the corresp<strong>on</strong>ding X terms must also remain. Many computer packages will violate this restricti<strong>on</strong> if left to<br />

their own devices.<br />

Forward additi<strong>on</strong><br />

This is the reverse of the backward eliminati<strong>on</strong> procedure. Start with a null model, and keep adding variables<br />

until no more can be added. The variable at each step with the smallest increment p-value is the variable that<br />

is added.<br />

Again, you must ensure that if X 2 terms are entered, that the corresp<strong>on</strong>ding X term is also entered.<br />

Stepwise selecti<strong>on</strong><br />

It may turn out that adding a variable during a <str<strong>on</strong>g>for</str<strong>on</strong>g>ward process makes an existing variable redundant. The<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g>ward additi<strong>on</strong> process has no mechanism <str<strong>on</strong>g>for</str<strong>on</strong>g> deleting variables <strong>on</strong>ce they’ve entered the model.<br />

In a stepwise selecti<strong>on</strong> procedure, after a variable is entered, a backward eliminati<strong>on</strong> procedure is attempted<br />

to see if any variable can be removed.<br />

Closing words<br />

In all of these automated selecti<strong>on</strong> procedures, there is no guarantee that the chosen model will be “optimal”<br />

in any sense. As well, because of the many, many statistical tests per<str<strong>on</strong>g>for</str<strong>on</strong>g>med, n<strong>on</strong>e of the p-values at final<br />

step should be interpreted literally. It is also well known, that if data that is generated completely at random<br />

is used with stepwise methods, it will select a model <str<strong>on</strong>g>for</str<strong>on</strong>g> predicti<strong>on</strong> that is often just noise.<br />

C<strong>on</strong>sequently, the results that you obtain may be highly specific to the dataset collected and may not be<br />

reproducible with other datasets. Refer to Secti<strong>on</strong> 6.7.5 <str<strong>on</strong>g>for</str<strong>on</strong>g> ideas <strong>on</strong> evaluating the reliability of the analysis.<br />

6.7.5 Goodness-of-fit<br />

Even with automated variable selecti<strong>on</strong> methods, there is no guarantee that the fitted models actually fit the<br />

data well. C<strong>on</strong>sequently, the usual residual diagnostics must be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med as outlined in earlier secti<strong>on</strong>s.<br />

At the same time, the analyst should avoid becoming fixated with the results from a single dataset. There<br />

is no guarantee that the results from this particular dataset translate into other datasets. There are several<br />

ways to try and assess how well the chosen relati<strong>on</strong>ship will work in the future:<br />

c○2012 Carl James Schwarz 502 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

• Try <strong>on</strong> a new dataset. In some cases, the study can be be repeated and a comparis<strong>on</strong> of the model<br />

selected from the existing and new study is instructive.<br />

• Split-sample. If there are many observati<strong>on</strong>s, the sample can be split into two. Model selecti<strong>on</strong> is d<strong>on</strong>e<br />

<strong>on</strong> each half independently, and the two analyses compared. If a variable is selected in <strong>on</strong>e half, but<br />

not the other, this is an indicati<strong>on</strong> of instability in the analysis.<br />

How well does the model do in predicti<strong>on</strong>s? Recall that R 2 measures the percentage of variati<strong>on</strong><br />

explained by the model. Use the first half of the data, fit a model, and find the R 2 <str<strong>on</strong>g>for</str<strong>on</strong>g> the first half.<br />

Use the model from the first sample to predict the data points <str<strong>on</strong>g>for</str<strong>on</strong>g> the sec<strong>on</strong>d sample and compute the<br />

correlati<strong>on</strong> 2 between the observed and predicted values. This sec<strong>on</strong>d R 2 will typically be smaller<br />

than the R 2 based <strong>on</strong> the first sample. If the shrinkage in R 2 is large, this is is bad news – it implies<br />

that the results from the first sample did not do well in predicting the values in the sec<strong>on</strong>d sample.<br />

• Cross validati<strong>on</strong>. In some cases, you do not have sufficient data to split into two halves. In these cases,<br />

single case cross validati<strong>on</strong> is often attempted. In this method, you fit a model excluding each case in<br />

turn, and then use the fitted model to fit the hold out case. A comparis<strong>on</strong> of the fitted vs. actual values<br />

is a measure of predictive ability.<br />

6.7.6 Example: Calories of candy bars<br />

The JMP installati<strong>on</strong> includes a dataset <strong>on</strong> the compositi<strong>on</strong> of popular candy bars. This is available under the<br />

Help → Sample Data Library → Food and Nutriti<strong>on</strong> secti<strong>on</strong> or in the candybar.jmp file in the Sample Program<br />

Library in the http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms<br />

directory.<br />

For each of about 50 brands of candy bars, the total calories and the compositi<strong>on</strong> (grams of fat, grams of<br />

fiber, etc.) were measured. Can the total calories be predicted from the various c<strong>on</strong>stituents?<br />

A preliminary scatter plot of the data:<br />

c○2012 Carl James Schwarz 503 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 504 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

shows a str<strong>on</strong>g relati<strong>on</strong>ship between calories and total grams of fat and/or grams of saturated fat, but a<br />

weaker relati<strong>on</strong>ship between calories and grams of protein and grams of carbohydrates.<br />

There are no obvious outliers except <str<strong>on</strong>g>for</str<strong>on</strong>g> a few candy bars which appear to have unusual levels of vitamins<br />

(?).<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to request a stepwise regressi<strong>on</strong> analysis to try and predict the<br />

number of calories in the candy bars:<br />

c○2012 Carl James Schwarz 505 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

In this case, the philosophy that the correct model must be a subset of these variables is likely correct. The<br />

mechanism by which calories “appear” in food is well understood - likely a combinati<strong>on</strong> of fat, protein, and<br />

carbohydrates. It is unlikely that fiber, or vitamins c<strong>on</strong>tribute anything substantial to the total calories.<br />

The stepwise dialogue box has a number (!) of opti<strong>on</strong>s and statistics available:<br />

c○2012 Carl James Schwarz 506 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Detailed explanati<strong>on</strong> of these features is available in the JMP help, but a summary is below:<br />

• The directi<strong>on</strong> of the stepwise procedure can be changed from <str<strong>on</strong>g>for</str<strong>on</strong>g>ward, to backwards, or to mixed. If<br />

you wish to do backwards eliminati<strong>on</strong>, you will have to Enter All variables first be<str<strong>on</strong>g>for</str<strong>on</strong>g>e selecting this<br />

opti<strong>on</strong>. All possible regressi<strong>on</strong>s is available from the red-triangle pop-down menu.<br />

• The probability to enter and to leave are set fairly liberally. A probability to enter of 0.25 indicates<br />

that variables that have any chance of being useful are added; the probability to leave indicates that as<br />

l<strong>on</strong>g as some predictive marginal ability is available, the variable should be retained.<br />

• If the Go butt<strong>on</strong> is pressed, the procedure is completely automatic. If the Step butt<strong>on</strong> is pressed, the<br />

procedure goes step-by-step through the algorithms. The Make Model butt<strong>on</strong> is used at the end to fit<br />

c○2012 Carl James Schwarz 507 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

the final selected model and obtain the usual diagnostic features.<br />

• The package reports the MSE, R 2 , the adjusted R 2 , C p , and AIC <str<strong>on</strong>g>for</str<strong>on</strong>g> each model. These can be used<br />

to assess the progress of the procedure.<br />

• The actual model under c<strong>on</strong>siderati<strong>on</strong> are those variables with check marks inside the Entered boxes.<br />

If you wish to <str<strong>on</strong>g>for</str<strong>on</strong>g>ce a variable to be always present, this is possible by entering a variable and locking<br />

it in.<br />

Change the directi<strong>on</strong> to Mixed and then repeatedly press the Step butt<strong>on</strong>.<br />

For the first step, the program computes the p-values <str<strong>on</strong>g>for</str<strong>on</strong>g> each new variable to enter the model. The<br />

variable with the smallest p-value that is below the Prob to Enter will be selected to enter, which is the Total<br />

Fat Variable<br />

c○2012 Carl James Schwarz 508 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The model now c<strong>on</strong>sists of the intercept and the total fat variable <str<strong>on</strong>g>for</str<strong>on</strong>g> a total of p = 2 predictors. The C p is<br />

extremely large; the R 2 has increased from the previous model; the MSE has decreased.<br />

N<strong>on</strong>e of the variables has p-values greater than the Prob to Leave so nothing happens <str<strong>on</strong>g>for</str<strong>on</strong>g> the “leaving<br />

step” and the step butt<strong>on</strong> must be pressed again.<br />

Based <strong>on</strong> the previous output, the carbohydrate variable will be entered (why?):<br />

c○2012 Carl James Schwarz 509 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

and then the protein variable (why?):<br />

c○2012 Carl James Schwarz 510 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

At this point we are now getting models with enormous R 2 values (close to 100%) which is practically<br />

unheard of in ecological c<strong>on</strong>texts. Note that C p is becoming close to p.<br />

At this point which variable would be entered next? Surprisingly, sodium is entered next, followed by<br />

saturated fat and finally the procedure halts:<br />

c○2012 Carl James Schwarz 511 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Both backward eliminati<strong>on</strong> and <str<strong>on</strong>g>for</str<strong>on</strong>g>ward selecti<strong>on</strong> also pick this final model (try it).<br />

The Make Model butt<strong>on</strong> will take these selected variables and create the Analyze->Fit Model dialogue<br />

box to fit this final model:<br />

c○2012 Carl James Schwarz 512 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

N<strong>on</strong>e of the leverage plots show anything a miss; the residual plots look good. The final estimates are:<br />

The VIF <str<strong>on</strong>g>for</str<strong>on</strong>g> total fat is a bit worrisome - notice that both total fat and saturated fat variables are in the model.<br />

Presumably, saturated fat is included in the total fat and is redundant. Try refitting this model dropping the<br />

saturated fat variable and reexamine the estimates:<br />

c○2012 Carl James Schwarz 513 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

Again all the leverage plots look fine; and the VIF are all small. In our final model, each additi<strong>on</strong>al gram<br />

of total fat increases calories by 8.9 g 30 ; each additi<strong>on</strong>al gram of protein increases calories by 4.7 g; 31<br />

each additi<strong>on</strong>al gram of carbohydrates increases calories by 4.1 grams; 32 ; and each mg of sodium decreases<br />

calories by a miniscule amount. The biological relevance of the sodium c<strong>on</strong>tributi<strong>on</strong> is unknown. Perhaps<br />

this is an artifact of this particular data set?<br />

This particular example was “easy” as the true model is known and the resp<strong>on</strong>se is almost exactly predicted<br />

by the predictors. As noted earlier, most ecological c<strong>on</strong>texts are not so nearly perfect.<br />

6.7.7 Example: Fitness dataset<br />

- this will be dem<strong>on</strong>strated in class<br />

6.7.8 Example: Predicting zoo plankt<strong>on</strong> biomass<br />

What drives the biomass of zoo plankt<strong>on</strong> <strong>on</strong> reefs? The zoo plankt<strong>on</strong> was broken into two size classes (190-<br />

600 µm and >600 µm) and envir<strong>on</strong>mental variables were sampled at 51 irregularly spaced sites (sampling<br />

interval: 156-37 m), arranged al<strong>on</strong>g a straight-line cross-shelf transect 8.4 km in length.<br />

The raw data are available at http://www.esapubs.org/archive/ecol/E085/050/suppl-1.<br />

htm#anchorFilelist in the Guadeloupe.txt file and is in the guadeloupe.jmp file in the Sample Program<br />

Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The resp<strong>on</strong>se variable is the log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med zoo plankt<strong>on</strong> biomasses of two size classes (original units:<br />

mg/m 3 ash-free dry mass) 33 . The predictor variables include<br />

• coordinate (km) of the sampling site al<strong>on</strong>g the transect.<br />

30 The accepted value <str<strong>on</strong>g>for</str<strong>on</strong>g> fat is 9 calories/gram.<br />

31 The accepted value <str<strong>on</strong>g>for</str<strong>on</strong>g> protein is 4 calories/gram.<br />

32 The accepted value <str<strong>on</strong>g>for</str<strong>on</strong>g> carbohydrates is 4 calories/gram<br />

33 Why was a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m used?<br />

c○2012 Carl James Schwarz 514 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

• envir<strong>on</strong>mental variables such as dissolved oxygen (mg/L), salinity (psu), wind speed (m/s), phytoplankt<strong>on</strong><br />

biomass (log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med, original units: µg/L), turbidity (NTU), swell height (m)<br />

• habitat variables codes as 14 indicator variables indicating various habitat classes.<br />

We will try and develop a predicti<strong>on</strong> equati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the larger zoo plankt<strong>on</strong> category.<br />

It is always good practice to do some preliminary plots of the data to search <str<strong>on</strong>g>for</str<strong>on</strong>g> outliers and general<br />

trends in the data be<str<strong>on</strong>g>for</str<strong>on</strong>g>e beginning a more sophisticated analysis.<br />

Start with a scatter plot-matrix of the c<strong>on</strong>tinuous variables obtained from the Analyze->MultiVariateMethods-<br />

>Multivariate plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 515 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 516 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

There appears to be a str<strong>on</strong>g bivariate relati<strong>on</strong>ship of biomass with distance al<strong>on</strong>g the transect line and<br />

phytoplankt<strong>on</strong> biomass. At the same time, several of the predictors appear to be highly related. For example,<br />

the distance al<strong>on</strong>g the transect line and phytoplankt<strong>on</strong> biomass are very str<strong>on</strong>gly related as is wind speed and<br />

swell height. A quadratic relati<strong>on</strong>ship between some of the predictor variables is also apparent (e.g. wind<br />

speed vs. distance). A few unusual points appear, e.g. look at the plot of salinity vs. the log(zooplankt<strong>on</strong>)<br />

where two points seem at odds with the rest of the data. By clicking <strong>on</strong> these points, we see that these<br />

corresp<strong>on</strong>d to site 5 (whose marker I subsequently changed to an X to see where it fit in the rest of the plot),<br />

and site 1 (whose marker I subsequent changed to a triangle <str<strong>on</strong>g>for</str<strong>on</strong>g> the remainder of the analysis).<br />

c○2012 Carl James Schwarz 517 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

A comm<strong>on</strong> problem with indicator variables is insufficient c<strong>on</strong>trast, i.e. there are <strong>on</strong>ly a few sampling<br />

sites with a particular habitat variable. You can see how many of each habitat variable are present by simply<br />

counting the number of 1’s in each indicator variable column or finding the “sum” of each column.<br />

c○2012 Carl James Schwarz 518 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 519 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

c○2012 Carl James Schwarz 520 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

These indicate that there is <strong>on</strong>ly 1 site with under 25% coverage of sea-grass <strong>on</strong> muddy sand, and most<br />

of the indicator variables occur <strong>on</strong> less than 10% of the sites. I will be hesitant to read too much into any<br />

regressi<strong>on</strong> equati<strong>on</strong> that includes most of these indicator variables as I suspect they will specific to this<br />

particular dataset and not generalizable to other datasets.<br />

So, based <strong>on</strong> this preliminary analysis, I would expect that the distance and/or phytoplankt<strong>on</strong> and/or<br />

turbidity would be the primary predictor <str<strong>on</strong>g>for</str<strong>on</strong>g> zooplankt<strong>on</strong> biomass in this category. With <strong>on</strong>ly 51 data points, I<br />

would be reluctant to include more than about 5 predictor variables using the rule of thumb of 10 observati<strong>on</strong>s<br />

per predictor.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to request a stepwise regressi<strong>on</strong> analysis:<br />

c○2012 Carl James Schwarz 521 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

A stepwise analysis is requested.<br />

c○2012 Carl James Schwarz 522 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

The step history :<br />

c○2012 Carl James Schwarz 523 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

shows that R 2 increases fairly rapidly until it hits around 80% and then tends to level off; the C p approaches<br />

p 34 also around step 9 or 10.<br />

The summary of the steps shows that the transect locati<strong>on</strong> is the first variable in, followed, surprisingly,<br />

by several indicator variables, followed by phytoplankt<strong>on</strong> biomass. It is somewhat surprising that both the<br />

transect locati<strong>on</strong> and the phytoplankt<strong>on</strong> biomass are both entered into the model as they are highly related.<br />

Rerun the stepwise procedure, a step at a time <str<strong>on</strong>g>for</str<strong>on</strong>g> the first 9 steps and then press the MakeModel butt<strong>on</strong>:<br />

34 Note that JMP uses the c<strong>on</strong>venti<strong>on</strong> that the count p INCLUDES the intercept.<br />

c○2012 Carl James Schwarz 524 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

to actually fit this model. The plot of actual vs. predicted :<br />

c○2012 Carl James Schwarz 525 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

shows a reas<strong>on</strong>able fit. Some of the leverage plots <str<strong>on</strong>g>for</str<strong>on</strong>g> the indicator variables show that the fit is determined<br />

by a single or pair of sites:<br />

The VIF <str<strong>on</strong>g>for</str<strong>on</strong>g> the transect locati<strong>on</strong> and phytoplankt<strong>on</strong> biomass variables:<br />

c○2012 Carl James Schwarz 526 November 23, 2012


CHAPTER 6. MULTIPLE LINEAR REGRESSION<br />

are large – a c<strong>on</strong>sequence of the str<strong>on</strong>g relati<strong>on</strong>ship between these two variables.<br />

I would subsequently remove <strong>on</strong>e of the the transect locati<strong>on</strong> or the phytoplankt<strong>on</strong> biomass variables and<br />

would likely remove any indicator variable that is entered that depends <strong>on</strong> a single site as this is surely an<br />

artifact of this particular dataset.<br />

All possible subsets is barely feasible with this size of regressi<strong>on</strong> problem. It took less than three minutes<br />

to fit <strong>on</strong> my Macintosh G4 at home, but the output file was enormous! I suspect that unless some way is<br />

found to c<strong>on</strong>dense the output to something more user friendly, that this would not be a feasible way to<br />

proceed.<br />

c○2012 Carl James Schwarz 527 November 23, 2012


Chapter 7<br />

Logistic Regressi<strong>on</strong><br />

7.1 Introducti<strong>on</strong><br />

7.1.1 Difference between standard and logistic regressi<strong>on</strong><br />

In regular multiple-regressi<strong>on</strong> problems, the Y variable is assumed to have a c<strong>on</strong>tinuous distributi<strong>on</strong> with<br />

the vertical deviati<strong>on</strong>s around the regressi<strong>on</strong> line being independently normally distributed with a mean of 0<br />

and a c<strong>on</strong>stant variance σ 2 . The X variables are either c<strong>on</strong>tinuous or indicator variables.<br />

In some cases, the Y variable is a categorical variable, often with two distinct classes. The X variables<br />

can be either c<strong>on</strong>tinuous or indicator variables. The object is now to predict the CATEGORY in which a<br />

particular observati<strong>on</strong> will lie.<br />

For example:<br />

• The Y variable is over-winter survival of a deer (yes or no) as a functi<strong>on</strong> of the body mass, c<strong>on</strong>diti<strong>on</strong><br />

factor, and winter severity index.<br />

• The Y variable is fledging (yes or no) of birds as a functi<strong>on</strong> of distance from the edge of a field, food<br />

availability, and predati<strong>on</strong> index.<br />

• The Y variable is breeding (yes or no) of birds as a functi<strong>on</strong> of nest density, predators, and temperature.<br />

C<strong>on</strong>sequently, the linear regressi<strong>on</strong> model with normally distributed vertical deviati<strong>on</strong>s really doesn’t<br />

make much sense – the resp<strong>on</strong>se variable is a category and does NOT follow a normal distributi<strong>on</strong>. In these<br />

cases, a popular methodology that is used is logistic regressi<strong>on</strong>.<br />

There are a number of good books <strong>on</strong> the use of logistic regressi<strong>on</strong>:<br />

528


CHAPTER 7. LOGISTIC REGRESSION<br />

• Agresti, A. (2002). Categorical Data Analysis. Wiley: New York.<br />

• Hosmer, D.W. and Lemeshow, S. (2000). <str<strong>on</strong>g>Applied</str<strong>on</strong>g> Logistic Regressi<strong>on</strong>. Wiley: New York.<br />

These should be c<strong>on</strong>sulted <str<strong>on</strong>g>for</str<strong>on</strong>g> all the gory details <strong>on</strong> the use of logistic regressi<strong>on</strong>.<br />

7.1.2 The Binomial Distributi<strong>on</strong><br />

A comm<strong>on</strong> probability model <str<strong>on</strong>g>for</str<strong>on</strong>g> outcomes that come in <strong>on</strong>ly two states (e.g. alive or dead, success or failure,<br />

breeding or not breeding) is the Binomial distributi<strong>on</strong>. The Binomial distributi<strong>on</strong> counts the number of times<br />

that a particular event will occur in a sequence of observati<strong>on</strong>s. 1 The binomial distributi<strong>on</strong> is used when a<br />

researcher is interested in the occurrence of an event, not in its magnitude. For instance, in a clinical trial,<br />

a patient may survive or die. The researcher studies the number of survivors, and not how l<strong>on</strong>g the patient<br />

survives after treatment. In a study of bird nests, the number in the clutch that hatch is measured, not the<br />

length of time to hatch.<br />

In general the binomial distributi<strong>on</strong> counts the number of events in a set of trials, e.g. the number of<br />

deaths in a cohort of patients, the number of broken eggs in a box of eggs, or the number of eggs that hatch<br />

from a clutch. Other situati<strong>on</strong>s in which binomial distributi<strong>on</strong>s arise are quality c<strong>on</strong>trol, public opini<strong>on</strong><br />

surveys, medical research, and insurance problems.<br />

It is important to examine the assumpti<strong>on</strong>s being made be<str<strong>on</strong>g>for</str<strong>on</strong>g>e a Binomial distributi<strong>on</strong> is used. The<br />

c<strong>on</strong>diti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> a Binomial Distributi<strong>on</strong> are:<br />

• n identical trials (n could be 1);<br />

• all trials are independent of each other;<br />

• each trial has <strong>on</strong>ly <strong>on</strong>e outcome, success or failure;<br />

• the probability of success is c<strong>on</strong>stant <str<strong>on</strong>g>for</str<strong>on</strong>g> the set of n trials. Some books use p to represent the probability<br />

of success; other books use π to represent the probability of success; 2<br />

• the resp<strong>on</strong>se variable Y is the the number of successes 3 in the set of n trials.<br />

However, not all experiments, that <strong>on</strong> the surface look like binomial experiments, satisfy all the assumpti<strong>on</strong>s<br />

required. Typically failure of assumpti<strong>on</strong>s include n<strong>on</strong>-independence (e.g. the first bird that hatches<br />

destroys remaining eggs in the nest), or changing p within a set of trials (e.g. measuring genetic abnormalities<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> a particular mother as a functi<strong>on</strong> of her age; <str<strong>on</strong>g>for</str<strong>on</strong>g> many species, older mothers have a higher probability<br />

of genetic defects in their offspring as they age).<br />

1 The Poiss<strong>on</strong> distributi<strong>on</strong> is a close cousin of the Binomial distributi<strong>on</strong> and is discussed in other chapters.<br />

2 Following the c<strong>on</strong>venti<strong>on</strong> that Greek letters refer to the populati<strong>on</strong> parameters just like µ refers to the populati<strong>on</strong> mean.<br />

3 There is great flexibility in defining what is a success. For example, you could count either the number of successful eggs that hatch<br />

or the number of eggs that failed to hatch in a clutch. You will get the same answers from the analysis after making the appropriate<br />

substituti<strong>on</strong>s.<br />

c○2012 Carl James Schwarz 529 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The probability of observing Y successes in n trials if each success has a probability p of occurring can<br />

be computed using:<br />

( )<br />

n<br />

p(Y = y|n, p) = p y (1 − p) n−y<br />

y<br />

where the binomial coefficient is computed as<br />

(<br />

n<br />

y<br />

and where n! = n(n − 1)(n − 2) . . . (2)(1).<br />

)<br />

=<br />

n!<br />

y!(n − y)!<br />

For example, the probability of observing Y = 3 eggs hatch from a nest with n = 5 eggs in the clutch if<br />

the probability of success p = .2 is<br />

( )<br />

5<br />

p(Y = 3|n = 5, p = .2) = (.2) 3 (1 − .2) 5−3 = .0512<br />

3<br />

Fortunately, we will have little need <str<strong>on</strong>g>for</str<strong>on</strong>g> these probability computati<strong>on</strong>s. There are many tables that tabulate<br />

the probabilities <str<strong>on</strong>g>for</str<strong>on</strong>g> various combinati<strong>on</strong>s of n and p – check the web.<br />

There are two important properties of a binomial distributi<strong>on</strong> that will serve us in the future. If Y is<br />

Binomial(n, p), then:<br />

• E[Y ] = np<br />

• V [Y ] = np(1 − p) and standard deviati<strong>on</strong> of Y is √ np(1 − p)<br />

For example, if n = 20 and p = .4, then the average number of successes in these 20 trials is E[Y ] = np =<br />

20(.4) = 8.<br />

If an experiment is observed, and a certain number of successes is observed, then the estimator <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

success probability is found as:<br />

̂p = Y n<br />

For example, if a clutch of 5 eggs is observed (the set of trials) and 3 successfully hatch, then the estimated<br />

proporti<strong>on</strong> of eggs that hatch is ̂p = 3 5<br />

= .60. This is exactly analogous to the case where a sample is drawn<br />

from a populati<strong>on</strong> and the sample average Y is used to estimate the populati<strong>on</strong> mean µ.<br />

7.1.3 Odds, risk, odds-ratio, and probability<br />

The odds of an event and the odds ratio of events are very comm<strong>on</strong> terms in logistic c<strong>on</strong>texts. C<strong>on</strong>sequently,<br />

it is important to understand exactly what these say and d<strong>on</strong>’t say.<br />

c○2012 Carl James Schwarz 530 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The odds of an event are defined as:<br />

Odds(event) =<br />

P (event)<br />

P (not event) = P (event)<br />

1 − P (event)<br />

The notati<strong>on</strong> used is often a col<strong>on</strong> separating the odds values. Some sample values are tabulated below:<br />

Probability<br />

Odds<br />

.01 1:99<br />

.1 1:9<br />

.5 1:1<br />

.6 6:4 or 3:2 or 1.5<br />

.9 9:1<br />

.99 99:1<br />

For very small or very large odds, the probability of the event is approximately equal to the odds. For<br />

example if the odds are 1:99, then the probability of the event is 1/100 which is roughly equal to 1/99.<br />

The odds ratio (OR) is by definiti<strong>on</strong>, the ratio of two odds:<br />

OR A vs. B = odds(A)<br />

odds(B) =<br />

P (A)<br />

1−P (A)<br />

P (B)<br />

1−P (B)<br />

For example, of the probability of an egg hatching under c<strong>on</strong>diti<strong>on</strong> A is 1/10 and the probability of an egg<br />

hatching under c<strong>on</strong>diti<strong>on</strong> B is 1/20, then the odds ratio is OR = (1 : 9)/(1 : 19) = 2.1 : 1. Again <str<strong>on</strong>g>for</str<strong>on</strong>g> very<br />

small or very larger odds, the odds ratio is approximately equal to the ratio of the probabilities.<br />

An odds ratio of 1, would indicate that the probability of the two events is equal.<br />

In many studies, you will hear reports that the odds of an event have doubled. This give NO in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong><br />

about the base rate. For example, did the odds increase from 1:milli<strong>on</strong> to 2:milli<strong>on</strong> or from 1:10 to 2:10.<br />

It turns out that it is c<strong>on</strong>venient to model probabilities <strong>on</strong> the log-odds scale. The log-odds (LO), also<br />

known as the logit, is defined as:<br />

( ) P (A)<br />

logit(A) = log e (odds(A)) = log e<br />

1 − P (A)<br />

We can extend the previous table, to compute the log-odds:<br />

c○2012 Carl James Schwarz 531 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Probability Odds Logit<br />

.01 1:99 −4.59<br />

.1 1:9 −2.20<br />

.5 1:1 0<br />

.6 6:4 or 3:2 or 1.5 .41<br />

.9 9:1 2.20<br />

.99 99:1 4.59<br />

Notice that the log-odds is zero when the probability is .5 and that the log-odds of .01 is symmetric with<br />

the log-odds of .99.<br />

It is also easy to go back from the log-odds scale to the regular probability scale in two equivalent ways:<br />

p =<br />

elog-odds<br />

1 + e log-odds = 1<br />

1 + e −log-odds<br />

Notice the minus sign in the sec<strong>on</strong>d back-translati<strong>on</strong>. For example, a LO = 10, translates to p = .9999; a<br />

LO = 4 translates to p = .98; a LO = 1 translates to p = .73; etc.<br />

7.1.4 Modeling the probability of success<br />

Now if the probability of success was the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all sets of trials, the analysis would be trivial: simply<br />

tabulate the total number of successes and divide by the total number of trials to estimate the probability of<br />

success. However, what we are really interested in is the relati<strong>on</strong>ship of the probability of success to some<br />

covariate X such as temperature, or c<strong>on</strong>diti<strong>on</strong> factor.<br />

For example, c<strong>on</strong>sider the following (hypothetical) example of an experiment where various clutches of<br />

bird eggs were found, and the number of eggs that hatched and fledged were measured al<strong>on</strong>g with the height<br />

the nest was above the ground:<br />

c○2012 Carl James Schwarz 532 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Height Clutch Size Fledged ̂p<br />

2.0 4 0 0.00<br />

3.0 3 0 0.00<br />

2.5 5 0 0.00<br />

3.3 3 2 0.67<br />

4.7 4 1 0.25<br />

3.9 2 0 0.00<br />

5.2 4 2 0.50<br />

10.5 5 5 1.00<br />

4.7 4 2 0.50<br />

6.8 5 3 0.60<br />

7.3 3 3 1.00<br />

8.4 4 3 0.75<br />

9.2 3 2 0.67<br />

8.5 4 4 1.00<br />

10.0 3 3 1.00<br />

12.0 6 6 1.00<br />

15.0 4 4 1.00<br />

12.2 3 3 1.00<br />

13.0 5 5 1.00<br />

12.9 4 4 1.00<br />

Notice that the probability of a fledging seems to increase with height above the grounds (potentially<br />

reflecting distance from predators?).<br />

We would like to model the probability of success as a functi<strong>on</strong> of height. As a first attempt, suppose<br />

that we plot the estimated probability of success (̂p) as a functi<strong>on</strong> of height and try and fit a straight line to<br />

the plotted points.<br />

The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used, and ̂p was treated as the Y variable and Height as the X<br />

variable:<br />

c○2012 Carl James Schwarz 533 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This procedure is not entirely satisfactory <str<strong>on</strong>g>for</str<strong>on</strong>g> a number of reas<strong>on</strong>s:<br />

• The data points seem to follow an S-shaped relati<strong>on</strong>ship with probabilities of success near 0 at lower<br />

heights and near 1 at higher heights.<br />

• The fitted line gives predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> the probability of success that are more than 1 and less than 0<br />

which is impossible.<br />

• The fitted line cannot deal properly with the fact that the probability of success is likely close to 0%<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> a wide range of small heights and essentially close to 100% <str<strong>on</strong>g>for</str<strong>on</strong>g> a wide range of taller heights.<br />

• The assumpti<strong>on</strong> of a normal distributi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the deviati<strong>on</strong>s from the fitted line is not tenable as the ̂p<br />

are essentially discrete <str<strong>on</strong>g>for</str<strong>on</strong>g> the small clutch sizes found in this experiment.<br />

• While not apparent from this graph, the variability of the resp<strong>on</strong>se changes over the different parts<br />

of the regressi<strong>on</strong> line. For example, when the true probability of success is very low (say 0.1), the<br />

c○2012 Carl James Schwarz 534 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

standard deviati<strong>on</strong> in the number fledged <str<strong>on</strong>g>for</str<strong>on</strong>g> a clutch with 5 eggs is found as √ 5(.1)(.9) = .67 while<br />

the standard deviati<strong>on</strong> of the number of fledges in a clutch with 5 eggs and the probability of success<br />

of 0.5 is √ 5(.5)(.5) = 1.1 which is almost twice as large as the previous standard deviati<strong>on</strong>.<br />

For these (and other reas<strong>on</strong>s), the analysis of this type of data are comm<strong>on</strong>ly d<strong>on</strong>e <strong>on</strong> the log-odds (also<br />

called the logit) scale. The odds of an event is computed as:<br />

ODDS =<br />

p<br />

1 − p<br />

and the log-odds is found as the (natural) logarithm of the odds:<br />

( ) p<br />

LO = log<br />

1 − p<br />

This trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> c<strong>on</strong>verts the 0-1 scale of probability to a −∞ → ∞ scale as illustrated below:<br />

p<br />

LO<br />

0.001 -6.91<br />

0.01 -4.60<br />

0.05 -2.94<br />

0.1 -2.20<br />

0.2 -1.39<br />

0.3 -0.85<br />

0.4 -0.41<br />

0.5 0.00<br />

0.6 0.41<br />

0.7 0.85<br />

0.8 1.39<br />

0.9 2.20<br />

0.95 2.94<br />

0.99 4.60<br />

0.999 6.91<br />

Notice that the log-odds scale is symmetrical about 0, and that <str<strong>on</strong>g>for</str<strong>on</strong>g> moderate values of p, changes <strong>on</strong> the<br />

p-scale have nearly c<strong>on</strong>stant changes <strong>on</strong> the log-odds scale. For example, going from .5 → .6 → .7 <strong>on</strong> the<br />

p-scale corresp<strong>on</strong>ds to moving from 0 → .41 → .85 <strong>on</strong> the log-odds scale.<br />

It is also easy to go back from the log-odds scale to the regular probability scale:<br />

p =<br />

eLO<br />

1 + e LO = 1<br />

1 + e −LO<br />

c○2012 Carl James Schwarz 535 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

For example, a LO = 10, translates to p = .9999; a LO = 4 translates to p = .98; a LO = 1 translates to<br />

p = .73; etc.<br />

We can now return back to the previous data. At first glance, it would seem that the estimated log-odds<br />

is simply estimated as:<br />

( ) ̂p<br />

̂LO = log<br />

1 − ̂p<br />

but this doesn’t work well with small sample sizes (it can be shown that the simple logit functi<strong>on</strong> is biased)<br />

or when values of ̂p close to 0 or 1 (the simple logit functi<strong>on</strong> hits ±∞). C<strong>on</strong>sequently, in small samples or<br />

when the observed probability of success is close to 0 or 1, the empirical log-odds is often computed as:<br />

̂LO empirical = log<br />

( n̂p + .5<br />

n(1 − ̂p) + .5<br />

)<br />

= log<br />

( ̂p + .5/n<br />

1 − ̂p + .5/n<br />

)<br />

We compute the empirical log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g> the hatching data:<br />

Height Clutch Fledged ̂p ̂LOemp<br />

2.0 4 0 0.00 -2.20<br />

3.0 3 0 0.00 -1.95<br />

2.5 5 0 0.00 -2.40<br />

3.3 3 2 0.67 0.51<br />

4.7 4 1 0.25 -0.85<br />

3.9 2 0 0.00 -1.61<br />

5.2 4 2 0.50 0.00<br />

10.5 5 5 1.00 2.40<br />

4.7 4 2 0.50 0.00<br />

6.8 5 3 0.60 0.34<br />

7.3 3 3 1.00 1.95<br />

8.4 4 3 0.75 0.85<br />

9.2 3 2 0.67 0.51<br />

8.5 4 4 1.00 2.20<br />

10.0 3 3 1.00 1.95<br />

12.0 6 6 1.00 2.56<br />

15.0 4 4 1.00 2.20<br />

12.2 3 3 1.00 1.95<br />

13.0 5 5 1.00 2.40<br />

12.9 4 4 1.00 2.20<br />

and now plot the empirical log-odds against height:<br />

c○2012 Carl James Schwarz 536 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The fit is much nicer, the relati<strong>on</strong>ship has been linearized, and now, no matter what the predicti<strong>on</strong>, it can<br />

always be translated back to a probability between 0 and 1 using the inverse trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m seen earlier.<br />

7.1.5 Logistic regressi<strong>on</strong><br />

But this is still not enough. Even <strong>on</strong> the log-odds scale the data points are not normally distributed around<br />

the regressi<strong>on</strong> line. C<strong>on</strong>sequently, rather than using ordinary least-squares to fit the line, a technique called<br />

generalized linear modeling is used to fit the line.<br />

In generalized linear models a method called maximum likelihood is used to find the parameters of the<br />

model (in this case, the intercept and the regressi<strong>on</strong> coefficient of height) that gives the best fit to the data.<br />

While details of maximum likelihood estimati<strong>on</strong> are bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>, they are closely related<br />

to weighted least squares in this class of problems. Maximum Likelihood Estimators (often abbreviated as<br />

MLEs) are, under fairly general c<strong>on</strong>diti<strong>on</strong>s, guaranteed to be the “best” (in the sense of having smallest<br />

standard errors) in large samples. In small samples, there is no guarantee that MLEs are optimal, but in<br />

practice, MLEs seem to work well. In most cases, the calculati<strong>on</strong>s must be d<strong>on</strong>e numerically – there are no<br />

c○2012 Carl James Schwarz 537 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

simple <str<strong>on</strong>g>for</str<strong>on</strong>g>mulae as in simple linear regressi<strong>on</strong>. 4<br />

In order to fit a logistic regressi<strong>on</strong> using maximum likelihood estimati<strong>on</strong>, the data must be in a standard<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g>mat. In particular, both success and failures must be recorded al<strong>on</strong>g with a classificati<strong>on</strong> variable that<br />

is nominally scaled. For example, the first clutch (at 2.0 m) will generate two lines of data – <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

successful fledges and <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> the unsuccessful fledges. If the count <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular outcome is zero, it can<br />

be omitted from the data table, but I prefer to record a value of 0 so that there is no doubt that all eggs were<br />

examined and n<strong>on</strong>e of this outcome were observed.<br />

A new column was created in JMP <str<strong>on</strong>g>for</str<strong>on</strong>g> the number of eggs that failed to fledge, and after stacking the<br />

revised dataset, the dataset in JMP that can be used <str<strong>on</strong>g>for</str<strong>on</strong>g> logistic regressi<strong>on</strong> looks like: 5<br />

4 Other methods that are qute popular are n<strong>on</strong>-iterative weighted least squares and discriminant functi<strong>on</strong> analysis. These are bey<strong>on</strong>d<br />

the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

5 This stacked data is available in the eggsfledge2.jmp dataset available from the Sample Program Library at http://www.stat.<br />

sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

c○2012 Carl James Schwarz 538 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

c○2012 Carl James Schwarz 539 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to launch simple logistic regressi<strong>on</strong>:<br />

Note that the Outcome is the actual Y variable (and is nominally scaled) while the Count column simply<br />

indicates how many of this outcome were observed. The X variable is Height as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e. JMP knows this is<br />

a logistic regressi<strong>on</strong> by the combinati<strong>on</strong> of a nominally or ordinally scaled variable <str<strong>on</strong>g>for</str<strong>on</strong>g> the Y variable, and a<br />

c<strong>on</strong>tinuously scaled variable <str<strong>on</strong>g>for</str<strong>on</strong>g> the X variable as seen by the reminder at the left of the plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m dialogue<br />

box.<br />

This gives the output:<br />

c○2012 Carl James Schwarz 540 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The first point to note is that most computer packages make arbitrary decisi<strong>on</strong>s <strong>on</strong> what is a “success” and<br />

what is a “failure” when fitting the logistic regressi<strong>on</strong>. It is important to always look at the output carefully<br />

to see what has been defined as a success. In this case, at the bottom of the output, JMP has indicated that<br />

fledged is c<strong>on</strong>sidered as a “success” and not fledged as a “failure”. If it had reversed the roles of these two<br />

c○2012 Carl James Schwarz 541 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

categories, everything would be “identical” except reversed appropriately.<br />

Sec<strong>on</strong>d, rather bizarrely, the actual data points plotted by JMP really d<strong>on</strong>’t any meaning! According<br />

the JMP help screens:<br />

Markers <str<strong>on</strong>g>for</str<strong>on</strong>g> the data are drawn at their x-coordinate, with the y positi<strong>on</strong> jittered randomly within<br />

the range corresp<strong>on</strong>ding to the resp<strong>on</strong>se category <str<strong>on</strong>g>for</str<strong>on</strong>g> that row.<br />

So if you do the analysis <strong>on</strong> the exact same data, the data points are jittered and will look different even<br />

though the fit is the same. The explanati<strong>on</strong> in the JMP support pages <strong>on</strong> the web state: 6<br />

The exact vertical placement of points in the logistic regressi<strong>on</strong> plots (<str<strong>on</strong>g>for</str<strong>on</strong>g> instance, <strong>on</strong> pages<br />

308 and 309 of the JMP User’s Guide, Versi<strong>on</strong> 2, and pages 114 and 115 of the JMP Statistics<br />

and Graphics Guide, Versi<strong>on</strong> 3) has no particular interpretati<strong>on</strong>. The points are placed midway<br />

between curves so as to assure their visibility. However, the locati<strong>on</strong> of a point between a<br />

particular set of curves is important. All points between a particular set of curves have the same<br />

observed value <str<strong>on</strong>g>for</str<strong>on</strong>g> the dependent variable. Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, the horiz<strong>on</strong>tal placement of each point is<br />

meaningful with respect to the horiz<strong>on</strong>tal axis.<br />

This is rather un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunate, to say the least! This means that the user must create nice plot by hand. This plot<br />

should plot the estimated proporti<strong>on</strong>s as a functi<strong>on</strong> of height with the fitted curve then overdrawn.<br />

Fortunately, the fitted curves are correct (whew). The curves presented doesn’t look linear <strong>on</strong>ly because<br />

JMP has trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med back from the log-odds scale to the regular probability scale. A linear curve <strong>on</strong> the<br />

log-odds scale has a characteristic “S” shape <strong>on</strong> the regular probability scale with the ends of the curve<br />

flattening out a 0 and 1. Using the Cross Hairs tool, you can see that a height of 5 m gives a predicted<br />

probability of success (fledged) of .57; by 7 m the estimated probability of success is almost 100%.<br />

The table of parameter estimates gives the estimated fit <strong>on</strong> the log-odds scale:<br />

̂LO = −4.03 + .72(Height)<br />

Substituting in the value <str<strong>on</strong>g>for</str<strong>on</strong>g> Height = 5, gives an estimated log-odds of −.43 which <strong>on</strong> the regular probability<br />

scale corresp<strong>on</strong>ds to .394 as seen be<str<strong>on</strong>g>for</str<strong>on</strong>g>e from using the cross hairs.<br />

The coefficient associated with height is interpreted as the increase in log-odds of fledging when height<br />

is increased by 1 m.<br />

As in simple regressi<strong>on</strong>, the precisi<strong>on</strong> of the estimates is given by the standard error. An approximate<br />

95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the coefficient associated with height is found in the usual fashi<strong>on</strong>, i.e.<br />

estimate ± 2se. 7 This c<strong>on</strong>fidence interval does NOT include 0; there<str<strong>on</strong>g>for</str<strong>on</strong>g>e there is good evidence that the<br />

probability of fledging is not c<strong>on</strong>stant over the various heights.<br />

6 http://www.jmp.com/support/techsup/notes/001897.html<br />

7 It is not possible to display the 95% c<strong>on</strong>fidence intervals in the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m output by right clicking in the table<br />

(d<strong>on</strong>’t ask me why not). However, if the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used to fit the model, then right-clicking in the Estimates table<br />

does make the 95% c<strong>on</strong>fidence intervals available.<br />

c○2012 Carl James Schwarz 542 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Similarly, the p-value is interpreted in the same way – how c<strong>on</strong>sistent is the data with the hypothesis<br />

of NO effect of height up<strong>on</strong> the survival rate. Rather than the t-test seen in linear regressi<strong>on</strong>, maximum<br />

likelihood methods often c<strong>on</strong>structs the test statistics in a different fashi<strong>on</strong> (called χ 2 likelihood ratio tests).<br />

The test statistic is not particularly of interest – <strong>on</strong>ly the final p-value matters. In this case, it is well below<br />

α = .05, so there is good evidence that the probability of success is not c<strong>on</strong>stant across heights. As in all<br />

cases, statistical significance is no guarantee of biological relevance.<br />

In theory, it is possible to obtain predicti<strong>on</strong> intervals and c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the MEAN probability<br />

of success at new values of X – JMP does not provide these in the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m with<br />

logistic regressi<strong>on</strong>. It does do Inverse Predicti<strong>on</strong>s and can give c<strong>on</strong>fidence bounds <strong>on</strong> the inverse predicti<strong>on</strong><br />

which require the c<strong>on</strong>fidence bounds to be computed, so it is a mystery to me why the c<strong>on</strong>fidence intervals<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the mean probability of success at future X values are not provided.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m can also be used to fit a logistic regressi<strong>on</strong> in the same way:<br />

Be sure to specify the Y variable as a nominally or ordinally scaled variable; the count as the frequency<br />

variable; and the X variables in the usual fashi<strong>on</strong>. The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m automatically switches<br />

to indicate logistic regressi<strong>on</strong> will be run.<br />

The same in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> as previously seen is shown again. But, you can now obtain 95% c<strong>on</strong>fidence<br />

c○2012 Carl James Schwarz 543 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the parameter estimates and there are additi<strong>on</strong>al opti<strong>on</strong>s under the red-triangle pop-down menu.<br />

These features will be explored in more detail in further examples.<br />

Lastly, the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m using the Generalized Linear Model opti<strong>on</strong> in the pers<strong>on</strong>ality<br />

box in the upper right corner, also can be used to fit this model. Specify a binomial distributi<strong>on</strong> with the<br />

logit link. You get similar results with more goodies under the red-triangles such as c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

the MEAN probability of success that can be saved to the data table, residual plots, and more. Again, these<br />

will be explored in more details in the examples.<br />

7.2 Data Structures<br />

There are two comm<strong>on</strong> ways in which data can be entered <str<strong>on</strong>g>for</str<strong>on</strong>g> logistic regressi<strong>on</strong>, either as individual observati<strong>on</strong>s<br />

or as grouped counts.<br />

If individual data points are entered, each line of the data file corresp<strong>on</strong>ds to a single individual. The<br />

columns will corresp<strong>on</strong>ds to the predictors (X) that can be c<strong>on</strong>tinuous (interval or ratio scales) or classificati<strong>on</strong><br />

variables (nominal or ordinal). The resp<strong>on</strong>se (Y ) must be a classificati<strong>on</strong> variable with any two possible<br />

outcomes 8 . Most packages will arbitrarily choose <strong>on</strong>e of these classes to be the success – often this is the<br />

first category when sorted alphabetically. I would recommend that you do NOT code the resp<strong>on</strong>se variable<br />

as 0/1 – it is far to easy to <str<strong>on</strong>g>for</str<strong>on</strong>g>get that the 0/1 corresp<strong>on</strong>d to nominally or ordinally scaled variables and not<br />

to c<strong>on</strong>tinuous variables.<br />

As an example, suppose you wish to predict if an egg will hatch given the height in a tree. The data<br />

structure <str<strong>on</strong>g>for</str<strong>on</strong>g> individuals would look something like:<br />

Egg Height Outcome<br />

. . .<br />

1 10 hatch<br />

2 15 not hatch<br />

3 5 hatch<br />

4 10 hatch<br />

5 10 not hatch<br />

Notice that even though three eggs were all at 10 m height, separate data lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each of the three eggs<br />

appear in the data file.<br />

In grouped counts, each line in the data file corresp<strong>on</strong>ds to a group of events with the same predictor<br />

(X) variables. Often researchers record the number of events and the number of successes in two separate<br />

columns, or the number of success and the number of failures in two separate columns. This data must be<br />

c<strong>on</strong>verted to two rows per group - <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> the success and <strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> the failures with <strong>on</strong>e variable representing<br />

8 In more advanced classes this restricti<strong>on</strong> can be relaxed.<br />

c○2012 Carl James Schwarz 544 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

the outcome and a sec<strong>on</strong>d variable representing the frequency of this event. The outcome will be the Y<br />

variable, while the count will be the frequency variable. 9<br />

For example, the above data could be originally entered as:<br />

Height Hatch Not Hatch<br />

10 2 1<br />

15 0 1<br />

5 1 0<br />

. . .<br />

but must be translated (e.g. using the Tables → Stack command) to:<br />

Height Outcome Count<br />

10 Hatch 2<br />

10 Not Hatch 1<br />

15 Hatch 0<br />

15 Not Hatch 1<br />

. . .<br />

5 Hatch 1<br />

5 Not Hatch 0<br />

While it is not required that counts of zero have data lines present, it is good statistical practice to remind<br />

yourself that you did look <str<strong>on</strong>g>for</str<strong>on</strong>g> failures, but failed to find any of them.<br />

7.3 Assumpti<strong>on</strong>s made in logistic regressi<strong>on</strong><br />

Many of the assumpti<strong>on</strong>s made <str<strong>on</strong>g>for</str<strong>on</strong>g> logistic regressi<strong>on</strong> parallel those made <str<strong>on</strong>g>for</str<strong>on</strong>g> ordinary regressi<strong>on</strong> with obvious<br />

modificati<strong>on</strong>s.<br />

1. Check sampling design. In these <str<strong>on</strong>g>course</str<strong>on</strong>g> notes it is implicitly assumed that the data are collected either<br />

as simple random sample or under a completely randomized design experiment. This implies that the<br />

units selected must be a random sample (with equal probability) from the relevant populati<strong>on</strong>s or<br />

complete randomizati<strong>on</strong> during the assignment of treatments to experimental units. The experimental<br />

unit must equal the observati<strong>on</strong>al unit (no pseudo-replicati<strong>on</strong>), and there must be no pairing, blocking,<br />

or stratificati<strong>on</strong>.<br />

It is possible to generalize logistic regressi<strong>on</strong> to cases where pairing, blocking, or stratificati<strong>on</strong> took<br />

place (<str<strong>on</strong>g>for</str<strong>on</strong>g> example, in case-c<strong>on</strong>trol studies), but these are not covered during this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

9 Refer to the secti<strong>on</strong> <strong>on</strong> Poiss<strong>on</strong> regressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> an alternate way to analyze this type of data where the count is the resp<strong>on</strong>se variable.<br />

c○2012 Carl James Schwarz 545 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Comm<strong>on</strong> ways in which assumpti<strong>on</strong> are violated include:<br />

• Collecting data under a cluster design. For example, class rooms are selected at random from<br />

a school district and individuals within a class room are then measured. Or herds or schools of<br />

animals are selected and all individuals within the herd or school are measured.<br />

• Quota samples are used to select individuals with certain classificati<strong>on</strong>s. For example, exactly<br />

100 males and 100 females are sampled and you are trying to predict sex as the outcome measure.<br />

2. No outliers. This is usually pretty easy to check. A logistic regressi<strong>on</strong> <strong>on</strong>ly allows two categories<br />

within the resp<strong>on</strong>se variables. If there are more than two categories of resp<strong>on</strong>ses, this may represent a<br />

typographical error and should be corrected. Or, categories should be combined into larger categories.<br />

It is possible to generalize logistic regressi<strong>on</strong> to the case of more than two possible outcomes. Please<br />

c<strong>on</strong>tact a statistician <str<strong>on</strong>g>for</str<strong>on</strong>g> assistance.<br />

3. Missing values are MCAR. The usual assumpti<strong>on</strong> as listed in earlier chapters.<br />

4. Binomial distributi<strong>on</strong>. This is a crucial assumpti<strong>on</strong>. A binomial distributi<strong>on</strong> is appropriate when<br />

there is a fixed number of trials at a given set of covariates (could be 1 trial); there is c<strong>on</strong>stant probability<br />

of “success” within that set of trials; each trial is independent; and the number of trials in the n<br />

successes is measured.<br />

Comm<strong>on</strong> ways in which this assumpti<strong>on</strong> is violated are:<br />

• Items within a set of trials do not operate independently of each other. For example, subjects<br />

could be litter mates, twins, or share envir<strong>on</strong>mental variables. This can lead to over- or underdispersi<strong>on</strong>.<br />

• The probability of success within the set of trials is not c<strong>on</strong>stant. For example, suppose a set of<br />

trials is defined by weight class. Not every<strong>on</strong>e in the weight class is exactly the same weight and<br />

so their probability of “success” could vary. Animals all d<strong>on</strong>’t have exactly the same survival<br />

rates.<br />

• The number of trials is not fixed. For example, sampling could occur until a certain number of<br />

success occur. In this case, a negative binomial distributi<strong>on</strong> would be more appropriate.<br />

5. Independence am<strong>on</strong>g subjects. See above.<br />

7.4 Example: Space Shuttle - Single c<strong>on</strong>tinuous predictor<br />

In January 1986, the space shuttle Challenger was destroyed <strong>on</strong> launch. Subsequent investigati<strong>on</strong>s showed<br />

that an O-ring, a piece of rubber used to seal two segments of the booster rocket, failed, allowing highly<br />

flammable fuel to leak, light, and destroy the ship. 10<br />

As part of the investigati<strong>on</strong>, the following chart of previous launches and the temperature at which the<br />

shuttle was launched was presented:<br />

10 Refer to http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster.<br />

c○2012 Carl James Schwarz 546 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The raw data is available in the JMP file spaceshuttleoring.jmp available from the Sample Program Library<br />

at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

Notice that the raw data has a single line <str<strong>on</strong>g>for</str<strong>on</strong>g> each previous launch even though there are multiple launches<br />

at some temperatures. The X variable is temperature and the Y variable is the outcome – either f <str<strong>on</strong>g>for</str<strong>on</strong>g> failure<br />

of the O-ring, or OK <str<strong>on</strong>g>for</str<strong>on</strong>g> a launch where the O-ring did not fail.<br />

c○2012 Carl James Schwarz 547 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

With the data in a single observati<strong>on</strong> mode, it is impossible to make a simple plot of the empirical logistic<br />

functi<strong>on</strong>. If some of the temperatures were pooled, you might be able to do a simple plot.<br />

The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used and gave the following results:<br />

First notice that JMP treats a failure f as a “success”, and will model the probability of failure as a functi<strong>on</strong><br />

c○2012 Carl James Schwarz 548 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

of temperature. This is why it is important that you examine computer output carefully to see exactly what<br />

a package is doing.<br />

The graph showing the fitted logistic curve must be interpreted carefully. While the plotted curve is<br />

correct, the actual data points are randomly placed - groan – see the notes in the previous secti<strong>on</strong>.<br />

The estimated model is:<br />

̂ logit(failure) = 10.875 − .17(temperature)<br />

So, the log-odds of failure decrease by .17 (se .083) units <str<strong>on</strong>g>for</str<strong>on</strong>g> every degree ( ◦ F) increase in launch temperature.<br />

C<strong>on</strong>versely, the log-odds of failure increase by .17 by every degree ( ◦ F) decrease in temperature.<br />

The p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> no effect of temperature is just below α = .05.<br />

Using the same reas<strong>on</strong>ing as was d<strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> ordinary regressi<strong>on</strong>, the odds of failure increase by a factor of<br />

e .17 = 1.18, i.e. almost a 18% increase per degree drop.<br />

To predict the failure rate at a given temperature, a two stage-process is required. First, estimate the<br />

log-odds by substituting in the X values of interest. Sec<strong>on</strong>d, c<strong>on</strong>vert the estimated log-odds to a probability<br />

using p(x) =<br />

eLO(x)<br />

1<br />

= .<br />

1+e LO(x) 1+e −LO(x)<br />

The actual launch was at 32 ◦ F. While it is extremely dangerous to try and predict outside the range<br />

of observed data, the estimated log-odds of failure of the O-ring are 10.875 − .17(32) = 5.43 and then<br />

p(failure) = e5.43<br />

1+e 5.43 = .99+, i.e. well over 99%!<br />

It is possible to find c<strong>on</strong>fidence bounds <str<strong>on</strong>g>for</str<strong>on</strong>g> these predicti<strong>on</strong>s – the easiest way is to create some “dummy”<br />

rows in the data table corresp<strong>on</strong>ding to the future predicti<strong>on</strong>s with the resp<strong>on</strong>se variable left blank. Use<br />

JMP’s Exclude Rows feature to exclude these rows from the model fit. Then use the red-triangle to same<br />

predicti<strong>on</strong>s and c<strong>on</strong>fidence bounds back to the data table.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m gives the same results with additi<strong>on</strong>al analysis opti<strong>on</strong>s that we will<br />

examine in future examples.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m using the Generalized Linear Model opti<strong>on</strong> also gives the same results<br />

with additi<strong>on</strong>al analysis opti<strong>on</strong>s. For example, it is possible to compute c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted<br />

probability of success at the new X. Use the pop-down menu beside the red-triangle:<br />

c○2012 Carl James Schwarz 549 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The predicted values and 95% c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted probability are stored in the data table:<br />

c○2012 Carl James Schwarz 550 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

These are found by finding the predicted log-odds and a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the predicted logodds<br />

and then inverting the c<strong>on</strong>fidence interval endpoints in the same way as the predicted probabilities are<br />

obtained from the predicted log-odds.<br />

While the predicted value and the 95% c<strong>on</strong>fidence interval are available, <str<strong>on</strong>g>for</str<strong>on</strong>g> some odd reas<strong>on</strong> the se of<br />

the predicted probability is not presented – this is odd as it is easily computed. The c<strong>on</strong>fidence intervals are<br />

quite wide given that there were <strong>on</strong>ly 24 data values and <strong>on</strong>ly a few failures.<br />

It should be noted that <strong>on</strong>ly predicti<strong>on</strong>s of the probability of success and c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

c○2012 Carl James Schwarz 551 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

probability of success are computed. These intervals would apply to all future subjects that have the particular<br />

value of the covariates. Unlike the case of linear regressi<strong>on</strong>, it really doesn’t make sense to predict<br />

individual outcomes as these are categories. It is sensible to look at which category is most probable and<br />

then use this as a “guess” <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual resp<strong>on</strong>se, but that is about it. This area of predicting categories<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> individuals is called discriminant analysis and has a l<strong>on</strong>g history in statistics. There are many excellent<br />

books <strong>on</strong> this topic.<br />

7.5 Example: Predicting Sex from physical measurements - Multiple<br />

c<strong>on</strong>tinuous predictors<br />

The extensi<strong>on</strong> to multiple c<strong>on</strong>tinuous X variables is immediate. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e there are now several predictors.<br />

It is usually highly unlikely to have multiple observati<strong>on</strong>s with exactly the same set of X values, so the data<br />

sets usually c<strong>on</strong>sist of individual observati<strong>on</strong>s.<br />

Let us proceed by example using the Fitness data set available in the JMP sample data library. This<br />

dataset has variables <strong>on</strong> age, weight, and measurements of per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance while per<str<strong>on</strong>g>for</str<strong>on</strong>g>ming a fitness assessment.<br />

In this case we will try and predict the sex of the subject given the various attributes.<br />

As usual, be<str<strong>on</strong>g>for</str<strong>on</strong>g>e doing any computati<strong>on</strong>s, examine the data <str<strong>on</strong>g>for</str<strong>on</strong>g> unusual points. Look at pairwise plots,<br />

the pattern of missing values, etc.<br />

It is important that the data be collected under a completely randomized design or simple random sample.<br />

If your data are collected under a different design, e.g. a cluster design, please see suitable assistance.<br />

Use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit a logistic regressi<strong>on</strong> trying to predict sex from the age,<br />

weight, oxygen c<strong>on</strong>sumpti<strong>on</strong> and run time:<br />

c○2012 Carl James Schwarz 552 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This gives the summary output:<br />

c○2012 Carl James Schwarz 553 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

First determine which category is being predicted. In this case, the sex = f category will be predicted.<br />

The Whole Model Test examines if there is evidence of any predictive ability in the 4 predictor variable.<br />

The p-value is very small indicating that there is predictive ability.<br />

Because we have NO categorical predictors, the Effect Tests can be ignored <str<strong>on</strong>g>for</str<strong>on</strong>g> now. The Parameter<br />

Estimates look <str<strong>on</strong>g>for</str<strong>on</strong>g> the marginal c<strong>on</strong>tributi<strong>on</strong> of each predictor to predicting the probability of being a Female.<br />

Just like in regular regressi<strong>on</strong>, these are MARGINAL c<strong>on</strong>tributi<strong>on</strong>s, i.e. how much would the log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

the probability of being female change if this variable changed by <strong>on</strong>e unit and all other variables remained<br />

in the model and did not change. In this case, there is good evidence that weight is a good predictor (not<br />

surprisingly), but also some evidence that oxygen c<strong>on</strong>sumpti<strong>on</strong> may be useful. 11 If you look at the dot plots<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the weight <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sexes and <str<strong>on</strong>g>for</str<strong>on</strong>g> the oxygen c<strong>on</strong>sumpti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the two sexes, the two groups seem to<br />

be separated <strong>on</strong> these variables:<br />

11 The output above actually appears to be a bit c<strong>on</strong>tradictory. The chi-square value <str<strong>on</strong>g>for</str<strong>on</strong>g> the effect of weight is 17 with a p-value<br />

< .0001. Yet the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the coefficient associated with weight ranges from (−1.57 → .105) which INCLUDES<br />

zero, and so whould not be statistically significant! It turns out that JMP has mixed two (asymptotically) equivalent methods in this<br />

<strong>on</strong>e output. The chi-square value and p-value are computed using a likelihood ratio test (a model with and without this variable is fit<br />

and the difference in fit is measured), while the c<strong>on</strong>fidence intervals are computed using a Wald approximati<strong>on</strong> (estimate ± 2(se)).<br />

In small samples, the sampling distributi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> an estimate may not be very symmetric or close to normally shaped and so the Wald<br />

intervals may not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m well.<br />

c○2012 Carl James Schwarz 554 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The estimated coefficient <str<strong>on</strong>g>for</str<strong>on</strong>g> weight is −.73. This indicates that the log-odds of being female decrease<br />

by .73 <str<strong>on</strong>g>for</str<strong>on</strong>g> every additi<strong>on</strong>al unit of weight, all other variables held fixed. This often appears in scientific<br />

reports as the adjusted effect of weight – the adjusted term implies that it is the marginal c<strong>on</strong>tributi<strong>on</strong>.<br />

C<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual coefficient (<str<strong>on</strong>g>for</str<strong>on</strong>g> predicting the log-odds of being female) are interpreted<br />

in the same way.<br />

Just like in regular regressi<strong>on</strong>, collinearity can be a problem in the X values. There is no easy test <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

collinearity in logistic regressi<strong>on</strong> in JMP, but similar diagnostics as in in ordinary regressi<strong>on</strong> are becoming<br />

available.<br />

Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e dropping more than <strong>on</strong>e variable, it is possible to test if two or more variables can be dropped.<br />

Use the Custom Test opti<strong>on</strong>s from the drop-down menu:<br />

c○2012 Carl James Schwarz 555 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Complete the boxes in a similar way as in ordinary linear regressi<strong>on</strong>. For example, to test if both age and<br />

runtime can be dropped:<br />

which gives:<br />

c○2012 Carl James Schwarz 556 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

It appears safe to drop both variables.<br />

Just as in regular regressi<strong>on</strong>, you can fit quadratic and product terms to try and capture some n<strong>on</strong>-linearity<br />

in the log-odds. This affects the interpretati<strong>on</strong> of the estimated coefficients in the same way as in ordinary<br />

regressi<strong>on</strong>. The simpler model involving weight and oxygen c<strong>on</strong>sumpti<strong>on</strong>, their quadratic terms and cross<br />

product term was fit using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

c○2012 Carl James Schwarz 557 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Surprisingly, the model has problems:<br />

c○2012 Carl James Schwarz 558 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Ir<strong>on</strong>ically, it is because the model is too good of a fit. It appears that you can discriminate perfectly between<br />

men and women by fitting this model. Why does a perfect fit cause problems. The reas<strong>on</strong> is that if the<br />

p(sex = f) = 1, the log-odds is then +∞ and it is hard to get a predicted value of ∞ from an equati<strong>on</strong><br />

without some terms also being infinite.<br />

If you plot the weight against oxygen c<strong>on</strong>sumpti<strong>on</strong> using different symbols <str<strong>on</strong>g>for</str<strong>on</strong>g> males and females, you<br />

can see the near complete separati<strong>on</strong> based <strong>on</strong> simply looking at oxygen c<strong>on</strong>sumpti<strong>on</strong> and weight without<br />

the need <str<strong>on</strong>g>for</str<strong>on</strong>g> quadratic and cross products:<br />

c○2012 Carl James Schwarz 559 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

I’ll c<strong>on</strong>tinue by fitting just a model with linear effects of weight and oxygen c<strong>on</strong>sumpti<strong>on</strong> as an illustrati<strong>on</strong>.<br />

Use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to fit this model with just the two covariates:<br />

c○2012 Carl James Schwarz 560 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Both covariates are now statistically significant and cannot be dropped.<br />

The Goodness-of-fit statistic is computed in two ways (which are asymptotically equivalent), but both<br />

are tedious to compute by hand. The Deviance of a model is a measure of how well a model per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms. As<br />

there are 31 data points, you could get a perfect fit by fitting a model with 31 parameters – this is exactly<br />

what happens if you try and fit a line through 2 points where 2 parameters (the slope and intercept) will fit<br />

exactly two data points. A measure of goodness of fit is then found <str<strong>on</strong>g>for</str<strong>on</strong>g> the model in questi<strong>on</strong> based <strong>on</strong> the<br />

fitted parameters of this model. In both cases, the measure of fit is called the deviance which is simply twice<br />

the negative of the log-likelihood which in turn is related to the probability of observing this data given the<br />

parameter values. The difference in deviances is the deviance goodness-of-fit statistic. If the current model<br />

is a good model, the difference in deviance should be small (this is the column labeled chi-square). There is<br />

no simple calibrati<strong>on</strong> of deviances 12 , so a p-value must be found which say how large is this difference. The<br />

p-value of .96 which indicates that the difference is actually quite small, almost 96% of the time you would<br />

get a larger difference in deviances.<br />

Similarly, the row labeled the Pears<strong>on</strong> goodness-of-fit is based <strong>on</strong> the same idea. A perfect fit is obtained<br />

with a model of 31 parameters. A comparis<strong>on</strong> of the observed and predicted values is found <str<strong>on</strong>g>for</str<strong>on</strong>g> the model<br />

with 3 parameters. How big is the difference in fit? How unusual is it?<br />

NOTE that <str<strong>on</strong>g>for</str<strong>on</strong>g> goodness-of-fit tests, you DO NOT WANT TO REJECT the null hypothesis. Hence<br />

p-values <str<strong>on</strong>g>for</str<strong>on</strong>g> a goodness-of-fit test that are small (e.g. less than α = .05) are NOT good!<br />

12 The df=31-3<br />

c○2012 Carl James Schwarz 561 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

So <str<strong>on</strong>g>for</str<strong>on</strong>g> this model, there is no reas<strong>on</strong> to be upset with the fit.<br />

The residual plots look strange, but this is an artifact of the data:<br />

Al<strong>on</strong>g the bottom axis is the predicted probability of being female. Now c<strong>on</strong>sider a male subject. If<br />

the predicted probability of being female is small (e.g. close to 0 because the subject is quite heavy),<br />

then there is an almost perfect agreement of the observed resp<strong>on</strong>se with the predicted probability. If you<br />

compute a residual by defining a male=0 and female=1, then the residual here would be computed as<br />

(obs − predicted)/se(predicted) = (0 − 0)/blah = 0. This corresp<strong>on</strong>ds to points near the (0,0) area<br />

of the plots.<br />

What about males whose predicted probability of being female is almost .7 (which corresp<strong>on</strong>ds to observati<strong>on</strong><br />

15). This is a poor predicti<strong>on</strong>. and the residual is computed as (0 − .7)/se(predicted) which is<br />

approximately equal to (0 − .7)/ √ .7(.3) ≈ −1.52 with some further adjustment to compute the se of the<br />

predicted value. This corresp<strong>on</strong>ds to the point near (.7, -1.5).<br />

On the other hand, a female with a predicted probability of being female will have a residual equal to<br />

approximately (1 − .7)/ √ .7(.3) = .65.<br />

Hence the two lines <strong>on</strong> the graph corresp<strong>on</strong>d to males and female respectively. What you want to see is<br />

c○2012 Carl James Schwarz 562 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

this two parallel line system, particularly with few males near the probability of being female close to 1, and<br />

few females with probability of being female close to 0.<br />

There are four possible residual plots available in JMP – they are all based <strong>on</strong> a similar procedure with<br />

minor adjustments in the way they compute a standard error. Usually, all four plots are virtually the same –<br />

anomalies am<strong>on</strong>g the plots should be investigated carefully.<br />

7.6 Examples: Lung Cancer vs. Smoking; Marijuana use of students<br />

based <strong>on</strong> parental usage - Single categorical predictor<br />

7.6.1 Retrospect and Prospective odds-ratio<br />

In this secti<strong>on</strong>, the case where the predictor (X) variable is also a categorical variable will be examined. As<br />

seen in multiple linear regressi<strong>on</strong>, categorical X variables are handled by the creati<strong>on</strong> of indicator variables.<br />

A categorical variable with k classes will generate k − 1 indicator variables. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, there are many ways<br />

to define these indicator variables and the user must examine the computer software carefully be<str<strong>on</strong>g>for</str<strong>on</strong>g>e using<br />

any of the raw estimated coefficients associated with a particular indicator variable.<br />

It turns out that there are multiple ways to analyze such data – all of which are asymptotically equivalent.<br />

Also, this particular topic is usually divided into two sub-categories - problems where there are <strong>on</strong>ly two<br />

levels of the predictor variables and cases where there are three or more levels of the predictor variables.<br />

This divisi<strong>on</strong> actually has a good reas<strong>on</strong> – it turns out that in the case of 2 levels <str<strong>on</strong>g>for</str<strong>on</strong>g> the predictor and 2<br />

levels <str<strong>on</strong>g>for</str<strong>on</strong>g> the resp<strong>on</strong>se variable (the classic 2 × 2 c<strong>on</strong>tingency table), it is possible to use a retrospective<br />

study and actually get valid estimates of the prospective odds ratio.<br />

For example, suppose you were interested in the looking at the relati<strong>on</strong>ship between smoking and lung<br />

cancer. In a prospective study, you could randomly select 1000 smokers and 1000 n<strong>on</strong>-smokers <str<strong>on</strong>g>for</str<strong>on</strong>g> their<br />

relevant populati<strong>on</strong>s and follow them over time to see how many developed lung cancer. Suppose you<br />

obtained the following results:<br />

Cohort Lung Cancer No lung cancer<br />

Smokers 100 900<br />

N<strong>on</strong>-smoker 10 990<br />

Because this is a prospective study, it is quite valid to say that the probability of developing lung cancer<br />

if you are a smoker is 100/1000 and the probability of developing lung cancer if you are not a smoker is<br />

10/1000. The odds of developing cancer if you are smoker are 100:900 and the odds of developing cancer<br />

if you are n<strong>on</strong>-smoker are 10:990. The odds ratio of developing cancer of a smoker vs. a n<strong>on</strong>-smoker is then<br />

OR(LC) S vs. NS =<br />

100 : 900<br />

10 : 990 = 11 : 1<br />

c○2012 Carl James Schwarz 563 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

But a prospective study takes too l<strong>on</strong>g, so an alternate way of studying the problem is to do a retrospective<br />

study. Here samples of 1000 people with lung cancer, and 1000 people without lung cancer are selected at<br />

random from their respective populati<strong>on</strong>s. For each subject, you determine if they smoked in the past.<br />

Suppose you get the following results:<br />

Lung Cancer Smoker N<strong>on</strong>-smoker<br />

yes 810 190<br />

no 280 720<br />

Now you can’t directly find the probability of lung cancer if you are smoker. It is NOT simply 810/(810+<br />

280) because you selected equal number of smokers and n<strong>on</strong>-smokers while less than 30% of the populati<strong>on</strong><br />

generally smokes. Unless that proporti<strong>on</strong> is known, it is impossible to compute the probability of getting<br />

lung cancer if you are a smoker or n<strong>on</strong>-smoker directly, and so it would seem that finding the odds of lung<br />

cancer would be impossible.<br />

However, not all is lost. Let P (smoker) represent the probability that a randomly chosen pers<strong>on</strong> is a<br />

smoker; then P (n<strong>on</strong>-smoker) = 1 − P (smoker). Bayes’ Rule 13<br />

P (lung cancer | smoker) =<br />

P (no lung cancer | smoker) =<br />

P (lung cancer | n<strong>on</strong>-smoker) =<br />

P (no lung cancer | n<strong>on</strong>-smoker) =<br />

P (smoker | lung cancer)P (lung cancer)<br />

P (smoker)<br />

P (smoker | no lung cancer)P (no lung cancer)<br />

P (smoker)<br />

P (n<strong>on</strong>-smoker | lung cancer)P (lung cancer)<br />

P (n<strong>on</strong>-smoker)<br />

P (n<strong>on</strong>-smoker | no lung cancer)P (no lung cancer)<br />

P (n<strong>on</strong>-smoker)<br />

This doesn’t appear to helpful, as P(smoker) or P(n<strong>on</strong>-smoker) is unknown. But, look at the odds-ratio<br />

of getting lung cancer of a smoker vs. a n<strong>on</strong>-smoker:<br />

OR(LC) S vs. NS =<br />

=<br />

ODDS(lung cancer if smoker)<br />

ODDS(lung cancer if n<strong>on</strong>-smoker)<br />

P (lung cancer | smoker)<br />

P (no lung cancer | smoker)<br />

P (lung cancer | n<strong>on</strong>-smoker)<br />

P (no lung cancer | n<strong>on</strong>-smoker)<br />

If you substitute in the above expressi<strong>on</strong>s, you find that:<br />

OR(LC) S vs NS =<br />

P (smoker | lung cancer)<br />

P (smoker | no lung cancer)<br />

P (n<strong>on</strong>-smoker | lung cancer)<br />

P (n<strong>on</strong>-smoker | no lung cancer)<br />

which can be computed from the retro-spective study. Based <strong>on</strong> the above table, we obtain<br />

OR(LC) S vs NS =<br />

.810<br />

.280<br />

.190<br />

.720<br />

= 11 : 1<br />

This symmetric in odds-ratios between prospective and retrospective studies <strong>on</strong>ly works in the 2x2 case<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> simple random sampling.<br />

13 See http://en.wikipedia.org/wiki/Bayes_rule<br />

c○2012 Carl James Schwarz 564 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

7.6.2 Example: Parental and student usage of recreati<strong>on</strong>al drugs<br />

A study was c<strong>on</strong>ducted where students at a college were asked about their pers<strong>on</strong>al use of marijuana and<br />

if their parents used alcohol and/or marijuana. 14 The following data is a collapsed versi<strong>on</strong> of the table that<br />

appears in the report:<br />

Parental<br />

Student Usage<br />

Usage Yes No<br />

Yes 125 85<br />

No 94 141<br />

This is a retrospective analysis as the students are interviewed and past behavior of parents is recorded.<br />

The data are entered in JMP in the usual <str<strong>on</strong>g>for</str<strong>on</strong>g>mat. There will be four lines, and three variables corresp<strong>on</strong>ding<br />

to parental usage, student usage, and the count.<br />

Start using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

14 “Marijuana Use in College, Youth and Society, 1979, 323-334.<br />

c○2012 Carl James Schwarz 565 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

but d<strong>on</strong>’t <str<strong>on</strong>g>for</str<strong>on</strong>g>get to specify the Count as the frequency variable. It doesn’t matter which variable is entered<br />

as the X or Y variable. Note that JMP actually will switch from the logistic plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to the c<strong>on</strong>tingency<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m 15 as noted by the diagram at the lower left of the dialogue box.<br />

The mosaic plot shows the relative percentages in each of the student usage groups:<br />

15 Refer to the chapter <strong>on</strong> Chi-square tests.<br />

c○2012 Carl James Schwarz 566 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The c<strong>on</strong>tingency table (after selecting the appropriate percentages <str<strong>on</strong>g>for</str<strong>on</strong>g> display from the red-triangle popdown<br />

menu) 16<br />

16 In my opini<strong>on</strong>, I would never display percentages to more than integer values. Displays such as 42.92% are just silly as they imply<br />

a precisi<strong>on</strong> of 1 part in 10,000 but you <strong>on</strong>ly have 219 subjects in the first row.<br />

c○2012 Carl James Schwarz 567 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The c<strong>on</strong>tingency table approach tests the hypothesis of independence between the X and Y variable, i.e.<br />

is the proporti<strong>on</strong> of parents who use marijuana the same <str<strong>on</strong>g>for</str<strong>on</strong>g> the two groups of students:<br />

As explained in the chapter <strong>on</strong> chi-square tests, there are two (asymptotically) equivalent ways to test this<br />

hypothesis – the Pears<strong>on</strong> chi-square statistic and the likelihood ratio statistic. In this case, you would come<br />

to the same c<strong>on</strong>clusi<strong>on</strong>.<br />

The odds-ratio is obtained from the red-triangle at the top of the display:<br />

c○2012 Carl James Schwarz 568 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

and gives:<br />

It is estimated that the odds of children using marijuana if their parents use marijuana or alcohol are about<br />

2.2 times that of the odds of a child using marijuana <str<strong>on</strong>g>for</str<strong>on</strong>g> parents who d<strong>on</strong>’t use marijuana or alcohol. The<br />

95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the odds-ratio is between 1.51 and 3.22. In this case, you would examine if<br />

the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the odds-ratio includes the value of 1 (why?) to see if anything interesting is<br />

happening.<br />

If the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m is used and a logistic regressi<strong>on</strong> is fit:<br />

c○2012 Carl James Schwarz 569 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This gives the output:<br />

c○2012 Carl James Schwarz 570 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The coefficient of interest is the effect of student usage <strong>on</strong> the no/yes log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g> parental usage. The<br />

test <str<strong>on</strong>g>for</str<strong>on</strong>g> the effect of student usage has chi-square test value of 17.02 with a small p-value which matches<br />

the likelihood ratio test from the c<strong>on</strong>tingency table approach. Many packages use different codings <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

categorical X variables (as seen in the secti<strong>on</strong> <strong>on</strong> multiple regressi<strong>on</strong>) so you need to check the computer<br />

manual carefully to understand exactly what the coefficient measures.<br />

However, the odds-ratio can be found from the red-triangle pop-down menu:<br />

c○2012 Carl James Schwarz 571 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

and matches what was seen earlier.<br />

Finally, the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m can be used with the Generalized Linear Model opti<strong>on</strong>:<br />

c○2012 Carl James Schwarz 572 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This gives:<br />

c○2012 Carl James Schwarz 573 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The test <str<strong>on</strong>g>for</str<strong>on</strong>g> a student effect has the same results as seen previously. But, ir<strong>on</strong>ically, gives no easy easy to<br />

compute the odds ratio. It turns out that given the parameterizati<strong>on</strong> used by JMP, the log-odds ratio is twice<br />

the coefficient of the student-usage, i.e. twice of -.3955. The odds-ratio would be found as the anti-log of<br />

this value, i.e. e 2×−.3955 = .4522 and the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the odds-ratio can be found by anti-logging<br />

twice the c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> this coefficient, i.e. ranging from (e 2×−.5866 = .31 → e 2×−.2068 =<br />

.66). 17 These values are the inverse of the value seen earlier but this is an artefact of which category is<br />

modelled. For example, the odds ratio of P arents Y vs.N (student Y vs.N ) =<br />

1<br />

P arents N vs.Y (student Y vs.N ) =<br />

1<br />

P arents Y vs.N (student N vs.Y ) = P arents N vs.Y (student N vs.Y )<br />

7.6.3 Example: Effect of selenium <strong>on</strong> tadpoles de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities<br />

The generalizati<strong>on</strong> of the above to more than two levels of the X variable is straight <str<strong>on</strong>g>for</str<strong>on</strong>g>ward and parallels the<br />

analysis of a single factor CRD ANOVA. Again, we will assume that the experimental design is a completely<br />

randomized design or simple random sample.<br />

Selenium (Se) is an essential element required <str<strong>on</strong>g>for</str<strong>on</strong>g> the health of humans, animals and plants, but be-<br />

17 This simple relati<strong>on</strong>ship may not be true with other computer packages. YMMV.<br />

c○2012 Carl James Schwarz 574 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

comes a toxicant at elevated c<strong>on</strong>centrati<strong>on</strong>s. The most sensitive species to selenium toxicity are oviparous<br />

(egg-laying) animals. Ecological impacts in aquatic systems are usually associated with teratogenic effects<br />

(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities) in early life stages of oviparous biota as a result of maternal sequestering of selenium in eggs.<br />

In aquatic envir<strong>on</strong>ments, inorganic selenium, found in water or in sediments is c<strong>on</strong>verted to organic selenium<br />

at the base of the food chain (e.g., bacteria and algae) and then transferred through dietary pathways to other<br />

aquatic organisms (invertebrates, fish). Selenium also tends to biomagnify up the food chain, meaning that<br />

it accumulates to higher tissue c<strong>on</strong>centrati<strong>on</strong>s am<strong>on</strong>g organisms higher in the food web.<br />

Selenium often occurs naturally in ores and can leach from mine tailings. This leached selenium can<br />

make its way to waterways and potentially c<strong>on</strong>taminate organisms.<br />

As a preliminary survey, samples of tadpoles were selected from a c<strong>on</strong>trol site and three sites identified<br />

as low, medium, and high c<strong>on</strong>centrati<strong>on</strong>s of selenium based <strong>on</strong> hydrologic maps and expert opini<strong>on</strong>. These<br />

tadpoles were examined, and the number that had de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities were counted.<br />

Here is the raw data:<br />

Site Tadpoles De<str<strong>on</strong>g>for</str<strong>on</strong>g>med % de<str<strong>on</strong>g>for</str<strong>on</strong>g>med<br />

C<strong>on</strong>trol 208 56 27%<br />

low 687 243 35%<br />

medium 832 329 40%<br />

high 597 283 47%<br />

The data are entered in JMP in the usual fashi<strong>on</strong>:<br />

c○2012 Carl James Schwarz 575 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Notice that the status of the tadpoles as de<str<strong>on</strong>g>for</str<strong>on</strong>g>med or not de<str<strong>on</strong>g>for</str<strong>on</strong>g>med is entered al<strong>on</strong>g with the count of each<br />

status.<br />

As the selenium level has an ordering, it should be declared as an ordinal scale and the ordering of the<br />

values <str<strong>on</strong>g>for</str<strong>on</strong>g> the selenium levels should be specified using the Column In<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> → Column Properties →<br />

Value Ordering dialogue box<br />

The hypothesis to be tested can be written in a number of equivalent ways:<br />

• H: p(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all levels of selenium.<br />

• H: odds(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all levels of selenium.<br />

• H: log-odds(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is the same <str<strong>on</strong>g>for</str<strong>on</strong>g> all levels of selenium.<br />

• H: p(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is independent of the level of selenium. 18<br />

• H: odds(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is independent of the level of selenium.<br />

• H: log-odds(de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity) is independent of the level of selenium.<br />

18 The use of independent in the hypothesis is a bit old-fashi<strong>on</strong>ed and not the same as statistical independence.<br />

c○2012 Carl James Schwarz 576 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

• H: p c (D) = p L (D) = p M (D) = p H (D) where p L (D) is the probability of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities at low doses,<br />

etc.<br />

There are again several ways in which this data can be analyzed.<br />

Start with the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

This will give a standard c<strong>on</strong>tingency table analysis (see chapter <strong>on</strong> chi-square tests).<br />

The mosaic plot:<br />

c○2012 Carl James Schwarz 577 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

seems to show an increasing trend in de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities with increasing selenium levels. It is a pity that JMP<br />

doesn’t display any measure of precisi<strong>on</strong> (such se bars or c<strong>on</strong>fidence intervals) <strong>on</strong> this plot.<br />

The c<strong>on</strong>tingency table (with suitable percentages shown 19 )<br />

19 I would display percentages to the nearest integer. Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there doesn’t appear to be an easy way to c<strong>on</strong>trol this in JMP.<br />

c○2012 Carl James Schwarz 578 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

also gives the same impressi<strong>on</strong>.<br />

A <str<strong>on</strong>g>for</str<strong>on</strong>g>mal test <str<strong>on</strong>g>for</str<strong>on</strong>g> equality of proporti<strong>on</strong> of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s across all levels of the factor gives the following<br />

test statistics and p-values:<br />

There are two comm<strong>on</strong> test-statistics. The Pears<strong>on</strong> chi-square test-statistic which examines the difference<br />

between observed and expected counts (see chapter <strong>on</strong> chi-square tests), and the likelihood-ratio test which<br />

compares the model when the hypothesis is true vs. the model when the hypothesis is false. Both are asymptotically<br />

equivalent. There is str<strong>on</strong>g evidence against the hypothesis of equal proporti<strong>on</strong>s of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities.<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, most c<strong>on</strong>tingency table analyses stop here. A naked p-value which indicates that there is<br />

evidence of a difference but does not tell you where the differences might lie, is not very in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative! In the<br />

same way that ANOVA must be followed by a comparis<strong>on</strong> of the mean am<strong>on</strong>g the treatment levels, this test<br />

should be followed by a comparis<strong>on</strong> of the proporti<strong>on</strong> of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities am<strong>on</strong>g the factor levels.<br />

Logistic regressi<strong>on</strong> methods will enable us to estimate the relative odds of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities am<strong>on</strong>g the various<br />

c○2012 Carl James Schwarz 579 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

classes.<br />

Start with the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

This gives the output:<br />

c○2012 Carl James Schwarz 580 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

First, the Effect Tests tests the hypothesis of equality of the proporti<strong>on</strong> of defectives am<strong>on</strong>g the four levels<br />

of selenium. The test-statistic and p-value match that seen earlier, so there is good evidence of a difference<br />

am<strong>on</strong>g the de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity proporti<strong>on</strong>s am<strong>on</strong>g the various levels.<br />

At this point in a ANOVA, a multiple comparis<strong>on</strong> procedure (such a Tukey’s HSD) would be used to<br />

examine which levels may have different means from the other levels. There is no simple equivalent <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

logistic regressi<strong>on</strong> implemented in JMP. 20 It would be possible to use a simple B<strong>on</strong>fer<strong>on</strong>ni correcti<strong>on</strong> if the<br />

number of groups is small.<br />

JMP provides some in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> comparis<strong>on</strong> am<strong>on</strong>g the levels. In the Parameter Estimates secti<strong>on</strong>,<br />

it presents comparis<strong>on</strong>s of the proporti<strong>on</strong> of defectives am<strong>on</strong>g the successive levels of selenium. 21 The<br />

estimated difference in the log-odds of de<str<strong>on</strong>g>for</str<strong>on</strong>g>med <str<strong>on</strong>g>for</str<strong>on</strong>g> the low vs. c<strong>on</strong>trol group is .39 (se .18). The associated<br />

p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> no difference in the proporti<strong>on</strong> of de<str<strong>on</strong>g>for</str<strong>on</strong>g>med is .02 which is less than the α = .05 levels so there<br />

is evidence of a difference in the proporti<strong>on</strong> of de<str<strong>on</strong>g>for</str<strong>on</strong>g>med between these two levels.<br />

By requesting the c<strong>on</strong>fidence interval and the odds-ratio these can be trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med to the odds-scale<br />

(rather than the log-odds) scale.<br />

20 This is somewhat puzzling as the theory should be straight <str<strong>on</strong>g>for</str<strong>on</strong>g>ward.<br />

21 This is purely a functi<strong>on</strong> of the internal coding used by JMP. Other packages may use different coding. YMMV.<br />

c○2012 Carl James Schwarz 581 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately, there is no simple mechanism to do a more general c<strong>on</strong>trasts in this variant of the Analyze-<br />

>Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m.<br />

The Generalized Linear Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m in the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m gives more opti<strong>on</strong>s:<br />

c○2012 Carl James Schwarz 582 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The output you get is very similar to what was seen previously. Suppose that a comparis<strong>on</strong> between the<br />

proporti<strong>on</strong>s of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mities between the high and c<strong>on</strong>trol levels of selenium are wanted.<br />

Use the red-triangle pop-down menu to select the C<strong>on</strong>trast opti<strong>on</strong>s:<br />

c○2012 Carl James Schwarz 583 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Then select the radio butt<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> comparis<strong>on</strong>s am<strong>on</strong>g selenium levels:<br />

Click <strong>on</strong> the + and − to <str<strong>on</strong>g>for</str<strong>on</strong>g>m the c<strong>on</strong>trast. Here you are interested in LO high − LO c<strong>on</strong>trol where the LO<br />

are the log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g> a de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity.<br />

c○2012 Carl James Schwarz 584 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This gives:<br />

c○2012 Carl James Schwarz 585 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The estimated log-odds ratio is .89 (se .18). This implies that the odds-ratio of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity is e .89 = 2.43,<br />

i.e. the odds of de<str<strong>on</strong>g>for</str<strong>on</strong>g>mity are 2.43 greater in the high selenium site than the c<strong>on</strong>trol site. The p-value is well<br />

below α = .05 so there is str<strong>on</strong>g evidence that this effect is real. It is possible to compute the se of the oddsratio<br />

using the Delta method – pity that JMP doesn’t do this directly. 22 An approximate 95% c<strong>on</strong>fidence<br />

interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the log-odds ratio could be found using the usual rule of estimate ± 2se. The 95% c<strong>on</strong>fidence<br />

interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the odd-ratio would be found by taking anti-logs of the end points.<br />

This procedure could then be repeated <str<strong>on</strong>g>for</str<strong>on</strong>g> any c<strong>on</strong>trast of interest.<br />

7.7 Example: Pet fish survival as functi<strong>on</strong> of covariates - Multiple<br />

categorical predictors<br />

There is no c<strong>on</strong>ceptual problem in having multiple categorical X variables. Unlike the case of a single<br />

categorical X variable, there is no simple c<strong>on</strong>tingency table approach. However, in more advanced classes,<br />

you will learn about a technique called log-linear modeling that can often be used <str<strong>on</strong>g>for</str<strong>on</strong>g> these types of tables.<br />

Again, be<str<strong>on</strong>g>for</str<strong>on</strong>g>e analyzing any dataset, ensure that you understand the experimental design. In these notes,<br />

it is assumed that the design is completely randomized design or a simple random sample. If your design is<br />

more complex, please seek suitable help.<br />

A fish is a popular pet <str<strong>on</strong>g>for</str<strong>on</strong>g> young children – yet the survival rate of many of these fish is likely poor. What<br />

factors seem to influence the survival probabilities of pet fish?<br />

A large pet store c<strong>on</strong>ducted a customer follow-up survey of purchasers of pet fish. A number of customers<br />

were called and asked about the hardness of the water used <str<strong>on</strong>g>for</str<strong>on</strong>g> the fish (soft, medium, or hard), where the<br />

fish was kept which was then classified into cool or hot locati<strong>on</strong>s within the living dwelling, if they had<br />

previous experience with pet fish (yes or no), and if the pet fish was alive six m<strong>on</strong>ths after purchase (yes or<br />

no).<br />

Here is the raw data 23 :<br />

22 For those so inclined, if ̂θ is the estimator with associated se, then the se of êθ is found as se(êθ) = se(̂θ) × êθ. In this case, the<br />

se of the odd-ratio would be .18 × e .89 = .44.<br />

23 Taken from Cox and Snell, Analysis of Binary Data<br />

c○2012 Carl James Schwarz 586 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Softness Temp PrevPet N Alive<br />

h c n 89 37<br />

h h n 67 24<br />

m c n 102 47<br />

m h n 70 23<br />

s c n 106 57<br />

s h n 48 19<br />

h c y 110 68<br />

h h y 72 42<br />

m c y 116 66<br />

m h y 56 33<br />

s c y 116 63<br />

s h y 56 29<br />

There are three factors in this study:<br />

• Softness with three levels (h, m or s);<br />

• Temperature with two levels (c or h);<br />

• Previous ownership with two levels (y or n).<br />

This a factorial experiment because all 12 treatment combinati<strong>on</strong>s appear in the experiment.<br />

The experimental unit is the household. The observati<strong>on</strong>al unit is also the household. There is no<br />

pseudo-replicati<strong>on</strong>.<br />

The randomizati<strong>on</strong> structure is likely complete. It seems unlikely that people would pick particular<br />

individual fish depending <strong>on</strong> their water hardness, temperature, or previous history of pet ownership.<br />

The resp<strong>on</strong>se variable is the Alive/Dead status at the end of six m<strong>on</strong>ths. This is a discrete binary outcome.<br />

For example, in the first row of the data table, there were 37 households where the fish was still alive after 6<br />

m<strong>on</strong>ths and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e 89 − 37 = 52 households where the fish had died somewhere in the 6 m<strong>on</strong>th interval.<br />

One way to analyze this data would be to compute the proporti<strong>on</strong> of households that had fish alive after<br />

six m<strong>on</strong>ths, and then use a three-factor CRD ANOVA <strong>on</strong> the estimated proporti<strong>on</strong>s. Because each treatment<br />

combinati<strong>on</strong> is based <strong>on</strong> a different number of trial (ranging from 48 to 116) which implies that the variance<br />

of the estimated proporti<strong>on</strong> is not c<strong>on</strong>stant. This violates (but not likely too badly) <strong>on</strong>e of the assumpti<strong>on</strong>s<br />

of ANOVA – that of c<strong>on</strong>stant variance in each treatment combinati<strong>on</strong>. Also, this seems to throw away data,<br />

as these 1000 observati<strong>on</strong>s are basically collapsed into 12 cells.<br />

Because the outcome is a discrete binary resp<strong>on</strong>se and each trial within each treatment is independent, a<br />

logistic regressi<strong>on</strong> (or generalized linear model) approach can be used.<br />

c○2012 Carl James Schwarz 587 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The data is available in the JMP data file fishsurvive.jmp available in the Sample Program Library<br />

at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Here is the data<br />

file:<br />

To begin with, c<strong>on</strong>struct some profile plots to get a feel <str<strong>on</strong>g>for</str<strong>on</strong>g> what is happening. Create new variables<br />

corresp<strong>on</strong>ding to the proporti<strong>on</strong> of fish alive and its logit 24 . These are created using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor of<br />

JMP in the usual fashi<strong>on</strong>. Also, <str<strong>on</strong>g>for</str<strong>on</strong>g> reas<strong>on</strong>s which will become apparent in a few minutes, create a variable<br />

which is the c<strong>on</strong>catenati<strong>on</strong> of the Temperature and Previous Ownership factor levels. This gives:<br />

( )<br />

24 Recall that logit(p) = log p<br />

. 1−p<br />

c○2012 Carl James Schwarz 588 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Now use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and specify that the p(alive) or logit(alive) is the resp<strong>on</strong>se<br />

variable, with the WaterSoftness as the factor.<br />

c○2012 Carl James Schwarz 589 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Then specify a matching column <str<strong>on</strong>g>for</str<strong>on</strong>g> the plot (do this <strong>on</strong> both plots) using the c<strong>on</strong>catenated variable defined<br />

above.<br />

c○2012 Carl James Schwarz 590 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This creates the two profile plots 25 :<br />

The profile plots seem to indicate that the p(alive) tends to increase with water softness if this is a first time<br />

pet owner, and (ir<strong>on</strong>ically) tends to decrease if a previous pet owner. Of <str<strong>on</strong>g>course</str<strong>on</strong>g> without standard error bars,<br />

it is difficult to tell if these trends are real or not. The sample sizes in each group are around 100 households.<br />

√<br />

.5(.5)<br />

100<br />

= .05 or the approximate<br />

If p(alive) = .5, then the approximate size of a standard error is se =<br />

95% c<strong>on</strong>fidence intervals are ±.1. It looks as if any trends will be hard to detect with the sample sizes used<br />

in this experiment.<br />

25 To get the labels <strong>on</strong> the graph, set the c<strong>on</strong>catenated variable to be a label variable and the rows corresp<strong>on</strong>ding to the h softness<br />

level to be labeled rows.<br />

c○2012 Carl James Schwarz 591 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

In order to fit a logistic-regressi<strong>on</strong> model, you must first create new variable representing the number<br />

Dead in each trial 26 , and then stack 27 the Alive and Dead variables, label the columns as Status and the<br />

Count of each Status to give the final table:<br />

Whew! Now we can finally fit a model to the data and test <str<strong>on</strong>g>for</str<strong>on</strong>g> various effects. In JMP 6.0 and later,<br />

there are two ways to proceed (both give the same answers, but the generalized linear model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m gives<br />

a richer set of outputs). Use the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

26 Use a <str<strong>on</strong>g>for</str<strong>on</strong>g>mula to subtract the number alive from the number of trials.<br />

27 Use the Tables->Stack command.<br />

c○2012 Carl James Schwarz 592 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Notice that the resp<strong>on</strong>se variable is Status and that the frequency variable is the Count of the number of<br />

times each status occurs. The model effects box is filled with each factors effect, and the sec<strong>on</strong>d and third<br />

order interacti<strong>on</strong>s.<br />

This gives the following output:<br />

c○2012 Carl James Schwarz 593 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Check to see exactly what is being modeled. In this case, it is the probability of the first level of the<br />

resp<strong>on</strong>ses, logit(alive).<br />

Then examine the effect tests. Just as in ordinary ANOVA modeling, start with the most complex term,<br />

and work backwards successively eliminating terms until nothing more can be eliminated. The third-order<br />

interacti<strong>on</strong> is not statistically significant. Eliminate this term from the Analyze->Fit Model dialog box, and<br />

refit using <strong>on</strong>ly main effects and two factor interacti<strong>on</strong>s. 28<br />

Successive terms were dropped to give the final model:<br />

28 Just like regular ANOVA, you can’t examine the p-values of lower order interacti<strong>on</strong> terms if a higher order interacti<strong>on</strong> is present.<br />

In this case, you can’t look at the p-values <str<strong>on</strong>g>for</str<strong>on</strong>g> the sec<strong>on</strong>d order interacti<strong>on</strong> when the third order interacti<strong>on</strong> is present in the model. You<br />

must first refit the model after the third order interacti<strong>on</strong> is dropped.<br />

c○2012 Carl James Schwarz 594 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

It appears that there is good evidence of Previous Ownership, marginal evidence of an effect of Temperature<br />

and an interacti<strong>on</strong> between water softness and previous ownership. [Because the two factor interacti<strong>on</strong> was<br />

retained, the main effects of softness and previous ownership must be retained in the model even though it<br />

looks as if there is no main effect of softness. Refer to the previous notes <strong>on</strong> two-factor ANOVA <str<strong>on</strong>g>for</str<strong>on</strong>g> details.]<br />

Save the predicted p(alive) to the data table 29<br />

29 CAUTION: the predicted p(alive) is saved to the data line even if the actual status is dead.<br />

c○2012 Carl James Schwarz 595 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

and plot the observed proporti<strong>on</strong>s against the predicted values as seen in regressi<strong>on</strong> examples earlier. 30<br />

30 Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, and then the Fit Special opti<strong>on</strong> to draw a line with slope=1 <strong>on</strong> the plot.<br />

c○2012 Carl James Schwarz 596 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The plot isn’t bad and seems to have captured most of what is happening. Use the Analyze->Fit Y-by-X<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, with the Matching Column as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e to create the profile plot of the predicted values:<br />

c○2012 Carl James Schwarz 597 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

It is a pity that JMP gives you no easy way to annotate the standard error or c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

predicted mean p(alive), but the c<strong>on</strong>fidence bounds can be saved to the data table.<br />

Unlike regular regressi<strong>on</strong>, it makes no sense to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> individual fish.<br />

By using the C<strong>on</strong>trast pop-down menu, you can estimate the difference in survival rates (but, un<str<strong>on</strong>g>for</str<strong>on</strong>g>tunately,<br />

<strong>on</strong> the logit scale) as needed. For example, suppose that you wished to estimate the difference in<br />

survival rates between fish raised in hard water and no previous experience and hard water with previous<br />

experience. Use the C<strong>on</strong>trast pop-down menu:<br />

c○2012 Carl James Schwarz 598 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The c<strong>on</strong>trast is specified by pressing the - and + boxes as needed:<br />

c○2012 Carl James Schwarz 599 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This gives:<br />

c○2012 Carl James Schwarz 600 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Again this is <strong>on</strong> the logit scale and implies that the logit(p(alive)) h,n − logit(p(alive)) h,y = −.86 (se .22).<br />

This is highly statistically significant. But, what does this mean? Working backwards, we get:<br />

)<br />

logit (p (alive) hn<br />

) − logit<br />

(p (alive) hy<br />

= −.86<br />

[ ] [<br />

p(alive)hn<br />

p(alive)hy<br />

]<br />

log<br />

1−p(alive)<br />

− log<br />

hn 1−p(alive)<br />

= −.86<br />

hy<br />

log<br />

[<br />

odds(alive)hn<br />

odds(alive) hy<br />

]<br />

= −.86<br />

odds(alive) hn<br />

odds(alive) hy<br />

= e −.86 = .423<br />

Or, the odds of a fish being alive from a n<strong>on</strong>-owner in hard water are about 1/2 of the odds of a fish being<br />

alive from a previous owner in hard water. If you look at the previous graphs, this indeed does match. It is<br />

possible to compute a se <str<strong>on</strong>g>for</str<strong>on</strong>g> this odds ratio, but is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>.<br />

7.8 Example: Horseshoe crabs - C<strong>on</strong>tinuous and categorical predictors.<br />

As to be expected, combinati<strong>on</strong>s of c<strong>on</strong>tinuous and categorical X variables can also be fit using similar<br />

reas<strong>on</strong>ing as ANCOVA models discussed in the chapter <strong>on</strong> multiple regressi<strong>on</strong>.<br />

If the categorical X variable has k categories, k −1 indicator variables will be created using an appropriate<br />

coding. Different computer packages use different codings, so you must read the package documentati<strong>on</strong><br />

carefully in order to interpret the estimated coefficients. However, the different codings, must, in the end,<br />

arrive at the same final estimates of effects.<br />

Unlike the ANCOVA model with c<strong>on</strong>tinuous resp<strong>on</strong>ses, there are no simple plots in logistic regressi<strong>on</strong><br />

to examine visually the parallelism of the resp<strong>on</strong>se or the equality of intercepts. 31 Preliminary plots where<br />

data are pooled into various classes so that empirical logistic plots can be made seem to be the best that can<br />

be d<strong>on</strong>e.<br />

As in the ANCOVA model, there are three models that are usually fit. Let X represent the c<strong>on</strong>tinuous<br />

predictor, let Cat represent the categorical predictor, and p the probability of success. The three models are:<br />

• logit(p) = X Cat X ∗ Cat - different intercepts and slopes <str<strong>on</strong>g>for</str<strong>on</strong>g> each group;<br />

• logit(p) = X Cat - different intercepts but comm<strong>on</strong> slope (<strong>on</strong> the logit scale);<br />

• logit(p) = X - same slope and intercept <str<strong>on</strong>g>for</str<strong>on</strong>g> all groups - coincident lines.<br />

The choice am<strong>on</strong>g these models is made by examining the Effect Tests <str<strong>on</strong>g>for</str<strong>on</strong>g> the various terms. For example,<br />

to select between the first and sec<strong>on</strong>d model, look at the p-value of the X ∗ Cat term; to select between the<br />

sec<strong>on</strong>d and third model, examine the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the Cat term.<br />

31 This is a general problem in logistic regressi<strong>on</strong> because the resp<strong>on</strong>ses are <strong>on</strong>e of two discrete categories.<br />

c○2012 Carl James Schwarz 601 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

These c<strong>on</strong>cepts will be illustrated using a dataset <strong>on</strong> nesting horseshoe crabs 32 that is analyzed in<br />

Agresti’s book. 33<br />

The design of the study is given in Brockmann H.J. (1996). Satellite male groups in horseshoe crabs,<br />

Limulus polyphemus. Ethology, 102, 1-21. Again it is important to check that the design is a completely<br />

randomized design or a simple random sampling. As in regressi<strong>on</strong> models, you do have some flexibility in<br />

the choice of the X settings, but <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular weight and color, the data must be selected at random from<br />

that relevant populati<strong>on</strong>.<br />

Each female horseshoe crab had a male resident in her nest. The study investigated other factors affecting<br />

whether the female had any other males, called satellites residing nearby. These other factors includes:<br />

• crab color where 2=light medium, 3=medium, 4=dark medium, 5=dark.<br />

• spine c<strong>on</strong>diti<strong>on</strong> where 1=both good, 2=<strong>on</strong>e worn or broken, or 3=both worn or broken.<br />

• weight<br />

• carapace width<br />

The number of satellites was measured; <str<strong>on</strong>g>for</str<strong>on</strong>g> this example we will c<strong>on</strong>vert the number of satellite males into<br />

a presence (number at least 1) or absence (no satellites).<br />

A JMP dataset crabsatellites.jmp is available from the Sample Program Library at http://www.<br />

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A porti<strong>on</strong> of the datafile is shown<br />

below:<br />

32 See http://en.wikipedia.org/wiki/Horseshoe_crab.<br />

33 These are available from Agresti’s web site at http://www.stat.ufl.edu/~aa/cda/sas/sas.html.<br />

c○2012 Carl James Schwarz 602 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Note that the color and spine c<strong>on</strong>diti<strong>on</strong> variables should be declared with an ordinal scale despite having<br />

numerical codes. The number of satellite males was c<strong>on</strong>verted to a presence/absence value using the JMP<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor.<br />

A preliminary scatter plot of the variables shows some interesting features.<br />

There is a very high positive relati<strong>on</strong>ship between carapace width and weight, but there are few anomalous<br />

crabs that should be investigated further as shown in this magnified plot:<br />

c○2012 Carl James Schwarz 603 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

There are three points with weights in the 1200-1300 g range whose carapace widths suggest that the weights<br />

should be in the 2200-2300 g range, i.e. a typographical error in the first digit. There is a single crab whose<br />

weight suggests a width of 24 cm rather than 21 cm – perhaps a typo in the last digit. Finally, there is <strong>on</strong>e<br />

crab which is extremely large compared to the rest of the group. In the analysis that follows, I’ve excluded<br />

these five crabs.<br />

The final point also appears to have an unusual number of satellite males compared to the other crabs in<br />

the dataset.<br />

The Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was then used to examine the differences in mean or proporti<strong>on</strong>s in<br />

the other variables when grouped by the presence/absence score. These are not shown in these notes, but<br />

generally dem<strong>on</strong>strate some separati<strong>on</strong> in the means or proporti<strong>on</strong>s between the two groups, but there is<br />

c<strong>on</strong>siderable overlap in the individual values between the two groups. The group with no satellite males<br />

tends to have darker colors than the presence group; while the distincti<strong>on</strong> between the spine c<strong>on</strong>diti<strong>on</strong> is not<br />

clear cut.<br />

Because of the high correlati<strong>on</strong> between carapace size and weight, the weight variable was used as the<br />

c<strong>on</strong>tinuous covariate and the color variable was used as the discrete covariate.<br />

A preliminary analysis divided weight into four classes (up to 2000g; 2000-2500 g; 2500-3000 g; and<br />

over 3000 g). 34 Similarly, a new variable (PA) was created to be 0 (<str<strong>on</strong>g>for</str<strong>on</strong>g> absence) or 1 (<str<strong>on</strong>g>for</str<strong>on</strong>g> presence) <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

presence/absence of satellite males. The Tables->Summary was used to compute the mean PA (which then<br />

34 The <str<strong>on</strong>g>for</str<strong>on</strong>g>mula commands of JMP were used.<br />

c○2012 Carl James Schwarz 604 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

corresp<strong>on</strong>ds to the estimated probability of presence) <str<strong>on</strong>g>for</str<strong>on</strong>g> each combinati<strong>on</strong> of weight class and color:<br />

c○2012 Carl James Schwarz 605 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Finally, the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to plot the probability of presence by weight class, using<br />

the Matching Column to joint lines of the same color:<br />

c○2012 Carl James Schwarz 606 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

c○2012 Carl James Schwarz 607 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

c○2012 Carl James Schwarz 608 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Note despite the appearance of n<strong>on</strong>-parallelism <str<strong>on</strong>g>for</str<strong>on</strong>g> the bottom line, the point in the 2500-3000 gram category<br />

is <strong>on</strong>ly based <strong>on</strong> 4 crabs and so has very poor precisi<strong>on</strong>. Similarly, the point near 100% in the 0000-2000 g<br />

c○2012 Carl James Schwarz 609 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

category is based <strong>on</strong> 1 data point! The parallelism hypothesis may be appropriate.<br />

A generalized linear model using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to fit the most general<br />

model using the raw data:<br />

This gives the results:<br />

c○2012 Carl James Schwarz 610 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-parallelism (refer to the line corresp<strong>on</strong>ding to the Color*Weight term) is just over<br />

α = .05 so there is some evidence that perhaps the lines are not parallel. The parameter estimates are not<br />

interpretable without understanding the coding scheme used <str<strong>on</strong>g>for</str<strong>on</strong>g> the indicator variables. The goodness-of-fit<br />

test does not indicate any problems.<br />

Let us c<strong>on</strong>tinue with a the parallel slopes model by dropping the interacti<strong>on</strong> term. This gives the following<br />

results:<br />

c○2012 Carl James Schwarz 611 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

There is good evidence that the log-odds of NO males present decrease as weight increases (i.e. the log-odds<br />

of a male being present increases as weight increases), with an estimated increase of .0016 in the log-odds<br />

per gram increase in weight. There is very weak evidence that the intercepts are different as the p-value is<br />

just under 10%.<br />

The goodness-of-fit test seems to indicate no problem. The residual plot must be interpreted carefully,<br />

but its appearance was explained in a previous secti<strong>on</strong>.<br />

The different intercepts will be retained to illustrate how to graph the final model. Use the red-triangle<br />

to save the predicted probabilities to the data table. Note that you may wish to rename the predicted column<br />

to remind yourself that the probability of NO male is being predicted.<br />

c○2012 Carl James Schwarz 612 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Use the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to plot the predicted probability of absence against weight, use the<br />

group by opti<strong>on</strong> to separate by color, and then fit a spline (a smooth flexible curve) to draw the four curves:<br />

c○2012 Carl James Schwarz 613 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

c○2012 Carl James Schwarz 614 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

to give the final plot:<br />

c○2012 Carl James Schwarz 615 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

Notice that while the models are linear <strong>on</strong> the log-odds scale, they plots will show a n<strong>on</strong>-linear shape <strong>on</strong> the<br />

regular scale.<br />

It appears that the color=5 group appears to be different from the rest. If you do a c<strong>on</strong>trast am<strong>on</strong>g the<br />

intercepts (not really a good idea as this could be c<strong>on</strong>sidered data dredging), you indeed find evidence that<br />

the intercept (<strong>on</strong> the log-odds scale) <str<strong>on</strong>g>for</str<strong>on</strong>g> color 5 may be different than the average of the intercepts <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

other three colors:<br />

c○2012 Carl James Schwarz 616 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

7.9 Assessing goodness of fit<br />

As is the case in all model fitting in Statistics, it is important that the model provides an adequate fit to the<br />

data at hand. Without such an analysis, the inferences drawn from the model may be misleading or even<br />

totally wr<strong>on</strong>g!<br />

One of the “flaws” of many published papers is a lack to detail <strong>on</strong> how the fit of the model to the data<br />

was assessed. The logistic regressi<strong>on</strong> model is a powerful statistical tool, but it must be used with cauti<strong>on</strong>.<br />

Goodness-of-fit <str<strong>on</strong>g>for</str<strong>on</strong>g> logistic regressi<strong>on</strong> models are more difficult than similar methods <str<strong>on</strong>g>for</str<strong>on</strong>g> multiple regressi<strong>on</strong><br />

because of the binary (success/failure) nature of the resp<strong>on</strong>se variable. Nevertheless, many of the<br />

c○2012 Carl James Schwarz 617 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

methods used in multiple regressi<strong>on</strong> have been extended to the logistic regressi<strong>on</strong> case.<br />

A nice review paper of the methods of assessing fit is given by<br />

Hosmer, D. W., Tabler, S., and Lameshow, S. (1991).<br />

The importance of assessing the fit of logistic regressi<strong>on</strong> models: a case study.<br />

American Journal of Public Health, 81, 1630âĂŞ1635.<br />

http://dx.doi.org/10.2105/AJPH.81.12.1630<br />

In any statistical model, there are two comp<strong>on</strong>ents – the structural porti<strong>on</strong> (e.g. the fitted curve) and the<br />

residual (or noise) (e.g. the deviati<strong>on</strong> of the actual values from the fitted curve). The process of building a<br />

model focuses <strong>on</strong> the structural porti<strong>on</strong>. Which variables are important in predicting value? Is the correct<br />

scale (e.g. should x or x 2 be used?) After the structural model is fit, the analyst should assess the degree fit.<br />

Assessing goodness-of-fit (GOF) usually entails two stages. First, computing a statistic that summarizes<br />

the general fit of the model to the data. Sec<strong>on</strong>d, computing statistics <str<strong>on</strong>g>for</str<strong>on</strong>g> individual observati<strong>on</strong>s that assess<br />

the (lack of) fit of the model to individual observati<strong>on</strong>s and their leverage in the fit. This may indentify<br />

particular observati<strong>on</strong>s that are outliers or have undue influence or leverage <strong>on</strong> the fit. These points need to<br />

be inspected carefully, but it is important to remember that data should not be arbitrarily deleted based solely<br />

<strong>on</strong> a statistical measure.<br />

Let ̂π i represent the predicted probability <str<strong>on</strong>g>for</str<strong>on</strong>g> case i whose resp<strong>on</strong>se is either 0 (<str<strong>on</strong>g>for</str<strong>on</strong>g> failure) or 1 (<str<strong>on</strong>g>for</str<strong>on</strong>g><br />

success). The deviance of a point is defined as<br />

√<br />

d i = 2| ln(̂π yi<br />

i (1 − ̂π i) 1−yi )|<br />

and is basically a functi<strong>on</strong> of the log-likelihood <str<strong>on</strong>g>for</str<strong>on</strong>g> that observati<strong>on</strong>.<br />

The total deviance is defined as:<br />

D = ∑ d 2 i<br />

Another statistics, the Pears<strong>on</strong> residual, is defined as:<br />

f i =<br />

y i − ̂π i<br />

√̂πi (1 − ̂π i )<br />

and the Pears<strong>on</strong> chi-square statistic is defined as<br />

χ 2 = ∑ r 2 i<br />

The summary statistics D and χ 2 each have degrees of freedom approximately equal to n − (p + 1)<br />

where p is the number of predictor variables, but d<strong>on</strong>’t have any nice distributi<strong>on</strong>al <str<strong>on</strong>g>for</str<strong>on</strong>g>ms (i.e. you can’t<br />

assume that they follow a chi-square distributi<strong>on</strong>). This is because the individual comp<strong>on</strong>ents are essentially<br />

fromed from n × 2 c<strong>on</strong>tingency table with all counts 1 or 0 so the problem of small expected counts found in<br />

c○2012 Carl James Schwarz 618 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

chi-square tests is quite serious. So any p-value reported <str<strong>on</strong>g>for</str<strong>on</strong>g> these overall goodness-of-fit measures are not<br />

very reliable, and about the <strong>on</strong>ly thing that is useful is to compare these statistics to their degrees of freedom<br />

to compute an approximate variance inflati<strong>on</strong> factor as seen earlier in the Fitness example.<br />

One strategy <str<strong>on</strong>g>for</str<strong>on</strong>g> sparse tables is to pool. The Lemeshow test divides the data into 10 groups of equal<br />

sizes based <strong>on</strong> the deciles of the fitted values. The observed and expected counts are computed by summing<br />

the estimated probabilities and the observed values in the usual fashi<strong>on</strong>, and then computing a standard<br />

chi-square goodness-of-fit statistic. It is compared to a chi-square distributi<strong>on</strong> with 8 df .<br />

Any assessment of goodness of fit should then start with the examinati<strong>on</strong> of the D, χ 2 and Lemeshow<br />

statistics. Then do a careful evauati<strong>on</strong> of the individual terms d i and r i .<br />

To start with, examine the residual plots. Suppose we wish to predict membership in a category as a<br />

functi<strong>on</strong> of a c<strong>on</strong>tinuous covariate. For example, can we predict the sex of an individual based <strong>on</strong> their<br />

weight? This is known as logistic regressi<strong>on</strong> and is discussed in another chapter in this series of notes.<br />

Again refer to the Fitness dataset. The (Generalized Linear) model is:<br />

Y i distributed as Binomial(p i )<br />

φ i = logit(p i )<br />

φ i = W eight<br />

The residual plot is produced automatically from the Generalized Linear Model opti<strong>on</strong> of the Analyze->Fit<br />

Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and looks like 35 :<br />

35 I added reference lines at zero, 2, and −2 by clicking <strong>on</strong> the Y axis of the plot<br />

c○2012 Carl James Schwarz 619 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This plot looks a bit strange!<br />

Al<strong>on</strong>g the bottom of the plot, is the predicted probability of being female 36 This is found by substituting<br />

in the weight of each pers<strong>on</strong> into the estimated linear part, and then back-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>ming from the logit scale<br />

to the ordinary probability scale. The first point <strong>on</strong> the plot, identified by a square box, is from a male who<br />

weighs over 90 kg. The predicted probability of being female is very small, about 5%.<br />

The first questi<strong>on</strong> is exactly how is a residual defined when the Y variable is a category? For example,<br />

how would the residual <str<strong>on</strong>g>for</str<strong>on</strong>g> this point be computed - it makes no sense to simply take the observed (male)<br />

minus the predicted probability (.05)?<br />

Many computer packages redefine the categories using 0 and 1 labels. Because JMP was modeling the<br />

probability of being female, all males are assigned the value of 0, and all females assigned the value of 1.<br />

Hence the residual <str<strong>on</strong>g>for</str<strong>on</strong>g> this point is 0-.05-0.05 which after studentizati<strong>on</strong>, is plots as shown.<br />

The bottom line in the residual plot corresp<strong>on</strong>ds to the male subjects, The top line corresp<strong>on</strong>ds to the<br />

female subjects. Where are areas of c<strong>on</strong>cern? You would be c<strong>on</strong>cerned about females who have a very small<br />

probability of predicti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> being female, and males who have a large probability of predicti<strong>on</strong> of being<br />

36 The first part of the output from the plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m states that the probability of being female is being modeled.<br />

c○2012 Carl James Schwarz 620 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

female. These are located in the plot in the circled areas.<br />

The residual plot’s strange appearance is an artifact of the modeling process.<br />

What happens if the predictors in a logistic regressi<strong>on</strong> are also categorical. Based <strong>on</strong> what what seen <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

the ordinary regressi<strong>on</strong> case, you can expect to see a set of vertical lines. But, there are <strong>on</strong>ly two possible<br />

resp<strong>on</strong>ses, so the plot reduces to a (n<strong>on</strong>-in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative) set of lattice points.<br />

For example, c<strong>on</strong>sider predicting survival rates of Titanic passengers as a functi<strong>on</strong> of their sex. This<br />

model is:<br />

Y i distributed as Binomial(p i )<br />

φ i = logit(p i )<br />

φ i = Sex<br />

The residual plot is produced automatically from the Generalized Linear Model opti<strong>on</strong> of the Analyze->Fit<br />

Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and looks like 37 :<br />

37 I added reference lines at zero, 2, and −2 by clicking <strong>on</strong> the Y axis of the plot<br />

c○2012 Carl James Schwarz 621 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The same logic applies as in the previous secti<strong>on</strong>s. Because Sex is a discrete predictor with two possible<br />

values, there are <strong>on</strong>ly two possible predicted probability of survival corresp<strong>on</strong>ding to the two vertical lines<br />

in the plot. Because the resp<strong>on</strong>se variable is categorical, it is c<strong>on</strong>verted to a 0 or 1 values, and the residuals<br />

computed which then corresp<strong>on</strong>d to the two dots in each vertical line. Note that each dot represents several<br />

hundred data values!<br />

This residual plot is rarely in<str<strong>on</strong>g>for</str<strong>on</strong>g>mative – after all, if there are <strong>on</strong>ly two outcomes and <strong>on</strong>ly two categories<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> the predictors, some people have to lie in the two outcomes <str<strong>on</strong>g>for</str<strong>on</strong>g> each of the two categories of predictors.<br />

The leverage of a point measures how extreme the set of predictors is relative to the rest of the predictors<br />

in the study. Leverage in logistic regressi<strong>on</strong> depends no <strong>on</strong>ly this distance, but also the weight in predicti<strong>on</strong>s<br />

which is a functi<strong>on</strong> of π(1 − π). C<strong>on</strong>sequently, points with very small predicted (i.e. ̂π i < 0.15) or very<br />

larger predicted (i.e. ̂π i > 0.85) actually have little weight <strong>on</strong> the fit and the maximum leverage occurs with<br />

points where the predicted probability is close to 0.15 or 0.85.<br />

Hosmer et al. (1991) suggest plotting the leverage of each point vs. ̂π i to determine the regi<strong>on</strong>s where<br />

the leverage is highest. These values may not be available in your package of choice.<br />

Hosmer et al. (1991) also suggest computing the Cook’s distance – how much does the regressi<strong>on</strong> coefficient<br />

change if a case is dropped from the model. These values may not be available in your package of<br />

choice.<br />

7.10 Variable selecti<strong>on</strong> methods<br />

7.10.1 Introducti<strong>on</strong><br />

In the previous examples, there were <strong>on</strong>ly a few predictor variables and generally, there was <strong>on</strong>ly model<br />

really of interest. In many cases, the <str<strong>on</strong>g>for</str<strong>on</strong>g>m of the model is unknown, and some sort of variable selecti<strong>on</strong><br />

methods are required to build realistic model.<br />

As in ordinary regressi<strong>on</strong>, these variable selecti<strong>on</strong> methods are NO substitute <str<strong>on</strong>g>for</str<strong>on</strong>g> intelligent thought,<br />

experience, and comm<strong>on</strong> sense.<br />

As always, be<str<strong>on</strong>g>for</str<strong>on</strong>g>e starting any analysis, check the sample or experimental design. This chapter <strong>on</strong>ly<br />

deals with data collected under a simple random sample or completely randomized design. If the sample or<br />

experimental design is more complex, please c<strong>on</strong>sult with a friendly statistician.<br />

Epidemiologists often advise that all clinically relevant variables should be included regardless if statistically<br />

significant or not. The rati<strong>on</strong>ale <str<strong>on</strong>g>for</str<strong>on</strong>g> this approach is to provide as complete c<strong>on</strong>trol of c<strong>on</strong>founding as<br />

possible – we saw in regular regressi<strong>on</strong> that collinearity am<strong>on</strong>g variables can mask statistical significance.<br />

The major problem with this approach is over-fitting. Over-fitted models have too many variables relative to<br />

the number of observati<strong>on</strong>s, leading to numerically unstable estimates with large standard errors.<br />

c○2012 Carl James Schwarz 622 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

I prefer a more subdued approach rather than this shotgun approach and would follow these steps to find<br />

a reas<strong>on</strong>able model:<br />

• Start with a multi-variate scatter-plot matrix to investigate pairwise relati<strong>on</strong>ships am<strong>on</strong>g variables. Are<br />

there pairs of variables that appear to be highly correlated? Are there any points that d<strong>on</strong>’t seem to<br />

follow the pattern seen with the other points?<br />

• Examine each variable separately using the Analyze->Distributi<strong>on</strong> plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m to check <str<strong>on</strong>g>for</str<strong>on</strong>g> anomalous<br />

values, etc.<br />

• Start with simple univariate logistic regressi<strong>on</strong> with each variable in turn.<br />

For c<strong>on</strong>tinuous variables, there are two suggested analyses. First, use the binary variable as the X<br />

variable and do a simple two-sample t-test to look <str<strong>on</strong>g>for</str<strong>on</strong>g> differences am<strong>on</strong>g the means of the potential<br />

predictors. The dot plots should show some separati<strong>on</strong> of the two groups. Sec<strong>on</strong>d, try a simple<br />

univariate logistic-regressi<strong>on</strong> using the binary variable as the Y variable with each individual predictor.<br />

Third, although it seems odd to do so, c<strong>on</strong>vert the binary resp<strong>on</strong>se variable to a 0/1 c<strong>on</strong>tinuous resp<strong>on</strong>se<br />

and try some of the standard smoothing methods, such a spline fit to investigate the general <str<strong>on</strong>g>for</str<strong>on</strong>g>m of<br />

the resp<strong>on</strong>se. Does it look logistic? Are quadratic terms needed?<br />

For nominal or ordinal variables, the two above analyses often start with a c<strong>on</strong>tingency table. Particular<br />

attenti<strong>on</strong> should be paid to problem cases – cells in a c<strong>on</strong>tingency table which have a zero count.<br />

For example, if an experiment was testing different doses of a drug <str<strong>on</strong>g>for</str<strong>on</strong>g> the LD50 38 and no deaths<br />

occurred at a particular dose. In these situati<strong>on</strong>s, the log-odds of success are either ±∞ which is<br />

impossible to model properly using virtually any standard statistical package. 39 If there are cells with<br />

0 counts, some pooling is often required.<br />

Looking at all the variables, which variables appear to be statistically significant? Approximately how<br />

large are these simple effects – can the predictor variables be ranked in approximate order of univariate<br />

importance?<br />

• Based up<strong>on</strong> the above results, start with a model that includes what appear to be the most important<br />

variables. As a rule of thumb 40 include variables that have a p-value under .25 rather relying <strong>on</strong> a<br />

stricter criteria. At this stage of the game, building a good starting model is of primary importance.<br />

• Use standard variable selecti<strong>on</strong> methods, such as stepwise selecti<strong>on</strong> (<str<strong>on</strong>g>for</str<strong>on</strong>g>ward, backward, combined)<br />

or all subset regressi<strong>on</strong> to investigate potential models. These mechanical methods are not to be used<br />

as a substitute <str<strong>on</strong>g>for</str<strong>on</strong>g> thinking! Remember that highly collinear variables can mask the importance of<br />

each other.<br />

If categorical variables are to be included then some care must be used <strong>on</strong> how the various indicator<br />

variables are included. The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this is that the coding of the indicator variables is arbitrary and<br />

the selecti<strong>on</strong> of a particular indicator variable may be artifact of the coding used. One strategy is that<br />

all the indicator variables should be included or excluded as a set, rather than individually selecting<br />

separate indicator variables. As you will see in the example, JMP has four different rules that could<br />

be used.<br />

38 LD50=Lethal Dose 50th percentile – that dose which kills 50% of the subjects<br />

39 However, refer to Hosmer and Lemeshow (2000) <str<strong>on</strong>g>for</str<strong>on</strong>g> details <strong>on</strong> alternate approaches.<br />

40 Hosmer and Lemeshow (2000), p. 95<br />

c○2012 Carl James Schwarz 623 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

• Once main effects have be identified, look at quadratic, interacti<strong>on</strong>, and crossproduct terms.<br />

• Verify the final model. Look <str<strong>on</strong>g>for</str<strong>on</strong>g> collinearity, high leverage, etc. Check if the resp<strong>on</strong>se to the selected<br />

variables are linear <strong>on</strong> the logistic scale. For example, break a c<strong>on</strong>tinuous variable into 4 classes, and<br />

refit the same model with these discretized classes. The estimates of the effects <str<strong>on</strong>g>for</str<strong>on</strong>g> each class should<br />

then follow an approximate linear pattern.<br />

• Cross validate the model so that artifacts of that particular dataset are not highlighted.<br />

7.10.2 Example: Predicting credit worthiness<br />

In credit business, banks are interested in in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> whether prospective c<strong>on</strong>sumers will pay back their<br />

credit or not. The aim of credit-scoring is to model or predict the probability that a c<strong>on</strong>sumer with certain<br />

covariates is to be c<strong>on</strong>sidered as a potential risk.<br />

If you visit http://www.stat.uni-muenchen.de/service/datenarchiv/welcome_e.<br />

html you will find a dataset c<strong>on</strong>sisting of 1000 c<strong>on</strong>sumer credits from a German bank. For each c<strong>on</strong>sumer<br />

the binary resp<strong>on</strong>se variable “creditability” is available. In additi<strong>on</strong>, 20 covariates that are assumed to influence<br />

creditability were recorded. The dataset is available in the creditcheck.jmp datafile from the Sample<br />

Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The variable descripti<strong>on</strong>s are available at http://www.stat.uni-muenchen.de/service/<br />

datenarchiv/kredit/kreditvar_e.html and in the Sample Program Library.<br />

I will assume that the initial steps in variable selecti<strong>on</strong> have been d<strong>on</strong>e such as scatter-plots, looking <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

outliers etc.<br />

This dataset has a mixture of c<strong>on</strong>tinuous variables (such as length of time an account has been paid in<br />

full), nominal scaled variables (such as sex, or the purpose of the credit request), and ordinal scaled variables<br />

(such as length of employment). Some of the ordinal variables may even be close enough to interval or ratio<br />

scaled to be usable as a c<strong>on</strong>tinuous variables (such as length of employment). Both approaches should be<br />

tried, particularly if the estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the individual categories appear to be increasing in a linear fashi<strong>on</strong>.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m was used to specify the resp<strong>on</strong>se variable, the potential covariates,<br />

and that a variable selecti<strong>on</strong> method will be used:<br />

c○2012 Carl James Schwarz 624 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This brings up the standard dialogue box <str<strong>on</strong>g>for</str<strong>on</strong>g> step-wise and other variable selecti<strong>on</strong> methods.<br />

c○2012 Carl James Schwarz 625 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

In the stepwise paradigm, the usual <str<strong>on</strong>g>for</str<strong>on</strong>g>ward, backwards, and mixed (i.e. <str<strong>on</strong>g>for</str<strong>on</strong>g>ward followed by a backwards<br />

step at each iterati<strong>on</strong>):<br />

In cases where variables are nominally or ordinally scales (and discrete), JMP provides a number of way<br />

to include/exclude the individual indicator variables:<br />

For example, c<strong>on</strong>sider the variable Repayment that had levels 0 to 4 corresp<strong>on</strong>ding from 0=repayment problems<br />

in the past, to 4=completely satisfactory repayment of past credit. JMP will create 4 indicator variables<br />

to represent these 5 categories. These indicator variables are derived in a hierarchical fashi<strong>on</strong>:<br />

c○2012 Carl James Schwarz 626 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The first indicator variable, splits the classes in such a way to maximize the difference between the proporti<strong>on</strong><br />

of credit worthiness between the two parts of the split. This corresp<strong>on</strong>ds to grouping levels 0 and 1 vs. levels<br />

2, 3, and 4. The next indicator variables then split the splits, again, if possible, to maximize the difference<br />

in the credit worthiness between the two parts of the split. [If the split is of a pair of variables, there is<br />

no choice in the split.] This corresp<strong>on</strong>ds to splitting the 0&1 categories into another indicator variable that<br />

distinguishes category 0 from 1. The 2&3&4 class is split into two sub-splits corresp<strong>on</strong>ding to categories<br />

2&3 vs. category 4. Finally, the 2&3 class is split into an indicator variable differentiating categories 2 and<br />

3.<br />

Now the rules <str<strong>on</strong>g>for</str<strong>on</strong>g> entering effects corresp<strong>on</strong>d to :<br />

• Combined When terms enter the model, they are combined with all higher terms in the hierarchy and<br />

tested as a group to enter or leave.<br />

• Restrict Terms cannot be entered into the model unless terms higher in the hierarchy are already<br />

entered. Hence the indicator variable that distinguishes categories 0 and 1 in the repayment variable<br />

cannot enter be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the indicator variable that c<strong>on</strong>trasts 0&1 and 2&3&4.<br />

• No Rules Each indicator variable is free to enter or leave the model regardless of the presence or<br />

absence of other variables in the set.<br />

• Whole Effects All indicator variable in a set must enter or leave together as a set.<br />

The Combined or Whole Effects are the two most comm<strong>on</strong> choices.<br />

This plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m also supports all possible subset regressi<strong>on</strong>s:<br />

c○2012 Carl James Schwarz 627 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

This should be used cautiously with a large number of variables.<br />

Because it is computati<strong>on</strong>ally difficult to fit thousands of models using maximum likelihood methods<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> each of the potential new variables that enter the model, a computati<strong>on</strong>ally simpler (but asymptotically<br />

equivalent) test procedure (called the Wald or score test) is used in the table of variables to enter or leave. In<br />

a <str<strong>on</strong>g>for</str<strong>on</strong>g>ward selecti<strong>on</strong>, the variable with the smallest p-value or the largest Wald test-statistic is chosen:<br />

Once this variable is chosen, the current model is refit using maximum likelihood, so the report in the Step<br />

History may show a slightly different test statistics (the L-R ChiSquare) than the score statistic and the<br />

p-value may be different.<br />

The stepwise selecti<strong>on</strong> c<strong>on</strong>tinues.<br />

In a few steps, the next variable to enter is the indicator variable that distinguishes categories 2&3 and<br />

4. Because of the restricti<strong>on</strong> <strong>on</strong> entering terms, if this indicator variable is entered, the first cut must also be<br />

entered. Hence, this step actually enters 2 variables and the number of predictors jumps from 3 to 5:<br />

c○2012 Carl James Schwarz 628 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

In a few more steps, some of the credit purpose variables are entered, again as a pair.<br />

The stepwise c<strong>on</strong>tinues <str<strong>on</strong>g>for</str<strong>on</strong>g> a total of 18 steps.<br />

As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, <strong>on</strong>ce you have identified a candidate model, it must be fit and examined in more detail. Use<br />

the Make Model butt<strong>on</strong> to fit the final model. Note that JMP must add new columns to the data tables<br />

corresp<strong>on</strong>ding to the indicator variables created during the stepwise report. These can be c<strong>on</strong>fusing to the<br />

novice, but just keep in mind that any set of indicator variables is somewhat arbitrary.<br />

c○2012 Carl James Schwarz 629 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The model fit then has separate variables used <str<strong>on</strong>g>for</str<strong>on</strong>g> each indicator variable created:<br />

c○2012 Carl James Schwarz 630 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

The log-odds of NOT repaying the loan is computed (see the bottom of the estimates table). Do the coefficient<br />

make sense?<br />

Can some variables be dropped?<br />

Pay attenti<strong>on</strong> to how the indicator variables have been split. For example, do you understand what terms<br />

are used if the borrower intends to use the credit to do repairs (CreditPurpose value =6)?<br />

Models that are similar to this <strong>on</strong>e should also be explored.<br />

Again, just like in the case of ordinary regressi<strong>on</strong>, model validati<strong>on</strong> using other data sets or hold-out<br />

samples should be explored.<br />

7.11 Model comparis<strong>on</strong> using AIC<br />

Sorry, to be added later.<br />

c○2012 Carl James Schwarz 631 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

7.12 Final Words<br />

7.12.1 Two comm<strong>on</strong> problems<br />

Two comm<strong>on</strong> problems can be encountered with logistic regressi<strong>on</strong>.<br />

Zero counts<br />

As noted earlier, zero counts <str<strong>on</strong>g>for</str<strong>on</strong>g> <strong>on</strong>e category of a nominal or ordinal predictor (X) variable are problematic<br />

as the log-odds of that category then approach ±∞ which is somewhat difficult to model.<br />

One simplistic approach is that similar to the computati<strong>on</strong> of the empirical logistic estimate – add 1/2n<br />

to each cell so that the counts are no l<strong>on</strong>ger-integers, but most packages will deal with n<strong>on</strong>-integer counts<br />

without problems.<br />

If the zero counts arise from spreading the data over too many cells, perhaps some pooling of adjacent<br />

cells is warranted. If the data are sufficiently dense that pooling is not needed, perhaps this level of the<br />

variable can be dropped.<br />

Complete separati<strong>on</strong><br />

Ir<strong>on</strong>ically, this is a problem because the logistic models are per<str<strong>on</strong>g>for</str<strong>on</strong>g>ming too well! We saw an example of this<br />

earlier, when the fitness data could predict perfectly the sex of the subject.<br />

This is a problem, because not the predicted log-odds <str<strong>on</strong>g>for</str<strong>on</strong>g> the groups must again be ±∞. This can <strong>on</strong>ly<br />

happen if some of the estimated coefficients are also infinite which is difficult to deal with numerically.<br />

Theoretical c<strong>on</strong>siderati<strong>on</strong>s show that in the case of complete separati<strong>on</strong>, maximum likelihood estimates do<br />

not exist!<br />

Sometimes this complete separati<strong>on</strong> is an artifact of too many variables and not enough observati<strong>on</strong>s.<br />

Furthermore, it is not so much a problem of total observati<strong>on</strong>s, but also the divisi<strong>on</strong> of observati<strong>on</strong>s between<br />

the two binary outcomes. If you have 1000 observati<strong>on</strong>s, but <strong>on</strong>ly 1 “success”, then any model with more<br />

than a few variables will be 100% efficient in capturing the single success – however, it is almost certain to<br />

be an artifact of the particular dataset.<br />

c○2012 Carl James Schwarz 632 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

7.12.2 Extensi<strong>on</strong>s<br />

Choice of link functi<strong>on</strong><br />

The logit link functi<strong>on</strong> is the most comm<strong>on</strong> choice <str<strong>on</strong>g>for</str<strong>on</strong>g> the link functi<strong>on</strong> between the probability of an<br />

outcome and the scale <strong>on</strong> which the predictors operate in a linear fashi<strong>on</strong>.<br />

However, other link functi<strong>on</strong>s have been used in different situati<strong>on</strong>s. For example, a log-link (log(p)),<br />

the log-log link (log(−log(p))), the complementary log-link (log(− log(1 − p))), the probit functi<strong>on</strong> (the<br />

inverse normal distributi<strong>on</strong>), the identity link (p) have all been proposed <str<strong>on</strong>g>for</str<strong>on</strong>g> various special cases. Please<br />

c<strong>on</strong>sult a statistician <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />

More than two resp<strong>on</strong>se categories<br />

Logistic regressi<strong>on</strong> traditi<strong>on</strong>ally has two resp<strong>on</strong>se categories that are classified as “success” or “failure”.<br />

It is possible to extend this modelling framework to cases where the resp<strong>on</strong>se variable has more than two<br />

categories.<br />

This is known as multinomial logistic regressi<strong>on</strong>, discrete choice, polychotomous logistic or polytomous<br />

logistic model, depending up<strong>on</strong> your field of expertise.<br />

There is a difference in the analysis if the resp<strong>on</strong>ses can be ordered (i.e. the resp<strong>on</strong>se variable takes an<br />

ordinal scale), or remain unordered (i.e. the resp<strong>on</strong>se variable takes an nominal scale).<br />

The basic idea is to compute a logistic regressi<strong>on</strong> of each category against a reference category. So a<br />

resp<strong>on</strong>se variable with three categories is translated into two logistic regressi<strong>on</strong>s where, <str<strong>on</strong>g>for</str<strong>on</strong>g> example, the<br />

first regressi<strong>on</strong> is category 1 vs. category 0 and the sec<strong>on</strong>d regressi<strong>on</strong> is category 2 vs. category 0. These<br />

can be used to derive the results of category 2 vs. category 1. What is of particular interest is the role of the<br />

predictor variables in each of the possible comparis<strong>on</strong>, e.g. does weight have the same effect up<strong>on</strong> mortality<br />

<str<strong>on</strong>g>for</str<strong>on</strong>g> three different disease outcomes.<br />

C<strong>on</strong>sult <strong>on</strong>e of the many book <strong>on</strong> logistic regressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />

Exact logistic regressi<strong>on</strong> with very small datasets<br />

The methods presented in this chapter rely up<strong>on</strong> maximum likelihood methods and asympototic arguments.<br />

In very small datasets, these large sample approximati<strong>on</strong>s may not per<str<strong>on</strong>g>for</str<strong>on</strong>g>m well.<br />

There are several statistical packages which per<str<strong>on</strong>g>for</str<strong>on</strong>g>m exact logistic regressi<strong>on</strong> and do not rely up<strong>on</strong><br />

asymptotic arguments.<br />

A simple search of the web brings up several such packages.<br />

c○2012 Carl James Schwarz 633 November 23, 2012


CHAPTER 7. LOGISTIC REGRESSION<br />

More complex experimental designs<br />

The results of this chapter have all assumed that the sampling design was a simple random sample or that<br />

the experiment design was a completely randomized design.<br />

Logistic regressi<strong>on</strong> can be extended to many more complex designs.<br />

In matched pair designs, each “success” in the outcome is matched with a randomly chosen “failure”<br />

al<strong>on</strong>g as many covariates as possible. For example, lung cancer patients could be matched with healthy<br />

patients with comm<strong>on</strong> age, weight, occupati<strong>on</strong> and other covariates. These designs are very comm<strong>on</strong> in<br />

health studies. There are many good books <strong>on</strong> the analysis of such design.<br />

Clustered designs are also very comm<strong>on</strong> where groups of subjects all receive a comm<strong>on</strong> treatment. For<br />

example, classrooms may be randomly assigned to different reading programs, and the success or failure<br />

of individual students within the classrooms in obtaining reading goals is assessed. Here the experimental<br />

unit is the classroom, not the individual student and the methods of this chapter are not directly applicable.<br />

Several extensi<strong>on</strong>s have been proposed <str<strong>on</strong>g>for</str<strong>on</strong>g> this type of “correlated” binary data (students within the same<br />

classroom are all exposed to exactly the same set of experimenal and n<strong>on</strong>-experimental factors). The most<br />

comm<strong>on</strong> is known as Generalized Estimating Equati<strong>on</strong>s and is described in many books.<br />

More complex experimental designs (e.g. split-plot designs) can also be run with binary outcomes. These<br />

complex designs require high power computati<strong>on</strong>al machinery to analyze.<br />

7.12.3 Yet to do<br />

- examples - dov’s example used in a comprehensive exam in previous years<br />

c○2012 Carl James Schwarz 634 November 23, 2012


Chapter 8<br />

Poiss<strong>on</strong> Regressi<strong>on</strong><br />

8.1 Introducti<strong>on</strong><br />

In past chapters, multiple-regressi<strong>on</strong> methods were used to predict a c<strong>on</strong>tinuous Y variable given a set of<br />

predictors, and logistic regressi<strong>on</strong> methods were used to predict a dichotomous categorical variable given a<br />

set of predictors.<br />

In this chapter, we will explore the use of Poiss<strong>on</strong>-regressi<strong>on</strong> methods that are typically used to predict<br />

counts of (rare) events given a set of predictors.<br />

Just as multiple-regressi<strong>on</strong> implicitly assumed that the Y variable had a normal distributi<strong>on</strong> and logisticregressi<strong>on</strong><br />

assumed that the choice of categories in Y was based <strong>on</strong> binomial distributi<strong>on</strong>, Poiss<strong>on</strong> regressi<strong>on</strong><br />

assumes that the observed counts are generated from a Poiss<strong>on</strong> distributi<strong>on</strong>.<br />

The Poiss<strong>on</strong> distributi<strong>on</strong> is often used to model count data when the events being counted are somewhat<br />

rare, e.g. cancer cases, the number of accidents, the number of satellite males around a female bird, etc. It is<br />

characterized by the expected number of events to occur µ with probability mass functi<strong>on</strong>s:<br />

P (Y = y|µ) = e−µ µ y<br />

where y! = y(y − 1)(y − 2) . . . 2(1), and y ≥ 0. The probability mass functi<strong>on</strong> is available in tabular <str<strong>on</strong>g>for</str<strong>on</strong>g>m,<br />

or can be computed by many statistical packages. While the values of Y are restricted to being n<strong>on</strong>-negative<br />

integers, it is not necessary <str<strong>on</strong>g>for</str<strong>on</strong>g> µ to be an integer.<br />

In the following graph, 1000 observati<strong>on</strong>s were each generated from a Poiss<strong>on</strong> distributi<strong>on</strong> with differing<br />

means.<br />

y!<br />

635


CHAPTER 8. POISSON REGRESSION<br />

c○2012 Carl James Schwarz 636 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

For very small values of µ, virtually all the counts are zero, with <strong>on</strong>ly a few counts that are positive. As µ<br />

increases, the shape of the distributi<strong>on</strong> look more and more like a normal distributi<strong>on</strong> – indeed <str<strong>on</strong>g>for</str<strong>on</strong>g> large µ, a<br />

normal distributi<strong>on</strong> can be used as an approximati<strong>on</strong> to the distributi<strong>on</strong> of Y .<br />

Sometimes µ is further parameterized by a rate parameter and a group size, i.e. µ = Nλ where λ is the<br />

rate per unit, and N is the group size. For example, the number of cancers in a group of 100,000 people<br />

could be modeled using λ as the rate per 1000 people, and the N = 100.<br />

Two important properties of the Poiss<strong>on</strong> distributi<strong>on</strong> are:<br />

E[Y ] = µ<br />

V [y] = µ<br />

Unlike the normal distributi<strong>on</strong> which has a separate parameter <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean and variance, the Poiss<strong>on</strong> distributi<strong>on</strong><br />

variance is equal to the mean. This means that <strong>on</strong>ce you estimate the mean, you have also estimated<br />

the variance and so it is not necessary to have replicate counts to estimate the sample variance from data. As<br />

will be seen later, this can be quite limiting when <str<strong>on</strong>g>for</str<strong>on</strong>g> many populati<strong>on</strong>, the data are over-dispersed, i.e. the<br />

variance is greater than you would expect from a simple Poiss<strong>on</strong> distributi<strong>on</strong>.<br />

Another important property is that the Poiss<strong>on</strong> distributi<strong>on</strong> is additive. If Y 1 is a Poiss<strong>on</strong>(µ 1 ), and Y 2 is<br />

a Poiss<strong>on</strong>(µ 2 ), then Y = Y 1 + Y 2 is also Poiss<strong>on</strong>(µ = µ 1 + µ 2 ).<br />

Lastly, the Poiss<strong>on</strong> distributi<strong>on</strong> is a limiting distributi<strong>on</strong> of a Binomial distributi<strong>on</strong> as n becomes large<br />

and p becomes very small.<br />

Poiss<strong>on</strong> regressi<strong>on</strong> is another example of a Generalized Linear Model (GLIM) 1 . As in all GLIM’s, the<br />

modeling process is a three step affair:<br />

Y i is assumed P oiss<strong>on</strong>(µ i )<br />

φ i = log(µ i )<br />

φ i = β 0 + β 1 X i1 + β 2 X i2 + . . .<br />

Here the link functi<strong>on</strong> is the natural logarithms log. In many cases, the mean changes in a multiplicative<br />

fashi<strong>on</strong>. For example, if populati<strong>on</strong> size doubled, then the expected number of cancer cases should also<br />

double. As populati<strong>on</strong> age, the rate of cancer increases linearly <strong>on</strong> a log-scale. Additi<strong>on</strong>ally, by modeling<br />

the log(µ i ), it is impossible to get negative estimates of the mean.<br />

The linear part of the GLIM can c<strong>on</strong>sist of c<strong>on</strong>tinuous X or categorical X or mixtures of both types<br />

of predictors. Categorical variables will be c<strong>on</strong>verted to indicator variables in exactly the same way as in<br />

multiple- and logistic-regressi<strong>on</strong>.<br />

Unlike multiple-regressi<strong>on</strong>, there are no closed <str<strong>on</strong>g>for</str<strong>on</strong>g>m soluti<strong>on</strong>s to give estimates of parameters. Standard<br />

maximum likelihood estimati<strong>on</strong> (MLE) methods are used. 2 MLEs are guaranteed to be the “best” estimators<br />

1 Logistic regressi<strong>on</strong> is another GLIM.<br />

2 A discussi<strong>on</strong> of the theory of MLE is bey<strong>on</strong>d the scope of this <str<strong>on</strong>g>course</str<strong>on</strong>g>, but is covered in Stat-330 and Stat-402.<br />

c○2012 Carl James Schwarz 637 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

(smallest standard errors) as the sample size increases, and seem to work well even if the sample sizes are<br />

not large. Standard methods are used to estimate the standard errors of the estimates. Model comparis<strong>on</strong>s<br />

are d<strong>on</strong>e using likelihood-ratio tests whose test statistics follow a chi-square distributi<strong>on</strong> which is used to<br />

give a p-value which is interpreted in the standard fashi<strong>on</strong>. Predicti<strong>on</strong>s are d<strong>on</strong>e in the usual fashi<strong>on</strong> – these<br />

initially appear <strong>on</strong> the log-scale and must be anti-logged to provide estimates <strong>on</strong> the ordinary scale.<br />

8.2 Experimental design<br />

In this chapter, we will again assume that the data are collected under a completely randomized design. In<br />

some of the examples that follow, blocked designs will be analyzed, but we will not explore how to analyze<br />

split-plot or repeated measure designs or design with pseudo-replicati<strong>on</strong>.<br />

The analysis of such designs in a generalized linear models framework is possible – please c<strong>on</strong>sult with<br />

a statistician if you have a complex experimental design.<br />

8.3 Data structure<br />

The data structure is straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward. Columns represent variables and rows represent observati<strong>on</strong>s. The<br />

resp<strong>on</strong>se variable, Y , will be a count of the number of events and will be set to c<strong>on</strong>tinuous scale. The<br />

predictor variables, X, can be either c<strong>on</strong>tinuous or categorical – in the later case, indicator variables will be<br />

created.<br />

As usual, the coding that a package uses <str<strong>on</strong>g>for</str<strong>on</strong>g> indicator variables is important if you want to interpret<br />

directly the estimates of the effect of the indicator variable. C<strong>on</strong>sult the documentati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the package <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

details.<br />

8.4 Single c<strong>on</strong>tinuous X variable<br />

The JMP file salamanders-burn.jmp available in the Sample Program Library at http://www.stat.<br />

sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms c<strong>on</strong>tains data <strong>on</strong> the number of salamanders<br />

in a fixed size quadrat at various locati<strong>on</strong>s in a large <str<strong>on</strong>g>for</str<strong>on</strong>g>est. The locati<strong>on</strong> of quadrats were chosen to represent<br />

a range of years since a <str<strong>on</strong>g>for</str<strong>on</strong>g>est fire burned the understory.<br />

A simple plot of the data:<br />

c○2012 Carl James Schwarz 638 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

shows an increasing relati<strong>on</strong>ship between the number of salamanders and the time since the <str<strong>on</strong>g>for</str<strong>on</strong>g>est understory<br />

burned.<br />

Why can’t a simple regressi<strong>on</strong> analysis using standard normal theory be used to fit the curve?<br />

First, the assumpti<strong>on</strong> of normality is suspect. The counts of the number of salamanders are discrete with<br />

most under 10. It is impossible to get a negative number of salamanders so the bottom left part of the graph<br />

would require the normal distributi<strong>on</strong> to be truncated at Y = 0.<br />

Sec<strong>on</strong>d, it appears that the variance of the counts at any particular age increases with age since burned.<br />

This violates the assumpti<strong>on</strong> of equal variance <str<strong>on</strong>g>for</str<strong>on</strong>g> all X values made <str<strong>on</strong>g>for</str<strong>on</strong>g> standard regressi<strong>on</strong> models.<br />

Third, the fitted line from ordinary regressi<strong>on</strong> could go negative. It is impossible to have a negative<br />

number of salamanders.<br />

It seems reas<strong>on</strong>able that a Poiss<strong>on</strong> distributi<strong>on</strong> could be used to model the number of salamanders. They<br />

are relatively rare and seem to <str<strong>on</strong>g>for</str<strong>on</strong>g>age independently of each other. This c<strong>on</strong>diti<strong>on</strong>s are the underpinnings of<br />

a Poiss<strong>on</strong> distributi<strong>on</strong>.<br />

The process of fitting the model and interpreting the output is analogous to those used in logistic regressi<strong>on</strong>.<br />

c○2012 Carl James Schwarz 639 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

The basic model is then:<br />

Y i ∼P oiss<strong>on</strong>(µ i )<br />

θ i =log(µ i )<br />

θ i =β 0 + β 1 Years i<br />

As in the logistic model, the distributi<strong>on</strong> of the data about the mean (line 1) has a link functi<strong>on</strong> (line<br />

2) between the mean <str<strong>on</strong>g>for</str<strong>on</strong>g> each Y and the linear structural part of the model (line 3). In logistic regressi<strong>on</strong>,<br />

the logit link was used to ensure that all values of p were between 0 and 1. In Poiss<strong>on</strong> regressi<strong>on</strong>, the log<br />

(natural logarithm) is traditi<strong>on</strong>ally used to ensure that the mean is always positive.<br />

The model must be fit using maximum likelihood methods, just like in logistic regressi<strong>on</strong>.<br />

This model is fit in JMP using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

Be sure to specify the proper distributi<strong>on</strong> and link functi<strong>on</strong>.<br />

c○2012 Carl James Schwarz 640 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

This gives the output:<br />

Most of the output parallels that seen in logistic regressi<strong>on</strong>. At the top of the output is a summary of variable<br />

being analyzed, the distributi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the raw data, the link used, and the total number of observati<strong>on</strong> (rows in<br />

the dataset).<br />

The Whole Model Test is analogous to that in multiple-regressi<strong>on</strong> - is there evidence that the set of<br />

predictors (in this case there is <strong>on</strong>ly <strong>on</strong>e predictor) have any predictive ability over that seen by random<br />

chance. The test statistic is computed using a likelihood-ratio test comparing this model to a model with<br />

<strong>on</strong>ly the intercept. The p-value is very small, indicating that the model has some predictive ability. [Because<br />

there is <strong>on</strong>ly 1 predictor, this test is equivalent to the Effect Test discussed below.]<br />

The goodness-of-fit statistic compares the model with the intercept and the single predictor to a model<br />

where every observati<strong>on</strong> is predicted individually. If the model fits well, the chi-square test statistic should<br />

be approximately equal to the degrees of freedom, and the p-value should be LARGE, i.e. much larger than<br />

.05. 3 There is no evidence of a problem in the fit. Later in this secti<strong>on</strong>, we will examine how to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

slight lack of fit.<br />

The Effect tests examine if each predictor (or in the case of a categorical variable, the entire set of<br />

indicator variables) makes a statistically significant marginal c<strong>on</strong>tributi<strong>on</strong> to the fit. As in multiple-regressi<strong>on</strong><br />

model, this are MARGINAL c<strong>on</strong>tributi<strong>on</strong>s, i.e. assuming that all other variables remain in the model and<br />

fixed at their current value. There is <strong>on</strong>ly <strong>on</strong>e predictor, and there is str<strong>on</strong>g evidence against the hypothesis<br />

of no marginal c<strong>on</strong>tributi<strong>on</strong>.<br />

3 Remember, that in goodness-of-fit tests, you DON’T want to find evidence against the null hypothesis.<br />

c○2012 Carl James Schwarz 641 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Finally, the Parameter Estimates secti<strong>on</strong> reports the estimated β’s. So our fitted model is:<br />

Y i ∼P oiss<strong>on</strong>(µ i )<br />

θ i =log(µ i )<br />

θ i =0.59 + .045Years i<br />

Each line also tests if the corresp<strong>on</strong>ding populati<strong>on</strong> coefficient is zero. Because each of the X variables<br />

in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the<br />

effect tests.<br />

We can obtain predicti<strong>on</strong>s by following the drop down menu:<br />

c○2012 Carl James Schwarz 642 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

For example, c<strong>on</strong>sider the first row of the data. At 12 years since the last burn, we estimate the mean resp<strong>on</strong>se<br />

by starting at the bottom of the model and working upwards:<br />

which is the predicted value in the table.<br />

θ 1 =0.59 + .045(12) = 1.12<br />

µ 1 =exp(1.12) = 3.08<br />

As in ordinary normal-theory regressi<strong>on</strong>, c<strong>on</strong>fidence limits <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se and <str<strong>on</strong>g>for</str<strong>on</strong>g> individual<br />

resp<strong>on</strong>se may be found. The above table shows the c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean resp<strong>on</strong>se.<br />

Finally, a residual plot may also be c<strong>on</strong>structed:<br />

There is no evidence of a lack-of-fit.<br />

c○2012 Carl James Schwarz 643 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

8.5 Single c<strong>on</strong>tinuous X variable - dealing with overdispersi<strong>on</strong><br />

One of the weaknesses of Poiss<strong>on</strong> regressi<strong>on</strong> is the very restrictive assumpti<strong>on</strong> that the variance of a Poiss<strong>on</strong><br />

distributi<strong>on</strong> is equal to its mean. In some cases, data are over-dispersed, i.e. the variance is greater than<br />

predicted by a simple Poiss<strong>on</strong> distributi<strong>on</strong>. In this secti<strong>on</strong>, we will illustrate how to detect overdispersi<strong>on</strong><br />

and how to adjust the analysis to account <str<strong>on</strong>g>for</str<strong>on</strong>g> overdispersi<strong>on</strong>.<br />

In the secti<strong>on</strong> <strong>on</strong> Logistic Regressi<strong>on</strong>, a dataset was examined <strong>on</strong> nesting horseshoe crabs 4 that is analyzed<br />

in Agresti’s book. 5<br />

The design of the study is given in Brockmann H.J. (1996). Satellite male groups in horseshoe crabs,<br />

Limulus polyphemus. Ethology, 102, 1-21. Again it is important to check that the design is a completely<br />

randomized design or a simple random sampling. As in regressi<strong>on</strong> models, you do have some flexibility in<br />

the choice of the X settings, but <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular weight and color, the data must be selected at random from<br />

that relevant populati<strong>on</strong>.<br />

Each female horseshoe crab had a male resident in her nest. The study investigated other factors affecting<br />

whether the female had any other males, called satellites residing nearby. These other factors includes:<br />

• crab color where 2=light medium, 3=medium, 4=dark medium, 5=dark.<br />

• spine c<strong>on</strong>diti<strong>on</strong> where 1=both good, 2=<strong>on</strong>e worn or broken, or 3=both worn or broken.<br />

• weight<br />

• carapace width<br />

In the secti<strong>on</strong> <strong>on</strong> Logistic Regressi<strong>on</strong>, a derived variable <strong>on</strong> the presence or absence of satellite males<br />

was examined. In this secti<strong>on</strong>, we will examine the actual number of satellite males.<br />

A JMP dataset crabsatellites.jmp is available from the Sample Program Library at http://www.<br />

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A porti<strong>on</strong> of the datafile is shown<br />

below:<br />

4 See http://en.wikipedia.org/wiki/Horseshoe_crab.<br />

5 These are available from Agresti’s web site at http://www.stat.ufl.edu/~aa/cda/sas/sas.html.<br />

c○2012 Carl James Schwarz 644 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Note that the color and spine c<strong>on</strong>diti<strong>on</strong> variables should be declared with an ordinal scale despite having<br />

numerical codes. In this analysis we will use the actual number of satellite males.<br />

As noted <strong>on</strong> the secti<strong>on</strong> <strong>on</strong> Logistic Regressi<strong>on</strong>, a preliminary scatter plot of the variables shows some<br />

interesting features.<br />

c○2012 Carl James Schwarz 645 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

There is a very high positive relati<strong>on</strong>ship between carapace width and weight, but there are few anomalous<br />

crabs that should be investigated further as shown in this magnified plot:<br />

c○2012 Carl James Schwarz 646 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

There are three points with weights in the 1200-1300 g range whose carapace widths suggest that the weights<br />

should be in the 2200-2300 g range, i.e. a typographical error in the first digit. There is a single crab whose<br />

weight suggests a width of 24 cm rather than 21 cm – perhaps a typo in the last digit. Finally, there is <strong>on</strong>e<br />

crab which is extremely large compared to the rest of the group. In the analysis that follows, I’ve excluded<br />

these five data values.<br />

To begin with, fit a model that attempts to predict the mean number of satellite crabs as a functi<strong>on</strong> of the<br />

weight of the female crab, i.e.<br />

Y i distributed P oiss<strong>on</strong>(µ i )<br />

λ i = log(µ i )<br />

µ i = β 0 + β 1 W eight i<br />

The Generalized Linear Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m of JMP is used:<br />

c○2012 Carl James Schwarz 647 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

c○2012 Carl James Schwarz 648 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

This gives selected output:<br />

c○2012 Carl James Schwarz 649 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

c○2012 Carl James Schwarz 650 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

There are two parts of the output which show that the fit is not very satisfactory. First while the studentized<br />

residual plot does not show any structural defects (the residuals are scattered around zero) 6 , it does<br />

show substantial numbers of points outside of the (−2, 2) range. This suggests that the data are too variable<br />

relative to the Passi<strong>on</strong> assumpti<strong>on</strong>. Sec<strong>on</strong>d, the Goodness-of-fit statistic has a vary small p-value indicating<br />

that the data are not well fit by the model.<br />

This is an example of overdispersi<strong>on</strong>. To see this overdispersi<strong>on</strong>, divide the weight classes into categories,<br />

e.g. 0000 − 2500 g, 2500 − 3000 g., etc. [This has already been d<strong>on</strong>e in the dataset.] 7 Now find<br />

the mean and variance of the number of satellite males <str<strong>on</strong>g>for</str<strong>on</strong>g> each weight class using the Tables->Summary<br />

plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

6 The “lines” in the plot are artifacts of the discrete nature of the resp<strong>on</strong>se. See the chapter <strong>on</strong> residual plots <str<strong>on</strong>g>for</str<strong>on</strong>g> more details.<br />

7 The choice of 4 weight classes is somewhat arbitrary. I would usually try and subdivide the data into between 4 and 10 classes<br />

ensuring that at least 20-30 observati<strong>on</strong>s are in each class.<br />

c○2012 Carl James Schwarz 651 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

If the Poiss<strong>on</strong> assumpti<strong>on</strong> were true, then the variance of the number of satellite males should be roughly<br />

equal to the mean in each class. In fact, the variance in the number of satellite males appears to be roughly<br />

3× that of the mean.<br />

With generalized linear models, there are two ways to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong>.<br />

c○2012 Carl James Schwarz 652 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

A different distributi<strong>on</strong> can be used that is more flexible in the mean-to-variance ratio. A comm<strong>on</strong><br />

distributi<strong>on</strong> that is used in these cases is the negative binomial distributi<strong>on</strong>. In more advanced classes, you<br />

will learn that the negative binomial distributi<strong>on</strong> can arise from a Poiss<strong>on</strong> distributi<strong>on</strong> with extra variati<strong>on</strong> in<br />

the mean rates. JMP does not allow the fitting of a negative binomial distributi<strong>on</strong>, but this opti<strong>on</strong> is available<br />

in SAS.<br />

An “ad hoc” method, that nevertheless has theoretical justificati<strong>on</strong>, is to allow some flexibility in the<br />

variance. For example, rather than restricting V [Y ] = E[Y ] = µ, perhaps, V [y] = cµ where c is called the<br />

over-dispersi<strong>on</strong> factor. Note that if this <str<strong>on</strong>g>for</str<strong>on</strong>g>mulati<strong>on</strong> is used, the data are no l<strong>on</strong>ger distributed as a Poiss<strong>on</strong><br />

distributi<strong>on</strong>; in fact, there is NO actual probability functi<strong>on</strong> that has this property. Nevertheless, this quasidistributi<strong>on</strong><br />

still has nice properties and the over-dispersi<strong>on</strong> factor can be estimated using quasi-likelihood<br />

methods that are analogous to regular likelihood methods.<br />

The end result is that the over-dispersi<strong>on</strong> factor is used to adjust the se and the test-statistics. The adjusted<br />

se are obtained by multiplying the se from the Poiss<strong>on</strong> model by √ ĉ. The adjusted chi-square test statistics<br />

are found by dividing the test statistic from the poiss<strong>on</strong> model by ĉ, and p-value is adjusted by looking up<br />

the adjusted test-statistic in the appropriate table.<br />

How is the over-dispersi<strong>on</strong> factor c estimated? There are two methods, both of which are asymptotically<br />

equivalent. These involve taking the goodness-of-fit statistic and dividing by their degrees of freedom:<br />

ĉ = goodness-of-fit-statistic<br />

df<br />

Usually, ĉ’s of less than 10 (corresp<strong>on</strong>ding to a potential inflati<strong>on</strong> in the se by a factor of about 3) are<br />

acceptable – if the inflati<strong>on</strong> factor is more than about 10, the lack-of-fit is so large that alternate methods<br />

should be used.<br />

In JMP, the adjustment of over-dispersi<strong>on</strong> occurs in the Analyze->Fit Model dialogue box:<br />

c○2012 Carl James Schwarz 653 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

The revised output is now:<br />

c○2012 Carl James Schwarz 654 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Notice that the overdispersi<strong>on</strong> factor has been estimated as<br />

ĉ =<br />

chi − square<br />

df<br />

= 519.7857<br />

166<br />

= 3.13<br />

This is very close to the “guess” that we made based <strong>on</strong> looking at the variance-to-mean ratio am<strong>on</strong>g weight<br />

classes.<br />

The estimated intercept and slope are unchanged and their interpretati<strong>on</strong> is as be<str<strong>on</strong>g>for</str<strong>on</strong>g>e. For example, the<br />

c○2012 Carl James Schwarz 655 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

estimated slope of .000668 is the estimated increase in the log number of male satellite crabs when the female<br />

crab’s weight increases by 1 g. A 1000g increase in body-weight corresp<strong>on</strong>ds to a 1000 × .000668 = .668<br />

increase in the log(number of satellite males) which corresp<strong>on</strong>ds to an increase by a factor of e .668 = 1.95,<br />

i.e. the mean number of male satellite crabs almost doubles. The estimated se has been “inflated” by √ ĉ =<br />

√<br />

3.13 = 1.77. The c<strong>on</strong>fidence intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the slope and intercept are now wider.<br />

The chi-square test statistics have been “deflated” by ĉ and the p-values have been adjusted accordingly.<br />

Finally, the residual plot has been rescaled by the factor of √ ĉ and now most residuals lie between −2<br />

and 2. Note that the pattern of the residual plot doesn’t change; all that the over-dispersi<strong>on</strong> adjustment does<br />

is to change the residual variance so that the standardizati<strong>on</strong> brings them closer to 0.<br />

Predicti<strong>on</strong>s of the mean resp<strong>on</strong>se at levels of X are obtained in the usual fashi<strong>on</strong>:<br />

giving (partial output):<br />

c○2012 Carl James Schwarz 656 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

The se of the predicted mean will also have been be adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> overdispersi<strong>on</strong> as will have the c<strong>on</strong>fidence<br />

intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean number of male satellite crabs at each weight value.<br />

However, notice that the menu item <str<strong>on</strong>g>for</str<strong>on</strong>g> a predicti<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the INDIVIDUAL resp<strong>on</strong>se is “grayed<br />

out” and it is now impossible to obtain predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the ACTUAl number of events. By using the<br />

overdispersi<strong>on</strong> factor, you are no l<strong>on</strong>ger assuming that the counts are distributed as a Poiss<strong>on</strong> distributi<strong>on</strong> –<br />

in fact, there is NO REAL DISTRIBUTION that has the mean to variance ratio that implicitly assumed using<br />

the overdispersi<strong>on</strong> factor. Without an actual distributi<strong>on</strong>, it is impossible to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> individual<br />

events.<br />

We save the predicted values to the dataset and do a plot of the final results <strong>on</strong> both the ordinary scale:<br />

c○2012 Carl James Schwarz 657 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

c○2012 Carl James Schwarz 658 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

and <strong>on</strong> the log-scale (the scale where the model is “linear”):<br />

c○2012 Carl James Schwarz 659 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

c○2012 Carl James Schwarz 660 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

8.6 Single C<strong>on</strong>tinuous X variable with an OFFSET<br />

In the previous examples, the sampling unit (where the counts were obtained) were all the same size (e.g.<br />

the number of satellite males around a single female). In some cases, the sampling unit are of different sizes.<br />

For example, if the number of weeds are counted in a quadrat plot, then hopefully the size of the plot<br />

is c<strong>on</strong>stant. However, it is c<strong>on</strong>ceivable that the size of the plot varies because different people collected<br />

different parts of the data. Of if the number of events are counted in a time interval (e.g. the number of fish<br />

captured in a fishing trip), the time intervals could be of different size.<br />

Often these type of data are pre-standardized, i.e. c<strong>on</strong>verted to a per m 2 or per hour basis and then an<br />

c○2012 Carl James Schwarz 661 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

analysis is attempted <strong>on</strong> this standardized variable. However, standardizati<strong>on</strong> destroys the poiss<strong>on</strong> shape of<br />

the data and turns out to be unnecessary if the size of the sampling unit is also collected.<br />

The incidence of n<strong>on</strong> melanoma skin cancer am<strong>on</strong>g women in the early 1970’s in Minneapolis-St Paul,<br />

Minnesota, and Dallas-Fort Worth, Texas is summarized below:<br />

City Age Class Age Mid Count Pop Size<br />

msp 15-24 20 1 172,675<br />

msp 25-34 30 16 123,065<br />

msp 35-44 40 30 96,216<br />

msp 45-54 50 71 92,051<br />

msp 55-64 60 102 72,159<br />

msp 65-74 70 130 54,722<br />

msp 75-84 80 133 32,185<br />

msp 85+ 90 40 8,328<br />

dfw 15-24 20 4 181,343<br />

dfw 25-34 30 38 146,207<br />

dfw 35-44 40 119 121,374<br />

dfw 45-54 50 221 111,353<br />

dfw 55-64 60 259 83,004<br />

dfw 65-74 70 310 55,932<br />

dfw 75-84 80 226 29,007<br />

dfw 85+ 90 65 7,538<br />

We will first examine the relati<strong>on</strong>ship of cancer incidence to age by using the age midpoint as our<br />

c<strong>on</strong>tinuous X variable and <strong>on</strong>ly using the Minneapolis data (<str<strong>on</strong>g>for</str<strong>on</strong>g> now).<br />

The data set is available in the JMP data file skincancer.jmp from the Sample Program Library at http:<br />

//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

Is there a relati<strong>on</strong>ship between the age of a cohort and the cancer incidence rate? Notice that a comparis<strong>on</strong><br />

of the raw counts is not very sensible because of the different size of the age cohorts. Most people<br />

would first STANDARDIZE the incidence rate, e.g. find the incidence per pers<strong>on</strong> by dividing the number of<br />

cancers by the number of people in each cohort:<br />

c○2012 Carl James Schwarz 662 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

A plot of the standardized incidence rate by the mid-age of each cohort:<br />

shows a curved relati<strong>on</strong>ship between the incidence rate and the mid-point of the age-cohort. This suggests a<br />

c○2012 Carl James Schwarz 663 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

theoretical model of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

Incidence = Ce age<br />

i.e. an exp<strong>on</strong>ential increase in the cancer rates with age.<br />

This suggests that a log-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>m is applied to BOTH sides, but a plot: of the logarithm of the incidence<br />

rate against log(age midpoint):<br />

is still not linear with a dip <str<strong>on</strong>g>for</str<strong>on</strong>g> the youhgest cohorts. There appears to be a str<strong>on</strong>g relati<strong>on</strong>ship between the<br />

log(cancer rate) and log(age) that may not be linear, but a quadratic looks as if it could fit quite nicely, i.e.<br />

a model of the <str<strong>on</strong>g>for</str<strong>on</strong>g>m<br />

.<br />

log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />

c○2012 Carl James Schwarz 664 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Is it possible to include the populati<strong>on</strong> size direct? Expand the above model:<br />

log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />

log( count<br />

pop size ) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />

log(count) − log(pop size) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />

log(count) = log(pop size) + β 0 + β 1 log(age) + β 2 log(age) 2 + residual<br />

Notice that the log(pop size) has a known coefficient of 1 associated with it, i.e. there is NO β coefficient<br />

associated with log(pop size).<br />

Also notice that log(P OP SiZE) is known in advance and is NOT a parameter to be estimated. Variables<br />

such as populati<strong>on</strong> size are often called offset variables and notice that most packages expect to see the offset<br />

variable pre-trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med depending up<strong>on</strong> the link functi<strong>on</strong> used. In this case, the log link was used, so the<br />

offset is log(P OP SIZE age ) as you will see in a minute.<br />

Our GLIM model will then be:<br />

Y age distributed P oiss<strong>on</strong>(µ age )<br />

φ age = log(µ age ) = log(P OP SIZE age ) + log(λ age )<br />

φ age = β 0 + β 1 log(AGE) + β 2 log(AGE) 2<br />

This can be rewritten slightly as:<br />

Y age distributed P oiss<strong>on</strong>(µ age )<br />

φ age = log(µ age ) = log(P OP SIZE age ) + log(λ age )<br />

φ age = log(P OP SIZE age ) + log(λ age ) = β 0 + β 1 log(AGE) + β 2 log(AGE) 2<br />

or<br />

log(λ age ) = β 0 + β 1 log(AGE) + β 2 log(AGE) 2 − log(P OP SIZE age )<br />

So the modeling can be d<strong>on</strong>e in terms of estimating the effect of log(age) up<strong>on</strong> the incidence rate, rather<br />

than the raw counts, as l<strong>on</strong>g as the offset variable (log(P OP SIZE age )) is known.<br />

To per<str<strong>on</strong>g>for</str<strong>on</strong>g>m a Poiss<strong>on</strong> regressi<strong>on</strong>, first create the offset variable (log(P OP SIZE age )) using the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula<br />

editor of JMP.<br />

The Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m launches the analysis:<br />

c○2012 Carl James Schwarz 665 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Note that the raw count is the Y variable, and that the offset variable is specified separately from the X<br />

variables.<br />

The output is:<br />

c○2012 Carl James Schwarz 666 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

The goodness-of-fit statistic indictes no evidence of lack-of-fit, i.e. no need to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong>.<br />

Based <strong>on</strong> the results of the Effect Test <str<strong>on</strong>g>for</str<strong>on</strong>g> the quadratic term, it appears that a linear fit may actually be<br />

sufficient as the p-value <str<strong>on</strong>g>for</str<strong>on</strong>g> the quadratic term is almost 10%.. The reas<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> this apparent n<strong>on</strong>-need <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

quadratic term is that the smaller age-cohorts have very few counts and so the actual incidence rate is very<br />

imprecisely estimated.<br />

Finally, the Parameter Estimates secti<strong>on</strong> reports the estimated β’s (remember these are <strong>on</strong> the log-scale).<br />

Each line also tests if the corresp<strong>on</strong>ding populati<strong>on</strong> coefficient is zero. Because each of the X variables<br />

in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the<br />

effect tests.<br />

Based <strong>on</strong> the output so far, it appears that we can drop the quadratic term.’ This term was dropped, and<br />

the model refit:<br />

c○2012 Carl James Schwarz 667 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

The final model is<br />

The predicted log(λ) <str<strong>on</strong>g>for</str<strong>on</strong>g> age 40 is found as:<br />

̂ log(λ) age<br />

= −21.32 + 3.60(log(age))<br />

̂ log(λ) 40<br />

= −21.32 + 3.60(log(40)) = −8.04<br />

This incidence rate is <strong>on</strong> the log-scale, so the predicted incidence rate is found by taking the anti-logs, or<br />

e −8.04 = .000322 or .322/thousand people or 322/milli<strong>on</strong> people.<br />

In order to make predicti<strong>on</strong>s about the expected number of cancers in each age cohort, that would be<br />

seen under this model, you would need to add back the log(P OP SIZE) <str<strong>on</strong>g>for</str<strong>on</strong>g> the appropriate age class:<br />

log(µ ̂ 40 ) = log(λ) ̂<br />

40<br />

+ log(P OP SIZE 40 ) = −8.04 + 11.47 = 3.42<br />

Finally, the predicted number of cases is simply the anti-log of this value:<br />

Ŷ 40 = e ̂ logµ 40<br />

= e 3.42 = 30.96<br />

Of <str<strong>on</strong>g>course</str<strong>on</strong>g>, this can be d<strong>on</strong>e automatically the the plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m by requesting:<br />

c○2012 Carl James Schwarz 668 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

This also allows you save the c<strong>on</strong>fidence limits <str<strong>on</strong>g>for</str<strong>on</strong>g> the average number (the mean c<strong>on</strong>fidence bounds) of skin<br />

cancers expected <str<strong>on</strong>g>for</str<strong>on</strong>g> this age class (assuming the same populati<strong>on</strong> size) and c<strong>on</strong>fidence limits (the individual<br />

c<strong>on</strong>fidence bounds)<br />

In this case, the expected number of skin cancer cases <str<strong>on</strong>g>for</str<strong>on</strong>g> the 35-44 age group is 30.69 with a 95% c<strong>on</strong>fidence<br />

interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean number of cases ranging from (26.0 → 36.8). The c<strong>on</strong>fidence bound <str<strong>on</strong>g>for</str<strong>on</strong>g> the actual<br />

number of cases (assuming the model is correct) is somewhere between 19 and 43 cases.<br />

By adding new data lines to the data table (be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the model fit) with the Y variable missing, but the age<br />

and offset variable present, you can make <str<strong>on</strong>g>for</str<strong>on</strong>g>ecasts <str<strong>on</strong>g>for</str<strong>on</strong>g> any set of new X values.<br />

The residual plot:<br />

c○2012 Carl James Schwarz 669 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

isn’t too bad – the large negative residual <str<strong>on</strong>g>for</str<strong>on</strong>g> the first age class (near when 0 skin cancers are predicted) is a<br />

bit worrisome, I suspect this is where the quadratic curve may provide a better fit.<br />

A plot of actual vs. predicted values can be obtained directly:<br />

c○2012 Carl James Schwarz 670 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

or by saving the predicted value to the data sheet, and using the Analyze->Fit Y-by-X plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m with Fit<br />

Special to add the reference line:<br />

c○2012 Carl James Schwarz 671 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

These plot show excellent agreement with data.<br />

Finally, it is nice to c<strong>on</strong>struct an overlay plot the empirical log(rates) (the first plot c<strong>on</strong>structed) with the<br />

estimated log(rate) and c<strong>on</strong>fidence bounds as a functi<strong>on</strong> of log(age). Create the predicted log(rate) using<br />

the <str<strong>on</strong>g>for</str<strong>on</strong>g>mula editor and the predicted skin cancer numbers by subtracting the log(P OP SIZE) (why?):<br />

c○2012 Carl James Schwarz 672 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Repeat the same <str<strong>on</strong>g>for</str<strong>on</strong>g>mula <str<strong>on</strong>g>for</str<strong>on</strong>g> the lower and upper bounds of the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the mean number<br />

of cases:<br />

Finally, use the Graph → OverlayPlot to plot the empirical estimates, the predicted values of λ and the 95%<br />

c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> λ <strong>on</strong> the same plot:<br />

c○2012 Carl James Schwarz 673 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

and fiddle 8 with the plot to join up predicti<strong>on</strong>s and c<strong>on</strong>fidence bounds but leave the actual empirical points<br />

as is to give the final plot:<br />

8 I had to use the turn <strong>on</strong> the c<strong>on</strong>nect through missing opti<strong>on</strong> the red-triangle.<br />

c○2012 Carl James Schwarz 674 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Remember that the point with the smallest log(rate) is based <strong>on</strong> a single skin cancer case and not very<br />

reliable. That is why the quadratic fit was likely not selected.<br />

8.7 ANCOVA models<br />

Just like in regular multiple-regressi<strong>on</strong>, it is possible to mix c<strong>on</strong>tinuous and categorical variables and test <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

parallelism of the effects. Of <str<strong>on</strong>g>course</str<strong>on</strong>g> this parallelism is assessed <strong>on</strong> the link scale (in most cases <str<strong>on</strong>g>for</str<strong>on</strong>g> Poiss<strong>on</strong><br />

data, <strong>on</strong> the log scale).<br />

There is nothing new compared to what was seen with ordinary regressi<strong>on</strong> and logistic regressi<strong>on</strong>. The<br />

three appropriate models are:<br />

log(λ) = X<br />

log(λ) = X Cat<br />

log(λ) = X Cat X ∗ Cat<br />

where X is the c<strong>on</strong>tinuous predictors, and Cat is the categorical predictor. The first model assumes a<br />

comm<strong>on</strong> line <str<strong>on</strong>g>for</str<strong>on</strong>g> all categories of the Cat variable. The sec<strong>on</strong>d model assumes parallel slopes, but differing<br />

intercepts. The third model assumes separate lines <str<strong>on</strong>g>for</str<strong>on</strong>g> each category.<br />

Fitting would start with the most complex model (the third model) and test if there is evidence of n<strong>on</strong>parallelism.<br />

If n<strong>on</strong>e were found, the sec<strong>on</strong>d model would be examined, and a test would be made <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

comm<strong>on</strong> intercepts. Finally, the simplest model may be an adequate fit.<br />

Let us return to the skin cancer data examined earlier in this chapter. It is of interest to see if there is a<br />

c<strong>on</strong>sistent difference in skin cancer rates between the two cities. Presumably, Dallas, which receives more<br />

intense sun, would have a higher skin cancer rate.<br />

The data is available in the skincancer.jmp data set in the Sample Program Library at http://www.<br />

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Use all of the data. As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the<br />

log(populati<strong>on</strong> size) will be the offset variable.<br />

A preliminary data plot of the empirical cancer rate <str<strong>on</strong>g>for</str<strong>on</strong>g> the two cities:<br />

c○2012 Carl James Schwarz 675 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

shows roughly parallel resp<strong>on</strong>ses, but now the curvature is much more pr<strong>on</strong>ounced in Dallas.<br />

Perhaps a quadratic model should be first fit, with separate resp<strong>on</strong>se curve <str<strong>on</strong>g>for</str<strong>on</strong>g> both cities. In <str<strong>on</strong>g>short</str<strong>on</strong>g> hand<br />

model notati<strong>on</strong>, this is:<br />

log(lambda) = City log(Age) log(Age) 2 City ∗ log(Age) City ∗ log(Age) 2<br />

where City is the effect of the two cities, log(age) is the c<strong>on</strong>tinuous X variable, and the interacti<strong>on</strong> terms<br />

represent the n<strong>on</strong>-parallelism of the resp<strong>on</strong>ses. This is specified as:<br />

c○2012 Carl James Schwarz 676 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

As be<str<strong>on</strong>g>for</str<strong>on</strong>g>e, use the Generalized Linear Model opti<strong>on</strong> of the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m and d<strong>on</strong>’t <str<strong>on</strong>g>for</str<strong>on</strong>g>get to<br />

specify the log(popsize) as the offset variable. This gives the output:<br />

c○2012 Carl James Schwarz 677 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

The Whole Model Test shows evidence that the model has predictive ability. The Goodness-of-fit Test<br />

shows that this model is a reas<strong>on</strong>able fit (p-values around .30). The Effect Test shows that perhaps both of<br />

the interacti<strong>on</strong> terms can be dropped, but some care must be taken as these are marginal tests and cannot be<br />

simply combined.<br />

A “Chunk Test” similar to that seen in logistic regressi<strong>on</strong> can be d<strong>on</strong>e to see if both interacti<strong>on</strong> terms can<br />

be dropped simultaneously:<br />

c○2012 Carl James Schwarz 678 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

c○2012 Carl James Schwarz 679 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

The p-value is just above α = .05 so I would be a little hesitant to drop both interacti<strong>on</strong> terms. On the other<br />

hand, some of the larger age classes have such large sample sizes and large count values that very minor<br />

differences in fit can likely be detected.<br />

The simpler model with two parallel quadratic curves was then fit:<br />

c○2012 Carl James Schwarz 680 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

c○2012 Carl James Schwarz 681 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

This simpler model also has no str<strong>on</strong>g evidence of lack-of-fit. Now, however, the quadratic term cannot be<br />

dropped.<br />

The parameter estimates must be interpreted carefully <str<strong>on</strong>g>for</str<strong>on</strong>g> categorical data. Every package codes indicator<br />

variables in different ways, and so the interpretati<strong>on</strong> of the estimates associated with the indicator<br />

variables differs am<strong>on</strong>g packages. JMP codes indicator variables so that estimates are the difference in<br />

resp<strong>on</strong>se between that specified level and the AVERAGE of all other levels. So in this case, the estimate<br />

associated with City[dfw] ̂ = .401 represent 1/2 the distance between the two parallel curves. C<strong>on</strong>sequently,<br />

the difference in logλ between Minneapolis and Dallas is 2 × .401 = .801 (SE 2 × .026 = .05). This is a<br />

c<strong>on</strong>sistent difference <str<strong>on</strong>g>for</str<strong>on</strong>g> all age groups.<br />

This can also be estimated without having to worry too much about the coding details by doing a c<strong>on</strong>trast<br />

between the estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the city effects::<br />

c○2012 Carl James Schwarz 682 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

c○2012 Carl James Schwarz 683 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

This gives the same results as above.<br />

c○2012 Carl James Schwarz 684 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

This is a difference <strong>on</strong> the log-scale. As seen in earlier chapter, this can be c<strong>on</strong>verted to an estimate of<br />

the ratio of incidence by taking anti-logs. In this case, Dallas is estimated to have e .802 = 2.22 TIMES the<br />

skin cancer rate of Minneapolis. This is c<strong>on</strong>sistent with what is seen in the raw data. The SE of this ratio is<br />

found using an applicati<strong>on</strong> of the Delta method 9 The delta-method indicates that the SE of an exp<strong>on</strong>entiated<br />

estimate is found as<br />

SE(êθ)<br />

= SE(̂θ)êθ<br />

In this case<br />

SE(ratio) = .052 × 2.22 = .11<br />

C<strong>on</strong>fidence bounds are found by finding the usual c<strong>on</strong>fidence bounds <strong>on</strong> the log-scale and then taking antilogs<br />

of the end points. In this case, the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the difference in log(λ) is (.802 −<br />

2(.052) → .802 + 2(.052)) or (.698 → .906). Taking antilogs, gives a 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the ratio<br />

of skin cancer rates as (2.01 → 2.47).<br />

The residual plot (not shown) look reas<strong>on</strong>able.<br />

8.8 Categorical X variables - a designed experiment<br />

Just like ANOVA is used to analyze data from designed experiments, Generalized linear models can also be<br />

used to analyze count data from designed experiments. However, JMP is limited to designs without random<br />

effects, e.g. no GLIMs that involve split-plot designs.<br />

C<strong>on</strong>sider an experiment to investigate 10 treatments (a c<strong>on</strong>trol vs. a 3x3 factorial structure <str<strong>on</strong>g>for</str<strong>on</strong>g> two factors<br />

A and B) <strong>on</strong> c<strong>on</strong>trolling insect numbers. The experiment was run in a randomized block design (see earlier<br />

chapters). In each block, the 10 treatments were randomized to 10 different trees. On each tree, a trap was<br />

mounted, and the number of insects caught in each trap was recorded.<br />

Here is the raw data. 10<br />

9 A <str<strong>on</strong>g>for</str<strong>on</strong>g>m of a Taylor Series Expansi<strong>on</strong>. C<strong>on</strong>sult many books <strong>on</strong> statistics <str<strong>on</strong>g>for</str<strong>on</strong>g> details.<br />

10 This is example 10.4.1. from SAS <str<strong>on</strong>g>for</str<strong>on</strong>g> Linear Models, 4th Editi<strong>on</strong>. Data extracted from http://ftp.sas.com/samples/<br />

A56655 <strong>on</strong> 2006-07-19.<br />

c○2012 Carl James Schwarz 685 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Block Treatment A B Count<br />

1 1 1 1 6<br />

1 2 1 2 2<br />

1 5 2 2 3<br />

1 8 3 2 3<br />

1 7 3 1 1<br />

1 0 0 0 16<br />

1 3 1 3 4<br />

1 6 2 3 1<br />

1 9 3 3 1<br />

1 4 2 1 5<br />

2 1 1 1 9<br />

2 2 1 2 6<br />

2 5 2 2 4<br />

2 8 3 2 2<br />

2 7 3 1 2<br />

2 0 0 0 25<br />

2 3 1 3 3<br />

2 6 2 3 5<br />

2 9 3 3 0<br />

2 4 2 1 3<br />

3 1 1 1 2<br />

3 2 1 2 14<br />

3 5 2 2 6<br />

3 8 3 2 3<br />

3 7 3 1 2<br />

3 0 0 0 5<br />

3 3 1 3 5<br />

3 6 2 3 17<br />

3 9 3 3 2<br />

3 4 2 1 3<br />

4 1 1 1 22<br />

4 2 1 2 4<br />

4 5 2 2 3<br />

4 8 3 2 4<br />

4 7 3 1 3<br />

4 0 0 0 9<br />

c○2012 Carl James Schwarz 686 November 23, 2012<br />

4 3 1 3 5<br />

4 6 2 3 1<br />

4 9 3 3 9<br />

4 4 2 1 2


CHAPTER 8. POISSON REGRESSION<br />

The data are available in JMP data file insectcount.jmp in the Sample Program Library at http://<br />

www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.<br />

The RCB model was fit using a generalized linear model with a log link:<br />

Count i distributed P oiss<strong>on</strong>(µ i )<br />

φ i = log(µ i )<br />

φ i = Block T reatment<br />

where the simplified syntax Block and Treatment refer to block and treatment effects. Both Blocks and<br />

Treatment are categorical, and will be translated to sets of indicator variables in the usual way.<br />

This model is fit in JMP using the Analyze->Fit Model plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m:<br />

Note that the block and treatment variables must be nominally scaled. There is NO offset variable as the<br />

insect cages were all equal size.<br />

c○2012 Carl James Schwarz 687 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

This produces the output:<br />

The Goodness-of-fit test shows str<strong>on</strong>g evidence that the model doesn’t fit as the p-values are very small.<br />

Lack-of-fit can be caused by inadequacies of the actual model (perhaps a more complex model with block<br />

and treatment interacti<strong>on</strong>s is needed?), failure of the Poiss<strong>on</strong> assumpti<strong>on</strong>, or using the wr<strong>on</strong>g link-functi<strong>on</strong>,<br />

The residual plot:<br />

c○2012 Carl James Schwarz 688 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

shows that the data is more variable than expected by a Poiss<strong>on</strong> distributi<strong>on</strong> (about 95% of the residual<br />

should be within ± 2). The base model and link functi<strong>on</strong> seem reas<strong>on</strong>able as there is no pattern to the<br />

residuals, merely an over-dispersi<strong>on</strong> relative to a Poiss<strong>on</strong> distributi<strong>on</strong>.<br />

The adjustment of over-dispersi<strong>on</strong> is made as seen earlier in the Analyze->Fit Model dialogue box:<br />

c○2012 Carl James Schwarz 689 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

which gives the revised output:<br />

c○2012 Carl James Schwarz 690 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Note that the over-dispersi<strong>on</strong> factor ĉ = 3.5. The test-statistic <str<strong>on</strong>g>for</str<strong>on</strong>g> the Effect Test are adjusted by this factor<br />

(compare the chi-square of 76.37 <str<strong>on</strong>g>for</str<strong>on</strong>g> the treatment effects in the absence of adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong><br />

with the chi-square of 21.79 after adjusting <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong>), and the p-values have been adjusted as well.<br />

The residuals have been adjusted by √ ĉ and now look more acceptable:<br />

c○2012 Carl James Schwarz 691 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

Note that the pattern of the residual plot doesn’t change; all that the over-dispersi<strong>on</strong> adjustment does is to<br />

change the residual variance so that the standardizati<strong>on</strong> brings them closer to 0.<br />

If you compare the parameter estimates between the two models, you will find that the estimates are<br />

unchanged, but the reported se are increased by √ ĉ to account <str<strong>on</strong>g>for</str<strong>on</strong>g> over-dispersi<strong>on</strong>. As the case with all<br />

categorical X variables, the interpretati<strong>on</strong> of the estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> the indicator variables depends up<strong>on</strong> the<br />

coding used by the package. JMP uses a coding where each indicator variable is compared to the mean<br />

resp<strong>on</strong>se over all indicator variables.<br />

Predicti<strong>on</strong>s of the mean resp<strong>on</strong>se at levels of X are obtained in the usual fashi<strong>on</strong>. The se will also be<br />

adjusted <str<strong>on</strong>g>for</str<strong>on</strong>g> overdisperi<strong>on</strong>. However, it is now impossible to obtain predicti<strong>on</strong> intervals <str<strong>on</strong>g>for</str<strong>on</strong>g> the ACTUAl<br />

number of events. By using the overdispersi<strong>on</strong> factor, you are no l<strong>on</strong>ger assuming that the counts are<br />

distributed as a Poiss<strong>on</strong> distributi<strong>on</strong> – in fact, there is NO REAL DISTRIBUTION that has the mean to<br />

variance ratio that implicitly assumed using the overdispersi<strong>on</strong> factor. Without an actual distributi<strong>on</strong>, it is<br />

impossible to make predicti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> individual events.<br />

If comparis<strong>on</strong>s are of interest am<strong>on</strong>g the treatment levels, it is better to use the built-in C<strong>on</strong>trast facilities<br />

of the package to compute the estimates and standard errors rather than trying to do this by hand. For<br />

example, suppose we are interested in comparing treatment 0 (the c<strong>on</strong>trol), to the treatment with factor A at<br />

level 1 and factor B at level 1 (corresp<strong>on</strong>ding to treatment 1). The c<strong>on</strong>trast is estimated as:<br />

c○2012 Carl James Schwarz 692 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

c○2012 Carl James Schwarz 693 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

The estimated difference in the log(mean) is -.34 (se .39) which corresp<strong>on</strong>ds to a ratio of e −.34 = .71 of<br />

treatment 1 to c<strong>on</strong>trol, i.e. <strong>on</strong>, average, the number of insects in the treatment 1 traps are 71% of the number<br />

of insects in the c<strong>on</strong>trol trap. An applicati<strong>on</strong> of the delta-method shows that the se of the ratio is computed<br />

as se(êθ) = se(̂θ)êθ = .39(.71) = .28. However, there was no evidence of a difference in trap counts as the<br />

standard error was sufficiently large. A 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the difference in log(mean) is found as<br />

−.34 ± 2(.39) which gives (−1.12 → .44). Because the p-value was larger than α = .05, this c<strong>on</strong>fidence<br />

c○2012 Carl James Schwarz 694 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

interval includes zero. When this interval is anti-logged, the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the ratio of mean<br />

counts is (.32 → 1.55), i.e. the true ratio of treatment counts to c<strong>on</strong>trol counts is between .32 and 1.55.<br />

Because the p-value was greater than α = .05, this interval c<strong>on</strong>tains the value of 1 (indicating that the ratio<br />

of counts was 1:1). It is also correct to compute the 95% c<strong>on</strong>fidence interval <str<strong>on</strong>g>for</str<strong>on</strong>g> the ratio using the estimated<br />

ratio ± its se@. This gives (.71 ± 2(.28)) or (.15 → 1.27). In large samples, these c<strong>on</strong>fidence intervals are<br />

equivalent. In smaller samples, there is no real objective way to choose between them.<br />

8.9 Log-linear models <str<strong>on</strong>g>for</str<strong>on</strong>g> multi-dimensi<strong>on</strong>al c<strong>on</strong>tingency tables<br />

In the chapter <strong>on</strong> logistic regressi<strong>on</strong>, k × 2 c<strong>on</strong>tingency tables were analyzed to see if the proporti<strong>on</strong> of<br />

resp<strong>on</strong>ses in the populati<strong>on</strong> that fell in the two categories (e.g. survived or died) were the same across the k<br />

levels of the factor (e.g. sex, or passenger class, or dose of a drug).<br />

The use of logisitic regressi<strong>on</strong> is a special case of the general r × c c<strong>on</strong>tingency table where observati<strong>on</strong>s<br />

are classified by r levels of a factor and c levels of a resp<strong>on</strong>se. In a separate chapter, the use of χ 2 tests to<br />

test the hypothesis of equal populati<strong>on</strong> proporti<strong>on</strong>s in the c levels of the resp<strong>on</strong>se across all levels of the the<br />

factor. This is also known as the test of independence of the resp<strong>on</strong>se to levels of the factor.<br />

This can be generalized to the analysis of multi-dimensi<strong>on</strong>al tables using Poiss<strong>on</strong>-regressi<strong>on</strong>. In more<br />

advanced <str<strong>on</strong>g>course</str<strong>on</strong>g>s, you can learn how the two previous cases are simple cases of this more general modelling<br />

approach. C<strong>on</strong>sult Agresti’s book <str<strong>on</strong>g>for</str<strong>on</strong>g> a fuller account of this topic.<br />

8.10 Variable selecti<strong>on</strong> methods<br />

To be added later<br />

8.11 Summary<br />

Poiss<strong>on</strong>-regressi<strong>on</strong> is the standard tool <str<strong>on</strong>g>for</str<strong>on</strong>g> the analysis of “smallish” count data. If the counts are large (say<br />

in the orders of hundreds), you could likely use ordinary or weighted regressi<strong>on</strong> methods without difficulty.<br />

This chapter <strong>on</strong>ly c<strong>on</strong>cerns itself with data collected under a simple random sample or a completely<br />

randomized design. If the data are collected under other designs, please c<strong>on</strong>sult with a statistician <str<strong>on</strong>g>for</str<strong>on</strong>g> the<br />

proper analysis.<br />

A comm<strong>on</strong> problem that have encountered are data that have been prestandardized. For example, data<br />

may recorded <strong>on</strong> the number of tree stems in a 100 m 2 test plots. This data could likely be modeled using<br />

poiss<strong>on</strong> regressi<strong>on</strong>. But, then the data are standardized to a “per hectare” basis. These standardized data<br />

c○2012 Carl James Schwarz 695 November 23, 2012


CHAPTER 8. POISSON REGRESSION<br />

are NO LONGER distributed as a Poiss<strong>on</strong> distributi<strong>on</strong>. It would be preferable to analyze the data using the<br />

sampling units that were used to collect the data with an offset variable being used to adjust <str<strong>on</strong>g>for</str<strong>on</strong>g> differing<br />

sizes of survey units.<br />

A comm<strong>on</strong> cause of overdispersi<strong>on</strong> is n<strong>on</strong>-independence in the data. For example, data may be collected<br />

using a cluster design rather than by a simple random sample. Overdispersi<strong>on</strong> can be accounted <str<strong>on</strong>g>for</str<strong>on</strong>g> using<br />

quasi-likelihood methods. As a rule of thumb, overdispersi<strong>on</strong> factor ĉ of 10 or less are acceptable. Very<br />

larger overdispersi<strong>on</strong> factors indicate other serious problems in the model. Alternatives to the use of the<br />

correcti<strong>on</strong> factor are using a different distributi<strong>on</strong> such as the negative binomial distributi<strong>on</strong>.<br />

Related models <str<strong>on</strong>g>for</str<strong>on</strong>g> this chapter are the Zero-inflated Poiss<strong>on</strong> (ZIP) models. In these models there are<br />

an excess number of zeroes relative to what would be expected under a Poiss<strong>on</strong> model. The ZIP model has<br />

two parts – the probability that an observati<strong>on</strong> will be zero, and then the distributi<strong>on</strong> of the n<strong>on</strong>-zero counts.<br />

There is a substantial base in the literature <strong>on</strong> this model.<br />

c○2012 Carl James Schwarz 696 November 23, 2012

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!